A Week Off

So I’m going to be taking next week off just so I have some more time to relax and refocus. I’ve felt that I never really got a chance to properly recharge between the end of finals and the start of my summer work. As I’ve said before, I just want some time free to think. More than anything, I enjoy having the time to do nothing in particular and not feel guilty about it either. My objectives for the week? I’d rather not specify exactly, but mostly I’m going to do lots of reading, some coding, some testing of Minefield, and clearing through all those links I’ve saved for reading at a later time.

I think now is also a good time to give an update on my research as I’ve completed about half of it or so I guess. So basically, the goal is to discuss the appliability of and to design a better resource allocation algorithm in a distributed computing system. Clearly the main problem is to figure out not only how to evenly distribute available resources in order to fully utilize the capabilities of the system, but also, on a local level, run each program as fast as possible. To accomplish this, it is not good enough to just give everyone an equal share - rather, you have to give each program the combination of resources that best fit its needs. Along a related thread of thought, we see as a natural consequence of this non-trivial allocation scheme that there exists some parts of the system - some nodes - are more heavily utilized, or more “popular”. But what is it that makes it so? These are the questions we are seeking to answer. Our approach is to look at the past for clues. So, we’re digging through many years’ worth of log files of usage in order to see how the popularity of nodes has evolved in the past. We are testing out several complex models of popularity metrics and seeing how well they match the data.

There are three of us working on this project and each of us are focusing on different aspects. For my part, I’ve been focusing specifically on how to extract pertinent data from the terabytes of textual log files in acceptable speed and how to best automate the task of actually finding relationships. I’ve decided to break up my work into many small programs that do different things and are tied together by scripts that automate as much as possible. Apart from modularity being good style, it also preserves sanity since this stuff can get pretty damn hairy.

There was one specific aspect that caught my interest in particular. The problem was tactfully posed by my professor as follows: let’s say we want to see how the temperature of water in a swimming pool and the height of the dive is related to how much pain the diver feels. Holding the temperature constant, we expect higher dives to hurt more but if the height isn’t too much, the relation may be weak and depend more on how you dive, etc. But conversely, how does temperature make a difference? At first glance, there’s no appreciable trend for a small range of temperature. But cross 32 degrees F and it’s a different ballgame. We strongly expect the height of the dive to affect pain felt and for it to hurt a lot more!

In this same, consider values of two seemingly independent data fields, one being the synthetic field of popularity and the other being a more tangible entity that deals with actual hardware. We seek to automatically find the ranges of the data that yield strong correlations.

In mathematical language, we seek to automatically perform a segmented regression on the data. Our final regression may be piecewise-linear, some crazy high order interpolated polynomial, or even non-parametric; but for our needs, this is at present not our concern. Thus, I chose to only look at the correlation coefficients. But there’s more than just one type: we are all taught Pearson’s coefficient, but other popular ones include Kendall’s Tau and Spearman’s Rho. Currently I’ve just implemented Pearson’s but this will almost certainly be replaced by Kendall’s. Why? Because linearity is not as importance as dependence in general. Having decided upon this, it was time to next decide how I was going to implement this in code. The main challenge was to automatically determine the relevant breakpoints that would yield the maximum correlations. There is a good deal of literature on this but in my preliminary search I only found papers that were either very marginally related to what I was doing or just too complicated, so I decided to implement a simple algorithm on my own.

Basically there are two parameters we wish to maximize: the correlation and the length of the segment. This need arises since a segment of one or two points will de facto have a correlation of unity and our data is certainly non-uniform. So, I thought of some kind of learning algorithm to accomplish the task and finally ended up with a simple algorithm that runs in ~2*N time (where ~ denotes tilde notation). Basically I start with some initial set of breakpoints evenly distributed along one axis. Then, I run a loop wherein I compare correlations of adjacent blocks. If their weighted sum of correlation and number of points in the block exceeds either block individually, I remove the middle breakpoint and coalesce the blocks. I recursively do this through all the blocks and then do the exact same thing in the reverse direction.

Now I have a set of unequal length blocks. I then copy all the blocks’ information into my list and sort by a comparator that compares the weight of the block, defined by: X*abs(correlation)+Y*(nelem_in_block), where X and Y are some constants.

In practice, this approach works reasonably well when the number of initial breakpoints is large enough. Besides, since we are only looking for heuristics, this does the job.

Now that that’s all done, I ran some tests on the data and the results are looking close to what we expected. When I get back, I’m going to tackle the task of tweaking and reworking the popularity metric to get even finer results.

But for right now, it is time to kick back and relax. Awesome.