July 2011
15 posts
Somehow, ritual drunk-conversation concerning team captains for the apocalypse has become a major part of the lives of 20-somethings. Having been matured in the Grandaddy-crowned masterpiece film (put “A.M. 180” on and forget that you have a job) 28 Days Later and the best-selling …
June 2011
13 posts
I was recently working on a problem where I had a Map which contained essentially String keys and Integer values. The data was being output by a Hadoop mapper and I wanted my reducer to only output the top 50 values per day. I needed to sort my map by the integer value and then grab only the top 50 items. The method I used to grab the top 50 was kind of gross (in my mind anyhow) so I wanted to share it and see what other ways people might solve this problem. Below is the code:
PriorityQueue<Map.Entry<String,Integer>> q = new PriorityQueue<Map.Entry<String,Integer>>(
50,
new Comparator<Map.Entry<String, Integer>>() {
@Override
public int compare(Entry<String, Integer> e1, Entry<String, Integer> e2) {
return e2.getValue().compareTo(e1.getValue());
}
});
I then iterated over my map and shoved all the entries into the priority queue. When I iterate over the queue, I get elements back out sorted. So, here are some issues I see with this approach.
- The memory requirement is
2n. - The runtime for insertion is roughly
O(n*log(n))(iterate over each item in the map, insertion into the queue) - The
PriorityQueueimplementation is unbounded (but I actually only want 50 items in it)
Is there a better (more efficient) way to do this without changing the original map?
Very high level overview of a bunch of graph/key-value/column-family/etc stores.
Alan Mathison Turing, was an English mathematician, logician, cryptanalyst and computer scientist. He was highly influential in the development of computer science, providing a formalisation of the concepts of “algorithm” and “computation” with the Turing machine, which played a significant role in the creation of the modern computer. Turing is widely considered to be the father of computer science and artificial intelligence.
I love to travel. I’ve slept in 45 of the 50 states and been to 19 countries. This past year I made it to Jamaica, Puerto Rico, and Hawaii (not a country, but a beautiful destination). So far my favorite country has probably been the US (just because the landscape is so diverse), but outside the US is Austria as Vienna is just an amazingly beautiful city. Not sure what my favorite continent is but Europe is easy (and safe) to travel in.
I’ve been evaluating a couple of queue systems recently (Kafka from LinkedIn and Kestrel from Twitter) that are both written in Scala. In evaluating the systems I had the need to become more than conversationally familiar with the language. When I lived in Indianapolis my company hosted the Scala meetups and I occasionally went and as such did some reading on the subject, but I didn’t actually code anything with it. My friend JR, organizer of the Indianapolis Scala meetup, recommended I check out the Scala for Java Refugees series to get started. The series is fantastic but was written back in ~2008 and isn’t entirely consistent with the latest and greatest from scala 2.9. Below are a list of my notes regarding things that I had to change from scala 2.9.
- The Application trait is deprecated, use App
- Using the java Math constants is deprecated in favor of scala.Math
Well, that’s it. I started this post on the first installment of the series and assumed it would be riddled with issues but in fact there were very few. Only took a couple of hours to get through including pounding out the examples. I hope to wrap up my ‘learn a new programming language’ exercises this weekend so I can get through some of the Kafka/Kestrel code this week.
If you’ve used Hadoop before you probably know that it’s strongly recommended to use a combiner if your workflow/data will allow for it. The standard convention is to just re-purpose your reducer as your combiner, however that doesn’t work particularly well when you’re operating on an aggregate form in the reducer. The canonical example is computing a mean. If you’re computing the mean in your reducer, you don’t want to do that in your combiner as well as you’ll be calculating the mean of a subset.
I recently implemented a statistical job without using a combiner. The job typically took 15-20 minutes to run. Below are the Hadoop stats for it.
![]()
Notice the substantial filesystem activity. 10 billion bytes written, 8 billion bytes read. Also, 2 billion reduce shuffle bytes.
I got tired of waiting on this job so refactored it to use a combiner. After the combiner was introduced the job time went from an average of 17 minutes down to an average of 3 minutes. The filesystem stats and shuffle stats were also reduced significantly. Below are the Hadoop stats.
![]()
The combiner made a tremendous difference and reduced the time for this job by 82%. Just thought I’d share some real world numbers that might be useful to someone struggling with job time. If you don’t have a combiner, implement one.
I read quite a few boring papers about various nerdy topics but this paper is not boring - it’s AMAZING. I was already a fan of Spotify but after reading this, I am sort of blown away. A few tidbits that I didn’t know -
- Uses Peer-to-Peer to distribute music
- Using Ogg Vorbis for encoding
- Song data comes from 8.8% servers, 35.8% p2p, 55.4% cache
- 88% of access were against 12% of the catalog
- 60% of the catalog was access at least once a week
- All TCP based
- No NAT Traversal but wide use of UPnP on home routers
The whole paper is packed with a ton of insights. Spotify is an impressive engineering feat.
Two additional things that I walked away with that I thought were particularly cool.
- Spotify uses a markov model for throughput estimations. The client doesn’t start playing until all simulations pass (e.g. indicate the track will play without a buffer underrun)
- Neighbor selection (which peers to stay connected to) is performed based on a heuristic which determines peer utility. Peer utility is based on things like; bytes sent over time, other peers found through searches via connection, and number of tracks the peer has that the client has been interested in.
![]()
May 2011
12 posts
In roughly 2003 I was working on the Linux kernel for a research project at Purdue. I didn’t have internet access at home at the time and wanted to be able to take work between my lab on campus and my office. The USB card reader I had didn’t work on Linux at the time so I created a small patch for the kernel to support it. Then I submitted it, and waited, and waited, and waited. The USB devices maintainer was quick to acknowledge, but then it was months before it made its way to an official kernel release and years before it made its way to most distributions. This cycle time wasn’t specific to Linux either, and I found this type workflow pretty common for open source projects.
Contrast this ‘old style’ workflow with what git and github has allowed us to do. Recently I submitted patches to hiredis and webdis for two unrelated issues. In both cases the authors acknowledged the pull request on the same day. The webdis author integrated the change the same day, and the hiredis author worked with me to make some changes to the patch before accepting it less than a week later. Now the code is out there for anyone to use and it didn’t really take me any effort at all. On top of that, github makes it so trivial to fork that if I hadn’t gotten a response I could have just forked the project and maintained it myself. Github would allow that fork to be visible for the world to see and use.
Until github came about, open source projects were relegated primarily to the confines of sourceforge and freshmeat. There seems to be this revival of open source software, largely due to how easy to use github is, that doesn’t require ‘participating in open source’. Github did something brilliant which was to take the open source out of open source. These days it’s just what you do, it’s not a decision to be made or something to be discussed.
Github has made my open source, vim, Linux, windowmaker elitist attitude a thing of the past. And yeah, that kind of brings me down.