Blake Matheny


Tumblr Firehose - The Gory Details

Back in December I started putting some thought into the tumblr firehose. While the initial launch was covered here, and the business stuff surrounding it was covered by places like techcrunch and AllThingsD, not much has been said about the technical details.

First, some back story. I knew in December that a product need for the firehose was upcoming and had simultaneously been spending a fair amount of time thinking about the general tumblr activity stream. In particular I had been toying quite a bit with trying to figure out a reasonable real-time processing model that would work in a heterogenous environment like the one at Tumblr. I had also been quite closely following some of the exciting work being done at LinkedIn by Jay Kreps and others on Kafka and Databus, by Eric Sammer from Cloudera on Flume, and by Nathan Marz from Twitter on Storm.

I had talked with some of the engineers at twitter about their firehose and knew some of the challenges they had overcome in scaling it. I spent some time reading their fantastic documentation and after reviewing some of these systems came up with the system I actually wanted to build, much of it completely influenced by the great work being done by other people. My ‘ideal’ firehose, from the consumer/client side, had the following properties:

  • Usable via curl
  • Allows a client to ‘rewind’ the stream in case of missed events or maintenance
  • If a client disconnects, they should pick up the stream where they left off
  • Client concurrency/parallelism, e.g. multiple consumers getting unique views of the stream
  • Near real-time is good enough (sub 1s from an event emitted to consumed)

From an event emitter (or producer) perspective, we simply wanted an elastic backend that could grow and shrink based on latency and persistence requirements.

What we ended up with accomplishes all of these goals and ended up being fairly simple to implement. We took the best of many worlds (a bit of kafka, a bit of finagle, some flume influences) and created the whole thing in about 10 days. The internal name for this system is Parmesan which is both a cheese as well as an arrested development character (Gene Parmesan, PI).

The system is comprised of 4 primary components.

  • A ZooKeeper cluster, used for coordinating Kafka as well as stream checkpoints
  • Kafka, which is used for message persistence and distribution
  • A thrift process, written with scala/finagle, which the tumblr application talks to
  • An HTTP process, written with scala/finagle, which consumers talk to

The Tumblr application makes a Thrift RPC call containing event data to parmesan. These RPC calls take about 5ms on average, and the client will retry unless it gets a success message back. Parmesan batches these events and uses Kafka to persist them to disk every 100ms. This functionality is all handled by the thrift side of the parmesan application. We also implemented a very simple custom message serialization format so that parmesan could completely avoid any kind of message serialization/deserialization overhead. This had a dramatic impact on GC time (the serialization change wasn’t made until it was needed) which in turn had a significant impact on average connection latency.

On the client side, any standard HTTP client works and requires (besides a username and password) an application ID and an optional offset. The offset is used for determining where in the stream to start reading from, and is specified either as Oldest (7 days ago), Newest (from right now), or an offset in seconds from the current time in UTC. Up to 16 clients with the same application ID can connect, each viewing a unique partition of the activity stream. Stream partitioning allows you to parallelize your consumption without seeing duplicates. This is a great feature for instance if you took your app down for maintenance and want to quickly catch back up in the stream.

Kafka doesn’t easily (natively) support this style of rewinding so we just persist stream offsets to ZooKeeper. That is, periodically clients with a specific application ID will say, “Hey, at this unixtime I saw a message which had this internal Kafka offset”. By periodically persisting this data to Kafka, we can ‘fake’ this rewind functionality in a way that is useful, but imprecise (we basically have to estimate where in the Kafka log to start reading from).

We use 4 ‘queue class’ (tumblr speak for a box with 72GB of RAM and 2 mirrored disks) machines, capable of supporting roughly 100k messages per second each, to support the entire stream. Those 4 machines provide a message backlog of 1 week, allowing clients to drop into the stream anywhere in the past week.

As I mentioned on twitter, I’m quite proud of the software and the team behind it. Many thanks to Derek, Danielle and Wiktor for help and feedback.

If you’re interested in this kind of distributed systems work, we’re hiring.