Blake Matheny

RSS

Posts tagged with "tumblr"

The tumblr deploy schedule. Blue vertical lines are deploys, green stacked graphs are requests per second, annotations are my own.

People start deploying just after 10am (not surprising, most folks come in around 10) and keep deploying code until lunch time. Lunch time is a very social thing here, so most folks head out for a bit. After lunch, people are heads down until snack time. Snack time does not really exist. Code pushes resume again after snack time and go until dinner, just after 7pm. After that, last minute bug fixes until just after 8pm. Then sleep.

These opinions are my own. I don’t actually think most tumblr employees go to bed at 8:00pm. There also is not any officially scheduled snack time, although this would be cool.

The tumblr deploy schedule. Blue vertical lines are deploys, green stacked graphs are requests per second, annotations are my own.

People start deploying just after 10am (not surprising, most folks come in around 10) and keep deploying code until lunch time. Lunch time is a very social thing here, so most folks head out for a bit. After lunch, people are heads down until snack time. Snack time does not really exist. Code pushes resume again after snack time and go until dinner, just after 7pm. After that, last minute bug fixes until just after 8pm. Then sleep.

These opinions are my own. I don’t actually think most tumblr employees go to bed at 8:00pm. There also is not any officially scheduled snack time, although this would be cool.

Tumblr Firehose - The Gory Details

Back in December I started putting some thought into the tumblr firehose. While the initial launch was covered here, and the business stuff surrounding it was covered by places like techcrunch and AllThingsD, not much has been said about the technical details.

First, some back story. I knew in December that a product need for the firehose was upcoming and had simultaneously been spending a fair amount of time thinking about the general tumblr activity stream. In particular I had been toying quite a bit with trying to figure out a reasonable real-time processing model that would work in a heterogenous environment like the one at Tumblr. I had also been quite closely following some of the exciting work being done at LinkedIn by Jay Kreps and others on Kafka and Databus, by Eric Sammer from Cloudera on Flume, and by Nathan Marz from Twitter on Storm.

I had talked with some of the engineers at twitter about their firehose and knew some of the challenges they had overcome in scaling it. I spent some time reading their fantastic documentation and after reviewing some of these systems came up with the system I actually wanted to build, much of it completely influenced by the great work being done by other people. My ‘ideal’ firehose, from the consumer/client side, had the following properties:

  • Usable via curl
  • Allows a client to ‘rewind’ the stream in case of missed events or maintenance
  • If a client disconnects, they should pick up the stream where they left off
  • Client concurrency/parallelism, e.g. multiple consumers getting unique views of the stream
  • Near real-time is good enough (sub 1s from an event emitted to consumed)

From an event emitter (or producer) perspective, we simply wanted an elastic backend that could grow and shrink based on latency and persistence requirements.

What we ended up with accomplishes all of these goals and ended up being fairly simple to implement. We took the best of many worlds (a bit of kafka, a bit of finagle, some flume influences) and created the whole thing in about 10 days. The internal name for this system is Parmesan which is both a cheese as well as an arrested development character (Gene Parmesan, PI).

The system is comprised of 4 primary components.

  • A ZooKeeper cluster, used for coordinating Kafka as well as stream checkpoints
  • Kafka, which is used for message persistence and distribution
  • A thrift process, written with scala/finagle, which the tumblr application talks to
  • An HTTP process, written with scala/finagle, which consumers talk to

The Tumblr application makes a Thrift RPC call containing event data to parmesan. These RPC calls take about 5ms on average, and the client will retry unless it gets a success message back. Parmesan batches these events and uses Kafka to persist them to disk every 100ms. This functionality is all handled by the thrift side of the parmesan application. We also implemented a very simple custom message serialization format so that parmesan could completely avoid any kind of message serialization/deserialization overhead. This had a dramatic impact on GC time (the serialization change wasn’t made until it was needed) which in turn had a significant impact on average connection latency.

On the client side, any standard HTTP client works and requires (besides a username and password) an application ID and an optional offset. The offset is used for determining where in the stream to start reading from, and is specified either as Oldest (7 days ago), Newest (from right now), or an offset in seconds from the current time in UTC. Up to 16 clients with the same application ID can connect, each viewing a unique partition of the activity stream. Stream partitioning allows you to parallelize your consumption without seeing duplicates. This is a great feature for instance if you took your app down for maintenance and want to quickly catch back up in the stream.

Kafka doesn’t easily (natively) support this style of rewinding so we just persist stream offsets to ZooKeeper. That is, periodically clients with a specific application ID will say, “Hey, at this unixtime I saw a message which had this internal Kafka offset”. By periodically persisting this data to Kafka, we can ‘fake’ this rewind functionality in a way that is useful, but imprecise (we basically have to estimate where in the Kafka log to start reading from).

We use 4 ‘queue class’ (tumblr speak for a box with 72GB of RAM and 2 mirrored disks) machines, capable of supporting roughly 100k messages per second each, to support the entire stream. Those 4 machines provide a message backlog of 1 week, allowing clients to drop into the stream anywhere in the past week.

As I mentioned on twitter, I’m quite proud of the software and the team behind it. Many thanks to Derek, Danielle and Wiktor for help and feedback.

If you’re interested in this kind of distributed systems work, we’re hiring.

The Ablogalypse

The Ablogalypse

coderspiel:

Building Network Services with Finagle and Ostrich (by Tumblr’s Blake Matheny for ny-scala)

First Week at Tumblr

I just wrapped up my first week at Tumblr and thought I’d take a minute to share with folks what it’s like to work here as an engineer. Katherine and I moved to NYC from Indianapolis back on May 2nd and my first day at Tumblr as a Senior Engineer was on May 9th. We had a week to get settled and moved in, and then it was off to work for me.

The first thing I noticed arriving at Tumblr was an Aeron chair at my desk. As someone who suffers from RSI (repetitive stress injury), having ergonomic equipment is incredibly important and I loved the fact that Tumblr was concerned with having a comfortable work space. The next thing I noticed was that my machine was a shiny new Mac Pro. Not a MacBook Pro, but a Mac Pro. It was also hooked up to two 24” Dell monitors as well as the ergonomic keyboard/mouse I had requested. By far this was the best equipment I had ever been presented with at a startup and I really appreciated not having to battle with antiquated equipment or office furniture.

Within a few hours I had my development environment setup. Basic tools:

  • CentOS image managed by vagrant
  • Git for version control
  • Develop on my Mac in a folder used by vagrant (not having to SSH into my image was pretty awesome)

I had a couple of getting started tickets assigned to me and got to work. By the end of my first day I had committed my changes, gone through a code review, and saw the changes deployed to production on Tuesday. Love the low cycle time, it’s a great feeling to get your changes out quickly and in front of users.

By Thursday I had completed and deployed all my getting started tickets and got my first real project. My first project, staircar, is essentially an HTTP interface on top of Redis (a future post on the engineering blog will go into more detail). The project is named staircar after the Arrested Development meme of the same name. Most new projects have been given an AD themed name which engineers choose. Engineers generally choose the tools they use on a project by project basis, and work with people from the product team to jointly make product decisions. The org structure is fairly flat and there’s virtually no top down management.

By the end of the week I had participated in a total of one scheduled meeting (30 minutes), and received something on the order of 10 emails. To say that the environment encourages focus/productivity is an understatement. Folks will tap each other on the shoulder for ad-hoc discussions but the distraction level is very low. One additional item of interest is that onsite development is strongly encouraged (and currently the only supported method). Having worked for years in environments that demanded 24/7 attention, working somewhere that allows me to somewhat unplug when I go home is amazing. I’ve noticed it also makes me substantially more productive when I’m actually in the office.

To not mention the team would leave this post incomplete. I work with smart, passionate, interesting people. Tumblr employees use the platform constantly and are always thinking about how to improve user experience, we definitely eat our own dog food. Engineers in a week might work in several different programming languages (week one I was using PHP, C and Ruby) to support software used by millions of people. The operations team is managing hundreds of servers in multiple data centers and building some pretty amazing infrastructure support tools.

When I’ve got the link for the engineering blog I’ll share it.

Day 1 at Tumblr

Starting on Monday May 9, I joined Tumblr full-time as a Senior Software Engineer. Yesterday I committed some code to the git repository, and this morning it was pushed into production after code review. Even though it was a small change, still awesome to see your work in production in under 24 hours.

For all 3 people following me, no worries I created a test blog for my testing so you shouldn’t see much on here that’s related to my work (unless something cool is being released).