Blake Matheny


Posts tagged with "software"

I don’t like the sound of my own voice, so, tell me if it’s any good.

Aug 1

What do you do?

Sometimes people ask me what I do for a living, and “engineering hobo” doesn’t make sense to most.

I joined Facebook in the TI (traffic infrastructure) team where I work on proxygen (layer 7 load balancer), thrift (RPC and serialization framework), and perf (which focuses on client to edge performance). There was a recent post from Facebook Engineering titled, “Secure browsing by default”. This touched on a lot of work done by TI (as well as other teams) over the past year, some of which has just been landing.

The proxygen team did a lot of the OpenSSL/crypto work so that we could build SPDY support and roll it out widely. This helped enable SSL for a number of clients (since SPDY runs on top of SSL). This post also talked about efforts from the perf team (and other parts of TI) to build out an edge network. The Facebook edge network is what allows clients to terminate connections locally (at an edge POP), thus avoiding the expensive SSL handshake being even more expensive (by having to make multiple round trips across the world). This allows SSL handshaking to be network cheap and CPU distributed. Pair this with SPDY, where you have multiplexed requests over a single secured connection, and you have a pretty performant SSL infrastructure.

The article also talked about some of the upcoming security related work for Facebook, much of which TI will be building. This includes things like 2048-bit RSA keys, ECC (elliptic curve cryptography), PFS (perfect forward secrecy) via ECDHE key exchange, and a bunch of others as well. I’m particularly excited about ECDHE because it uses ephemeral keys on a per session basis, which makes retroactively decrypting recorded data a much more difficult problem.

It’s really fascinating to work on software that a billion people get to touch. Things that aren’t an issue with a million users or a hundred millions users become enormous problems. Scaling software to handle millions of requests per second is really challenging. The proxygen team is like ~4 people, the perf team is 3 people. Tiny teams solving huge problems. This is some of the stuff I get to work on. It’s pretty fun.

Open sourcing memkeys

We rely on memcache pretty heavily at Tumblr, with over 10TB of cache memory available across the stack. One of the things we’ve historically had a challenging time with at Tumblr is finding hot keys. A hot key is a memcache key getting dramatically more activity than other keys. This can have a significant performance impact on your cache backend.

We spent the past few days working on a C++ implementation of mctop*, which we’re happy to release today as memkeys. We do some pretty interesting stuff in memkeys to keep from dropping packets, some of which is documented in the wiki. I’m particularly proud of the striped lock-free queue implementation. In some basic benchmarks I found that memkeys dropped less than 2% of packets when seeing 1Gb/s of traffic. Additionally, the latency between a packet being picked up, parsed, processed, and reported on averages less than 1ms. Here is a screenshot of memkeys in action.


Interested in stuff like this? We’re hiring.

Footnote: Etsy created the excellent mctop tool which aims to be like unix top for memcache, showing you which keys are getting the most activity. Unfortunately (as noted in the known issues), mctop drops packets. It drops a lot of packets. This can be really problematic because depending on the packets being dropped, you’re getting a really incomplete view of your cache story.

Eliminating Duplicate Requests

If you’re building a backend service and can avoid the problem of serving duplicate requests, that’s a good thing. A duplicate is defined in this case as: if two requests R1 and R2 both have a response r_1, they are duplicates. Eliminating duplicates avoids doing unnecessary additional work.

At Tumblr we use finagle for building our backend stack. One of the benefits of using finagle is that services are composable. Here’s an example of a web server that supports Authorization. See how on line 75 you just compose these things together? This leads to very clean separation of concerns in your code.

This week I noticed that 5-10% of requests for a particular service were duplicates. I created a DuplicateRequestFilter that simply caches the most recent 500 requests/responses and if the request has recently been seen, we serve up the old response. Because this caches a Future, and not the actual result, it means that in flight requests are also handled appropriately. That is, we start serving a cached response as soon as the first request comes in, not after the first response goes out. Because this is implemented in terms of a finagle service filter, it required no changes to the actual backend service. Dropping this file in, and adding the class to the service composition, were the only required steps.

Pretty neat.

The tumblr deploy schedule. Blue vertical lines are deploys, green stacked graphs are requests per second, annotations are my own.

People start deploying just after 10am (not surprising, most folks come in around 10) and keep deploying code until lunch time. Lunch time is a very social thing here, so most folks head out for a bit. After lunch, people are heads down until snack time. Snack time does not really exist. Code pushes resume again after snack time and go until dinner, just after 7pm. After that, last minute bug fixes until just after 8pm. Then sleep.

These opinions are my own. I don’t actually think most tumblr employees go to bed at 8:00pm. There also is not any officially scheduled snack time, although this would be cool.

The tumblr deploy schedule. Blue vertical lines are deploys, green stacked graphs are requests per second, annotations are my own.

People start deploying just after 10am (not surprising, most folks come in around 10) and keep deploying code until lunch time. Lunch time is a very social thing here, so most folks head out for a bit. After lunch, people are heads down until snack time. Snack time does not really exist. Code pushes resume again after snack time and go until dinner, just after 7pm. After that, last minute bug fixes until just after 8pm. Then sleep.

These opinions are my own. I don’t actually think most tumblr employees go to bed at 8:00pm. There also is not any officially scheduled snack time, although this would be cool.

Collins - Infrastructure Management for Engineers

At Tumblr we strive to automate as much as is reasonable. Automation helps us manage thousands of servers, our MySQL topology, our software deployments, our configuration updates, A/B testing, etc. As your production environment grows it generally becomes less and less consistent and more and more difficult to manage, even with tools like Puppet and Chef. Eventually you need a central point of truth from which you can determine the state of any given asset in your environment.

When we started building a second production datacenter last year we needed a way to manage the intake process for the thousands of servers that would be getting shipped to us. Although we built Collins to support our secondary datacenter, we deployed it to the legacy production environment in February 2012. At that point we already had roughly 1500 production servers yet we had no consistent view of the environment. The initial problem we set out to solve was simply to inventory the production environment and get a sense of what servers were in use, which servers weren’t, and where there might be cost savings. After completing the inventory process, Collins quickly found another use in helping automate the management of our MySQL topology via Jetpants.

Pretty quickly most of our infrastructure was using or populating Collins data in some way: puppet, func, the deployment tool, host provisioning, graphing/trending, proxy configuration, nagios configuration, DNS configuration, etc. Today an engineer at Tumblr can login to Collins using their LDAP credentials, find an available host, click Provision and be on their new dev box in under an hour. This is actually part of the developer on-boarding process.

In our recipes document you can find some sample use cases for Collins like:

Today we are open sourcing Collins, the Tumblr infrastructure management system. Collins was developed using the play framework but was designed so that people without any Java/Scala experience could integrate with it using the API or via Callbacks. Here at Tumblr we have bash, python and ruby integrations with Collins, all developed by different people. There is also a PHP SDK for Collins in the works.

We are releasing the following components, available under the Apache License v2.0:

The Documentation is available under the Creative Commons BY 3.0 license.

Collins can be integrated with the Jetpants MySQL management toolkit through an open source plugin called jetpants_collins. This plugin allows Jetpants to use Collins as the single point of truth for your hardware inventory, automatically querying the list of pools, shards, hosts, and database instances in your infrastructure. Furthermore, every change you make to your MySQL topology using Jetpants (master promotions, shard splits, cloning replicas, etc) will be reflected in Collins immediately and automatically.

Over the next month we will also be open sourcing a number of other related components which you can find out more about here and here.

In the meantime, here are some more links to get you started:

A number of people were responsible for helping make Collins so successful at Tumblr. Big thanks to Dan Simon, Steve Salevan, Joshua Hoffman, Dallas Marlow, Brad McDuffie, Evan Elias and all the rest of the Tumblr engineering team. Additionally a number of companies helped beta test Collins and provide feedback, thanks to all of you!

Oh, and if you’re Interested in this type of work, we’re hiring!

Tumblr Firehose - The Gory Details

Back in December I started putting some thought into the tumblr firehose. While the initial launch was covered here, and the business stuff surrounding it was covered by places like techcrunch and AllThingsD, not much has been said about the technical details.

First, some back story. I knew in December that a product need for the firehose was upcoming and had simultaneously been spending a fair amount of time thinking about the general tumblr activity stream. In particular I had been toying quite a bit with trying to figure out a reasonable real-time processing model that would work in a heterogenous environment like the one at Tumblr. I had also been quite closely following some of the exciting work being done at LinkedIn by Jay Kreps and others on Kafka and Databus, by Eric Sammer from Cloudera on Flume, and by Nathan Marz from Twitter on Storm.

I had talked with some of the engineers at twitter about their firehose and knew some of the challenges they had overcome in scaling it. I spent some time reading their fantastic documentation and after reviewing some of these systems came up with the system I actually wanted to build, much of it completely influenced by the great work being done by other people. My ‘ideal’ firehose, from the consumer/client side, had the following properties:

  • Usable via curl
  • Allows a client to ‘rewind’ the stream in case of missed events or maintenance
  • If a client disconnects, they should pick up the stream where they left off
  • Client concurrency/parallelism, e.g. multiple consumers getting unique views of the stream
  • Near real-time is good enough (sub 1s from an event emitted to consumed)

From an event emitter (or producer) perspective, we simply wanted an elastic backend that could grow and shrink based on latency and persistence requirements.

What we ended up with accomplishes all of these goals and ended up being fairly simple to implement. We took the best of many worlds (a bit of kafka, a bit of finagle, some flume influences) and created the whole thing in about 10 days. The internal name for this system is Parmesan which is both a cheese as well as an arrested development character (Gene Parmesan, PI).

The system is comprised of 4 primary components.

  • A ZooKeeper cluster, used for coordinating Kafka as well as stream checkpoints
  • Kafka, which is used for message persistence and distribution
  • A thrift process, written with scala/finagle, which the tumblr application talks to
  • An HTTP process, written with scala/finagle, which consumers talk to

The Tumblr application makes a Thrift RPC call containing event data to parmesan. These RPC calls take about 5ms on average, and the client will retry unless it gets a success message back. Parmesan batches these events and uses Kafka to persist them to disk every 100ms. This functionality is all handled by the thrift side of the parmesan application. We also implemented a very simple custom message serialization format so that parmesan could completely avoid any kind of message serialization/deserialization overhead. This had a dramatic impact on GC time (the serialization change wasn’t made until it was needed) which in turn had a significant impact on average connection latency.

On the client side, any standard HTTP client works and requires (besides a username and password) an application ID and an optional offset. The offset is used for determining where in the stream to start reading from, and is specified either as Oldest (7 days ago), Newest (from right now), or an offset in seconds from the current time in UTC. Up to 16 clients with the same application ID can connect, each viewing a unique partition of the activity stream. Stream partitioning allows you to parallelize your consumption without seeing duplicates. This is a great feature for instance if you took your app down for maintenance and want to quickly catch back up in the stream.

Kafka doesn’t easily (natively) support this style of rewinding so we just persist stream offsets to ZooKeeper. That is, periodically clients with a specific application ID will say, “Hey, at this unixtime I saw a message which had this internal Kafka offset”. By periodically persisting this data to Kafka, we can ‘fake’ this rewind functionality in a way that is useful, but imprecise (we basically have to estimate where in the Kafka log to start reading from).

We use 4 ‘queue class’ (tumblr speak for a box with 72GB of RAM and 2 mirrored disks) machines, capable of supporting roughly 100k messages per second each, to support the entire stream. Those 4 machines provide a message backlog of 1 week, allowing clients to drop into the stream anywhere in the past week.

As I mentioned on twitter, I’m quite proud of the software and the team behind it. Many thanks to Derek, Danielle and Wiktor for help and feedback.

If you’re interested in this kind of distributed systems work, we’re hiring.

Oct 3

ID Generation at Scale

Most companies that operate at a large enough scale run into the problem of efficiently generating ID’s. At tumblr this problem was solved with a simple libevent based HTTP service that essentially generates ID’s fast enough to meet our current demand and handles failure/startup by grabbing the MAX(id) (plus a large cushion). This approach has a duplicate id/mutli-datacenter problem, but we’ll deal with that soon enough.

Instagram recently wrote an interesting piece on their approach to ID generation. The author looked at Twitter’s snowflake (which I haven’t always been kind to) and discussed why it wasn’t the right fit for their problem. The author also went into a fair amount of detail about the ID generation scheme and why they made specific implementation choices.

Although I personally don’t like the approach Instagram took, I do appreciate their engineering in the open style post and the insight it provides into their infrastructure. I also like that the team looked at snowflake which I think highlights something that Twitter has been great at but hasn’t gotten much credit for. Twitter puts their stuff out on Github, whether it’s usable by the masses or not. Although this means you can’t always take their stuff and just use it, it gives people like the Instagram team ideas on how people are solving problems. I think operating at scale and in the open, like Instagram/Twitter/etc are doing, is a credit to their engineering team and company culture. I wish we saw this kind of stuff from companies like Google & Amazon.

Sep 9

How long does it take to make a context switch?

Really nice breakdown of context switching costs on the JVM, compared across different CPU’s.

Sorting Maps by Value in Java

I was recently working on a problem where I had a Map which contained essentially String keys and Integer values. The data was being output by a Hadoop mapper and I wanted my reducer to only output the top 50 values per day. I needed to sort my map by the integer value and then grab only the top 50 items. The method I used to grab the top 50 was kind of gross (in my mind anyhow) so I wanted to share it and see what other ways people might solve this problem. Below is the code:

PriorityQueue<Map.Entry<String,Integer>> q = new PriorityQueue<Map.Entry<String,Integer>>(
    new Comparator<Map.Entry<String, Integer>>() {
        public int compare(Entry<String, Integer> e1, Entry<String, Integer> e2) {
            return e2.getValue().compareTo(e1.getValue());

I then iterated over my map and shoved all the entries into the priority queue. When I iterate over the queue, I get elements back out sorted. So, here are some issues I see with this approach.

  • The memory requirement is 2n.
  • The runtime for insertion is roughly O(n*log(n)) (iterate over each item in the map, insertion into the queue)
  • The PriorityQueue implementation is unbounded (but I actually only want 50 items in it)

Is there a better (more efficient) way to do this without changing the original map?