• Using entropy to route web traffic

    adamlaiacano:

    Earlier this week, Blake asked me for some help with a problem he’s working on. He has a couple of hash functions that are being used to route web traffic to a number of different servers. A hash function takes an input, such as a blog’s url, and outputs a number between 0 and 232. Say we have 1000 servers, that means that each one will handle about 430 million points in the hash-space.

    The data looked something like this (with fake blog names, of course):

    ##       blog.name        H1        H2        H3        H4
    ## 1 23.tumblr.com 3.137e+09 1.866e+09 6.972e+08 5.792e+08
    ## 2 19.tumblr.com 1.875e+09 2.545e+08 2.606e+09 1.312e+09
    ## 3 34.tumblr.com 1.366e+09 2.236e+09 1.106e+09 3.640e+09
    ## 4 43.tumblr.com 2.639e+09 1.098e+09 8.755e+08 1.507e+09
    ## 5 90.tumblr.com 6.564e+08 5.397e+07 3.084e+09 2.961e+09
    ## 6 29.tumblr.com 2.476e+09 4.532e+08 2.787e+08 4.894e+08
    

    One important thing to point out before we get started is that this has to be a representative sample of the request data. Despite the wild popularity of my personal blog, it doesn’t get a sliver of the traffic that Beyonce gets. That fact needs to be represented in the sample data, meaning that her blog should appear in more rows of the sample data than mine.

    Plot the data

    The first thing I ever do is plot data to get a sense of what I’m working with and what I’m trying to accomplish. The density plots below show the distribution of values in the hash space for each algorithm. If you’re not familiar with kernel density plots, you can imagine this to be a smoothed (and prettier) version of a histogram. For the electrical engineers out there, it’s the sum of the convolution of a kernel function (usually a gaussian), with an impulse function at each of the points on the x-axis (represented here by dots).

    image

    By comparison, here are the density plots of a “near-ideal” example (1000 pulls from a uniform distribution) and a bad example (all assigned to the value 2e+09). The worst case example here shows the shape of the kernel function.

    image

    Calculate the entropy

    In information theory, entropy is the minimum number of bits required (on average) to identify an encoded symbol (stay with me here…). If we’re trying to transmit a bunch of text digitally, we would encode the alphabet where each “symbol” is a letter that will be represented in 1’s and 0’s. In order to transmit our message quickly, we want to use as few bits as possible. Since the letter “e” appears more frequently than the letter “q”, we want the symbol for “e” to have fewer bits than “q”. Make sense?

    Huffman Coding is one encoding algorithm. There’s an example implementation here, which assigns the code 100 to “e” and 1111110010 to “q”. The more “uneven” your symbol distribution is, the fewer bits it will take, on average, to transmit your message (meaning you’ll have a lower entropy). The entropy value is a lower bound for the actual weighted average of the symbol lengths. There are special cases where some encoding algorithms get closer to the entropy value than others, but none will ever surpass it.

    The actual entropy formula is:

    \[ H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \]

    Where \( H(x) \) is the entropy, and \( p_i \) is the probability of that symbol \( i \) will appear. In the example I linked to above, \( p_e = 0.124 \) and \( p_q=0.0009 \), so it makes sense that e’s symbol is so much shorter. In the example, the average number of bits per symbol is \( \sum S_i p_i = 4.173 \frac{bits}{symbol} \), where \( S_i \) is the number of bits in the symbol. The entropy, from the above equation, is \( 4.142 \frac{bits}{symbol} \).

    The example problem of web traffic distribution is a little different. We’re not actually encoding anything, but rather trying to make the theoretical lower bound for average number of bits/signal as high as possible.

    We can consider each server to be a symbol, and the amount of traffic that it recieves is decided by the hash function that we’re trying to choose. If one of our servers is the equivalent of the letter “e”, it’s going to be totally overloaded while the “q” isn’t going to be handling much traffic at all. We want each symbol (server) to appear (receive traffic) equally often.

    So to calculate the entropy, we’ll take a histogram of the hash values with 20 buckets (representing the 20 servers). That will give us the number of requests that go to each server. Dividng that by the total number of requests gives us each server’s probability of handing the next incoming request. These are the \( p_i \) values that we need in order to calculate the entropy. In code, it looks like this:

    calc.entropy <- function(hash.value) {
        h = hist(hash.value, plot = FALSE, breaks = seq(0, 2^32, length.out = 21))
        probs = h$counts/sum(h$counts)
        print(probs)
        entropy = -sum(probs * log2(probs))
    }
    

    The entropy values for our four hash functions are:

    ##   hash.function entropy
    ## 1            H1   4.203
    ## 2            H2   4.226
    ## 3            H3   4.254
    ## 4            H4   4.180
    

    And while we’re at it, here’s the entropy of our best/worst case example that we plotted earlier.

    ##    hash.function entropy
    ## 1     near.ideal   4.309
    ## 2 worst.possible   0.000
    

    Why is the worst case value 0? Because if all traffic is going to one server, we wouldn’t need any bits at all to tell us which server the request is going to. The theoretical limit for a histogram with 20 buckets is: \( -20\frac{1}{20}log_2{\frac{1}{20}} = 4.32 \), which we’re close to but can never exceed.

    All of our hash functions appear to be working pretty well, especially for such a small sample size that I’m using for this blog post. It looks like our winner is H3.

    To summarize here’s what we did to find the optimal hashing function:

    • Get a representative sample of your web traffic
    • Run each request through the hashing function
    • Take a histogram of the resulting values with N bins, where N is the number of servers you have available
    • Divide the bin counts by the total number of requests in your sample to get the probability of handing a request for each server
    • Calculate the entropy, \( H(x)=-\sum_{i=0}^{N-1} p_i log_2(p_i) \), for each hash function

    There are more considerations to take, like setting an upper bound on the value of \( p_i \) to ensure that no single server ever gets so busy that it can’t handle its load.

    If you want to read up more on information theory, Elements of Information Theory by Thomas Cover and Joy Thomas is an excellent book that is reasonably priced (used) on Amazon.

    There is also, of course, Claude Shannon’s landmark paper from 1948 “A Mathematical Theory of Communication”“, in which he essentially defines the entire field.

  • seejohnrun:

tumblr Java client: jumblr
Last week we announced our first official API client, for JavaScript (read: the announcement). This week we’re back to announce the release of an official Java client, Jumblr:
Blog blog = client.blogInfo("seejohnrun.tumblr.com");
for (Post post : blog.posts()) {
    post.like(); // you're too kind..
}

Like its JavaScript counterpart, Jumblr comes with full support for all of the API V2 endpoints. Check out more detail on the github page!
If you’re interested in following along with any of our open-source work, check out our GitHub page.

    seejohnrun:

    tumblr Java client: jumblr

    Last week we announced our first official API client, for JavaScript (read: the announcement). This week we’re back to announce the release of an official Java client, Jumblr:

    Blog blog = client.blogInfo("seejohnrun.tumblr.com");
    for (Post post : blog.posts()) {
        post.like(); // you're too kind..
    }
    

    Like its JavaScript counterpart, Jumblr comes with full support for all of the API V2 endpoints. Check out more detail on the github page!

    If you’re interested in following along with any of our open-source work, check out our GitHub page.

  • Community is extremely important us here at Tumblr. That goes double for engineers and developer evangelists. After spending all day giving talks at the API Strategy &amp; Practice Conference, many of the speakers who were in town came over to TumblrHQ for the monthly Dev Evangelist Meetup here in NYC. It was a great time had by all and we were even able to get a group shot of everyone!
If you&#8217;re looking to be a part of a team that encourages community, We&#8217;re hiring.

    Community is extremely important us here at Tumblr. That goes double for engineers and developer evangelists. After spending all day giving talks at the API Strategy & Practice Conference, many of the speakers who were in town came over to TumblrHQ for the monthly Dev Evangelist Meetup here in NYC. It was a great time had by all and we were even able to get a group shot of everyone!

    If you’re looking to be a part of a team that encourages community, We’re hiring.

  • seejohnrun:


tumblr.js JavaScript client
Today I’m excited to announce the release of tumblr.js, the first of several official API clients we’ll be rolling out over the next few months.
You can install it now with npm, and start making something awesome:
var tumblr = require('tumblr.js');
var client = tumblr.createClient({
  consumer_key: 'consumer_key',
  consumer_secret: 'consumer_secret',
  token: 'oauth_token',
  token_secret: 'oauth_token_secret'
});

// Name all of the authenticating user's blogs
client.userInfo(function (err, data) {
  data.user.blogs.forEach(function (blog) {
    console.log(blog.name);
  });
});

It comes with full support for all of the API V2 endpoints including tag search, following, liking, and post creation. For more detail, see the GitHub page.
More to come soon!

    seejohnrun:

    tumblr.js JavaScript client

    Today I’m excited to announce the release of tumblr.js, the first of several official API clients we’ll be rolling out over the next few months.

    You can install it now with npm, and start making something awesome:

    var tumblr = require('tumblr.js');
    var client = tumblr.createClient({
      consumer_key: 'consumer_key',
      consumer_secret: 'consumer_secret',
      token: 'oauth_token',
      token_secret: 'oauth_token_secret'
    });
    
    // Name all of the authenticating user's blogs
    client.userInfo(function (err, data) {
      data.user.blogs.forEach(function (blog) {
        console.log(blog.name);
      });
    });
    

    It comes with full support for all of the API V2 endpoints including tag search, following, liking, and post creation. For more detail, see the GitHub page.

    More to come soon!

  • 0xa:

We have Engineering Summer Internships @ Tumblr!  We’re looking for aspiring software engineers with a passion for open source software to join us for a summer of programming and fun.  You will be integrated into a small engineering team working on a real-world project as part of
Product Engineering: Using PHP and JavaScript, create new and improve site features that keep our growing millions of users doing amazing things with their tumblelogs.
Search Engineering: Research, expand and refine Tumblr’s search infrastructure and search features.
Platform Engineering: Write highly optimized distributed services that manage data and requests in real time, helping our site scale to billions of posts.
Infrastructure Engineering: Work hands on with the Network team configuring, deploying and maintaining JunOS network devices.
Mobile Engineering: Come work on the application that’s putting Tumblr in millions of users’ pockets.  A passion for IOS and Android is needed.
Polish your code samples, then send us your resume via our Engineering Summer Internship job page.
(We’re also hiring full time Engineers at every level of our technical stack.  Learn more on Tumblr’s Jobs page).

    0xa:

    We have Engineering Summer Internships @ Tumblr!  We’re looking for aspiring software engineers with a passion for open source software to join us for a summer of programming and fun.  You will be integrated into a small engineering team working on a real-world project as part of

    • Product Engineering: Using PHP and JavaScript, create new and improve site features that keep our growing millions of users doing amazing things with their tumblelogs.
    • Search Engineering: Research, expand and refine Tumblr’s search infrastructure and search features.
    • Platform Engineering: Write highly optimized distributed services that manage data and requests in real time, helping our site scale to billions of posts.
    • Infrastructure Engineering: Work hands on with the Network team configuring, deploying and maintaining JunOS network devices.
    • Mobile Engineering: Come work on the application that’s putting Tumblr in millions of users’ pockets.  A passion for IOS and Android is needed.

    Polish your code samples, then send us your resume via our Engineering Summer Internship job page.


    (We’re also hiring full time Engineers at every level of our technical stack.  Learn more on Tumblr’s Jobs page).

  • Tumblr for iPad’s custom UIPopoverController

    When Peter was designing Tumblr for iPad, he came up with a great custom popover design:

    Implementing his design was a challenge that I wanted to share.

    iOS’s UIKit framework includes UIPopoverBackgroundView, an abstract class which can be subclassed to provide a custom background for a popover. Sounds perfect, right? UIPopoverBackgroundView let me use custom assets provided by Peter and Ben for the background and arrow, but the design also included a custom inset stroke/shadow. Fortunately, UIPopoverBackgroundView makes it easy to hide the default inset:

    +(BOOL)wantsDefaultContentAppearance {
        return NO;
    }
    

    Unfortunately, adding its replacement wasn’t so easy.

    First I subclassed UIPopoverController and added the custom inset to the view hierarchy when the popover is presented (_insetShadow is a resizable UIImageView):

    UIView *superview = self.contentViewController.view.superview;
    [superview addSubview:_insetStroke];
    
    _insetStroke.frame = superview.bounds;
    

    Simple enough, unless the popover’s content view controller is a UINavigationController.

    In that case, the view hierarchy looks different and we have to resort to something like:

    UIViewController *controller = self.contentViewController;
    
    BOOL isNavigationController = [controller isKindOfClass:[UINavigationController class]];
    
    UIView *superview = isNavigationController 
            ? controller.view 
            : controller.view.superview;
    
    [superview addSubview:_insetStroke];
    
    CGRect frame = controller.view.superview.bounds;
    
    if (isNavigationController && 
            !((UINavigationController *)controller).isNavigationBarHidden) {
        float navBarOffset = 36;
    
        frame.origin.y += navBarOffset;
        frame.size.height -= navBarOffset;   
    }
    
    _insetStroke.frame = frame;
    

    Ugly, but it works.

    Except it doesn’t work on iOS 5 because wantsDefaultContentAppearance wasn’t added until iOS 6. On iOS 5 we need to manually traverse the view hierarchy to find the default stroke and remove it. But keep in mind that the view hierarchy looks different depending on whether or not the content view controller is a navigation controller. We end up with something like this:

    if (!IS_IOS_6) {
        for (UIView *subview in [superview subviews]) {
            if (isNavigationController) {
                for (UIView *subsubview in [subview subviews])
                    if ([subsubview class] == [UIImageView class])
                        subsubview.hidden = YES;
            } else {
                if ([subview class] == [UIImageView class] && subview != _insetStroke)
                    subview.hidden = YES;
            }
        }
    }
    
  • Orr Sella: Introducing tumblr4s: A Scala Library For The Tumblr API

    orrsella:

    Back in February of this year I stumbled upon an amazing article on HighScalability.com about Tumblr’s infrastructure. It’s a great read for many reasons (go read it). One of the greatest take-aways I had from this article, was this “new” programming language called Scala. I had never heard of…

    Awesome stuff. Thanks Orr!

  • adamlaiacano:

    John Myles White was kind enough to come by Tumblr HQ this week and give a talk about the advantages that MAB (Multi-Armed Bandit) testing provide over traditional A/B testing.

    Most of the content is drawn from his ebook “Bandit Algorithms for Website Optimization”

  • adamlaiacano:

The Bad Data Handbook is finally out! It’s the first time I’ve contributed to a publication and I’m incredibly excited about it.
The top required skills of a data scientist are generally considered to be: mathematical know-how, programming capabilities, and some sort of domain knowledge enabling them to ask (and then answer) relevant questions.
This book is about all of the other garbage that you have to put up with along the way. Each chapter is written by someone who has spent more time than they probably would have liked dealing with a specific issue, and they provide some tips and pitfalls. I haven’t read the other chapters yet, but some of the included topics are:
Test drive your data to see if it’s ready for analysis
Work spreadsheet data into a usable form
Handle encoding problems that lurk in text data
Develop a successful web-scraping effort (that’s me)
Use NLP tools to reveal the real sentiment of online reviews
Address cloud computing issues that can impact your analysis effort
Avoid policies that create data analysis roadblocks
Take a systematic approach to data quality analysis
I hope at least a few of you buy it and enjoy it.

    adamlaiacano:

    The Bad Data Handbook is finally out! It’s the first time I’ve contributed to a publication and I’m incredibly excited about it.

    The top required skills of a data scientist are generally considered to be: mathematical know-how, programming capabilities, and some sort of domain knowledge enabling them to ask (and then answer) relevant questions.

    This book is about all of the other garbage that you have to put up with along the way. Each chapter is written by someone who has spent more time than they probably would have liked dealing with a specific issue, and they provide some tips and pitfalls. I haven’t read the other chapters yet, but some of the included topics are:

    • Test drive your data to see if it’s ready for analysis
    • Work spreadsheet data into a usable form
    • Handle encoding problems that lurk in text data
    • Develop a successful web-scraping effort (that’s me)
    • Use NLP tools to reveal the real sentiment of online reviews
    • Address cloud computing issues that can impact your analysis effort
    • Avoid policies that create data analysis roadblocks
    • Take a systematic approach to data quality analysis

    I hope at least a few of you buy it and enjoy it.

  • Tumblr for iPhone is now 100% native

    My appreciation for those who build web browsers has grown dramatically over the past few months.

    With today’s 3.2 release I’m happy to announce that Tumblr for iPhone is now 100% native. We love web technologies at Tumblr but believe this change provides a much better experience for this particular product.

    HTML is the first-class citizen at Tumblr. All posts, whether authored in plain text, Markdown, or using our WYSIWIG editor, are stored as HTML. This is what we use to build your dashboard both on the website and in the mobile apps, via our API.

    The decision to use a web view to render lists of posts in the 3.0 rewrite (June 2012) had nothing to do with cross-platform compatibility or ease of development and everything to do with needing to render (somewhat) arbitrary HTML, provided by our users. This makes building a Tumblr iOS app of the highest caliber an interesting challenge.

    As noted by Zach Williams, we used GRMustache to render the post lists and Zepto.js was used to implement the web view’s behavior, with some slight modications (the longTap event didn’t work exactly the way we wanted, tap wouldn’t play HTML5 audio, and we needed to prevent touches while scrolling). CSS classes were used instead of :active pseudo-classes so they could be removed programmatically, as scrolling began. Our JavaScript-native “bridge” was basically this example.

    Suitable scrolling performance was difficult to achieve in the web view as lists of Tumblr posts are usually extremely media-heavy. Some of the measures we took included using images in place of border-radius and box-shadow CSS and scaling/compressing photos on our servers, to the exact size needed on the phone.

    We considered writing a JavaScript dequeuing mechanism but didn’t end up doing so. Since our 3.0 release, Airbnb open-sourced their “UITableView for the web” in ∞.js and LinkedIn published a detailed overview of how they approached a similar problem in their iPad application. Anyone attempting to build a web/native hybrid application like we did should take a look at both of these resources first.

    I’m excited about the speed and stability that I believe going fully native has brought to our app, but please let me know how we can make the Tumblr iOS experience even better.