The making of the NYT’s Netflix graphic

One of The Times’ recent graphics, “A Peek Into Netflix Queues,” ended up being one of our more popular graphics of the past few months. (A good roundup of what people wrote is here). Since then, there have been a few questions about the how the graphic was made and Tyson Evans, a friend and colleague, thought it might interest SND members. (I bother Tyson with questions about CSS and Ruby pretty regularly, so I owe him a few favors.)

Most readers are probably interested in the interactive graphic, although I will say that we also ran a lovely full-page graphic in print in the Metropolitan section, which goes out to readers in the New York region. That graphic had a lot of interesting statistical analysis – in fact, it would have been nice to get some analysis in the web version, more on that later – but for this I will focus mostly on the web version. If there are questions about the print graphic, I will make sure I get Amanda Cox to try to explain cluster analysis to me again.

First is the data itself. Jo Craven McGinty, a CAR reporter, was in contact with Netflix to obtain a database of the top 50 movies in each ZIP code for every ZIP in the country. That’s about 1.9 million records. The database did not include the number of people renting the movie – just the rank. (We would have loved to have it, but Netflix said no. Understandably, it would have given competitors a perfect map of their market penetration.) The raw data looked like this:

We decided to focus on cities, rather than the nation as a whole, for a few reasons.

First: Most of the interesting trends occurred on a local scale – stark differences between the South Bronx and Lower Manhattan, for example, or the east and west sides of D.C. – and weren’t particularly telling at a national scale. (We actually generated U.S.  maps in PDF form that showed all 35,000 or so ZIPs, but when we flipped through them, with a few exceptions, we found the nationwide patterns weren’t nearly as interesting as the close-in views.)

Second: Matthew Bloch’s mapping framework is highly optimized, but it’s not necessarily equipped to handle changing 35,000+ polygons between 100 different movies as fast as would be necessary – no one likes to use a scrubber that’s slow to react.

One solution to the too many polygons problem is  scaling up the data to wider geographies, such as one based on the first 3 digits of ZIP codes.  But in this case, we couldn’t do that because we didn’t have the total number of renters in each ZIP — we only had the rank.

So, we decided on a dozen cities, determined mostly by population but also geographic distribution, which is why Minneapolis, Seattle, Denver and San Francisco are on the map, but not Houston or Philadelphia. (This apparent injustice was not lost on commenters from those cities.) We made individual GIS shapefiles of each city, then merged them into a single shapefile using ArcView’s ‘merge’ tool.

This reduced the number of shapes down to about 5,000 or so, well within Matthew Bloch’s “still super fast” threshold.

Still, the hardest part about this graphic was designing the interface. We wanted readers to be able to find a given movie quickly, but a search box didn’t really work visually. We also wanted to give readers an idea which movies were most popular and which were most critically acclaimed.

I mocked up at least ten versions. None were any good. The challenge was navigation. As a user, I wanted to be able to see one movie in a bunch of different cities, fast, or I wanted to see a bunch of movies in one city just as fast. So there are two major navigation elements – cities and movies – but the map itself still needed to be the visual focal point of the graphic.

In the end, graphics director Steve Duenes and deputy Matt Ericson came up with a sketch based on elements of my previous mockups:

which I turned into a more refined Illustrator mockup:

We tweaked this until it resembled what’s now online. It’s a complicated interface, but I don’t know if it could be any simpler.

Once we settled on a design, there was still a lot of work to do. We needed to get all the movie thumbnail images, the Metacritic ‘Metascores’, the links to The Times’ reviews and the first few sentences of the reviews themselves. We did this mostly by writing scripts. Both Metacritic and Netflix have great search-engine optimization, so just Googling a film title with the word ‘Netflix’ or ‘Metacritic’ generally gives you what you want in the first search entry:

We wrote a Ruby script that parsed the Google search results page for each movie, which typically contained the Metacritic score and Netflix ID. We used hpricot, a HTML-parsing Ruby plugin to pull out the Netflix ID and Metascore of any film.With the Netflix ID, we know the link to the thumbnail image is

"cdn-0.nflximg.com/us/boxshots/large/" + netflix_movie_id + ".jpg"

We used a similar technique to fetch links and content of The Times’ review, and  then filled in any missing movies by looking the information up by hand.

As for the making of the map itself, the concept is very simple. For any movie, each ZIP code is assigned a color based on its rank. If it’s not in the top 50, it’s not shaded. That’s about it. To optimize the map, Matthew Bloch did a bit of database work, giving each movie title a numerical ID instead of using its full title, since it’s faster to parse through numbers than text.

The result was the graphic that’s online now. We were able to get a lot in, but we still had to leave a lot out, such as different ways to shade the maps other than by movie (i.e. where people rented movies that were nominated for Best Picture or shading each ZIP code based on a calculation of the Metacritic ratings for its top 50 movies, which we did in print for the New York region), but it would have made the interface even more complicated.

Don’t get me wrong – leaving things out is critical in interactive graphics, where the default temptation is to dump all the data you have behind an interface. It’s hard to say no to that, because readers are going to find a lot of things with raw data that you might have missed. (Such as the interesting island of Andrews Air Force Base).

It’s something we know we can do better; I don’t think anyone would disagree that tidbits of analysis are usually more meaningful than massive streams of raw data. It’s nice to get both in if you can. We’re working on it.

Kevin Quealy has been a graphics editor at the New York Times for almost two years. He has a Master’s degree from the Missouri School of Journalism and a Bachelor’s degree in Physics from Gustavus Adolphus College. He has previously worked at the Philadelphia Inquirer and the St. Louis Post-Dispatch.

29 comments

Kevin, you’ll be interested to know that this will be required reading for my new Multimedia Planning and Design class. You’re the expert grownup now!

This is some great behind-the-scenes info on the process of bringing a complex graphic to life. I’m with Joy – this is going into next fall’s infographics course.

Does anyone know what programs were used other than ArcGIS, Illustrator and Ruby for scraping?

In particular, I’m curious where the web portal was done with Processing or Flash? How was the database managed?

Also – I should say, I’m not necessarily interested in how NYTimes did it, I’m more interested in getting a sense of how one could do something similar.

Thanks in advance,
Dana

Cool. How about releasing the data for the other zips so I can use ArcMap to make a thematic map of my region, Detroit?

I really liked this piece when I saw it on the Times site, great work!

One of the anomalies I noticed was that you outed the guy who gets his movies delivered to work at LaGuardia (11371) as a fan of “Romancing the Stone”, “Crocodile Dundee 2”, and “Godzilla’s Revenge”.

Fantastic detail. I would love to know the timeline for the graphic … from concept to published graphic – and how many graphic artists involved … Great work.

A complete directory listing of Massive Multiplayer Online Shooter games in alphabetical order.
One minute he’s stage left and the next he’s stage right.
In all of the straight sex tape scandals that
have spilled out of Hollywood with the frequency of comic book movies,
has anyone ever gotten riled up about safe sex practices.

Howdy I am so glad I found your blog page, I really found you by mistake, while I was browsing
on Google for something else, Anyways I am here now and would just
like to say thank you for a tremendous post and a all round entertaining blog
(I also love the theme/design), I don’t have time to browse it all at the moment
but I have bookmarked it and also added your RSS feeds, so when I have time I will be back to read
a lot more, Please do keep up the excellent work.

Leave a Reply

*