The making of the NYT’s Netflix graphic
January 20th, 2010
One of The Times’ recent graphics, “A Peek Into Netflix Queues,” ended up being one of our more popular graphics of the past few months. (A good roundup of what people wrote is here). Since then, there have been a few questions about the how the graphic was made and Tyson Evans, a friend and colleague, thought it might interest SND members. (I bother Tyson with questions about CSS and Ruby pretty regularly, so I owe him a few favors.)
Most readers are probably interested in the interactive graphic, although I will say that we also ran a lovely full-page graphic in print in the Metropolitan section, which goes out to readers in the New York region. That graphic had a lot of interesting statistical analysis – in fact, it would have been nice to get some analysis in the web version, more on that later – but for this I will focus mostly on the web version. If there are questions about the print graphic, I will make sure I get Amanda Cox to try to explain cluster analysis to me again.
First is the data itself. Jo Craven McGinty, a CAR reporter, was in contact with Netflix to obtain a database of the top 50 movies in each ZIP code for every ZIP in the country. That’s about 1.9 million records. The database did not include the number of people renting the movie – just the rank. (We would have loved to have it, but Netflix said no. Understandably, it would have given competitors a perfect map of their market penetration.) The raw data looked like this:
We decided to focus on cities, rather than the nation as a whole, for a few reasons.
First: Most of the interesting trends occurred on a local scale – stark differences between the South Bronx and Lower Manhattan, for example, or the east and west sides of D.C. – and weren’t particularly telling at a national scale. (We actually generated U.S. maps in PDF form that showed all 35,000 or so ZIPs, but when we flipped through them, with a few exceptions, we found the nationwide patterns weren’t nearly as interesting as the close-in views.)
Second: Matthew Bloch’s mapping framework is highly optimized, but it’s not necessarily equipped to handle changing 35,000+ polygons between 100 different movies as fast as would be necessary – no one likes to use a scrubber that’s slow to react.
One solution to the too many polygons problem is scaling up the data to wider geographies, such as one based on the first 3 digits of ZIP codes. But in this case, we couldn’t do that because we didn’t have the total number of renters in each ZIP — we only had the rank.
So, we decided on a dozen cities, determined mostly by population but also geographic distribution, which is why Minneapolis, Seattle, Denver and San Francisco are on the map, but not Houston or Philadelphia. (This apparent injustice was not lost on commenters from those cities.) We made individual GIS shapefiles of each city, then merged them into a single shapefile using ArcView’s ‘merge’ tool.
This reduced the number of shapes down to about 5,000 or so, well within Matthew Bloch’s “still super fast” threshold.
Still, the hardest part about this graphic was designing the interface. We wanted readers to be able to find a given movie quickly, but a search box didn’t really work visually. We also wanted to give readers an idea which movies were most popular and which were most critically acclaimed.
I mocked up at least ten versions. None were any good. The challenge was navigation. As a user, I wanted to be able to see one movie in a bunch of different cities, fast, or I wanted to see a bunch of movies in one city just as fast. So there are two major navigation elements – cities and movies – but the map itself still needed to be the visual focal point of the graphic.
In the end, graphics director Steve Duenes and deputy Matt Ericson came up with a sketch based on elements of my previous mockups:
which I turned into a more refined Illustrator mockup:
We tweaked this until it resembled what’s now online. It’s a complicated interface, but I don’t know if it could be any simpler.
Once we settled on a design, there was still a lot of work to do. We needed to get all the movie thumbnail images, the Metacritic ‘Metascores’, the links to The Times’ reviews and the first few sentences of the reviews themselves. We did this mostly by writing scripts. Both Metacritic and Netflix have great search-engine optimization, so just Googling a film title with the word ‘Netflix’ or ‘Metacritic’ generally gives you what you want in the first search entry:
We wrote a Ruby script that parsed the Google search results page for each movie, which typically contained the Metacritic score and Netflix ID. We used hpricot, a HTML-parsing Ruby plugin to pull out the Netflix ID and Metascore of any film.With the Netflix ID, we know the link to the thumbnail image is
"" + netflix_movie_id + ".jpg"
We used a similar technique to fetch links and content of The Times’ review, and then filled in any missing movies by looking the information up by hand.
As for the making of the map itself, the concept is very simple. For any movie, each ZIP code is assigned a color based on its rank. If it’s not in the top 50, it’s not shaded. That’s about it. To optimize the map, Matthew Bloch did a bit of database work, giving each movie title a numerical ID instead of using its full title, since it’s faster to parse through numbers than text.
The result was the graphic that’s online now. We were able to get a lot in, but we still had to leave a lot out, such as different ways to shade the maps other than by movie (i.e. where people rented movies that were nominated for Best Picture or shading each ZIP code based on a calculation of the Metacritic ratings for its top 50 movies, which we did in print for the New York region), but it would have made the interface even more complicated.
Don’t get me wrong – leaving things out is critical in interactive graphics, where the default temptation is to dump all the data you have behind an interface. It’s hard to say no to that, because readers are going to find a lot of things with raw data that you might have missed. (Such as the interesting island of Andrews Air Force Base).
It’s something we know we can do better; I don’t think anyone would disagree that tidbits of analysis are usually more meaningful than massive streams of raw data. It’s nice to get both in if you can. We’re working on it.
Kevin Quealy has been a graphics editor at the New York Times for almost two years. He has a Master’s degree from the Missouri School of Journalism and a Bachelor’s degree in Physics from Gustavus Adolphus College. He has previously worked at the Philadelphia Inquirer and the St. Louis Post-Dispatch.