A Guide to Sabermetric Research: How to Find Raw Data

Back in the beginning days of sabermetrics, data was hard to come by. Some things weren’t too bad — if you wanted to know Bill Terry’s batting average in 1933, there were two encyclopedias, Macmillan and Neft/Cohen, that would tell you. But if you wanted more esoteric statistics, like Joe Morgan’s career performance with the bases loaded, you were out of luck.

When Bill James started writing his self-published Baseball Abstracts back in the late 1970s, he had to compile situational statistics himself, from the daily box scores, without a computer. At the time, Bill marketed his book as “featuring 18 categories of statistical information that you just can’t get anywhere else.”

James found that he had to keep compiling those stats even into the 1980s; famously, in his 1981 book, he reprinted a letter from the Chicago Cubs refusing to provide him with such “intelligence-type” stats.

Now, of course, things are different. There is no shortage of almost any kind of data. My four favorites — in rough order of increasing detail — are:

MLB's website provides copious statistical data, sortable and printable, updated instantly as games progress. But that stuff can be found elsewhere. The main attraction of the MLB website is that it provides PITCHf/x data. That is, for every pitch thrown by any pitchers in MLB, they’ll tell you the type of pitch, where it crossed the plate, and how much it broke vertically and horizontally. As a result, and not surprisingly, much of the groundbreaking research these days has to do with pitch analysis.

Easily the best source for precalculated historical statistics is Baseball-Reference.com (B-R). That site has pretty much rendered printed baseball encyclopedias obsolete. Not only do you get the regular Bill-Terry’s-batting-average data, but you also get a large selection of sabermetric stats, breakdowns by tens of different criteria (left/right, day/night, April/September, and so on), and the ability to manipulate the data in ways that other websites don’t allow. You can also do absurdly specific searches. Want to know Joe Morgan’s longest consecutive streak of games where he came to the plate at least twice? The answer: 235 games. (If you want the details, you have to subscribe, but the overwhelming majority of the information on the site can be had for free.)

For those of us who want to do more complicated things, Baseball Reference, awesome as it is, just isn’t enough. We need the raw data on our own computers, so we can manipulate it in ways that B-R never thought of. There are two main sources of raw data: the Lahman Database and Retrosheet.

The Lahman Database can be obtained for free at seanlahman.com/baseball-archive/statistics, the website of its creator, Sean Lahman. It’s basically a standard Baseball Encyclopedia in downloadable form. You can get it in text form, for loading into Excel, but, more importantly, it also comes in relational database format (Microsoft Access). If you’re familiar with Access and with SQL database queries, you know how convenient it is to use it to do powerful, specific data searches quickly. (If you’re not familiar with SQL, there have been a few tutorials on sabermetric sites recently.)

Anyway, the Lahman Database has every player’s standard batting and pitching line for every year. It’s got managers, birthdates, awards, all-star games, and other good stuff. Its limitation is that data is available only for single seasons — if you want to know how Eddie Murray hit in July 1979, there’s no way the Lahman Database will tell you. For that, you have to turn to Retrosheet.

Retrosheet is, basically, a miracle. It’s the result of a small army of volunteers, combing historical sources to try to re-create the play-by-play of every game in baseball history and digitizing it for download and analysis. I can’t begin to imagine how difficult it is to find all that information, to reconstruct the top of the 6th inning of the Cardinals/Phillies game of April 29, 1953. But they did. (D. Rice grounded out (shortstop to first); Presko popped to first in foul territory; Hemus popped to first in foul territory.)

You can also see the entire career of any player, game by game. You can see the standings and results from any date in baseball history. You can see a coach’s career, which teams he coached for and what he coached, and even how many times he was ejected.

You can see this stuff online, or, if you have computer data-manipulation skills, you can download it and work with it yourself. You can load the data into Excel and write macros to manipulate it. Or, you can write programs to analyze it; I use Visual Basic, but any language will do. There’s a 2006 book called “Baseball Hacks” (O’Reilly), which explains how to use a computer language called “R” to download and analyze Retrosheet data (and, actually, lots of other baseball data that can be found on the internet). For a primer on how to build your own personal Retrosheet database using MySQL, click here.

Not all of baseball history is available on Retrosheet — yet. The volunteers are still working on it, though. (Want to help? Click here for details.) For now, you can see game-by-game summaries from 1871 on. You can see box scores for more than 90 percent of games since 1916. And, if you want full play-by-play data, it’s available for any game after 1952, and a large number of games before that. Some years even include pitch-by-pitch data, in terms of ball, strike, foul.

The result of literally tens of thousands of hours of volunteer labor, Retrosheet is the greatest sabermetric resource ever.


