How to Do Baseball Research: Statistical Databases and Websites
It's a good time to be a baseball researcher with a computer. This was true in 2000 when How to Do Baseball Research was originally published and it is even more true today. And it isn't only that computer and Internet speeds are more than an order of magnitude faster than they were back at the end of the last century; there has been a tremendous increase in the amount of information available as well. This chapter will describe some of the Internet's best sources of baseball data. Few of these sites even existed back at the beginning of the millenium and most of those that did have changed almost beyond recognition.
So what can we find on the Internet and where can we find it? We will break the answer to this question into two parts: sites that let us browse statistical data and sites that let us download it.
Browsing Baseball Data
This section will focus on data that is made available primarily for browsing. Of course, just about anything that can be browsed on the internet can also be downloaded ("File"->"Save Page As"), but these sites normally present their data scattered over thousands of pages and in many different formats. And while they will let you search for data (by player or team name, for example) and will often even let you sort the data, they are not usually designed to let you answer complicated questions. In many ways, these sites simply provide on-line baseball encyclopedias and their visitors access them in much the same way people have been accessing print encyclopedias for decades.
Note: some of these sites also provide data for downloading, but this will be discussed in the next section.
So let's start by answering a sample question: how many hits did Derek Jeter have in 2004? Here are some places that will answer that question:
While each of the sites above provided the same answer to our original question (Derek Jeter had 188 hits in 2004), they also provided a wide range of presentation styles and additional information. The rest of this section will discuss what is available on each of these sites (and several others not included above). It is important to note that we are focusing on the statistical data available on the sites below. In many cases, statistical data is only a small part of these sites have to offer.
ESPN offers a variety of statistics on players who are currently active. Both regular and some sabrmetric stats (IsoP, SecA, PSN) are included. What is available varies from year to year. For example, the number of pitches is included starting in 2002 and ground ball/fly ball information starts in 2004. Batter-pitcher match-ups as well as batting and pitching splits are available for active players. Samples are here (Derek Jeter's match-ups against the Tigers pitchers) and here (Derek Jeter's 2005 batting splits).
They also provides sortable regular and post-season stats on all major league players back to 2002 and player game logs on active players back to 2001. Samples are here (2003 major league hitters sorted by triples) and here (Nomar Garciaparra's 2002 game log).
This site is free, although the sell an insider subscription permitting you to access all their columnists as well as getting some other services (mostly relating to fantasy baseball).
MLB.com has statistics on both active and "historical" players. Derek Jeter's sample page above shows what is included for active players. Ground ball/fly ball and pitch count data starts in 1999. One feature is a leaderboard section showing all seasons a player was in the top 25 in a major statistical category. A sample showing Del Bissonette's entries on the leaderboard are here. Another feature for active players is a hit chart showing the location of various types of batted balls (singles, doubles, triples, home runs, ground outs and fly outs) in each park back to 2000. There are also game logs and splits for active players for the current season.
They provide sortable team and player stats for seasons (including multiple seasons) back to 1871. Here is the main page for this. If you always wondered what player hit the most triples in the AL from 1923 to 1926 (Goose Goslin, with 70), this is the place for you. The flexible interface allows you to specify the league (AL, NL or both), the hitting or pitching stat, and the seasons of interest (hint: hold down the button to add years to the list). From 1999 onward, it will also looks like you should be able to add postion and situational splits to the selection criteria, but most of these queries didn't end well ("There was a problem retrieving your requested stats. Please try again later").
This site is free.
Baseball-Reference contains a wealth of data on all major league players. Both the usual and unusual (BAbip, OWn%, Rtz, Rtzhm and many more) are included. Two features in the data display that warrant mention: you can sort the years in each display by any category amd you can sum the stats for any group of years by highlighting them.
Other things included on the site are box scores (most with play-by-play descriptions), batting and pitching splits and daily game logs back to the early 1950s. Here is a link to the schedule and results page (with links to daily box scores) for the 1961 Milwaukee Braves, here is a link to Sandy Koufax's 1965 pitching game log, and here is a link to Johnny Bench's career splits. Each player's page also contains similarity scores (which players are most like the one in question), a leaderboard section showing all the times the player appeared in the top ten in any category, and a home run log (accessed by clicking "HR Log" at the top of the Standard Batting and Pitching sections (Mantle's home run log is here).
In addition to major league players, baseball-reference also has season and career data on minor league players, as well as the minor league records of major league players. A sample, Joe Hauser's minor league record, is here.
The things mentioned so far are free. You can also purchase a subscription that allows you access to the play index, a tool that permits you to query seasonal data as well as Retrosheet game data back to 1954. There is a very flexible and powerful interface that allows subscribers to answer all sorts of questions.
Here are just a few examples of some of the types of questions you can answer using this tool:
1) When was the last time a player who was not in the starting lineup had 7 or more RBIs in a game? To answer this, you would go to the Batting Gamelog Finder, change the Batter's Defensive Position from "Either" to "Sub" and the first set of stat fields to "RBI", ">=" and "7". Pressing "Get Results" quickly told us that John Mayberry had 7 RBIs for the Blue Jays on June 26, 1978 and Roy Siever did the same for the White Sox on June 21, 1961.
2) Who was the last pitcher to win a game in which he game up 5 home runs or more? To answer this, you would go to the Pitching Gamelog Finder, change the Pitcher Decision to "Win" and the first set of stat fields to "HR", ">=" and "5". Pressing "Get Results" showed us that Tim Wakefield last did this on August 8, 2004.
3) Who was the last Boston Red Sox switch-hitter to have 30 or more doubles and home runs in a season? The Batting Season Finder can tell us the answer to this one. Once there, change League to "Anerican League", team to "Boston Red Sox", Bats to "Switch" and the first two sets of stats fields to "2B", ">=", "30" and "HR", ">=", "30". The answer: Carl Everett in 2000.
Retrosheet contains box scores with play-by-play data covering 1952-2008 for the NL and 1953-2008 for the AL. In addition, it also has box scores (without play-by-play data) for 1872 and 1874 for the National Association, 1911 for the NL and 1920 to 1931 for both leagues. Here is a sample game log (with links to box scores) for the 1965 New York Mets.
It has encyclopedia entries for all players like the Derek Jeter page shown above. In addition to the usual statistic data, each player page will have links to game logs (like this one for Rogers Hornsby's 1922 season), splits (Billy Williams' 1970 splits are here), a top performance page (like Lou Gehrig's), and batting and pitching matchups (Joe Pepitone's are here).
There are all-time top performance pages covering top statistical marks in one through eight consecutive games, as well as top performance pages for each baseball franchise (the Mets' page is here). Finally, the are ballpark pages containing various splits (the Ebbets Field page is here).
Data for the current year is not available until late November.
Much of the data used to generate the pages on Retrosheet's site is also available for downloading, but that will be discussed in the next section.
This site is free.
Baseball Prospectus contains the kind of statistical data you'd expect from a sabrmetrically sophisticated site like theirs, which can be seen from Derek Jeter's sample entry above. In addition, they also have seasonal data that can be sorted by a variety of fields here. Each report permits you to select a year back to 1954 (it doesn't look like a range of years is supported), an optional defensive position (for batting stats), and a series of statistics to sort on. As a example, here is the report of 1996 pitchers, sorted by the number of fly balls they allowed.
The stuff described above is free, although they do offer a subscription service (which include Custom Sortable Stats).
In addition to offering player encyclopedia entries on major league players with a wide range of normal as well as advanced sabrmetric data (see the example above), Fangraphs also has extensive win probability data (since 1974) as well as statistics on batted balls, pitch type and plate discipline (since 2002). They also have player game logs available since 2002 (Manny Ramirez's 2003 log is here), a play log containing every batting play in each season (also since 2002) that can be sorted by a host of categories (Travis Hafner's 2006 log, sorted by win probability added, is here), as well as series of graphs showing, among other things, how each player has compared to league averages and players his own age in several rate categories. Here is how Barry Bonds compared to his league in on-base percentage during his career. And here is Ichiro Suzuki's daily graphs in a host of rate categories since 2002.
Minor league data is also included starting in 2006.
This site is free.
Baseball Musings gives you access to daily logs and player splits from 1957 to 2009 (here) in a flexible format that allows you to select games to include from a wide criteria. A few examples of the kinds of reports you can generate:
1) a list of Ernie Bank's games played from 1957 to 1971 against the Cardinals in Wrigley Field is here. Note: the summary line at the end of the report giving the totals of all the games displayed.
2) Albert Pujols' splits from 2001 to 2003 are here.
3) Todd Helton's yearly road record from 1997 to the present are here.
4) Jorge Posada's batter-pitcher matchups from 2003 to 2006 covering only those games played in Yankee Stadium are here.
You can also generate lists of batters, again using a wide selection of both inclusion and sorting criteria. Two examples:
1) A list of all the players who reached first base on catcher's interference at least 10 times from 1957 to the present is here.
2) A list of the visiting pitchers from 1957 to 2008 with the most shutouts at Yankee Stadium is here.
This site is free.
Baseball Almanac has encyclopedia entries for all players (see the example above). They also have a tool called statmaster that will allow you to generate team reports containing a variety of statistical categories. There are boxscores available for many teams from 1958 to 2004 (the 1960 White Sox main page, is here) and player logs from 1954 to 2008 (Stan Musial's 1954 log is here).
In addition, the site has an extensive section on baseball records.
The site is free.
The Baseball Cube contains major league statistical data back to 1903, minor league data starting in 1978 and NCAA data from 2002. Tim Lincecum's page, showing all three types of data, is here. Baseball boxscores are available from 1957 to 2008 here and player logs are also available for the same years. A sample game log, covering Greg Maddux's 1995 season, is available here.
The site is free.
Howe Sports Data has major and minor league data on all active players, major league daily logs back to 2002 and minor league logs back to 1999. A sample log, Jeff Francis' 2004 Texas League record, is here.
This site is free.
Minor League Baseball Split displays minor league splits for players back to 2005. A sample page is here.
This site is free.
BrooksBaseball.net allows you to look at PitchFX data for games since 2007. PitchFX data captures information on each pitch, including speed and vertical and horizontal position, movement and spin angle. Note: not all 2007 games have this data. From the main page, you select a date, game and pitcher and can see a variety of data on that pitcher's pitching in the game. For example, Jon Garland's pitches in his September 3, 2009 start for the Dodgers against the Diamondbacks is here.
There is an online encyclopedia of Japanese baseball stats available here, including player registers, yearly standings and team records. There are separate sections for Japanese and non-Japanese players. A sample batting register (Oh - Oishi) is here.
Downloading Baseball Data
In this section, we will discuss webistes that provide data you can download or purchase on CD.
Baseball-Databank.org contains a number of tables that together comprise a encyclopedia of seasonal data. There are 27 tables in all, with everything from the usual (batting, pitching and fielding data) to the less usual (salary data, award and hall of fame voting). The tables are available in both comma-delimited (txt files) and as a MySQL database.
A good tutorial on how to use this data is Statistically Speaking (part 2 of the tutorial is here) which contains a good description on how to get and install MySQL, how to add the Baseball-Databank data into it, and how to query it.
There is also a good discussion on the data and how to use it in Joseph Adler's 2006 book Baseball Hacks.
There is also a Yahoo e-group to discuss the data available at this website here.
The data is free.
The Baseball Archive contains the same data that is available at Baseball-Databank, but it is available here in some different formats, including Microsoft Access (free) and on a CD-Rom (not free). This site also contains documentation on the tables in the database here.
Also available at this site is The Baseball Statistics System, a free Windows application developed by Randy Myers, which is an interactive interface to the data. Documentation on this is available here.
Retrosheet contains two basic types of game data: event files and game logs. Event files come into two varieties: regular event files, containing a play-by-play description of a game, and box score event files, which contains information sufficient to generate a box score for a game but does not contain play descriptions. Game logs contain a wide variety of information on each game (not all of the information is available for each year) back to 1871.
There is a step-by-step example showing how to use the event files here.
Retrosheet makes some software available for accessing regular event files (running on Windows) called BOX, BEVENT and BGAME. They are described here.
Chadwick is an excellent software package, written by Ted Turocy, that can be used to access both regular and box score event files. A description of how it works, as well as how to download and install it, is here.
There is also a Yahoo e-group to discuss the data available at this website here.
The Complete Baseball Encyclopedia was developed by Lee Sinins and allows the users to sort player data and generate lists in a variety of ways. You can not actually download the data, but you can purchase it on CD.
Old-Time Data is the brainchild of Pat Doyle and is actually two products for purchase on CD: Professional Baseball Players Database, containing a few batting and pitching stats for both the minor and major leagues from 1922 to 2004, and Professional Baseball Players Statistical Database, containing a lot more statistics for the same group of players from 1920 to 1945.
Since most of the major league data is covered better elsewhere, the focus here is on the minor league data. And while it is true that SABR and Baseball-Reference now cover much the same territory with their online data, Pat Doyle's products are valuable because they provide a second, independent view of this data. They also have the advantage of not requiring Internet access, since it is installed directly on your computer.
National Pastime Almanac is a free downloadable that runs on Windows and lets you do a wide variety of sorting and selecting on seasonal and career player and manager data. Its user interface is a little old-fashioned (it really likes to take over your entire screen and you need to continually resize it if you don't want it hogging of all the screen's real estate) but it does let you run a wide variety of queries on its data. For example, if you've ever wondered how many pitchers walked 100 or more batters, struck out 200 or more batters and posted an ERA under 2.00, this tool will quickly tell you all about Jack Coombs (1910), Hal Newhouser (1945) and Sam McDowell (1968).
The Seamheads Ballparks Database is an MS Access database produced by Kevin Johnson and contains statistical and descriptive historical ballpark data. A description of what the database contains is here.
Pro Yakyu Now contains two seperate database for downloading (available in its data section): Michael Westbay's Pro Yakyu Database and Michael Eng's Japanese Baseball Database. Both are available in comma-delimited text (csv) files. It looks like there is more coverage of the older players in Michael Eng's database.
We mentioned PitchFX above briefly (when discussing the BrooksBaseball.net site). gd2.mlb.com contains the PitchFX data from MLB.com. There is a separate directory for each year, month, day and, finally, game. For example, the directory containing the PitchFX for the Mets-Rockies game on July 13, 2008 is here. A good tutorial in how to capture and use this data is here. One of the problems it deals with is to how to set up scripts to automatically download the data from all of these directories into a single database. There is one website that has done some of the work in collecting all the data from these directories into a single SQL database from 2007 to 2009.
- Related link: For more information on sabermetric research, visit our Guide to Sabermetric Research at SABR.org/sabermetrics
|Click here to go back to the Resources page