A Primer on Statistics

For the typical fan, sabermetrics doesn’t represent anything as theoretical as scientific inquiry. Rather, sabermetrics is associated with new and unfamiliar statistics. OPS is the most famous of those new stats. It’s gone from a nearly unknown statistic in the early 80s, to barely used a decade ago, to mainstream now (it even appears on Topps baseball cards). There have also been stats like Linear Weights, Runs Created, Extrapolated Runs, WAR, and so on.

I’d still argue that sabermetrics isn’t really about those statistics; rather, the statistics have been proven to be useful based on evidence that sabermetricians have uncovered. “Runs Created,” for instance, is a statistic that was created by Bill James in the late 1970s. James’ thinking went this way: a team’s job on offense is to score runs – the more runs, the better. Suppose you didn’t know how many runs a team scored, and wanted to make an estimate, based on its batting line. For instance, here’s a real team batting line:

G	AB	H	2B	3B	HR	BB	K	AVG
161	5517	1451	234	22	214	604	908	.263

How many runs would you guess that team scored that year? If I made you guess, you’d probably look over a few years of team statistics, try to find some team that was reasonably close, and use that as a baseline. You might find a team that hit .267 with less power, and scored 788 runs. You’d figure, “well, this team hit only .263, but they had a few more home runs, so I guess maybe they’d cancel out, so I’d guess the same 788 runs. But, wait, this team had about 20 more walks than the other team, so maybe I should bump up my estimate to 800 or something.”

What Bill James probably did was work through logic like that, and, after some trial and error, come up with the Runs Created (RC) formula. That statistic is intended to provide a formal way of estimating how a batting line translates into runs. In its most basic form, RC looks like:

Runs Created = (TB) (H+BB) / (AB+BB)

If you plug the numbers in from the above batting line, you get

Runs Created = (2371) (2055) / (6121)

which gives 796 runs.

As it turns out, that was actually the batting line for the 1985 Baltimore Orioles. They actually scored 818 runs. The estimate is off by 22 runs, which is fairly typical.

Why is Runs Created important? Why do we need RC if we already know the Orioles scored 818 runs? Well, knowing that there is a predictable relationship between a batting line and runs is useful when we don’t know how many runs we actually have. For instance, we can use RC on an individual player’s batting line. Here’s Albert Pujols in 2009:

G	AB	H	2B	3B	HR	BB	K	AVG
160	568	186	45	1	47	115	64	.327

Using the basic RC formula, we can estimate that if a given major league team had a batting line like Pujols did, it would score about 149 runs. That batting line would comprise about 15 games, which gives about 10 runs per game.

What we can conclude, then, is that if you put together a lineup of nine Albert Pujols clones, on average they’d score 10 runs per game. That’s a huge total – the average MLB team scores somewhere between 4.5 and 5.0.

We can compare Pujols to Joe Mauer, or Adam Lind, or Alex Rodriguez, to help inform our conclusions on how much each contributed to his team, or even to our arguments about which player deserves the MVP award.

Runs Created is one of the most famous of the statistics used to evaluate offense. Others include Pete Palmer’s “Linear Weights,” Jim Furtado’s “Extrapolated Runs,” and David Smyth’s “Base Runs.” All are very good estimators. But which is the best? Well, that depends. No estimator is perfect, and all have their strengths and weaknesses.

One way to compare the various estimators is to test them for accuracy. Apply them to the last (say) fifty years of baseball, which should give you around 700 team-seasons. Have them each estimate runs for all 700 teams, and see which ones do the best.

Search the Research Collection

SABR Analytics Conference

A Primer on Statistics

Support SABR today!