Offensive Statistics – A Caution
What does all this have to do with how to do baseball research? Well, it brings me to my first suggestion: if you’re just starting out, you might want to consider researching something other than coming up with new ways to evaluate player offenses.
It’s just that it’s been done to death. I’ve listed four different statistics that evaluate offenses, and there are even more than those. All of them are pretty good, and all of them are pushing the limits of how accurate a statistic can possibly be.
Now, I’m not saying that there’s no way you’ll do better. I would have thought the same thing maybe 20 years ago, that there was no way to beat Linear Weights and Runs Created – but then David Smyth came to invent Base Runs, which, by some measures, is the best yet. My advice is not to suggest that you can’t do better, but, rather that your research effort may yield more fruit if applied elsewhere.
But, on the other hand, evaluating players is fun. And if this area of sabermetrics is something that you find most interesting, then go ahead! But if you come up with a new statistic, you will be expected to come up with hard evidence that yours works better than any that are already out there. It’s not enough to argue theoretically why it should work — you have to prove it does.
There’s a sabermetric adage: Just because a statistic has Babe Ruth on top and Mario Mendoza on the bottom, that doesn’t mean it’s accurately measuring what it’s supposed to measure.
So, as you work on your new statistic, keep these points in mind:
- It’s possible to get more and more accurate by including more and more information. The version of Runs Created includes only six data items: AB, H, 2B, 3B, HR and BB. Obviously, you can get more accurate if you include SB and CS, and HBP, and SF, and other information. Indeed, some of the other statistics already include those categories, so when you compare your statistic to others, make sure you use the equivalent version, to ensure you’re comparing apples to apples. If you show that your statistic that includes 20 categories is more accurate than a statistic that includes only six categories, that’s not necessarily a breakthrough.
- It is possible to get very accurate if you include “situational” statistics that give information about when the various events happened. For instance, if you were to add “batting average with runners in scoring position,” you’d increase the accuracy of your estimates quite a bit. But you wouldn’t necessarily increase your statistic’s usefulness.
- If you’re trying to show how various factors lead to runs scored, you can’t include categories that are based on how many runs actually scored! For instance, you can do a lot better than Runs Created if you include “runners left on base.” For instance (H + BB – CS – DP – runners left on) is almost exactly equal to runs! That’s because it’s almost equal to (runners reaching base – runners who didn’t score), which is exactly the definition of runs.
After keeping all this in mind, if you do come up with a statistic that you can demonstrate is more accurate than its counterparts, you’ll have something of very high interest to the sabermetric community. But, again, as I said, you have an uphill climb. This is the one area of sabermetrics that has had the most effort poured into it over the past three or four decades, and a better mousetrap will not be easy to invent.
A similar caution applies to any new statistic, especially one that’s supposed to evaluate or rank players or teams in some dimension. If your new stat is trying to estimate something that can be measured, show how well it does that, especially compared to any other stats that are out there. And if it’s trying to estimate something ethereal, like “consistency” or “durability,” something that doesn’t have a real definition, how do you know that you’re measuring it the best way possible? There’s nothing wrong with a statistic like that — Bill James has “speed score,” which estimates the fuzzy notion of a player’s “baseball speed” — but be aware that those kinds of things are rough tools, not strong empirical findings.