What To Research
In sabermetrics, as probably like any other discipline, there’s no official list of topics to research. Most sabermetricians just study what they’re interested in. Often, ideas for subjects come up during conversations with other fans. You’ll be talking baseball over a beer, and someone will say, “well, I’m worried about the Indians next year … they went 7-25 in September and October, and that’s probably a bad sign of things to come.”
And you think, hmmm, I wonder if that’s true, that a bad September is likely to be a negative indicator for next year’s performance? And, suddenly, you have a topic to study.
Another common source for ideas is baseball broadcasters – they’ll make some claim on the air, without giving evidence, and you spot an opportunity to check if what they say is true. Bill James used to do this a lot.
Or, you might be reading a certain study on one of the many sabermetric internet sites, and someone makes a suggestion in the comments – or, the study raises a question in your mind that you think it would be interesting to investigate.
If you’re just starting out, my suggestion would be to start fairly simple. One possibility is to find a bunch of old Bill James Abstracts, and read through them (which I recommend you do anyway, if you’re new to sabermetrics). Those books are full of little studies that Bill James throws in when a question occurs to him, and those might lead you to related questions that you can test. Even repeating one of Bill’s studies with more current data can be useful.
For instance, in the 1982 Bill James Baseball Abstract (Ballantine, 1982), Bill lists the average attendance for every starting pitcher in the major leagues, and finds that the only pitcher who reliably seemed to draw fans, in 1981, was rookie phenom Fernando Valenzuela. It immediately occurred to me: is it still true that the starting pitcher doesn’t affect attendance? I’d love to see a similar study for recent years[fn]* UPDATE: it turns out that someone has followed up Bill’s study! In an excellent piece in The Hardball Times 2012 Baseball Annual, Max Marchi looked at all pitchers since 1947, adjusted for overall trends, and found many great starters who drew in the fans. Nolan Ryan was the career leader, with 641,000 estimated extra tickets sold, while Mark Fidrych had the highest season average, with a total of around 300,000 tickets over three years.[/fn]. I’d also love to see someone take this a bit further. Bill just eyeballed the data before concluding that there didn’t seem to be an effect. But might there be a small effect that you’d find if you looked harder? You might check whether the better pitchers tended to draw more fans than the worse pitchers, after adjusting for day, weather, and opponent. Maybe there’s a small effect, but maybe there’s not.
The nice thing about using the Bill James Abstracts for ideas is that Bill tends to use straightforward techniques that don’t require any formal statistical expertise. His techniques may not be formal enough for, say, academic journals, but they’re excellent nonetheless, and they have enabled Bill James to teach us more about baseball than any other sabermetrician.
Of course, if you do have some expertise in statistical techniques, that will help too. For the attendance study, you might run a regression to predict attendance based on team, day of the week, opponent, and starting pitcher’s quality. But, even if you don’t use a formal statistical technique (and, for the record, I think in all of Bill James’s work, he’s used linear regression maybe twice), with a bit of creativity you can usually still figure out what’s going on.
Once you’ve settled on a question, you have to figure out how you’re going to work your way to an answer. That’ll be difficult without some knowledge of sabermetrics. There’s no field of human knowledge where you can just jump in without some basic understanding of how the field works and what’s already been done.
Indeed, if there were only one piece of advice I was allowed to give to aspiring researchers, it would be: learn some sabermetrics first. As my friend John Matthew IV said, “If you were interested in astronomy, you would read at least a few books before trying to predict the path of a comet.”
And so: know some of the sabermetric canon. In the next section, I’ll outline what might be a reading list for “Sabermetrics 101.”
Also, before you start working on your problem, you’re going to want to check whether others have worked on the problem before. Maybe they’ve already done the exact same thing you’re planning to do. Maybe they’ve gone only part of the way, and you can expand on what they’ve done. And maybe they’ve thought of some things that you haven’t, or maybe you won’t agree on how they did it.
In any case, no matter how knowledgeable you are in sabermetrics, nobody is aware of everything. Before you start, you’ll want to search the literature, to see what progress has already been made on your problem. We’ll talk about that a bit later too.