# Judge: Sabermetrics and a plea for statistical tolerance

From Jonathan Judge at The Hardball Times on October 27, 2014, with mention of SABR members Bill James and Pete Palmer:

To get the disappointing news out of the way, this article does not contain the latest hot take on Yasiel Puig or the perceived death of baseball.

It does address a statistical topic that is becoming more widely appreciated, but not well understood. That topic is “multicollinearity”—an ugly-sounding term that can be problematic sometimes, but is also not as big a problem as some people seem to think.

Multicollinearity arises in the context of statistical regression. Regression is one of the most popular methods used in baseball research, helping us reveal actual—rather than merely suspected—relationships between baseball statistics. With regression, we can demonstrate that on-base percentage predicts run-scoring better than batting average, confirm that strikeouts and walks are highly predictive of a pitcher’s earned run average, and so on.

There are many forms of regression, but in the baseball world, linear regression is often the most useful. There are many reasons for this: (1) many baseball statistics are continuous variables, and thus well-suited for linear regression; (2) linear regression allows us to “control” for variables that otherwise would hide important effects; (3) the simplicity of linear regression makes it easier to model future accomplishments; and (4) unlike more complex methods, linear regression allows us to use statistical significance to determine if perceived relationships are actually meaningful.

At the same time, linear regression makes a number of (generally reasonable) assumptions about the data being analyzed. It assumes that the “predictor” variables have a linear relationship with the “outcome” variable. It assumes that the data are  normally distributed. It assumes that when a prediction is wrong, it will consistently be wrong in the same ways. Finally, and particularly relevant to our discussion today, linear regression assumes that the predictors work independently of one another, and are not trying to explain the same phenomenon—in statistical lingo, this is known avoiding “multicollinearity.”