Calculating Skill and Luck in Major League Baseball
This article was published in the Spring 2017 Baseball Research Journal.
One of my favorite topics is the contributions of skill and luck in baseball. I recently ran one thousand simulations of a 162-game schedule that is the same as is currently being used in the majors (two leagues of 15 teams, three divisions each, with interleague play) where every team was the same and games could be decided by a coin flip (a random number generated by the computer). Of course you would not expect each team to win 81 games, but you would expect that the wins would show a bell-shaped curve centered at 81 wins.
This is of course what happened.
The width of the distribution is characterized by its standard deviation, defined as the square root of the sum of the squares of the difference between the team wins and 81, all divided by the number of teams. You would expect about two-thirds of the value to be within plus or minus one standard deviation, and 94 percent to be within two.
There is a formula which defines what this number should be in binomial distribution—the special case of the normal distribution where there are only two outcomes, like heads or tails, or in this case, wins or losses. The formula is simply the square root of the win probability times the loss probability times the number of trials, or games. For a 162-game season, this would be the square root of ½ times ½ times 162, which is 6.36. In my simulation, of course, each season out of the 1000 that I ran would not be exactly that value, but the number should be close. It turns out the season value was 6.35 with a range of plus or minus 0.9. If I combined the data into 10 year periods, the range was down to 0.3, as expected, reduced by the square root of 10.
If you assume luck and skill are independent — the luck factor for a season is the 6.36 wins calculated and the total variation is what happened on the field — then you can calculate the skill factor. It is simply the square root of total squared minus luck squared. I took the data by decades from 1871 to the present. The Union Association (1884) and Federal League (1914-15) were excluded.
A variation in the luck factor of 0.3 wins or so would result in a change in skill of about 0.2.
What this shows is the skill factor, which was around 13 wins for 1901–1950, has been reduced significantly so that is has now been around 9 since 1971. This is a team skill factor, not a player skill factor, so basically the teams have become more evenly matched. It also shows that over the course of a full season, the skill and luck factors are almost equal. If you assume the 9 wins (.055 pct) would be constant for any length schedule, then you need 81 games before the skill factor and luck factor are equal. In 81 games, the skill factor would be 81 x .055 or 4.5 wins, while the luck factor (sqrt (81/4) would also be 4.5.
Next I modified my simulation to use teams that were not the same. I chose teams with a standard deviation equal to 9 wins derived from the previous table. I ran 1000 seasons, each with a different set of expected team win percentages. I noted the number of teams within each win range and the number of expected wins for each team that actually won games in that range. I derived a formula for the probability of one team beating another, which is the difference in overall win probability of the teams plus one half.
The results show the actual range is quite a bit broader than the expected one. From the previous table we would expect the actual range to have a standard deviation around 12, which it did. The important column is E/A, which is the expected number of wins for a team that actually won games in range. For example, a team that actually won 59 to 61 games was expected to win 67.4 games, which was 6.5 wins more than actual. Only 8 percent of the number of teams that won 59-61 games won more games than expected. Looking at the 89-91 range, those teams expected to win only 86.5 games and 70 percent of those teams won more games than expected.
I then looked at actual data showing change in wins in the following year. I used all teams 1901 to date and found, not surprisingly, that teams tended to show wins closer to .500. In fact the change was almost identical to the change found in the simulation above. Thus I believe that the so-called regression to the mean is simply due to luck. Of course, we don’t know the true team win probabilities of the actual teams, but it does seem likely that they are similar to the simulation, and a real life 90 win team is probably an 86.5 wins team that was lucky. Wins were normalized to 162 games to account for varying length seasons. Diff1 is the difference between wins this year and next. Diff2 is the difference in team wins expected and actual from the previous table. The table below is from a much smaller sample, so the numbers would be less uniform.
This method can also be used to look at player performance. The variation in batting average due to luck uses the same formula, except the probability of success is more like 0.25 rather than 0.50. For a full season of 500 at-bats, the variation in hits is the square root of 0.25 times 0.75 times 500, which is 9.7, or about 20 percentage points. The table below shows batting average by decades for all players with at least 300 appearances (at bats plus walks) in a season. Total, luck and skill are in batting average points, that is 37 is .037 on batting average. The variation in skill level has decreased, which indicates that the average level has probably increased, making it harder for the best players to exceed it. However some of the decrease may be due to power hitters who sacrifice batting average for homers. The standard deviation from year to year is 1.4 (square root of two) times the yearly value, so that means five percent of the players can change more than 60 points just due to luck.
* Note: Although the chart shows appearances (AB + walks), actual at-bats were used to calculate the variance in batting average. AB are usual around 40 fewer than appearances. Example: 1901-10 it was 456, so sigma is sqr (456 *.265 *.735)/456 o 21 pts. Appearances give credit to players who walked, but I used batting average as the criterion and didn’t want too many columns.
Normalized on-base plus slugging (NOPS) is a better measure of batting than average, though. The definition is on-base average (OBA) player over OBA league plus slugging average (SLG) player over SLG league minus 1, all times one hundred. The league averages do not include pitcher batting. This is then adjusted for park by dividing the player park factor (PF). PF is basically runs scored per inning at home over runs scored per inning away plus one divided by two. So a park where twenty percent more runs were scored at home than on the road would have a PF of 1.10. NOPS correlates directly with runs in that a player with a 120 NOPS produces runs at a rate 20 percent higher than the league average. The standard deviation for NOPS is a bit more complicated. Slugging average is driven by home runs, so homer hitters have a higher standard deviation.
I ran a simulation where all players each year from 1901 through 2016 with 300 or more plate appearances were run through 100 seasons. The league standard deviation came out 15 points, although it was more like 14.5 for the first half of the period and 15.5 for the last half, where homers were more frequent. If I divided the league in half by homer percentage in 2016, the top half had 16.5 and the bottom half 14.5. I will use 15 for analysis. This table shows that the variation in skill level has remained fairly constant with a slight dip recently, which as with batting average may indicate a rise in average skill.
So for NOPS, five percent of the players can change 40 or more points from year to year due to luck alone.
For pitchers, I would use normalized ERA, which is simply league ERA over pitcher ERA times 100. Again, a 120 NERA will result in 20 percent less runs allowed than an average pitcher. I do have a quarrel with the way earned runs are given, though. A pitcher is always charged with runs that score by players he has allowed on base. It would be fairer if they were shared when a relief pitcher comes in. For example, if a pitcher left with the bases loaded and none out, he would be charged with 1.8 runs. This is the number of runs usually scored by the 3 runners. If the relief pitcher got out with no runs scored, he would get minus 1.8 runs.
The actual value varies a bit from year to year and league to league, and does have a random factor associated with it. If runs were individual events like goals in a hockey game, the standard deviation would be the square root of the average number of runs, but in baseball scoring one run can often lead to another and a grand slam homer can score four in one blow, so the actual value is the square root of twice the number of runs. In the bases loaded case, the 1.8 figure is 701 scored in 389 cases for 2015 AL. The square root of 1402 over 389 is about 0.1. A runner on third and none out scores about 82 percent of the time, while a runner on first with two outs comes in only 13 percent.
For ERA, the luck factor is calculated the same way. A pitcher with 180 innings and an ERA of 4.00 would have allowed 80 runs, and the luck factor would be the square root of 160 times 9 over 180 or 0.63, a fairly hefty figure. That means five percent of the pitchers could have their ERA off by more than 1.26 due to luck alone. For NERA, the luck factor if the league ERA was 4.00 would be 0.63 divided by 4.00 times 100 or 16. The table below shows all pitchers with 150 or more innings from 1901 to the present by decades.
I did a study of all players with at least 300 at bats their first two years and sorted by difference in NOPS and whether they made 300 at-bats in their third year. What it showed was that most players who did worse their second year improved in their third year, while most players who did better their second year didn’t do as well in their third year. About 30 percent of those who were worse their second year did not get a third year, while only 10 percent of those who were better did. In the 30-point change area, those who improved were 37 points higher in year two, but only 5 points higher in year three. So it appears that a big improvement is considered a trend, when it is really mostly luck. The 30-point players who improved did do a little better the third year (up 7 points), while those who did worse only went up 4, but that is a pretty small difference. (See Table 7.)
A handy rule for determining simulated series winners is that the probability of winning a seven-game series is equal to twice the one game percentage minus one half. So a .550 team will win the series sixty percent of the time. If you actually do the math, it turns out that a .500 team will win .500 of the time, obviously, while a .550 team will win .608, a .600 team will win .710 and a .650 team will win .800, but the rule of thumb is close enough.
I ran four separate runs of 5000 162-game season simulations based on playoff structure. The first was one league of 30 teams with no playoffs. The winner was the first place team at the end of the season.
The next case was 2 leagues of 15 teams with the league winners in the playoff. Then I tried two leagues of two divisions each and 4 teams in the playoffs. Finally it was two leagues, three divisions and a wild card, eight teams in the playoffs. As the number of playoff teams increased, the probability of the best team making the playoffs also increased, but the likelihood of the best team winning went down.
In real life, we do not know which team was really the best, but if we assume that it was the team with the best record during the season, then that team has won the World Series five times since 1995 when the wild card was introduced, although in one case there was a tie for the best. That works out to about 20 percent, consistent with the above table. The average rank of the World Series winner was fourth out of eight. If the playoffs were completely random, the average rank would be 4.5. The worst team has won 4 times.
In a season, the variation due to luck is about 6.3 games or about 40 percentage points. In a 7 game series, the variation is the square root of ½ times ½ times 7 which is 1.32 games or 188 percentage points. For a single game it is the square root of ½ times ½ times 1 which is .5 wins or 500 percentage points. The average difference in team skill in a game is about 55 percentage points, but if you include home/away and variation among starting pitchers, the actual difference per game is around 100 points or one run. We established that the variation due to chance is the square root of twice the number of runs involved. This means the variation of the difference in runs for a single game would be the square root of 18 or 4.25. This is over four times the variation due to skill.
Looking at other sports as comparisons, the skill factor is much more important in basketball and football. With fewer players on a basketball team, one star player can make a big difference. Football has a much shorter season, so the luck factor is higher per game, but skill still wins out. If you assume the skill factor would be the same regardless of the length of the season, then for a 162-game season basketball would be about 150 points of winning percentage, football about 140, and baseball 55 as shown above. Trying to hit a round ball with a round bat introduces a lot of variability which does not exist in other sports.
Table 9.1. Basketball
Table 9.2. Football
A team’s record from year to year includes a great deal of luck, and luck contributes over four times as much as skill to a team's eventual record. If all teams were equal, the standard deviation year to year would be 9 games (the square root of 324/4), or alternately the square root of 2 times the in season variation (square root of 162/4 or 6.36). That means every year there should be one or two teams with differences of 18 just by luck. Below are results by decade. The real difference between teams is only about 7 games a year. There were 166 teams who gained 18 or more games from one year to the next, going from 67 wins to 90 on the average. However, the next year, they dropped back to 85, just like any other team. Likewise there were 156 teams who dropped 18 or more wins, going from 91 to 68, but won 75 the following year. Wins were normalized to 162 games to allow for schedule differences.
Most people think luck is a lot less important than it is. A team’s record from year to year includes a great deal of luck, and luck contributes about equally as skill to a team’s eventual regular season record. (And in the postseason, it’s nearly all luck.)
PETE PALMER is the co-author with John Thorn of "The Hidden Game of Baseball" and co-editor with Gary Gillette of the "Barnes and Noble ESPN Baseball Encyclopedia" (five editions). Pete worked as a consultant to Sports Information Center, the official statisticans for the American League from 1976 to 1987. Pete introduced on-base average as an official statistic for the American League in 1979 and invented on-base plus slugging (OPS), now universally used as a good measure of batting strength. He won the SABR Bob Davids Award in 1989 and was selected by SABR in 2010 as a winner of the inaugural Henry Chadwick Award. Pete also edited with John Thorn seven editions of "Total Baseball." He previously edited four editions of the "Barnes Official Encyclopedia of Baseball" (1974–79). A member of SABR since 1973, Pete is also the editor of "Who’s Who in Baseball," which celebrated its 101st year in 2016.