Baseball Research Journal, Spring 2026

On the Association of Umpire Performance with Age and Experience in MLB

This article was written by Riley Post - Chenyang Li - Dale Zimmerman

This article was published in Spring 2026 Baseball Research Journal


Baseball Research Journal, Spring 2026Perhaps no sport relies on the accuracy and consistency of its officials more than baseball, where the home plate umpire calls a ball or strike on every pitch not swung at by the batter. The relatively sedentary nature of the home plate umpire’s duties compared to officials in other major sports allows individuals to perform this role to a more advanced age and to accrue relatively greater experience. Here, we investigate the associations of two metrics of home plate umpire performance with umpire age and experience using game-level data from 2015 to 2023 provided by the StatCast pitch-tracking system. The bounded continuous nature of these metrics, their high game-to-game variability, their correlation over years within umpires, and the high correlation between umpire age and experience make this task quite challenging. We use mixed-effects weighted beta regression methodology to address these challenges. We find that after adjusting for year-to-year changes in aggregate umpire performance from 2015 to 2023, accuracy and consistency were negatively associated with umpire age and experience. That is, older and more experienced umpires performed worse than their younger and less experienced counterparts. These negative associations are statistically and practically significant and stand in stark contrast to the positive associations of referee performance with age and experience observed in other sports.

Figure 1. Measures of center and dispersion for UmpScorecards metrics accuracy and consistency by year, over the period 2015–23. Accuracy and consistency, which are proportions, are expressed as percentages in this plot. The closed circle represents the average value of a metric in a given year, and the vertical bar for a given year extends from 5th to the 95th sample percentiles of the metric in that year.

Figure 1. Measures of center and dispersion for UmpScorecards metrics accuracy and consistency by year, over the period 2015–23.

 

Figure 3. Matrix of scatter plots showing the relationships between age, experience, and age-at-hiring. Correlations between each pair of variables are given above the main diagonal. The number of asterisks, say a, attached to the correlation indicate that the correlation is significantly different from 0 at the 10-a level of significance.

Figure 3. Matrix of scatter plots showing the relationships between age, experience, and age-at-hiring.

 

Marginally, accuracy was negatively and highly significantly associated with both age and experience, and not significantly associated with age-at-hiring (Table 2). The estimated slope coefficient for age in the mixed-effects weighted beta regression of the logit of accuracy on age and indicator variables for year, which was −6.61×10-3, translates to a decrease in accuracy for a 60-year-old, compared to a 30-year-old, of 1.7% in 2015 and 1.1% in 2023. Similarly, the estimated slope coefficient for experience in the regression of the logit of accuracy on experience and indicator variables for year, which was −6.31×10-3, corresponds to a decrease in accuracy for an umpire with 30 years of experience, compared to one with no experience, that ranges from 1.6% in 2015 to 1.1% in 2023. These decreases, though small, are statistically and practically significant.

In the two-regressor models in which one of the regressors was age-at-hiring, accuracy continued to be negatively and highly significantly associated with both age and experience, and not significantly associated with age-at-hiring. In the model with both regressors age and experience, however, neither regressor had a significant effect on accuracy. The vanishing of significance in this model is largely due to a fourfold increase in the estimated standard errors of the estimated slope coefficients, which in turn is due to the very high correlation between age and experience noted previously. The marginal associations between consistency and age or experience were similar in sign to those just described between accuracy and the same two regressors, but overall about half as steep (Table 3). Specifically, the declines in consistency for the same age and experience comparisons described above for the declines in accuracy were approximately 0.7% in 2015 and 0.6% in 2023. The results of each regression of consistency on two regressors were also similar to that of the regression of accuracy on the same two regressors, only weaker. In particular, in the model with regressors age and experience, neither regressor had a significant effect on consistency.

All of the models described above specify that the logit of the mean of the beta distribution depends linearly on the covariates. To check for the possibility of nonlinear dependence, we performed another analysis for which the logit of the mean included not only a linear term for the covariate (age or experience), but also a quadratic term. It turned out that all four of the fitted quadratic mean functions in the beta regression model (accuracy on age, accuracy on experience, consistency on age, and consistency on experience) were strictly decreasing and strictly concave over the entire range of age or years of experience (Figure 5). Each estimated quadratic-term coefficient was significantly negative statistically, indicating that the rate of decline in performance was smaller for younger and less experienced umpires than for their older and more experienced counterparts. However, the degree of concavity (curvature) in each fitted model was so small that the fitted quadratic mean curves differed very little from the fitted linear means over all but the extreme upper end of the ranges of age and experience. In fact, the curvature is not practically significant since the aforementioned performance comparisons of 30- and 60-year-old umpires under the linear mean model are virtually unaffected. For example, the fitted decrease in accuracy for the latter compared to the former under the quadratic mean model is 1.6% (compared to 1.7%) for 2015 and 1.0% (versus 1.1%) for 2023.

Finally, we also repeated the entire analysis for the subset of only those umpires who began umpiring in 2008 or before. There are 64 umpires in this subset, about half as many as in the complete dataset, and they account for 460 umpire-years. The rationale for considering this subset is that umpires who debuted after 2008 were trained after the implementation of PITCHf/x and received direct feedback from it during their training and could thus be systematically different than those who began in or before 2008. However, the results of the analysis of this subset (not shown) indicated that none of our conclusions are substantially affected by excluding umpires who debuted after 2008.

CONCLUSIONS AND DISCUSSION

In this work, we used data compiled by umpscorecards.com along with demographic information supplied by Retrosheet to evaluate the associations of Major League Baseball umpire performance metrics accuracy and consistency with umpire age, experience, and age at hiring. A major finding is that accuracy, and to a lesser degree consistency, is negatively associated with both age and experience. Put simply, at any given point in time from 2015–23, older or more experienced umpires were less accurate and less consistent. This is in stark contrast to the generally positive associations found previously between referee performance and age and experience in other sports. We found little to no association between any of the metrics and the age at which the umpire was hired.

Our findings complement those of previous authors, who showed (over a different, non-overlapping period) that the accuracy of umpires, in aggregate, improved from one season to the next and that younger and less experienced umpires improved more rapidly than those who were older and had more experience.33,34 Our investigation shows that irrespective of performance improvement over time, younger and less experienced umpires performed better in absolute terms than their older and more experienced brethren. The differences in performance due to age or experience were not huge (about 1–2% higher accuracy for a 30-year-old than a 60-year-old, for example), but were practically and statistically significant. Because about 140 pitches are called in a game on average, this difference in accuracy would correspond to about two or three pitches, in a typical game, that the 30-year-old would call correctly that a 60-year-old would not. Over the course of a standard 2430-game season there might be 5000 or so more correct calls made if all umpires were 30 years old than if all umpires were 60 years old.

It is important to note that our findings do not support a conclusion that umpires decline in performance as they get older and accrue more experience. On the contrary, from 2015–23 the mean umpire performance improved about as much for umpires from pre-2000 cohorts as for cohorts who began their careers after 2000. Note that this contrasts with results from previous studies over the period 2008–15 cited herein.

 

Table 4. Umpire accuracy (%) by cohort in years 2015–23. The sample size (initial number of umpires from cohorts 2016–2023, or the number still umpiring still umpiring in 2015 for other cohorts) is listed in the last column.

Table 4. Umpire accuracy (%) by cohort in years 2015–23.
 

However, as can be seen from Table 4, in any fixed year of our study umpire performance tends to worsen with increasing experience of the cohort. This is particularly so for cohorts in 2015 and after. For example, the average accuracy of rookie umpires from cohorts after 2017 was always greater than 93%, while that of rookie umpires from the 2017 cohort and prior never exceeded 93%, and exceeded 92% only once and that just barely. These results suggest that it is not so much a decline in sensory acuity or a resistance to evaluation that is responsible for the relatively inferior performance of older and more experienced umpires in recent years, but rather an improvement in training that has helped recent cohorts have excellent performance “right off the bat.” As an example of a recent enhancement to training, in 2022, Major League Baseball began running its own Umpire Prospect Development Camp, a month-long program that immerses top prospects in game-speed evaluations that emphasize strike zone accuracy, pitch recognition, and consistency under pressure. Perhaps as a result, it seems that many umpires from recent cohorts are now demonstrating unprecedented levels of ball-and-strike accuracy and zone consistency.

Also relevant to the negative association between umpire performance and experience may be the effects of union representation of MLB umpires provided by the Major League Baseball Umpires Association (MLBUA). Union representation of professional sports officials is not unique to the MLB; officials in the National Football League, National Hockey League, and National Basketball Association are also represented by similar organizations. This representation may explain why dismissal of underperforming officials is exceedingly rare (the authors were unable to find an example within the past decade of an MLB umpire being fired for unsatisfactory performance), though suspensions have been issued as a result of questionable officiating. However, these suspensions are rarely related to lack of umpire accuracy and are typically associated with improper application of other rules not related to on field calls (e.g., the suspension of an umpire for two games for not properly enforcing regulations on pitcher substitutions). While removal is rare, umpires are indeed held to performance standards, with only the best performers being asked to officiate more consequential games such as those in the World Series. Nevertheless, union membership may reduce the incentive for performance improvement.

In addition to examining marginal associations between umpire performance and age or experience, we also examined joint associations. However, because age and experience are so highly collinear for umpires, our results obtained by fitting models that included both regressors were, unfortunately, rather uninformative. In this sense, we were unable to achieve one of our objectives, namely, to understand the impacts of both age and experience, adjusted for the other, on umpire performance. In ordinary regression settings, strategies for dealing with multicollinearity include ridge regression and penalized likelihood. Unfortunately, ridge and penalized likelihood methods for mixed-effects beta regression models have not yet been developed, and (what is equally important) no software exists for implementing them. Furthermore, such methods are not a panacea. Nevertheless, once such methods are developed, future researchers may apply them to UmpScorecards data to attempt to better understand the effects of umpire age or experience, adjusted for the other, on umpire performance. Additional years of data will likely help tease out these adjusted effects by reducing the variability of the estimated slope coefficients for age and experience in the model with both regressors.

Several readers and reviewers of earlier versions of this article inquired as to why we took our units of analysis to be the average accuracy and consistency metrics over all games called by a given umpire within a given year, rather than the game-level versions of these metrics, and why we analyzed the data using beta, rather than logistic, regression methods. Regarding the first question, game-level metrics are so highly variable from game to game that recovering a statistically significant age or experience effect in a regression analysis with one of those metrics as the dependent variable is hopeless. Averages over games within an umpire-year combination are much less variable (compare Figure 5 to Figure 4, taking account of the difference in scale) because the effects of a myriad of possible transient factors (weather, day game versus night game, fatigue from travel, pitcher handedness, and framing ability of the catcher to name just a few) are averaged over, allowing attention to be focused on factors that remain constant over the entire season, such as age and experience (as we defined them). Regarding the second question, the metrics provided by UmpScorecards are mere proportions; the numerator and denominator (number of “trials”) of the proportion are not provided, a situation for which beta regression is generally recommended over logistic regression.36,37

It is worth noting that at the time of this writing, Major League Baseball is on the verge of implementing an automated system for calling balls and strikes (“robo-umps”).38 Currently, it appears most likely that the system that is implemented will be something of a “hybrid,” involving human umpires supplemented by a challenge procedure in which robo-ump calls can override a very limited number of umpire-called pitches (two or three for each team per game, not accounting for retention if the challenge is successful). The effects of age and experience on human umpire performance do not bear directly on the decision on whether and when this system should be implemented, but they do suggest that after the system is implemented and as older and more experienced umpires retire, fewer calls will be overturned, at least in the near term. 

DALE ZIMMERMAN is Professor, Department of Statistics and Actuarial Science, University of Iowa. He received his PhD in Statistics from Iowa State University in 1986. He is a Fellow of both the American Statistical Association and the Institute of Mathematical Statistics, and in 2007 he received the Distinguished Achievement Award from the Section on Statistics and the Environment of the American Statistical Association. His research interests include spatial statistics, longitudinal data analysis, multivariate analysis, mixed linear models, environmental statistics, and sports statistics. He has authored or co-authored six books and more than 100 articles in journals such as Biometrics, Biometrika, Environmetrics, Journal of the Royal Statistical Society (Series B), Annals of Applied Statistics, Journal of Quantitative Analysis in Sports, and Journal of Sports Analytics. At the University of Iowa he teaches courses in spatial and environmental statistics, linear models, experimental design, and sports statistics.

CHENYANG LI is a PhD candidate in Statistics at the University of Iowa. His research focuses on longitudinal discrete data and stochastic process modeling. He is particularly interested in statistical methodology for dependent count data and related applications.

RILEY POST is a senior water resources engineer at HDR Inc. in Des Moines, Iowa. He is a licensed professional engineer and holds a PhD in civil engineering from the University of Iowa. His research interests focus on statistical hydrology, extreme rainfall events, and flood management. He is a former baseball player and umpire and current baseball fan. He dabbles in sports statistics when the opportunity presents itself.

 

Notes

1. Alan Nevill, Nigel Balmer, and Mark Williams, “The Influence of Crowd Noise and Experience Upon Refereeing Decisions in Football.” Psychology of Sport and Exercise 3 (2002): 261–72.

2. Sean L. Corrigan, Dan B. Dwyer, Briana Harvey, and Paul B. Gastin, “The Influence of Match Characteristics and Experience on Decision-Making Performance in AFL Umpires,” Journal of Science and Medicine in Sport 22 (2019): 112–16.

3. Cansel Arslanoglu, Erol Dogan, and Kursat Acar, “Investigation of Decision Making and Thinking Styles of Volleyball Referees in Terms of Some Variables,” Journal of Education and Training Studies 6 (2018): 21–28.

4. Aydin Karacam and Niyazi Sidki Adiguzel, “Examining the Relationship Between Referee Performance and Self-Efficacy,” European Journal of Educational Research 8 (2019): 377–82.

5. Ivan Belcic, “Does Age, Experience and Body Fat Have an Influence on the Performance of Handball Referees?,” Applied Sciences 12 (2022): 9399.

6. John Walsh, “The Compassionate Umpire,” The Hardball Times, April 7, 2010. http://www.hardballtimes.com/the-compassionate-umpire/. Accessed June 27, 2025:

7. Max Marchi and Jim Albert, Analyzing Baseball Data with R (Boca Raton, FL: CRC Press, 2014).

8. Etan Green and David P. Daniels, What Does it Take to Call a Strike?: Three Biases in Umpire Decision Making, MIT Sloan Sports Analytics Conference, March 1, 2014. https://web.archive.org/web/20140308013335/https://www.sloansportsconference.com/wp-content/uploads/2014/02/2014_SSAC_What-does-it-Take-to-Call-a-Strike.pdf.

9. Matthew Carruth, “The Strike Zone,” SB Nation, October 29, 2012. Accessed June 27, 2025: https://www.lookoutlanding.com/2012/10/29/3561060/the-strike-zone.

10. Dale L. Zimmerman, Jun Tang, and Rui Huang, “Outline Analyses of the Called Strike Zone in Major League Baseball,” Annals of Applied Statistics 13 (2019): 2416–51.

11. Christopher A. Parsons, Johan Sulaeman, Michael C. Yates, and Daniel S. Hamermesh, “Strike Three: Discrimination, Incentives, and Evaluation,” American Economic Review 101 (2011): 1410–35.

12. Scott Tainsky, Brian M. Mills, and Jason A. Winfree, “Further Examination of Potential Discrimination Among MLB Umpires,” Journal of Sports Economics 16 (2015): 353–74.

13. Jerry W. Kim and Brayden G. King, “Seeing Stars: Matthew Effects and Status Bias in Major League Baseball Umpiring,” Management Science 60 (2014): 2619–44.

14. Mike Hsu, “Umpire Home Bias in Major League Baseball,” Journal of Sports Economics 25 (2024): 423–42.

15. Eric Fesselmeyer, “The Impact of Temperature on Labor Quality: Umpire Accuracy in Major League Baseball,” Southern Economic Journal 88 (2021): 545–67.

16. Brian M. Mills, “Technological Innovations in Monitoring and Evaluation: Evidence of Performance Impacts among Major League Baseball Umpires,” Labour Economics 46 (2017): 189–99.

17. Zimmerman, Tang, and Huang, 2416–51.

18. Riley Post, Jun Tang, and Dale L. Zimmerman, “On the Evolution of the Accuracy, Within-Game Consistency, and Geometry of the Called Strike Zone in Major League Baseball from2008–2023,” Journal of Sports Analytics 11 (2025): 1–18. https://doi.org/10.1177/22150218251389237.

19. Mills, 189–99.

20. Kevin S. Flannagan, Brian M. Mills, and Robert L. Goldstone, “The Psychophysics of Home Plate Umpire Calls,” Scientific Reports 14 (2024): 2735.

21. Mike Fast, “What the Heck is PITCHf/x?,” The Hardball Times Baseball Annual, 2010. Accessed June 27, 2025: http://baseball.physics.illinois.edu/fastpfxguide.pdf.

22. Glenn Healey and Shiyuan Zhao, “Using PITCHf/x to Model the Dependence of Strikeout Rate on the Predictability of Pitch Sequences,” Journal of Sports Analytics 3 (2017): 93–101.

23. Marcos Lage, Jorge Piazentin Ono, Daniel Cervone, Justin Chiang, Carlos Dietrich, and Claudio T. Silva, “StatCast Dashboard: Exploration of Spatiotemporal Baseball Data,” IEEE Computer Graphics and Applications 36 (2016): 28–37.

24. Glenn Healey, “The New Moneyball: How Ballpark Sensors are Changing Baseball,” Proceedings of the IEEE 105 (2017): 1999–2002.

25. Lage, Ono, Cervone, Chiang, Dietrich, and Silva, 28–37.

26. David J. Hunter, “New Metrics for Evaluating Home Plate Umpire Consistency and Accuracy,” Journal of Quantitative Analysis in Sports 14 (2018): 159–72.

27. Sahadev Sharma, “Joe Maddon Wants Robot Umps, or Maybe Just More Consistent Human Ones,” The Athletic, May 11, 2017. https://www.nytimes.com/athletic/59634/2017/05/11/joe-maddon-wants-robot-umps-or-maybe-just-more-consistent-human-ones/. Accessed June 27, 2025.

28. Ethan Singer, “The Effect of Umpires on Baseball: Umpire Runs Created (uRC),” FanGraphs, April 30, 2020. https://community.fangraphs.com/the-effect-of-umpires-on-baseball-umpire-runs-created-urc/. Accessed June 27, 2025

29. Mills, 189–99.

30. Flannagan, Mills, and Goldstone, 2375.

31. Silvia Ferrari and Francisco Cribari-Neto, “Beta Regression for Modelling Rates and Proportions,” Journal of Applied Statistics 31 (2004): 799–815.

32. Jorge I. Figueroa-Zuniga, Reinaldo B. Arellano-Valle, and Silvia L. P. Ferrari, “Mixed Beta Regression: A Bayesian Perspective,” Computational Statistics and Data Analysis 61 (2013): 137–47.

33. Mills, 189–99.

34. Flannagan, Mills, and Goldstone, 2375.

35. Benjamin Hoffman, “Umpire Suspended for Blown Call,” New York Times, May 10, 2013.

36. Ferrari and Cribari-Neto, 799–815.

37. Michael Smithson and Jay Verkuilen, “A Better Lemon Squeezer? Maximum-Likelihood Regression with Beta-Distributed Dependent Variables,” Psychological Methods 11 (2006): 54–71.

38. Jesse Rogers, “When, How Will Robot Umps Arrive in MLB? Latest on ABS Plans,” ESPN, June 18, 2024. https://www.espn.com/mlb/story//id/40377683/mlb-robot-umpires-automated-balls-strikes-challenge-system-umps-majors. Accessed June 27, 2025.

Donate Join

© 2026 SABR. All Rights Reserved.