On the Association of Umpire Performance with Age and Experience in MLB
This article was written by Riley Post - Chenyang Li - Dale Zimmerman
This article was published in Spring 2026 Baseball Research Journal
Perhaps no sport relies on the accuracy and consistency of its officials more than baseball, where the home plate umpire calls a ball or strike on every pitch not swung at by the batter. The relatively sedentary nature of the home plate umpire’s duties compared to officials in other major sports allows individuals to perform this role to a more advanced age and to accrue relatively greater experience. Here, we investigate the associations of two metrics of home plate umpire performance with umpire age and experience using game-level data from 2015 to 2023 provided by the StatCast pitch-tracking system. The bounded continuous nature of these metrics, their high game-to-game variability, their correlation over years within umpires, and the high correlation between umpire age and experience make this task quite challenging. We use mixed-effects weighted beta regression methodology to address these challenges. We find that after adjusting for year-to-year changes in aggregate umpire performance from 2015 to 2023, accuracy and consistency were negatively associated with umpire age and experience. That is, older and more experienced umpires performed worse than their younger and less experienced counterparts. These negative associations are statistically and practically significant and stand in stark contrast to the positive associations of referee performance with age and experience observed in other sports.
BACKGROUND
Referees, umpires, and other duly appointed officials play an important adjudicatory role in nearly every major sporting contest. Consequently, the in-game performance of referees is of great interest to sports leagues and fans, and even to researchers. Several of the latter have conducted interesting investigations of referee performance and the various factors that may affect it. These studies span a range of sports and research questions. Some investigate the effects of the home team’s crowd on the referee’s contributions to the so-called home advantage, while others study the influence of game situations, or the impact of referees’ age and experience, on referees’ overall performance. In regard to the latter, one study considered the effects of referee experience on referee performance in association football (English soccer) and found that experience, at least up to a certain level, is positively associated with the number of fouls called against players on the home team, suggesting that more experienced referees tend to be less influenced by the game’s spectators.1 Another investigation of over 8000 decisions made by a group of 32 Australian Football League umpires found that while match characteristics (e.g., match location and match attendance) had little effect on umpire error rates, more experienced umpires had lower error rates than those with less experience.2 Additional studies have reported generally positive effects of age and experience on referee performance in volleyball, basketball, and handball.3,4,5
Baseball is perhaps the most thoroughly studied sport of all, and one of its referees, the home-plate umpire, must make and announce hundreds of decisions known as pitch calls (either balls or strikes) during each game. Hence it is not surprising that the performance of home-plate umpires and the factors that may affect it have been intensely scrutinized. An important metric of umpire performance is “accuracy,” which is the percentage of pitch calls that are correct, in reference to the rule-book strike zone (RBSZ). One factor that has been shown to be extremely influential on umpire accuracy is the “count,” i.e., the cumulative number of called balls and the cumulative number of strikes (called or swinging) over all pitches received by the batter in the current plate appearance, immediately prior to the current pitch.6,7,8 Another is the handedness of the batter, as it seems that until recently there was a substantial shift of the called strike zone to the left of the RBSZ, and thus also a decline in umpire accuracy, for a left-handed batter relative to a right-handed batter., Some evidence has been found for racial discrimination in umpires’ pitch calls, with strikes more likely to be called when the umpire and pitcher match races, although additional investigation, using more extensive data, was unable to replicate this finding.11,12 It has also been shown that umpires tend to grant a larger strike zone to high-status pitchers (All-Stars) than to other pitchers.13 A home advantage in umpire pitch calls was documented, with home-team batters receiving more called balls on actual balls and fewer called strikes on actual strikes.14 Increasing air temperature (especially temperatures exceeding 95°F) and increasing levels of air pollution (especially carbon monoxide) were found to have significant negative effects on umpire accuracy, suggesting that there may be other, rather more obscure factors affecting umpire performance that remain to be discovered.15 Other work has documented continual improvement in umpire accuracy from 2008–23, a period during which technology to monitor ball-strike calls, together with an evaluation and feedback system for Major League umpires, was implemented.16,17,18
Surprisingly, relatively few published studies exist of the effects of age or experience on umpire performance in baseball, especially since home plate officiating is different from officiating in many sports in that it requires very little movement on the part of the officials. This results in baseball umpires being capable of performing the duties of their position for longer, and to a more advanced age. We are aware of two relevant studies. The first, using data from 2009–14, found that less experienced umpires improved more over time than those with more experience.19 The second, using data from the slightly broader period 2008–15, showed that younger umpires tended to improve more over this period than older umpires.20 But these studies focused on the effects of age or experience on umpire improvement over time rather than on time-adjusted comparisons of umpire performance. It would be entirely possible, for example, for less experienced umpires to perform worse than those with more experience, even if the former improve faster. Furthermore, those previous studies considered how umpire performance is associated with either age or experience marginally (individually), not jointly (simultaneously). The age and experience of MLB umpires are highly positively correlated: it is shown herein, for example, that among the 129 MLB umpires who called at least one game between 2015 and 2023, this correlation was 0.96. Consequently, an effect on performance that is attributed to age could actually be attributable to experience or vice versa, or even to some combination of the two. Moreover, it is possible that the association of performance with age, adjusted for experience, is quite different than the association of performance with age marginally. Likewise, it is possible that the association of performance with experience, adjusted for age, is different than the association of performance with experience marginally. The main objective of this article is to investigate whether, and how, MLB umpire performance is associated with umpire age and experience, both marginally and jointly.
We measure umpire performance using two specific measures provided in a database known as “UmpScorecards,” the source of which is described in the next section. The two performance measures are: 1) accuracy of called pitches, relative to the RBSZ, and 2) consistency of called balls and strikes, regardless of whether the pitches were actually called in accordance with the RBSZ. Using mixed-effects weighted beta regression methods, we determine how, and to what extent, these performance measures, adjusted for year, are associated with age, experience, and a third, related variable: the age at which the umpire was hired.
Before analyzing the data, it is difficult to predict the nature of any association between umpire performance and age or experience. On one hand, evidence from other sports shows a positive relationship between these variables, which might suggest a similar pattern in baseball. On the other hand, when an automated feedback system was introduced in 2006, some umpires—many of whom are still active—were initially resistant to incorporating its feedback and may still be so to some extent. In addition, it is reasonable to expect that advanced age could eventually affect visual acuity and stamina. These considerations point toward the possibility of a negative association. Given these competing possibilities, we do not make any prior assumptions about the direction of the relationship between umpire performance and age or experience. Instead, we test a null hypothesis of no association against a two-sided alternative, allowing the data to determine whether any observed relationship is positive or negative.
DATA
The use of pitch tracking technology has revolutionized analytics in baseball. Beginning in a single Major League Baseball (MLB) stadium in 2006, and then expanding to a league-wide rollout prior to the 2008 season, the PITCHf/x camera system, developed by Sportsvision, used a pair of cameras to track pitch-related data including trajectory, speed, break, and location. These values were provided to broadcasters and MLB in real time before being released to the public.21,22 In 2015, the StatCast system added optical cameras in an attempt to capture all in-game actions not already collected by PITCHf/x (i.e., player position, ground ball location and speed, etc.).23 PITCHf/x was deprecated following the 2016 season in favor of “TrackMan,” which is based on phased-array Doppler radar collected via a sensor mounted above home plate.24 The use of Doppler radar allows for better estimation of the variables associated with the flight of the baseball including back spin, side spin, and speed.25 The data collected by the StatCast system, including data while a ball is in flight measured via TrackMan, are released to the public following each game.
In this study, we use StatCast data from 2015–23 (i.e., PITCHf/x for the 2015 and 2016 seasons, TrackMan for the period 2017–19, and Hawk-Eye from 2020–23) compiled by Umpire Scorecards (umpscorecards.com) (UmpScorecards hereafter). UmpScorecards uses the daily release of data from the StatCast system to quantify the accuracy and consistency of each individual home plate umpire from the previous day’s MLB games. These values are defined by UmpScorecards as follows.
Accuracy represents the proportion of taken pitches called correctly over a game. UmpScorecards uses the pitch location provided by StatCast, the estimated RBSZ based on the unique attributes of a given batter, and an algorithm to determine the likelihood that a given pitch was a strike based on Monte Carlo simulation of a pitch’s potential true location (i.e., accounting for error in the location estimate provided by TrackMan). Using 500 potential true location values, a pitch is determined to be incorrectly called if 1) the probability that the pitch was in fact a strike is over 90% and the umpire called it a ball or 2) the probability that the pitch was in fact a ball is over 90% and the umpire called it a strike.
Consistency represents the proportion of taken pitches considered consistent with the umpire’s established zone (EUZ) in any given game.26 The boundary of the EUZ is defined by the area where a given pitch has a greater than 50% chance of being called a strike by that umpire. Therefore, a pitch is consistent if 1) the pitch fell within the EUZ and was called a strike or 2) it fell outside of the EUZ and was called a ball. Consistency is a relevant additional measure of performance because it seems to be the case that baseball players, managers, and fans will tolerate some level of inaccuracy from an umpire as long as his calls are consistent throughout the course of a game.27
Accuracy and consistency are proportions, hence bounded within [0,1]. The statistical methods we use herein to analyze the effects of age and experience on these metrics will properly account for their bounded support.
UmpScorecards provides accuracy and consistency on a game-by-game basis for each home plate umpire. We used the data from years 2015–23, for which there were 18,682 games played. For our analyses, we supplemented these data by some umpire-specific information, namely age and years of experience at the MLB level of the umpire as of the opening day of the given season. This additional information is not provided by StatCast or UmpScorecards but was obtained from Retrosheet. From the age (in years) and experience (in years) of each umpire, we obtained his age-at-hiring by forming the difference, age minus experience. The ages, years of experience, and ages at hiring of the umpires in our study ranged from 27 to 69, 0 to 45, and 24 to 41, respectively. UmpScorecards provides a third umpire performance metric called favor, which represents the effect of an umpire on the absolute difference in run expectancies of incorrect versus correct calls for the two teams. A more thorough explanation of this metric can be found at umpscorecards.com or elsewhere.28 We considered this metric in our study initially, but found that its association with umpire age and experience was much weaker than those of the other two metrics with age and experience, so we opted not to include it herein.
EXPLORATORY DATA ANALYSES
Prior to performing formal inferential methods to address our research objective, we do a bit of exploratory data analysis here to gain some familiarity with the data and glean its major features. The features discovered will inform our subsequent formal statistical analysis.
Figure 1 displays plots of annual averages of accuracy and consistency versus year over the 9-year study period, 2015–23. Averages were taken over all games in a given year. Vertical bars extending from the 5th to the 95th percentiles of a metric in each year are also shown to convey dispersion.
Figure 1. Measures of center and dispersion for UmpScorecards metrics accuracy and consistency by year, over the period 2015–23. Accuracy and consistency, which are proportions, are expressed as percentages in this plot. The closed circle represents the average value of a metric in a given year, and the vertical bar for a given year extends from 5th to the 95th sample percentiles of the metric in that year.
The plots reveal that umpires improved from 2015–23 in both accuracy and consistency, which comports with previously published reports of improvements in accuracy over earlier time periods 2009–14 and 2008–15.29,30 They also show that the improvements from 2016 to 2017, the year the switch was made from PITCHf/x to Trackman, were of similar magnitude to improvements in some other years.
Because we wish to explore the relationship between each performance metric and age and experience, free of the confounding effect of year, we subtract the annual averages shown in Figure 1 from the performance metrics for all subsequent exploratory data analyses. The resulting metrics are referred to as year-adjusted metrics. The interpercentile ranges between the 5th and 95th percentiles of the metrics vary somewhat over years, tending to decline over time but only slightly; they are about 20% smaller in 2023 than they were in 2015. The interpercentile ranges also reveal a slight amount of left skewness, which is to be expected considering the upper bound of 1.0 (100% in the figure).
Accuracy and consistency, though distinct as measures of performance, could be associated with one another. The scatterplot of the two year-adjusted metrics, shown in Figure 2, investigates this issue. The unit of observation in the scatterplot is a game. The plot shows that accuracy and consistency are significantly positively associated. The strength of the linear association, as measured by Pearson’s correlation coefficient, is 0.52, which is highly statistically significant.
Figure 2. Scatter plot showing the relationship between year-adjusted umpire accuracy and consistency. The unit of observation is a game.
Since accuracy and consistency are observed repeatedly over a lengthy period, it is possible that one or both is correlated over time within umpires. To examine this possibility, we computed the pairwise correlations, for each pair of years, of year-adjusted metrics averaged over all games called by an umpire in a year. These correlations are listed in Table 1.
Table 1. Sample pairwise correlations between years for (a) accuracy, and (b) consistency.
They indicate that there is substantial positive temporal correlation in umpire accuracy, and somewhat less (though still appreciable) positive temporal correlation in consistency. Furthermore, the temporal correlation does not appear to attenuate with elapsed time as it would under an autoregressive model but instead fluctuates around a constant. The average pairwise correlation among accuracy measurements is 0.49, while that among consistency measurements is 0.22.
Figure 3. Matrix of scatter plots showing the relationships between age, experience, and age-at-hiring. Correlations between each pair of variables are given above the main diagonal. The number of asterisks, say a, attached to the correlation indicate that the correlation is significantly different from 0 at the 10-a level of significance.
Turning our attention from the performance metrics to age and experience, recall that these variables are highly positively correlated (Pearson’s r=0.96). Figure 3 is a matrix scatterplot of age, experience, and age-at-hiring. The unit of observation in this scatterplot is an umpire-year combination, of which there are 833 in the dataset. The middle plot in the first column shows the strong linear relationship between age and experience. This relationship is due partly to a shared one-year increment of these variables with each passing year and partly to the fact that most umpires are hired in their late twenties or early thirties, so that there is relatively little dispersion in experience among umpires of any given age.
Such a strong correlation between variables that we wish to use as regressors is unfortunate and leads us to consider not only age and experience, but also the derived variable age-at-hiring, as regressors. Figure 3 shows that the correlations between age-at-hiring and age, and age-at-hiring and experience, are still statistically significant, but much smaller in magnitude than the correlation between age and experience. The signs of the correlations between age-at-hiring and the other two variables are of no surprise, matching as they do the signs of the coefficients on age and experience in the definition of age-at-hiring. Note that at most two of these three variables may be used as regressors in any regression analysis, due to their perfect collinearity.
Figures 4 and 5 (below), consist of bivariate scatterplots of both year-adjusted performance metrics versus umpire age, umpire experience, or umpire age-at-hiring. Each point plotted in Figure 4 represents a metric for an individual game, while a point plotted in Figure 5 represents an average metric over all games called by an umpire in a given year. Note that the scale along the vertical axis differs in the two figures. These two figures suggest that the marginal associations between year-adjusted accuracy and both age and experience are negative, and the same is true of the marginal associations between year-adjusted consistency and both age and experience. In contrast, the marginal associations between both metrics and age-at-hiring appear to be positive, though not strongly so. Each scatterplot reveals considerable game-to-game variability (noise) in the performance metrics, which is reduced substantially when the metric is averaged over all games called by an umpire in a given year (Figure 5). Because of this, and because our regressors (age, experience, and age-at-hiring) do not change within a given year, we take an umpire-year combination, rather than a game, as the basic unit of observation for our more formal analysis described in the next section. As noted previously, there are 833 such units in the dataset.
Figure 4. Bivariate scatterplots depicting the relationships of year-adjusted accuracy and consistency with umpire age, experience, and age-at-hiring. Each point plotted corresponds to a single game. Ordinary least squares lines are superimposed to indicate overall trend.
Figure 5. Bivariate scatterplots depicting the relationships of year-adjusted accuracy and consistency with umpire age, experience, and age-at-hiring. Each point plotted represents the average value of a metric over all games called by an individual umpire in a year. The fitted linear mean model and fitted quadratic mean model from beta regressions are superimposed in two different lines.
STATISTICAL ANALYSIS
To more quantitatively address the relationships of umpire performance with age, experience, and age-at-hiring, we carried out various regression analyses, as follows. The dependent variable for these regressions was accuracy or consistency, computed by averaging over all games called by an umpire during a given year. These (averaged) metrics have bounded continuous support in (0,1). To account properly for this support, we used beta regression methods.31 Because the numbers of games each umpire called in any given year varied considerably, the performance metrics varied in reliability across umpire-years and we therefore weighted each umpire’s performance metric for a given year by the number of games he called in that year. The regressors in these weighted beta regressions included indicator variables for year (to account for the previously documented improvement over years noted previously in our exploratory data analysis), plus one or two of the variables age, experience, and age-at-hiring. Furthermore, to account for the substantial but non-attenuating temporal correlation within umpires discovered in our exploratory data analysis, we included random umpire effects, which are assumed to be independently and identically normally distributed with mean zero and variance σ2u.
Thus, the model fitted to the observed values {yij}, where yij is either the accuracy or consistency of umpire i in year j, is specified, in part, by the beta probability density function
where Γ(·) denotes the gamma function; ϕ is a precision (inverse dispersion) parameter; wij is the number of games called by umpire i in year j; μij=E(yij) and Var(yij)=μij(1−μij)/(1+ϕwij). The remaining part of the model specification stipulates that μij is linked to the regressors and random umpire effects via a logit link function:
where xTij=(xij1,…, xijk) is a vector of k regressors; β=(β1,…, βk)T is a k-vector of fixed unknown regression coefficients (which in our case consists of year effects and one or two of the slope coefficients on age, experience, and age-at-hiring); and ui is the random effect of the umpire. This model is a mixed effects variant of a weighted beta regression model.32
Table 2. Results of mixed-effects weighted beta regressions of umpire accuracy on indicator variables for years and one or two of the regressors age, experience (Exp), and age-at-hiring (AAH). The quantities listed are the maximum likelihood estimates of regression parameters and p-values of the tests of the null hypothesis that the parameters are equal to zero. An asterisk indicates that the p-value corresponding to a regression coefficient is less than or equal to 0.001.
Table 3. Results of mixed-effects weighted beta regressions of umpire consistency on indicator variables for years and one or two of the regressors age, experience (Exp), and age-at-hiring (AAH). The quantities listed are the maximum likelihood estimates of regression parameters and p-values of the tests of the null hypothesis that the parameters are equal to zero. An asterisk indicates that the p-value corresponding to a regression coefficient is less than or equal to 0.001.
Marginally, accuracy was negatively and highly significantly associated with both age and experience, and not significantly associated with age-at-hiring (Table 2). The estimated slope coefficient for age in the mixed-effects weighted beta regression of the logit of accuracy on age and indicator variables for year, which was −6.61×10-3, translates to a decrease in accuracy for a 60-year-old, compared to a 30-year-old, of 1.7% in 2015 and 1.1% in 2023. Similarly, the estimated slope coefficient for experience in the regression of the logit of accuracy on experience and indicator variables for year, which was −6.31×10-3, corresponds to a decrease in accuracy for an umpire with 30 years of experience, compared to one with no experience, that ranges from 1.6% in 2015 to 1.1% in 2023. These decreases, though small, are statistically and practically significant.
In the two-regressor models in which one of the regressors was age-at-hiring, accuracy continued to be negatively and highly significantly associated with both age and experience, and not significantly associated with age-at-hiring. In the model with both regressors age and experience, however, neither regressor had a significant effect on accuracy. The vanishing of significance in this model is largely due to a fourfold increase in the estimated standard errors of the estimated slope coefficients, which in turn is due to the very high correlation between age and experience noted previously. The marginal associations between consistency and age or experience were similar in sign to those just described between accuracy and the same two regressors, but overall about half as steep (Table 3). Specifically, the declines in consistency for the same age and experience comparisons described above for the declines in accuracy were approximately 0.7% in 2015 and 0.6% in 2023. The results of each regression of consistency on two regressors were also similar to that of the regression of accuracy on the same two regressors, only weaker. In particular, in the model with regressors age and experience, neither regressor had a significant effect on consistency.
All of the models described above specify that the logit of the mean of the beta distribution depends linearly on the covariates. To check for the possibility of nonlinear dependence, we performed another analysis for which the logit of the mean included not only a linear term for the covariate (age or experience), but also a quadratic term. It turned out that all four of the fitted quadratic mean functions in the beta regression model (accuracy on age, accuracy on experience, consistency on age, and consistency on experience) were strictly decreasing and strictly concave over the entire range of age or years of experience (Figure 5). Each estimated quadratic-term coefficient was significantly negative statistically, indicating that the rate of decline in performance was smaller for younger and less experienced umpires than for their older and more experienced counterparts. However, the degree of concavity (curvature) in each fitted model was so small that the fitted quadratic mean curves differed very little from the fitted linear means over all but the extreme upper end of the ranges of age and experience. In fact, the curvature is not practically significant since the aforementioned performance comparisons of 30- and 60-year-old umpires under the linear mean model are virtually unaffected. For example, the fitted decrease in accuracy for the latter compared to the former under the quadratic mean model is 1.6% (compared to 1.7%) for 2015 and 1.0% (versus 1.1%) for 2023.
Finally, we also repeated the entire analysis for the subset of only those umpires who began umpiring in 2008 or before. There are 64 umpires in this subset, about half as many as in the complete dataset, and they account for 460 umpire-years. The rationale for considering this subset is that umpires who debuted after 2008 were trained after the implementation of PITCHf/x and received direct feedback from it during their training and could thus be systematically different than those who began in or before 2008. However, the results of the analysis of this subset (not shown) indicated that none of our conclusions are substantially affected by excluding umpires who debuted after 2008.
CONCLUSIONS AND DISCUSSION
In this work, we used data compiled by umpscorecards.com along with demographic information supplied by Retrosheet to evaluate the associations of Major League Baseball umpire performance metrics accuracy and consistency with umpire age, experience, and age at hiring. A major finding is that accuracy, and to a lesser degree consistency, is negatively associated with both age and experience. Put simply, at any given point in time from 2015–23, older or more experienced umpires were less accurate and less consistent. This is in stark contrast to the generally positive associations found previously between referee performance and age and experience in other sports. We found little to no association between any of the metrics and the age at which the umpire was hired.
Our findings complement those of previous authors, who showed (over a different, non-overlapping period) that the accuracy of umpires, in aggregate, improved from one season to the next and that younger and less experienced umpires improved more rapidly than those who were older and had more experience.33,34 Our investigation shows that irrespective of performance improvement over time, younger and less experienced umpires performed better in absolute terms than their older and more experienced brethren. The differences in performance due to age or experience were not huge (about 1–2% higher accuracy for a 30-year-old than a 60-year-old, for example), but were practically and statistically significant. Because about 140 pitches are called in a game on average, this difference in accuracy would correspond to about two or three pitches, in a typical game, that the 30-year-old would call correctly that a 60-year-old would not. Over the course of a standard 2430-game season there might be 5000 or so more correct calls made if all umpires were 30 years old than if all umpires were 60 years old.
It is important to note that our findings do not support a conclusion that umpires decline in performance as they get older and accrue more experience. On the contrary, from 2015–23 the mean umpire performance improved about as much for umpires from pre-2000 cohorts as for cohorts who began their careers after 2000. Note that this contrasts with results from previous studies over the period 2008–15 cited herein.
Table 4. Umpire accuracy (%) by cohort in years 2015–23. The sample size (initial number of umpires from cohorts 2016–2023, or the number still umpiring still umpiring in 2015 for other cohorts) is listed in the last column.
However, as can be seen from Table 4, in any fixed year of our study umpire performance tends to worsen with increasing experience of the cohort. This is particularly so for cohorts in 2015 and after. For example, the average accuracy of rookie umpires from cohorts after 2017 was always greater than 93%, while that of rookie umpires from the 2017 cohort and prior never exceeded 93%, and exceeded 92% only once and that just barely. These results suggest that it is not so much a decline in sensory acuity or a resistance to evaluation that is responsible for the relatively inferior performance of older and more experienced umpires in recent years, but rather an improvement in training that has helped recent cohorts have excellent performance “right off the bat.” As an example of a recent enhancement to training, in 2022, Major League Baseball began running its own Umpire Prospect Development Camp, a month-long program that immerses top prospects in game-speed evaluations that emphasize strike zone accuracy, pitch recognition, and consistency under pressure. Perhaps as a result, it seems that many umpires from recent cohorts are now demonstrating unprecedented levels of ball-and-strike accuracy and zone consistency.
Also relevant to the negative association between umpire performance and experience may be the effects of union representation of MLB umpires provided by the Major League Baseball Umpires Association (MLBUA). Union representation of professional sports officials is not unique to the MLB; officials in the National Football League, National Hockey League, and National Basketball Association are also represented by similar organizations. This representation may explain why dismissal of underperforming officials is exceedingly rare (the authors were unable to find an example within the past decade of an MLB umpire being fired for unsatisfactory performance), though suspensions have been issued as a result of questionable officiating. However, these suspensions are rarely related to lack of umpire accuracy and are typically associated with improper application of other rules not related to on field calls (e.g., the suspension of an umpire for two games for not properly enforcing regulations on pitcher substitutions). While removal is rare, umpires are indeed held to performance standards, with only the best performers being asked to officiate more consequential games such as those in the World Series. Nevertheless, union membership may reduce the incentive for performance improvement.
In addition to examining marginal associations between umpire performance and age or experience, we also examined joint associations. However, because age and experience are so highly collinear for umpires, our results obtained by fitting models that included both regressors were, unfortunately, rather uninformative. In this sense, we were unable to achieve one of our objectives, namely, to understand the impacts of both age and experience, adjusted for the other, on umpire performance. In ordinary regression settings, strategies for dealing with multicollinearity include ridge regression and penalized likelihood. Unfortunately, ridge and penalized likelihood methods for mixed-effects beta regression models have not yet been developed, and (what is equally important) no software exists for implementing them. Furthermore, such methods are not a panacea. Nevertheless, once such methods are developed, future researchers may apply them to UmpScorecards data to attempt to better understand the effects of umpire age or experience, adjusted for the other, on umpire performance. Additional years of data will likely help tease out these adjusted effects by reducing the variability of the estimated slope coefficients for age and experience in the model with both regressors.
Several readers and reviewers of earlier versions of this article inquired as to why we took our units of analysis to be the average accuracy and consistency metrics over all games called by a given umpire within a given year, rather than the game-level versions of these metrics, and why we analyzed the data using beta, rather than logistic, regression methods. Regarding the first question, game-level metrics are so highly variable from game to game that recovering a statistically significant age or experience effect in a regression analysis with one of those metrics as the dependent variable is hopeless. Averages over games within an umpire-year combination are much less variable (compare Figure 5 to Figure 4, taking account of the difference in scale) because the effects of a myriad of possible transient factors (weather, day game versus night game, fatigue from travel, pitcher handedness, and framing ability of the catcher to name just a few) are averaged over, allowing attention to be focused on factors that remain constant over the entire season, such as age and experience (as we defined them). Regarding the second question, the metrics provided by UmpScorecards are mere proportions; the numerator and denominator (number of “trials”) of the proportion are not provided, a situation for which beta regression is generally recommended over logistic regression.36,37
It is worth noting that at the time of this writing, Major League Baseball is on the verge of implementing an automated system for calling balls and strikes (“robo-umps”).38 Currently, it appears most likely that the system that is implemented will be something of a “hybrid,” involving human umpires supplemented by a challenge procedure in which robo-ump calls can override a very limited number of umpire-called pitches (two or three for each team per game, not accounting for retention if the challenge is successful). The effects of age and experience on human umpire performance do not bear directly on the decision on whether and when this system should be implemented, but they do suggest that after the system is implemented and as older and more experienced umpires retire, fewer calls will be overturned, at least in the near term.
DALE ZIMMERMAN is Professor, Department of Statistics and Actuarial Science, University of Iowa. He received his PhD in Statistics from Iowa State University in 1986. He is a Fellow of both the American Statistical Association and the Institute of Mathematical Statistics, and in 2007 he received the Distinguished Achievement Award from the Section on Statistics and the Environment of the American Statistical Association. His research interests include spatial statistics, longitudinal data analysis, multivariate analysis, mixed linear models, environmental statistics, and sports statistics. He has authored or co-authored six books and more than 100 articles in journals such as Biometrics, Biometrika, Environmetrics, Journal of the Royal Statistical Society (Series B), Annals of Applied Statistics, Journal of Quantitative Analysis in Sports, and Journal of Sports Analytics. At the University of Iowa he teaches courses in spatial and environmental statistics, linear models, experimental design, and sports statistics.
CHENYANG LI is a PhD candidate in Statistics at the University of Iowa. His research focuses on longitudinal discrete data and stochastic process modeling. He is particularly interested in statistical methodology for dependent count data and related applications.
RILEY POST is a senior water resources engineer at HDR Inc. in Des Moines, Iowa. He is a licensed professional engineer and holds a PhD in civil engineering from the University of Iowa. His research interests focus on statistical hydrology, extreme rainfall events, and flood management. He is a former baseball player and umpire and current baseball fan. He dabbles in sports statistics when the opportunity presents itself.
Notes
1. Alan Nevill, Nigel Balmer, and Mark Williams, “The Influence of Crowd Noise and Experience Upon Refereeing Decisions in Football.” Psychology of Sport and Exercise 3 (2002): 261–72.
2. Sean L. Corrigan, Dan B. Dwyer, Briana Harvey, and Paul B. Gastin, “The Influence of Match Characteristics and Experience on Decision-Making Performance in AFL Umpires,” Journal of Science and Medicine in Sport 22 (2019): 112–16.
3. Cansel Arslanoglu, Erol Dogan, and Kursat Acar, “Investigation of Decision Making and Thinking Styles of Volleyball Referees in Terms of Some Variables,” Journal of Education and Training Studies 6 (2018): 21–28.
4. Aydin Karacam and Niyazi Sidki Adiguzel, “Examining the Relationship Between Referee Performance and Self-Efficacy,” European Journal of Educational Research 8 (2019): 377–82.
5. Ivan Belcic, “Does Age, Experience and Body Fat Have an Influence on the Performance of Handball Referees?,” Applied Sciences 12 (2022): 9399.
6. John Walsh, “The Compassionate Umpire,” The Hardball Times, April 7, 2010. http://www.hardballtimes.com/the-compassionate-umpire/. Accessed June 27, 2025:
7. Max Marchi and Jim Albert, Analyzing Baseball Data with R (Boca Raton, FL: CRC Press, 2014).
8. Etan Green and David P. Daniels, What Does it Take to Call a Strike?: Three Biases in Umpire Decision Making, MIT Sloan Sports Analytics Conference, March 1, 2014. https://web.archive.org/web/20140308013335/https://www.sloansportsconference.com/wp-content/uploads/2014/02/2014_SSAC_What-does-it-Take-to-Call-a-Strike.pdf.
9. Matthew Carruth, “The Strike Zone,” SB Nation, October 29, 2012. Accessed June 27, 2025: https://www.lookoutlanding.com/2012/10/29/3561060/the-strike-zone.
10. Dale L. Zimmerman, Jun Tang, and Rui Huang, “Outline Analyses of the Called Strike Zone in Major League Baseball,” Annals of Applied Statistics 13 (2019): 2416–51.
11. Christopher A. Parsons, Johan Sulaeman, Michael C. Yates, and Daniel S. Hamermesh, “Strike Three: Discrimination, Incentives, and Evaluation,” American Economic Review 101 (2011): 1410–35.
12. Scott Tainsky, Brian M. Mills, and Jason A. Winfree, “Further Examination of Potential Discrimination Among MLB Umpires,” Journal of Sports Economics 16 (2015): 353–74.
13. Jerry W. Kim and Brayden G. King, “Seeing Stars: Matthew Effects and Status Bias in Major League Baseball Umpiring,” Management Science 60 (2014): 2619–44.
14. Mike Hsu, “Umpire Home Bias in Major League Baseball,” Journal of Sports Economics 25 (2024): 423–42.
15. Eric Fesselmeyer, “The Impact of Temperature on Labor Quality: Umpire Accuracy in Major League Baseball,” Southern Economic Journal 88 (2021): 545–67.
16. Brian M. Mills, “Technological Innovations in Monitoring and Evaluation: Evidence of Performance Impacts among Major League Baseball Umpires,” Labour Economics 46 (2017): 189–99.
17. Zimmerman, Tang, and Huang, 2416–51.
18. Riley Post, Jun Tang, and Dale L. Zimmerman, “On the Evolution of the Accuracy, Within-Game Consistency, and Geometry of the Called Strike Zone in Major League Baseball from2008–2023,” Journal of Sports Analytics 11 (2025): 1–18. https://doi.org/10.1177/22150218251389237.
19. Mills, 189–99.
20. Kevin S. Flannagan, Brian M. Mills, and Robert L. Goldstone, “The Psychophysics of Home Plate Umpire Calls,” Scientific Reports 14 (2024): 2735.
21. Mike Fast, “What the Heck is PITCHf/x?,” The Hardball Times Baseball Annual, 2010. Accessed June 27, 2025: http://baseball.physics.illinois.edu/fastpfxguide.pdf.
22. Glenn Healey and Shiyuan Zhao, “Using PITCHf/x to Model the Dependence of Strikeout Rate on the Predictability of Pitch Sequences,” Journal of Sports Analytics 3 (2017): 93–101.
23. Marcos Lage, Jorge Piazentin Ono, Daniel Cervone, Justin Chiang, Carlos Dietrich, and Claudio T. Silva, “StatCast Dashboard: Exploration of Spatiotemporal Baseball Data,” IEEE Computer Graphics and Applications 36 (2016): 28–37.
24. Glenn Healey, “The New Moneyball: How Ballpark Sensors are Changing Baseball,” Proceedings of the IEEE 105 (2017): 1999–2002.
25. Lage, Ono, Cervone, Chiang, Dietrich, and Silva, 28–37.
26. David J. Hunter, “New Metrics for Evaluating Home Plate Umpire Consistency and Accuracy,” Journal of Quantitative Analysis in Sports 14 (2018): 159–72.
27. Sahadev Sharma, “Joe Maddon Wants Robot Umps, or Maybe Just More Consistent Human Ones,” The Athletic, May 11, 2017. https://www.nytimes.com/athletic/59634/2017/05/11/joe-maddon-wants-robot-umps-or-maybe-just-more-consistent-human-ones/. Accessed June 27, 2025.
28. Ethan Singer, “The Effect of Umpires on Baseball: Umpire Runs Created (uRC),” FanGraphs, April 30, 2020. https://community.fangraphs.com/the-effect-of-umpires-on-baseball-umpire-runs-created-urc/. Accessed June 27, 2025
29. Mills, 189–99.
30. Flannagan, Mills, and Goldstone, 2375.
31. Silvia Ferrari and Francisco Cribari-Neto, “Beta Regression for Modelling Rates and Proportions,” Journal of Applied Statistics 31 (2004): 799–815.
32. Jorge I. Figueroa-Zuniga, Reinaldo B. Arellano-Valle, and Silvia L. P. Ferrari, “Mixed Beta Regression: A Bayesian Perspective,” Computational Statistics and Data Analysis 61 (2013): 137–47.
33. Mills, 189–99.
34. Flannagan, Mills, and Goldstone, 2375.
35. Benjamin Hoffman, “Umpire Suspended for Blown Call,” New York Times, May 10, 2013.
36. Ferrari and Cribari-Neto, 799–815.
37. Michael Smithson and Jay Verkuilen, “A Better Lemon Squeezer? Maximum-Likelihood Regression with Beta-Distributed Dependent Variables,” Psychological Methods 11 (2006): 54–71.
38. Jesse Rogers, “When, How Will Robot Umps Arrive in MLB? Latest on ABS Plans,” ESPN, June 18, 2024. https://www.espn.com/mlb/story//id/40377683/mlb-robot-umpires-automated-balls-strikes-challenge-system-umps-majors. Accessed June 27, 2025.












