Objective Method for Determining the Most Valuable Player ... · The Most Valuable Player (MVP)...

Objective Method for Determining the Most Valuable Player in Major League Baseball

Running Title: Objective Determination of Major League Baseball MVPs

Keywords: Baseball Analysis, Markov Process, Team Run Production

Objective Method for Determining the Most Valuable Player in Major League Baseball

Kevin Fritz

Hillsborough, NJ 08844

Bruce Bukiet Department of Mathematical Sciences

Center for Applied Mathematics and Statistics New Jersey Institute of Technology

Newark, NJ 07012

Abstract

The sportswriters who select the Most Valuable Player (MVP) and Cy Young award winners for Major League Baseball do not use any mandated criteria to make their selections and as a result, many observers feel they do not select the most appropriate players. In this paper, we introduce an objective criterion for selecting the MVP and Cy Young award winners. We extended the Markov Chain model developed by Bukiet et al. (1997) to include more realistic aspects of baseball, improving its authenticity. We used the model to analyze the various candidates for the two awards over a period of 20 years to determine the players whose performance would have added the greatest number of expected wins to an average team and call these players the “objective winners” of the awards. We found that the sportswriters’ selections matched the objective criterion just under half the time and that the sportswriters selected one of the top three performers nearly 70% of the time.

1 Introduction

1.1 Baseball as a Markov Process

Baseball, contrary to many sports, is very well suited for mathematical analysis. Baseball is primarily a stop-and-go game where there is a moment of action, and then enough time to analyze and record what happened. This moment of action largely depends on the performance of two players – the pitcher and the batter. As a result, the complexity of the game can be greatly reduced for the purpose of analysis. In addition, voluminous statistical data is recorded and is readily available for Major League Baseball games. Furthermore, baseball can be described as a series of states in which the ending state is easily identifiable. Baseball can be successfully analyzed using a Markov Chain approach because it fits the four main requirements of a Markov Chain described by Trueman (1977). These are:

1. There must be a finite number of possible states. 2. The probability of transitioning from one state to any other must be known. 3. The probability of transitioning from each state must not be affected by how this state

was reached. 4. The probability of starting in each of the states must be known.

Baseball fits the first requirement because a game can be in only 1 of 25 possible states (8 possible baserunner situations x 3 possible outs + the end of inning three out state) in any given half of an inning. Baseball fits the second requirement because each batter has a known performance history that can be used to determine his likelihood of causing a transition from any of the 25 states to any other. The third requirement is fulfilled because the probability of a player getting a hit is not affected by how the current situation was arrived at. Finally, baseball fits the fourth requirement because each half inning must start in the no baserunners, zero out state.

The starting point for the Markov Chain model developed in this study is the formulation of Bukiet et al. (1997). This model uses easily accessible baseball batting and pitching statistics to create the transition probabilities. In this work, the authors demonstrated their approach using a simple and conservative runner advancement model that had previously been used by D’Esopo and Lefkowitz (1977). (Background on the methods used in Bukiet et al. (1997) is provided in the Appndix). The key output of the Bukiet et al. method is the distribution of runs scored for each team. These distributions show the probability of each team scoring any given number of runs in a game. From this distribution, an expected number of runs scored can also be calculated. In their original work Bukiet et al. (1997) were mostly concerned with determining the optimal batting order for a set of nine players.

Of course in baseball, as in all sports, the objective is to win games. From the model, the probability of each team winning a game can be determined using the runs distribution for lineups representing the two competing teams, while taking into account the ability of the teams’ pitcherss, to determine the probability that one team should score at least one more run than the other team. The probability of each team winning can be used to predict the outcome of games and to predict how many wins a team should have attained in a season. In this paper, we expand on this model to study the Most Valuable Player and Cy Young awards. We hope to explore other applications in future papers.

1.2 Background on the MVP and Cy Young Awards

The Most Valuable Player (MVP) award was created in 1911, and was first given to the player who had the best batting average in each league. Then, in 1914 World War I started, and interest in baseball waned, eventually resulting in the discontinuing of the MVP award. After World War I, from 1922 – 1929, each league appointed a sportswriter from each city in which the league had a team to vote for who they thought should be MVP, and the person with the most votes was declared the winner. However, there were some unusual rules during this time period. For example, only one player per team could be considered, player-managers were not considered, and a player could only win the award once in his career. All this changed in 1931 when the method of choosing the winner was changed to the current method - the MVP award is voted on by sportswriters who follow each of the teams in the particular league. Sportswriters often disagree over what the term “most valuable” means, but under the current system, it is up to an individual sportswriter to decide for him or herself what “most valuable” should mean.

The Cy Young award is given to the “best” pitcher in each league. Baseball Commissioner Ford Frick proposed the award in 1956, stating that pitchers were not being considered fairly in the MVP voting. From 1956, when the award was inaugurated, until 1966 when Commissioner Frick died, there was only one award given per year. However, in 1967 under new Commissioner William Eckert the award was expanded to the best pitcher in each league, rather than the best overall pitcher. As with the MVP award, the Cy Young award is chosen by sportswriters who follow the teams in each of the leagues. They decide for themselves how “best” should be defined. 1.3 Layout of this Paper

The intent of this paper is to develop and study an objective criterion for determining the Most Valuable Player and the Cy Young award winner. Contrary to other models which use only the award candidate’s statistics, we will model how interactions between the award candidate and the rest of a team lead to scoring runs. We believe that this approach lends itself to a more appropriate representation of the effect the award candidate has on

team run production and wins. We would also like to determine how often the voters select the player who best meets the criterion – i.e. the most appropriate person to be awarded the MVP or Cy Young award. Finally, we will discuss what factors voters take into account at variance with our objective criterion.

We present in section 2.1 some background on previous research in the area of quantifying pitcher and batter performance as well as previous mathematical modeling of baseball. In section 2.2 we define a mathematical criterion for selecting the Most Valuable Player and Cy Young award winner as those players whose performance during the season would have contributed the greatest number of additional wins to an average team. In section 2.3 we discuss how we chose the players who should be considered as possible candidates for the awards. Once the candidates were determined, the Markov Chain model was updated to include more realistic runner advancement as well as more types of statistics as explained in section 2.4. We then explain the specific procedures for conducting our tests in section 2.5, followed by a display of our results in section 3. The top player using our criterion (in each year) for the 20 baseball seasons from 1988 through 2007 is displayed as well as the MVP and Cy Young award winner as chosen by the sportswriters. We find that sportswriters selected the best candidate for MVP according to our model 40% of the time and the appropriate awardee for the Cy Young award just over half the time. Finally, in section 4 we discuss the some of the issues in the award voting including some which have been noted in previous work that result in the differences between our computational results and the sportswriters’ selections. 2 Mathematical Analysis of the Most Valuable Player and Cy Young

Awards

2.1 Prior Research of the Most Valuable Player and Cy Young

Previous researchers have considered Markov processes in modeling baseball, while others have performed research relating to predicting winners of various baseball awards. However, the current paper is the first attempt to apply Markov processes to MVP and Cy Young award winners.

With respect to using Markov processes to study baseball, Trueman (1977) recognized that baseball can be analyzed as a Markov process because it fits the basic requirements. He used a 25 x 25 transition matrix similar to that used in this paper to analyze batting orders and strategies in baseball. Bellman (1977) used a Markov Chain model to study how baseball managers can apply Markov processes to make strategic decisions during a game.

Bukiet et al. (1997) developed a Markov Chain model with 217 states (9 innings, 3 outs, 8 baserunner situations and the 27 out state) and up to 20 runs per lineup that was

used to compute the run distribution for a lineup of nine different players in order to determine the optimal lineup for a set of nine players. (Details of this method are given in the Appendix) While these computations employed simple runner advancement rules and did not incorporate every possible occurrence in a baseball game, the model can be expanded to include many such aspects of baseball, if the required baseball statistics are available. The methods used in Bukiet et al. (1997) form the foundation for the research presented in the current paper. We demonstrate later in this paper how we have incorporated a more realistic runner advancement model into this structure. Sokol (2003) also employed a similar structure in developing a heuristic method for determining optimal lineups. Hirotsu and Wright (2003) extended the methods of Bukiet et al. (1997) to study optimal pinch hitting strategies, by also including current scoring in a game among other aspects leading to a transition matrix with 1,434,673 states.

Concerning the prediction of the winners of various baseball awards, many baseball fans and a number of researchers have attempted to quantify player performance using their statistical data. Cover and Keilers (1977) developed the Offensive Earned-Run Average (OERA) to quantify the number of runs for which each player was responsible. The OERA was calculated by looking at a player’s at-bats as their own personal innings, and determining how many runs would have been scored by the given player during the season. Pankin (1978) also performed research quantifying player performance. He created a statistic called the Offensive Performance Average (OPA) which weighted each offensive statistic in baseball, and then computed an average for each individual player. This average can be used to determine the MVP or Cy Young winner by simply looking for the player with the highest average. Bennett and Flueck (1983) compared several popular offensive performance models and used their results to make several inferences about award voting.

More recently, Zilliante (2005) considered the idea that award voters prefer players who are well known over equally qualified players who are not as well known by analyzing the effect of a player’s reputation on Gold Glove award winners. This research can be applied to look at the effect of players’ reputations on the winners of other awards such as the MVP and Cy Young award.

Sparks and Abrahamson (2005) developed a mathematical model to predict the winners of the Cy Young award. This model quantified the value of various pitching statistics as a sportswriter would when determining whom they should select as the Cy Young winner. They then created a scale from 0 to 10 to determine how likely a specific player was to win the Cy Young award. Their research differs from what is outlined in this paper because they only consider the Cy Young award, and their approach was to develop a model which tries to agree with the sportswriters, whereas we are trying to determine who ought to receive the awards based on an objective criterion and determine how often the sportswriters’ select the appropriate winners (based on this criterion).

2.2 An Objective Criterion for Most Valuable Player and Cy Young Award Winner Selection

We define the players most deserving of the MVP and Cy Young awards to be those players who would have had the greatest positive influence on the number of wins that a team of otherwise average players would have won. It is essential that each MVP candidate who is a hitter is compared to the average player at his fielding position, rather than to MVP candidates who play other positions. The reason is that a strong hitter who plays a position with traditionally weak hitters (e.g., second baseman) would add more offense and thus more wins to an average team than an equally skilled player at a traditionally strong hitting position.

2.3 Potential Candidates

While it would be possible to consider each player who played even a single game during a season in our model, most players do not warrant serious consideration for the MVP and Cy Young awards. Thus, we consider only pitchers in the top 5 in one or more of the following statistical categories for the Cy Young award: wins, saves, Earned Run Average (ERA), Walks plus Hits per Innings Pitched (WHIP) or Batting Average Against (BAA). (Data were obtained from www.baseball-reference.com)

Players who were considered for the Cy Young award, or were in the top five in one or more of the following statistical categories were considered as candidates for the MVP award: hits, home runs, Runs Batted In (RBIs), runs scored, bases stolen, batting average, or slugging percentage. (Data were obtained from www.baseball-reference.com)

We note that every player awarded the MVP or Cy Young award since 1988 has met at least one of these criteria. Furthermore, since 1988, 234 out of the possible 240 (97.5%) players who finished in the top three in the final voting met at least one of these criteria.

2.4 Extensions of the Basic Markov Process

We start with the Markov Chain model developed by Bukiet et al. (1997) thatused a conservative set of runner advancement rules (described in the Appendix). They used these advancement rules in order to be able to compare their results with those of D’Esopo and Lefkowitz (1977) who computed run distributions expected for a lineup of all equal batters. Examples of the conservative nature of the runner advancement rules include advancing only from first to second base on a single, and only from first to third on a double. Errors, double and triple plays, hit by pitch and several other types of events that occur in a baseball game were ignored. This runner advancement model was tested on each of the Major Leagues for the 2003-2007 seasons using average player offensive data by

http://www.baseball-reference.com/

http://www.baseball-reference.com/

batting position. The Markov process model with the conservative runner advancement scheme predicted that a team would score about 4% fewer runs than teams actually scored. In the current study, the runner advancement rules were made more authentic by including more detailed runner advancement data.

The conservative runner advancement rules just described allow for a maximum of one out per plate appearance. However, in baseball, approximately 2% of the time two outs are recorded on a single play. In the current study, the possibility of making two outs on a single play was included. Though triple plays could be included, their occurrence is so rare (only 682 have been turned since 1876, or an average of about 5 per season) that they were ignored. The probability of the batter hitting into a double play was determined by the number of double plays into which each batter hit divided by his plate appearances. The possibility of the batter being hit by a pitch was also included.

Finally, the runner advancement rules were updated. The simple runner advancement rules used in much previous work (e.g. Bukiet et al., 1997, Sokol, 2003, D’Esopo and Lefkowitz, 1977) have a predetermined number of bases that each baserunner advances on each type of event, and the baserunners cannot be thrown out on the bases, nor can they advance on an out. However, in baseball these occurrences occur quite frequently. In this study, play-by-play data from the 2007 season (obtained from www.retrosheet.org) were parsed to determine the probability of advancing each number of bases on each type of event, the probability of making an out on the bases, and the probability of advancing on an out. All of these probabilities were incorporated in the runner advancement rules. The details of how the transition matrices were modified by incorporating these data are provided in the Appendix. Including these modifications in our computations and testingagainst the 2003-2007 seasons to determine the accuracy, we found that the updated runner advancement model predicted that teams should have scored about 2% more runs than they actually did (half the error of the conservative rules).

2.5 Calculation of the Mathematical MVP and Cy Young Award winners

To determine the “objective” MVP and Cy Young award winners, i.e. those that our model recommends, two lineups were created by averaging the performance of all average players by fielding position. The batting order was arranged in a traditional way that is consistent with how teams frequently have arranged their lineups.

Typical Batting Orders by Position National League American League

· Shortstop · Shortstop · Outfield · Outfield · Outfield · Outfield · First Base · First Base · Outfield · Designated Hitter

http://www.retrosheet.org/

· Third Base · Outfield · Catcher · Third Base · Second Base · Catcher · Pitcher · Second Base

To determine the contribution that each MVP candidate made over what an average player at their position made, the MVP candidate’s batting statistics were first scaled by the fraction of games that the candidate played (i.e., games played divided by 162 games in a season) and the statistics for an average batter at this player’s position were scaled to make up the difference between this value and unity, so that essentially an average player was used in games that the MVP candidate did not play. Then, the MVP candidate replaced the average player at the MVP candidate’s fielding position in the lineup given in the table above. The code was run considering the MVP candidate in each possible batting position and readjusting the lineup accordingly. For example, to insert the MVP candidate, the average player at the MVP candidate’s fielding position was first removed and all of the players below him were shifted up one position. Then the MVP candidate was inserted at a given position in the lineup and all of the players below that position were shifted down one batting position. The runs distribution for the lineup with all average players and the lineups with the candidate substituted for the average player at his position were analyzed to determine the probability of each lineup winning the game. The probability of the MVP candidate’s team winning the game was recorded for the best case batting position. The increase in the probability of winning (PWin) over 50% (PWin - 0.5) was multiplied by 162 to find the number of additional wins that the MVP candidate would have gained for an average team over an average player at his position over the course of the season. If the probability of the MVP candidate’s team winning a game was less than 50%, the candidate was actually less valuable than an average player at his position, and therefore would have had a negative effect on the number of games that an average team would be expected to win. This process was repeated for each batting position and for each MVP candidate who was not a pitcher.

To analyze the Cy Young award candidates, a number we call the key pitcher number was defined. This number represents how well a pitcher was able to keep runners off the bases compared to the average of all pitchers in the candidate’s league. Since traditionally the data available for pitching performance includes walks, hits and home runs allowed (but not doubles or triples), we use only these statistics (along with innings pitched) in our computations and weight home runs more than average hits. This key pitcher number roughly compares the number of baserunners allowed per inning by the award candidate to the league average. The number was also scaled by the number of innings that the pitcher pitched as a fraction of the total innings a team plays in a season. The formula for the key pitcher number is as follows, where L represents the league numbers and P represents the pitcher being considered, TIP is the total innings played by a team in a season:

3 * (( ) ( ) ( ) ( )( ) ( ) ( ) ( ) 3( )

H HR BB HBP

H BB HBP IPIP IP IP

H H HR HR BB BB HBP HBP

H H HR HR BB BB HBP HBP IP IP

IP

P P P PP P P P P T PL P L P L P L P

L P L P L P L P L PT

+ + ++ + +

+ −− + − + − + −

− + − + − + − + −

)

The opposing lineup’s offensive statistics were multiplied by this number to account for the ability of the pitcher. For example, if the pitcher being considered for the Cy Young Award allowed only 90% (key pitcher number = 0.9) as much offensive production per batter as an average pitcher, then the opposing team’s batters’ statistics would be modified by the computer program so that they only gained 90% of their usual hits and walks, the remainder being added to their outs. Once the computer program was run, the probability of the Cy Young candidate’s team winning a game versus the same average batters but facing an average pitcher (key pitcher number = 1) was recorded. Obviously if one of the two otherwise equal teams gains only 90% of their usual hits and walks, the team with the Cy Young candidate would be expected to have a corresponding increase in their probability of winning. Once again, the probability of winning over 50% (PWin - 0.5) multiplied by 162 was the number of additional wins that the Cy Young candidate would have gained for his all average teammates. This process was repeated for each Cy Young candidate. This process was also applied for all of the MVP candidates who were pitchers. We call the MVP and Cy Young award winners recommended by our model the “objectively computed” award winners.

3 Results

In Table 1, we compare our computed win contribution (Win Cont.) in games, of the actual MVP award winners with the win contribution of the objectively computed MVP award winners (objective winner), where pitchers are consider for MVP honors.

Out of the 40 MVP award winners (20 from the American League and 20 from the National League) from 1988 – 2007, the sportswriters voted for the player who would have contributed the most wins to a team of average players 16 times or 40% of the time, and one of the three best candidates according to our model 24 out of 40 times or 60% of the time. Interestingly enough, for all 40 selections the difference between the objectively computed winner and the second best candidate was greater than ¼ of a game. This is not the case for both the Cy Young voting and the MVP voting without the inclusion of pitchers. The remaining 16 times (40% of the time) the writers did not pick one of the three best candidates for MVP according to our model. Perhaps the most extreme case of the sportswriters missing the objectively computed MVP occurred in 2006. In 2006, Justin Morneau was named the American League MVP despite only contributing 0.16 extra wins to an average team, while Derek Jeter (who was voted second) was worth 2.6 extra wins

and Johan Santana (who was the best candidate according to our model) was worth 4.8. Table 1 MVP Award w/ Pitchers

Year Actual Winner Win Cont. Objective Winner Win Cont AL 2007 Alex Rodriguez 5.38 Alex Rodriguez 5.38

2006 Justin Morneau 0.16 Johan Santana 4.84 2005 Alex Rodriguez 6.71 Alex Rodriguez 6.71 2004 Vladimir Guerrero 4.32 Johan Santana 5.85 2003 Alex Rodriguez 5.87 Alex Rodriguez 5.87 2002 Miguel Tejada 2.40 Alex Rodriguez 5.92 2001 Ichiro Suzuki 2.28 Alex Rodriguez 6.34 2000 Jason Giambi 5.59 Pedro Martinez 9.17

1999 Ivan Rodriguez 3.11 Pedro Martinez 7.40 1998 Juan Gonzalez 3.51 Albert Belle 5.41 1997 Ken Griffey, Jr. 5.03 Roger Clemens 6.79 1996 Juan Gonzalez 2.84 Alex Rodriguez 6.36 1995 Mo Vaughn 1.65 Edgar Martinez 6.01 1994 Frank Thomas 5.33 Frank Thomas 5.33 1993 Frank Thomas 4.52 Ken Griffey, Jr. 5.77 1992 Dennis Eckersley 2.13 Frank Thomas 5.47 1991 Cal Ripken, Jr. 6.31 Cal Ripken, Jr. 6.31 1990 Rickey Henderson 7.19 Rickey Henderson 7.19 1989 Robin Yount 4.07 Bret Saberhagen 6.30 1988 Jose Canseco 5.42 Wade Boggs 6.31

NL 2007 Jimmy Rollins 2.88 Hanley Ramirez 4.67 2006 Ryan Howard 4.49 Lance Berkman 5.11 2005 Albert Pujols 4.89 Derrek Lee 5.36 2004 Barry Bonds 12.31 Barry Bonds 12.31 2003 Barry Bonds 8.48 Barry Bonds 8.48 2002 Barry Bonds 11.02 Barry Bonds 11.02 2001 Barry Bonds 10.86 Barry Bonds 10.86 2000 Jeff Kent 4.96 Todd Helton 6.33 1999 Chipper Jones 6.31 Chipper Jones 6.31 1998 Sammy Sosa 4.96 Greg Maddux 13.76 1997 Larry Walker 8.40 Larry Walker 8.40 1996 Ken Caminiti 5.29 Barry Bonds 7.22 1995 Barry Larkin 4.77 Greg Maddux 8.59 1994 Jeff Bagwell 5.53 Greg Maddux 9.23 1993 Barry Bonds 8.26 Barry Bonds 8.26 1992 Barry Bonds 7.88 Barry Bonds 7.88 1991 Terry Pendleton 3.21 Barry Bonds 5.33 1990 Barry Bonds 5.36 Barry Bonds 5.36

1989 Kevin Mitchell 6.03 Kevin Mitchell 6.03 1988 Kirk Gibson 3.50 Darryl Strawberry 4.17

In Table 2, we compare the win contribution of the actual MVP award winners with

the win contribution of the objectively computed MVP award winners, but with the stipulation that pitchers may not be considered for the MVP award.

Often pitchers are not considered for the MVP award since only pitchers can receive the Cy Young award. (Only 8.5% of the MVP award winners since the inception of the Cy Young award have been pitchers). When pitchers were removed from consideration for the MVP award, the sportswriters’ accuracy of picking the objectively computed MVP award winner increased somewhat. With pitchers removed from consideration for the MVP award, the sportswriters voted for the objectively computed best player 18 times or 45% of the time and one of the three best candidates according to our model 26 out of 40 times or 65% of the time. Furthermore, 3 out of 40 times (7.5%) the sportswriters’ selection trailed our objectively computed selection by less than an insignificant ¼ of a game. The remaining 14 times (35% of the time) the writers did not pick any of the top three objectively computed best candidates for MVP.

In Table 3, we compare the win contribution of the objectively computed Cy Young award winner with the win contribution of the actual Cy Young award winner each year. Out of the 40 Cy Young award winners (20 from the American League and 20 from the National League), the sportswriters voted for the objectively computed best player 21 times or 52.5% of the time and one of the three best candidates according to our model 31 out of 40 times or 77.5% of the time. Furthermore, 4 out of 40 times (10%) the sportswriters’ selection trailed our objectively computed selection by less than ¼ of a game. The remaining 9 times (22.5% of the time) the writers did not pick any of the three best candidates for the Cy Young award.

For completeness, also compared how often our objectively computed winners of the Cy Young and MVP awards were among the sportswriters’ top 3 selections. When pitchers were considered for the MVP award, the objectively computed winner appeared in the sportswriters’ top 3 selections 24 out of 40 times (60%). When pitchers were removed this percentage increased to 67.5% (27 out of 40). For the Cy Young award the objectively computed winner appeared in the sportswriters’ top 3 selections 32 out of 40 times (80%).

Table 2 MVP Award w/o Pitchers

Year Actual Winner Win Cont. Objective Winner Win Cont. AL 2007 Alex Rodriguez 5.38 Alex Rodriguez 5.38

2006 Justin Morneau 0.16 Derek Jeter 2.60 2005 Alex Rodriguez 6.71 Alex Rodriguez 6.71 2004 Vladimir Guerrero 4.32 Vladimir Guerrero 4.32 2003 Alex Rodriguez 5.87 Alex Rodriguez 5.87

2002 Miguel Tejada 2.40 Alex Rodriguez 5.92 2001 Ichiro Suzuki 2.28 Alex Rodriguez 6.34 2000 Jason Giambi 5.59 Carlos Delgado 5.71

1999 Ivan Rodriguez 3.11 Manny Ramirez 6.02 1998 Juan Gonzalez 3.51 Albert Belle 5.41 1997 Ken Griffey, Jr. 5.03 Edgar Martinez 5.03 1996 Juan Gonzalez 2.84 Alex Rodriguez 6.36 1995 Mo Vaughn 1.65 Edgar Martinez 6.01 1994 Frank Thomas 5.33 Frank Thomas 5.33 1993 Frank Thomas 4.52 Ken Griffey, Jr. 5.77 1992 Dennis Eckersley 2.13 Frank Thomas 5.47 1991 Cal Ripken, Jr. 6.31 Cal Ripken, Jr. 6.31 1990 Rickey Henderson 7.19 Rickey Henderson 7.19 1989 Robin Yount 4.07 Rickey Henderson 4.60 1988 Jose Canseco 5.42 Wade Boggs 6.31

NL 2007 Jimmy Rollins 2.88 Hanley Ramirez 4.67

2006 Ryan Howard 4.49 Lance Berkman 5.11 2005 Albert Pujols 4.89 Derrek Lee 5.36 2004 Barry Bonds 12.31 Barry Bonds 12.31 2003 Barry Bonds 8.48 Barry Bonds 8.48 2002 Barry Bonds 11.02 Barry Bonds 11.02 2001 Barry Bonds 10.86 Barry Bonds 10.86 2000 Jeff Kent 4.96 Todd Helton 6.33 1999 Chipper Jones 6.31 Chipper Jones 6.31 1998 Sammy Sosa 4.96 Mark McGwire 7.91 1997 Larry Walker 8.40 Larry Walker 8.40 1996 Ken Caminiti 5.29 Barry Bonds 7.22 1995 Barry Larkin 4.77 Barry Bonds 4.89 1994 Jeff Bagwell 5.53 Jeff Bagwell 5.53 1993 Barry Bonds 8.26 Barry Bonds 8.26 1992 Barry Bonds 7.88 Barry Bonds 7.88 1991 Terry Pendleton 3.21 Barry Bonds 5.33 1990 Barry Bonds 5.36 Barry Bonds 5.36 1989 Kevin Mitchell 6.03 Kevin Mitchell 6.03 1988 Kirk Gibson 3.50 Darryl Strawberry 4.17

Table 3 Cy Young Award Year Actual Winner Win Cont. Objective Winner Win Cont.

AL 2007 C.C. Sabathia 2.20 C.C. Sabathia 2.20 2006 Johan Santana 4.84 Johan Santana 4.84 2005 Bartolo Colon 2.05 Johan Santana 5.20 2004 Johan Santana 5.85 Johan Santana 5.85

2003 Roy Halladay 4.39 Tim Hudson 4.48 2002 Barry Zito 2.63 Derek Lowe 5.24 2001 Roger Clemens 1.70 Mike Mussina 4.24 2000 Pedro Martinez 9.17 Pedro Martinez 9.17 1999 Pedro Martinez 7.40 Pedro Martinez 7.40 1998 Roger Clemens 4.94 Roger Clemens 4.94 1997 Roger Clemens 6.79 Roger Clemens 6.79 1996 Pat Hentgen 3.96 Pat Hentgen 3.96 1995 Randy Johnson 6.00 Randy Johnson 6.00 1994 David Cone 4.91 David Cone 4.91 1993 Jack McDowell 1.80 Kevin Appier 5.06 1992 Dennis Eckersley 2.13 Roger Clemens 4.09 1991 Roger Clemens 5.20 Roger Clemens 5.20 1990 Bob Welch 0.92 Roger Clemens 4.03 1989 Bret Saberhagen 6.30 Bret Saberhagen 6.30 1988 Frank Viola 2.86 Teddy Higuera 4.38

NL 2007 Jake Peavy 4.50 Jake Peavy 4.50 2006 Brandon Webb 3.92 Brandon Webb 3.92 2005 Chris Carpenter 4.60 Pedro Martinez 5.27 2004 Roger Clemens 2.93 Randy Johnson 6.87 2003 Eric Gagne 3.72 Jason Schmidt 5.31 2002 Randy Johnson 4.13 Curt Schilling 5.42 2001 Randy Johnson 4.56 Randy Johnson 4.56 2000 Randy Johnson 4.76 Kevin Brown 5.93 1999 Randy Johnson 5.56 Randy Johnson 5.56 1998 Tom Glavine 10.40 Greg Maddux 13.76 1997 Pedro Martinez 6.25 Greg Maddux 6.44 1996 John Smoltz 5.50 Kevin Brown 5.92 1995 Greg Maddux 8.59 Greg Maddux 8.59 1994 Greg Maddux 9.23 Greg Maddux 9.23 1993 Greg Maddux 4.93 Greg Maddux 4.93 1992 Greg Maddux 5.30 Greg Maddux 5.30 1991 Tom Glavine 2.73 Jose Rijo 2.92 1990 Doug Drabek 3.46 Doug Drabek 3.46 1989 Mark Davis 1.08 Scott Garrelts 3.13 1988 Orel Hershiser 2.91 Danny Jackson 3.15

4 Discussion and Conclusions

While there have been numerous efforts to quantify and rank performance of Major League Baseball players, such as OPA and OERA as discussed in Section 2, the work

presented here is the first to our knowledge that analyzes the influence of individual players performance on expected team performance (wins) with the purpose of selecting the most appropriate Most Valuable Player and Cy Young Award winners. There are several reasons that the sportswriters’ picks for these awards disagree with those computed by the method described in this paper.

Zillante (2005), in his analysis of Gold Glove Award winners found that a player’s reputation had some effect on the voting for that award. As a result, it would be reasonable to think that a player’s reputation could influence selection of MVP and Cy Young award winners. Reputation is not considered in our analysis.

Our analysis does not consider how well a candidate’s team performed during the season. Nor do we attempt to quantify a player’s leadership or personality traits. However, Bennett and Flueck (1983) found that the winning percentage of the MVP award candidate’s team had a great effect on the candidate’s chance of winning the award. They also found that a player’s batting average had a larger effect than it should in determining the MVP award winner. Finally, they found that hard to quantify skills which were not included in our analysis such as team leadership also had a noticeable effect. Sparks and Abrahamson (2005) agree that in the selection of Cy Young award winners, the team’s winning percentage has an effect on a candidate’s chances of winning the award.

Furthermore, for the 20 year period from 1988-2007, the sportswriters selected a pitcher for the MVP award only once (2.5% of the time), whereas our model showed that 20% of the time a pitcher was the most appropriate awardee. In such a case, this pitcher would also be worthy of the Cy Young award. Interestingly enough, the one pitcher who won the MVP award was a relief pitcher! (Dennis Eckersley in 1992). This is surprising because relief pitchers very rarely win even the Cy Young award, much less the MVP award.

The results also show that the sportswriters selected the objectively computed Cy Young award winner more often than they selected the objectively computed MVP award winner. We suspect that one reason for this result is that voters for the Cy Young award consider all of the players who are eligible for the award (all pitchers), while the voters for the MVP award do not (all hitters, but none or very few pitchers).

In this paper, we extended the ideas of previously developed Markov Process models for analyzing run distributions in baseball to consider more realistic runner advancement phenomena. This more complicated setup overestimates the number of runs expected by 2% versus the more conservative model which underestimates expected runs by 4%. We then applied the method to analyze MVP and Cy Young award selection. The method can also be applied to better understand baseball strategies such as pinch hitting. We hope to address this issue in the future.

Appendix

Here, we present a more detailed description of the method of Bukiet etal. (1997) described in section 1.1. In the method, a transition matrix, P, is created for each batter incorporating his performance attributes – the probability of getting a single, double, triple, home run, out or walk in a plate appearance.

These transition matrices have the form:

00 00 0 0 1

A B C DA B E

PA F

⎛ ⎞⎜ ⎟⎜ ⎟=⎜ ⎟⎜ ⎟⎝ ⎠

where the A’s, B’s and C are 8x8 block matrices and D, E, and F are 8x1 column vectors. The A blocks represent transitions in which outs do not occur. The B blocks represent transitions that increase the number of outs by one – from 0 to 1 or from 1 to 2; the F vector represents transitions from 2 to 3 outs. The C and E blocks represent transitions that increase the number of outs by 2 (double plays) and the D block represents the probability of triple plays.

For the D’Esopo and Lefkowitz (1997) runner advancement rules used by Bukiet et al. (1997), the runners advance as follows: • On a walk, runners forced to advance move up one base • On a single, runners advance from first to second base, while base runners on second or

third base score • On a double, a runner advances from first to third base, while runners on second and

third base score • On a triple, all baserunners score • On a home run, all baserunners plus the batter score • On an out, no runners advance • Double and triple plays, hit by pitch and other relatively rare events are not considered.

With this runner advancement model using subscripts W for walk (e.g., probability of a walk is PW), S for single, D for double, T for triple, H for home run and O for out, the A blocks become

0 0 0 00 0 0 0

0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0

H S W D T

H T S W

H S D T W

H S D T W

S DH T

S DH T

WH S D T

S DH T

P P P P PP P P PP P P P PP P P P P

AP PP PP PP P

PP P P PP PP P

+⎛⎜ +⎜⎜⎜⎜= ⎜⎜⎜⎜⎜⎜⎝

0

D

W

W

W

P

PP

P

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

while the B blocks are diagonal 8x8 blocks with entries PO, the F block has all entries PO, the C and E blocks have all entries zero. The columns of the A blocks represent transitions to the states with no runners, a man on first, a man on second, a man on third, men on first and second, men on first and third, men on second and third and bases loaded, respectively.

To determine the probability of being in any possible state, a row vector we call the current situation vector, S, is used. This vector has an entry for each state. Starting with this vector representing a 100% chance that there are no outs with no runners on base state, i.e., the start of the game, the current situation vector, S, multiplies the batter transition matrices for the nine batters in the lineup, P1 through P9, one at a time to compute the probability of arriving at any other possible state after the current batter’s plate appearance. The model transitions through the lineup while there are less than 27 outs. Note that once the ninth batter is accounted for, the transition matrix for the first batter will be used next. The transitions that result in the scoring of runs are recorded and the model uses them to find the probability of scoring any number of runs in a game. As described in section 2.4, this model was updated and made more accurate in this study by updating the runner advancement model. As a result, the transition matrix was filled with more probabilities. The updated A block becomes:

1 1 1 1 2 1 3 1 3

2 2 2 2 3 2 3

3 3 3

1,2 1;2 1;2 1 2;2 1 3;2 1 3;2 1 2;2 3

1,3 1;3

0 0 0 000

0 0

H S W D T

H H S H D H T S W S D

H H S H D H T W S D

H H S H D H T W

H H S H D H T H S H S H D S

H H S

P P P P PP P P P P P P PP P P P P P PP P P P P 0

WP P P P P P P PP P

σ δ τ σ σ δσ δ τ σ δσ δ τσ δ τ σ σ δ σσ δ

− − − − − −

− − − − −

− − −

− − − − − − − − − − −

−

++

+

1;3 1 2;3 1 3;3 1 3;3

2,3 2;3 2;3 2 3;3 2 3;3

1;2,3 1;2,3 1;2,3 1 2;2,3 1 3;2,3 1 3;2,3 1 2;2 3

0H D H T H S H S H D W

H H S H D H T H S H D W

H H S H D H T H S H S H D S

P P P P P PP P P P P P PP P P P P P P P

τ σ σ δσ δ τ σ δσ δ τ σ σ δ σ

− − − − − − − −

− − − − − − −

− − − − − − − − − − −

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

P

The B, C, and E blocks were also updated. Due to the width of the B block, we have split it into two 8x4 matrices:

B Block, Part A (first four columns)

(1 3) 1 2 (1 3) 1 3 (1 )1

(1 ) (1 )

(2 3)(2 3) (2 ) 2 3 (2 )2

(2 )

(3 )(3 ) (3 )3

(1,2

0 0 0O

n O O S O O D O O H TH O

O H S O O H D

n O O DO S O H S O O H TH O

O H D O

n O O H TO H S O H DH O

O

nH O

P

P P P P P PPP P P

P PP P P PPP P

P PP PPP

P

ο σ ο δ ο τοσ δ

ο δσ σ ο τοδ

ο τσ δο

οο

− − − − −−

− −

−− − − −−

−

−− −−

−

+ + + + +

+

+ ++ +

+

+ +

1);2 (1 3);2 1 2;2 (1 3);2 1 3;2 (1 );2

(1 );2 (1 );2

(1);3 (1 3);3 1 2;3 (1 3);3 1 3;3 (11,3

(1 );3 (1 );3

H O O H S H O O H D H O O H H T

O H H S O H H D

n H O O H S H O O H D H O O HH O

O H H S O H H D

P P P P P PP P

P P P P PPP P

σ ο δ ο τσ δ

ο σ ο δ ο τοσ δ

− − − − − − − − − − −

− − − −

− − − − − − − − − −−

− − − −

+ + + +

+

+ + + + + );3

(2);3 (2 3);3(2 3);3 (2 );3 2 3;3 (2 );32,3

(2 );3

(1);2,3 (1 3);2,3 1 2;2,3 (1 3);2,31,2,3

(1 );2,3 (1 );2,3

H T

n H O O H DO H S O H H S H O O H H TH O

O H H D

n H O O H S H O O H DH O

O H H S O H

P

P PP P P PPP

P P P PP

P

ο δσ σ ο τοδ

ο σ ο δο

σ δ

−

− − −− − − − − − − −−

− −

− − − − − − −−

− − −

+ ++ +

+ + + + 1 3;2,3 (1 );2,3H O O H H T

H D

P PP

ο τ− − − −

−

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟+⎜ ⎟⎜ ⎟⎝ ⎠

B Block, Part B (last four columns)

1 2; (2 3) (1);2 3 1 3; (2 ) 1 2;2 3 1 3; (2 )

1 2; (2 )

1 3; (2 )1 2; (3 ) 1 2; (3) 1 3; (3 )

2 3; (3 )2 3; (3 )

(

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0

0

0 0

n O O S n O O H S O O H D

O H S O

n O O H SO H S n O O H D

O

n O O H DO H S

O

n

P P P P P PP P

P PP P PP

P PPP

ο σ ο σ ο δσ

ο σσ ο δ

ο δσ

ο

− − − − − − − − −

− −

− −− − − − −

− −− −

+ + + +

+

+ + +

+ +

1,2);3 1 2; (2 3);3 (1);2 3;3 1 2;2 3;3 1 2;2 3; (3 )

1 3; (2 );31 2; (2 );3 1 3; (2 );3

H O O H S n H O H O n O O H S

O H H DO H H S O H H S O

P P P P P PPP P P

σ ο ο ο σδσ σ

− − − − − − − − − − − −

− − −− − − − − −

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟

+ + + +⎜ ⎟+ +⎜ ⎟⎜ ⎟⎝ ⎠

The C and E blocks now account for double plays becoming:

C Block E Block

1

2

3

4

0 0 0 0 0 0 0 00 0 0 0 0 0 0

0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0

0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0

DP

DP

DP

DP

P

PP

P

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

1

2

3

4

0

00

0

DP

DP

DP

DP

P

PP

P

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

In the A and B blocks, σ represents the percentage of singles, starting from a

particular situation, which result in the given transition, δ represents the percentage of doubles which result in the given transition, andτ represents the percentage of triples which result in the given transition. In the B blocks, ο is used to represent a percentage of outs which result in the given transition.

The subscripts following the Greek symbols in the A and B blocks represent the bases advanced by the baserunners where semi-colons separate baserunners that end up on different bases and commas represent baserunners that end up on the same base – they both score on the play. Furthermore, the subscript n followed by numbers (1, 2, 3 or some combination thereof) denotes that the baserunners on the given bases did not advance. When the subscript n is not followed by any numbers, it is understood that no baserunners advanced. Finally, the subscript O denotes that the baserunner was out trying to advance on the play. For example, the term 1 3;3 Hσ − − in the sixth row and sixth column of the A block represents the fraction of the time that a single with men on first and third base will result in the runner on first advancing to third base (represented by 1-3) and the runner on third scoring (represented by 3-H). Similarly, 1 2;2,3 Hσ − − in the eighth row and fifth column of the A block represents the situation in which a single with the bases loaded results in the runner on first advancing to second base (represented by 1-2), while the runners on second and third both score (represented by 2,3-H). The numerical values of each of the

, , and σ δ τ ο values with their particular subscripts is given in Table 7. The C block represents plays that increase the number of outs from 0 to 2, i.e. a double play – represented by subscript DP and the E block represents plays that increase the number of outs from 1 to 3. Please note that not all double play combinations are included, as there are only a few combinations that occur with any regularity. We represent the probability of transitioning from a runner on first to no runners on a double play as DP1, from men on first and second to just a man on third as DP2, from first and third to no

one on (with a run scoring) as DP3 and from bases loaded to a man on third, with a run scoring as DP4.

Finally, the F block represents transitions involving increasing the number of outs from 2 to 3. The entries are the sums of the rows of the B block; runs are not increased on such a play in the model.

The values of all the parameters appearing in the transition matrix blocks based on tabulating data parsed from www.retrosheet.org from the 2007 Major League Baseball season are presented in Table 7.

Table 7

σ δ τ ο subscript value subscript value subscript value subscript value

O(1-H);2-H 0.001 1,2,3-H 0.430 1,2,3-H 1.000 1,2,3-H 0.000 1,2,3-H 0.001 1,2-H 0.430 1,2-H 1.000 1,2-H 0.000 1,2-H 0.001 1,3-H 0.436 1,3-H 1.000 1,3-H 0.000 1,3-H 0.002 1-3 0.534 1-H 1.000 1-2 0.240 1-2 0.730 1-3;2,3-H 0.525 2,3-H 1.000 1-2;2,3-H 0.000

1-2;2,3-H 0.424 1-3;2-H 0.525 2-H 1.000 1-2;2-3 0.058 1-2;2-3 0.284 1-3;3-H 0.534 3-H 1.000 1-2;2-3;3-H 0.014

1-2;2-3;O(3-H) 0.000 1-3;O(2-H) 0.000 O(1-H) 0.000 1-2;2-H 0.000 1-2;2-H 0.424 1-3;O(2-H);3-H 0.000 O(1-H);2,3-H 0.000 1-2;3-H 0.058 1-2;3-H 0.730 1-3;O(3-H) 0.000 O(1-H);2-H 0.000 1-2;n(3) 0.183

1-2;O(2-3) 0.000 1-H 0.436 O(1-H);3-H 0.000 1-3 0.000 1-2;O(2-3);3-H 0.000 2,3-H 0.984 O(2-H) 0.000 1-3;2,3-H 0.000

1-2;O(2-H) 0.021 2-3 0.015 O(2-H);3-H 0.000 1-3;2-H 0.000 1-2;O(2-H);3-H 0.021 2-3;3-H 0.015 O(3-H) 0.000 1-3;3-H 0.000

1-2;O(3-H) 0.000 2-3;O(3-H) 0.000 1-H 0.000 1-3 0.257 2-H 0.984 2,3-H 0.000

1-3;2,3-H 0.150 3-H 1.000 2-3 0.240 1-3;2-H 0.150 O(1-3) 0.000 2-3;3-H 0.058 1-3;3-H 0.257 O(1-3);2,3-H 0.000 2-H 0.000

1-3;O(2-H) 0.007 O(1-3);2-H 0.000 3-H 0.240 1-3;O(2-H);3-H 0.007 O(1-3);3-H 0.000 n (0.759)n

1-H 0.002 O(1-H) 0.030 n(1);2,3-H 0.000 2,3-H 0.582 O(1-H);2,3-H 0.029 n(1);2-3 0.183 2-3 0.389 O(1-H);2-H 0.029 n(1);2-3;3-H 0.044

2-3;3-H 0.389 O(1-H);3-H 0.030 n(1);2-H 0.000 2-3;O(3-H) 0.000 O(2-3) 0.000 n(1);3-H 0.183

2-H 0.582 O(2-3);3-H 0.000 n(1,2);3-H 0.139 3-H 1.000 O(2-H) 0.001 n(2);3-H 0.183

O(1-3) 0.010 O(2-H);3-H 0.001

http://www.retrosheet.org/

O(1-3);2,3-H 0.006 O(3-H) 0.000 O(1-3);2-H 0.006 O(1-3);3-H 0.010

O(1-H) 0.001 O(1-H);2,3-H 0.001 O(1-H);3-H 0.001

O(2-3) 0.001 O(2-3);3-H 0.001

O(2-H) 0.029 O(2-H);3-H 0.029

O(3-H) 0.000 References

Bellman, R. (1977). Dynamic Programming and Markovian Decision Processes, with Application to Baseball. In Optimal Strategies in Sports (edited by S. P Ladany and R. E. Machol), 77-85. New York: Elsevier-North Holland. Bennett, J.M., and Flueck, J.A. (1983). An Evaluation of Major League Baseball Offensive Performance Models. The American Statistician 37, 76-82. Bukiet, B., Harold, E. and Palacios, J.L. (1997). A Markov Chain Approach to Baseball. Operations Research 45, 14-23. Cover, T.M., and Carroll W.K. (1977) An Offensive Earned-Run Average for Baseball. Operations Research, 25, 729-40. D’Esopo, D. A. and B. Lefkowitz. (1977). The Distribution of Runs in the Game of Baseball. In Optimal Strategies in Sports (edited by S. P. Ladany and R. E. Machol), 55-62. New York: Elsevier-North Holland. Hirostu, N. and Wright, M. (2003). A Markov Chain Approach To Optimal Pinch Hitting Strategies in a Designated Hitter Rule Strategies in a Designated Hitter Rule Baseball Game. Journal of the Operations Research Society of Japan, 46:3, 353-371. Pankin, M. (1978). Evaluating Offensive Performance in Baseball. Operations Research. 26, 610-619. Sokol, J. (2003) A Robust Heuristic for Batting Order Optimization Under Uncertainty. Journal of Heuristics. 9, 353-70.

Sparks, R. and Abrahamson, D., (2005). A mathematical model to predict award winners. Math Horizons. 5—13. Trueman, R. E. (1977) Analysis of Baseball as a Markov Process. In Optimal Strategies in Sports (edited by S. P. Ladany and R. E. Machol), 68-76. New York: Elsevier-North Holland. Zillante, A. (2005). Reputation Effects in Gold Glove Award Voting. Public Economics. 0502003, EconWPA.

Objective Method for Determining the Most Valuable Player ... · The Most Valuable Player (MVP)...

Documents

Transcript of Objective Method for Determining the Most Valuable Player ... · The Most Valuable Player (MVP)...