Predicting New Race Times From Race Historycs229.stanford.edu/proj2015/247_poster.pdf ·...

1
Predicting New Race Times From Race History Matthew Millett, TJ Melanson Locally Weighted Linear Regression Introduction: “What if…?” Many runners like to ask the question, “What if?” If you race a distance you’ve never raced before, how would you do? What can it tell you about your fitness? Current methods are limited to intuition, Jack Daniel’s running formula, which requires serious Vo 2 Max testing, or Peter Riegel’s oversimplified formula. All are either imprecise or fail to generalize well to different distances. Figure 1: A Vo2Max test in progress. Current tests like this are expensive and complicated, but give a good measure of fitness. Why don’t we just use race history? Data Initial tests: Baseline & Oracle Figure 4: Data sources consisted of online race results. We’re implementing a Hidden Markov Model to model states of fitness. Emissions are normally distributed with respect to fitness, and currently all transition probabilities are uniform. We have a discrete set of fitness states. We train on our entire data set, using the Baum-Welch EM algorithm. Then we use the forward algorithm to predict how likely your fitness is a certain state given the races you’ve run. To establish a baseline for performance, we drew from literature. Peter Riegel’s formula is still widely used today. We took a runner’s most recent race to be d 1 and extrapolated from there to find t 2. Our oracle was the running club coach, Pattisue Plumer, a former Olympian. We tested on 16 runners, where for each runner we’d predict a random race and see how far off the time was. Baseline Average Error:11.71% Oracle Average Error: 9.73% Future Work We are tuning the hyperparameters of the regression to see if we can find more reliable values. We are working out some bugs that keep us from fully implementing the HMM with 2 variables instead of 1. Lastly, we hope to eventually incorporate workout data into our model as emissions. References 1. Riegel, Peter S. "Athletic Records and Human Endurance: A time-vs.-distance equation describing world-record performances may be used to compare the relative endurance capabilities of various groups of people." American Scientist(1981): 285-290. 2. BeautifulSoup: http ://www.crummy.com/software/BeautifulSoup/ 3. YAHMM: https://github.com/jmschrei/yahmm Our Hidden Markov Model We scraped our data from online results on tfrrs.org. Each athlete has a page with a whole history of results. We randomly selected 47 athletes from track meets and 54 athletes from Cross-Country meets to train and test our models. CS229 Final Project t = sd E We adapted Riegel’s model to take in a set of races and parameterize fitness with values of (s, E). A lower s value means you’ve got speed, but a lower E means you’ve got strength. t 2 =( d 2 d 1 ) 1.06 Figure 2: Riegel’s Formula. Riegel’s formula was first published in Runner’s Wordl Magazine in 1977, modeling human performance over increasing distance. Figure 3: Coach Pattisue, our oracle. The closer the distance and the closer the date of a race, the more likely it will model the race you’re trying to predict. We used locally weighted linear regression, where ϴ = (X T WX) -1 X T WY.. We modeled error using LOOCV. While this model was fast, it is not descriptive and didn’t perform as well as we expected, so we moved to an HMM. Features: Distance, Date Output: race speed XC RMSE: 1.12 m/s Track RMSE: 0.82 m/s Figure 5: One athlete’s data: Race distance vs. date vs. speed. Figure 6: An example of our Markov Model calculation. Given the distribution of races that people have run, and your race, what will you run? 0 0.0005 0.001 0.0015 1400 1450 1500 1550 1600 1650 1700 1750 1800 Ini$al Probability Distribu$on 0 0.0005 0.001 0.0015 0.002 1400 1450 1500 1550 1600 1650 1700 1750 1800 Trained Distribu$on 0 100 200 1400 1410 1420 1430 1440 1450 1460 1470 1480 1490 1500 1510 1520 1530 1540 1550 1560 1570 1580 1590 1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700 1710 1720 1730 1740 1750 1760 1770 1780 1790 1800 Data 0 0.000005 0.00001 1400 1450 1500 1550 1600 1650 1700 1750 1800 P(Race | 28:20 8K)

Transcript of Predicting New Race Times From Race Historycs229.stanford.edu/proj2015/247_poster.pdf ·...

Page 1: Predicting New Race Times From Race Historycs229.stanford.edu/proj2015/247_poster.pdf · 2017-09-23 · Figure 2: Riegel’s Formula. Riegel’s formula was first published in Runner’s

Predicting New Race Times From Race History

Matthew Millett, TJ Melanson

Locally Weighted Linear Regression

Introduction: “What if…?”Many runners like to ask the question, “What if?”•  If you race a distance you’ve never raced before,

how would you do?•  What can it tell you about your fitness?

Current methods are limited to intuition, Jack Daniel’s running formula, which requires serious Vo2Max testing, or Peter Riegel’s oversimplified formula.

All are either imprecise or fail to generalize well to different distances.

Figure 1: A Vo2Max test in progress. Current tests like this are expensive and complicated, but give a good measure of fitness. Why don’t we just use race history?

DataInitial tests: Baseline & Oracle

Figure 4: Data sources consisted of online race results.

We’re implementing a Hidden Markov Model to model states of fitness. Emissions are normally distributed with respect to fitness, and currently all transition probabilities are uniform.We have a discrete set of fitness states. We train on our entire data set, using the Baum-Welch EM algorithm. Then we use the forward algorithm to predict how likely your fitness is a certain state given the races you’ve run.

To establish a baseline for performance, we drew from literature. Peter Riegel’s formula is still widely used today.We took a runner’s most recent race to be d1 and extrapolated from there to find t2. Our oracle was the running club coach, Pattisue Plumer, a former Olympian.We tested on 16 runners, where for each runner we’d predict a random race and see how far off the time was.Baseline Average Error:11.71% Oracle Average Error: 9.73%

Future WorkWe are tuning the hyperparameters of the regression to see if we can find more reliable values.We are working out some bugs that keep us from fully implementing the HMM with 2 variables instead of 1.Lastly, we hope to eventually incorporate workout data into our model as emissions.

References1.  Riegel, Peter S. "Athletic Records and Human

Endurance: A time-vs.-distance equation describing world-record performances may be used to compare the relative endurance capabilities of various groups of people." American Scientist(1981): 285-290.���

2.  BeautifulSoup:http://www.crummy.com/software/BeautifulSoup/ ���

3.  YAHMM: https://github.com/jmschrei/yahmm

Our Hidden Markov Model

We scraped our data from online results on tfrrs.org. Each athlete has a page with a whole history of results.We randomly selected 47 athletes from track meets and 54 athletes from Cross-Country meets to train and test our models.

CS229 Final Project

t = sdEWe adapted Riegel’s model to take in a set of races and parameterize fitness with values of (s, E). A lower s value means you’ve got speed, but a lower E means you’ve got strength.

t2 = (d2d1)1.06

Figure 2: Riegel’s Formula. Riegel’s formula was first published in Runner’s Wordl Magazine in 1977, modeling human performance over increasing distance. Figure 3: Coach Pattisue, our oracle.

The closer the distance and the closer the date of a race, the more likely it will model the race you’re trying to predict.We used locally weighted linear regression, whereϴ = (XTWX)-1XTWY..

We modeled error using LOOCV. While this model was fast, it is not descriptive and didn’t perform as well as we expected, so we moved to an HMM.

Features: Distance, DateOutput: race speed

XC RMSE: 1.12 m/sTrack RMSE: 0.82 m/s  

Figure 5: One athlete’s data: Race distance vs. date vs. speed.

Figure 6: An example of our Markov Model calculation.Given the distribution of races that people have run, and your race, what will you run?

0  

0.0005  

0.001  

0.0015  

1400   1450   1500   1550   1600   1650   1700   1750   1800  

Ini$al  Probability  Distribu$on  

0  

0.0005  

0.001  

0.0015  

0.002  

1400   1450   1500   1550   1600   1650   1700   1750   1800  

Trained  Distribu$on  

0  

100  

200  

1400  

1410  

1420  

1430  

1440  

1450  

1460  

1470  

1480  

1490  

1500  

1510  

1520  

1530  

1540  

1550  

1560  

1570  

1580  

1590  

1600  

1610  

1620  

1630  

1640  

1650  

1660  

1670  

1680  

1690  

1700  

1710  

1720  

1730  

1740  

1750  

1760  

1770  

1780  

1790  

1800  

Data  

0  

0.000005  

0.00001  

1400   1450   1500   1550   1600   1650   1700   1750   1800  

P(Race  |  28:20  8K)