sat, 29-oct-2016, 21:14
Equinox Marathon Relay leg 2, 2016

Equinox Marathon Relay leg 2, 2016

Introduction

A couple years ago I compared racing data between two races (Gold Discovery and Equinox, Santa Claus and Equinox) in the same season for all runners that ran in both events. The result was an estimate of how fast I might run the Equinox Marathon based on my times for Gold Discovery and the Santa Claus Half Marathon.

Several years have passed and I've run more races and collected more racing data for all the major Fairbanks races and wanted to run the same analysis for all combinations of races.

Data

The data comes from a database I’ve built of race times for all competitors, mostly coming from the results available from Chronotrack, but including some race results from SportAlaska.

We started by loading the required R packages and reading in all the racing data, a small subset of which looks like this.

race year name finish_time birth_year sex
Beat Beethoven 2015 thomas mcclelland 00:21:49 1995 M
Equinox Marathon 2015 jennifer paniati 06:24:14 1989 F
Equinox Marathon 2014 kris starkey 06:35:55 1972 F
Midnight Sun Run 2014 kathy toohey 01:10:42 1960 F
Midnight Sun Run 2016 steven rast 01:59:41 1960 M
Equinox Marathon 2013 elizabeth smith 09:18:53 1987 F
... ... ... ... ... ...

Next we loaded in the names and distances of the races and combined this with the individual racing data. The data from Chronotrack doesn’t include the mileage and we will need that to calculate pace (minutes per mile).

My database doesn’t have complete information about all the racers that competed, and in some cases the information for a runner in one race conflicts with the information for the same runner in a different race. In order to resolve this, we generated a list of runners, grouped by their name, and threw out racers where their name matches but their gender was reported differently from one race to the next. Please understand we’re not doing this to exclude those who have changed their gender identity along the way, but to eliminate possible bias from data entry mistakes.

Finally, we combined the racers with the individual racing data, substituting our corrected runner information for what appeared in the individual race’s data. We also calculated minutes per mile (pace) and the age of the runner during the year of the race (age). Because we’re assigning a birth year to the minimum reported year from all races, our age variable won’t change during the running season, which is closer to the way age categories are calculated in Europe. Finally, we removed results where pace was greater than 20 minutes per mile for races longer than ten miles, and greater than 16 minute miles for races less than ten miles. These are likely to be outliers, or competitors not running the race.

name birth_year gender race_str year miles minutes pace age
aaron austin 1983 M midnight_sun_run 2014 6.2 50.60 8.16 31
aaron bravo 1999 M midnight_sun_run 2013 6.2 45.26 7.30 14
aaron bravo 1999 M midnight_sun_run 2014 6.2 40.08 6.46 15
aaron bravo 1999 M midnight_sun_run 2015 6.2 36.65 5.91 16
aaron bravo 1999 M midnight_sun_run 2016 6.2 36.31 5.85 17
aaron bravo 1999 M spruce_tree_classic 2014 6.0 42.17 7.03 15
... ... ... ... ... ... ... ... ...

We combined all available results for each runner in all years they participated such that the resulting rows are grouped by runner and year and columns are the races themselves. The values in each cell represent the pace for the runner × year × race combination.

For example, here’s the first six rows for runners that completed Beat Beethoven and the Chena River Run in the years I have data. I also included the column for the Midnight Sun Run in the table, but the actual data has a column for all the major Fairbanks races. You’ll see that two of the six runners listed ran BB and CRR but didn’t run MSR in that year.

name gender age year beat_beethoven chena_river_run midnight_sun_run
aaron schooley M 36 2016 8.19 8.15 8.88
abby fett F 33 2014 10.68 10.34 11.59
abby fett F 35 2016 11.97 12.58 NA
abigail haas F 11 2015 9.34 8.29 NA
abigail haas F 12 2016 8.48 7.90 11.40
aimee hughes F 43 2015 11.32 9.50 10.69
... ... ... ... ... ... ...

With this data, we build a whole series of linear models, one for each race combination. We created a series of formula strings and objects for all the combinations, then executed them using map(). We combined the start and predicted race names with the linear models, and used glance() and tidy() from the broom package to turn the models into statistics and coefficients.

All of the models between races were highly significant, but many of them contain coefficients that aren’t significantly different than zero. That means that including that term (age, gender or first race pace) isn’t adding anything useful to the model. We used the significance of each term to reduce our models so they only contained coefficients that were significant and regenerated the statistics and coefficients for these reduced models.

The full R code appears at the bottom of this post.

Results

Here’s the statistics from the ten best performing models (based on ).

start_race predicted_race n p-value
run_of_the_valkyries golden_heart_trail_run 40 0.956 0
golden_heart_trail_run equinox_marathon 36 0.908 0
santa_claus_half_marathon golden_heart_trail_run 34 0.896 0
midnight_sun_run gold_discovery_run 139 0.887 0
beat_beethoven golden_heart_trail_run 32 0.886 0
run_of_the_valkyries gold_discovery_run 44 0.877 0
midnight_sun_run golden_heart_trail_run 52 0.877 0
gold_discovery_run santa_claus_half_marathon 111 0.876 0
chena_river_run golden_heart_trail_run 44 0.873 0
run_of_the_valkyries santa_claus_half_marathon 91 0.851 0

It’s interesting how many times the Golden Heart Trail Run appears on this list since that run is something of an outlier in the Usibelli running series because it’s the only race entirely on trails. Maybe it’s because it’s distance (5K) is comparable with a lot of the earlier races in the season, but because it’s on trails it matches well with the later races that are at least partially on trails like Gold Discovery or Equinox.

Here are the ten worst models.

start_race predicted_race n p-value
midnight_sun_run equinox_marathon 431 0.525 0
beat_beethoven hoodoo_half_marathon 87 0.533 0
beat_beethoven midnight_sun_run 818 0.570 0
chena_river_run equinox_marathon 196 0.572 0
equinox_marathon hoodoo_half_marathon 90 0.584 0
beat_beethoven equinox_marathon 265 0.585 0
gold_discovery_run hoodoo_half_marathon 41 0.599 0
beat_beethoven santa_claus_half_marathon 163 0.612 0
run_of_the_valkyries equinox_marathon 125 0.642 0
midnight_sun_run hoodoo_half_marathon 118 0.657 0

Most of these models are shorter races like Beat Beethoven or the Chena River Run predicting longer races like Equinox or one of the half marathons. Even so, each model explains more than half the variation in the data, which isn’t terrible.

Application

Now that we have all our models and their coefficients, we used these models to make predictions of future performance. I’ve written an online calculator based on the reduced models that let you predict your race results as you go through the running season. The calculator is here: Fairbanks Running Race Converter.

For example, I ran a 7:41 pace for Run of the Valkyries this year. Entering that, plus my age and gender into the converter predicts an 8:57 pace for the first running of the HooDoo Half Marathon. The for this model was a respectable 0.71 even though only 23 runners ran both races this year (including me). My actual pace for HooDoo was 8:18, so I came in quite a bit faster than this. No wonder my knee and hip hurt after the race! Using my time from the Golden Heart Trail Run, the converter predicts a HooDoo Half pace of 8:16.2, less than a minute off my 1:48:11 finish.

Appendix: R code

library(tidyverse)
library(lubridate)
library(broom)

races_db <- src_postgres(host="localhost", dbname="races")

combined_races <- tbl(races_db, build_sql(
    "SELECT race, year, lower(name) AS name, finish_time,
        year - age AS birth_year, sex
     FROM chronotrack
     UNION
     SELECT race, year, lower(name) AS name, finish_time,
        birth_year,
        CASE WHEN age_class ~ 'M' THEN 'M' ELSE 'F' END AS sex
     FROM sportalaska
     UNION
     SELECT race, year, lower(name) AS name, finish_time,
        NULL AS birth_year, NULL AS sex
     FROM other"))

races <- tbl(races_db, build_sql(
    "SELECT race,
        lower(regexp_replace(race, '[ ’]', '_', 'g')) AS race_str,
        date_part('year', date) AS year,
        miles
     FROM races"))

racing_data <- combined_races %>%
    inner_join(races) %>%
    filter(!is.na(finish_time))

racers <- racing_data %>%
    group_by(name) %>%
    summarize(races=n(),
              birth_year=min(birth_year),
              gender_filter=ifelse(sum(ifelse(sex=='M',1,0))==
                                   sum(ifelse(sex=='F',1,0)),
                                   FALSE, TRUE),
              gender=ifelse(sum(ifelse(sex=='M',1,0))>
                            sum(ifelse(sex=='F',1,0)),
                            'M', 'F')) %>%
    ungroup() %>%
    filter(gender_filter) %>%
    select(-gender_filter)

racing_data_filled <- racing_data %>%
    inner_join(racers, by="name") %>%
    mutate(birth_year=birth_year.y) %>%
    select(name, birth_year, gender, race_str, year, miles, finish_time) %>%
    group_by(name, race_str, year) %>%
    mutate(n=n()) %>%
    filter(!is.na(birth_year), n==1) %>%
    ungroup() %>%
    collect() %>%
    mutate(fixed=ifelse(grepl('[0-9]+:[0-9]+:[0-9.]+', finish_time),
                        finish_time,
                        paste0('00:', finish_time)),
           minutes=as.numeric(seconds(hms(fixed)))/60.0,
           pace=minutes/miles,
           age=year-birth_year,
           age_class=as.integer(age/10)*10,
           group=paste0(gender, age_class),
           gender=as.factor(gender)) %>%
    filter((miles<10 & pace<16) | (miles>=10 & pace<20)) %>%
    select(-fixed, -finish_time, -n)

speeds_combined <- racing_data_filled %>%
    select(name, gender, age, age_class, group, race_str, year, pace) %>%
    spread(race_str, pace)

main_races <- c('beat_beethoven', 'chena_river_run', 'midnight_sun_run',
                'run_of_the_valkyries', 'gold_discovery_run',
                'santa_claus_half_marathon', 'golden_heart_trail_run',
                'equinox_marathon', 'hoodoo_half_marathon')

race_formula_str <-
    lapply(seq(1, length(main_races)-1),
           function(i)
               lapply(seq(i+1, length(main_races)),
                      function(j) paste(main_races[[j]], '~',
                                        main_races[[i]],
                                        '+ gender', '+ age'))) %>%
    unlist()

race_formulas <- lapply(race_formula_str, function(i) as.formula(i)) %>%
    unlist()

lm_models <- map(race_formulas, ~ lm(.x, data=speeds_combined))

models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*',
                                        '\\1',
                                        race_formula_str),
                                   levels=main_races),
                 predicted_race=factor(gsub('([^ ]+).*',
                                            '\\1',
                                            race_formula_str),
                                       levels=main_races),
                 lm_models=lm_models) %>%
    arrange(start_race, predicted_race)

model_stats <- glance(models %>% rowwise(), lm_models)
model_coefficients <- tidy(models %>% rowwise(), lm_models)

reduced_formula_str <- model_coefficients %>%
    ungroup() %>%
    filter(p.value<0.05, term!='(Intercept)') %>%
    mutate(term=gsub('genderM', 'gender', term)) %>%
    group_by(predicted_race, start_race) %>%
    summarize(independent_vars=paste(term, collapse=" + ")) %>%
    ungroup() %>%
    transmute(reduced_formulas=paste(predicted_race, independent_vars, sep=' ~ '))

reduced_formula_str <- reduced_formula_str$reduced_formulas

reduced_race_formulas <- lapply(reduced_formula_str,
                                function(i) as.formula(i)) %>% unlist()

reduced_lm_models <- map(reduced_race_formulas, ~ lm(.x, data=speeds_combined))

n_from_lm <- function(model) {
    summary_object <- summary(model)

    summary_object$df[1] + summary_object$df[2]
}

reduced_models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*', '\\1', reduced_formula_str),
                                           levels=main_races),
                         predicted_race=factor(gsub('([^ ]+).*', '\\1', reduced_formula_str),
                                               levels=main_races),
                         lm_models=reduced_lm_models) %>%
    arrange(start_race, predicted_race) %>%
    rowwise() %>%
    mutate(n=n_from_lm(lm_models))

reduced_model_stats <- glance(reduced_models %>% rowwise(), lm_models)
reduced_model_coefficients <- tidy(reduced_models %>% rowwise(), lm_models) %>%
    ungroup()

coefficients_and_stats <- reduced_model_stats %>%
    inner_join(reduced_model_coefficients,
               by=c("start_race", "predicted_race", "n")) %>%
    select(start_race, predicted_race, n, r.squared, term, estimate)

write_csv(coefficients_and_stats,
          "coefficients.csv")

make_scatterplot <- function(start_race, predicted_race) {
   age_limits <- speeds_combined %>%
      filter_(paste("!is.na(", start_race, ")"),
               paste("!is.na(", predicted_race, ")")) %>%
      summarize(min=min(age), max=max(age)) %>%
      unlist()

   q <- ggplot(data=speeds_combined,
               aes_string(x=start_race, y=predicted_race)) +
            # plasma works better with a grey background
            # theme_bw() +
            geom_abline(slope=1, color="darkred", alpha=0.5) +
            geom_smooth(method="lm", se=FALSE) +
            geom_point(aes(shape=gender, color=age)) +
            scale_color_viridis(option="plasma",
                              limits=age_limits) +
            scale_x_continuous(breaks=pretty_breaks(n=10)) +
            scale_y_continuous(breaks=pretty_breaks(n=6))

   svg_filename <- paste0(paste(start_race, predicted_race, sep="-"), ".svg")

   height <- 9
   width <- 16
   resize <- 0.75

   svg(svg_filename, height=height*resize, width=width*resize)
   print(q)
   dev.off()
}

lapply(seq(1, length(main_races)-1),
      function(i)
            lapply(seq(i+1, length(main_races)),
                  function(j)
                        make_scatterplot(main_races[[i]], main_races[[j]])
                  )
Meta Photolog Archives