Introduction
While I had followed the College Football Data (CFBD) Model Pick’em Contest in passing, I had never participated. Now that IU is a football powerhouse, and since I was looking for a new project, I decided to put together a model and join the competition this year. I’m not a college football expert and didn’t want to commit a ton of time, so the model is relatively straightforward, but it was fun to work on and I was able to brush up on some time series concepts I hadn’t used in awhile.
Model Overview
The goal of the competition is to predict how many points the home team will win by for at least 70% of games in the season. Rather than regress on home margin of victory, I decided to build a model that estimates the number of points each team will score, and calculate home MOV from there.
The expected points model is a straightforward linear regression with three predictors: expected number of plays, expected passing predicted points added (PPA) and expected rushing PPA. PPA is the CFBD equivalent of expected points added.
\[xPoints = xPlays + xPassPPA + xRushPPA\]
The expected plays, passing PPA and rushing PPA factors are estimated using AR(1) models. For each variable, a model is fit with binary indicators for home/neutral site, and random effects for the offense and defense team IDs. The random effects evolve over time through an AR(1) process.
As an example, here is what the passing PPA model looks like.
\[PassPPA \sim \beta_0 + \beta_1 \text{ishome} + \beta_2 \text{isneutral} + OffTeam + DefTeam\]
\[OffTeam_t \sim \beta_{0,off} + \phi_{off} \cdot OffTeam_{t-1}\]
\[DefTeam_t \sim \beta_{0,def} + \phi_{def} \cdot DefTeam_{t-1}\]
These models were estimated in Stan. Weakly informative priors are set for global parameters, but more informative priors are set for offense and defense team skill.
For each team, I find “short run” and “long run” estimates using mixed effects models via the lme4
package. The model structure is regressing the skill factor in question against a binary is_home
term and random effects for offense and defense. In the short-run case, this regression is done on one year of data and in the long-run case, four years of data are used. The final prior for each team is the weighted average between the two, with a global weight found within the Stan model.
Continuing with the PassPPA example, an example prior for a given team j is:
\[PassPPA_j = w * PassPPA_{shortrun, j} + (1-w) * PassPPA_{longrun, j} , \quad w \in [0,1]\]
The general thought behind these priors is that blue bloods tend to stay blue bloods, mid-tier programs tend to stay mid-tier, and so on, which is reflected in the four-year regression. However, teams aren’t static, and so the one-year regression is aimed at providing a more up-to-date estimate of how the team has been performing heading into the current season. Given how much movement there is now with the transfer portal, this may not hold up as well, but in general, this approach works fine, and skill updates should update fast enough after a couple games to make up for these shortcomings.
Once this is complete and expected points are calculated, along with the corresponding expected home margin of victory, a final regression is performed, where home MOV is regressed against my model’s expected home margin of victory, and the DraftKings opening line.
\[\text{HomeMOV} \sim \text{DGHomeMOV} + \text{DraftKingsHomeMOV}\]
DraftKings isn’t considered to be a market-making book the way Pinnacle, Cris or Circa are, but they make lines for all the games, and the data was readily available, so I opted to use them. I use opening line rather than closing line because closing line isn’t available to me when I make predictions.
The final model ends up with predictions that are on average about 4.5 points different from the bookmaker spread. This is inline with predictions make by other contentestants in previous years, which gives me confidence that my predictions are not too far off market.
Conclusion
Overall, while I’d be shocked if I finished near the top of the leaderboard, this was a fun project to work on, and I was able to learn a lot about time series modeling in Stan. Further improvements can be made to the priors and looking at returning production, or to the coefficient on the AR(1) terms to make them more time sensitive, but right now, I’m happy where the model is.
Thanks to Max Resnick for floating the idea of joining this competition, and the short run/long run idea.