I'm always interested in improving my model for predicting game outcomes. My logistic regression model is based on straightforward variables: offensive and defensive passing and running efficiencies, turnover rates, and penalty rates. In this post, I'll question some of my own assumptions and begin to look at which variables really belong in a prediction model.

Explanatory vs. Predictive Models

As I updated the data for the predictions each week over the recent regular season, I noticed that some of the variables were more consistent than others. Turnover rates were particularly erratic. A team with a very good interception rate in the first half of the season would very often have a below average interception rate in the second half of the season.

Any basic regression model that attempts prediction is based on an assumption that the variables used as predictors, the 'to-date' variables, are indicative of what the same variables will be in the future. In football terms, when I include each team's interception rates from weeks 1-8 in the model to predict outcomes for week 9, I'm assuming that previous team interception rates are representative of future team interception rates.

But what if interceptions were completely random? Weeks 1-8 would not be predictive of week 9. Even though interceptions would still explain a large part of previous outcomes, past interceptions would not predict future outcomes at all. Just like mutual funds, past turnover performance does not guarantee future returns.

What if interception rates were just 'mostly' random? Should they still be included in a prediction model? Perhaps variables with lots of random noise such interceptions or fumbles should not be included in predictive models even though they explain a large part of past outcomes. The question becomes 'what is the critical signal-to-noise ratio that makes a variable appropriate for inclusion as a predictor?' Building on my previous efforts to devise a better passer rating, and on my analysis of Air Yards, I've created a more complete passer rating formula.

"Interceptions are very random, and they are 'thrown' by an offense much more than they are 'taken' by a defense."

There is so much statistical data available for football teams that it is tempting to dump them all into regression software. Doing that would produce a very high r-squared, but would include so much noise, so many non-repeating circumstantial conditions, that it would not be an effective prediction model. I believe this is why so many other models out there do so poorly. A system like DVOA may be very good at quantifying how well teams have done to date--something we already know, but not as good at telling us which teams are likely to do well in the future.

Team Stat Self-Correlations

To test which variables should remain in a prediction model, I tested how well each variable predicted itself from the first half of a season to the second half. This is known as longitudinal auto-correlation. This method tests how enduring and repeatable each variable is.

I tested how well team efficiency stats from weeks 1-8 predicted themselves from weeks 9-17. For example, I tested how well offensive passing efficiency from the first half of the season predicted pass efficiency in the second half of the season. Both offensive and defensive stats were tested. I used data from the 2006 and 2007 regular seasons for all 32 teams (n=64, with two exceptions: mid-season 3rd down conversion rate and penalty rates were not available for 2006.)

The correlation coefficients between team stats from weeks 1-8 with stats from weeks 9-17 are listed in the table below.

Variable | Correlation |

D Int Rate | 0.08 |

D Pass | 0.29 |

D Run | 0.44 |

D Sack Rate | 0.24 |

O 3D Rate | 0.43 |

O Fumble Rate | 0.48 |

O Int Rate | 0.27 |

O Pass | 0.58 |

O Run | 0.56 |

O Sack Rate | 0.26 |

Penalty Rate | 0.58 |

The longitudinal correlations range from as high as 0.60 for defensive pass efficiency and 0.58 for offensive pass efficiency, to as low as 0.08 for defensive interception rate.

The defensive interception rate stands out as the least enduring, least consistent team stat. In contrast, offensive interception rates correlate significantly better, with a coefficient of 0.27.

This indicates there is a lot of randomness in interceptions, which is no surprise. But producing defensive interception does not appear to be an enduring, repeatable ability of a team. Instead, it appears that defensive interceptions are more of a function of 1) randomness, and 2) their opponents' tendency to throw interceptions. In other words, interceptions are very random, and they are 'thrown' by an offense much more than they are 'taken' by a defense.

In following posts, I'll demonstrate that some of the more random team stats can be more accurately predicted by using other, less noisy stats instead of the to-date stats themselves. This may have large implications for an improved game prediction model.

First let me say that we are on the same track in trying to improve our models. The only difference being that you try to predict win-lose outcome and I try to predict cover-not cover Vegas spread outcomes.

Here are a couple of statistical suggestions to your model that come quickly in mind:

- Observations are correlated while logistic regression assumes independence.

- You have "double error". If you predict week 9 based on stats from week 1-8 including say offensive stats, you first predict offensive stats to then predict game outcomes. Variance over variance

- Errors follow a normal distribution, I don't know if this one is true. Check on residuals, if not, it should be easy to take the log or standardize the outcomes in order to achieve normality.

- I have heard that using neural networks in classification data as yours provides much better results. Plus, none of the above assumptions need to be verified (maybe normality still holds, not sure)

- Outliers might be affecting your results. Have you tried using a robust logistic regression or downweighting outliers?

beats-the-spread: Thanks for the great comments. I took a look at your site yesterday. Very impressive. The link behind your name didn't work, so here it is for anyone else who's interested:

http://www.numbersinsight.com/niblog/football.php

I've already tested many of the points you suggested.

-I use rate stats, which are very independent (total pass yds and total run yds aren't, but yds per pass and yds per run are independent.)

-My current model does not do what I suggested in this post, i.e. first predict some stats, then predict wins. But the numbers suggest that this method may actually reduce total error. Some stats are more random and noisy than others. For example, using stable, less-noisy stats to estimate the central tendency of a team to throw interceptions may be better than using past interception data.

-I've found that residuals for both my linear and logistic models are generally normal.

-I experimented with neural networks for game predictions, but found that it was slightly less effective than logistic regression models. I'm not an expert on NN, so there could be ways to improve the effectiveness.

-Outliers in the past were very "on-axis," i.e. they didn't bias the coefficients. But I have a feeling that when I include this year's data, NE might cause some problems. For example, if you graph their TDs per passing efficiency, they are far off the linear axis. To me it suggests once an offense becomes so efficient, they pass a point of inflection beyond which it almost can't be stopped.

I've got some similar ideas about how to build a model vs the spread.

One suggestion for you is to try rate stats instead of total stats. It's hard to tell if you do, or if you still use total yards difference between teams.

For example, use yards per pass attempt instead of total passing yards. Losing teams can rack up lots of passing yards because they're passing much more often, not because they're better at passing. But total yards might be a better fit for point spread estimation.