To compensate for early-season over-confidence in model predictions, the model now includes a mechanism for extremes in team stat variation.

To explain how the compensation works, consider the recent SD at DEN game. The model had DEN at 0.97 favorites to win, but the Chargers crushed the Broncos. (Some of the following is taken from my comments to the week 5 predictions.) The current model basically says this:

1. Assumes SD's and DEN's to-date efficiency stats represent their true full-season talent,skill, and performance.

2. If a team with season-long stats such as SD's plays at a team with season-long stats such as DEN's, the home team with a stat advantage that enormous would win 97 times out of 100 games.

I think #1 is the problem, but I think #2 is true. I think the overconfidence in the model at this point in the season comes from only a few data points for each statistic so far. Outlier/unrepresentative performances have a large effect on each team's aggregate efficiency stats early in a season. In other words, Denver may have played their best 4 games in terms of team efficiencies already, and from here out their stats will regress to the mean. Until their efficiency stats stabilize, the model would overweight Denver's odds of winning, and vice versa for San Diego.

[Incidentally, this would be true of all statistical models this early in the season. But I think a pure efficiency model might be less susceptable. For example, including 3rd down conversion rates or red zone rates would exaggerate the effect of unrepresentative performances to an even greater degree.]

Ultimately, it's the unstable input variables which may not represent each team's true ultimate efficiency averages. I had already been working on a simple method of Bayesian compensation for early-season unstable team stats. It's similar to what IMDB or other similar sites do with their movie-rating system. IMDB doesn't want one voter to give 5 stars to his favorite Pauly Shore movie, resulting in a perfect 5.00 rating for "Encino Man" until someone else takes time to rate it a zero. So they assign every movie a certain number of baseline votes, so one early voter doesn't move the aggregate too far. Eventually, as the real votes accumulate, the baseline phantom votes are pushed out of the calculation. Throughout the process, the aggregate rating remains stable.

The model's input stats will include a number of "phantom" games at league-average stats for every team. This mitigates the exaggerated confidence levels early in the season. (DEN's stats wouldn't have looked so good, and SD's wouldn't have looked so bad.) The question is, how many pure-average phantom games should be used? We need to look at how quickly each team's stats stabilize. In other words, how many games of data are required before the stats are within an acceptable margin of their final season average?

The answer is more complicated than expected. I plotted each team's 2006 stats by week. The graph for defensive pass efficiency is posted below as an example. The y-axis is yds allowed per attempt and the x-axis is week number. Although hard to read, it's easy to see that the 'funnel' stabilizes shortly before halfway through the season.

Each stat has it's own rate of stabilization. For example, offensive run efficiency stabilizes very quickly, within 4 or 5 games. But defensive pass efficiency doesn't stabilize until about week 8. Turnover rates don't stabilize until week 9.

The compensation is reduced as each week goes by in the early season. Each week, one fewer phantom game is included in the team's averages until the each stat stabilized. For example, by week 5 the compensation for offensive run efficiency has run its course. And by week 8, defensive pass efficiency no longer contains any compensation. At this point in the season, between week 5 and 6, the resulting effect is to mitigate the importance of defensive pass efficiency and turnover rates.

## Instability Compensation

By
Brian Burke

Subscribe to:
Post Comments (Atom)

I think the overconfidence in the model at this point in the season comes from only a few data points for each statistic so far.It would really be good to find a way to actually calculate out the error in the model. You should have errors in the regression coefficients, so that's the systematic error right there.

The question then becomes what are the errors in the input data? Most of your input statistics are rates, so assuming the error goes like Poisson (i.e. the error is proportional to the square root of the number of observations) isn't such a bad idea.

I'm not a fan of trying to ad-hoc correct limited data. Figure out the errors due to the poor data, and live with it. If your data is limited, you can't magically create more information. Doing magic like pulling teams to the mean, etc. is just ad-hoc : it works because the average team is average. But you get the same effect, without the artificial "they can't be THAT good" effect, if you just figure out the uncertainty in the model prediction.

Very Nice!

Re "Instability ompensation"

I agree with Pat.

"You should have errors in the regression coefficients, so that's the systematic error right there.

The question then becomes what are the errors in the input data? Most of your input statistics are rates, so assuming the error goes like Poisson (i.e. the error is proportional to the square root of the number of observations) isn't such a bad idea."

My compensation has never been stable anyway.

I do have the std errors of each coefficient, but for logit models, there typically isn't an overall std error like you see with linear regression. Should I aggregate the individual std errors? Should they be weighted/normalized?

I don't disagree about ad-hoc corrections. But in this case, without the corrections I would need a different model with different coefficients each week of the season. Each week would have its own uncertainty. I realize that is probably a more ideal way to model games, but it is not practical.

I wouldn't say this is totally ad-hoc. It's kind of Baysian in a way, and is similar to how the human mind actually works. You have a prior distribution of team stats. Then, as games are played, you get a trickle of evidence about what the ultimate true values of each team's stats are. You modify the prior distribution with the new evidence to arrive at a posterior estimate.

Maybe - maybe not.

Should I aggregate the individual std errors? Should they be weighted/normalized?Typically you add in quadrature, by percentage. That's assuming the effect of the input variables in the regression are orthogonal (i.e. there's no covariance matrix - well, it's unity).

You have a prior distribution of team statsThe prior distribution is what's ad hoc, not the method. Starting off with "each team is average..." means that while you won't have a

bias, and you won't get the extremes, you'll still have just as many mispredictions as before, they just won't be as bold.And at the end of the year, the mispredictions will, in fact, be biased: a team that was good will consistently be underestimated in the beginning of the year (you didn't 'trust' the data), vice versa for bad teams.

If you just figure out exactly how uncertain the predictions are ("the Broncos have a 97(+2-38)% chance of winning") you'll have the same benefit (no bold mispredictions) while still letting the input data converge as fast as it actually does.

Definitely for some of the stats that are numbers-limited (like, interception rate) you should be trying to take their errors into account.

It'd be a bit of a pain to do in something like Excel, though. It really wants to be a simple Monte Carlo.

I also should say, the fact that any model said a team has greater than a 95% chance of winning is a sign that you're extrapolating waaay too far.

Easy check: in the training data set (all data before this year) how many games have *ever* been played such that one team had a 95%+ chance of winning? What was the winning record of that team in those games?

In fact, given the previous study regarding the relative weight of talent distribution/random luck in the NFL, I'm really surprised that you've got a model that

caneven go up to 95% chance of winning.Ok. As soon as I figure out quadrature by percentage I can generate some error margins.

The reason for the >.95 confidence levels is due to early turnover stats. Early in the year, one or two very high/low turnover games can dominate the input.

The reason for the >.95 confidence levels is due to early turnover stats. Early in the year, one or two very high/low turnover games can dominate the input.Exactlywhy you need error margins. The simple fact is that early in the season, you just can't tell the difference between 1 interception ever 40 passes and 1 interception every 25 passes.You know, there's an easy way to handle this: tune the model based on the number of passes. That is, start with datasets for teams with between 0-40 passes, between 40-100 passes, etc. What you'll see is that the interception percentage will have

nostatistical merit for the "low numbers of passes" dataset, and it'll build up as the sample size increases. Thatmightclean up the bias.(I've never heard of doing this statistically, so I'm not sure if it's solid or not. I'd have to think about it)

Quadrature means things add in squares (since you're presuming the errors are orthogonal) - i.e. error squared = error 1 squared + error 2 squared.