This is the third part of a four-part article discussing the relative importance of factors in winning NFL games. Part 1 is here and Part 2 is here.

**CORRELATION SUMMARY**

So far we’ve analyzed each phase of the game and its statistical connection with regular season team wins. Below is a table that lists the relevant statistics and their correlations. The table is sorted in order of absolute strength of correlation.

Stat | Win Correlation |

Off Pass Yds/Att | 0.61 |

Def Pass Yds/Att | -0.47 |

Off Fumble Rate | -0.46 |

Off Int Rate | -0.45 |

Def FFumble Rate | -0.41 |

Def Int Rate | 0.39 |

Off Pen Rate | -0.37 |

Off Run Yds/Att | 0.18 |

Def Run Yds/Att | -0.04 |

The table is presented graphically below. Negative coefficients, such as defense pass efficiency, are shown as positive values to make it easier to compare each variable's relative importance.

The relative importance of each aspect of the game begins to come into focus. Passing is most important, followed by turnovers, then penalties and running. For every aspect, the correlation on the offensive side of the ball is stronger than on the defensive side.

But this isn’t the final word. Correlation coefficients by themselves do not take into account the other factors. In other words, they ignore the effect of the other stats when calculating the correlation.

**REGRESSION**

To take all facets of the game into account simultaneously and produce a valid model of winning NFL games, we can use linear regression to estimate coefficients for each stat. The relative value of the coefficients will reveal the relative importance of each phase of the game, holding all other variables constant. This will yield estimates that are more pure and accurate than simple correlations.

The dependent variable of the regression model is regular season wins. The independent variables are the efficiency stats I’ve previously outlined. The data set continues to be all 32 teams over the past 5 seasons for a total of 160 observations. The results of the regression are detailed below.

VARIABLE | COEFFICIENT |

const | 5.31 |

O Pass Eff | 1.43 |

D Pass Eff | -1.65 |

O Int Rate | -53.50 |

D Int Rate | 81.70 |

O Fum Rate | -49.10 |

D FF Rate | 70.90 |

O Run Eff | 1.00 |

D Run Eff | -0.55 |

Pen Rate | -2.73 |

R-squared | 0.802 |

Each of the independent variables are statistically significant at the 0.05 level or better, except defensive run efficiency, which is significant at 0.06. The R-squared value indicates an extremely good overall fit for the model. 80% of the variance in team wins can be explained by the included variables. The remaining 20% could be due to any number of factors, but we have to accept that outcomes in any sport are partly due to luck.

Using the regression results we can estimate a team’s expected wins using a linear equation. Here is what the equation would look like:

Wins = 5.31 + (1.43 * O Pass Eff) + (- 1.65 * D Pass Eff) + …

The regression coefficients are stated in terms of wins per unit of the variable. For example, the coefficient for offensive pass efficiency (yds/att) is 1.43. So for every 1 yard improvement in pass efficiency a team can expect 1.43 additional wins. When coefficients are stated this way, it makes it very easy to estimate the effect on the dependent variable (wins) given a change in one of the independent variables. But it makes it very difficult to get a sense of the relative importance of each variable. Defensive forced fumble rates are certainly not 70 times more important than offensive run efficiency.

To reveal the true relative importance of each factor, we need to standardize each variable by calculating the number of standard deviations from its average value. In statistics, these are known as “normalized” or "standardized" variables, noted by the prefix “z.”

Here are the regression results again, this time calculated with standardized coefficients. The significance of each variable, and the overall fit of the model remain the same since only the units of the variables have changed.

VARIABLE | COEFFICIENT |

constant | 8.06 |

Z O Pass Eff | 1.14 |

Z D Pass Eff | -0.92 |

Z O Int Rate | -0.45 |

Z D Int Rate | 0.76 |

Z O Fum Rate | -0.33 |

Z D FF Rate | 0.42 |

Z O Run Eff | 0.46 |

Z D Run Eff | -0.24 |

Z Pen Rate | -0.39 |

It may seem like we’ve gone through a tortured process to arrive at these coefficients. But they are merely the mathematical weight we would need to give each factor to have the best estimate of actual team wins. These are based on real-world data from every team’s season between 2002 and 2006.

Here is a graph representing each variable’s relative weight. Negative coefficients, are shown as positive values.

Probably the simplest way to interpret the chart is this way. If my team is average in absolutely everything, I'd expect to win 8 games. But if my team is average in everything except offensive pass efficiency, in which we're one standard deviation above average, I'd expect to win 9.14 games (8 + 1*1.14).

So if my team was the league's best at running the ball, say 2.5 standard deviations above average, but average at everything else, we'd expect to win 9.15 games (8 + 2.5*0.46). Compare that to passing--if my team were average at everything but best in the league in passing, we'd expect to win 10.85 games (8+2.5*1.14).

Continue reading the fourth and final part of this article.

First off, this is an outstanding blog that I enjoy and read regularly.

Second, what is the adjusted R-squared of your season win model? The R-squared you reported was 0.802, but I wonder what the effect of keeping all those regressors (especially defensive run efficiency)has on the fit.

Max-Thanks. Didn't Homer Simpson go by that name in an episode? Classic.

The adjusted R-squared for the model cited in this post is 0.791.

I think one reason the adjusted value is so high is that the model isn't a "kitchen-sink" model. A lot of people are tempted to throw in first downs, touchdowns, field goals, etc. which are really just intermediate results between yards gained in individual plays and team wins.

The game is far too complicated to take this analysis seriously. Factors such as Time of Possession, first downs/game and Average Time of Possession per offensive drive have not been taken into consideration in this quasi-mathematical analysis.

Apparently, it is too complicated for some people.

In case anyone is curious about time of possession, it does not correlate with winning. TOP is an 'intervening variable' between running and passing performance, and winning. So TOP is an intermediate result of success in other aspects of the game, not an ability in itself.

Hi..just found your site and I really enjoy it. Slowly reading all articles to get up to speed. I also try to predict NFL games and I use a regression.

I start with the assumption that game box score stats are the result of field position, score and time left in the game. I assume the box score stats are a result of the score, not a cause.

One minor nit pick in "what makes team win" and it also refers to other posts is that the game stats (no matter how massaged) to not make a team win. The game is physical, violent and chaotic, and the team with the best combination of bigger and better athletes and better luck wins. The map is not the territory, and the model is not reality. I think we both know that but that does not stop us from looking for the prefect (or at least a better) model.

Keep up the good work.

Brian,

As always, great article. I've recently come across your blog and am going through it all chronologically.

Just a few clean-up points at some of the numbers at the end of the article.

1) I believe the expected number of wins in the last line of the second-to-last paragraph is 9.14 (not 9.94).

2) In the last sentence, I believe the formula you want to show is (8+2.5*1.14) and not (8+1.5*1.14).

Keep up the fantastic work, and I look forward to working through the rest of your articles!

-Fred

Fixed. Thanks, Fred!

Run points against wins, and all the different states against points.

THis is really great stuff. I am working on an advanced stats project for school. Because I am a Seattle Seahawks fan, and yes I still roll in my sleep over the SB loss to Pittsburgh, I am trying to find a correlation between penalties and impact of momentum in a game. Not sure if a statistical model is the best tool to predict this... Any advice?

First you'd have to figure out a way to define and measure momentum. I'm not convinced there's any such thing.

One thing I might do is use win probability to evaluate a team's WP before a big penalty and after the penalty. You could select high profile game-turning penalties (like in the SEA-PIT SB) as case studies.

Hi Brian;

I was rereading one of your old (and great! posts Some questions about your model.

Are all! 9 variables in your prediction model

'repeatable ( i.e. attributed to skill)

If so could you post your findings i.e. first half to second half correlations ..this way we could see how much luck is invlove in each variable

Thanks I seem to recall in later posts you raising the point that def. interceptions weren't. (I also wonder if penalty rate might be part luck since it depends on ref. judgement and when you get the penalty (similiar notion to your idea of when a team gets a series of first downs?)

If so, have you removed or altered interceptions from your model?

2. Does rushing eff. corrlate at all with first downs?

I realize you seem to have proved that running doesn't set upthe past..However..I have a habit of checking team stats after an upset..and alot of time the 'upset' team had a large difference in run efficency ..even when they had an equal deficency in pass efficency..I know these upsets could just be part of 'luck' however it seems that when a team has a 'dominate' advantage in run officeny then it makes the defence have to play run and pass..and was just thinking that

these teams might have shorter pass completions by choice since they are in 2nd and short and 3rd and short more often..(thus a correlation from run eff.to fd)

3. My math is average but does a .61 correlation mean that .61^2 =36% ishow much the stat explains the results?

thanks Dan

Brain,

Anecdotally, mean reversion seems to play a role in sports, i.e, teams that lose big seem to have a tendency to come back in a big way the following game and vice versa. Have you seen this effect in the data?

Brian, I have been playing with the dataset you provided and can match your results on all but Off Pen Rate. The best I can get for it is a correlation of -.27 which leads to an R^2 of .725 ... can you shed some light on what stats you used to figure the penalty rate? I've tried about all the combinations (ie off + def, number and yds, etc) and no luck.

KenyonLV

How did you calculate the constant in the coefficient tables?

Incredible research and website. I am 17 years old and very interested in football spreads and statistics. I am very curious how you got each of the co-efficients for each category in the table and was wondering why Def FFumble Rate has a negative correlation. I believe that forcing fumbles from a defensive stand-point is positive. Thank you for all of this!

Brian, shouldn't yards per attempt (whehter rushing or passing) be the same, and thus have the same coefficents? ie. a particular game has a result (dep. variable) and independent variables such as passing yards per attempt. but if team A gets 7 ypa, then team B gives up 7 ypa.

or am I not looking at this correctly?

thanks