This is the 4th part in my series on examining the concept of momentum in NFL games.The first part looked at whether teams that gained possession of the ball by momentum-swinging means went on to score more frequently than teams that gained possession by regular means. The second part of this series looked at whether teams that gained possession following momentous plays went on to win more often than we would otherwise expect. The third part focused on drive success following a turnover on downs, which is often cited by coaches and analysts as a reason not to go by the numbers when making strategic decisions.

This article will examine how 'streaky' NFL games tend to be. If momentum is real and it affects game outcomes, it would result in streaks of success and failure that are longer than we would expect by chance. But if consecutive plays are independent of previous success, the streaks of success and failure will tend to be no longer than expected by chance. This method of analysis does not rely on any particular definition of a precipitating momentum-swing, as it looks at entire games to measure whether success begets further success and whether failure leads to more failure.

For momentum to have a tangible effect on games, it does not require completely unbroken strings of successful or unsuccessful plays. But if success does enhance the chance of subsequent success, then the streaks of outcomes will be longer than if by chance alone.

For this analysis, I applied the *Runs Test* to the sequence of plays in a game. This produces a statistic indicating how streaky a string of results is compared to what would be expected by chance. For example, consider the following 3 strings of results of flipping a coin 8 times:

HTHTHTHT, HHHHTTTT, HTTTHTHH

The Runs Test works like this:

The average expected number of runs of the same result (the 'mu' symbol) is calculated based on the number of 'successful' trials (N+) and the number of 'unsuccessful' trials (N-). In the 3 examples above, the test says we should expect an average of 5 unbroken 'runs' of one result or the other:

In the first example above, we have 8 separate runs, which is choppier and more alternating than we would expect. And in the second example, we see only 2 runs, which is fewer than expected and might suggest some sort of trends are at work. The third example is more in line with what we would commonly expect by chance, producing 5 runs of heads and tails.

Instead of the heads and tails of coin flips, I examined the success of football plays. I classified the plays of a game as successful (S) or unsuccessful (U) based on whether the play improved or worsened a team's net scoring potential, as measured by Expected Points. (This is the basis of the ANS Success Rate statistic.) So instead of HTTTHTHH... we would have a string like SUUUSUSS...

One thing to keep in mind is that unlike coin flips, some teams are better than others, so there isn't a 50/50 chance of success for plays in any particular game. Fortunately, the Runs Test accounts for this kind of imbalance.

For each game we can measure how streaky the plays were. If the plays are significantly more streaky than we would expect from an independent distribution like flipping a coin, then we have evidence of momentum. For every game from the 1999 season through week 8 of the 2013 season, I examined the sequence of play success for all common scrimmage plays

*from the home team's perspective*.

This is a key element of my methodology because this would detect momentum for a team's offense or defense alone, as well as detect any kind of carry-over from one side of the ball to the other. In other words, this method will detect whether an offense or defense 'gets on a roll' in addition to whether there plays on one side of the ball inspire their teammates to play better. No matter what your definition of momentum, this method should be able to detect evidence if it exists.

The bottom line is that we are testing for

*independence*between successive plays. If plays are independent 'trials', then there can be no momentum.

Let's start with an example. The Broncos hosted the Eagles in week 4 this season, winning 52-20. In that game, DEN had a total of 147 plays on both sides of the ball. They had 90 successful plays and 57 unsuccessful plays, resulting in 62 streaks. Was this streaky or not?

We expected about 71 streaks, but there actually only 62 runs of the same result. So this game was streakier than we might expect. That doesn't yet say whether momentum is at work. Some games are bound to be streakier than others by chance. So how unlikely is an outcome like this? The standard error of the runs test can be calculated from its variance, and this produces a p-value which is the probability of such an outcome by chance.

In the case of the PHI-DEN game, the p-value was 0.12. That doesn't meet the traditional cut-off of .05 for statistical significance, but it's still fairly unlikely.

One game doesn't tell us much, but 3,875 of them can. I repeated this analysis for every game since the start of the 1999 season, and here are the results:

-The average expected number of runs was 67.0.

-The average observed number of runs was 64.7.

-The average standard error for a single game was +/-8.0 runs.

-49 games (or 1.2%) of the 3,875 games exceeded the p=0.05 threshold for streakiness

The number of observed runs per game was less than the expected number, which indicates there is more streakiness than if all the plays were purely independent. The difference is by 2.3 runs per game. However, very few games were more streaky than expected to a statistically significant degree.

I interpret these results to say that yes, there may be some degree of momentum, on average, in an NFL game. But it is imperceptibly small, and we can only point to handful of games over the past 14 years that are particularly momentum-rich.

I characterize the difference as imperceptible because 2.3 runs is 4% fewer than expected. It's a difference of only about 1 play in a game. In other words, we only need to flip one play from a success to a non-success (or vice versa) to create 2 additional runs. If there were a string of heads and tails like HHHHTHTT with 4 runs, we only need to change one result to create 6 runs: HTHHTHTT.

Further, the difference between success and failure can be razor thin. A run of 5 yards on 1st down is typically a success, but a run of 3 or 4 is usually not. That's the practical size of the effect at work. I don't think one or possibly two additional consecutive successes or failures in an entire game are what believers in momentum have in mind.

Still, these results do show some evidence of momentum. That's interesting in itself. However, this result may only be due to teams with offenses and defenses at opposite ends of the performance spectrum. For example, when the 2000 Ravens defense was on the field, we were likely to see streaks of successful plays, but when their offense was on the field, we were likely to see streaks of unsuccessful plays. On net they had an average success rate somewhere between their two squad's rates, which increase the number of runs we should expect.

In the next part of this study, I'll separate offenses and defenses, and also look at series-level success. As mentioned, the difference between success and failure at the play-level can be very thin, and momentum may manifest itself at a higher level. For example, an offense could string together plays like this: SUSUSUS, and still march down the field to score as long as the successes are of enough magnitude to convert first downs. Other methods of analysis could include autocorrelation plots.

If you're curious, here are a few of the most streaky games in the data:

2000 MIA 37 at NYJ 40 (a classic OT comeback) - 84 expected runs, 57 observed

2002 JAX 23 at CAR 24 (another comeback) - 66 expected, 44 observed

2012 SD 27 at NYJ 17 - 61 expected, 42 observed

2013 DAL 21 at SD 30 - (a great visual example of streakiness) 68 expected 48 observed

And here's a good example of non-streakiness. You can actually see the choppiness in the graph:

2012 TEN 23 at IND 27 - 71 expected run, 81 observed

I'm thinking some of the "streakiness" observed here can be attributed to differences in team strength.

Take an extreme example where both teams have the best offense you can imagine and, at the same time, the worst defense you can imagine, so that every play results in a success for the offense. In that case, each possession is a run that lasts for as many plays as the drive, and you would observe a "streaky" game in this method, even though the cause is completely explainable by differences in team strength. I would think you would want to try to control for something like this in the analysis.

Keep up the good work!

Brian A - I apologize. I updated the post earlier this morning to address exactly that point. I'm betting you got a hold of the early draft. In short, I agree.

Ah, very good. Thanks for the reply.

I like this more general approach to testing for momentum.

It does seem like the expected success rate will vary depending on how risky the situation is. Compare, for example, the expected success rates for going for it on fourth and goal from the 6, or kicking the field goal on fourth and goal from the 15.

If the expected success rate has lots of runs, then that could easily lead to runs in the successes without any team performance carry on effect.

Great analysis and very interesting. Something tells me you would expect to see more streakiness/fewer runs than you would expect from an independent model because NFL are not independent.

There are all sorts of adjustments, both tactically and personnel, that are made that move the chance of the next play being a success or not all over the place. It might not be a case of 'success begets more success' just that I believe there would be a lag of a few plays at least for a coach to make whatever changes are made to bring the balance back towards him.

At a guess, the function denoting the probability of success on a given play should be able to be modelled as a stochastic process. I may have a look into this.

Do this take into account games when fans gets arrested for streaking?

It seems like you might get a little bit of streakiness in the data overall just from teams piling up 'unsuccessful' plays at the ends of halves or when they're in run-out-the-clock mode. Did you try to remove those plays, or plan to check using WP instead of EP?

Great stuff. I'm not sure I agree with the conclusion that this data shows evidence of streakiness, though. Doing a one sample t-test on the data provided (mean=64.7, SEM=8.0, expected mean=67, N=3875), I get a p-value of .77 and a 95% CI of (-18.0, 13.4) for the difference. Unless I'm missing something, this would fail to falsify the null hypothesis of no streakiness.

As for the fact that 1.2% of games were streaky at a p=.05 level, doesn't the null hypothesis predict that up to 5% of games will meet this significance threshold? If anything this result might indicate the plays are more "unstreaky" than predicted by random chance--in fact doing a chi-square test seems to suggest just that. Using an observed frequency of 49/3875 and an expected frequency of 193.75/3875 (5%), I get Chi-squared=113.8 with 1 df, which is significant at the p<.0001 level.

does this analysis look at play by play differences in "success"? that is far too small a window.

So every incomplete pass would break a "streak" of success for a offense with momentum.

Also, it seems that success could be long streaks, but failure could only have a maximum of 3 in a row, then it's a punt.

Anon, that's the standard error for a single game. The standard error for the sample should be s/sqrt1=8 s=8 8/sqrt3875=.1285, which is far above 95%. Of course, I had the same thought as others, that when the Seahawks demolished the Cardinals 58-0 they didn't outplay them in every phase of the game for 60 minutes because of momentum, it was because they are were a far superior team. That'd be tough to control for, but you could use pregame GWP and leave out week 1 games.

Nathan: D'oh, you're absolutely right. The result's highly statistically significant (although the effect size is small). It's still interesting that the number of games that met the p=.05 significance threshold was actually less than predicted by the null hypothesis (unless I'm making another stupid mistake).

I don't think your test for if the NFL is streaky is correct. Saying how many games are below the .05 p-value is insignificant, isn't the correct comparison comparing the two means and determining a p-value from the standard error? If the difference in the means is really small, but you have a lot of data you can accept that the difference in the means is statistically significant.

Hi Brian,

Methinks you need to issue a cease and desist or plagiarism notice against your namesake!

http://www.nfl.com/news/story/0ap2000000296566/article/december-momentum-just-isnt-everything-its-cracked-up-to-be

Best,

Pete

The difference in the means is statistically significant but not practically significant.

Surprised Billick would take that position.

NFL offenses and defenses are rated more for the success/failure of each drive as a whole, rather than play-by-play sequences. For example, if you get stuffed on 1st and 2nd down, but convert on 3rd down, that is better than gaining yards on 1st and 2nd but turning the ball over on 3rd. Even more than that, if you gain 4 first downs in a row, but fumble the ball away on the 5th set of downs, that is a failure—even though you were piling up “momentum”-positive sequences beforehand.

Likewise, a long drive which results in a touchdown has a lot more value, obviously, than a drive of the same length which results in a field goal. I’m not clear on how this analysis accounts for that difference as well. (The Patriots, for example, have for years ceded a lot of yards, in relation to other teams, but have minimized the number of points resulting from those yards... I would think an analysis of a team like that might result in some misleading conclusions vis-a-vis momentum.)

Maybe I’m misunderstanding the methodology—I’m not a statistician—but thought I’d throw that out there.