This post at Football Outsiders caught my eye today. The IgglesBlog noticed something odd with their team rankings. I’ve notice the same phenomenon in my own systems—that team ranking methods that adjust for opponent strength tend to produce rankings that correlate (inversely) with a team’s strength of schedule. In other words, top ranked teams appear to have weaker schedules and low ranked teams appear to have stronger schedules. The problem is, assuming that a ranking method properly adjusts for opponent strength, it ostensibly should produce no correlation between each team’s ranking and its opponents' average ranking. In fact, we might expect the opposite result because of the two “strength of schedule” games each season—Last year’s 1st place teams play other 1st place teams, and so on.

In 2011 FO’s “DVOA” method correlated with opponent strength at -0.66, which is considerable. Here at ANS, Generic Win Probability correlated with Average Opponent GWP at -0.60 this season. FO notes that in other years the correlation isn’t nearly as strong, but there is an apparent tendency for negative correlations for most seasons.

This phenomenon was first pointed out to me a couple years back by a reader, and I too thought it was either a) randomness, or b) a flaw with my methodology. But I soon realized this is exactly what we should expect given the NFL’s scheduling rules. It’s neither luck nor a flaw. In fact, it's a sign the method is doing something right.

Consider a fictional four-team football league. Presume we have a perfect team ranking system that can peer omnisciently into each team’s soul to know its True Winning Probability (TWP). The Sharks, Knights, River Dogs, and Jack Rabbits each have a TWP of 0.75, 0.60, 0.40, and 0.25. (Notice the TWPs average to 0.50, as they would have to.)

Now, suppose this league plays a schedule where each team plays each of the others twice. The top team Sharks, who have a TWP of 0.75, would have an average opponent TWP of 0.42: [.6+.4+.25]/3. And if we calculate the average opponent TWPs for all the teams, we’d get the chart below, which plots average opponent TWP vs. team TWP.

The better the team, the easier its schedule. In fact, the correlation is a perfect -1. The reason that strength of schedule will always tend to correlate inversely with team strength is that

*a team can never play itself*. Scheduling is not a random draw.

Although NFL scheduling system isn’t a four-team round robin, the 6-game intra-division part of the schedule is. So each year we should expect an inverse correlation between team strength and average opponent strength to some degree. This year’s strong correlation is likely a random outlier, and the FO guys can rest assured it’s not due to a weakness in their system.

Based on your assessment of the negative correlation, you should be able to test your hypothesis by testing the correlation of the correlation coefficient you described in your article with the standard deviation of the win percents of all the teams in the same season. If your assessment is accurate, this correlation should be negative (null hypothesis: rho >= 0).

...a team can never play itself.Of course. I never imagined this would be a mystery to anybody.

Let's say that in 2007 the Pats and Dolphins played an identical schedule except for each playing other. The Pats were 16-0 and Dolphins were 1-15, and being in the same division they played each other twice.

So even with the otherwise identical schedule the opponents of the Pats had 30 fewer wins than those of the Dolphins. The Dolphins played a schedule tougher by 30 more opponent wins.

If readers of stats sites like FOers have been puzzled by why the strongest teams have easier schedules, then I feel good about not being as innumerate compared to the average as I feared.

This effect will get weaker as the number of teams and games increases. If you make it a 16 team 15 game league, And the spread in GWP is even from .3 to .7, the spread in strength of schedule would be .027. That's not a lot. I have trouble believing that this factor is something you will actually see in single season data... though maybe it gets stronger if you account that 6 games are within division and within a four team sample there is a lot of variation in GWP.

It wouldn't be hard to construct that either, but it would probably take me more than five minutes, so I will punt.

I did a little more work. I randomly assigned teams to divisions. Than I "played" only the 6 division rival games. I than calculate the correlation between GWP and ScheduleStrength for the whole data set.

I ran this 10000 times. The average R was -.0024. The standard deviation in R was .3107.

Obviously, I haven't yet included out of division games. But the point is that one has to take in account the shape of the schedule when making this kind of analysis.

This effect will get weaker as the number of teams and games increases ... I have trouble believing that this factor is something you will actually see in single season dataWhy would it be hard to see?

NFL teams play double round robins in divisions. Stylized division play:

........... W-L ... Opp W-L ... S-o-S

Best team . 6-0 .... 6-12 ..... .333

2nd best ... 4-2 .... 8-10 ..... .444

3rd best ... 2-4 .... 10-8 ..... .556

Weakest ... 0-6 .... 12-6 ..... .667

That's a pretty visible spread from top to bottom in Opponent W-L and strength of schedule.

To the extent that teams play the same schedule outside of their division, this will of course carry over. Multiply by eight across the league. It is difficult for me to imagine how the effect *wouldn't* be seen.

Same in divisions with closer spreads from top to bottom. Say team records are 4-2, 3-3, 3-3, 2-4. Then the top team has a s-o-s in the divison of .444 and the bottom team of .556, clearly visible too. That's the starting point for each team going into inter-division play. Multiply by eight, that's plenty enough to be visible in full-season numbers, as it is. The actual average divisional case is between these two examples.

And that is not the end. E.g. 10 teams in other divisions had to play the 16-0 Patriots, but the Patriots didn't have to play themselves even once! The Patriots had an easier-than-average schedule *outside* their division right there. And the converse, of course, for the 1-15 Dolphins. They didn't get even one gimme game against themselves, teams in other divisions had 10 while those in their division had 6 more.

For an average-strength team the Patriots schedule was very near two full wins easier than the Dolphin's schedule, even if the Pats and Fins played identical schedules apart from playing each other. Two wins is *a lot* in a 16-game season.

Of course Patriots-Dolphins 2007 is an extreme example, but this scenario plays out in all eight divisions and for the best and worst teams in inter-division play to a greater of lesser extent every season.

This season, by Pythagorean, the seven strongest teams had opponents with a 47% winning strength and the seven weakest teams had opponents with an average 53% winning strength. That 6% difference is worth one win in a 16-game season.

To fully equalize strength of schedule between strongest and weakest teams the NFL would have to aggressively require strongest teams to play only strong teams outside their division, and set weaker teams playing only weaker teams. The NFL does a little of this but not much.

The result is seen straight up in the strength-of-schedule numbers, which say what they mean and mean what they say.

Hi Jim,

The reason it would be hard to see is because:

"If you make it a 16 team 15 game league, And the spread in GWP is even from .3 to .7, the spread in strength of schedule would be .027"

As in I calculated that using the pretty reasonable assumptions I gave. Do you have an issue with one of my assumptions? I did assume a certain distribution of GWP, but I could easily run my code on any arbitrary distribution. Though, I really don't see that this should make a difference.

Actually I just improved my code so that it accounts for a 14 game season. I neglect division champ games, because they're not easy to handle, and anyway should give a positive correlation.

After 10000 runs of such a season: The mean of R is .0012.

The standard deviation of R is .3657.

That standard deviation is pretty comparable to the year-by-year correlations from the FO post is that this correlation itself is fairly random.

If you would like my code (it's in matlab), I'd be happy to send it to you.

Pretty obvious once someone else has figured it out and posted a succinct explanation :)

So, I've looked into this a little because I do my own opponent adjusted EPA stats and here's what I found (maybe someone can help explain).

I adjust offenses based on defensive opponent strength and defenses based on offensive opponent strength. So, instead of just using the wins to determine strength of schedule, I wanted to look at it based on the strength of opponent's offenses and defenses.

Here are the correlations:

Adjusted Off EPA vs Opponent Def EPA: -0.021

Adjusted Def EPA vs Opponent Off EPA: -0.017

This is right around 0 as should be expected.

But,

Adjusted Off EPA vs Opponent OFF EPA: -0.53

Adjusted DEF EPA vs Opponent DEF EPA: -0.46

This falls more in line with what Brian was saying in terms of the nature of the schedule. But, it seems to me, that it might just be an outlier because I don't think these should be correlated, not really sure.

Would love to hear thoughts on it.

Hi Jim ... Do you have an issue with one of my assumptions?No, I just don't see the need to use assumptions when one has facts. I looked at the actual Pythagorean numbers for all teams and noted what they say:

"This season, by Pythagorean, the seven strongest teams had opponents with a 47% winning strength and the seven weakest teams had opponents with an average 53% winning strength."Looking at each team's game-by-game schedule one can see exactly where its numbers came from. (The Rams' opponents averaged 59% winning strength, the Packers' 45%, etc.)

If I didn't have the actual numbers I'd make assumptions to figure what they might reasonably be, but I have them.

Well I'm going to make a mea culpa first. Misordered a vector, so I wasn't correlating the wrong two things. Fixed my code.

On 100000 runs:

R is at mean -.192

This should be lowered a little by the two conference games I didn't take into account.

R has standard deviation .278

The reason R can go positive is because of how teams are divided into divisions. For an extreme example consider that the NFC East has the four best teams, the NFC South the next four and so forth, and then consider we're in the year when the NFC East and the NFC South play. In that case, the 10 best teams are playing literally more than half their games against top ten teams. That gives a positive correlation of about .9; that's an extreme case - in a hundred thousand seasons the highest correlation I ever got was .71. But intuitively, we all know that there are seasons in which one division is really strong and another division godawful. Those circumstances will make the correlation more positive.

Jim,

"This season, by Pythagorean, the seven strongest teams had opponents with a 47% winning strength and the seven weakest teams had opponents with an average 53% winning strength."I bet GWP correlates pretty well to Pythagorean. So what you are saying is that there is a negative correlation this year between GWP and schedule strength. We already knew that! The question is

whyis there such a correlation, and why is there such variation in the correlation. I've answered that question. Have you?Wait - somebody thought there was something wrong with negative correlation? It's obvious that there

should benegative correlation. I made this exact argument a week or two ago somewhere in the FO comments section.For the most obvious example, start with a two-team league where one of the teams is stronger than the other. Proof by induction is left as an exercise for the reader. (Just think about what a minimal counter-example would look like.)

another vote for the fact that it has been well known.

for years, we fans have always said that the lions strength of schedule was high because they never get to play the lions.

(hopefully, that old joke is no longer valid).

The question is why is there such a correlationLooking at this...

........... W-L ... Opp W-L ... S-o-S

Best team . 6-0 .... 6-12 ..... .333

2nd best ... 4-2 .... 8-10 ..... .444

3rd best ... 2-4 .... 10-8 ..... .556

Weakest ... 0-6 .... 12-6 ..... .667

... you are asking *why* is there a negative correlation between winning percentage and strength of schedule???

and why is there such variation in the correlation.What a mystery. To start with, one might ponder if there is any reason why the exact correlation *wouldn't* vary year-to-year. Remembering, among other things, Brian's observation that 42% of game results are random.

I've answered that question. Have you?Um, yes.

~~~~~~~~~~~~~~~~~~

Wait - somebody thought there was something wrong with negative correlation? It's obvious that there *should* be negative correlation. I made this exact argument a week or two ago somewhere in the FO comments section.Of course. You were right. It's obvious.

There are people at FOers talking about using Python and Matlab to take apart and understand the complexities of this obviousness.

Which is something like using a 155mm howitzer to dissect a gnat. Worse, *needing* a howitzer to dissect a gnat. (Though a lot of it, I think, is some people will just take any excuse to have the fun of playing with a 155mm howitzer.)

I dont understand how anyone who's seen their team rankings methodology takes FO seriously. Its pretty damn clear they don't understand the basics of statistics. Their whole staff is comprised of idiots like the 49ers fans of this year and the Falcons fans of last year that were constantly arguing that turnovers are predictive (even though there is literally no proof of this)

In addition to the effect inherent in the method, there is also some Special circumstances is this years SoS games:

No 1 Seed schedules this year contained:

Indy, KC, and a lot of mediocrity in the NFC

No 3 seed schedules OTOH contained Houston, SF, and Oakland.

I'm with Jim Glass on this one, it's extremely obvious why, retrodictively (and THAT is KEY here) we would see the better teams having an easier schedule.

That is why, for me, GWP should be adjusted to represent the win probability a team would have were it playing an average team from the rest of the league not including itself, as this is far more meaningful.

The average r that db22 reported of -.19 makes sense. 6/16 (37.5%) of each team's schedule will have a "true" r of -1.0. Another 8 (50%) games should theoretically be uncorrelated at r=0. And 2 games (11%) should usually be positively (to some degree) correlated due to SoS scheduling.

Brian,

So the presumption of that analysis is that the TWP within each division is .5. On mean that will be true, and so we arrive at approximately the same number, there is also a lot of variation in mean GWP between divisions.

That variation is very strong. Strong enough to make this correlation a really terrible way to judge yours or any other forecasting model. I just want to convince you to not head down that blind alley, because I enjoy your work.

Thanks for all that you do here. And GO GIANTS! :)

On Opponent Coaching Strength and OUr Coaching Strength Correlation...

Relatedly, I've often wondered, as a Pats fan, why other coaches seem so stupid.

It's because we never get to compete against Belichick.

Twice a year we play Chan Gailey, Dick Jauron, Mike Mularkey, Gregg Williams, Herm Edwards (!), the Mangenius, Dave Wannstadt (!!), Cam Cameron (!!!)

In 4+ years of following this blog I find this article probably the most disappointing. This analysis is completely flawed, and just as terrible as FO's own analysis. Taking a simple average to compute SOS??? At least use an iterative method and try to better understand the nuances involved. Yes, there are many ways to compute SOS, but just averaging opponents Win% is one of the worst!

"It is difficult to imagine how this wouldnt be seen" Jim do u have to be derisive 24 hrs a day? This is just one example but there r dozens to be found

"It is difficult to imagine how this wouldnt be seen" Jim do u have to be derisive 24 hrs a day?When something actually is seen consistently, for a clear and simple reason, what is "derisive" about saying it is hard to imagine that it wouldn't be seen?

Your opinion may differ, you may be able to imagine that it wouldn't be consistently seen.

But why read a difference of opinion as a personal slight?

[I take it that before the arrival of blog comments you didn't spend a lot of time in the friendly and happy world of usenet discussion groups, where "every stranger was just a friend you hadn't met". ;-) ]

Pat Laffaye

What's wrong with the average of opponents generic win probability? After all, SOS is simply trying to say 'if I was an average team, what would my win percentage be against these opponents'.

GWP is the probability a team would win against an average team, so taking the straight average of that seems fair enough to me, because it answers your question.

Mark, I could see how an average could misrepresent the spread of GWP. For instance, if a team had played the Steelers, Saints, Patriots, Packers, and Vikings the average GWP would be .6 and indicate a 3-2 record, but the average team would be heavy underdogs in 4 of the 5 games.

@Mark M

What's wrong with the average of opponents generic win probability?Lots of things. First, the averaging method is not robust enough. Second, this 4 team league example stinks because it is incomplete and there is way too much variance between the teams. There is no average team, and worse there are no similarly skilled teams. A team can always play against like opponents, and the NFL has all 32 teams play 2 of these every season! These matchups alone make it a 50-50 proposition REGARDLESS of the computed SOS or overall Win%.

GWP is the probability a team would win against an average team, so taking the straight average of that seems fair enough to me, because it answers your question.You need to look at the schedules of each team individually and just averaging Win% doesn't capture that. For example, you probably didn't realize that the (#1) .75 team playing the (#2) .6 team is TWICE AS GOOD!!! That said, it has a 2/3 chance of winning!! In this bogus league there is no combination of MATCH-UPs where the probability of winning lies between 34% and 66%. Most unrealistic too say the least.

James - in that example you'd expect an average team to win 1.89 games (based on season end GWP), so approx a 2-3 record. Just because they are massive underdogs, they've still a chance of winning.

I suppose one problem is we haven't got a unit for strength of schedule. Wins per game seems a reasonable one. That schedule would give you 1.89 wins, vs 2.5 wins against an average schedule, which comes out to .122 wins per game. Average opponent GWP of those five games is 0.622, exactly 0.5 above the wins per game measure.

If I'm massively missing the point then please explain it to me, but to me this seems to work fine.

James is saying that an average team playing that schedule should be expected to win just 1 game.

You could also argue that the 1.89 GWP figure is too high. e.g. take Regular Season Win% and compute what an average team would do. Using the average method, which again I don't like to use as SOS, you get an average of *only* 1.5 wins.

In your last post, there's a problem with your math and understanding. The .122 you compute is not Wins per Game -- it is actually .122 GWP WINS BELOW AVERAGE. Thus, .378 is the same as 1.89/GP which when combined with the OPP GWP of .622 the sum is 1.

I don't know if anyone is still reading this comment thread, but I'll take a shot anyway.

How does the SOS change if you were to remove the games played by the team in question? In other words, if you remove 15 wins and 1 loss from GB's SOS, and do the same for every other team according to their record, does that help reduce (eliminate?) the SOS-WL pct. correlation?

I believe this is what they do for the RPI system in college basketball. I'm not trying to say that the RPI is great, because it isn't, but it does seem to make sense when calculating a teams SOS.

Two problems:

1) SOS games should on average account for this; first place teams play two games each vs 2nd, 3rd, 4th, plus two SOS games against 1st.

2) There's no reason to believe that teams in a division would balance out to league average. Some divisions are strong or weak. There's also not a great year-over-year correlation in a team's ability: teams can rise or fall quickly, and often do.