clock menu more-arrow no yes mobile

Filed under:

Assessing the Accuracy of WP

Between John and I, there have been a couple of posts at FG related to the Advanced NFL Stats Win Probability charts. If you've paid any attention, you know just how neat they are. If you are a generally curious person, you probably have wondered about the accuracy of the WP charts. Wonder no more. If you have no interest in probabilistic models, you should probably just stare at a wall for a few minutes.

For readers who are accustomed to linear regression models, you'd expect to see a goodness-of-fit statistic known as r-squared. And for those familiar with logistic models, you'd expect to see some other measure, such as the percent of cases predicted correctly. But the win probability model I've built is a complex custom-built model, using multiple smoothing and estimation methods. There isn't a handy goodness-of-fit statistic to cite.

We can still test how accurate the model is by measuring the proportion of observations that correctly favor the ultimate winner. For example, if model says the home team has a 0.80 WP, and they go on to win, then the model would be "correct."

But it's not that simple. I don't want the model to be correct 100% of the time when it says a team has a 0.80 WP. I want it to be wrong sometimes. Specifically, in this case I'd want it to be wrong 20% of the time. If so, that's a good feature of any probability model. This is what's known as model calibration.

Right, so that's it for the wall of text. If you are still reading, now go check out the charts in the article. The first shows a very nice relationship between actual results and the predicted results. There's a problem, however: The top chart shows the relationship between the 2000-2007 data and the model, which was built off of the 2000-2007 data. When building and subsequently testing a model, it's important to split the data into what's known as test and training data sets, part of the cross-validation process. If you test a model with the same data you created it with, the results will almost certainly show a very good model*. To ensure that this wasn't the case, Burke tested his model against the 2008 data, and the results look pretty good for a single-season sample.

Also, Burke charted out the model confidence, which is pretty interesting to look at from a fan perspective. It makes sense that the average game should start out with a roughly 50/50 split at kickoff. As late as the ten minutes from the end,  however, there is generally only about ~80% confidence in a winner. That leaves a lot up for grabs.

*Unless you screwed up.