Assessing the Accuracy of WP
Between John and I, there have been a couple of posts at FG related to the Advanced NFL Stats Win Probability charts. If you've paid any attention, you know just how neat they are. If you are a generally curious person, you probably have wondered about the accuracy of the WP charts. Wonder no more. If you have no interest in probabilistic models, you should probably just stare at a wall for a few minutes.
For readers who are accustomed to linear regression models, you'd expect to see a goodness-of-fit statistic known as r-squared. And for those familiar with logistic models, you'd expect to see some other measure, such as the percent of cases predicted correctly. But the win probability model I've built is a complex custom-built model, using multiple smoothing and estimation methods. There isn't a handy goodness-of-fit statistic to cite.
We can still test how accurate the model is by measuring the proportion of observations that correctly favor the ultimate winner. For example, if model says the home team has a 0.80 WP, and they go on to win, then the model would be "correct."
But it's not that simple. I don't want the model to be correct 100% of the time when it says a team has a 0.80 WP. I want it to be wrong sometimes. Specifically, in this case I'd want it to be wrong 20% of the time. If so, that's a good feature of any probability model. This is what's known as model calibration.
Right, so that's it for the wall of text. If you are still reading, now go check out the charts in the article. The first shows a very nice relationship between actual results and the predicted results. There's a problem, however: The top chart shows the relationship between the 2000-2007 data and the model, which was built off of the 2000-2007 data. When building and subsequently testing a model, it's important to split the data into what's known as test and training data sets, part of the cross-validation process. If you test a model with the same data you created it with, the results will almost certainly show a very good model*. To ensure that this wasn't the case, Burke tested his model against the 2008 data, and the results look pretty good for a single-season sample.
Also, Burke charted out the model confidence, which is pretty interesting to look at from a fan perspective. It makes sense that the average game should start out with a roughly 50/50 split at kickoff. As late as the ten minutes from the end, however, there is generally only about ~80% confidence in a winner. That leaves a lot up for grabs.
*Unless you screwed up.
0 recs |
7 comments
Comments
The WP charts are awesome
Will they be updated real time this season?
by Nate Dogg on Jul 8, 2009 8:57 AM PDT reply actions 0 recs
I'm not a statistician
But I know a little bit about developing a testable model. This defense of WP sounds like a whole lot hand-waving and “just trust me guys”. Especially this:
But the win probability model I’ve built is a complex custom-built model, using multiple smoothing and estimation methods. There isn’t a handy goodness-of-fit statistic to cite.
Using the right “multiple smoothing and estimation methods” you can make make any piece of experimental data say anything you want it to. The fact that the developer of WP tailored his results to fit the original experimental data and that “There isn’t a handy goodness-of-fit statistic to cite” seems really fishy.
If you test a model with the same data you created it with, the results will almost certainly show a very good model*.
Totally true. While it is great that the results agree with the 2008 season, I think some more testing is necessary before WP can be considered more than a novelty act.
by ninjasocks on Jul 8, 2009 10:12 AM PDT reply actions 0 recs
You can't expect to develop a model for WP with any single regression.
You’re off base here. Accuse him of not properly validating the model, fine, but this isn’t a case of lying with statistics. The point of a predictive model is not to “make any piece of experimental data say anything you want it to”, but rather to use that wealth of data to learn about trends and distributions within the data for use in, well, predicting. If Burke needed to use several estimation methods to fit various aspects of game modeling, then his model should be more accurate on account of him having done so. If you read into his comments, he mentioned problems modeling “going for it on 4th down” situations. This is a detailed model. He didn’t just run a linear regression on point differential and time remaining.Using the right "multiple smoothing and estimation methods" you can make make any piece of experimental data say anything you want it to.
by abender20 on Jul 8, 2009 10:36 AM PDT up reply actions 0 recs
I read it more as
there are only so many situations when it’s 3rd and 23rd with 1:34 to go on your own 35 yard line and down by 4, so this WP is not as simple as something like baseball, where there are only so many scenarios that are possible because there needs to be a bit of estimation involved.
Also, that there is a reasonable degree of estimation in his calculations, but it might take 50+ pages to explain what he did, why he did, and how he went about using what he did, so just trust him when he says he tested it on its predictive value on games played in the past and that it seemed to ‘predict’ the results accurately.
by LantermanC on Jul 8, 2009 10:51 AM PDT up reply actions 0 recs
The model performance looks good to me
The first calibration graph (based on 2000-2007 data) is basically useless, but the model looks to be performing pretty well on 2008 data.
The “Confidence” plot is a little hard to interpret, but I think it’s answering the question: “If you bet on the team with the higher WP at time X, what proportion of the time will you win your bet?” Unfortunately, this mixes two kinds of variability: variability due to errors in calculating the win probability via the model, and variability due to the nature of the game (i.e. last-minute lead changes do happen). This second kind of variability cannot be reduced by building a better model; even if you estimated win probability without any error, at every time before the end of the game there is some chance that betting on the team with the higher WP will ultimately be a losing bet.
In the end, I interpret WP at any given state of the game as the proportion of times teams in similar situations have gone on to win the game. And the model seems to be doing a good job at estimating this proportion.
by cyberwulf on Jul 9, 2009 9:45 AM PDT reply actions 0 recs

by 













