In the first article on this topic I explained that injuries are rare, potentially catastrophic, events and that their distribution therefore differs fundamentally from intuitive expectations. We expect one half of a population to be on one side of average and the other half to be on the other. That's not the way it works with rare events. The average earthquake causes some dollar value of damages. The median earthquake causes zero dollars worth of damage.
Really what we care about is how often an earthquake causes huge damage and that takes more than looking at central tendencies.
The rarity of injuries means that models using simple expected values are unlikely to be elucidative (the expected damages due to an earthquake in any given year aren't going to be too different between California and Missouri) and that distribution models can't be normal.
In the last piece I settled on a gamma distribution to describe the population injury risk for QBs and you can read in the comments why that meant individual QB distributions were also likely to be gamma distributions.
As promised this piece will end with a more robust model that will be tested out of population in the next one.
So let's talk about individual modeling.
First, a caveat. Models like these can only account for what is measurable. Thy shouldn't even try to account for things that aren't. That's how you end up with hideous monsters like QBR. That doesn't mean that the things I can't measure aren't important it just means that when I say "this is Wilson's injury risk probability distribution" I mean "accounting for the modeled parameters - everything else is league average."
Second, I don't believe in the spaghetti regression. What I mean by that is throwing every measurement I have into a predictive analysis to see if they stick. Don't get me wrong, I'm not doing super rigorous science here so I look at everything. But unless I have a plausible theory of causation beforehand or develop an extremely compelling one after I'd rather have an intuitive understanding of causation than a system that predicts slightly more accurately on paper.
Partly that's because it helps to understand all of the moving parts when something goes wrong but mostly that's because I think this sort of analysis should clarify causal mechanisms - not obscure them. The conclusions in the third piece of this series should give readers a better intuitive understanding of QB injuries not just a jumble of numbers.
To start out I have a jumble of QB stats and the idea that expected games missed due to injury is modeled by a gamma distribution.
Where I want to get is having an individual model for each QB based on key stats.
I've got something else though - the strong suspicion that my population (All QB seasons in the past three years) is actually multiple populations.
Here's what I mean by that:
Let's say I took some polling data on an insensitive caricature of Capitol Hill (Seattle, not DC). I might find that wearing leather clothing has no correlation with going to punk rock shows. Of course it couldn't, punk is dead, but let's say I mean the soft socially conscious stuff the kids call punk nowadays. That would be surprising since we know that there is a large population of hill rats who wear leather and go to punk shows. Problem is that they're being cancelled out by a large population of hill dwellers who wear other leathers and go to dance clubs instead of punk shows. I could differentiate between chaps and jackets but it would be better to admit that I'm polling two fundamentally different populations.
In the NFL there are many different types of QBs - you'll know them by heart: pocket passers, scramblers, option QBs, Tim Tebow, gun slingers, and any phrase Jon Gruden has invented. When I started this analysis I was mostly concerned with two, mutually exclusive, populations: backups and starters.
The variance in play stats for backups is essentially zero so it might seem like they'd be prefect for sussing out the effect of things like age - the problem is that that only works if starters and backups are otherwise the same. I doubt they are. First and foremost a starter may be more likely to avoid a listing on the injury report than a backup given the exact same injury. That would mean that I simply can't compare them since injury report listings are how I created the eGames stat I use in this analysis. Second, what if QBs with preexisting injuries are likely to become backups and carry a higher injury risk with them? That would mean anything that differentiated backups and starters would show correlation with injuries - causal or not - when looking at the full population.
After poking about with numbers I decided that nobody cares about backup injury risk anyways and that the drop in sample size would but more than outweighed by getting a more accurate picture of the population of interest.
At this point I could have just made an arbitrary games started cutoff or taken the top 96 QBs in games started but when I clustered QBs into two groups by all their stats (if you're curious what this means ask away in the comments) they neatly popped a backup group and a starter group leaving out some of the annoying game managing replacements who never did anything under center in their games started.
So that leaves me with a group of starting QB seasons. Tragically it also leaves me a sample size of around 100 QBs and that's small. Fortunately I've never met a sample I couldn't irresponsibly bootstrap.
Bootstrapping is a great example of the vernacular meaning of a word and the scientific being the same. Basically I take my sample of QBs and pull many smaller random samples from it. I then use the representative stats and distribution parameters for estimated distributions of those samples to approximate the relationship between the stats and the distribution of injuries.
Note: "Distribution parameters" means the values that control a distributions shape - for the normal curve it is mean and standard deviation, for the gamma distribution it is shape and scale or rate.
The stats I used in this process were passing and rushing attempts per game, rushing yards per game, sacks per game, and yards per carry.
As an aside someone will say that I should have looked at age as well. It turns out I did! It is a very good predictor of backup injuries but falls apart for starters. I didn't really consider including it in the final analysis because I have such a short and small data set that the strong relationship of the age of a QB in year x versus years x+1 and x+2 meant that it could just act as a proxy for individual QB luck and unmeasured stats that didn't vary year to year.
In other words I was worried that it would add more questions about the final analysis than it would solve.
Passing attempts was basically independent of the distribution parameters, the others were all related to varying degrees but almost all of the predictable variation could be explained by just rush yards per game and sacks per game.
Here's the super intimidating, super approximate, equation for the probability density function of an individual QB's injuries, ryds is the rush yards per game, sk is the sacks per game, and egames is the games missed due to injury:
Everything went fine making the bootstrapped model except for Michael Vick. His 2010 season went ahead and broke the whole thing. I'll show you why because 3D graphs are cool.
The model assumes the distribution parameters are gamma distributed and, basically, Michael Vick's 2010 was past the limit - the model spits out a negative value and that makes no sense in the physical world. In the following chart Vick's 2010 is the expanding red dot.
I "solved" the problem by excluding Michael Vick's 2010 - in the next piece I'll propose and test a more robust solution.
Vick's problems aside, now I can make individual QB distributions, I could already do that - I did it in the last piece, How can I show that this method is at least accurate enough to go on with?
Essentially I run a whole mess of simulated seasons using the projected distributions and then preform the Kolmogorov-Smirnov test to see if it is likely that the simulated data and the real data came from the same distribution.
It's actually more complicated than that because the distributions are continuous whereas the actual data only has a few possible values (since it comes from the injury report). That means I had to take the simulated seasons and break them down into possible weekly values, and then reconstruct the seasons from there.
The final data can be visualized like this:
In the end I got a p value of .99 using the argument that the simulated injuries and actual injuries were derived from the same distribution as the null. For you betting folk that translates to 1:99 odds. That's Browns don't win the Superbowl territory in terms of certainty.
The p value of the same simulation using the population distribution for each QB instead of the individual projections is .66 so the individual distributions describe the data better than just assuming each QB has a league average risk of injury.
At this point I am very certain that the model describes the injury risk for in sample QBs - the remaining questions are:
- Does it describe out of sample QB seasons well?
- Can this information be used to project?
- How should projections be used?
- What does it all mean?
I'll answer these (and more!) in the final piece in this series.
Now, just for the laugh out louds I'll share the movement in injury probability relative to QB stats and individual projections for a few QBs.
First here are the probability distributions of certain numbers of games missed across values of sacks per game and rush yards per game, I'm so sorry for the quality:
And here are the distributions for the NFC west QBs:
Kaepernick and Palmer are the pretty clear winners. Kaepernick's, maybe counterintuitive, below average injury risk is based on his low sack numbers - I'm not convinced that they're sustainable so I doubt the projection is an accurate reflection of his actual injury risk. But that could just be wishful thinking.