In this series of articles I'm going to create a model for individual QB injury risk and test its usefulness. Part one is going to deal with the fundamentals of injury risk and a glance at the QB rush as a determinant of injury risk, part two will propose a robust multivariate model for QB injury risk, and part three will test the model for accuracy on out of sample data. I am going to avoid the hardcore math where possible so i encourage you to ask about anything that bothers you or makes you curious in comments. Much of the work that went into this was non-trivial and it's all non-peer-reviewed so you're likely enough to catch a mistake that I'd really appreciate you not holding back because you think a question might be silly or the answer outside of your grasp.
At some point we were all young enough that the only universe we could imagine was one with us at the center. In retrospect that should have been a liberating world view. During the window when I could understand there were things outside of me but was unable to understand that they weren't there for me I should have felt like a billionaire toddler playboy. Instead I went full Howard Hughes. If everything was put here for me why shouldn't it be malevolently so?
I knew that I was valuable. I knew there were bad people. Bad people want valuable things so I knew that I was going to be kidnapped late in the night.
I'd lay awake in bed each night filled with growing dread. Eventually I'd either fall asleep or run to my parents room. They'd walk me through each room turning on the lights and looking behind the larger furniture. Reassured that the house was safe I'd finally fall asleep.
One night I began thinking about what seeing an empty room on our checks meant. I reasoned that each safe room meant that the kidnappers were more likely to be in the next one. I spent some time trying to tease out how the chances increased after each check. Stymied by an incomplete understanding of multiplication my thoughts wandered and I suddenly realized that each check could also be viewed as increasing the odds that there were no kidnappers at all.
At the time it was a reassuring thought. I expanded the reasoning and decided that the multiple checks my parents and I had done meant I could safely adjust my perception of kidnapping risk downwards. In retrospect it was probably the first time I had ever questioned a baseline assumption and I like to think of it as a formative moment.
What I had done was to take the leap from interpreting data based on preconceived models to allowing data to inform my model.
The reason I bring it up is because football analysis is full of people assuming kidnappers.
In no area is this more true than injuries.
Injuries are no good, awful, terrible things and when they happen people want to find a cause. Something that could have been changed and will be in the future. When they find that cause they assume that it is something to be feared.
Some QB runs end with injury and there is an intuitive basis for believe that rushes are riskier than throws. That doesn't mean that we should treat every QB run like there's an injury hiding in it - cringing every time Russell Wilson leaves to pocket because it could be his last. That mentality shows a fundamental misunderstanding of how kidnappers work.
What I mean by that - and I promise I'll stop stretching the analogy now - is that people who cringe every run fundamentally misunderstand the nature of rare events.
Let me show what I mean by rare events:
The eGames values, which I discussed in this article, is just a conversion of injury report listing to the probability that a player with that listing will miss the following game. I should call it something like eInjured but eGames is what I chose so deal with it.
In this case each bar is a tenth of a game missed, so to get the probability instead of the probability density (if you prefer it) you can just divide the density by 10. You can clearly see that more than half the players are expected to miss less than a game due to injury while a little less than a third were never listed at all.
Before moving on I have to turn this raw chart into something more useful, eGames is limited because of its very low granularity. You can see the repeating waves in the data - that's not noise, that because to get .3 eGames you have to be listed as probable in three separate weeks whereas getting to .4 just requires being listed as questionable once.
In order to do fun things with this data I have to model it as a continuous distribution that (hopefully) approximates the actual distribution of probability of games missed due to injury. Because of fun background stuff that we don't need to get into I tend to assume a gamma distribution when modeling an aggregate of many rare events and that's what I ended up using here.
Here's the distribution I arrived at after the liberal application of intuition and maths:
The red line is my projected distribution, the dashed line is its survival function - the amount of player seasons remaining to be disbursed at any point. Using that you can see that well less than half of the players are expected to miss more than a game.
I should note at this point that I have to project the number of uninjured players based on how many player weeks are in the NFL season. This is easy for the aggregate numbers above but for individual positions I need an accurate average of players carried on the 53 man roster for each positions - this is why I did the crowd sourcing.
Here is a gif of each individual position group's distribution, the red line is the projected position group distribution and the dashed line is full population distribution from above:
You guys dramatically underestimated the number of LBs because reality has a 3-4 bias but I'm not doing LB or D line stuff right now so I wasn't worried about it.
At any rate, before we dive into QBs here are the mean, median, and number of non-projected seasons for each position group for the curious:
I'm going to start by asserting that rushes are an important factor to account for when considering QB injury risk. Later I'll use rush attempts but for this part of the series I'm going to use rush yards per game to show this. I want to stick to relatively simple models right now and rush yards per game distills many of the other factors you have to consider in an attempts model into one nice number.
To support the view that rushes increase injury risk here is a chart of the change in mean eGames in the QB population as those with the lowest rush yards per game are dropped. The lines are the 10% and 90% quantiles for the mean of random samples of the same size as the actual QB population at each value:
The consistent upward walk suggests to me that there is a significant relation to be found with a greater sample size but it's undeniably weak - just look at how neatly it stays within the confidence band. If you prefer cold unfeeling numbers to sexy winking graphs I'll just let Mathematica tell you:
(where populations are rush yards per game and eGames)
Yeah, again,that's pretty weak, but this is football analysis. I have a bad proxy for injuries, my sample size is just three seasons, and there are a dozen other plausible risk factors.
Still, that means that means that injury risk most likely increases with rushes, right? Well, maybe, rush yards per game isn't just a proxy for rushes and could be capturing any number of things. But it doesn't hurt the argument. Later in this series when I get into multi-variate analysis we'll see better evidence.
The point I actually want to make is that even if we assume rush yards per game is capturing the realtion of rushes and injury risk the relation isn't neatly intuitive because of the nature of rare events.
Usually one uses the mean to concisely summarize a probability distribution. In the case of distributions describing the frequency of rare events the mean can be very misleading.
In the chart showing the moving mean above you can see the QB population mean is .78 eGames while the high rush yards per game groups tops out at over 1 eGame. That's a not huge increase, but it is an increase. Still, as small as it is it overstates the increase in injury risk. To the GIF-copter!
The dashed line represents the full population distribution and the red line represents the distribution at the rush yards per game cutoff above.
The change in the mean is modest but you can see that the change in the distribution is nearly non-existent.
By way of analogy, if the mean size of a storm in your area has increased by 100% that doesn't mean each storm will be a hurricane - they'll still be very rare.
That is primary message of this piece. Our minds don't work well with the probability distributions of rare events.
But enough foundational stuff. Let's see what this (unsubstantiated) model has to say about Russell Wilson.
Last season Wilson rushed for just over 30 yards per game. That's kind of a lot and, without looking, I'm fairly certain that they were high leverage yards. So, yay! Unfortunately for modeling, just twenty other QBs have rushed for within +/- 10 yards of Wilson's total so the sample size up there is pretty small. But there's hope for projection. By tracking the change of the projected distribution against average yards per game through the sample I can project an individual injury probability chart for Wilson with more accuracy than by just using the high yards/game QBs.
Will it be an accurate representation of reality? Maybe, I'd have to check it against reality to know. But if you suspend your disbelief for a bit I'll use it reinforce my point about rare events.
Wilson is in red versus the dashed population distribution.
Note that I used a different scale than in the GIF above.
And the central tendencies of this distribution versus the population distribution are as follows:
We can tell just by looking at Russell Wilson's projection that the mean is misleading. The projection suggests far less than 50% of QBs who run 30.6 yards a game are expected to miss a game. This is born out by the median which still sits below .5 eGames for the Wilson PDF.
To close out this article let's make two silly assumptions so that we can have some fun with baseless speculating. Let's first say that all QB injuries occur during an in game football play where they have the ball. And second well assume that there are two probability distributions for injuries in QBs, a rushing probability and a passing probability. We'll say that Wilson rushes one time for every four passes and is involved in 500 plays a season. I'll project the passing injury distribution using the same method that I used to project Wilson's personal injury risk curve using a hypothetical QB with zero (actually .01) yards a game and another with 200 yards a game (something like his total if he had rushed every play for six y/a). Here's a graph:
Note that the new Wilson projection has a different shape than the one I showed you a little bit back and has different central tendencies:
The model I used to project Wilson's original projected injury risk curve systematically falls apart for high values of rush yards per game. It is likely underestimating rush injury risk. But like I said this part is just for fun.
I included the projections for individual rush and pass plays to make that point I mentioned earlier. Now I'll poke you with it. The median risk for a rush play is 234 orders of magnitude greater than a throw play, that's a lot! Clearly the message to coaches is to stop rushing your QB.
But, not really.
That sort of logic would leave you terrified of meteors because being killed by one is more likely than being struck by lightning (possibly true) - it's stupid.
But what about the fun? Well, eagle eyed readers may have noticed that I haven't used a single word I promised to in the crowdsourcing fan shot (except for "the"). I'm about to! As I continue this series on QB injuries I'm going to need to simulate QB seasons using different models but the same random number strings. That means that I need to have a bunch of seeded string to keep lying around. I went ahead and asked for your words so that the various projected seasons could have some emotional meaning.
Using the just-for-fun rush-pass split projection here are Wilson's expected games missed with each of your seeds for a twenty season career with 19 games each season - note that I didn't really try to format things well.
So remember, rare events are funny, rushing is probably more dangerous than throwing, and fiftyone and pqlqi want to see Wilson hurt.
I'll see you all in part 2 which will have 3D graph gifs!
Read more from Field Gulls:
Xs & Os: Breaking down schematics & strategy
The Numbers Game: Analysis of statistics & the salary cap
The Offseason: News & notes on the Seahawks' offseason
Miscellany: Commentary, criticism, pop culture & more
Field Gulls Podcasts: Hear from your writers