The Hive turned me on to this NFL play-by-play dataset going back to 2002, released by Brian Burke to see what the community could do with it. It’s a set of CSV files (hosted as a Google Docs spreadsheets) with some metadata columns (game ID, offensive team, defensive team, yard-line, down, yards-to-go, quarter, time remaining) and a text description of the play.
This seems like a gold mine for evaluating the overall effectiveness of offenses and defenses in different situations. The data is great, but you can’t do much with the data unless you can parse the text description. Here are some example descriptions:
- (Shotgun) E.Manning pass incomplete short middle to V.Cruz.
- (No Huddle) L.McCoy left tackle to CLV 43 for 7 yards (T.Ward).
- (Shotgun) T.Brady pass deep middle to W.Welker to DEN 1 for 19 yards (C.Harris). Touchdown was reviewed by Review Assistant. And overturned.
There’s a quite a bit you can do with this data as a spreadsheet. But when it comes to parsing these descriptions, a spreadsheet is simply not the right tool. After a few hours and a couple hundred lines of Python, I’ve parsed each description, filtered out non-offensive plays (i.e. kicks, penalties, and weird situations), and then determined the type of each play (pass or run) and the yardage gained or lost. I then rate each play as a success or failure. A successful play meets one of these criteria:
- Play results in a first down
- Play scores a touchdown
- On 1st or 2nd down, play gains at least 4 yards
And, of course, successful plays can’t result in turnovers or losing the game.
I also determined how each drive ended, and then I ingested the entire dataset in a MySQL database. SQL is a great tool for filtering the plays based on specific values or ranges for any of the columns.
I don’t claim 100% of the 350,000+ play descriptions were parsed accurately, but based on some spot checking, I’d guess accuracy is in the high 90s. Football is a game with strange rules, and it would probably take thousands of lines of code to nail down every weird situation involving penalties, official review, blocked kicks, and multiple fumbles.
Regardless, with this mostly accurate database available, it’s now possible to quickly generate stats and even make the data accessible as a web application. I hope to roll out a site in the next couple days.
Here’s a sneak preview of the run stats visualization:
Stay tuned and let me know what great ideas you have for the web app!