The Hive turned me on to this NFL play-by-play dataset going back to 2002, released by Brian Burke to see what the community could do with it. It’s a set of CSV files (hosted as a Google Docs spreadsheets) with some metadata columns (game ID, offensive team, defensive team, yard-line, down, yards-to-go, quarter, time remaining) and a text description of the play.
This seems like a gold mine for evaluating the overall effectiveness of offenses and defenses in different situations. The data is great, but you can’t do much with the data unless you can parse the text description. Here are some example descriptions:
- (Shotgun) E.Manning pass incomplete short middle to V.Cruz.
- (No Huddle) L.McCoy left tackle to CLV 43 for 7 yards (T.Ward).
- (Shotgun) T.Brady pass deep middle to W.Welker to DEN 1 for 19 yards (C.Harris). Touchdown was reviewed by Review Assistant. And overturned.
There’s a quite a bit you can do with this data as a spreadsheet. But when it comes to parsing these descriptions, a spreadsheet is simply not the right tool. After a few hours and a couple hundred lines of Python, I’ve parsed each description, filtered out non-offensive plays (i.e. kicks, penalties, and weird situations), and then determined the type of each play (pass or run) and the yardage gained or lost. I then rate each play as a success or failure. A successful play meets one of these criteria:
- Play results in a first down
- Play scores a touchdown
- On 1st or 2nd down, play gains at least 4 yards
And, of course, successful plays can’t result in turnovers or losing the game.
I also determined how each drive ended, and then I ingested the entire dataset in a MySQL database. SQL is a great tool for filtering the plays based on specific values or ranges for any of the columns.
I don’t claim 100% of the 350,000+ play descriptions were parsed accurately, but based on some spot checking, I’d guess accuracy is in the high 90s. Football is a game with strange rules, and it would probably take thousands of lines of code to nail down every weird situation involving penalties, official review, blocked kicks, and multiple fumbles.
Regardless, with this mostly accurate database available, it’s now possible to quickly generate stats and even make the data accessible as a web application. I hope to roll out a site in the next couple days.
Here’s a sneak preview of the run stats visualization:
Stay tuned and let me know what great ideas you have for the web app!
Awesome work! Is your code on github? I’d like to see and add to it if I can.
great idea… it’s not, but I’ll see if I can make that happen
thanks! looking forward to it
The source is now available on GitHub:
https://github.com/10flow/playbyplay
Great work!!
Yay! I’m not the only programmer / stat junkie who saw errors and “weird situations” in the data.
Capital “O” concatenated to the end of player #?
My favorite so far (aside from multiple-laterals):
9-J.Tucker kicks onside 8 yards from BAL 35 to BAL 43. 98-V.Williams (didn’t try to advance) to BAL 43 for no gain (17-T.Doss). Penalty on BAL-36-J.Miles, Illegal Kick, declined. PENALTY on BAL-9-J.Tucker, Offside on Free Kick, 5 yards, enforced at BAL 43.
Just curious; did you consider going so far as to extract Pass and Run ‘directions (ie: left tackle, up the middle, deep right, etc)?
Lastly, did you read about the NFL’s proposal to use RFID to track player motion? See: http://arstechnica.com/gadgets/2014/07/fantasy-footballers-and-coaches-rejoice-nfl-players-to-wear-rfid-tags/