How to Analyze the Play-by-Play Data

Interested in analyzing the play-by-play football stats? I pushed some code to GitHub to help you parse the plays and output them to your choice of CSV (for use in a spreadsheet), JSON, or a SQL database. You can check out the code (and a brief README) here. The repository also contains all the source code of my play-by-play effectiveness webapp.

How to Parse Plays to CSV

If you don’t need the data in a SQL database, you can quickly parse the plays to a CSV file (comma-separated values). You could then open this file in a spreadsheet application or read it into the scripting environment of your choice (R, MATLAB, Python, etc.).

I created a specific branch of the repository that will parse the plays without any dependencies on database libraries (i.e. SQLAlchemy). Assuming you’re on Linux and have Git installed, just clone the “nodatabase” branch of the repository to get started like this:

$ git clone --branch=nodatabase https://github.com/10flow/playbyplay.git

Then you can run the parser script:

$ cd playbyplay/play-parser/
$ ./play_parser.py

After a minute or two of processing, you’ll have a “output.csv” ready to play with.

How to Ingest Plays into a Database

If you want to use a SQL database, I’ll assume you have one installed. If not, I’ll refer you here for Ubuntu’s guide to MySQL.

My ingest script requires SQLAlchemy, a popular database library for Python. This is easy to install with the “easy_install” package manager. If you don’t have easy install, you can install it on Ubuntu as follows: “sudo apt-get install python-setuptools”. Next, get SQLAlchemy like this:

$ sudo easy_install SQLAlchemy

Now you’re ready to download the code from GitHub. Grab the master branch if you’re planning on using a database:

$ git clone https://github.com/10flow/playbyplay.git

Before you run the parser script, there’s a few things to do. First, specify your database information in “play_parser.py” on line 28. And next, you’ll need to create the database before you can insert the data. For example, if your database connection string (line 28) looks like this:

database_engine = create_engine('mysql+mysqldb://root:password@localhost/playtest', echo=False)

Then create the database like this:

$ mysql -u root -p
mysql> create database playtest;
Query OK, 1 row affected (0.06 sec)
mysql> exit

Now you’re ready to run the script:

$ cd playbyplay/play-parser/
$ ./play_parser.py

Inserting the plays will take a long time. But once it’s done, you can issue queries this:

$ mysql -u root -p
mysql> use playtest;

Database changed
mysql> select result,count(result) from plays where season=2012 and result!='play' group by result;
+--------+---------------+
| result | count(result) |
+--------+---------------+
| downs  |           209 |
| fg     |           855 |
| fumble |           184 |
| half   |           185 |
| int    |           378 |
| loss   |            61 |
| miss   |           193 |
| ot     |            14 |
| other  |           154 |
| punt   |          2480 |
| td     |          1269 |
| tie    |             2 |
| win    |           148 |
+--------+---------------+
13 rows in set (0.02 sec)

And now you’re ready to analyze.

Let me know if you come with any interesting stats in time for the big game!

Play-by-Play Effectiveness

Check out this web app I put together to measure the effectiveness of plays and drives:
http://football.10flow.com/

Here are some interesting stats I’ve found so far:

The stats show Andy Reid’s Eagles performed poorly when down a score in the final 2 minutes. The biggest problem seems to be the terrible 34% pass success rate and a below average plays-per-drive (4.4 compared to 4.7). Two-minute drill drives ended in 11 picks (21.2%), 12 turnovers on downs (23.1%), and 10 TDs (19.2%). They ran out of the clock 9 times. Their scoring percentage was 25% compared to a league average of 33.4%. Good luck Kansas City!

The 2012 Jets stand out as a weak offense. There was a big drop in touchdowns, pass play effectiveness, and interestingly, pass play selection. The Jets passed on 57.1% of plays in 2011, compared to 52.1% this year. The loss of CB Darrelle Revis also had an impact. Opposing team’s pass success rate went from 40.8% (with 15 picks) in 2011 to 42.8% (with 9 picks) in 2012.

Finally, the Pats offense was scary-effective this year. Their touchdown rate stands out the most — 32.4% compared to league average 21.0%.

Leave a reply below or Tweet at me if you find something interesting in the stats!

NFL Play-by-Play Data

The Hive turned me on to this NFL play-by-play dataset going back to 2002, released by Brian Burke to see what the community could do with it. It’s a set of CSV files (hosted as a Google Docs spreadsheets) with some metadata columns (game ID, offensive team, defensive team, yard-line, down, yards-to-go, quarter, time remaining) and a text description of the play.

This seems like a gold mine for evaluating the overall effectiveness of offenses and defenses in different situations. The data is great, but you can’t do much with the data unless you can parse the text description. Here are some example descriptions:

  • (Shotgun) E.Manning pass incomplete short middle to V.Cruz.
  • (No Huddle) L.McCoy left tackle to CLV 43 for 7 yards (T.Ward).
  • (Shotgun) T.Brady pass deep middle to W.Welker to DEN 1 for 19 yards (C.Harris). Touchdown was reviewed by Review Assistant. And overturned.

There’s a quite a bit you can do with this data as a spreadsheet. But when it comes to parsing these descriptions, a spreadsheet is simply not the right tool. After a few hours and a couple hundred lines of Python, I’ve parsed each description, filtered out non-offensive plays (i.e. kicks, penalties, and weird situations), and then determined the type of each play (pass or run) and the yardage gained or lost. I then rate each play as a success or failure. A successful play meets one of these criteria:

  • Play results in a first down
  • Play scores a touchdown
  • On 1st or 2nd down, play gains at least 4 yards

And, of course, successful plays can’t result in turnovers or losing the game.

I also determined how each drive ended, and then I ingested the entire dataset in a MySQL database. SQL is a great tool for filtering the plays based on specific values or ranges for any of the columns.

I don’t claim 100% of the 350,000+ play descriptions were parsed accurately, but based on some spot checking, I’d guess accuracy is in the high 90s. Football is a game with strange rules, and it would probably take thousands of lines of code to nail down every weird situation involving penalties, official review, blocked kicks, and multiple fumbles.

Regardless, with this mostly accurate database available, it’s now possible to quickly generate stats and even make the data accessible as a web application. I hope to roll out a site in the next couple days.

Here’s a sneak preview of the run stats visualization:

Run Statistics

Histograms show run play success

Stay tuned and let me know what great ideas you have for the web app!