Page 1 of 1
A New Python Library To Get PlayByPlay Data, Useful for APM
Posted: Tue Aug 02, 2016 3:21 pm
by ethanluo
Hey guys,
I previously did some project on basketball analytics, including doing some analytics with RAPM and APM. In the project I have to crawl the Playbyplay from statsnba.com and compute my own RAPM and APM.
If I understand correctly, many of the existing sources of RAPM metrics (or other metrics) are pre-computed, meaning that you only get the result, not the source that actually created them. This is not enough for some of us, and the playbyplay data can be tough to find.
Nevertheless, I manage to scrape the data from statsnba.com with my library which can be found on Github
https://github.com/ethanluoyc/statsnba-playbyplay , which would allow you to parse the playbyplay data from the website into plays (i.e. the events that happened during a game). You are able to get the EventType, Score, OnCourtPlayers for each event. Ideally, I hope it can produce data of similar quality to that on
https://downloads.nbastuffer.com/nba-pl ... -data-sets.
I am still working on it but currently the major functionalities are more of less usable. I would like to find out, though, from APBRmetrics how many people actually find this useful and relevant to themselves. This will allow me to decide how much time I should devote to this project. (Say if it's just me and a few other geeks who are interested in this, then we will do fine with the sparing docs and limited functionality. But should this receive a wider attention in the community then I am more than willing to include more docs and continue updating a data feed if that is something you guys want.
It would be great if you guys can comment on this

Re: A New Python Library To Get PlayByPlay Data, Useful for
Posted: Wed Aug 03, 2016 4:08 am
by ampersand5
Thank you for your contribution!
I'm not doing any RAPM stuff now, but this would have made my life much easier at many points. I hope you get more feedback on this, but I can say, despite what gets posted, tons of people will use this resource and get influenced (and more involved) through your work.
Re: A New Python Library To Get PlayByPlay Data, Useful for
Posted: Wed Aug 03, 2016 8:58 pm
by kmedved
This is a huge contribution to the community. Data scraping is the least fun part of doing RAPM calculations, and yet, obviously integral, so this is a big help.
Re: A New Python Library To Get PlayByPlay Data, Useful for
Posted: Thu Aug 04, 2016 1:53 am
by ethanluo
kmedved wrote:This is a huge contribution to the community. Data scraping is the least fun part of doing RAPM calculations, and yet, obviously integral, so this is a big help.
I actually tried to release it last year, but then it was a working copy of mine where the code is obscure to read and be used by others, so I refactored the whole thing immensely.
Re: A New Python Library To Get PlayByPlay Data, Useful for
Posted: Thu Aug 04, 2016 12:05 pm
by DSMok1
ethanluo wrote:kmedved wrote:This is a huge contribution to the community. Data scraping is the least fun part of doing RAPM calculations, and yet, obviously integral, so this is a big help.
I actually tried to release it last year, but then it was a working copy of mine where the code is obscure to read and be used by others, so I refactored the whole thing immensely.
Excellent contribution! I've been following your repository for this project for a while.
Re: A New Python Library To Get PlayByPlay Data, Useful for
Posted: Thu Aug 04, 2016 1:01 pm
by ethanluo
DSMok1 wrote:ethanluo wrote:kmedved wrote:This is a huge contribution to the community. Data scraping is the least fun part of doing RAPM calculations, and yet, obviously integral, so this is a big help.
I actually tried to release it last year, but then it was a working copy of mine where the code is obscure to read and be used by others, so I refactored the whole thing immensely.
Excellent contribution! I've been following your repository for this project for a while.
Yeah I know

I just hope that more people can be informed that this project exists.
I know that there is another project py-Goldsberry which is also very useful.
Re: A New Python Library To Get PlayByPlay Data, Useful for
Posted: Thu Aug 04, 2016 4:57 pm
by DSMok1
There are 4 major components of the RAPM process, as I see it:
1. Scrape and clean the raw PbP data. This is what this library does--a very valuable part of the process.
2. Parse out what lineups are on the floor at any given time.
3. Construct the RAPM sparse matrix.
4. Actually run the RAPM analysis.
Re: A New Python Library To Get PlayByPlay Data, Useful for
Posted: Fri Aug 05, 2016 1:38 am
by ethanluo
DSMok1 wrote:There are 4 major components of the RAPM process, as I see it:
1. Scrape and clean the raw PbP data. This is what this library does--a very valuable part of the process.
2. Parse out what lineups are on the floor at any given time.
3. Construct the RAPM sparse matrix.
4. Actually run the RAPM analysis.
Yes indeed. I have completed steps 1 and 2 (if you check out the v0.2.0 I pushed yesterday, it now features a Matchup class which is basically the lineup, I have troubles with naming these things so I just leave it as Matchup) Every matchup would have the on-the-court players and you can also query for their boxscores) I also had a test suite which basically verifies the integrity of this library against the data itself. I run the test suite on a couple of games and everything seems to be working.
For 3 and 4 I believe it's more of a personal preference on how you would want to do it. Personally I use the scikit-learn Python machine learning library. I probably won't enforce 3 and 4 but rather provide a useful dataset that would allow the users to implement 3 and 4 and do analysis easily.
Re: A New Python Library To Get PlayByPlay Data, Useful for
Posted: Tue Sep 06, 2016 8:13 am
by permaximum
ethanluo wrote:DSMok1 wrote:There are 4 major components of the RAPM process, as I see it:
1. Scrape and clean the raw PbP data. This is what this library does--a very valuable part of the process.
2. Parse out what lineups are on the floor at any given time.
3. Construct the RAPM sparse matrix.
4. Actually run the RAPM analysis.
Yes indeed. I have completed steps 1 and 2 (if you check out the v0.2.0 I pushed yesterday, it now features a Matchup class which is basically the lineup, I have troubles with naming these things so I just leave it as Matchup) Every matchup would have the on-the-court players and you can also query for their boxscores) I also had a test suite which basically verifies the integrity of this library against the data itself. I run the test suite on a couple of games and everything seems to be working.
For 3 and 4 I believe it's more of a personal preference on how you would want to do it. Personally I use the scikit-learn Python machine learning library. I probably won't enforce 3 and 4 but rather provide a useful dataset that would allow the users to implement 3 and 4 and do analysis easily.
Can you confirm if everything's okay for v0.2.0? It seems like it's missing some files.
Re: A New Python Library To Get PlayByPlay Data, Useful for
Posted: Tue Sep 20, 2016 4:37 am
by ethanluo
permaximum wrote:ethanluo wrote:DSMok1 wrote:There are 4 major components of the RAPM process, as I see it:
1. Scrape and clean the raw PbP data. This is what this library does--a very valuable part of the process.
2. Parse out what lineups are on the floor at any given time.
3. Construct the RAPM sparse matrix.
4. Actually run the RAPM analysis.
Yes indeed. I have completed steps 1 and 2 (if you check out the v0.2.0 I pushed yesterday, it now features a Matchup class which is basically the lineup, I have troubles with naming these things so I just leave it as Matchup) Every matchup would have the on-the-court players and you can also query for their boxscores) I also had a test suite which basically verifies the integrity of this library against the data itself. I run the test suite on a couple of games and everything seems to be working.
For 3 and 4 I believe it's more of a personal preference on how you would want to do it. Personally I use the scikit-learn Python machine learning library. I probably won't enforce 3 and 4 but rather provide a useful dataset that would allow the users to implement 3 and 4 and do analysis easily.
Can you confirm if everything's okay for v0.2.0? It seems like it's missing some files.
Hi, glad to help here! Can you tell me exactly the problems you are having. Maybe you can submit an issue on Github so I can look into it.
