Building a solid NBA database
Posted: Mon Dec 07, 2020 8:09 am
Hello APBRmetrics community! I can't believe it has taken me this long to find out about this forum, but alas, here I am. Here's a quick intro about myself, as I'm new to the community here, and a brief overview of what I'm aiming to build, as well as what I've built so far.
Me: I work as a data scientist in Silicon Valley. My academic background is in mathematics (BA) & statistics (MS). Most of my work consists of predictive modeling, either using classical statistical techniques or machine learning algorithms. I've dabbled in random projects with sports data for years on my own, but have never really exchanged ideas with others who share my passion. Recently, I've decided to dig deeper into NBA analytics and am in the process of building out a strong database to support that effort.
# The short version
The database I'm building:
1) NBA PBP data for all seasons going back to 1996-97
2) NBA player data which includes season totals, per-game, per-possession, meta data for all players of all seasons
3) NBA game info data which includes final scores, teams, location, attendance, starting lineups, and point spreads
# The long version:
I've written a webscraper in R for play-by-play data that takes all the available PBP logs from BBR (which I believe goes back to ~1996-97). In the past, I tried using the JSON PBP files from NBA.com, but those had so many data entry errors that I eventually gave up on the project. I've recently discovered that the PBP logs from BBR also have similar issues (though they don't seem to be quite as bad as what I got from NBA.com, at least not for the recent seasons I've processed). After scraping all the data, I went through the most recent 5 seasons with a fine-toothed comb to clean up the PBP logs. Some plays have erroneous time stamps, sometimes a rebound occurs before the corresponding shot was taken, sometimes substitutions occur out of order in the time series, etc. I'm sure many of you know exactly what I'm talking about. Anyhow, I've gotten several of the most recent seasons cleaned up now. I wrote some logic into my processing code that handled most of the errors, but I also had to handle a couple hundred more errors manually to get the data as close to perfect as I could (which even involved me watching a film on NBA.com to see which players started during some of the OT periods, as my code couldn't identify them since they were on the court for the full period without having logged a single statistic). I use regular expressions to parse the plays and figure out who did what, and which players are on the court for a given play. I have meticulously spot-checked this data (I have OCD) to ensure that it is as accurate as possible. But it is a fairly labor-intensive process, so processing additional prior seasons will take time. After I finished processing my 2018-19 NBA season PBP data, I summed up each player's minutes played per the durations for each play and the logic I used to determine who was on the court, and checked those numbers against the player totals from BBR for each player. All of my MP played totals line up pretty well. The plurality line up exactly with the same MP played as their listed totals, about 100+ players are +/- 1 minute difference, and a few random players are within 2 MP of their totals, and 3 players are within 3 minutes. No player is off by more than 3 MP. I believe these deltas probably just reflect rounding errors. I've also spot checked several players to see if the Pts, Reb, Ast, TO, etc. totals I get from aggregating my processed game logs lines up with their reported totals, and all of them line up perfectly. Same thing with total points for home and away teams. So I believe I have a strong data foundation now, at least for a few seasons (I have many more still to finish processing though).
I'm curious to know what others have done though in tackling this seemingly monumental task. Do most of you use a SQL database to store the data, or just flat files, or something else? This is a fun project for me, so I enjoy wrangling all of this data, but I'm sure others do not. I've looked into purchasing the data I was looking for from a few sources in the past, but each time I found a source and looked at their sample data, it always had flaws that I wasn't happy with (some couldn't even get the players correct - "James, L. - LAC" was one of my favorites
). Anyhow, I don't trust others to process the data as well as I trust myself, so I've taken this on as a project to build it "right". But I'm fairly ignorant to what all is out there, and for all I know, everyone here is laughing at me because you guys all pass around some flawless golden datasets that noobs like myself would drool over lol.
What have you guys done for your datasets? Do most of you scrape your own data and process it yourselves? If so, what data do you think is the best? (NBA.com, BBR, something else?) Do you purchase your data somewhere instead?
If anyone wants my data after I've finished processing it, you can certainly have it for free. I'd be happy to share it with anyone that is interested, as long as that's not against the rules or something.
Me: I work as a data scientist in Silicon Valley. My academic background is in mathematics (BA) & statistics (MS). Most of my work consists of predictive modeling, either using classical statistical techniques or machine learning algorithms. I've dabbled in random projects with sports data for years on my own, but have never really exchanged ideas with others who share my passion. Recently, I've decided to dig deeper into NBA analytics and am in the process of building out a strong database to support that effort.
# The short version
The database I'm building:
1) NBA PBP data for all seasons going back to 1996-97
2) NBA player data which includes season totals, per-game, per-possession, meta data for all players of all seasons
3) NBA game info data which includes final scores, teams, location, attendance, starting lineups, and point spreads
# The long version:
I've written a webscraper in R for play-by-play data that takes all the available PBP logs from BBR (which I believe goes back to ~1996-97). In the past, I tried using the JSON PBP files from NBA.com, but those had so many data entry errors that I eventually gave up on the project. I've recently discovered that the PBP logs from BBR also have similar issues (though they don't seem to be quite as bad as what I got from NBA.com, at least not for the recent seasons I've processed). After scraping all the data, I went through the most recent 5 seasons with a fine-toothed comb to clean up the PBP logs. Some plays have erroneous time stamps, sometimes a rebound occurs before the corresponding shot was taken, sometimes substitutions occur out of order in the time series, etc. I'm sure many of you know exactly what I'm talking about. Anyhow, I've gotten several of the most recent seasons cleaned up now. I wrote some logic into my processing code that handled most of the errors, but I also had to handle a couple hundred more errors manually to get the data as close to perfect as I could (which even involved me watching a film on NBA.com to see which players started during some of the OT periods, as my code couldn't identify them since they were on the court for the full period without having logged a single statistic). I use regular expressions to parse the plays and figure out who did what, and which players are on the court for a given play. I have meticulously spot-checked this data (I have OCD) to ensure that it is as accurate as possible. But it is a fairly labor-intensive process, so processing additional prior seasons will take time. After I finished processing my 2018-19 NBA season PBP data, I summed up each player's minutes played per the durations for each play and the logic I used to determine who was on the court, and checked those numbers against the player totals from BBR for each player. All of my MP played totals line up pretty well. The plurality line up exactly with the same MP played as their listed totals, about 100+ players are +/- 1 minute difference, and a few random players are within 2 MP of their totals, and 3 players are within 3 minutes. No player is off by more than 3 MP. I believe these deltas probably just reflect rounding errors. I've also spot checked several players to see if the Pts, Reb, Ast, TO, etc. totals I get from aggregating my processed game logs lines up with their reported totals, and all of them line up perfectly. Same thing with total points for home and away teams. So I believe I have a strong data foundation now, at least for a few seasons (I have many more still to finish processing though).
I'm curious to know what others have done though in tackling this seemingly monumental task. Do most of you use a SQL database to store the data, or just flat files, or something else? This is a fun project for me, so I enjoy wrangling all of this data, but I'm sure others do not. I've looked into purchasing the data I was looking for from a few sources in the past, but each time I found a source and looked at their sample data, it always had flaws that I wasn't happy with (some couldn't even get the players correct - "James, L. - LAC" was one of my favorites

What have you guys done for your datasets? Do most of you scrape your own data and process it yourselves? If so, what data do you think is the best? (NBA.com, BBR, something else?) Do you purchase your data somewhere instead?
If anyone wants my data after I've finished processing it, you can certainly have it for free. I'd be happy to share it with anyone that is interested, as long as that's not against the rules or something.