Page 1 of 1

Building a solid NBA database

Posted: Mon Dec 07, 2020 8:09 am
by rainmantrail
Hello APBRmetrics community! I can't believe it has taken me this long to find out about this forum, but alas, here I am. Here's a quick intro about myself, as I'm new to the community here, and a brief overview of what I'm aiming to build, as well as what I've built so far.

Me: I work as a data scientist in Silicon Valley. My academic background is in mathematics (BA) & statistics (MS). Most of my work consists of predictive modeling, either using classical statistical techniques or machine learning algorithms. I've dabbled in random projects with sports data for years on my own, but have never really exchanged ideas with others who share my passion. Recently, I've decided to dig deeper into NBA analytics and am in the process of building out a strong database to support that effort.

# The short version
The database I'm building:
1) NBA PBP data for all seasons going back to 1996-97
2) NBA player data which includes season totals, per-game, per-possession, meta data for all players of all seasons
3) NBA game info data which includes final scores, teams, location, attendance, starting lineups, and point spreads

# The long version:
I've written a webscraper in R for play-by-play data that takes all the available PBP logs from BBR (which I believe goes back to ~1996-97). In the past, I tried using the JSON PBP files from NBA.com, but those had so many data entry errors that I eventually gave up on the project. I've recently discovered that the PBP logs from BBR also have similar issues (though they don't seem to be quite as bad as what I got from NBA.com, at least not for the recent seasons I've processed). After scraping all the data, I went through the most recent 5 seasons with a fine-toothed comb to clean up the PBP logs. Some plays have erroneous time stamps, sometimes a rebound occurs before the corresponding shot was taken, sometimes substitutions occur out of order in the time series, etc. I'm sure many of you know exactly what I'm talking about. Anyhow, I've gotten several of the most recent seasons cleaned up now. I wrote some logic into my processing code that handled most of the errors, but I also had to handle a couple hundred more errors manually to get the data as close to perfect as I could (which even involved me watching a film on NBA.com to see which players started during some of the OT periods, as my code couldn't identify them since they were on the court for the full period without having logged a single statistic). I use regular expressions to parse the plays and figure out who did what, and which players are on the court for a given play. I have meticulously spot-checked this data (I have OCD) to ensure that it is as accurate as possible. But it is a fairly labor-intensive process, so processing additional prior seasons will take time. After I finished processing my 2018-19 NBA season PBP data, I summed up each player's minutes played per the durations for each play and the logic I used to determine who was on the court, and checked those numbers against the player totals from BBR for each player. All of my MP played totals line up pretty well. The plurality line up exactly with the same MP played as their listed totals, about 100+ players are +/- 1 minute difference, and a few random players are within 2 MP of their totals, and 3 players are within 3 minutes. No player is off by more than 3 MP. I believe these deltas probably just reflect rounding errors. I've also spot checked several players to see if the Pts, Reb, Ast, TO, etc. totals I get from aggregating my processed game logs lines up with their reported totals, and all of them line up perfectly. Same thing with total points for home and away teams. So I believe I have a strong data foundation now, at least for a few seasons (I have many more still to finish processing though).

I'm curious to know what others have done though in tackling this seemingly monumental task. Do most of you use a SQL database to store the data, or just flat files, or something else? This is a fun project for me, so I enjoy wrangling all of this data, but I'm sure others do not. I've looked into purchasing the data I was looking for from a few sources in the past, but each time I found a source and looked at their sample data, it always had flaws that I wasn't happy with (some couldn't even get the players correct - "James, L. - LAC" was one of my favorites :roll: ). Anyhow, I don't trust others to process the data as well as I trust myself, so I've taken this on as a project to build it "right". But I'm fairly ignorant to what all is out there, and for all I know, everyone here is laughing at me because you guys all pass around some flawless golden datasets that noobs like myself would drool over lol.

What have you guys done for your datasets? Do most of you scrape your own data and process it yourselves? If so, what data do you think is the best? (NBA.com, BBR, something else?) Do you purchase your data somewhere instead?

If anyone wants my data after I've finished processing it, you can certainly have it for free. I'd be happy to share it with anyone that is interested, as long as that's not against the rules or something.

Re: Building a solid NBA database

Posted: Mon Dec 07, 2020 9:00 am
by vzografos
Hi, welcome to the site.

I have a pretty big database scraped from NBA.com going back to 1950s with all the per-game Boxscore stats of all the players that played since then, About 120+ stats for each player (Basic, Advanced, Misc, Tracking...etc etc). I am not tracking per-minute stats at the moment.
I use sql of course because of the relationships between the data and I am using NBA.com's own identifiers. Unique identifiers for the teams and players so the name's are not important (hey they are even players who change their names).

Even stats.nba.com have flaws in the data but it is manageable.

Maybe we can work together on this. Send me a PM

Re: Building a solid NBA database

Posted: Mon Dec 07, 2020 10:00 pm
by rainmantrail
Thanks for the reply. I have also built what sounds like the same dataset, except mine uses BBR's data instead of NBA.com. It has summary statistics from every player for every season in their database, which includes ABA and BBA data as well as NBA. If you or anyone else would like this data, I'd be happy to share it. I could also share my R code for my webscraper if anyone finds that helpful.

Re: Building a solid NBA database

Posted: Mon Oct 18, 2021 8:42 pm
by apophain
Hi rainmantrail,

I've read your posts in this forum just recently. I feel like we've encountered very similiar problems. I built up my plusminus database in R on play by play data from (I guess) bbref. We share many of the struggles that you've described, like erroneous time stamps, and solving missing information by film research.

Right now I am at a stage, where I want to compare the quality of my data to serious sources. My points match up almost perfectly. Unlike you, I did not really use the played minutes as an instrument because I wanted to go by PM per possession instead of minute. The idea was fine intitially, but I have problems now in comparing my data on this side. At first, I definitely managed to generate a code that does not add a general bias to one team. In all of the seasons I tested my code, there are only maybe 5 instances, of lineups that have 2 possessions than the opponent pineup. I.e., when Team A has 7 possessions, Team B (nearly) never has 9 or 5 possessions, but almost certainly 6, 7 or 8. That's a success already. But again, checking if the measured possessions are correct is rather difficult. Bbref does not display the possessions in its Player On-Off statistics, only the total minutes and per100 stats. 82games.com can help here, but I am not quite sure about the quality. The plusminuses from that side are always a bit higher than on bbref, and also do not seem to add up all the time, regarding minutes, points and possessions. Pbpstats seems the right adress for it, I guess? My possessions do not totally match with pbpstats possessions, there might be some games, where I got max 2 more or 2 less possessions.
Again, 82games per100 definitely differ from bbref (regarding PM) AND pbpstats (possessions). 82games has fewer possessions, which is probably the reason why their PlusMinus ORtgs and DRtgs are a bit higher than i.e. bbrefs. Therefore my question to you guys: how reliable are the mentioned stats websites? What site is the gold standard? Are we sure that one of them not might also have a hickup in coding here and there, just as we all do? Do we have a statement, how do they treat tricky possessions, like when substitutions occur during a possession (stuff that was already discussed here in other threads)?

Sorry for hijacking this thread. Maybe it would be more helpful when I open a new thread issuing the quality of the mentioned free plusminus sites.

Edit: how do your results compare to those sites? Do you have a perfect match with one of them? Are there others who build their plusminus database from the scratch? How are your results?

Re: Building a solid NBA database

Posted: Sun Jan 16, 2022 11:48 pm
by mUmblr
I missed this post before making my own. I've been working on a similar app to build out a DB, written in Python: https://github.com/mpope9/nba-sql

It builds up a SQL database. I've experienced very similar things that you have: rounding errors, old team names/abbreviations and the like. I've found that for the most part, its workable. It took a bit of time to find the 'right' endpoints to fill all the relations. I've tried to make this as close as possible to a data warehouse snowflake style schema, except I'm new to the field and the design isn't all that pretty. It works though, and is pretty speedy on SQLIte / Postgress through 5-way joins for 2008-2021 seasons (I haven't backfilled 1996-97 because I haven't had the need yet).

One thing that was painful was a general 'game' endpoint. I've had to rely on the player_game_log endpoints to really build that out, and rely on some logic to build the game table from cached player_game_log data before inserting the data to correctly build the key constraints. I think this sucks and that I may have missed an 'obvious' endpoint, but it works for now.