Open source historical RAPM data and code

Home for all your discussion of basketball statistical analysis.
Post Reply
Simon
Posts: 1
Joined: Wed Sep 04, 2013 8:10 pm

Open source historical RAPM data and code

Post by Simon » Sun Jun 24, 2018 2:34 am

Hi all, long time lurker here. I recently finished my master's in computer science and did my thesis project essentially creating this site: http://basketball-analytics.gitlab.io/rapm-data/

You can read the paper at https://basketball-analytics.gitlab.io/ ... ta-nba.pdf, which basically just summarizes this history (as I know it) of the advanced stats movement in basketball and an overview of what I did to generate this data (tl;dr, I watched this video by JE and did what he described: https://www.youtube.com/watch?v=OuC0YZTADcE).

I put all the code I wrote to scrape the data, parse, and analyze the data here, as well as the website code that has the final results: https://gitlab.com/basketball-analytics/

One thing that's been bothering me about these results is that they're close but don't quite match up with JE's numbers that I find on the internet (e.g. http://web.archive.org/web/201504080428 ... ot.com:80/ and https://sites.google.com/site/rapmstats/. In particular some examples I was going to use as sanity checks (e.g Shaq in early 2000s and Lebron in Miami) don't seem quite right so was wondering if anyone had any ideas or insight into the discrepancies. One thing I also noticed is that the possession numbers I come up with have pretty wide discrepancies with the ones on JE's old site so I'm not sure what's going on there either.

Would appreciate any feedback! I'm planning on extending this work to do multi year RAPM, include priors to see if an RPM like stat can be reverse engineered, and rewrite the paper as a blog post that can explain to a lay person what this is all about.

Crow
Posts: 5344
Joined: Thu Apr 14, 2011 11:10 pm

Re: Open source historical RAPM data and code

Post by Crow » Sun Jun 24, 2018 2:46 am

Thanks for posting all this. I look forward to reviewing it all later, including the playoff data (and compiling some averages).

Mike G
Posts: 4130
Joined: Fri Apr 15, 2011 12:02 am
Location: Asheville, NC

Re: Open source historical RAPM data and code

Post by Mike G » Mon Jun 25, 2018 12:40 pm

Simon, that's a very well-written paper. Very few people can do stats and actually write about them so well.
I read about half, and skimmed the 2nd half. Maybe I missed part of the explanation; but I copied the 2018 playoff numbers, and I have a question or 2.

To get a player's Net plus-minus, I take RAPM*poss/100
The Pacers total +29.6 by this approach.
They outscored the Cavs by 40 in their series, a team that was +0.59 in the regular season and right around zero in the rest of their postseason -- against well-above-avg competition.

Shouldn't Indiana's Net be more than 40, rather than less?


Here are playoff correlations between your RAPM and possessions played. These are highest corr. among top-8, 9, and 10 players.

Code: Select all

tm     #    corr
GSW   10    0.88
HOU    8    0.84
BOS   10    0.78
NOP   10    0.73
IND    8    0.57
CLE   10    0.55
MIL   10    0.36
SAS   10    0.34
MIA    8    0.18
TOR    9    0.08
UTA   10   -0.12
OKC   10   -0.28
MIN   10   -0.29
WAS   10   -0.34
POR    8   -0.36
PHI   10   -0.43
If RAPM is a good stat, should this ranking indicate which teams had the best/worst lineup management?
... the possession numbers I come up with have pretty wide discrepancies with the ones on JE's old site so I'm not sure what's going on there either.
Even on b-r.com there may be notable discrepancies. According to this page --
https://www.basketball-reference.com/te ... 18/on-off/
... the Warriors' pace was 103.2 with Steph Curry on the court. It seems that figure should be used to calculate all his /100 numbers, but last time I checked, all player stats were relative to the team avg pace.
With Quinn Cook, they ran at 96.3 -- a difference of some 7% from Curry.

Crow
Posts: 5344
Joined: Thu Apr 14, 2011 11:10 pm

Re: Open source historical RAPM data and code

Post by Crow » Mon Jun 25, 2018 5:24 pm

I am interested in correlation of RAPM estimates and team performance as well. In Mike's example, the correlations are decent or better for the winners (except Sixees), really bad for the losers. I'd want to understand what is happening here better. Winners have larger sample sizes but is there something else involved? Losers probably have worse, more random lineup management... and worse RAPM estimates because of that? The end of bench guys might have some notable impact too.

Mike G
Posts: 4130
Joined: Fri Apr 15, 2011 12:02 am
Location: Asheville, NC

Re: Open source historical RAPM data and code

Post by Mike G » Tue Jun 26, 2018 4:24 pm

Seems to me a correlation of < .50 is not very good. At that rate -- if RAPM is the real story -- only the final 4, and two 1st-round victims (Pels and Pacers) were decent. Other coaches sucked, and/or the stat is not quite stable.
Putting players in and out randomly would give us correlations of zero, and 6 teams look worse than that.

eminence
Posts: 56
Joined: Sun Sep 10, 2017 8:20 pm

Re: Open source historical RAPM data and code

Post by eminence » Tue Jun 26, 2018 4:49 pm

Shaq from the early 00's and Nash in general are the two I've looked into a bit and the seem surprisingly far off what I've seen from other sources.

permaximum
Posts: 413
Joined: Tue Nov 27, 2012 7:04 pm

Re: Open source historical RAPM data and code

Post by permaximum » Sat Jun 30, 2018 12:24 am

Nice job! Also very kind for you to share all this code. You are the first to share such code at this scale in this forum. So thank you so much for that.

At first I must say I'm not a computer scientist, I'm no analyst, I'm not getting any money. I'm just a basketball fan (not even that nowadays) and this is just a hobby I visit for a month or less, yearly. As that kind of a person I checked your code very quickly for a few mins and I want to comment on that without having any credentials. Forgive my arrogance.

Shortly, I liked what I saw. Your programming approach is definetely greater than me considering my all code lies within a single python file which obviously includes thousands of code. I know I'm a newbie. Also your parsing code is so much cleaner. Nice catch on Melvin Booker and the other guy. You're also using espn and b-ref on top of nba.com for different things. It wasn't wise for me to use nba.com as the only source. Your approach for getting lineups for the next quarters is a good example for the advantage of using other sources.

I see you have clean and detailed pbp error handling codes. But in my experience, there's so much error in PBP data that auto-handling can't ever be enough and I'm kind of a perfectionist. So I had to manually correct a lot of pbp data. Also I get really crazy with weird possesional outcomes that my code sometimes break one possession into 0.125 or even less posessions and it also handles end of the game or end of the quarter situations where players simply waste time and don't shoot. And even more absurd stuff like that with substitutions, injuries, ejections, free throws and substitutions between, flagrant fouls, clear path fouls, jump-balls 2-min away-from-ball fouls, non-shooting foul on a player without the ball while someone else was shooting and even more crazier stuff. I had to learn the rules and rule changes in this PBP area like an NBA ref out there because of this mad mindset.

What I want to say is, your code is probably better than everyone else's out there but I fear it's still not quite as good as I want because nba.com pbp data has crazy errors that I'm sure results are skewed. It has errors in player IDs, plays itself, event numbers, times, wrong team assignments and even worse stuff. How bad is it in your case after your error-handling approaches? I can't say since I gave up after a a few years of pbp data correction between 1996-2000. But I have even more sophisticated error-handling code and it wasn't even nowhere near enough. And it's really really easy to miss those errors that happen in the plays itself. However I don't know the state of newer seasons. If it's not that bad, later season RAPMs can be very reliable. If that's the case perhaps I should also finish this stuff.

I will look at your code more thoroughly later but I already know I can rely on your RAPM results more than everyone else's because I know how you came up with those values and it's obvious you tried your best to minimize the errors in the PBP data. Again, it's really really appreciated you shared all of this and I enjoyed how similar our code got at certain places to handle different situations.

Edit: BTW my code was just a possessional PBP parser. You also have RAPM calculating code. I have only calculated RAPM in R before, in Python I haven't done it yet but besides the PBP data differences with J.E's, your CV method and lambda sequences will also give different results (I took a quick glimpse at it and I guess you have optional penalized regression methods and you're doing 5-fold cv with a strict lambda sequence -it's called alpha in your code i guess-). Possessional differences I'm sure comes from the different parsing codes and actual PBP sources.

Edit2: J.E.'s results combine regular season and playoffs. That's an another reason for the difference.

Edit3: After further investigation, your lambda (alpha in your code) values are too high. As a result of that you punish players with relative low possessions even more. You should change your sequence to come up with a better one. I would focus on the cross validation code there. I don't know how python libraries handle those things but your main problem is there.

Rd11490
Posts: 88
Joined: Mon Sep 29, 2014 4:54 am

Re: Open source historical RAPM data and code

Post by Rd11490 » Sat Jun 30, 2018 4:41 am

After further investigation, your lambda (alpha in your code) values are too high. As a result of that you punish players with relative low possessions even more. You should change your sequence to come up with a better one. I would focus on the cross validation code there. I don't know how python libraries handle those things but your main problem is there.
Without knowing exactly how many rows his input is I'm not sure this is a fair statement. I agree though that the alpha should be calculated based on your input

Code: Select all

lambdas = [.01, .05, .1]
samples = stintX.shape[0]
alphas = [l * samples / 2 for l in lambdas]
Image

Just to comment, There are some big outliers in your results that are concerning. Kemba being in the 400s when all others have him in the top 30 is a red flag.

permaximum
Posts: 413
Joined: Tue Nov 27, 2012 7:04 pm

Re: Open source historical RAPM data and code

Post by permaximum » Sat Jun 30, 2018 7:29 am

Rd11490 wrote:
After further investigation, your lambda (alpha in your code) values are too high. As a result of that you punish players with relative low possessions even more. You should change your sequence to come up with a better one. I would focus on the cross validation code there. I don't know how python libraries handle those things but your main problem is there.
Without knowing exactly how many rows his input is I'm not sure this is a fair statement. I agree though that the alpha should be calculated based on your input

Code: Select all

lambdas = [.01, .05, .1]
samples = stintX.shape[0]
alphas = [l * samples / 2 for l in lambdas]
Image

Just to comment, There are some big outliers in your results that are concerning. Kemba being in the 400s when all others have him in the top 30 is a red flag.
I think you're confusing things here. I don't know how Python libraries handle penalized regressions but by looking at his code I thought alpha in that code meant the lambda in R, not the alpha we know in R. And that alpha (lambda in R) is definetely very high. Its cv sequence is also bad.

BTW, L1_ratio in that code is probably the alpha we know in R. So as far as parameters go;

Python Sklearn's Alpha = R Glmnet's Lambda
Python Sklearn's L1_ratio = R Glmnet's Alpha

Rd11490
Posts: 88
Joined: Mon Sep 29, 2014 4:54 am

Re: Open source historical RAPM data and code

Post by Rd11490 » Sat Jun 30, 2018 2:32 pm

I might be confused. Are you talking about his alpha=2900?
If so that number sounds are right as alpha in sklearn is N/2 * lambda in R where N is your number of observations.
If not then yeah my bad I thought you were talking about something else.

permaximum
Posts: 413
Joined: Tue Nov 27, 2012 7:04 pm

Re: Open source historical RAPM data and code

Post by permaximum » Sat Jun 30, 2018 3:53 pm

Rd11490 wrote:I might be confused. Are you talking about his alpha=2900?
If so that number sounds are right as alpha in sklearn is N/2 * lambda in R where N is your number of observations.
If not then yeah my bad I thought you were talking about something else.
Yes, I'm talking about that "2900" and the sequence "2500, 3000, 50 below" if that's how he came up with 2900. I guess it's the L2 penalty that's called lambda in R's glmnet package because there was no other parameter in the calculation stage of the ridge regression in his code. So there it looks like it's called alpha.

And that number is definetely too high. How did I come up with that reasoning? Simply by looking at rapm values and player possessions. It's very obvious his penalty is high and that's why I said he should improve upon his cross validation method to come up with an optimized penalty and thus lowering it.

Rd11490
Posts: 88
Joined: Mon Sep 29, 2014 4:54 am

Re: Open source historical RAPM data and code

Post by Rd11490 » Sat Jun 30, 2018 4:37 pm

That’s fair, there are a good number of very strange results in this set

Post Reply