RAPM request thread

Nathan · Post by **Nathan** » Thu Jan 07, 2016 6:01 pm

I'm having some problems producing SPM related to what Crow's mentioning here. I end up with a decent top-10 for the 2014-15 season (out of sample) with my simple linear model:

and an even better top-10 for the 2014-15 season (out of sample) with my more complex model:

but both give very low ratings, on average, to the top players. Is there some simple way to 'un-mean-regress' your NPI APM (e.g. as some function of possessions or minutes played)? The problem I'm running into, I think, is that I need a form of APM such that each player's APM is a random variate with mean equal to his "true, underlying plus/minus." Clearly, if there's significant mean regression going on, that's not going to be true for top players like Curry. I understand, of course, that the RMSE between APM and "true, underlying plus/minus" will be larger without mean regression, but for the purposes of calculating SPM this shouldn't matter as long as sample size is large enough.

J.E. · Post by **J.E.** » Thu Jan 07, 2016 8:56 pm

Are you using single season RAPM? That'll always lead to RAPM numbers comparatively close to 0, and thus the SPM numbers will be, as well

If that's what you do, the solution should probably be to use multiple years of data for RAPM, instead

Nathan · Post by **Nathan** » Thu Jan 07, 2016 9:47 pm

My training data is seasons 2000-2001 through 2013-2014, using your NPI APM. So, if i'm interpreting you correctly, single year NPI APM will always have an unrealistically small spread. I notice for instance that GotBuckets' 2-year NPI APM has a much larger spread though, for instance. Is this indeed because it's a larger sample size, or is it more likely because they are using a weaker prior?

J.E. · Post by **J.E.** » Thu Jan 07, 2016 10:12 pm

Nathan wrote:My training data is seasons 2000-2001 through 2013-2014, using your NPI APM. So, if i'm interpreting you correctly, single year NPI APM will always have an unrealistically small spread. I notice for instance that GotBuckets' 2-year NPI APM has a much larger spread though, for instance. Is this indeed because it's a larger sample size, or is it more likely because they are using a weaker prior?

The whole point of "NPI" is to have no priors. More years -> larger spread

Nathan · Post by **Nathan** » Thu Jan 07, 2016 11:39 pm

Well, there's a zero prior of some sort (which is why players with few minutes tend to cluster around -0.5 or so), right? If I understand correctly this is closely related to the "lambda" value people mention in this thread.

J.E. · Post by **J.E.** » Fri Jan 08, 2016 8:51 am

Difference in lambda will (or should) be very minor from NPI RAPM to NPI RAPM. And all NPI RAPMs share the 0 prior

Nathan · Post by **Nathan** » Fri Jan 08, 2016 3:55 pm

What was done differently, for instance, to produce this one-year APM?

http://nbviewer.ipython.org/gist/EvanZ/ ... eb14f28d58

which has several players >6?

J.E. · Post by **J.E.** » Fri Jan 08, 2016 5:01 pm

Nathan wrote:What was done differently, for instance, to produce this one-year APM?

http://nbviewer.ipython.org/gist/EvanZ/ ... eb14f28d58

which has several players >6?

Something's a tad bit off with his process. It's not the selection of the penalization paramater itself, because he lets the software choose. But something is off before that, or else sklearn wouldn't arrive at 500. 3000 is pretty much the standard here, found by Joe Sill (creator of RAPM), and by myself

Nathan · Post by **Nathan** » Fri Jan 08, 2016 5:28 pm

OK, I see. But what I don't understand is the following. I notice that the standard deviation in your pm ratings is approximately proportional to the square root of minutes played, and this is true all the way up through 2000 minutes:

This suggests that the zero prior is significantly affecting the ratings of all players, even those who played 2000+ minutes.

J.E. · Post by **J.E.** » Fri Jan 08, 2016 6:05 pm

Why should it not

Nathan · Post by **Nathan** » Fri Jan 08, 2016 6:27 pm

There's nothing wrong with it per se, it's just that to make SPM, I need a version of APM such that at least for players with high minutes played, their APM rating is a random variate with mean equal to their true, underlying plus/minus value. If the zero prior has a significant effect, as it does here, then elite players in particular are systematically underrated, and as a result my SPM also systematically underrates elite players.

Nathan · Post by **Nathan** » Sun Jan 10, 2016 11:20 pm

I did more reading on the topic of RAPM and ridge regression, and came across this article which was helpful: https://tamino.wordpress.com/2011/02/12 ... egression/

In particular:

"Now the matrix we need to invert no longer has determinant near zero, so the solution does not lead to uncomfortably large variance in the estimated parameters. And that’s a good thing.

We pay a price for this. The new estimates are no longer unbiased, their expected values are not equal to the true values. Generally they tend to underestimate the true values. However, the variance of this new estimate can be so much lower than that of the least-squares estimator, that the total expected mean squared error is also less — and that makes it (in a certain sense) a “better” estimator, surely a better-behaved one."

Basically, as I understand it, APM is the most accurate possible rating that is bias free. To produce a better estimator, it's necessary to "trade" some variance for bias, thereby reducing the mean error which is a sum, roughly speaking, of variance and bias. Going from APM to (non-prior informed) RAPM, and going from APM to SPM, are both valid methods of getting rid of a lot of variance at the expense of introducing a little bit of bias, reducing the overall mean squared error (which is a "sum" of variance and bias).

If I start out with RAPM in my attempt to produce SPM, bias gets introduced twice, once in the move from APM to RAPM, and again in the move from RAPM to SPM. That's why it's so important that I start out with APM, in spite of the fact that RAPM has a lower mean squared error than APM.

My other option would be to start out with long-term RAPM, which has much less bias than single year RAPM. This does work decently well (I posted the results in my recent thread), but there's a key advantage to using APM. The mean squared error in APM is, to good approximation, inversely proportional to minutes played. This makes it possible to get accurate uncertainty estimates for my SPM ratings, which is one of my main goals.

So, if it's not too much trouble, would it be possible to post single year NPI APM? I'm hoping this would be relatively easy, since it's really just RAPM with the regularization parameter set to 0.

Thanks as always, and I apologize for not articulating myself more clearly in the first place.

J.E. · Post by **J.E.** » Mon Jan 11, 2016 10:28 am

Here you go.
https://docs.google.com/spreadsheets/d/ ... sp=sharing
Enjoy

...

Seriously though, I think you're too concerned about bias. APM should not be used over RAPM under any circumstances (in NBA basketball).
And this

This makes it possible to get accurate uncertainty estimates for my SPM ratings

is probably *far* more work than you think it is, because you have to, for each player and BoxScore stat, figure out this player's "real ability" (to score, steal, rebound etc.) and some uncertainty estimates for the estimated real ability

Nathan · Post by **Nathan** » Mon Jan 11, 2016 3:50 pm

Not that you're obligated to help my cause, but I would need a much larger sample of APM than just the incomplete 2015/16 season in order to make SPM. Ideally I'd like single year APM from the 2000-2001 season onward, but I understand if that would be a lot of work.

I do think I'm right to be concerned about the bias introduced in the calculation of RAPM, though. It appears that, at least for the top players, the leading source of error in my SPM calculated from your NPI RAPM is the lingering bias against large ratings from your RAPM:

I.e. these players all have ratings 2-5 points below what they probably should have.

As for calculating "real" rebounding, etc., that's an intractable problem for me until I have access to play-by-play data, which is unlikely to happen anytime soon (although I very much agree that this kind of thing is essential to further refining existing metrics).

DSMok1 · Post by **DSMok1** » Wed Jan 13, 2016 5:14 pm

Jerry posted 15 year RAPM with age adjustments here; it's basically what I used to create Box Plus/Minus: http://www.apbr.org/metrics/viewtopic.p ... 673#p24673

APBRmetrics

RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread

Re: RAPM request thread