Statistical +/-

Home for all your discussion of basketball statistical analysis.
Post Reply
J.E.
Posts: 852
Joined: Fri Apr 15, 2011 8:28 am

Statistical +/-

Post by J.E. »

I'm currently in the process of creating another version of 'statistical +/-' (SPM). You can read up in the first installment here. What's different this time around is
a) more data
b) I've included stats that are not available in the BoxScore and thus have to be figured out through the PBP (see here)

I'm posting results of the first run, but more versions are sure to come. Thus far I've kept it relatively simple from the 'shooting' standpoint, only including PPS and '# of shots', with the latter being (FGA+(FTA-AND1s)/2) where AND1s includes 'unsuccesful AND1s'. I'm definitely open to suggestions, whether to include more variables (has to be either on BBR or something I can figure out through the PBP) or whether to try certain interaction effects.

Another 'fun' exercise could be to include already existing metrics like PER and ORtg/DRtg in the regression. This is potentially interesting because the regression would be able to tell us which metric is 'better' (since '02), and could tell us potential weaknesses of each metric (say we include PER and various BoxScore stats, and assuming PER gets a positive coefficient for offense. If we then get a large negative coefficient for 'FGA' we 'know' PER overrated the chuckers (again, since '02, can't make any statements for the years before))

I'm using z-scores for the variables, so a coefficient of 2 has twice the impact on the player rating than a coefficient of 1 does

Coefficients for offense

Code: Select all

MP	2.92
pps	1.94
shots	0.88
AST	0.83
off_rebound_ft	0.60
off_rebound_fg	0.57
STL	0.30
height	0.19
weight	0.11
blocks_to_off	0.06
dead_to	-0.04
blocks_to_def	-0.13
live_to	-0.16
goaltends	-0.16
def_rebound_fg	-0.42
G	-0.56
def_rebound_ft	-0.66
PF	-1.22
GS	-1.55
and defense

Code: Select all

def_rebound_fg	1.37
blocks_to_def	1.11
STL	1.05
GS	1.04
height	0.56
AST	0.47
off_rebound_ft	0.36
weight	0.34
blocks_to_off	0.31
pps	0.23
MP	0.02
G	-0.07
PF	-0.08
def_rebound_ft	-0.18
goaltends	-0.37
dead_to	-0.40
live_to	-0.45
off_rebound_fg	-0.52
shots	-1.10
where
- 'off_rebound_fg' is an offensive rebound after a FGA, rather than an FTA. Vice versa for 'off_rebound_ft'.
- goaltends are defensive goaltends only
- 'live_to'/'dead_to' are live/dead ball turnovers
- 'blocks_to_def' are blocks that get recovered by the defensive team. Vice versa for 'off'

The large difference for the (defensive) coefficients for 'blocks_to_def' vs. 'blocks_to_off' and 'def_reb_fg' vs. 'def_reb_ft' tells me there's definitely value in extracting those kinds of stats from the PBP

The player values for '13-'14 are here. It's not a fan of guards that don't assist much and don't shoot very well. Has the Lopez brothers in the top 10. Definitely a little weird, although they also sport a 125/121 ORtg and 108/107 DRtg this season

R^2 for '14 SPM and RAPM is 0.35 for offense, 0.2 for defense. I'm not gunning for higher R^2 ('m gunning for lower OOS prediction error for offensive efficiency of 5-on-5 lineups), but higher R^2 is probably desirable, too
ilardi
Posts: 27
Joined: Fri Apr 15, 2011 3:29 am

Re: Statistical +/-

Post by ilardi »

Great stuff as always, Jerry, and I especially like the use of PbP-derived metrics. Two quick thoughts, though, on potential improvements:

1) it looks like you've got ~20 IV's in the model right now (a large number), and there's a good chance that some IVs are redundant (vis-a-vis DV variability) and that some may even be functioning as 'suppressors'. For example, Daniel Myers used only about 10 predictors with his ASPM model and says he got much higher R^2 values in predicting RAPM (.73 offense, .49 defense), at least within-sample . . .

2) So, I'm thinking it might be helpful to try one of the newer array of machine-learning algorithms for model specification to help optimize the model and sort out the wheat from the chaff, as it were. R has a nice package for 'random forest' regression, which can also be used in tandem with helpful techniques like 'gradient boosting'. The beauty of such appproaches is that they permit you to toss in dozens of POTENTIAL variables - as well as all sorts of potential interaction terms and non-linear terms if you want - without having to worry too much about overfitting, suppressor effects, radical shrinkage effects, etc.

Hope that might prove helpful.

Steve
J.E.
Posts: 852
Joined: Fri Apr 15, 2011 8:28 am

Re: Statistical +/-

Post by J.E. »

Thanks. I'll try to look into random forest

Just for completeness, I've tried different regression techniques, some of which do some type of variable selection, including Lasso, ElasticNet and Lasso with Least Angle Regression. Ridge outperformed all of them for this specific problem - that's why I went ahead and posted coefficients for just Ridge

I've never heard of the 'suppressor effect'. I'll have to look into that as well
xkonk
Posts: 307
Joined: Fri Apr 15, 2011 12:37 am

Re: Statistical +/-

Post by xkonk »

As long as you're reporting the 'vanilla' R squared as opposed to some kind of adjusted value, having more predictors cannot lead to a worse fit. If this R squared is lower than previous iterations of SPM, it is due to differences in the data sets used.

But Steve makes a good point on the interpretation of the predictors themselves; the suppressor effect is one potential issue that comes out of multicollinearity problems. Related to that, my first thought was that adding PER (or potentially any box-score based metric) is going to be problematic in that it is going to be ridiculously correlated with the predictors you already have in the model. Ridge regression will help clean it up, as it usually does, but it can only help, not fix the issue entirely.

Speaking along the same lines, are there any a priori reasons to expect some of the differences we see for similar predictors? For example, it looks like minutes played is canceling out some of the effect of games played and games started on offense but on defense minutes and games have little impact while games started is very important. Would you/someone have predicted that beforehand? Why is games started important to defense as opposed to minutes played? Why is it so different for offense?
colts18
Posts: 313
Joined: Fri Aug 31, 2012 1:52 am

Re: Statistical +/-

Post by colts18 »

I don't see it on the list but did you have charges drawn put in your SPM model? What about assisted FG%? I imagine if you put assisted FG%, you won't have guys like DeAndre Jordan and Tyson Chandler at the top of the list.
DSMok1
Posts: 1119
Joined: Thu Apr 14, 2011 11:18 pm
Location: Maine
Contact:

Re: Statistical +/-

Post by DSMok1 »

ilardi wrote:Great stuff as always, Jerry, and I especially like the use of PbP-derived metrics. Two quick thoughts, though, on potential improvements:

1) it looks like you've got ~20 IV's in the model right now (a large number), and there's a good chance that some IVs are redundant (vis-a-vis DV variability) and that some may even be functioning as 'suppressors'. For example, Daniel Myers used only about 10 predictors with his ASPM model and says he got much higher R^2 values in predicting RAPM (.73 offense, .49 defense), at least within-sample . . .
Most of the difference in R^2 is probably coming from the RAPM basis. I used an 8 year average RAPM, so likely not very noisy at all. This 2014 RAPM has a much smaller sample size, so there's likely quite a bit more noise coming from the RAPM. (Though I agree with your comments on number of predictors otherwise).
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
Twitter.com/DSMok1
Crow
Posts: 10533
Joined: Thu Apr 14, 2011 11:10 pm

Re: Statistical +/-

Post by Crow »

if your exploration includes height and weight, it might be interesting to extend to include wingspan, vertical jump, speed and agility from the draft combine data (available at draftexpress) just to see what the model run shows for them. I'd be interested in seeing a RAPM model run with such demographic variables included too and then compare what the results show on SPM and RAPM about their impact.
AcrossTheCourt
Posts: 237
Joined: Sat Feb 16, 2013 11:56 am

Re: Statistical +/-

Post by AcrossTheCourt »

I've been wanting to create a stat. +/- model too, and I started one a few weeks ago to include a lot of non-box score information. There's a lot possible if you get creative with the stats.

And for determining which variables to include in a model, I sometimes use step regression.
J.E.
Posts: 852
Joined: Fri Apr 15, 2011 8:28 am

Re: Statistical +/-

Post by J.E. »

Just for the record, I just tried another regression technique which, from subjective experience, usually selects the least amount of variables between all the techniques I've come across (including ElasticNET, Lasso, LassoLars). The technique is Lasso with Bayesian Information Criterion

Here, it throws out 6 of the 19 variables for offense, 4 of 19 for defense, so it's still keeping most of the variables. I'm guessing it could be due to the large # of observations I have (roughly 12*1300*100 = 1.5 million)

When it comes to OOS prediction it is a little worse than Ridge at predicting '12, '13 and '14, but at least it's close
talkingpractice
Posts: 194
Joined: Tue Oct 30, 2012 6:58 pm
Location: The Alpha Quadrant
Contact:

Re: Statistical +/-

Post by talkingpractice »

I have various comments here, and hopefully some will be helpful ->

- So our FORPM model (we've discussed it a bit but not with tons of detail) uses random forest regressions and gradient boosting, for many of the reasons that Ilardi mentioned. Our findings match a lot of what I'm reading here, namely that the machine learning versions of models like this do a bit better, but not tons better, at predicting ptdiff out of sample.

- One problem with random forest type models is that they're sorta a black box. iow you can't really know afterwards precisely what interactions the forest did/didn't like, in the same way you can with a traditional model.

- We've used LASSO/BIC based models too (and stepwise regressions), for the same variable selection reasons as mentioned here. We've found the same thing, that most variables stay in the model, and it's likely due to the huge sample.

- We use several 'demographic' variables in there (height, wingspan, vertical, etc), and allow it to try a gazillion different (though logical a priori) interactions. The demographics do more for D than for O. That said, there's nothing very groundbreaking there either (nothing special really about giving guys with long arms some extra credit on D). Boxscore/demographic models will always suck a bit on D until they are able to account for crappy rotations, lazy stuff, bad pnr d, etc. I'm sure that some of the smarter teams usea video charting or similar to create the variables of interest for some of those areas.

- Imo, this here is indeed fun/cool as hell, and I hadnt thought of it before ->
J.E. wrote:Another 'fun' exercise could be to include already existing metrics like PER and ORtg/DRtg in the regression. This is potentially interesting because the regression would be able to tell us which metric is 'better' (since '02), and could tell us potential weaknesses of each metric (say we include PER and various BoxScore stats, and assuming PER gets a positive coefficient for offense. If we then get a large negative coefficient for 'FGA' we 'know' PER overrated the chuckers (again, since '02, can't make any statements for the years before))
J.E.
Posts: 852
Joined: Fri Apr 15, 2011 8:28 am

Re: Statistical +/-

Post by J.E. »

talkingpractice wrote:- We've used LASSO/BIC based models too (and stepwise regressions), for the same variable selection reasons as mentioned here. We've found the same thing, that most variables stay in the model, and it's likely due to the huge sample.
The alternative, computing ASPM like DSMok1 does (using long-term RAPM numbers as the dependent variable), could potentially lead to less variables as the number of observations is so much smaller (~2800 vs ~1.5 million). The big question, to me, with that model is how you weigh the observations. If you weigh using some function of each player's RAPM standard error (less standard error -> more weight) I suspect this method could be slightly inferior to the SPM method that has PPP in terms of predicting performance of the high-minute-players. That is because the standard error doesn't really go down too much after a player has played ~20k possessions, so a player with 20k possessions will have the same weight as a player that has played 80k. The model that has PPP as a dependent variable will 'spend more effort' trying to nail the 80k player. I guess you could sidestep that problem, if you think it is a problem, by weighing by minutes instead of standard error.

Ultimately all empirical questions, and definitely worth checking out
Post Reply