Statistical +/-

J.E. · Post by **J.E.** » Thu Feb 27, 2014 12:32 pm

I'm currently in the process of creating another version of 'statistical +/-' (SPM). You can read up in the first installment here. What's different this time around is
a) more data
b) I've included stats that are not available in the BoxScore and thus have to be figured out through the PBP (see here)

I'm posting results of the first run, but more versions are sure to come. Thus far I've kept it relatively simple from the 'shooting' standpoint, only including PPS and '# of shots', with the latter being (FGA+(FTA-AND1s)/2) where AND1s includes 'unsuccesful AND1s'. I'm definitely open to suggestions, whether to include more variables (has to be either on BBR or something I can figure out through the PBP) or whether to try certain interaction effects.

Another 'fun' exercise could be to include already existing metrics like PER and ORtg/DRtg in the regression. This is potentially interesting because the regression would be able to tell us which metric is 'better' (since '02), and could tell us potential weaknesses of each metric (say we include PER and various BoxScore stats, and assuming PER gets a positive coefficient for offense. If we then get a large negative coefficient for 'FGA' we 'know' PER overrated the chuckers (again, since '02, can't make any statements for the years before))

I'm using z-scores for the variables, so a coefficient of 2 has twice the impact on the player rating than a coefficient of 1 does

Coefficients for offense

Code: Select all

MP	2.92
pps	1.94
shots	0.88
AST	0.83
off_rebound_ft	0.60
off_rebound_fg	0.57
STL	0.30
height	0.19
weight	0.11
blocks_to_off	0.06
dead_to	-0.04
blocks_to_def	-0.13
live_to	-0.16
goaltends	-0.16
def_rebound_fg	-0.42
G	-0.56
def_rebound_ft	-0.66
PF	-1.22
GS	-1.55

and defense

Code: Select all

def_rebound_fg	1.37
blocks_to_def	1.11
STL	1.05
GS	1.04
height	0.56
AST	0.47
off_rebound_ft	0.36
weight	0.34
blocks_to_off	0.31
pps	0.23
MP	0.02
G	-0.07
PF	-0.08
def_rebound_ft	-0.18
goaltends	-0.37
dead_to	-0.40
live_to	-0.45
off_rebound_fg	-0.52
shots	-1.10

where
- 'off_rebound_fg' is an offensive rebound after a FGA, rather than an FTA. Vice versa for 'off_rebound_ft'.
- goaltends are defensive goaltends only
- 'live_to'/'dead_to' are live/dead ball turnovers
- 'blocks_to_def' are blocks that get recovered by the defensive team. Vice versa for 'off'

The large difference for the (defensive) coefficients for 'blocks_to_def' vs. 'blocks_to_off' and 'def_reb_fg' vs. 'def_reb_ft' tells me there's definitely value in extracting those kinds of stats from the PBP

The player values for '13-'14 are here. It's not a fan of guards that don't assist much and don't shoot very well. Has the Lopez brothers in the top 10. Definitely a little weird, although they also sport a 125/121 ORtg and 108/107 DRtg this season

R^2 for '14 SPM and RAPM is 0.35 for offense, 0.2 for defense. I'm not gunning for higher R^2 ('m gunning for lower OOS prediction error for offensive efficiency of 5-on-5 lineups), but higher R^2 is probably desirable, too

ilardi · Post by **ilardi** » Thu Feb 27, 2014 2:59 pm

Great stuff as always, Jerry, and I especially like the use of PbP-derived metrics. Two quick thoughts, though, on potential improvements:

1) it looks like you've got ~20 IV's in the model right now (a large number), and there's a good chance that some IVs are redundant (vis-a-vis DV variability) and that some may even be functioning as 'suppressors'. For example, Daniel Myers used only about 10 predictors with his ASPM model and says he got much higher R^2 values in predicting RAPM (.73 offense, .49 defense), at least within-sample . . .

2) So, I'm thinking it might be helpful to try one of the newer array of machine-learning algorithms for model specification to help optimize the model and sort out the wheat from the chaff, as it were. R has a nice package for 'random forest' regression, which can also be used in tandem with helpful techniques like 'gradient boosting'. The beauty of such appproaches is that they permit you to toss in dozens of POTENTIAL variables - as well as all sorts of potential interaction terms and non-linear terms if you want - without having to worry too much about overfitting, suppressor effects, radical shrinkage effects, etc.

Hope that might prove helpful.

Steve

J.E. · Post by **J.E.** » Thu Feb 27, 2014 4:27 pm

Thanks. I'll try to look into random forest

Just for completeness, I've tried different regression techniques, some of which do some type of variable selection, including Lasso, ElasticNet and Lasso with Least Angle Regression. Ridge outperformed all of them for this specific problem - that's why I went ahead and posted coefficients for just Ridge

I've never heard of the 'suppressor effect'. I'll have to look into that as well

xkonk · Post by **xkonk** » Thu Feb 27, 2014 4:40 pm

As long as you're reporting the 'vanilla' R squared as opposed to some kind of adjusted value, having more predictors cannot lead to a worse fit. If this R squared is lower than previous iterations of SPM, it is due to differences in the data sets used.

But Steve makes a good point on the interpretation of the predictors themselves; the suppressor effect is one potential issue that comes out of multicollinearity problems. Related to that, my first thought was that adding PER (or potentially any box-score based metric) is going to be problematic in that it is going to be ridiculously correlated with the predictors you already have in the model. Ridge regression will help clean it up, as it usually does, but it can only help, not fix the issue entirely.

Speaking along the same lines, are there any a priori reasons to expect some of the differences we see for similar predictors? For example, it looks like minutes played is canceling out some of the effect of games played and games started on offense but on defense minutes and games have little impact while games started is very important. Would you/someone have predicted that beforehand? Why is games started important to defense as opposed to minutes played? Why is it so different for offense?

colts18 · Post by **colts18** » Thu Feb 27, 2014 5:14 pm

I don't see it on the list but did you have charges drawn put in your SPM model? What about assisted FG%? I imagine if you put assisted FG%, you won't have guys like DeAndre Jordan and Tyson Chandler at the top of the list.

DSMok1 · Post by **DSMok1** » Thu Feb 27, 2014 8:10 pm

ilardi wrote:Great stuff as always, Jerry, and I especially like the use of PbP-derived metrics. Two quick thoughts, though, on potential improvements:

1) it looks like you've got ~20 IV's in the model right now (a large number), and there's a good chance that some IVs are redundant (vis-a-vis DV variability) and that some may even be functioning as 'suppressors'. For example, Daniel Myers used only about 10 predictors with his ASPM model and says he got much higher R^2 values in predicting RAPM (.73 offense, .49 defense), at least within-sample . . .

Most of the difference in R^2 is probably coming from the RAPM basis. I used an 8 year average RAPM, so likely not very noisy at all. This 2014 RAPM has a much smaller sample size, so there's likely quite a bit more noise coming from the RAPM. (Though I agree with your comments on number of predictors otherwise).

Crow · Post by **Crow** » Thu Feb 27, 2014 8:14 pm

if your exploration includes height and weight, it might be interesting to extend to include wingspan, vertical jump, speed and agility from the draft combine data (available at draftexpress) just to see what the model run shows for them. I'd be interested in seeing a RAPM model run with such demographic variables included too and then compare what the results show on SPM and RAPM about their impact.

AcrossTheCourt · Post by **AcrossTheCourt** » Fri Feb 28, 2014 2:14 am

I've been wanting to create a stat. +/- model too, and I started one a few weeks ago to include a lot of non-box score information. There's a lot possible if you get creative with the stats.

And for determining which variables to include in a model, I sometimes use step regression.

J.E. · Post by **J.E.** » Fri Feb 28, 2014 8:42 am

Just for the record, I just tried another regression technique which, from subjective experience, usually selects the least amount of variables between all the techniques I've come across (including ElasticNET, Lasso, LassoLars). The technique is Lasso with Bayesian Information Criterion

Here, it throws out 6 of the 19 variables for offense, 4 of 19 for defense, so it's still keeping most of the variables. I'm guessing it could be due to the large # of observations I have (roughly 12*1300*100 = 1.5 million)

When it comes to OOS prediction it is a little worse than Ridge at predicting '12, '13 and '14, but at least it's close

talkingpractice · Post by **talkingpractice** » Mon Mar 03, 2014 11:47 pm

I have various comments here, and hopefully some will be helpful ->

- So our FORPM model (we've discussed it a bit but not with tons of detail) uses random forest regressions and gradient boosting, for many of the reasons that Ilardi mentioned. Our findings match a lot of what I'm reading here, namely that the machine learning versions of models like this do a bit better, but not tons better, at predicting ptdiff out of sample.

- One problem with random forest type models is that they're sorta a black box. iow you can't really know afterwards precisely what interactions the forest did/didn't like, in the same way you can with a traditional model.

- We've used LASSO/BIC based models too (and stepwise regressions), for the same variable selection reasons as mentioned here. We've found the same thing, that most variables stay in the model, and it's likely due to the huge sample.

- We use several 'demographic' variables in there (height, wingspan, vertical, etc), and allow it to try a gazillion different (though logical a priori) interactions. The demographics do more for D than for O. That said, there's nothing very groundbreaking there either (nothing special really about giving guys with long arms some extra credit on D). Boxscore/demographic models will always suck a bit on D until they are able to account for crappy rotations, lazy stuff, bad pnr d, etc. I'm sure that some of the smarter teams usea video charting or similar to create the variables of interest for some of those areas.

- Imo, this here is indeed fun/cool as hell, and I hadnt thought of it before ->

J.E. wrote:Another 'fun' exercise could be to include already existing metrics like PER and ORtg/DRtg in the regression. This is potentially interesting because the regression would be able to tell us which metric is 'better' (since '02), and could tell us potential weaknesses of each metric (say we include PER and various BoxScore stats, and assuming PER gets a positive coefficient for offense. If we then get a large negative coefficient for 'FGA' we 'know' PER overrated the chuckers (again, since '02, can't make any statements for the years before))

J.E. · Post by **J.E.** » Tue Mar 04, 2014 1:09 pm

talkingpractice wrote:- We've used LASSO/BIC based models too (and stepwise regressions), for the same variable selection reasons as mentioned here. We've found the same thing, that most variables stay in the model, and it's likely due to the huge sample.

The alternative, computing ASPM like DSMok1 does (using long-term RAPM numbers as the dependent variable), could potentially lead to less variables as the number of observations is so much smaller (~2800 vs ~1.5 million). The big question, to me, with that model is how you weigh the observations. If you weigh using some function of each player's RAPM standard error (less standard error -> more weight) I suspect this method could be slightly inferior to the SPM method that has PPP in terms of predicting performance of the high-minute-players. That is because the standard error doesn't really go down too much after a player has played ~20k possessions, so a player with 20k possessions will have the same weight as a player that has played 80k. The model that has PPP as a dependent variable will 'spend more effort' trying to nail the 80k player. I guess you could sidestep that problem, if you think it is a problem, by weighing by minutes instead of standard error.

Ultimately all empirical questions, and definitely worth checking out

APBRmetrics

Statistical +/-

Statistical +/-

Re: Statistical +/-

Re: Statistical +/-

Re: Statistical +/-

Re: Statistical +/-

Re: Statistical +/-

Re: Statistical +/-

Re: Statistical +/-

Re: Statistical +/-

Re: Statistical +/-

Re: Statistical +/-