Incorporating Prior Information into RAPM

TopDownHockey · Post by **TopDownHockey** » Thu Nov 26, 2020 9:48 pm

Hello all,

I am currently working on RAPM for analyzing hockey skaters. For those who aren’t familiar with hockey, it is very similar to basketball from the perspective of RAPM: 5 skaters are constantly playing against one another and RAPM is run using a matrix with dummy variables denoting whether or not a player is active on offense or defense as well as a few other dummy variables such as score state, home advantage, etc. I use shot attempts per 60 minutes as the target variable instead of goals because goals are prone to much variance.

As of right now, I am running a weighted ridge regression using a Gaussian distribution that biases the value of all coefficients towards zero. I would like to incorporate prior information for every player dummy variable and I am not sure how to go about doing so.

I am currently running my calculations in R using the glmnet package. My code is as follows:

glmnet(X = Dummy_Matrix, Y = Target_Variable, standardize = FALSE, lambda = Prior_Obtained_Lambda)

With Lambda being obtained through cross validation (cv.glmnet) on the dataset.

As mentioned, the distribution is Gaussian and therefore coefficients are biased towards zero, and lambda values are equal for all coefficients. I have looked into using the Bayesglm package, but it is much slower than glmnet as it can not handle sparse matrices. (I always set up my dummy matrix as sparse.)

To those of you who have created prior-informed RAPM in the past: How did you incorporate your priors?

DSMok1 · Post by **DSMok1** » Fri Nov 27, 2020 3:39 am

I don't think it's the perfect approach, but I just have used them as a pre and post-processing step. Basically subtract out the prior from the target variable, run the regression, and then add back in the prior. I only use that with a very generic type of prior, because it's hard to validate.

Please understand that I'm no expert in this area, that's just the approach that I have used.

TopDownHockey · Post by **TopDownHockey** » Fri Nov 27, 2020 6:47 am

DSMok1 wrote: ↑Fri Nov 27, 2020 3:39 am I don't think it's the perfect approach, but I just have used them as a pre and post-processing step. Basically subtract out the prior from the target variable, run the regression, and then add back in the prior. I only use that with a very generic type of prior, because it's hard to validate.

Please understand that I'm no expert in this area, that's just the approach that I have used.

Haha, I found a post from you on this forum suggesting this method over two years ago, and I've been working with it today!

I agree that it may not be perfect, but it is very interesting. I'm working with it for the time being. I may switch over to something else in the future depending on the results I get from testing this for 4-5 seasons.

Would you care to elaborate on what you mean by "it's hard to validate?" And just one other question - do you feel that the finished product from this process is a descriptive metric that measures player performance in that season, or more of an evaluative metric that uses priors to more accurately evaluate their performance in that season? My understanding is that it's more of the latter but I'm curious to hear your stance on it.

DSMok1 · Post by **DSMok1** » Fri Nov 27, 2020 12:45 pm

Well, there is no inherent portion of the process that identifies whether the prior was appropriate, checking out of sample. The only real way to assess quality of a prior in this crude method is to compare the lambda that is automatically selected, with greater shrinkage indicating the prior was better.

I would love to see a process where the priors are validated within the Cross validation process, but I don't know how to do that.

TopDownHockey · Post by **TopDownHockey** » Sat Nov 28, 2020 2:31 am

How exactly did you go about calculating the prior? Did you use box metrics in a given season to calculate the "prior" for that same season? If so, was this calculated through some sort of linear regression that attempts to "predict" a player's RAPM in a given season using box metrics (or just MPG/team strength)?

I've been using raw vanilla RAPM from the previous season as a "prior" but I believe that doing so may be leading the prior to hold too much weight in the calculation. It seems that players at the top and bottom ends of the spectrum are far less malleable to change, which indicates me that the end result of my prior informed RAPM may be more of a multi-year metric than a single year metric.

DSMok1 · Post by **DSMok1** » Mon Nov 30, 2020 5:30 pm

TopDownHockey wrote: ↑Sat Nov 28, 2020 2:31 am How exactly did you go about calculating the prior? Did you use box metrics in a given season to calculate the "prior" for that same season? If so, was this calculated through some sort of linear regression that attempts to "predict" a player's RAPM in a given season using box metrics (or just MPG/team strength)?

I've been using raw vanilla RAPM from the previous season as a "prior" but I believe that doing so may be leading the prior to hold too much weight in the calculation. It seems that players at the top and bottom ends of the spectrum are far less malleable to change, which indicates me that the end result of my prior informed RAPM may be more of a multi-year metric than a single year metric.

Since I knew the approach I was using wasn't cross-validated or tested out of sample, I wanted to under-parameterize the regression. I only used two variables, MPG and Team Adjusted Efficiency, and I regressed those onto the old BPM 1.0 to get a rough value for each. There's a little more to it than that, but that's the general outline. I wanted the prior to just give the general outline of how good we'd expect the player to be, with no more data than required.

I didn't regress onto RAPM, because RAPM's big source of error is how it deals with low minutes players (dragging them all toward 0), so it will not handle a MPG variable well at all.

xkonk · Post by **xkonk** » Mon Nov 30, 2020 11:05 pm

DSMok1 wrote: ↑Fri Nov 27, 2020 12:45 pm I would love to see a process where the priors are validated within the Cross validation process, but I don't know how to do that.

I would argue from a philosophical standpoint that priors shouldn't be 'validated' in any particular way. They should be defensible, but otherwise a prior is just supposed to reflect what values you think are reasonable for whatever parameter they go with. If you aren't sure what a prior should be like or you can't defend a particular choice, then pretty much by definition you want a vague, 'default' prior that is quickly replaced by data. If you do have a good idea what the prior should be, then you don't need to validate it.

vzografos · Post by **vzografos** » Fri Dec 04, 2020 6:57 pm

Hi, just to understand better because I haven't attempted something like this before. I guess by RAPM you mean regularized adjuststed plus/minus.

So you are trying to regress Y (predict) shot attempts per 60 minute (as a way of predicting game outcome?) from the "dummy" X variables you have chosen using some regularised regression model such as Lasso etc.

And now you would like to put priors on what? On your output coefficients or you X 'dummy' variables. From your post it seems the latter.
From reading the documentation on bayesglm it would seem to be the right answer if you wanted priors for the output coefficients.
I am a bit confused what you are trying to achieve with priors on the X variables because then you mention about the biasing of output coefficients towards zero.

I would imagine all the prior information has been incorporated into your matrix with the dummy variables X. Removing a prior from Y and adding it back after regression sounds a bit ad-hock and looks to me like a re-centering/removing bias approach. That doesn't sound like the "right" way of doing things.

Can you tell us a little bit where your prior information is coming from? i.e. what additional information you would like to incorporate that is not already in X.

TopDownHockey · Post by **TopDownHockey** » Sat Dec 05, 2020 10:30 pm

vzografos wrote: ↑Fri Dec 04, 2020 6:57 pm Hi, just to understand better because I haven't attempted something like this before. I guess by RAPM you mean regularized adjuststed plus/minus.

So you are trying to regress Y (predict) shot attempts per 60 minute (as a way of predicting game outcome?) from the "dummy" X variables you have chosen using some regularised regression model such as Lasso etc.

And now you would like to put priors on what? On your output coefficients or you X 'dummy' variables. From your post it seems the latter.
From reading the documentation on bayesglm it would seem to be the right answer if you wanted priors for the output coefficients.
I am a bit confused what you are trying to achieve with priors on the X variables because then you mention about the biasing of output coefficients towards zero.

I would imagine all the prior information has been incorporated into your matrix with the dummy variables X. Removing a prior from Y and adding it back after regression sounds a bit ad-hock and looks to me like a re-centering/removing bias approach. That doesn't sound like the "right" way of doing things.

Can you tell us a little bit where your prior information is coming from? i.e. what additional information you would like to incorporate that is not already in X.

Yes, I do mean Regularized Adjusted Plus-Minus. I am using L2 (Tikhonov) regularization - not L1 (Lasso) - but otherwise you are correct.

I would like to place priors on my dummy variables. For example, in basketball terminology, if we know that Lebron's offensive RAPM is +5, I would like to add a +5 prior to the "Lebron James - Offense" dummy variable.

The goal of the priors on the X variables is to incorporate more information into my player estimates. This would reduce collinearity and allow me to make player evaluations with more certainty.

The additional information that I would like to incorporate is past player RAPM.

vzografos · Post by **vzografos** » Sun Dec 06, 2020 11:13 am

ok I apologise in advance for too many questions but I need to understand what exactly you want, and chatting on the forum might not be the best way of understanding.

so.... you are training your regressor as

glmnet(X = Dummy_Matrix, Y = Target_Variable, foget the other params....)

and you get your output coefficients. But you dont want priors on the output coefficients (that would have been achieved with bayesglm) but you want priors on....X or on Y?

The additional information that I would like to incorporate is past player RAPM.

That seems to me like priors on Y? IS that correct?

rainmantrail · Post by **rainmantrail** » Mon Dec 14, 2020 9:28 am

DSMok1 wrote: ↑Fri Nov 27, 2020 12:45 pm I would love to see a process where the priors are validated within the Cross validation process, but I don't know how to do that.

I think this should be doable. I believe we can just add new rows to the design matrix, two for each player (one for their offensive prior and one for their defensive prior, if splitting those out) where that player gets a 1 and all others get a 0. Then input the value of their prior as the score margin, then set the weight for however many possessions we want it to count for. I think this should work, as long as we have a row of values for all returning players. You could also use this to include multiple seasons' worth of priors, awarding decaying weights to each season if you wanted. Or you could just wrap those all up into one static prior for each player. Here's an example of how I'm picturing this, in case I'm not making sense. If any of you guys/gals with a better background than me in linear algebra see an issue with this approach, please let me know. But this is how I'm thinking of handling my priors after I finish building my database.

Design matrix for a 3-on-3 setup:

Code: Select all

P1o	P2o	P3o	P4o	P5o	P6o	P1d	P2d	P3d	P4d	P5d	P6d	Pts	Poss
1	1	1	0	0	0	0	0	0	-1	-1	-1	10	22
0	0	0	1	1	1	-1	-1	-1	0	0	0	6	13
1	1	1	0	0	0	0	0	0	-1	-1	-1	2	3
0	0	0	1	1	1	-1	-1	-1	0	0	0	6	4
1	0	0	0	0	0	0	0	0	0	0	0	7.5	3000  #P1's offensive prior
0	1	0	0	0	0	0	0	0	0	0	0	1.3	3000  #P2's offensive prior
0	0	1	0	0	0	0	0	0	0	0	0	0.1	3000  #P3's offensive prior
...

vzografos · Post by **vzografos** » Mon Dec 14, 2020 12:12 pm

I dont see why that wouldn't work but I am not sure what it means statistically to use priors in that way. i.e. the least squares solution that you get from that overdetermined system if it bears any relation to a Bayesian solution. But maybe that's not important here...

DSMok1 · Post by **DSMok1** » Mon Dec 14, 2020 2:45 pm

vzografos wrote: ↑Mon Dec 14, 2020 12:12 pm I dont see why that wouldn't work but I am not sure what it means statistically to use priors in that way. i.e. the least squares solution that you get from that overdetermined system if it bears any relation to a Bayesian solution. But maybe that's not important here...

I had experimented (as have others) with a construction like that, but It didn't seem to work "correctly". That was a long time ago, and I can't remember the issue.

Remember--the L2 regularization shrinks toward 0. That doesn't make a lot of sense with the above construction. All the cross validation does is determine the extent of the shrinkage toward 0.

I was thinking more along the lines of experimenting with a range of priors to determine what the prior should be--using some sort of out of sample validation to determine which prior was best.

vzografos · Post by **vzografos** » Mon Dec 14, 2020 3:47 pm

DSMok1 wrote: ↑Mon Dec 14, 2020 2:45 pm

I had experimented (as have others) with a construction like that, but It didn't seem to work "correctly". That was a long time ago, and I can't remember the issue.

I guess it all depends on what the expectation of working correctly is. If it is about getting better prediction accuracy, then I dont see how using priors like that can guarantee that even with CV. Especially if you are using a lot of static historical priors on a non-stationary process (which is player performance over the years)

You might get better results by using priors and a proper Bayesian treatment of the regression coefficients rather than the RAPM values.

rainmantrail · Post by **rainmantrail** » Mon Dec 14, 2020 10:12 pm

I'll play around with it and see which approach is more predictive. Another design option would be to have each player going up against the average defender in these additional rows, and have them defending against the average offensive player for their defensive priors. I don't know from a theory standpoint if this is any different from the solution without the average player as a variable, but it might yield more stability? I'm not sure. But I plan to test that as well. I'm fairly confident that this is how David Frohardt-Lane (and many other top NFL handicappers) incorporates his priors when building predictive NFL models. However, I don't think they are using regularization methods in their models as multicollinearity isn't nearly as problematic in that framework.

APBRmetrics

Incorporating Prior Information into RAPM

Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM

Re: Incorporating Prior Information into RAPM