Demystifying Ridge Regression
Demystifying Ridge Regression
I've been meaning to make a post like this for a while, but I haven't really had time. I feel there is an obfuscating issue with the fact that Ridge Regression sounds complicated. It isn't. Ignore the matrix algebra, ignore the name, ignore the wikipedia page. If you are comfortable with weighted multiple linear regression then you should have no problem with ridge regression. All ridge regression is in a nutshell, is the insertion of dummy measurements into a weighted multiple linear regression (the standard APM regression in this case, with dummy measurements for each individual player). That's it. In the case of standard RAPM the dummy measurements are set to zero (these dummy measurements are also reffered to as priors, if you weren't sure). However, there is no compunction for them to be set to zero, it is merely the most basic version. The other factor that may confuse people is the lambda value. All the lambda value is is a way to apportion a certain weight to the dummy measurements, depending on your confidence in them, relative to the data. Cross-validation (i.e. out of sample testing) is the standard way to tune lambda values. There is no compunction for the weight to be the same for each dummy measurment, either, it is merely the basic model.
If anybody has any questions I'll be happy to try to answer them.
EDIT:
This post is intended for those for whom the immediate introduction of mathematical formality (talk of Bayesian this and corollary that) is confusing, and would rather grasp the mechanics of the problem before trying to grasp the deeper mathematical issues that it presents. If you would like to better grasp those deeper issues then here's the thread for you: http://www.apbr.org/metrics/viewtopic.php?f=2&t=8239
EDIT:
I should mention that once you get into varying the values of the priors you are approaching the realms of something called Bayesian Linear Regression, however you will often come across this referred to as Ridge Regression. It is very much related, with Ridge regression (zero priors) being a special case.
			
			
													If anybody has any questions I'll be happy to try to answer them.
EDIT:
This post is intended for those for whom the immediate introduction of mathematical formality (talk of Bayesian this and corollary that) is confusing, and would rather grasp the mechanics of the problem before trying to grasp the deeper mathematical issues that it presents. If you would like to better grasp those deeper issues then here's the thread for you: http://www.apbr.org/metrics/viewtopic.php?f=2&t=8239
EDIT:
I should mention that once you get into varying the values of the priors you are approaching the realms of something called Bayesian Linear Regression, however you will often come across this referred to as Ridge Regression. It is very much related, with Ridge regression (zero priors) being a special case.
					Last edited by v-zero on Thu May 16, 2013 11:42 am, edited 2 times in total.
									
			
						
										
						Re: Demystifying Ridge Regression
If you set the lambda to zero, does RAPM just become basic APM?
			
			
									
						
										
						Re: Demystifying Ridge Regression
Yes, but some regression software won't allow zero weights (because that row becomes a null row).
			
			
									
						
										
						Re: Demystifying Ridge Regression
Thank you for that explanation, it has helped to clarify something I have always been baffled by. 
One question though. If the priors are all zero, does the regression become OLS?
			
			
									
						
										
						One question though. If the priors are all zero, does the regression become OLS?
Re: Demystifying Ridge Regression
No, as stated above that would only be the case if the weight (lambda) for each player was zero. Priors of zero would push player ratings towards a value of zero, just as priors of ten would push them towards ten. The amount that pushing does to change the values from APM really depends on the value of lambda, and the number of possessions that player has played in the sample under regression.
			
			
									
						
										
						Re: Demystifying Ridge Regression
Can you go into a little more detail on the process of cross-validation? Is it basically selecting multiple different samples of players and doing a regression for each sample? And if so, how does that impact how much weight the prior should be given?
Do my questions even make sense? haha
			
			
									
						
										
						Do my questions even make sense? haha
Re: Demystifying Ridge Regression
That's a fair question. Cross validation is in fact very simple. You may have come across the term K-fold cross validation. This is when you have a sample of N measurements (in the case of RAPM that's N periods of 5-on-5 play), and you split that sample equally into K pieces, you then leave one of those pieces out and perform your ridge regression on the other pieces all together, you then use the result of that regression to predict the piece you left out (i.e. you insert the player values from the regression into the 5-on-5 periods you left out), you then calculate the squared-error on this. You do this for each of your K pieces, and sum all of the errors to reach a total error. This total error is then what you desire to minimise, and you vary lambda in order to do so. Each time lambda is varied the whole process must be done again from the start, since lambda effects the regression, otherwise it wouldn't matter.
The natural limit of this is known as leave one out cross validation. This is when K = N, such that each time you do regression on all but one of your measurements (5-on-5 periods), and then add the error on that measurement you left out to your total.
			
			
									
						
										
						The natural limit of this is known as leave one out cross validation. This is when K = N, such that each time you do regression on all but one of your measurements (5-on-5 periods), and then add the error on that measurement you left out to your total.
Re: Demystifying Ridge Regression
You should charge for these explanations (not me, though   ) - they're very helpful and clear.
) - they're very helpful and clear.
What would you do to determine an appropriate numbers of folds? Since it's an iterative process (thankfully not a manual one), are there calculation speed concerns with increasing k too high? And are there any other issues to consider as you get closer and closer to one out cross validation? Also, is there any use for some kind of j x k cross validation, where you would do j cross validations involving k folds in order to get different sample splits?
Thanks for your help in understanding in this stuff.
			
			
									
						
										
						 ) - they're very helpful and clear.
) - they're very helpful and clear.What would you do to determine an appropriate numbers of folds? Since it's an iterative process (thankfully not a manual one), are there calculation speed concerns with increasing k too high? And are there any other issues to consider as you get closer and closer to one out cross validation? Also, is there any use for some kind of j x k cross validation, where you would do j cross validations involving k folds in order to get different sample splits?
Thanks for your help in understanding in this stuff.
Re: Demystifying Ridge Regression
I'm glad to help, I feel that there's a sort of blind faith required to believe in a technique like ridge regression if you don't really quite grasp what it does. Blind faith is the enemy of good judgement and scientific reasoning, so I'd like to reduce the split in this forum/in the basketball metrics community between those who 'get it', and those who 'sort of get it/don't get it'.
Back to your question: More folds is always better, so leave one out is the ultimate form, but a way to know how many folds is 'good enough' is to start with, say, five, and find the ideal lambda for that, then increase that to six or seven, and find the ideal lambda for that, then increase it to say ten, and find lambda for that.... you keep increasing it until lambda seems to settle (which you can define as some smallish percentage e.g. 2% change in value from one lambda to the next). Mathematically you would say that as you increase the number of folds the value of lambda will tend to the value at the full N folds, and how quickly that number converges will depend on your sample, both its character (inherent noise/variance) and size.
As for computational issues - yes, with large samples leave one out can be extremely time consuming, but it is usually entirely pointless. For data such as an NBA season of 5-on-5 periods ten folds is fine. There are no other real issues with leave one out cross validation, but as I suggest it is generally overkill.
Lastly no, there's no real point in doing multiple K-fold cross validations, as it will always be more computationally efficient to simply increase the value of K with the same effect.
			
			
									
						
										
						Back to your question: More folds is always better, so leave one out is the ultimate form, but a way to know how many folds is 'good enough' is to start with, say, five, and find the ideal lambda for that, then increase that to six or seven, and find the ideal lambda for that, then increase it to say ten, and find lambda for that.... you keep increasing it until lambda seems to settle (which you can define as some smallish percentage e.g. 2% change in value from one lambda to the next). Mathematically you would say that as you increase the number of folds the value of lambda will tend to the value at the full N folds, and how quickly that number converges will depend on your sample, both its character (inherent noise/variance) and size.
As for computational issues - yes, with large samples leave one out can be extremely time consuming, but it is usually entirely pointless. For data such as an NBA season of 5-on-5 periods ten folds is fine. There are no other real issues with leave one out cross validation, but as I suggest it is generally overkill.
Lastly no, there's no real point in doing multiple K-fold cross validations, as it will always be more computationally efficient to simply increase the value of K with the same effect.
- 
				AcrossTheCourt
- Posts: 237
- Joined: Sat Feb 16, 2013 11:56 am
Re: Demystifying Ridge Regression
I actually started doing adjusted plus/minus last night for the first time using basketballvalue's data. I do want to learn how to do RAPM. I've done multiple linear, weighted, step regression, autocorrelation, and nonlinear, but this flavor seems interesting. 
 
In doing numerical work there's usually a rule of thumb people give you, like stop when the change is less than 1%. And ten folds is roughly around there for a full season? What about multiple seasons?
From what I've read about RAPM, it seems like one of the biggest issues is what to do with rookies. Is there a best method yet?
 
One thing I couldn't find online about regular adjusted plus/minus is the best way to separate defense/offense. I think I know how, but I want to know what the established techniques are. You look at the home players versus the away players where home is on offense and compare it to what ... the league average offensive efficiency?
			
			
									
						
										
						In doing numerical work there's usually a rule of thumb people give you, like stop when the change is less than 1%. And ten folds is roughly around there for a full season? What about multiple seasons?
From what I've read about RAPM, it seems like one of the biggest issues is what to do with rookies. Is there a best method yet?
One thing I couldn't find online about regular adjusted plus/minus is the best way to separate defense/offense. I think I know how, but I want to know what the established techniques are. You look at the home players versus the away players where home is on offense and compare it to what ... the league average offensive efficiency?
Re: Demystifying Ridge Regression
Ten folds isn't quite at the 1% level, but it is adequate, considering the level of noise present in 5-on-5 data, that is: you'd be arguing over inches when your problem is one of approximating feet. For multiple seasons you should find that lambda is similar (and ten folds is similarly adequate, however I have not done extensive testing on this for the reason I'm about to mention), however I don't advise anybody do multi-year RAPM, because the computational size of the problem grows to levels which will want to address a bucketload of memory, and the rewards aren't there. Multi-year RAPM has a big problem: players aren't the same one year to the next, so the regression will have issues with letting old players 'die', and new players 'flourish'. No, it is best to stick to the daisy-chain style of using year one as priors for year two, then that as priors for year three etc. Even that has large numerical issues (Westbrook being assigned Durant's improvement, for instance, since his appearance coincided with it). There simply is no quick easy hack to fix the RAPM method. 
As for rookies, the best simple way to deal with them would be to assign them all a strongly negative prior (-3 on O, -3 on D is a decent rule of thumb for quick RAPM estimates).
You have it right, you break down the 5-on-5 into offence vs defence situations (so your dependent variable is how many points the team on offence scored, and your explanatory variables are your five guys on offence against your five guys on defence - so each player in the regression must have an offence and defence variable). However, all we care about is the marginal value (value above average) of these players, so in order to only have player variables as marginal you must include an intercept term in your regression (as well as an HCA term) in order to swallow up the average efficiency. There are actually very sound numerical reasons for including an intercept in this particular method, but I won't get into that. Suffice to say that introducing an intercept is the way to go.
There is another way to decompose player ratings into offence and defence which I have come to prefer, but the above method is standard and the most obvious, so I advise you stick to that.
			
			
									
						
										
						As for rookies, the best simple way to deal with them would be to assign them all a strongly negative prior (-3 on O, -3 on D is a decent rule of thumb for quick RAPM estimates).
You have it right, you break down the 5-on-5 into offence vs defence situations (so your dependent variable is how many points the team on offence scored, and your explanatory variables are your five guys on offence against your five guys on defence - so each player in the regression must have an offence and defence variable). However, all we care about is the marginal value (value above average) of these players, so in order to only have player variables as marginal you must include an intercept term in your regression (as well as an HCA term) in order to swallow up the average efficiency. There are actually very sound numerical reasons for including an intercept in this particular method, but I won't get into that. Suffice to say that introducing an intercept is the way to go.
There is another way to decompose player ratings into offence and defence which I have come to prefer, but the above method is standard and the most obvious, so I advise you stick to that.
Re: Demystifying Ridge Regression
FWIW, I'm taking the Coursera Machine Learning class right now, and this week happens to be the section on regularization.
			
			
									
						
										
						Re: Demystifying Ridge Regression
Well feel free to post here if anything interesting/odd comes up, I'm curious to know its content, but not curios enough to shell out/spend time on it. 
			
			
									
						
										
						
Re: Demystifying Ridge Regression
Uh, it's free. I guess time is money, but other than that.
			
			
									
						
										
						Re: Demystifying Ridge Regression
Thought Coursera courses were generally paid, thanks for educating me.