Re: Early 2013-14 stat-based observations
Posted: Sun Jan 19, 2014 5:35 pm
from talkingpractice's website:
"Individual Player Value (“IPV”) is a stabilized in-season RAPM model which uses a robust machine learning based SPM metric (“FORPM”) as a prior for RAPM. There is no previous season information used, to put it on par with other in-season metrics such as NPI RAPM, PER, WS, or EZPM. The choice of a FORPM metric as prior (using an ensemble consisting of random forest regressions and gradient boosting), rather than a traditional SPM metric, was made in part to eliminate discretion in variable selection, with the goal of making IPV a pure metric. In addition, a properly specified FORPM model (fit to SRS) performs much better out of sample than more plain vanilla regression-based models (especially with regard to ‘defense’). Due to not using any previous year info or an aging/experience curve, these values should be considered as descriptive more so than as predictive. The model is based on ‘basketball’, and not on ‘offense’ nor on ‘defense’, and as such there is only one coefficient for each player. This is again both for purity of the metric, and due to this approach being more predictive out of sample. Individual Player Values here are not meant to imply player rankings, nor are they meant to imply that they are the players value if he were to be traded, or have his role changed on his team."
I looked up random forests and saw that is considered a very helpful approach overall. I did see this caveat at the wikipedia page:
"This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups." I wondered how this would affect your application. Would attributes "with more levels" be scoring, rebounding and assists? Would "groups of correlated features of similar relevance for the output" be FGA or usage, FG%, FTA, 3 pt FGA?
Is there any good reason to use or not use random forest regressions for the RAPM model?
"The model is based on ‘basketball’, and not on ‘offense’ nor on ‘defense’, and as such there is only one coefficient for each player."
This statement seems important to highlight. And it is consistent with Mike G's perspective that offensive stats should be seen in the context of a player's / team's defensive stats and some stuff I have seen about prior action impact on next plays. As much as I have been interested in offensive /defensive splits of RAPM, if a random forest regression for an overall RAPM model had noticeable less estimated error than the parts I would want to keep that in mind.
I am stretching from what I quickly read and do not by any means fully understand but is there any heightened value from using a random forest regression (SPM or RAPM) to think about / find player nearest neighbors / "similars"? Does anyone else who has worked on player similarity models have opinions or questions about this?
Are there any other statistical outputs generated by random forest regression runs that seem important besides the overall SPM (or RAPM) outputs?
This article looked somewhat interesting. Is a basketball player or his stats a "deformable object"?
http://www.google.com/url?sa=t&rct=j&q= ... 8121,d.cWc
There was some work with "facial mapping" awhile ago. http://www.countthebasket.com/blog/2008 ... off-faces/
"Individual Player Value (“IPV”) is a stabilized in-season RAPM model which uses a robust machine learning based SPM metric (“FORPM”) as a prior for RAPM. There is no previous season information used, to put it on par with other in-season metrics such as NPI RAPM, PER, WS, or EZPM. The choice of a FORPM metric as prior (using an ensemble consisting of random forest regressions and gradient boosting), rather than a traditional SPM metric, was made in part to eliminate discretion in variable selection, with the goal of making IPV a pure metric. In addition, a properly specified FORPM model (fit to SRS) performs much better out of sample than more plain vanilla regression-based models (especially with regard to ‘defense’). Due to not using any previous year info or an aging/experience curve, these values should be considered as descriptive more so than as predictive. The model is based on ‘basketball’, and not on ‘offense’ nor on ‘defense’, and as such there is only one coefficient for each player. This is again both for purity of the metric, and due to this approach being more predictive out of sample. Individual Player Values here are not meant to imply player rankings, nor are they meant to imply that they are the players value if he were to be traded, or have his role changed on his team."
I looked up random forests and saw that is considered a very helpful approach overall. I did see this caveat at the wikipedia page:
"This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups." I wondered how this would affect your application. Would attributes "with more levels" be scoring, rebounding and assists? Would "groups of correlated features of similar relevance for the output" be FGA or usage, FG%, FTA, 3 pt FGA?
Is there any good reason to use or not use random forest regressions for the RAPM model?
"The model is based on ‘basketball’, and not on ‘offense’ nor on ‘defense’, and as such there is only one coefficient for each player."
This statement seems important to highlight. And it is consistent with Mike G's perspective that offensive stats should be seen in the context of a player's / team's defensive stats and some stuff I have seen about prior action impact on next plays. As much as I have been interested in offensive /defensive splits of RAPM, if a random forest regression for an overall RAPM model had noticeable less estimated error than the parts I would want to keep that in mind.
I am stretching from what I quickly read and do not by any means fully understand but is there any heightened value from using a random forest regression (SPM or RAPM) to think about / find player nearest neighbors / "similars"? Does anyone else who has worked on player similarity models have opinions or questions about this?
Are there any other statistical outputs generated by random forest regression runs that seem important besides the overall SPM (or RAPM) outputs?
This article looked somewhat interesting. Is a basketball player or his stats a "deformable object"?
http://www.google.com/url?sa=t&rct=j&q= ... 8121,d.cWc
There was some work with "facial mapping" awhile ago. http://www.countthebasket.com/blog/2008 ... off-faces/