Creating an SPM
Posted: Sat Jul 20, 2024 12:21 pm
Hey all,
I'm currently creating a Statistical Plus-Minus (SPM) model for personal use. I plan on building different variants of the model, starting from raw box score data (for use in the game Basketball GM) and eventually incorporating play-by-play data and other advanced metrics. Right now, I have some questions on how to improve model performance.
My current iteration uses raw box score data, and I am regressing players' average stats against their 2-year RAPM from the 1998 season to 2024 (data provided by JE). My current R² with the entire dataset is approximately 0.6. Here are my questions:
Length of RAPM: What size of RAPM (e.g., 2-year, 3-year) have you found to be the best when producing SPM models?
Averaging Stats: Is using a player's average stats across a period a promising approach for the X variables in the regression?
Cross-Validation: I'm using Sklearn's Ridge CV to produce my results. Should I leave out some samples of my data when training the model to improve performance?
Normalization: Does normalizing data every two years improve model performance?
Cutoff Threshold: Since my basic version is meant to work in a game and I'm only concerned with the best seasons, should I limit my regression to only players with a specific cutoff of RAPM?
Non-linear Variables: Reading through other models that have their methodologies online, it seems non-linear variables could improve performance. What non-linear stats or methods have you found to be most effective?
Play-by-Play Data: When I try to tackle a more advanced SPM model, where can I find publicly available play-by-play data?
If there are any other things I could improve on that I didn't mention, please let me know.
I'm currently creating a Statistical Plus-Minus (SPM) model for personal use. I plan on building different variants of the model, starting from raw box score data (for use in the game Basketball GM) and eventually incorporating play-by-play data and other advanced metrics. Right now, I have some questions on how to improve model performance.
My current iteration uses raw box score data, and I am regressing players' average stats against their 2-year RAPM from the 1998 season to 2024 (data provided by JE). My current R² with the entire dataset is approximately 0.6. Here are my questions:
Length of RAPM: What size of RAPM (e.g., 2-year, 3-year) have you found to be the best when producing SPM models?
Averaging Stats: Is using a player's average stats across a period a promising approach for the X variables in the regression?
Cross-Validation: I'm using Sklearn's Ridge CV to produce my results. Should I leave out some samples of my data when training the model to improve performance?
Normalization: Does normalizing data every two years improve model performance?
Cutoff Threshold: Since my basic version is meant to work in a game and I'm only concerned with the best seasons, should I limit my regression to only players with a specific cutoff of RAPM?
Non-linear Variables: Reading through other models that have their methodologies online, it seems non-linear variables could improve performance. What non-linear stats or methods have you found to be most effective?
Play-by-Play Data: When I try to tackle a more advanced SPM model, where can I find publicly available play-by-play data?
If there are any other things I could improve on that I didn't mention, please let me know.