Seeking Advice on Data, Validation, and Best Practices for Building an SPM
Posted: Mon Nov 25, 2024 7:53 am
Hi everyone,
I'm thinking of extending my approach to build an SPM for the NBA and for the game basketballgm.
The model I’m working on uses **RAPM from the JE dataset** as a foundation, aiming to combine its stability with box score and potentially tracking data to create a metric that’s both predictive and interpretable. For BBGM, i was able to create a fork of the game which was able to calculate accurate RAPM values for each player. If you would like to see the fork of the rapm code I will be happy to provide this. I’d appreciate your thoughts on a few key questions:
---
1. Data Sources
I’m currently using box score data and team ratings but would like to incorporate richer tracking or contextual data. For example:
- Shot location data, defensive contests, off-ball movement
- Screen assists, rim protection metrics, and lineup combinations
- Team offensive/defensive ratings, adjusted for possession context
- Play-by-play data
I’ve explored options like `nba_api`, **PBP Stats**, and **Basketball-Reference**, but I know many of you have far more experience sourcing robust datasets. What are your recommendations for tracking or contextual data sources, especially for defensive and off-ball metrics? If anyone has insights into preprocessing or transforming this data for model inputs, I’d love to hear about that too.
---
2. Best Window for RAPM
In terms of RAPM, I’ve been debating the best number of years to include. My current thinking is:
Either a 5 or 3 year RAPM.
What has worked best for you? Is there an optimal balance here for projects like this?
---
3. Playoff Data
Do you include playoff games in your RAPM calculations, or do you stick strictly to regular-season data? I’m torn because playoff games are high-leverage, but the small sample might overweight individual performances.
---
4. Validation Set and Splits
I’m also trying to establish a solid validation framework. How do you typically construct your validation set?
- Do you use the following season (e.g., RAPM from the next year) as your validation data?
- What split of training, validation, and test data has worked best for you?
- Any advice on balancing in-season predictive accuracy with cross-season generalizability?
This part has been especially tricky for me since I want to ensure the model doesn’t overfit historical RAPM and remains predictive for unseen data.
---
5. Feature Engineering and Weighting
Finally, I’m exploring how best to engineer features and weigh observations. For instance:
- Are there particular features (e.g., interaction terms) that have consistently proven valuable?
- How do you handle outliers or weigh contributions from players with limited minutes?
---
6. Adding Priors
I’ve been considering adding a prior to the RAPM estimates to stabilize the model for players with limited data (e.g., low minutes or few possessions).
How do you decide on an appropriate prior, and what’s worked best for you?
Should the prior be based on league averages, positional averages, or something else entirely (e.g., aging curves for veteran players)?
If you have experience using priors to improve player evaluation, I’d love to hear how it’s impacted your results.
Also I'm thinking of using this spm as a prior in a single season's RAPM calculation just as was done in xRAPM and EPM. How did you go about doing this, and EPM talks about something called a bayesian prior. How is this different from a normal prior for the RAPM calcualtion
---
7. Splitting RAPM into Offensive and Defensive Components
I’m also curious about the best approach to handling offensive and defensive RAPM. Is it better to create a single overall RAPM metric first and then split it into offensive and defensive components, or should offensive and defensive RAPM be modeled separately and then summed to form an overall metric? I’d love to hear about any trade-offs you’ve encountered with these approaches, particularly in terms of interpretability, stability, or validation results.
---
Thanks in advance for any advice or insights you can offer. I’m excited to hear how others have approached similar challenges. Looking forward to learning from everyone here!
I'm thinking of extending my approach to build an SPM for the NBA and for the game basketballgm.
The model I’m working on uses **RAPM from the JE dataset** as a foundation, aiming to combine its stability with box score and potentially tracking data to create a metric that’s both predictive and interpretable. For BBGM, i was able to create a fork of the game which was able to calculate accurate RAPM values for each player. If you would like to see the fork of the rapm code I will be happy to provide this. I’d appreciate your thoughts on a few key questions:
---
1. Data Sources
I’m currently using box score data and team ratings but would like to incorporate richer tracking or contextual data. For example:
- Shot location data, defensive contests, off-ball movement
- Screen assists, rim protection metrics, and lineup combinations
- Team offensive/defensive ratings, adjusted for possession context
- Play-by-play data
I’ve explored options like `nba_api`, **PBP Stats**, and **Basketball-Reference**, but I know many of you have far more experience sourcing robust datasets. What are your recommendations for tracking or contextual data sources, especially for defensive and off-ball metrics? If anyone has insights into preprocessing or transforming this data for model inputs, I’d love to hear about that too.
---
2. Best Window for RAPM
In terms of RAPM, I’ve been debating the best number of years to include. My current thinking is:
Either a 5 or 3 year RAPM.
What has worked best for you? Is there an optimal balance here for projects like this?
---
3. Playoff Data
Do you include playoff games in your RAPM calculations, or do you stick strictly to regular-season data? I’m torn because playoff games are high-leverage, but the small sample might overweight individual performances.
---
4. Validation Set and Splits
I’m also trying to establish a solid validation framework. How do you typically construct your validation set?
- Do you use the following season (e.g., RAPM from the next year) as your validation data?
- What split of training, validation, and test data has worked best for you?
- Any advice on balancing in-season predictive accuracy with cross-season generalizability?
This part has been especially tricky for me since I want to ensure the model doesn’t overfit historical RAPM and remains predictive for unseen data.
---
5. Feature Engineering and Weighting
Finally, I’m exploring how best to engineer features and weigh observations. For instance:
- Are there particular features (e.g., interaction terms) that have consistently proven valuable?
- How do you handle outliers or weigh contributions from players with limited minutes?
---
6. Adding Priors
I’ve been considering adding a prior to the RAPM estimates to stabilize the model for players with limited data (e.g., low minutes or few possessions).
How do you decide on an appropriate prior, and what’s worked best for you?
Should the prior be based on league averages, positional averages, or something else entirely (e.g., aging curves for veteran players)?
If you have experience using priors to improve player evaluation, I’d love to hear how it’s impacted your results.
Also I'm thinking of using this spm as a prior in a single season's RAPM calculation just as was done in xRAPM and EPM. How did you go about doing this, and EPM talks about something called a bayesian prior. How is this different from a normal prior for the RAPM calcualtion
---
7. Splitting RAPM into Offensive and Defensive Components
I’m also curious about the best approach to handling offensive and defensive RAPM. Is it better to create a single overall RAPM metric first and then split it into offensive and defensive components, or should offensive and defensive RAPM be modeled separately and then summed to form an overall metric? I’d love to hear about any trade-offs you’ve encountered with these approaches, particularly in terms of interpretability, stability, or validation results.
---
Thanks in advance for any advice or insights you can offer. I’m excited to hear how others have approached similar challenges. Looking forward to learning from everyone here!