APBRmetrics

Posted: **Thu Dec 17, 2020 1:57 pm**

So I was wondering what do you guys think is the advantage of all these player-specific, 1-dimensional PM-style impact metrics? (see for example the list in www.bball-index.com).

My opinion is that any prediction should be done in the full dimensional space of your features because it captures all the big and small correlations and interactions between the different statistics. Sure you include the noise as well of useless features but with big enough dataset and a robust classifier/regressor you should be able to get superior results than classifying/regressing on low dimensional spaces.
The other issue is a single dimensional feature, I think, does not capture interactions between players and between teams (i.e. teammates and opponents).

The advantage I see is only if you want to use a basic classifier (like logistic regression) or if you want to rank players (for whatever reason).

Still I would like to try to evaluate game win/loss classification accuracy by taking a bunch of these 1-dimensional impact metrics for each player, averaging them over the team and see how well they work compared to the full space.

I would really like to listen to your opinions. Maybe I am missing something

Posted: **Fri Dec 18, 2020 12:42 am**

To enhance understanding by more people, please define the 1-dimension. Is it the overall impact vector? Is offensive and defensive impact still 1-dimension or two? Are RAPM factors 4 dimensions? Would RAPM factor split for estimated teammate and opponent impacts be 8 dimensions or what? Would RAPM factors split for personal boxscore recognized impacts and all other impacts be 8 dimensions? Would that and teammate / opponent influenced splits of these be 16 dimensions?

I'd like to understand. Help appreciated. With definitions and more narrative.

Posted: **Fri Dec 18, 2020 2:48 am**

Higher mutli-dimensional stuff can often be better simplified into categorical roles/variables in my experience (like 3/D player), with that then interacting with a single (or a couple) dimensional impact number, vs having all of the numbers, which can be overwhelming in a lot of situations.

It's a bit easier to weigh a couple dimensional statistical analysis vs a scouts general impression too imo, which is certainly useful for comparing guys in different leagues and whatnot.

Example.

Were I to describe (healthy) Klay Thompson in my grading system I might say: +2/+1 3/Man Wing

+2 - an offensive impact number
+1 - a defensive impact number
3 - a description of offensive role
Man - a description of defensive role
Wing - a positional descriptor

Those categorical descriptors carry with them a lot of data that it's a lot easier for a person to interact with in a simplified form.

Posted: **Fri Dec 18, 2020 12:04 pm**

This discussion is a bit above my pay grade, but...

Isn't the actual objective only 1 dimension? Point differential? And all other dimensions that can be used simply proxies for different "paths" to achieve that point differential? And the only direct way to measure a player's contribution (some form of RAPM) is only 1 dimensional, point differential?

My sense is that by adding secondary dimensions there is a likelihood of introducing bias.

Posted: **Fri Dec 18, 2020 4:08 pm**

eminence wrote: ↑Fri Dec 18, 2020 2:48 am

Those categorical descriptors carry with them a lot of data that it's a lot easier for a person to interact with in a simplified form.

My thoughts exactly....

Posted: **Fri Dec 18, 2020 4:19 pm**

DSMok1 wrote: ↑Fri Dec 18, 2020 12:04 pm This discussion is a bit above my pay grade, but...

Isn't the actual objective only 1 dimension? Point differential? And all other dimensions that can be used simply proxies for different "paths" to achieve that point differential? And the only direct way to measure a player's contribution (some form of RAPM) is only 1 dimensional, point differential?

My sense is that by adding secondary dimensions there is a likelihood of introducing bias.

Now we are confusing input dimensions (many) to output dimensions (most often one, unless you do multivariate regression).

Imagine the following. You have two teams of 3 players each. Each player has 100 stats. And let's take the average of the 3 players to be the team's stats. Let's assume we want to predict the score differential. So we can train a regression algorithm by concatenating our data:

observation 1: [100 stats of Team A, 100 stats of Team B] ---> point spread
observation 2: [100 stats of Team A, 100 stats of Team B] ---> point spread
...
observation 1000: [100 stats of Team A, 100 stats of Team B] ---> point spread

That's 1000 games worth of training data.

Now assume the counter example where we use some 1 dimensional metric (like RAPM) and also assume that it makes sense to average this over the team or alternatively, assuming that the teams equal number of players we can just concatenate them. So we have:

observation 1: [RAPM_A1, RAPM_A2, RAPM_A3, RAPM_B1, RAPM_B2, RAPM_B3] ---> point spread
observation 2: [RAPM_A1, RAPM_A2, RAPM_A3, RAPM_B1, RAPM_B2, RAPM_B3] ---> point spread
...
observation 1000: [RAPM_A1, RAPM_A2, RAPM_A3, RAPM_B1, RAPM_B2, RAPM_B3] ---> point spread

In your opinion, and assuming all else being equal, which of the above two methods should perform better?

Posted: **Fri Dec 18, 2020 8:45 pm**

Crow wrote: ↑Fri Dec 18, 2020 12:42 am To enhance understanding by more people, please define the 1-dimension. Is it the overall impact vector? Is offensive and defensive impact still 1-dimension or two? Are RAPM factors 4 dimensions? Would RAPM factor split for estimated teammate and opponent impacts be 8 dimensions or what? Would RAPM factors split for personal boxscore recognized impacts and all other impacts be 8 dimensions? Would that and teammate / opponent influenced splits of these be 16 dimensions?

I'd like to understand. Help appreciated. With definitions and more narrative.

Depends what you use and how you use it.
I am talking from the point of view of regression/classification. So using the impact metrics to predict game score or game outcome. If you use the impact metrics independently (no linear/nonlinear combinations) and assuming they are statistically independent (unlikely) then each of them would correspond to a separate dimension.

Posted: **Fri Dec 18, 2020 8:46 pm**

vzografos wrote: ↑Fri Dec 18, 2020 4:19 pm
DSMok1 wrote: ↑Fri Dec 18, 2020 12:04 pm This discussion is a bit above my pay grade, but...

Isn't the actual objective only 1 dimension? Point differential? And all other dimensions that can be used simply proxies for different "paths" to achieve that point differential? And the only direct way to measure a player's contribution (some form of RAPM) is only 1 dimensional, point differential?

My sense is that by adding secondary dimensions there is a likelihood of introducing bias.
Now we are confusing input dimensions (many) to output dimensions (most often one, unless you do multivariate regression).

Imagine the following. You have two teams of 3 players each. Each player has 100 stats. And let's take the average of the 3 players to be the team's stats. Let's assume we want to predict the score differential. So we can train a regression algorithm by concatenating our data:

OBSERVATION: [FEATURE VECTOR] -----> LABEL
................................................................................................................................................
observation 1: [100 stats of Team A, 100 stats of Team B] ---> point spread
observation 2: [100 stats of Team A, 100 stats of Team B] ---> point spread
...
observation 1000: [100 stats of Team A, 100 stats of Team B] ---> point spread

That's 1000 games worth of training data.

Now assume the counter example where we use some 1 dimensional metric (like RAPM) and also assume that it makes sense to average this over the team or alternatively, assuming that the teams equal number of players we can just concatenate them. So we have:

observation 1: [RAPM_A1, RAPM_A2, RAPM_A3, RAPM_B1, RAPM_B2, RAPM_B3] ---> point spread
observation 2: [RAPM_A1, RAPM_A2, RAPM_A3, RAPM_B1, RAPM_B2, RAPM_B3] ---> point spread
...
observation 1000: [RAPM_A1, RAPM_A2, RAPM_A3, RAPM_B1, RAPM_B2, RAPM_B3] ---> point spread

In your opinion, and assuming all else being equal, which of the above two methods should perform better?

Posted: **Fri Dec 18, 2020 9:07 pm**

If I'm understanding your question, I think it depends on if the summary measure is redundant with the 'full scale' of statistics you mention according to your analysis technique.

Take a simple example where your full scale of statistics is only two measures, A and B, and your analysis is some kind of linear regression. Say I have access to two different summary/all-in-ones, one that takes the average of A and B and another that says A should be worth 3 times as much as B. If I also run a linear regression with those two summary measures it should fit more or less identically to the regression using 'raw' A and B, because the summary measures are just linear combinations of A and B.

Differences could/will creep in if the summary measures and/or analysis techniques are mismatched. If I use the same linear regression on a list of statistics and compare it to a linear regression with summary measures that include interactions (steals*rebounds or whatever), they are no longer the same inputs. I would expect the summary measures to do somewhat better, to the extent that the interaction was intelligently chosen and there aren't so many measures included that they drown out the 'smart' one. Or, if you use a linear regression with the summary measures but a fancy deep learning neural network with the raw statistics, I would (broadly speaking) expect the raw stats to do better, or at least differently.

Posted: **Sat Dec 19, 2020 4:36 am**

if you are making predictions at the game level, you don't have enough training samples if you are going to use like 200 features. Better to carefully craft features that actually make sense IMO. If you are predicting at the possession level, yes, you can probably get away with a lot more features. Even then, stuff like "average rebounds in last 10 games" will very likely have no predictive value...so simpler is probably still better.

Furthermore, if you are wanting to use box score stats like assists and rebounds etc, might be useful to first think about where those stats are coming from. What aspects of winning basketball are they proxies for?

Posted: **Sat Dec 19, 2020 8:24 am**

liminal_space wrote: ↑Sat Dec 19, 2020 4:36 am if you are making predictions at the game level, you don't have enough training samples if you are going to use like 200 features.

What exactly do you mean "at the game level"? I didnt understand that

Posted: **Sat Dec 19, 2020 8:41 am**

xkonk wrote: ↑Fri Dec 18, 2020 9:07 pm
Take a simple example where your full scale of statistics is only two measures, A and B, and your analysis is some kind of linear regression. Say I have access to two different summary/all-in-ones, one that takes the average of A and B and another that says A should be worth 3 times as much as B. If I also run a linear regression with those two summary measures it should fit more or less identically to the regression using 'raw' A and B, because the summary measures are just linear combinations of A and B.

Agreed.

xkonk wrote: ↑Fri Dec 18, 2020 9:07 pm Differences could/will creep in if the summary measures and/or analysis techniques are mismatched. If I use the same linear regression on a list of statistics and compare it to a linear regression with summary measures that include interactions (steals*rebounds or whatever), they are no longer the same inputs. I would expect the summary measures to do somewhat better, to the extent that the interaction was intelligently chosen and there aren't so many measures included that they drown out the 'smart' one. Or, if you use a linear regression with the summary measures but a fancy deep learning neural network with the raw statistics, I would (broadly speaking) expect the raw stats to do better, or at least differently.

Agreed here as well.
Regarding use in prediction prediction:
My observation is that summary (or impact or whatever you want to call them) measures can be effective assuming that you did a good job choosing them. If anything they can be very efficient since you only need to use a simple regressor/classifier with them. Whereas raw stats need something more advanced (and computationally expensive) to work.

However that "did a good job choosing them" is no simple matter. From the discussions I see on this board for the various impact metrics, I see there is a lot of manual tweaking and modification necessary, simply because it is very difficult to design such metrics in the first place. Especially if these metrics need to make sense on the human level. What I mean by that is that the recipe for designing them needs to have an easily understood narrative.

I was experimenting with making such impact metrics in the past using automated "black-box" techniques (such as evolutionary programming) that produced really very accurate metrics as (often very complex) combinations of raw stats. Although accurate and fast (since you can do prediction with only 1 or 2 dimensions) they were not much good in communicating to other people what they mean exactly. Still their prediction accuracy, although comparable, is never as good as with raw statistics.

When using raw stat and a sophisticated regression/classification algorithm (such are NNs or RFs or GBTs), you dont have to design much on the feature level. These days the algorithms themselves will even do the feature selection and feature engineering for you.

I think that summary/impact metrics are very effective in manually understanding and communicating importance of different players and definitely for ranking them. Especially if you want to compare players. With impact metrics you can easily say that (e.g.) Yiannis is 3.5 times better than Harden or whatever, something you cannot easily do with raw stats as a whole. But I would not use impact metrics to do any serious prediction to be honest

Posted: **Mon Dec 21, 2020 3:59 pm**

vzografos wrote: ↑Sat Dec 19, 2020 8:24 am
liminal_space wrote: ↑Sat Dec 19, 2020 4:36 am if you are making predictions at the game level, you don't have enough training samples if you are going to use like 200 features.
What exactly do you mean "at the game level"? I didnt understand that

team A is playing team B. you want to use a "full dimensional space" but I am saying you don't have enough data to find the right signals in that space. this isn't ImageNet where you have 14 million training samples...

Posted: **Mon Dec 21, 2020 11:18 pm**

liminal_space wrote: ↑Mon Dec 21, 2020 3:59 pm
vzografos wrote: ↑Sat Dec 19, 2020 8:24 am
liminal_space wrote: ↑Sat Dec 19, 2020 4:36 am if you are making predictions at the game level, you don't have enough training samples if you are going to use like 200 features.
What exactly do you mean "at the game level"? I didnt understand that
team A is playing team B. you want to use a "full dimensional space" but I am saying you don't have enough data to find the right signals in that space. this isn't ImageNet where you have 14 million training samples...

I see. ...

I have about 60.000 data samples (game observations) for a 124-dimensional space (feature space).

APBRmetrics

What is the advantage of 1-dimensional impact metrics?

What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?

Re: What is the advantage of 1-dimensional impact metrics?