Stability of Team Statistics
Stability of Team Statistics
Kostya Medvedovsky did some excellent work analyzing the stability of NBA team stats. This would also apply to lineup-level stats.
Here are the string of Tweets:
Here are the string of Tweets:
Re: Stability of Team Statistics
A thought experiment:
How much time is required for a coach to know if a particular lineup is working well? I know Crow has mentioned often that lineups are used for such short stints, it is hard to know how well a lineup meshes.
Plus/minus stats on even a few hundred minutes are extremely unreliable, because the underlying statistics have a huge amount of variance. For instance, just binomial randomness on a sample size of 20 or 30 3 point attempts will completely swamp any "signal" from a particular lineup combination generating particularly good looks. And that random error will completely swamp the actual plus/minus point totals.
As an example, I looked at lineups in the 2017 season, and compared the first half of the season with the second half of the season. Even lineups with over 100 possessions in both halves of the season showed almost no correlation in performance between the first half of the season and the second half of the season. (The correlation was 0.17). Only team stylistic choices showed good correlation between the first half of the season and the second--things like pace, 3PA, and DRB.
However, these are all viewed through the lens of rough counting stats. A team would be looking at things like the quality of 3pt attempts, if the team is struggling to get into plays on the offensive end, etc. Those elements should be apparent to the trained eye very quickly, compared to the crude measure of counting stats.
So my question is...how long does it take to assess a lineup, based on these more qualitative, yet basic, criteria?
How much time is required for a coach to know if a particular lineup is working well? I know Crow has mentioned often that lineups are used for such short stints, it is hard to know how well a lineup meshes.
Plus/minus stats on even a few hundred minutes are extremely unreliable, because the underlying statistics have a huge amount of variance. For instance, just binomial randomness on a sample size of 20 or 30 3 point attempts will completely swamp any "signal" from a particular lineup combination generating particularly good looks. And that random error will completely swamp the actual plus/minus point totals.
As an example, I looked at lineups in the 2017 season, and compared the first half of the season with the second half of the season. Even lineups with over 100 possessions in both halves of the season showed almost no correlation in performance between the first half of the season and the second half of the season. (The correlation was 0.17). Only team stylistic choices showed good correlation between the first half of the season and the second--things like pace, 3PA, and DRB.
However, these are all viewed through the lens of rough counting stats. A team would be looking at things like the quality of 3pt attempts, if the team is struggling to get into plays on the offensive end, etc. Those elements should be apparent to the trained eye very quickly, compared to the crude measure of counting stats.
So my question is...how long does it take to assess a lineup, based on these more qualitative, yet basic, criteria?
Re: Stability of Team Statistics
So I looked at this last year in terms of expected points per shot vs points per shot (expected points per shot is just what the team would expect to be shooting if every shooter shot their career average from the location of the shot).A team would be looking at things like the quality of 3pt attempts, if the team is struggling to get into plays on the offensive end, etc. Those elements should be apparent to the trained eye very quickly, compared to the crude measure of counting stats.
The charts for game by game can be found here:
https://imgur.com/a/Ti9Z94R
I can't find the code right now, but i'm pretty sure I found a very high correlation (>.9) between a teams first half ePPS and second half ePPS, but a lower correlation on PPS (~.7). On the defensive end I think both were in the .6 range.
Re: Stability of Team Statistics
I thought I would bring this back up.
I just ran a Monte Carlo simulation to roughly assess the standard error associated with a team's points/100 possessions observation.
Supposing the observation looks like this: for lineup X, the Sacramento Kings scored X Points/100 Possessions. What is the standard error, assuming every possession is an equal observation of an underlying TRUE SKILL LEVEL?
In general, the standard error is of the form: 113/sqrt(n), where n is the number of possessions observed. (The 113 varies a bit season to season--it is a touch higher now with more 3 point shooting).
So if lineup X has scored 120 points/possession over 80 possessions, then the standard error for that offensive rating is +/- 12.6 points.
Note, team efficiency differential has 2 sources of error--the offense and the defense. I'm just looking at the offensive rating here.
That is the LOWER BOUND for the error. If the possessions/observations are biased, in other words not all against league-average opponents, then the actual standard error is even higher. And all samples in the NBA are biased; some are significantly so. This applies even if you're adjusting the observation for the expected quality of the opponent lineups.
Thoughts?
I just ran a Monte Carlo simulation to roughly assess the standard error associated with a team's points/100 possessions observation.
Supposing the observation looks like this: for lineup X, the Sacramento Kings scored X Points/100 Possessions. What is the standard error, assuming every possession is an equal observation of an underlying TRUE SKILL LEVEL?
In general, the standard error is of the form: 113/sqrt(n), where n is the number of possessions observed. (The 113 varies a bit season to season--it is a touch higher now with more 3 point shooting).
So if lineup X has scored 120 points/possession over 80 possessions, then the standard error for that offensive rating is +/- 12.6 points.
Note, team efficiency differential has 2 sources of error--the offense and the defense. I'm just looking at the offensive rating here.
That is the LOWER BOUND for the error. If the possessions/observations are biased, in other words not all against league-average opponents, then the actual standard error is even higher. And all samples in the NBA are biased; some are significantly so. This applies even if you're adjusting the observation for the expected quality of the opponent lineups.
Thoughts?
Re: Stability of Team Statistics
Take two lineups or equal or near equal size but different current performances. How likely are various levels of net performance differences to be maintained at x bigger sample size? That is what I want to know rather than just say "huge standard errors", I don't think I know anything and therefore don't care to try to optimize rationally. It is about choice not individual "true" values.
And I'd want to see the math on comparing future expectations of lineups of significantly different sizes.
I'd want to make the best guess possible from available data. And the best guesses possible (though not good quality) will be on average the lineups with the best combination of sample size and current performance with sample size under available control but not that much so by any coach / team. That information is available even without doing all the desirable math.
How important is it to do all the math? Maybe some for exact proportions between lineups but less so on ordinal rank / priority. But almost none if coaches and teams manage lineups almost totally independent of the facts and optimization principles.
Adjusting for opponent 3pt performance is worthwhile especially for small samples (if you try to make sense of small samples, which I try to move away from as much as possible from in the current practice) but I'd look at both unadjusted and adjusted rather than completely throwing out unadjusted. And I did look at own adjusted 3pt as well even if not as big a deal as opponent 3pt.
The main things I want to do though are increase sample sizes massively and then generally follow the best performers in that group much more massively than currently. Simple stuff. Not being done much at all. More could be done to refine but do that.
And I'd want to see the math on comparing future expectations of lineups of significantly different sizes.
I'd want to make the best guess possible from available data. And the best guesses possible (though not good quality) will be on average the lineups with the best combination of sample size and current performance with sample size under available control but not that much so by any coach / team. That information is available even without doing all the desirable math.
How important is it to do all the math? Maybe some for exact proportions between lineups but less so on ordinal rank / priority. But almost none if coaches and teams manage lineups almost totally independent of the facts and optimization principles.
Adjusting for opponent 3pt performance is worthwhile especially for small samples (if you try to make sense of small samples, which I try to move away from as much as possible from in the current practice) but I'd look at both unadjusted and adjusted rather than completely throwing out unadjusted. And I did look at own adjusted 3pt as well even if not as big a deal as opponent 3pt.
The main things I want to do though are increase sample sizes massively and then generally follow the best performers in that group much more massively than currently. Simple stuff. Not being done much at all. More could be done to refine but do that.
Re: Stability of Team Statistics
Imo, the ideal distribution of minutes to lineups for a team trying to be it's best would be something like:
1 lineup of 800-1000 minutes (your best performer)
2 lineups of 400 minutes (your best alternatives)
4-6 of 200 minutes
maybe 5 more at or near 100 minutes
then about 500-1000 minutes of situational calls.
Last season, only 3 teams had a lineup used 8 minutes per game for season. Only 12 had a lineup even over 400 minutes as a top lineup. Instead of 6-8 lineups at or above 200 minutes as in my model, the actual average was 0.7. Instead of 12-14 lineups over 100 minutes, the actual average was 3.3.
My model might not be exactly right for every team, especially if heavily affected by injuries or change, but the actual average is way short of desirable for every team. There would be a middle ground, giving sone priority to development of / experimentation with young or new, but that middle ground of concentration would still be highe to far higher than the current norm.
1 lineup of 800-1000 minutes (your best performer)
2 lineups of 400 minutes (your best alternatives)
4-6 of 200 minutes
maybe 5 more at or near 100 minutes
then about 500-1000 minutes of situational calls.
Last season, only 3 teams had a lineup used 8 minutes per game for season. Only 12 had a lineup even over 400 minutes as a top lineup. Instead of 6-8 lineups at or above 200 minutes as in my model, the actual average was 0.7. Instead of 12-14 lineups over 100 minutes, the actual average was 3.3.
My model might not be exactly right for every team, especially if heavily affected by injuries or change, but the actual average is way short of desirable for every team. There would be a middle ground, giving sone priority to development of / experimentation with young or new, but that middle ground of concentration would still be highe to far higher than the current norm.
Re: Stability of Team Statistics
That's easy with the math. Say we have 80 possessions. Lineup A has an offensive rating of 120, Lineup B has an offensive rating of 100.
Standard error for each is 12.6 points.
SD(A-B) is sqrt(12.6^2 + 12.6^2) = 17.8.
The Z score is therefore: (120-100)/17.8 = 1.12. The probability that lineup A has a better TRUE SKILL LEVEL than lineup B, with no outside information, is 87%.
Notice this is not giving any information on what the true skill level of either of these lineups is actually is most likely to be--this is just a straight comparison.
---------
As an aside, Crow--my feeling is that lineups are generally best evaluated as the sum of their 5 players, each of a given "skill level", and it is rare for there to be enough data to know with any certainty mathematically that there are synergies that make the lineup greater than the sum of its parts.
Now, poorly constructed lineups can certainly be worse than the sum of their parts, and well constructed lineups better. But the effects are small enough and the standard errors large enough that I feel this should be evaluated qualitatively by the coaches rather than quantitatively, at least through the lens of points per possession.
Re: Stability of Team Statistics
Thanks Daniel.
To be clear, how did you go from the Z score to the probability that lineup A has a better TRUE SKILL LEVEL than lineup B? Z-score table?
What are the probabilities of such for a 5 or 10 pt performance difference? If the above is yes, then about 60% and 75%? 75% chance of being right would certainly move me. 60% by analytics is also probably (imo) better than subjective guess given how it appears the results of coach driven guessing game actually goes.
Are the standard errors for 200 and 400 minutes (or about 400 and 800 possessions) about 5.7 and 4 respectively?And then at 400 possessions the z score for a 20 pt performance margin would be 2.5, 10 pt would be 1.25 and 5 or would be 0.6125? And that would lead to probablities of close to 100%, 87% and 75% respectively? And even higher for 800 possessions.
200 possessions would have z-scores of about 1.74, 0.87 and 0.43 respectively? So that a 5pt performance difference at 200 possessions has little predictive value but raise the sample size and / or performance difference and then it increasingly does? I don't make a deal of 5 pt performances differences at 100 possesions More of either or both, the greater likelihood I would and would be justified in doing so.
Use a lower confidence level and these probabilities go up further.
A typical team has 12-13 lineups used 40 minutes or 80 possessions or more. Based on this analysis it would be reasonable to accept 10 plus point performance differences as likely real and smaller performance differences are fairly likely real if the one or both of the sample sizes is much bigger. So, that likely means increasing minutes for at least some of the top quartile performers at the expense of the bottom quartile (and less movement in the middle without other rationale.) If a team has 12-14 lineups over 200 possessions as called for in my distribution model, they'd have a considerably stronger basis for lineup selection / weighting than currently.
"my feeling is that lineups are generally best evaluated as the sum of their 5 players..."
Adjusting usage or not? Recognizing predominant starter / bench utilization contexts or not? Better and worse spacing contexts?
Sum of 5 may be decent start but the variances are important to understand and get more right than random.
As for synergies and learning of them, RAPM pairs would seem to be the best advisor, probably better than coaching eye test judgment though hopefully they are fairly similar. How good? I dunno the standard errors but have thought they are pretty low for the most used pairs.
To be clear, how did you go from the Z score to the probability that lineup A has a better TRUE SKILL LEVEL than lineup B? Z-score table?
What are the probabilities of such for a 5 or 10 pt performance difference? If the above is yes, then about 60% and 75%? 75% chance of being right would certainly move me. 60% by analytics is also probably (imo) better than subjective guess given how it appears the results of coach driven guessing game actually goes.
Are the standard errors for 200 and 400 minutes (or about 400 and 800 possessions) about 5.7 and 4 respectively?And then at 400 possessions the z score for a 20 pt performance margin would be 2.5, 10 pt would be 1.25 and 5 or would be 0.6125? And that would lead to probablities of close to 100%, 87% and 75% respectively? And even higher for 800 possessions.
200 possessions would have z-scores of about 1.74, 0.87 and 0.43 respectively? So that a 5pt performance difference at 200 possessions has little predictive value but raise the sample size and / or performance difference and then it increasingly does? I don't make a deal of 5 pt performances differences at 100 possesions More of either or both, the greater likelihood I would and would be justified in doing so.
Use a lower confidence level and these probabilities go up further.
A typical team has 12-13 lineups used 40 minutes or 80 possessions or more. Based on this analysis it would be reasonable to accept 10 plus point performance differences as likely real and smaller performance differences are fairly likely real if the one or both of the sample sizes is much bigger. So, that likely means increasing minutes for at least some of the top quartile performers at the expense of the bottom quartile (and less movement in the middle without other rationale.) If a team has 12-14 lineups over 200 possessions as called for in my distribution model, they'd have a considerably stronger basis for lineup selection / weighting than currently.
"my feeling is that lineups are generally best evaluated as the sum of their 5 players..."
Adjusting usage or not? Recognizing predominant starter / bench utilization contexts or not? Better and worse spacing contexts?
Sum of 5 may be decent start but the variances are important to understand and get more right than random.
As for synergies and learning of them, RAPM pairs would seem to be the best advisor, probably better than coaching eye test judgment though hopefully they are fairly similar. How good? I dunno the standard errors but have thought they are pretty low for the most used pairs.
Re: Stability of Team Statistics
The simplest way to improve is probably just to take the best coach chosen lineups and increase their utilization greatly. Emphasize the positive. (Though grown-ups should be able to face all the data including the negative results equally as directly.)
Close to 60% of the most used 100 lineups have been used less than 200 minutes, Any over +5 or +10 should probably see 50-100% increase in utilization (in perceived favorable circumstances based on split analysis) for awhile and see what happens. There were almost 20 cases in league were a +7.5 lineup over 100 minutes failed to get 200 minutes. Warriors 1, Celtics 1, Heat 3, Bucks 2, Griz 2, Suns 1, Lakers 1, Clippers 2... Nuggets 0.
15 of the 40 total lineups over +7.5 and 100 minutes last season are no longer possible. How often was that explicitly recognized before the move that wiped a good lineup out? When an average team had only 1.3 of these, this is a pretty big deal imo.
Close to 60% of the most used 100 lineups have been used less than 200 minutes, Any over +5 or +10 should probably see 50-100% increase in utilization (in perceived favorable circumstances based on split analysis) for awhile and see what happens. There were almost 20 cases in league were a +7.5 lineup over 100 minutes failed to get 200 minutes. Warriors 1, Celtics 1, Heat 3, Bucks 2, Griz 2, Suns 1, Lakers 1, Clippers 2... Nuggets 0.
15 of the 40 total lineups over +7.5 and 100 minutes last season are no longer possible. How often was that explicitly recognized before the move that wiped a good lineup out? When an average team had only 1.3 of these, this is a pretty big deal imo.
Re: Stability of Team Statistics
Less than 62% of player pairs over 500 minutes were positive on raw +/-. The average team had 24 such pairs with 9 non-positive. How many of them were / will be directly reviewed and how many will be diminished or eliminated?
Thunder only tested 19 pairs to this level and 11 were non-positive. Below average on both counts. Only 42% positive and barely more than half the typical 15 positive bigger minute pairs.
With only 8 positive pairs with the minutes, every lineup will either have negative primary pairs or pairs not tested to this level. You can't say that about the average team.
Coach D was almost 1/3 less likely to get positive results than league average when awarding 500 plus minutes to a pair.
(P.S. I used nba.com data on pairs. I now see BRef is modestly different. One or both have flaws.)
Thunder only tested 19 pairs to this level and 11 were non-positive. Below average on both counts. Only 42% positive and barely more than half the typical 15 positive bigger minute pairs.
With only 8 positive pairs with the minutes, every lineup will either have negative primary pairs or pairs not tested to this level. You can't say that about the average team.
Coach D was almost 1/3 less likely to get positive results than league average when awarding 500 plus minutes to a pair.
(P.S. I used nba.com data on pairs. I now see BRef is modestly different. One or both have flaws.)
Re: Stability of Team Statistics
Crow, is your belief that maintaining consistent lineups (more minutes with fewer lineups) will cause those lineups to perform better? Or is the primary benefit that more information will be gathered about those lineups?
----
I believe that the primary reason that numerous player pairs are negative is that the majority of NBA players are negative. So we would expect most player pairs to be negative.
----
I believe that the primary reason that numerous player pairs are negative is that the majority of NBA players are negative. So we would expect most player pairs to be negative.
Re: Stability of Team Statistics
I don't recall this old thread, but the above statement would seem to indicate that a couple hundred good minutes with a lineup does not generally continue to be as good. And part of this tendency could be that opponents adjust to well-used lineups over the course of a season.DSMok1 wrote: ↑Fri Dec 28, 2018 5:13 pm ... I looked at lineups in the 2017 season, and compared the first half of the season with the second half of the season. Even lineups with over 100 possessions in both halves of the season showed almost no correlation in performance between the first half of the season and the second half of the season. (The correlation was 0.17). ..
In playoff series, this is seen most dramatically. A coaching staff figures out how to neutralize a hot player, or when your weakest defender isn't exploited too much. Sticking to your best (to date) 5 players is like asking to be out-coached.
Re: Stability of Team Statistics
I said elsewhere recently that 95.5% of the champion's net playoff margin over last 5 years came from their 5 biggest lineups. They didn't get outcoached by playing familiar.
More generally 25 of the 31 biggest minute lineups in the last playoffs were positive. Over 80%.
Playing bigger minutes gathers more data and gives more probability that the best of those will persist as positive. Consistency of minutes probably helps performance. Eli Witus showed here that bigger minute lineup stints on average performed better a decade or more ago.
First half / second half of season low correlation of performance is probably mostly showing that lineups "over 100 possessions" are mostly still under 200 possessions which is still small and doesn't really explain performance over time of truly large / well tested lineups (400-800 possessions or more).
More generally 25 of the 31 biggest minute lineups in the last playoffs were positive. Over 80%.
Playing bigger minutes gathers more data and gives more probability that the best of those will persist as positive. Consistency of minutes probably helps performance. Eli Witus showed here that bigger minute lineup stints on average performed better a decade or more ago.
First half / second half of season low correlation of performance is probably mostly showing that lineups "over 100 possessions" are mostly still under 200 possessions which is still small and doesn't really explain performance over time of truly large / well tested lineups (400-800 possessions or more).
Re: Stability of Team Statistics
Perhaps, but I would argue that most of this effect is simply from inherent noise in the measurement.Mike G wrote: ↑Thu Jul 20, 2023 1:06 pm I don't recall this old thread, but the above statement would seem to indicate that a couple hundred good minutes with a lineup does not generally continue to be as good. And part of this tendency could be that opponents adjust to well-used lineups over the course of a season.
In playoff series, this is seen most dramatically. A coaching staff figures out how to neutralize a hot player, or when your weakest defender isn't exploited too much. Sticking to your best (to date) 5 players is like asking to be out-coached.
How much is the difference between a good lineup and a poor lineup? 10 points per 100 possessions in "true" level?
The standard error of Point Differential is larger. It's basically 160/sqrt(n), where n is the number of full possessions (one each on offense and defense). Offense and defense each supply half of the error.
So if we have a 100 possession sample, the standard error for an individual lineup observation is +/- 16. That is significantly larger than any true talent differences we're trying to observe! So the noise swamps the signal even over half a season.
Re: Stability of Team Statistics
For the only 3 lineups used over 600 minutes last season, 2 were positive in both first and second half.
I am trying to find more but it is hard to find really big minutes for a lineup in both halves of the season.
I am trying to find more but it is hard to find really big minutes for a lineup in both halves of the season.