Some rules of thumb
Page 1 of 3
 
Post new topic   Reply to topic 	   APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic  
Author 	Message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Thu Apr 07, 2005 1:55 am    Post subject: Some rules of thumb 	Reply with quote
Some hacks I'm always flipping through my notes to look up. Figured I'd just jot them down here and bookmark the page to save me some trouble. Feel free to add your own.
Assisted % = 0.75 - AST/MIN * 1.5
This one has a standard error of about 10%.
Potential Assists = AST * (0.5 * PTS/FGM) / TS%
Thanks to Dan for this one.
In-Game Home Team Win Probability = 1 / (1 + EXP(-(0.06 + MinutesRemaining * 0.01+ HomeTeamLead * 0.34)))
This one will be the subject of a massive study this offseason. It only works on less than one quarter remaining.
For every point in team point differential, add 3 games to a team's win total over the course of 82 games. Home court advantage is worth about 3 points per game.
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Sat Apr 09, 2005 3:56 am    Post subject: 	Reply with quote
For estimating high possession players
    * Regress FT% and FTA/poss 20% to the mean
    * Regress eFG% 25% to the mean
    * Regress ORTG 25% to the mean
    * Regress DRTG 30% to the mean
    * Regress TO% 20% to the mean
Do not regress
    * DR%
    * OR%
    * AST%
    * BLK%
    * STL%
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
HoopStudies
Joined: 30 Dec 2004
Posts: 706
Location: Near Philadelphia, PA
	
PostPosted: Sat Apr 09, 2005 10:36 am    Post subject: 	Reply with quote
Ed Küpfer wrote:
For estimating high possession players
    * Regress FT% and FTA/poss 20% to the mean
    * Regress eFG% 25% to the mean
    * Regress ORTG 25% to the mean
    * Regress DRTG 30% to the mean
    * Regress TO% 20% to the mean
What do you mean by this, Ed? What are you "estimating"?
_________________
Dean Oliver
Author, Basketball on Paper
The postings are my own & don't necess represent positions, strategies or opinions of employers.
Back to top 	
View user's profile Send private message Visit poster's website 	 	
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Sat Apr 09, 2005 12:44 pm    Post subject: 	Reply with quote
HoopStudies wrote:
What do you mean by this, Ed? What are you "estimating"?
It's my best guess for the next season. Once upon a time I calculated the year-to-year correlation coefficients for these stats -- the numbers above represent the regression to the mean.
For example, FT% is pretty stable: r = 0.8 among players who shoot a lot of free throws. If a player shoots 90% in season 1, my best guess for season 2 is
90% - [(1 - r) * (90% - 75%)] = 87%.
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Tue Apr 19, 2005 8:12 pm    Post subject: 	Reply with quote
Team1 Vs Team2 predictors
Code:
EFG% = 1 / (1 + EXP (-(-3.1 + 0.05 * HOME + 3.2 * 1offEFG% + 3.0 * 2defEFG%)))
Predictor       Coef    SE Coef        Z      P  Ratio  Lower  Upper
Constant    -3.09752  0.0114094  -271.49  0.000
HOME       0.0460184  0.0007412    62.09  0.000   1.05   1.05   1.05
1offEFG%       3.22721  0.0187841   171.81  0.000  25.21  24.30  26.15
2defEFG%       3.00915  0.0205559   146.39  0.000  20.27  19.47  21.10
Code:
TO% = 1 / (1 + EXP (-(-3.65 + 0.03 * HOME + 6.1 * 1offTO% + 6.3 * 2defTO%)))
Predictor        Coef    SE Coef        Z      P  Odds Ratio   Lower   Upper
Constant     -3.65004  0.0084608  -431.40  0.000
HOME       -0.0284064  0.0010110   -28.10  0.000        0.97    0.97    0.97
1offTO%         6.05658  0.0407233   148.73  0.000      426.91  394.16  462.39
2defTO%         6.29438  0.0384019   163.91  0.000      541.52  502.26  583.85
Code:
OR% = 1 / (1 + EXP (-(-3.0 + 0.08 * HOME + 3.8 * 1offOR% + 3.2 * 2defDR%)))
                                      Odds     95% CI
Predictor       Coef    SE Coef        Z      P  Ratio  Lower  Upper
Constant    -2.99216  0.0057565  -519.79  0.000
HOME       0.0769602  0.0008089    95.14  0.000   1.08   1.08   1.08
1offOR%        3.78669  0.0151349   250.20  0.000  44.11  42.82  45.44
2defOR%        3.15742  0.0176015   179.38  0.000  23.51  22.71  24.34
2defOR% = 1 - 2defDR%
Code:
FTA/Poss = -0.22 + 0.01 * HOME + 0.9 * 1FTA/poss + 0.9 * 2FTA/poss
Predictor       Coef    SE Coef       T      P
Constant   -0.216195   0.006288  -34.38  0.000
Home       0.0108320  0.0008882   12.20  0.000
1FTA         0.88636    0.01859   47.69  0.000
2FTA         0.91714    0.01631   56.25  0.000
S = 0.0759282   R-Sq = 17.9%   R-Sq(adj) = 17.9%
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Wed Apr 20, 2005 4:12 pm    Post subject: 	Reply with quote
Linear Weights-style individual possession estimator:
POSS = 0.74 * FGA + 0.44 * FTA + 0.25 * OR + 0.25 * AST + TO
Points Produced estimator:
PtsProd = 1.45 * 2Made + 2.2 * 3Made + FTMade + 0.6 * OR + 0.6 * AST
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
mtamada
Joined: 28 Jan 2005
Posts: 377
	
PostPosted: Mon Apr 25, 2005 7:00 pm    Post subject: 	Reply with quote
Ed Küpfer wrote:
For estimating high possession players
Do not regress
    * DR%
    * OR%
    * AST%
    * BLK%
    * STL%
Why would one not want to regress these statistics? Given that even FT% should be regressed .2 to the mean according to your figures, I would think that these should be also, e.g. an Elmore Smith might have a titanic shot-blocking year which he is unlikely to repeat.
Back to top 	
View user's profile Send private message 	 	
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Mon May 09, 2005 4:18 pm    Post subject: 	Reply with quote
{Edited by ed}
Here's a cool formula for estimating a team's final winning percentage from the within-season win%, from DennisBoz.
Code:
Final Win % = ((1-F)^2)/2 + (2*F – F*F) * Win% To Date
where F = %age of the season completed (GP/82).
I have no idea why this works -- you can see some discussion at the link above. It does work though, giving an overall RMSE of 6.1 games. The formula, of course, gets more accurate as the season progresses -- here's a comparison between Boz's estimate of final season Win% and the final Win% if you simply extrapolated from the present win% (that is, if you assumed your .600 record at game# 40 would produce a season ending record of .600):
Code:
              RMSE
Game#    BOZ  extrapolated
1-10    11.0    21.1
11-20   8.7     9.7
21-30   6.9     6.8
31-40   5.3     5.1
41-50   4.1     4.0
51-60   3.1     3.1
61-70   2.2     2.2
71-80   1.3     1.3
Pretty neat.
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Mon May 09, 2005 4:30 pm    Post subject: 	Reply with quote
Ed Küpfer wrote:
Code:
              RMSE
Game#    BOZ  extrapolated
1-10    11.0    21.1
11-20   8.7     9.7
21-30   6.9     6.8
31-40   5.3     5.1
41-50   4.1     4.0
51-60   3.1     3.1
61-70   2.2     2.2
71-80   1.3     1.3
Pretty neat.
Hey, wait a minute -- that's no good! It only does better at games 1-20. Hmm. Maybe I can screw around with the exponent.
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
Dan Rosenbaum
Joined: 03 Jan 2005
Posts: 541
Location: Greensboro, North Carolina
	
PostPosted: Mon May 09, 2005 4:59 pm    Post subject: 	Reply with quote
Hey Ed, I know things are slow around here once in awhile, but it kind of makes us all look bad if you have to argue with yourself. Very Happy Very Happy Very Happy
Back to top 	
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger 	 	
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Sun May 15, 2005 2:56 pm    Post subject: 	Reply with quote
A linear weights-style estimator for RTG, based on the four factors. I have no idea what it's good for, but anyway:
Code:
RTG = (1 + 5 * EFG% + FTA% - 4 * TO% + OR%) * 31
where
EFG% = (FGM + .5 * 3M) / FGA
FTA% = FTA / POSS
TO% = TO / POSS
OR% = OR / (OR + OppDR)
and POSS = FGA + FTA * 0.44 - OR + TO
Even these simplified weights -- 1, 5, 1, -4, 1 -- are very accurate: RMSE is 0.7 points per 100 possessions. Adding three decimal places only decreases the RMSE to 0.5.
One further note: I have found that standardising the four factors isn't helpful. I have tried standardising by the season (ie using the season by season mean and SD) and also by the entire sample, and neither has improved upon the accuracy of the raw stats. I don't know why this is.
And while I'm at it, a four factors-based WIN% estimator:
Code:
WIN% = 1 / ( 1 + EXP (bX) )
where bX = - (  20 * oEFG
              +  5 * oFTA% 
              - 16 * oTO%
              +  6 * oOR%
              - 20 * dEFG%
              -  5 * dFTA%
              + 16 * dTO%
              -  6 * dOR% )
This too is very accurate, with a RMSE of 3.5 games in an 82-game season, compared to an RMSE of 3.0 games for a pythagorean estimate based on exponents customised by season.
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Thu Feb 09, 2006 8:17 pm    Post subject: 	Reply with quote
After fuch mucking around, I finally came up with a logistic within-game win estimator.
Code:
p(Home Team Win) = 1/(1+exp(-(0.13 + 0.12 * HomeTmLead + 0.0044 * HomeTmLead2 + 0.0068 * MinutesRemaining)))
HomeTmLead = Home Team lead
HomeTmLead2 = HomeTmLead * ABS(HomeTmLead)
MinutesRem = Minutes in game remaining.
Overtimes treated like the final five minutes of the 4th quarter, and the fourth quarter of overtime games treated as if they ended in regulation. What? What I mean is this: if a game goes to overtime, and the home team leads by five with two minutes left in the overtime, that situation is dealt with exactly the same as if it were a five point lead with two minutes remaining in the 4th quarter.
WARNING: I have not tested this at the <1 minute remaining level. In fact, I used times rounded off to the nearest minute, so I'm not exactly sure how well the equation performs during the final minutes of games, when time slows down.
This equation above gives a very nice fit to my data (r2 = .96), which includes about 110,000 observations from almost every game in 04-05. I'm adding more observations soon, so the coefficents will be updated a little, and I'll post my methodology at that time. I'm not quite sure if logit is the way to go at the <1 minute remaining spots, but I'll see about that later.
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
mtamada
Joined: 28 Jan 2005
Posts: 377
	
PostPosted: Thu Feb 09, 2006 9:10 pm    Post subject: 	Reply with quote
Fascinating stuff, but is there a typo in the formula? The coefficient on MinutesRemaining is positive, so the larger the number of minutes remaining, the more negative the argument to the exponentiation function, so the smaller the denominator, and the higher the probability of a Home Team Win.
That surely can't be right; a home team with a 10 point lead with 1 minute left should have a near guarantee of victory (I get 85.4% from your formula, assuming that I typed it in correctly). As the MinutesRemaining gets larger, shouldn't we see a decrease, not an increase, in the probability of this home team winning (while still remaining substantially above 50%)? Also 85.4% seems way too low, unless the opponent has Reggie Miller or Isiah Thomas on one of their heroic playoff rampages.
My in-my-head calculations in the first paragraph, as well as the numbers in my spreadsheet (again, assuming that I haven't made any typos) both show an ever-growing probability of victory as MinutesRemaining INCREASES. That makes perfect sense for a team that is behind, but not for a team that is ahead.
Back to top 	
View user's profile Send private message 	 	
Ed Küpfer
Joined: 30 Dec 2004
Posts: 787
Location: Toronto
	
PostPosted: Thu Feb 09, 2006 9:33 pm    Post subject: 	Reply with quote
I'm pretty sure your calculations are correct. The problem is, as you can see, that the logit fit breaks down at the extremes. It's hard for me to tell exactly what's going on here as these areas are respresent by only a few observations. In a previous effort, I also had real problems fitting a logistic curve to the extreme areas in both Points Difference and Time Remaining. All I can say for now is, hang on. I'm still accumulating data. If the fit is still poor, I may have to mix curves somehow. I'll keep everyone posted.
_________________
ed
Back to top 	
View user's profile Send private message Send e-mail 	 	
mtamada
Joined: 28 Jan 2005
Posts: 377
	
PostPosted: Thu Feb 09, 2006 10:33 pm    Post subject: 	Reply with quote
Maybe some pure curve-fitting technique, such as cubic splines is the way to go.
http://mathews.ecs.fullerton.edu/n2003/ ... esMod.html
http://www.zoology.ubc.ca/~schluter/splines.html
Using your formula, I was looking for the combinations at which a team would have a 90% probability of winning. If I typed in the formula correctly, a home team with a 12 point lead always has a better than 90% chance of winning, while a team with a lead of 11 or less never has a 90% probability of winning. The first part might be plausible (but not the way the probability rises with increased MinutesRemaining), but not the second.
A friend of mine once said that he used he following as a rule of thumb: if "m" is the number of minutes remaining, then a team with a lead of 2m+7 points is practically guaranteed to win the game ("game in the refrigerator" in Chick Hearn-speak). I don't know if his rule of thumb is a good one or not, but in trying to calibrate it against your formula, I found the patterns that I've pointed out, the ones which don't seem realistic. But if you've got the data and can present, say, a table of "MinutesRemaining" and "Points Ahead" stats and the corresponding observed probabilities of winning, maybe I'll discover that the true probabilities are not what I think they should be.
			
			
									
						
										
						Some Rules of Thumb (E. Kupfer)
- 
				Neil Paine
- Posts: 73
- Joined: Mon Apr 18, 2011 1:18 am
- Location: Philadelphia
- Contact:
- 
				Neil Paine
- Posts: 73
- Joined: Mon Apr 18, 2011 1:18 am
- Location: Philadelphia
- Contact:
Re: Some Rules of Thumb (E. Kupfer)
Some rules of thumb
Page 2 of 3
 
Post new topic Reply to topic APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic
Author Message
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
	
PostPosted: Fri Feb 10, 2006 7:27 am Post subject: Reply with quote
MikeT - Is there any correction in your rule of thumb for whether or not it's the home team with the lead?
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
	
PostPosted: Fri Feb 10, 2006 1:39 pm Post subject: Reply with quote
mtamada wrote:
Maybe some pure curve-fitting technique, such as cubic splines is the way to go.
Wow. That just shows how far out of my depth I am here. I'd never even heard of splines, but it looks just like what I need. What I should do is post the raw data, and give some smart people a chance to hack out a better solution. I'll do that in a separate thread.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
mtamada
Joined: 28 Jan 2005
Posts: 375
	
PostPosted: Sat Feb 11, 2006 7:05 am Post subject: Reply with quote
gabefarkas wrote:
MikeT - Is there any correction in your rule of thumb for whether or not it's the home team with the lead?
No, it's just an informal rule of thumb that my friend came up with.
One very interesting thing about the probabilities produced by EdK's formula is the way the homecourt advantage exists, even in a tie game late in the 4th quarter.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
	
PostPosted: Wed Mar 08, 2006 1:56 am Post subject: Reply with quote
Say you want to match up two teams, and find the probability of one team beating the other. You model that using the old log5. But if it's early in the season, the WIN% you use in log5 are not going to reflect a team's true ability very accurately — you need to regress that WIN% to 0.500. How much do you regress?
[This is an edit. I just recieved in the mail a book by Tangotiger et al, titled The Book, in which they describe a much simpler and more precise method of regressing. My paraphrase of their method follows.]
Essentially, the regressed WIN% is calculated as the following:
Code:
RegressedWin% = (Win%/Var(Win%) + 0.5/0.019)/(1/Var(Win%) + 1/0.019)
where Win% is the current team Win%
Var(Win%) = Win% * (1 - Win%) / GamesPlayed
Essentially, the regressed win% is the average of the team's win% and the league win% (ie 0.5), weighted by the inverse of the variance of the true ability of the team and the league. The team variance is simple enough, it's the binomial equation VAR = p * (1 - p) / n. The league variance is tougher. I figured out a long time ago it was about 0.02 by simulation, but Tango presents a way to figure it out mathematically, through this equation:
Code:
Observed Variance = True Variance + Random Variance.
Switch that around to
Code:
True Variance = Observed Variance- Random Variance.
Observed variance is easy enough — it's about 0.022, depending on what time frame you're looking at. The random variance is also easy: 0.5 * 0.5/82 = 0.003. Therefore, true variance equals 0.0224 - 0.0030 = 0.0193.
So, if you have a team that begins the season 2-3 in their first five games, their regressed win% is
Code:
RegressedWin% = (0.4/0.048 + 0.5/0.0193) / (1/0.048 + 1/0.0193)
= 0.47
[End first edit. The hack below still works fine, with the changes included from the second edit.]
That is way too complicated. A much easier way is to add a constant to the team's win and loss record. Using 4 as a constant works great. From the example above:
Code:
Regressed Win% ~= (Wins + 4) / (Wins + 4 + Losses + 4)
~= (2 + 4) / (2 + 4 + 3 + 4)
~= 6/13 = 0.46
Close enough.
Actually the best constant to add is not always 4. In fact, a great approximation of the regressed Win% comes from using the following to calculate the constant:
Code:
Constant = 6.5 - 26*(Win% - 0.5)^2
BTW win estimators like Pythatgorean and Correlated Gaussian %ages should also be regressed, especially early in the season. You can use the equations above to be on the safe side, but my testing suggests that you only need to add 1 win and 1 loss if you're using the hack version.
_________________
ed
Last edited by Ed Küpfer on Thu Mar 09, 2006 3:08 am; edited 3 times in total
Back to top
View user's profile Send private message Send e-mail
Mike G
Joined: 14 Jan 2005
Posts: 3476
Location: Hendersonville, NC
	
PostPosted: Wed Mar 08, 2006 5:34 am Post subject: Reply with quote
Ed Küpfer wrote:
Code:
Regressed Win% = ActualWin% - (ActualWin% - 0.5) * Regression
Code:
Regressed Win% = 0.40 - (0.40 - 0.5) * 0.49
= 0.45
A - (A - .5) = .5
.5 * .49 = .245
_________________
`
36% of all statistics are wrong
Back to top
View user's profile Send private message Send e-mail
THWilson
Joined: 19 Jul 2005
Posts: 164
Location: phoenix
	
PostPosted: Wed Mar 08, 2006 11:09 am Post subject: Reply with quote
Mike G wrote:
Ed Küpfer wrote:
Code:
Regressed Win% = ActualWin% - (ActualWin% - 0.5) * Regression
Code:
Regressed Win% = 0.40 - (0.40 - 0.5) * 0.49
= 0.45
A - (A - .5) = .5
.5 * .49 = .245
You have the parentheses in the wrong place.
Back to top
View user's profile Send private message
Mike G
Joined: 14 Jan 2005
Posts: 3476
Location: Hendersonville, NC
	
PostPosted: Wed Mar 08, 2006 11:32 am Post subject: Reply with quote
That, and I can't tell a (*) from a (+)
_________________
`
36% of all statistics are wrong
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
	
PostPosted: Thu Nov 23, 2006 9:55 pm Post subject: Reply with quote
Something more than just a rule of thumb:
Add 5 wins and 5 losses to each teams record to get a better read of their actual ability. For example, if a team starts the year 3-5 (.375) , a best guess as to their true win ability is (3+5)/(3+5 + 5+5) = .444.
This accounts for regression to the mean, and is robust throughout the history of the NBA. It outperforms both pythagorean record and winning record in predicting the outcome of the next game. Plus it's easier to use than pythagorean!
_________________
ed
Back to top
View user's profile Send private message Send e-mail
rob c
Joined: 08 Feb 2006
Posts: 14
	
PostPosted: Fri Nov 24, 2006 11:49 am Post subject: Reply with quote
Ed,
Have you ever done anything like this using some sort of weighting system to cater for more recent form? I have tried this sort of thing for various analysis using an exponential coefficient which gives matches from around 45 days ago half the weight of the last match but am not convinced of the best coefficient.
Rob
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
	
PostPosted: Fri Nov 24, 2006 5:42 pm Post subject: Reply with quote
Every time I've looked at how much information from past games can be used to predicting future games, I've never found evidence that more recent games have more information than less recent games. Trying again now, a quickie regression shows that if use include a pyth or win percentage in model, including the outcome recent games as a separate variable adds nothing. Specifically, the last 5 games is not a significant factor. Maybe a longer span works, but I can't imagine that it would significantly improve on the hack above or pyth.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
cherokee_ACB
Joined: 22 Mar 2006
Posts: 157
	
PostPosted: Sat Nov 25, 2006 9:50 am Post subject: Reply with quote
Ed Küpfer wrote:
Every time I've looked at how much information from past games can be used to predicting future games, I've never found evidence that more recent games have more information than less recent games. Trying again now, a quickie regression shows that if use include a pyth or win percentage in model, including the outcome recent games as a separate variable adds nothing. Specifically, the last 5 games is not a significant factor. Maybe a longer span works, but I can't imagine that it would significantly improve on the hack above or pyth.
Do you take into account strength of opponents in recent games? Could it influence the results?
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
	
PostPosted: Sat Nov 25, 2006 12:25 pm Post subject: Reply with quote
cherokee_ACB wrote:
Do you take into account strength of opponents in recent games? Could it influence the results?
It could, but you're starting to make these calculations really complicated. It is a rules of thumb thread, after all.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
dlirag
Joined: 30 Dec 2004
Posts: 29
	
PostPosted: Fri Jan 05, 2007 8:09 pm Post subject: Reply with quote
Ed Küpfer wrote:
Add 5 wins and 5 losses to each teams record to get a better read of their actual ability. For example, if a team starts the year 3-5 (.375) , a best guess as to their true win ability is (3+5)/(3+5 + 5+5) = .444.
This accounts for regression to the mean, and is robust throughout the history of the NBA. It outperforms both pythagorean record and winning record in predicting the outcome of the next game. Plus it's easier to use than pythagorean!
Is it possible to make a similar rule of thumb for a player's projected scoring?
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
	
PostPosted: Tue Feb 13, 2007 4:25 pm Post subject: Reply with quote
dlirag wrote:
Ed Küpfer wrote:
Add 5 wins and 5 losses to each teams record to get a better read of their actual ability. For example, if a team starts the year 3-5 (.375) , a best guess as to their true win ability is (3+5)/(3+5 + 5+5) = .444.
This accounts for regression to the mean, and is robust throughout the history of the NBA. It outperforms both pythagorean record and winning record in predicting the outcome of the next game. Plus it's easier to use than pythagorean!
Is it possible to make a similar rule of thumb for a player's projected scoring?
I didn't reply to this sooner because I've been thinking about it a lot. My conclusion is that there is a way, but it would be too complicated to work out.
However, a related concept is the confidence interval. These are easy enough to caclulate for binomials like FT% and p3%, but not so easy for multinomials like EFG% and TS%. In Tango, Lichtman, and Dolphin's THE BOOK, they outline a way to calculate multinomial variance, which can be extended to find confidence intervals. I've discussed their method elsewhere on this site, but I'll show a quickie version here so I can find it when I need it.
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
Look at two Kings players in 0506:
Bibby:
p2 405-885 45.8%
p3 192-497 38.6%
EFG = 50.1%
var(EFG) = (1.6%)^2
95% CI = +/- 1.6%*2 = +/- 3.2%
Price:
p2 15-31 48.3%
p3 6-27 22.2%
EFG = 41.4%
var(EFG) = (7.4%)^2
95% CI = +/- 7.4%*2 = +/- 14.9%
Here is the method generalised for multinomials comprising up to four outcomes (eg ORTG):
Code:
outcome weights n p
0 w0 n0 p0 = n0/nTOT
1 w1 n1 p1
2 w2 n2 p2
3 w3 n3 p3
ALL -- nTOT --
<x> = w0p0 + w1p1 + w2p2 + w3p3
<x2> = (w0^2)p0 + (w1^2)p1 + (w2^2)p2 + (w3^2)p3
stdev = sqrt( (<x2> - <x>^2)/nTOT)
The link above shows how to estimate the p's.
_________________
ed
Last edited by Ed Küpfer on Thu Feb 15, 2007 12:34 pm; edited 1 time in total
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
	
PostPosted: Thu Feb 15, 2007 1:37 am Post subject: Reply with quote
Ed Küpfer wrote:
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
An easy hack to calculate EFG% variance, or rather standard errors:
Code:
se_EFG% ~ 1.4 * sqrt( EFG% * (1.5 - EFG%) / FGA)
Works okay as far as I can tell. There is no hack for the other multinomials like TS% or ORTG -- you just have to do them the long way.
fake edit: oh, alright. Here's a hack for points/play:
Code:
HHH = (p2A/play)^2 + (p3A/play)^2 + (.44*FTA/play)^2 + (TO/play)^2
se_pts/play ~ (HHH^.044)/sqrt(play)
_________________
ed
			
			
									
						
										
						Page 2 of 3
Post new topic Reply to topic APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic
Author Message
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Fri Feb 10, 2006 7:27 am Post subject: Reply with quote
MikeT - Is there any correction in your rule of thumb for whether or not it's the home team with the lead?
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Fri Feb 10, 2006 1:39 pm Post subject: Reply with quote
mtamada wrote:
Maybe some pure curve-fitting technique, such as cubic splines is the way to go.
Wow. That just shows how far out of my depth I am here. I'd never even heard of splines, but it looks just like what I need. What I should do is post the raw data, and give some smart people a chance to hack out a better solution. I'll do that in a separate thread.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
mtamada
Joined: 28 Jan 2005
Posts: 375
PostPosted: Sat Feb 11, 2006 7:05 am Post subject: Reply with quote
gabefarkas wrote:
MikeT - Is there any correction in your rule of thumb for whether or not it's the home team with the lead?
No, it's just an informal rule of thumb that my friend came up with.
One very interesting thing about the probabilities produced by EdK's formula is the way the homecourt advantage exists, even in a tie game late in the 4th quarter.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Wed Mar 08, 2006 1:56 am Post subject: Reply with quote
Say you want to match up two teams, and find the probability of one team beating the other. You model that using the old log5. But if it's early in the season, the WIN% you use in log5 are not going to reflect a team's true ability very accurately — you need to regress that WIN% to 0.500. How much do you regress?
[This is an edit. I just recieved in the mail a book by Tangotiger et al, titled The Book, in which they describe a much simpler and more precise method of regressing. My paraphrase of their method follows.]
Essentially, the regressed WIN% is calculated as the following:
Code:
RegressedWin% = (Win%/Var(Win%) + 0.5/0.019)/(1/Var(Win%) + 1/0.019)
where Win% is the current team Win%
Var(Win%) = Win% * (1 - Win%) / GamesPlayed
Essentially, the regressed win% is the average of the team's win% and the league win% (ie 0.5), weighted by the inverse of the variance of the true ability of the team and the league. The team variance is simple enough, it's the binomial equation VAR = p * (1 - p) / n. The league variance is tougher. I figured out a long time ago it was about 0.02 by simulation, but Tango presents a way to figure it out mathematically, through this equation:
Code:
Observed Variance = True Variance + Random Variance.
Switch that around to
Code:
True Variance = Observed Variance- Random Variance.
Observed variance is easy enough — it's about 0.022, depending on what time frame you're looking at. The random variance is also easy: 0.5 * 0.5/82 = 0.003. Therefore, true variance equals 0.0224 - 0.0030 = 0.0193.
So, if you have a team that begins the season 2-3 in their first five games, their regressed win% is
Code:
RegressedWin% = (0.4/0.048 + 0.5/0.0193) / (1/0.048 + 1/0.0193)
= 0.47
[End first edit. The hack below still works fine, with the changes included from the second edit.]
That is way too complicated. A much easier way is to add a constant to the team's win and loss record. Using 4 as a constant works great. From the example above:
Code:
Regressed Win% ~= (Wins + 4) / (Wins + 4 + Losses + 4)
~= (2 + 4) / (2 + 4 + 3 + 4)
~= 6/13 = 0.46
Close enough.
Actually the best constant to add is not always 4. In fact, a great approximation of the regressed Win% comes from using the following to calculate the constant:
Code:
Constant = 6.5 - 26*(Win% - 0.5)^2
BTW win estimators like Pythatgorean and Correlated Gaussian %ages should also be regressed, especially early in the season. You can use the equations above to be on the safe side, but my testing suggests that you only need to add 1 win and 1 loss if you're using the hack version.
_________________
ed
Last edited by Ed Küpfer on Thu Mar 09, 2006 3:08 am; edited 3 times in total
Back to top
View user's profile Send private message Send e-mail
Mike G
Joined: 14 Jan 2005
Posts: 3476
Location: Hendersonville, NC
PostPosted: Wed Mar 08, 2006 5:34 am Post subject: Reply with quote
Ed Küpfer wrote:
Code:
Regressed Win% = ActualWin% - (ActualWin% - 0.5) * Regression
Code:
Regressed Win% = 0.40 - (0.40 - 0.5) * 0.49
= 0.45
A - (A - .5) = .5
.5 * .49 = .245
_________________
`
36% of all statistics are wrong
Back to top
View user's profile Send private message Send e-mail
THWilson
Joined: 19 Jul 2005
Posts: 164
Location: phoenix
PostPosted: Wed Mar 08, 2006 11:09 am Post subject: Reply with quote
Mike G wrote:
Ed Küpfer wrote:
Code:
Regressed Win% = ActualWin% - (ActualWin% - 0.5) * Regression
Code:
Regressed Win% = 0.40 - (0.40 - 0.5) * 0.49
= 0.45
A - (A - .5) = .5
.5 * .49 = .245
You have the parentheses in the wrong place.
Back to top
View user's profile Send private message
Mike G
Joined: 14 Jan 2005
Posts: 3476
Location: Hendersonville, NC
PostPosted: Wed Mar 08, 2006 11:32 am Post subject: Reply with quote
That, and I can't tell a (*) from a (+)
_________________
`
36% of all statistics are wrong
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Thu Nov 23, 2006 9:55 pm Post subject: Reply with quote
Something more than just a rule of thumb:
Add 5 wins and 5 losses to each teams record to get a better read of their actual ability. For example, if a team starts the year 3-5 (.375) , a best guess as to their true win ability is (3+5)/(3+5 + 5+5) = .444.
This accounts for regression to the mean, and is robust throughout the history of the NBA. It outperforms both pythagorean record and winning record in predicting the outcome of the next game. Plus it's easier to use than pythagorean!
_________________
ed
Back to top
View user's profile Send private message Send e-mail
rob c
Joined: 08 Feb 2006
Posts: 14
PostPosted: Fri Nov 24, 2006 11:49 am Post subject: Reply with quote
Ed,
Have you ever done anything like this using some sort of weighting system to cater for more recent form? I have tried this sort of thing for various analysis using an exponential coefficient which gives matches from around 45 days ago half the weight of the last match but am not convinced of the best coefficient.
Rob
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Fri Nov 24, 2006 5:42 pm Post subject: Reply with quote
Every time I've looked at how much information from past games can be used to predicting future games, I've never found evidence that more recent games have more information than less recent games. Trying again now, a quickie regression shows that if use include a pyth or win percentage in model, including the outcome recent games as a separate variable adds nothing. Specifically, the last 5 games is not a significant factor. Maybe a longer span works, but I can't imagine that it would significantly improve on the hack above or pyth.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
cherokee_ACB
Joined: 22 Mar 2006
Posts: 157
PostPosted: Sat Nov 25, 2006 9:50 am Post subject: Reply with quote
Ed Küpfer wrote:
Every time I've looked at how much information from past games can be used to predicting future games, I've never found evidence that more recent games have more information than less recent games. Trying again now, a quickie regression shows that if use include a pyth or win percentage in model, including the outcome recent games as a separate variable adds nothing. Specifically, the last 5 games is not a significant factor. Maybe a longer span works, but I can't imagine that it would significantly improve on the hack above or pyth.
Do you take into account strength of opponents in recent games? Could it influence the results?
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Sat Nov 25, 2006 12:25 pm Post subject: Reply with quote
cherokee_ACB wrote:
Do you take into account strength of opponents in recent games? Could it influence the results?
It could, but you're starting to make these calculations really complicated. It is a rules of thumb thread, after all.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
dlirag
Joined: 30 Dec 2004
Posts: 29
PostPosted: Fri Jan 05, 2007 8:09 pm Post subject: Reply with quote
Ed Küpfer wrote:
Add 5 wins and 5 losses to each teams record to get a better read of their actual ability. For example, if a team starts the year 3-5 (.375) , a best guess as to their true win ability is (3+5)/(3+5 + 5+5) = .444.
This accounts for regression to the mean, and is robust throughout the history of the NBA. It outperforms both pythagorean record and winning record in predicting the outcome of the next game. Plus it's easier to use than pythagorean!
Is it possible to make a similar rule of thumb for a player's projected scoring?
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Tue Feb 13, 2007 4:25 pm Post subject: Reply with quote
dlirag wrote:
Ed Küpfer wrote:
Add 5 wins and 5 losses to each teams record to get a better read of their actual ability. For example, if a team starts the year 3-5 (.375) , a best guess as to their true win ability is (3+5)/(3+5 + 5+5) = .444.
This accounts for regression to the mean, and is robust throughout the history of the NBA. It outperforms both pythagorean record and winning record in predicting the outcome of the next game. Plus it's easier to use than pythagorean!
Is it possible to make a similar rule of thumb for a player's projected scoring?
I didn't reply to this sooner because I've been thinking about it a lot. My conclusion is that there is a way, but it would be too complicated to work out.
However, a related concept is the confidence interval. These are easy enough to caclulate for binomials like FT% and p3%, but not so easy for multinomials like EFG% and TS%. In Tango, Lichtman, and Dolphin's THE BOOK, they outline a way to calculate multinomial variance, which can be extended to find confidence intervals. I've discussed their method elsewhere on this site, but I'll show a quickie version here so I can find it when I need it.
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
Look at two Kings players in 0506:
Bibby:
p2 405-885 45.8%
p3 192-497 38.6%
EFG = 50.1%
var(EFG) = (1.6%)^2
95% CI = +/- 1.6%*2 = +/- 3.2%
Price:
p2 15-31 48.3%
p3 6-27 22.2%
EFG = 41.4%
var(EFG) = (7.4%)^2
95% CI = +/- 7.4%*2 = +/- 14.9%
Here is the method generalised for multinomials comprising up to four outcomes (eg ORTG):
Code:
outcome weights n p
0 w0 n0 p0 = n0/nTOT
1 w1 n1 p1
2 w2 n2 p2
3 w3 n3 p3
ALL -- nTOT --
<x> = w0p0 + w1p1 + w2p2 + w3p3
<x2> = (w0^2)p0 + (w1^2)p1 + (w2^2)p2 + (w3^2)p3
stdev = sqrt( (<x2> - <x>^2)/nTOT)
The link above shows how to estimate the p's.
_________________
ed
Last edited by Ed Küpfer on Thu Feb 15, 2007 12:34 pm; edited 1 time in total
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Thu Feb 15, 2007 1:37 am Post subject: Reply with quote
Ed Küpfer wrote:
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
An easy hack to calculate EFG% variance, or rather standard errors:
Code:
se_EFG% ~ 1.4 * sqrt( EFG% * (1.5 - EFG%) / FGA)
Works okay as far as I can tell. There is no hack for the other multinomials like TS% or ORTG -- you just have to do them the long way.
fake edit: oh, alright. Here's a hack for points/play:
Code:
HHH = (p2A/play)^2 + (p3A/play)^2 + (.44*FTA/play)^2 + (TO/play)^2
se_pts/play ~ (HHH^.044)/sqrt(play)
_________________
ed
- 
				Neil Paine
- Posts: 73
- Joined: Mon Apr 18, 2011 1:18 am
- Location: Philadelphia
- Contact:
Re: Some Rules of Thumb (E. Kupfer)
Some rules of thumb
Page 3 of 3
 
Post new topic Reply to topic APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic
Author Message
Analyze This
Joined: 17 May 2005
Posts: 364
	
PostPosted: Thu Feb 15, 2007 2:12 am Post subject: Reply with quote
I don't know if you work for an nba team, but if not somebody give this man an nba job.
_________________
Where There's a WilT There's a Way
Back to top
View user's profile Send private message
HoopStudies
Joined: 30 Dec 2004
Posts: 705
Location: Near Philadelphia, PA
	
PostPosted: Fri Feb 16, 2007 7:45 am Post subject: Reply with quote
Analyze This wrote:
I don't know if you work for an nba team, but if not somebody give this man an nba job.
He does and he got it by showing the kind of effort and quality work that he posts.
_________________
Dean Oliver
Author, Basketball on Paper
The postings are my own & don't necess represent positions, strategies or opinions of employers.
Back to top
View user's profile Send private message Visit poster's website
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
	
PostPosted: Sat Feb 17, 2007 2:08 pm Post subject: Reply with quote
Ed Küpfer wrote:
Ed Küpfer wrote:
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
An easy hack to calculate EFG% variance, or rather standard errors:
Code:
se_EFG% ~ 0.79 * sqrt( EFG% * (1.5 - EFG%) / FGA)
Works okay as far as I can tell. There is no hack for the other multinomials like TS% or ORTG -- you just have to do them the long way.
fake edit: oh, alright. Here's a hack for points/play:
Code:
HHH = (p2A/play)^2 + (p3A/play)^2 + (.44*FTA/play)^2 + (TO/play)^2
se_pts/play ~ (HHH^.044)/sqrt(play)
_________________
ed
Back to top
View user's profile Send private message Send e-mail
tpryan
Joined: 11 Feb 2005
Posts: 100
	
PostPosted: Sat Feb 17, 2007 3:45 pm Post subject: Reply with quote
Ed Küpfer wrote:
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
Simple question: What is the original source for this expression?
I can't find it in BoP or by using Google. The reason I ask is because there are two sample sizes involved in the EFG% expression, namely the number of 2-pt. attempts and the number of 3-pt. attempts, but these are not involved in the variance given above. That is, I would expect to see something like FGA(2) and FGA(3) involved in the expression. The last term in the expression presumably estimates the covariance, but I was surprised to see -3 rather than +3. I would just like to see the math that produces the expression, or at least find it sketched out somewhere, so that I can follow how the expression was obtained.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
	
PostPosted: Sat Feb 17, 2007 4:13 pm Post subject: Reply with quote
tpryan wrote:
Ed Küpfer wrote:
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
Simple question: What is the original source for this expression?
One Ray Koopman sketched it out for me. I've since discoverd that is a special case of a general formula for multinomial variance, which goes something like this:
For outcomes 0 pts, 2 pts, 3 pts,
p0 = (p2missed + p3missed)/FGA
p2 = (p2made)/FGA
p3 = (p3made)/FGA
Weights (scaled to EFG%, ie pts/2)
w0 = 0pts / 2
w2 = 2pts / 2 = 1
w3 = 3pts / 2 = 1.5
EFG can be reformulated as
EFG = x = w0p0 + w2p2 + w3p3
To get the variance, you need to first square the weights in EFG, like this
x2 = (w0^2)p0 + (w2^2)p2 + (w3^2)p3
= (0)p0 + (1)p2 + (2.25)p3 = p2 + (2.25)p3
Finally
VAR(EFG) = (x2 - x^2)/FGA
_________________
ed
Back to top
View user's profile Send private message Send e-mail
tpryan
Joined: 11 Feb 2005
Posts: 100
	
PostPosted: Sat Feb 17, 2007 5:20 pm Post subject: Reply with quote
Thanks, Ed. I see that Ray raised some questions when he answered your question. I'm inclined to do the same. I believe there is more than one way of looking at this, and Ray's answer seems to suggest that he would agree.
One immediate thought is the following.
In the multinomial case, there are more than two possible outcomes, as noted. An example would be rolling a die, which of course has 6 possible outcomes. The probability of observing, say, 2 ones, 2 sixes, 2 fives, 1 four, 1 two and 1 three when a die is rolled 9 times is a multinomial probability.
In basketball, however, the "trials" differ. That is, the same type of shot is not attempted each time, with the outcome being either 0, 2, or 3 points. (I.e., obviously it is impossible for a 3-pt. shot attempt to result in 2 points.) Since the trials differ (unlike the die example), my initial reaction is that this is not a multinomial problem. (I see that Ray also questions the multinomial assumption. I haven't worked it out, but his variance result could presumably be obtained by using the variance and covariance expressions at http://en.wikipedia.org/wiki/Multinomial_distribution.)
Instead, I believe the appropriate variance would be obtained by finding the variance of the weighted average of proportions, since we may express EFG% as (n_1*p_1 + 1.5*(n_2*p_2))/FGA, with n_1 denoting the number of 2-pt. FGAs, p_1 the proportion made of 2-pt. attempts, n_2 the number of 3-pt. FGAs, and p_2 the proportion of 3-pt. attempts made. (Of course n_1*p_1 just gives us the number of 2-pt. FGs, and similarly for 3-pt. FGs.)
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
At least this is my initial reaction. I'll think about it further tonight.
Thanks again for the details.
Tom
Back to top
View user's profile Send private message
tpryan
Joined: 11 Feb 2005
Posts: 100
	
PostPosted: Wed Feb 21, 2007 12:53 am Post subject: Reply with quote
Well, I've thought about this some more and my view hasn't changed. We clearly have two binomials rather than one multinomial.
We also have a mess because n_1 and n_2 are obviously negatively correlated, in addition to not being fixed, and their sum isn't even fixed. There is probably also at least a slight dependency between the two sample proportions, although this would of course vary from team to team, as would the proportions themselves.
I don't see anything that could be done other than taking a descriptive statistics approach and computing the variablility over games for each team.
Justin, if you think about it, please remember me to Angela Dean, Noel Cressie and Tom Santner if you happen to be talking to them. (Angela may not think of me fondly at the moment, however, since my design book is now competing against her book. Noel is on the advisory board for the series in which my book is published.)
Tom
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
	
PostPosted: Wed Feb 21, 2007 2:34 am Post subject: Reply with quote
Appreciate the thoughts.
tpryan wrote:
I don't see anything that could be done other than taking a descriptive statistics approach and computing the variablility over games for each team.
Here is Chris Duhon's season so far:
http://www.editgrid.com/user/edkupfer/Duhon
Summary stats (weights = attempts/game):
Code:
FG% p2% p3% EFG%
Mean 0.39 0.40 0.38 0.44
SE 0.23 0.22 0.30 0.27
wtMean 0.40 0.42 0.36 0.46
wtSE 0.21 0.21 0.25 0.25
Look at those huge errors! Those cannot be right. Looking at FG%, the SE from the binomial fomula thingy is
SE = sqrt(FG%*(1-FG%)/FGA) = sqrt(.4*.6/468) = 0.022
or 1/10 the SE you get by looking at the game-by-game stats. Maybe at the team level that approach works better because the number of attempts is more constant from game to game, but at the player level you can't get a good estimate on the error by looking at the game-to-game variation.
Now, it simply must be the case that the error of EFG% is greater than FG%, but I can't imagine it would be that much greater -- certainly not 0.25, especially on almost 500 attempts!
Using the handy dandy SE EFG% estimator posted above, you get 0.027 as your standard error, which is slightly greater than the 0.022 for FG%.
I can understand your reluctance to use such a method, which seems to lack any theoretical justification. I appreciate that, but I can't help you there. What I need is a reasonable way to estimate these errors, and screw theory. It seems totally plausible to me that the error on a 3-outcome multinomial will be slightly greater than that of a 2-outcome. The method I described in an earlier post returns numbers that look good. I can't imagine that they will be too far off, and since I'm not doing rocket surgery, it's a hack I can live with. Especially since the alternative, looking at game-by-game variation, returns numbers that are unusable.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
	
PostPosted: Wed Feb 21, 2007 8:07 am Post subject: Reply with quote
tpryan wrote:
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
I'm not sure why they are necessarily dependent. A player comes down the court with his team in possession of the ball. He is standing beyond the 3-point arc and his teammate passes him the ball. He looks at the basket for a split second before remembering that his name is Shaquille and he passes it off to another teammate. Later on in the same game our player finds himself situated much closer to the basket. This time he receives an entry pass, makes a post move, and scores a two point lay in.
I could give an example of the contrapositive, but I think we all get it.
To me, these are two independent events, each determined by a player's ability and willingness to attempt that type of shot.
Back to top
View user's profile Send private message Send e-mail AIM Address
tpryan
Joined: 11 Feb 2005
Posts: 100
	
PostPosted: Thu Feb 22, 2007 6:06 am Post subject: Reply with quote
gabefarkas wrote:
tpryan wrote:
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
I'm not sure why they are necessarily dependent. A player comes down the court with his team in possession of the ball. He is standing beyond the 3-point arc and his teammate passes him the ball. He looks at the basket for a split second before remembering that his name is Shaquille and he passes it off to another teammate. Later on in the same game our player finds himself situated much closer to the basket. This time he receives an entry pass, makes a post move, and scores a two point lay in.
I could give an example of the contrapositive, but I think we all get it.
To me, these are two independent events, each determined by a player's ability and willingness to attempt that type of shot.
It is the two percentages that I was contending are dependent, and this is what, if true, would make it difficult to compute the variance, in addition to the other problems that I mentioned.
It sounds to me like you are saying that the type of shot attempted is independent when we look at a series of trips down the court. I would certainly agree with that.
Here is what I was thinking.
Of the current top 50 in 3-pt. percentage, Brent Barry ranks 4th at 46%. Of that group, he is tied for 1st in 2-pt. percentage at 57%. Thus, loosely speaking, his ranks are correlated.
What surprised me, however, is that when I did some cutting and pasting and used software to compute the correlations, the Pearson correlation betwen the two sets of percentages was only .135, and when I ranked the two sets and computed the rank (Spearman) correlation, it was zero to the first few decimal places!! Very odd.
Perhaps there just aren't enough Brent Barrys to support my hypothesis, but I also find it odd that 7 of these 50 have a higher 3-pt. % than 2-pt. %, and 3 have the same percentage!
If we looked at each of those 10 players we would probably find an explanation for this, but on the surface it looks a bit odd to me. (The fact that Earl Boykins has equal percentages is understandable, however.)
Obviously there are many 3-pt. specialists in that group and some, like Barry, have attempted far more 3s than 2s, so this begs the question of how atypical this group is of the entire set of NBA players. Surely there must be a large subset of NBA players for which there would be a substantial correlation between the two percentages.
All things considered, it seems as though there should be at least a small-to-moderate correlation, and it may be large enough to be non-ignorable in computing (estimating) the variance.
Back to top
View user's profile Send private message
tpryan
Joined: 11 Feb 2005
Posts: 100
	
PostPosted: Thu Feb 22, 2007 6:50 am Post subject: Reply with quote
Ed,
I'm not certain how you computed the 0.22. Did the spreadsheet give you the mean and standard error of the mean for those 50 numbers? I'm guessing that is what happened because the number of 2-pt. attempts of course varies from game to game, so there is not a standard error for a fixed number of attempts. I.e., you are apparently getting the standard error for n = 50, not for n = 468 or any other number.
Those numbers also vary greatly because of the relatively small number of attempts in each game, whereas the .022 is based on 468 attempts.
Thus, the standard error from the formula should indeed be much smaller. If Duhon had taken 468 shots per game and you computed just the standard deviation, not the standard error, of the 2-pt. % for that number of attempts, that standard deviation would be very close to the .022.
Thus, I think there are two things going on that prevent approximate equality, but I will look at this further later today ... after I have had some sleep! Smile
As a final thought, which may be apparent, the .022 is not what you want as that greatly underestimates the game-to-game variability in his performance. That would apply only if he took 468 2-pt. attempts every game.
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
	
PostPosted: Thu Feb 22, 2007 9:11 am Post subject: Reply with quote
tpryan wrote:
It is the two percentages that I was contending are dependent, and this is what, if true, would make it difficult to compute the variance, in addition to the other problems that I mentioned.
It sounds to me like you are saying that the type of shot attempted is independent when we look at a series of trips down the court. I would certainly agree with that.
Hmmm, that's odd, because I quoted you as saying:
tpryan wrote:
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
That makes it seem to me that you're saying the number of 3-point shots attempted and made is not independent of the number of 2-point shots attempted and made. In other words, the types of shots are dependent. Maybe I'm misinterpreting?
tpryan wrote:
Here is what I was thinking.
Of the current top 50 in 3-pt. percentage, Brent Barry ranks 4th at 46%. Of that group, he is tied for 1st in 2-pt. percentage at 57%. Thus, loosely speaking, his ranks are correlated.
What surprised me, however, is that when I did some cutting and pasting and used software to compute the correlations, the Pearson correlation betwen the two sets of percentages was only .135, and when I ranked the two sets and computed the rank (Spearman) correlation, it was zero to the first few decimal places!! Very odd.
Perhaps there just aren't enough Brent Barrys to support my hypothesis, but I also find it odd that 7 of these 50 have a higher 3-pt. % than 2-pt. %, and 3 have the same percentage!
If we looked at each of those 10 players we would probably find an explanation for this, but on the surface it looks a bit odd to me. (The fact that Earl Boykins has equal percentages is understandable, however.)
Obviously there are many 3-pt. specialists in that group and some, like Barry, have attempted far more 3s than 2s, so this begs the question of how atypical this group is of the entire set of NBA players. Surely there must be a large subset of NBA players for which there would be a substantial correlation between the two percentages.
All things considered, it seems as though there should be at least a small-to-moderate correlation, and it may be large enough to be non-ignorable in computing (estimating) the variance.
The first thing I would say to this is that I think we're wading into dangerous territory when we start trying to pull only subsets of the entire data in order to find a correlation. That might be construed as cherry-picking.
Secondly, thinking about the flow of the game, it doesn't surprise me at all that there's no correlation between the two, using either a parametric or nonparametric test.
On one hand, we have players who take the majority of their shots very close to the basket and only take high percentage shots, at that (call them Shaquilles). These guys might shoot 55% around the basket on many 2PA and (hypothetically) 15% on very very few 3PA. When Shaquilles do take these 3PA, they might not be in the typical flow of the game; they might be desperation shots or in a blowout.
On the other hand, coaches design plays to give certain players (call them Hoibergs) wide open 3-point attempts. These players may already have demonstrated an ability to hit a fairly high percentage of 3PA (maybe 45-47%), and the coaches try to exploit this. Hoibergs also understand their role and when they should be taking these wide-open 3PA. When they take 2PA, they might be under more challenging circumstances (like the Shaquilles and their 3PA), or from areas on the court where Hoibergs are not as accustomed to shooting.
However, for every Shaquille, there's a Rasheed who has the ability to step back and take a 3PA more successfully. And for every Hoiberg, there's a Brent, who can demonstrate a wider variety of successful shot selection.
So, the lack of correlation doesn't surprise me. I can see why it might be tempting to want to conclude a continuous (possibly linear) relationship between distance from the basket and FG%, and thus that a successful shooter should be as relatively successful from wherever he shoots. However, the existence of the three-point line as a discrete indicator of the reward for a FGM invalidates this, in my opinion.
To me, that causes the functional relationship to no longer be continuous. There's a slow and steady drop off in the expectation as you extend back from the basket, until you reach 23 feet and 9 inches, at which point the expectation suddenly takes a little jump, corresponding to the increased reward for a 3PM.
Back to top
View user's profile Send private message Send e-mail AIM Address
tpryan
Joined: 11 Feb 2005
Posts: 100
	
PostPosted: Thu Feb 22, 2007 2:45 pm Post subject: Reply with quote
gabefarkas wrote:
tpryan wrote:
It is the two percentages that I was contending are dependent, and this is what, if true, would make it difficult to compute the variance, in addition to the other problems that I mentioned.
It sounds to me like you are saying that the type of shot attempted is independent when we look at a series of trips down the court. I would certainly agree with that.
Hmmm, that's odd, because I quoted you as saying:
tpryan wrote:
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
That makes it seem to me that you're saying the number of 3-point shots attempted and made is not independent of the number of 2-point shots attempted and made. In other words, the types of shots are dependent. Maybe I'm misinterpreting?
Well, if we think of the total number of field goal attempts in a game as being defined by an interval that isn't very wide, which seems like a reasonable assumption, then there has to be some dependency of the number of FGAs of each type. That is, the freedom to vary of one is somewhat limited if we condition on a value for the other one.
For individual possessions, whether a 3-pt. FGA is made on a given possession should be "essentially independent" of what occurred on the previous possession, which is what I thought you were saying.
In regression, the residuals are not independent even if the errors are independent since the residuals must some to zero. The dependency is slight, however, and essentially inconsequential, especially for a large sample size. In basketball, if we fixed the number of FGAs of each type and also fixed the total number of FGAs, then there would be some dependency for individual possessions, but it would be very slight and essentially ignorable. But of course this is moot since nothing is fixed.
So what I'm saying is that there has to be some dependency in the aggregate but not for individual possessions.
Quote:
The first thing I would say to this is that I think we're wading into dangerous territory when we start trying to pull only subsets of the entire data in order to find a correlation. That might be construed as cherry-picking.
I agree but I was just questioning whether or not these playes are representative of the entire league.
Quote:
However, for every Shaquille, there's a Rasheed who has the ability to step back and take a 3PA more successfully. And for every Hoiberg, there's a Brent, who can demonstrate a wider variety of successful shot selection.
If we had an equal number of Rasheeds and Shaquilles, then of course the correlation would be about zero, but I would not have suspected that. (As an aside, if we construct a scatterplot for these 50 players and label the points, Jason Kapono really stands out as an outlier, and Steve Nash and Brent Barry are also some distance from the bunch.)
Back to top
View user's profile Send private message
tpryan
Joined: 11 Feb 2005
Posts: 100
	
PostPosted: Thu Feb 22, 2007 3:04 pm Post subject: Reply with quote
Ed Küpfer wrote:
I can understand your reluctance to use such a method, which seems to lack any theoretical justification. I appreciate that, but I can't help you there. What I need is a reasonable way to estimate these errors, and screw theory. It seems totally plausible to me that the error on a 3-outcome multinomial will be slightly greater than that of a 2-outcome. The method I described in an earlier post returns numbers that look good. I can't imagine that they will be too far off, and since I'm not doing rocket surgery, it's a hack I can live with. Especially since the alternative, looking at game-by-game variation, returns numbers that are unusable.
We often need approximations because the exact expression is intractable or complicated and impractical to use. Yes, there should be more estimation error for a multinomial than for a binomial but here the comparison is a mixture of two binomials (for which there is a dependency, at least regarding n_1 and n_2) versus the single multinomial. I have no idea how well the latter should work as a substitute for the former in this scenario.
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
	
PostPosted: Thu Feb 22, 2007 10:24 pm Post subject: Reply with quote
tpryan wrote:
For individual possessions, whether a 3-pt. FGA is made on a given possession should be "essentially independent" of what occurred on the previous possession, which is what I thought you were saying.
In regression, the residuals are not independent even if the errors are independent since the residuals must some to zero. The dependency is slight, however, and essentially inconsequential, especially for a large sample size. In basketball, if we fixed the number of FGAs of each type and also fixed the total number of FGAs, then there would be some dependency for individual possessions, but it would be very slight and essentially ignorable. But of course this is moot since nothing is fixed.
So what I'm saying is that there has to be some dependency in the aggregate but not for individual possessions.
You're talking about OLS only though for the residuals summing to 0.
You mentioned Spearman's in your previous post. When it comes to basketball stuff, I'm slowly coming to the realization that parametric methods (even with large N's that can supposedly allow us to make a normal approximation assumption) just don't always cut it. There's just too much intra- and inter-game variability. I'm not saying that someone like Kapono or Barry is going to be so much of an outlier that everything will be thrown off, just that more robust methods would probably be worth trying.
Regarding fixing the number of total and specific FGAs, my personal impression is that individual team strategy factors into it too much. However, after some cursory examination, I'm not quite sure anymore. According to my data from the 2006 season, the entire league took a 2PA on 68.2% of possessions, and a 3PA on 17.3% of possessions. For 2PA, the Hornets led the league, taking them on 74.4%, while ranking 27th in 3PA, taking them on only 11.9% of their possessions. Leading in 3PA were the Suns, attempting the long ball on 26.2% of their possessions. Phoenix ranked 29th in 2PA frequency, at 63.2%. Calculating the correlation between the two for all 30 teams gives -0.868. Seems fairly convincing.
The details are:
Code:
TEAM 2PA/Poss 3PA/Poss 2P Rank 3P Rank
76ers 71.1% 13.3% 9 24
Blazers 71.2% 14.2% 7 22
Bobcats 71.1% 16.1% 8 19
Bucks 68.4% 17.5% 14 17
Bulls 67.6% 19.0% 16 12
Cavaliers 65.9% 19.5% 21 9
Celtics 65.1% 16.7% 24 18
Clippers 73.4% 11.0% 2 29
Grizzlies 62.8% 21.8% 30 3
Hawks 70.7% 15.3% 11 20
Heat 64.6% 18.9% 26 13
Hornets 74.4% 11.9% 1 27
Jazz 71.5% 12.5% 6 26
Kings 66.2% 18.3% 19 15
Knicks 69.4% 11.5% 13 28
Lakers 66.2% 20.9% 20 6
Magic 72.4% 10.7% 4 30
Mavericks 71.7% 15.2% 5 21
Nets 65.0% 19.3% 25 10
Nuggets 71.1% 13.7% 10 23
Pacers 63.5% 20.6% 28 7
Pistons 70.5% 20.0% 12 8
Raptors 65.7% 21.2% 23 5
Rockets 65.8% 19.2% 22 11
Spurs 67.6% 18.5% 17 14
Suns 63.2% 26.2% 29 1
SuperSonics 66.3% 21.3% 18 4
Timberwolves 72.6% 12.7% 3 25
Warriors 63.5% 23.5% 27 2
Wizards 68.0% 18.0% 15 16
			
			
									
						
										
						Page 3 of 3
Post new topic Reply to topic APBRmetrics Forum Index -> General discussion
View previous topic :: View next topic
Author Message
Analyze This
Joined: 17 May 2005
Posts: 364
PostPosted: Thu Feb 15, 2007 2:12 am Post subject: Reply with quote
I don't know if you work for an nba team, but if not somebody give this man an nba job.
_________________
Where There's a WilT There's a Way
Back to top
View user's profile Send private message
HoopStudies
Joined: 30 Dec 2004
Posts: 705
Location: Near Philadelphia, PA
PostPosted: Fri Feb 16, 2007 7:45 am Post subject: Reply with quote
Analyze This wrote:
I don't know if you work for an nba team, but if not somebody give this man an nba job.
He does and he got it by showing the kind of effort and quality work that he posts.
_________________
Dean Oliver
Author, Basketball on Paper
The postings are my own & don't necess represent positions, strategies or opinions of employers.
Back to top
View user's profile Send private message Visit poster's website
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Sat Feb 17, 2007 2:08 pm Post subject: Reply with quote
Ed Küpfer wrote:
Ed Küpfer wrote:
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
An easy hack to calculate EFG% variance, or rather standard errors:
Code:
se_EFG% ~ 0.79 * sqrt( EFG% * (1.5 - EFG%) / FGA)
Works okay as far as I can tell. There is no hack for the other multinomials like TS% or ORTG -- you just have to do them the long way.
fake edit: oh, alright. Here's a hack for points/play:
Code:
HHH = (p2A/play)^2 + (p3A/play)^2 + (.44*FTA/play)^2 + (TO/play)^2
se_pts/play ~ (HHH^.044)/sqrt(play)
_________________
ed
Back to top
View user's profile Send private message Send e-mail
tpryan
Joined: 11 Feb 2005
Posts: 100
PostPosted: Sat Feb 17, 2007 3:45 pm Post subject: Reply with quote
Ed Küpfer wrote:
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
Simple question: What is the original source for this expression?
I can't find it in BoP or by using Google. The reason I ask is because there are two sample sizes involved in the EFG% expression, namely the number of 2-pt. attempts and the number of 3-pt. attempts, but these are not involved in the variance given above. That is, I would expect to see something like FGA(2) and FGA(3) involved in the expression. The last term in the expression presumably estimates the covariance, but I was surprised to see -3 rather than +3. I would just like to see the math that produces the expression, or at least find it sketched out somewhere, so that I can follow how the expression was obtained.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Sat Feb 17, 2007 4:13 pm Post subject: Reply with quote
tpryan wrote:
Ed Küpfer wrote:
Here is the specific formula for the EFG% variance:
Code:
var(EFG%) = ((p2m/FGA)*(1-p2m/FGA) + 2.25*(p3m/FGA)*(1-p3m/FGA) - 3*(p2m/FGA)*(p3m/FGA))/FGA)
Simple question: What is the original source for this expression?
One Ray Koopman sketched it out for me. I've since discoverd that is a special case of a general formula for multinomial variance, which goes something like this:
For outcomes 0 pts, 2 pts, 3 pts,
p0 = (p2missed + p3missed)/FGA
p2 = (p2made)/FGA
p3 = (p3made)/FGA
Weights (scaled to EFG%, ie pts/2)
w0 = 0pts / 2
w2 = 2pts / 2 = 1
w3 = 3pts / 2 = 1.5
EFG can be reformulated as
EFG = x = w0p0 + w2p2 + w3p3
To get the variance, you need to first square the weights in EFG, like this
x2 = (w0^2)p0 + (w2^2)p2 + (w3^2)p3
= (0)p0 + (1)p2 + (2.25)p3 = p2 + (2.25)p3
Finally
VAR(EFG) = (x2 - x^2)/FGA
_________________
ed
Back to top
View user's profile Send private message Send e-mail
tpryan
Joined: 11 Feb 2005
Posts: 100
PostPosted: Sat Feb 17, 2007 5:20 pm Post subject: Reply with quote
Thanks, Ed. I see that Ray raised some questions when he answered your question. I'm inclined to do the same. I believe there is more than one way of looking at this, and Ray's answer seems to suggest that he would agree.
One immediate thought is the following.
In the multinomial case, there are more than two possible outcomes, as noted. An example would be rolling a die, which of course has 6 possible outcomes. The probability of observing, say, 2 ones, 2 sixes, 2 fives, 1 four, 1 two and 1 three when a die is rolled 9 times is a multinomial probability.
In basketball, however, the "trials" differ. That is, the same type of shot is not attempted each time, with the outcome being either 0, 2, or 3 points. (I.e., obviously it is impossible for a 3-pt. shot attempt to result in 2 points.) Since the trials differ (unlike the die example), my initial reaction is that this is not a multinomial problem. (I see that Ray also questions the multinomial assumption. I haven't worked it out, but his variance result could presumably be obtained by using the variance and covariance expressions at http://en.wikipedia.org/wiki/Multinomial_distribution.)
Instead, I believe the appropriate variance would be obtained by finding the variance of the weighted average of proportions, since we may express EFG% as (n_1*p_1 + 1.5*(n_2*p_2))/FGA, with n_1 denoting the number of 2-pt. FGAs, p_1 the proportion made of 2-pt. attempts, n_2 the number of 3-pt. FGAs, and p_2 the proportion of 3-pt. attempts made. (Of course n_1*p_1 just gives us the number of 2-pt. FGs, and similarly for 3-pt. FGs.)
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
At least this is my initial reaction. I'll think about it further tonight.
Thanks again for the details.
Tom
Back to top
View user's profile Send private message
tpryan
Joined: 11 Feb 2005
Posts: 100
PostPosted: Wed Feb 21, 2007 12:53 am Post subject: Reply with quote
Well, I've thought about this some more and my view hasn't changed. We clearly have two binomials rather than one multinomial.
We also have a mess because n_1 and n_2 are obviously negatively correlated, in addition to not being fixed, and their sum isn't even fixed. There is probably also at least a slight dependency between the two sample proportions, although this would of course vary from team to team, as would the proportions themselves.
I don't see anything that could be done other than taking a descriptive statistics approach and computing the variablility over games for each team.
Justin, if you think about it, please remember me to Angela Dean, Noel Cressie and Tom Santner if you happen to be talking to them. (Angela may not think of me fondly at the moment, however, since my design book is now competing against her book. Noel is on the advisory board for the series in which my book is published.)
Tom
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Wed Feb 21, 2007 2:34 am Post subject: Reply with quote
Appreciate the thoughts.
tpryan wrote:
I don't see anything that could be done other than taking a descriptive statistics approach and computing the variablility over games for each team.
Here is Chris Duhon's season so far:
http://www.editgrid.com/user/edkupfer/Duhon
Summary stats (weights = attempts/game):
Code:
FG% p2% p3% EFG%
Mean 0.39 0.40 0.38 0.44
SE 0.23 0.22 0.30 0.27
wtMean 0.40 0.42 0.36 0.46
wtSE 0.21 0.21 0.25 0.25
Look at those huge errors! Those cannot be right. Looking at FG%, the SE from the binomial fomula thingy is
SE = sqrt(FG%*(1-FG%)/FGA) = sqrt(.4*.6/468) = 0.022
or 1/10 the SE you get by looking at the game-by-game stats. Maybe at the team level that approach works better because the number of attempts is more constant from game to game, but at the player level you can't get a good estimate on the error by looking at the game-to-game variation.
Now, it simply must be the case that the error of EFG% is greater than FG%, but I can't imagine it would be that much greater -- certainly not 0.25, especially on almost 500 attempts!
Using the handy dandy SE EFG% estimator posted above, you get 0.027 as your standard error, which is slightly greater than the 0.022 for FG%.
I can understand your reluctance to use such a method, which seems to lack any theoretical justification. I appreciate that, but I can't help you there. What I need is a reasonable way to estimate these errors, and screw theory. It seems totally plausible to me that the error on a 3-outcome multinomial will be slightly greater than that of a 2-outcome. The method I described in an earlier post returns numbers that look good. I can't imagine that they will be too far off, and since I'm not doing rocket surgery, it's a hack I can live with. Especially since the alternative, looking at game-by-game variation, returns numbers that are unusable.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
PostPosted: Wed Feb 21, 2007 8:07 am Post subject: Reply with quote
tpryan wrote:
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
I'm not sure why they are necessarily dependent. A player comes down the court with his team in possession of the ball. He is standing beyond the 3-point arc and his teammate passes him the ball. He looks at the basket for a split second before remembering that his name is Shaquille and he passes it off to another teammate. Later on in the same game our player finds himself situated much closer to the basket. This time he receives an entry pass, makes a post move, and scores a two point lay in.
I could give an example of the contrapositive, but I think we all get it.
To me, these are two independent events, each determined by a player's ability and willingness to attempt that type of shot.
Back to top
View user's profile Send private message Send e-mail AIM Address
tpryan
Joined: 11 Feb 2005
Posts: 100
PostPosted: Thu Feb 22, 2007 6:06 am Post subject: Reply with quote
gabefarkas wrote:
tpryan wrote:
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
I'm not sure why they are necessarily dependent. A player comes down the court with his team in possession of the ball. He is standing beyond the 3-point arc and his teammate passes him the ball. He looks at the basket for a split second before remembering that his name is Shaquille and he passes it off to another teammate. Later on in the same game our player finds himself situated much closer to the basket. This time he receives an entry pass, makes a post move, and scores a two point lay in.
I could give an example of the contrapositive, but I think we all get it.
To me, these are two independent events, each determined by a player's ability and willingness to attempt that type of shot.
It is the two percentages that I was contending are dependent, and this is what, if true, would make it difficult to compute the variance, in addition to the other problems that I mentioned.
It sounds to me like you are saying that the type of shot attempted is independent when we look at a series of trips down the court. I would certainly agree with that.
Here is what I was thinking.
Of the current top 50 in 3-pt. percentage, Brent Barry ranks 4th at 46%. Of that group, he is tied for 1st in 2-pt. percentage at 57%. Thus, loosely speaking, his ranks are correlated.
What surprised me, however, is that when I did some cutting and pasting and used software to compute the correlations, the Pearson correlation betwen the two sets of percentages was only .135, and when I ranked the two sets and computed the rank (Spearman) correlation, it was zero to the first few decimal places!! Very odd.
Perhaps there just aren't enough Brent Barrys to support my hypothesis, but I also find it odd that 7 of these 50 have a higher 3-pt. % than 2-pt. %, and 3 have the same percentage!
If we looked at each of those 10 players we would probably find an explanation for this, but on the surface it looks a bit odd to me. (The fact that Earl Boykins has equal percentages is understandable, however.)
Obviously there are many 3-pt. specialists in that group and some, like Barry, have attempted far more 3s than 2s, so this begs the question of how atypical this group is of the entire set of NBA players. Surely there must be a large subset of NBA players for which there would be a substantial correlation between the two percentages.
All things considered, it seems as though there should be at least a small-to-moderate correlation, and it may be large enough to be non-ignorable in computing (estimating) the variance.
Back to top
View user's profile Send private message
tpryan
Joined: 11 Feb 2005
Posts: 100
PostPosted: Thu Feb 22, 2007 6:50 am Post subject: Reply with quote
Ed,
I'm not certain how you computed the 0.22. Did the spreadsheet give you the mean and standard error of the mean for those 50 numbers? I'm guessing that is what happened because the number of 2-pt. attempts of course varies from game to game, so there is not a standard error for a fixed number of attempts. I.e., you are apparently getting the standard error for n = 50, not for n = 468 or any other number.
Those numbers also vary greatly because of the relatively small number of attempts in each game, whereas the .022 is based on 468 attempts.
Thus, the standard error from the formula should indeed be much smaller. If Duhon had taken 468 shots per game and you computed just the standard deviation, not the standard error, of the 2-pt. % for that number of attempts, that standard deviation would be very close to the .022.
Thus, I think there are two things going on that prevent approximate equality, but I will look at this further later today ... after I have had some sleep! Smile
As a final thought, which may be apparent, the .022 is not what you want as that greatly underestimates the game-to-game variability in his performance. That would apply only if he took 468 2-pt. attempts every game.
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
PostPosted: Thu Feb 22, 2007 9:11 am Post subject: Reply with quote
tpryan wrote:
It is the two percentages that I was contending are dependent, and this is what, if true, would make it difficult to compute the variance, in addition to the other problems that I mentioned.
It sounds to me like you are saying that the type of shot attempted is independent when we look at a series of trips down the court. I would certainly agree with that.
Hmmm, that's odd, because I quoted you as saying:
tpryan wrote:
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
That makes it seem to me that you're saying the number of 3-point shots attempted and made is not independent of the number of 2-point shots attempted and made. In other words, the types of shots are dependent. Maybe I'm misinterpreting?
tpryan wrote:
Here is what I was thinking.
Of the current top 50 in 3-pt. percentage, Brent Barry ranks 4th at 46%. Of that group, he is tied for 1st in 2-pt. percentage at 57%. Thus, loosely speaking, his ranks are correlated.
What surprised me, however, is that when I did some cutting and pasting and used software to compute the correlations, the Pearson correlation betwen the two sets of percentages was only .135, and when I ranked the two sets and computed the rank (Spearman) correlation, it was zero to the first few decimal places!! Very odd.
Perhaps there just aren't enough Brent Barrys to support my hypothesis, but I also find it odd that 7 of these 50 have a higher 3-pt. % than 2-pt. %, and 3 have the same percentage!
If we looked at each of those 10 players we would probably find an explanation for this, but on the surface it looks a bit odd to me. (The fact that Earl Boykins has equal percentages is understandable, however.)
Obviously there are many 3-pt. specialists in that group and some, like Barry, have attempted far more 3s than 2s, so this begs the question of how atypical this group is of the entire set of NBA players. Surely there must be a large subset of NBA players for which there would be a substantial correlation between the two percentages.
All things considered, it seems as though there should be at least a small-to-moderate correlation, and it may be large enough to be non-ignorable in computing (estimating) the variance.
The first thing I would say to this is that I think we're wading into dangerous territory when we start trying to pull only subsets of the entire data in order to find a correlation. That might be construed as cherry-picking.
Secondly, thinking about the flow of the game, it doesn't surprise me at all that there's no correlation between the two, using either a parametric or nonparametric test.
On one hand, we have players who take the majority of their shots very close to the basket and only take high percentage shots, at that (call them Shaquilles). These guys might shoot 55% around the basket on many 2PA and (hypothetically) 15% on very very few 3PA. When Shaquilles do take these 3PA, they might not be in the typical flow of the game; they might be desperation shots or in a blowout.
On the other hand, coaches design plays to give certain players (call them Hoibergs) wide open 3-point attempts. These players may already have demonstrated an ability to hit a fairly high percentage of 3PA (maybe 45-47%), and the coaches try to exploit this. Hoibergs also understand their role and when they should be taking these wide-open 3PA. When they take 2PA, they might be under more challenging circumstances (like the Shaquilles and their 3PA), or from areas on the court where Hoibergs are not as accustomed to shooting.
However, for every Shaquille, there's a Rasheed who has the ability to step back and take a 3PA more successfully. And for every Hoiberg, there's a Brent, who can demonstrate a wider variety of successful shot selection.
So, the lack of correlation doesn't surprise me. I can see why it might be tempting to want to conclude a continuous (possibly linear) relationship between distance from the basket and FG%, and thus that a successful shooter should be as relatively successful from wherever he shoots. However, the existence of the three-point line as a discrete indicator of the reward for a FGM invalidates this, in my opinion.
To me, that causes the functional relationship to no longer be continuous. There's a slow and steady drop off in the expectation as you extend back from the basket, until you reach 23 feet and 9 inches, at which point the expectation suddenly takes a little jump, corresponding to the increased reward for a 3PM.
Back to top
View user's profile Send private message Send e-mail AIM Address
tpryan
Joined: 11 Feb 2005
Posts: 100
PostPosted: Thu Feb 22, 2007 2:45 pm Post subject: Reply with quote
gabefarkas wrote:
tpryan wrote:
It is the two percentages that I was contending are dependent, and this is what, if true, would make it difficult to compute the variance, in addition to the other problems that I mentioned.
It sounds to me like you are saying that the type of shot attempted is independent when we look at a series of trips down the court. I would certainly agree with that.
Hmmm, that's odd, because I quoted you as saying:
tpryan wrote:
When viewed in this way, the appropriate variance would be found by computing the variance of a linear combination of proportions, and the result would be a function of the number of 2-pt. FGAs and the number of 3-pt. FGAs. The covariance could be a bit sticky, however, because the number of 3-pt. shots attempted and made is certainly not independent of the number of 2-pt. shots attempted and made.
That makes it seem to me that you're saying the number of 3-point shots attempted and made is not independent of the number of 2-point shots attempted and made. In other words, the types of shots are dependent. Maybe I'm misinterpreting?
Well, if we think of the total number of field goal attempts in a game as being defined by an interval that isn't very wide, which seems like a reasonable assumption, then there has to be some dependency of the number of FGAs of each type. That is, the freedom to vary of one is somewhat limited if we condition on a value for the other one.
For individual possessions, whether a 3-pt. FGA is made on a given possession should be "essentially independent" of what occurred on the previous possession, which is what I thought you were saying.
In regression, the residuals are not independent even if the errors are independent since the residuals must some to zero. The dependency is slight, however, and essentially inconsequential, especially for a large sample size. In basketball, if we fixed the number of FGAs of each type and also fixed the total number of FGAs, then there would be some dependency for individual possessions, but it would be very slight and essentially ignorable. But of course this is moot since nothing is fixed.
So what I'm saying is that there has to be some dependency in the aggregate but not for individual possessions.
Quote:
The first thing I would say to this is that I think we're wading into dangerous territory when we start trying to pull only subsets of the entire data in order to find a correlation. That might be construed as cherry-picking.
I agree but I was just questioning whether or not these playes are representative of the entire league.
Quote:
However, for every Shaquille, there's a Rasheed who has the ability to step back and take a 3PA more successfully. And for every Hoiberg, there's a Brent, who can demonstrate a wider variety of successful shot selection.
If we had an equal number of Rasheeds and Shaquilles, then of course the correlation would be about zero, but I would not have suspected that. (As an aside, if we construct a scatterplot for these 50 players and label the points, Jason Kapono really stands out as an outlier, and Steve Nash and Brent Barry are also some distance from the bunch.)
Back to top
View user's profile Send private message
tpryan
Joined: 11 Feb 2005
Posts: 100
PostPosted: Thu Feb 22, 2007 3:04 pm Post subject: Reply with quote
Ed Küpfer wrote:
I can understand your reluctance to use such a method, which seems to lack any theoretical justification. I appreciate that, but I can't help you there. What I need is a reasonable way to estimate these errors, and screw theory. It seems totally plausible to me that the error on a 3-outcome multinomial will be slightly greater than that of a 2-outcome. The method I described in an earlier post returns numbers that look good. I can't imagine that they will be too far off, and since I'm not doing rocket surgery, it's a hack I can live with. Especially since the alternative, looking at game-by-game variation, returns numbers that are unusable.
We often need approximations because the exact expression is intractable or complicated and impractical to use. Yes, there should be more estimation error for a multinomial than for a binomial but here the comparison is a mixture of two binomials (for which there is a dependency, at least regarding n_1 and n_2) versus the single multinomial. I have no idea how well the latter should work as a substitute for the former in this scenario.
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
PostPosted: Thu Feb 22, 2007 10:24 pm Post subject: Reply with quote
tpryan wrote:
For individual possessions, whether a 3-pt. FGA is made on a given possession should be "essentially independent" of what occurred on the previous possession, which is what I thought you were saying.
In regression, the residuals are not independent even if the errors are independent since the residuals must some to zero. The dependency is slight, however, and essentially inconsequential, especially for a large sample size. In basketball, if we fixed the number of FGAs of each type and also fixed the total number of FGAs, then there would be some dependency for individual possessions, but it would be very slight and essentially ignorable. But of course this is moot since nothing is fixed.
So what I'm saying is that there has to be some dependency in the aggregate but not for individual possessions.
You're talking about OLS only though for the residuals summing to 0.
You mentioned Spearman's in your previous post. When it comes to basketball stuff, I'm slowly coming to the realization that parametric methods (even with large N's that can supposedly allow us to make a normal approximation assumption) just don't always cut it. There's just too much intra- and inter-game variability. I'm not saying that someone like Kapono or Barry is going to be so much of an outlier that everything will be thrown off, just that more robust methods would probably be worth trying.
Regarding fixing the number of total and specific FGAs, my personal impression is that individual team strategy factors into it too much. However, after some cursory examination, I'm not quite sure anymore. According to my data from the 2006 season, the entire league took a 2PA on 68.2% of possessions, and a 3PA on 17.3% of possessions. For 2PA, the Hornets led the league, taking them on 74.4%, while ranking 27th in 3PA, taking them on only 11.9% of their possessions. Leading in 3PA were the Suns, attempting the long ball on 26.2% of their possessions. Phoenix ranked 29th in 2PA frequency, at 63.2%. Calculating the correlation between the two for all 30 teams gives -0.868. Seems fairly convincing.
The details are:
Code:
TEAM 2PA/Poss 3PA/Poss 2P Rank 3P Rank
76ers 71.1% 13.3% 9 24
Blazers 71.2% 14.2% 7 22
Bobcats 71.1% 16.1% 8 19
Bucks 68.4% 17.5% 14 17
Bulls 67.6% 19.0% 16 12
Cavaliers 65.9% 19.5% 21 9
Celtics 65.1% 16.7% 24 18
Clippers 73.4% 11.0% 2 29
Grizzlies 62.8% 21.8% 30 3
Hawks 70.7% 15.3% 11 20
Heat 64.6% 18.9% 26 13
Hornets 74.4% 11.9% 1 27
Jazz 71.5% 12.5% 6 26
Kings 66.2% 18.3% 19 15
Knicks 69.4% 11.5% 13 28
Lakers 66.2% 20.9% 20 6
Magic 72.4% 10.7% 4 30
Mavericks 71.7% 15.2% 5 21
Nets 65.0% 19.3% 25 10
Nuggets 71.1% 13.7% 10 23
Pacers 63.5% 20.6% 28 7
Pistons 70.5% 20.0% 12 8
Raptors 65.7% 21.2% 23 5
Rockets 65.8% 19.2% 22 11
Spurs 67.6% 18.5% 17 14
Suns 63.2% 26.2% 29 1
SuperSonics 66.3% 21.3% 18 4
Timberwolves 72.6% 12.7% 3 25
Warriors 63.5% 23.5% 27 2
Wizards 68.0% 18.0% 15 16
- 
				greyberger
- Posts: 32
- Joined: Thu Apr 14, 2011 11:14 pm
Re: Some Rules of Thumb (E. Kupfer)
Oh man this thread is great.