More Pyth stuff (Ed Küpfer, 2006)
Posted: Fri Apr 15, 2011 9:32 am
Author Message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Mon Jan 02, 2006 4:13 pm Post subject: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
HoopStudies wrote:
WizardsKev wrote:
Quote:
- The optimal pythagorean exponent
I recall a thread on this when this site had just opened for business. I think it was Justin and Ed Kupfer primarily who looked at this issue. The exponent that performed best was 14.
I have been asked a couple times why any pythagorean method works given that the Bell Curve method is what theoretically makes sense. And maybe by understanding how a Pythagorean method can come from the Bell Curve, we could come up with the best exponent (as it changes depending on avg # of poss in a game).
To tell you the truth, I don't know why the Bell Curve works, even in theory. It makes sense that it would work better than Pyth, given that it makes use of more information, but it is not obvious to me why it works without any correction for points differential like Pyth (ie the exponent).
Some thoughts.
I know there's a mathematical derivation of Pyth somewhere, but lets simplify. Simple points differential should tell us something about the strength of a team. Good teams will outscore their opponents on average by a greater margin than bad teams will outscore their opponents. But the marginal points differential effect on win percentage cannot be linear, since this would mean that at some points, teams would have a win% greater than 1. If you plot Pts Diff against expected win%, you want an s-curve, like this:
The Pythagorean method bends the linear Pts Diff into an s-curve. I have used logistic regression to get similar results. If you optimise the exponent in the Pyth to reduce the errors in expected team win%, the exponent you end up with will depend on your area of focus: for the entire history of the NBA, the optimal exponent is 14, IIRC. For particular season, the exponent varies. The problem is that the value of the exponent is not strongly correlated to the value of the points scoring enviroment. That means you can't reliably predict the exponent — you can only optimise it empirically, by reducing errors. I believe any future advances in Pyth will depend on finding the statistical areas to which the optimal exponent correlates strongly.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Neil Paine
Joined: 13 Oct 2005
Posts: 774
Location: Atlanta, GA
PostPosted: Mon Jan 02, 2006 4:27 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
Quote:
The problem is that the value of the exponent is not strongly correlated to the value of the points scoring enviroment.
Why is that? I know that in baseball, it can be reliably predicted by:
exponent=1.5*log(Avg. Runs/G of both teams)+.45
What is different about basketball? The standard deviations of the scores are higher, maybe? Or the fact that each baseball team, basically without fail, has 27 outs to work with, while basketball possession numbers vary wildly from game to game?
Back to top
View user's profile Send private message Visit poster's website
HoopStudies
Joined: 30 Dec 2004
Posts: 705
Location: Near Philadelphia, PA
PostPosted: Tue Jan 03, 2006 6:29 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
Ed Küpfer wrote:
HoopStudies wrote:
WizardsKev wrote:
Quote:
- The optimal pythagorean exponent
I recall a thread on this when this site had just opened for business. I think it was Justin and Ed Kupfer primarily who looked at this issue. The exponent that performed best was 14.
I have been asked a couple times why any pythagorean method works given that the Bell Curve method is what theoretically makes sense. And maybe by understanding how a Pythagorean method can come from the Bell Curve, we could come up with the best exponent (as it changes depending on avg # of poss in a game).
To tell you the truth, I don't know why the Bell Curve works, even in theory. It makes sense that it would work better than Pyth, given that it makes use of more information, but it is not obvious to me why it works without any correction for points differential like Pyth (ie the exponent).
It works in any situation where points and points allowed are roughly normally distributed. Statistical theory says that the sum or difference of normal distributions are normally distributed, so the net pts distribution is also normally distributed (although the rule of no ties throws this off a little). With a normally distributed net pt distribution, you just use statistical tools for normal distributions to find out the probability that net pts will be greater than 0. And normal distributions are characterized by only their means and standard deviations. So that's all that is used.
Ed Küpfer wrote:
I know there's a mathematical derivation of Pyth somewhere, but lets simplify.
Is there a math derivation of the Pyth method? I never really saw one...
Ed Küpfer wrote:
The Pythagorean method bends the linear Pts Diff into an s-curve. I have used logistic regression to get similar results. If you optimise the exponent in the Pyth to reduce the errors in expected team win%, the exponent you end up with will depend on your area of focus: for the entire history of the NBA, the optimal exponent is 14, IIRC. For particular season, the exponent varies. The problem is that the value of the exponent is not strongly correlated to the value of the points scoring enviroment. That means you can't reliably predict the exponent — you can only optimise it empirically, by reducing errors. I believe any future advances in Pyth will depend on finding the statistical areas to which the optimal exponent correlates strongly.
I think the reason you see variation is because it isn't a very well-posed problem. It is unstable. An exponent of 17 isn't horrible, nor is one of 11. They both do a pretty good job. I tend to use a higher number because I prefer understanding outliers rather than the middle. Someone did a study a long time ago that showed that smaller exponents work better on 0.500 teams, but larger ones works better on really good or really bad teams.
Maybe that means the derivation from the Bell Curve isn't easy. I still wish someone would do it for me, though. Could be a cool extra credit problem for some college student in here. Hint.
_________________
Dean Oliver
Author, Basketball on Paper
The postings are my own & don't necess represent positions, strategies or opinions of employers.
Back to top
View user's profile Send private message Visit poster's website
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Jan 03, 2006 8:39 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
HoopStudies wrote:
Is there a math derivation of the Pyth method? I never really saw one...
This wasn't the one I was thinking of, but there it is. (PDF)
Ah, here's the one I was thinking about.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
mtamada
Joined: 28 Jan 2005
Posts: 376
PostPosted: Wed Jan 04, 2006 11:41 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
Ed Küpfer wrote:
HoopStudies wrote:
Is there a math derivation of the Pyth method? I never really saw one...
This wasn't the one I was thinking of, but there it is. (PDF)
Ah, here's the one I was thinking about.
The first one is very heavy going, mathematically, but it is what I would call a true math derivation, using statistical theory to derive a functional form for the equation.
The second one is a nice article, but is not what I would call a derivation. Instead it is an exercise in curve-fitting, trying out different functional forms to see which ones give a good fit.
Theory and empiricism; both have their place. But the empirical estimates can't give you information about why they work, all you know is that they do work (e.g. that result that it is more useful to let the coefficient on the cubic term vary, rather than to account for diminishing returns: why is this so? We don't know, we only know that we get a better fit when we do so.)
On the other hand, the theoretical results may rest on shaky foundations, e.g. the author had to assume that runs scored in baseball are a continuous variable rather than a discrete one. He got good results, but his theoretical derivation though elegant does have this shaky part.
Back to top
View user's profile Send private message
kjb
Joined: 03 Jan 2005
Posts: 865
Location: Washington, DC
PostPosted: Thu Jan 19, 2006 12:17 pm Post subject: Reply with quote
Is there a formula for getting the Pyth exponent for a non-NBA league? (I'm goofing around with the scores from my church league.)
Using Wylie's formula (above), I get an exponent of 3.4.
Some info that may be useful:
- 32 minute games (8 minute quarters)
- avg. points per game (total points combined): 90.5
- highest scoring team (mine) has outscored 2 opponents 150-65
- lowest scoring team has been outscored 142-51
Back to top
View user's profile Send private message AIM Address Yahoo Messenger
mtamada
Joined: 28 Jan 2005
Posts: 376
PostPosted: Thu Jan 19, 2006 5:08 pm Post subject: Reply with quote
WizardsKev wrote:
Is there a formula for getting the Pyth exponent for a non-NBA league? (I'm goofing around with the scores from my church league.)
Using Wylie's formula (above), I get an exponent of 3.4.
Some info that may be useful:
- 32 minute games (8 minute quarters)
- avg. points per game (total points combined): 90.5
- highest scoring team (mine) has outscored 2 opponents 150-65
- lowest scoring team has been outscored 142-51
Now there would be an interesting research project, probably publishable in a research journal. Given the characterestics of a league, what "should" the exponent be -- can it be calculated just from the league's "vital statistics" or does it have to be empirically derived by looking at actual win-loss records and scores.
More exciting still is if this could be derived from an underlying statistical framework, as with the article by Steve Miller that EdK linked to.
Most exciting would be if that framework could be applied to different sports: basketball, baseball, soccer, lacrosse, etc. American football would probably require a different statistical model due to its scoring being so different from other sports.
Back to top
View user's profile Send private message
kjb
Joined: 03 Jan 2005
Posts: 865
Location: Washington, DC
PostPosted: Fri Jan 20, 2006 10:19 am Post subject: Reply with quote
mtamada wrote:
Now there would be an interesting research project, probably publishable in a research journal. Given the characterestics of a league, what "should" the exponent be -- can it be calculated just from the league's "vital statistics" or does it have to be empirically derived by looking at actual win-loss records and scores.
More exciting still is if this could be derived from an underlying statistical framework, as with the article by Steve Miller that EdK linked to.
Most exciting would be if that framework could be applied to different sports: basketball, baseball, soccer, lacrosse, etc. American football would probably require a different statistical model due to its scoring being so different from other sports.
All that goes beyond my statistical capabilities. The only data I have are standings and scores (below):
Code:
TEAM W L PTS oPTS
McLean1A 2 0 150 65
ArlingtonA 2 0 91 70
Langley 2 0 80 65
Falls Church 1 1 93 68
McLean2 1 1 111 114
McLean1B 0 2 87 107
ArlingtonB 0 2 61 93
Great Falls 0 2 51 142
Do your worst. Smile
Back to top
View user's profile Send private message AIM Address Yahoo Messenger
parinella
Joined: 16 Dec 2005
Posts: 10
PostPosted: Fri Jan 20, 2006 11:01 am Post subject: Reply with quote
Regarding the exponent being tied to the runs per game environment for baseball but not for basketball: the exponent also depends on the average margin of victory (AMOV) (although I'm not sure what the proper mathematical relationship is). The average score of a major league baseball game is about 6-3. I'd guess the average NBA game is something like 100-90. You need a much larger exponent when AMOV/RPG is smaller. I don't know why the exponent for basketball would be independent of the points per game, or if that's even true, but I'd guess that AMOV/RPG doesn't change with RPG.
And while we're at it, does anyone have any thoughts about the special needs for sports that play to a fixed point total, like volleyball or ultimate. Volleyball might have its own problems because a team can score many points (or even win the game) on one possession, but ultimate is like basketball in that each team has the same number of possessions in a game (within one).
Back to top
View user's profile Send private message
94by50
Joined: 01 Jan 2006
Posts: 499
Location: Phoenix
PostPosted: Fri Feb 03, 2006 2:20 am Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
Ed Küpfer wrote:
The problem is that the value of the exponent is not strongly correlated to the value of the points scoring enviroment.
I just spent the last hour or so doing the math because I didn't want to believe this was true. I came up with a correlation between "points per game per team" and "ideal exponent per league" of about .62 and r^2 of about .38. Needless to say, I'm disappointed. A perfect solution certainly is not linear, and I'm not about to spend all night finding a non-linear solution to the problem.
I also understand the differences between the two most common exponents, I think. By maximizing the correlation between "actual winning percentage" and "expected winning percentage", I got close to 16.5. By minimizing the RMSE between actual wins and expected wins, I got to around 14.25.
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
PostPosted: Fri Feb 03, 2006 11:43 am Post subject: Reply with quote
94by -
I don't think even a nonlinear correlation is the answer. Perhaps nonparametric analysis is the way to go.
With something like this, where you're taking a big-picture statistic as the explanatory variable (pts/game/team) and trying to relate it to another big-picture statistic as the dependent variable (a coefficient for win% prediction), I wouldn't be surprised if the distributions of the expected values can't be normalized through the usual techniques.
So, instead of using the actual values, a (possibly weighted) ranking system might be the way to go.
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Fri Feb 03, 2006 12:17 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
94by50 wrote:
I just spent the last hour or so doing the math because I didn't want to believe this was true. I came up with a correlation between "points per game per team" and "ideal exponent per league" of about .62 and r^2 of about .38. Needless to say, I'm disappointed. A perfect solution certainly is not linear, and I'm not about to spend all night finding a non-linear solution to the problem.
To be perfectly honest, I had no idea the correlation was that high. Just goes to show, I cannot be trusted to rely on my memory for these things. Contra Gabe, I think a linear analysis is a good place to start, given such a high r2. I'm on my way out the door right now, so I don't have time for that, but I did manage to solve for the optimal exponents for each season (given below) using three different fitness measures. First, the RMSE between actual and pythagorean win%s. Second, the mean deviation between actual and pythagorean win%s. Third, I solved for the optimal exponent at the team level, using this handy equation:
Exponent = ln(c / (1 - c)) / ln(a/b)
where a = ppg, b = opponent's ppg, and c = win%
and then I took the median team-level exponent for each season.
Right, the data. I'm not very good at time series stuff, but I did not notice any time factor in the exponents. That ppg sure looks like a good place to start.
Code:
SEASON RMSE AVEDEV MEDIAN PPG
1947 9.9 10.2 10.4 67.8
1948 9.6 10.6 9.5 72.7
1949 11.3 12.6 11.1 80.0
1950 11.2 11.1 11.6 80.1
1951 11.4 10.0 9.6 84.0
1952 8.9 9.5 9.4 83.7
1953 14.5 15.5 14.5 82.6
1954 12.1 12.6 12.5 79.5
1955 18.1 16.7 20.2 93.1
1956 11.6 12.0 11.6 99.0
1957 7.6 8.6 6.7 99.6
1958 14.9 14.8 13.1 106.6
1959 12.8 12.5 15.0 108.2
1960 19.3 18.5 19.0 115.3
1961 16.9 18.4 19.1 118.1
1962 15.8 13.8 13.8 118.8
1963 16.8 16.2 16.3 115.3
1964 13.7 14.5 14.3 111.0
1965 16.8 15.8 14.8 110.6
1966 16.5 16.6 16.5 115.5
1967 15.5 14.7 11.2 117.4
1968 16.9 16.7 16.2 116.6
1969 15.0 16.0 16.4 112.3
1970 14.1 13.7 14.0 116.7
1971 13.5 13.0 13.0 112.4
1972 13.3 13.7 14.9 110.2
1973 15.0 13.5 13.9 107.6
1974 14.4 14.1 14.3 105.7
1975 13.7 12.7 11.3 102.6
1976 18.8 17.0 16.1 104.3
1977 11.8 12.2 12.5 106.5
1978 16.3 16.7 17.0 108.5
1979 12.6 13.6 13.5 110.3
1980 16.4 15.9 15.6 109.3
1981 16.5 16.2 16.1 108.1
1982 17.1 16.9 16.6 108.6
1983 15.4 15.0 15.2 108.5
1984 14.9 16.1 16.1 110.1
1985 15.8 15.7 17.6 110.8
1986 14.8 17.0 17.0 110.2
1987 14.7 15.0 13.9 109.9
1988 15.0 15.2 15.2 108.2
1989 13.3 12.8 12.6 109.2
1990 16.1 15.9 15.5 107.0
1991 14.1 14.4 14.3 106.3
1992 13.6 13.4 13.3 105.3
1993 14.6 14.6 14.3 105.3
1994 13.7 13.4 12.4 101.5
1995 13.3 12.9 13.5 101.4
1996 14.0 14.3 13.6 99.5
1997 14.6 14.4 13.5 96.9
1998 14.5 14.7 14.0 95.6
1999 13.8 14.1 13.2 91.6
2000 14.0 14.2 14.4 97.5
2001 13.9 14.2 14.0 94.8
2002 13.1 12.8 12.9 95.5
2003 13.3 14.1 14.2 95.1
2004 13.2 13.8 13.4 93.4
2005 15.4 16.4 16.2 97.2
_________________
ed
Back to top
View user's profile Send private message Send e-mail
94by50
Joined: 01 Jan 2006
Posts: 499
Location: Phoenix
PostPosted: Fri Feb 03, 2006 3:40 pm Post subject: Reply with quote
gabefarkas wrote:
I don't think even a nonlinear correlation is the answer. Perhaps nonparametric analysis is the way to go... I wouldn't be surprised if the distributions of the expected values can't be normalized through the usual techniques.
It wouldn't shock me if this were the case. I agree with Ed that the linear analysis was a good starting point, but beyond that, I'm not sure where to go with it.
HoopStudies wrote:
I think the reason you see variation is because it isn't a very well-posed problem. It is unstable.
Perhaps the ideal exponent in each league is less stable in smaller leagues, which can be influenced more heavily by one team that deviates strongly from the norm. That's the impression I get just from eyeballing the data that Ed and I came up with.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Sat Feb 04, 2006 3:10 pm Post subject: Reply with quote
Some more:
I categorised games by the average number of points scored (Home plus Away, divided by two), rounded to 5 points. I then calculated the optimal exponent for the home team in each category (OLS, weighted by the number of games in each category), and plotted the results:

That's the 95% confidence interval shown for each observation. The regression equation is Exponent = 5.88 + 0.0758*PPG, with a adjusted r2 of 0.58. Very nice fit, at least in games where both teams average between 60 and 130 points (that is, when they combine to score between 120 and 260 points). Removing the outliers didn't change the regression equation significantly, due to the small number of games in the extremes (n=40,911, 63 games below 60 points, 554 games over 130). The quadratic regression fit was not a significant improvement.
So, the Pyth exponent should be
Exponent = 5.88 + 0.0758*((PTS + OppPts)/(2 * Games))
That gives us a historical RMSE of 0.039, or 3.2 games in a season. That compares nicely to the Correlated Gaussian method (historically a RMSE of 0.037, or 3.0 games), and Pyth with a 14.1 exponent (RMSE = 0.040, 3.3 games).
I think we've reached the limits of PPG-level analysis. To understand Pyth better, I think we're going to have to look for other factors, maybe possession-based stats.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Feb 07, 2006 3:02 pm Post subject: Reply with quote
Using the same method as above, but binning the games by possessions, and only using games from 87-88 to 04-05, I found that the Pythagorean exponent is not significantly correlated to the pace of the game. Very strange.
The exponent is also unrelated to Turnover% and OR%.
_________________
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Mon Jan 02, 2006 4:13 pm Post subject: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
HoopStudies wrote:
WizardsKev wrote:
Quote:
- The optimal pythagorean exponent
I recall a thread on this when this site had just opened for business. I think it was Justin and Ed Kupfer primarily who looked at this issue. The exponent that performed best was 14.
I have been asked a couple times why any pythagorean method works given that the Bell Curve method is what theoretically makes sense. And maybe by understanding how a Pythagorean method can come from the Bell Curve, we could come up with the best exponent (as it changes depending on avg # of poss in a game).
To tell you the truth, I don't know why the Bell Curve works, even in theory. It makes sense that it would work better than Pyth, given that it makes use of more information, but it is not obvious to me why it works without any correction for points differential like Pyth (ie the exponent).
Some thoughts.
I know there's a mathematical derivation of Pyth somewhere, but lets simplify. Simple points differential should tell us something about the strength of a team. Good teams will outscore their opponents on average by a greater margin than bad teams will outscore their opponents. But the marginal points differential effect on win percentage cannot be linear, since this would mean that at some points, teams would have a win% greater than 1. If you plot Pts Diff against expected win%, you want an s-curve, like this:
The Pythagorean method bends the linear Pts Diff into an s-curve. I have used logistic regression to get similar results. If you optimise the exponent in the Pyth to reduce the errors in expected team win%, the exponent you end up with will depend on your area of focus: for the entire history of the NBA, the optimal exponent is 14, IIRC. For particular season, the exponent varies. The problem is that the value of the exponent is not strongly correlated to the value of the points scoring enviroment. That means you can't reliably predict the exponent — you can only optimise it empirically, by reducing errors. I believe any future advances in Pyth will depend on finding the statistical areas to which the optimal exponent correlates strongly.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Neil Paine
Joined: 13 Oct 2005
Posts: 774
Location: Atlanta, GA
PostPosted: Mon Jan 02, 2006 4:27 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
Quote:
The problem is that the value of the exponent is not strongly correlated to the value of the points scoring enviroment.
Why is that? I know that in baseball, it can be reliably predicted by:
exponent=1.5*log(Avg. Runs/G of both teams)+.45
What is different about basketball? The standard deviations of the scores are higher, maybe? Or the fact that each baseball team, basically without fail, has 27 outs to work with, while basketball possession numbers vary wildly from game to game?
Back to top
View user's profile Send private message Visit poster's website
HoopStudies
Joined: 30 Dec 2004
Posts: 705
Location: Near Philadelphia, PA
PostPosted: Tue Jan 03, 2006 6:29 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
Ed Küpfer wrote:
HoopStudies wrote:
WizardsKev wrote:
Quote:
- The optimal pythagorean exponent
I recall a thread on this when this site had just opened for business. I think it was Justin and Ed Kupfer primarily who looked at this issue. The exponent that performed best was 14.
I have been asked a couple times why any pythagorean method works given that the Bell Curve method is what theoretically makes sense. And maybe by understanding how a Pythagorean method can come from the Bell Curve, we could come up with the best exponent (as it changes depending on avg # of poss in a game).
To tell you the truth, I don't know why the Bell Curve works, even in theory. It makes sense that it would work better than Pyth, given that it makes use of more information, but it is not obvious to me why it works without any correction for points differential like Pyth (ie the exponent).
It works in any situation where points and points allowed are roughly normally distributed. Statistical theory says that the sum or difference of normal distributions are normally distributed, so the net pts distribution is also normally distributed (although the rule of no ties throws this off a little). With a normally distributed net pt distribution, you just use statistical tools for normal distributions to find out the probability that net pts will be greater than 0. And normal distributions are characterized by only their means and standard deviations. So that's all that is used.
Ed Küpfer wrote:
I know there's a mathematical derivation of Pyth somewhere, but lets simplify.
Is there a math derivation of the Pyth method? I never really saw one...
Ed Küpfer wrote:
The Pythagorean method bends the linear Pts Diff into an s-curve. I have used logistic regression to get similar results. If you optimise the exponent in the Pyth to reduce the errors in expected team win%, the exponent you end up with will depend on your area of focus: for the entire history of the NBA, the optimal exponent is 14, IIRC. For particular season, the exponent varies. The problem is that the value of the exponent is not strongly correlated to the value of the points scoring enviroment. That means you can't reliably predict the exponent — you can only optimise it empirically, by reducing errors. I believe any future advances in Pyth will depend on finding the statistical areas to which the optimal exponent correlates strongly.
I think the reason you see variation is because it isn't a very well-posed problem. It is unstable. An exponent of 17 isn't horrible, nor is one of 11. They both do a pretty good job. I tend to use a higher number because I prefer understanding outliers rather than the middle. Someone did a study a long time ago that showed that smaller exponents work better on 0.500 teams, but larger ones works better on really good or really bad teams.
Maybe that means the derivation from the Bell Curve isn't easy. I still wish someone would do it for me, though. Could be a cool extra credit problem for some college student in here. Hint.
_________________
Dean Oliver
Author, Basketball on Paper
The postings are my own & don't necess represent positions, strategies or opinions of employers.
Back to top
View user's profile Send private message Visit poster's website
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Jan 03, 2006 8:39 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
HoopStudies wrote:
Is there a math derivation of the Pyth method? I never really saw one...
This wasn't the one I was thinking of, but there it is. (PDF)
Ah, here's the one I was thinking about.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
mtamada
Joined: 28 Jan 2005
Posts: 376
PostPosted: Wed Jan 04, 2006 11:41 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
Ed Küpfer wrote:
HoopStudies wrote:
Is there a math derivation of the Pyth method? I never really saw one...
This wasn't the one I was thinking of, but there it is. (PDF)
Ah, here's the one I was thinking about.
The first one is very heavy going, mathematically, but it is what I would call a true math derivation, using statistical theory to derive a functional form for the equation.
The second one is a nice article, but is not what I would call a derivation. Instead it is an exercise in curve-fitting, trying out different functional forms to see which ones give a good fit.
Theory and empiricism; both have their place. But the empirical estimates can't give you information about why they work, all you know is that they do work (e.g. that result that it is more useful to let the coefficient on the cubic term vary, rather than to account for diminishing returns: why is this so? We don't know, we only know that we get a better fit when we do so.)
On the other hand, the theoretical results may rest on shaky foundations, e.g. the author had to assume that runs scored in baseball are a continuous variable rather than a discrete one. He got good results, but his theoretical derivation though elegant does have this shaky part.
Back to top
View user's profile Send private message
kjb
Joined: 03 Jan 2005
Posts: 865
Location: Washington, DC
PostPosted: Thu Jan 19, 2006 12:17 pm Post subject: Reply with quote
Is there a formula for getting the Pyth exponent for a non-NBA league? (I'm goofing around with the scores from my church league.)
Using Wylie's formula (above), I get an exponent of 3.4.
Some info that may be useful:
- 32 minute games (8 minute quarters)
- avg. points per game (total points combined): 90.5
- highest scoring team (mine) has outscored 2 opponents 150-65
- lowest scoring team has been outscored 142-51
Back to top
View user's profile Send private message AIM Address Yahoo Messenger
mtamada
Joined: 28 Jan 2005
Posts: 376
PostPosted: Thu Jan 19, 2006 5:08 pm Post subject: Reply with quote
WizardsKev wrote:
Is there a formula for getting the Pyth exponent for a non-NBA league? (I'm goofing around with the scores from my church league.)
Using Wylie's formula (above), I get an exponent of 3.4.
Some info that may be useful:
- 32 minute games (8 minute quarters)
- avg. points per game (total points combined): 90.5
- highest scoring team (mine) has outscored 2 opponents 150-65
- lowest scoring team has been outscored 142-51
Now there would be an interesting research project, probably publishable in a research journal. Given the characterestics of a league, what "should" the exponent be -- can it be calculated just from the league's "vital statistics" or does it have to be empirically derived by looking at actual win-loss records and scores.
More exciting still is if this could be derived from an underlying statistical framework, as with the article by Steve Miller that EdK linked to.
Most exciting would be if that framework could be applied to different sports: basketball, baseball, soccer, lacrosse, etc. American football would probably require a different statistical model due to its scoring being so different from other sports.
Back to top
View user's profile Send private message
kjb
Joined: 03 Jan 2005
Posts: 865
Location: Washington, DC
PostPosted: Fri Jan 20, 2006 10:19 am Post subject: Reply with quote
mtamada wrote:
Now there would be an interesting research project, probably publishable in a research journal. Given the characterestics of a league, what "should" the exponent be -- can it be calculated just from the league's "vital statistics" or does it have to be empirically derived by looking at actual win-loss records and scores.
More exciting still is if this could be derived from an underlying statistical framework, as with the article by Steve Miller that EdK linked to.
Most exciting would be if that framework could be applied to different sports: basketball, baseball, soccer, lacrosse, etc. American football would probably require a different statistical model due to its scoring being so different from other sports.
All that goes beyond my statistical capabilities. The only data I have are standings and scores (below):
Code:
TEAM W L PTS oPTS
McLean1A 2 0 150 65
ArlingtonA 2 0 91 70
Langley 2 0 80 65
Falls Church 1 1 93 68
McLean2 1 1 111 114
McLean1B 0 2 87 107
ArlingtonB 0 2 61 93
Great Falls 0 2 51 142
Do your worst. Smile
Back to top
View user's profile Send private message AIM Address Yahoo Messenger
parinella
Joined: 16 Dec 2005
Posts: 10
PostPosted: Fri Jan 20, 2006 11:01 am Post subject: Reply with quote
Regarding the exponent being tied to the runs per game environment for baseball but not for basketball: the exponent also depends on the average margin of victory (AMOV) (although I'm not sure what the proper mathematical relationship is). The average score of a major league baseball game is about 6-3. I'd guess the average NBA game is something like 100-90. You need a much larger exponent when AMOV/RPG is smaller. I don't know why the exponent for basketball would be independent of the points per game, or if that's even true, but I'd guess that AMOV/RPG doesn't change with RPG.
And while we're at it, does anyone have any thoughts about the special needs for sports that play to a fixed point total, like volleyball or ultimate. Volleyball might have its own problems because a team can score many points (or even win the game) on one possession, but ultimate is like basketball in that each team has the same number of possessions in a game (within one).
Back to top
View user's profile Send private message
94by50
Joined: 01 Jan 2006
Posts: 499
Location: Phoenix
PostPosted: Fri Feb 03, 2006 2:20 am Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
Ed Küpfer wrote:
The problem is that the value of the exponent is not strongly correlated to the value of the points scoring enviroment.
I just spent the last hour or so doing the math because I didn't want to believe this was true. I came up with a correlation between "points per game per team" and "ideal exponent per league" of about .62 and r^2 of about .38. Needless to say, I'm disappointed. A perfect solution certainly is not linear, and I'm not about to spend all night finding a non-linear solution to the problem.
I also understand the differences between the two most common exponents, I think. By maximizing the correlation between "actual winning percentage" and "expected winning percentage", I got close to 16.5. By minimizing the RMSE between actual wins and expected wins, I got to around 14.25.
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
PostPosted: Fri Feb 03, 2006 11:43 am Post subject: Reply with quote
94by -
I don't think even a nonlinear correlation is the answer. Perhaps nonparametric analysis is the way to go.
With something like this, where you're taking a big-picture statistic as the explanatory variable (pts/game/team) and trying to relate it to another big-picture statistic as the dependent variable (a coefficient for win% prediction), I wouldn't be surprised if the distributions of the expected values can't be normalized through the usual techniques.
So, instead of using the actual values, a (possibly weighted) ranking system might be the way to go.
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Fri Feb 03, 2006 12:17 pm Post subject: Re: More Pyth stuff (was: Welcome to APBRmetrics! ) Reply with quote
94by50 wrote:
I just spent the last hour or so doing the math because I didn't want to believe this was true. I came up with a correlation between "points per game per team" and "ideal exponent per league" of about .62 and r^2 of about .38. Needless to say, I'm disappointed. A perfect solution certainly is not linear, and I'm not about to spend all night finding a non-linear solution to the problem.
To be perfectly honest, I had no idea the correlation was that high. Just goes to show, I cannot be trusted to rely on my memory for these things. Contra Gabe, I think a linear analysis is a good place to start, given such a high r2. I'm on my way out the door right now, so I don't have time for that, but I did manage to solve for the optimal exponents for each season (given below) using three different fitness measures. First, the RMSE between actual and pythagorean win%s. Second, the mean deviation between actual and pythagorean win%s. Third, I solved for the optimal exponent at the team level, using this handy equation:
Exponent = ln(c / (1 - c)) / ln(a/b)
where a = ppg, b = opponent's ppg, and c = win%
and then I took the median team-level exponent for each season.
Right, the data. I'm not very good at time series stuff, but I did not notice any time factor in the exponents. That ppg sure looks like a good place to start.
Code:
SEASON RMSE AVEDEV MEDIAN PPG
1947 9.9 10.2 10.4 67.8
1948 9.6 10.6 9.5 72.7
1949 11.3 12.6 11.1 80.0
1950 11.2 11.1 11.6 80.1
1951 11.4 10.0 9.6 84.0
1952 8.9 9.5 9.4 83.7
1953 14.5 15.5 14.5 82.6
1954 12.1 12.6 12.5 79.5
1955 18.1 16.7 20.2 93.1
1956 11.6 12.0 11.6 99.0
1957 7.6 8.6 6.7 99.6
1958 14.9 14.8 13.1 106.6
1959 12.8 12.5 15.0 108.2
1960 19.3 18.5 19.0 115.3
1961 16.9 18.4 19.1 118.1
1962 15.8 13.8 13.8 118.8
1963 16.8 16.2 16.3 115.3
1964 13.7 14.5 14.3 111.0
1965 16.8 15.8 14.8 110.6
1966 16.5 16.6 16.5 115.5
1967 15.5 14.7 11.2 117.4
1968 16.9 16.7 16.2 116.6
1969 15.0 16.0 16.4 112.3
1970 14.1 13.7 14.0 116.7
1971 13.5 13.0 13.0 112.4
1972 13.3 13.7 14.9 110.2
1973 15.0 13.5 13.9 107.6
1974 14.4 14.1 14.3 105.7
1975 13.7 12.7 11.3 102.6
1976 18.8 17.0 16.1 104.3
1977 11.8 12.2 12.5 106.5
1978 16.3 16.7 17.0 108.5
1979 12.6 13.6 13.5 110.3
1980 16.4 15.9 15.6 109.3
1981 16.5 16.2 16.1 108.1
1982 17.1 16.9 16.6 108.6
1983 15.4 15.0 15.2 108.5
1984 14.9 16.1 16.1 110.1
1985 15.8 15.7 17.6 110.8
1986 14.8 17.0 17.0 110.2
1987 14.7 15.0 13.9 109.9
1988 15.0 15.2 15.2 108.2
1989 13.3 12.8 12.6 109.2
1990 16.1 15.9 15.5 107.0
1991 14.1 14.4 14.3 106.3
1992 13.6 13.4 13.3 105.3
1993 14.6 14.6 14.3 105.3
1994 13.7 13.4 12.4 101.5
1995 13.3 12.9 13.5 101.4
1996 14.0 14.3 13.6 99.5
1997 14.6 14.4 13.5 96.9
1998 14.5 14.7 14.0 95.6
1999 13.8 14.1 13.2 91.6
2000 14.0 14.2 14.4 97.5
2001 13.9 14.2 14.0 94.8
2002 13.1 12.8 12.9 95.5
2003 13.3 14.1 14.2 95.1
2004 13.2 13.8 13.4 93.4
2005 15.4 16.4 16.2 97.2
_________________
ed
Back to top
View user's profile Send private message Send e-mail
94by50
Joined: 01 Jan 2006
Posts: 499
Location: Phoenix
PostPosted: Fri Feb 03, 2006 3:40 pm Post subject: Reply with quote
gabefarkas wrote:
I don't think even a nonlinear correlation is the answer. Perhaps nonparametric analysis is the way to go... I wouldn't be surprised if the distributions of the expected values can't be normalized through the usual techniques.
It wouldn't shock me if this were the case. I agree with Ed that the linear analysis was a good starting point, but beyond that, I'm not sure where to go with it.
HoopStudies wrote:
I think the reason you see variation is because it isn't a very well-posed problem. It is unstable.
Perhaps the ideal exponent in each league is less stable in smaller leagues, which can be influenced more heavily by one team that deviates strongly from the norm. That's the impression I get just from eyeballing the data that Ed and I came up with.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Sat Feb 04, 2006 3:10 pm Post subject: Reply with quote
Some more:
I categorised games by the average number of points scored (Home plus Away, divided by two), rounded to 5 points. I then calculated the optimal exponent for the home team in each category (OLS, weighted by the number of games in each category), and plotted the results:

That's the 95% confidence interval shown for each observation. The regression equation is Exponent = 5.88 + 0.0758*PPG, with a adjusted r2 of 0.58. Very nice fit, at least in games where both teams average between 60 and 130 points (that is, when they combine to score between 120 and 260 points). Removing the outliers didn't change the regression equation significantly, due to the small number of games in the extremes (n=40,911, 63 games below 60 points, 554 games over 130). The quadratic regression fit was not a significant improvement.
So, the Pyth exponent should be
Exponent = 5.88 + 0.0758*((PTS + OppPts)/(2 * Games))
That gives us a historical RMSE of 0.039, or 3.2 games in a season. That compares nicely to the Correlated Gaussian method (historically a RMSE of 0.037, or 3.0 games), and Pyth with a 14.1 exponent (RMSE = 0.040, 3.3 games).
I think we've reached the limits of PPG-level analysis. To understand Pyth better, I think we're going to have to look for other factors, maybe possession-based stats.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Feb 07, 2006 3:02 pm Post subject: Reply with quote
Using the same method as above, but binning the games by possessions, and only using games from 87-88 to 04-05, I found that the Pythagorean exponent is not significantly correlated to the pace of the game. Very strange.
The exponent is also unrelated to Turnover% and OR%.
_________________