Within Game Win Expectancy (Ed Küpfer, 2006)
Posted: Fri Apr 15, 2011 1:18 am
recovered page 1 of 3
Author Message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Sun Feb 12, 2006 2:53 am Post subject: Within Game Win Expectancy Reply with quote
Okay, I assembled a ton of numbers, and posted the data:
http://ca.geocities.com/edkupfer/basket ... ngData.txt
The file contains just the raw data, about 7500 unique Time Remaining/Home Team Lead combinations, and about one-quarter million total observations.
I'll be looking at it more closely, and post any results in this thread, but if anyone thinks they have a generalised solultion, or a way of getting a better fit to the data than any of my attempts, please give it a shot.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Feb 14, 2006 12:40 am Post subject: Reply with quote
Okay. I got a good result using a logit model with cubed (!) variables. It's ugly, but it's the best I got so far.
Code:
Logistic Regression Table
Predictor Coef SE Coef Z P
Constant 0.0001040 0.0105958 0.01 0.992
Min^1 0.0238027 0.0021063 11.30 0.000
Min^2 -0.0006059 0.0001038 -5.84 0.000
Min^3 0.0000064 0.0000014 4.50 0.000
Lead^1 0.137276 0.0010688 128.44 0.000
Lead^2 -0.0003527 0.0001139 -3.10 0.002
Lead^3 -0.0002829 0.0000125 -22.60 0.000
(Lead^1)/Min 0.171210 0.0044060 38.86 0.000
(Lead^2)/Min 0.0066804 0.0009404 7.10 0.000
(Lead^3)/Min 0.0069239 0.0001444 47.96 0.000
Min = Minutes remaining = MINUTES + SECONDS/60
Lead = Home Team lead
Here's how it looks for a home team lead over the final three minutes.

Note that the home court advantage dissapears near the end.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
mtamada
Joined: 28 Jan 2005
Posts: 376
PostPosted: Tue Feb 14, 2006 2:28 am Post subject: Reply with quote
Fabulous. I haven't had time to play around with the formula that you derived, but the numbers look plausible.
Have you talked about this with DeanO? I know that one of his research areas, at least as of a year or two ago, was within-game probabilities-of-winning, although I think he was more interested in a discrete game-state approach (e.g. with 30 seconds left, home team has the ball and a 2 point deficit, should they go for a quick shot to get a 2-for-1, or work the regular offense, or try to shoot a 3-pointer?).
His and your approach might complement or supplement each other real well.
Back to top
View user's profile Send private message
tenkev
Joined: 31 Jul 2005
Posts: 20
Location: Memphis,TN
PostPosted: Tue Feb 14, 2006 2:37 am Post subject: Reply with quote
I think this is absolutely fantastic.
I've had an idea that relates to this for some time.
If you can calculate the expected winning % at any given time during the game based on point differential, time remaining and possesion then you can make a metric that would blow DanVal out of the water.
Dan's regression formula for deriving his player rating is
MARGIN=b0 + b1X1 + b2X2 + . . . + bKXK + e, where
MARGIN=100*(home team points per possession – away team points per possession)
Well, what if instead of the margin being the difference in points per possesion while a unit is on the floor, why not make it the difference in expected winning %?
This way, you could account for the fact that points in a close ball game are more valuable than in a blow out, and a game winning shot is more valuable than another shot, etc.
What do you think? It would take alot of work, but if somebody did it that would be the best possible player rating, IMO.
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Feb 14, 2006 3:47 am Post subject: Reply with quote
tenkev wrote:
Well, what if instead of the margin being the difference in points per possesion while a unit is on the floor, why not make it the difference in expected winning %?
This way, you could account for the fact that points in a close ball game are more valuable than in a blow out, and a game winning shot is more valuable than another shot, etc.
I seem to recall that DanR put a "clutch" modifier in his model somewhere. But, yes, I think that for a comprehensive rating system, using changes in win probablity as the response variable is preferable to using points.
Quote:
What do you think? It would take alot of work, but if somebody did it that would be the best possible player rating, IMO.
It would take much more work. The stuff I've done — to the extext that I've done anything at all — is coarse. Some problems:
1. I've only used one season of data. That can't be good. This can be addressed soon.
2. Possession isn't indicated, but is clearly an important variable towards the end lof the game. This is harder to address, because I don't have an automatic way of digging possession out of the PBPs the way I did with score changes.
3. You'd still need the other data DanR used: the identity of the other players on the floor.
4. Credit needs to be given out. If the probablitity of a home win increases by 0.2 on a single possession, who gets what credit? Half should be deducted from the defense, obviously, but should it be shared equally among all defenders? Should a single defender be credited? Same thing for the offense, although there it's probably less problematic to assign credit.
I'm envisioning a smaller scale usage. Maybe a game-level analysis, done one game at a time by any interested fan. This would eliminate most of the problems, since the fan could manually code the missing data. For example, tonight I watched the Raptors at the Wolves, and it seemed to me that KG was a terrifying defensive presence. Since I watched the game, I could print out a PBP and code his defensive assignments manually, along with most of the other players. This type of thing could be done on a larger scale for the playoffs.
I think I'm going to try doing a single game, just to see what kind of problems come up. The Raps are in NY on Wednesday, a game which promises to exhaust my supply of boredom, but maybe scoring the game by hand this way will perk things up.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Feb 14, 2006 3:52 am Post subject: Reply with quote
mtamada wrote:
I think he was more interested in a discrete game-state approach (e.g. with 30 seconds left, home team has the ball and a 2 point deficit, should they go for a quick shot to get a 2-for-1, or work the regular offense, or try to shoot a 3-pointer?).
I'll drop him a line, unless he wants to pipe up here...
I created a spreadsheet once which simulated the last few minutes of a game, focusing on 2- or 3-pt strategies. I really enjoyed working through it, and although I left lots of variables out, I saw at the time how it could be modified to include more — given a model to base it on. I still think I need more data (as noted in my reply to tenkev) but it should be workable, if I get off my lazy butt to collect more data.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Tmon
Joined: 09 Oct 2005
Posts: 9
Location: Boston
PostPosted: Fri Feb 17, 2006 4:57 pm Post subject: Reply with quote
Beautiful stuff, Ed!. Thanks for making the data available as well. Few quesetions/comments:
1. Interesting to note that last year, 4/5 home teams won when time stopped, down by 1 pt with 1 sec remaining! Anomally I'm sure, but never say die!
2. Conversely, only 6/11 home teams won when winning by 1 pt with 2 secs left when time stopped. Never relax!
3. Were the other "negative lead" data included in the regression, just not shown on the chart? If negative leads were included in the regression, the "lead^2" and "lead^2/min" terms change the sign, causing logic problems.
4. Inclusion of the "min" "min^2" and "min^3" variables seems a bit off logically to me. I realize the "p" values look good... But, the chance of winning should increase at lower time remaining values, so the inverse time terms (lead/min) you include later make more theoretical sense to me, and note those terms have much higher coefficients and coefficient*variable values.
5. For the logistic regression: usually the dependent input is 0 or 1. I think the regression would be more rigorous if the whole data set was broken out, instead of collapsed into say, 110 observations at 1 minute lead of 5, 55 wins, for 50%, which is then weighed as heavily in the regression as a time/lead combo with just one observation for 100%. Or perhaps you did this, and the text file was collapsed for convenience?
6. Finally, I am playing with this data using MatLAB, and the logistic code I have does not provide "p" values (or anything but the coefficient). Is there a chance anybody has more complete logisitic code for MATLAB?
All that said, none of the regressions I've done so far give anything that looks as logical as your chart.
-Tmon
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Fri Feb 17, 2006 5:20 pm Post subject: Reply with quote
Tmon wrote:
3. Were the other "negative lead" data included in the regression, just not shown on the chart? If negative leads were included in the regression, the "lead^2" and "lead^2/min" terms change the sign, causing logic problems.
Hmm. Before, I included the proper sign even with squared variables (var^2 = var * |var|). I'm not sure why I didn't do it this time. Might be worth trying again.
Quote:
4. Inclusion of the "min" "min^2" and "min^3" variables seems a bit off logically to me. I realize the "p" values look good... But, the chance of winning should increase at lower time remaining values, so the inverse time terms (lead/min) you include later make more theoretical sense to me, and note those terms have much higher coefficients and coefficient*variable values.
There's nothing theoretical about what I did. I tried a bunch of different variables and intereaction variables until the results looked good. This was harder than I thought — I never thought I would have to cube anything. If you can think of a way to fit a curve to the data using a more theoretical approach, I would appreciate it. I'm not comfortable with what I have so far.
Quote:
5. For the logistic regression: usually the dependent input is 0 or 1. I think the regression would be more rigorous if the whole data set was broken out, instead of collapsed into say, 110 observations at 1 minute lead of 5, 55 wins, for 50%, which is then weighed as heavily in the regression as a time/lead combo with just one observation for 100%. Or perhaps you did this, and the text file was collapsed for convenience?
I used Minitab for the regressions, being much quicker and easier than the more hardcore stats packages I have on my computer. Minitab allows me to use the number of games to weight the results of the outcomes. Imagine my surprise when I found out that other, more complex packages don't allow this as an option on the regressions.
So to answer your question, each game was used as an obeservation in the regression. I don't know how you'd "unstack" the observations from my data — the way I presented them was pretty much the way I collected them. I suppose you could run a macro to copy each observation g times, where g is the number of games.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Sat Feb 18, 2006 11:54 am Post subject: Reply with quote
Update:
I've doubled the number of observations to the data set. It's now about 650,000. I've also uploaded a zip file containing the same data in "unstacked" format, so that every game observation is on its own row, with a binary win/loss outcome in the final column. Any stats package should now be able to handle this without a problem — as long as it can handle 650,000 rows.
http://ca.geocities.com/edkupfer/basket ... tacked.zip
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Tmon
Joined: 09 Oct 2005
Posts: 9
Location: Boston
PostPosted: Wed Feb 22, 2006 4:17 pm Post subject: Reply with quote
woa nelly! Thanks for unstacking all that. I'm taking and playing with as much data as I can at a time. Definitely can't get all 650,000, doesn't even let me try. I tried for 5 minutes on, and it let me get it in there, then crashed. Maybe I can look at different leads at specific time points one at a time or something.
-Tmon
Back to top
View user's profile Send private message
farbror
Joined: 13 Oct 2005
Posts: 15
Location: Sweden
PostPosted: Thu Mar 09, 2006 4:08 am Post subject: Reply with quote
This is really interesting stuff! How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games?
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Thu Mar 09, 2006 11:43 am Post subject: Reply with quote
farbror wrote:
How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games?
I have every game from 04-05, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Tmon
Joined: 09 Oct 2005
Posts: 9
Location: Boston
PostPosted: Thu Mar 09, 2006 6:38 pm Post subject: Reply with quote
Ed,
I'm still liking this stuff a lot, and I think you are essentially there for end-of-game situations. However, I was wondering if it would be possible to simplify and choose times that are commonly stated landmarks, such as halftime and end-of-three. My gut says people often put far too much emphasis on the score at these landmark times. Your current formula probably is just as valid at these times, but doing this would also reduce the number of variables, of course, so I can play too Smile .
You could just use every game, instead of randomly sampling. I think there are too-few observations at these times in the massive data file you posted to really get a good picture if I pull these out selectively. Any chance you're interested? Gotta figure this out so we can yell at the tv when they say "xx holds a commanding lead at the half"!
-Tmon
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
PostPosted: Thu Mar 09, 2006 6:53 pm Post subject: Reply with quote
Ed Küpfer wrote:
farbror wrote:
How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games?
I have every game from 04-05, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons.
So, correct me if I'm wrong, but essentially you could have the following:
Probability(HomeTeamWin) = ( ( x * (L^a) ) + ( y * (T^b) ) ) * E
where x, y, a and b are integers that establish the output probability to be within a certain range and/or threshold, and E is a normalizing component or "fudge factor" to bring the bounds to {0, 1}, making it a true probability.
Perhaps you might even need the natural log of the above to smooth it out.
In any case, from what you've got, do you think you could reasonably come up with values for those 5 variables that satisfy that formula, with a tolerable level or error? From your earlier post, it seems as though the equation would have linear, quadratic, and cubic terms for both variables. Is that correct?
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Thu Mar 09, 2006 7:17 pm Post subject: Reply with quote
Tmon wrote:
I was wondering if it would be possible to simplify and choose times that are commonly stated landmarks, such as halftime and end-of-three.
Like this?
Code:
TIME REMAINING
EndQ1 Half EndQ3 10:00 5:00 3:00 2:00 1:00 0:40 0:30 0:20 0:10
20 .92 .96 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
15 .90 .92 .97 .98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
12 .87 .88 .93 .94 .99 1.00 1.00 1.00 1.00 1.00 1.00 1.00
10 .84 .85 .88 .90 .95 .99 1.00 1.00 1.00 1.00 1.00 1.00
H 9 .83 .83 .86 .87 .93 .97 .99 1.00 1.00 1.00 1.00 1.00
O 8 .81 .81 .83 .84 .89 .94 .97 1.00 1.00 1.00 1.00 1.00
M 7 .79 .79 .80 .81 .85 .90 .94 .99 1.00 1.00 1.00 1.00
E 6 .77 .76 .77 .77 .80 .85 .90 .97 .99 1.00 1.00 1.00
5 .74 .74 .73 .73 .75 .79 .84 .93 .97 .99 1.00 1.00
T 4 .72 .71 .70 .70 .71 .73 .77 .86 .92 .95 .99 1.00
E 3 .69 .68 .66 .66 .66 .67 .70 .77 .83 .87 .94 .99
A 2 .66 .65 .63 .62 .61 .62 .63 .67 .72 .76 .83 .94
M 1 .63 .61 .59 .58 .57 .57 .57 .59 .61 .63 .67 .78
0 .59 .58 .55 .55 .53 .52 .51 .51 .50 .50 .50 .50
L -1 .56 .54 .51 .51 .48 .47 .46 .43 .41 .39 .34 .24
E -2 .52 .51 .48 .47 .44 .42 .40 .35 .31 .27 .20 .08
A -3 .49 .47 .44 .43 .39 .36 .34 .26 .21 .16 .09 .01
D -4 .45 .43 .40 .39 .35 .31 .27 .18 .11 .07 .03 .00
-5 .42 .40 .36 .35 .30 .25 .20 .10 .05 .02 .00 .00
-6 .39 .36 .32 .31 .25 .19 .13 .05 .02 .00 .00 .00
-7 .35 .33 .28 .27 .19 .13 .08 .02 .00 .00 .00 .00
-8 .33 .30 .24 .22 .15 .08 .04 .00 .00 .00 .00 .00
-9 .30 .27 .21 .19 .10 .05 .02 .00 .00 .00 .00 .00
-10 .27 .24 .17 .15 .07 .02 .01 .00 .00 .00 .00 .00
-12 .23 .19 .11 .09 .02 .00 .00 .00 .00 .00 .00 .00
-15 .18 .13 .05 .03 .00 .00 .00 .00 .00 .00 .00 .00
-20 .14 .07 .01 .00 .00 .00 .00 .00 .00 .00 .00 .00
Since all that represents the probability of an average home team beating an average away team, it's more interesting is to use the numbers above to modify the log5 formula, like this:
Probability of Home Team Win = (HomeWin * (1 - AwayWin) * W) / (HomeWin * (1 - AwayWin) * W + (1 HomeWin) * AwayWin * (1 - W))
where HomeWin and AwayWin represent some estimate of the Home and Away teams' win ability (like their Win% or Pythagorean or something), and
W = some HCA weight. Normally, we use simple HCA, which is about 0.6, but the win expectancy equation returns a more precise weight, given the game circumstances.
For example, if two average teams are playing, and the home team has a 5 point lead at halftime, they have a 0.74 probability of a win. But if the home team is the Lakers (Win% = 0.5) and the away team is the Hawks (Win% = 0.3), then the home team win probability is
Code:
p(HomeWin) = (0.5 * (1 - 0.3) * 0.74) / (0.5 * (1 - 0.3) * 0.74 + (1 - 0.5) * 0.3 * (1 - 0.74))
= 0.87
_________________
ed
page 2 of 3
Author Message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Thu Mar 09, 2006 7:24 pm Post subject: Reply with quote
gabefarkas wrote:
In any case, from what you've got, do you think you could reasonably come up with values for those 5 variables that satisfy that formula, with a tolerable level or error? From your earlier post, it seems as though the equation would have linear, quadratic, and cubic terms for both variables. Is that correct?
Yeah. I used a logistic regression model (it's up there in the second post of this thread), which takes the form:
p = 1 / (1 + EXP(-b))
where b = all the variables (time, time^2, time^3, etc) weighted by their regression coefficients. I don't know how familiar you are with logistic regression, but it's used on events that have binary outcomes (eg win/loss), and returns a nice s-curve bounded at 0 and 1.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Thu Mar 09, 2006 10:22 pm Post subject: Reply with quote
yeah, i know log regression stuff somewhat. it can also be used with Poisson regression models, such as for counts data, or rate data.
that's part of where i was going with my previous post. i think maybe you could try remodeling using the Poisson assumption to simplify it.
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Thu Mar 09, 2006 11:01 pm Post subject: Reply with quote
gabefarkas wrote:
that's part of where i was going with my previous post. i think maybe you could try remodeling using the Poisson assumption to simplify it.
Okay, my turn to ask you to explain. I'm not too familiar with Poisson models. How would you turn it into a probability of a binary outcome?
_________________
ed
Back to top
View user's profile Send private message Send e-mail
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Thu Mar 09, 2006 11:08 pm Post subject: Reply with quote
well, the outcome would be modeled as a Poisson, rather than as a Binary.
let me give it some more thought and get back to you.
Back to top
View user's profile Send private message Send e-mail AIM Address
farbror
Joined: 13 Oct 2005
Posts: 15
Location: Sweden
PostPosted: Fri Mar 10, 2006 1:40 am Post subject: Reply with quote
Ed Küpfer wrote:
I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue
(my bold)
My gut feel is that correlation is a major issue! A standrad logistic regression would be based on the assumption that the data points are independent. Data from the same game are not.
With 1000+ games available you might want to validate your results by sampling a single data point from each game and then do the Logistic regression.
.....and then perhaps repeat the validation a few times?
Do you in any way Model the strengths of the involved teams? Falling a few points behind, say, Portland of Today might be easier to overcome than trailing Detroit.
Poission regression: Poission regression is an excellent Model for soccer scores and hockey score. You might need to do some clever stuff with the dispersion parameter if you try to model hoops using poission regression.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Fri Mar 10, 2006 1:09 pm Post subject: Reply with quote
farbror wrote:
My gut feel is that correlation is a major issue! A standrad logistic regression would be based on the assumption that the data points are independent. Data from the same game are not.
I understand what you're saying, but I still don't see how it is a big issue. Think of the question we're trying to answer: given a home team lead of T, and M minutes remaining in the game, what is the probability of a home team win? I can't see how sampling repeatedly from the same game, but at different points, affects the answer here.
farbror wrote:
With 1000+ games available you might want to validate your results by sampling a single data point from each game and then do the Logistic regression.
.....and then perhaps repeat the validation a few times?
Okay, I did this. The problem with this approach is that there are not nearly enough data to give a significant regression result. For example, I repeated the process of sampling a single point from each game 10 times, and I still haven't sampled a single datapoint that has a home team lead with 5-10 minutes remaining. Think of all the possible Time/Lead combinations: say home team leads between -15 and 15, 48 minutes in a (actually, I recorded the time down to the second) will give us 1500 possibilities, which means that every sample will have an average of a single datapoint per Time/Lead combination possibility. This won't tell us anything.
I don't want to dismiss your objections out of hand. But I'm still not sure if a) the correlation issue really makes a difference (I'm not very familiar with the problems inherent in the resampling approach I used), and b) a practical alternative can be concieved. So far, the approached I used at least matches or intuitive feel of how the numbers should look, for whatever that's worth.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
farbror
Joined: 13 Oct 2005
Posts: 15
Location: Sweden
PostPosted: Mon Mar 13, 2006 4:24 am Post subject: Reply with quote
Ed>>
Robust estimates of correlation structures for Repeated measurements has been my field of research for some time. It is rather tricky (and I try to deal with simple stuff). The major quirk is that it is really hard to realize when the correlation structure has a major impact on the results.
If 1000+ data points are too few to get significant results, then that is a very interesting finding in itself. It might be an indication that other factors than "time remaining" and "score" are (even more) important predictors.
I appreciate your efforts to investigate this interesting topic. Also, I am very greatfull that you share the results.
Cheers, farbror
Sweden
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Wed Mar 15, 2006 12:10 pm Post subject: Reply with quote
farbror, I tried this again, this time focusing on a) the final minute of games, and b) the final 2 minutes of games. Neither pass gave me significant results. I think what I have to do is re-visit this issue when I have more data. Probably this summer I'll have added two more seasons' worth to work with.
For now, all I can say is that the results above seem to conform to my intuition. I prefer to look at it as a useful, pragmatic hack, rather than a refelction of reality. I promise not to put any more confidence than it deserves.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Fri Apr 28, 2006 3:44 pm Post subject: Reply with quote
Ed,
This dataset is pure gold. I need to be brief right now, (I promise to folllow up after I am off the clock today) but here are some of the things I did with a with a very similar dataset of college football games. Some of these things you probably have already tried.
* Rather than parameterizing the model using the regression, I created an "empirical matrix" of time remaining verses margin. In other words there would be an empirical cell that says that "empirically" (I'm making up numbers here), when the home team is up by 8 points with exactly 5 minutes left, since they won 10/16 times, that cell would have a "empirical" probability of .625.
Question: Are these recorded events whenever the score changed, or are other events included as well? I ask because, if it is just changes of score, then it should be easy to fill in the blanks for all of the times in between scores.
* One way to get the data to be a little smoother where you do not have many observations without resorting to parameterization is to set up a Markov transition matrix for each time/delta. This basically means that the prob(W|t,d) is a function of the the sum of the different prob(W|t-1).
* I love the game graphs! A simple but very telling statistic is what I called the gamescore which is the the integral of your graph. As you stated, this statistic "scores" the game on the change of probability over time and is useful at colapsing blowouts and getting at the "true" closeness of a game. I found that for college football using this statistic was better than using Margin Of Victory (MOV) in predicting future matchups.
Question 2: Would you be willing to provide a version of the data with the teams involved? That would make it possible to give each game a gamescore.
* One more thing I pursued was once I had gamescores for a season, it was able to estimate a MOV based on gamescores. This was helpful for those who might want to *ahem* predict a MOV for whatever reason. Smile
Back to top
View user's profile Send private message
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Fri May 05, 2006 3:35 pm Post subject: Reply with quote
Quote:
I have every game from 04-05, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons.
Ed,
Excuse me for being dense, but are you saying that the 650,000+ observations are not each time/margin observation in the sample, picked once, but rather 650,000+ independent samples from the dataset, including oversampling?
Would it be unseamly for me to beg for even the reduced dataset that contains each gameid, time, home score and visitor score? I would like to take a crack at replicating the time/delta matrix and also try and generate the probability graphs for individual games.
[/quote]
Back to top
View user's profile Send private message
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Fri May 05, 2006 4:15 pm Post subject: How one might "smooth" the time/margin matrix Reply with quote
Since there was concern that there might not be enough observations for any one particular time/margin cell to get a good estimate of the probability, here's one way to smooth the data a bit. It is my attempt at a rewrite of something similar I did with college football data.
For notational purposes let p(T,d) be the probability of winning with T seconds remaining with a lead of d (delta). By definition, p(0,>0) = 1, and p(0,<0) = 0. (Overtime, as you noted, is tricky. I would just set p(0,0) to be whatever the empirical probability of the home team winning in overtime.)
Suppose that there were 20 instances where a team was leading by one
point with one second remaining. Now for the sake of simplicity, assume that there were only 3 possible outcomes for the final second. There was a 18/20=90% chance that the lead will not change, a 1/20=5% chance that the lead will go to 3 (The team with the lead scores another field goal), and a 1/20=5% chance that the other team will lead by 1 (The trailing team scores). In this example, the probability of winning given a one point lead with one second remaining is:
p(1,1)= p(0,1)*.9 + p(0,3)*.05 + p(0,-1)*.05
= 1 *.9 + 1 *.05 + 0 *.05 = .95.
Suppose one did this for every margin with one second left. Then the probability of winning after with two seconds remaining p(2,d) given different point differentials would be calculated using the p(1,d) from above. In other words, the probabilities are being modelled as a Markov process.
Another way of looking at this is that instead of comparing a T/D with the final result, you are just comparing it with the states at T+1.
This method will give some smoothness for situations where say they were down 20 at the half, rallied back to within 5 with a minute to go and lost. Instead of just tabulating this as down 20 at the half therefore lost, the probability would be based on probability that one could win down 5 with a minute to go.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Sat May 06, 2006 2:05 pm Post subject: Reply with quote
Jon: I just saw these posts now. I don't have time to read them closely right now, but it looks like a lot of good stuff. Expect a real reply within a couple of days.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
suburbanDad
Joined: 10 May 2006
Posts: 1
PostPosted: Thu May 11, 2006 8:27 am Post subject: NBA within game different from NCAA? Reply with quote
This is brilliant work Ed.
I wonder whether the within game odds are different for the NBA. Is it difficult to get the NBA PBPs?
Also, ball possession seems important. Three points down at 0:12 is very different with the ball than without. I see that you didn't include ball possession. Is that because you didn't have it in your NCAA dataset?
sD
Back to top
View user's profile Send private message
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Thu May 11, 2006 2:15 pm Post subject: Reply with quote
Quote:
I wonder whether the within game odds are different for the NBA. Is it difficult to get the NBA PBPs?
Also, ball possession seems important. Three points down at 0:12 is very different with the ball than without. I see that you didn't include ball possession. Is that because you didn't have it in your NCAA dataset?
I am not Ed, but I hope he doesn't mind my answering.
I'm quite certain that Ed used NBA and not NCAA data.
Getting NBA PBPs are not difficult. They are found at nba.com, espn.com, and a few other places. The trick is parsing them. I started to take a crack at it myself a few months back, but my data was corrupted and I did not repursue it at the time.
I believe what Ed did was sample from the lines of the PBP where a score took place, so by definition, the possession was with the team that scored. If one was to look at the continum of time in between scores, one would have to note each change of possession that did not involve a score in order to do the analysis of the ball possession.
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Thu May 11, 2006 5:56 pm Post subject: Reply with quote
Ed Küpfer wrote:
gabefarkas wrote:
that's part of where i was going with my previous post. i think maybe you could try remodeling using the Poisson assumption to simplify it.
Okay, my turn to ask you to explain. I'm not too familiar with Poisson models. How would you turn it into a probability of a binary outcome?
I realized I never got back to you about this. What you've done here is a binomial logit (logistic regression) model, with the form:
logit(Pi) = log (Pi / (1 - Pi) ) = a + b1x1 + b2x2 + ...
This model ensures that the response will be between 0 and 1.
A Poisson loglinear model predicts the expected value of "y" (the response variable), and takes the form:
log(E(y)) = a + b1x1 + b2x2 + ...
And it's used for counts of things, or rate data, or also when putting together a contingency table. So, you couldn't use it for a binary outcome, but if you have the total number of games, you could model the rate of success.
Loglinear and logit models have a lot of connections between them, and oftentimes there's an equivalent version of one that can be found in the other.
page 3 of 3
Author Message
capnhistory
Joined: 27 Jul 2005
Posts: 62
Location: Durham, North Carolina
PostPosted: Fri May 12, 2006 7:10 pm Post subject: Reply with quote
While those of you with serious math skills work on the real research I just wanted to toss in an observation. Basketball appeals to me as a spectator more than other sports because of its dynamic nature. Something that stood out to me about Ed's initial chart was how large a lead had to be for the home team to have the game locked down. I think about it like this: imagine the home team is clinging on to a four point lead in the games closing minutes. If they are exchanging possessions (i.e. matching the visiting team score for score and stop for stop) to maintain that 4pt lead, then the chart indicates that there's still as much as a 20% chance the visiting team coming back to win with as little as 90 seconds remaining in the game. I know the data are still rough, and the results may change, but to me this indicates that a high number of games ending in this situation could result in a dramatic, come-from-behind, stealing-one-on-the-road, type of win. Again I find that nature thrilling and a large part of why I love this sport. I chose a 4 point lead because, while teams can score that much in an atypical play (foul on a made trey), it essentially represents the lowest "two possession" lead a team can have. By comparison if a home football team has an 8 point lead (again do-able in one possession, but really a two possession lead) what are the odds the visiting team overcomes that deficit in less then two minutes? I imagine they are much smaller than 1-in-5. I would be really interested if Jon had found any numbers regarding this. I can't contribute to the research but I just thought I should share that I found meaning in the discussion so far.
_________________
Throw it down big man!
More expansive basketball babble at a slower pace The Captain of History
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Mon May 15, 2006 9:55 am Post subject: Reply with quote
By comparison if a home football team has an 8 point lead (again do-able in one possession, but really a two possession lead) what are the odds the visiting team overcomes that deficit in less then two minutes? I imagine they are much smaller than 1-in-5. I would be really interested if Jon had found any numbers regarding this. I can't contribute to the research but I just thought I should share that I found meaning in the discussion so far.
As you predicted, it is much more difficult in college football.
With two minutes to go, down 8 points:
The home team came back to ultimately win 6/69=10.2% of the time.
The visitors came back 4/75= 5.3% of the time.
Keep in mind that these are all situations with the point differential. The data does not contain possession information.
Back to top
View user's profile Send private message
jmethven
Joined: 16 May 2005
Posts: 51
PostPosted: Tue Jun 16, 2009 10:55 am Post subject: Reply with quote
Sorry to bump this topic after 3 years, but has any more research been done in this direction? I know some cool graphs were posted during this year's playoffs at advancednflstats.com.
I think it would be really interesting to combine a win expectancy framework with a box score player rating model like PER, a la what they do over at fangraphs.com for baseball. There are a lot of great players like David Robinson and Kevin Garnett who have been criticized by Bill Simmons and other sportswriters for not being able to take over a game in crunchtime. Using win expectancy could give a sense of how much, say, Kobe Bryant's willingness to shoot the ball at the end of games gives him value over someone like Garnett.
Author Message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Sun Feb 12, 2006 2:53 am Post subject: Within Game Win Expectancy Reply with quote
Okay, I assembled a ton of numbers, and posted the data:
http://ca.geocities.com/edkupfer/basket ... ngData.txt
The file contains just the raw data, about 7500 unique Time Remaining/Home Team Lead combinations, and about one-quarter million total observations.
I'll be looking at it more closely, and post any results in this thread, but if anyone thinks they have a generalised solultion, or a way of getting a better fit to the data than any of my attempts, please give it a shot.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Feb 14, 2006 12:40 am Post subject: Reply with quote
Okay. I got a good result using a logit model with cubed (!) variables. It's ugly, but it's the best I got so far.
Code:
Logistic Regression Table
Predictor Coef SE Coef Z P
Constant 0.0001040 0.0105958 0.01 0.992
Min^1 0.0238027 0.0021063 11.30 0.000
Min^2 -0.0006059 0.0001038 -5.84 0.000
Min^3 0.0000064 0.0000014 4.50 0.000
Lead^1 0.137276 0.0010688 128.44 0.000
Lead^2 -0.0003527 0.0001139 -3.10 0.002
Lead^3 -0.0002829 0.0000125 -22.60 0.000
(Lead^1)/Min 0.171210 0.0044060 38.86 0.000
(Lead^2)/Min 0.0066804 0.0009404 7.10 0.000
(Lead^3)/Min 0.0069239 0.0001444 47.96 0.000
Min = Minutes remaining = MINUTES + SECONDS/60
Lead = Home Team lead
Here's how it looks for a home team lead over the final three minutes.

Note that the home court advantage dissapears near the end.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
mtamada
Joined: 28 Jan 2005
Posts: 376
PostPosted: Tue Feb 14, 2006 2:28 am Post subject: Reply with quote
Fabulous. I haven't had time to play around with the formula that you derived, but the numbers look plausible.
Have you talked about this with DeanO? I know that one of his research areas, at least as of a year or two ago, was within-game probabilities-of-winning, although I think he was more interested in a discrete game-state approach (e.g. with 30 seconds left, home team has the ball and a 2 point deficit, should they go for a quick shot to get a 2-for-1, or work the regular offense, or try to shoot a 3-pointer?).
His and your approach might complement or supplement each other real well.
Back to top
View user's profile Send private message
tenkev
Joined: 31 Jul 2005
Posts: 20
Location: Memphis,TN
PostPosted: Tue Feb 14, 2006 2:37 am Post subject: Reply with quote
I think this is absolutely fantastic.
I've had an idea that relates to this for some time.
If you can calculate the expected winning % at any given time during the game based on point differential, time remaining and possesion then you can make a metric that would blow DanVal out of the water.
Dan's regression formula for deriving his player rating is
MARGIN=b0 + b1X1 + b2X2 + . . . + bKXK + e, where
MARGIN=100*(home team points per possession – away team points per possession)
Well, what if instead of the margin being the difference in points per possesion while a unit is on the floor, why not make it the difference in expected winning %?
This way, you could account for the fact that points in a close ball game are more valuable than in a blow out, and a game winning shot is more valuable than another shot, etc.
What do you think? It would take alot of work, but if somebody did it that would be the best possible player rating, IMO.
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Feb 14, 2006 3:47 am Post subject: Reply with quote
tenkev wrote:
Well, what if instead of the margin being the difference in points per possesion while a unit is on the floor, why not make it the difference in expected winning %?
This way, you could account for the fact that points in a close ball game are more valuable than in a blow out, and a game winning shot is more valuable than another shot, etc.
I seem to recall that DanR put a "clutch" modifier in his model somewhere. But, yes, I think that for a comprehensive rating system, using changes in win probablity as the response variable is preferable to using points.
Quote:
What do you think? It would take alot of work, but if somebody did it that would be the best possible player rating, IMO.
It would take much more work. The stuff I've done — to the extext that I've done anything at all — is coarse. Some problems:
1. I've only used one season of data. That can't be good. This can be addressed soon.
2. Possession isn't indicated, but is clearly an important variable towards the end lof the game. This is harder to address, because I don't have an automatic way of digging possession out of the PBPs the way I did with score changes.
3. You'd still need the other data DanR used: the identity of the other players on the floor.
4. Credit needs to be given out. If the probablitity of a home win increases by 0.2 on a single possession, who gets what credit? Half should be deducted from the defense, obviously, but should it be shared equally among all defenders? Should a single defender be credited? Same thing for the offense, although there it's probably less problematic to assign credit.
I'm envisioning a smaller scale usage. Maybe a game-level analysis, done one game at a time by any interested fan. This would eliminate most of the problems, since the fan could manually code the missing data. For example, tonight I watched the Raptors at the Wolves, and it seemed to me that KG was a terrifying defensive presence. Since I watched the game, I could print out a PBP and code his defensive assignments manually, along with most of the other players. This type of thing could be done on a larger scale for the playoffs.
I think I'm going to try doing a single game, just to see what kind of problems come up. The Raps are in NY on Wednesday, a game which promises to exhaust my supply of boredom, but maybe scoring the game by hand this way will perk things up.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Tue Feb 14, 2006 3:52 am Post subject: Reply with quote
mtamada wrote:
I think he was more interested in a discrete game-state approach (e.g. with 30 seconds left, home team has the ball and a 2 point deficit, should they go for a quick shot to get a 2-for-1, or work the regular offense, or try to shoot a 3-pointer?).
I'll drop him a line, unless he wants to pipe up here...
I created a spreadsheet once which simulated the last few minutes of a game, focusing on 2- or 3-pt strategies. I really enjoyed working through it, and although I left lots of variables out, I saw at the time how it could be modified to include more — given a model to base it on. I still think I need more data (as noted in my reply to tenkev) but it should be workable, if I get off my lazy butt to collect more data.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Tmon
Joined: 09 Oct 2005
Posts: 9
Location: Boston
PostPosted: Fri Feb 17, 2006 4:57 pm Post subject: Reply with quote
Beautiful stuff, Ed!. Thanks for making the data available as well. Few quesetions/comments:
1. Interesting to note that last year, 4/5 home teams won when time stopped, down by 1 pt with 1 sec remaining! Anomally I'm sure, but never say die!
2. Conversely, only 6/11 home teams won when winning by 1 pt with 2 secs left when time stopped. Never relax!
3. Were the other "negative lead" data included in the regression, just not shown on the chart? If negative leads were included in the regression, the "lead^2" and "lead^2/min" terms change the sign, causing logic problems.
4. Inclusion of the "min" "min^2" and "min^3" variables seems a bit off logically to me. I realize the "p" values look good... But, the chance of winning should increase at lower time remaining values, so the inverse time terms (lead/min) you include later make more theoretical sense to me, and note those terms have much higher coefficients and coefficient*variable values.
5. For the logistic regression: usually the dependent input is 0 or 1. I think the regression would be more rigorous if the whole data set was broken out, instead of collapsed into say, 110 observations at 1 minute lead of 5, 55 wins, for 50%, which is then weighed as heavily in the regression as a time/lead combo with just one observation for 100%. Or perhaps you did this, and the text file was collapsed for convenience?
6. Finally, I am playing with this data using MatLAB, and the logistic code I have does not provide "p" values (or anything but the coefficient). Is there a chance anybody has more complete logisitic code for MATLAB?
All that said, none of the regressions I've done so far give anything that looks as logical as your chart.
-Tmon
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Fri Feb 17, 2006 5:20 pm Post subject: Reply with quote
Tmon wrote:
3. Were the other "negative lead" data included in the regression, just not shown on the chart? If negative leads were included in the regression, the "lead^2" and "lead^2/min" terms change the sign, causing logic problems.
Hmm. Before, I included the proper sign even with squared variables (var^2 = var * |var|). I'm not sure why I didn't do it this time. Might be worth trying again.
Quote:
4. Inclusion of the "min" "min^2" and "min^3" variables seems a bit off logically to me. I realize the "p" values look good... But, the chance of winning should increase at lower time remaining values, so the inverse time terms (lead/min) you include later make more theoretical sense to me, and note those terms have much higher coefficients and coefficient*variable values.
There's nothing theoretical about what I did. I tried a bunch of different variables and intereaction variables until the results looked good. This was harder than I thought — I never thought I would have to cube anything. If you can think of a way to fit a curve to the data using a more theoretical approach, I would appreciate it. I'm not comfortable with what I have so far.
Quote:
5. For the logistic regression: usually the dependent input is 0 or 1. I think the regression would be more rigorous if the whole data set was broken out, instead of collapsed into say, 110 observations at 1 minute lead of 5, 55 wins, for 50%, which is then weighed as heavily in the regression as a time/lead combo with just one observation for 100%. Or perhaps you did this, and the text file was collapsed for convenience?
I used Minitab for the regressions, being much quicker and easier than the more hardcore stats packages I have on my computer. Minitab allows me to use the number of games to weight the results of the outcomes. Imagine my surprise when I found out that other, more complex packages don't allow this as an option on the regressions.
So to answer your question, each game was used as an obeservation in the regression. I don't know how you'd "unstack" the observations from my data — the way I presented them was pretty much the way I collected them. I suppose you could run a macro to copy each observation g times, where g is the number of games.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Sat Feb 18, 2006 11:54 am Post subject: Reply with quote
Update:
I've doubled the number of observations to the data set. It's now about 650,000. I've also uploaded a zip file containing the same data in "unstacked" format, so that every game observation is on its own row, with a binary win/loss outcome in the final column. Any stats package should now be able to handle this without a problem — as long as it can handle 650,000 rows.
http://ca.geocities.com/edkupfer/basket ... tacked.zip
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Tmon
Joined: 09 Oct 2005
Posts: 9
Location: Boston
PostPosted: Wed Feb 22, 2006 4:17 pm Post subject: Reply with quote
woa nelly! Thanks for unstacking all that. I'm taking and playing with as much data as I can at a time. Definitely can't get all 650,000, doesn't even let me try. I tried for 5 minutes on, and it let me get it in there, then crashed. Maybe I can look at different leads at specific time points one at a time or something.
-Tmon
Back to top
View user's profile Send private message
farbror
Joined: 13 Oct 2005
Posts: 15
Location: Sweden
PostPosted: Thu Mar 09, 2006 4:08 am Post subject: Reply with quote
This is really interesting stuff! How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games?
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Thu Mar 09, 2006 11:43 am Post subject: Reply with quote
farbror wrote:
How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games?
I have every game from 04-05, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Tmon
Joined: 09 Oct 2005
Posts: 9
Location: Boston
PostPosted: Thu Mar 09, 2006 6:38 pm Post subject: Reply with quote
Ed,
I'm still liking this stuff a lot, and I think you are essentially there for end-of-game situations. However, I was wondering if it would be possible to simplify and choose times that are commonly stated landmarks, such as halftime and end-of-three. My gut says people often put far too much emphasis on the score at these landmark times. Your current formula probably is just as valid at these times, but doing this would also reduce the number of variables, of course, so I can play too Smile .
You could just use every game, instead of randomly sampling. I think there are too-few observations at these times in the massive data file you posted to really get a good picture if I pull these out selectively. Any chance you're interested? Gotta figure this out so we can yell at the tv when they say "xx holds a commanding lead at the half"!
-Tmon
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC
PostPosted: Thu Mar 09, 2006 6:53 pm Post subject: Reply with quote
Ed Küpfer wrote:
farbror wrote:
How do you model the correlation structure for the reapeated measures? I am assuming that you have multiple data points from several games?
I have every game from 04-05, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons.
So, correct me if I'm wrong, but essentially you could have the following:
Probability(HomeTeamWin) = ( ( x * (L^a) ) + ( y * (T^b) ) ) * E
where x, y, a and b are integers that establish the output probability to be within a certain range and/or threshold, and E is a normalizing component or "fudge factor" to bring the bounds to {0, 1}, making it a true probability.
Perhaps you might even need the natural log of the above to smooth it out.
In any case, from what you've got, do you think you could reasonably come up with values for those 5 variables that satisfy that formula, with a tolerable level or error? From your earlier post, it seems as though the equation would have linear, quadratic, and cubic terms for both variables. Is that correct?
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 785
Location: Toronto
PostPosted: Thu Mar 09, 2006 7:17 pm Post subject: Reply with quote
Tmon wrote:
I was wondering if it would be possible to simplify and choose times that are commonly stated landmarks, such as halftime and end-of-three.
Like this?
Code:
TIME REMAINING
EndQ1 Half EndQ3 10:00 5:00 3:00 2:00 1:00 0:40 0:30 0:20 0:10
20 .92 .96 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
15 .90 .92 .97 .98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
12 .87 .88 .93 .94 .99 1.00 1.00 1.00 1.00 1.00 1.00 1.00
10 .84 .85 .88 .90 .95 .99 1.00 1.00 1.00 1.00 1.00 1.00
H 9 .83 .83 .86 .87 .93 .97 .99 1.00 1.00 1.00 1.00 1.00
O 8 .81 .81 .83 .84 .89 .94 .97 1.00 1.00 1.00 1.00 1.00
M 7 .79 .79 .80 .81 .85 .90 .94 .99 1.00 1.00 1.00 1.00
E 6 .77 .76 .77 .77 .80 .85 .90 .97 .99 1.00 1.00 1.00
5 .74 .74 .73 .73 .75 .79 .84 .93 .97 .99 1.00 1.00
T 4 .72 .71 .70 .70 .71 .73 .77 .86 .92 .95 .99 1.00
E 3 .69 .68 .66 .66 .66 .67 .70 .77 .83 .87 .94 .99
A 2 .66 .65 .63 .62 .61 .62 .63 .67 .72 .76 .83 .94
M 1 .63 .61 .59 .58 .57 .57 .57 .59 .61 .63 .67 .78
0 .59 .58 .55 .55 .53 .52 .51 .51 .50 .50 .50 .50
L -1 .56 .54 .51 .51 .48 .47 .46 .43 .41 .39 .34 .24
E -2 .52 .51 .48 .47 .44 .42 .40 .35 .31 .27 .20 .08
A -3 .49 .47 .44 .43 .39 .36 .34 .26 .21 .16 .09 .01
D -4 .45 .43 .40 .39 .35 .31 .27 .18 .11 .07 .03 .00
-5 .42 .40 .36 .35 .30 .25 .20 .10 .05 .02 .00 .00
-6 .39 .36 .32 .31 .25 .19 .13 .05 .02 .00 .00 .00
-7 .35 .33 .28 .27 .19 .13 .08 .02 .00 .00 .00 .00
-8 .33 .30 .24 .22 .15 .08 .04 .00 .00 .00 .00 .00
-9 .30 .27 .21 .19 .10 .05 .02 .00 .00 .00 .00 .00
-10 .27 .24 .17 .15 .07 .02 .01 .00 .00 .00 .00 .00
-12 .23 .19 .11 .09 .02 .00 .00 .00 .00 .00 .00 .00
-15 .18 .13 .05 .03 .00 .00 .00 .00 .00 .00 .00 .00
-20 .14 .07 .01 .00 .00 .00 .00 .00 .00 .00 .00 .00
Since all that represents the probability of an average home team beating an average away team, it's more interesting is to use the numbers above to modify the log5 formula, like this:
Probability of Home Team Win = (HomeWin * (1 - AwayWin) * W) / (HomeWin * (1 - AwayWin) * W + (1 HomeWin) * AwayWin * (1 - W))
where HomeWin and AwayWin represent some estimate of the Home and Away teams' win ability (like their Win% or Pythagorean or something), and
W = some HCA weight. Normally, we use simple HCA, which is about 0.6, but the win expectancy equation returns a more precise weight, given the game circumstances.
For example, if two average teams are playing, and the home team has a 5 point lead at halftime, they have a 0.74 probability of a win. But if the home team is the Lakers (Win% = 0.5) and the away team is the Hawks (Win% = 0.3), then the home team win probability is
Code:
p(HomeWin) = (0.5 * (1 - 0.3) * 0.74) / (0.5 * (1 - 0.3) * 0.74 + (1 - 0.5) * 0.3 * (1 - 0.74))
= 0.87
_________________
ed
page 2 of 3
Author Message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Thu Mar 09, 2006 7:24 pm Post subject: Reply with quote
gabefarkas wrote:
In any case, from what you've got, do you think you could reasonably come up with values for those 5 variables that satisfy that formula, with a tolerable level or error? From your earlier post, it seems as though the equation would have linear, quadratic, and cubic terms for both variables. Is that correct?
Yeah. I used a logistic regression model (it's up there in the second post of this thread), which takes the form:
p = 1 / (1 + EXP(-b))
where b = all the variables (time, time^2, time^3, etc) weighted by their regression coefficients. I don't know how familiar you are with logistic regression, but it's used on events that have binary outcomes (eg win/loss), and returns a nice s-curve bounded at 0 and 1.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Thu Mar 09, 2006 10:22 pm Post subject: Reply with quote
yeah, i know log regression stuff somewhat. it can also be used with Poisson regression models, such as for counts data, or rate data.
that's part of where i was going with my previous post. i think maybe you could try remodeling using the Poisson assumption to simplify it.
Back to top
View user's profile Send private message Send e-mail AIM Address
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Thu Mar 09, 2006 11:01 pm Post subject: Reply with quote
gabefarkas wrote:
that's part of where i was going with my previous post. i think maybe you could try remodeling using the Poisson assumption to simplify it.
Okay, my turn to ask you to explain. I'm not too familiar with Poisson models. How would you turn it into a probability of a binary outcome?
_________________
ed
Back to top
View user's profile Send private message Send e-mail
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Thu Mar 09, 2006 11:08 pm Post subject: Reply with quote
well, the outcome would be modeled as a Poisson, rather than as a Binary.
let me give it some more thought and get back to you.
Back to top
View user's profile Send private message Send e-mail AIM Address
farbror
Joined: 13 Oct 2005
Posts: 15
Location: Sweden
PostPosted: Fri Mar 10, 2006 1:40 am Post subject: Reply with quote
Ed Küpfer wrote:
I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue
(my bold)
My gut feel is that correlation is a major issue! A standrad logistic regression would be based on the assumption that the data points are independent. Data from the same game are not.
With 1000+ games available you might want to validate your results by sampling a single data point from each game and then do the Logistic regression.
.....and then perhaps repeat the validation a few times?
Do you in any way Model the strengths of the involved teams? Falling a few points behind, say, Portland of Today might be easier to overcome than trailing Detroit.
Poission regression: Poission regression is an excellent Model for soccer scores and hockey score. You might need to do some clever stuff with the dispersion parameter if you try to model hoops using poission regression.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Fri Mar 10, 2006 1:09 pm Post subject: Reply with quote
farbror wrote:
My gut feel is that correlation is a major issue! A standrad logistic regression would be based on the assumption that the data points are independent. Data from the same game are not.
I understand what you're saying, but I still don't see how it is a big issue. Think of the question we're trying to answer: given a home team lead of T, and M minutes remaining in the game, what is the probability of a home team win? I can't see how sampling repeatedly from the same game, but at different points, affects the answer here.
farbror wrote:
With 1000+ games available you might want to validate your results by sampling a single data point from each game and then do the Logistic regression.
.....and then perhaps repeat the validation a few times?
Okay, I did this. The problem with this approach is that there are not nearly enough data to give a significant regression result. For example, I repeated the process of sampling a single point from each game 10 times, and I still haven't sampled a single datapoint that has a home team lead with 5-10 minutes remaining. Think of all the possible Time/Lead combinations: say home team leads between -15 and 15, 48 minutes in a (actually, I recorded the time down to the second) will give us 1500 possibilities, which means that every sample will have an average of a single datapoint per Time/Lead combination possibility. This won't tell us anything.
I don't want to dismiss your objections out of hand. But I'm still not sure if a) the correlation issue really makes a difference (I'm not very familiar with the problems inherent in the resampling approach I used), and b) a practical alternative can be concieved. So far, the approached I used at least matches or intuitive feel of how the numbers should look, for whatever that's worth.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
farbror
Joined: 13 Oct 2005
Posts: 15
Location: Sweden
PostPosted: Mon Mar 13, 2006 4:24 am Post subject: Reply with quote
Ed>>
Robust estimates of correlation structures for Repeated measurements has been my field of research for some time. It is rather tricky (and I try to deal with simple stuff). The major quirk is that it is really hard to realize when the correlation structure has a major impact on the results.
If 1000+ data points are too few to get significant results, then that is a very interesting finding in itself. It might be an indication that other factors than "time remaining" and "score" are (even more) important predictors.
I appreciate your efforts to investigate this interesting topic. Also, I am very greatfull that you share the results.
Cheers, farbror
Sweden
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Wed Mar 15, 2006 12:10 pm Post subject: Reply with quote
farbror, I tried this again, this time focusing on a) the final minute of games, and b) the final 2 minutes of games. Neither pass gave me significant results. I think what I have to do is re-visit this issue when I have more data. Probably this summer I'll have added two more seasons' worth to work with.
For now, all I can say is that the results above seem to conform to my intuition. I prefer to look at it as a useful, pragmatic hack, rather than a refelction of reality. I promise not to put any more confidence than it deserves.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Fri Apr 28, 2006 3:44 pm Post subject: Reply with quote
Ed,
This dataset is pure gold. I need to be brief right now, (I promise to folllow up after I am off the clock today) but here are some of the things I did with a with a very similar dataset of college football games. Some of these things you probably have already tried.
* Rather than parameterizing the model using the regression, I created an "empirical matrix" of time remaining verses margin. In other words there would be an empirical cell that says that "empirically" (I'm making up numbers here), when the home team is up by 8 points with exactly 5 minutes left, since they won 10/16 times, that cell would have a "empirical" probability of .625.
Question: Are these recorded events whenever the score changed, or are other events included as well? I ask because, if it is just changes of score, then it should be easy to fill in the blanks for all of the times in between scores.
* One way to get the data to be a little smoother where you do not have many observations without resorting to parameterization is to set up a Markov transition matrix for each time/delta. This basically means that the prob(W|t,d) is a function of the the sum of the different prob(W|t-1).
* I love the game graphs! A simple but very telling statistic is what I called the gamescore which is the the integral of your graph. As you stated, this statistic "scores" the game on the change of probability over time and is useful at colapsing blowouts and getting at the "true" closeness of a game. I found that for college football using this statistic was better than using Margin Of Victory (MOV) in predicting future matchups.
Question 2: Would you be willing to provide a version of the data with the teams involved? That would make it possible to give each game a gamescore.
* One more thing I pursued was once I had gamescores for a season, it was able to estimate a MOV based on gamescores. This was helpful for those who might want to *ahem* predict a MOV for whatever reason. Smile
Back to top
View user's profile Send private message
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Fri May 05, 2006 3:35 pm Post subject: Reply with quote
Quote:
I have every game from 04-05, 1230 of them. I randomly sampled many observations from within these games, over half a million. I don't know if correlation is much of an issue—the question being asked is, given home team lead L and time remaining in game T, what is the probability of a home team win? I think the method I used was good enough to answer that, at least provisionally, until we add some more observations from other seasons.
Ed,
Excuse me for being dense, but are you saying that the 650,000+ observations are not each time/margin observation in the sample, picked once, but rather 650,000+ independent samples from the dataset, including oversampling?
Would it be unseamly for me to beg for even the reduced dataset that contains each gameid, time, home score and visitor score? I would like to take a crack at replicating the time/delta matrix and also try and generate the probability graphs for individual games.
[/quote]
Back to top
View user's profile Send private message
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Fri May 05, 2006 4:15 pm Post subject: How one might "smooth" the time/margin matrix Reply with quote
Since there was concern that there might not be enough observations for any one particular time/margin cell to get a good estimate of the probability, here's one way to smooth the data a bit. It is my attempt at a rewrite of something similar I did with college football data.
For notational purposes let p(T,d) be the probability of winning with T seconds remaining with a lead of d (delta). By definition, p(0,>0) = 1, and p(0,<0) = 0. (Overtime, as you noted, is tricky. I would just set p(0,0) to be whatever the empirical probability of the home team winning in overtime.)
Suppose that there were 20 instances where a team was leading by one
point with one second remaining. Now for the sake of simplicity, assume that there were only 3 possible outcomes for the final second. There was a 18/20=90% chance that the lead will not change, a 1/20=5% chance that the lead will go to 3 (The team with the lead scores another field goal), and a 1/20=5% chance that the other team will lead by 1 (The trailing team scores). In this example, the probability of winning given a one point lead with one second remaining is:
p(1,1)= p(0,1)*.9 + p(0,3)*.05 + p(0,-1)*.05
= 1 *.9 + 1 *.05 + 0 *.05 = .95.
Suppose one did this for every margin with one second left. Then the probability of winning after with two seconds remaining p(2,d) given different point differentials would be calculated using the p(1,d) from above. In other words, the probabilities are being modelled as a Markov process.
Another way of looking at this is that instead of comparing a T/D with the final result, you are just comparing it with the states at T+1.
This method will give some smoothness for situations where say they were down 20 at the half, rallied back to within 5 with a minute to go and lost. Instead of just tabulating this as down 20 at the half therefore lost, the probability would be based on probability that one could win down 5 with a minute to go.
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Sat May 06, 2006 2:05 pm Post subject: Reply with quote
Jon: I just saw these posts now. I don't have time to read them closely right now, but it looks like a lot of good stuff. Expect a real reply within a couple of days.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
suburbanDad
Joined: 10 May 2006
Posts: 1
PostPosted: Thu May 11, 2006 8:27 am Post subject: NBA within game different from NCAA? Reply with quote
This is brilliant work Ed.
I wonder whether the within game odds are different for the NBA. Is it difficult to get the NBA PBPs?
Also, ball possession seems important. Three points down at 0:12 is very different with the ball than without. I see that you didn't include ball possession. Is that because you didn't have it in your NCAA dataset?
sD
Back to top
View user's profile Send private message
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Thu May 11, 2006 2:15 pm Post subject: Reply with quote
Quote:
I wonder whether the within game odds are different for the NBA. Is it difficult to get the NBA PBPs?
Also, ball possession seems important. Three points down at 0:12 is very different with the ball than without. I see that you didn't include ball possession. Is that because you didn't have it in your NCAA dataset?
I am not Ed, but I hope he doesn't mind my answering.
I'm quite certain that Ed used NBA and not NCAA data.
Getting NBA PBPs are not difficult. They are found at nba.com, espn.com, and a few other places. The trick is parsing them. I started to take a crack at it myself a few months back, but my data was corrupted and I did not repursue it at the time.
I believe what Ed did was sample from the lines of the PBP where a score took place, so by definition, the possession was with the team that scored. If one was to look at the continum of time in between scores, one would have to note each change of possession that did not involve a score in order to do the analysis of the ball possession.
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Thu May 11, 2006 5:56 pm Post subject: Reply with quote
Ed Küpfer wrote:
gabefarkas wrote:
that's part of where i was going with my previous post. i think maybe you could try remodeling using the Poisson assumption to simplify it.
Okay, my turn to ask you to explain. I'm not too familiar with Poisson models. How would you turn it into a probability of a binary outcome?
I realized I never got back to you about this. What you've done here is a binomial logit (logistic regression) model, with the form:
logit(Pi) = log (Pi / (1 - Pi) ) = a + b1x1 + b2x2 + ...
This model ensures that the response will be between 0 and 1.
A Poisson loglinear model predicts the expected value of "y" (the response variable), and takes the form:
log(E(y)) = a + b1x1 + b2x2 + ...
And it's used for counts of things, or rate data, or also when putting together a contingency table. So, you couldn't use it for a binary outcome, but if you have the total number of games, you could model the rate of success.
Loglinear and logit models have a lot of connections between them, and oftentimes there's an equivalent version of one that can be found in the other.
page 3 of 3
Author Message
capnhistory
Joined: 27 Jul 2005
Posts: 62
Location: Durham, North Carolina
PostPosted: Fri May 12, 2006 7:10 pm Post subject: Reply with quote
While those of you with serious math skills work on the real research I just wanted to toss in an observation. Basketball appeals to me as a spectator more than other sports because of its dynamic nature. Something that stood out to me about Ed's initial chart was how large a lead had to be for the home team to have the game locked down. I think about it like this: imagine the home team is clinging on to a four point lead in the games closing minutes. If they are exchanging possessions (i.e. matching the visiting team score for score and stop for stop) to maintain that 4pt lead, then the chart indicates that there's still as much as a 20% chance the visiting team coming back to win with as little as 90 seconds remaining in the game. I know the data are still rough, and the results may change, but to me this indicates that a high number of games ending in this situation could result in a dramatic, come-from-behind, stealing-one-on-the-road, type of win. Again I find that nature thrilling and a large part of why I love this sport. I chose a 4 point lead because, while teams can score that much in an atypical play (foul on a made trey), it essentially represents the lowest "two possession" lead a team can have. By comparison if a home football team has an 8 point lead (again do-able in one possession, but really a two possession lead) what are the odds the visiting team overcomes that deficit in less then two minutes? I imagine they are much smaller than 1-in-5. I would be really interested if Jon had found any numbers regarding this. I can't contribute to the research but I just thought I should share that I found meaning in the discussion so far.
_________________
Throw it down big man!
More expansive basketball babble at a slower pace The Captain of History
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Jon Cohodas
Joined: 08 Jul 2005
Posts: 31
Location: Richmond, VA
PostPosted: Mon May 15, 2006 9:55 am Post subject: Reply with quote
By comparison if a home football team has an 8 point lead (again do-able in one possession, but really a two possession lead) what are the odds the visiting team overcomes that deficit in less then two minutes? I imagine they are much smaller than 1-in-5. I would be really interested if Jon had found any numbers regarding this. I can't contribute to the research but I just thought I should share that I found meaning in the discussion so far.
As you predicted, it is much more difficult in college football.
With two minutes to go, down 8 points:
The home team came back to ultimately win 6/69=10.2% of the time.
The visitors came back 4/75= 5.3% of the time.
Keep in mind that these are all situations with the point differential. The data does not contain possession information.
Back to top
View user's profile Send private message
jmethven
Joined: 16 May 2005
Posts: 51
PostPosted: Tue Jun 16, 2009 10:55 am Post subject: Reply with quote
Sorry to bump this topic after 3 years, but has any more research been done in this direction? I know some cool graphs were posted during this year's playoffs at advancednflstats.com.
I think it would be really interesting to combine a win expectancy framework with a box score player rating model like PER, a la what they do over at fangraphs.com for baseball. There are a lot of great players like David Robinson and Kevin Garnett who have been criticized by Bill Simmons and other sportswriters for not being able to take over a game in crunchtime. Using win expectancy could give a sense of how much, say, Kobe Bryant's willingness to shoot the ball at the end of games gives him value over someone like Garnett.