Regressing toward a prior: Model Selection Question

Home for all your discussion of basketball statistical analysis.
Post Reply
kmedved
Posts: 80
Joined: Thu Jul 03, 2014 9:18 pm

Regressing toward a prior: Model Selection Question

Post by kmedved » Tue Nov 06, 2018 3:24 pm

I am interested in approaches for how to weight a sample of increasing size against some set of priors to predict future outcomes. This is somewhat python focused, but I am interested in general modeling thoughts here as this is a common issue to sports stats.

I have created a dataset where I have each NBA's team opponent-adjusted offensive and defensive rating, after each day of the season, for the last 20 years. I'd like to use this, along with their offensive/defensive rating from the previous season, to predict their season-ending rating. So for example, the Bucks have an offensive rating 7 points better than league average right now (after 9 games). Last year they were 2 points better than average. I want to predict where they'll end this year.

Here's an example of what my data looks like.

Image

oRTG+_ytd is their year to date rating.
GP_ytd is the number of games played so far.
oRTG+_f is the season ending rating.
oRTG+_lastYear is their rating last year.

I want to predict oRTG+_f.

I can use a simple regression, except that's going to miss the weighting for the number of games played so far this year. A +10 rating after 40 games is going to be much more significant than a +10 rating after 5 games. I don't think adding sample weights solves for this issue either.

My standard approach to this sort of problem is to optimize X and Y in the following equation using Excel's solver or scipy.optimize.curve_fit.

(<oRTG+_ytd>*<GP_ytd> + <oRTG+_lastYear>*X + 0*Y)/(<GP_ytd>+X+Y) = <oRTG+_f>

What that will do is essentially assign each team's current rating its weight based on the number of games played so far. Last Year's rating gets X games weight. And then I'm applying a regression to the mean weight of Y games (0 = mean offensive rating). And that yields a weighted average. This is similar to the methodology laid out here, except adding a feature for last year's rating as well: http://statitudes.com/blog/2013/11/12/h ... -using-srs

This may not be the only model however, and I'm not sure it's the best model. It's the one that makes intuitive sense to me (a weighted average based on the number of games played so far), but given the number of machine learning options available, I am interested in whether other, potentially more elegant models exist which can do this just as well.

This is a common problem in sports analytics, and generalizable outside of this particular dataset obviously. I am curious about how people go about solving issues like this. Is this an area where a Bayesian Regression is needed?

Crow
Posts: 5467
Joined: Thu Apr 14, 2011 11:10 pm

Re: Regressing toward a prior: Model Selection Question

Post by Crow » Sat Nov 10, 2018 4:02 am

Would be nice if one or more folks with the proper background would offer some feedback here. It is not for me.

Mike G
Posts: 4139
Joined: Fri Apr 15, 2011 12:02 am
Location: Asheville, NC

Re: Regressing toward a prior: Model Selection Question

Post by Mike G » Sat Nov 10, 2018 12:44 pm

I actually started something like this. At b-r.com, I looked up a few of last season's standings at various points in the early season. This related to the question of when our pre-season predictions should be outweighed by current season performance.

pre% refers to the average of 18 APBR predictions submitted in Oct. 2017.
Error is relative to what we now know were season totals.
Of course the season ends with zero error between current and final status of whatever team stat we are looking at. I was looking at MOV and came up with these weights for minimal errors:

Code: Select all

Date   G    pre%   cur%    err   e=.5
Nov.1  07   0.75   0.25   2.01   0.29
Nov15  14   0.76   0.24   1.93   0.41
Nov31  21   0.57   0.43   1.59   0.51
Dec15  28   0.41   0.59   1.50   0.58
...
April  82   0.00   1.00   0.00   1.00
The final column shows the results of the formula:
cur% = (G/82)^0.5
and is meant to simulate the cur% column.
The G column is just the avg games by all teams, at that point in the season.
Last year, pre-season predictions were outweighed by to-date performance after 21-28 games.
It's a very crude and small sample; another exponent -- this is just the square root -- is apt to be better overall.

DSMok1
Posts: 850
Joined: Thu Apr 14, 2011 11:18 pm
Location: Maine
Contact:

Re: Regressing toward a prior: Model Selection Question

Post by DSMok1 » Mon Nov 12, 2018 2:52 pm

Interesting discussion!

To me, this looks like an excellent topic to view, as you suggest, through as Bayesian framework. In other words, we develop a prior expectation entering the year, with an associated error term, and then update with the current year's data (with its associated error term).

The difficulty lies in assessing the error term for the current year data, but it should be something that can be relatively easily estimated. I would expect the error on the current year data point would be approximately proportional to the inverse of the square root of the number of games played--a standard error term. In reality, as the number of games played go towards infinity, the error would not approach zero, because the team's overall ability is not 100% unchanging...but that can be accounted for by adding a small constant to the current year standard error term.

I would structure this as:

Overall population curve (+ standard deviation of population)

Update with

Last year's final result (with a standard error--Would have to derive via trial and error to determine how much it decomposes in value coming to this year)

Update with

This year's partial result (with a standard error, generally proportional to inverse of sqrt(n)--again, use a best fit approach approach to determine size of term).
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
GodismyJudgeOK.com/DStats/
Twitter.com/DSMok1

Post Reply