Page 1 of 1

Testing APM on unseen data (jsill, 2009)

Posted: Fri Apr 15, 2011 7:07 am
by Crow
Author Message
jsill



Joined: 19 Aug 2009
Posts: 73


PostPosted: Wed Aug 26, 2009 11:29 pm Post subject: Testing APM on unseen data Reply with quote
Hello, everyone. I've been lurking for a bit, but this is my first post.

I've been working on my own website for a little while now. It's not quite ready to go live, but once it is, it should provide several basketball stats analyses which are new, to the best of my knowledge. So I'll let you know when that's ready and I'll be interested in any feedback.

In the meantime, I've been reading up on adjusted plus-minus, which seems to be one of the favorite topics in this forum. I read Eli Witus's writeup (http://www.countthebasket.com/blog/2008 ... lus-minus/ ) and I also downloaded the processed version of the '07-'08 dataset that he provides towards the end of his blog post. I ran the regression on that file and I reproduced the top 10 he gets (Amir Johnson, Ronnie Price, Dwight Howard, etc.) exactly. As he notes, these results are quite close to those of Aaron Barzilai at basketballvalue.com.

My question for those of you who have implemented APM is whether you've tested your models out-of-sample. By "out-of-sample", I mean on data which the model was not fitted on. The APM player ratings arrived at via the regression are presumably supposed to yield predictions about the outcome of future games (or future game-snippets, specifically) if a certain set of 5 players plays against a certain other set of 5 players.

So the ultimate out-of-sample test is to test the model on data from games after the data on which the model was fit. For instance, you could take the APMs from '07-'08 and use them to predict some of the games from '08-'09. If you want to stay within a particular season, you could fit the model on the games through February and then test the model on March and April games, for instance.

A slightly less stringent but still very useful out-of-sample test is to take a randomly selected subset of your dataset and remove it from the data the model is fitted on and simply use it for testing. So here you randomly select game snippets to be used for testing purposes. For instance, you can partition the data into 10 subsets and fit the model 10 times, each time fitting on a different 9/10ths of the data and testing on the final 10th of the data. This procedure is known as cross-validation in the statistics and machine learning literature (I come from a background in machine learning, which is not the same as statistics but is closely related).

Those of you who know a decent amount about regression will be familiar with the concept of R-squared. Have you looked at the R-squared for the APM regression? Perhaps more importantly, have you looked at an out-of-sample R-squared for the APM regression?

When you evaluate a model like this, you generally want to benchmark it against a trivial model to see that it's doing better than the trivial model. For instance, here a trivial model would be to ignore who is on the floor and simply predict the efficiency margin (i.e. difference in points per possession for the 2 teams) based on the average efficiency difference between home and road teams averaged over the entire season of games over the entire league. I get a mean home/road efficiency diff of about 3.7 for the '07-'08 data.

So the benchmarking procedure here would be to compute the (possessions-weighted) mean squared error of the predictions of the model (on data the model has *not* been fit on) regarding margin versus the actual margin. Then you would compare it to the dumb model which just predicts a margin of 3.7 for every single data point.

I have done some initial cross-validation experiments along these lines with the '07-08 data (as processed by Eli Witus) and I have to say that the results are not so good. In fact, I am finding that the dumb model which predicts 3.7 every time does better than APM ! However, it is quite possible that I'm doing something wrong, particularly given that I am new to all of this. That's why I'm wondering if someone else has done any of these types of tests.

Can anyone dispute or confirm what I've found regarding the out-of-sample performance of APM?
Back to top
View user's profile Send private message
DJE09



Joined: 05 May 2009
Posts: 148


PostPosted: Thu Aug 27, 2009 12:28 am Post subject: Reply with quote
I would love to see your model evaluation work.

In my area of interest there is considerable modelling going on, with both stochastic/regression models and physically based ones. I have no time for reading papers where the authors haven't attempted to validate their model / test it's performance.

Welcome, great to have your post, I too once was a lurker, some people might wish that I had stayed that way Smile but let us know when you are going to start posting your analysis.
Back to top
View user's profile Send private message
Crow



Joined: 20 Jan 2009
Posts: 806


PostPosted: Thu Aug 27, 2009 3:00 am Post subject: Re: Testing APM on unseen data Reply with quote
jsill wrote:


The APM player ratings arrived at via the regression are presumably supposed to yield predictions about the outcome of future games (or future game-snippets, specifically) if a certain set of 5 players plays against a certain other set of 5 players.




If you use 1 year stabilized adjusted and each player has a standard error of around 1.5. 5 guys on each squad doesn't the standard error of that 5 on 5 comparison grow, potentially to 15, though likely not that high? So isn't what you are finding unsurprising or expected?

Though I'd think to be fair you'd ideally want Adjusted values for home and away or assume the same home court advantage tacked on to the sum of the Adjusteds in comparison with the naive / trivial model.


The alternative would be to compare lineup Adjusted vs lineup Adjusted, assuming the lineup was used enough to get an estimate. What kind of results does that out of sample test give? I'd think / hope it was better but I agree it would help to know more about how it performs in such a test.



(Ultimately you might be able to characterize lineups as types or player type sequences and look at a specific lineup's performance against similar lineups to the test case and adjust the projection giving some push toward the past results against the most similar lineups.

If 5 players haven't played together you might be able to estimate if you had player pair Adjusteds,summed them and perhaps regressed to the mean some? Or find the 5 man lineups that were most similar in composition (preferably only 1 player different, 2 might be too much to be meaningful) and perhaps average them as the estimate, perhaps with some modification?)

Last edited by Crow on Fri Nov 27, 2009 6:12 pm; edited 1 time in total
Back to top
View user's profile Send private message
Ryan J. Parker



Joined: 23 Mar 2007
Posts: 708
Location: Raleigh, NC

PostPosted: Thu Aug 27, 2009 6:15 am Post subject: Reply with quote
Just wanted to say welcome, and I eagerly await the launch of your website. Very Happy

I'm most eager to predict team efficiency in future seasons, but in my mind a basic APM-like model would be the starting point. I've seen other instances where just picking the average will give you a lower RMSE. Maybe others will have a better perspective on this, but I would prefer to set the APM model as the baseline and work on beating that.

I'm currently in the process of doing such analyses, so I'm glad to see there's someone out there already working on this. I know that Steve is very interested in this as well.
_________________
I am a basketball geek.
Back to top
View user's profile Send private message Visit poster's website
Ryan J. Parker



Joined: 23 Mar 2007
Posts: 708
Location: Raleigh, NC

PostPosted: Thu Aug 27, 2009 6:56 am Post subject: Reply with quote
Thinking about it more, I'm interested to know exactly what you're predicting. My plan was to look at each possession and predict a mean points for each possession. Thus you'll never see predictions of 3.7. So I want to understand exactly how your predictions are setup. That might give insight into why your model is performing poorly.
_________________
I am a basketball geek.
Back to top
View user's profile Send private message Visit poster's website
Ilardi



Joined: 15 May 2008
Posts: 263
Location: Lawrence, KS

PostPosted: Thu Aug 27, 2009 8:37 am Post subject: Reply with quote
A few things to keep in mind:

1) Single-season APM estimates are notoriously noisy (high SEs) due to heavily intercorrelated teammate minutes - a problem which can be addressed by using multi-season datasets heavily weighted toward the target season of interest. Here's an explanation and a set of relatively low-noise estimates for 07-08: http://www.82games.com/ilardi2.htm. Likewise, even lower-noise estimates for 08-09: http://spreadsheets.google.com/ccc?key= ... 4WkE&hl=en

2) The 3.7 in jsill's model is simply his constant term, which serves in context as the home court advantage (in pts/100 poss).

3) The R^2 for predicting each individual lineup outcome based on player APM is fairly low (down around .01 to .02), but the model F is always significant at p<.00001. The low R^2 is not surprising, and it tells us little about the importance (or lack thereof) of player APM. Consider, for example, that the R^2 for predicting the outcome of all player at-bats over a season based on each player's batting average is less than .01, as is the R^2 for predicting individuals' lung cancer status based on #cigarettes smoked. (We still tend to think batting average and smoking are important predictive variables, and rightly so . . . it's just that their predictive impact is much more clearly seen in data aggregates (e.g., a collection of 600 at-bats rather than 1 at-bat, a sample of 100,000 smokers rather than 1 individual). Likewise, you'll more clearly see the effects of APM across a season's worth of lineups, rather than a single lineup observation across only a few possessions.

4) Accordingly, in my opinion, the best test of the predictive validity of APM estimates (or any other metric) is that of predicting (or retrodicting) team performance - i.e., point differential (efficiency) - over an entire season. I've got much more to say about this point - and an APBR Retrodiction Challenge to issue - but have to run off to teach right now.
Back to top
View user's profile Send private message
Ryan J. Parker



Joined: 23 Mar 2007
Posts: 708
Location: Raleigh, NC

PostPosted: Thu Aug 27, 2009 9:27 am Post subject: Reply with quote
Ilardi wrote:

4) Accordingly, in my opinion, the best test of the predictive validity of APM estimates (or any other metric) is that of predicting (or retrodicting) team performance - i.e., point differential (efficiency) - over an entire season. I've got much more to say about this point - and an APBR Retrodiction Challenge to issue - but have to run off to teach right now.


The bold is what I've been progressing towards. Looking forward to the APBR Retrodiction Challenge. Cool
_________________
I am a basketball geek.
Back to top
View user's profile Send private message Visit poster's website
jsill



Joined: 19 Aug 2009
Posts: 73


PostPosted: Thu Aug 27, 2009 9:39 am Post subject: Reply with quote
Crow:

Quote:
Though I'd think to be fair you'd ideally want adjusted values for home and away or assume the same home court advantage tacked on to the sum of the adjusteds in comparison with the naive / trivial model.


When running APM, I am indeed including the same constant term which corresponds to the home-court advantage. The comparison is "constant term only" vs. "adjusted plus-minus parameters for each player plus constant term". You might wonder how the second model could possibly do worse given that it has everything the first model has plus many extra parameters. If you look at the in-sample fit, it can't do worse. If you look at the out-of-sample performance, though, it's not uncommon to see more complicated models do worse when they have lots of extra parameters and the data is noisy and limited.

You could certainly try estimating the performance of lineups directly, as you suggest. It would be an interesting comparison. If the motivation of APM is the evaluation of individuals, though, then it's harder to extract that from a lineup-based model.

Ryan:

Quote:
My plan was to look at each possession and predict a mean points for each possession. Thus you'll never see predictions of 3.7. So I want to understand exactly how your predictions are setup. That might give insight into why your model is performing poorly


As Steve indicates, this is simply a matter of units. The way that Eli Witus set up the data, the dependent variable to be predicted is the difference in efficiency for the 2 teams over the period the 2 lineups were on the floor, where efficiency is expressed as points per 100 possessions. So 3.7 corresponds to a difference of 0.037 points per single possession. I guess the motivation of setting it up as points per 100 possessions is that the APM values can then be roughly interpreted as the point differential impact over the course of a game which the player would make, since a game typically has somewhere around 100 possessions.

Steve:

Quote:
The R^2 for predicting each individual lineup outcome based on player APM is fairly low (down around .01 to .02), but the model F is always significant at p<.00001. The low R^2 is not surprising, and it tells us little about the importance (or lack thereof) of player APM


I agree that it's not surprising that the R^2 is around 0.01 or so (which is also somewhere around what I get). I agree that models with such low R^2 can still be useful when used in the aggregate to make predictions over a long series of events like a whole season. I just figure you also want to make sure that your *out-of-sample* R^2 is greater than 0, at least. This is a philosophical difference to some degree between researchers in statistics vs. machine learning. Statisticians tend to focus more on in-sample fits and the inferences that can be drawn from them via hypothesis tests about the model parameters, etc. Machine learning focus tend to focus more on testing models on unseen data, i.e., data the models were not fitted on (probably because the models used in machine learning are more complex and do not have accompanying theory behind them which allows one to make a statement about e.g. a standard deviation around a parameter). I think it's a useful check in any case to do out-of-sample testing. Your retrodiction challenge sounds interesting.

I'm hoping today to evaluate APM and benchmark it versus the dumb home-court-advantage-only model in the aggregate over the course of many games. I'll let you all know what I find.
Back to top
View user's profile Send private message
DLew



Joined: 13 Nov 2006
Posts: 224


PostPosted: Thu Aug 27, 2009 10:15 am Post subject: Reply with quote
I think testing APM on out of sample data is a really useful exercise and I am glad someone is interested in doing it. Now that there are 6 years of data available it is definitely viable. I doubt you will be able to determine much based on just one year ratings, but if someone was to divide all the observations from the whole six year period into two randomly selected groups, run the model on each data set, and then test the out of sample predictivity (relative to a constant only model) on the other data set, that would likely yield very informative results.
Back to top
View user's profile Send private message
Crow



Joined: 20 Jan 2009
Posts: 806


PostPosted: Thu Aug 27, 2009 10:27 am Post subject: Reply with quote
jsill wrote:


When running APM, I am indeed including the same constant term which corresponds to the home-court advantage. The comparison is "constant term only" vs. "adjusted plus-minus parameters for each player plus constant term".


Ok, I am not surprised that you are doing this though I didn't see explicit mention of it 'til now, so I felt it needed to be smoked out one way or the other. Thanks for the clarification.

Still some of Adjusted's worse performance will be related to it being home/away rather than just one and some players breaking from the normal home=away performance pattern. The constant term being an average for everyone won't necessarily be the right adjustment for those specific players.


I'd be most interested in the out of sample performance of Adjusted against playoff level teams- in regular season or taken into the playoffs. I think a main goal is to gauge lineup / team strength in that context and prospects for advancement.

Performance against lottery teams helps determine playoff seed at the same rate but performance against playoff level teams I'd think is a better predictor of out of sample performance against playoff level teams. So perhaps having an adjusted based on just performance against playoff level teams would be worth producing. I think the context difference is large and may be worth removing. Some coaches and teams enjoy running up the score against weak teams as much as possible to demonstrate how good they are while others tend to coast as soon as they feel safe.

Last edited by Crow on Fri Nov 27, 2009 6:15 pm; edited 1 time in total
Back to top
View user's profile Send private message
Ryan J. Parker



Joined: 23 Mar 2007
Posts: 708
Location: Raleigh, NC

PostPosted: Thu Aug 27, 2009 12:25 pm Post subject: Reply with quote
That makes sense. Thanks.

I'd be interested in seeing what sort of errors you're getting with each method.
_________________
I am a basketball geek.
Back to top
View user's profile Send private message Visit poster's website
battaile



Joined: 27 Jul 2009
Posts: 38


PostPosted: Thu Aug 27, 2009 11:58 pm Post subject: Reply with quote
It really warms my heart to see all this basketball number crunching going on. I'd be very interested in seeing how this pans out. I've been planning something similar using WinShares just because I have a lot more data available for it and can easily calculate it for the out of sample data, but intuitively it seems clear that APM would yield better results.