Thanks. Would it be possible to cut this down further (somewhere closer to Layne's length?)? If you prefer, I would be happy to do the trimming...nrestifo wrote:I guess I'll go first. Let me know if this is appropriate as a summary/bio. I tried my best to keep it short, general, and accessible.
My name is Nick Restifo. In my basketball life, I write for Nylon Calculus and am a special assistant for the D2 powerhouse that is the University of New Haven Chargers. If you like, you can follow me on Twitter at @itsastat.
My overall predictions are based on an ensemble of four base models predicting a two year career peak blend of RAPM and Win Shares. The four models include a regression and a bagged neural network (to help with stability) trained on two different subsets of data; all prospects with statistics listed on DraftExpress since 2001-2002 and just those prospects that were actually drafted since 2001-2002. I use standing reach, RSCI high school rank, standing vertical leap, lane agility test time, true shooting percentage, points per 40 minutes pace adjusted, total rebounds per 40 minutes pace adjusted, assists per 40 minutes pace adjusted, steals per 40 minutes pace adjusted, blocks per 40 minutes pace adjusted, turnovers per 40 minutes pace adjusted, personal fouls per 40 minutes pace adjusted, minutes per game, age on February 1st of a player’s draft year, strength of schedule, and percentage of points from three (to account for some spacing benefits). I average an entire player's pre-NBA career, each year weighted by minutes played. Unlike other models, I do not assign any extra weight to the most recent years. The most recent years only get weighed more if the player played more in those years, (and this is usually the case anyway). For the vast amount of missing data for the players who did not participate in the combine, I impute regression based estimates of body dimensions (hand length, body fat, etc) based on listed height and weight. Body dimensions are mostly very easy to estimate, for obvious reasons. For the vertical and agility tests, I impute missing values via decision trees trained on a player's age and body dimensions.
Each model in my ensemble plays an individual role conceptually. By applying two of the base models to some 20,000 plus prospects since 2002, I set the framework for my model to have the ability be applied to any basketball player anywhere, not just those who make top 100 prospects lists. For those prospects that never play in the NBA and don’t have RAPM and Win Shares values, I fill those missing values with -4 and 0 respectively, which are each very close to the absolute minimum career peaks of all NBA players since 2001-2002. Only a handful of NBA players have ever peaked at below -4 RAPM or 0 Win Shares. For those prospects without an RSCI high school rank, I fill those missing values with 600, which is a very rough estimate of what the average high school rank would be for the remaining unranked prospects each year. The problem with developing these models on every potential player is that, in conjunction with imputing all these missing values, the models become not only a reflection of NBA success, but also a reflection of whether or not a player will be drafted, which isn’t always the same thing.
To counter these effects, I trained two additional base models just on players who were drafted. Though some of these players also never played in the NBA, these models are able to get a better handle on whether or not a player will actually succeed in the NBA, more independent of the sometimes clouding effects of what will get a player drafted in the NBA.
In comparison to other models, since I include high school ranking as a variable, my model will favor those highly heralded high school players significantly more than other models. High school rank is an especially important predictor in the regression model trained on all available prospects This results in additional predicted value for the highly ranked high schoolers that might not be as favored in other models, players like Cliff Alexander and Myles Turner, and less predicted value for unranked high school players that do a little better in other models, like Frank Kaminsky. (In Frank Kaminsky’s case in particular, he does not do very well by the models trained on all prospects, but does considerably better by the models trained on just those prospects who were drafted.)
With regards to methodology, my ensemble has its strengths and weaknesses like any other prediction system. I used neural networks as part of my ensemble because they were the most accurate out of sample prediction method on my data, and accuracy is obviously valuable. Neural networks are flexible and often better than other methods like regression at teasing out complex, non-linear relationships amongst the training data, and with regards to draft prospects, neural networks are also good at capturing just how much better the premier players are than the middle class. While neural networks are powerfully accurate, they also tend to overfit training datasets and attach themselves to noise in data in their pursuit of accuracy. To alleviate these concerns, I applied a process known as bagging to my neural networks, which helps to increase the stability of the predictions by taking the consensus of several neural networks over subsets of the training data, rather than a single neural network over the complete training data, as the latter is more likely to interpret noise as signal.
I think this is a great write up and It should definitely be linked to from the main article so readers can learn more. However, I think for our purposes (due to both length restrictions and attention spans of the average reader), it's too in depth.