Data: Preventing Wide-Open 3s
Re: Data: Preventing Wide-Open 3s
Hi I had a quick look at that exposition (at least only at the first model). 
I have to say I did not really understand the logic behind some of the model assumptions.
Couple of questions (again I only read up to the first model):
1) The data you are modelling seems to be just 3d point percentages. Yet your question is about wide open 3pts (i.e. where the opponent are more than 6ft away). Did I misunderstand? How is your data showing this since it contains all 3pts.
2) Why is a binomial model (unbiased true/false series of test) a good way to model this? That part I didnt get. Are you saying that if a team has no effect on the outcome of the test then the test shouldnt be distributed as Binomial? You know that the Bernoulli trials in the Binomial CAN actually model a fixed yet biased coin?
3) I sort of gave up after that but if I understood correctly you calculate summary statistics (via some MC sims?) to compare the overal Binomial with that of the Celtics? Assuming I understood correctly what you are doing, my question is that dont the parameters theta of your Binomial already contain the Celtics data?
And, is that the right way to answer your hypothesis in the first place. If your Binomial model is correct (see 2) above) isnt the right way to prove or disprove the hypothesis is to test via a Binomial test or even better a Kolmogorov-Sminrov or some other non-parametric model? At least a parameteric test to see if the theta of the Celtics data is (statistically) different from the global theta?
Anyway. Sorry If I misunderstood. I only skimmed through it.
			
			
									
						
										
						I have to say I did not really understand the logic behind some of the model assumptions.
Couple of questions (again I only read up to the first model):
1) The data you are modelling seems to be just 3d point percentages. Yet your question is about wide open 3pts (i.e. where the opponent are more than 6ft away). Did I misunderstand? How is your data showing this since it contains all 3pts.
2) Why is a binomial model (unbiased true/false series of test) a good way to model this? That part I didnt get. Are you saying that if a team has no effect on the outcome of the test then the test shouldnt be distributed as Binomial? You know that the Bernoulli trials in the Binomial CAN actually model a fixed yet biased coin?
3) I sort of gave up after that but if I understood correctly you calculate summary statistics (via some MC sims?) to compare the overal Binomial with that of the Celtics? Assuming I understood correctly what you are doing, my question is that dont the parameters theta of your Binomial already contain the Celtics data?
And, is that the right way to answer your hypothesis in the first place. If your Binomial model is correct (see 2) above) isnt the right way to prove or disprove the hypothesis is to test via a Binomial test or even better a Kolmogorov-Sminrov or some other non-parametric model? At least a parameteric test to see if the theta of the Celtics data is (statistically) different from the global theta?
Anyway. Sorry If I misunderstood. I only skimmed through it.
Re: Data: Preventing Wide-Open 3s
Mentioned your post to the author at @thecity2 on twitter.
			
			
									
						
										
						Re: Data: Preventing Wide-Open 3s
This is an interesting piece by Evan.  I had originally suggested the binomial test on Twitter, as that is the basic way to check if a random process is purely showing variance or something more.  The spread does look a little bit wider than a pure binomial process on open 3's.
I'm not 100% following the beta/theta approach. Does this indicate the results are 100% consistent with the variation being solely from variance? Or does this indicate that the "different mints" are something to do with unique characteristics of the arena, or something like that?
			
			
									
						
										
						I'm not 100% following the beta/theta approach. Does this indicate the results are 100% consistent with the variation being solely from variance? Or does this indicate that the "different mints" are something to do with unique characteristics of the arena, or something like that?
Re: Data: Preventing Wide-Open 3s
I'm also not Evan, but some things I think I understand:
The data set he read in is titled 'open 3s' or something similar. Assuming the data set was created correctly, it doesn't contain all three-point attempts. Also, a quick look at how many threes teams take in a year (something like https://www.basketball-reference.com/le ... stats-base) shows that all threes are in the 2000s while this data is in the upper hundreds.
The first model gives every team in the league the same three point defense probability. This is a kind of default; assume teams *aren't* actually different at defending open threes. This results in a single distribution of plausible theta values (opponent open three point accuracy). Sampling from that distribution and applying it to the actual number of shots faced by teams in the data set, he gets an expected distribution of ranks as well as percentage. To my eye, the actual observed distribution seems rare but maybe not outlandish. He also pulls the minimum rank and minimum percentage from those samples and compares Boston's actual rank/percentage to them. Boston's percentage is unusual but not outlandish (4.8th percentile compared to the samples) but their rank is more unlikely (0.4th percentile).
The second model allows each team to have its own theta and the same sampling exercise finds Boston's values to be much more plausible. In other words, the Celtics' observed ranks and defensive 3 point percentage is more likely in a model where teams differ in their three point defense. So, the conclusion is that it seems more likely that teams do differ in their open three point 'defense', with the Celtics being notably good at it. Evan specifically leaves it to someone else to say why, which is a little disappointing.
My own questions would be:
Why not compare the two models directly? AIC or BIC or something else that penalizes for the number of parameters should still prefer the second model if it's a better fit of the data.
How did you get through an entire article using Bayesian models without describing the prior(s) once? What are they? I assume a default in whatever pyro is, but what's the default?
			
			
									
						
										
						The data set he read in is titled 'open 3s' or something similar. Assuming the data set was created correctly, it doesn't contain all three-point attempts. Also, a quick look at how many threes teams take in a year (something like https://www.basketball-reference.com/le ... stats-base) shows that all threes are in the 2000s while this data is in the upper hundreds.
The first model gives every team in the league the same three point defense probability. This is a kind of default; assume teams *aren't* actually different at defending open threes. This results in a single distribution of plausible theta values (opponent open three point accuracy). Sampling from that distribution and applying it to the actual number of shots faced by teams in the data set, he gets an expected distribution of ranks as well as percentage. To my eye, the actual observed distribution seems rare but maybe not outlandish. He also pulls the minimum rank and minimum percentage from those samples and compares Boston's actual rank/percentage to them. Boston's percentage is unusual but not outlandish (4.8th percentile compared to the samples) but their rank is more unlikely (0.4th percentile).
The second model allows each team to have its own theta and the same sampling exercise finds Boston's values to be much more plausible. In other words, the Celtics' observed ranks and defensive 3 point percentage is more likely in a model where teams differ in their three point defense. So, the conclusion is that it seems more likely that teams do differ in their open three point 'defense', with the Celtics being notably good at it. Evan specifically leaves it to someone else to say why, which is a little disappointing.
My own questions would be:
Why not compare the two models directly? AIC or BIC or something else that penalizes for the number of parameters should still prefer the second model if it's a better fit of the data.
How did you get through an entire article using Bayesian models without describing the prior(s) once? What are they? I assume a default in whatever pyro is, but what's the default?
Re: Data: Preventing Wide-Open 3s
ok I dont want to comment too much on this because as I said I didnt really read it thoroughly (not in my interest area), but I believe that if he was trying to determine if the Boston sample comes from a given Binomial with fixed parameters (just like the rest of the data) then the correct approach might be a parametric hypothesis test where the Bostonn sample is compared, the test statistic is calulated and the p-values are checked. I dont think comparing rank statistics actually tells you much. Of course there is the whole other issue of using the Binomial as an approrpiate model, since this does not say anything about an unbiased or biased sample (the Bernouli trials can have a fixed bias).xkonk wrote: ↑Mon Feb 01, 2021 9:54 pm The first model gives every team in the league the same three point defense probability. This is a kind of default; assume teams *aren't* actually different at defending open threes. This results in a single distribution of plausible theta values (opponent open three point accuracy). Sampling from that distribution and applying it to the actual number of shots faced by teams in the data set, he gets an expected distribution of ranks as well as percentage. To my eye, the actual observed distribution seems rare but maybe not outlandish. He also pulls the minimum rank and minimum percentage from those samples and compares Boston's actual rank/percentage to them. Boston's percentage is unusual but not outlandish (4.8th percentile compared to the samples) but their rank is more unlikely (0.4th percentile).
Yes well one could have seen that from the heatmap at the beginning. Not sure that whole statistical analysis was necessary. As for the real question you are right. He didnt answer it or even attempt it because that is impossible from just looking at the number of 3pts alone. I think it might be more complicated to answer
Again I quit after reading the first part but I think he mentions the Beta prior which is a conjugate on the Binomial. So I believe he did something along these lines. I think the whole pyro exposition is a bit distracting and not really interesting (unless you are a Python developer) for the whole statistical analysis.
Alright thats all from me on this subject.
It is an interesting question to try to answer but maybe not like that