p-hacking

rlee · Post by **rlee** » Tue Feb 14, 2017 2:47 pm

https://16winsaring.com/can-statistics- ... .wixyxduqa

(Note: Interested in reactions to this)

Crow · Post by **Crow** » Tue Feb 14, 2017 3:40 pm

Finding p less than .05 one year then dismissing it and quitting after checking just one previous year and not giving the value is not rigorous enough. It is almost as bad at stopping at one data point.

Nate · Post by **Nate** » Tue Feb 14, 2017 6:03 pm

Data mining like that seems like a pretty good way to produce hypothesis. Discoveries can happen 'by accident' too.

xkonk · Post by **xkonk** » Wed Feb 15, 2017 1:40 am

The article is right in that doing what he described is poor statistical practice. It basically invalidates the p value, so if that's your basis for deciding what's "real" or important, you're in bad shape. But Nate and Crow are right that it could be the start of a more rigorous analysis that could lead to something interesting. Perhaps the more important thing to take away is that if you start by p-hacking, you don't necessarily have a great chance at finding the pattern again in out-of-sample testing.

mtamada · Post by **mtamada** » Wed Feb 15, 2017 10:29 am

xkonk wrote:Perhaps the more important thing to take away is that if you start by p-hacking, you don't necessarily have a great chance at finding the pattern again in out-of-sample testing.

I'd make an even stronger statement: the researcher will almost always find much less statistical significance in out-of-sample testing, and indeed there's good evidence that chances are the follow-up research will fail to find statistical significance.

Theoretical example: if you hunt for a coin that will give you five heads in a row (about a 3% probability), you will eventually find one, without too much work or trouble. But I guarantee that when you take that magic coin and flip it five more times, your chance of getting five heads will be very small (with fair coins, it's the same approximately 3% as your original significance level). Your research result will not be reproducible.

The evidence: 60% of recent psychology research results were not reproducible, according to this famous study from a couple of years ago. This is not because psychologists are especially incompetent or prone to take shortcuts in their research -- in fact it's a sign of their ability to self-criticize that they're the only field that I know of that made a major study like this. The same problem of p-hacking or data fishing occurs in all social sciences (and can occur in the natural sciences too). It's been an open dirty secret for decades, but the authors of this study were the first ones to systematically measure how badly this causes research results to be unreliable.

Nate · Post by **Nate** » Wed Feb 15, 2017 3:43 pm

...
Theoretical example: if you hunt for a coin that will give you five heads in a row (about a 3% probability), you will eventually find one, without too much work or trouble. But I guarantee that when you take that magic coin and flip it five more times, your chance of getting five heads will be very small (with fair coins, it's the same approximately 3% as your original significance level). Your research result will not be reproducible.
...

Sure. Assuming the coin is fair is begging the question though.

The point that p-hacking leads to misleading results is valid. More generally, it's also true that statistics is abused and mispresented in many other ways. P-hacking is the bete noir du jour, but nobody seems to care about the conflation of 'margin of error' with 'confidence interval', and then there's the whole 'explanatory stats' and trivia thing.

I'd make an even stronger statement: the researcher will almost always find much less statistical significance in out-of-sample testing, and indeed there's good evidence that chances are the follow-up research will fail to find statistical significance.

Hmm... do you have a p-value for that?

The thing is, people should be leery of statistical results in general. It's not like p-hacking has to be an individual act: If we have a bunch of researchers independently checking similar hypotheses, then we expect 1 in 20 of them to get a p value of 0.05 or less by accident, right?

xkonk · Post by **xkonk** » Thu Feb 16, 2017 12:15 am

mtamada wrote:
xkonk wrote:Perhaps the more important thing to take away is that if you start by p-hacking, you don't necessarily have a great chance at finding the pattern again in out-of-sample testing.
I'd make an even stronger statement: the researcher will almost always find much less statistical significance in out-of-sample testing, and indeed there's good evidence that chances are the follow-up research will fail to find statistical significance.

This is certainly true, but I tried to be careful to say 'pattern' instead of 'statistical significance'. Number one, depending on how much data diving/p-hacking you did it could be just as likely that you find the opposite result in a new sample, let alone statistically significant in the same direction. Number two, we could have an entire separate discussion about if something so applied as the NBA (or anyone for that matter) should care about statistical significance as opposed to practical significance.

Nate wrote: If we have a bunch of researchers independently checking similar hypotheses, then we expect 1 in 20 of them to get a p value of 0.05 or less by accident, right?

Under some circumstances I could envision this being true, but certainly not if the researchers were using different data sets or if the effect was clearly true/significant. Here's an example that people might find interesting: http://andrewgelman.com/2015/01/27/crow ... d-players/

Nate · Post by **Nate** » Thu Feb 16, 2017 5:28 pm

xkonk wrote:...

Nate wrote: If we have a bunch of researchers independently checking similar hypotheses, then we expect 1 in 20 of them to get a p value of 0.05 or less by accident, right?
Under some circumstances I could envision this being true, but certainly not if the researchers were using different data sets or if the effect was clearly true/significant. ...

Do you know what the p-value means?

v-zero · Post by **v-zero** » Thu Feb 16, 2017 7:44 pm

This is why when considering a lot of variables and possible interaction variables then application of AIC or cross-validation is far more valuable for discerning actual value.

xkonk · Post by **xkonk** » Fri Feb 17, 2017 1:30 am

Nate wrote:
xkonk wrote:...

Nate wrote: If we have a bunch of researchers independently checking similar hypotheses, then we expect 1 in 20 of them to get a p value of 0.05 or less by accident, right?
Under some circumstances I could envision this being true, but certainly not if the researchers were using different data sets or if the effect was clearly true/significant. ...
Do you know what the p-value means?

Yeah, I'm pretty familiar. It's P(data | null hypothesis is true). In the example of the original article in the thread, the null for each of his correlation tests would be that the correlation is 0. But if there's an actual effect then one would hope that more than 1 in 20 researchers would find p<.05. Even if the null were true, the particulars of any data set and how the researchers decide to test a hypothesis could affect if the p value reflects what it's supposed to.

Did I pass?

Nate · Post by **Nate** » Fri Feb 17, 2017 5:39 am

xkonk wrote:...
Yeah, I'm pretty familiar. It's P(data | null hypothesis is true). In the example of the original article in the thread, the null for each of his correlation tests would be that the correlation is 0. But if there's an actual effect then one would hope that more than 1 in 20 researchers would find p<.05. Even if the null were true, the particulars of any data set and how the researchers decide to test a hypothesis could affect if the p value reflects what it's supposed to.

Did I pass?

Sure, you pass.

So how would having independent data sets reduce the chance of an accidental p<0.05 result on an individual trial?

ryanchen · Post by **ryanchen** » Fri Feb 17, 2017 10:31 pm

Hi everyone!

I'm super flattered by all of you taking an interest in my piece - thanks! Your posts and thoughts have given me quite a bit to think about. I'm still learning and while I don't pretend to have done a perfect job (either in the analysis or the explanation), I'd like to explain a little further about my thought processes that I didn't get into in the original piece (I feel like I'm among kindred analytical spirits here as opposed to a general basketball audience).

To Nate's first point, yes, discoveries can certainly happen by accident - but, in this context, I think the best (or at least a better) practice would be to re-evaluate out of sample to verify. (I kind of shot myself in the foot in this regard because I used all the available Synergy data in the first pass

)

To Crow's (and xkonk's) point about this being the starting point for a more rigorous analysis, I totally agree in the general case - out-of-sample testing is where I would start! However, in this case, the purpose of this analysis was really to say, "This is an example of a sort of statistical analysis that, in its undeveloped and flawed form, a general layperson might believe -- let's try to guard against that a little." I suppose I could've included in the piece potential next steps to make this into a meaningful analysis.

Nate, re: your second post, are you saying there's applicability of "conflation of 'margin of error' with 'confidence interval', and then there's the whole 'explanatory stats' and trivia thing" to this particular analysis? I'm curious to hear your thoughts! Also, you're pretty right that p-hacking is a very common (approaching cliched) topic among stats-inclined folks, but I don't think it's quite as well-known in the general public, maybe even less so among basketball fans. I think/hope that introducing/re-introducing this idea to a more general audience, even in this limited scope and fairly simplified form, is worthwhile.

To xkonk's last point, this is exactly what I was thinking - my prior belief is that no single play type correlates with team quality. To Nate's response, by "independent data sets" in this context, do you mean just separate partitions of the larger data set? Otherwise, I don't think you can have truly independent data sets that describe these offensive play type distributions, though I might be missing something!

Again, thanks to all for your interest in my piece! I'm still in the nascent stages of doing sports analytics work, so I'm certainly open to any additional suggestions, comments, criticisms, etc.

Thanks,
Ryan

Crow · Post by **Crow** » Fri Feb 17, 2017 11:15 pm

Good response post. Thanks for dropping by. Will watch / listen for more.

xkonk · Post by **xkonk** » Sat Feb 18, 2017 2:26 am

Nate wrote: So how would having independent data sets reduce the chance of an accidental p<0.05 result on an individual trial?

I don't think it would on an individual test per se, but if your results differed across sets or you used an independent set for out-of-sample testing and noticed a big drop in accuracy/fit you would realize that your results are probably not so significant.

Nate · Post by **Nate** » Sat Feb 18, 2017 5:40 pm

ryanchen wrote:...

Nate, re: your second post, are you saying there's applicability of "conflation of 'margin of error' with 'confidence interval', and then there's the whole 'explanatory stats' and trivia thing" to this particular analysis? ...

...

Thanks for coming by and reading the thread. I don't think those issues are particularly apropos to a discussion about p-hacking in any technical way, but they're other issues with how statistics are presented to the public.

APBRmetrics

p-hacking

p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking

Re: p-hacking