P-values ​​and "statistical significance": what they actually mean



[ad_1]

For too long, many scientific careers have been built around the search for a single statistic: p <0.05.

In many scientific disciplines, this is the threshold beyond which the results of the study can be declared "statistically significant", which is often interpreted as meaning that the results are unlikely to be a stroke of luck. , the result of a chance.

Although this is not what it actually means in practice. "Statistical significance" is too often misunderstood – and misused. This is why a trio of scientists writing in Nature this week call for "the concept of statistical significance to be abandoned".

Their main argument: "statistically significant" or "not statistically significant" is too often misunderstood to mean "the study worked" or "the study did not work". A "true" effect can sometimes produce a p value greater than 0.05. And we know in recent years that science is full of falsely positive studies with values ​​below 0.05 (read my explanation about the replication crisis in the social sciences to find out more).

the Nature commentators argue that computation is not the problem. Instead, it's human psychology. The results obtained are "statistically significant" and "statistically insignificant," they explain, which leads to an overly black-and-white approach to scientific scrutiny.

More than 800 other scientists and statisticians from around the world have joined this manifesto. For the moment, this sounds more like a provocative argument than at the beginning of a real upheaval. "Nature," for his part, "is not trying to change his way of considering statistical analysis in the evaluation of articles at the moment," the paper noted.

But the tide may rise against "statistical significance". This is not the first time that scientists and statisticians have challenged the status quo. In 2016, I described how many of them had called for raising the threshold to .005, which made it much more difficult to call a result "statistically significant". (At the same time as Nature commentary, the newspaper The American statistician devoted an entire issue to the problem of "statistical significance".) It is widely recognized that p-values ​​can be problematic.

I suspect that this proposal will be discussed in depth (as is the case for all scientists). At least this last call for radical change highlights an important fact that afflicts science: statistical significance is largely misunderstood. Let me guide you. I think this will help you to better understand this debate and to realize that there are many more ways to judge the merits of a scientific discovery than the predictive values.

Wait, what's a p-value? What is the statistical significance?


Mick Wiggins / Getty Images Creative

Even the simplest definitions of p-values ​​tend to get complicated, so leave me with me while I break down.

When researchers calculate a p-value, they test what is called the null hypothesis. First thing to know: this is do not a test of the question that the experimenter wants most desperately to answer.

Let's say that the experimenter really wants to know if eating a bar of chocolate a day leads to weight loss. To test this, they assign 50 participants to eat a bar of chocolate a day. Fifty others are asked to refrain from delicious food. Both groups are weighed before and after the experiment and their average weight variation is compared.

The null hypothesis is the argument of the devil's advocate. He states that there is no difference in weight between chocolate eaters and abstainers.

Rejecting zero is a major obstacle that scientists must overcome to prove their hypothesis. If the NULL value is maintained, it means that they have not eliminated a major alternative explanation for their results. And what is science if it is a process of clarifying explanations?

So, how do they exclude the null? They calculate statistics.

The researcher essentially asks: how ridiculous would it be to believe that the null hypothesis is the real answer, given the results we see?

Rejecting the draw is a bit like the "innocent until proven guilty" principle in court cases, said Regina Nuzzo, professor of mathematics at Gallaudet University. In court, you assume that the defendant is innocent. Then you start looking at the evidence: the bloody knife with its fingerprints, its history of violence, eyewitness accounts. As the evidence progresses, this presumption of innocence begins to appear naive. At some point, jurors have the feeling, beyond a reasonable doubt, that the defendant is not innocent.

The test of null hypotheses follows a similar logic: if there are significant and consistent weight differences between chocolate consumers and chocolate abstainers, the null hypothesis – that it is n & nbsp; There is no difference in weight – starts to look ridiculous and you can reject it.

You may be thinking: is not this a pretty devious way to prove that an experiment has worked?

You are right!

Reject the null hypothesis is indirect evidence of an experimental hypothesis. That says nothing about the validity of your scientific conclusion.

Of course, chocolate eaters can lose weight. But is it because of chocolate? May be. Or maybe they felt even more guilty by eating sweets every day and knowing that they were going to be weighed by strangers wearing lab coats (weird!). They skimped on other meals.

The rejection of the null value tells you nothing about the mechanism by which chocolate causes weight loss. This does not tell you if the experiment is well designed, or well controlled, or if the results have been selected.

It just helps you understand how rare the results are.

But – and this is a delicate point – it's not uncommon to see the results of your experience are. It is the scarcity of results in a world where the null hypothesis is true. That is to say, the results would be rare if nothing in your experiment worked and if the difference in weight was solely due to chance.

Here is where the value p is: The value p quantifies this notion. It tells you how often you will see the numerical results of an experiment – or even more extreme results – if the null hypothesis is true and there is no difference between the groups.

If the p value is very small, it means that the numbers would rarely (but never!) Happen by chance. Thus, when p is small, researchers begin to think that the null hypothesis seems implausible. And they make a leap to conclude "their [experimental] It is unlikely that the data are due to chance, "says Nuzzo.

Another tricky point: researchers can never completely rule out nullity (just as jurors are not direct witnesses of a crime). So scientists are choosing a threshold where they feel confident enough to be able to reject zero. For many disciplines, it is now set at less than 0.05.

Ideally, a p of 0.05 means that if you perform the experiment 100 times (again, assuming that the null hypothesis is true), you will see these same numbers (or more extreme results) five times .

And a last super-thorny concept that almost everyone is wrong: a p <0.05 do not This means that there is less than 5% chance that your experimental results are due to chance. This does not mean that there is only a 5% chance that you will get a false positive. Nope. Not at all.

Again: a p-value less than 0.05 means that there is less than a 5% chance of seeing these results (or more extreme results), in the world where the null hypothesis is true . It looks tiny, but it is essential. It is the misunderstanding that causes people to have too much confidence in predictive values. The false positive rate for experiments at p = 0.05 can be much higher than 5%.

Again, P values ​​do not necessarily tell you whether an experience has worked or not.

Kristoffer Magnusson, a Ph.D. student in psychology, has come up with a cool interactive calculator that evaluates the probability of obtaining a range of p values ​​for any real difference given between groups. I used it to create the following scenario.

Suppose there is a study where the real difference between two groups is one half standard deviation. (Yes, it's a way of saying it, but think like this: it means that 69% of people in the experimental group show above-average results in the control group, which researchers call this a "medium-sized" effect. .) And let's say that there are 50 people each in the experimental group and the control group.

In this scenario, you should be able to get a p value between 0.03 and 0.05 only about 7.62% of the time.

If you run this experience again and again, you would expect to see many more p-values ​​with a much lower number. This is shown in the following table. The x-axis corresponds to the specific p-values ​​and the y-axis corresponds to the frequency at which you find them repeating this experiment. Look at how many p-values ​​you find below 0.001.


This is why many scientists are wary when they find that too many results are around 0.05. This should not happen so often and raises the red flag that the results were chosen carefully or, in scientific terms, "p-hacked". In science, it can be far too easy to play and adjust statistics to reach a meaning.

And in this graph you will see: Yes, you can get a p value greater than 0.05 when an experimental assumption is true. It just should not happen that often. In this case, approximately 9.84% of all p values ​​should be between 0.05 and 0.1.

There are more nuanced approaches to evaluating science

Many scientists recognize that there are more robust methods for evaluating a scientific discovery. And they are already committed to it. But in one way or another, they do not have as much power as "statistical significance". Those are:

  • To focus on effect size (what difference does an intervention make and is it really significant?)
  • Confidence intervals (what is the margin of doubt built into a given answer?)
  • If the result is a new study or replication (give more weight to a theory that many laboratories have studied)
  • If the design of a study has been pre-recorded (so that authors can not manipulate their results after the test), and the underlying data is freely accessible (so that everyone can check the calculation).
  • There are also other statistical techniques – such as Bayesian analysis – that in some ways more directly evaluate the results of a study. (The P-values ​​ask the question "what is the rarity of my results?" The Bayes factors ask, "What is the probability that my hypothesis is the best explanation of the results we found?" Both approaches have trade-offs. )

The real problem is not of statistical importance; it's with the culture of science

The authors of the last Nature the comments do not ask for the end of p-values. They would still like scientists to report them where appropriate, but do not necessarily qualify them as "significant" or not.

There will probably be discussions around this strategy. Some might think that it is useful to have simple rules of thumb, or thresholds, for evaluating science. And we still need expressions in our language to describe the scientific results. The removal of "statistical significance" is likely to confuse things.

Anyway, changing the definition of statistical significance, or removing it altogether, does not solve the real problem. And the real problem is the culture of science.

In 2016, Vox sent a survey to over 200 scientists asking them, "If you could change one aspect of how science works, what would it do and why?" One of the clear themes Answers: Scientific institutions have to become better at rewarding failure.

A young scientist told us, "I feel torn between asking questions that I know will lead to statistical significance and ask important questions."

The biggest problem in science is not statistical significance; it's the culture. She felt torn because young scientists needed publications to find a job. Under the status quo, to obtain publications, you need statistically significant results. The only statistical significance did not lead to the crisis of replication. The institutions of science have incited the behaviors that allowed it to infect.

[ad_2]

Source link