The p-value may be the most consequential number in modern science. Drug approvals depend on it. Career-defining publications turn on whether it falls below 0.05. Public-health policies, economics papers, and psychology experiments all live or die by this single statistic. And it is, almost certainly, the most misunderstood concept in everyday science.
A standard textbook will tell you a p-value is “the probability of obtaining results as extreme as those observed, assuming the null hypothesis is true.” A standard scientist, asked what that means, will paraphrase it as “the probability the result happened by chance.” Those two statements are not the same. The mismatch between what the p-value actually measures and what scientists believe it measures has produced the so-called replication crisis — entire fields where most published findings turn out to be irreproducible.
This article is about the difference. What is a p-value, mathematically? What can you correctly conclude from a small one? And what is happening, statistically, when a major journal publishes a result that turns out, three years later, to be wrong?
The definition, carefully
A p-value comes from the framework of frequentist hypothesis testing, developed primarily by Jerzy Neyman, Egon Pearson, and Ronald A. Fisher in the early twentieth century.
The basic setup: you have a null hypothesis — typically a statement of “no effect” — and an alternative hypothesis . You collect data and compute a test statistic. The p-value is
In words: if the null hypothesis is true, what is the probability that the test statistic would come out as extreme or more extreme than what we actually saw?
That conditional clause is the entire point. The p-value tells you something about the data, assuming the null hypothesis. It tells you nothing directly about whether the null hypothesis is true.
Consider a coin you suspect is biased toward heads. You flip it 100 times and get 60 heads. Under the null hypothesis that the coin is fair (probability 0.5 of heads each flip), the probability of getting 60 or more heads is about 0.028. The p-value is 0.028.
What this 0.028 says: if the coin is fair, you’d see results this extreme about 2.8% of the time.
What this 0.028 does not say:
- It does not say the probability the coin is fair is 2.8%.
- It does not say there’s a 97.2% chance the coin is biased.
- It does not say the result is likely to replicate.
- It does not say the effect is large or important.
These are all separate questions, requiring separate analyses.
The Bayesian translation
The probability you usually want is
which is not the same as the p-value . These two are related by Bayes’ theorem:
The right-hand side requires the prior probability — your belief, before seeing the data, about how likely the null hypothesis was. Without a prior, you cannot translate from p-value to “probability the null is true.” This is mathematically unambiguous.
Worked example. Suppose a new drug is tested for an effect, and you find a p-value of 0.04. You are tempted to say “there’s a 96% chance the drug works.” But suppose drugs in this class are mostly ineffective — say 90% of them are duds. Then the prior probability of effect is 10%. Run the Bayesian calculation properly and the posterior probability the drug works is roughly 50%, not 96%. A p-value of 0.04 in a field of mostly-failing hypotheses is, on its own, weak evidence.
This is why pharmacology and medicine are so plagued by replication failures: the prior probability of any specific hypothesis is low, so even a “significant” p-value gives only modest posterior confidence.
How to read a p-value sensibly
Here is the distinction that working statisticians try to drum into students:
A p-value of 0.04 does NOT mean:
- There is a 4% chance the result is wrong
- There is a 96% chance the alternative hypothesis is right
- The effect is large
- The effect will replicate
- The hypothesis was the only one worth testing
A p-value of 0.04 DOES mean:
- If the null hypothesis is true, this kind of result happens about 4% of the time
- Therefore, this is somewhat surprising data assuming the null is correct
That second statement is much weaker than the first set. It is a piece of evidence, not a conclusion. To turn it into a conclusion, you need:
- The plausibility of the hypothesis before you saw the data
- The size of the effect, not just whether it exists
- Whether the test was pre-specified or chosen after looking at the data
- Whether the result has been independently replicated
Without those, the p-value is one number on a scoreboard. With them, it becomes a starting point for an actual scientific argument.
P-hacking and the garden of forking paths
The replication crisis became visible around 2011, when Joseph Simmons, Leif Nelson, and Uri Simonsohn published a paper titled “False-Positive Psychology” demonstrating, with a now-famous example, that following standard practice you could “prove” that listening to certain Beatles songs makes people younger.
Their point was that working scientists routinely make small choices during analysis — whether to drop outliers, which subgroups to compare, which control variables to include — that inflate the false-positive rate dramatically. If you have any flexibility, the effective number of tests you ran is larger than the number you reported.
The math is simple. Run 20 independent tests at the 0.05 threshold. The probability that all of them show no effect, if no effect actually exists, is . So there’s a 64% chance at least one comes out significant by pure chance. Report only that one, and you have a “publishable” finding that means nothing.
This is p-hacking. It is rarely deliberate fraud. It is the natural result of conducting science with the freedom to adjust your analysis until something works. Whole fields have struggled to clean it up:
- Pre-registration of hypotheses before data collection
- Multiple-comparison corrections like the Bonferroni adjustment
- Reporting all analyses, not just successful ones
- Insistence on independent replication before accepting results
These reforms work. They also slow down science and make negative results more visible — which is why many practitioners resisted them.
What 0.05 means historically
The 0.05 threshold has no theoretical basis. It comes from R. A. Fisher’s 1925 book Statistical Methods for Research Workers, where he wrote:
“It is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.”
“Convenient.” Not “correct,” not “principled.” Fisher himself, in later writings, warned against rigid use of any single threshold and argued that a p-value should be interpreted alongside everything else known about the problem.
The convention calcified anyway. By mid-century, scientific journals required p < 0.05 for publication. Funding bodies asked for it on grant applications. Generations of scientists came to treat 0.04 as truth and 0.06 as fiction, despite the fact that the underlying statistical evidence in those two cases is almost identical.
Some fields have tried to fix this. In 2017 a group of 72 statisticians proposed lowering the threshold to 0.005 for new claims, arguing that the false-discovery rate at 0.05 is too high. The proposal has had limited adoption. A more radical alternative — abandoning thresholds entirely and reporting effect sizes with confidence intervals — has been pushed by methodologists for decades and is gradually taking hold in the better journals.
What the math says, what science needs
The p-value is a useful piece of statistical machinery. It is not a complete picture of evidence, and treating it as one has caused enormous waste in modern science. The number tells you whether your data would be surprising under a specific null hypothesis — nothing more.
Three takeaways for anyone reading scientific results:
- Demand effect sizes alongside p-values. A statistically significant 0.1% improvement in something is much less interesting than a marginally non-significant 30% improvement.
- Prefer replicated results to single-study p-values. A finding confirmed in three independent labs is dramatically stronger evidence than a single p < 0.001 from one team.
- Watch for the prior. In fields with low base rates of true effects (drug discovery, parapsychology, much of biomedical research), even small p-values translate to weak posterior beliefs.
The p-value is a measurement instrument. Like any instrument, it produces numbers that need to be interpreted in context. Confusing the reading on the dial with the underlying quantity is one of the most expensive errors in modern science — and it’s one that nearly every working scientist has made at some point.
The problem is not the mathematics. The problem is that a precisely defined statistical quantity got pressed into service as a proxy for a much more general question — “did this effect really happen?” — that requires considerably more than one number to answer.
If there’s one habit worth cultivating: when you read “the result was significant (p < 0.05),” translate it mentally to “the result was somewhat surprising under the null.” That’s what the math actually says. Everything beyond that is interpretation.
Frequently asked
Is a low p-value the same as a true result?
No. A p-value of 0.01 means: assuming the null hypothesis is true, the data would look this extreme or more about 1% of the time. It does not mean there's a 99% chance your hypothesis is true. The probability you actually want — the probability the hypothesis is true given the data — requires Bayes' theorem and the prior probability of your hypothesis.
Why is 0.05 the threshold?
Because R. A. Fisher, in 1925, suggested it as a 'convenient' cutoff for routine work. He explicitly warned against treating it as a universal rule. The threshold has no mathematical or philosophical justification — it is a historical accident that calcified into convention.
What's wrong with p-hacking?
When researchers run many tests, try different statistical models, or report only successful comparisons, they inflate their false-positive rate. A 5% p-value cutoff applied 20 times produces, on average, one significant result by pure chance even if every hypothesis is wrong. The result looks statistically valid but isn't.