The past decade has seen numerous criticisms of statistical methodology in biomedical research, fuelled in part by a ‘reproducibility crisis’ – the concern that many supposedly significant results cannot be replicated. The latest salvo was an editorial recently published in the Journal of the American Medical Association (Ioannidis JPA. JAMA 2018;319:1429-1430).
“P values are misinterpreted, overtrusted, and misused,” states John Ioannidis, an epidemiologist at Stanford University. “Adopting lower P value thresholds [such as p<0.005] may help promote a reformed research agenda with fewer, larger, and more carefully conceived and designed studies.”
In 2005, Professor Ioannidis published a paper that stated that most published research findings are false (Ioannidis JPA. PLoS Med 2005;2:e124, free full text at www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/pdf/pmed.0020124.pdf). He also co-authored a survey of the biomedical literature, which found that the reporting of p-values increased from 7.3% of abstracts in 1990 to 15.6% in 2014 (33.0% in core clinical journals) (Chavalarias et al. JAMA 2016;315:1141-1148).
Perhaps more telling was that the 385,393 articles reported an average of 8.9 p-values per paper, most of which were statistically significant (p<0.05). Other statistical methods were uncommon, such as confidence intervals (2.3%), Bayesian analysis (0%), and effect size (13.9%).
Questions about the validity of scientific findings prompted the American Statistical Society to take the unprecedented step of issuing a formal statement on the use of p-values (Wasserstein et al. American Statistician 2016;70:129-133). The statement included six core principles to clarify what p-values do and do not mean:
- P-values can indicate how incompatible the data are with a specified statistical model. The hypothesis being analysed statistically is the null hypothesis (e.g. there is no difference between drug and placebo), with the added caveat about the assumptions used to calculate the p-value.
- P-values do not measure the probability that the hypothesis is true, or the probability that the data were produced by random chance alone. Thus, a “significant result” of p<0.05 does not mean that the probability that a result is due to random chance is less than 5%, nor does it mean that there is a 95% probability that the hypothesis is true (e.g. that a drug is superior to placebo).
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. Also important are the study design, the quality of the measurements, the external evidence, and the validity of assumptions underlying the data analysis.
- P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (so-called “p-hacking”) makes the results uninterpretable.
- A p-value does not measure effect size or the importance of a result. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect. Any effect, no matter how tiny, can produce a small p-value.
- A p-value does not provide a good measure of evidence regarding a model or hypothesis.
For example, a p-value near 0.05 offers only weak evidence against the null hypothesis.
With respect to this point, of particular importance is the plausibility of the hypothesis before it is tested. For example, it has been calculated that for an unlikely hypothesis (estimated beforehand to be 19:1 against being true), achieving a significant effect (p<0.05) means that the probability of the hypothesis being true is still only 11% (Nuzzo R. Nature 2014;506:150-152). For an even-money hypothesis (1:1), a p<0.05 result translates to a probability of the hypothesis being correct of 71%. This means that a statistically significant result carries with it a false-positive rate of 29% – and that is before the clinical relevance of the finding is taken into consideration.
Several changes have been proposed to address the p-value problem. One was to re-name the methodology “statistical hypothesis inference testing” (Lambdin C. Theory Psychol 2012;22:67-90), if only for the acronym. Another is to lower the significance cut-off to p<0.005, which would render about one-third of p<0.05 results no longer significant. The proposed cut-off would be even more rigorous for database analyses, small effects (e.g. genetic studies), or epidemiologic studies. Lowering the threshold is considered to be an imperfect stopgap solution, but there is no consensus on what other methodologies should be employed.