What p-hacking really looks like : a comment on Masicampo & LaLande (2012)

• A submitted manuscript is the author's version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the "Taverne" license above, please follow below link for the End User Agreement: www.tue.nl/taverne Take down policy If you believe that this document breaches copyright please contact us at: openaccess@tue.nl providing details and we will investigate your claim. WHAT P-HACKING REALLY LOOKS LIKE 2 Masicampo and Lalande (2012;M&L) assessed the distribution of 3627 exactly calculated p-values between 0.01 and 0.10 from 12 issues of three journals. The authors concluded that "The number of p-values in the psychology literature that barely meet the criterion for statistical significance (i.e., that fall just below .05) is unusually large".
"Specifically, the number of p-values between .045 and .050 was higher than that predicted based on the overall distribution of p." There are four factors that determine the distribution of p-values, namely the number of studies examining true effect and false effects, the power of the studies that examine true effects, the frequency of Type 1 error rates (and how they were inflated), and publication bias. Due to publication bias, we should expect a substantial drop in the frequency with which p-values above .05 appear in the literature. True effects yield a right-skewed p-curve (the higher the power, the steeper the curve, e.g., Sellke, Bayarri, & Berger, 2001). When the null-hypothesis is true the p-curve is uniformly distributed, but when the Type 1 error rate is inflated due to flexibility in the data-analysis, the p-curve could become left-skewed below pvalues of .05.
M&L (and others, e.g., Leggett, Thomas, Loetscher, & Nicholls, 2013) model pvalues based on a single exponential curve estimation procedure that provides the best fit of p-values between .01 and .10 (see Figure 3, right pane). This is not a valid approach because p-values above and below p=.05 do not lie on a continuous curve due to publication bias. It is therefore not surprising, nor indicative of a prevalence of p-values just below .05, that their single curve doesn't fit the data very well, nor that Chi-squared tests show the residuals (especially those just below .05) are not randomly distributed. P-hacking does not create a peak in p-values just below .05. Actually, p-hacking does not even have to lead to a left-skewed p-curve. If you perform multiple independent tests in a study where the null-hypothesis is true the Type one error rate is substantially increased, but WHAT P-HACKING REALLY LOOKS LIKE 3 the p-curve is uniform, as if you had performed 5 independent studies. The right skew (in addition to the overall increase in false positives) emerges through dependencies in the data in a repeated testing procedure, such as collecting data, performing a test, collecting additional data, and analyzing the old and new data together. In Figure 1 two multiple testing scenarios (comparing a single mean to up to 5 other means, or collecting additional participants up to a maximum of five times) are simulated 100000 times when there is no true effect (for details, see the supplementary material). Only 500 significant Type 1 errors should be observed in each bin without p-hacking, but we see an increase in false positives above 500 for most of the 10 bins. Identifying a prevalence of Type 1 errors in a large heterogeneous set of studies is, regrettably, even more problematic due to the p-curve of true effects. In Figure 2  Clearly, more data is needed, and the reliability and reproducibility of the analysis of pcurves can be improved by always publishing a p-curve disclosure table (see Simonsohn, Nelson, & Simmons, 2014).
WHAT P-HACKING REALLY LOOKS LIKE 5   .05. Altogether, the evidence for a reliable peak of p-values just below p=.05 in the data collected by M&L is weak. Furthermore, looking for such a peak distracts from the fact that that p-hacking will lead to a much greater absolute increase in false positives between 0.01-0.045 than between 0.045-0.05. It should be clear that p-hacking can be a big problem even when it is difficult to observe. Although the data by M&L do not indicate a surprising prevalence of p-values just below .05 when interpreted against a more realistic model of expected p-curves, there is a clear drop in expected p-values above .05, which is in line with the strong effect of publication bias on which p-values end up in the literature (see also Kühberger, Fritz, & Scherndl, 2014).
An alternative to attempting to point out p-hacking in the entire psychological literature is to identify left-skewed p-curves in small sets of more heterogeneous studies (i.e., where all studies examine a null-hypothesis that is true). Better yet, we should aim to control the Type 1 error rate for the findings reported in an article. Pre-registration and/or replication (e.g., Nosek & Lakens, 2014) are two approaches that can improve the reliability of findings.