Correcting the Past: Failures to Replicate Psi

Across seven experiments (N = 3,289) we replicate the procedure of Experiments 8 and 9 from Bem (2011), which had originally demonstrated retroactive facilitation of recall. We failed to replicate that finding. We further conduct a meta-analysis of all replication attempts of these experiments and find that the average effect size (d = .04) is no different from zero. We discuss some reasons for differences between the results in this paper and those presented in Bem (2011).

procedures and materials whenever we could, and we used computers to standardize the delivery of the instructions and materials. We also predetermined our intended sample (e.g., "a minimum of 100 participants") and always formally stopped the experiment before looking at any results.
We used the same data analytic strategies that Bem used, and we also heeded the advice of Wagenmakers et al. (2011) to use additional analyses, in particular Bayesian t-tests (described in more detail later).
Altogether, we ran seven experiments with seven different samples, examining over 3,000 participants. We focused our replication attempts on the retroactive facilitation of recall findings described above: four experiments replicated the procedures of Bem (2011) Experiment 8 and three experiments replicated the procedures of Bem (2011) Experiment 9. We chose these findings in particular because the other findings reported in Bem (2011) hinge on nuanced affective responses, such as arousal to erotic images or a preference for avoiding negative Failures to Replicate Psi 6 images. As Bem (2011) reports, one difficulty with such experiments is that finding the appropriate stimuli can be difficult (e.g., people can foresee erotic images only if they are sufficiently erotic, and men and women require different erotic stimuli and different negative stimuli). Thus, the findings involving affective responses seem to be sensitive to subtle variation in the intensity and character of the stimuli. Not only is extensive pretesting required to find the right stimuli, but this need for appropriate stimuli makes it easy to dismiss any null findings as due to the use of inappropriate stimuli.
In the retroactive facilitation of recall studies, on the other hand, people are simply shown a list of words and are then asked to freely recall as many as possible. Participants are then randomly assigned to practice half of the words, with precognition being observed if people recall more of the words that they subsequently practice than words that they subsequently do not practice. In comparison to the other studies reported by Bem (2011), practicing and remembering words was relatively straightforward for us to replicate without concerns about the stimuli insufficiently matching the parameters suggested in the original paper. In fact, as noted below, we used the exact stimuli used by Bem (2011) in four of our experiments.
In addition to replicating Bem's (2011) retroactive facilitation of recall studies, another goal of this paper was to conduct a meta-analysis of all attempts to replicate these particular studies. We should note that other meta-analyses of psi phenomena have been conducted, but they are not of direct relevance to our conclusions because they do not examine the retroactive facilitation of recall paradigm. Nevertheless, they are worth consideration. Milton (1997) found evidence for a wide range of parapsychological phenomena but warned that the vast majority of experiments did not pre-define their outcome measure and therefore should be greatly discounted. Dunne and Jahn (2003) concluded that evidence for remote perception is relatively Failures to Replicate Psi 7 weak and, from a meta-analytic point of view, is non-existent. Storm, Tressoldi, and Di Risio (2010) concluded that evidence for psychic communication (i.e., telepathy) does, in fact, persist across a variety of testing conditions. Finally, Tressoldi (2011) conducted a meta-analysis of these three published meta-analyses and two additional unpublished analyses and concluded that, using a frequentist data analytic approach, there is substantial evidence for psi, but using Bayesian analyses, there is mixed evidence for psi. As noted, however, these meta-analyses do not include Bem's (2011) tightly controlled psi experiments. Thus, one of the central goals of this paper, aside from directly attempting to replicate Bem's retroactive facilitation of recall experiments, is to conduct a new meta-analysis which includes both our new empirical findings and all other attempted replications of these particular experiments.

Method
Below we briefly review the basic methodology of our replication attempts. We then provide the relevant details about the specifics of data collection in each experiment. Because the seven experiments that we conducted were highly similar to each other, we present the methods of all seven experiments before turning to their results. This report adheres to the requirements proposed by Simmons, Nelson, and Simonsohn (2012).
All instructions and manipulations were presented through a computer interface. As in Bem (2011), participants first read and agreed to a consent form mentioning that the experimenter was investigating ESP and then read a brief introductory statement almost identical to the one used by Bem (2011): "This experiment tests for ESP (extra sensory perception) by administering several tasks involving common everyday words. The experiment takes about 15 minutes to complete. The program will give you specific instructions as you go. At the end of the session, the computer will explain to you how this procedure tests for ESP." When participants Failures to Replicate Psi 8 had finished reading the statement (after a forced time delay of seven seconds to better ensure that participants read the text), they clicked to advance to the next screen.
On the two subsequent screens participants answered the same stimulus-seeking items that Bem (2011) reported administering. Both items were preceded by, "To what extent is the following statement true of you:" The first item was "I am easily bored," and the second was "I often enjoy seeing movies I've seen before." Participants responded on 5-point scales anchored at 1 ("Very Untrue") and 5 ("Very True").
Participants then experienced a three-minute relaxation procedure as described in Bem (2011): they looked at an astronomical photograph while listening to relaxing music. When the three minutes had ended, participants clicked a button to acknowledge that they were ready.
Based on the procedure outlined by Bem, they then received these instructions about the task: Next, we would like you to look at a list of 48 common nouns one at a time, for 3 seconds. While looking at each word, please visualize the corresponding object. For example, if the word is "house", please imagine a house. When you are ready to begin, please click continue.
Participants in Experiments 1, 2, 6, and 7, who completed the experiments online, were given an additional instruction "It is absolutely critical that you focus on only this task and do not perform any other tasks (e.g. check email)." After participants clicked "continue," they were shown the series of words, each for 3 seconds. We completed our first two experiments and began data collection for our seventh experiment prior to Bem making his exact materials publically available. Accordingly, we created the lists of words ourselves. In Experiments 1 and 7 we used the same four categories as Bem (2011;food, animals, occupations, and clothes), and for Experiment 2 we created four new Failures to Replicate Psi 9 categories (kitchen items, electronics, body parts, sports). For the remaining experiments we used exactly the set of words used by Bem (2011). Appendix A presents the full lists of words for Experiments 1 through 7. Paralleling Bem's procedure, the words were presented in a predetermined random order (the same order for all participants). After all 48 words had been presented, participants were asked to type any words that they recalled. They had as much time as they wanted, and when they were finished they clicked a button to go to the next stage.
At that point the program, using a pseudo-random number generator, randomly assigned 24 words to be practiced; six words were randomly chosen from each of the four groups of 12 words. Practice unfolded as follows: replicating Bem's Experiment 9, participants in our experiments 4 through 6 were shown and asked to visualize the 24 practice words one at a time for 3 seconds. Specifically, they were given the following instructions: "You will now be shown 24 of the words you saw earlier, divided into 4 categories: Foods, Animals, Occupations, and Clothing. As you see each word, try to form an image of the thing it refers to (e.g., if the word is tree, visualize a tree)." Consistent with Bem's Experiment 8, participants in our Experiments 1, 2, 3, and 7 did not complete this first practice task. Next, all participants in every experiment viewed the list of 24 practice words. On successive screens, they were asked first to click on the six words from a specified category (at which point the words became highlighted) and then to retype those words in six boxes below. Participants could not continue until they correctly clicked on the appropriate six words and typed the six words in the corresponding boxes. They did this for each of the four categories, as in Bem (2011).
Participants in Experiments 1, 2, 6, and 7 (the online experiments) answered one more question: "It is very important for us to know if you were not paying 100% attention to this study (e.g., checking email, going to the bathroom). You will not be penalized in any way if you did Failures to Replicate Psi 10 other tasks and you will be entered into the lottery regardless of how you respond. So please be honest! Did you, at any point during this study, do something else (e.g. check email)?" Participants could check a box corresponding to either "No, I paid 100% attention to the study" or "Yes, I did other things during the study." Finally, because of the open nature of Experiment 7 (details below), participants in this experiment answered one more question: "Is this your first time taking this experiment (or one similar to it)?" Participants could check a box corresponding to either "No, I've Taken This Experiment Before" or "Yes, I've Never Taken This Experiment Before".
For each experiment we specify how we determined sample sizes, but it is worth an additional mention that in all cases we did not download any of the data prior to terminating any experiment. For all cases we sought at least 100 participants to mirror the number of participants in Bem's Experiment 8. In the cases where we set a target of greater than 100, this was largely done to make sure that the samples were large enough to be considered a fair replication attempt.

Experiment 1
Participants (n = 112; 88 females, 23 males, 1 unknown; median age = 38) were recruited from an online panel to complete the experiment for a chance to win a $100 gift card. All participants were registered members of the website consumerbehaviorlab.com and received an email explaining the compensation and containing a link to the experiment. We predetermined that we wanted at least 100 participants, and once we observed that over 100 people had completed the experiment, we stopped data collection and analyzed the data.
This experiment used the same basic design as Bem (2011) Experiment 8 with the following notable exceptions: it was conducted online (rather than in the lab) and used a different set of words in the same categories used by Bem.

Experiment 2
Participants (n = 158; 119 females, 39 males; median age = 39.5) were recruited from the same online panel and offered the same compensation as Experiment 1 (although none of the same individuals were in this sample). Again, participants received an email that included the link to the experiment. We decided on a minimum sample of 150 for this experiment and stopped collecting data once we saw that we had passed that number.
This experiment used the same basic design as Bem (2011) Experiment 8 with the following notable exceptions: it was conducted online (rather than in the lab) and used a different set of words taken from four different categories.

Experiment 3
Undergraduates (n = 124; 55 females, 69 males; median age = 19) at New York University participated in partial fulfillment of a course requirement. Each participant was scheduled to come into the lab, and upon arrival, was seated at a computer terminal and told to put on the available headphones. The experimenter opened the program and participants went through the procedure at their own pace. We sought a sample of greater than 100 participants, and because students are available in "batches" at NYU, we ended up with 124. This experiment used the same design and words as Bem (2011) Experiment 8.

Experiment 4
Undergraduates (n = 109; 53 females, 55 males; 1 unknown; median age = 21) from Carnegie Mellon University and the University of California, Berkeley participated for partial fulfillment of a course requirement. Scheduling and experimenter interaction were largely the same as in Experiment 3. We drew our sample from two universities because we wanted to make certain that we could reach a sample of at least 100 prior to the end of the semester, and neither Failures to Replicate Psi 12 participant pool could provide that many participants on its own. This experiment used the same words and design as Bem (2011) Experiment 9.

Experiment 5
Undergraduates (n = 211; 116 females, 94 males, 1 unknown; median age = 20) from the University of Florida participated for extra course credit. Scheduling and experimenter interaction were largely the same as in Experiments 3 and 4. We sought a sample of at least 200.
Because participants were scheduled in batches, we ended up with a number that was slightly higher. This experiment used the same words and design as Bem (2011) Experiment 9.

Experiment 6
Participants (n = 175; 122 females, 52 males, 1 unknown; median age = 36) were recruited from the same online panel as in Experiments 1 and 2. Again, participants received an email that included the link to the experiment. Participants were assigned to one of two conditions. Some participants saw the same words and followed the same procedure as in Bem (2011) Experiment 9 (Test-Before-Practice), whereas some received the same elements in the reverse order (Practice-Before-Test). This latter condition was included to establish that participants in an online sample are sufficiently attentive to benefit from practice (and thus, that any null results in Test-Before-Practice conditions could not be blamed on online participants failing to engage in practice). The Practice-Before-Test condition thus followed the sequence typically observed in memory experiments: participants answered the sensation-seeking items and watched a presentation of all 48 words. Then, 24 words were randomly selected by the computer (again, 6 from each of the 4 categories of 12 words), and participants watched a presentation of those 24 words and practiced the 24 words. Next, participants completed the free Failures to Replicate Psi 13 recall task of all 48 words, and finally, they reported whether or not they had paid attention during the experiment.
More people were intentionally assigned to the Test-Before-Practice condition than the Practice-Before-Test condition, and we left the program running until we observed that there were more than 100 people in the former condition: this led to 106 participants in the Test-Before-Practice condition and 69 in the Practice-Before-Test condition. The non-uniform random assignment was accomplished by having the computer program assign roughly one participant to the Practice-Before-Test condition for every two participants who completed the Test-Before-Practice condition.
This experiment, apart from the manipulation described above, used the same basic design as Bem (2011) Experiment 9 but was conducted online (rather than in the lab).

Experiment 7
Participants (n = 2,469; demographic information not collected) were neither actively recruited nor compensated. After completing Experiment 1, the authors posted a short summary of that experiment on SSRN, the online social science repository, and they included a link to an open study that could be completed by anyone with an Internet connection. A number of commentators on Bem also included hyperlinks to the short report. This, in turn, led to more people completing the open experiment. Data collection began on October 29 th , 2010 and concluded on March 2 nd , 2012 (when this paper was written).

Data Coding Strategy
To assess whether or not we observed retroactive facilitation of recall we first had to determine which words were recalled as a function of whether they were practiced. On the surface, this seems like a trivial task; however, there were occasionally spelling errors. For Failures to Replicate Psi 14 Experiments 1 and 2, we coded the recalled words in a two-stage process. First, all entered words that perfectly matched any of the 48 words from the set were coded as either coming from the practice set of words or coming from the control set of words (about 90% of all words fell into one of these two categories). This was done automatically by a computer program. Next, any listed words that did not match any of the 48 words from the set were manually checked, one at a time, to assess whether they were simply misspelled words (e.g. "spageti") or words that were not in the main set of words (e.g. "home"). In all cases, the determination of whether a word was a misspelling was entirely clear, and furthermore, in all cases the coder was blind as to whether the words were drawn from the practice set or the control set.
For Experiments 3 through 7, we developed a fully computerized approach to coding the recalled words, thus removing any possible human bias in the scoring. Specifically, we used a computer program to generate exhaustive lists of common misspellings and typographical errors (e.g., "walruss" instead of "walrus"). If the recalled word matched any of the common misspellings, it was coded as a correctly recalled word.
Finally, for all experiments, any duplicate words were automatically identified and categorized as having come from the practice or control sets. Scores were adjusted accordingly (e.g., if the word "car" was in the control set and a participant responded with "car" twice, the second response was not counted as an additional recalled control word). The originally typed text, the lists of commonly misspelled words, and all of our data are freely available (http://www.consumerbehaviorlab.com/psi/CorrectingThePastData.xlsx).

Results
To test for the presence of precognition, Bem (2011) computed a weighted differential recall score (DR) for each participant using the formula: In the paper, for descriptive purposes, Bem frequently reports this number as DR%, which is the percentage that a participant's score deviated from random chance towards the highest or lowest scores possible (-576 to 576). We conducted the identical analysis on our data and also report DR% (see Table 1). In addition to using the weighted differential recall score, we also computed a simple unweighted recall score, which is the difference between recalled practice words and recalled control words (see Appendix B). For both of these measures, random chance would lead to a score of 0, and our analysis, like Bem's, was conducted using a onesample t-test. Table 1 presents the results of our seven experiments as well as the results of Bem's (2011) Experiments 8 and 9, for comparison. Bem found DR% = 2.27% in Experiment 8 and 4.21% in Experiment 9, effects that were significant at p = .03 and p = .002, one-tailed.

Main Results
In contrast, only one of our seven experiments showed a significant effect suggesting precognition (using a one-tailed p-value), and had an overall effect very close to zero.
In Experiment 1, DR% = -1.21%, t(111) = -1.20, p = .88 1 . Bayesian t-tests suggest that this is "substantial" support for the null hypothesis of no precognition. Bayesian t-tests (advocated by Wagenmakers et al., 2011) allow for hypothesis testing that considers the evidence for and against the null hypothesis, as well as the evidence for and against the alternative hypothesis. The analysis results in a Bayes Factor (BF) that denotes the weight of evidence provided by the data. Formally, the BF is computed as the probability of the data arising given H 0 , over the probability of the data arising given H 1 . When BF > 1, there is greater Failures to Replicate Psi 16 support for H 0 and when 0 < BF < 1, there is greater support for H 1 . For a more detailed review of Bayesian t-tests see Rouder, Speckman, Sun, Morey, and Iverson (2009).
In Experiment 3, DR% = 1.17%, t(123) = 1.28, p = .10. Although DR% was indeed above zero, in the direction predicted by the ESP hypothesis, the test statistic did not reach conventional levels of significance, and Bayesian t-tests suggest that this is nevertheless "substantial" support for the null hypothesis.
In Experiment 4, DR% = 1.59%, t(108) = 1.77, p = .04. The test statistic was significant in this one-tailed test, but Bayesian t-tests suggest that this is "anecdotal" support for the null hypothesis.
Bayesian t-tests suggest that this is "strong" support for the null hypothesis.
In sum, in four of our experiments, participants recalled more control words than practice words (Experiments 1, 5, 6, and 7) and in three of our experiments, participants recalled more practice words than control words (Experiments 2, 3, and 4). One of these effects was statistically reliable using one-tailed t-tests (see Table 1), but as noted, Bayesian t-tests suggest Failures to Replicate Psi 17 that even the findings that were directionally consistent with precognition show substantial support for the null hypothesis of no precognition.

Practice-Before-Test, Experiment 6
In Experiment 6, we wanted to confirm that the basic underlying effect of practicefacilitated recall could be detected online. Accordingly, we assigned some participants to practice the words prior to the free recall test (a non-retroactive condition). In the Practice-Before-Test condition, the results were quite strong (DR% = 41.76%, t(68) = 16.55, p < .001).
Not only was there a substantial mean difference between practiced and control words, but 68 of 69 participants recalled more practice words than control words (the remaining participant remembered the same number of each). Recall that, in the same experiment, some participants received the precognition version (i.e., the retroactive condition). Despite coming from the same population and taking the experiment over the same medium, DR% did not differ reliably from zero in the retroactive condition, and in fact participants remembered slightly more control words than practice words.
It is also worth noting that, among the practice-before-test participants, people who recalled more words overall also showed a larger DR% (r = .70, p < .001). Even in this online environment, people who remembered more words (presumably reflecting more attention) also showed more benefits of practice, but only when the practicing preceded testing. When testing preceded practicing, this correlation was nonsignificant (r = .01, p = .50).

Sensation Seeking as a Correlate
In addition to the primary measure, Bem (2011) reported evidence suggesting that sensation seeking positively influenced precognitive ability. His evidence came in the form of a correlation between DR% and responses on the two-item sensation seeking scale. In Experiment Failures to Replicate Psi 18 8, he reports a correlation of r = .22. In Experiment 9 the correlation drops to r = -.10, perhaps because "the same strong stimulus manipulation that produced the higher effect size also restricted the range of DR% scores sufficiently to squelch the predictive power of the individual difference measure" (Bem, 2011, p. 420). We did not observe a significant correlation across any of our experiments. Effect sizes ranged from r = -.11 in Experiment 4 to r = .06 in Experiment 6 (see Table 1). Sensation seeking did not predict (positively or negatively) precognitive performance in any of our experiments.

Meta-Analysis
In addition to conducting our own replications, another goal of this paper was to examine all evidence for or against psi in the retroactive facilitation of recall paradigm. Accordingly, we conducted a meta-analysis of all known published and unpublished replication attempts of the two relevant experiments.

Retrieval of Studies
To locate all such attempts, we employed a number of different strategies. First, we searched for all papers that cite the original Bem (2011) paper using Google Scholar, Web of Science, and ProQuest. We assumed that any attempts to replicate would cite this paper. Next, we posted a request for information regarding replication attempts on the following list-serves: Rhine Research Center, the publishers of the Journal of Parapsychology. Some responders informed of us of individuals who may be conducting relevant replications, and we contacted all of those individuals. Every individual that we contacted who conducted a relevant study responded with either their data or with a description of their results.

Criteria for Selection of Studies
Our goal was to identify any direct replication attempts of either Experiment 8 or 9 from Bem (2011). To that end, we identified 12 replications and included 10 of them in our metaanalysis (Table 2). We excluded two experiments reported by Snodgrass (2012) due to the limited sample size (N = 1 in Experiment 1 and N = 9 in Experiment 2). In addition, we included the original results obtained by Bem (2011) and the results from the seven experiments reported in this paper. In total, this yielded data from 4,091 participants.

Calculation and Coding of Effect Sizes
Means and standard deviations were available for all replication attempts and we calculated effect sizes (d) by dividing the DR% score by its standard deviation, with positive values indicating the presence of retroactive facilitation of recall and negative values indicating the presence of anti-retroactive facilitation of recall. In addition to DR%, Bem (2011) reported a positive correlation between sensation seeking and DR% across all but the last of his nine experiments. Accordingly, we obtained these correlation estimates for the experiments in this meta-analysis either by extracting them from provided materials (e.g., published papers or unpublished manuscripts) or by computing them ourselves using data provided by experimenters.

Failures to Replicate Psi 20
All effect sizes were coded on six dimensions: 1) whether the experiment attempted to replicate Bem's Experiment 8 or his Experiment 9, 2) whether it was administered online or in a lab, 3) whether it was conducted by Bem,4) whether the software used to administer the experiment was the software originally used by Bem,5) whether the results had already been published (we treat our results as unpublished), and 6) whether the experimenters conducting the replication expected to observe a psi effect.
The last criterion merits further explanation. Previous work has shown that experimental results can be influenced by experimenters' expectations (Rosenthal, 1966), and so we thought it appropriate to investigate whether psi effects might also be susceptible to such influence.
Furthermore, it has been suggested that this type of expectancy might influence the operation of psi (Bem, personal communication, February 26, 2012). We were able to identify the experimenter expectation associated with each replication attempt by one of two means: 1) collecting publicly made statements by the experimenters (e.g., in their papers or on their public blogs) or 2) contacting the experimenters and explicitly asking them what their expectation was.
We coded the experiments that we conducted as follows. The lead investigator for Experiment 1 initially hypothesized that the experiment would yield positive results. Following the failure to replicate, the same investigator, falling in line with the remaining authors, subsequently updated his personal prior to that of obtaining a null result. It is worth noting that despite the fact that the authors of this paper held priors about psi when conducting the experiments, the goal of our replication attempts was always to be as objective as possible. As far as we know, our expectations did not affect the programming of the experiments, data collection, or analyses. The expectation merely refers to the belief about psi that the experimenters held prior to conducting the experiments, not to a conscious agenda that was pursued.

Meta-analysis of Effect Sizes
A summary of effect sizes is provided in Table 2 and Figure 1. To meta-analyze the effect sizes, we followed the procedure outlined by Hedges and Olkin (1985) and Lipsey and Wilson (2001). For DR%, we first adjusted the effect sizes to correct for biases associated with small samples (raw effect sizes are reported throughout the paper). We then weighted the effect sizes by the inverse of the standard error of each point estimate to account for variations in sample size and then computed weighted average effect sizes for each level of our six effect size coding variables (Table 3). For the correlation between DR% and sensation seeking, we first Despite these apparent differences, it is important to note that only one variable had a statistically significant influence on the size of the psi effect. That is, for only one potential moderator did the 95% CI around the point estimate of the differences in ds (between levels of the moderator) not include zero. This variable was whether or not the experiment was conducted It is also important to note that many of the moderators are highly correlated with each other and with whether Bem was the experimenter, and so many of the observed moderation Failures to Replicate Psi 23 effects likely do not represent unique effects. For example, in our sample a study that is published also tends to be one that Bem conducted (r = .46), suggesting that the "Bem-asexperimenter" result may be driving the publication result. This is further confirmed by the fact that re-running the meta-analysis with the 17 experiments (N = 3,941) -.11, .11]). Given that this was the case for every dimension we examined, we conclude that the rather large effect sizes observed by Bem drove every potential moderator that our meta-analysis originally revealed.
Sensation Seeking. The average correlation between sensation seeking and DR% across all experiments was -.03 (95% CI:[-0.06, 0.00]), suggesting that there was no relationship between these two variables. Moreover, none of the variables we considered moderated this relationship nor did we observe this relationship in any subset of these dimensions. There seems to be insufficient evidence to conclude that sensation seeking correlates with psi.
Homogeneity. As can be seen in Table 3, the overall meta-analyses is heterogeneous (Q(18) = 38.97, p < .01) suggesting that a fixed effect meta-analytic model may be inappropriate.
Accordingly, a random effects model was used which yielded nearly identical results.
Specifically, the overall average effect size of 0.05 did not significantly differ from 0 (95% CI: [-Failures to Replicate Psi 24 0.02, .12). For simplicity, we do not report the average effect sizes using a random effects model for each level of moderator tested. However, the point estimates do not significantly vary as a function of model used.
Because homogeneity was found for the overall sensation seeking analysis and for every level of moderator, a fixed effect model is sufficient and so no random effects model was tested for sensation seeking.

Additional Analyses
Because Bem has made his data available (personal communication, November 1, 2010), we are able to perform additional analyses comparing his results with the results of our seven experiments. One way of comparing our results to Bem's is simply to test, via independentsample t-tests, whether the psi effect observed in our experiments was significantly lower than that observed in the original studies. When comparing our experiments 1, 2, 3 and 7 against Bem's experiment 8, we obtain the following results: p = .03, p = .01, p = .47, and p = .04, respectively. Comparing our experiments 4, 5, and 6 against Bem's experiment 9, we obtain the following results: p = .11, p < .01, p < .01, respectively. With the exception of Experiments 3 and 4, all of our experiments produced a psi effect significantly lower than those reported by Bem.
Finally, because Experiment 7 differs greatly in sample size from all other experiments included in the meta-analysis, we re-ran the entire analysis excluding this experiment. As can be seen in Appendix C, with one exception, our conclusions do not greatly differ. When using a fixed effect model, the overall d of .06 does significantly differ from 0 (95% CI: [.01, .11]). However, when controlling for heterogeneity with a random effects model, the corrected d of .05 does not significantly differ from 0 (95% CI: [-.02, .13]). Accordingly, despite the rather large weight that Experiment 7 plays in the meta-analysis, excluding it does not meaningfully change Failures to Replicate Psi 25 the interpretation of our results. Moreover, the conclusions about the moderators are unchanged with Experiment 7 excluded. That is, the only moderator that yields significantly different results is whether the experiment was conducted by Bem or not. All other moderators do not yield statistically significant effects.

General Discussion
We conducted seven experiments testing for precognition and found no evidence supporting its existence. Participants were asked to freely recall a set of words and then subsequently to practice them by retyping and categorizing them. Bem (2011) found (in two experiments with a total of 150 participants) that participants recalled more words from a set that they were then randomly assigned to practice. We did not find this. In our seven experiments (with 3,289 participants), participants were as likely to recall words that were subsequently practiced as words that were not subsequently practiced. Finally, in a meta-analysis including the results of all nine of these experiments (seven of ours and two of Bem's) and the results of ten experiments conducted by other researchers, we observed an overall effect nonsignificantly different from zero (d = .04). This combination of results suggests that, in the retroactive facilitation of recall paradigm, there is insufficient evidence to reject the null hypothesis.
Additionally, we find no evidence to support a relationship between sensation seeking and psi (r = -.03).

Limitations
Despite our best efforts to conduct identical replications of Bem's Experiments 8 and 9, it is possible that the detection of psi requires certain methodological idiosyncrasies that we failed to incorporate into our experiments. For instance, after reading the replication packet provided by Bem (personal communication, November 1, 2010), we noticed that there were at least three Failures to Replicate Psi 26 differences between our experiments (which followed the procedure described in Bem's published paper) and the full procedure actually employed by Bem. First, prior to the start of Bem's experiments, the experimenter was required to have a conversation with each participant in order to relax the participant. Second, prior to starting Bem's experiments, participants were asked two questions in addition to the sensation seeking scale (agreement with the statement "I have lots of anxiety when I'm taking a test" and frequency of "have(ing)...practiced any form of meditation, self-hypnosis, relaxation exercises, or biofeedback"). Third, the set of words used by Bem were divided into common and uncommon words, something that we did not do in our Experiments 1, 2, and 7. Given the fragility of the observation of psi phenomena, it is possible that these methodological idiosyncrasies are necessary for reliable detection. Indeed, although we failed to replicate Bem's findings, we would be eager to know of a set of conditions that can reliably detect psi. That said, to the extent that Bem elected not to report these specific idiosyncrasies in his published paper, we can only assume that he does not believe that they are necessary for the detection of psi.
Another limitation is in our choice of experiments to replicate and meta-analyze.
Although, as mentioned, Bem's Experiments 8 and 9 make the most logical sense to replicate, our investigation into psi is limited to the (lack of) detection of retroactive facilitation of recall.
We can reasonably claim a failure to observe this type of psi, but can make no claims regarding precognition, retroactive priming, or retroactive habituation, the other three areas of psi investigated by Bem. For that, we call for more replication attempts by independent research teams.

Failures to Replicate Psi 27
Of the seven experiments that we conducted, four were conducted online. It is not immediately clear why precognition would not be observed online (i.e., the theoretical development of the construct does not specify whether this should moderate the effect), but we thought that it was reasonable to give the online environment additional consideration. One possible concern might be that, if people are taking the test at some remote location, their surroundings might be sufficiently distracting to make them less attentive.
In Appendix B we report the outcome of two methods for excluding participants who were insufficiently attentive for Experiments 1, 2, and 6 and two additional measures for Experiment 7. One measure simply asked participants to self-report if they were not paying full attention. This measure appears to have some validity as that exclusion increased the measure of overall recall in all four online experiments. Nevertheless, it did not influence DR%. The second measure was behavioral: we recorded how long each participant spent on the task. We reasoned that participants who were working too quickly (or abandoning the experiment) were unlikely to have attended sufficiently to the task. We chose a relatively liberal cut-off, and excluded any participant who was more than 1 standard deviation faster than the mean completion time. Again this measure was validated in that the exclusion yielded a higher total recall score, but it had no noticeable influence on DR%. (For two experiments it non-significantly increased DR% and for two it non-significantly decreased it).
Because of the open nature of Experiment 7, additional precautions were taken to ensure data integrity. First, as described above, participants indicated whether or not they had previously taken this experiment or one like it in the past. Of the 2,469 participants, 250 indicated that they had. We analyze these data both with and without these participants and report the results in Appendix B. Second, because participants may have been interested in simply seeing what the Failures to Replicate Psi 28 experimental procedure was like, we identified participants who chose not to recall any words at all. Thirty-three participants did not recall any words and, again, to be conservative, we analyze the data both with and without them. Neither of these exclusion criteria had a discernible influence on the total number of words recalled or DR%.
Additionally, we analyzed whether DR% was influenced by the total number of words recalled, for both the online and lab studies. The total number of words recalled can be seen as a reasonable proxy for how closely people attend to the stimuli. This measure was positively related to DR% in four studies, and negatively related in the three others. It never approached significance in either online or lab studies.
Finally, one concern may be that participants actively sought to sabotage our experiments in the direction of observing a null result. Participants completing our experiments at home could have taken one of two strategies to undermine our investigation. First, they could have "recalled" either zero or all 48 words (something that could be accomplished by writing down the words as they appeared during the learning phase of the experiments). Either strategy would yield a DR% of 0. However, only 44 participants out of all 3,289 "recalled" zero words and none "recalled" all 48, suggesting that this was not the case. Second, participants could have, a priori, decided to write down some subset of words as they were being displayed (say, the first 10) and only "recall" those words. Because practice and control words are randomly determined after the "recall" task, this strategy would, on average, also yield a DR% score of 0. Though we cannot empirically rule out this strategy, we can reason that it would work best if the number of predetermined words to recall was even, and not odd (i.e., an odd number of recalled words necessarily provides evidence either for or against psi). Following this strategy, the sinister participant could minimize the likelihood of contributing to the overall DR% score by recalling Failures to Replicate Psi 29 an even number of words. This, however, was not the case: there was no difference in the proportion of times the total number of words recalled was odd or even (χ 2 (1, N = 3,289) = .11, p = .30). Moreover, analyzing the results from only those participants who recalled an odd number of words yielded a DR% of -.30 (t(1,394) = -.99, p = .32), suggesting that even when excluding participants who may have attempted to undermine our results in this way, we failed to observe psi. As such, we suspect that the nefariousness of our participants was minimal.

How Can These Results Be Reconciled with Bem (2011)?
Bem reports nine experiments (n = 950) suggesting that people can feel the future; we report seven experiments (n = 3,289) suggesting that people cannot. How is that possible? To start, it is certainly useful to point out that we are only looking at one basic procedure from the overall set of Bem experiments. Perhaps, it could be argued, precognition exists, but it cannot be detected in the retroactive facilitation of recall paradigm. Under that assumption we might look at the original Bem paper and suggest that Experiments 8 and 9 are simply Type I error -a false rejection of the null hypothesis. We do not have any empirical grounds for questioning the remaining seven experiments.
Still, even in Experiments 8 and 9, it is unclear how Bem could find significant support for a hypothesis that appears to be untrue. Elsewhere, critics of Bem have implicated his use of a one-tailed statistical test (Wagenmakers et al., 2011), testing multiple comparisons without correction (Wagenmakers et al., 2011), or perhaps simply a lurking file drawer with some less successful pilot experiments. All of these concerns fall under a larger category of researcher degrees of freedom, which raise the likelihood of falsely rejecting the null hypothesis (Simmons et al., 2011). Some of these researcher degrees of freedom can be easily justified and have small and seemingly inconsequential effects. For example, Bem analyzes participant recall using an Failures to Replicate Psi 30 algorithm which weights the total number of correctly recalled words (i.e., DR%). He could have instead analyzed simple difference scores and found a similar, but not quite identical, result.
The scoring distinction is just a single example, but even for Bem's simple procedure there are many others. For example, Bem's words are evenly split between common and uncommon words, a difference that was not analyzed (or reported) in the original paper, but may reflect an alternative way to consider the data: perhaps psi only persists for uncommon words?
He reports the results of his two-item sensation-seeking measure, but he does not analyze (or report collecting) additional measures of participant anxiety or experimenter-judged participant enthusiasm. Presumably these were collected because there was a possibility that they may be influential as well, but when analyses revealed that they were not, they were dropped from the paper. To be fair, because Bem reported two experiments on retroactive facilitation, his freedom is somewhat constrained. He cannot easily use DR% for one and a simple difference score for the other. On the other hand, he can certainly choose the one that works best for both studies and never report the other. Regardless, all of these decisions are defensible and possibly even recommended. Nevertheless, because their application is at the discretion of the researcher examining data after the completion of the experiment, they can make a true effect more difficult to discern. Researcher degrees of freedom do not make a finding false (e.g., the second law of thermodynamics is still true, even if a researcher tries multiple tests to detect it), but they do make it much harder to distinguish between truth and falseness in reported data.
Failures to Replicate Psi 31 Popper (1959Popper ( /2002) defined a scientifically true effect as that "which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed" (pp. 23-24). Though decades have passed, that is still the operational definition of scientific truth. An effect is not an effect unless it is replicable, and a science is not a science unless it conducts (and values) attempted replications. No matter the outcome, it is indisputably admirable for Bem to encourage and facilitate the independent replication of his experiments. It is, by definition, what any scientist should do.

1.
To mirror the analysis conducted by Bem, all p-values for experimental data in this paper are one-tailed in the positive direction, except where stated. Because we had no a priori predictions about moderators in the meta-analysis, all p-values there are two-tailed.

2.
Throughout the manuscript we primarily report values to two significant digits. In some cases, this results in values of 0.00 and -0.00. In those cases, we include the sign to indicate that, before rounding, the value is positive or negative.     Overall Effect Note-Size of circles represents weight of experiment in meta-analysis. Vertical dotted line and square represent weighted average overall effect. Horizontal lines represent 95% confidence intervals. The four means presented in this table (P, C, DR%, and Simple Differential Recall) are each presented with the Standard Error reported in parentheses below. 2 P = the number of practice words correctly recalled (out of 24 possible) 3 C = the number of control words correctly recalled (out of 24 possible) 4 No participants were faster than a standard deviation from the mean in Experiment 6 (Practice-Before-Test). 5 Because total number of words recalled were not provided by Bem, totals (2012) were not available, the n for that meta-analysis is only 16.