Is Statistical Learning Ability Related to Reading Ability, and If So, Why?

ABSTRACT Previous studies found a relationship between performance on statistical learning (SL) tasks and reading ability and developmental dyslexia. Thus, it has been suggested that the ability to implicitly learn patterns may be important for reading acquisition. Causal mechanisms behind this relationship are unclear: Although orthographic sensitivity to letter bigrams may emerge through SL and facilitate reading, there is no empirical support for this link. We test 84 adults on two SL tasks, reading tests, and a bigram sensitivity task. We test for correlations using Bayes factors. This serves to test the prediction that SL and reading ability are correlated and to explore sensitivity to bigram legality as a potential mediator. We find no correlations between SL tasks and reading ability, SL and bigram sensitivity, or between the SL tasks. We conclude that correlating SL with reading ability may not yield replicable results, partly due to low correlations between SL tasks.

reading ability in adults suggests that the effect of SL explains variance in reading ability even when the participants have achieved a high level of competence. A second study, which examined the correlation between reading ability and SL (Frost, Siegelman, Narkiss, & Afek, 2013), tested English native speakers who were learning Hebrew as a second language. Here, performance on a visual SL task, akin to the task used by Arciuli and Simpson, was correlated with reading ability: Participants who performed well on this task also made faster progress in both unpointed nonword and pointed word reading ability. However, as this study tested adult second-language learners of a new script, it is not clear whether it reflects a similar relationship as the correlation between SL and learning to read one's first language as a child.
Although only two studies assessed the relationship between SL and reading ability in an unselected sample, more than a dozen studies compared the performance on SL tasks in a group of participants with developmental dyslexia (hereafter, dyslexia) to a control group. If SL is correlated with reading ability, one should predict such studies to find group differences, as choosing a set of participants with dyslexia increases the range of reading ability among the participants. The results of the studies on SL and dyslexia are mixed: A recent review (Schmalz, Altoè, & Mulatti, 2017) and a meta-analysis (Van Witteloostuijn, Boersma, Wijnen, & Rispens, 2017) suggest that there is publication bias. Publication bias refers to the preferential publication of positive results, which consequently become overrepresented in the published literature, leading to an increased Type I error rate and inflated effect sizes (Rosenthal, 1979;Van Elk et al., 2015). With the presence of publication bias, it becomes difficult to determine whether an effect is different from zero, as different statistical correction methods for meta-analyses often yield conflicting results (Van Elk et al., 2015). Here, we do not discuss whether there is sufficient evidence for a group difference in SL (but for a critical review, see . For our purposes, it is relevant to consider the tasks that were used by these previous studies. The two studies on SL in an unselected population used a visual SL task: Here, participants see a series of shapes, presented one at a time (Arciuli & Simpson, 2012;Frost et al., 2013). This sequence of shapes includes embedded triplets: Three of the stimuli always follow one another. This means that the first and the second of the three stimuli can be used to predict the next stimuli, once the participant has (implicitly) learned the sequence. In a subsequent recognition test, participants perceive stimulus pairs that occurred together within this triplet as more familiar than stimulus pairs that did not frequently occur together, suggesting that they learned these transitional statistics.
In contrast, the studies on SL and dyslexia used either Artificial Grammar Learning (AGL; Reber, 1967) or SRTTs (Nissen & Bullemer, 1987). In the AGL task, participants first see a set of symbol sequences in a learning phase. These are created according to a set of rules (see Figure 1). The rules specify the positions and sequences in which the symbols can occur. In a subsequent test phase, participants are presented with symbol strings that did not occur during the learning phase and need to guess, for each string, whether it corresponds to the grammar that constrained the learning strings.
In the SRTT, participants see a stimulus that can occur in different positions on the screen (e.g., top, bottom, left, right). The task of the participants is to press a key corresponding to the stimulus's location. Unknown to the participants, the sequence of the locations repeats. With increased exposure to the repeated sequence, the participants' performance improves across blocks. Critically, toward the end a block is inserted where the location sequence is randomised. If participants implicitly learned the sequence, their performance on this random block drops, compared to the preceding block.
A commonality between the three tasks is that the participants need to learn to use the available input to predict a future event. In the triplet task and SRTT, these are the identity and location, respectively, of an upcoming stimulus. In the AGL task, participants appear to learn symbol chunks (Pothos, 2007): In their decision about whether a given string is grammatical, participants rely on their knowledge of whether a given symbol can occur next to the other within a letter string. This also involves a prediction based on conditional probabilities: Given the first symbol of the sequence, what is the probability of the observed second symbol?
If SL, as a hypothetical single construct that is measured by all three tasks, is related to reading ability, it is likely that it does so through their shared component: observing regularities in the environment, and using this knowledge to predict an upcoming event. An alternative explanation for any correlations between SL tasks and reading ability is that they reflect participant-level confounds, such as the general level of attention or motivation (Staels & Van Den Broeck, 2017;Waber et al., 2003).
Assuming that SL is correlated with reading ability, after partialling out the shared variance with a control task to account for differences in attention and motivation, the next question is about the causal pathways that lead from SL performance to reading ability. Orthographies contain regularities on many levels, and studies have shown that readers develop sensitivities to them. Children learn very quickly which letter sequences do or do not occur in their orthography (Cassar & Treiman, 1997;Pacton, Fayol, & Perruchet, 2005;Pacton et al., 2001; but see also Deacon, Benere, & Castles, 2012;Rothe et al., 2014, for a failure to find evidence for a causal link between orthographic sensitivity and reading ability, using longitudinal designs). Bigram sensitivity could affect reading ability through a mediating link, namely, spelling ability: Knowing frequent letter patterns in one's orthography constrains possible spelling patterns of a word, which would improve a child's spelling ability. When writing the word quick, a child who does not know how to spell it may rely on their knowledge of legal letter patterns of the English orthography to decide against spelling it as ckwik, although this spelling is phonologically plausible (for a review, see Chetail, 2015).
A second possible causal pathway between SL and reading ability could be via learning of complex grapheme-phoneme correspondences (GPCs). In alphabetic orthographies, graphemes sometimes have multiple sound associations, which depend on their context (Schmalz, Marinus, Coltheart, & Castles, 2015). In English, the grapheme a is generally pronounced as in "cat," but its pronunciation changes when it is preceded by a w or qu, as in "wasp" (Venezky, 1970). In German, vowel length can often be predicted based on the number of subsequent consonants (Perry, Ziegler, Braun, & Zorzi, 2010). These rules are not taught explicitly at school. However, with reading experience, German speakers become sensitive to such context-dependent regularities, as the number of consonants after a vowel affects the probability of participants reading a vowel as long or short: The nonword BLAF, with only one consonant in the coda, is more likely to be pronounced with a long vowel than the nonword BAMT, where the vowel is followed by two consonants (Schmalz et al., 2014). SL may be important for learning these complex GPCs through exposure to real words in one's orthography (Apfelbaum, Hazeltine, & McMurray, 2013). This would enhance nonword reading skills, as readers of even a shallow orthography such as German would compute the correct pronunciation more quickly. The ability to decode unfamiliar words is, in turn, a well-established predictor of orthographic learning and reading ability (Share, 1995(Share, , 2008.  Knowlton and Squire (1994).
Other possible links between SL and reading ability include learning to use probabilistic cues to assign lexical stress in languages without a regular stress pattern (Arciuli, Monaghan, & Seva, 2010;Jouravlev & Lupker, 2015;Mousikou, Sadat, Lucas, & Rastle, 2017;Seva, Monaghan, & Arciuli, 2009;Sulpizio & Colombo, 2013), facilitating written word learning by learning fully specified links between phonology, orthography, and semantics (Steacy, Elleman, & Compton, 2017) or, indirectly, via oral language skills (Saffran, Newport, & Aslin, 1996;Seidenberg & Gonnerman, 2000;Spencer, Kaschak, Jones, & Lonigan, 2015). It is worth noting that, although there are abundant theories on the relation between SL and reading ability, there is less empirical work which would establish (a) the link between the performance on nonlinguistic SL tasks and the learning of orthography-specific regularities, and (b) the link between sensitivity to a given orthography-specific regularity and reading ability.
Here, our aim is twofold. First, we test the proposal that SL is important for reading. In line with previous findings of Arciuli and Simpson (2012), we expect to find a correlation between nonlinguistic SL tasks and reading ability. Second, we test the plausibility of two possible mediators in the relationship: sensitivity to bigram legality and nonword reading ability. We use Bayesian correlation analyses, which allows us to draw conclusions about the absence of a correlation rather than only about significant correlations (Dienes, 2014;Rouder, Speckman, Sun, Morey, & Iverson, 2009).
As Arciuli and Simpson (2012), we tested a sample of unselected adults on two reading tests (word and nonword reading fluency) and two SL tasks (serial reaction time [RT] and artificial grammar learning). As these two SL tasks have been frequently used in the literature on SL and dyslexia, it is worth establishing whether they correlate with each other and thus measure the same SL construct. In addition, we use a correlational approach to test whether orthographic sensitivity may mediate the relationship between SL and reading. Participants performed an orthographic choice task that measured bigram sensitivity and a task to control for individual differences in attention or motivation (choice RT).

Participants
Participants were 84 adult German native speakers, recruited at two universities and a research institute in southern Germany. Participant characteristics are summarised in Table 1. Reading percentiles show a wide range of reading ability compared to a normative sample of university students, apprentices, and high school graduates (Moll & Landerl, 2010). Participants were tested individually in sessions lasting about 30 min.

Reading tasks
We used a standardised reading task to assess word and nonword reading fluency (Salzburger Leseund Rechtschreibtest II; Moll & Landerl, 2010). The tests consist of lists of words and nonwords, respectively, arranged in columns and increasing in difficulty. Participants are instructed to read as many items as possible within 60 s.
Dependent variables for the reading tests are the number of words or nonwords, respectively, read correctly within 60 s. Although performance on these two reading subtests is correlated, they reflect different cognitive processes, which are dissociated in some readers (Castles & Coltheart, 1993). If SL specifically affects the learning of GPCs, we might expect that readers with poor SL skills may show relatively poor nonword reading skills compared to word reading skills. To test this possibility, we calculated a difference score by subtracting each participant's z-score of their nonword reading performance (compared to the rest of the sample) from the z-score of their word reading performance. Negative numbers reflect relatively good nonword reading skills compared to participants' word reading skills; positive numbers reflect relatively good word reading skills.  Schmalz (2017). SLRT = X = Salzburger Lese-und Rechtschreibtest II; SRTT = serial reaction time task; AGL = Artificial Grammar learning; RT = reaction time.

SL tasks
Serial reaction time task. We implemented the SRTT in OpenSesame (Mathôt, Schreij, & Theeuwes, 2012). The stimulus, which occurred sequentially in one of four positions on the screen, was a cartoon-like drawing of a cow. Participants were instructed to indicate the cow's position on the numerical keyboard (8 for up, 4 for left, 6 for right, and 2 for down). The instructions were to respond to each stimulus as fast as possible but to avoid making too many mistakes. Each trial was presented for 2 s or until a button press occurred. The location sequence repeated after each 16 trials. There were 12 blocks of 16 trials each. The 11th block consisted of a different, pseudorandomised sequence of 16 trials. There are numerous ways to calculate an outcome variable for this task, including improvement across repeated blocks, difference between the random block and the preceding repeated block, difference between the random block and the succeeding repeated block, or difference between the random block and an average of the preceding and succeeding repeated blocks. Such flexibility is problematic, because multiple comparisons associated with different variables increase the Type I error rate (Elson, 2016). We therefore decided on the outcome variable a priori. For improvement across repeated blocks, it is unclear whether it reflects a practice effect or implicit learning. For the repeated block that succeeds the random block, implicit knowledge is likely to be already diluted by the random block. We therefore calculated the difference between the random block and the preceding repeated block. As accuracy rates are close to ceiling (in our task, average = 98.0%), RTs are better suited to assess individual differences. With RT measures, one needs to be weary of overadditivity: When relying on raw RTs, differences between conditions are numerically larger for participants with longer overall RTs (Faust, Balota, Spieler, & Ferraro, 1999). Thus, we z-transformed RTs for each participant. For the analysis, we excluded incorrect trials (2% of the data), and item points that deviated more than 3 SDs from each participant's mean (a further 1.5% of the data). The outcome variable was the z-score difference between the random block (Block 11) and the preceding repeated block (Block 10), where larger positive values reflect stronger implicit sequence learning.
Artificial grammar learning. This task consists of a learning phase and a test phase. In the learning phase, participants were exposed to symbol strings, which followed the set of rules summarised in Figure 1. As a cover task for the first phase, we presented participants with two grammatical symbol strings on the screen simultaneously, separated by 25 blank spaces. In half of the trials, the two symbol strings were identical, and in the other half of the trials, they were different. The number of symbols contained in each string was identical for each pair. Participants were instructed to choose whether the strings of each pair were identical or different by pressing the right or left shift key. Each trial stayed on the screen until a response occurred. There were 86 trials in total. Throughout the task, participants saw four repetitions of 43 legal symbol strings. The participants completed the cover task from the first phase with very high accuracy (M = 97.8%, by-participant SD = 2.2%, minimum accuracy = 90.7%), which shows that they attended to the exposure strings. These data are not analysed further.
After the first phase, participants were told that the strings they had just seen followed a set of complex rules. For the second phase, participants were presented with symbol strings that had not occurred in the learning phase. They were told that half of the symbol strings were created using the same rules as the strings in the previous part and that they would need to guess whether each new string was created by the same rules. If the string seemed familiar to them, they were instructed to press the right shift key, and if the string looked less familiar, they were asked to press the left shift key. There were 44 trials altogether. Each symbol string stayed on the screen until a response occurred.
Typically, for the critical second phase of the task, accuracy is too low for an RT analysis; therefore, only accuracy rates are analysed. We calculated overall accuracy and the sensitivity index (d´). The latter measure is a z-score difference between the hit rate and the false alarm rate and accounts for participants' response bias (Stanislaw & Todorov, 1999). To calculate the d´score, hit or false alarm rates of 0 or 1 were changed to 0.00001 and 0.99999, respectively, as 0 or 1 yield z-score of ± ∞. Higher values of d´indicate better learning, and d´= 0 indicates chance performance. 1 Sensitivity to frequent letter patterns Participants were presented with nonwords containing either a letter bigram, which never occurs in the German orthography, or nonwords with legal letter bigrams only. The task was to decide, for each item, if the nonwords follow the orthographic principles of German. Participants were instructed to respond as quickly as possible and to guess if they were unsure. The items were presented in random order, for 5 s or until a response occurred, with DMDX (Forster & Forster, 2003). The items were 80 legal and 80 illegal nonwords, taken from Bakos, Landerl, Bartling, Schulte-Körne, and Moll (2018). All items were pronounceable in German. Illegal letter clusters were illegal letter doublets (e.g., ovv, Tüü) or consonant clusters (e.g., Lutd and Alßt, where the bigrams td and ßt do not occur in German). The nonwords were matched across the two conditions on length and syllabic structure. The test was preceded by 10 practice trials.
We calculated the overall accuracy and sensitivity index (d´) for this task. As accuracy was relatively high, we also calculated the overall RTs, after excluding incorrect trials (10.5% of the data).

Control task
To control for overall differences in processing speed, which may reflect attention or motivation, participants saw different cartoon animals on the screen, presented in random order, and were instructed to press a key on the right side of the keyboard if the animal was a cat and a key on the left side if the animal was not a cat. The instructions were to respond as fast as possible but without making too many mistakes. The task was programmed with OpenSesame (Mathôt et al., 2012). For each trial, the stimulus was presented for 1,500 ms or until a response occurred. The stimuli were three different-coloured pictures each of cats, cows, rabbits, and sheep. There were 120 trials, 30 of which required a yes response. Here, we calculated both the accuracy and the average RTs for each subject. Table 1 shows the overall descriptive statistics. For the analysis, we generated (a) a correlation matrix containing Pearson's correlations and (b) Bayes factors (BFs) for the presence of each correlation (Table 2). 2 The scatterplots showing the relationship between the SL tasks and reading, between the two SL tasks, and between bigram sensitivity and reading and SL are shown in Figure 2. Figure 3 shows the average performance of participants across blocks. A figure with all scatterplots, as well as the data used for the analyses, can be accessed at Schmalz (2017). We did not exclude any outliers. The figures with scatterplots show that although there were some outlier points for several tasks, these do not seem to distort any meaningful patterns.

Results
The correlation coefficients and BFs were calculated with JASP (Love et al., 2015). BFs compare the extent to which the data are compatible with a prespecified alternative hypothesis over a null hypothesis (r = 0). In JASP, the prespecified alternative is a beta-distribution centred around r = 0. The width of the distribution determines the probability density, under the alternative model, of the occurrence of different correlation coefficients. The default parameter assumes a flat prior distribution, such that the probability density of r values between −1 and 1 is evenly distributed. As we expect the correlation between reading and SL to be small , we changed the default parameter to 0.5, which changes the distribution to one where extreme values become less likely.
between the two statistical learning tasks, and (C) between bigram sensitivity and reading ability. Note. For SRTT, values greater than 0 reflect that learning occurred, for AGL and bigram legality accuracy, 0.5 reflects chance level, and for AGL and bigram legality d´, 0 reflects chance level. For word and nonword reading, the axis reflects the number of words read correctly within 1 min, and for bigram legality, the average number of milliseconds before response.
BFs greater than 1 provide relative support for the alternative hypothesis, and BFs less than 1 provide relative support for the null hypothesis. In line with guidelines summarised by Rouder et al. (2009), we interpret values between 1/3 and 3 as inconclusive evidence, and values less than 1/3 or greater than 3 as evidence for the null and alternative hypotheses, respectively.
Critically, neither of the SL tasks show any strong correlations, neither with the reading tasks (all BFs < 0.7), nor did the SRTT outcome variable correlate with either of the AGL measures (BF < 1/3).

Discussion
In a sample of 84 adult readers, we found no correlation between any of SL tasks and reading ability and no correlation between the two SL tasks. Thus, we did not find support for the proposal that SL ability has an effect on reading ability.
The current findings are not in line with previous results of Arciuli and Simpson (2012), who found a correlation between visual SL and reading ability in children and adults. As our study was not a close replication of the original experiment, it is possible that the methodological differences across the studies are responsible for the different outcomes. Thus, a future, preregistered study is needed to determine whether the presence of a correlation can be confirmed, when the protocol closely follows the methods of Arciuli and Simpson. If this is the case, future empirical work is needed to isolate moderating factors that could have led to the outcomes of the current study.
There are several methodological differences that could explain the different outcomes. In the final sample, we used different tasks compared to Arciuli and Simpson (2012) and Frost et al. (2013). Thus, there may be task-specific processes associated with the visual SL task that are correlated with reading ability. Furthermore, as the visual SL task requires participants to focus on a stimulus sequence, which lasts for several minutes, attention may be a confounding factor (Staels & Van Den Broeck, 2017;Waber et al., 2003). However, we did not find a correlation between the visual SL task, either with reading ability or with statistical learning in the SRTT, in a subset of our sample, which had a comparable size as the study of Arciuli and Simpson (see Footnote 1).
The two tasks that we used in the final analyses (AGL and SRTT) have been used by numerous studies on SL and dyslexia. As previous studies have linked our SL tasks to dyslexia, the current results are interpretable within the literature on SL and reading ability. The lack of a correlation between the two tasks raises issues about their psychometric properties. It is possible that the tasks show insufficient variability to allow us to study individual differences (Hedge, Powell, & Sumner, 2017;Siegelman & Frost, 2015): In the AGL task, average performance was significantly above chance level. On the individual level, however, all but six participants were numerically above chance (i.e., at > 50% accuracy), but most of these were only slightly above chance level, such that their accuracy level was not significantly better than chance at the 5% level (see Table 1; see Siegelman, Bogaerts, & Frost, 2016, for a discussion of this problem in SL tasks). This methodological issue prevents us from interpreting the absence of a correlation as evidence against the view that SL is important for reading. However, given the popularity of these tasks, our finding of no correlation is still important for future research.
We also did not find a correlation between SL and bigram sensitivity, or between bigram sensitivity and reading ability. Previous studies have been unable to find evidence for a causal relationship between bigram sensitivity and reading ability (Deacon et al., 2012;Rothe et al., 2014). Furthermore, the role of bigram frequency during reading processes is unclear . Our results are in line with these studies and may suggest that sensitivity to letter bigrams does not act as a mediating link between SL and reading ability. However, our adult participants were clearly already very sensitive to bigram legality, as shown by the high accuracy on the bigram sensitivity task. Bigram sensitivity may be related to reading ability during the early stages of reading acquisition; for adults, this influence may be masked by other variables that influence reading performance.
Finally, it is worth pointing out that our study was conducted with German speakers, whereas the participants of Arciuli and Simpson (2012) were English speakers. German and English are different in terms of orthographic depth: The English orthography contains more complex (multiletter and context sensitive) GPCs than German, as well as more words where the pronunciation is unpredictable based on print-speech regularities (Schmalz et al., 2015). It is possible that SL is more important for learning to read in English, which could be necessary for extracting the orthographic regularities relating to complex rules. However, we consider this an unlikely explanation for our results: Frost et al. (2013) reported that, in learners of Hebrew, SL predicted both the learning efficiency of pointed nonword reading ability (a very shallow script) and unpointed word reading ability (a very deep script). Thus, the influence of SL ability on reading acquisition does not seem to be moderated by orthographic depth.
In summary, we found that SL tasks, which are typically used in the literature on SL and dyslexia, do not correlate with reading. We cannot distinguish between the possibility that there is no link between SL and reading from the possibility that the tasks that are generally used are inadequate to show it. Future research may want to address issues of publication bias and poor psychometric properties of SL tasks. Researchers will need design a child-friendly SL task with good psychometric properties (for a SL task for adults, designed to have good psychometric properties; see Siegelman et al., 2016), and test a large sample of children to establish the presence or absence of a correlation between reading and SL. Notes 1. Originally, we had also included a visual SL (triplet) task, akin to Arciuli and Simpson (2012) and Frost et al. (2013), with cartoon-like pictures of animals instead of aliens or shapes. However, after the first 30 participants completed this task, it became clear that there was not sufficient variability in the learning performance to yield meaningful correlations. This is in line with recent observations about the rather poor psychometric properties of this task (Siegelman et al., 2016). To save time, we therefore decided to discontinue using this task. The mean accuracy on the test phase was 54.4% (chance level = 50%; SD = 9.4, minimum = 40.0%, maximum = 75.0%). Pearson's correlations with this variable were r(28) = 0.06 for word reading, r(28) = −0.14 for nonword reading, and r(28) = −0.07 for SRTT learning, all BF < 1/3. 2. We do not report p values, because the multiple comparisons would yield them uninterpretable, but for the readers' reference, given our sample size of 84, correlations exceeding r ≈ ±0.22 reach the traditional significance threshold of p = .05.