Unpredictability and complexity of print-to-speech correspondences increase reliance on lexical processes: more evidence for the orthographic depth hypothesis

ABSTRACT The Orthographic Depth Hypothesis [Katz, L., & Frost, R. (1992). The reading process is different for different orthographies: The orthographic depth hypothesis. In R. Frost & L. Katz (Eds.), Orthography, phonology, morphology, and meaning (pp. 67–84). Amsterdam: Elsevier Science] proposes cross-linguistic differences in the involvement of lexical processing during reading. In orthographies with complex, inconsistent, and/or incomplete sublexical correspondences, decoding is more difficult and therefore slower. This gives more time to the lexical route to retrieve information, and leads to a greater ratio of lexical processing. We test whether this mechanism applies both for words with inconsistent (in English) and for words with complex (in French) correspondences. As complex correspondences are sufficient to derive a correct pronunciation, an increase in lexical processing may not occur. In a reading-aloud task, we used the frequency effect to measure lexical processing. The data showed stronger involvement of lexical processing for inconsistent compared to consistent words, and for complex compared to simple words. The results confirm that Katz and Frost’s proposed mechanism applies to different sources of orthographic depth.

pronunciation of the word wasp is predictable, because an a preceded by a w tends to be pronounced as /ɔ/ (Schmalz et al., 2014;Treiman, Kessler, & Bick, 2003). In English orthography, however, there are also instances where the pronunciation is not predictable based on any sublexical information, such as in the words yacht or colonel. Therefore, there are two reasons why English orthography might be considered deep: First, the relatively high degree of complexity of the print-speech correspondences compared to orthographies such as German, Italian, and Dutch, and second, a high degree of unpredictability, even when those complex rules are applied (van den Bosch, Content, Daelemans, & de Gelder, 1994;Schmalz, Marinus, Coltheart, & Castles, 2015;Ziegler, Perry, & Coltheart, 2000).
Corpus analyses have shown that, across orthographies, unpredictability and complexity are dissociable on a linguistic level (van den Bosch et al., 1994;Schmalz et al., 2015): Although orthographies with simple correspondences tend to also have a high degree of predictability, these two concepts are not perfectly correlated. A particularly interesting example is French orthography, as there is a discrepancy between the degree of complexity and unpredictability. Specifically, French is high in complexity, because it contains multi-letter rules (au → /o/) and context-sensitive rules (c[a,o,u] → /k/, c[e,i] → /s/), but low in unpredictability (van den Bosch et al., 1994;Schmalz et al., 2015).
To date, it is unclear whether complexity and unpredictability of the sublexical correspondences act as separate sources of orthographic depth, or if they affect reading processes in the same way on a behavioural level. An existing hypothesis on orthographic depth is the orthographic depth hypothesis (hereafter: ODH; Katz & Frost, 1992). Here, the authors offer both a well-specified definition of orthographic depth, and propose a specific cognitive mechanism that drives cross-linguistic differences as a function of orthographic depth. In deep orthographies, they describe print-to-speech conversion code as characterised by complex, inconsistent, and/or incomplete sublexical information. This makes the sublexical conversion process more difficult in deep compared to shallow orthographies. As a result, the sublexical conversion process is impaired in one way or another, which gives more time for a lexical look-up mechanism to derive the correct pronunciation. This leads to a higher overall ratio of lexical-to-sublexical processing, as a function of degree of orthographic depth.
It is particularly noteworthy that Katz and Frost (1992) list three different properties that underlie the sublexical regularities of deep versus shallow orthographies: complexity, consistency, and incompleteness. The concepts of complexity and consistency map onto the distinction between complexity and unpredictability proposed by Schmalz et al. (2015; see also van den Bosch et al., 1994). Yet despite the theoretical and linguistic work that has shown a distinction between these multiple constructs underlying orthographic depth, whether these may differentially affect reading processes on the behavioural level has not been previously empirically tested.
The first construct proposed by both Katz and Frost (1992) and Schmalz et al. (2015) is complexity. An orthography with complex correspondences is characterised by multi-letter rules, where several letters are required to denote a single phoneme (e. g., augh → /o:/ in English, aient → /ε/ in French) and/or context-sensitive regularities, where surrounding letters affect a grapheme's pronunciation (e.g., in English, a is pronounced as /ɔ/ when preceded by a w, as in "swan"; in French, a g is pronounced as /ʒ/ when followed by an i or e, as in gélatine). When words contain complex correspondences, the sublexical information is sufficient to access full information about the word's phonology and semantics, once these complex rules are applied. However, evidence exists that applying multi-letter rules slows down the sublexical procedure, as they cause a conflict between the pronunciation of the single letters (e.g., in English, t and h) and the grapheme's pronunciation, th → /θ/ (Marinus & de Jong, 2010;Rastle & Coltheart, 1998;Rey, Jacobs, Schmidt-Weigand, & Ziegler, 1998).
The second construct that has been described by both Katz and Frost (1992) and Schmalz et al. (2015) is inconsistency, or unpredictability. Inconsistency is the presence of two or more pronunciations for the same orthographic unit. Conventionally, this is defined at the level of a word's body (e.g., the bodyear is inconsistent because it can be pronounced as in "hear" or "bear"), but the same measure can also be applied to graphemes (e.g., the grapheme th is inconsistent, because it can be pronounced as in "thistle," "this," or "thyme"). For this source of depth, the sublexical information is not sufficient to derive the correct pronunciation. For example, the English words "tough," "though" and "through" have nearly identical sublexical information, but each of them has a different pronunciation of the grapheme ough, which cannot be derived without knowledge of the whole word (see Schmalz et al., 2015, for an in-depth discussion). According to rule-based computational models, such as the Dual Route Cascaded (DRC) model (Coltheart et al., 2001), such words need to be read aloud via the lexical route for a correct response, because the sublexical route will give a "regularised," or rule-based response for words which do not comply to the rules (e.g., /θaim/ for the word "thyme"). In a connectionist framework (Perry et al., 2007;Plaut et al., 1996), such words require the reliance on larger units, which in the case of unpredictable words coincide with a whole word (e.g., ough is pronounced as in "tough" when preceded by a t and as in "though" when preceded by a th). Arguably, the fact that the orthographic unit coincides with a whole word makes its processing qualitatively different from processing sublexical orthographic units, as whole words have direct connections to their semantic information, whereas sublexical units do not (for a discussion, see Schmalz et al., 2015).
The third construct proposed by Katz and Frost (1992) is incompleteness. This construct is of high relevance to Semitic orthographies, where vowels are not always represented. In pointed Hebrew, the sublexical information is complete, because all phonemes are represented; vowels are represented as diacritics. Generally, however, texts are written in unpointed Hebrew, without vowel markings. Here, vowel information is incomplete, and the pronunciation needs to be derived via semantic context by fluent readers. Incompleteness is not of high relevance for European orthographies, however. In the European alphabetic scripts, the orthographic (sublexical and whole-word) information is mostly sufficient to assemble a full phonological representation and to use this to access a word's semantics. There are some examples of words with incomplete lexical and sublexical information, namely heterophonic homographs. For a word like "present," semantic context is needed to derive both a pronunciation, and to access different semantic information depending on whether this word occurs as a verb or a noun. By definition, lexical-semantic processing is required when the sublexical correspondences are incomplete.
According to the ODH, complexity, inconsistency, and incompleteness result in a higher ratio of lexical and/or semantic to sublexical processing. The notion of an independent lexical and sublexical route is the basis of the dual-route framework (Coltheart et al., 2001;Perry et al., 2007). Here, the lexical and sublexical routes operate in parallel to obtain a pronunciation from an orthographic input. The longer the sublexical route takes, the more the final pronunciation will be influenced by excitatory connections from the orthographic lexicon to the phonological lexicon and to the phonological output buffer. If the sublexical information can be processed quickly, the phonological output will be driven to a greater extent by phoneme activation from the sublexical units.
Previous research has provided support for a stronger lexical influence for deep compared to shallow scripts, as predicted by the ODH. Frost et al. (1987) showed, in a between-language comparison, that lexical and semantic marker effects increase as a function of depth in Serbo-Croatian (a shallow orthography), English (medium), and unpointed Hebrew (deep). In a further study, Frost (1994), took advantage of the presence of both the shallow pointed and the deep unpointed script in Hebrew. This allows for a within-item design, where the same words can be presented with and without diacritics. Again, Frost (1994) showed stronger lexical (word frequency) and semantic (semantic priming) effects for the deep compared to the shallow script.
Both studies support the view that incompleteness increases the reliance on lexical processing, as both report a comparison of unpointed Hebrew with a complete orthography (pointed Hebrew, and English and Serbo-Croatian). The comparison between English and Serbo-Croatian, however, can be interpreted in different ways, because these two orthographies differ from each other both in terms of complexity and unpredictability. The first possibility, which is in line with the ODH, is that complex correspondences slow down the process of sublexical decoding. Thus, while the sublexical output is in principle sufficient for a correct response to occur, the slowdown will allow more time for the lexical route to contribute to the final phonological output. This would mean that any source of orthographic depth (i.e., complexity, unpredictability, or incompleteness) should increase the relative contribution of the lexical route.
Alternatively, it is possible that there is a qualitatively different impact of unpredictability and incompleteness as compared to complexity: As unpredictability and incompleteness make it impossible for the reader to compute a pronunciation from the sublexical information, the final response of the sublexical route will be either incorrect or partial. In this case, a correct reading-aloud response cannot occur until the lexical route has provided enough activation to the phonological output buffer. This is different for words with complex correspondences: Here, the sublexical information is, in principle, sufficient for a correct pronunciation. Any slowdown associated with the presence of complex correspondences might not be sufficient to result in a substantial effect on the relative amount of lexical processing.
The existing studies do not allow us to differentiate between the two possibilities. To our knowledge, all comparisons of lexical-semantic marker effects used orthography pairs which differ both in terms of complexity and unpredictability, such as English and Serbo-Croatian (Frost et al., 1987) or English and German (Frith, Wimmer, & Landerl, 1998;Rau et al., 2015). The main aim of the current study was to distinguish between these two possibilities. We use two orthographies, where the correspondences reflect two different sources of depth, namely unpredictability in English, and complexity in French (van den Bosch et al., 1994;Schmalz et al., 2015). We chose reading aloud rather than silent reading as the experimental task, because the ODH is specifically concerned with the process of deriving speech from print. Lexical decision is considered to be less sensitive to this sublexical process, as high accuracy on this task can be achieved purely by relying on lexical access (Coltheart et al., 2001).

The unpredictability measure
Defining unpredictability is not straightforward, because existing models of reading make different assumptions about the way in which the sublexical route assembles a pronunciation (Coltheart et al., 2001;Plaut et al., 1996; for a discussion, see Schmalz et al., 2015). Given that there is no consensus about the type of information that is used to assemble a pronunciation, it is also unclear what kinds of words would be considered to have an unpredictable pronunciation. To ensure that the results are meaningful beyond the assumptions of a specific model, we use a definition which is compatible with both connectionist and rule-based models: We classify a word as unpredictable, if (1) it is both irregular (by the set of graphemephoneme correspondence rules implemented within the rule-based computational model, DRC; Coltheart et al., 2001) and inconsistent (i.e., if the body has more than one possible pronunciation), such as the word "ghost," or (2) if it is irregular, and does not have any body neighbours, such as the word "debt." Thus, neither graphemephoneme correspondences nor body-rime correspondences can be reliably used to read aloud these words correctly.
The concepts of irregularity and inconsistency are strongly correlated, but reflect theoretically different constructs and can be manipulated to vary orthogonally (Andrews, 1982;Cortese & Simpson, 2000;Jared, 1997Jared, , 2002Jared, McRae, & Seidenberg, 1990). Here, we classified words as predictable if the pronunciation was predictable both from grapheme-phoneme correspondence rules (i.e., regular) and from body-rime correspondences (i.e., consistent), and as unpredictable when neither source could be used to read aloud the words correctly. We excluded words that are regular but inconsistent (e.g., "mint," which is regular but has the enemy "pint") or irregular but consistent (e.g., "walk," which should be pronounced as /waelk/ according to the DRC). Recent behavioural data suggests that participants rely on information from various types of sources to predict a novel word's pronunciation (Schmalz et al., 2014); as it is not yet clear how the cognitive system merges conflicting information from different sources, we excluded these types of words for the current purposes.
It is unconventional to use predictability as a variable in psycholinguistic research. To date, the literature has focused predominantly on contrasting the effects of regularity with those of consistency (Andrews, 1982;Cortese & Simpson, 2000;Jared, 1997Jared, , 2002Jared et al., 1990). Rule-based models, such as DRC, predict effects of regularity, because a lack of compliance to grapheme-phoneme correspondence rules should impair the reading-aloud process via the sublexical route. Connectionist models, such as the triangle or connectionist dual processing (CDP) models (Perry et al., 2007;Plaut et al., 1996), use a learning algorithm to extract the relationships between print and speech, which becomes more difficult when a given orthographic pattern can map onto multiple pronunciations (e.g., the body -ost, which can be pronounced as in "ghost" or as in "lost"). Thus, connectionist models predict an effect of consistency, but not regularity. While previous studies have shown that inconsistent words and nonwords are read aloud more slowly than matched consistent items (Andrews, 1982;Cortese & Simpson, 2000;Glushko, 1979;Jared, 1997Jared, , 2002Jared et al., 1990), other data suggests that participants also rely on printto-speech rules, especially for unusual orthographic patterns (Andrews & Scarratt, 1998;Pritchard, Coltheart, Palethorpe, & Castles, 2012;Robidoux & Pritchard, 2014).

The current study
In Experiment 1, we compare the frequency effect for English words with predictable versus unpredictable correspondences. The English orthography is used, because the relatively high degree of both complexity and unpredictability allows us to manipulate frequency and predictability. If there is stronger involvement of lexical processing when the pronunciation of a word is unpredictable, we expect a frequency-by-predictability interaction, where the frequency effect is larger for unpredictable compared to predictable words. This study serves as a conceptual replication of the finding of Frost et al. (1987) that there is a stronger relative involvement of the lexical route when the correspondences are unpredictable compared to when they are predictable.
In Experiment 2, we use the French orthography, which has a high degree of complexity, while being highly predictable. This allows us to manipulate frequency and complexity in a within-subject design and without unpredictability as a confounding variable. We aim to establish whether there is a stronger frequency effect for words containing complex correspondences, compared to words which contain only simple correspondences (i.e., there is a one-to-one correspondence between letters and sounds). If the complexity of the correspondences slows down the assembly process, we expect to find a frequency-by-complexity interaction, as lexical processing should be stronger for words with complex than simple correspondences according to the ODH. If we obtain this pattern, it would indicate that both complexity and unpredictability affect reading processes in adults in the same way as incompleteness in the previous studies (Frost, 1994;Frost et al., 1987). If we do not find a frequency-by-complexity interaction, this means that cross-linguistic differences in the relative reliance on lexical processing are driven by unpredictability and incompleteness, but not complexity.

Participants
Twenty undergraduate students at an Australian university participated in the experiment. All were native speakers of English and received course credit for their participation.

Items
To justify the use of the predictability metric rather than the more conventional consistency metric, we first verified that predictability reflects a psychologically valid construct. We created two models from the full data set that is analysed in Experiment 1 (see below). The models were nearly identical to those described in the Results section. The independent variables were predictability (coded as a binary contrast) or consistency (centred ratio of friends to enemies), centred log frequency, and the two-way interaction. Note that we centred all continuous independent variables (by subtracting each value from the mean) and contrast-coded dichotomous conditions (as 0.5 and −0.5) because Linear Mixed Effect (LME) models provide parameter estimates as deviations from the point closest to zero rather than deviations from the mean. The dependent variable was trimmed inverse RT (for more details about the trimming procedure, see the Results section). Items and participants were included as random effects, and the slope of the frequency effect was allowed to vary across participants (Barr, Levy, Scheepers, & Tily, 2013). The model with predictability as the independent variable yielded a numerically better fit than the model with the consistency ratio as the independent variable (AIC = 4,620 for the former, AIC = 4,658 for the latter). A Bayesian analysis, where the two models were contrasted, provided support for the model which used predictability as the independent variable over the model using consistency, with a Bayes Factor value >1,000,000 (for a description of how we interpret Bayes Factors, see below). This justifies the use of predictability as an independent variable, and suggests that predictability has stronger psychological validity than consistency.
For the experiment, we used only monosyllabic words, because traditionally, measures of regularity and consistency, which form the basis of the predictability construct, have been defined for monosyllabic words only (but see Chateau & Jared, 2003;Kearns et al., 2014;Yap & Balota, 2009, for an extension of the consistency measure to multisyllabic words). We extracted all monosyllabic words from the British Lexicon Project (BLP; Keuleers, Lacey, Rastle, & Brysbaert, 2012). We retained all words with log frequencies between 0 and 2, because analyses of large-scale lexical databases have shown that the frequency effect is most robust for this log frequency range (Balota et al., 2007;Brysbaert et al., 2011;Ferrand et al., 2010;Keuleers, Diependaele, & Brysbaert, 2010). All words had a lexical decision accuracy >80%, suggesting that the words should be familiar to the majority of undergraduate students. We classified the words as predictable or unpredictable based on the above criteria.
We selected a total of 376 words. Half of these were predictable (e.g., "forge") and half were unpredictable (e.g., "ghost"), and they were chosen to vary in frequency, as half had a relatively low frequency (log frequency of 0-1) and the other half a relatively high frequency (log frequency of 1-2). Note that we treat frequency as a continuum rather than a dichotomy throughout the paper to increase experimental power. Frequency, as well as orthographic N counts, are based on the subtitle counts provided by the BLP (van Heuven, Mandera, Keuleers, & Brysbaert, 2014;Keuleers et al., 2012). Linear models were performed to assess whether any of the item characteristics co-varied with frequency or unpredictability. In separate analyses, each centred potential covariate was used as the dependent variable; centred frequency, contrastcoded predictability, and their interaction were used as predictor variables. The outcomes of this set of analyses are shown in Table 1. The individual items and their full descriptives, as well as the raw data and the R script used for the current study, can be found here: osf.io/hm8fw. Note that orthographic neighbourhood and Phonological Levenshtein Distance (the number of phoneme substitutions, deletions, or additions which are required to reach the nearest 20 neighbours; see Yarkoni, Balota, & Yap, 2008) co-vary with frequency, and the ratio of letters to phonemes differs across predictable and unpredictable words. However, the critical comparison in the current experiment is the interaction between predictability and frequency, and none of the covariates show a stronger manipulation for the predictable than unpredictable condition, all p > .3 for the interaction. We therefore do not include any of them as covariates in the main analysis. To confirm that these potential confounds do not influence the results, however, we present a covariate analysis in a post-hoc test.

Procedure
Item presentation was controlled with DMDX (Forster & Forster, 2003). The words were shown, one at a time, in random order, for 2.5 s or until the voice-key was triggered. The participants were instructed to read aloud each item as quickly and accurately as possible.

Results and discussion
The reading-aloud responses were scored offline with the software CheckVocal (Protopapas, 2007), as correct, incorrect, or no response. Response latencies were readjusted using CheckVocal, based on the onset of the sound waves, in the case of premature of late voice-key triggers. This removes potential biases associated with first phonemes.
The data were further analysed using the software R, both withLME models (Baayen, Davidson, & Bates, 2008) and with Bayes Factors (Morey & Rouder, 2014;Rouder, Speckman, Sun, Morey, & Iverson, 2009). LMEs allow us to obtain an estimate of the slope (which serve as descriptives given the use of a continuous measure of frequency). We provide the results of t-tests, and p-values, when appropriate, to provide a point of reference for those unfamiliar with Bayesian analyses.
We report Bayes Factors for all theoretically interesting comparisons (i.e., for the critical interactions), and base our conclusions on them. Unlike frequentist statistics, Bayes Factors allow us to quantify the evidence for (or against) an effect or interaction of interest, given a prior belief. Therefore, they arguably provide a closer link to the conclusions that can be drawn from the data. Here, we use the default prior of the BayesFactor package, which assumes a Cauchy distribution with the width parameter r = 0.5 (Morey & Rouder, 2014). We interpret the results according to a set of guidelines described in Rouder et al. (2009): Bayes Factor values smaller than 1/3 provide evidence against an effect or interaction, values between 1/3 and 1 and between 1 and 3 are considered to provide anecdotal or equivocal evidence against or for it, respectively, values larger than 3 provide some evidence for the effect or interaction, and values larger than 10 provide strong evidence. Thus, throughout the paper, smaller values provide evidence for a null hypothesis, and larger values provide evidence for the alternative hypothesis.
For the LME model, we used inverse RTs as the dependent variables. For the independent variables, we used centred log frequency as a continuous predictor, and predictability, contrast-coded as 0.5 (predictable) and −0.5 (unpredictable), as a binary predictor. The model also included previous RT (Baayen, 2008). Participants and items were included as random factors, and the frequency slope was allowed to vary across participants (Barr et al., 2013).
There were seven non-responses (0.01% of all data), and overall accuracy was 97.2%. The accuracy rates ranged from 93.4% for low-frequency unpredictable words to 99.3% for low-frequency predictable words. An LME model with accuracy as the dependent variable showed a main effect of predictability, β = 1.5, z = 3.9, p < .0001, reflecting higher accuracy for the predictable (99.1%) than unpredictable (95.3%) conditions. 1 The interaction between frequency and predictability was significant, β = −1.7, z = −2.5, p = .013, indicating a facilitatory frequency slope for unpredictable (β = 1.5) but not predictable (β = −0.3) words. The main effect of frequency was not significant, p = .1. The results are broadly in line with the RT results discussed below. As the error rate was relatively high, and there was not a lot of variability between the conditions, we draw conclusions based on the RT results only.
Before conducting the RT analyses, we excluded all incorrect responses, and trials with latencies <300 ms (0.2% of the data) and >1,200 ms (0.1% of the data). This yielded an approximately normal distribution of inverse RTs. When we artificially dichotomise frequency into high (log frequency >1) and low (log frequency <1), the averages RTs of the trimmed data set are 494.2 (SD = 97.8) and 494.1 (SD = 97.2), respectively, for the predictable words, and 503.6 (SD = 97.5) and 528.3 (SD = 115.1), respectively, for the unpredictable words.
The LME showed a significant main effect of predictability, β = −0.08, t = −5.3, p < .0001, reflecting shorter RTs for predictable ("forge") than unpredictable ("ghost") words, and a main effect of frequency, β = −0.05, t = −3.5, p = .0006, indicating shorter RTs for words of higher frequencies. The interaction was also significant, β = 0.07, t = 2.3, p = .02, with a steeper frequency slope for unpredictable compared to predictable words. In a comparison of the full model against an additive one that included the main effects of predictability and frequency, the Bayes Factor provided anecdotal evidence for the presence of the interaction, BF = 1.7 (±1.2%).

Follow-up analyses
To potentially strengthen the case for the interaction, we retrieved all trial-level data for our items from the English Lexicon Project (ELP) readingaloud database (Balota et al., 2007). Note that we included data from the ELP and not the BLP because the ELP has both lexical decision and reading-aloud data, whereas the BLP only has lexical decision. As both the ELP and our experiment employed a standardised reading-aloud procedure, we can increase the amount of evidence by collapsing the two data sets. The ELP contains trial-level reading-aloud data for 375 of the original 376 words. These include 10,342 valid and correct trials, with an average of 27.6 participants per word. We combined the data from our experiment with data of the ELP. The trimming procedure of this bigger item set was identical to that of the original data, as was the model, except that previous RT was not included, as it was unavailable in  (New, Brysbaert, Veronis, & Pallier, 2007); orthographic N counts are retrieved from the British Lexicon Project (Keuleers et al., 2012); bigram frequency is from the MCWord database (Medler & Binder, 2005); Phonological Levenshtein Distance is retrieved from the English Lexicon Project (Balota et al., 2007).
the ELP database. Dichotomising frequency, the average RTs for the four types of items were 561.9 ms (SD = 125.4) for high-frequency predictable words, 570.2 (SD = 125.7) for high-frequency unpredictable; 575.3 (SD = 132.4) for low-frequency predictable, and 597.6 (SD = 141.5) for low-frequency unpredictable. Note that p-values are not reported for any of the follow-up analyses in this paper, as due to the multiple comparisons, the Type-I error rate increases and is no longer 5% with a cut-off of α = 0.05 (Cramer et al., 2015;Simmons, Nelson, & Simonsohn, 2011).
The LME results showed the same pattern as the original data, with shorter RTs for more frequent compared to less frequent words, with a slope of β = −0.07, t = −10.0, with shorter latencies for predictable than unpredictable words, β = −0.05, t = −6.6, and a steeper frequency slope for unpredictable than predictable words by β = 0.05, t = 3.3. Importantly, the Bayes Factor now provided evidence for the presence of the interaction between frequency and predictability, BF = 9.6 (±0.6%).
In an additional post-hoc analysis, we ensured that the obtained results remain stable after taking into account potential confounds. As shown in Table 1, some of the psycholinguistic variables covaried with our manipulations. The model was identical to the one above, but we also included main effects of orthographic N and PLD20 (which differ as a function of frequency), and the ratio of letters to phonemes (which differs as a function of predictability). The adjusted meanswhen frequency is dichotomisedare 606.5 ms for high-frequency predictable words, 612.6 ms for high-frequency unpredictable words, 624.7 ms for low-frequency predictable words, and 646.5 ms for low-frequency unpredictable words. The results of the full model can be downloaded from the OSF folder (osf.io/ hm8fw). The patterns of results did not change: The LME showed a main effect of frequency, β = −0.07, t = −10.0, predictability, β = −0.04, t = −5.8, and the interaction, β = −0.05, t = 3.8. The Bayes Factor provided evidence for the presence of the critical interaction, BF = 50.7 (±0.88%).
In sum, we found strong evidence for the predicted interaction between frequency and predictability, where the frequency effect is stronger for unpredictable than predictable words. This provides a conceptual replication of previous experiment by Frost and colleagues (Frost, 1994;Frost et al., 1987), and evidence for the ODH. Specifically, the results suggest that unpredictability of print-to-speech correspondences impairs sublexical processing, which results in stronger lexical involvement compared to words with predictable correspondences.

Experiment 2: Complexity in French
For French, the aim was to assess whether the frequency effect for words containing complex printto-speech correspondences is stronger than for words where the pronunciation can be deciphered based on simple single-letter correspondences. This would provide further support for the ODH, and insights about the orthographic characteristics that may lead to a script being classified as deep or shallow. A lack of an interaction between frequency and complexity would suggest that complex correspondences are processed qualitatively differently to unpredictable correspondences.

Participants and procedure
The participants were 24 students from a university in France. All were native speakers of French and received course credit in exchange for their participation. The procedure was identical to Experiment 1.

Items
We retrieved words and their corresponding information from the Lexique 2 database (New, Pallier, Brysbaert, & Ferrand, 2004) and the French Lexicon Project (FLP; Ferrand et al., 2010). For frequency, we relied on subtitle counts (Brysbaert et al., 2011;Brysbaert & New, 2009;New et al., 2007). We again removed words with log frequencies of <0 or >2. To classify words as complex or simple, we used the ratio of letters to phonemes in each word: The presence of multi-letter correspondences means that multiple letters correspond to a single phoneme, thus a complex word has a letter-tophoneme ratio >1. Simple words were those with a letter-to-phoneme ratio of one (e.g., "garnir"), and words with a ratio of greater than one were considered complex (e.g., "gâteau"). In the database, this procedure classified 280 words (8.9%) as "simple," and 2,852 (91.1%) as "complex." We selected 384 words, half with complex correspondences ("gâteau") and half with simple correspondences ("garnir"). The words were chosen to vary in frequency, where half the items had frequency counts lower than 1, and the other half higher than 1. In addition, the items were chosen such that they did not differ, across conditions, on average grapheme consistency (Lété, Sprenger-Charolles, & Colé, 2004), suggesting that there were no differences in the degree of unpredictability. Overall, the French orthography has a high degree of predictability once complex rules are taken into account (Schmalz et al., 2015;Ziegler, Perry, & Coltheart, 2003). However, there are some words with ambiguous pronunciations (e.g., "femme," where the second letter is pronounced as /a/ rather than the default /ε/). While, to our knowledge, there is no quantification method of regularity that can be applied to polysyllabic words in French, the Manulex database contains average grapheme consistency ratings (Lété et al., 2004). We use these as a measure of unpredictability, as words with unpredictable pronunciations necessarily have graphemes that can be pronounced in multiple ways.
All items had a lexical decision accuracy, according to the FLP, of >80%. The descriptive statistics are listed in Table 2. For the full item set with individual word characteristics, as well as the raw data and R scripts, see here: osf.io/hm8fw. Again, in the results section, we will follow up with covariate analyses to ensure that the results cannot be explained by the variables that differ as a function of the manipulation.

Results and discussion
The data were scored with CheckVocal as correct, incorrect, or no response, and the RTs were adjusted when the voice-key had been triggered prematurely or late (again, adjusting for potential biases associated with first phonemes). As for the English analyses, we used inverse RTs as the dependent variables, continuous centralised frequency and binary contrastcoded complexity (−0.5 = simple) as independent variables, and previous RT. Participants and items were included as random factors, and the effect of frequency was allowed to vary across participants. As the items were matched on the number of letters (as is common in studies on multi-letter rules; see Rastle & Coltheart, 1998;Rey et al., 1998), the simple ("garnir") condition had, by definition, more phonemes than the complex ("gâteau") condition. This also resulted in a lower number of syllables for complex (average = 2.0) compared to simple (average = 2.8) words. It was therefore decided, a priori, that the number of syllables should be included in the model, to act as a covariate.
Overall, there were no non-responses, and the accuracy rate was 97.5%. Accuracy was very high and evenly distributed across conditions (ranging from 95.7% for low-frequency simple words to 98.7% for high-frequency complex words). An LME on the accuracy rates showed a main effect of frequency, β = 0.5, z = 4.0, p < .0001, reflecting higher accuracy for high-than low-frequency words. Neither the effect of complexity nor the complexity-by-frequency interaction reached significance, p > .1. This is likely to be reflect the overall high accuracy rates and lack of variability across conditions. For this reason, as for in Experiment 1, we draw conclusions from the RT data.
For the RT analyses, we removed one data point with RT <300 ms, which yielded an approximately normal distribution of inverse RTs. When artificially dichotomising frequency (high: log frequency >1; low: log frequency <1), the average RTs are 608.5 ms (SD = 146.9) and 620.4 ms (SD = 156.4), respectively, for simple words, and 565.9 ms (SD = 114.5) and 603.5 ms (SD = 140.5), respectively, for complex words. Adjusting these means for the number of syllables yields, for simple words, 600.3and 608.0 ms, for high-and low-frequency words, respectively, and for complex words, 578.4 ms and 616.0, respectively.
The latency analyses showed a main effect of frequency, β = −0.05, t = −5.3, p < .0001. The main effect of the number of syllables, which was included as a covariate, was also significant, β = 0.07, t = 9.8, p < .0001. The main effect of complexity was not significant, β = −0.01, t = −0.9, p = .4, but the critical interaction between frequency and complexity was, indicating a steeper frequency slope for complex than simple words, β = −0.06, t = 3.5, p = .0005. The Bayes Factor provided strong evidence for the presence of this interaction, BF = 37.9 (±1.1%).

Follow-up analyses
An unexpected finding in Experiment 2 is the absence of a significant main effect of complexity. As the explanation of the complexity-by-frequency interaction, in the Orthographic Depth Hypothesis framework, is based on the assumption that complex words are more difficult to process by the sublexical route than simple words, this finding might compromise our conclusion. A possible explanation is the inclusion of relatively high-frequency words in our item set. Previous research has shown that the complexity effect is diminished for high-compared to low-frequency words (Rey et al., 1998). LME provides the slope estimates at the point where the independent variables equal to zero. As we used centred log frequency as an independent variable, it is possible that the slope estimate of the complexity effect is based on a point where the frequency is too high to show a complexity effect. To test this possibility, we conducted follow-up tests of the effect of complexity separately for low-frequency (log frequency <1) and high-frequency (log frequency >1) words. Indeed, the data showed slower RTs for complex than simple items for low-frequency words, β = 0.03, t = 1.6, and faster RTs for complex than simple items for high-frequency words, β = −0.05, t = −2.8. The Bayes Factors provided equivocal evidence for the expected inhibitory complexity effect for low-frequency words, BF = 0.4 (±1.1%), and weak evidence for the unexpected facilitatory complexity effect for high-frequency words, BF = 4.9 (±0.9%).
As the facilitatory effect (faster RTs for complex than simple words) for high-frequency words goes both against the existing literature and existing models of reading, we considered possible confounds that could be driving this counter-intuitive pattern. In matching items with complex correspondences against items with simple correspondences, it is customary to match for the number of letters, not phonemes (Rastle & Coltheart, 1998;Rey et al., 1998). This is a conservative approach: As complex words contain more letters than phonemes, the complex condition necessarily has fewer phonemes than the simple condition. This could be counteracting the complexity effect in our analysis. Indeed, when adding the number of phonemes as an additional predictor, we still get the predicted inhibitory effect (numerically) for low-frequency words, β = 0.03, t = 1.6 (adjusted means: 579.9 and 598.5 ms for complex and simple words, respectively), and the unexpected numerically facilitatory effect for high-frequency words, β = −0.04, t = −2.3 (adjusted means: 622.2 and 603.0 ms, for complex and simple words, respectively), but now the Bayes Factor provides weak evidence for the expected inhibitory effect for low-frequency words, BF = 3.9 (±0.8%), and equivocal evidence for the unexpected facilitatory effect for highfrequency words, BF = 1.6 (±0.8%). This means that the current data does not give us any conclusive evidence about whether or not there is a complexity effect for high-frequency words after taking into account the number of syllables and phonemes as a covariate, but suggests that there might be the expected inhibitory effect for low-frequency words. Note that in a post-hoc analysis of the full French data set which includes the number of phonemes as well as the number of syllables as covariates, we continue to get evidence for a frequency-by-complexity interaction, β = −0.07, t = −3.7, BF = 7.3 (±0.8%), suggesting that the key result is robust.
As with the English analyses, we performed one final post-hoc test to ensure that none of the potential covariates from Table 2 compromise our results. We repeated the analyses while including the main effect of OLD20 and PLD20 (which co-varied with complexity), and the main effect of bigram frequency and its interactions with frequency and complexity. For bigram frequency, missing values were replaced with the global mean. As in the previous model, we also included the number of phonemes, number of syllables, and the critical two main effects of frequency, complexity, and their interaction. Again, the pattern of results remained stable, with a main effect of frequency, β = −0.05, t = −5.1, an unexpected facilitatory effect of complexity, β = −0.04, t = −2.4, and the critical interaction, β = −0.07, t = −3.5. The adjusted mean RTs are 574.2 and 604.3 ms for high-frequency complex and simple words, respectively, and 612.0 and 612.6 ms for low-frequency complex and simple words, respectively. The evidence for the critical interaction between complexity and frequency was BF = 13.6 (±0.86%).  (New et al., 2004); grapheme consistency and bigram frequency from Manulex (Lété et al., 2004). Note that Manulex has the bigram frequency for only 314 out of the 384 words; missing cells were excluded for the analysis which included bigram frequency as the dependent variable.
In sum, we found evidence for the critical interaction, showing that the frequency effect is stronger for words with complex compared to words with simple correspondences. This suggests that, like unpredictability, complexity acts as a source of orthographic depth by impairing the sublexical route. This leads to a relative increase in the degree to which the lexical route contributes to the final output.

General discussion
Although orthographic depth has been studied extensively throughout the past decades, it is unclear whether the complexity and the unpredictability of the sublexical correspondences affect skilled reading processes in the same way, or whether these two constructs have a differential effect on the cognitive processes (Schmalz et al., 2015). The Orthographic Depth Hypothesis proposes that in deep orthographies, the lexical route becomes relatively more important, because the sublexical information is less efficient in retrieving a correct pronunciation (Katz & Frost, 1992). We hypothesised that this may not be the case for orthographies with complex but predictable sublexical correspondences, such as French, because here the sublexical information is, in principle, sufficient to derive a correct pronunciation. However, increased lexical processing may be observed if complex correspondences slow down the sublexical route, thus allowing more time for the lexical route to retrieve the relevant phonological information. We found support for the latter possibility inasmuch as the frequency effect (a marker of lexical processing) was greater for words with complex sublexical correspondences than for those with simple correspondences.

Predictability within models of reading
Experiment 1 indicates that, in a within-experiment manipulation, the frequency effect is stronger for English unpredictable ("ghost") than predictable ("forge") words. In line with the ODH (Katz & Frost, 1992) and with previous research (Frost et al., 1987), this suggest that unpredictability increases the relative reliance on lexical processing, as the sublexical processing cannot be resolved without lexical knowledge.
Note that, within a rule-based model of reading, the theoretical explanation of a predictability-by-frequency interaction is slightly different from what is likely to happen in the case of complexity, even though they result in an identical behavioural pattern (Coltheart et al., 2001). If the sublexical route uses a set of print-to-speech conversion rules, the sublexical output for words with irregular correspondences, which do not comply to the rules, will be an incorrect response (e.g., in English /dept/ instead of /dεt/ for the written word debt). A conflict would then take place in the phonological buffer, when combining the output of the lexical and the sublexical routes. Such a conflict may be resolved by postponing the initiation of the verbal response, until sufficient activation from the lexical route has accumulated to trump the incorrect phonemic activation from the sublexical route. This would explain the main effect of unpredictability, because the pronunciation of unpredictable or irregular words is delayed, due to the conflict between the two routes. Furthermore, this conflict does not occur for words with predictable or regular correspondences, therefore the pronunciation does not need to be delayed until the lexical route trumps the activation of the sublexical route. As a result, relatively stronger lexical involvement is needed to resolve the pronunciation of unpredictable words. For predictable words, the sublexical route does not need to be suppressed for a correct pronunciation.
Within a connectionist framework (Perry et al., 2007;Plaut et al., 1996), the sublexical route would be predicted to operate more slowly for unpredictable compared to predictable words. Unpredictable words, by definition, contain inconsistent correspondences (e.g., in the word "ghost," the grapheme o is inconsistent, as it can also be pronounced as in "lost"). It is possible that phonemic activation associated with inconsistent graphemes is slower than the activation of consistent graphemes (e.g., sh → /ʃ/ would be activated faster than th → /θ/). In this case, unpredictability (or, more specifically, inconsistency) would lead to an overall slowdown of the sublexical route, thus giving more time for the lexical or semantic information to contribute to the verbal output. Thus, in contrast to rule-based models, connectionist models would suggest that the mechanism responsible for the predictabilityby-frequency interaction is very similar to the mechanism underlying the complexity-by-frequency interaction.
Complexity within models of reading Experiment 2 examined whether the frequency effect would be stronger, for French, in words containing complex (multi-letter) correspondences ("gâteau"), compared to words with simple correspondences only ("garnir"). Again, there was evidence for an interaction, suggesting that complexity, like unpredictability, increases the relative importance of lexical processing.
As previous studies on the ODH have used cross-linguistic comparisons of pairs of orthographic systems that differed in both complexity and unpredictability (e.g., Serbo-Croatian/English, German/English), our study is the first to suggest that complexity affects the ratio of lexical-to-sublexical processing. Presumably, this is due to a slowdown of the sublexical decoding process, which is caused by the application of complex multi-letter rules. More specifically, complex rules could lead to a conflict between the activation of the phoneme corresponding to a multiletter grapheme and the phonemes corresponding to its underlying individual letters, as proposed by the dual-route cascaded (DRC) model of reading aloud (Rastle & Coltheart, 1998). In a word like "garnir," each letter maps onto its default phoneme. A simple word would lead to faster activation of the phonemes in the output buffer from the sublexical route, thus reducing the relative contribution of the lexical route in achieving the final pronunciation. For a word with complex correspondences, like "gâteau," the activation of the phonemes of the individual letters, e (→ /ε/), a (→ /a/), and u (→ /y/), would cause a conflict within the sublexical route, as the three letters need to be combined into a single grapheme and mapped onto the correct phoneme /o/. This would slow down the output of the sublexical route, such that the lexical route has a larger contribution to the final output.
We did not find a main effect of complexity, thus failing to replicate the results of Rastle and Coltheart (1998) and Rey et al. (1998). Including articulatory variables, namely the number of syllables and the number of phonemes, as covariates, provided a more coherent picture. Here, there was some evidence for an effect of complexity in the low-frequency condition, though this emerged only in the covariate analysis that included both the number of syllables and the number of phonemes. Thus, it seems that articulatory processes counteract the effect of complexity. Articulatory processes affect reading-aloud latencies at a post-lexical stage (Cholin & Levelt, 2009;Cholin, Schiller, & Levelt, 2004), which results in in facilitation of the verbal response, driven by a smaller number of phonemes and syllables for all types of words, regardless of frequency. The effect of complexity counteracts this facilitatory articulation-level effect especially for low-frequency words, as complexity operates on the sublexical level.
Notwithstanding the lack of a main effect of complexity, Experiment 2 provided strong evidence for an interaction between complexity and frequency in French. This suggests that sublexical information plays a role in determining the net ratio of lexicalto-sublexical processing, even if the output is driven to a great extent by the lexical route. While the process of reading aloud appears to happen at the same rate for complex as for simple words, there is relatively more contribution from the lexical than the sublexical route.
Finally, it is worth expanding on our central assumption that the frequency effect is a marker of lexical processing. While it is generally assumed that frequency reflects some kind of threshold of the activation of entries in a mental orthographic lexicon (e.g., Coltheart et al., 2001;Taft, 1991), there are alternative views of how the frequency effect works. First, it is possible that frequency effects reflect other constructs that are strongly correlated, such as imageability (Strain, Patterson, & Seidenberg, 1995), age-of-acquisition (Zevin & Seidenberg, 2002), or contextual diversity (Adelman, Brown, & Quesada, 2006). We did not match for these variables, as it would have substantially limited the choice of items. The norms for these variables are not available for the majority of our items, therefore we can also not include them as post-hoc covariates in follow-up analyses. This does not present a problem for our conclusions, however, as these variables reflect lexical-semantic activation and thus measure processes which occur broadly within the lexical route. The DRC (Coltheart et al., 2001) and CDP+ (Perry et al., 2007;Perry, Ziegler, & Zorzi, 2010) models make a distinction between an orthographic lexicon and a purely semantic route. The semantic route can be reached either by activation from the orthographic lexicon or the phonological lexicon, and in turn sends activation to the nonsemantic lexical components. These models would therefore predict a close link between non-semantic and semantic lexical processes. Triangle models do not make a distinction between a semantic and a non-semantic route, as there is no purely orthographic representation of whole words, thus there is an even closer link between semantic marker effects and the lexical route (Plaut et al., 1996;Seidenberg & McClelland, 1989).
As a second alternative explanation of the frequency effect, it is possible that it reflects the frequency not of the whole word, but of the letters and letter clusters which are contained in the word. Thus, in a connectionist model, it is possible to show word frequency effects in the absence of an orthographic lexicon, because frequent letter clusters and their pronunciations are easier to learn (Plaut et al., 1996). This would imply that the frequency effect is a measure of sublexical processing. If a sublexical mechanism, reflecting the frequency of letter clusters, drives the interactions with complexity or predictability, one would expect the interaction to disappear once bigram frequency is taken into account. However, in Experiment 1 we found the frequencyby-predictability interaction while the manipulations did not co-vary with bigram frequency, and in Experiment 2, the frequency-by-consistency interaction remained robust after taking into account bigram frequency as a covariate. Thus, we can exclude the possibility that the frequency effect in our study reflects a sublexical process.

Conclusion
The current study is the first, to our knowledge, to empirically address the hypothesis that orthographic depth consists of various components that differentially affect skilled reading processes. The experiments reported here suggest that both complexity and unpredictability independently increase relative reliance on the lexical route. This provides support for the ODH, and the cognitive mechanism that Katz and Frost (1992) proposed as driving the cross-linguistic differences associated with orthographic depth: complexity and unpredictability both act to impair the efficiency of the sublexical route, which allows for a relatively greater influence of the lexical route in retrieving the word's pronunciation.