Text Integration and Speaking Proficiency: Linguistic, Individual Differences, and Strategy Use Considerations

ABSTRACT The current study examined the effects of text-based relational (i.e., cohesion), propositional-specific (i.e., lexical), and syntactic features in a source text on subsequent integration of the source text in spoken responses. It further investigated the effects of word integration on human ratings of speaking performance while taking into consideration individual characteristics in test-takers (e.g., listening proficiency, age, grade point average, working memory capacity) and test-taker strategy use (e.g., note-taking strategies). A total of 263 test-takers’ speaking samples were collected using TOEFL-iBT research forms of integrated listen/speak items. This data and, individual characteristics measures and note-taking data were collected over two days. These spoken samples were transcribed and analyzed in terms of textual integration at lexical, cohesion, and syntactic levels. The linguistic features along with the individual characteristics and note-taking data were used to predict human scores of speaking proficiency. The results indicate that the linguistic properties of the source text are almost perfect predicators of which words test-takers will integrate into their response. Moreover, it was found that text integration is an important factor that affects human ratings of speaking proficiency that goes beyond individual test takers’ characteristics and note-taking strategies.


Introduction
In academic contexts, text integration skills (i.e., integrating material from reading or listening input into speaking or writing tasks) are presumed to be critical elements of academic success for second language (L2) learners of English.This basic notion is premised on the idea that academic settings require students to both read academic texts and listen to academic lectures while integrating information from both sources into oral and written reports as well as class discussions (Douglas, 1997).Integrated writing and speaking tasks that combine these skills best represent the demands placed on students in academic contexts, and such tasks have become common in a number of standardized testing situations designed to measure students' readiness for academic contexts (Cumming, Grant, Mulcahy-Ernt, & Powers, 2005;Cumming et al., 2006).
Students' success at recalling and integrating previous information can be based on diverse learner characteristics (e.g., working memory), strategy use (e.g., note-taking strategies), and on linguistic properties of a text (e.g., word repetition or word frequency).Working memory capacity (WMC) denotes the ability to temporarily store and manipulate information simultaneously (Baddeley, 2003) and it is an important component of recall that might impact the quality and efficiency of real time language processing (Miyake & Friedman, 1998).Previous studies have also shown that note-taking strategies can positively affect lecture summarization (Carrell, 2007).In terms of the linguistic properties of text, two types of information that affect the efficiency of encoding of discourse and its subsequent recall have been noted in previous research: propositionspecific information and relational information (McDaniel, Einstein, Dunay, & Cobb, 1986).Proposition-specific information refers to lexical items (i.e., words) that are found within a proposition (e.g., a sentence, clause, or idea) and the semantic relationships between these words.Relational information pertains to organizational elements with a text and how propositions are embedded (i.e., text cohesion).Both proposition-specific and relational information are important factors in L2 processing because L2 learners often have difficulty identifying relationships among ideas (i.e., relational information) and detecting key ideas (i.e., proposition specific information; Powers, 1986).
The purpose of the current study is to examine how learner characteristics (e.g., working memory, language proficiency, and gender) and the linguistic properties of listening source texts (e.g., the cohesive and lexical properties of source texts) influence source text integration in standardized language assessment test focused on integrated speaking tasks.Further, we assess associations between learner characteristics and linguistic properties in the source texts with expert ratings of speaking proficiency.

Test takers' individual characteristics
In the current study, we examined a variety of test takers' individual characteristics including proficiency level as measured by the Test of English as a Foreign Language (TOEFL) Institutional Testing Program (ITP), first language, gender, and working memory.Language proficiency has been one of the most widely addressed individual characteristics, and researchers often investigate proficiency as a mediating variable of test performance.For instance, Appel and Wood (2016) reported that high level learners were less dependent on reading sources during integrated writing tasks.Barkaoui found that overall English language proficiency significantly contributed to TOEFL iBT writing scores (2013) and that participants' writing performance was mediated by task types but not proficiency (2015).Lastly, Hill and Liu (2012) reported that that language proficiency interacted with background knowledge in TOEFL iBT reading tasks.Overall, previous L2 assessment research has suggested that learner proficiency along with other variables such as background knowledge and task types may be associated with test takers' language performance.
Gender and age are other individual characteristics of test takers and L2 learners that have been examined.As an example, Breland, Lee, Najarian, and Muraki (2004) examined gender effects on TOEFL CBT writing and found that gender was a significant predictor of writing success, with females tending to obtain higher scores than males.Multiple studies have demonstrated that younger learners develop proficiency in a L2 faster than older learners (DeKeyser, 2000;McDonald, 2000).
Another individual characteristic of interest in test takers is WMC, which refers to "the ability to maintain information in an active and readily accessible state, while concurrently and selectively processing new information" (Carrell, 2007, p. 3)."Over the last two decades, WMC has been increasingly investigated and findings suggest it is an important cognitive factor that affects L2 learning and processing (Wen, Mota, & McNeil, 2015).For instance, Linck, Osthus, Koeth, and Bunting (2013) conducted a meta-analysis that included 79 studies and 3,707 participants that focused on associations between working memory and a range of learning outcomes such as L2 comprehension.The results suggested that working memory is an important component of L2 processing and proficiency outcomes.In contrast, Kormos and Trebits (2011) reported a more limited role for WMC in the oral production of L2 learners such that WMC might only affect L2 syntactic production.Recent studies also do not provide a strong evidence for a strong relationship between WCM and L2 listening comprehension even when using multiple WMC measures (Andringa, Olsthoorn, van Beuningen, Schoonen, & Hulstijn, 2012;Vandergrift & Baker, 2015).Research has shown relationships between WMC, L2 performance, and L2 language proficiency level.For instance, Kormos and Sáfár (2008) reported that phonological short-term memory capacity was mediated by proficiency level.Overall, although WMC has been suggested as an important individual characteristic, its role might not be consistent across different L2 tasks that involve different types of processing.
The last individual characteristic we consider is L2 learners' note-taking strategies.Previous early L2 research suggests an association between students' note-taking strategies and listening comprehension performance as measured by multiple choice tests (Dunkel, 1988).For instance, Dunkel (1988) reported that total number of words and information units in test-takers notes were significantly associated with test performance.Additionally, Cushing (1993) indicated that testtakers' academic status and listening comprehension proficiency positively affected the quality and content of notes.More recently, Carrell (2007) found that note-taking and test performance are moderately related.In sum, previous research suggests that students' note-taking strategies vary and that the quality and quantity of note-taking might be associated with language performance.

Text properties and recall
In the current study, the linguistic properties of a text are operationalized in terms of two types of information (i.e., relational information and proposition-specific information).Relational aspects in texts are most commonly related to text cohesion, while proposition-specific information is related to lexical elements.A variety of linguistic features such as connectives, anaphoric references, and word overlap have been used to measure text cohesion (Crossley, Kyle, & McNamara, 2017).These cohesion features provide readers with explicit text markers meant to signal connections between ideas in a text that can help develop a coherent model of the text.However, cohesion is different from text coherence.Coherence refers to the understanding that the reader extracts from the text and, while it can often develop with the help of cohesion features (e.g., connectives and word overlap), it can also develop because of prior knowledge and/or reading skill (McNamara, Kintsch, Songer, & Kintsch, 1996).
While many text features are related to cohesion, connectives such as and, but, or also are probably the most common cohesive devices reported in linguistic research.Connectives can help create cohesive links between ideas and clauses at the sentence level (Crismore, Markkanen, & Steffensen, 1993;Longo, 1994).These links can help develop greater text organization (van de Kopple, 1985) and thus promote increased text comprehension.However, there is some indication that connectives are not linked to text coherence, especially for advanced readers (Crossley & McNamara, 2010, 2011).Another common cohesive device that is used to link sentences is lexical overlap (i.e., overlap between words; Halliday & Hasan, 1976).Previous research has shown that lexical overlap can improve text readability and text processing (Crossley, Greenfield, & McNamara, 2008;Rashotte & Torgesen, 1985).However, similar to the use of connectives, lexical overlap at the sentence level has not been shown to be linked to text coherence (Crossley & McNamara, 2010, 2011).As compared to links between sentence level text segments (known as local cohesion), global cohesion devices that link larger segments of text together (e.g., at the paragraph level) have shown links with text coherence.These cohesive devices include lexical overlap between paragraphs (Crossley et al., 2017;Crossley & McNamara, 2011;Foltz, 2007) and causal relations among text segments (Graesser, McNamara, Louwerse, & Cai, 2004).
Unlike relational information, proposition specific features refer to lexical elements within propositions and how words may be easier to recall because of their lexical properties.For instance, research has shown that concrete words have advantages in recall and comprehension tasks as compared to abstract words (Gee, Nelson, & Krawczyk, 1999;Paivio, 1991).Other lexical properties that influence recall include word imageability (Paivio, 1968), word polysemy (i.e., the number of senses per word, Davies & Widdowson, 1974), and word associations (Nelson, McEvoy, & Schreiber, 1990).Additionally, word recall can also be influenced by word familiarity and frequency.Word familiarity has demonstrated strong effects on word identification and recall (Paivio, 1991), although it is not as strong of a predictor as word imageability (Boles, 1983;Paivio & O'Neill, 1970).High frequency words are named more rapidly (Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004) and recognized quicker (Kirsner, 1994) than lower frequency words.

Text integration
To be successful, language users have to integrate four language skills (i.e., speaking, listening, writing, and reading) in real-world contexts.As a result, integrating language skills is an important pedagogical component in the L2 classroom.Teaching learners how to integrate language skills can help students interact more naturally in an authentic environment (Oxford, 2001) by requiring students to receive, transmit, and demonstrate their knowledge as well as organize and regulate that knowledge for communicative purposes (Butler, Eignor, Jones, McNamara, & Suomi, 2000).From a testing perspective, integrating language skills is simplified by asking test-takers to discuss and include key propositions and terms found in listening and/or reading materials in their spoken or written responses.Standardized tests such as (TOEFL) include integrated tasks because they represent an important authentic academic skill that affords test-takers the opportunity to manipulate and control language data that may not rely on their prior knowledge (Hamp-Lyons & Kroll, 1996;Wallace, 1997).Integrated tasks allow test-takers to produce contextually appropriate language (Hamp-Lyons & Kroll, 1996), identify and extract relevant information from the source text(s), and synthesize and organize this information into their responses (Feak & Dobson, 1996).In short, integrated tasks encourage test-takers to produce more authentic language (Plakans & Gebril, 2012).
To date, studies examining text integration have focused mainly on integrated writing tasks which require test takers to write using source texts.These studies have generally investigated the differences between integrated and independent writing in terms of linguistic features or have examined how linguistic features are predictive of human ratings of integrated writing.For instance, Guo, Crossley, and McNamara (2013) found integrated essays, as compared to independent essays, focused more on organizational cues, used a more detached style of informational writing, and contained more context-independent lexical items.Cumming et al. (2005), Cumming et al. (2006) reported that higher-rated integrated essays generally contained more words, more words per T-unit, and a greater diversity of words.
Few studies have focused on text integration in speaking tasks.Barkaoui, Brooks, Swain, and Lapkin (2012) investigated the strategic behaviors test-takers used during integrated speaking tasks.However, they failed to find clear relationships between strategy use and integrated speaking scores.A more recent study by Crossley, Clevinger, and Kim (2014) examined the linguistic properties of source material on recall and human ratings of speaking proficiency in a small corpus of TOEFL speaking responses.Their findings demonstrated that the relational and propositional properties of words in the source texts were significant predictors of text integration.Specifically, they found that the average incidence of word occurrence in the source text, the frequency of integrated words in the source text (as measured by an external reference corpus), and the integration of words found in positive connective clauses in the source text predicted whether a word was integrated into a testtaker response or not with over 98% accuracy.They also found that the incidence of integrated words from the source text predicted 51% of score variance in speaking proficiency ratings.

Current study
The findings reported by Crossley et al. (2014) indicated that linguistic properties in the source texts could strongly influence text integration in test-taker responses.Because the human ratings of integrated speaking proficiency appeared to be influenced by different levels of text integration, Crossley et al. concluded that the relational and proposition-specific elements of a text should be controlled during test development.For instance, if a source text was low in relational and proposition specific elements, it might lead to less information recall which could influence human judgments of quality.However, Crossley et al. (2014) included several limitations.First, the study was a pilot study that focused on a small number of test-taker responses (N = 60).In addition, the study did not take into consideration learner characteristics such as WMC, language proficiency, gender, and age.Furthermore, although integrated TOEFL speaking tasks allow students to take notes, students' note-taking strategies were not examined.To date, the extent to which test takers' individual characteristics mediate such relationships has not been systematically examined.
In the current study, we conduct a partial replication of Crossley et al. (2014) by examining if the relational (i.e., cohesive) and proposition-specific (i.e., lexical) properties of words in source texts found in the integrated speaking section of the TOEFL-iBT are predictive of their integration into a spoken response within a relatively large test-taker population.However, unlike Crossley et al. (2014), we assess whether a number of individual differences (e.g., working memory, gender, age, note-taking strategies, and language proficiency) and the lexical and cohesion properties of integrated words are predictive of speaking response quality while controlling for random factors such as participants and task.We focused on TOEFL integrated listen/speak responses referencing academic genres as found in the TOEFL-iBT.The listen/speak integrated tasks ask test-takers to first listen to a spoken source text, such as an academic lecture or a conversation in an academic context.The testtaker then provides a spoken response to a question based on the listening prompts, and their answer is recorded for later assessment.These answers generally include relationships between the examples in the source text and also the task topic.Expert raters then score these speech samples using a standardized rubric that assesses delivery, language use, and topic development.
The current study is guided by the following three research questions (RQs): (1) Do the relational and propositional properties of words in source texts predict their rate of integration into spoken responses?(2) Which individual characteristics of test-takers are predictive of human ratings of speaking quality?(3) Can relational and propositional properties in spoken responses along with individual characteristics predict human ratings of speaking proficiency?

Participants
The study included 280 participants who were enrolled in Intensive English Programs (IEP) in the Atlanta, Georgia area at the time of data collection.Participants were recruited from intermediate and advanced English classes to ensure they had appropriate language skills to take the integrated listen/speak section of TOEFL-iBT.The participants spoke a number of different first languages.The first languages most strongly represented in the data were Arabic (22%), Portuguese (22%), Spanish (18%), and Chinese (10%).In terms of their gender distribution, 47% of the participants were male and 53% were female.The average age of the participants was 24 years.Of the 280 participants, full data was only retrievable for 263 of the participants.Four participants were missing working memory scores because of technical problems.Six participants were missing institutional TOEFL scores because they failed to take the tests.Another six participants were missing speaking scores either because of technical difficulties or because the participants did not complete the question.One participant did not fill out the demographic survey.

Background survey
A background survey was created to collect the following information: age, gender, the highest educational degree, other foreign language learning experience, time spent in the US, time spent studying English, grade point average (GPA) in the IEP, and previous TOEFL scores.The survey was conducted on-line using Qualtrics.

Working memory tests
In the current study, complex WMC was measured using two different working memory tests which were administered using E-Prime 2.0: an aural running span test and a listening span test.Because the current study used the TOEFL integrated speaking tests, which used listening prompts, the listening span test was developed based on the original reading span test (Daneman & Carpenter, 1980;Kim, Payant, & Pearson, 2015).The listening span test was similar to that used in previous SLA studies (Mackey, Adams, Stafford, & Winke, 2010;Mackey & Sachs, 2012).The test consisted of 72 sentences with the sequences ranging from three to six spans, and the order of each sequence was randomly presented.For each sentence, participants were asked to judge plausibility (i.e., whether its content is possible in the real world by pressing either "yes" or "no" on the computer keypad).After they answered the plausibility question, they heard a letter (e.g., "P"), and at the end of each span, they were asked to recall all of the letters they heard in the correct order.The listening span test was piloted with 10 native speakers of English and 3 non-native speakers of English in order to verify the accuracy of the expected judgments.We scored the listening span test using a partial-credit scoring rather than all-or-nothing scoring following Conway et al. (2005).One point was given for each correctly recalled letter, and, thus, the possible total score was 72.
In order to provide a working memory test which is not overly dependent on L2 proficiency, we also used an aural running span test (Broadway & Engle, 2010).Broadway and Engle (2010) tested the validity of the running span test, and found that it is predictive of higher order cognition.Since then a growing number of second language studies have used the running span test (e.g., Kim et al., 2015).In this test, participants heard a series of letters and were asked to recall the last n items from lists that are m + n items long.The number of letters to recall was pre-determined; however, participants were not informed of the total number of letters that they would hear in the series.For instance, participants would see the message "remember the last 4 letters" on the monitor, but they were not informed a priori of the total number of letters to be presented aurally in any given sequence.The span of letters ranged from three to six, and there were six sets letters in each span.In total, participants were asked to recall a total of 108 letter items.Based on Broadway and Engle (2010), participants received one point for each correctly recalled item in correct serial position.Thus the possible total score of the running span test was 108.

Institutional TOEFL
Participants completed an institutional TOEFL exam, which utilizes retired items from the paperbased TOEFL.The institutional TOEFL includes three sections: Listening comprehension (k = 50, 30-40 minutes), Structure and written expression (k = 40, 40 minutes), and Reading comprehension (k = 50, 50 minutes).The three sections take approximately two hours to complete in total.

TOEFL iBT speaking tasks
Participants also completed two non-operational research versions of the integrated listen/speak TOEFL iBT speaking tasks.Each version consists of two speaking tasks which are based on two types of listening sources: (1) listening to a conversation in an academic context; and (2) listening to a lecture.For each question, students were given 20 seconds to prepare for their response and 60 seconds to respond to the prompt.Participants were allowed to take notes during the tests, but they were not required.The two conversational listening sources included in this study including a discussion between two professors about a student missing class because she was on the swimming team (swimming topic) and a conversation between two students about note-taking in class (notetaking topic).The two lecture sources included a lecture on reciprocity from an anthropology class (reciprocity topic) and a lecture about fungus from a botany class (botany topic).

Procedure
All participants attended two data collection sessions.They completed the institutional TOEFL on Day 1 and then completed the background survey, the two working memory tests, and the two integrated listen/speak tasks from the TOEFL iBT speaking test (listening to a conversation vs. listening to a lecture) on Day 2. On average, participants spend approximately two hours in the lab on the first day, and one hour and 20 minutes in the lab on the second day.The order of the data collection for the two speaking tasks on day two was counter-balanced and randomly assigned to participants.

Transcription
Each spoken response was transcribed by a trained transcriber.The transcriber ignored filler words (e.g., umm, ahh) but did include other disfluency features such as word repetition and repairs.Periods were inserted at the end of each idea unit.All transcriptions were independently checked for accuracy by a second trained transcriber.The same trained transcriber transferred all the notes written by the test-takers into an electronic format.The vast majority of all notes were lexical in nature (i.e., the notes consisted of words and not symbols or abbreviations).

Note-taking
To assess student note-taking, we calculated the number of word lemmas (i.e., word roots) shared between the source text and the notes taken by each participant.We calculated two different notetaking features for the number of lemma tokens (i.e., all words) and types (i.e., unique words) shared between the notes and the source text.

Human ratings
Two expert TOEFL raters scored each speaking response.The raters used the TOEFL-iBT integrated speaking task rubric, which provides a holistic score (see http://www.ets.org/Media/Tests/TOEFL/pdf/Speaking_Rubrics.pdf).The score is based on a 0-4 scale with a score of 4 representing the highest score.Three criteria formed the basis of ratings: delivery (i.e., pronunciation and prosody), language use (i.e., grammar and vocabulary), and topic development (i.e., content and coherence).Text integration is not addressed in the rubric but the rubric notes task fulfillment, which requires text integration.
Inter-rater reliability for the human scores reported a Cohen's Kappa of .697and a Pearson's correlation of r = .714.If the two scores differed by less than two points, the average of the raters' scores was included in the dataset.If the scores between the two raters differed by more than one point, a third rater scored the sample, and the final score was the average of the two closest scores (cf.Bejar, 1985;Carrell, 2007;Sawaki, Stricker, & Oranje, 2008).

Language feature variables
A variety of cohesion and syntactic values were calculated to assess if word lemmas were integrated from the source text (i.e., the listening samples) into the test-taker speaking responses.We consider these source internal variables because each word in the source text was assigned a cohesion or syntactic value based on features found in the source texts.These features included the number of repetitions of the word within the source (cohesion), if the word was in the subject or object position in a clause (syntax), or if the word was coordinated in a phrase or a clause (syntax).After source internal values were assigned, they were matched to the words produced by the test-takers in their spoken responses in order to examine features for words that were not integrated (i.e., found in the source text, but not in the test-taker response) and words that were integrated (i.e., found in the source text and the test-taker responses).A different procedure was conducted for lexical features.For lexical features, words in each test-taker's response were separated into .txtfiles that contained either integrated or non-integrated words.These files were then run through the Tool for the Automatic Analysis of Lexical Sophistication (TAALES; Kyle & Crossley, 2015) in order to calculate a number of lexical features (see below for discussion of these features).We considered these features to be response internal.
The source and response internal features were used to predict which words were integrated into spoken responses (i.e., RQ 1).The source and response internal features were also used to predict human ratings of speaking proficiency in conjunction with individual characteristics and topic (RQ 2).

TAALES
TAALES is a computational tool that is freely available, user-friendly, works on most computer operating systems (Linux, Mac, Windows), allows for batch processing of text files, and incorporates over 250 classic and recently developed indices of lexical sophistication.These indices measure word frequency, lexical range, n-gram frequency and proportion, academic words and phrases, word information, lexical and phrasal sophistication, bigram and trigram strength of association, contextual distinctiveness, word neighbor information, lexical decision times, age of exposure, and semantic lexical relations (hypernymy and polysemy).Each of these are discussed briefly below.For more detailed accounts of TAALES please see Kyle and Crossley (2015).
Word frequency indices.TAALES calculates a number of word frequency indices with frequency counts retrieved from the SUBTLexus database (Brysbaert & New, 2009).the British National Corpus (BNC; 2007) and the five genres found in the Corpus of Contemporary American English (COCA; academic, fiction, magazine, news, and spoken texts; Davies, 2010).TAALES calculates scores for all words (AW), content words (CW), and function words (FW).
Range indices.In addition to frequency information, TAALES computes range indices which calculate how many texts within a corpus a word appears (i.e., specificity).Range indices were computed from the spoken (574 texts) and written (3,083 texts) subsets of the BNC, SUBTLEXus (8,388 texts), the five genres found in COCA (190,000 texts in the complete corpus).
N-gram frequency and proportion indices.TAALES calculates bigram and trigram frequencies and proportion scores (i.e., the proportion of n-grams in a text that are common in a reference corpus) from both the written (80 million words) and spoken subcorpora (10 million words) of the BNC and from the five genres represented in COCA (440 million words).
N-gram association measures.TAALES calculates five association measures for each bigram and trigram found in the reference corpora: Mutual Information (MI), Mutual Information Squared (MI 2 ), t-score, ΔP, and collexeme score (Gries, 2013).MI, MI 2 , and t-score are bidirectional measures of association between constituent words in an n-gram.While MI and, to a lesser extent, MI 2 tend to highlight n-grams composed of low-frequency words, t-score tends to favor n-grams composed of high-frequency words.ΔP is a directional association measure and calculates the probability of the second word in a bigram given the occurrence of the first word in it.The collexeme association measure calculates the strength of association between lexemes.
Contextual distinctiveness.TAALES calculates several indices related to contextual distinctiveness approach which measure the diversity of contexts in which a word is encountered (Brysbaert & New, 2009;McDonald & Shillcock, 2001).These indices come from The Edinburgh Associative Thesaurus (EAT) index based on empirical free association data collected by Kiss, Armstrong, Milroy, and Piper (1973), the University of South Florida (USF) (Nelson, McEvoy, & Schreiber, 1998) stimuli count index based on a written free association task, semantic diversity (SemD) based on a computationally-derived latent semantic analysis (LSA) measure (Hoffman, Ralph, & Rogers, 2013), and relative entropy index calculated by McDonald and Shillcock (2001) for 8,000 English lexemes as they occurred in the spoken BNC.
Word recognition norms.TAALES reports on lexical decision (LD) and word naming (WN) behavioral norms obtained from The English Lexicon Project (ELP), a large publicly available psycholinguistic dataset (Balota et al., 2007).The ELP includes LD and WN task response latencies and accuracies collected from 816 native English-speaking subjects.Latencies (i.e., response times) and accuracies were calculated in response to 40,481 real words (and an additional 40,481 nonwords for the LD task).
Word neighborhood information.TAALES reports on the word neighborhood information found in ELP.These indices are based on orthographic, phonographic, and phonological neighborhood information for 40,481 words that report word neighborhood size and frequency indices.All neighborhood frequency values are based on the 131 million-word Hyperspace Analogue to Language (HAL) corpus frequency norms (Lund & Burgess, 1996).
Age of exposure.TAALES reports on age of exposure indices that calculate a comprehensive model of word complexity, Age of Exposure, which replicates the learning curve of lexical concepts based on their associations with other words (Dascalu, McNamara, Crossley, & Trausan-Matu, 2016).Hypothetically, AOE indices model the word learning process as a function of language experience with language based on a large-scale corpus.

Statistical analyses
In order to address our three research questions, a number of statistical analyses were conducted.Prior to all analyses, we first checked for multicollinearity between all the linguistic variables in the analysis, which was operationalized as any two variables demonstrating a strong correlation (r > .700).We next conducted correlations between the variables and the speaking scores for each task for each participant to ensure that variables entered into the model demonstrated a significant and meaningful linear relation with the dependent variable (p < .001,r > .100).We selected a cut-off of p < .001 to correct for any Type I errors.For research question 1, we first conducted an initial Multivariate Analysis of Variance (MANOVA) to select the linguistic variables that demonstrated the strongest differences between the integrated and unintegrated words.We then entered the significant MANOVA variables that did not demonstrate multicollinearity into a discriminant function analysis (DFA) on the entire set of speaking samples to provide confirmatory evidence for the strength of these variables in classifying the words as integrated or unintegrated.The model reported by this DFA was then used to predict group membership of the speaking samples using leave-one-out-cross-validation (LOOCV).The LOOCV procedure allows testing of the accuracy of the model on an independent data set.The DFA analysis can provide evidence that source internal variables are predictive of which words test-takers will integrate into their responses.
Our second statistical analysis was to determine if the linguistic features and individual differences (e.g., working memory and institutional TOEFL sub-scores) could be used to predict the human ratings for the individual integrated speaking tasks while accounting for both pooled and individual variance among participants as opposed to one pooled group by including subjects as random effects (i.e., assigning a unique intercept for each participant).We used R (R Core Team, 2015) for our statistical analysis and the package lme4 (Bates, Mächler, Bolker, & Walker, 2015) to construct linear mixed effects models (LME).We also used the package lmerTest (Kuznetsova, Brockhoff, & Christensen, 2015) to analyze the LME output and derive p-values for individual fixed effects.Final model selection and interpretation was based on t and p values for fixed effects, post-hoc comparisons of categorical variables, and visual inspection of residuals distribution.To obtain a measure of effect sizes, we computed correlations between fitted and predicted residual values, resulting in an R 2 value. 1 Prior to running an LME model, we examined correlations between the linguistic features and the individual characteristics and the speaking scores in order to select variables for inclusion in the LMEs that reported at least a small effect size (r > .100)and that were not multicollinear (r > .700).We conducted two stepwise LMEs.The first LME examined the associations of individual characteristics (e.g., working memory, age, and institutional TOEFL scores) and topic on the speaking scores.This model included subjects as random effects.Descriptive statistics for the continuous scaled individual characteristics used in this analysis are reported in Table 1.The second LME model was conducted to examine the associations of these individual characteristics along with topic and linguistic features on speaking scores.

MANOVA
Prior to conducting the MANOVA, all assumptions for the MANOVA were checked and met.The MANOVA used the integrated and unintegrated words from each text as the independent variables and the linguistic indices as the dependent variables.Seventeen indices were selected from the MANOVA for the DFA based on their effect sizes.Selected indices did not theoretically overlap with each other (see Table 2 for descriptive statistics for these indices).The MANOVA results demonstrated that words integrated into test-takers spoken responses from the source text were more frequent, had lower age of acquisition, had a greater range, had more orthographic and phonological neighbors, had more free associations, were repeated more often in the source text (i.e., the occurrence of word in source text index), occurred more often in the source text in clausal coordinations and as objects of prepositions, had greater age of exposure, had greater character bigram frequency, and were named more quickly than unintegrated words.Conversely, the words not integrated into test-takers spoken responses from the source text were less meaningful and less concrete.We used R 2 GLMM to present the variance explained in our model.Historically, using R 2 in mixed-effects models has been problematic because R 2 algorithms may report decreased or increased R 2 in larger models.R 2 GLMM calculates marginal and conditional R 2 that are less susceptible to these problems.Marginal effects are concerned with the variance explained by fixed factors while conditional effects concern the variance explained by both fixed and random factors (Nakagawa & Schielzeth, 2012).

Discriminant function analysis
We conducted a stepwise discriminant function analysis (DFA) to confirm that the indices selected in the MANOVA indeed discriminated between integrated and unintegrated words.A DFA generates a discriminant function, which is then used in an algorithm to predict group membership (i.e., whether the words were integrated or unintegrated).For the DFA, we used the 17 indices from MANOVA analysis.The stepwise DFA retained 11 of these indices as significant predictors of whether a word was integrated in the test-takers' response or unintegrated (see Table 2 for details on whether the variable was retained in the DFA) and removed the remaining six variables as nonsignificant predictors based on their predictive strength.
The results demonstrate that the DFA using these eleven indices correctly allocated 1049 of the 1052 word lists as being integrated or unintegrated, χ 2 (1) = 1040.068,p < .001,for an accuracy of 99.7% (chance level for this analysis is 50%).The Kappa value for this analysis was .994,which suggests almost perfect agreement between the predicted classification of the word lists and their actual classification.The results from the LOOCV were identical to the initial DFA (see Table 3 for the confusion matrix for this analysis).The results indicate that the 11 variables can predict with almost perfect accuracy if a word is integrated or unintegrated from the source text.

Pearson correlations
After controlling for multicollinearity, p values, and effect sizes, we were left with 31 variables.These variables related to key words, and Institutional TOEFL reading, listening, and structure subscores, cohesion, syntactic, and lexical sophistication scores taken from the integrated words, note-taking, and working memory (see Table 4 for Pearson correlation results).For our baseline model that answered RQ 2, we included all individual characteristics that showed at least a small effect size (r > .100)along with topic and gender.In order to avoid overfitting the full LME model, which addressed RQ 3, we only selected the linguistic indices that demonstrated at least a medium effect size (r > .300)with speaking scores and all individual characteristics that showed at least a small effect size (r > .100)along with topic and gender.Thus, we included the five linguistic features that showed the highest correlations in the model along with the three TOEFL subscore variables, one note-taking variable, one working memory variable (listening span score), and two categorical variables (gender and topic).

Linear mixed effects models
A baseline stepwise LME model considering participants' individual characteristics and topic revealed significant effects for note taking, TOEFL listening and structure scores, and topic.The model indicated that students who included more word types from the source into their notes scored higher on listen/speak tasks.In addition, students with higher TOEFL listening and structure scored higher as did students who responded to the note-taking topic (i.e., a conversation task).The model reported a marginal R 2 of .361and a conditional R 2 of .719.Table 5 displays the coefficients, standard error, t values, and p values for each of the fixed effects.Inspection of residuals suggested the model was not influenced by homoscedasticity.
A full model including the nested baseline model and linguistic features revealed significant effects for two linguistic features, Number of shared words between response and source and Occurrence of shared words (noun in object position) between response and source, TOEFL listening and structure scores, and topic.Results indicated that students who had a greater number of words integrated from the source into their response received higher speaking scores.However, if the students integrated words from the source texts that were in the object position, they received lower scores.As in the baseline model, students with higher TOEFL listening and structure scored higher.
In terms of topic, students who responded to the "note taking" topic scored higher than students who wrote on the fungus and reciprocity topic but not the swimming topic (i.e., students scored higher on the conversation tasks than the lecture tasks).Contrasts indicated that students who wrote on swimming topic scored higher than on the fungus and reciprocity topics.The model reported a marginal R 2 of .588and a conditional R 2 of .754.Table 6 displays the coefficients, standard error, t values, and p values for each of the fixed effects.A log likelihood comparison found a significant difference between the baseline and full models, (χ 2(2) = 193.210,p < .001),suggesting that the inclusion of linguistic features contributed to a significantly better model fit.Inspection of residuals suggested the model was not influenced by homoscedasticity.

Discussion
Integrating content from surrounding language is an important indicator of academic success and, in order to better assess the potential for academic success in test-takers, standardized tests now reflect this reality.An important element of integrating content is the ability to recall information from previously exposed discourse.Recall can be aided by individual characteristics such as working memory or language proficiency, strategy use such as note-taking, or based on the linguistic properties of the preceding discourse.The purpose of this study was to examine if linguistic features in source texts could explain word recall and integration for items administered in the listen/speak section of the TOEFL-iBT and to what extent individual characteristics such as working memory and proficiency level and/or linguistic features could predict human judgments of speaking proficiency.
The results provide evidence that words integrated into spoken responses from the source text had word properties that would afford their recall.Twelve linguistic indices related to lexical items (i.e., propositional-specific information), text cohesion (relational information), and syntactic features predicted to an almost perfect accuracy (99.7%) whether words from the source text would be integrated into test-takers' spoken responses.The majority of these variables were lexical in nature and demonstrated that words in the source text that were more frequent, had more associations, were named more quickly, contained more frequent character bigrams, and had more phonographic neighbors were more likely to be integrated into the response.Two cohesion variables were also significant predictors in the DFA indicating that words that were repeated more often in the source texts and words that were found in coordinated phrases were more likely integrated into test-takers responses.Lastly, one syntactic feature (nouns that were objects of a preposition) was a predictor in the DFA indicating that nouns used in descriptive phrases were more likely integrated into the spoken response.This study also focused on predicting human judgments of speaking responses in terms of individual characteristics, topic, and linguistic features related to both source and response internal variables.A baseline model using only individual characteristics and topic included four variables as significant predictors of human ratings.These included note-taking, TOEFL ITP listening scores and structure scores, and topic.The note-taking variable indicated that students who included more word types (i.e., individual words) from the source text in their notes received a higher score.In terms of topic, lecture tasks led to lower scores than the note-taking conversation as did swimming conversation.The note-taking conversation led to higher scores when compared to the swimming conversation likely because the topic was more common (note-taking as compared to swimming) as was the context (two students talking as compared to two professors).
No demographic variables were significant predictors of speaking proficiency in the LME model.In addition, no working memory test scores were significant predictors in the LME model even though a correlation demonstrated a weak relationship between the listening span and speaking scores (r = .154,see Table 2), while the correlation between the running span and speaking scores was not significant (r = .030).The descriptive statistics for the working memory scores reported in Table 1 do not indicate a ceiling effect and show a relatively robust range and variance scores, suggesting that our study included participants who had a range of working memory capacity.The findings of the study are, to some extent, in line with previous L2 listening testing literature which showed a lack of evidence for the significant relationship between WMC and L2 listening (Andringa et al., 2012).In addition, the participants in the current study participated in listen/speak tasks which allowed them to take notes and use them while speaking.Such task characteristics likely reduce the need to rely on working memory during oral responses (i.e., using strategies to overcome cognitive differences).
When linguistic variables were incorporated into the model, the model significantly outperformed the baseline model and included two linguistic features.The first feature indicated that responses that received higher scores included a greater number of shared words between the response and the source text suggesting that the degree of text integration was the most important factor that predicted human scores.Additionally, responses received a lower score if the responses included a greater number of nouns from the response text that were located in the object position.The latter finding likely indicates that test-takers that focused on ancillary information in the source text (i.e., not the main subjects of the source text) received lower scores.The LME models also indicated that test-takers with better listening and structure scores scored higher on the speaking section of the TOEFL-iBT.Lastly, topic was an important predictor.Specifically, test-takers received lower scores on the lecture tasks than the conversation tasks.Unlike the baseline LME, note-taking was not a significant predictor of human scores when linguistic features were included.
In combination, the findings from the DFA and LME models indicate that linguistic elements in the source text (i.e., cohesive and syntactic features) and lexical properties of word strongly predict which words test-takers integrate into their spoken responses.The findings demonstrate that words that are repeated words more often in the source text and nouns that either coordinated of found as object as preposition in the source text are more likely to be integrated into test-takers' responses.In addition, words in the source text that are more frequent, have more associations, are named more quickly, have more common characters and have more phonographic neighbors are more likely integrated into test-takers' responses.These findings suggest that properties of the source text along with properties of the words within the source text assist in text recall and may aid test-takers in noticing and integrating key words and/or concepts into their responses.The findings also show that integration of words from the source text is a significant predictor of human judgments of speaking proficiency although nouns in the object position are not.These linguistic features are still important predictors of speaking proficiency even when individual characteristics such as language proficiency, working memory skills, age, and gender along with topic and strategies such as note-taking are included in the model.
As noted by Crossley et al. (2014), these findings have important inferences for the difficulty of test items because listening samples that contain less sophisticated words that are easier to recall and contain greater cohesion between these words appear to lead to better recall of key words from the source text.The integration of these words by test-takers into their spoken response may lead to higher ratings of speaking proficiency indicating that source texts containing words with greater recall properties (i.e., words that are more frequent words and have greater associations) and discourse structures that lead to greater recall (i.e., key words, words in cohesive structures, and words that are objects of prepositions) may positively influence test-taker scores when compared to source texts with lower lexical, cohesion, and syntactic recall properties.
Thus, test designers need to carefully consider lexical and cohesive properties between test items to ensure balance among items across different versions of their tests.When developing speaking assessment tests, developers should consider that linguistic properties of source texts strongly influence text integration, which in turn can impact human ratings of integrated speaking proficiency.If a text contains relational, propositional, and syntactic features that do not lead to recall of items, human ratings of speaking proficiency may decrease.On the other hand, if a source text contains relational, propositional, and syntactic feature that do increase recall, ratings of speaking proficiency may increase.As a result, if test has two forms or multiple versions of test are administered with different source texts that differ in the amount of relational, propositional, and syntactic features, one form or test may prime greater recall of source text words/concepts resulting in increased speaking proficiency scores when compared to the other.While not easy to measure, natural language processing tools like TAALES would prove helpful in assessing the properties of words within source texts.For instance, if multiple forms of a test are developed, TAALES could be used to measure differences in the lexical properties of each form (i.e., differences in word frequency, words' phonological and orthographic neighbors, and word meaningfulness) to ensure balance across forms.This could provide a level of certainty that each form would lead to similar integration of words from the source text.Additionally, test developers could identify key words in source texts and ensure that each form included a similar number of key terms.

Conclusion
The current study shows that the relational, propositional, and syntactic properties of source texts are almost perfect predictors of text integration and that lexical integration from the source text into the spoken response (especially nouns) acts as a strong predictor of human ratings of speaking proficiency that goes beyond individual differences such as working memory and listening skills, test-taking strategies such as note-taking, and topic.Overall the findings indicate that the properties of the source text can predict which words will be included in the response as well as predict human ratings of speaking proficiency.The finding that properties in the input appear to have an effect on the elicitation of spoken responses (Lee, 2006) raises concerns about integrated speaking assessments which may inadvertently place greater weight on recall ability than other elements of speaking proficiency such as language use, delivery, and topic development.Future studies would benefit from the inclusion of multiple source texts that are controlled such that they differ in their frequency and type of relational and propositional properties.Such studies could better examine the relationship between linguistics properties in the source text and speaking proficiency score and provide direct support for our interpretation of the findings from this study.
Overall, this study in conjunction with Crossley et al. (2014) provides strong evidence that linguistic features in the source text can influence text recall and text integration.However, these results cannot be generalized to other types of sources beyond the listen/speak tasks in the TOEFL-iBT.Unlike Crossley et al. (2014), the current study did control for several test-taker variables such as proficiency, age, gender, and working memory.In addition, this study examined a wider range of linguistic features taken from a number of contemporary natural language processing tools.Together, these additions provide additional strength to the argument that lexical, cohesion, and syntactic features in the source text can influence text recall and text integration and that this integration is a predictor of test performance.

Table 1 .
Descriptive statistics for individual differences.

Table 2 .
Descriptive statistics and MANOVA results for linguistic features.

Table 3 .
Confusion matrix for DFA integrated and unintegrated words.

Table 4 .
Correlations between fixed factors and speaking scores.

Table 5 .
Baseline model for speaking proficiency scores.

Table 6 .
Full model for speaking proficiency scores.