Validity evidence for a sentence repetition test of Swiss German Sign Language

In this study we seek evidence of validity according to the socio-cognitive framework (Weir, 2005) for a new sentence repetition test (SRT) for young Deaf L1 Swiss German Sign Language (DSGS) users. SRTs have been developed for various purposes for both spoken and sign languages to assess language development in children. In order to address the need for tests to assess the grammatical development of Deaf L1 DSGS users in a school context, we developed an SRT. The test targets young learners aged 6–17 years, and we administered it to 46 Deaf students aged 6.92–17.33 (M = 11.17) years. In addition to the young learner data, we collected data from Deaf adults (N = 14) and from a sub-sample of the children (n = 19) who also took a test of DSGS narrative comprehension, serving as a criterion measure. We analyzed the data with many-facet Rasch modeling, regression analysis, and analysis of covariance. The results show evidence of scoring, criterion, and context validity, suggesting the suitability of the SRT for the intended purpose, and will inform the revision of the test for future use as an instrument to assess the sign language development of Deaf children.

An underlying assumption of an SRT is "if the participant has acquired the grammatical feature associated with or displayed in the stimuli, it should be easy to repeat the stimuli" (Yan, Maeda, Lv, & Ginther, 2016, p. 498). SRTs involve the following: (1) the processing of a stimulus sentence; (2) reconstructing it with the test-takers' own grammar; and (3) reproducing it (Jessop, Suzuki, & Tomita, 2007).
There does not seem to be a consensus on what exact construct an SRT taps into, although several researchers have attempted to address this. Yan et al. (2016) reviewed a number of studies using SRTs, concluding that the measured construct can be summarized as (1) a global proficiency or (2) more specific linguistic features, for example, phonology, morphosyntax, and syntax (p. 504). Okura and Lonsdale (2012) raised the question of whether the construct addressed by SRTs is one of language proficiency or, rather, rote repetition, whereas Spada et al. (2015), in a study of implicit linguistic knowledge in L2 adult learners, argued that the construct addressed is grammatical processing.
In an attempt to settle this controversy, some researchers have added ungrammatical sentences to their stimuli with the expectation that an ungrammatical sentence processed by a test-taker will result in a corrected sentence (Erlam, 2006;Sarandi, 2015;Yan et al., 2016). For example, in the study by Erlam (2006), native speakers of English (N = 20) corrected 91% of ungrammatical sentences in an SRT (and repeated 97% of grammatical sentences correctly), which the author interpreted as "evidence of the validity of the test as a measure of implicit [linguistic] knowledge" (p. 485). Additional evidence that the measured construct is linguistic knowledge came from a study by Klem et al. (2015). Klem and colleagues (2015) investigated an SRT as a measure of language ability in school-aged children (N = 216) in the Norwegian context, concluding, "sentence repetition is best conceptualized as a measure of language ability" (p. 152). The authors further argued that "sentence repetition is best seen as a complex linguistic task that reflects the integrity of language processing systems at many different levels (speech perception, lexical (vocabulary) knowledge, grammatical skills, and speech production . . .)" (p. 152). Support for the notion that the SRT format measures linguistic knowledge has been further provided by the researchers of various studies (e.g., Devescovi & Caselli, 2007;Graham et al., 2010;Jones, 1994). For example, Devescovi and Caselli (2007) used an SRT for spoken Italian with pre-schoolers aged 2-4 years (N = 25) and compared the results with the children's spontaneous language data. The authors found significant positive correlations between the mean length of utterance, omission of articles, and number of verbs produced in both measures. The authors concluded that an SRT can be used (along with other measures) "to evaluate language abilities in typical developing children between 2 and 4 years of age" (Devescovi & Caselli, 2007, p. 201).

SRTs for sign languages
The authors of only a few studies published in the literature explicitly addressed the development of SRTs for sign languages. For example, Hauser et al. (2008) discussed the development of an SRT for American Sign Language (ASL) as a global measure of proficiency to test Deaf and hearing signers at different levels. The ASL SRT was used both with adults and children (age range of children: 12.5 to 14.1 years; M age = 12.9).
The ASL SRT is based on the Speaking Grammar Subtest of the Test of Adolescent and Adult Language -Third Edition (TOALT3;Hammill, Brown, Larsen, & Wiederholt, 1994). In total, 40 sentences in increasing length and of different syntactic, thematic, and morphological complexity were developed (Hauser et al., 2008). Difficulty was increased, for example, by using more complex morphological signs. It was found that increasing sentence length did not automatically increase the complexity of a sentence in sign languages (Hauser et al., 2008). The ASL SRT has also been adapted to German Sign Language (Deutsche Gebärdensprache; DGS) (Kubus & Rathmann, 2012), British Sign Language (BSL;Cormier, Adam, Rowley, Woll, & Atkinson, 2012), and Swedish Sign Language (Svenskt Teckenspråk, STS) (Schönström & Holmström, 2017).

Socio-cognitive approach to test validation
The socio-cognitive approach to test validation (O'Sullivan & Weir, 2011;Weir, 2005) includes the cognitive, social, and evaluative dimension of "language use in test development and validation" (O'Sullivan & Weir, 2011, p. 20). This approach includes various validity arguments for which evidence is gathered at different stages of test development and use, collectively contributing to an argument for the overall validity of the test. The kinds of validity evidence are as follows: (1) test-takers' characteristics; (2) context validity (i.e., characteristics of test tasks and their administration); (3) cognitive validity (i.e., appropriateness of cognitive processes required to complete the tasks); (4) scoring validity (i.e., meaning of the score); (5) consequential validity (i.e., effect of test in stakeholders); and (6) criterion-related validity (i.e., other, external evidence corroborating test score inferences) (O'Sullivan &Weir, 2011;Weir, 2005). This framework will serve as the basis for the present validation of the SRT for DSGS, with particular attention paid to test-taker characteristics, scoring, and criterion-related validity. Additionally, context validity was partly addressed in item and rating scale development.

Test-taker characteristics
Test-taker characteristics such as chronological age or parental hearing status have often been used as a means to differentiate between early or later access to a sign language (L1 vs. L2) in studies evaluating sign language tests (e.g., Herman, 2002;Mann, 2006). The variable of chronological age is used to investigate whether a test instrument represents developmental progression in signing Deaf children (Herman, 2002). The variable of parental hearing status is often used to account for the heterogenous linguistic experiences of Deaf children (Mayberry, Lock, & Kazmi, 2002). Only about 5% of Deaf children are born into Deaf families and therefore may have access to a sign language from birth as a first language. The remaining 95% are born into non-signing hearing families (Mitchell & Karchmer, 2004) and might have first access to sign language after the critical period of language acquisition (Mayberry et al., 2002).
A group of native signing Deaf children is therefore often used as a model or a reference against which the performances of children with other linguistic experiences (Deaf children of hearing parents) can be measured. It is important to point out, however, that the use of parents' hearing status as a variable is not entirely undisputed, as Deaf parents may not be native signers per se, as they may have grown up in a hearing family and learned sign language later (e.g., Singleton & Newport, 2004). For the purpose of determining whether the DSGS SRT scores align with developmental expectations, both variables will be included in the model of evaluating the SRT for DSGS.

Deaf adults as a reference for test development
Many sign languages are not as well researched as spoken languages. This at least partly accounts for the incomplete description of DSGS grammar as well as the lack of L1 DSGS child acquisition studies to use as external "benchmarks" to inform the development of items for the measurement of developmental progression. As a result, Deaf adults' performances are the only practical point of reference for mastered/acquired structures of DSGS. Since the present test is the first for DSGS targeting Deaf children, we compared the performances of Deaf adults with those of the target population of children and adolescents (e.g., Rinaldi et al., 2018), on the assumption that adult users of DSGS should outperform the children, since the adults had already acquired DSGS fully.

Linguistic structure of DSGS
Linguistic structures of DSGS that are part of the construct of the SRT will be briefly described in this section. An important feature in any sign language is the distinction between manual and non-manual components. Manual components are produced with the hands; non-manual components are features that are produced with the mouth, the face (e.g., with cheeks, eyes, eyebrows, etc.), with the head, and the upper torso (Boyes Braem, 1995;Sutton-Spence & Woll, 1999). For example, eye gaze can be used to reestablish reference in signing space or raised eyebrows to differentiate between a declarative and an interrogative sentence (Pfau & Quer, 2010).
Another important feature of sign languages is the use of signing space, that is, the physical space in front of the signer's body, which serves various purposes (Johnston & Schembri, 2007). The signing space is important for introducing and maintaining reference. For example, with the first mention of an object or person an index is used to locate it (e.g., a house) at a specific point in space. With gaze or index finger at this same locus the signer can then establish pronominal reference (Boyes Braem, 1995). The signing space is also important in representing how an object (e.g., a car) moves from A to B.
Sign language phonology: The smallest building blocks of sign languages are the sublexical units of signs. These sub-lexical units are the handshape, location, movement, and hand orientation (Boyes Braem, 1995).
Sign language morphology: One important aspect of sign language morphology is verb classes, which are, depending on the underlying model, grouped as plain, agreement, and spatial verbs (e.g., Padden, 1990). Another area is negation which is expressed in DSGS manually, non-manually, or by a combination of both.
Sign language syntax: Sign languages are described as having more flexible word/sign order than spoken languages (Erlenkamp, 2012). The difference between a question and a statement is expressed non-manually. For example, the sign GEHÖRLOS 2 (deaf) in a declarative sentence "I am deaf" shows a neutral facial expression and head position. This is different if the sign is part of a question such as "Are you deaf?" Here, the sign is realized with a slight head movement forward and raised eyebrows (Boyes Braem, 1995).
Discourse strategies: A frequently used discourse strategy in sign languages is constructed action (e.g., for BSL : Cormier, Smith, & Zwets, 2013). Constructed action refers to a situation in which the signer "takes the role" of a referent to express his or her feelings, ideas, actions, and so on. The signer uses manual and non-manual techniques to express specific feelings or actions of a referent.

Research questions
In the present study we seek validity evidence for a new sentence repetition test (SRT) for school-aged Deaf users of Swiss German Sign Language (DSGS) through the following research questions (RQs):

RQ1:
To what extent does the DSGS SRT demonstrate evidence of scoring validity?

RQ2:
To what extent does the DSGS SRT demonstrate evidence of criterion-related validity?
RQ3: To what extent do individual test-taker characteristics (age and hearing status of parents) impact performance on the DSGS SRT?

RQ4:
To what extent does the DSGS SRT demonstrate evidence of context validity?

Instruments DSGS Sentence Repetition Test (SRT).
Existing SRTs for sign languages described above were used as a framework for developing the SRT for DSGS, referring both to the sentences and to the scoring criteria, to be detailed below.
SRT item development. We developed the content of the SRT for the current study through a process of expert moderation. In the first step, an item candidate pool of 75 sentences was developed: 38 were based on the DGS version, 17 from a BSL SRT for children, and 10 from the Italian Sign Language SRT for children. The sentence development was supplemented by five sentences from the DSGS online learning materials for families with Deaf children "E-Kids" (SGB-FSS, n.d., https://ekids.sgb-fss.ch/) and five sentences that were developed by the Deaf research collaborator of the project. Even though the majority of the sentences of the SRT for DSGS are from the DGS version (which, in turn, was a direct translation from the ASL SRT), sentences were also adapted from other SRTs developed explicitly for younger children (under the age of 12, which is the youngest age group of the SRT for ASL). The goal at this stage of the project was to have a pool of DSGS sentences available that (1) varied in length, (2) varied in complexity, and (3) were sensitive to the life experiences of children aged 6-17 years. This pool was subjected to expert moderation twice by two separate panels of Deaf sign language instructors before it was administered to any test-takers. The first panel consisted of four Deaf sign language instructors, and the second, of five. In the first moderation, the sentences were evaluated for the following: regional variation (Haug, 2011;Hauser et al., 2008); grammaticality; and relevance to the child sample, in terms of life experience and linguistic development. This resulted in the removal of 15 sentences.
In the second moderation, the judges individually rated the sentences' difficulty from the perspective of a Deaf child on a four-point holistic Likert scale ranging from "very easy" to "very difficult." Sentences for which the judges showed little or no agreement were later discussed as a group. These discussions resulted in the following criteria for describing/separating easier from more difficult sentences: (1) length of the sentence, (2) use of non-manual components, and (3) use of space. This process of ensuring that the test tasks matched the test-takers in terms of appropriacy of information and content, grammatical and lexical difficulty, and regional language variation suggests support of a claim to context validity (O'Sullivan & Weir, 2011;Weir, 2005).
Thirty-six sentences upon which the majority of the five judges of the second moderation agreed within one point on the Likert scale were kept. The remaining 39 sentences were further discussed by the Deaf research collaborator and one other Deaf colleague with extensive experience in sign language research. On the basis of this discussion, 15 of these 39 sentences were removed, resulting in a pool of 60 sentences.
The pool of 60 sentences was then piloted with a small representative sample of three signing children and two adults, and the responses utilized for rater training (described below). Based on rater recommendations from this pilot, a further 20 sentences were removed for either being too difficult (no participants repeated correctly) or too easy (all repeated correctly), resulting in a final instrument composed of 40 sentences that had passed all three moderation/training steps. The final 40 sentences included specific linguistic features of DSGS, for example, phonology (e.g., sub-lexical units of signs), morphology (e.g., types of verbs, different forms of negation), syntax (e.g., different types of sentences), discourse strategies (e.g., constructed action), and non-manuals (e.g., negation) (e.g., Boyes Braem, 1995;Sutton-Spence & Woll, 1999).
SRT rating scale and rater training. The rating scale for the SRT was developed based on a study of three candidate scales (Batty & Haug, forthcoming) with the aim of developing a scale that offered more detailed information about the children's performances than would result from a simple dichotomous scale (Leclercq et al., 2014). The results of this study, as well as the criteria laid out by Marshall et al. (2015), informed the development of the current rating scale. The final, five-step (0-4) scale is presented in Table 1.
Criteria 1, 2, and 4 were judged at the single-sign level; only Criterion 3 was judged at the sentence level. When a single sign was judged as being incorrect (e.g., one out of four signs in a sentence), no points were assigned to the criterion. A total score of 4 was possible for each sentence, one for each criterion. When something was incorrect, it was possible to specify the error for each individual sign, for example, the wrong use of nonmanual features in a negated utterance or wrong sub-lexical units (e.g., incorrect handshape). However, this information did not have an impact on the test-taker's score. Non-manual features (e.g., facial expression of negation or questioning) were not listed as a separate criterion, but included in the different criteria; for example, in Criterion 3, the use of facial expression for asking a question needs to be present for the child to receive a score.
Since the goal of this rating scale was to obtain detailed information about the Deaf children's DSGS performances, rater training was also required. Rater training was conducted by the Deaf research collaborator with the two Deaf raters. The training included familiarizing the raters with the rating scale and analyzing the data from three native signing children from the item development pilot described above, including feedback and discussion on the rating criteria moderated by the Deaf collaborator. Ensuring that marking criteria are explicit for the raters further supports a claim of context validity (O'Sullivan & Weir, 2011;Weir, 2005). Furthermore, the interaction between the rating system and the scores produced, which will be discussed in greater detail with respect to the many-facet Rasch model, will provide evidence of scoring validity.
DSGS narrative comprehension test. Owing to the absence of another DSGS test that covers the same construct, the results of a DSGS narrative comprehension test (Haug & Perrollaz, 2015) were correlated with the SRT results in order to investigate criterion-related validity (RQ2). The narrative comprehension test was developed within the EU project SignMET (Sign Language: Methodologies and Evaluation Tools). The test was evaluated as part of the scientific final report for the funder of the project (SignMET Consortium, 2016). A total of 34 Deaf children took this test, their age ranged from 4.0 to 14.0 years of age (M age = 8.67). Of these 34 children, 26 had hearing parents; the remaining eight had at least one Deaf parent. The maximum possible score on the test was 17, and the raw scores of the children ranged from 0 to 16 (M raw scores = 9.65, SD raw scores = 5.08).
To investigate the relationship between the chronological age, raw scores, and the hearing status of the parents, a one-way, between-groups analysis of covariance (ANCOVA) was computed with the raw score as the dependent variable, the parents' hearing status as the independent variable, and the chronological age as the covariate. There was a statistically significant difference in the raw scores between Deaf children with hearing and Deaf parents [F (1, 31) = 6.13, p = .019, partial η 2 = .165]. The parental hearing status explains only 16.5% of the variance in the raw scores. There was also a significant relationship between the chronological age covariate and the dependent variable while controlling for the independent variable [F (1, 31) = 37.97, p < .001, partial η 2 = .551). The chronological age explains 55.1% of the variance of the raw scores.
A sub-sample (n = 19) that took part in the SRT project was also tested with this narrative comprehension test (Table 2). Even though the narrative comprehension test does not tap into exactly the same construct of the SRT, there are some aspects in the construct of the narrative comprehension test that are similar. First of all, both the SRT and the narrative test assess (also) comprehension skills. Additionally, some grammatical features are shared by the construct of both tests, for example: (1) Signing space (e.g., for pronominal referencing) (2) Verb type (e.g., agreement and spatial verbs) (3) Constructed action (4) Non-manual features for grammatical purposes (e.g., negation, asking questions) We therefore argue, based on the preliminary statistical results and the overlap of the construct of both tests, that this comparison can be used to investigate criterion-related validity for the SRT for DSGS, thereby addressing RQ2.

Participants
In total, we recruited 46 children and adolescents through the five schools for the Deaf in German Switzerland. We tested them between June and November 2014. Demographic data collected included the hearing status of the participants' parents, as this is often used in sign language research as an indication of L1/L2 status (i.e., for those with Deaf parents, sign language is their L1). See Table 3 for a breakdown of participant characteristics.

Procedure
The entire test was embedded in a PowerPoint presentation, which was presented to the children individually on a laptop. After the pre-recorded test instructions, the children saw six practice items to become familiar with the task, followed by the 40 sentences. During the testing session, a Deaf test administrator was present and guided the children through the test. The children were video-recorded through the built-in webcam of the laptop. The testing took between 20-30 minutes. Apart from the tests, the parents filled out a background questionnaire. Parents also received background information about the study and signed a consent form. All materials were collected through the schools and returned to the researchers. After the data collection, the video files were imported into a bespoke application for the scoring of the SRT results and given to two Deaf raters. The Deaf collaborator produced a written and a signed version of a manual, including how to use the stand-alone application of the rating scale for the raters, and also conducted a live training with them.
Owing to resource constraints, it was not possible to ensure that both raters rated all children (N = 46). Rater 1 scored 38 children, and Rater 2 rated 22 children, with an overlap of 13 children to investigate inter-rater reliability and estimate measures in the Rasch model. This resulted in 25 cases that were evaluated only by Rater 1 and nine that were scored only by Rater 2. It took about one hour for the raters to evaluate the 40 sentences per child.

Comparing the Deaf children's and adolescents' data with results from Deaf adult signers
We also collected data from adult Deaf signers to compare to the children's results. For this purpose, 14 Deaf adults, both "L1" and "L2" users of DSGS as defined by their parents' hearing status (i.e., Deaf vs. hearing parents), were tested with the same set of items of the SRT for DSGS (Table 4).
The Deaf adults filled out a background questionnaire and signed a consent form before they took the DSGS test. Rater 1 rated eight adults, and the remaining six adults were scored by Rater 2. Rater 1 and 2 were the same individuals as in the main study. Despite the lack of overlap, however, severity was estimated with the ratings of the child sample (see below). These data were used to ensure that the lexico-grammatical level of the SRT was appropriate for the developmental level of the target test-takers (Weir, 2005), thereby (together with the process of item and rating scale development) addressing RQ4.

Data analysis
In order to investigate the four research questions, the following statistical procedures were employed.

Many-facet Rasch measurement.
We employed Many-facet Rasch measurement (MFRM; Linacre, 1994) with the software package Facets (Linacre, 2018) to address RQ1 and to detect possible threats to scoring validity. This method has frequently been employed to detect and investigate rater effects (Bachman, 2004;Myford & Wolfe, 2003), but can be used wherever two aspects (facets) of a test or testing situation are thought to interact (Batty, 2014;Brunfaut, Harding, & Batty, 2018;Engelhard, 2009). In addition, Rasch residuals-based fit statistics can be used to identify poorly performing items or raters requiring further examination or removal. Although there are no theoretical cut-off values at which an element can be considered too "noisy" to be useful, a commonly used guideline is that offered by Wright and Linacre (1994), which considers elements with fit statistics above 2.0 as distorting or degrading measurement. The Facets software package also provides "fair average" scores for all elements. These are provided in the original units of measurement (a five-step scale from 0 to 4 here), and represent each examinee's score, given the severity of the rater(s) the examinee was rated by. The fair average represents the score the examinee could be expected to receive, had he or she taken the test with a theoretical average-severity rater. These fair average scores will be used to investigate the impact of individual differences on scores.
The present research employs a four-facet MFRM model to investigate instrument reliability and inter-rater reliability, and to compare performances between child and adult examinees in order to demonstrate construct validity. The facets are as follows: 1. Test-takers 2. Child/Adult (dummied) 3. Rater 4. Item The second facet (Child/Adult) is a dummy facet, not used for estimation, but is used for investigating item difficulties for the two sub-samples.
In order to address RQ4 and compare the children's performances to those of the separate sample of Deaf adult sign language users, we used an anchored model. The model was first estimated using only the children (n = 46) and the estimates were anchored. The adult sample (n = 14) was then added and the model was estimated again. This ensured that the adults' level did not contribute to the calibration of the model, and that their abilities were estimated only in terms of those of the child sample. Two initial estimations of the model revealed six items (Items 2,14,30,33,35,and 38) with Infit mean-square (MS) values exceeding 2.0, which, according to Wright and Linacre (1994) may have degraded measurement. These items were therefore removed from the model. As such, the final count of items used in estimation was reduced from 40 to 34.
Comparative analyses. To address RQ2, we sought evidence for criterion-related validity through various comparative analyses. In order to investigate the relationship between the results of the SRT and the scores on the narrative comprehension test, we calculated a Pearson product-moment correlation between the Rasch fair average SRT scores and the raw scores of the narrative test.
Additionally, to address RQ3 and determine the degree to which the test results align with factors explaining sign language acquisition, we set external variables in relation to the test results to explain performance differences (Haug, 2011;Mann, 2006). Variables that were examined were (1) chronological age and (2) hearing status of the parents, with the assumption that both age and having at least one Deaf parent would be predictive of higher scores on the SRT. In order to investigate the variable age, a simple linear regression analysis was applied. In order to investigate whether the parental hearing status contributed to SRT performance, an ANCOVA controlled by the covariate of age was employed.

Rasch analysis
Summary statistics for the MFRM model can be seen in Table 5. The Wright map is presented in Figure 1.
As shown, the scoring instrument was able to separate the examinee sample into four distinct levels of ability with a reliability of separation of .96. This can be interpreted similarly to a Cronbach's alpha reliability coefficient (Wright & Masters, 2002), and, as such, the instrument can be understood to be highly reliable, providing evidence of scoring validity (RQ1). Although there were three children and one adult whose abilities were outlying, most examinees' abilities were grouped around the mean of 0.42 logits, and the distribution of abilities was roughly similar to the distribution of item difficulties.
The raters were very nearly equivalent in severity, with a mean severity of 0 logits and a standard deviation of .04. The reliability of the separation between their severities is .20, and a Chi-square test of their comparative severities is non-significant, indicating that there is virtually no difference between the raters' severities. Finally, the Rasch-kappa interrater reliability coefficient of .71 indicates a very high degree of interrater agreement. As such, they can be understood to be rating objectively, and therefore do not present a threat to the scoring validity of the SRT. A pairwise bias analysis revealed that two items (Items 36 and 40) were rated significantly differently by the raters. These items were then subjected to qualitative item analysis to determine if they might represent more complex linguistic structures or were longer than other sentences as a potential explanation for the scoring differences. However, these were not found to be the case, suggesting that the differences were merely spurious (see the "Discussion" section below).
Finally, the items can be separated into five distinct levels of difficulty with a reliability of separation of .97. Average fit statistics are fairly close to their expected values of 1, and the fairly small standard deviations indicate that there was relatively little variation in the degree of fit among the items. After the removal of the six items with Infit MS values over 2.0, the remainder of the items all displayed adequate fit to the Rasch model, demonstrating that they measure the same latent trait, and therefore suggesting construct validity.

Comparative analyses
Comparison to the adult sample. An independent samples t-test (Table 6) revealed a significant difference between Child and Adult fair averages, with an effect size in the "large" range, according to the Plonsky and Oswald (2014) thresholds (RQ4). As this difference would be predicted by studies in the field of sign language linguistics (e.g., Rinaldi et al., 2018), this finding lends further support to an argument for context validity.
A pairwise bias report (Table 7) revealed seven items which exhibited significantly different difficulty estimates for the Child and Adult samples, with four (Items 11, 16, 36, and 37) being harder for the children and three (Items 18, 22, and 26) being harder for    ;Cohen, 1988). Wright variable map. Adult participants are denoted by "Ad," children by "Ch," and hearing status of parents as "L1" for those with at least one Deaf parent, and "L2" for those with hearing parents. "L1" test-takers are underlined.
Parents' hearing status. To address RQ3 and determine whether Deaf children with at least one Deaf parent performed better on the SRT than Deaf children of hearing parents, we computed a one-way, between-groups analysis of covariance (ANCOVA), controlled by the covariate chronological age. There was a significant difference in the test performance between the children of Deaf parents and the children of hearing parents [F (1, 42) = 7.27, p = .010, η 2 partial = .148]. The parental hearing status factor explains only 14.8% of the variance of the fair average. There was also a significant relationship between the covariate chronological age and the dependent variable while controlling for the independent variable with F (1, 42) = 24.14, p < .001, η 2 partial = .365. Chronological age explains 36.5% of the variance of the test performances of the test-takers (see also Figure 2). The implications of these results will be discussed in the "Discussion" section.

Discussion
Our primary goal for this study was to seek evidence of context, scoring, and criterionrelated validity within the socio-cognitive framework for an SRT for DSGS aiming to assess the grammatical development of school-aged Deaf children. Although not empirically demonstrable, the item and rating scale development, and rater training (RQ4) provide some evidence for an argument of context validity, whereas evidence to support an argument for scoring validity (RQ1) of the SRT was found during the Rasch analysis. It revealed four distinct levels of ability within the sample with a reliability of separation of .96, which can be interpreted similarly to a Cronbach's alpha (Wright & Masters, 2002), indicating very high reliability. Also, five distinct levels of item difficulty with a reliability of separation of .97 were found. The Rasch-kappa inter-rater reliability coefficient shows a very high degree of agreement between the two raters.
Two items (Items 36 and 40 of 40 items) were rated significantly differently by the raters. It is striking that both items occur towards the end of the test. A potential source might be (1) fatigue of the raters to explain that they scored these items differently or (2) fatigue by the test-takers making these items harder to score. The first issue (fatigue of raters) has been reported for spoken language assessment as well (e.g., Ling, Mollaun, & Xi, 2014). It would have been useful to conduct a follow-up interview with the raters to discuss why they scored these sentences differently (e.g., Isaacs & Thomson, 2013), but the actual scoring took place in summer 2015 and it was therefore not possible to collect any valid follow-up data.
Owing to the absence of a test that measures the same construct as the SRT, a subsample of the children (n = 19) were also tested on a DSGS narrative comprehension test (Haug & Perrollaz, 2015), in order to seek evidence of criterion-related validity (RQ2). The results of the correlation can be considered as strong according to the Plonsky and Oswald (2014) threshold. This evidence contributes to a limited argument for the criterion-related validity.
Investigation of the test-taker characteristics' (age, parents' hearing status) impact on scores (RQ3) revealed that the test functioned mostly as expected based on the existing literature. The comparison between the performances of the adult users of DSGS (N = 14) and the children (N = 46) revealed a significant difference, contributing evidence of context validity (RQ4). However, seven items (Items 11,16,18,22,26,36,and 37) showed significantly different difficulty estimates for the samples. Of these seven items, four items were more difficult for the children (Items 11, 16, 36, and 37). The authors considered whether the four items that were harder for the children might pose a threat to validity with regard to the test-taker characteristics from an acquisition perspective; that is, the items represent specific linguistic structures of DSGS that might not have been acquired by all children and are therefore inappropriate for the intended test-takers. The "problem" with this hypothesis is that, for example, constructed action, which is a discourse strategy using manual and non-manual components to "express a referents actions, utterances, thoughts, feelings and/or attitudes" (Cormier, Smith, & Zwets, 2013) (e.g., Item 11) and which is normally mastered above nine years old (e.g., Morgan, Herman, & Woll, 2002), also occurs in other items (e.g., Items 3, 6, and 12). These three items did not differ in difficulty for the two samples. For that reason, although this hypothesis did not bear out in the present study, age should continue to be investigated in future sign language SRT research.
Three of these seven items (Items 18, 22, and 26) were harder for the adults. It is impossible to look at these three items from an acquisition perspective as in the case of the items that were harder for the children (in theory, the adults should outperform the children on all items). In addition, other potential criteria that might explain performance differences, like complexity or length of the items, cannot really explain the differences between the two samples. Further investigation would be needed in the future to shed some light on the question of why these three items were more difficult for the adults than the children.
External variables contributing to the performance differences in the children's sample have been identified in the literature (e.g., Mann, 2006) and set in relation to the SRT results. Chronological age, a crucial variable in child acquisition research, significantly predicted the SRT scores of the children with a strong effect size (Cohen, 1988). This provides further evidence of criterion-related validity, as scores should be expected to increase with age and, therefore, with linguistic development and acquisition.
The variable of parents' hearing status (analogous to L1/L2), when controlled for age, predicted score differences between children with Deaf and hearing parents, but explained only 14.8% of the variance in scores, in contrast with previous work on German Sign Language assessment (Haug, 2011). This may or may not contribute to an argument for validity. On the one hand, one would expect those who have grown up using DSGS with their parents to exhibit more facility with it, in which case, this result is somewhat surprising. On the other hand, given the young age of the participants, it may simply be the case that overall linguistic development is a much better predictor of performance on an integrated test task such as an SRT, requiring sufficient experience and facility with the language not only to parse input, but recreate it in response. Clearly, further work with more varied samples is needed.

Conclusion
With this study we reported the results of the development and evaluation of an SRT for DSGS for the purpose of demonstrating scoring and criterion-related validity (RQ1 and RQ2). We did this in order to ensure that test-taker characteristics impacted known factors that explain the performance of the children (RQ3), and we also demonstrated context validity (RQ4). Although some issues may require further examination (e.g., difference in the scoring of the raters on four items; why some items are too difficult for the children), the results demonstrate evidence of context, scoring, and criterion-related validity with regard to global DSGS proficiency and development, and, furthermore, provide a basis for continuing development of the SRT in question, and for encouraging others to consider using SRTs for sign language assessment.
Our study does, however, suffer from some limitations, chief of which was the relatively small sample size. It would have also been preferable to ensure that all performances were double-rated, although the raters in the present study appeared to operate virtually indistinguishably. Additionally, more background information of the test-takers should be collected, for example, information about the test-takers non-verbal IQ and working memory skills in order to investigate if and to which degree cognitive resources are required to solve the task of an SRT (e.g., for spoken languages: Bartlett, 2018).
Likely future directions include a closer examination of the rating scale by comparing the results of the five-step scale of the present study to shorter and dichotomous scales, as these are more frequently found in the literature. The test remains in development, and further validation studies are underway.
Overall, our study provides important validation work in a lesser-tested language, Swiss German Sign Language, and offers an insight into the use of SRTs for sign language assessment generally. The work can be used as a template for other researchers working in similar contexts to develop and validate their sign language test.