Examining the Dimensionality of Linguistic Features in L2 Writing Using the Rasch Measurement Model

It has been acknowledged that second/foreign language (L2) writing is a complex and multi-dimensional cognitive process, and linguistic knowledge is the foremost predictor of L2 writing. Previous research on developing models and orientations for characterizing L2 writing and its linguistic features are based on methods rooted in classical test theory (CTT) which mostly tend to overlook qualitative differences among writers. The use of item response theory (IRT) and Rasch models has been disregarded in L2 writing research. This study aimed to psychometrically investigate the dimensionality of linguistic features in L2 writing using the Rasch model. To achieve this, 500 Iranian English as a foreign language (EFL) students wrote an essay marked by four experienced raters using an empirically-derived descriptor-based diagnostic check-list. The scores derived from the marking of the essays were subjected to Rasch model analysis. Individual item/descriptor fit, separation and reliability, unidimensionality, and local item dependency (LID) were examined. The results provided evidence for the multidimensionality of linguistic features in L2 writing. The analysis of the positive and negative item loadings on Factor 1, extracted from the Rasch model residuals, revealed the presence of two sets of descriptors that contribute to the definition of two groups of L2 writers. The first set comprises descriptors with positive loadings mostly related to higher-level linguistic features of L2 writing, including content fulfillment (CON) and organizational effectiveness (ORG). However, the second set includes descriptors with negative loadings chiefly related to lower-level linguistic features, such as vocabulary use (VOC), grammatical knowledge (GRM), and mechanics (MCH). Implications and suggestions for further research are discussed.

cognitive (e.g., planning, organizing, and translating ideas), cultural (e.g., audience expectations, cultural norms, communicative purposes), textual (e.g., coherence and structure of the written text), affective (e.g., attitudinal and emotional aspects of writing), and strategic (e.g., paraphrasing, cognates, and language resources) dimensions.Successful L2 writing requires the integration of these factors and dimensions to produce a quality text.These myriad factors overshadow the assessment of L2 writing, that is, the dimensions are more likely to vary depending on the specific approach, orientation, or framework used for assessing L2 writing.Typical dimensions considered for assessing L2 writing are content, language use, organization, conventions, style, genre knowledge, writing processes (e.g., planning, drafting, revising, and editing), audience and topic awareness, cultural awareness and expectations, and task response.
Over the past few decades, L2 writing scholars have presented numerous orientations and approaches toward L2 writing.Manchón (2011) states that there are three complementary perspectives in the study of the nature of L2 writing in the literature: learning-to-write (LW), writing-to-learn-content (WLC), and writing-to-learn-language (WLL).Each of these orientations highlights distinct aspects of writing and is intricately connected to the purposes of learning and teaching writing as well as the diverse contexts in which L2 writing is acquired and instructed.Under the LW, researchers concentrate on three kinds of approaches to L2 writing.The first sort of approaches is concerned with the writer and the cognitive processes used to create texts.In this writing-oriented approach, numerous models have been developed to explain cognitive processes involved in writing (e.g., Abbott & Berninger, 1993;Alamargot & Chanquoy, 2001;Bereiter & Scardamalia, 1987;Chenoweth & Hayes, 2003;Fayol et al., 2012;Flower & Hayes, 1980, 1981, 1983;Galbraith, 2009;Hayes, 2012;Grabe & Kaplan, 1996;Kellogg, 1996;Schoonen et al., 2003Schoonen et al., , 2009)).The second sort of approaches focus on the products of writing by investigating texts as autonomous objects and discourse (e.g., Feez, 2001;Hyland, 2004;Johns, 1997).The third sort of approaches concerns what writers do incorporate a sense of audience or address the readers based on their expectations (e.g., Hyland, 2009).
The second perspective, i.e., WLC, focuses on the examination of how the act of writing can serve as a tool for learning other disciplinary subject-matter in the content areas.The core idea of the WLC revolves around the concept of transfer which "occurs when people make use of prior experiences to address new challenges" (Cleary, 2013, p. 62).The WLC states that students utilize writing not only to showcase their acquired knowledge in a written form but also to enhance their learning by leveraging the resources that writing offers (James, 2009).
As a recent orientation to L2 writing studies, Cumming (1990) argues that the act of writing in an L2 has the potential to improve the acquisition of L2 linguistic knowledge.This improvement arises both from the act of writing and the reflection upon written corrective feedback provided for one's own writing.Writing could encourage learners to "analyze and consolidate second language knowledge that they have previously (but not fully) acquired" (Cumming, 1990, p. 483).When L2 writers engage in L2 writing, they are likely to need to "monitor their language production in a way that is not necessary or feasible under time constraints of comprehending or conversing in a second language" (p.483).Consequently, it is essential to examine the way L2 writers learn how to write and the importance of understanding the instrumental role that writing plays in the development of an L2 in educational settings (Harklau, 2002).
Among the orientations and models proposed for describing writing, Flower and Hayes's (1981) model is the most influential model of writing.Flower and Hayes argue that writing comprises three hierarchical and dynamic processes: (1) planning where both L1 and L2 writers engage in idea generation, organization, and goal-setting to form an internal representation of the information or knowledge they aim to convey through their writing; (2) translating where writers translate or convert their ideas into written texts by drawing on pertinent knowledge and linguistic resources stored in their memory; and (3) reviewing where writers assess and revise their texts in a systematic way to correct errors.
Translation serves as the primary mechanism among the various writing processes, facilitating the conversion of ideas or propositional content into suitable linguistic expressions (van Gelderen et al., 2011).This intricate process demands a thorough understanding of linguistic components, such as content, sentence structure, grammar, word selection, textual coherence, organization, and mechanics (e.g., punctuation and spelling).Writing subskills such as mechanics, grammar, and vocabulary are typically regarded as lower-level, whereas higher-level subskills encompass organization and content (Schoonen et al., 2011).To produce a well-constructed text, writers should coordinate these higher-and lower-level writing subskills which impose constraints on working memory capacity and influence the overall quality of written texts (Güvendir & Uzun, 2023).The impediment or any difficulty in the mastery of these subskills can override the improvement of L2 writing.It is thus crucial for teachers and educational experts to accurately assess and identify specific writing weaknesses or flawed strategies among students.Once problematic areas are pinpointed, students can receive appropriate and timely feedback and subsequently pursue some strategies to remedy and develop their writing skills during their learning process.

| Diagnostic Assessment
Diagnostic assessment of L2 writing ability has received a great deal of attention among L2 researchers (Alderson, 2005) due to its potential to yield rich and informative diagnostic information about writers' specific weaknesses and strengths in writing (Knoch, 2011).Diagnostic assessment identifies what students have already mastered in their learning process and where they need more assistance to eliminate their deficiencies.The information obtained from diagnostic assessment can be used to support inferences about students' knowledge and help teachers tailor remedial instruction based on individual students' needs (Jang, 2009).More specifically, diagnostic assessment can be viewed as a form of needs assessment aimed at pinpointing students' gaps, understanding the root causes of issues, determining priorities, and exploring potential solutions (Nation & Macalister, 2010).Diagnostic information can also be useful for providing diagnostic feedback to all stakeholders about students' learning status and ultimately enhancing the quality of learning by maximizing the opportunities to learn (Jang, 2009).Many researchers have noted that providing diagnostic information can raise students' awareness of their own learning process, encouraging self-regulated learning, and motivating them (Kuo et al., 2016;Zimmerman, 2000).In this regard, diagnostic assessment aligns with formative assessment or assessment for learning (AFL).In fact, it effectively combines assessment with curriculum and instruction (Pellegrino & Chudowsky, 2003).Alderson (2005) contends that there is a confusion between diagnostic tests and placement or proficiency tests.However, according to Bachman (1990, p. 60), when we speak of a diagnostic test, we are generally referring to a test that has been designed and developed specifically to provide detailed information about the specific content domains that are covered in a given program or that are part of a general theory of language proficiency.Thus, diagnostic tests may be either theory or syllabus-based.In the context of writing assessment, Alderson (2005) advocates the use of indirect tests for assessing writing ability of students rather than performance tests.However, the current trend in writing assessment favors performance tests over indirect ones.This shift is attributed to the perception that indirect tests cannot adequately and validly measure the multifaceted aspects of writing (Weigle, 2002).
An important aspect in the performance assessment of writing is rating scales.Within the context of performance rater-mediated assessments, raters are usually provided with some kinds of multiple-category or ordinal rating scales to evaluate students' performance in a writing task.They utilize these scales to convey their interpretation of the quality of students' performance.However, LIosa et al. (2011) have argued that most existing tools for assessing writing [including analytic and holistic scales] are not sufficiently informative to identify the exact nature of students' problems with composing nor to provide instructionally useful feedback.Most existing assessments of writing are designed to evaluate writing achievement by eliciting an extended response to a writing prompt and evaluating the response with a holistic score.While useful for determining whether students have met performance goals in composition, such assessments do not provide sufficiently specific information to be useful for instructional purposes (p.258).What is needed are rating scales that fulfill diagnostic purposes.In this regard, a large number of diagnostic rating scales, which will be reviewed in the following sections, have been developed to identify strengths and weaknesses in students' writing ability.Scoring of students' written performance using rating scales holds significant importance in the conceptualization of validity (Kane, 2013) and is essential in the practical aspects of language testing and assessment (Knoch & Chapelle, 2018).This is due to the fact that scoring establishes a critical connection between a performance and a proficiency claim (Knoch et al., 2021).It is thus necessary to provide evidence for the efficiency of rating scales used to assess writing ability of students.

| Rating Scales for Scoring Writing
A number of researchers have suggested numerous classifications of rating scales.The most widely cited categorization is the distinction between holistic and analytic scales (Hamp-Lyons, 1991;Lee et al., 2010;Weigle, 2002).In holistic scoring (sometimes referred to as 'impressionistic' scoring), a single score to a piece of writing is assigned based on raters' overall impression of it.Holistic scores have been shown to be rapid and appropriate for writings characterized by internal coherence or congruency, but not for texts exhibiting internal complexity, such as those that are highly developed but frequently interrupted by grammatical errors (Weigle, 2002).Raters further mostly tend to provide a global score based on the strengths of a text not its weaknesses (Jarvis et al., 2003).On the other hand, analytic scoring (sometimes referred to as 'multiple trait assessment') involves the assignment of a separate score for each of a number of aspects or criteria of a task.This kind of scoring has the advantage of being highly reliable, providing higher construct validity for L2 writers, and offering more diagnostic information about the performance of students, because they measure writing across multiple criteria.However, empirical research has indicated that raters fail to conceptually discriminate between different criteria of a scale (Hughes & Hughes, 2020).That is to say, raters tend to judge each of the criteria independently of the others, known as 'halo effect' (Cooper, 1981;Engelhard, 2013).This tendency of raters has been shown to produce 'flat' profiles (e.g., where students are often assigned ratings that are uniformly negative or uniformly positive across various criteria), compromise the diagnostic capacity of the analytic approach for assessing writing, and finally provide misleading information about the performance of writers (Hamp-Lyons, 1991;Knoch, 2009).
Another well-known categorization of rating scales indicates their construction methodology, or the way they are developed.In his dyadic typology of rating scales, Fulcher (2003) makes a distinction between intuitive methods and empirical methods.Scales constructed intuitively rely on existing scales or the intuition of scale developers for specifying common features at different proficiency levels.Fulcher et al. (2011) contends that such scales are more likely to depend on theory indirectly, as the developers' beliefs are influenced by theory, but they do not originate from an examination of real performance data.However, scales developed based on empirical methods are developed based on real world and corpora and take into consideration real performance.Fulcher et al. (2011) extend his typology and distinguishes between 'measurement-driven' scales and 'performance-driven' scales, which are relatively equivalent to empirical and intuitive methods, respectively.Fulcher (2012) argues that well-designed empirical or performance-driven scales will better reflect the way language is naturally used in real-world situations.As Fulcher et al. (2011) argue, what is important in scale construction is that rating scales should be empirically developed and establish a meaningful connection with real world language use.In response to Fulcher et al.'s call for developing rating scales empirically, a number of researchers have developed scales and checklists for assessing writing ability of students.For a comprehensive review of approaches to scale development, refer to Knoch, 2009 andKnoch et al., 2021.

| Diagnostic Checklists for Writing Assessment
Over the past few decades, a number of researchers have focused on diagnostic assessment of writing.Two lines of research emerge in the relevant literature.The first line comprises studies developing diagnostic rating scales for assessing writing ability of students (e.g., Fulcher, 1996;He et al., 2021;Kim, 2010;Knoch, 2007;Lukácsi, 2021;Ma et al., 2022;North & Schneider, 1998;Safari & Ahmadi, 2023;Struthers et al., 2013;Turner & Upshur, 2002;Upshur & Turner, 1999).For example, Knoch (2007) developed a theoretically-based and empirically-developed rating scale for a diagnostic writing con-text.The scale assesses accuracy, fluency, complexity, mechanics, reader-writer interaction, content, coherence, and cohesion.Kim (2010) also developed an empirically derived descriptor-based diagnostic (EDD) checklist using think-aloud protocols from raters.The original purpose of constructing the scale was to assess and characterize the writing of non-native English-speaking students in an academic context, specifically for the independent essay section of the Test of English as a Foreign Language TM Internet-based Test (TOEFL iBT).This checklist contains 35 descriptors assessing five sub-skills of academic writing in English, including content fulfillment, organizational effectiveness, vocabulary use, grammatical knowledge, and mechanics.Following the procedure suggested by Crocker and Algina (1986), in another study, Struthers et al. (2013) developed a 13-item checklist to only assess cohesion in the writing of children in Grades 4-7 in order to inform instructional practices.Furthermore, Lukácsi (2021) designed a level-specific checklist for assessing the English as a Foreign Language (EFL) writing proficiency at the B2 level.Based on a mixed-methods strategy of inquiry (e.g., think-aloud protocols, qualitative data from stimulated recall, and semi-structured interviews), 35 binary items were developed based on CEFR (Common European Framework of Reference for Languages) to allow researchers and teachers to use the rating scale for level testing regardless of the test purpose.He et al. (2021) also developed a diagnostic checklist using the descriptors of China's Standards of English Language Ability (CSE).The scale consists of 15 descriptors and measures four attributes including grammatical knowledge, textual knowledge, functional knowledge, and sociolinguistic knowledge.Ma et al. (2022) further developed an English writing EDD checklist by using raters' think-aloud protocol and expert judgment.The scale includes 30 descriptors measuring five attributes such as content, organization, vocabulary, syntax, and mechanics.Most recently, based on L2 students' verbalizations of their challenges in integrated writing tasks, Safari and Ahmadi (2023) developed and validated a binary 30-descriptor empirically-based diagnostic checklist for L2 reading-listening-writing integrated tasks.The traits are source use concerns, fulfilling task demands, organization and structure, grammatical knowledge, vocabulary knowledge, and mechanics.
The second line of research comprises studies modifying and applying the above-mentioned rating scales in different educational contexts for different purposes.An advantage of the development of binary diagnostic checklists is that they increase the number of writing items which, in turn, allow researchers and practitioners to apply complex statistical methods and measurement models to writing.For that reason, a large number of researchers have applied cognitive diagnostic models (CDMs; Ravand & Baghaei, 2019) to diagnose writing ability of students (e.g., Effatpanah et al., 2019;Shahsavar, 2019;Shi et al., 2024;Xie, 2017); Shi et al. (2024) used the checklist developed by Ma et al. (2022), whereas other studies either utilized the Kim's (2010) EDD checklist or modified it.Numerous studies have also used the EDD checklist to apply advanced quantitative methods to L2 writing such as the linear logistic test model (LLTM; Fischer, 1973), as an IRT-based cognitive processing model, for examining the underlying cognitive operations of L2 writing performance (Effatpanah & Baghaei, 2021) and the mixed Rasch model (Rost, 1990) for exploring multiple profiles of L2 writers (Effatpanah et al., 2024).All of the reviewed studies provide empirical evidence for the benefits of using descriptor-based diagnostic checklists for assessing L2 writing performance of students and offering fine-grained diagnostic feedback.

| The Present Study
Using the Rasch model (Rasch, 1960(Rasch, /1980)), this study seeks to examine the dimensionality of L2 writing based on an empirically-derived descriptor-based checklist.Rasch model is a probabilistic model that is used to predict the outcome of encounters between persons and a set of items.The Rasch model assumes that the probability of giving a correct response to an item or successfully completing a task is a function of the ability of a person and the difficulty of a given item/task.The model yields a logistic function based on the differences between person ability and item difficulty.The higher is the probability of success when a person's ability is greater than an item's difficulty.Under the Rasch model, the probability that person  gets an item  right, given his/her ability   and the item difficulty   is expressed as: where (  ) is the probability of success.A distinguishing characteristic of the Rasch model is its ability to transform item and person raw measures into interval-scaled measures.This transformation involves placing items and examinees on a calibrated scale, where their positions correspond to their difficulty and ability measures.In the present study, the following research questions were addressed: RQ1: Are the linguistic features of L2 writing as measured by the empirically-derived descriptor-based diagnostic (EDD) checklist unidimensional?RQ2: What are the underlying psychometric linguistic dimensions of L2 writing as measured by the empiricallyderived descriptor-based diagnostic (EDD) checklist?

| Method 4.1 | Setting and Participants
The study conducted secondary analysis of data obtained from a research project carried out by Effatpanah et al. (2019) to diagnose Iranian EFL students' writing performance using cognitive diagnostic models (CDMs).The data were collected from 500 Iranian EFL students aged between 19 to 58 (M = 24.89years, SD = 6.30) from different classes and universities in Iran.The students were taught by 32 non-native full-time English teachers following various materials and syllabi for instruction.Their L2 writing teaching experience to adults varied from 9 to 24 years.There were 349 female (69.8%) and 151 male (30.2%) participants.This gender imbalance is due to the typical distribution of students in English departments in many Iranian universities, which are heavily dominated by females.The sample included 212 junior, 152 senior, and 136 postgraduate students.Of the total sample, 104 (20.8%) studied English Translation, 128 (25.6%)English Literature, and 268 (53.6%)Teaching English as a Foreign Language (TEFL).The students reported their English learning background from three to more than ten years and their amount of English use from cannot say to almost every day.All participants were native speakers of Persian and were studying English as an academic major.They were informed that the retrieved data would remain anonymous and confidential, and informed consent was obtained from all students.
Four experienced raters (1 female and 3 male) whose ages ranged from 28 to 39 years old (M = 31.75;SD = 4.99) were also recruited to assess the essays.They had an average 14.3 years of experience in teaching and assessing L2 writing.
One rater was a Ph.D. candidate of TEFL and a lecturer in university, while the remaining three raters held master's degrees in TEFL and had achieved an overall score of 8 in the IELTS exam.All of the raters were non-native English speakers (with Persian as their first language and English as their foreign language), but all had native or near-native English language proficiency.

| Instrument 4.2.1 | Writing Prompt
The students were required to write at least a 350-word essay in response to the following writing prompt: "How to be a first-year college student?Write about the experience you have had.Make a guide for students who might be in a similar situation.Describe how to make new friends, how to escape homesickness, how to be successful in studying, etc."The writing prompt was administered as a class activity in L2 writing courses.The students were urged to depend on their own abilities and refrain from utilizing any books, dictionaries, or internet resources while writing to provide the researchers and their teachers with the opportunity to assess their strengths and weaknesses in L2 writing.

| The Empirically-derived Descriptor-based Diagnostic (EDD) Checklist
To operationalize the measurement of writing, a diagnostic assessment scale called the empirically derived descriptor-based diagnostic (EDD) checklist (Kim, 2010(Kim, , 2011) )

| Rater Training and Formal Scoring
Within the context of rater-mediated performance assessment, rater variations are a major threat to construct validity in the rating procedure (Wind & Peterson, 2018).A 2-hour training session prior to scoring the essays was held to mitigate potential inconsistencies (construct-irrelevant variance) or variations across the raters in this study.This training session involved instructing raters on the use of the scale and moderating discussions to separately elucidate the content of each descriptor.They were also trained how to interpret the yes or no response option; when writers generally met the standards, a yes response was recommended; otherwise, the no option was chosen.Following the guidelines of Weigle (2002), a small subset of essays was provided to raters to familiarize themselves with the scale and its specific properties and compare their performance with each other.After training, the essays were randomized and grouped into four packages.Each rater received 125 essays and copies of the checklist.Thirty-five essays were included in each package to be scored by all the raters.The total score on the checklist ranged from 0 to 35, with a mean of 17.62 and a standard deviation of 7.59.As a preliminary check, the Cronbach's alpha reliability of the checklist was investigated, and a value of 0.89 was obtained, which is highly satisfactory.The degree of inter-rater reliability among the raters was also examined using Pearson correlation coefficients analysis.The correlation coefficient across the raters was 0.82.According to Dancey and Reidy's (2004) criteria, coefficients smaller than 0.30 are considered weak, between 0.30 and 0.60 moderate, and above 0.60 strong.Cohen's Kappa was also computed, and a value of 0.62 was obtained, indicating a high agreement among the raters.
To ensure that students' scores have not been affected by construct irrelevant factors such as rater severity that can confound measurement, a three-faceted many-facet Rasch measurement (MFRM; Linacre, 1989) was performed using the software FACETS, Version 3.71 (Linacre, 2014a).For intra-rater consistency, infit and outfit mean squares (MNSQs), as two rater variability indices, were computed.Linacre (2014b) recommended the range of 0.5-1.50 for both fit indices.With the range of outfit between 0.90 and 1.10, infit between 0.91 and 1.06, all the raters achieved sufficient fit to the model, indicating satisfactory self-consistency in scoring the essays using the descriptors.Rater severity measures further varied from -0.09 to 0.16, suggesting negligible differences in rating severity.Two measures of the variance of rater severity, i.e., rater separation reliability (R = 0.89) and rater separation ratio (G = 2.82), also indicated that raters' severity does not vary considerably.Finally, checking the vertical summary (e.g., an interval scale which compares the location of items/descriptors, persons, and raters on the same scale calibrated in logits) showed that rater severity has not affected the students' marks.

| Data Analysis
The ratings awarded by the raters were analyzed by the WINSTEPS computer package version 3.73 (Linacre, 2009a) to examine the fit of the data to the Rasch model.For the purpose of this study, individual item/descriptor fit, separation and reliability, unidimensionality, and local item dependency (LID) were examined.Item difficulty parameters are estimated based on the proportion of examinees who get an item right or successfully complete a task regardless of those examinees' ability levels.They indicate the locations of items/descriptors on the latent trait continuum and are expressed in log-odd units or logits.Unlike the CTT, an error of measurement index is provided for each item.The error of measurement indicates the accuracy of estimated item difficulty parameters.
Furthermore, item and person reliability and separation indices were examined.The reliability of person and item indices serves as an indicator of the scale's precision in measuring examinee ability and item difficulty (Linacre, 2009b).Separation reliability refers to the ratio of item or person true standard deviation to error standard deviation (e.g., root mean square error (RMSE)).It signifies the degree to which person and item parameters are distinct on the latent trait.In fact, item separation is used to verify the hierarchy of items of a scale.In cases where separation values are low (< 3 for high, medium, low item difficulties, item reliability < 0.9), it suggests that the sample size is not sufficient to validate the hierarchy of item difficulties within the scale (Linacre, 2009b).However, person separation is used to classify examinees.Low person separation values (< 2, person reliability < 0.8) imply that the scale may lack sensitivity to differentiate between examinees at varying proficiency levels (Linacre, 2009b).The range of separation reliability values extends from zero to infinity.A higher separation value for persons/items indicates a greater probability that persons/items with higher ability/difficulty estimates possess higher estimates compared to those with lower estimates (Linacre, 2009b).
To investigate the fit of the data to the model, outfit and infit mean square (MNSQ) fit statistics were calculated (Linacre, 2002).Both statistics indicate to what extent items of a scale align with the underlying latent variable being measured and involve the division of chi-square values for items by their degrees of freedom (Linacre, 2009b).As Linacre (2002) argued, outfit MNSQ serves as an outlier-sensitive fit statistic which detects erratic response patterns from examinees on items that are either relatively very easy or very difficult for them, and vice versa.On the other hand, infit MNSQ serves as an inlier-sensitive fit statistic, identifying unexpected response patterns to items that are specifically targeted to examinees, and vice versa.Linacre (2009b) considered a value between 0.50-1.50 as an acceptable range for fit indices.Moreover, pointmeasure correlations were estimated for all items/descriptors to measure the extent to which observed scores align with the expected latent trait.Point-biserial (or point-measure) correlations indicate to what extent the responses to each item within a scale are correlated with the overall measure.
An important feature of the Rasch model is its ability to generate a ruler-like device, functioning as an interval scale.This facilitates the comparison of item/descriptor and person locations.Specifically, it creates an item-person map, often referred to as the Wright map, which represents item difficulty and person ability estimates on the same metric calibrated in logits.The Wright map visually illustrates how items/descriptors are distributed in relation to the abilities of examinees.
Moreover, the unidimensionality of the scale was checked.Unidimensionality is an important assumption in Rasch and IRT models which posits that all items of a scale should measure a single unidimensional latent trait, that is, only one latent trait should explain variability in the observed responses.This requirement is fundamental for measurement and, more importantly, item construction because a scale claims to measure different levels of a latent trait should be affected by different levels of another latent trait (Ziegler & Hagemann, 2015).The Rasch model is a parametric model which imposes certain assumptions for unidimensionality.When the assumptions of the model hold (e.g., the data fit the Rasch model), it is an indication that all the items of a scale measure a single unidimensional latent trait, and that persons and items can be located on a common interval continuum.It must be noted that "unidimensionality does not imply that performance on items is due to a single psychological process.In fact, a variety of psychological processes are involved in responding to a set of items.However, as long as they are involved in unison-that is, performance on each item is affected by the same process and in the same form-unidimensionality will hold" (Bejar, 1983, p. 31).Items can still be regarded as unidimensional if they measure the same processes to the same degree (Fischer, 1997).Because infit and outfit MNSQ fit statistics exhibit minimal susceptibility to systematic threats against unidimensionality (Smith & Plackner, 2009, p. 424), the unidimensionality of the scale was checked using the principal component analysis of linearized Rasch residuals (PCAR).Because items usually do not adhere to the expectations of the Rasch model, there remain some residuals after data-model fit.In fact, residuals denote the disparities between Rasch model predictions and actual observations (e.g., observed data) and are regarded as differences between predictions of the Rasch model and the actual observations.As unexpected part of the data, residuals are expected to be uncorrelated and randomly distributed (Linacre, 2009b).Smaller residual values indicate a closer fit to the model.It must be noted that the expected latent trait is excluded from the analysis when PCAR is conducted based on standardized residuals.When the data have adequate fit to the Rasch model, the latent trait is expected to account for all information in the data, with the residuals reflecting random noise.Any component taken from the residuals is thus identified as a second dimension, indicating the breach of the unidimensionality assumption (Linacre, 2009b).The strength of the emergent component (e.g., the capacity of the component for explaining the common variance in data) is compared with the strength of the target dimension.According to Linacre (2009b), eigenvalues smaller than 2 verify the unidimensionality of the scale.
In the PCAR, loadings indicate the correlation between the items and an off-target secondary component extracted from the residuals that is unrelated to the primary target dimension (Baghaei & Cassady, 2014).Items/Descriptors with positive and negative loadings form two distinct sets that are orthogonal to the target dimension.Items with a correlation close to zero do not contribute to this secondary component.A high loading on the secondary component indicates that the item is associated with the off-target dimension and is more likely to have a weaker correlation with the target Rasch dimension or the latent trait (Baghaei & Cassady, 2014).A scrutiny of the content of contrasting clusters of items with high negative and positive loadings (exceeding ±0.40) helps identify meaningful interpretations of the secondary components as additional dimensions (Linacre, 2009b).
To assess the unidimensionality of a scale, Smith (2002) suggested the estimation of person parameters based on two subsets of a scale.Unidimensionality implies that person parameter estimates should remain consistent regardless of the subset of items encountered by examinees.If the ability estimate of an examinee varies between different subsets of the scale, it indicates that the data may reflect more than one dimension, threatening the construct validity of the scale (Baghaei & Cassady, 2014).The equality of estimates is investigated using t-tests.Statistically significant results suggest a lack of equality across the subsets with regard to person parameters and the presence of additional dimensions in the scale.
The results of item loadings on the first factor extracted from the residuals are also illustrated as a map by WINSTEPS.The plot visually illustrates the distribution of item loadings on the off-target dimension, with higher loadings at the two extremes.Items/Descriptors at the upper end exhibit positive loadings, while those at the lower end display negative loadings.If there is no discernible pattern in the residuals of items, they are expected to disperse across various regions of the map without clustering in either the positive or negative loading regions (Linacre, 2009b).Conversely, notable dimensions lead to clusters of items referred to as "contrasts," which emerge in opposing regions of the plot based on their loading values.These contrasts reflect structural differences among examinees with regard to their performance in a test.
Finally, LID was also evaluated using Pearson correlation analysis of linearized Rasch residuals.As stated by Wright (1994), residuals indicate the extent to which the items of a scale are locally easier or more difficult than the expectation of the model.Substantial correlations between the residuals of two items suggest local dependency, signifying potential shared dimensions or replicated features (Linacre, 2009b).After removing the Rasch dimension, locally dependent item pairs show notably high positive or negative residual correlations.Consistent with Christensen et al.'s (2017) recommendation, correlations exceeding 0.20 indicate the presence of local dependency.

| Item/Descriptor Characteristics and Fit Statistics
Table 1 shows descriptive statistics of the data, including mean, standard deviation (SD), skewness, and kurtosis, computed on SPSS for Windows, Version 25, item difficulty measures in logits, standard errors of measurement, infit and outfit MNSQ statistics, and point-measure correlations.As can be seen, Item/Descriptor 16 had the highest mean score (M = 0.82, SD = 0.385), and Item/Descriptor 35 (M = 0.12, SD = 0.323) had the lowest.The skewness and kurtosis values of some descriptors were high, possibly due to their difficulty measures causing an imbalance in the distribution tails (Bachman, 2004).Because the data is dichotomous, a descriptor being either easy or difficult results in a high prevalence of 1.00 or 0.00 (correct or incorrect) cases, respectively.This creates an asymmetry in the data shape, with a skew towards the side with lower frequency, leading to a long-tailed distribution and a high skewness value.PT-Measures = Point-Measure Correlations.
The results of item difficulty parameters also showed that they ranged from -1.94 to 2.56 logits with item reliability coefficients and separation values of 0.99 and 8.78, respectively.Descriptors 19 and 16 were the easiest descriptors, and Descriptors 35 and 34 were the most difficult.Person estimates ranged from -3.89 to 4.08, with item reliability coefficients and separation values of 0.87 and 2.62, respectively, suggesting the wide range of examinees' abilities.With regard to infit and outfit MNSQ fit statistics, except for Descriptor 22, all the fit values were within the acceptable boundary of 0.50-1.50.Descriptor 22 is a poor descriptor for inclusion in the EDD checklist and can be excluded in future studies.The pointmeasure correlations also indicated that all correlations are positive and mostly medium.These values indicate a substantial agreement between the patterns of item difficulties in the data and the Rasch model (Linacre, 2009b).
A Wright map of the concurrent distribution of persons and the difficulty of the descriptors of the EDD checklist is shown in Figure 1.The first column shows the interval scale in logits.The dotted dividing line represents the hypothetical latent trait continuum.On the left side of the line, Hash marks (#) indicate the distribution of examinees.On the right side of the line, numbers show the difficulty of the descriptors.On both side of the dotted line, M represents the mean, S and T are one and two standard deviation(s) from the mean.Descriptors and examinees ranged from the easiest and least able at the bottom to the most difficult and most able at the top of the scale, respectively.Descriptors/Items are expected to be located across the whole scale to effectively measure the 'ability' of all examinees (Bond et al., 2020).As can be seen in Figure 1, descriptors cover a wide range of difficulty, although few difficult descriptors should be added to capture ability of most able examinees at the top of the scale.This provides evidence for the representativeness of the descriptors.PCAR was carried out to check the unidimensionality of the checklist.The results of PCAR revealed that the model explains 30.9% of the observed variance; 14% are explained by item measures, and 16.9% are explained by person measures.The observed model variance was higher than the model expectation of 30.7%, although 69.1% of the variance are still unexplained.The first factor (contrast) accounted for 4.9% of the unexplained variance, with the eigenvalue equals to 2.5, which was higher than the critical value of 2, indicating the multidimensionality of the checklist.
Table 2 gives the loadings of the 35 descriptors on the first factor identified in the PCAR, excluding the first component.The first factor showed that there are two sets of descriptors that contribute to the definition of the EDD checklist.Descriptors with positive loadings on Factor 1 are mostly associated with higher-level linguistic features of L2 writing, such as content fulfillment (CON) and organizational effectiveness (ORG).However, descriptors with negative loadings are chiefly associated with lower-level linguistic features, including vocabulary use (VOC), grammatical knowledge (GRM), and Mechanics (MCH).To assess whether the two sets produce equivalent person parameter estimates, five hundred t-tests were run to compare the person parameters derived from the two sets of the descriptors for all the examinees.The analysis showed that, out of five hundred, 81 cases (16.2%) of the t-tests were statistically significant.As argued by Smith (2002), less than 5% of t-tests should be significant if a unidimensional scale is desired.The findings of the present study strongly support a lack of equality across the two sets of descriptors, indicating the multidimensionality of the EDD checklist.
Loading patterns of descriptors on the first hypothesized factor in the linearized residuals are presented in Figure 2. Descriptors with negative loadings are represented by small letters and located on the bottom end, while descriptors with positive loadings are represented by capital letters and located on the top.As can be seen, the residuals of the descriptors have formed two distinguishable clusters.In fact, descriptors have not scattered across the map, so the EDD checklist is multidimensional.
Finally, the largest standardized residual correlations for identifying dependent items are provided in Table 3.As can be seen, all the correlations of item residual pairs are higher than 0.20, suggesting that the items share a large proportion of their random variance which is an indication of dependency among item pairs.Figure 2 Plot of Item Loadings on the First Factor in PCAR

| Discussion
This study set out to use the Rasch model to psychometrically investigate the dimensionality of linguistic features in L2 writing.To operationalize the measurement of L2 writing, an empirically-derived descriptor-based diagnostic (EDD) checklist (Kim, 2010(Kim, , 2011) ) commonly used within the context of L2 writing research was utilized.The analysis of item/descriptor difficulty parameters, fit statistics, and point-measure correlations showed the conformity between the observed data and the expectations of the Rasch model, although one descriptor (e.g., Descriptor 22) had a poor fit.This descriptor needs to be carefully analyzed before being used in future studies.Item and person separation and reliability coefficients were also evaluated.The results of LID analysis also indicated the existence of dependency between some item pairs of the checklist.This could be due to the replicated features or potential shared dimensions across descriptors of the checklist.However, according to Schroeders et al. (2014), such dependency represents linguistic knowledge and "lie at the core of the construct; they are by no means irrelevant for the measurement of language abilities" (p.414).It is highly recommended for future studies which intend to develop new rating scales for assessing L2 writing to avoid including descriptors with many replicated features.Furthermore, the unidimensionality of the checklist was analyzed using PCAR.The results showed that eigenvalue of the first factor (contrast) was greater than 2, suggesting the multidimensionality of the EDD checklist.The analysis of item loadings further revealed the involvement of two sets of descriptors in defining the checklist.The comparison of person estimates across the two sets of descriptors showed a lack of equality across the two sets and provided further evidence for the multidimensionality of the checklist and L2 writing.
A closer analysis of negative and positive item loadings showed that descriptors that loaded negatively on Factor 1 were mainly related to lower-level linguistic features, such as vocabulary use (VOC), grammatical knowledge (GRM), and Mechanics (MCH).However, descriptors that loaded positively on Factor 1 were primarily associated to higher-level linguistic features of L2 writing, including content fulfillment (CON) and organizational effectiveness (ORG).This can be considered as an empirical evidence that there are two groups of L2 writers who are qualitatively different with respect to using linguistic features in their writing.This finding converges with Effatpanah et al. (2024) who recently conducted a research study to detect multiple classes of L2 writers using the mixed Rasch model (Rost, 1990).The authors recognized two classes for L2 writers.The first class consists of L2 writers who are inclined to attend more to lower-level linguistic features, including grammar, vocabulary, and mechanics, at the sentence level in order to produce a written text.On the other hand, the second class is characterized by L2 writers who prioritize higher-level linguistic features, such as content and organization, aiming to move beyond the boundaries of a sentence and pay more attention to the whole structure of a paragraph.This finding also supports Schoonen and De Glopper's (1996) contention that writers of different proficiency levels perceive the qualities of a well-written text differently.Writers with lower proficiency tend to prioritize layout and mechanics, including grammar, vocabulary, and mechanics, while those with a higher level of proficiency concentrate on more advanced or higher-order features including text organization and content.
Another finding of the study is that unlike reading and listening comprehension, the involvement of lower-and higher-level classifications of writing sub-skills may not be legitimate.Rather, each writing sub-skill can involve both higherand lower-level constituents.For example, those vocabulary items which have more frequency and are widely used can be retrieved more easily compared to those vocabulary items which are sophisticated and require more cognitive processing load to be elicited.Similarly, simple grammatical structures can be easily retrieved and used in a written text, but complex grammatical structures need more cognitive capacity.This could be the possible reason why some descriptors, marked as higher-level linguistic features, were more difficult for the group who tends to focus on lower-level linguistic features at the sentence level.
This account of the hierarchy of linguistic features in L2 writing seems plausible because less proficient L2 writers have limited linguistic knowledge (e.g., content, vocabulary, grammar, organization, and mechanics) and generate written texts marked by many shortcomings, including limited intelligibility, lack of cohesion and coherence, distorted organization, and numerous lower-order issues such as grammar and spelling (Trapman et al., 2018).Consequently, lower level writers are more likely to resort to some strategies to make up for their lack of linguistic knowledge.This process can be justified by the inhibition and compensation assumptions proposed by van Gelderen et al. (2011).They argue that According to the inhibition assumption, inexperienced writers' inefficient use of grammatical and lexical knowledge impedes their monitoring of text production on the level of content and their use of higher order strategies for optimizing text quality on a global level … .According to the compensation assumption, … inexperienced writers may well be able to adopt higher order strategies in writing, although their capability for efficient retrieval and production of linguistic elements is still limited.For example, by using efficient strategies, working memory capacity can be spent on sequential processing of different aspects of the writing task, instead of simultaneous processing.(P.283)

| Conclusion
This study has several theoretical, pedagogical, and methodological implications.From a theoretical standpoint, the findings of this study supported the multidimensionality of the linguistic features in L2 writing.Understanding the dimensionality of linguistic features in L2 writing enables L2 writing researchers and scholars to characterize the nature of writing.This understanding facilitates the development of a coherent and comprehensive model for L2 writing, as well as the refinement of existing theories related to L2 writing production.
From a pedagogical standpoint, examining qualitative differences among L2 writers with regard to several linguistic features can prove to be a valuable approach for addressing individual variances.The analysis of qualitative differences among L2 writers allows teachers to pay much more attention to different sets of linguistic features (e.g., higher-and lowerlevel) in writing classes, identify problematic aspects of writing ability, offer targeted and appropriate feedback to individual students, devise more effective tasks, activities, and remedial materials tailored to each student, enhance writing instruction, and evaluate L2 writing.Students themselves can adopt some strategies to overcome deficiencies in their learning process and enhance their writing proficiency.From a methodological standpoint, this study extends previous studies on the dimensionality of L2 writing by using the standard dichotomous Rasch model.More importantly, researchers typically use methods rooted in CTT and MFRM to investigate the psychometric qualities of rating scales.However, the use of various group-level inter-rater agreement and inter-rater consistency for a given dataset can generate inconsistent results, and the presence of high agreement and reliability among raters does not necessarily represent the precision of ratings (Wind & Peterson, 2018).Furthermore, studies on the use of MFRM mostly tend to focus on the performance of raters, especially their severity and leniency, and/or the interaction of raters with other factors such as the gender of examinees.Too little attention has been devoted to the analysis of the dimensionality of a scale and its construct.Another important implication of the present study for research in validating rating scales within the context of L2 writing research is that checking item loadings and LID can provide valuable information about the development and validation of rating scales for assessing L2 writing.Examining item loadings can explore qualitative differences that underlie the performance of examinees.Similarly, investigating LID allows researchers to avoid including descriptors and elements with replicated features.
This study had limitations that could be addressed in future research.First, the generalizability of the present results was limited to the current sample under investigation.Similar further studies are needed to shed more light on the generalizability of the findings from this study.Second, as repeatedly argued, a problem with binary diagnostic checklists is the dichotomization of an integrated skill like writing, although all researchers, practitioners, and teachers acknowledge that writing competence cannot be simplified into binary distinction.An intriguing avenue for further research involves examining the dimensionality of linguistic features in L2 writing across various academic tasks and genres.Such analyses would provide more in-depth insights into structural or qualitative differences among L2 writers with regard to their (meta)cognitive processes, especially their linguistic features.

Funding
The authors did not receive support from any organization for the submitted work.

Figure 1 Wright
Figure 1 Wright Map of the Distribution of Persons and Descriptors of the EDD Checklist on the latent variable ensuring unity and relevance in supporting sentences, information, and examples; (2) vocabulary use (VOC) assessing a student's ability to accurately and appropriately use a broad array of lexical items; (3) grammatical knowledge (GRM) assessing a student's ability to accurately show syntactic variety and complexity; (4) organizational effectiveness (ORG) assessing a student's ability to cohesively and coherently develop and organize ideas and supporting sentences within and between paragraphs; and (5) mechanics (MCH) assessing a student's ability to adhere to English writing conventions, including indentation and margins, capitalization, spelling, and punctuation.
was used.The checklist comprises 35 binary (Yes, No) descriptors measuring five writing sub-skills.The sub-skills are: (1) content fulfillment (CON) assessing a student's ability to address a given question by

Table 1
Descriptive Statistics, Item Measures, Fit Statistics, and Point-Measure Correlations

Table 2 Item
Loadings for the Descriptors of the EDD Checklist on the First Factor in PCA of Residuals

Table 3 The
Largest Standardized Residual Correlations