Cortegal. Corpus of Galician Texts Written by Students in an Academic Context
- 1. Universidade de Santiago de Compostela
- 2. Instituto da Lingua Galega
Description
One of the aims of the corpus is to assess students’ command of the Galician standard variety, as well as to identify the main deviations from normative language and academic register. For this reason, all forms and sequences that diverge from the standard are annotated with codes that indicate the type of deviation, and in some cases, its possible origin. These non-standard forms are accompanied by their corresponding standard expressions. Standardisation is applied across six linguistic levels — orthography, morphology, lexicon, syntax, semantics, and discursive — through a multilayer annotation system, allowing users to visualise the texts across different layers of correction.
The dataset includes two types of files:
- 
XML files encoded in XML-TEI format and annotated using the TEITOK platform. These contain: - 
Full transcriptions of handwritten texts produced by students. 
- Structural elements such as paragraph divisions, line breaks, and segment boundaries, reflecting the original layout and structure of the students’ handwritten texts.
- 
Struck-through forms (words deleted by the student during writing). 
- 
Inserted forms (additions between lines or above crossed-out words). 
- 
Uncertain readings, i.e., forms or segments where the transcription could not be determined with full certainty. 
- 
Standardized versions at six linguistic levels: orthographic, morphological, lexical, syntactic, semantic, and discourse-related. 
- 
Error codes indicating non-standard forms, including their type and possible source (e.g., analogy, language transfer). 
- 
Lemmatization and part-of-speech tags, both standard and original. 
- 
Textual connectors, classified according to discourse function. 
- 
Metadata about each document (exam session, topic, number of words, etc.). 
 
- 
- 
Plain text files, corresponding to the student’s final version of each text. These exclude any struck-through forms and present the cleaned final version as written by the student.You can find more information at https://ilg.usc.gal/cortegal/en/index.php? 
The project that led to the creation of the corpus was funded by the Secretaría Xeral de Política Lingüística of the Xunta de Galicia, and later by the Spanish Ministry of Science, Innovation and Universities (Corpus of Galician Texts Written by Students in an Academic Context. A Tool for the Analysis of Writing Competence in Galician Language, PGC2018-096069-B-I00).
By accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.