Published June 5, 2025 | Version 1.0
Dataset Restricted

Cortegal. Corpus of Galician Texts Written by Students in an Academic Context

  • 1. Universidade de Santiago de Compostela
  • 2. Instituto da Lingua Galega

Description

This corpus is publicly accessible upon accepting T&Cs and requesting access.
 
CORTEGAL, Corpus of Galician Texts Written by Students in an Academic Context, is a corpus made up of 1,000 essays written by pre-university students as part of the Galician Language and Literature exam in the university entrance assessment (ABAU). Specifically, the texts were taken from the commentary section of these exams, in which students must write an argumentative text of between 200 and 250 words on a given topic.
 

One of the aims of the corpus is to assess students’ command of the Galician standard variety, as well as to identify the main deviations from normative language and academic register. For this reason, all forms and sequences that diverge from the standard are annotated with codes that indicate the type of deviation, and in some cases, its possible origin. These non-standard forms are accompanied by their corresponding standard expressions. Standardisation is applied across six linguistic levels — orthography, morphology, lexicon, syntax, semantics, and discursive — through a multilayer annotation system, allowing users to visualise the texts across different layers of correction.


The dataset includes two types of files:
  1. XML files encoded in XML-TEI format and annotated using the TEITOK platform. These contain:

    • Full transcriptions of handwritten texts produced by students.

    • Structural elements such as paragraph divisions, line breaks, and segment boundaries, reflecting the original layout and structure of the students’ handwritten texts.
    • Struck-through forms (words deleted by the student during writing).

    • Inserted forms (additions between lines or above crossed-out words).

    • Uncertain readings, i.e., forms or segments where the transcription could not be determined with full certainty.

    • Standardized versions at six linguistic levels: orthographic, morphological, lexical, syntactic, semantic, and discourse-related.

    • Error codes indicating non-standard forms, including their type and possible source (e.g., analogy, language transfer).

    • Lemmatization and part-of-speech tags, both standard and original.

    • Textual connectors, classified according to discourse function.

    • Metadata about each document (exam session, topic, number of words, etc.).

  2. Plain text files, corresponding to the student’s final version of each text. These exclude any struck-through forms and present the cleaned final version as written by the student.You can find more information at https://ilg.usc.gal/cortegal/en/index.php?

The project that led to the creation of the corpus was funded by the Secretaría Xeral de Política Lingüística of the Xunta de Galicia, and later by the Spanish Ministry of Science, Innovation and Universities (Corpus of Galician Texts Written by Students in an Academic Context. A Tool for the Analysis of Writing Competence in Galician Language, PGC2018-096069-B-I00).

Terms and Conditions

By accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.
 
 

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

By accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.
 

You are currently not logged in. Do you have an account? Log in here