iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP
Creators
-
Pintard, Alice
(Researcher)1, 2
-
François, Thomas
(Researcher)1, 2
-
Justine, Nagant de Deuxchaisnes
(Other)1, 2
-
Barbosa, Sílvia
(Researcher)3, 4
-
Reis, Maria Leonor
(Researcher)4, 3
-
Moutinho, Michell
(Researcher)4, 3
-
Monteiro, Ricardo
(Researcher)3, 4
-
Amaro, Raquel
(Project leader)3, 4
-
Correia, Susana
(Work package leader)3, 4
-
Rodríguez Rey, Sandra
(Researcher)5, 6
-
Mu, Keran
(Researcher)7
-
Garcia González, Marcos
(Researcher)5, 6
- Bernárdez Braña, André (Researcher)5, 6
-
Blanco Escoda, Xavier
(Researcher)7
Description
The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in Excel format (.xlsx). These corpora were compiled and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com). This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform. The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a sample of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The sample texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution: · 140 Very Easy texts · 140 Easy texts · 140 Plain texts · 42 More Complex texts. Trainers and students were asked to classify the texts according to the complexity levels of the project, here informally defined as: · Very Easy (everyone can understand the text or most of the text). · Easy (a person with less than the 9th year of schooling can understand the text or most of the text) · Plain (a person with the 9th year of schooling can understand the text the first time he/she reads it) · More complex (a person with the 9th year of schooling cannot understand the text the first time he/she reads it). Annotators were also asked to mark the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), The full details regarding the students and the trainers’ tasks, data qualitative and quantitative description and inter-annotator agreement are described here: https://zenodo.org/records/14653180 The results are here presented in Excel format. For each language, and for each group (trainers and students), two pairs of files exist – the annotation and the classification files – resulting in four files per language and twelve files, in total. In all files, the data is organized as a matrix, with each row representing an ‘answer’ from a particular participant, and the columns containing various details about that specific input, as shown below:
The content of the column “File Name” is color-coded, where a green shade alludes to a text with a lower level of complexity and a red one alludes to one with a higher level of complexity. The complete datasets are available under creative CC BY-NC-ND 4.0.
|
Files
Additional details
Funding
- European Commission
- HORIZON-CL2-2022-TRANSFORMATIONS-01-07 – Conditions for the successful development of skills matched to needs 1010094837