iRead4Skills Dataset 1: corpora by complexity level for FR, PT and SP

Pintard, Alice; François, Thomas; Nagant de Deuxchaisnes, Justine; Barbosa, Sílvia; Reis, Maria Leonor; Moutinho, Michell; Monteiro, Ricardo; Amaro, Raquel; Correia, Susana; Rodríguez Rey, Sandra; Garcia González, Marcos; Mu, Keran; Blanco Escoda, Xavier

doi:10.5281/zenodo.13127399

Published July 25, 2024 | Version v3

Dataset Restricted

iRead4Skills Dataset 1: corpora by complexity level for FR, PT and SP

1. UCLouvain
2. CENTAL
3. Universidade Nova de Lisboa
4. CLUNL
5. CITIUS
6. Universidade de Santiago de Compostela
7. Universitat Autònoma de Barcelona

The iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP is a collection of written texts of several genres and levels of complexity, in txt format, compiled under the scope of the project iReadSkills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development. The project, funded by the European Commission (grant number: 1010094837) aims to improve reading skills in the adult population by creating an intelligent system that assesses text complexity and suggests appropriate reading materials to adults with low literacy skills, contributing to reducing skills gaps and to provide access to information and culture (https://iread4skills.com/).

The compilation of this first dataset was based on the complexity levels established as relevant for the project (Very Easy (approx. A1), Easy (approx. A2) and Plain (approx. B1) and on the expected needs of learners and trainers. For some genres, there are also texts of a more complex level. The data will provide the basis for the training and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The dataset will be further enhanced, validated, and annotated by end-users, originating forthcoming versions and a second, derived, dataset.

The resource is composed of three sub corpora: French, Portuguese and Spanish.

Each of the sub corpora considers different complexity levels and covers texts from the following communication domains:

01_personal communication

02_institutional/professional communication

03_social media

04_commercial communication/dissemination

05_non-fiction book

06_fiction book

07_didactic book

08_academic/school

09_political communication/dissemination

10_legal documentation

11_religious texts/dissemination

French corpus:

Number of texts: 2 199

Number of tokens: 530 298

Spanish corpus:

Number of texts: 2 563

Number of tokens: 960 644

Portuguese corpus:

Number of texts: 2 915

Number of tokens: 802 125

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

The data files are open (i.e., they have on-line access, they are free of charge to the user, and they are re-usable) according to CC BY-NC-ND 4.0 license.

To access and use this dataset, you acknowledge that, as part of the texts in the corpus may be still under copyright, access is only granted for the inspection of research results and to ensure research results reproducibility, but the texts cannot be published freely or for any purposes not compliant with CC BY-NC-ND 4.0 license.

Please fill out the form below or send request to: iread4skills@fcsh.unl.pt.

You are currently not logged in. Do you have an account? Log in here

Additional details

European Commission
HORIZON-CL2-2022-TRANSFORMATIONS-01-07 – Conditions for the successful development of skills matched to needs 1010094837

	All versions	This version
Views	1,049	116
Downloads	47	13
Data volume	490.0 MB	134.6 MB

iRead4Skills Dataset 1: corpora by complexity level for FR, PT and SP

Creators

Description

Files

Restricted

Request access

Additional details

Funding