iRead4Skills Dataset 1: corpora by complexity level for FR, PT and SP
Creators
- Pintard, Alice (Researcher)1, 2
- François, Thomas (Project leader)1, 2
- Nagant de Deuxchaisnes, Justine (Other)1, 2
- Barbosa, Sílvia (Researcher)3, 4
- Reis, Maria Leonor (Researcher)3, 4
- Moutinho, Michell (Researcher)3, 4
- Monteiro, Ricardo (Researcher)3, 4
- Amaro, Raquel (Project leader)3, 4
- Correia, Susana (Researcher)3, 4
- Rodríguez Rey, Sandra (Researcher)5, 6
- Garcia González, Marcos (Project leader)5, 6
- Mu, Keran (Researcher)7
- Blanco Escoda, Xavier (Project leader)7
Description
The iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP is a collection of written texts of several genres and levels of complexity, in txt format, compiled under the scope of the project iReadSkills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development. The project, funded by the European Commission (grant number: 1010094837) aims to improve reading skills in the adult population by creating an intelligent system that assesses text complexity and suggests appropriate reading materials to adults with low literacy skills, contributing to reducing skills gaps and to provide access to information and culture (https://iread4skills.com/).
The compilation of this first dataset was based on the complexity levels established as relevant for the project (Very Easy (approx. A1), Easy (approx. A2) and Plain (approx. B1) and on the expected needs of learners and trainers. For some genres, there are also texts of a more complex level. The data will provide the basis for the training and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The dataset will be further enhanced, validated, and annotated by end-users, originating forthcoming versions and a second, derived, dataset.
The resource is composed of three sub corpora: French, Portuguese and Spanish.
Each of the sub corpora considers different complexity levels and covers texts from the following communication domains:
01_personal communication
02_institutional/professional communication
03_social media
04_commercial communication/dissemination
05_non-fiction book
06_fiction book
07_didactic book
08_academic/school
09_political communication/dissemination
10_legal documentation
11_religious texts/dissemination
French corpus:
Number of texts: 2 199
Number of tokens: 530 298
Spanish corpus:
Number of texts: 2 563
Number of tokens: 960 644
Portuguese corpus:
Number of texts: 2 915
Number of tokens: 802 125
Files
Additional details
Funding
- HORIZON-CL2-2022-TRANSFORMATIONS-01-07 – Conditions for the successful development of skills matched to needs 1010094837
- European Commission