Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training
Creators
- 1. Universidade de Santiago de Compostela
Description
CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.
The corpus is structured as follows:
Subcorpus: Data obtained via transfer agreement |
Genre | Nº tokens | Nº documents |
Books | 7.255.784 | 104 | |
Research articles | 2.665.351 | 664 | |
Press | 124.253.084 | 224.419 | |
Governmental | 245.897.880 | 654.505 | |
Web contents | 15.946.686 | 44.165 | |
Encyclopedic | 4.799.214 | 47.396 | |
Subtotal | 400.817.999 | 971.253 |
Subcorpus: Public data |
Genre | Nº tokens | Nº documents |
Press and blogs | 153.497.883 | 665.265 | |
Encyclopedic | 57.164.848 | 184.628 | |
Web crawls | 1.384.015.664 | 3.366.449 | |
Translation corpora | 133.726.004 | 4.745.799 | |
Subtotal | 1.728.404.399 | 8.777.514 | |
Total | 2.129.222.398 | 9.748.767 |
Following this structure, the corpus contains two folders for each subcorpus and within each subcorpus, folders with the different genres can be found. Files are in plain text format (*.txt) and individual documents inside each file are separated by two line breaks.
Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing.
Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0).
Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models.
If you use this data in your work, please cite:
de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.
Files
corpusnos.zip
Files
(5.3 GB)
Name | Size | Download all |
---|---|---|
md5:48187e20e44191f272d97b387db9792a
|
5.3 GB | Preview Download |