There is a newer version of the record available.

Published February 21, 2024 | Version 1.0.0
Dataset Open

Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

Description

CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.

The corpus is structured as follows:

Subcorpus:

Data obtained via transfer agreement

Genre Nº tokens Nº documents
  Books 7.255.784 104
  Research articles 2.665.351 664
  Press 124.253.084 224.419
  Governmental 245.897.880 654.505
  Web contents 15.946.686 44.165
  Encyclopedic 4.799.214 47.396
  Subtotal 400.817.999 971.253

 

Subcorpus:

Public data

Genre Nº tokens Nº documents
  Press and blogs 153.497.883 665.265
  Encyclopedic 57.164.848 184.628
  Web crawls 1.384.015.664 3.366.449
  Translation corpora 133.726.004 4.745.799
  Subtotal 1.728.404.399 8.777.514
  Total 2.129.222.398 9.748.767

Following this structure, the corpus contains two folders for each subcorpus and within each subcorpus, folders with the different genres can be found. Files are in plain text format (*.txt) and individual documents inside each file are separated by two line breaks.

Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing.

Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0).

Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models.

If you use this data in your work, please cite:

de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.

Files

corpusnos.zip

Files (5.3 GB)

Name Size Download all
md5:48187e20e44191f272d97b387db9792a
5.3 GB Preview Download