Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian

doi:10.5281/zenodo.10687642

There is a newer version of the record available.

Published February 21, 2024 | Version 1.0.0

Dataset Open

Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

1. Universidade de Santiago de Compostela

CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.

The corpus is structured as follows:

Subcorpus: Data obtained via transfer agreement	Genre	Nº tokens	Nº documents
	Books	7.255.784	104
	Research articles	2.665.351	664
	Press	124.253.084	224.419
	Governmental	245.897.880	654.505
	Web contents	15.946.686	44.165
	Encyclopedic	4.799.214	47.396
	Subtotal	400.817.999	971.253

Subcorpus: Public data	Genre	Nº tokens	Nº documents
	Press and blogs	153.497.883	665.265
	Encyclopedic	57.164.848	184.628
	Web crawls	1.384.015.664	3.366.449
	Translation corpora	133.726.004	4.745.799
	Subtotal	1.728.404.399	8.777.514
	Total	2.129.222.398	9.748.767

Following this structure, the corpus contains two folders for each subcorpus and within each subcorpus, folders with the different genres can be found. Files are in plain text format (*.txt) and individual documents inside each file are separated by two line breaks.

Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing.

Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0).

Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models.

If you use this data in your work, please cite:

de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.

Files

corpusnos.zip

Files (5.3 GB)

Name	Size
corpusnos.zip md5:48187e20e44191f272d97b387db9792a	5.3 GB	Preview Download

Views

599

Downloads

Show more details

	All versions	This version
Views	1,778	840
Downloads	599	204
Data volume	3.3 TB	1.2 TB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Published in

Association for Computational Lingustics, Proceedings of the 16th International Conference on Computational Processing of Portuguese, 593-599, ISSN: 0736-587X, 2024.

Conference

International Conference on Computational Processing of Portuguese (PROPOR) , Santiago de Compostela, 12-15 March 2024

Languages

Galician

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more; Apache License 2.0

A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code. Read more; Creative Commons Zero v1.0 Universal

CC0 waives copyright interest in a work you've created and dedicates it to the world-wide public domain. Use CC0 to opt out of copyright entirely and ensure your work has the widest reach. Read more; Creative Commons Attribution Non Commercial No Derivatives 4.0 International

No further description. Read more

Technical metadata

Created: February 21, 2024
Modified: April 30, 2024

Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

Authors/Creators

Description

Files

corpusnos.zip

Files (5.3 GB)