Published March 22, 2021 | Version 2.0
Dataset Open

TeCla: Text Classification Catalan dataset

Description

Corpus de notícies en català per a classificació textual, extret del web de l'Agència Catalana de Notícies sota llicència CC-BY-NC-ND

TeCla (Text Classification) is a Catalan News corpus for thematic multi-class Text Classification tasks. The present version (2.0) contains 113.376 articles classified under a hierarchical class structure consisting of a coarse-grained and a fine-grained class. Each of the 4 coarse-grained classes accept a subset of fine-grained ones, 53 in total.

The source data is crawled from the ACN (Catalan News Agency) site: http://www.acn.cat, and used under CC-BY-NC-ND 4.0 licence. The dataset is released under the same licence, and is intended exclusively for training Machine Learning models.

This dataset was developed by BSC TeMU as part of the AINA project, and intended as part of CLUB (Catalan Language Understanding Benchmark).

Files

tecla_v2.zip

Files (93.7 MB)

Name Size Download all
md5:bb4dfda0b612eaf24abbda63cee46fde
93.7 MB Preview Download