TeCla: Text Classification Catalan dataset
Description
Corpus de notícies en català per a classificació textual, extret del web de l'Agència Catalana de Notícies sota llicència CC-BY-NC-ND
TeCla (Text Classification) is a Catalan News corpus for thematic multi-class Text Classification tasks. The present version (2.0) contains 113.376 articles classified under a hierarchical class structure consisting of a coarse-grained and a fine-grained class. Each of the 4 coarse-grained classes accept a subset of fine-grained ones, 53 in total.
The source data is crawled from the ACN (Catalan News Agency) site: http://www.acn.cat, and used under CC-BY-NC-ND 4.0 licence. The dataset is released under the same licence, and is intended exclusively for training Machine Learning models.
This dataset was developed by BSC TeMU as part of the AINA project, and intended as part of CLUB (Catalan Language Understanding Benchmark).
Files
tecla_v2.zip
Files
(93.7 MB)
Name | Size | Download all |
---|---|---|
md5:bb4dfda0b612eaf24abbda63cee46fde
|
93.7 MB | Preview Download |