TeCla: Text Classification Catalan dataset
Description
If you use this resource in your work, please cite our latest paper:
@inproceedings{armengol-estape-etal-2021-multilingual,
title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
author = "Armengol-Estap{\'e}, Jordi and
Carrino, Casimiro Pio and
Rodriguez-Penagos, Carlos and
de Gibert Bonet, Ona and
Armentano-Oller, Carme and
Gonzalez-Agirre, Aitor and
Melero, Maite and
Villegas, Marta",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.437",
doi = "10.18653/v1/2021.findings-acl.437",
pages = "4933--4946",
}
Corpus de notícies en català per a classificació textual, extret del web de l'Agència Catalana de Notícies sota llicència CC-BY-NC-ND
TeCla is a Catalan News corpus for thematic Text Classification tasks. It contains 153.265 articles classified under 30 different categories.
The source data is crawled from the ACN (Catalan News Agency) site: http://www.acn.cat, and used under CC-BY-NC-ND 4.0 licence. The dataset is released under the same licence, and is intended exclusively for training Machine Learning models.
This dataset was developed by BSC TeMU as part of the AINA project, and intended as part of CLUB (Catalan Language Understanding Benchmark).
Files
TeCla_v.1.0.1.zip
Files
(109.8 MB)
Name | Size | Download all |
---|---|---|
md5:46f1ebe0a6a55b90a05912ad602093f2
|
109.8 MB | Preview Download |