Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.

There is a newer version of the record available.

Published March 22, 2021 | Version 1.0.1
Dataset Open

TeCla: Text Classification Catalan dataset

Description

If you use this resource in your work, please cite our latest paper:

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

Corpus de notícies en català per a classificació textual, extret del web de l'Agència Catalana de Notícies sota llicència CC-BY-NC-ND

TeCla is a Catalan News corpus for thematic Text Classification tasks. It contains 153.265 articles classified under 30 different categories.

The source data is crawled from the ACN (Catalan News Agency) site: http://www.acn.cat, and used under CC-BY-NC-ND 4.0 licence. The dataset is released under the same licence, and is intended exclusively for training Machine Learning models.

This dataset was developed by BSC TeMU as part of the AINA project, and intended as part of CLUB (Catalan Language Understanding Benchmark).

Files

TeCla_v.1.0.1.zip

Files (109.8 MB)

Name Size Download all
md5:46f1ebe0a6a55b90a05912ad602093f2
109.8 MB Preview Download