Published February 25, 2021 | Version v1.0
Dataset Open

Catalan Sub-word Embeddings in FastText

Description

These Catalan sub-word embeddings in FastText using BPE have been generated from the largest corpus ever made in Catalan till the date. The corpus has more than 10Gb of curated high quality text.

If this material is useful, please cite it.

 

Copyright (c) 2021 Text Mining Unit  - Barcelona Supercomputing Center

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL) and the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública.

Files

cased.zip

Files (13.8 GB)

Name Size Download all
md5:096da389feab2f3efdbef2fdbb61bcf3
6.9 GB Preview Download
md5:2ab724713fdaf49e4523c4503bfd068d
18.7 kB Preview Download
md5:674da6b43602ddf95932acbab58f882f
1.0 kB Preview Download
md5:699d285c17346d53f51074cf7300e9d3
6.9 GB Preview Download