Catalan Sub-word Embeddings in FastText

Gutiérrez-Fandiño, Asier; Armengol-Estapé, Jordi; Carrino, Casimiro Pio; De Gibert, Ona; Gonzalez-Agirre, Aitor; Villegas, Marta

doi:10.5281/zenodo.4561598

Published February 25, 2021 | Version v1.0

Dataset Open

Catalan Sub-word Embeddings in FastText

1. Barcelona Supercomputing Center

These Catalan sub-word embeddings in FastText using BPE have been generated from the largest corpus ever made in Catalan till the date. The corpus has more than 10Gb of curated high quality text.

If this material is useful, please cite it.

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL) and the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública.

Files

cased.zip

Files (13.8 GB)

Name	Size
cased.zip md5:096da389feab2f3efdbef2fdbb61bcf3	6.9 GB	Preview Download
LICENSE.txt md5:2ab724713fdaf49e4523c4503bfd068d	18.7 kB	Preview Download
README.md md5:674da6b43602ddf95932acbab58f882f	1.0 kB	Preview Download
uncased.zip md5:699d285c17346d53f51074cf7300e9d3	6.9 GB	Preview Download

340

Views

149

Downloads

Show more details

	All versions	This version
Views	340	340
Downloads	149	149
Data volume	454.9 GB	454.9 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

Catalan

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: February 25, 2021
Modified: February 26, 2021

Catalan Sub-word Embeddings in FastText

Authors/Creators

Description

Notes

Files

cased.zip

Files (13.8 GB)