Dataset Open Access

C4 kōan CBOW embeddings

Irsoy, Ozan; Benton, Adrian; Stratos, Karl

These are 2 million 768-dimensional and 300-dimensional CBOW embeddings trained on the English colossal, cleaned common crawl (C4) corpus.  They were trained with the corrected CBOW code from kōan:

https://github.com/bloomberg/koan

with intrinsic evaluation reported in:

    Ozan İrsoy, Adrian Benton, Karl Stratos. “Corrected CBOW Performs as well as Skip-gram”. The 2nd Workshop on Insights from Negative Results in NLP. 2021.

Files (8.0 GB)
Name Size
c4_koan_embeddings.zip
md5:4eded8caa2d8a8b0ae655e71c9069d0c
8.0 GB Download
62
5
views
downloads
All versions This version
Views 6262
Downloads 55
Data volume 40.1 GB40.1 GB
Unique views 5555
Unique downloads 55

Share

Cite as