Published September 30, 2021 | Version v1
Dataset Open

C4 kōan CBOW embeddings

  • 1. Bloomberg
  • 2. Rutgers University

Description

These are 2 million 768-dimensional and 300-dimensional CBOW embeddings trained on the English colossal, cleaned common crawl (C4) corpus.  They were trained with the corrected CBOW code from kōan:

https://github.com/bloomberg/koan

with intrinsic evaluation reported in:

    Ozan İrsoy, Adrian Benton, Karl Stratos. “Corrected CBOW Performs as well as Skip-gram”. The 2nd Workshop on Insights from Negative Results in NLP. 2021.

Files

c4_koan_embeddings.zip

Files (8.0 GB)

Name Size Download all
md5:4eded8caa2d8a8b0ae655e71c9069d0c
8.0 GB Preview Download

Additional details

Related works

Is compiled by
Software: https://github.com/bloomberg/koan (URL)
Is derived from
Dataset: https://www.tensorflow.org/datasets/catalog/c4 (URL)