Published September 30, 2021
| Version v1
Dataset
Open
C4 kōan CBOW embeddings
Description
These are 2 million 768-dimensional and 300-dimensional CBOW embeddings trained on the English colossal, cleaned common crawl (C4) corpus. They were trained with the corrected CBOW code from kōan:
https://github.com/bloomberg/koan
with intrinsic evaluation reported in:
Ozan İrsoy, Adrian Benton, Karl Stratos. “Corrected CBOW Performs as well as Skip-gram”. The 2nd Workshop on Insights from Negative Results in NLP. 2021.
Files
c4_koan_embeddings.zip
Files
(8.0 GB)
Name | Size | Download all |
---|---|---|
md5:4eded8caa2d8a8b0ae655e71c9069d0c
|
8.0 GB | Preview Download |
Additional details
Related works
- Is compiled by
- Software: https://github.com/bloomberg/koan (URL)
- Is derived from
- Dataset: https://www.tensorflow.org/datasets/catalog/c4 (URL)