Dataset Open Access

Standardized Project Gutenberg Corpus

Martin Gerlach; Francesc Font-Clos

Standardized Project Gutenberg Corpus
version: SPGC-2018-07-18
number of books: 55905
uncompressed size: 3GB (counts) + 18GB (tokens)

Publication
https://arxiv.org/abs/1812.08092
[ journal link ]

Project Site
https://pgcorpus.github.io/

Github
https://github.com/pgcorpus/gutenberg

Files (7.9 GB)
Name Size
SPGC-counts-2018-07-18.zip
md5:bccfbdf00caa906d84344cf335cc96ee
1.5 GB Download
SPGC-metadata-2018-07-18.csv
md5:a2d5f325f13846cbec2fd21d982b4ef4
10.0 MB Download
SPGC-tokens-2018-07-18.zip
md5:13e16ae2c8350a0b7407a8f7a51e8a7e
6.4 GB Download
1,758
621
views
downloads
All versions This version
Views 1,7581,758
Downloads 621621
Data volume 965.5 GB965.5 GB
Unique views 1,6481,648
Unique downloads 459459

Share

Cite as