Dataset Open Access

Standardized Project Gutenberg Corpus

Martin Gerlach; Francesc Font-Clos

Standardized Project Gutenberg Corpus
version: SPGC-2018-07-18
number of books: 55905
uncompressed size: 3GB (counts) + 18GB (tokens)

Publication
https://arxiv.org/abs/1812.08092
[ journal link ]

Project Site
https://pgcorpus.github.io/

Github
https://github.com/pgcorpus/gutenberg

Files (7.9 GB)
Name Size
SPGC-counts-2018-07-18.zip
md5:bccfbdf00caa906d84344cf335cc96ee
1.5 GB Download
SPGC-metadata-2018-07-18.csv
md5:a2d5f325f13846cbec2fd21d982b4ef4
10.0 MB Download
SPGC-tokens-2018-07-18.zip
md5:13e16ae2c8350a0b7407a8f7a51e8a7e
6.4 GB Download
248
130
views
downloads
All versions This version
Views 248248
Downloads 130130
Data volume 573.5 GB573.5 GB
Unique views 205205
Unique downloads 3737

Share

Cite as