Dataset Open Access

Standardized Project Gutenberg Corpus

Martin Gerlach; Francesc Font-Clos

Standardized Project Gutenberg Corpus
version: SPGC-2018-07-18
number of books: 55905
uncompressed size: 3GB (counts) + 18GB (tokens)

Publication
https://arxiv.org/abs/1812.08092
[ journal link ]

Project Site
https://pgcorpus.github.io/

Github
https://github.com/pgcorpus/gutenberg

Files (7.9 GB)
Name Size
SPGC-counts-2018-07-18.zip
md5:bccfbdf00caa906d84344cf335cc96ee
1.5 GB Download
SPGC-metadata-2018-07-18.csv
md5:a2d5f325f13846cbec2fd21d982b4ef4
10.0 MB Download
SPGC-tokens-2018-07-18.zip
md5:13e16ae2c8350a0b7407a8f7a51e8a7e
6.4 GB Download
326
181
views
downloads
All versions This version
Views 326326
Downloads 181181
Data volume 704.9 GB704.9 GB
Unique views 274274
Unique downloads 7272

Share

Cite as