Published December 19, 2018 | Version SPGC-2018-07-18
Dataset Open

Standardized Project Gutenberg Corpus

  • 1. Department of Chemical and Biological Engineering, Northwestern University
  • 2. Center for Complexity and Biosystems, Department of Physics, University of Milan

Description

Standardized Project Gutenberg Corpus
version: SPGC-2018-07-18
number of books: 55905
uncompressed size: 3GB (counts) + 18GB (tokens)

Publication
https://arxiv.org/abs/1812.08092
[ journal link ]

Project Site
https://pgcorpus.github.io/

Github
https://github.com/pgcorpus/gutenberg

Files

SPGC-counts-2018-07-18.zip

Files (7.9 GB)

Name Size Download all
md5:bccfbdf00caa906d84344cf335cc96ee
1.5 GB Preview Download
md5:a2d5f325f13846cbec2fd21d982b4ef4
10.0 MB Preview Download
md5:13e16ae2c8350a0b7407a8f7a51e8a7e
6.4 GB Preview Download