Published September 21, 2018 | Version 1.0
Dataset Open

Hathi Trust Library Vectorized features

  • 1. Northeastern University

Description

A smaller-resolution (and therefore more portable) version of the Stable Random Projection Hathi Trust features described in my forthcoming article. The Northeastern repository is many individual files with 1280 random dimensions; this is just 640 random dimensions. The numbers are also experimentally encoded as half-precision floats, which cuts the file size by half at the cost of only being supported by my Python module. The net result is a file 1/4 the size of the full resolution ones for the paper that has, probably, something like 60-80% of the information content.

The full file is 'ht-640d-complete-half-precision.bin'. You can also download 11 smaller files organized by language.

Since these files use half-precision float encoding, to read them you must specify the precision when reading: e.g.,
 

from SRP import Vector_file
f = Vector_file("ita.bin", precision = "half")

Code to read these files is at https://github.com/bmschmidt/pySRP. 

Files

Files (35.4 GB)

Name Size Download all
md5:9d42e87889548753c87a0789b2974eb7
672.5 MB Download
md5:3f568d8da7a8ea91cce543c97c2c6548
3.1 GB Download
md5:1c0cf6c2f28a460f06819d70972f9158
3.4 GB Download
md5:e2e6849025ed6ed9bcf4a1a66fefb6d4
3.0 GB Download
md5:9ffcd0bc224808f4eb26a1405ad7ef6c
1.3 GB Download
md5:901a65063a63db2d61c72ccfec9817fc
1.7 GB Download
md5:e3752fd49b778674a321fc619c9f81a2
17.7 GB Download
md5:efadff54374de2e5fab31457ace0d8fc
418.0 MB Download
md5:d1b710006321ae4b99fd7227be4f0bcf
632.0 MB Download
md5:fc0b487e8633f95f38bb93ffb229e7bf
2.4 GB Download
md5:1466aa25e5440176d41f3e23870bf84b
530.6 MB Download
md5:11997b2761a6fb8ea3a409e92d8678c8
650.2 MB Download