Published November 3, 2018 | Version v1
Dataset Open

Materials for 2d representation of the HathiTrust Library

  • 1. Northeastern University

Description

Materials to create the LargeVis visualization online at http://creatingdata.us/datasets/hathi-features/, and described in Benjamin Schmidt, "Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries," Journal of Cultural Analytics. October 3, 2018.

Two items. First, `hathi_pca.bin`: a binary file with 100-dimensional representations of the complete Hathi Trust Extended Features set. These began as 1280-dimensional SRP features, and were reduced to 100 dimensions using a PCA transformation matrix derived using a random sample of the full 13 million book set. Vectors were reduced to unit length before PCA, but not afterwords; this means that in general, their length gives some sense of much information was lost in the PCA representation. This can be read using the code at https://github.com/bmschmidt/pySRP, or anything that reads word2vec formatted vectors. Includes HathiTrust identifiers.

Second, `hathi.tsv.gz`: a row oriented set containing a variety of metadata fields for each set, including (as 'x' and 'y') the coordinates of a 2-d LargeVis visualization. This is the immediate input to the visualization at ttp://creatingdata.us/datasets/hathi-features/. Columns should be relatively straightforward; they are derived from the HathiTrust MARC records, which can be accessed through Hathi's public API. Classification codes ('lc1') are using the Library of Congress classification; they represent the subclass (generally two characters, though it can be one or three). The first character alone represents the LC class and can be useful for coloring high-level overviews.

These two files can be merged through the Hathi Trust identifier present in both.

 

Files

Files (6.6 GB)

Name Size Download all
md5:b0341698e8c03566169e18538578d8e7
878.9 MB Download
md5:9da4d05d8517151c26d956a8b5130df2
5.7 GB Download

Additional details

Related works

Is supplement to
10.22148/16.025 (DOI)

References

  • Benjamin Schmidt, "Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries," Journal of Cultural Analytics. October 3, 2018