Materials for 2d representation of the HathiTrust Library
Description
Materials to create the LargeVis visualization online at http://creatingdata.us/datasets/hathi-features/, and described in Benjamin Schmidt, "Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries," Journal of Cultural Analytics. October 3, 2018.
Two items. First, `hathi_pca.bin`: a binary file with 100-dimensional representations of the complete Hathi Trust Extended Features set. These began as 1280-dimensional SRP features, and were reduced to 100 dimensions using a PCA transformation matrix derived using a random sample of the full 13 million book set. Vectors were reduced to unit length before PCA, but not afterwords; this means that in general, their length gives some sense of much information was lost in the PCA representation. This can be read using the code at https://github.com/bmschmidt/pySRP, or anything that reads word2vec formatted vectors. Includes HathiTrust identifiers.
Second, `hathi.tsv.gz`: a row oriented set containing a variety of metadata fields for each set, including (as 'x' and 'y') the coordinates of a 2-d LargeVis visualization. This is the immediate input to the visualization at ttp://creatingdata.us/datasets/hathi-features/. Columns should be relatively straightforward; they are derived from the HathiTrust MARC records, which can be accessed through Hathi's public API. Classification codes ('lc1') are using the Library of Congress classification; they represent the subclass (generally two characters, though it can be one or three). The first character alone represents the LC class and can be useful for coloring high-level overviews.
These two files can be merged through the Hathi Trust identifier present in both.
Files
Files
(6.6 GB)
Name | Size | Download all |
---|---|---|
md5:b0341698e8c03566169e18538578d8e7
|
878.9 MB | Download |
md5:9da4d05d8517151c26d956a8b5130df2
|
5.7 GB | Download |
Additional details
Related works
- Is supplement to
- 10.22148/16.025 (DOI)
References
- Benjamin Schmidt, "Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries," Journal of Cultural Analytics. October 3, 2018