Dataset Open Access
Ruest, Nick;
Olson, Lauris;
Abrams, Samantha
Web archive derivatives of the Popline and K4Health Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The ivy-12006-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
The ivy-12006-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Name | Size | |
---|---|---|
ivy-12006-auk.tar.gz
md5:854d89aad5c02a4f731aa159ebc16716 |
175.8 MB | Download |
ivy-12006-parquet.tar.gz
md5:d253a024cc6e5f46868275639718417a |
457.2 MB | Download |
All versions | This version | |
---|---|---|
Views | 27 | 27 |
Downloads | 23 | 23 |
Data volume | 10.0 GB | 10.0 GB |
Unique views | 27 | 27 |
Unique downloads | 19 | 19 |