{ "access": { "embargo": { "active": false, "reason": null }, "files": "public", "record": "public", "status": "open" }, "created": "2020-01-31T18:32:15.651660+00:00", "custom_fields": {}, "deletion_status": { "is_deleted": false, "status": "P" }, "files": { "count": 2, "enabled": true, "entries": { "ivy-11670-auk.tar.gz": { "checksum": "md5:ba492f30424dc294c2666f42ca8dfcbe", "ext": "gz", "id": "055caee0-d86f-462e-9dae-cdf83c2afada", "key": "ivy-11670-auk.tar.gz", "metadata": null, "mimetype": "application/gzip", "size": 853818536 }, "ivy-11670-parquet.tar.gz": { "checksum": "md5:c61be4b5601afd60f526d77a448af23c", "ext": "gz", "id": "195038da-d971-401e-9b29-1fea23ce1c36", "key": "ivy-11670-parquet.tar.gz", "metadata": null, "mimetype": "application/gzip", "size": 3335230435 } }, "order": [], "total_bytes": 4189048971 }, "id": "3633161", "is_draft": false, "is_published": true, "links": { "access": "https://zenodo.org/api/records/3633161/access", "access_links": "https://zenodo.org/api/records/3633161/access/links", "access_request": "https://zenodo.org/api/records/3633161/access/request", "access_users": "https://zenodo.org/api/records/3633161/access/users", "archive": "https://zenodo.org/api/records/3633161/files-archive", "archive_media": "https://zenodo.org/api/records/3633161/media-files-archive", "communities": "https://zenodo.org/api/records/3633161/communities", "communities-suggestions": "https://zenodo.org/api/records/3633161/communities-suggestions", "doi": "https://doi.org/10.5281/zenodo.3633161", "draft": "https://zenodo.org/api/records/3633161/draft", "files": "https://zenodo.org/api/records/3633161/files", "latest": "https://zenodo.org/api/records/3633161/versions/latest", "latest_html": "https://zenodo.org/records/3633161/latest", "media_files": "https://zenodo.org/api/records/3633161/media-files", "parent": "https://zenodo.org/api/records/3633160", "parent_doi": "https://zenodo.org/doi/10.5281/zenodo.3633160", "parent_html": "https://zenodo.org/records/3633160", "requests": "https://zenodo.org/api/records/3633161/requests", "reserve_doi": "https://zenodo.org/api/records/3633161/draft/pids/doi", "self": "https://zenodo.org/api/records/3633161", "self_doi": "https://zenodo.org/doi/10.5281/zenodo.3633161", "self_html": "https://zenodo.org/records/3633161", "self_iiif_manifest": "https://zenodo.org/api/iiif/record:3633161/manifest", "self_iiif_sequence": "https://zenodo.org/api/iiif/record:3633161/sequence/default", "versions": "https://zenodo.org/api/records/3633161/versions" }, "media_files": { "count": 0, "enabled": false, "entries": {}, "order": [], "total_bytes": 0 }, "metadata": { "creators": [ { "affiliations": [ { "name": "York University" } ], "person_or_org": { "family_name": "Ruest", "given_name": "Nick", "identifiers": [ { "identifier": "0000-0003-1891-1112", "scheme": "orcid" } ], "name": "Ruest, Nick", "type": "personal" } } ], "description": "
Web archive derivatives of the Literary Authors from Europe and Eurasia Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
\n\nThe ivy-11670-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
\n\nDomains
\n\n.webpages().groupBy(ExtractDomainDF($\"url\").alias(\"url\")).count().sort($\"count\".desc)
\n\nProduces a DataFrame with the following columns:
\n\nWeb Pages
\n\n.webpages().select($\"crawl_date\", $\"url\", $\"mime_type_web_server\", $\"mime_type_tika\", RemoveHTMLDF(RemoveHTTPHeaderDF(($\"content\"))).alias(\"content\"))
\n\nProduces a DataFrame with the following columns:
\n\nWeb Graph
\n\n.webgraph()
\n\nProduces a DataFrame with the following columns:
\n\nImage Links
\n\n.imageLinks()
\n\nProduces a DataFrame with the following columns:
\n\nThe ivy-11670-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
\n\nThe Web Archives for Historical Research (WAHR) group has the goal of linking history and big data to give historians the tools required to find and interpret digital sources from web archives. Our research focuses on both web histories - writing about the recent past as reflected in web archives - as well as methodological approaches to understanding these repositories.
\r\n\r\nFunded by the Social Sciences and Humanities Research Council (SSHRC), the WAHR brings together students and faculty across three universities to explore this field. The WAHR is led at the University of Waterloo by Professor Ian Milligan, in close collaboration with Professor Jimmy Lin. with partnerships at Western University with Professor William J. Turkel and at York University with Nick Ruest.
", "title": "Web Archives for Historical Research Group " }, "revision_id": 0, "slug": "wahr", "updated": "2016-06-20T13:55:42+00:00" } ], "ids": [ "1ec35a90-46e8-448d-be9a-6965270f8f47" ] }, "id": "3633160", "pids": { "doi": { "client": "datacite", "identifier": "10.5281/zenodo.3633160", "provider": "datacite" } } }, "pids": { "doi": { "client": "datacite", "identifier": "10.5281/zenodo.3633161", "provider": "datacite" }, "oai": { "identifier": "oai:zenodo.org:3633161", "provider": "oai" } }, "revision_id": 2, "stats": { "all_versions": { "data_volume": 56939048522.0, "downloads": 26, "unique_downloads": 22, "unique_views": 280, "views": 291 }, "this_version": { "data_volume": 56939048522.0, "downloads": 26, "unique_downloads": 22, "unique_views": 279, "views": 290 } }, "status": "published", "updated": "2020-01-31T19:20:51.385967+00:00", "versions": { "index": 1, "is_latest": true } }