{ "access": { "embargo": { "active": false, "reason": null }, "files": "public", "record": "public", "status": "open" }, "created": "2020-02-25T21:35:19.545116+00:00", "custom_fields": {}, "deletion_status": { "is_deleted": false, "status": "P" }, "files": { "count": 1, "enabled": true, "entries": { "ImmigrationQc-parquet.tar.gz": { "checksum": "md5:572a10bda0579a44cab510e99e305816", "ext": "gz", "id": "7c457b76-81e8-4197-a522-09249c6a5afd", "key": "ImmigrationQc-parquet.tar.gz", "metadata": null, "mimetype": "application/gzip", "size": 33659632 } }, "order": [], "total_bytes": 33659632 }, "id": "3687264", "is_draft": false, "is_published": true, "links": { "access": "https://zenodo.org/api/records/3687264/access", "access_links": "https://zenodo.org/api/records/3687264/access/links", "access_request": "https://zenodo.org/api/records/3687264/access/request", "access_users": "https://zenodo.org/api/records/3687264/access/users", "archive": "https://zenodo.org/api/records/3687264/files-archive", "archive_media": "https://zenodo.org/api/records/3687264/media-files-archive", "communities": "https://zenodo.org/api/records/3687264/communities", "communities-suggestions": "https://zenodo.org/api/records/3687264/communities-suggestions", "doi": "https://doi.org/10.5281/zenodo.3687264", "draft": "https://zenodo.org/api/records/3687264/draft", "files": "https://zenodo.org/api/records/3687264/files", "latest": "https://zenodo.org/api/records/3687264/versions/latest", "latest_html": "https://zenodo.org/records/3687264/latest", "media_files": "https://zenodo.org/api/records/3687264/media-files", "parent": "https://zenodo.org/api/records/3687263", "parent_doi": "https://zenodo.org/doi/10.5281/zenodo.3687263", "parent_html": "https://zenodo.org/records/3687263", "requests": "https://zenodo.org/api/records/3687264/requests", "reserve_doi": "https://zenodo.org/api/records/3687264/draft/pids/doi", "self": "https://zenodo.org/api/records/3687264", "self_doi": "https://zenodo.org/doi/10.5281/zenodo.3687264", "self_html": "https://zenodo.org/records/3687264", "self_iiif_manifest": "https://zenodo.org/api/iiif/record:3687264/manifest", "self_iiif_sequence": "https://zenodo.org/api/iiif/record:3687264/sequence/default", "versions": "https://zenodo.org/api/records/3687264/versions" }, "media_files": { "count": 0, "enabled": false, "entries": {}, "order": [], "total_bytes": 0 }, "metadata": { "creators": [ { "affiliations": [ { "name": "York University" } ], "person_or_org": { "family_name": "Ruest", "given_name": "Nick", "identifiers": [ { "identifier": "0000-0003-1891-1112", "scheme": "orcid" } ], "name": "Ruest, Nick", "type": "personal" } }, { "affiliations": [ { "name": "Biblioth\u00e8que et Archives nationales du Qu\u00e9bec" } ], "person_or_org": { "family_name": "Gagn\u00e9", "given_name": "Carole", "name": "Gagn\u00e9, Carole", "type": "personal" } }, { "affiliations": [ { "name": "Biblioth\u00e8que et Archives nationales du Qu\u00e9bec" } ], "person_or_org": { "family_name": "Mitchell", "given_name": "Dave", "name": "Mitchell, Dave", "type": "personal" } } ], "description": "
Web archive derivatives of the Sites of the Quebec Ministry of Immigration from 2012 to 2018 collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!
\n\nThese derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
\n\nDomains
\n\n.webpages().groupBy(ExtractDomainDF($\"url\").alias(\"url\")).count().sort($\"count\".desc)
\n\nProduces a DataFrame with the following columns:
\n\nWeb Pages
\n\n.webpages().select($\"crawl_date\", $\"url\", $\"mime_type_web_server\", $\"mime_type_tika\", RemoveHTMLDF(RemoveHTTPHeaderDF(($\"content\"))).alias(\"content\"))
\n\nProduces a DataFrame with the following columns:
\n\nWeb Graph
\n\n.webgraph()
\n\nProduces a DataFrame with the following columns:
\n\nImage Links
\n\n.imageLinks()
\n\nProduces a DataFrame with the following columns:
\n\nThe Web Archives for Historical Research (WAHR) group has the goal of linking history and big data to give historians the tools required to find and interpret digital sources from web archives. Our research focuses on both web histories - writing about the recent past as reflected in web archives - as well as methodological approaches to understanding these repositories.
\r\n\r\nFunded by the Social Sciences and Humanities Research Council (SSHRC), the WAHR brings together students and faculty across three universities to explore this field. The WAHR is led at the University of Waterloo by Professor Ian Milligan, in close collaboration with Professor Jimmy Lin. with partnerships at Western University with Professor William J. Turkel and at York University with Nick Ruest.
", "title": "Web Archives for Historical Research Group " }, "revision_id": 0, "slug": "wahr", "updated": "2016-06-20T13:55:42+00:00" } ], "ids": [ "1ec35a90-46e8-448d-be9a-6965270f8f47" ] }, "id": "3687263", "pids": { "doi": { "client": "datacite", "identifier": "10.5281/zenodo.3687263", "provider": "datacite" } } }, "pids": { "doi": { "client": "datacite", "identifier": "10.5281/zenodo.3687264", "provider": "datacite" }, "oai": { "identifier": "oai:zenodo.org:3687264", "provider": "oai" } }, "revision_id": 3, "stats": { "all_versions": { "data_volume": 168298160.0, "downloads": 5, "unique_downloads": 5, "unique_views": 303, "views": 304 }, "this_version": { "data_volume": 168298160.0, "downloads": 5, "unique_downloads": 5, "unique_views": 302, "views": 303 } }, "status": "published", "updated": "2020-02-26T07:21:02.815120+00:00", "versions": { "index": 1, "is_latest": true } }