Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published February 25, 2020 | Version v1
Dataset Open

Quebec Ministry of Agriculture, Fisheries and Food from 2012-2018 web archive collection derivatives

  • 1. York University
  • 2. Bibliothèque et Archives nationales du Québec

Description

Web archive derivatives of the Quebec Ministry of Agriculture, Fisheries and Food from 2012-2018 collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

  • domain
  • count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

  • crawl_date
  • url
  • mime_type_web_server
  • mime_type_tika
  • content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

  • crawl_date
  • src
  • dest
  • anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

  • src
  • image_url

Binary Analysis

  • Audio
  • Images
  • PDFs
  • Presentation program files
  • Spreadsheets
  • Text files
  • Videos
  • Word processor files

Files

Files (92.3 MB)

Name Size Download all
md5:a8929e50f17ce55c180f0819fbda5f8a
92.3 MB Download