Published January 2, 2020 | Version v4
Dataset Open

Ministry of Environment of Québec (2011-2014) web archive collection derivatives

  • 1. York University
  • 2. Bibliothèque et Archives nationales du Québec

Description

Web archive derivatives of the Ministry of Environment of Québec (2011-2014) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

  • domain
  • count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

  • crawl_date
  • url
  • mime_type_web_server
  • mime_type_tika
  • content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

  • crawl_date
  • src
  • dest
  • anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

  • src
  • image_url

Binary Analysis

  • Audio
  • Images
  • PDFs
  • Presentation program files
  • Spreadsheets
  • Text files
  • Videos
  • Word processor files

Files

Files (491.7 MB)

Name Size Download all
md5:02fa3193b4134405ee62246e686afbd6
3.1 kB Download
md5:cc737657505209b2698708b0660b4421
491.7 MB Download