Dataset Open Access

Sites of the Quebec Ministry of Immigration from 2012 to 2018 web archive collection derivatives

Ruest, Nick; Gagné, Carole; Mitchell, Dave

Web archive derivatives of the Sites of the Quebec Ministry of Immigration from 2012 to 2018 collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

  • domain
  • count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

  • crawl_date
  • url
  • mime_type_web_server
  • mime_type_tika
  • content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

  • crawl_date
  • src
  • dest
  • anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

  • src
  • image_url

Binary Analysis

  • Audio
  • Images
  • PDFs
  • Presentation program files
  • Spreadsheets
  • Text files
  • Videos
  • Word processor files

Files (33.7 MB)
Name Size
ImmigrationQc-parquet.tar.gz
md5:572a10bda0579a44cab510e99e305816
33.7 MB Download
26
2
views
downloads
All versions This version
Views 2626
Downloads 22
Data volume 67.3 MB67.3 MB
Unique views 2525
Unique downloads 22

Share

Cite as