Harvest Quebec Government Websites from December 2006 web archive collection derivatives
Creators
- 1. York University
- 2. Bibliothèque et Archives nationales du Québec
Description
Web archive derivatives of the Sites of the Harvest Quebec Government Websites from December 2006 collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!
These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
- domain
- count
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
- crawl_date
- url
- mime_type_web_server
- mime_type_tika
- content
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
- crawl_date
- src
- dest
- anchor
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
- src
- image_url
- Audio
- Images
- PDFs
- Presentation program files
- Spreadsheets
- Text files
- Videos
- Word processor files
Files
Files
(1.4 GB)
Name | Size | Download all |
---|---|---|
md5:0246a7a58b4128491867a9f10485ce65
|
1.4 GB | Download |