3687262
doi
10.5281/zenodo.3687262
oai:zenodo.org:3687262
user-wahr
Gagné, Carole
Bibliothèque et Archives nationales du Québec
Mitchell, Dave
Bibliothèque et Archives nationales du Québec
Coalition Avenir Québec (CAQ) web archive collection derivatives
Ruest, Nick
York University
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
web archives
parquet
dataframes
<p>Web archive derivatives of the Coalition Avenir Québec (CAQ) collection from the <a href="https://www.banq.qc.ca/accueil/">Bibliothèque et Archives nationales du Québec</a>. The derivatives were created with the <a href="https://github.com/archivesunleashed/aut/">Archives Unleashed Toolkit</a>. Merci beaucoup BAnQ!</p>
<p>These derivatives are in the <a href="https://parquet.apache.org/">Apache Parquet format</a>, which is a <a href="http://en.wikipedia.org/wiki/Column-oriented_DBMS">columnar storage</a> format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See <a href="https://github.com/archivesunleashed/notebooks/blob/master/parquet_pandas_example.ipynb">this</a> notebook for examples.</p>
<p><strong>Domains</strong></p>
<pre><code class="language-java">.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)</code></pre>
<p>Produces a DataFrame with the following columns:</p>
<ul>
<li>domain</li>
<li>count</li>
</ul>
<p><strong>Web Pages</strong></p>
<pre><code class="language-java">.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))</code></pre>
<p>Produces a DataFrame with the following columns:</p>
<ul>
<li>crawl_date</li>
<li>url</li>
<li>mime_type_web_server</li>
<li>mime_type_tika</li>
<li>content</li>
</ul>
<p><strong>Web Graph</strong></p>
<pre><code class="language-java">.webgraph()</code></pre>
<p>Produces a DataFrame with the following columns:</p>
<ul>
<li>crawl_date</li>
<li>src</li>
<li>dest</li>
<li>anchor</li>
</ul>
<p><strong>Image Links</strong></p>
<pre><code class="language-java">.imageLinks()</code></pre>
<p>Produces a DataFrame with the following columns:</p>
<ul>
<li>src</li>
<li>image_url</li>
</ul>
<p><a href="https://github.com/archivesunleashed/aut-docs/blob/master/current/binary-analysis.md#binary-analysis"><strong>Binary Analysis</strong></a></p>
<ul>
<li>Audio</li>
<li>Images</li>
<li>PDFs</li>
<li>Presentation program files</li>
<li>Spreadsheets</li>
<li>Text files</li>
<li>Videos</li>
<li>Word processor files</li>
</ul>
Zenodo
2020-02-25
info:eu-repo/semantics/other
3687261
user-wahr
1582701662.81496
170645758
md5:e134ad834219c3d596a3105594d212e4
https://zenodo.org/records/3687262/files/CAQ-2012--2018-11-parquet.tar.gz
public
10.5281/zenodo.3687261
isVersionOf
doi