Webis-Web-Archive-17
====================
Version 1.1.0; October 2nd, 2020


This dataset was created mid-2017 from 10,000 web pages that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The process is described in detail in an upcoming publication.

The file 'sites-and-pages.txt' contains for each page it's ID and URL (see 'Web page data') as well as it's site name and site ID (see 'Web site data').

All ZIP archives (with the exception of webis-web-archive-17-000000.zip) extract to a common directory hierarchy. For convenience, we split the resources into different parts. The ZIP archive webis-web-archive-17-000000.zip contains all resources for the first page (also contained in the respective other archives) if you just want to have a look at one page without downloading all.

Web page data
-------------
The directory 'pages' contains for each page the following data:
  - archive-*.warc.gz. The web archive file that contains all web page data. It was created using the [webis-web-archiver](https://zenodo.org/badge/latestdoi/107244409) which scrolled down up tp 25 screen heights during the archive process.
  - replay.db. Index of the warc contents.
  - archiving.png. PNG screenshot taken by the webis-web-archiver after scrolling down.
  - archiving.html. HTML DOM snapshot taken by the webis-web-archiver after scrolling down.
  - reproducing-{custom,pywb,warcprox}.{png,html}. Screenshots and snapshots taken by the webis-web-archiver after scrolling down during reproduction of the page from the warc.gz when using one of the three reproduction methods (details in the paper).
  - normalized-rmse.txt. RMSE (root mean squared error) matrix when comparing the images (archiving and reproducing) using imagemagick's 'compare -metric rmse'.


Reproduction quality annotations
--------------------------------
The file 'page-quality-scores.txt' contains a reproduction quality annotation (score 1 (no significant difference) to 5 (unusable)) for each web page. The file contains the median of the scores that 9 Mechanical Turk workers gave when comparing the archiving screenshot with the reproducing screenshot that has the lowest RMSE to the archiving screenshot (see above). The used reproduction approach is given in the second column of the file. The whole process is explained in more detail in the paper.

In cases where the screenshots taken during archiving and reproducing are identical, we automatically assigned the best score (1) without human-annotation. These cases are marked in the 'page-quality-scores.txt' with a 'no' in the fourth column.

The file 'page-quality-interface.png' is a screenshot of the annotation interface. 


Web site data
-------------
The directory 'sites' contains for each site a text file compiled from the Amazon Web Information Service in April 2018. If available for the site, the file contains the following information:
  - url: The name of the site.
  - online-since: Date when the web site went online.
  - adult-content: Whether this site contains adult content (yes/no).
  - locale: Language locale of the site.
  - link-in-count: Number of incoming links to this site.
  - category: First-level and (if available) second-level category (see https://www.alexa.com/topsites/category) of the site. Can contain multiple categories.

Due to a bug in the Amazon service, the data is not available for site 03109 (www.blogtopsites.com).


Versions
--------
1.1.0
  - Added webis-web-archive-17-000000.zip

1.0.1
  - Added missing entries in page-quality-scores.txt
  - Added missing site data

1.0.0
  - Initial