Dataset Open Access

Webis-Web-Archive-17

Kiesel, Johannes; Potthast, Martin; Hagen, Matthias; Kneist, Florian; Stein, Benno

This dataset was created mid-2017 from 10,000 web pages that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The process is described in detail in an upcoming publication.

This dataset is supplemented with Content Error Annotations.

Files (99.5 GB)
Name Size
page-quality-interface.png
md5:8ad67239dde6f7d20369d428ba284e06
271.4 kB Download
page-quality-scores.txt
md5:f58fa9ad1ee5fd4c8539b01b2b5295fc
180.2 kB Download
README.txt
md5:3a6fa77c0d5c6a63a71236a24a6eede8
3.0 kB Download
sites-and-pages.txt
md5:a5cfd4b545772ec3bf46f845fc7696b1
1.0 MB Download
webis-web-archive-17-archives-part0.zip
md5:2ec3e59b0c8f19a07318397fca1e9a1f
3.1 GB Download
webis-web-archive-17-archives-part1.zip
md5:208d356268f62174ccb969632ddfb404
4.5 GB Download
webis-web-archive-17-archives-part2.zip
md5:470913f265a5423e53d5707a32c65086
3.0 GB Download
webis-web-archive-17-archives-part3.zip
md5:29898c5b1a9fea42aa45e77cbab23ad9
2.0 GB Download
webis-web-archive-17-archives-part4.zip
md5:6c76d9791d86cc2810d2c9a1676df7ba
3.9 GB Download
webis-web-archive-17-archives-part5.zip
md5:c1e0e9548e615a655d8e23b3c51b771c
5.3 GB Download
webis-web-archive-17-archives-part6.zip
md5:fbca883e22ed8cb00d6be8e24477a47a
5.4 GB Download
webis-web-archive-17-archives-part7.zip
md5:3931e329b8f9e1e8bd4f3527a4551730
4.2 GB Download
webis-web-archive-17-archives-part8.zip
md5:edc24c195a50e43241ad269fc0f9eec2
7.4 GB Download
webis-web-archive-17-archives-part9.zip
md5:1cec18ac83f6ebd38e3cb4bb6061481e
4.9 GB Download
webis-web-archive-17-dom-snapshots.zip
md5:de36dc487398e153551b83647a801562
1.3 GB Download
webis-web-archive-17-screenshots-part0.zip
md5:21c9a6db2d739c52263cd2f565cfb6e5
5.7 GB Download
webis-web-archive-17-screenshots-part1.zip
md5:523707ee732e3502b670b37a38908faf
6.2 GB Download
webis-web-archive-17-screenshots-part2.zip
md5:41076c729087e061aa2f125db643317f
5.2 GB Download
webis-web-archive-17-screenshots-part3.zip
md5:65fb839103987758a67693047524ee9c
4.6 GB Download
webis-web-archive-17-screenshots-part4.zip
md5:262a90e4cd847a48b980376b03aad0f8
5.1 GB Download
webis-web-archive-17-screenshots-part5.zip
md5:15eb19b14ecd6ed8d16d544a749eb1ff
4.3 GB Download
webis-web-archive-17-screenshots-part6.zip
md5:f4b185bf714afd6de6f2c56c9bb59f47
5.3 GB Download
webis-web-archive-17-screenshots-part7.zip
md5:0838751d6765630030a3208ebb7ddef7
5.8 GB Download
webis-web-archive-17-screenshots-part8.zip
md5:6d31ccce3043b80a5c2d14bd76e02a2d
6.0 GB Download
webis-web-archive-17-screenshots-part9.zip
md5:c87e29c8731f9d19760183236232d019
6.2 GB Download
293
10,822
views
downloads
All versions This version
Views 293293
Downloads 10,82210,821
Data volume 453.1 GB453.1 GB
Unique views 271271
Unique downloads 10,21010,209

Share

Cite as