There is a newer version of this record available.

Dataset Open Access

Webis-Web-Archive-17

Kiesel, Johannes; Potthast, Martin; Hagen, Matthias; Kneist, Florian; Stein, Benno

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality. See this overview for all datasets that built upon this one. If you use this dataset in your research, please cite it using this paper.

Files (99.5 GB)
Name Size
page-quality-interface.png
md5:8ad67239dde6f7d20369d428ba284e06
271.4 kB Download
page-quality-scores.txt
md5:4aa126f9faf6c1d929c3fb477f2581b3
196.8 kB Download
README.txt
md5:3a6fa77c0d5c6a63a71236a24a6eede8
3.0 kB Download
sites-and-pages.txt
md5:a5cfd4b545772ec3bf46f845fc7696b1
1.0 MB Download
sites.zip
md5:8282a9408250aa85a88fa75fe5fd74ef
1.6 MB Download
webis-web-archive-17-archives-part0.zip
md5:2ec3e59b0c8f19a07318397fca1e9a1f
3.1 GB Download
webis-web-archive-17-archives-part1.zip
md5:208d356268f62174ccb969632ddfb404
4.5 GB Download
webis-web-archive-17-archives-part2.zip
md5:470913f265a5423e53d5707a32c65086
3.0 GB Download
webis-web-archive-17-archives-part3.zip
md5:29898c5b1a9fea42aa45e77cbab23ad9
2.0 GB Download
webis-web-archive-17-archives-part4.zip
md5:6c76d9791d86cc2810d2c9a1676df7ba
3.9 GB Download
webis-web-archive-17-archives-part5.zip
md5:c1e0e9548e615a655d8e23b3c51b771c
5.3 GB Download
webis-web-archive-17-archives-part6.zip
md5:fbca883e22ed8cb00d6be8e24477a47a
5.4 GB Download
webis-web-archive-17-archives-part7.zip
md5:3931e329b8f9e1e8bd4f3527a4551730
4.2 GB Download
webis-web-archive-17-archives-part8.zip
md5:edc24c195a50e43241ad269fc0f9eec2
7.4 GB Download
webis-web-archive-17-archives-part9.zip
md5:1cec18ac83f6ebd38e3cb4bb6061481e
4.9 GB Download
webis-web-archive-17-dom-snapshots.zip
md5:de36dc487398e153551b83647a801562
1.3 GB Download
webis-web-archive-17-screenshots-part0.zip
md5:21c9a6db2d739c52263cd2f565cfb6e5
5.7 GB Download
webis-web-archive-17-screenshots-part1.zip
md5:523707ee732e3502b670b37a38908faf
6.2 GB Download
webis-web-archive-17-screenshots-part2.zip
md5:41076c729087e061aa2f125db643317f
5.2 GB Download
webis-web-archive-17-screenshots-part3.zip
md5:65fb839103987758a67693047524ee9c
4.6 GB Download
webis-web-archive-17-screenshots-part4.zip
md5:262a90e4cd847a48b980376b03aad0f8
5.1 GB Download
webis-web-archive-17-screenshots-part5.zip
md5:15eb19b14ecd6ed8d16d544a749eb1ff
4.3 GB Download
webis-web-archive-17-screenshots-part6.zip
md5:f4b185bf714afd6de6f2c56c9bb59f47
5.3 GB Download
webis-web-archive-17-screenshots-part7.zip
md5:0838751d6765630030a3208ebb7ddef7
5.8 GB Download
webis-web-archive-17-screenshots-part8.zip
md5:6d31ccce3043b80a5c2d14bd76e02a2d
6.0 GB Download
webis-web-archive-17-screenshots-part9.zip
md5:c87e29c8731f9d19760183236232d019
6.2 GB Download
861
435,996
views
downloads
All versions This version
Views 86118
Downloads 435,9968
Data volume 24.1 TB3.7 MB
Unique views 76714
Unique downloads 229,6376

Share

Cite as