Dataset Open Access

Webis-Web-Archive-17

Kiesel, Johannes; Potthast, Martin; Hagen, Matthias; Kneist, Florian; Stein, Benno

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality. See this overview for all datasets that built upon this one. If you use this dataset in your research, please cite it using this paper.

Files (99.5 GB)
Name Size
page-quality-interface.png
md5:8ad67239dde6f7d20369d428ba284e06
271.4 kB Download
page-quality-scores.txt
md5:4aa126f9faf6c1d929c3fb477f2581b3
196.8 kB Download
README.txt
md5:6b09d6739bc87bd3998fae6f78c43f34
3.6 kB Download
sites-and-pages.txt
md5:a5cfd4b545772ec3bf46f845fc7696b1
1.0 MB Download
sites.zip
md5:8282a9408250aa85a88fa75fe5fd74ef
1.6 MB Download
webis-web-archive-17-000000.zip
md5:ee2da1054ee7eab6059ebbe323330bd2
23.3 MB Download
webis-web-archive-17-archives-part0.zip
md5:2ec3e59b0c8f19a07318397fca1e9a1f
3.1 GB Download
webis-web-archive-17-archives-part1.zip
md5:208d356268f62174ccb969632ddfb404
4.5 GB Download
webis-web-archive-17-archives-part2.zip
md5:470913f265a5423e53d5707a32c65086
3.0 GB Download
webis-web-archive-17-archives-part3.zip
md5:29898c5b1a9fea42aa45e77cbab23ad9
2.0 GB Download
webis-web-archive-17-archives-part4.zip
md5:6c76d9791d86cc2810d2c9a1676df7ba
3.9 GB Download
webis-web-archive-17-archives-part5.zip
md5:c1e0e9548e615a655d8e23b3c51b771c
5.3 GB Download
webis-web-archive-17-archives-part6.zip
md5:fbca883e22ed8cb00d6be8e24477a47a
5.4 GB Download
webis-web-archive-17-archives-part7.zip
md5:3931e329b8f9e1e8bd4f3527a4551730
4.2 GB Download
webis-web-archive-17-archives-part8.zip
md5:edc24c195a50e43241ad269fc0f9eec2
7.4 GB Download
webis-web-archive-17-archives-part9.zip
md5:1cec18ac83f6ebd38e3cb4bb6061481e
4.9 GB Download
webis-web-archive-17-dom-snapshots.zip
md5:de36dc487398e153551b83647a801562
1.3 GB Download
webis-web-archive-17-screenshots-part0.zip
md5:21c9a6db2d739c52263cd2f565cfb6e5
5.7 GB Download
webis-web-archive-17-screenshots-part1.zip
md5:523707ee732e3502b670b37a38908faf
6.2 GB Download
webis-web-archive-17-screenshots-part2.zip
md5:41076c729087e061aa2f125db643317f
5.2 GB Download
webis-web-archive-17-screenshots-part3.zip
md5:65fb839103987758a67693047524ee9c
4.6 GB Download
webis-web-archive-17-screenshots-part4.zip
md5:262a90e4cd847a48b980376b03aad0f8
5.1 GB Download
webis-web-archive-17-screenshots-part5.zip
md5:15eb19b14ecd6ed8d16d544a749eb1ff
4.3 GB Download
webis-web-archive-17-screenshots-part6.zip
md5:f4b185bf714afd6de6f2c56c9bb59f47
5.3 GB Download
webis-web-archive-17-screenshots-part7.zip
md5:0838751d6765630030a3208ebb7ddef7
5.8 GB Download
webis-web-archive-17-screenshots-part8.zip
md5:6d31ccce3043b80a5c2d14bd76e02a2d
6.0 GB Download
webis-web-archive-17-screenshots-part9.zip
md5:c87e29c8731f9d19760183236232d019
6.2 GB Download
695
276,771
views
downloads
All versions This version
Views 69531
Downloads 276,77157
Data volume 2.2 TB159.4 GB
Unique views 62028
Unique downloads 160,06616

Share

Cite as