Published May 25, 2020 | Version v1
Dataset Open

Summary GAW

  • 1. Humboldt-Universität zu Berlin

Description


The dataset has been created in an effort to establish a knowledge base on the ``German Academic Web'' (GAW). Since 2012, semi-annual focused crawls of the web pages of universities and research institutes in Germany have been performed using Heritrix, the open source archival quality web crawler of the Internet Archive.

Starting from a list of given seeds, follows newly discovered hyperlinks and stores seen content in the standardised WARC file format.


For each crawl, Heritrix was initialised with a conceptually invariant seed list of, on average, 150 domains of all German academic institutions with the right to award doctorates. The seed list is extracted from the current entries on https://de.wikipedia.org/wiki/Liste_der_Hochschulen_in_Deutschland

The crawler follows a breadth-first policy on each host, thereby collecting all available pages reachable by links from the homepage. The scope was limited to crawl only pages from the seed domains and certain file types (mainly audio, video, and compressed files) were excluded using regular expressions. 

Along the crawl, the URL queues were monitored via a web UI. Hosts that appeared to be undesirable, such as e-learning systems or repositories, were `retired', that is, their URLs no longer crawled. However, previously harvested URLs from retired hosts were not removed.


Most crawls were finished (manually) after roughly 100 million pages were collected (according to Heritrix' control console), which took roughly two weeks per crawl, on average. 

The present data set presents an overview of the size of the GAW.

Files

overview.csv

Files (350 Bytes)

Name Size Download all
md5:2d74337441ce8d7be2c54f19d731044c
350 Bytes Preview Download

Additional details