Dataset Open Access

# Summary GAW

Michael Paris; Robert Jäschke

The dataset has been created in an effort to establish a knowledge base on the German Academic Web'' (GAW). Since 2012, semi-annual focused crawls of the web pages of universities and research institutes in Germany have been performed using Heritrix, the open source archival quality web crawler of the Internet Archive.

Starting from a list of given seeds, follows newly discovered hyperlinks and stores seen content in the standardised WARC file format.

For each crawl, Heritrix was initialised with a conceptually invariant seed list of, on average, 150 domains of all German academic institutions with the right to award doctorates. The seed list is extracted from the current entries on https://de.wikipedia.org/wiki/Liste_der_Hochschulen_in_Deutschland

The crawler follows a breadth-first policy on each host, thereby collecting all available pages reachable by links from the homepage. The scope was limited to crawl only pages from the seed domains and certain file types (mainly audio, video, and compressed files) were excluded using regular expressions.

Along the crawl, the URL queues were monitored via a web UI. Hosts that appeared to be undesirable, such as e-learning systems or repositories, were `retired', that is, their URLs no longer crawled. However, previously harvested URLs from retired hosts were not removed.

Most crawls were finished (manually) after roughly 100 million pages were collected (according to Heritrix' control console), which took roughly two weeks per crawl, on average.

The present data set presents an overview of the size of the GAW.

Files (350 Bytes)
Name Size
overview.csv
md5:2d74337441ce8d7be2c54f19d731044c
350 Bytes
166
91
views