Dataset Open Access

Summary GAW

Michael Paris; Robert Jäschke

The dataset has been created in an effort to establish a knowledge base on the ``German Academic Web'' (GAW). Since 2012, semi-annual focused crawls of the web pages of universities and research institutes in Germany have been performed using Heritrix, the open source archival quality web crawler of the Internet Archive.

Starting from a list of given seeds, follows newly discovered hyperlinks and stores seen content in the standardised WARC file format.

For each crawl, Heritrix was initialised with a conceptually invariant seed list of, on average, 150 domains of all German academic institutions with the right to award doctorates. The seed list is extracted from the current entries on

The crawler follows a breadth-first policy on each host, thereby collecting all available pages reachable by links from the homepage. The scope was limited to crawl only pages from the seed domains and certain file types (mainly audio, video, and compressed files) were excluded using regular expressions. 

Along the crawl, the URL queues were monitored via a web UI. Hosts that appeared to be undesirable, such as e-learning systems or repositories, were `retired', that is, their URLs no longer crawled. However, previously harvested URLs from retired hosts were not removed.

Most crawls were finished (manually) after roughly 100 million pages were collected (according to Heritrix' control console), which took roughly two weeks per crawl, on average. 

The present data set presents an overview of the size of the GAW.

Files (350 Bytes)
Name Size
350 Bytes Download
All versions This version
Views 242242
Downloads 109109
Data volume 38.1 kB38.1 kB
Unique views 215215
Unique downloads 9292


Cite as