How can we improve our web collection? An evaluation of webarchiving at the KB National Library of the Netherlands (2007–2017)

In 2007, the Koninklijke Bibliotheek, the Dutch National Library (KB-NL), started the project ‘webarchiving’ based on a selection of Dutch websites. The initial selection of 1000 websites has currently grown into over 12,000 selected websites, crawled at different intervals. Although due to legal restrictions the current use is limited to the KB-NL reading room, it is important that the KB-NL includes the requirements of the (future) users in its approach to creating a web collection. With respect to the long-term preservation of the collection, we also need to incorporate the requirements for long-term archiving in our approach, as described in the Open Archival Information System (OAIS) Model ISO 14721: 2012. This article describes the results of a research project on webarchiving and the web collection of archived sites in the KB-NL, investigating the following questions. What is webarchiving in the Netherlands? What are the selection criteria of KB-NL and how are these related to what can be found on the Dutch web by the contemporary user? What is the influence of the choice of tools we use to harvest the final archived website? Do we know enough of the value of the web collection and the potential usage of it by researchers and how can we improve this value? This article will describe the outcomes of the research, the conclusions and advice that can be drawn from it and it is hoped will inspire broader discussions about the essence of creating web collections for long-term preservation as part of cultural heritage.


Introduction
The importance of websites and webarchives as cultural heritage has been acknowledged worldwide in the UNESCO Charter on the Preservation of the Digital Heritage, published in 2003published in (UNESCO, 2003UNESCO/PERSIST Content Task Force, 2016). The Koninklijke Bibliotheek, the Dutch National Library (KB-NL), started with webarchiving a selection of the Dutch web as a research project 10 years ago in March 2007. 1 Over the years, webarchiving activity evolved from a pilot project initiated by the Research Department into a regular activity of the Collections Department. The KB-NL collection of archived websites now contains more than 12,000 websites preserved in almost 26 TB data and comprises around 211 million URLs (Uniform Resource Locators). 2 As there is no legal deposit in the Netherlands and for copyright reasons, the collection can only be studied in the reading rooms of the library. It attracts around 100 users every year.

Open Archival Information System
The aim of the KB-NL web collection is to select, to preserve and to make accessible a representative set of Dutch websites. The KB-NL's policy is to conform to the Open Archival Information System (OAIS) standard which describes on a conceptual level the approach for long-term preservation, both in a functional model and in an information model. Currently, the KB-NL web collection is stored on a dedicated server at an external location as '.ARC' files and is not yet in a preservation system. This means that the web collection currently is not compliant with the OAIS standard (as are many web collections). But the KB-NL needs to prepare itself for the ingest of the web collection into the preservation system in a few years, whereby the OAIS requirements will be taken into account. Hence, the various concepts of OAIS need to be taken into account now, although the actual ingest might be at a later stage.
The information model of OAIS distinguishes several topics in the information model, related to what the so-called 'designated community' needs to know in order to be able to make 'information' from the content data object (the preserved website) and the related representation information, which together make up the content information. As the custodian, the KB-NL is to a certain degree responsible for delivering this information. The content data object has clear boundaries, as this is the 'original target of preservation'. The representation information not only encompasses the information needed to render in this case the instance of the web harvest faithfully but also needs to contain relevant information in relation to the so-called knowledge base of the designated community.

Designated community
In OAIS terminology, the designated community is the target community of the archive: the future archive's users. The primary designated community of the KB-NL web collection is the academic and non-academic research community. Researchers that will base their research on the KB-NL web collection need to know their topic of study, so they might be expected to have at least a basic knowledge of the original historic context of the collection, the process of web harvesting and how the web was organised during the period of their investigation. But they will also be interested in several aspects of the collection, of which only the KB-NL can provide information.
In order to better understand the KB-NL web collection, context information about the broader environment of which the web collection was part of is important. The KB-NL is doing selective harvesting of websites of the Dutch national domain, but which criteria were applied during the selection process? And did these selection criteria change over the years? Did the KB-NL harvest all the sources within the scope of the crawl or were there technical or legal barriers that prevented achieving these goals? How much was selected in comparison with what was available at a certain time? Did ethical questions play a role during the process of selection or harvest? In short: which aspects that will be relevant for researchers influenced the KB-NL collection during the process of selection and harvest? Where can a researcher find this so-called context information if not available in the web collection?

Preservation description information
Apart from this context information, the OAIS model also requires extra information in the information model for each website that is preserved, especially in the preservation description information (PDI), as described in paragraph 4.2.1.4.2 of the OAIS reference model and Figure 4.16. It will contain reference information, provenance information, context information, fixity information and access rights information. Some of this information can be captured in metadata.
In a survey conducted in 2013 by the IIPC Preservation Working Group among the IIPC members, a question was asked about whether the web collection was already integrated in their preservation system, assuming that for this preservation system, the requirements of the OAIS model were realised (Pearson et al., 2015). Of 25 completed surveys, 37% responded that they had it integrated, while 63% were either planning it or had not done so. We can conclude that the majority of the respondents did not archive their web collection conform to the OAIS reference model into a preservation system. One of the exceptions is the Bibliothèque National de France, which gave a description of its Premis/Metadata Encoding and Transmission Standard (METS) model in which part of the PDI information is incorporated (Dappert et al., 2010). In short, context information is crucial to understand a web collection, both for current and for future users.
Although the KB-NL web collection is not yet ingested into a preservation system that is organised in compliance with the OAIS data model (including the above-mentioned representation information and PDI), we started this research to investigate what we could improve in the KB-NL to prepare ourselves for an OAIS-compliant web collection, with a focus on the three main steps of the workflow: selection, harvesting and presentation.

Selection
The Netherlands has no legal deposit law prescribing what the National Library should collect. The collections of the KB-NL are based on the collection plans, including the web collection. This should be a representative set of Dutch web publications. The lack of legal deposit also requires that the KB-NL must ask permission of each website owner before the website can be part of the collection. Although the KB-NL has a set of selection criteria, this does not mean that the webarchiving team can collect as it pleases. In practice, this approach resulted in some exemplary websites that the KB-NL did not get permission to harvest. It also hampers rescue archiving: some sites went offline while KB-NL awaited permission to harvest. What we could not harvest or were not allowed to is also context information which is currently lacking.
When webarchiving activities started in 2007, the KB-NL based its seed list on a selection of Dutch websites, with the intention of creating a representative set. Looking back, we need to conclude that more effort into determining what encompassed the Dutch web in 2007 might have helped to make a more representative selection from the beginning. The knowledge of the web had to be built.
The OAIS model describes that an OAIS archive should be clear in what is collected and what is not in the collection and to be transparent in this aspect to the designated community. In this section, we will briefly describe the Dutch national web, the background information that could be helpful in determining the size of a national web and extra sources that could be explored in shaping the overview of the Dutch web.

The Dutch national web
If we consider the national Dutch web as the way the Dutch shaped their virtual space on the web, this development started early. The .nl domain of the Netherlands was the world's second country code top-level domain (ccTLD) outside of the United States. The first .nl domain name cwi.nl was registered in 1986. 3 The first Dutch website was published online 6 years later in 1992 as the third website in the world, after that of CERN, the European Organization for Nuclear Research, in Switzerland and SLAC, National Accelerator Laborator, in the United States. 4 The .nl ccTLD comprised 5.76 million domain names in June 2017 according to the figures of the Dutch national domain registrar SIDN (Stichting Internet Domeinregistratie Nederland). 5 Given its size and the amount of websites per inhabitant, the Dutch national domain is one of the biggest national domains to be crawled in the world and has a daily growth of 5500 domain names. If we take the non-.nl domain names into account which are used to publish Dutch sites, it is even bigger. To preserve even a representative part of this national web is an enormous task for a national library with limited resources.
Mapping the Dutch web and the start of webarchiving Even before webarchiving started, the KB-NL National Library was aware of the potential information value of the Dutch part of the world wide web for its patrons. It swiftly followed digital developments: it started mapping the Dutch web already in 1992 by compiling web directories or web lists of relevant websites for its users. This 'NLmenu' was a classified overview of information services of Dutch organisations on the web. 6 This activity was part of the traditional library task regarding the collection of Dutch publications, both analogue and digital. Another KB-NL project, Dutch Electronic Subject Service (DutchESS), started in 1997 and was a service to disclose internet sources for researchers and to add to them the Dutch Basic Classification. Only URLs were collected and described: the websites were not then harvested. These are only two examples of various initiatives that the KB-NL started to provide its audience with information about new information sources. For years, the web directory of the Dutch National Library was one of the most important providers of information on the geography and content of the Dutch web. This activity ended in 2004 as it seemed no longer useful, due to the popularity of commercial web directory sites and the growth of search engines such as Google.
When webarchiving started in 2007, a selection of 400 websites taken from DutchESS was used as a seed list for the first experiments with selective crawls. Yet, this context information of the start of KB-NL webarchiving seemed to be lost and forgotten after 10 years. These KB-NL initiatives from the 1990s were part of the collective memory in the library, but their traces were hidden and detailed information was not easy to find. The NL menu was handed over to a public library organisation and DutchESS was stopped. But, the limited information that still existed in the KB-NL (sometimes on an old CD-ROM in an employee's drawer which we were able to rescue) gave us important context information about the historic development of the Dutch web and what was seen as important websites from the KB-NL perspective in those days, although it required some 'web archaeology activities' to find and rescue these sources. The future publication of these web lists will place the current web collection into context, which is one of the essentials for long-term preservation. The collected information by the KB-NL was seen as only of actual use and was not then preserved. But in the end, this kind of information, which is already available in the institutional memory, can be important background information and a useful source of information for researchers.

Selection policy of KB-NL
It is important to stress that although the KB-NL is the national library of the Netherlands, the collection of archived websites was neither created as a national webarchive nor was it aimed to collect all material from the Dutch national web. The selection criteria of the KB-NL are defined by technical, financial, legal and human possibilities; web space; content; theme; and period of time. As the web collection is the end result of this range of criteria, this is important context information for a researcher and needs to be added to the preserved web collection in some way. We have not yet discussed how it would fit in the OAIS data model.
The general criteria for selecting material were written down in the KB-NL collection plan (KB, 2014). The collection plan states therefore that the KB-NL collection covers 'everything published in and about the Netherlands' (KB, 2012). This policy was refined and limited at the start of webarchiving in 2007 to websites about Dutch language, history and culture. Moreover, the KB-NL limited itself to the selection of websites from the Dutch national domain: other Dutch materials from the internet, such as emails, programs or apps, are not preserved. Furthermore, a website must be a separate publication of a certain size: tweets or other microblogs and social media items were excluded from selection.
As webarchiving became a more regular activity, the KB-NL selection criteria were adapted to technical or other limits and web trends. For copyright reasons, the lack of a deposit law and resources, the original ambitions became more modest and were adapted to the goal of archiving 'a representative selection of the Dutch national domain'. The legal limits of Dutch webarchiving were described in several publications which were written in collaboration with legal experts of the University of Leiden (Beunen, 2008;Beunen and Schiphof, 2006;Beunen et al., 2007).
As the webarchiving team began to know the Dutch national domain better, the selection policy was also extended and further refined. The original restriction to sites of the Dutch ccTLD .nl domain turned out to be too narrow for making a representative selection, as many sites with valuable content did have other extensions. Therefore, the selection policy was extended to sites with other extensions as well. Another criterion initially excluded commercial sites. As many old and established Dutch companies went bankrupt and their websites went offline, the KB-NL selection policy became more flexible to include endangered sites which do not meet the selection criteria but are important to preserve from a Dutch digital web heritage point of view. Finally, websites were selected on the basis of their popularity on the Dutch web, using various sources such as Alexa, Wikipedia and Similarweb.
As the selection policy of the KB-NL evolved over time, these changes were neither always recorded nor communicated with present and future users through the website. One could say that the changing context of the web collection was not recorded. The future goal is to better inform users, our designated community, what is preserved, included and excluded from selection. Therefore, we need also to provide information on the development of the Dutch web and the content of past web directories: even if most of the sites present are neither online anymore nor preserved. We even plan to make visible what was not archived due to legal, technical or other issues to publish these URLs. It is important for our designated community to realise the library did its best to preserve as much digital heritage as possible and to make them aware of the limits of webarchiving by providing context information of what was not preserved.

Future developments in the selection policy
There are still three issues related to the selection policy which will be solved in future. First, since 2007 the selection policy has focused on actual websites and includes only a small number of websites published before 2007. We now plan to select relevant Dutch web heritage which is still online by conducting web archaeology and to harvest sites which are important to study the origin of the Dutch web. Already one ancient web directory, the 10th website ever published in the Netherlands, was rescued from a server on an attic, reconnected to the web and finally harvested. 7 Also, the webarchiving team focuses now on the selection of more online news sites, as newspapers become less important and online news more influential. Third, the selection policy excluded websites with despicable or untrue content. As abject content on the web becomes more influential in society and fake news is thriving, it is necessary to preserve this source material for future research.

The selection policy of KB-NL from a national perspective
Apart from the KB-NL, several other Dutch organisations are creating web collections based on a specific portion of the Dutch web or a specific Dutch theme. All of them have a different selection policy, crawl strategy and sometimes even use another web harvesting technology as well. The KB-NL selection policy takes the selection policies and the harvest activities of other institutions into account, even if a different crawl technology is used. What other organisations crawl, the KB-NL excludes from its policy in principle. This principle is also contextual information about our web collection and thus of importance for our designated community.
As far as we know, webarchiving started in the Netherlands by the Dutch Documentation Centre of Political Parties in 2000, using HTTrack web crawler tool. 8 This organisation harvests almost all the websites of Dutch political parties, politicians and political movements on a monthly or yearly basis. Many local or municipal archives also run webarchiving projects, like the Frisian Treasor collection (the repository of the history of Fryslân), the county of Groningen and the cities of Rotterdam and Dordrecht. 9 The Dutch National Archive harvests websites from an archival point of view. Finally, the Netherlands Institute for Sound and Vision collects the websites of the Dutch broadcast organisations. All the organisations together collect around 15,000 websites, but the KB-NL collection is the largest of all. Future researchers must take this national context information into account when studying the Dutch national web and the KB-NL web collection. They will be able to do so, as most of the above-mentioned organisations have a preservation task and will preserve their web collections according to the OAIS model.
If we consider the selection criteria of the different web initiatives in the Netherlands including KB-NL, we can observe a bias towards politics, local sites, media and cultural history and heritage and a lack of archived sites before 2007. Because of this, a national expert group was launched at the end of 2016 that focuses on webarchiving on a national scale. 10 Its purpose is to promote cooperation between all the different institutions and the professionalisation of webarchiving in the Netherlands. Another goal is to make an inventory of all the different webarchiving initiatives in the Netherlands and to make a list of all the websites which are harvested by various organisations. In this way, a researcher of the Dutch web can get a better insight of what is archived where and what technique is used. The combined efforts of all small organisations together will enrich the value of the separate web collections and provide context information about the national Dutch domain for future users.

Webarchive or web collection?
A harvest policy is a key issue for the development of a collection of archived websites. As webarchiving is a relatively new activity for libraries, the definition of the goal driving collection development is still under discussion. The collection's original target in 2007 was defined as 'to harvest a selection of the Dutch web with a maximum of 3000 websites'. Therefore, the collection of archived websites at the KB-NL National Library can be regarded as a special web collection of a scientific library rather than a general or even a national webarchive. According to Helen Hockx-Yu, a collection of archived websites was described by Brewster Kahle as a webarchive when he founded the Internet Archive (IA) in 1996. In his opinion, a digital collection of websites must be considered as a webarchive and not as a web library, as its collection can never be complete. 11 Still, a webarchive is not an archive, in the sense of a place in which public records or historical documents are preserved. The archived websites which are collected and preserved by the Dutch National Archives can be considered as a true webarchive from this point of view. 12 The KB-NL owns a collection of more or less similar archived websites which have been selected for a reason with a specific goal in mind. The term 'web collection' is therefore more suitable in the Dutch KB-NL case.

Harvest strategy of the KB-NL and web sources
What is harvested by the KB-NL for its web collection and how is this actually done? The mission of the KB-NL National Library is clear: to collect and preserve everything published in and about the Netherlands and Dutch culture in order that researchers, students and other users will be able to consult this now and in the future. A problem arises when trying to apply this policy to web material, as it is not clear how to define the scope, size, content and even value of the digital object which we want to collect and to preserve.
The Danish web historian Niels Brügger has described five analytical layers of the web to identify digital web objects (Brügger, 2010). These objects can be identified, harvested and archived on the following layers: 1. the individual textual elements of a web page: source code, text, images, style sheets and so on; 2. the individual web page: the layer where all above described elements can be found under a certain URL and which are linked to it; 3. the individual website: the level where all linked web material which can be found under a certain domain name; 4. the web sphere: the layer where all sites which are linked together with one certain website; and 5. the web as a whole: the level where all websites are online at a certain moment.
As the KB-NL harvests from a web collection point of view, the focus of its harvest strategy is on the third analytical layer of the web. This means that the KB-NL preserves the individual website, which is considered as a separate digital object to be collected as a single unit in web space and time. The KB-NL web collection as a whole is therefore described in an amount of selected websites with a separate time stamp and presented as a list of URLs accompanied with the date of selection. The contextual information we want to offer our designated community will focus on 1, 2 and 3 in the metadata in the Archival Information Package, while contextual information for 4 and 5 needs to be described separately.
The harvest strategy of KB-NL is to make a snapshot of all the elements of one website at a certain moment or period of time. An online website is a dynamic object linked to the live web. The goal is to harvest the selected live website as completely as possible and to collect as much web material as can be done by the harvester from one URL within the shortest possible time. The purpose of this is that it can be studied by the user as an object in the KB-NL Wayback Machine like it was live at a certain moment. 13 During the harvest, the website is cut off from the live web, harvested on the level of the site and the individual web pages by following and harvesting links and web elements. Afterwards, the harvested web material is reconstructed as an archived version of the site in the web collection and made accessible through the Wayback Machine. The context information of the KB-NL web collection which is presented to the user is about the third analytical level of the web.
Due to the harvesting by the KB-NL from a collection perspective, we can state that its activity of webarchiving is not the preservation of a historic source, but the creation of a new one on the third level of web analysis out of harvested elements. The result which can be viewed in the Wayback Machine must resemble the live version as much as possible to be considered as an authentic source of our digital age. The authenticity which is missing of the archived instance is the dynamics, as it is a snapshot made of a dynamic website on a certain moment. Still, the harvest of the website can also have its own dynamic characteristics. The bigger and more dynamic the website is at the moment of harvest, the longer is the 'shutter time' of the harvest, which is also due to the used webarchiving technology. One link, web page or web element can be harvested at a different time from the other.

KB-NL and the IA
The general harvest strategy of the IA is focused on the first analytical layer of the web. 14 The IA describes its collection present in the Wayback Machine for the general public in the amount of time-stamped web objects or web captures, which means archived web elements. 15 The harvest strategy of the KB-NL differs from that of the IA with regard to the analytical level of the web, as the KB-NL harvests from a collection point of view. The KB-NL focuses on preserving an authentic archived website. The IA focuses on broad harvesting at the first analytical level of selected individual web elements through domain harvests of web spheres, not on snapshots of selected websites. 16 When viewing a specific website in the Wayback Machine, the researcher navigates through snapshots of websites with elements which were harvested at different moments of time and were brought together later in the Wayback Machine (Leetaru, 2015).
The difference between harvesting methods has important implications for research on websites. If we research web material at the analytic level of the websites, selective harvests offer us a more authentic source. But if we conduct research on websites on the level of the web sphere, selective harvests offer less authenticity, as the instances of separate websites were harvests within different time intervals and together cannot be treated as one source of a certain moment. It is therefore important that the user is aware of this context information when studying the archived web.
A domain crawl of a national domain or national web in addition to selective crawls can provide valuable context information about the individual archived websites in the web collection for future users. Because of legal issues, the conducting of a domain crawl is not yet possible for the KB-NL. If the KB-NL had been able to conduct domain harvests of the Dutch national web, like the British Library and the National Danish Library are able to do, it could have provided an authentic snapshot of the Dutch web sphere for future researchers. 17

Heritrix web crawler tool
As web harvesting technology results in the creation of new sources in the web collection of archive, it is important to understand the working of the web harvester. The KB-NL uses Heritrix version 1.14.1 for webarchiving, as most of the national libraries and large heritage institutions in the world like the British Library, the National Library of New Zealand, the Biblioteca Nazionale Centrale Firenze, Netarkivet in Denmark and the Bibliothèque Nationale de France do. The tool Heritrix is responsible for the majority of archived websites and the content of web collections in the world. Therefore, background information on the working of this program will be crucial for future users of archived web material to judge the value and authenticity of the sources and to understand the context in which this material was created.
The core setting of Heritrix is to focus on the first or the third analytical level of the web. The main difference between the crawl strategies of IA and KB-NL as described above has its root in different settings of the web crawler Heritrix. The IA conducts broad domain harvests, which means that the crawler harvests as much web material as possible from the first layer of the web but only one of two levels of a website and therefore only scratches on the surface of the third analytical level of the web. The KB-NL does selective harvests: focused crawls of selected websites. Heritrix is therefore instructed to keep in scope of the selected website and crawls as much web material as possible from a certain website. If we compare IA and KB-NL on the fourth level of the web sphere, it means that the IA harvests more from this, but superficially, and KB-NL harvests less, but very thoroughly.
It is therefore not possible to state that the IA harvests 'everything' and KB-NL possesses only a small probe of the Dutch national web which is already present in the collection of the IA. What collection is most useful for research of the past web depends on the need of a researcher, what analytical level of the archived web he or she wants to study and how authentic the archived resource must be for his or her research goal.
We can state that it is necessary for webarchivists and researchers to understand the working of Heritrix and its outcome. At the moment, there is a lack of information on the web about Heritrix and the difference between the versions. Unfortunately, even developer documentation for Heritrix is largely out of date and scattered around the web (IIPC Heritrix Task Force, 2012). 18 No person or organisation in the world takes responsibility for keeping this information up to date or checking the content of it. There exists a developer community of Heritrix, but it is relatively small and no organisation takes the lead for further development. 19 This poses serious limits on the availability of context information about the harvest.
As we have stated above, many institutions still use an old version of Heritrix for different reasons, including KB-NL. Still, this version has serious flaws, which most researchers are not aware of. Sites with https cannot be harvested anymore, for example. Another serious issue is the crawler trap, through which large amounts of unwanted data are harvested which is useless for analysis. In addition, dynamic websites are hard to crawl. There are new tools like Brozzler and Webrecorder under development to deal with these issues, but it takes a serious investment in time and money to implement these in the regular work flow of institutions. 20 Our designated community of researchers needs to become familiar with the details of harvest techniques when doing research on web collections. Now, only a few researchers understand the technical aspects of webarchiving and most of them are webarchivists themselves. Researchers are scarce now who take the working of a web crawler into account when analysing archived web resources. It is not enough anymore for a researcher to understand the first layer of the archived web and be able to analyse texts and images: to be a serious researcher of the digital age, knowledge of all web layers must be a prerequisite for researching web collections.
Source criticism of archived websites is thus scarcely out of the egg, but this knowledge is very necessary to make web collections useful for researchers and to increase the overall value of the web collections. When researchers understand the working of Heritrix and its outcome better, they are also able to judge the value of the web collection better. It is therefore important not only to collect and preserve context information of the selection policy and the collection but also preserve the data and other background information on the harvesting tools which are used to build the web collections.

Conclusion
The KB-NL has archived websites for more than 10 years and has built up a unique web collection of the Dutch digital web culture since 2007. Still, this digital collection is not ready for long-term preservation yet. If we want to preserve this collection for the future in a responsible way, we need to incorporate the requirements for long-term archiving as described in the OAIS model. The most important requirement is to provide context information about the Dutch national web domain, the national web collection of Dutch webarchives, the KB-NL selection policy and the harvest strategy and technology. This can be done by mapping the past and present Dutch national domain using old and forgotten data, to draw up a national list of all archived websites in the Netherlands by Dutch webarchiving institutions, including those of the KB-NL. The selection policy and policy changes must be recorded and this information made available for future researchers. Furthermore, understanding the harvesting techniques and the outcome of these is crucial to value the authenticity of a preserved digital source of the past web. Webarchiving institutions should make researchers more aware of the possible limits of their objects and the difference between the various collections due to different harvesting strategies and the tools used. Finally, the web does not have national borders, neither does a national web collection. In order to be fully prepared for the future, national libraries must secure national context information by international cooperation with other institutions.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.