Published June 6, 2023 | Version v1
Poster Open

Archiving a Mailing List. A Case Study of the Katalist

Creators

  • 1. National Széchényi Library

Description

Of all born digital objects, email is one of the most challenging to preserve in the long term. Internationally, there is already experience with archiving even large-scale correspondence, but in Hungary the development of a (public) collection-level practice is still to be seen. In order to implement international good practices and results, the Digital Humanities Centre of the National Széchényi Library has launched a pilot project to archive the entire material of the Katalist mailing list (more than 40,000 emails from the start until August 2022), which could serve as a model for further e-mail archiving tasks. Katalist is a list of library and library informatics topics, in operation since the early 1990s. The letters on the list have been preserved since 1997. It has thousands of members, its contents are publicly available and it is an important source of the history of Hungarian librarianship.

In the framework of the process, the entire archive was first discovered using the ePADD software. EPADD is free and open source software developed by Stanford University's Special Collections & University Archives that supports the appraisal, processing, preservation, discovery, and delivery of historical email archives. EPADD incorporates techniques from computer science and computational linguistics, including machine learning, natural language processing, and named entity recognition to help users access and search email collections of historical and cultural value. The archive was then packaged with the Mailbagit software into an OAIS-compliant AIP package, which includes, in addition to the standard EML, other formats suitable for long-term preservation (HTML, TXT, PDF, WARC) and the extracted attachments from the emails, in accordance with the BagIt package format specifications, together with the collection-relevant metadata for the emails. The Mailbag project is a specification with an open source tool for preserving email archives using multiple formats, such as MBOX, PDF, and WARC developed by a consortium lead by M.E. Grenander Department of Special Collections & Archives, University at Albany, SUNY. The Mailbag proposal is an extension of the Library of Congress Bagit specification. The archived material will be searchable through a Solr-based search engine.

During the demonstration, one can test the discovery and processing capabilities of the ePADD software on the Katalist mailing list, inspect the OAIS-compatible packages created with Mailbagit software (including testing Mailbagit on the fly, e.g. creating standard derivatives like WARC, PDF etc. from emails), and try out the Solr-based search engine.

Files

Files (1.3 MB)

Name Size Download all
md5:0886f7e609460bd6d29d89b559176e9a
1.3 MB Download