Back to analog: the added value of printing digital editions

Kupreyev, Maxim N.

According to Sahle (2017) [1] digital editions are guided by a digital paradigm in their theory, method, and practice, and thus “cannot be given in print without significant loss of content and functionality”. This talk touches upon the challenges of printing TEI XML datasets, but also highlights a useful diagnostic value of the PDF export for the data quality. PDF output, indeed, represents only a part of the encoded information, but it can play an essential role in data curation and quality assurance.

The “School of Salamanca” [2] project, jointly sponsored by the AdWL Mainz [3], MPI-LHLT [4] and Goethe-University Frankfurt [5], publishes the works of the jurists and theologians related to the University of Salamanca - the intellectual center of the Spanish monarchy during the 16th and 17th centuries. Based on a selected set of print editions we create a digital text corpus, which will include 116 works encoded in TEI XML. In addition, we also compose a historic dictionary of circa 300 essential terms, rendering the fundamental importance of the School of Salamanca for the early modern discourse about law, politics, religion, and ethics.

Our TEI XML data is controlled by the RNG schema and is exported to HTML and JSON IIIF for web display [6]. Recently, a PDF printout option was added. Considering the complexity and the depth of annotation we decided to use the established XSL-FO technology, supported by a free Apache FOP processor integrated in the Oxygen Author workflow. Similar results can be achieved with the CSS Paged Media Module or TEI Publisher. The PDF export highlighted issues which pertain to two ontologically different areas:

  • Rendering XML elements in two-dimensional space of a PDF page.
  • Semantic errors and inconsistencies in the XML encoding.

The issues of the first type refer, for example, to the representation of the marginal notes and their anchors, and to the correlation in pagination between XML, IIIF and PDF. The problems of the second type include, for instance, different XML encoding of semantically identical chunks of information, which escaped the Schematron check-ups, but became visible with print layout.

PDF generation in the School of Salamanca was initially intended to be one of the export methods of the TEI data. It is now implemented early in the TEI production pipeline as a diagnostic tool, exposing the semantic and structural inconsistencies of the data, which can now be corrected before the final XML release. PDF production thus adheres to one of the principles of agile software testing, which states that capturing and eliminating defects in the early stages of research data life cycle is less time-consuming, less resource-intensive and less prone to collateral bugs (Crispin 2008) [7].


