Published September 22, 2025 | Version v1
Presentation Open

Encoding Complexity - TEI Modeling in the 'Forschungsportal BACH' Project

  • 1. ROR icon Saxon Academy of Sciences in Leipzig

Description

The long-term project "Forschungsportal BACH" launched in 2023, is a collaborative effort between the Saxon Academy of Sciences and Humanities in Leipzig and the Bach Archive Leipzig. The goal of the project is to document, digitally process, and make available all extant non-musical documents from the family of Johann Sebastian Bach, spanning from the late 16th to the early 19th century, in an online research portal.

 

The textual sources include private and official correspondence, as well as educational and professional records, legal documents, and other related materials – such as student registers, timetables, account books, wills, petitions, official records, as well as copies and transcripts of various public documents. Letters represent only a small portion of this overall heterogeneous corpus.

 

In the project we established a complex workflow starting with visiting the archives and digitizing the sources, via their recording in the project’s database, automatic text recognition and transcription by “Transkribus”, the correction and structural annotation of the text that takes place there, to the automatic conversion from Transkribus output in PAGE XML to TEI via XSLT and further textual annotations in “TEI Publisher”.

 

To ensure standardization and long-term usability, all texts are encoded in TEI-P5 format. This encoding follows established guidelines, although the diversity and specificity of the sources occasionally present challenges.

 

This presentation addresses specific challenges in modeling a structurally and semantically heterogeneous corpus, using three very different document types as examples – a will, a school register, and a coherent document written by a single scribe that contains transcripts and attachments of multiple documents. The focus is on issues of structuring, annotation, and semantic interpretation, especially in areas where the current TEI modules may not fully address the needs of the sources.

 

Furthermore, solutions will be proposed for how these challenges can be addressed within the current TEI framework – for example, through a nuanced combination of existing elements and careful modeling that closely aligns with the materiality of the sources. The aim is to make transparent the strategies developed within the project and to stimulate further discussion on how to approach structurally and semantically diverse sources, particularly in cases where the scalability of existing modules may not fully meet the demands of the project. Throughout, the encoding remains TEI-valid and is intentionally free of project-specific customizations in the form of a separate schema against which it could be validated, as the full scope of editorial requirements will only become clear during the course of the work. This presentation offers insights into the editorial practice of working with a heterogeneous corpus and aims to foster a discussion on modeling strategies for complex and diverse archival collections.

Files

Quenouille_Encoding_Complexity_Zenodo.pdf

Files (4.7 MB)

Name Size Download all
md5:f57fbb4da9d5db367334746993ff8595
4.7 MB Preview Download

Additional details

Related works