Presentation Open Access

Creating and questioning research-oriented digital outputs to manuscript metadata. A case-based methodological investigation

Diandra Maria Cristacha


Creating and questioning research-oriented digital outputs to manuscript metadata

Hello everyone, my name is Diandra Cristache and I am currently working as a data curator for Digital Humanities at the university of Grenoble. Between February and March 2020 I was finishing my masters degree in Digital Humanities at the university of Tours and I had the great opportunity to spend a few weeks as a student visitor at the Cambridge University Library. During the stay I was tutored by Huw Jones, Head of Digital Library Unit and Digital Humanities Coordinator; this research project also exists thanks to his wise guidance and bright ideas.

My research in Cambridge explored the issues of multidisciplinary teamwork in the creating and questioning research-oriented digital outputs to manuscripts metadata. What happens when researchers, cataloguers, librarians, data curators, and photographers come together to create manuscript metadata? What methodological challenges does such a teamwork face? These questions are addressed here through a prism of principles that are particularly cherished by digital humanists: interoperability and re-usability. So, let's rephrase the question above: what challenges does a multidisciplinary teamwork face for creating interoperable and reusable corpora of manuscript metadata? How do such corpora respond to the scholarly expectations of research, and the other way round, how does scholarly research feed back into the multidisciplinary workflow surrounding the metadata?

My intent was to address these questions through a case-based observation. To do so, a valuable groundwork was provided to me by the Polonsky Foundation Greek Manuscripts Project, a collaborative project from the universities of Cambridge and Heidelberg. The project runs from 2018 to 2021 and it consists in digitizing and describing a corpus of 4 hundred forty three Greek Manuscripts. The manuscripts reside partly in Cambridge University Library, the Fitzwilliam Museum and 12 Cambridge colleges, and partly in the Biblioteca Palatina in the Vatican Library. The project is funded by the Polonsky Foundation and further information can be found at the link in the presentation. [Website : Blog :]. For the sake of this research, I joined the project's team as an external observer: I spent some time with the team of photographers in charge of digitizing the manuscripts, some time with the team of cataloguers in charge of creating the manuscript metadata in XML-TEI, and I myself tested some minor processing solutions in python from that corpus of metadata.

In this talk, I would like to expose the results from this observation on the field. First of all, I will outline some methodological premises underlying the work of digital manuscript description; then, I will highlight some methodological gaps that seem to occur in multidisciplinary projects in manuscript studies; finally, I will suggest that mutual understanding across disciplines is fundamental for making the most of manuscript metadata. I chose to address these questions from a methodological perspective that takes into account both the scholarly and the technical issues related to manuscript metadata. And finally, I chose not to develop the computational part of my research, because it mainly served as a functional support for the thoughts that are expressed here: if you would like to know more about it, I would be happy to present it during the Questions&Answers session.

TEI and manuscript sources. Describing manuscript resources implies making a selection of features, or, in other words, choosing what to describe: and that choice reflects the expression of a scholarly approach that is necessarily a subjective one. As we all know, though, the scholarly effort goes towards common purposes: research projects are not only about producing contents and making conclusions, but also about opening the path for further investigation, which implies delivering data to the community. For research projects in manuscript studies, that's where the Text Encoding Initiative enters the game: TEI offers the widest range of encoding choices for manuscript descriptions, while ensuring further usability, and therefore sustainability, of metadata. But what about encoders that come from a background of traditional scholarship in the humanities? Do TEI-based projects respond to the needs and expectations of "traditional" cataloguers, philologists, or palaeographers? On the basis of my observation on the field, I would say yes, but only partially: digital manuscript studies set up a confrontation between "traditional" practices in manuscript studies and the solutions offered by TEI, where fundamental shifts occur in the methodology and working practices.

A matter of form and content. The first remarkable shift occurs in the relationship between contents and supports. In traditional research, there is a fundamental distinction between research information as a content, and the printed material as a support. If the printed book is a material carrier for information, and makes it shareable and spreadable, it does not actually partake in the epistemological value of information: in other words, if the material format of the book should change, that would not affect how the reader acknowledges what the book or the catalogue says. When digital practices become a part of manuscript studies, a shift of perspective occurs, that is reflected on a shift of terminology: "descriptive information" becomes "metadata", "describing manuscripts" becomes "metadata recording", and the "document" becomes a "file". In TEI-encoded metadata, all piece of information is embedded in a hierarchical structure of elements that is meant to be interoperable. Interoperability essentially means reuse, and this implies another important shift, that Rehbein already pointed out back in 2008: digital metadata is not only meant to be read, as a printed book would, but also to be used. To be interoperable, it is not enough for information to just be there: it also has to be encoded in a way that makes it usable. So, relationship between contents and supports changes in that the material support of information, the "file", is not only a carrier any more, but acquires a conclusive role in the effective contribution of information to the progress of research. In multidisciplinary projects, however, producing reusable metadata can be challenging: on one side, the encoding choices determine further sustainability of a metadata corpus; on the other side, digital manuscript projects result from a collaboration of professionals from different disciplinary backgrounds, with their peculiar methods and priorities, that eventually impact their approach to TEI.

From narration to standardization. Digital manuscript projects also experience a shift of framework: from user-friendly text processors (such as the What You See Is What You Get writing tools) to the intimidating environment of Oxygen and other XML editors. While the former allow developing information through a free and straight narrative, the latter constrain information into a hierarchical and standardized structure; but experience from the Polonsky project showed that not any information about manuscripts can fit into some encoding standard. Two main issues were highlighted by the members of the project: firstly, the encoding guidelines in use not being designed for describing the specific features of Greek manuscripts; secondly, Greek palaeography being a recent discipline and having a limited terminology, that cannot be directly translated into the granularity of TEI. When no solution for standardization exists, there are two possible solutions: either the information is not expressed, or, which happens more often, it is expressed in a non-standardized way, for example as a note. Non-standardized information, however, is more difficult to manage: the coexistence of heterogeneous encoding choices will require further adjustments to make metadata suitable for large-scale processing. And it is not only a matter of quality, but also of quantity: should one encode all descriptive information that can be found about a manuscript? Or should one make a selection? And if a selection is to be made, what is worthy of note, and why ? Eventually, the flexibility of TEI has a double-folded consequence: on one side, it tends to meet the epistemological demand of scholars whose purpose is to convey all of the material resulting from their research; on the other side, given the difficulties in producing standardized metadata for manuscripts, massive datasets may be difficult to process through computational solutions; and that is where an extensive "proliferation of markup" can become counterproductive.

For a mutual understanding. At this point, we can take a step back and draw a bigger picture. From what we observed, we can see that researchers from a background of traditional humanities are not only confronted with a framework that they are less familiar with; they also have to manage the offset between their own tools and methods with respect to that framework. This mainly affects how information is expressed for the purpose of research: while the conceptual framework of XML encoding ranges information into a vertical hierarchy, traditional scholarly writing develops information through a narrative approach, and the reader decides which information is more important for him or her to be retained. In a project team where researchers and data curators work together, this offset often leads to a sort of mutual scepticism: researchers would hardly accept to exclude any information from a metadata record, and data curators would ask who is really going to use all that information about a manuscript...

This is actually a normal matter of professional intentions: cataloguers and data curators working on the same corpus may pursue the same purpose of a common initiative, but each professional figure brings in field-specific motives: is it about creating a manuscripts catalogue for the use of scholars? Is it about producing a well-formed metadata corpus? Or is it about uncovering data patterns for answering specific research questions? Often, in digital manuscripts projects, it is the three at once, and even more.

For all these purposes to be fully achieved, different perspectives need to come together and find a way to address their mutual needs: mutual understanding brings mutual appreciation, and eventually, this is how a multidisciplinary team can make the most of manuscript metadata for the progress of research.

Files (35.9 MB)
Name Size
35.9 MB Download
All versions This version
Views 2121
Downloads 479479
Data volume 17.2 GB17.2 GB
Unique views 1616
Unique downloads 348348


Cite as