Published January 27, 2022 | Version v1
Presentation Open

Multilingual research projects: Challenges for making use of standards, authority files, and character recognition

  • 1. Heidelberg University

Description

2019, the new Centre of Asian and Transcultural Studies (CATS) opened its doors at the University of Heidelberg.[1] This research collaboratorium also features a strong digital section, comprising research data in various media across Asia from both, digital library and digital humanities research sides. However, providing data and metadata to a multilingual community is not always trivial. In my presentation I will take three use cases from CATS projects as examples for the challenges we face, and also introduce approaches to solve them.

Use case 1: Not all metadata standards are capable of encoding multilingual content in a sufficient way. Here I will take XML elements from the VRA Core 4 XML [2] metadata standard as examples, and present our extension of the standard [3, 4].

Use case 2: In western language source materials, for example newspapers after 1850, digitization also implies the production of an OCR version of the content (full text). Although results are not always perfect, funding agencies like the DFG formulated this processing as a mandatory step. For non-Latin script (NLS) material, this is not feasible. Not only are the OCR algorithms not yet good enough, or additional characters -like for emphasis- significantly disturb the processing, it is already the document layout recognition that fails. One example is the processing of Chinese newspapers from the first half of the 20th century [5], which will be used to illustrate the challenges. [6]

Use case 3: Connecting local databases to international authority files is a good practice to open up local databases. It not only helps to precisely identify local entities, but also makes it possible to use external data to enhance the local resource. On the other hand, it allows external parties to re-use local data, and also opens the way to enhance the authority databases with domain-specific data. Large international authorities, like the Getty Thesauri, or national authorities, like the German National Authority File (GND), tend to be less aware of non-western items, like concepts or agents. Projects linking their data systematically to these authority files can make an advantage out of their local knowledge and submit data back to the community. While contributing to the larger authorities may be a challenge itself, a bottom-up workflow with community-based systems like WikiData, or DBpedia can be a more feasible first step. [7]

[1] http://hra.uni-hd.de
[2] https://www.loc.gov/standards/vracore/schemas.html
[3] Application: VRA Core 4 XML Transform tool (.csv to XML): https://www.researchgate.net/project/VRA-Core-4-Transform-Tool
[4] Schema extension: http://cluster-schemas.uni-hd.de/vra-strictCluster.xsd
[5] Project database: https://uni-heidelberg.de/ecpo
[6] ECPO presentation: https://www.slideshare.net/MatthiasArnold/early-chinese-periodicals-online-ecpo-from-digitization-towards-open-data-jadh2018
[7] Agent service presentation: https://www.slideshare.net/MatthiasArnold/transforming-data-silos-into-knowledge-early-chinese-periodicals-online-ecpo

Files

DH2019_Arnold_01_public.pdf

Files (16.7 MB)

Name Size Download all
md5:7194de67dc528ccff66d1bfc83b6a948
16.7 MB Preview Download