Published September 10, 2019 | Version v1
Presentation Open

Towards Usable and FAIR Software for Arabic Textual Scholarship - Lessons from Kalīla and Dimna Research

  • 1. Freie Universität Berlin

Description

In designing the digital support for the project Kalīla and Dimna – AnonymClassic, we found that it has been relatively difficult for some literary researchers to acquire the technical skills to work with TEI/XML. To solve this problem, we decided to develop a system that has an easy user interface for advanced data entry and does not require knowledge of XML. The system involves mechanisms for validating the entered data and assuring its integrity and homogeneity. Furthermore, the data schema design aims at storing the linguistic data in a format that will facilitate its processing by machine learning algorithms. Applying this process should lead to optimize automated tokenization and lemmatization procedures, which are critical issues in Arabic computational linguistics.

However, even MSA (Modern Standard Arabic) requires some specific features for digital representation. Since the Arabic script is written from right to left (RTL), user interfaces need to be reworked to match the RTL behavior patterns. Thanks to the flexbox CSS layout model, switching website layouts from LTR to RTL can be achieved by merely one additional CSS statement when implemented correctly.

Furthermore, most consonants and long vowels are written in ligatures. However, ligatures and other features of the Arabic script pose challenges regarding their digital representation. In their early stages, computers could not easily display Arabic script with ligatures, leading to a misrepresentation of the language. Although there are new CSS flags to enforce Arabic ligatures on websites when using the right font (e.g. Coranica, Amiri or Arabic Typesetting), adding tags within a word deletes the ligatures and thus changes the representation of the Arabic letters.

Although Arabic script can be represented as transliterated text in Latin script, there are numerous different transcription systems (e.g. DMG, Anglo-American) of which some are highly inconsistent. However, most transcription systems only cover written Arabic and are not necessarily useful for representing phonetically correct oral or dialectal language.

In the western academia, software to support the research around the Arabic language needs to be manually crafted. All the problems mentioned above, as well as the underrepresentation of the Arabic language in software solutions pose challenges for the long-term preservation of research data and research software. In our contribution we want to present the challenges and (our) solutions when creating research software in Digital Humanities projects handling Arabic script.

Files

Utrecht Kalīla wa-Dimna Anonym Workshop powerpoint(1).pdf

Files (2.8 MB)