Published July 30, 2020 | Version v1
Presentation Open

Multilingual data in ELTeC: enacting European literary traditions

  • 1. "Alexandru Ioan Cuza" University of Iasi
  • 2. "Gr. T. Popa" University of Iasi

Description

This workshop aims to present challenges that arose during the implementation of the CA 16204: Distant Reading for European Literary History and several solutions for fostering a culturally-informed and linguistically-aware use of data, which the project members – representing 30 participating countries – have identified together through close collaboration within the 4 working groups. The first part is a discussion of pros and cons regarding the application of strict sampling principles to heterogeneous and non-synchronously-developed literary traditions, many of them defined as “emergent”, thus having an intermittent dynamic between 1840 and 1920. While for the relatively young literary traditions of the Central and South-Eastern Europe, the metadata-based approach, “the distance as a condition of knowledge,” the forced estrangement from a novel’s content, and the ELTeC selection criteria (e.g. at least 10/15 % female authored novels, 20 % long novels) might look like “a bed of Procrustes,” the encoding schema (Level 1 and 2 in particular) allows for enough illustration of linguistic and cultural specificity. In fact, ELTeC presently accommodates 14 typologically-diverse languages such as Romance, Balto-Slavic, and Germanic (for the current status of ELTeC, check https://distantreading.github.io/ELTeC/), and it is used as a benchmark corpus to test the performance of tools on lesser resourced languages. Moreover, new entries (Ukrainian, Belorussian), extensions to collections that contain texts published before or after the indicated time span, as well as multilingual collections (e.g. Swiss) are encouraged.

The second part is a case study on the ELTeC paratext (titles as “thresholds” to “the great unread”) aiming to illustrate how the ELTeC multilingual diversity can be employed in order combat language indifference and to devise more comprehensive research questions questioning theoretical assumptions and integrating both linguistic and literary concepts. While it includes first editions as well as later ones, ELTeC reflects literary and cultural conventions (e.g. genres, periods), rather than linguistic features. Nevertheless, tokenization and POS tagging tests that have been done on the titles from the Romance language collections (French, Italian, Portuguese, Spanish, and Romanian) brought to the fore several situations that should challenge current tokenization practices.

Files

Patras_Lionte_Multilingual data in ELTeC_ enacting European literary traditions.pdf