Published September 4, 2024 | Version v1
Presentation Open

Cathedral, bazaar, data garden: the Pangloss Collection

Description

Presented by Alexis Michaud and Séverine Guillaume at the Language Documentation and Archiving Conference, Berlin & Online, 4-6 Sept, 2024

Eric Raymond’s classic The Cathedral and the Bazaar (Raymond 1999) contrasts top-down and bottom-up design. How does the Pangloss Collection (pangloss.cnrs.fr), a member of the Digital Endangered Languages and Musics Archives Network, pattern in these terms? There is a cathedral aspect to the format used for display and distribution: hierarchical markup (XML) following a fixed structure that has remained essentially stable over the years (Jacobson, Michailovsky & Lowe 2001; Michailovsky et al. 2014). Yet the collection is on the bazaar side in terms of the corpora hosted. Any speech dataset can in principle be deposited in the archive, provided that it was collected in connection to linguistic/anthropological research (and belongs within the national scope, since the institution taking care of long-term archiving has national scope). Corpora are archived as is: they are documented to a varying extent, they have vastly different sizes, some are annotated with an excruciating amount of detail whereas others only have a transcription and some are untranscribed, or even left untranscribed. The diversity of the archive, and the evolution of the hosted resources over time, suggest a third metaphor: that of the data garden, which not only undergoes overall growth over the years, but also allows (and fosters) gradual improvements to the resources. The gardening tasks can favour emulation and cross-fertilization in terms of annotation practices, without hard regulations and guidelines. The presentation of the archive will place emphasis (i) on the functions that it plays as part of a broader Open Science environment, (ii) on current plans for improvements to workflows, to interfaces, and to resources and metadata, and (iii) on opportunities, challenges and threats related to Natural Language Processing (such as automatic transcription: Guillaume et al. 2022).

Files

LD&A2024_MichaudGuillaume.mp4

Files (383.0 MB)

Name Size Download all
md5:5c4c623ceda8b69cf025aee4974cec0f
374.8 MB Preview Download
md5:5efa7f00236de4fe5215718e923b11d3
8.2 MB Preview Download

Additional details

Related works

Is identical to
Video/Audio: https://youtu.be/9Vgb4hPpPj8?feature=shared (URL)

Dates

Created
2024-09-04

References

  • Guillaume, Séverine, Guillaume Wisniewski, Benjamin Galliot, Minh-Châu Nguye?n, Maxime Fily, Guillaume Jacques & Alexis Michaud. 2022. Plugging a neural phoneme recognizer into a simple language model: a workflow for low-resource settings. In Proceedings of Interspeech 2022. Incheon, Korea. https://halshs.archives- ouvertes.fr/halshs-03625581.
  • Jacobson, Michel, Boyd Michailovsky & John B. Lowe. 2001. Linguistic documents synchronizing sound and text. Speech Communication 33 [special issue: "Speech Annotation and Corpus Tools"]. 79–96.
  • Michailovsky, Boyd, Martine Mazaudon, Alexis Michaud, Séverine Guillaume, Alexandre François & Evangelia Adamou. 2014. Documenting and researching endangered languages: the Pangloss Collection. Language Documentation and Conservation 8. 119– 135.
  • Raymond, Eric. 1999. The cathedral and the bazaar. Knowledge, Technology & Policy. Springer 12(3). 23–49.