Published November 14, 2023 | Version 1
Journal article Open

A search tool based on language modelling developed for The Index of Middle English Prose

  • 1. Department of Literature, Area Studies and European Languages, University of Oslo, Oslo, Oslo, 0315, Norway
  • 2. Humit, University of Oslo, Oslo, Oslo, 0313, Norway

Description

Non-standardised early vernaculars present a problem for search tools due to the high degree of variation. The challenge lies in the variation found in orthography, syntax, and lexicon between titles, incipits, and explicits in manuscript copies of the same work. Traditional search methods relying on exact string matching or regular expressions fail to address these variations comprehensively. This project presents a web-based search tool specifically designed to handle linguistic and textual variation. The software is made available as a part of the Index of Middle English Prose (IMEP).

The search tool addresses the issue of variation by utilizing a database of incipits and explicits, character-based n-gram language models (LMs) built with the Stanford Research Institute Language Modelling (SRILM) toolkit, and a fuzzy search script (IMEP: FSS) written in Python. The tool optimizes for recall, retrieving multiple potential matches for a search string, without attempting to identify the 'correct' one. The search process involves looking up exact matches in the database while simultaneously using the fuzzy search script to evaluate the incipits and explicits against a model of the search string, followed by a match of the search string against models of the incipits and explicits. This two-step process shortens the processing time, which would otherwise be unreasonably long, because while using SRILM to match the search string against each incipit or explicit in the IMEP for precision could be time-consuming, running a first step where all texts are matched against a single LM built from the search string allows for faster processing.

A web application, built using Django and Docker, combines the results of the direct database lookup and the fuzzy search script, presenting them as a list with exact matches followed by fuzzy matches ordered by increasing model perplexity. The tool is made available Open Access and can be adapted to other datasets.

Files

openreseurope-3-17907.pdf

Files (877.2 kB)

Name Size Download all
md5:63222882de0146cf8fb248733ec293b1
877.2 kB Preview Download

Additional details

References

  • (2006). A Corpus of English Dialogues 1560-1760.
  • (2013). A Linguistic Atlas of Early Middle English (LAEME), 1150-1325.
  • Archer D, Kytö M, Baron A (2015). Guidelines for normalising Early Modern English corpora: Decisions and justifications. ICAME Journal. doi:10.1515/icame-2015-0001
  • Benskin M, Laing M, Karaiskos V (null). An Electronic Version of A Linguistic Atlas of Late Medieaeval English.
  • (2007). BNC Consortium, The British National Corpus, XML Edition.
  • Brown C, Robbins RH (1943). The Index of Middle English Verse.
  • Chen SF, Goodman JT (1996). An empirical study of smoothing techniques for language modeling. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics. doi:10.48550/arXiv.cmp-lg/9606011
  • Chen SF, Goodman J (1999). An empirical study of smoothing techniques for language modeling. Comput Speech Lang. doi:10.1006/csla.1999.0128
  • (null). Early Modern English Medical Texts (EMEMT).
  • Edwards ASG, Pearsall D (1981). Middle English Prose: Essays on Biographical Problems.
  • (2000). Scientific and Medical Writings in Old and Middle English: An Electronic Reference.
  • Hanna R (1984). A Handlist of Manuscripts Containing Middle English Prose in the Henry E. Huntingdon Library.
  • Horner PJ (1986). A Handlist of Manuscripts Containing Middle English Prose in the Digby Collection, Bodleian Library, Oxford.
  • (null). IMEP: FSS - Index of Middle English Prose website: fuzzy search script for Middle English. doi:10.5281/zenodo.8396406
  • (null). A website based on the indexes in the individual volumes of the.
  • Jolliffe PS (1974). A Check-List of Middle English Prose Writings of Spiritual Guidance.
  • Jurafsky D, Martin JH (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Lingusitics, and Speech Recognition.
  • Lester GA (1985). A Handlist of Manuscripts Containing Middle English Prose in the John Rylands University Library of Manchester and Chetham's Library, Manchester.
  • Matheson LM (1998). The Prose Brut: The Development of a Middle English Chronicle.
  • Nøklestad A (2023). ahonkapo/imep_fuzzy_search_public: Index of Middle English Prose website: fuzzy search script for Middle English (Python). Zenodo.
  • Olson DL, Delen D (2008). Advanced Data Mining Techniques. doi:10.1007/978-3-540-76917-0
  • Rayson P, Archer D, Baron A (2007). Tagging historical corpora - the problem of spelling variation.
  • Sawyer D (2018). A review of Handlist XXIII (ed. Ogilvie-Thomson).
  • Schmied J (1994). The Lampeter Corpus of Early Modern English Tracts.
  • Stolcke A, Zheng J, Wang W (2011). SRILM at sixteen: update and outlook. Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop.
  • (null). The Index of Middle English Prose.
  • (null). The : An Open-Access, Digital Edition of the.
  • Witten IH, Bell TC (1991). The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression. IEEE Trans Information Theory. doi:10.1109/18.87000