Published June 30, 2023
| Version v1
Conference paper
Open
Polyphemus, a lexical database of the Ancient Greek papyri, and the Madrid Wordlist of Ancient Greek
Contributors
Data managers:
Hosting institution:
- 1. University of Graz
- 2. Belgrade Center for Digital Humanities
- 3. Le Mans Université
- 4. Digital Humanities im deutschsprachigen Raum
Description
At present there is no way to search the corpus of Greek papyri for lemmas, or to search for specific grammatical forms of a word. Much less is there a way to search for examples of a grammatical category. Polyphemus comes to solve these shortcomings, and some more. For this purpose we have processed all the papyrus texts from PapyInfo (). This processing is done at the same time as the processing that results in the Callimachus database, which we also present at this DH Congress Congress. I summarize below the procedure by which we obtain our database Polyphemus. A) First we analyze each line of papyrus and differentiate the actual full words from the gaps or non-textual elements. B) Then we identify the complete words and separate them from the fragments. This can sometimes be done because of the editorial criteria used in the original edition, before digitizing. Other times it is necessary to check if the text meets some of the external qualities that define a word in ancient Greek (presence of accentuation, etc.). C) We then proceed to lemmatize each of the words, and determine to which part of speech it corresponds, and what is its morphological analysis. All this is done with the help of the Madrid list, which I will discuss below. For text fragments (incomplete words), we try to see if they can be ascribed to a root. We also separate proper nouns from common nouns. D) Lemma assignment and POS-tagging is performed in two phases. In a first pass we tag the forms with the highest frequency of occurrence. We calculate this from the frequency with which lexical forms were tagged in several manually annotated treebanks (over 700,000 words). We then go on to label all the remaining forms using the Madrid Wordlist. The Madrid Wordlist incorporates information about the dialect in which a form appears, so in case of multiple possible analyses we prefer those belonging to Koine (the Greek form of Papyri Greek) or Attic Greek. Naturally this procedure has the consequence that we reduce the number of multiple analyses for the same form (thus drastically reducing the number of false positives) in exchange for losing the correct POS- tagging for low-frequency forms that coincide with high-frequency ones. E) All this information is transferred to a SQL database, and put in relation with the data on the papyri that we have obtained when creating the Callimachus database. In this way, for each lexical form we obtain a lemma, a non-disambiguated morphological analysis, and a translation or gloss. Each of these parameters can be searched in combination with the more than fifty categories available to us thanks to Callimachus, such as date, origin, category, extension, subject, etc. To date, we have been able to analyze 97% of the complete words, including proper names, which are very numerous. 4. The Madrid Ancient Greek Word List The lemmatization and analysis in Parts Of Speech (POS tagging) is performed by comparing each record in our database with the records of a word list that we have created over the last 3 years, which we have called the Madrid Ancient Greek Wordlist. Most of the Ancient Greek wordlists are evolutions, simplifications, or improvements from the Morpheus list developed by Gregory Crane between 1984 and 1990 (Crane 1991; Celano et al. 2016). Our list also starts with Morpheus, but has been enriched with our own treebank (Aristarchus Treebank, 200,000 words; cf. Riaño 2006), and almost 100,000 proper names from The Lexicon of Greek Personal Names and the Trismegistos repository of papyrological and epigraphic resources. All these data were processed to obtain morphological information. I have manually entered several hundred (mostly irregular) pronominal forms in this list. To complete this list I have processed the digital version of the Greek-English Lexicon of Liddell-Scott-Jones, and extracted all the nominal lemmas; then I have determined the declension of each one of them, and I have proceeded to decline each lemma in its Attic and Ionic form by means of a program we have developed. Then we search for each of these forms in the papyri. The program thus produces over 600,000 lexical forms (many of them already in the Morpheus list). The lemmas are then assigned a translation, or rather a gloss
Files
RIA_O_RUFILANCHAS_Daniel_Polyphemus__a_lexical_database_of_t.pdf
Files
(84.2 kB)
Name | Size | Download all |
---|---|---|
md5:4c293a229ab025323467ad994de0d30c
|
67.7 kB | Preview Download |
md5:110dc6dfa1758017c757006724d77adc
|
16.4 kB | Preview Download |
Additional details
Related works
- Is part of
- Book: 10.5281/zenodo.7961822 (DOI)