ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis

The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present Alberti, the first multilingual pre-trained large language model for poetry. Through domain-specific pre-training (DSP), we further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. We evaluated its performance on two structural poetry tasks: Spanish stanza type classification, and metrical pattern prediction for Spanish, English and German. In both cases, Alberti outperforms multilingual BERT and other tranformers-based models of similar sizes, and even achieves state-of-the-art results for German when compared to rule-based systems, demonstrating the feasibility and effectiveness of DSP in the poetry domain.


Introduction
Poetry analysis is the process of examining the elements of a poem to understand its meaning.To analyze poetry, readers must examine its words and phrasing from the perspectives of rhythm, sound, images, obvious meaning, and implied meaning.Scansion, a common approach to analyze metrical poetry, is the method or practice of determining and usually graphically representing the metrical pattern of a line of verse.It breaks down the anatomy of a poem by marking the metrical pattern of a poem by breaking each line of verse up into feet and highlighting the stressed and unstressed syllables [1].
Having multilingual tools for scansion and analysis of poetic language enables large-scale examinations of poetry traditions, helping researchers identify patterns and trends that may not be apparent through an examination of a single tradition or language [2].By using multilingual tools, scholars can compare and contrast different poetic forms, structures, and devices across languages and cultures, allowing them to uncover similarities and differences and gain a more comprehensive understanding of poetic expression.
SEPLN 2023: 39 th International Conference of the Spanish Society for Natural Language Processing versae@nb.no(J.de la Rosa); alvaro.perez@linhd.uned.es(A.Perez Pozo); sros@scc.uned.es(S.Ros); egonzalezblanco@faculty.ie.edu (E.Gonzalez-Blanco) 0000-0002-9143-5573 (J.de la Rosa); 0000-0001-7116-9338 (A.Perez Pozo); 0000-0001-6330-4958 (S.Ros); 0000-0002-0448-1812 (E.Gonzalez-Blanco) However, the analysis of multilingual poetry presents significant challenges that must be overcome.It demands a deep understanding of diverse linguistic and cultural traditions, as each language brings its own unique poetic conventions and nuances.Researchers and scholars need expertise in multiple languages to navigate the intricacies of each tradition accurately.Additionally, translation and interpretation pose complex obstacles in multilingual poetry analysis.Figurative language, wordplay, and cultural references deeply rooted in the specific language and culture of the poem make it challenging to convey the intended meaning, emotional impact, and artistic integrity when translating.Cultural contexts, historical references, and subtle language connotations often get lost in translation, making it difficult to fully capture the essence of the original work.
Furthermore, the development of advanced computational tools is crucial for effective analysis and comparison of poetic expression across multiple languages.This requires the application of sophisticated machine learning techniques, natural language processing algorithms, and other emerging technologies.Building models that can accurately capture the unique aesthetic qualities, rhythm, rhyme, and stylistic variations in different languages is an ongoing research endeavor that requires continuous refinement and innovation.
In this work, we investigate whether domain-specific pre-training (DSP) [3] in a multilingual poetry setting can be leveraged to mitigate some of these issues.Specifically, we introduce Alberti, a multilingual encoder-only BERT-based language model suited for poetry analysis.We experimentally demonstrate that Alberti exhibits arXiv:2307.01387v1[cs.CL] 3 Jul 2023 better performance than the base model it was built on, a multilingual BERT [4] which was pre-trained on the 104 languages with the largest Wikipedias.And by reformulating scansion and stanza identification as classification problems, we show that Alberti also outperforms its based model in these downstream tasks.Moreover, we are releasing both Alberti and the dataset used for further training it, which consists of over 12 million verses in multiple languages.

Related Work
The transformer architecture [5] is now pervasive in natural language processing (NLP).In the last five years, context-aware language models have revolutionized the computational modeling of language.
In the humanities, domain specific BERT-based models [4] trained with the goal of predicting masked words are starting to appear.In MacBERTh, [6], the authors present diachronic models for pre-1950 English literature.And a new shared task on historical models for English, French, and Dutch took place last year [7].While pre-training these large language models from scratch is often cost-prohibitive and extremely data demanding, adjusting them to work on other domains and tasks via transfer learning requires less data and fewer resources.For poetry, computational approaches have focused primarily on generation [8,9] and scansion [10,11,12,13,14,15,16,17], but generally in a monolingual setting.While multilingual systems exist for metrical analysis, they internally work by having different sets of rules for each language [14] or by building ad-hoc neural networks [15].To the best of our knowledge, the only attempt at multilinguality for metrical pattern prediction was introduced in [18] for English, German, and Spanish, where the authors jointly fine-tune different monolingual language models and document some crosslingual transferability when using multilingual RoBERTa [19].Inspired by their good results, in this work we build a domain specific language model trained on a corpus of verses in 12 languages to explore its performance on tasks of poetic nature.

Methods and Data
We leverage domain-specific pre-training techniques by fine-tuning the widely used multilingual BERT (mBERT) model with the same base architecture and vocabulary for our specific domain.We adopt the masked language modeling (MLM) 1 objective and further train the model for 40 epochs on a large corpus consisting of 12 million verses, which were sourced from various poetry anthologies.The training was conducted on a Google TPUv3 virtual machine with a batch size of 256, a learning rate of 1.25e-4, and a weight decay of 0.01.The maximum sequence length was set to 32 since verses with up to 32 tokens using the mBERT tokenizer make up for almost 99 percent of the total.Furthermore, we used a 10,000-step warmup process, which allowed the model to learn the distribution of the corpus gradually.We are naming the resulting model Alberti2 .After training, we evaluate the model on 10% of the corpus held out as a validation set, achieving a final global MLM accuracy of 57.77%.

PULPO
The training of the model was done over a new corpus we built for the occasion.The Prolific Unannotated Literary Poetry Corpus (PULPO) is a set of multilingual verses and stanzas with over 72 million words.It was created to tackle the needs of scholars interested in poetry from a machine learning perspective.Although poetry is a fundamental aspect of human expression that has been around for millennia, the study of poetry from a machine learning perspective is still in its infancy, largely due to the scarcity of poetic corpora.And while literary corpora are becoming more readily available, multilingual poetic corpora remain elusive.The lack of such corpora presents a major challenge for researchers interested in natural language processing (NLP) and machine learning (ML) applied to poetry.The PULPO corpus comprises over 12 million deduplicated metrical verses from 12 different languages in 3 scripts (see Tables 1 and 6).We chose these languages because of the large number of poems freely available on the Internet out of copyright or with a permissive license.The poems date from the 15th-century to contemporary poetry and a number of them also have stanza separations.This makes the corpus a valuable resource for multilingual NLP and machine learning research.In addition, the corpus includes poems from various historical periods and literary traditions, providing a diverse range of poetic styles and forms.

Stanzas
To further evaluate the performance of the model, we conduct extrinsic evaluations using two different tasks.First, a stanza-type classification task for Spanish poetry.This task aims to assess the ability of the model to distinguish between different stanza types, such as tercet, quatrain, and sestina (see Table 2 for an example).
Example of a stanza with its metrical length and rhyme scheme.
A stanza, which is considered the fundamental structural unit of a poem, serves to encapsulate themes or ideas [20].Comprised of verses, stanzas are influenced by the writing styles and historical preferences of authors.The Spanish tradition boasts a rich abundance of stanza types, rendering their identification a challenging and intricate task.Generally, three factors contribute to the identification of a stanza: metrical length, rhyme type, and rhyme scheme [21,22,23,24].Consequently, the classification of stanzas can be approached in three stages [21]: 1. Calculation of the metrical length per verse.This process typically involves counting the number of syllables while considering rhetorical devices that may alter this count (e.g., syneresis, synalephas).In some cases, the pattern formed by these verse lengths can assist in determining the stanza type.2. Determination of the rhyme type.When the sounds after the final stressed syllable of each verse match, it is known as consonance rhyme.
Alternatively, assonance rhyme involves the matching of vowel sounds while disregarding consonant sounds.However, there are stanza types where this distinction becomes irrelevant.
3. Extraction of the rhyme scheme.The rhyme scheme is established based on the verses that share a rhyme.
Following [25], we approached stanza type identification as a classification task.We used their 5,005 Spanish stanzas containing between 12 and 170 examples for each of the 45 different types of stanzas 3 , and used the already existing splits of 80% for training, 10% for validation, and 10% for testing.

Scansion
Second, a multilingual scansion task aimed at testing the ability of the model to predict the metrical pattern of a given verse in different languages.The scanning of a verse relies on assigning stress correctly to the syllables of the words.This process can be influenced by rhetorical figures and individual traditions.The synalepha is a common device in Spanish, English, and German poetry, which combines separate phonological groups into a single unit for metrical purposes.Syneresis and dieresis are two other devices that operate similarly but within the word, either joining or splitting syllables.The meter of a verse can be seen as a sequence of stressed and unstressed syllables, represented by the symbols '+' and '−', respectively.Examples 1, 2, and 3 from [18] illustrate verses with metrical lengths of 8, 10, and 7 syllables in Spanish, English, and German, respectively.These examples also demonstrate the resulting metrical pattern after applying (or breaking, as in the case for 'la-her' in the Spanish verse) synalepha, represented by ' < ', and considering the stress of the last word as it may affect the metrical length in Spanish poetry.In order to measure the performance of Alberti, we follow the experimental design in [18] and use their chosen datasets of verses manually annotated with syllabic stress for English, German, and Spanish.For the Spanish corpus, the Corpus de Sonetos de Siglo de Oro [26] was used.This TEI-XML annotated corpus consists of hendecasyllabic verses from Golden Age Spanish authors.A subset of 100 poems initially used for evaluating the ADSO Scansion system [27] was selected for testing, while the remaining poems were split for training and evaluation.Unfortunately, suitable annotated corpora of comparable scale were not found for English and German.Instead, an annotated corpus of 103 poems from For Better For Verse [28] was used for English, and a manually annotated corpus from [29,30] was used for German.The German corpus contains 158 poems which cover the period from 1575 to 1936.Around 1200 lines have been annotated in terms of syllable stress, foot boundaries, caesuras and line main accent.These corpora were divided into train, evaluation, and test sets, following a 70-15-15 split.Table 3 shows the number of verses per language and split .

Evaluation and Results
After training, we evaluated the resulting model Alberti on several fronts.For intrinsic evaluation, we used the aforementioned MLM metric as well as a perplexity proxy score based on the predicted token probabilities.We calculated these metrics for every language on the validation set of PULPO for both Alberti and mBERT.As shown in Figure 1, the MLM accuracy of Alberti is generally higher than that of mBERT for all languages.The gains of Alberti against mBERT range from +19.65 percentage points for Portuguese to +40.59 for Finnish.A similar trend is shown for our perplexity proxy score in Figure 2, with clear gains of Alberti over mBERT across the board, ranging from -35.75 for French to staggering -739.16 points for Chinese.The stark difference for Chinese could be a result of differences in the way text is represented in that language in both the pre-training corpus of mBERT and PULPO.
For extrinsic evaluation, we also evaluated Alberti against mBERT for stanza classification and metrical pat-tern prediction.We chose the best performing models on the validation set over a small grid search of learning rates 10 −5 , 3 × 10 −5 , and 5 × 10 −5 , for 3, 5, and 10 epochs, and warmup of 0 and 10% of the steps.Figure 3 shows the ROC curves of each stanza type versus the rest for both Alberti and mBERT, with higher areas under the curve (AUC) in 29 out of the 45 stanza types for Alberti, and in 16 out of 45 for mBERT.Table 4 shows F1 and accuracy macro scores for each model, with Alberti outperforming mBERT by a small percentage.Interestingly, our baseline fine-tuned mBERT model scores better than the monolingual Spanish BETO [31] reported in [25].Nonetheless, the combination of the rule-based system Rantanplan [17] with an expert system remains state of the art for stanza classification.[25] -88.51 The prediction of metre was approached as a multiclass binary classification task, i.e., one class per syllable where each syllable can be stressed (strong) or unstressed (weak).After a grid search with roughly the same hyperparameters as in [18], Alberti outperforms mBERT for every language, as shown in Table 5.When compared to other similarly sized models (English RoBERTa [19] and multilingual XLM RoBERTa [32]) as reported in [18], it still performs better for English and German.Lastly, Alberti achieves a new state-of-the-art for German, as it performs better than both the large version of XLM RoBERTa and the rule-based system Metricalizer [33].

Conclusions and Further Work
In this work, we hope to make a significant contribution to the fields of Digital Humanities and NLP by intro-  specific pre-training for poetry.The evaluation of the model on intrinsic and extrinsic metrics highlights its potential for practical applications in tasks such as stanzatype identification and scansion on a multilingual setting.
The release of our model and accompanying corpus will provide an important resource for researchers in the field, facilitating further investigation into poetry-related tasks.It is our plan to train Alberti at the stanza level and compare its performance against the current versebased model, which presents itself as an exciting avenue for future research, as it could potentially improve the ability of the model to capture the meaning and structure of poetry in a more sophisticated way.Given the good results obtained by Alberti, despite its training on an arguably outdated model, future iterations will leverage more powerful and larger pre-trained models, thereby enhancing its performance and versatility.
Moreover, we do believe that the strong accuracy of Alberti in the masked language prediction task could pave the way for methods analyzing metaphoric language by leveraging the differences between the predictions of Alberti and the predictions of other models trained on more journalistic or encyclopedic type of data.
Overall, the results of this study have the potential to significantly advance our understanding of poetry in various languages and contribute to the development of more sophisticated NLP models that can capture the subtleties of poetic language.We hope that our work will inspire further research and innovation in this field, and we look forward to seeing how our model and corpus will be used in future studies.

Figure 3 .:
Figure 3.: True positive rate (TPR) against false positive rate (FPR) of the receiver operating characteristic (ROC) curves and their corresponding areas (AUC) for the classification of each stanza type versus the rest after fine-tuning Alberti (blue) and mBERT (red).Best AUC score in bold.

Table 1 .
Number of deduplicated verses and their words per language in PULPO.

Table 3 .
Number of verses for each language in the metrical pattern prediction datasets.

Table 4 .
F1 scores on stanza classification.Best neural model scores in bold.Rule-based systems italicized.

Table 5 .
Accuracy on metrical pattern prediction.Best neural model scores in bold.Rule-based systems italicized.