Published September 21, 2021 | Version v3
Dataset Open

Dataset of limericks for computational poetics

  • 1. Dartmouth College
  • 2. Indiana University
  • 3. University of Connecticut


Herein is a data set comprising 98k limericks scraped from the The Omnificent English Dictionary In Limerick Form - OEDILF. It is a subset of the full data set, filtered to pass a basic test of standard limerick form (i.e., ensuring five lines, no emojis, no symbols). Each limerick was written by a human contributor whose work has passed through a rigorous moderation. This dataset is released alongside two companion papers: "BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence" (Abdibayev, Riddell, Rockmore, RANLP 2021) and "Automating the Detection of Poetic Features: The Limerick as Model Organism" (Abdibayev, Riddell, Igarashi, Rockmore, SIGHUM 2021). The dataset is primarily released for use by NLP researchers interested in studying formal structure of poetry and more generally, interested in computational poetics. Each limerick is accompanied by metadata: author information, id within the website and "is_limerick" field, which denotes if limerick was recognized by our custom filter that was built to check for formal limerick properties (this tagging was a goal of the SIGHUM paper and reflects the results reported there - see the paper for details). Thus, if "is_limerick"=True this is a true positive,  "is_limerick"=False is (almost surely) a false negative. We identify 70% of these as limericks and provide the tagging as a benchmark for the community to improve upon. With these considerations in mind we hope that NLP community will use this dataset to study poetical knowledge of language models trained on large corpora as many of their properties still remain a mystery to the community at large. We are excited for the possibilities ahead!

UPDATE: we released a new version of our dataset that contains all of the limericks that we planned to publish. Previous version (v2) was created using code that contained a bug which in turn lowered the number of available limericks.



