Published February 9, 2019 | Version v1
Thesis Open

Creation and Analysis of the Yugoslav Rock Song Lyrics Corpus from 1945 to 2003

  • 1. University of Belgrade


The thesis analyzes, from the theoretical and practical perspective, the creation and processing of corpus of rock songs’ lyrics originating from the former Yugoslavia in the period 1945-2003. The lyrics are obtained from the LyricWiki website using the Python library lyricsmaster in the web scraping process. The collected texts are then merged into a single XML file and automatically annotated with the yattag Python tool. Afterwards, the data preprocessing was conducted at the formal and content level. Furthermore, the XML document is transformed into XHTML format applying XSLT processor, in order to generate basic corpus data. The diacritic restoration process with the “Slovo Majstor” application and morphological electronic dictionaries of Serbian language in the LeXimir software package, is also automated. The text mining process encompassed retrieving socio-political and patriotic topics using NLTK library in Python, while romantic and other topics were visualized using the TreeCloud and WordItOut software. The similarity between authors represented in the corpus was measured using stylo package in the programming language R. Finally, an overview of the today’s most relevant programming libraries in the field of natural language processing is provided, which, at the same time, serves as a guideline for the future work.


Création et analyse du corpus de textes de chansons rock yougoslaves de 1945 à 2003.pdf