Multifunctional Digital Corpus of Czech Novels
Description
Slide 1: Dear colleagues, let me present here the project of a multidimensional digital corpus of Czech prose, which primarily focuses on literary narratives that significantly represent Prague's spatial, topographical landscape. The purpose of this project is to functionally link different methods of digital processing and representation of literary data: specifically, literary-cartographic models, quantitative models focused on the analysis of narrative and its segments, and last but not least, basic tools in the mining of a linguistic corpus.
Slide 2: So the corpus is built on three basic pillars, as you can see on this slide. The first part includes literary-cartographic maps of Prague's fictional topography and other tools related to this topic, some of which I will introduce during the presentation. The second pillar is selected quantitative methods that focus on the analysis of selected text segments (I will show them in the presentation), narrative rhythm, emotional analysis. The third pillar includes features designed to search the language corpus.
Slide 3: As you can see on this slide, the corpus is structured as follows:
a) Its components are the so-called author corpora; specifically, our goal is to sequentially process all 19th-21st century authors in Czech literature who significantly thematize the Prague plein air in their prose.
b) Each author corpus is processed with respect to the three mentioned basic areas of the corpus, i.e. literary-cartographic maps, quantitative models together with a database for corpus search are developed.
Within each of the three pillars, the corpus offers a number of functionalities that can be used to search the corpus. As you can see, within each of the three main pillars the corpus offers specific ways to mine it. Fictional topography maps of Prague are available.....
Slide 4: On the homepage of the corpus, whose official name is Literary Cartographic and Quantitative Models of Czech Novels from the 19thto 21stCentury, in addition to the main menu and project description, there is basic statistical information about its composition. Here we can see how individual authors are represented in the corpus in terms of number of tokens, lemmas, as well as percentage. There is also information about the number of prose sentences for each author, the number of prose sentences for each year, and a table about the size of the corpus in number of tokens, number of lemmas, number of sentences, number of authors or texts.
Slide 5: The main component is of course the corpus itself and its tools. Clicking on it, we see a list of authors currently available in the corpus, with varying degrees of processing. When we click on any author, a menu is available where we can find the specific tools for searching the corpus that have already been discussed here.
Slide 6: I will demonstrate their real-world form using the example of Jacob Arbes. When you click on it, you will see a menu that is the same for all authors.
Slide 7: Here we see the literary-cartographic maps. For each of Arbes's works, a separate map is prepared, using a period-historical map base. Plot locations or character movements are plotted into the blind map with respect to how these locations are referred to in the text (referred to by the narrator or character). Places are also colour-coded according to their type (places-in-place, places-out-of-place, parallel places) and reference (type of narrator, character). Using these models, we can trace how the structure of Prague's fictional topography evolved or changed over time in selected Arbes' prose works.
In addition to maps of fictional topography, we find in this section maps of the density of individual locations, maps of the movement of characters through the environment, quantitative models of toponyms, GIS models, etc.
Slide 8: The next segment of the corpus is quantitative models of text segments. As you can see in the slide, there are graphs showing the average sentence lengths for each text segment, showing the variance of these segments, as well as models showing the quantitative usage of each text segment over time, and models of narrative rhythms.
Slide 9: Other graphs, such as those focusing on Emotional Analysis, are an integral component of the corpus. Two types of graphs are used here. The first is based on a binary classification of emotionality (positive/negative), the second is based on a special lexicon of emotionality based on the Thematic Thesaurus of Czech by Aleš Klégr. We can also see the course of emotionality in individual texts on the axis between positive, negative and neutral emotionality, as well as a graph showing the distances between texts based on emotionality.
Slide 10: As mentioned in the introduction, the last part of the corpus is the tools for its linguistic mining (concordance, collocation, frequency search of lemmas and forms, percentage display of word types, and a number of other tools). On this slide you can see the forms for concordance search. As an example, I typed the word house and after a while we see the results of a contextual search.
Slide 11: Similarly, you can search for the closest lexical contexts for a given word, filtered by selected association measures.
Slide 12: The corpus also has a number of other functionalities that cannot be presented in detail here due to time constraints. Finally, I would at least like to state what the main goal of this project is. As the above examples have shown, its purpose is to functionally link literary and linguistic research. The corpus is suitable for both literary scholars and linguists, but above all it offers tools and methods of data processing that are mutually potent. For example, this can be seen in the integration of literary-cartographic maps and quantitative models of text segments. Word Clouds, in turn, provide the most frequent autosemantics, which, thanks to concordances or collocations, can be supplemented with their contexts.
The future of the project naturally lies in the addition of text/language data and other relevant functionalities.
Files
Files
(82.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:268f95a3c82bc101ee3d28b102290a59
|
82.7 MB | Download |
Additional details
Dates
- Accepted
-
2025