Mapping out the Torlak verb

Milosavljević, Stefan; Babić, Anđela; Simonovic, Marko

doi:10.5281/zenodo.15736920

Published June 25, 2025 | Version v1

Dataset Open

Mapping out the Torlak verb

1. University of Graz
2. Balkanoloski institut Srpske akademije nauka i umetnosti

Mapping out the Torlak verb

Stefan Milosavljević (Universitz of Graz)

Anđela Babić (Institute for Balkan Studies, Serbian Academy of Sciences and Arts)

Marko Simonović (University of Graz)

The present database is a deliverable of the project “What’s in a verb? Mapping Serbian verbs borrowed into Romani”, conducted at the department of Slavic studies of the University of Graz and the Institute for Balkan Studies, Serbian Academy of Sciences and Arts and funded by the Ministry of Education, Science, and Research of the Republic of Austria in cooperation with the Serbian Ministry of Science, Technological Development, and Innovation. It is part of the Scientific and Technological Cooperation Program between the Republic of Austria and the Republic of Serbia for the period 2024–2026.

The first deliverable of the project, a database of Romani verbs borrowed from BCMS, has already been published as Mirić et al. (2025). The present database focuses on the properties of Torlak verbs. While many of the properties included are those that proved particularly instrumental in accounting for borrowing patterns into Romani — such as theme vowel classes, the presence of a derivational suffix, and so on — none of them is relevant only for borrowing. The included annotations will be generally useful for researchers working on the Torlak verbal system.

The main dataset

The main dataset consists of 3085 verbs extracted from the Spoken Torlak dialect corpus 1.0 (Vuković, 2020). The corpus was accessed via the following link: https://www.clarin.si/ske/#dashboard?corpname=torlak

As indicated in Vuković (2020), the corpus consists of semi-orthographic transcripts of 86.5 hours of recordings from locations evenly distributed throughout the Timok area of the Torlak dialect zone (see Vuković 2021 for a more detailed description of this corpus).

The procedure for obtaining the verbs was as follows. We extracted all the verbal lemmas from the corpus and then annotated the 5,000 most frequent ones. Two native-speaker annotators (the first two authors) first evaluated the full set to identify actual verbs, i.e., to eliminate any incorrectly marked entries due to typos, erroneous lemmatizations, and similar issues. Following this procedure, the dataset comprises a total of 3085 verbs. The resulting dataset was then annotated for the following properties:

the third-person present tense form; e.g., uzne for the lemma uzeti ‘take’;
the participle (feminine singular) form; e.g., uzela for the lemma uzeti ‘take’;
theme vowel class; e.g., /, ne for the lemma uzeti ‘take’;
suffix (whether a given verb contains a derivational suffix); e.g., 0 for the lemma uzeti ‘take’, 1 for the lemma izbegavati ‘avoid’, which contains the secondary imperfectivising suffix -av;
root allomorphy (whether or not it is exhibited between the present tense and the participle form); e.g., 1 for the presence of root allomorphy, e.g., in for the lemma uzeti ‘take’, 0 the absence of root allomorphy, for instance, čitati ‘read’.

In annotating these properties, we generally followed the instructions developed for the database WeSoSlaV, as described in Arsenijević et al. (2024), Arsenijević et al. (2025), but we did introduce some modifications. The following notes on the annotated properties are in order.

Inflected forms. The third-person present tense forms and the participle forms are based on their attestation in the Torlak corpus. In cases where these forms were not attested, they were reconstructed by the native speaker annotators. Since Torlak does not have an infinitive form and since no single form suffices to identify the conjugation pattern of the verb, in the remainder of the document we use the combination of the participle and the third-person present tense to identify Torlak verbs. So for the original lemma uzeti ‘take’, based on the standard BCMS infinitive, we will use <uzela, uzne>.

Theme-vowel classes. Like in WeSoSlaV, the theme-vowel classes, which identify the conjugation pattern, are annotated as pairs, where the first theme vowel originates from a non-finite form, and the second one from the present-tense form. Since there is no infinitive form in Torlak, the first TV within a pair is based on the participial form.

Root allomorphy. Root allomorphy is defined as any unpredictable difference in the exponence of the root morpheme (i.e., the material preceding the TV) between the non-finite and finite form of the verb. Our definition of root allomorphy excludes instances of predictable phonologically conditioned allomorphy (such as <pisala, piše> ‘write’). For instance, the verb <raznela, raznese> ‘blow up’ exhibits root allomorphy displayed by the contrast between ne and nes. This approach is based on Arsenijević et al. (2024).

Suffixes. In annotating this column, we departed considerably from WeSoSlaV, where three classes of suffixes or suffix-like morphemes were annotated. We only annotated whether the verb contains a derivational suffix, assigning the value 1 if there is an overt suffix, and 0 otherwise. Informed by our preliminary research, we aimed to annotate as containing derivational suffixes only those verbs that exhibit overt suffixal material realised by dedicated segments. This means that verbs derived by ablaut (e.g., <sklapala, sklapa>, a secondary imperfective from <sklopila, sklopi> ‘put together’) or consonant mutation (e.g., <izmišljala, izmišlja>, a secondary imperfective from <izmislila, izmisli> ‘make up’) were not counted as containing a derivational suffix. We also counted as derivational only those affixes that are never part of the theme vowel material. For this reason, we did not consider the suffix n-, which precedes the theme vowels <u, e>, as a derivational affix, since arguably the same element is analysed as a part of a theme vowel in the TV class </, ne>. On the other hand, the suffix k- in <na-sec-k-a-la, na-sec-k-a> ‘cut’ is counted as a derivational affix, as it is never part of a theme vowel.

The columns contained in the dataset are summarized in Table 1.

Table 1. Annotated columns in the main dataset

Column	Explanation
ID	The numeric identifier of the verb in the dataset.
Lemma	Lemma as given in the Spoken Torlak dialect corpus 1.0 — even if it does not accurately reflect the actual verb. This column is not analytically meaningful; its sole purpose is to facilitate linking this dataset with the Spoken Torlak dialect corpus 1.0.
Frequency	The frequency retrieved from the Spoken Torlak dialect corpus 1.0.
PRS.3SG	The third-person singular form of the given verb. E.g.: trči ‘runs’
PTCP.F	The feminine singular active participle form of the verb. E.g. trčala ‘run’.
TV class	The pair of theme vowels based on ᴘᴛᴄᴘ.ꜰ and ᴘʀs.𝟥sɢ, e.g. <a, i> for the verb <trčala, trči> ‘run’.
Suffix	If the verb contains a suffix, this is marked as 1; 0 otherwise.
Root allomorphy	If the verb exhibits root allomorphy, this is marked as 1; 0 otherwise.

Complementary datasets

In this subsection, we describe four complementary datasets that were used to attest potential variation in behavior of different theme-vowel classes across finite-verb environments, specifically, in the third-person plural present tense, aorist, and imperfectum.

PRESENT TENSE: THIRD-PERSON PLURAL datasets

Since the third-person plural forms of the present tense have no formal properties in common, in order to obtain an initial sample, we used a regular expression that targeted some common plural subject forms followed by a verb. Specifically, we used the query:

From the obtained sample of 692 concordances, we then manually extracted those that contained verbs in the ᴘʀs.𝟥ᴘʟ form. This resulted in a total of 323 verb tokens, which were further annotated.

In addition to the columns identifying examples (ID, the reference of the example in the Torlak corpus, left and right contexts, KWIC), the dataset is annotated for the following properties:

whether the concordance includes a ᴘʀs.𝟥ᴘʟ form
ᴘᴛᴄᴘ.ꜰ form,
TV class,
how the ᴘʀs.𝟥ᴘʟ can be derived from the respective ᴘʀs.𝟥sɢ form.

The columns included are described in Table 2.

Table 2. Annotated columns in the ᴘʀs.𝟥ᴘʟ dataset

Column	Explanation
ID	The example identifier
Left	Left context
KWIC	KWIC
Right	Right context
Reference	The reference in the Spoken Torlak dialect corpus 1.0 (Vuković 2020).
PTCP.F	The feminine singular active participle form of the verb.
TV_class	The theme vowel class based on the participle and the present tense.
From_PRS.3SG	This column contains information on how the ᴘʀs.𝟥ᴘʟ can be derived from the respective ᴘʀs.𝟥ꜱɢ form. For instance, the form zapale ‘they set fire’ can be derived from the singular correspondent zapali ‘s/he sets fire’ by removing the final i and adding an e. This is captured by the marking “- i + e”. In cases where the ᴘʀs.𝟥ᴘʟ can be derived from the respective ᴘʀs.𝟥ꜱɢ form just by adding extra material, this material is entered without a “+” in order to avoid the interpretation as a formula. For instance, the form uzimau ‘they take’ can be derived from the singular correspondent uzima ‘s/he takes’ by adding an u. This is captured by the marking “u”.

Our preliminary analysis has shown that variation within TV classes is restricted to a single class: <a, a>, where three different patterns are attested:

"ju", e.g., primaju ‘they receive’ (cf. prima ‘s/he receives’)
"u", e.g., uzimau ‘they take’ (cf. uzima ‘s/he takes’)
"-a + u", e.g., igru ‘they play’ (cf. igra ‘s/he plays’).

In the original sample, "ju" is by far the most common pattern, occurring in 36 out of 46 tokens. To determine whether specific verbs tend to specialise in taking particular ᴘʀs.𝟥ᴘʟ endings, we conducted targeted searches for the eight verbs that appeared in the initial sample with the two less frequent patterns.

For instance, for the verb <uzimala, uzima> the query was

[word="(ú|u)z(í|i)m(áju|ajú|aju|áu|aú|au|u|ú)"]

These searches yielded 228 concordances, which are presented in a separate document and annotated for the ᴘʀs.𝟥ᴘʟ ending.

Table 3. Annotated columns for the additional ᴘʀs.𝟥ᴘʟ dataset

Column	Explanation
ID	The example identifier
CQL	The dedicated query (also grouping together the specific verbs)
Reference	The reference in the Spoken Torlak dialect corpus 1.0 (Vuković 2020)
Left	Left context
KWIC	KWIC
Right	Right context
From_PRS.3SG	This column contains information on how the ᴘʀs.𝟥ᴘʟ can be derived from the respective ᴘʀs.𝟥ꜱɢ form.

AORIST

To obtain a sample of aorist forms, we extracted the forms ending in -še, which is the ending common to all third-person plural aorist forms in Torlak. Specifically, we used the query:

[word=". +še" & word!="víše" & word!="dodúše" & word!="náše" & word!="váše"].

We then manually extracted verbs in the aorist from the resulting forms. This resulted in a total of 701 verb tokens, which were further annotated in a separate table.

In addition to the columns identifying examples (ID, left and right contexts, KWIC, and the reference of the example in the Torlak corpus), the dataset is annotated for the following properties:

the theme vowel class;
the vowel preceding inflection;
variation – if the verb exhibits variation in the aorist forms, e.g., the verb imati ‘have’ has the aorist forms imaše or imadoše in the third-person plural;
aspect (1 if perfective; 0 if imperfective).

The columns included are described in Table 4.

Table 4. Annotated columns in the aorist dataset

Column	Explanation
ID	The example identifier
Left	Left context
KWIC	KWIC
Right	Right context
Vowel_before_inflection	The vowel preceding the inflectional ending
Theme_vowel	Theme vowel class based on the participle and the present tense
Perfective_aspect	1 if perfective, 0 if imperfective
Reference	The reference in the Spoken Torlak dialect corpus 1.0 (Vuković 2020)
Variation	Annotated as 1 if the given form exhibits variation related to the theme-vowel class assignment; 0 otherwise. For instance, the verb <imala, ima> ‘have’ has the third-person plural aorist form im-a-še (root-TV-AOR), characteristic of the a/a class, but also imad-o-še, a form otherwise characteristic of the ∅/e class. Both these forms have the value “1” in the Variation column.
Variation_note	Note on the variation attested. E.g., in the column containing the form imaše, it says: “alternative: imádoše (characteristic of the class 0/e)”.

IMPERFECTUM

Since several forms in the imperfectum end in -še, the initial sample was obtained using the same query as for the aorist sample, after which the actual imperfectum forms were manually extracted and annotated in a separate table. The resulting dataset comprises a total of 545 verb tokens with their concordances. Since the ending -še appears both in singular and plural imperfectum forms, the dataset is additionally annotated for singular vs. plural.

The columns included are described in Table 5.

Table 5: The imperfectum dataset

Column	Explanation
ID	The example identifier
Left	Left context
KWIC	KWIC
Right	Right context
Vowel_before_inflection	The vowel preceding the inflectional ending
Theme_vowel	Theme vowel class based on the participle and the present tense
Perfective_aspect	1 if perfective, 0 if imperfective
Reference	The reference in the Spoken Torlak dialect corpus 1.0 (Vuković 2020)
Variation	Annotated as 1 if the given form exhibits variation related to the theme-vowel class assignment; 0 otherwise. For instance, the verb <smejala, smeje> ‘laugh’ has the third-person singular imperfectum form smej-a-še (root-TV-IFM), characteristic of the a/a class, but also sme-e-še, which is form otherwise characteristic of other classes. Both these forms have 1 in the Variation column.
Variation_note	Note on the variation attested.
Number	1 – second/third person singular; 0 – third-person plural.

Acknowledgments

This database is a result of the project “What’s in a verb? Mapping Serbian verbs borrowed into Romani” (Institute for Balkan Studies, Serbian Academy of Sciences and Arts (SASA), Department of Slavic Studies, University of Graz). The project is financed by the Ministry of Science, Technological Development and Innovations of the Republic of Serbia in cooperation with the Austria’s Agency for Education and Internationalisation (OeAD), within the program of scientific and technological cooperation between the Republic of Serbia and the Republic of Austria, for the period 2024–2026.

References

Arsenijević, B., Gomboc Čeh, K., Marušič, F. L., Milosavljević, S., Mišmaš, P., Simić, J., Simonović, M., & Žaucer, R. (2024). Database of the Western South Slavic Verb HyperVerb 2.0 – WeSoSlaV. http://hdl.handle.net/11356/1846 (Slovenian language resource repository CLARIN.SI)

Arsenijević, B., Marušič, F. L., Milosavljević, S., Mišmaš, P., Simonović, M., & Žaucer, R. (2025). Hyperspacing the verb: The interplay between prosody, morphology, syntax and semantics in the western south slavic verbal domain. Berlin: Language Science Press. (Manuscript submitted for publication).

Miric, M., Ćirković, S., & Simonovic, M. (2025). Serbian loanverbs in Gurbet Romani (Knjaževac) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15642055

Vuković, Teodora, 2020, Spoken Torlak dialect corpus 1.0 (transcription), Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1281.

Vuković, T. (2021). Representing variation in a spoken corpus of an endangered dialect: the case of Torlak. Language Resources and Evaluation, 55(3), 731-756.. doi: 10.1007/s10579-020-09522-4

Files

0.Description.pdf

Files (1.1 MB)

Name	Size	Download all
0.Description.pdf md5:1e73abfc4c7d2aa5eafed70f6d11a519	258.0 kB	Preview Download
1.Main_dataset.csv md5:e67c1ec192c4f6a060b44b6d1a3126d5	138.3 kB	Preview Download
2.PRS.3PL.csv md5:9142fa2384d62b809bfc6471481a7d08	243.3 kB	Preview Download
3.PRS.3PL_Additional_dataset.csv md5:e801b7bcd56e5c9be5ec0c94316310b5	46.9 kB	Preview Download
4.Aorist.csv md5:47490383656cac21bcebbd49218ce026	249.2 kB	Preview Download
5.Imperfectum.csv md5:29183984a2264a1d19b05345b36d90ca	199.1 kB	Preview Download

Additional details

Austrian Agency for International Cooperation in Education and Research
What’s in a verb? Mapping Serbian verbs borrowed into Romani RS03/2024

	All versions	This version
Views	69	69
Downloads	130	130
Data volume	28.7 MB	28.7 MB

Mapping out the Torlak verb

Authors/Creators

Description

Mapping out the Torlak verb

The main dataset

Complementary datasets

PRESENT TENSE: THIRD-PERSON PLURAL datasets

AORIST

IMPERFECTUM

Acknowledgments

References

Files

0.Description.pdf

Files (1.1 MB)

Additional details

Funding