Mapping out the Torlak verb
Authors/Creators
- 1. University of Graz
- 2. Balkanoloski institut Srpske akademije nauka i umetnosti
Description
Mapping out the Torlak verb
Stefan Milosavljević (Universitz of Graz)
Anđela Babić (Institute for Balkan Studies, Serbian Academy of Sciences and Arts)
Marko Simonović (University of Graz)
The present database is a deliverable of the project “What’s in a verb? Mapping Serbian verbs borrowed into Romani”, conducted at the department of Slavic studies of the University of Graz and the Institute for Balkan Studies, Serbian Academy of Sciences and Arts and funded by the Ministry of Education, Science, and Research of the Republic of Austria in cooperation with the Serbian Ministry of Science, Technological Development, and Innovation. It is part of the Scientific and Technological Cooperation Program between the Republic of Austria and the Republic of Serbia for the period 2024–2026.
The first deliverable of the project, a database of Romani verbs borrowed from BCMS, has already been published as Mirić et al. (2025). The present database focuses on the properties of Torlak verbs. While many of the properties included are those that proved particularly instrumental in accounting for borrowing patterns into Romani — such as theme vowel classes, the presence of a derivational suffix, and so on — none of them is relevant only for borrowing. The included annotations will be generally useful for researchers working on the Torlak verbal system.
The main dataset
The main dataset consists of 3085 verbs extracted from the Spoken Torlak dialect corpus 1.0 (Vuković, 2020). The corpus was accessed via the following link: https://www.clarin.si/ske/#dashboard?corpname=torlak
As indicated in Vuković (2020), the corpus consists of semi-orthographic transcripts of 86.5 hours of recordings from locations evenly distributed throughout the Timok area of the Torlak dialect zone (see Vuković 2021 for a more detailed description of this corpus).
The procedure for obtaining the verbs was as follows. We extracted all the verbal lemmas from the corpus and then annotated the 5,000 most frequent ones. Two native-speaker annotators (the first two authors) first evaluated the full set to identify actual verbs, i.e., to eliminate any incorrectly marked entries due to typos, erroneous lemmatizations, and similar issues. Following this procedure, the dataset comprises a total of 3085 verbs. The resulting dataset was then annotated for the following properties:
-
the third-person present tense form; e.g., uzne for the lemma uzeti ‘take’;
-
the participle (feminine singular) form; e.g., uzela for the lemma uzeti ‘take’;
-
theme vowel class; e.g., /, ne for the lemma uzeti ‘take’;
-
suffix (whether a given verb contains a derivational suffix); e.g., 0 for the lemma uzeti ‘take’, 1 for the lemma izbegavati ‘avoid’, which contains the secondary imperfectivising suffix -av;
-
root allomorphy (whether or not it is exhibited between the present tense and the participle form); e.g., 1 for the presence of root allomorphy, e.g., in for the lemma uzeti ‘take’, 0 the absence of root allomorphy, for instance, čitati ‘read’.
In annotating these properties, we generally followed the instructions developed for the database WeSoSlaV, as described in Arsenijević et al. (2024), Arsenijević et al. (2025), but we did introduce some modifications. The following notes on the annotated properties are in order.
Inflected forms. The third-person present tense forms and the participle forms are based on their attestation in the Torlak corpus. In cases where these forms were not attested, they were reconstructed by the native speaker annotators. Since Torlak does not have an infinitive form and since no single form suffices to identify the conjugation pattern of the verb, in the remainder of the document we use the combination of the participle and the third-person present tense to identify Torlak verbs. So for the original lemma uzeti ‘take’, based on the standard BCMS infinitive, we will use <uzela, uzne>.
Theme-vowel classes. Like in WeSoSlaV, the theme-vowel classes, which identify the conjugation pattern, are annotated as pairs, where the first theme vowel originates from a non-finite form, and the second one from the present-tense form. Since there is no infinitive form in Torlak, the first TV within a pair is based on the participial form.
Root allomorphy. Root allomorphy is defined as any unpredictable difference in the exponence of the root morpheme (i.e., the material preceding the TV) between the non-finite and finite form of the verb. Our definition of root allomorphy excludes instances of predictable phonologically conditioned allomorphy (such as <pisala, piše> ‘write’). For instance, the verb <raznela, raznese> ‘blow up’ exhibits root allomorphy displayed by the contrast between ne and nes. This approach is based on Arsenijević et al. (2024).
Suffixes. In annotating this column, we departed considerably from WeSoSlaV, where three classes of suffixes or suffix-like morphemes were annotated. We only annotated whether the verb contains a derivational suffix, assigning the value 1 if there is an overt suffix, and 0 otherwise. Informed by our preliminary research, we aimed to annotate as containing derivational suffixes only those verbs that exhibit overt suffixal material realised by dedicated segments. This means that verbs derived by ablaut (e.g., <sklapala, sklapa>, a secondary imperfective from <sklopila, sklopi> ‘put together’) or consonant mutation (e.g., <izmišljala, izmišlja>, a secondary imperfective from <izmislila, izmisli> ‘make up’) were not counted as containing a derivational suffix. We also counted as derivational only those affixes that are never part of the theme vowel material. For this reason, we did not consider the suffix n-, which precedes the theme vowels <u, e>, as a derivational affix, since arguably the same element is analysed as a part of a theme vowel in the TV class </, ne>. On the other hand, the suffix k- in <na-sec-k-a-la, na-sec-k-a> ‘cut’ is counted as a derivational affix, as it is never part of a theme vowel.
The columns contained in the dataset are summarized in Table 1.
Table 1. Annotated columns in the main dataset
|
Column |
Explanation |
|
ID |
The numeric identifier of the verb in the dataset. |
|
Lemma |
Lemma as given in the Spoken Torlak dialect corpus 1.0 — even if it does not accurately reflect the actual verb. This column is not analytically meaningful; its sole purpose is to facilitate linking this dataset with the Spoken Torlak dialect corpus 1.0. |
|
Frequency |
The frequency retrieved from the Spoken Torlak dialect corpus 1.0. |
|
PRS.3SG |
The third-person singular form of the given verb. E.g.: trči ‘runs’ |
|
PTCP.F |
The feminine singular active participle form of the verb. E.g. trčala ‘run’. |
|
TV class |
The pair of theme vowels based on ᴘᴛᴄᴘ.ꜰ and ᴘʀs.𝟥sɢ, e.g. <a, i> for the verb <trčala, trči> ‘run’. |
|
Suffix |
If the verb contains a suffix, this is marked as 1; 0 otherwise. |
|
Root allomorphy |
If the verb exhibits root allomorphy, this is marked as 1; 0 otherwise. |
Complementary datasets
In this subsection, we describe four complementary datasets that were used to attest potential variation in behavior of different theme-vowel classes across finite-verb environments, specifically, in the third-person plural present tense, aorist, and imperfectum.
PRESENT TENSE: THIRD-PERSON PLURAL datasets
Since the third-person plural forms of the present tense have no formal properties in common, in order to obtain an initial sample, we used a regular expression that targeted some common plural subject forms followed by a verb. Specifically, we used the query:
[word="oni" | word="one" | word="ljudi" | word="žene"| word="deca" | word="déca" | word="oní" | word="oné" | word="žéne" | word="ljúdi"] [tag="V.*"]
From the obtained sample of 692 concordances, we then manually extracted those that contained verbs in the ᴘʀs.𝟥ᴘʟ form. This resulted in a total of 323 verb tokens, which were further annotated.
In addition to the columns identifying examples (ID, the reference of the example in the Torlak corpus, left and right contexts, KWIC), the dataset is annotated for the following properties:
-
whether the concordance includes a ᴘʀs.𝟥ᴘʟ form
-
ᴘᴛᴄᴘ.ꜰ form,
-
TV class,
-
how the ᴘʀs.𝟥ᴘʟ can be derived from the respective ᴘʀs.𝟥sɢ form.
The columns included are described in Table 2.
Table 2. Annotated columns in the ᴘʀs.𝟥ᴘʟ dataset
|
Column |
Explanation |
|
ID |
The example identifier |
|
Left |
Left context |
|
KWIC |
KWIC |
|
Right |
Right context |
|
Reference |
The reference in the Spoken Torlak dialect corpus 1.0 (Vuković 2020). |
|
PTCP.F |
The feminine singular active participle form of the verb. |
|
TV_class |
The theme vowel class based on the participle and the present tense. |
|
From_PRS.3SG |
This column contains information on how the ᴘʀs.𝟥ᴘʟ can be derived from the respective ᴘʀs.𝟥ꜱɢ form. For instance, the form zapale ‘they set fire’ can be derived from the singular correspondent zapali ‘s/he sets fire’ by removing the final i and adding an e. This is captured by the marking “- i + e”. In cases where the ᴘʀs.𝟥ᴘʟ can be derived from the respective ᴘʀs.𝟥ꜱɢ form just by adding extra material, this material is entered without a “+” in order to avoid the interpretation as a formula. For instance, the form uzimau ‘they take’ can be derived from the singular correspondent uzima ‘s/he takes’ by adding an u. This is captured by the marking “u”. |
Our preliminary analysis has shown that variation within TV classes is restricted to a single class: <a, a>, where three different patterns are attested:
-
"ju", e.g., primaju ‘they receive’ (cf. prima ‘s/he receives’)
-
"u", e.g., uzimau ‘they take’ (cf. uzima ‘s/he takes’)
-
"-a + u", e.g., igru ‘they play’ (cf. igra ‘s/he plays’).
In the original sample, "ju" is by far the most common pattern, occurring in 36 out of 46 tokens. To determine whether specific verbs tend to specialise in taking particular ᴘʀs.𝟥ᴘʟ endings, we conducted targeted searches for the eight verbs that appeared in the initial sample with the two less frequent patterns.
For instance, for the verb <uzimala, uzima> the query was
[word="(ú|u)z(í|i)m(áju|ajú|aju|áu|aú|au|u|ú)"]
These searches yielded 228 concordances, which are presented in a separate document and annotated for the ᴘʀs.𝟥ᴘʟ ending.
Table 3. Annotated columns for the additional ᴘʀs.𝟥ᴘʟ dataset
|
Column |
Explanation |
|
ID |
The example identifier |
|
CQL |
The dedicated query (also grouping together the specific verbs) |
|
Reference |
The reference in the Spoken Torlak dialect corpus 1.0 (Vuković 2020) |
|
Left |
Left context |
|
KWIC |
KWIC |
|
Right |
Right context |
|
From_PRS.3SG |
This column contains information on how the ᴘʀs.𝟥ᴘʟ can be derived from the respective ᴘʀs.𝟥ꜱɢ form. |
AORIST
To obtain a sample of aorist forms, we extracted the forms ending in -še, which is the ending common to all third-person plural aorist forms in Torlak. Specifically, we used the query:
[word=". +še" & word!="víše" & word!="dodúše" & word!="náše" & word!="váše"].
We then manually extracted verbs in the aorist from the resulting forms. This resulted in a total of 701 verb tokens, which were further annotated in a separate table.
In addition to the columns identifying examples (ID, left and right contexts, KWIC, and the reference of the example in the Torlak corpus), the dataset is annotated for the following properties:
-
the theme vowel class;
-
the vowel preceding inflection;
- variation – if the verb exhibits variation in the aorist forms, e.g., the verb imati ‘have’ has the aorist forms imaše or imadoše in the third-person plural;
-
aspect (1 if perfective; 0 if imperfective).
The columns included are described in Table 4.
Table 4. Annotated columns in the aorist dataset
|
Column |
Explanation |
|
ID |
The example identifier |
|
Left |
Left context |
|
KWIC |
KWIC |
|
Right |
Right context |
|
Vowel_before_inflection |
The vowel preceding the inflectional ending |
|
Theme_vowel |
Theme vowel class based on the participle and the present tense |
|
Perfective_aspect |
1 if perfective, 0 if imperfective |
|
Reference |
The reference in the Spoken Torlak dialect corpus 1.0 (Vuković 2020) |
|
Variation |
Annotated as 1 if the given form exhibits variation related to the theme-vowel class assignment; 0 otherwise. For instance, the verb <imala, ima> ‘have’ has the third-person plural aorist form im-a-še (root-TV-AOR), characteristic of the a/a class, but also imad-o-še, a form otherwise characteristic of the ∅/e class. Both these forms have the value “1” in the Variation column. |
|
Variation_note |
Note on the variation attested. E.g., in the column containing the form imaše, it says: “alternative: imádoše (characteristic of the class 0/e)”. |
IMPERFECTUM
Since several forms in the imperfectum end in -še, the initial sample was obtained using the same query as for the aorist sample, after which the actual imperfectum forms were manually extracted and annotated in a separate table. The resulting dataset comprises a total of 545 verb tokens with their concordances. Since the ending -še appears both in singular and plural imperfectum forms, the dataset is additionally annotated for singular vs. plural.
The columns included are described in Table 5.
Table 5: The imperfectum dataset
|
Column |
Explanation |
|
ID |
The example identifier |
|
Left |
Left context |
|
KWIC |
KWIC |
|
Right |
Right context |
|
Vowel_before_inflection |
The vowel preceding the inflectional ending |
|
Theme_vowel |
Theme vowel class based on the participle and the present tense |
|
Perfective_aspect |
1 if perfective, 0 if imperfective |
|
Reference |
The reference in the Spoken Torlak dialect corpus 1.0 (Vuković 2020) |
|
Variation |
Annotated as 1 if the given form exhibits variation related to the theme-vowel class assignment; 0 otherwise. For instance, the verb <smejala, smeje> ‘laugh’ has the third-person singular imperfectum form smej-a-še (root-TV-IFM), characteristic of the a/a class, but also sme-e-še, which is form otherwise characteristic of other classes. Both these forms have 1 in the Variation column. |
|
Variation_note |
Note on the variation attested. |
|
Number |
1 – second/third person singular; 0 – third-person plural. |
Acknowledgments
This database is a result of the project “What’s in a verb? Mapping Serbian verbs borrowed into Romani” (Institute for Balkan Studies, Serbian Academy of Sciences and Arts (SASA), Department of Slavic Studies, University of Graz). The project is financed by the Ministry of Science, Technological Development and Innovations of the Republic of Serbia in cooperation with the Austria’s Agency for Education and Internationalisation (OeAD), within the program of scientific and technological cooperation between the Republic of Serbia and the Republic of Austria, for the period 2024–2026.
References
-
Arsenijević, B., Gomboc Čeh, K., Marušič, F. L., Milosavljević, S., Mišmaš, P., Simić, J., Simonović, M., & Žaucer, R. (2024). Database of the Western South Slavic Verb HyperVerb 2.0 – WeSoSlaV. http://hdl.handle.net/11356/1846 (Slovenian language resource repository CLARIN.SI)
-
Arsenijević, B., Marušič, F. L., Milosavljević, S., Mišmaš, P., Simonović, M., & Žaucer, R. (2025). Hyperspacing the verb: The interplay between prosody, morphology, syntax and semantics in the western south slavic verbal domain. Berlin: Language Science Press. (Manuscript submitted for publication).
-
Miric, M., Ćirković, S., & Simonovic, M. (2025). Serbian loanverbs in Gurbet Romani (Knjaževac) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15642055
-
Vuković, Teodora, 2020, Spoken Torlak dialect corpus 1.0 (transcription), Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1281.
-
Vuković, T. (2021). Representing variation in a spoken corpus of an endangered dialect: the case of Torlak. Language Resources and Evaluation, 55(3), 731-756.. doi: 10.1007/s10579-020-09522-4
Files
0.Description.pdf
Files
(1.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:1e73abfc4c7d2aa5eafed70f6d11a519
|
258.0 kB | Preview Download |
|
md5:e67c1ec192c4f6a060b44b6d1a3126d5
|
138.3 kB | Preview Download |
|
md5:9142fa2384d62b809bfc6471481a7d08
|
243.3 kB | Preview Download |
|
md5:e801b7bcd56e5c9be5ec0c94316310b5
|
46.9 kB | Preview Download |
|
md5:47490383656cac21bcebbd49218ce026
|
249.2 kB | Preview Download |
|
md5:29183984a2264a1d19b05345b36d90ca
|
199.1 kB | Preview Download |
Additional details
Funding
- Austrian Agency for International Cooperation in Education and Research
- What’s in a verb? Mapping Serbian verbs borrowed into Romani RS03/2024