Global database of vowel dissimilation
Description
This dataset consists of descriptive data about vowel dissimilation patterns across 116 languages and 133 unique patterns, representing an array of 38 linguistic families, including Afro-Asiatic, Austronesian, Indo-European, Mayan, Niger-Congo, Sino-Tibetan, Turkic, as well as isolates like Basque and Ainu. The largest representation within the language families is for the Austronesian family, represented with 28 patterns observed in 23 languages, followed by 17 patterns in 15 Indo-European languages, 10 patterns in 10 Atlantic-Congo and 8 Afro-Asiatic languages, respectively, and 8 distinct patterns found in 7 Mayan languages. Other smaller language families are also represented and make up half of the total patterns collected in the survey. Linguistic isolates are represented with 5 patterns in 3 languages.
The dissimilation is signaled from the input-output pairs and the morpho-syntactic context in which the pattern is noticed (a specific morpheme, group of morphemes, or unlimited with respect to the morpho-syntax). The information and the examples are sourced from grammatical descriptions, primarily reference and descriptive grammars, but also dictionaries, wordlists, corpora, various online materials like forums, song lyrics, news portals, magazines, religious texts. Valuable sources are phonological descriptions and phonological studies on individual languages as well as descriptive and theoretical papers offering analyses of dissimilative patterns in various frameworks.
Genetic information for individual languages is sourced from Glottolog v. 5.2, supplemented with information from the sources themselves, where necessary. For example, in some cases the language name in Glottolog is different from the name in the source, in which case priority is given to the source. Linguistics systems are presented in an alphabetical order according to the major name followed by the modifier. This means that variants of the same major linguistic system are presented after one another, like in the case of Basque, Guere etc.
Every linguistic system / languoid is identified with an ISO-3 code or Glottocode (if the ISO-3 Code is not available, usually for smaller variants), language family, area(s) where spoken and the list of sources the data are retrieved from. The general information is followed by the marker `phonological' or `morpho-phonological' depending on the observed nature of the pattern. Next is the information on the dissimilative regularity and the morpho-syntactic context in which the pattern functions, followed by the data, represented as lists of examples showing the regular pattern in contrast to dissimilative, including notes about exceptions and general phonological tendencies in the language. The amount of data available is sadly not uniform and is in several cases scarce. In some cases only the representative examples are available and in some all of the available data are taken into account, even if that meant the pattern is represented with five examples.
Columns in the dataset:
Language Identification & Metadata
-
glottocode
- Unique language identifier from Glottolog v. 5.2 (e.g.,adyg1241
) -
language.x
- Language name (e.g., "Adyghe") -
iso.x
- ISO 639-3 code (e.g.,ady
) -
family
- Language family (e.g., "Abkhaz-Adyge") -
subfamily
- Subgroup (e.g., "Circasian") -
language_glottolog
- Glottolog's standardized language name -
language_glottolog.1
- Secondary Glottolog reference -
iso.y
- Alternate ISO code (if different fromiso.x
) -
level
- Language/dialect classification ("language" or "dialect")
Geographic Data
-
area
- Macro-region (e.g., "Eurasia", "Africa") -
latitude
- Decimal degrees -
longitude
- Decimal degrees -
countries
- ISO country codes (e.g., "RU;TR")
Dissimilation Patterns
-
VD.type
- Pattern type (P = phonological, MP = morpho-phonological) -
feature.INPUT
- Underlying vowel feature (e.g.[+low]
) -
feature.OUTPUT
- Resulting feature (e.g.[-low]
) -
feature.CONTEXT
- Phonological context triggering change -
other.features
- Additional relevant features (e.g.[+round]
) -
type.of.identity
- What kind of identity is necessary for dissimilation ("full" or "partial") -
vowel.length
- Sensitivity to vowel length ("no", "feeds", "bleeds") -
adjacent
- Locality condition ("syllable", "root node", "foot", "unlimited", "variable")
Morphosyntactic Context
-
morphemes.involved
- Morpho-syntactic context (e.g., "pl", "poss") -
another
- Secondary morpheme category (if applicable) -
class
- Word class affected ("noun", "verb", "both") -
direction
- "regressive" or "progressive" dissimilation -
trigger
- From where dissimilative originate ("prefix", "suffix", "root") -
location
- Locus of change ("root", "suffix", etc.)
Additional Features
-
prosody.related
- Stress/tone involvement ("yes"/"no") -
alternative
- Alternative value to dissimilative (e.g. "default", "harmony", "reduplication") -
feature_change
- Descriptive string (e.g.[[+low]] → [[-low]]
) -
morpheme_categories
- Grammatical categories (e.g. "pers/num")
Genealogical & Classification Data
-
affiliation
- Language family with sub-branches -
subclassification
- Detailed genealogical tree -
countries
- Repeat of ISO country codes
Example Entry
kase1253 | Kasem | xsm | Atlantic-Congo | Grusi | MP | [-low] | [+low] | [+high] | [+round] | partial | feeds | syllable | pl | no | noun | regressive | suffix | root | no | default | Kasem | Kasem | xsm | language | Africa | 11.0824 | -1.39076 | BF;GH | Atlantic-Congo, Volta-Congo, North Volta-Congo, Gur, Central Gur, Southern Central Gur, Grusi, Northern Grusi, Nuna-Kasem | (East_Kasem:1,Fere:1,Lela:1,Nuclear_Kasem:1,Nunuma:1,West_Kasem:1)kase1253:1; | [[-low]] → [[+low]] | pl |
Files
cleaned_precisely_standardized_data.csv
Files
(88.3 kB)
Name | Size | Download all |
---|---|---|
md5:a153fe08b4ff428086a3defcde567e28
|
88.3 kB | Preview Download |
Additional details
References
- Hammarström, Harald & Forkel, Robert & Haspelmath, Martin & Bank, Sebastian. 2025. Glottolog 5.2. Leipzig: Max Planck Institute for Evolutionary Anthropology. https://doi.org/10.5281/zenodo.15525265 (Available online at http://glottolog.org, Accessed on 2025-08-01.)
- Bosque-Gil J, Dojchinovski M, Cimiano P, Forkel R, Hammarström H. Glottocodes: Identifiers linking families, languages and dialects to comprehensive reference information. Semantic Web. 2022;13(6):917-924. doi:10.3233/SW-212843
- Moroz G (2017). lingtypology: easy mapping for Linguistic Typology. https://CRAN.R-project.org/package=lingtypology.