---
annotations_creators:
- expert-generated
language_creators:
- found
languages:
  ady:
  - ady
  ang:
  - ang
  ara:
  - ar
  arn:
  - arn
  ast:
  - ast
  aze:
  - az
  bak:
  - ba
  bel:
  - be
  ben:
  - bn
  bod:
  - bo
  bre:
  - br
  bul:
  - bg
  cat:
  - ca
  ces:
  - cs
  chu:
  - cu
  ckb:
  - ckb
  cor:
  - kw
  crh:
  - crh
  csb:
  - csb
  cym:
  - cy
  dan:
  - da
  deu:
  - de
  dsb:
  - dsb
  ell:
  - el
  eng:
  - en
  est:
  - et
  eus:
  - eu
  fao:
  - fo
  fas:
  - fa
  fin:
  - fi
  fra:
  - fr
  frm:
  - frm
  fro:
  - fro
  frr:
  - frr
  fry:
  - fy
  fur:
  - fur
  gal:
  - gal
  gla:
  - gd
  gle:
  - ga
  glv:
  - gv
  gmh:
  - gmh
  gml:
  - gml
  got:
  - got
  grc:
  - grc
  hai:
  - hai
  hbs:
  - sh
  heb:
  - he
  hin:
  - hi
  hun:
  - hu
  hye:
  - hy
  isl:
  - is
  ita:
  - it
  izh:
  - izh
  kal:
  - kl
  kan:
  - kn
  kat:
  - ka
  kaz:
  - kk
  kbd:
  - kbd
  kjh:
  - kjh
  klr:
  - klr
  kmr:
  - kmr
  krl:
  - krl
  lat:
  - la
  lav:
  - lv
  lit:
  - lt
  liv:
  - liv
  lld:
  - lld
  lud:
  - lud
  mkd:
  - mk
  mlt:
  - mt
  mwf:
  - mwf
  nap:
  - nap
  nav:
  - nv
  nds:
  - nds
  nld:
  - nl
  nno:
  - nn
  nob:
  - nb
  oci:
  - oc
  olo:
  - olo
  osx:
  - osx
  pol:
  - pl
  por:
  - pt
  pus:
  - ps
  que:
  - qu
  ron:
  - ro
  rus:
  - ru
  san:
  - sa
  sga:
  - sga
  slv:
  - sl
  sme:
  - sme
  spa:
  - es
  sqi:
  - sq
  swc:
  - swc
  swe:
  - sv
  syc:
  - syc
  tat:
  - tt
  tel:
  - te
  tgk:
  - tg
  tuk:
  - tk
  tur:
  - tr
  ukr:
  - uk
  urd:
  - ur
  uzb:
  - uz
  vec:
  - vec
  vep:
  - vep
  vot:
  - vot
  xcl:
  - xcl
  xno:
  - xno
  yid:
  - yi
  zul:
  - zu
licenses:
- cc-by-sa-3.0
multilinguality:
- monolingual
size_categories:
  ady:
  - 1K<n<10K
  ang:
  - 1K<n<10K
  ara:
  - 1K<n<10K
  arn:
  - n<1K
  ast:
  - n<1K
  aze:
  - n<1K
  bak:
  - 1K<n<10K
  bel:
  - 1K<n<10K
  ben:
  - n<1K
  bod:
  - 1K<n<10K
  bre:
  - n<1K
  bul:
  - 1K<n<10K
  cat:
  - 1K<n<10K
  ces:
  - 1K<n<10K
  chu:
  - n<1K
  ckb:
  - n<1K
  cor:
  - n<1K
  crh:
  - 1K<n<10K
  csb:
  - n<1K
  cym:
  - n<1K
  dan:
  - 1K<n<10K
  deu:
  - 10K<n<100K
  dsb:
  - n<1K
  ell:
  - 10K<n<100K
  eng:
  - 10K<n<100K
  est:
  - n<1K
  eus:
  - n<1K
  fao:
  - 1K<n<10K
  fas:
  - n<1K
  fin:
  - 10K<n<100K
  fra:
  - 1K<n<10K
  frm:
  - n<1K
  fro:
  - 1K<n<10K
  frr:
  - n<1K
  fry:
  - n<1K
  fur:
  - n<1K
  gal:
  - n<1K
  gla:
  - n<1K
  gle:
  - 1K<n<10K
  glv:
  - n<1K
  gmh:
  - n<1K
  gml:
  - n<1K
  got:
  - n<1K
  grc:
  - 1K<n<10K
  hai:
  - n<1K
  hbs:
  - 10K<n<100K
  heb:
  - n<1K
  hin:
  - n<1K
  hun:
  - 10K<n<100K
  hye:
  - 1K<n<10K
  isl:
  - 1K<n<10K
  ita:
  - 10K<n<100K
  izh:
  - n<1K
  kal:
  - n<1K
  kan:
  - n<1K
  kat:
  - 1K<n<10K
  kaz:
  - n<1K
  kbd:
  - n<1K
  kjh:
  - n<1K
  klr:
  - n<1K
  kmr:
  - 10K<n<100K
  krl:
  - n<1K
  lat:
  - 10K<n<100K
  lav:
  - 1K<n<10K
  lit:
  - 1K<n<10K
  liv:
  - n<1K
  lld:
  - n<1K
  lud:
  - n<1K
  mkd:
  - 10K<n<100K
  mlt:
  - n<1K
  mwf:
  - n<1K
  nap:
  - n<1K
  nav:
  - n<1K
  nds:
  - n<1K
  nld:
  - 1K<n<10K
  nno:
  - 1K<n<10K
  nob:
  - 1K<n<10K
  oci:
  - n<1K
  olo:
  - 10K<n<100K
  osx:
  - n<1K
  pol:
  - 10K<n<100K
  por:
  - 1K<n<10K
  pus:
  - n<1K
  que:
  - 1K<n<10K
  ron:
  - 1K<n<10K
  rus:
  - 10K<n<100K
  san:
  - n<1K
  sga:
  - n<1K
  slv:
  - 1K<n<10K
  sme:
  - 1K<n<10K
  spa:
  - 1K<n<10K
  sqi:
  - n<1K
  swc:
  - n<1K
  swe:
  - 10K<n<100K
  syc:
  - n<1K
  tat:
  - 1K<n<10K
  tel:
  - n<1K
  tgk:
  - n<1K
  tuk:
  - n<1K
  tur:
  - 1K<n<10K
  ukr:
  - 1K<n<10K
  urd:
  - n<1K
  uzb:
  - n<1K
  vec:
  - n<1K
  vep:
  - 1K<n<10K
  vot:
  - n<1K
  xcl:
  - 1K<n<10K
  xno:
  - n<1K
  yid:
  - n<1K
  zul:
  - n<1K
source_datasets:
- original
task_categories:
- structure-prediction
- text-classification
task_ids:
- multi-class-classification
- multi-label-classification
- structure-prediction-other-morphology
paperswithcode_id: null
---

# Dataset Card for [Dataset Name]

## Table of Contents
- [Dataset Description](#dataset-description)
  - [Dataset Summary](#dataset-summary)
  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
  - [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Instances](#data-instances)
  - [Data Fields](#data-fields)
  - [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Source Data](#source-data)
  - [Annotations](#annotations)
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
  - [Social Impact of Dataset](#social-impact-of-dataset)
  - [Discussion of Biases](#discussion-of-biases)
  - [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
  - [Dataset Curators](#dataset-curators)
  - [Licensing Information](#licensing-information)
  - [Citation Information](#citation-information)
  - [Contributions](#contributions)

## Dataset Description

- **Homepage:** [UniMorph Homepage](https://unimorph.github.io/)
- **Repository:** [List of UniMorph repositories](https://github.com/unimorph)
- **Paper:** [The Composition and Use of the Universal Morphological Feature Schema (UniMorph Schema)](https://unimorph.github.io/doc/unimorph-schema.pdf)
- **Point of Contact:** [Arya McCarthy](mailto:arya@jhu.edu)

### Dataset Summary

The Universal Morphology (UniMorph) project is a collaborative effort to improve how NLP handles complex morphology in the world’s languages.
The goal of UniMorph is to annotate morphological data in a universal schema that allows an inflected word from any language to be defined by its lexical meaning,
typically carried by the lemma, and by a rendering of its inflectional form in terms of a bundle of morphological features from our schema.
The specification of the schema is described in Sylak-Glassman (2016).

### Supported Tasks and Leaderboards

[More Information Needed]

### Languages

The current version of the UniMorph dataset covers 110 languages.

## Dataset Structure

### Data Instances

Each data instance comprises of a lemma and a set of possible realizations with morphological and meaning annotations. For example:
```
{'forms': {'Aktionsart': [[], [], [], [], []],
  'Animacy': [[], [], [], [], []],
  ...
  'Finiteness': [[], [], [], [1], []],
  ...
  'Number': [[], [], [0], [], []],
  'Other': [[], [], [], [], []],
  'Part_Of_Speech': [[7], [10], [7], [7], [10]],
  ...
  'Tense': [[1], [1], [0], [], [0]],
  ...
  'word': ['ablated', 'ablated', 'ablates', 'ablate', 'ablating']},
 'lemma': 'ablate'}
```

### Data Fields

Each instance in the dataset has the following fields:
- `lemma`: the common lemma for all all_forms
- `forms`: all annotated forms for this lemma, with:
  - `word`: the full word form
  - [`category`]: a categorical variable denoting one or several tags in a category (several to represent composite tags, originally denoted with `A+B`). The full list of categories and possible tags for each can be found [here](https://github.com/unimorph/unimorph.github.io/blob/master/unimorph-schema-json/dimensions-to-features.json)


### Data Splits

[More Information Needed]

## Dataset Creation

### Curation Rationale

[More Information Needed]

### Source Data

#### Initial Data Collection and Normalization

[More Information Needed]

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

[More Information Needed]

### Citation Information

[More Information Needed]

### Contributions

Thanks to [@yjernite](https://github.com/yjernite) for adding this dataset.