The DLx Data Formats

A collection of JSON Schemas for representing scientific linguistic data.

npm version Build Status [DOI

Introduction

The canonical way that linguists represent linguistic data in their publications is with an interlinear gloss. This is typically a 3- or 4-line format that shows a phrase in the language of interest, the words and morphemes inside the phrase, what each of those morphemes means, and its overall translation. Here is a short example of an interlinear gloss for a phrase in a language called Chitimacha:

Wetkx hus naancaakamankx weyt hi hokmiqi.                      (Transcription)
wetkx   hus   naancaaka-mank-x   weyt   hi      hok-mi-qi      (Morpheme Breakdown)
then    his   brother-PL-TOP     he     there   leave-PL-3sg   (Glosses)
'Then he left his brothers there.'                             (Translation)

While humans look at a representation like this and can see which glosses are associated with which morphemes, computers cannot rely on visual layouts in this way, and require more explicit structure. The purpose of the Digital Linguistics Data Format is to define a standard for representing interlinear glosses (as well as other linguistic information, such as dictionary entries) in a digital, computer-readable way.

There are many ways a linguist could choose to represent their data in digital form. Not only are many formats are available (a relational database, XML, a tabular spreadsheet, JSON, etc.), but there is significant flexibility in deciding what properties to include in your data and what to call them. For example, does the data about a text have a property specifying the language it was spoken in, and should that property be represented as "lang" or "language"?

The Digital Linguistics (DLx) project recommends a data format called JSON (JavaScript Object Notation) for digitally representing your linguistic data. Moreover, the DLx project has drafted recommendations for how to structure linguistic data using JSON. This recommended format was designed to capture hierarchical linguistic data in a way that aligns with the descriptive categories that linguists actually use, relying on fundamental linguistic notions such as text, morpheme, orthography, etc. For instance, this format is capable of capturing the fact that a text contains sentences, sentences contain words, words contains morphemes, and morphemes contain phonemes. This functionality turns out to be a crucial factor in inputting, editing, searching, and analyzing linguistic data. At the same time, the DLx format is computer-readable, easily searchable, and is natively supported by all modern web-based tools.

The DLx project recommends JSON because it has become the data interchange format for the modern web, and is natively supported by every major programming language. This makes it significantly easier for programmers to develop tools that use the DLx format, meaning that linguists will have a wider variety of options and helpful tools for managing their linguistic data. Moreover, JSON is extremely easy for humans to read. Below is a short phrase represented in JSON. Notice that, even if you don't understand how the format works, you can see the hierarchical relationship between phrases, words, and morphemes, and you know which piece of data belongs to what kind of linguistic object.

{
  "transcription": {
    "spa": "Hola, me llamo Daniel.",
    "ipa": "ola me jamo dænjəl"
  },
  "translation": {
    "eng": "Hello, my name is Daniel.",
  },
  "words": [
    {
      "transcription": {
        "spa": "hola",
        "ipa": "ola"
      },
      "translation": {
        "eng": "hello"
      },
      "morphemes": [
        {
          "form": {
            "spa": "hola",
            "ipa": "ola"
          },
          "gloss": {
            "eng": "hello"
          }
        }
      ]
    },
    {
      "transcription": {
        "spa": "me",
        "ipa": "me"
      },
      "translation": {
        "eng": "me"
      },
      "morphemes": [
        {
          "form": {
            "spa": "me",
            "ipa": "me"
          },
          "gloss": {
            "eng": "1sg.DO"
          }
        }
      ]
    },
    {
      "transcription": {
        "spa": "llamo",
        "ipa": "jamo"
      },
      "translation": {
        "eng": "I call"
      },
      "morphemes": [
        {
          "form": {
            "spa": "llam",
            "ipa": "jam"
          },
          "gloss": {
            "eng": "call"
          }
        },
        {
          "form": {
            "spa": "o",
            "ipa": "o"
          },
          "gloss": {
            "eng": "1sg.PRES.IND.SUBJ"
          }
        }
      ]
    },
    {
      "transcription": {
        "spa": "Daniel",
        "ipa": "dænjəl"
      },
      "translation": {
        "eng": "Daniel"
      },
      "morphemes": [
        {
          "form": {
            "spa": "Daniel",
            "ipa": "dænjəl"
          },
          "gloss": {
            "eng": "Daniel"
          }
        }
      ]
    }
  ]
}

JSON format is easy to learn. It consists of just a few simple rules:

Another great feature of JSON is that adding new properties to an Object doesn't change or in any way disrupt its other properties. This allows you to take your data from tool to tool without any tedious conversion or formatting. For example, say you've transcribed your data using a tool for morphological analysis, and now you want to add time alignment to each phrase using a different tool. If you were using FLEx and ELAN, you would have to first export from FLEx and create an ELAN file. In other words, you have to change the data format just to change the type of annotation you want to add. But with JSON, adding time alignment data couldn’t be simpler. The time alignment tool would merely add properties called "startTime" and "endTime" to the phrase, and enter their values. You could then take your data back to the morphological analysis tool without any converting, because the data hasn't been altered, just extended. The underlying format is all the same.

Schemas

Linguistic Schemas

The DLx project provides recommendations for how to format linguistic data in JSON for the following kinds of linguistic objects. Click each object to see its specification. Note that working data does not need to adhere to these schemas. Only data stored or exchanged in JSON format must follow these specifications. Developers may choose to represent the data internally in their software however they wish.

Non-Linguistic Schemas

Other non-linguistic objects are given specifications as well (click on the name of each to see its specification):

Schema Description
Abbreviation A human-readable abbreviation, containing no spaces, and only letters A-Z or numbers.
Access Information about who should be allowed to access the current data. Access rights can be specified in many of the formats used by well-known linguistic archives such as ELAR or AILLA.
Address A postal address.
Bundle A collection of resources relating to a single event or task, such as all the files relating to a certain elicitation session, or all the field notes from a given day.
Contributor Information about a person who contributed to the given resource, and the role they played. For example, most texts will have a contributor with the role of speaker specified.
DateCreated The date a database resource was created (not the date the item was recorded).
DateModified The date a database resource was last modified.
DateRecorded The date a database resource (usually a text) was recorded.
LexemeReference An object that contains a reference to any item in a lexicon.
Location A location with optional geographic coordinates.
Media Information and metadata about a media file (e.g. WAV, PDF, or JPEG files, etc.).
MultiLangString An object containing a string in multiple orthographies. Usually this is a transcription of some linguistic data.
Note Most DLx resources allow you to add notes in different languages, of different types.
Person Information about a person, e.g. speaker, linguist, editor, translator, etc.
Reference A bibliographic reference.
Tags A collection of tags on the given resource. Particularly useful for tagging instances of a phenomenon in your corpora.
Url A URL.