Published November 22, 2022 | Version v1
Dataset Open

Aligned Latin-Myanmar Transliteration Dataset

  • 1. NICT

Description

Aligned Latin-Myanmar Transliteration Dataset

                    Chenchen Ding
                    Tue Nov 22 00:00:00 JST 2022

* Introduction

This data set is a further refined and annotated version of the data at
https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/western-myanmar-transliteration.zip

The data set is developed by Chenchen Ding from NICT. The license is

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License
https://creativecommons.org/licenses/by-nc-sa/4.0/

* Contents

- data.txt : 42,736 segmented and aligned instances.

* Format

Each line contains a segmented transliteration pair in a format of

[Latin segment 1] | [Latin segment 2] ... ||| [Myanmar segment 1] | [Myamar segment 2] | ...

where the Latin-Myanmar pair has identical number of segments.

* Annotation Guidelines

- There is no insertion but only segmentation on the Latin side.
- A placeholder @ is inserted in the Myanmar side for unaligned Latin segments.

- The consonant clusters at syllable onset are generally segmented and aligned to Myanmar basic letters
- The consonants at coda are generally aligned to the placeholder, unless they are absorbed by a rhyme with nasalization or glottal stop, or by an extra explicit killed-letter.
- Doubled consonant letters are generally segmented and treated as coda and onset of two neighboring syllables.

- Myanmar rhymes are generally not segmented.
- The Myanmar letter A (0x1021) is unsegmented in the case of vowel-beginning words as no insertion on Latin side.

The data can be directly used to train a sequence-labeling model for Myanmar Romanization.

* Disclaimer

[1] NICT bears no responsibility for the contents of the corpus and the lexicon and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the corpus or the lexicon.

[2] If any copyright infringement or other problems are found in the corpus or the lexicon, please contact us at alt-info [at] khn [dot] nict [dot] go [dot] jp. We will review the issue and undertake appropriate measures when needed.

 

Files

Aligned_Latin_Myanmar_Transliteration_Dataset.zip

Files (290.1 kB)

Name Size Download all
md5:d8ee186b18491cd9b08e625ae7057c13
290.1 kB Preview Download