Published August 2, 2021 | Version v1
Dataset Open

Named-Entity Recognition for Modern Tibetan Newspapers: Tagset, Guidelines and Training Data

  • 1. SOAS, University of London
  • 2. MIASU, Cambridge University

Contributors

  • 1. Theoretical and Applied Linguistics, University of Cambridge
  • 2. MIASU, Cambridge University

Description

This dataset, tagset and guidelines were the output of a six-month incubator project on the feasibility of developing Named-Entity Recognition (NER) for modern Tibetan, primarily for use with contemporary Tibetan-language newspapers and media published inside the PRC. The project was carried out by the Mongolian and Inner Asian Studies Unit at Cambridge University’s Department of Social Anthropology. It was funded by an incubator grant from Cambridge Language Sciences. The project title was “Named-Entity Recognition in Tibetan and Mongolian Newspapers.” The Project PI was Dr Hildegard Diemberger (Cambridge), the Coordinator and Lead Author was Dr Robert Barnett (SOAS), and Senior Advisers were Dr Nathan Hill (SOAS), Dr Marieke Meelen (Cambridge), and Dr Thomas White (Cambridge). 

Although some forms of NER and other NLP procedures have been developed within China for modern Tibetan (see Liu, Nuo et al, 2011), the data underlying those initiatives have not been made publicly available and their findings cannot be tested or reproduced. Significant work on developing NLP for Tibetan has been carried out outside China, but has focused largely on classical Tibetan and religious texts (see Hill & Garrett, Edward, 2017). 

The Cambridge incubator project therefore produced a tagset, guidelines and training data for developing NER for modern Tibetan, with a focus on historical and political analysis of contemporary newspapers, media and other public documents in Tibetan. We compiled 3.11m syllables of data in Tibetan extracted from articles downloaded from Chinese-language news aggregator sites within China, primarily tibet.cpc.people.com.cn and tibet.people.com.cn. From this data, we selected texts containing 280,000 syllables in Tibetan, grouped in 26,000 utterances/sentences (available on request). Using Lighttag, an online annotation site, we developed a tagset for NER consisting of 17 tags (and one for wrong segmentation if using segmented data). We annotated approximately 186,000 syllables, leading to 9,884 annotations. Of these, after discounting flawed data, we produced training data containing c.6,700 annotations.  We carried out the secondary, manual review offline (for our method of converting Lighttag data for offline review, see the attached report “Using Spreadsheets to Review Annotations Offline.pdf”), and found an error rate of 3.6%. The final total of reviewed annotations was 6,624. 

The dataset, tagset, guidelines and reports were developed and documented by Robert Barnett, with assistance from Tsering Samdrup, Dr Hill and Dr Meelen. Primary annotation was by Tsering Samdrup, assisted by Dr Barnett.

The datasets published here include:       

  1. The tagseet guidelines and annotation manual, including the 17-tag tagset, guidelines, and recommendations ("NER for Modern Tibetan-tagset and guidelines.pdf").
  2. The tagged training data in .csv format ("Tibetan NER Training Data-tagged, reviewed wth context-v10-UTF-8.csv") and .xls format ("Tibetan NER Training Data-tagged with context-v10-UTF-8.xlsx"). This includes 6,624 reveiwed annotations, arranged according to the Tibetan alphabet together with the tags and context (utterance) for each annotation.
  3. The raw annotation results downloaded from Lighttag as .json files ("Raw Training Data for NER in Modern Tibetan -Jobs2-11-JSON.zip") and as .xls files ("Training Data for NER in Modern Tibetan -Jobs2-11-XLS.zip"). These include 10 "tasks" or datasets of articles scraped from Tibetan-language websites within Tibet.    
  4. A guide to preparing Lighttag annotation results for manual review offline (“Using Spreadsheets to Review Annotations Offline.pdf”).

The project's findings regarding the status of NER and NLP for vertical Mongolian are available at DOI: 10.5281/zenodo.5103499.

Notes

Note: To view normal CSV files with Tibetan content in Excel, do *not* open as normal files, but always *import* their content into Excel as Data/FromText - otherwise the Tibetan script (even if unicode) will be erased. Alternatively, save these files in UTF-8 csv vformat, not in plain csv format.

Files

NER for Modern Tibetan-Tagset and Guidelines.pdf

Additional details

References