2024-03-29T07:51:17Z
https://zenodo.org/oai2d
oai:zenodo.org:4727108
2021-04-30T01:48:24Z
user-tibnlp
openaire_data
Christian Faggionato
Edward Garrett
Nathan W. Hill
Samyo Rode
Nikolai Solmsdork
Sonam Wangyal
2021-04-29
<p>This is a small hand-annotated partial treebank of Tibetan, primarily in CoNLL-U format. It builds upon the following corpus:</p>
<p>Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. <a href="http://doi.org/10.5281/zenodo.574878">http://doi.org/10.5281/zenodo.574878</a></p>
<p>This corpus differs from the above in three ways:</p>
<ol>
<li>The tagset has been converted from the SOAS tag system to the Universal Dependency part-of-speech tagset.</li>
<li>We have added dependency relations between verbs and their argument.</li>
<li>For some of the texts, English translations were available in digital form. These translations were manually aligned to the Tibetan texts and included in the CoNLL-U files.</li>
</ol>
<p>It was created as part of the AHRC-funded project <em>Lexicography in Motion </em>(PI Ulrich Pagel, 2017-2021).</p>
Funded by the UK's Arts and Humanities Research Council (grant code: AH/P004644/1)
https://doi.org/10.5281/zenodo.4727108
oai:zenodo.org:4727108
Zenodo
https://github.com/tibetan-nlp/classical-tibetan-corpus/tree/v1.0
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.4727107
info:eu-repo/semantics/openAccess
Other (Open)
Tibetan language
natural language processing
corpus linguistics
classical languages
universal dependencies
Classical Tibetan corpus annotated for verb-argument dependency relations
info:eu-repo/semantics/other
oai:zenodo.org:574882
2021-04-29T14:30:12Z
user-tibnlp
openaire_data
Garrett, Edward
Hill, Nathan W.
2017-05-11
<p>This rule based Tibetan part-of-speech (POS) tagger was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). For a description of the tag set see Garrett et al. 2014. and Garrett et al. 2015. For a description of the tagger itself see Garrett et al. 2014. Note that the tagger must be used together with a lexicon (for example Hill & Garrett 2017a). One must use one's own script to tag all words with all tags in the lexicon and then apply the tagger to remove incorrect tags.</p>
<p>On the associated corpus of 318,230 words (Hill & Garrett 2017b) the lexical tagger (i.e. simply applying all available tags to all words) tags 141,911 words with the correct unique tag, achieves as accuracy of 1.000 (by definition getting the right tag among others for each word) with an ambiguity of 2.73111. In contrast, the Rule Tagger tags 241,256 words with the correct unique tag, achieves an accuracy of 0.99893 and an ambiguity of 1.38577.</p>
<p>Because this tagger does not achieve ambiguity 1.000 it is not suitable for tagging large scale corpora, but instead is useful for the creation of gold standard training data.</p>
<p>N.B. In some rare cases the tagger removes all POS-tags.</p>
funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1)
https://doi.org/10.5281/zenodo.574882
oai:zenodo.org:574882
Zenodo
https://zenodo.org/communities/tibnlp
https://doi.org/
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Part of speech tagging
Tibetan language
corpus linguistics
A rule based Tibetan part-of-speech (POS) tagger for the creation of gold standard training data
info:eu-repo/semantics/other
oai:zenodo.org:803268
2021-04-29T14:31:11Z
user-tibnlp
openaire_data
Germano, David
Garrett, Edward
Weinberger, Stephen
2017-06-06
<p>This corpus of spoken Tibetan was compiled in Tibet in the early 2000s by the Tibetan and Himalayan Digital Library project based at the University of Virginia. Aligned video and audio files are available at the Shanti homepage of UVA (see shanti.virginia.edu). </p>
https://doi.org/10.5281/zenodo.803268
oai:zenodo.org:803268
Zenodo
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.803267
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Tibetan language
Corpus LInguistics
spoken Tibetan
UVA Tibetan Spoken Corpus
info:eu-repo/semantics/other
oai:zenodo.org:821218
2020-01-24T19:25:15Z
user-tibnlp
openaire_data
Wallman, Jeff
Rowinski, Zach
Ngawang Trinley
Tomlinson, Chris
Keutzer, Kurt
2017-04-28
<p>This is the Tibetan etext collection of the Buddhist Digital Resource Center (www.tbrc.org) as of April 28, 2017.</p>
https://doi.org/10.5281/zenodo.821218
oai:zenodo.org:821218
Zenodo
https://doi.org/10.5281/zenodo.823707
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.821217
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Tibetan language
Tibetan Buddhism
etexts
Collection of Tibetan etexts compiled by the Buddhist Digital Resource Center
info:eu-repo/semantics/other
oai:zenodo.org:4322933
2021-04-29T14:38:16Z
user-tibnlp
openaire_data
user-natural-language-processing
user-minority-nlp
Yliniemi, Juha
Hill, Nathan
2020-12-15
<p>This is a collection of Drejong (Sikkimese) texts for use in NLP.</p>
https://doi.org/10.5281/zenodo.4322933
oai:zenodo.org:4322933
sip
Zenodo
https://zenodo.org/communities/natural-language-processing
https://zenodo.org/communities/minority-nlp
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.4322932
info:eu-repo/semantics/openAccess
Other (Public Domain)
Drenjongke
Sikkimese
Natural Language Processing
natural language processing
A collection of Drenjong texts for use in NLP
info:eu-repo/semantics/other
oai:zenodo.org:4727200
2021-04-30T01:48:26Z
user-tibnlp
software
Christian Faggionato
2021-04-29
<p>This resource contains two grammars for Tibetan language processing, implemented using the <a href="https://visl.sdu.dk/cg3.html">VISL CG3</a> system:</p>
<ol>
<li><em>Other Dependencies</em>: This grammar takes a Tibetan source file which includes POS-tags and verb-argument dependencies, and adds additional dependencies which can be ascertained with a high degree of certainty.</li>
<li><em>Verb Argument Dependencies</em>: This grammar takes a Tibetan source file which includes POS-tags and adds verb-argument dependencies to it.</li>
</ol>
<p>This work was created as part of the AHRC-funded project <em>Lexicography in Motion</em> (PI Ulrich Pagel, 2017-2021).</p>
Funded by the UK's Arts and Humanities Research Council (grant code: AH/P004644/1)
https://doi.org/10.5281/zenodo.4727200
oai:zenodo.org:4727200
Zenodo
https://github.com/tibetan-nlp/tibcg3/tree/v1.0
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.4727199
info:eu-repo/semantics/openAccess
Other (Open)
Tibetan language
natural language processing
universal dependencies
constraint grammar
Constraint grammars for Tibetan dependency parsing
info:eu-repo/semantics/other
oai:zenodo.org:4727129
2021-04-30T01:48:24Z
user-tibnlp
openaire_data
Jamyang Dakpa
Tashi Dhondup
Yeshi Jigme Gangne
Edward Garrett
Marieke Meelen
Sonam Wangyal
2021-04-29
<p>This is a small hand-annotated partial treebank of Modern Tibetan, primarily in CoNLL-U format. Some texts were POS-tagged by machine, and then dependency relations between verbs and their arguments were added by hand. Other texts include only dependency relations and relevant POS-tags. A number of the texts have English translations which have been manually aligned to the Tibetan text.</p>
<p>This work was created as part of the AHRC-funded project <em>Lexicography in Motion</em> (PI Ulrich Pagel, 2017-2021).</p>
Funded by the UK's Arts and Humanities Research Council (grant code: AH/P004644/1)
https://doi.org/10.5281/zenodo.4727129
oai:zenodo.org:4727129
Zenodo
https://github.com/tibetan-nlp/modern-tibetan-corpus/tree/v1.0
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.4727128
info:eu-repo/semantics/openAccess
Other (Open)
Tibetan language
natural language processing
corpus linguistics
universal dependencies
Modern Tibetan corpus annotated for verb-argument dependency relations
info:eu-repo/semantics/other
oai:zenodo.org:3951486
2020-07-20T00:59:22Z
user-tibnlp
openaire_data
Meelen, Marieke
Roux, Élie
2020-05-04
<p>This corpus consisting of >185 million tokens is a segmented and part-of-speech tagged version of</p>
<p>Wallman, Jeff, Rowinski, Zach, Ngawang Trinley, Tomlinson, Chris, & Keutzer, Kurt. (2017). Collection of Tibetan etexts compiled by the Buddhist Digital Resource Center [Data set]. Zenodo. http://doi.org/10.5281/zenodo.821218</p>
<p>using the training data of</p>
<p>Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878</p>
<p>The code for segmenting and POS tagging any Tibetan file can be found on GitHub.</p>
<p>This Version 2 of ACTib is based on the same XML files as ACTib Version 1 (http://doi.org/10.5281/zenodo.823707), but contains both segmented and POS-tagged files and is improved in a number of ways, although post-processing was still done automatically and no manual correction was involved. For details of this improved annotation method see:</p>
<p>Meelen, Marieke, Roux, Élie & Hill, Nathan (forthcoming). 'Optimisation of the largest annotated Tibetan corpus combining rule-based, memory-based & deep-learning methods' in <em>TALLIP.</em></p>
Acknowledgements go to the British Academy for funding Meelen's research through grant pf170063.
https://doi.org/10.5281/zenodo.3951486
oai:zenodo.org:3951486
Zenodo
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.3785070
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Tibetan language
Natural Language Processing
Annotated Historical Corpus
Segmentation
POS tagging
The Annotated Corpus of Classical Tibetan (ACTib) - Version 2.0 (Segmented & POS-tagged)
info:eu-repo/semantics/other
oai:zenodo.org:3785071
2020-07-19T14:54:58Z
user-tibnlp
openaire_data
Meelen, Marieke
Roux, Élie
2020-05-04
<p>This corpus consisting of >185 million tokens is a segmented and part-of-speech tagged version of</p>
<p>Wallman, Jeff, Rowinski, Zach, Ngawang Trinley, Tomlinson, Chris, & Keutzer, Kurt. (2017). Collection of Tibetan etexts compiled by the Buddhist Digital Resource Center [Data set]. Zenodo. http://doi.org/10.5281/zenodo.821218</p>
<p>using the training data of</p>
<p>Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878</p>
<p>The code for segmenting and POS tagging any Tibetan file can be found on GitHub.</p>
<p>This Version 2 of ACTib is based on the same XML files as ACTib Version 1 (http://doi.org/10.5281/zenodo.823707), but contains both segmented and POS-tagged files and is improved in a number of ways, although post-processing was still done automatically and no manual correction was involved. For details of this improved annotation method see:</p>
<p>Meelen, Marieke, Roux, Élie & Hill, Nathan (forthcoming). 'Optimisation of the largest annotated Tibetan corpus combining rule-based, memory-based & deep-learning methods' in <em>TALLIP.</em></p>
Acknowledgements go to the British Academy for funding Meelen's research through grant pf170063.
https://doi.org/10.5281/zenodo.3785071
oai:zenodo.org:3785071
Zenodo
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.3785070
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Tibetan language
Natural Language Processing
Annotated Historical Corpus
Segmentation
POS tagging
The Annotated Corpus of Classical Tibetan (ACTib) - Version 2.0 (Segmented & POS-tagged)
info:eu-repo/semantics/other
oai:zenodo.org:4727552
2021-04-30T01:48:24Z
user-tibnlp
openaire_data
Christian Faggionato
Edward Garrett
Marieke Meelen
2021-04-29
<p>This resource includes two tagged and annotated Tibetan texts: the <em>Old Tibetan Annals </em>and the <em>Old Tibetan Chronicl</em>e.</p>
<p>The texts were first normalized using the included CG3 grammar, to make them look and behave like Classical Tibetan. They were then part-of-speech tagged by machine. However, these machine tags were individually checked by hand, so we can say both texts have been manually part-of-speech tagged.</p>
<p>In the second stage, verb-argument dependency relations were manually added to the text.</p>
<p>A digital version of Brandon Dotson's translation of the _Old Tibetan Annals_ was obtained, and aligned to the Tibetan text in the third stage of work.</p>
<p>In the final stage, the text was denormalized back to Old Tibetan orthrography, so that the CoNLL-U files reflect the original OT source data.</p>
<p>This work was created as part of the AHRC-funded project <em>Lexicography in Motion </em>(PI Ulrich Pagel, 2017-2021).</p>
<p> </p>
Funded by the UK's Arts and Humanities Research Council (grant code: AH/P004644/1)
https://doi.org/10.5281/zenodo.4727552
oai:zenodo.org:4727552
Zenodo
https://github.com/tibetan-nlp/old-tibetan-corpus/tree/v1.0
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.4727551
info:eu-repo/semantics/openAccess
Other (Open)
Tibetan language
text normalization
corpus linguistics
Old Tibetan
classical languages
natural language processing
universal dependencies
Old Tibetan Corpus and Normalization Grammar
info:eu-repo/semantics/other
oai:zenodo.org:4726991
2021-04-30T01:48:24Z
user-tibnlp
openaire_data
Nathan W. Hill
Edward Garrett
2021-04-29
<p>This resource contains versions of Nathan Hill's Tibetan verb dictionary which can be used for various corpus and computational purposes.</p>
<p>It was created as part of the AHRC-funded project <em>Lexicography in Motion </em>(PI Ulrich Pagel, 2017-2021).</p>
Funded by the UK's Arts and Humanities Research Council (grant code: AH/P004644/1)
https://doi.org/10.5281/zenodo.4726991
oai:zenodo.org:4726991
Zenodo
https://github.com/tibetan-nlp/lexicon-of-tibetan-verb-stems/tree/v1.0
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.4726990
info:eu-repo/semantics/openAccess
Other (Open)
natural language processing
Tibetan language
lexicon
verbs
Lexicon of Tibetan Verb Stems
info:eu-repo/semantics/other
oai:zenodo.org:574878
2021-04-29T14:29:12Z
user-tibnlp
openaire_data
Whiteman, Oliver
Grokhovskiy, Pavel
Biondo, Serena
Hill, Nathan W.
Garrett, Edward
2017-05-11
<p>This part-of-speech (POS) tagged corpus of Classical Tibetan was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). For a description of the tag set see Garrett et al. 2014. and Garrett et al. 2015. This corpus includes the <em>Mdzaṅs blun</em> (9th century, canonical), the <em>Bu ston chos ḥbyuṅ</em> (13th century, ecclesiastical history), the <em>Mi la ras paḥi rnam thar</em> and <em>Mar paḥi rnam thar</em> (15th century, biography).</p>
funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1)
https://doi.org/10.5281/zenodo.574878
oai:zenodo.org:574878
Zenodo
https://doi.org/10.5281/zenodo.822537
https://doi.org/10.5281/zenodo.821218
https://zenodo.org/communities/tibnlp
https://doi.org/
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Tibetan language
corpus linguistics
Classical Tibetan
A part-of-speech (POS) tagged corpus of Classical Tibetan
info:eu-repo/semantics/other
oai:zenodo.org:4536516
2022-12-06T12:36:33Z
user-tibnlp
user-tibetica
openaire_data
Marieke Meelen
Thomas White
Barnett, Robert
Hill, Nathan
Diemberger, Hildegard
Samdrup, Tsering
2021-08-02
<p>This dataset, tagset and guidelines were the output of a six-month incubator project on the feasibility of developing Named-Entity Recognition (NER) for modern Tibetan, primarily for use with contemporary Tibetan-language newspapers and media published inside the PRC. The project was carried out by the Mongolian and Inner Asian Studies Unit at Cambridge University’s Department of Social Anthropology. It was funded by an incubator grant from Cambridge Language Sciences. The project title was “Named-Entity Recognition in Tibetan and Mongolian Newspapers.” The Project PI was Dr Hildegard Diemberger (Cambridge), the Coordinator and Lead Author was Dr Robert Barnett (SOAS), and Senior Advisers were Dr Nathan Hill (SOAS), Dr Marieke Meelen (Cambridge), and Dr Thomas White (Cambridge). <br>
<br>
Although some forms of NER and other NLP procedures have been developed within China for modern Tibetan (see Liu, Nuo <em>et al</em>, 2011), the data underlying those initiatives have not been made publicly available and their findings cannot be tested or reproduced. Significant work on developing NLP for Tibetan has been carried out outside China, but has focused largely on classical Tibetan and religious texts (see Hill & Garrett, Edward, 2017). </p>
<p>The Cambridge incubator project therefore produced a tagset, guidelines and training data for developing NER for modern Tibetan, with a focus on historical and political analysis of contemporary newspapers, media and other public documents in Tibetan. We compiled 3.11m syllables of data in Tibetan extracted from articles downloaded from Chinese-language news aggregator sites within China, primarily tibet.cpc.people.com.cn and tibet.people.com.cn. From this data, we selected texts containing 280,000 syllables in Tibetan, grouped in 26,000 utterances/sentences (available on request). Using Lighttag, an online annotation site, we developed a tagset for NER consisting of 17 tags (and one for wrong segmentation if using segmented data). We annotated approximately 186,000 syllables, leading to 9,884 annotations. Of these, after discounting flawed data, we produced training data containing c.6,700 annotations. We carried out the secondary, manual review offline (for our method of converting Lighttag data for offline review, see the attached report “Using Spreadsheets to Review Annotations Offline.pdf”), and found an error rate of 3.6%. The final total of reviewed annotations was 6,624. </p>
<p>The dataset, tagset, guidelines and reports were developed and documented by Robert Barnett, with assistance from Tsering Samdrup, Dr Hill and Dr Meelen. Primary annotation was by Tsering Samdrup, assisted by Dr Barnett.<br>
<br>
The datasets published here include: </p>
<ol>
<li>The <strong>tagseet guidelines and annotation manual</strong>, including the 17-tag tagset, guidelines, and recommendations ("NER for Modern Tibetan-tagset and guidelines.pdf").</li>
<li>The <strong>tagged training data </strong>in .csv format ("Tibetan NER Training Data-tagged, reviewed wth context-v10-UTF-8.csv") and .xls format ("Tibetan NER Training Data-tagged with context-v10-UTF-8.xlsx"). This includes 6,624 reveiwed annotations, arranged according to the Tibetan alphabet together with the tags and context (utterance) for each annotation.</li>
<li>The <strong>raw annotation results </strong>downloaded from Lighttag as .json files ("Raw Training Data for NER in Modern Tibetan -Jobs2-11-JSON.zip") and as .xls files ("Training Data for NER in Modern Tibetan -Jobs2-11-XLS.zip"). These include 10 "tasks" or datasets of articles scraped from Tibetan-language websites within Tibet. </li>
<li>A <strong>guide to preparing Lighttag annotation results for manual review offline </strong>(“Using Spreadsheets to Review Annotations Offline.pdf”).</li>
</ol>
<p>The project's findings regarding the status of NER and NLP for vertical Mongolian are available at DOI: 10.5281/zenodo.5103499.</p>
Note: To view normal CSV files with Tibetan content in Excel, do *not* open as normal files, but always *import* their content into Excel as Data/FromText - otherwise the Tibetan script (even if unicode) will be erased. Alternatively, save these files in UTF-8 csv vformat, not in plain csv format.
https://doi.org/10.5281/zenodo.4536516
oai:zenodo.org:4536516
bod
Zenodo
https://zenodo.org/communities/tibetica
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.4536515
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Modern Tibetan
NER
tags
newspapers
NLP
Tibetan language
Corpus linguistics
Named-Entity Recognition for Modern Tibetan Newspapers: Tagset, Guidelines and Training Data
info:eu-repo/semantics/other
oai:zenodo.org:823707
2021-04-29T14:42:17Z
user-tibnlp
openaire_data
Meelen, Marieke
Hill, Nathan W.
Handy, Christopher
2017-07-06
<p>This corpus is a part-of-speech tagged version of</p>
<p>Wallman, Jeff, Rowinski, Zach, Ngawang Trinley, Tomlinson, Chris, & Keutzer, Kurt. (2017). Collection of Tibetan etexts compiled by the Buddhist Digital Resource Center [Data set]. Zenodo. http://doi.org/10.5281/zenodo.821218</p>
<p>using the training data of</p>
<p>Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878</p>
<p>using the memory based tagger of</p>
<p>https://languagemachines.github.io/mbt/</p>
<p>Please note that the files are not post-processed or manually corrected and that a small number of files in the KarmaDelek directory were still annotated, although the original xml-input was corrupted already.</p>
https://doi.org/10.5281/zenodo.823707
oai:zenodo.org:823707
Zenodo
https://doi.org/10.5281/zenodo.821218
https://doi.org/10.5281/zenodo.574878
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.823706
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Tibetan language
natural language processing
corpus linguistics
memory based tagging
Tibetan linguistics
Trans-Himalayan Linguistics
The Annotated Corpus of Classical Tibetan (ACTib), Part I - Segmented version, based on the BDRC digitised text collection, tagged with the Memory-Based Tagger from TiMBL.
info:eu-repo/semantics/other
oai:zenodo.org:3951503
2020-07-20T00:59:22Z
user-tibnlp
openaire_data
Meelen, Marieke
Roux, Élie
2020-05-04
<p>This corpus consisting of >185 million tokens is a segmented and part-of-speech tagged version of</p>
<p>Wallman, Jeff, Rowinski, Zach, Ngawang Trinley, Tomlinson, Chris, & Keutzer, Kurt. (2017). Collection of Tibetan etexts compiled by the Buddhist Digital Resource Center [Data set]. Zenodo. http://doi.org/10.5281/zenodo.821218</p>
<p>using the training data of</p>
<p>Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878</p>
<p>The code for segmenting and POS tagging any Tibetan file can be found on GitHub.</p>
<p>This Version 2 of ACTib is based on the same XML files as ACTib Version 1 (http://doi.org/10.5281/zenodo.823707), but contains both segmented and POS-tagged files and is improved in a number of ways, although post-processing was still done automatically and no manual correction was involved. For details of this improved annotation method see:</p>
<p>Meelen, Marieke, Roux, Élie & Hill, Nathan (forthcoming). 'Optimisation of the largest annotated Tibetan corpus combining rule-based, memory-based & deep-learning methods' in <em>TALLIP.</em></p>
Acknowledgements go to the British Academy for funding Meelen's research through grant pf170063.
https://doi.org/10.5281/zenodo.3951503
oai:zenodo.org:3951503
Zenodo
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.3785070
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Tibetan language
Natural Language Processing
Annotated Historical Corpus
Segmentation
POS tagging
The Annotated Corpus of Classical Tibetan (ACTib) - Version 2.0 (Segmented & POS-tagged)
info:eu-repo/semantics/other
oai:zenodo.org:574876
2021-04-29T14:28:04Z
user-tibnlp
openaire_data
Hill, Nathan W.
Garrett, Edward
2017-05-11
<p>This part-of-speech (POS) lexicon of Classical Tibetan was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). The data for verbs comes from a digitized version of <em>A Lexicon of Tibetan Verb Stems as Reported by the Grammatical Tradition</em> (Munich: Bayerische Akademie der Wissenschaften, 2010) by Nathan W. Hill. Otherwise data comes from the manually part-of-speech tagged training data produced by the corpus and a few lexical items specifically added by hand to improve rule based tagging.</p>
funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1)
https://doi.org/10.5281/zenodo.574876
oai:zenodo.org:574876
Zenodo
https://zenodo.org/communities/tibnlp
https://doi.org/
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Tibetan language
Natural language processing
part-of-speech tagging
A part-of-speech (POS) lexicon of Classical Tibetan for NLP
info:eu-repo/semantics/other
oai:zenodo.org:822537
2021-04-29T14:43:25Z
user-tibnlp
openaire_data
Meelen, Marieke
Hill, Nathan
Handy, Christopher
2017-07-04
<p>This corpus is a part-of-speech tagged version of</p>
<p>Wallman, Jeff, Rowinski, Zach, Ngawang Trinley, Tomlinson, Chris, & Keutzer, Kurt. (2017). Collection of Tibetan etexts compiled by the Buddhist Digital Resource Center [Data set]. Zenodo. http://doi.org/10.5281/zenodo.821218</p>
<p>using the training data of</p>
<p>Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878</p>
<p>Please note that the files are not post-processed or manually corrected and that a small number of files in the KarmaDelek directory were still annotated, although the original xml-input was corrupted already.</p>
<p> </p>
<p>using the memory based tagger of</p>
<p>https://languagemachines.github.io/mbt/</p>
https://doi.org/10.5281/zenodo.822537
oai:zenodo.org:822537
Zenodo
https://doi.org/10.5281/zenodo.821218
https://doi.org/10.5281/zenodo.574878
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.822536
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
memory based tagging
Tibetan linguistics
Trans-Himalayan Linguistics
POS-tagging
corpus linguistics
The Annotated Corpus of Classical Tibetan (ACTib), Part II - POS-tagged version, based on the BDRC digitised text collection, tagged with the Memory-Based Tagger from TiMBL
info:eu-repo/semantics/other