Meelen, Marieke
Roux, Élie
2020-05-04
<p>This corpus consisting of >185 million tokens is a segmented and part-of-speech tagged version of</p>
<p>Wallman, Jeff, Rowinski, Zach, Ngawang Trinley, Tomlinson, Chris, & Keutzer, Kurt. (2017). Collection of Tibetan etexts compiled by the Buddhist Digital Resource Center [Data set]. Zenodo. http://doi.org/10.5281/zenodo.821218</p>
<p>using the training data of</p>
<p>Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878</p>
<p>The code for segmenting and POS tagging any Tibetan file can be found on GitHub.</p>
<p>This Version 2 of ACTib is based on the same XML files as ACTib Version 1 (http://doi.org/10.5281/zenodo.823707), but contains both segmented and POS-tagged files and is improved in a number of ways, although post-processing was still done automatically and no manual correction was involved. For details of this improved annotation method see:</p>
<p>Meelen, Marieke, Roux, Élie & Hill, Nathan (forthcoming). 'Optimisation of the largest annotated Tibetan corpus combining rule-based, memory-based & deep-learning methods' in <em>TALLIP.</em></p>
Acknowledgements go to the British Academy for funding Meelen's research through grant pf170063.
https://doi.org/10.5281/zenodo.3951503
oai:zenodo.org:3951503
Zenodo
https://zenodo.org/communities/tibnlp
https://doi.org/10.5281/zenodo.3785070
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Tibetan language
Natural Language Processing
Annotated Historical Corpus
Segmentation
POS tagging
The Annotated Corpus of Classical Tibetan (ACTib) - Version 2.0 (Segmented & POS-tagged)
info:eu-repo/semantics/other