The Annotated Corpus of Classical Tibetan (ACTib) - Version 2.0 (Segmented & POS-tagged)
Description
This corpus consisting of >185 million tokens is a segmented and part-of-speech tagged version of
Wallman, Jeff, Rowinski, Zach, Ngawang Trinley, Tomlinson, Chris, & Keutzer, Kurt. (2017). Collection of Tibetan etexts compiled by the Buddhist Digital Resource Center [Data set]. Zenodo. http://doi.org/10.5281/zenodo.821218
using the training data of
Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878
The code for segmenting and POS tagging any Tibetan file can be found on GitHub.
This Version 2 of ACTib is based on the same XML files as ACTib Version 1 (http://doi.org/10.5281/zenodo.823707), but contains both segmented and POS-tagged files and is improved in a number of ways, although post-processing was still done automatically and no manual correction was involved. For details of this improved annotation method see:
Meelen, Marieke, Roux, Élie & Hill, Nathan (forthcoming). 'Optimisation of the largest annotated Tibetan corpus combining rule-based, memory-based & deep-learning methods' in TALLIP.
Notes
Files
SegPOS-DharmaDownload_July2020.zip
Files
(844.9 MB)
Name | Size | Download all |
---|---|---|
md5:3a24715c7b1181be0ee2bf671c8142e0
|
91.7 MB | Preview Download |
md5:f9c3c9dd9fdd4e597ad1d6283e68cf9f
|
42.1 MB | Preview Download |
md5:fe9e87295f8a5408adc69019f9976da7
|
79.0 MB | Preview Download |
md5:becaf3b43d67f8a903278f0fd71e78d9
|
192.9 MB | Preview Download |
md5:1ff9ccc99e372d6cae11d44f1204febd
|
222.2 MB | Preview Download |
md5:75836c8ca070af0b543c17f674b4a628
|
85.1 MB | Preview Download |
md5:2510e3ea27ae0230cc6b060226e5fca1
|
17.5 MB | Preview Download |
md5:2552ba9369579462db2ef7d2cdf94ad2
|
68.2 MB | Preview Download |
md5:e846876dd34da3770e8df0620491b14d
|
15.1 MB | Preview Download |
md5:29ce6ea2ec4ae90257c0be7a5d8eb8c5
|
12.9 MB | Preview Download |
md5:82c1fe04c44296b1e18f47815a8d7d88
|
18.0 MB | Preview Download |