Other Open Access
This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.
The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three configurations:
The latter version can be used to generate word sketches in Sketch Engine if used in conjunction with the included sketch grammar, which infers likely syntactic dependencies from morphological cues.
**avagraha has been replaced with a** in the stemmed versions
As a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not lemmatised or PoS tagged).
A funding bid has been submitted to expand and lemmatise the corpus.
The corpus has been segmented with Lugli's Sanskrit segmenter (10.5281/zenodo.3459215). The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature.
Please refer to the segmenter documentation stored at 10.5281/zenodo.3459215 for details on evaluation and stemming conventions.
The corpus has been realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London under the supervision of Prof. Henrietta Kate Crosby.
Dr. Bruno Galasek-Hul has contributed to versions 1.4 & 1.5 thanks to funding from the Mangalam Research Center for Buddhist Languages.
Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.
version 1.5 adds more Buddhist texts, removes the reference corpus and improves segmentation
version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors
version 1.4.1 corrects some spacing and sentence parsing errors