Other Open Access

segmented Sanskrit corpus (proof of concept)

Ligeia Lugli

Project member(s)
Bruno Galasek-Hul

This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.

It comprises:

  •  172 metadata-enriched Buddhist Sanskrit texts for a total of ~ 5 million words. The corpus contains all Mahāyāna and 'mainstream' Buddhist based on Sanskrit editions texts available on GRETIL (reconstructed editions based on Tibetan translations have been filtered out).

The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three configurations:

  1.  segmented (with dash-separated words)
  2.  segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).
  3. segmented, stemmed and normalised (normalisation treats some spelling variation and solves sandhi of stems' initials in most cases), recommended for Word Sketches.

The latter version can be used to generate word sketches in Sketch Engine if used in conjunction with the included sketch grammar, which infers likely syntactic dependencies from morphological cues.

**avagraha has been replaced with a** in the stemmed versions

Limitations
As a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not lemmatised or PoS tagged). 
A funding bid has been submitted to expand and lemmatise the corpus.

Data Quality
The corpus has been segmented with Lugli's Sanskrit segmenter (10.5281/zenodo.3459215). The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature.

Please refer to the segmenter documentation stored at 10.5281/zenodo.3459215 for details on evaluation and stemming conventions.

Acknowledgments
The corpus has been realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London under the supervision of Prof. Henrietta Kate Crosby. 

Dr. Bruno Galasek-Hul has contributed to versions 1.4 & 1.5 thanks to funding from the Mangalam Research Center for Buddhist Languages.

Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.

 

Changelog

version 1.5 adds more Buddhist texts, removes the reference corpus and improves segmentation

version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors

version 1.4.1 corrects some spacing and sentence parsing errors

Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine
Files (40.3 MB)
Name Size
Lugli2019_BuddhSktSketchGrammar.txt
md5:c58791f712ac0174c2363523179d12f7
15.1 kB Download
Lugli_BuddhistSanskritCorpusMetadata2020-06-22.csv
md5:aa2c9f2071623329468796ac90f39ce0
50.8 kB Download
Lugli_BuddhistSanskritCorpusSegmented_v1_5.zip
md5:43e8793746f43c4d86be12936c0d0c9c
12.7 MB Download
Lugli_BuddhistSanskritCorpusStemmed_v1_5.zip
md5:5aa6c8d2acabc99cc387a2a6c544514a
13.8 MB Download
Lugli_BuddhistSanskritCorpusStemmedNormalisedForGramrels_v1_5.zip
md5:e63c00b914b6d5f62db1829ea39d4be5
13.8 MB Download
155
89
views
downloads
All versions This version
Views 15523
Downloads 890
Data volume 522.5 MB0 Bytes
Unique views 13621
Unique downloads 290

Share

Cite as