segmented Sanskrit corpus (proof of concept)

Ligeia Lugli

doi:10.5281/zenodo.3659200

Published September 23, 2019 | Version 1.4.1

Other Open

segmented Sanskrit corpus (proof of concept)

Ligeia Lugli¹

1. Mangalam Research Center

Contributors

Project member:

Bruno Galasek-Hul

This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.

It comprises:

131 metadata-enriched Buddhist Sanskrit texts for a total of ~ 4 million words (~ 8 million tokens)
a ~ 2 million words reference corpus comprising 30 metadata-enriched non-Buddhist Sanskrit texts.

The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three configurations:

segmented (with dash-separated words)
segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).
segmented, stemmed and normalised (normalisation treats some spelling variation and solves sandhi of stems' initials in most cases), recommended for Word Sketches.

The latter version can be used to generate word sketches in Sketch Engine if used in conjunction with the included sketch grammar, which infers likely syntactic dependencies from morphological cues.

**avagraha has been replaced with a**

Limitations
As a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not lemmatised or PoS tagged).
A funding bid has been submitted to expand and lemmatise the corpus.

Data Quality
The corpus has been segmented with Lugli's Sanskrit segmenter (10.5281/zenodo.3459215). The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature. No evaluation has been performed on non-Buddhist materials and the quality of the segmentation may be worse in the non-Buddhist section of the corpus.

Please refer to the segmenter documentation stored at 10.5281/zenodo.3459215 for details on the stemming conventions used in the corpus.

Acknowledgments
The corpus has been realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London under the supervision of Prof. Henrietta Kate Crosby.

Dr. Bruno Galasek-Hul has contributed to version 1.4 thanks to funding from the Mangalam Research Center for Buddhist Languages.

Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.

Changelog

version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors

version 1.4.1 corrects some spacing and sentence parsing errors

Notes

Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine

Files

Lugli2019_BuddhSktSketchGrammar.txt

Files (58.2 MB)

Name	Size	Download all
Lugli2019_BuddhSktSketchGrammar.txt md5:c58791f712ac0174c2363523179d12f7	15.1 kB	Preview Download
Lugli_BuddhSktCorpus_MetadataV1_4.txt md5:16bbad8b8fe0826471bdec1ea216dd0a	51.3 kB	Preview Download
Lugli_BuddhSktCorpusSegmentedFeb2020V1_4_1.zip md5:015db1ca42d97645729c3f47b5db1249	18.3 MB	Preview Download
Lugli_BuddhSktCorpusSegmentedStemmedFeb2020V1_4_1.zip md5:818e4bb8ad4c6da2e894d0ea625a14c8	19.9 MB	Preview Download
Lugli_BuddhSktCorpusSegmentedStemmedNormalisedFeb2020V1_4_1.zip md5:612aeda2cc8327a7b89607de5fcf6f5b	19.9 MB	Preview Download
SegmentedSanskritCorpusMetadataCumBibliography.csv md5:6dcc0fcf94acc3497d48afb4e5fa6696	39.8 kB	Preview Download

	All versions	This version
Views	2,388	90
Downloads	1,223	38
Data volume	29.9 GB	329.8 MB

segmented Sanskrit corpus (proof of concept)

Creators

Contributors

Project member:

Description

Notes

Files

Lugli2019_BuddhSktSketchGrammar.txt

Files (58.2 MB)