segmented Sanskrit corpus (proof of concept)

Ligeia Lugli

doi:10.5281/zenodo.3457822

Published September 23, 2019 | Version 1

Other Open

segmented Sanskrit corpus (proof of concept)

Ligeia Lugli¹

1. King's College London

This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.

It comprises:

66 metadata-enriched Buddhist Sanskrit texts for a total of 2.5 million tokens
a 4 million tokens reference corpus comprising 30 metadata-enriched non-Buddhist Sanskrit texts.

The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in two configurations:

segmented (with dash-separated words)
segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).

The latter version can be used to generate word sketches in Sketch Engine if used in conjunction with the included sketch grammar, which infers likely syntactic dependencies from morphological cues.

Limitations
As a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not normalised, lemmatised or PoS tagged).
A funding bid has been submitted to expand and lemmatise the corpus.

Data Quality
The corpus has been segmented with Lugli's Sanskrit segmenter (10.5281/zenodo.3459215). The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature. No evaluation has been performed on non-Buddhist materials and the quality of the segmentation may be worse in the non-Buddhist section of the corpus.

Acknowledgments
The corpus has been realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London under the supervision of Prof. Henrietta Kate Crosby.

Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.

Notes

Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine

Files

CorpusConfigurationFile_SketchEngine_Skt_Lugli.txt

Files (23.8 MB)

Name	Size	Download all
CorpusConfigurationFile_SketchEngine_Skt_Lugli.txt md5:aef5a40e3f0c8e3f68b944cb312f3d77	2.6 kB	Preview Download
Lugli_SanskritCorpusSegmented.zip md5:9d4ab9fe4366b2ca7843540413010fee	11.4 MB	Preview Download
Lugli_SanskritCorpusSegmentedAndStemmed.zip md5:4a40944e6ac9bb541dc7a2793d0f40c9	12.3 MB	Preview Download
SegmentedSanskritCorpusMetadataCumBibliography.csv md5:6dcc0fcf94acc3497d48afb4e5fa6696	39.8 kB	Preview Download
SketchGrammar_SktLugli.txt md5:14ceb879b23e2eaac3c2fef38bc213c9	14.8 kB	Preview Download

	All versions	This version
Views	2,388	66
Downloads	1,223	29
Data volume	29.9 GB	166.5 MB

segmented Sanskrit corpus (proof of concept)

Creators

Description

Notes

Files

CorpusConfigurationFile_SketchEngine_Skt_Lugli.txt

Files (23.8 MB)