segmented Sanskrit corpus (proof of concept)

Ligeia Lugli; Bruno Galasek-Hul; Luis Quiñonen

doi:10.5281/zenodo.5188228

Published September 23, 2019 | Version 1.6

Other Open

segmented Sanskrit corpus (proof of concept)

1. Mangalam Research Center

Contributors

Project member:

Bruno Galasek-Hul

This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.

It comprises:

225 metadata-enriched Buddhist Sanskrit texts for a total of ~ 6 million words.

The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three configurations:

segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).
segmented, stemmed and normalised (normalisation treats some spelling variation and solves sandhi of stems' initials in most cases), recommended for Word Sketches.
lemmatized (vertical file, currently as csv conllu version will be available once the corpus has been proofread)

The latter version can be used to generate word sketches in Sketch Engine if used in conjunction with the included sketch grammar, which infers likely syntactic dependencies from morphological cues.

**avagraha has been replaced with a** in the stemmed versions

Limitations
As a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards and it has not been proof-read yet.
A funding bid is being submitted to expand and proofread the corpus.

Data Quality
The corpus has been segmented with Lugli's Sanskrit segmenter (10.5281/zenodo.3459215). The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature.

Please refer to the segmenter documentation stored at 10.5281/zenodo.3459215 for details on evaluation and stemming conventions.

Acknowledgments
The corpus has been realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London under the supervision of Prof. Henrietta Kate Crosby.

Dr. Bruno Galasek-Hul has contributed to versions 1.4 & 1.5 thanks to funding from the Mangalam Research Center for Buddhist Languages.

Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.

Changelog

version 1.6 adds more Buddhist texts, improves segmentation and adds an initial iteration of the lemmatised corpus

version 1.5 adds more Buddhist texts, removes the reference corpus and improves segmentation

version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors

version 1.4.1 corrects some spacing and sentence parsing errors

Notes

Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine

Files

GalasekAndLugli_2021_SanskritBuddhistCorpusMetadata_2021-08-12.csv

Files (92.7 MB)

Name	Size	Download all
GalasekAndLugli_2021_SanskritBuddhistCorpusMetadata_2021-08-12.csv md5:91ff4bfc017c32fb60d33dc6b4b53096	407.1 kB	Preview Download
Lugli2019_BuddhSktSketchGrammar.txt md5:c58791f712ac0174c2363523179d12f7	15.1 kB	Preview Download
Lugli_2021_SanskritBuddhistCorpusLemmatized.zip md5:edcea86574136e526a399d30f6ceea23	56.7 MB	Preview Download
Lugli_2021_SanskritBuddhistCorpusSegmented.zip md5:54313fd0f850a53977917201785d262c	17.8 MB	Preview Download
Lugli_2021_SanskritBuddhistCorpusSegmentedAndNormalized.zip md5:204730790293feaa44d36fbf42653e47	17.8 MB	Preview Download

	All versions	This version
Views	2,388	337
Downloads	1,223	235
Data volume	29.9 GB	5.0 GB

segmented Sanskrit corpus (proof of concept)

Creators

Contributors

Project member:

Description

Notes

Files

GalasekAndLugli_2021_SanskritBuddhistCorpusMetadata_2021-08-12.csv

Files (92.7 MB)