segmented Corpus of Buddhist Sanskrit (proof of concept)

doi:10.5281/zenodo.7296548

Published June 20, 2022 | Version 1.9

Other Open

segmented Corpus of Buddhist Sanskrit (proof of concept)

1. Mangalam Research Center

Project member:

Bruno Galasek-Hul

This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.

It comprises:

367 lemmatized and metadata-enriched Buddhist Sanskrit texts for a total of ~ 7 million words.
a tokenised reference corpus of general Sanskrit including 267 texts for a total of ~ 13 million words
a metadata table with information about each text in the Buddhist and Reference corpora
stemmed and normalised version of the Buddhist corpus & sketch grammar for use in Sketch Engine

The corpora is in romanised Sanskrit (UTF-8 encoding)

Limitations
The corpus is currently undergoing proofreading, there still are several segmentation and lemmatization errors.

We are grateful to have received an Ashoka grant from the Khyentse Foundation to proofread the Buddhist Sanskrit Corpus. Improved versions will be released in due course.

Acknowledgments
The corpus had been first realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London under the supervision of Prof. Henrietta Kate Crosby.

Dr. Bruno Galasek-Hul has contributed to versions 1.4 - 1.7 thanks to funding from the Mangalam Research Center for Buddhist Languages.

The reference corpus of general Sanskrit has been tokenised by Matej Martinc within the project 'Computing the Dharma' funded by the National Endowment for the Humanities (HAA-277246-21).

Thanks to GRETIL, CTS e-texts, Vinita Tseng, Jowita Kramer and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.

Changelog

version 1.9 adds more Buddhist texts, has better segmentation and lemmatization and is partially proofread

version 1.8 adds more Buddhist texts and is partially proofread

version 1.7 adds more Buddhist texts and a new pre-processed corpus of general Sanskrit

version 1.6 adds more Buddhist texts, improves segmentation and adds an initial iteration of the lemmatised corpus

version 1.5 adds more Buddhist texts, removes the reference corpus and improves segmentation

version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors

version 1.4.1 corrects some spacing and sentence parsing errors

Notes

This corpus is being proofread thanks to funding from the Khyentse Foundation

Files

Lugli2019_BuddhSktSketchGrammar.txt

Files (125.4 MB)

Name	Size	Download all
Lugli2019_BuddhSktSketchGrammar.txt md5:c58791f712ac0174c2363523179d12f7	15.1 kB	Preview Download
Lugli2022_BuddhistSanskritCorpusLemmatized.zip md5:480e8e2faed119056eaf3e4ece47e312	68.5 MB	Preview Download
Lugli2022_BuddhistSanskritCorpusSegmentedNormalized.zip md5:ac6ab160d769cb329350d22b10b7cb24	21.9 MB	Preview Download
LugliAndGalasek_2022_ReferenceSanskritCorpus_tokenized.zip md5:fbe335328a510c0462dbf08ebc2996f9	34.5 MB	Preview Download
LugliGalasekQuinones2022_BuddhistSanskritCorpusMetadata.csv md5:3d2f8a2e593b07ccdcab5c861e598f49	503.1 kB	Preview Download

	All versions	This version
Views	1,996	68
Downloads	892	9
Data volume	22.2 GB	252.8 MB

segmented Corpus of Buddhist Sanskrit (proof of concept)

Creators

Contributors

Project member:

Description

Notes

Files

Lugli2019_BuddhSktSketchGrammar.txt

Files (125.4 MB)