Mangalam Corpus of Buddhist Sanskrit Literature

Ligeia Lugli; Luis Quiñones

doi:10.5281/zenodo.16639879

Published July 31, 2025 | Version 3.0

Other Open

Mangalam Corpus of Buddhist Sanskrit Literature

1. Mangalam Research Center

Contributors

Hosting institution:

Mangalam Research Center

Others:

Project members:

Supervisor:

Ligeia Lugli⁵

1. 84000
2. university of alberta
3. Shan State Buddhist University
4. Mangalam Research Center
5. King's College London

This is a Sanskrit corpus developed at the Mangalam Research Center (Berkeley, California) for the study of Buddhist Sanskrit lexicology.

It comprises:

446 lemmatized and metadata-enriched Buddhist Sanskrit texts for a total of ~ 7.5 million words.
a lemmatized reference corpus of general Sanskrit including 375 texts for a total of ~ 13.5 million words
a metadata table with information about each text in the Buddhist and Reference corpora
stemmed and normalised version of the Buddhist corpus & sketch grammar for use in Sketch Engine

Lemmatization notes

The corpora are in romanised Sanskrit (UTF-8 encoding). Verbs are lemmatized to the stem of the present indicative of third person singular; the verb root can be found in the Root column. We have replaced avagraha with a.

Data Quality & Limitations
We are grateful to have received an Ashoka grant from the Khyentse Foundation to proofread samples of the Buddhist Sanskrit Corpus. Still, only a small percentage of the corpus has been proofread and many segmentation and lemmatization errors are likely to remain. Quantitative evaluation based on ~9000 proofread sentences puts pre-processing accuracy at ~94% (F1 0.938 averaged across all sentences).

Semantic tags are currently being added using Claude Sonnet 3.7, with an accuracy of about 80% when evaluated against the manually curated semantic annotation produced in our lexicographic work (see 10.5281/zenodo.16633800). A paper detailing our semantic tagging experiments has been submitted for inclusion in the proceeding of eLex 2025.

For general information on the corpus see the paper Word Embeddings for Buddhist Sanskrit

Acknowledgments
The Buddhist corpus had been first realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London. It has subsequently been expanded and its accuracy improved with funding from the Khyentse Foundation (Ashoka Grant 2021-2022).

The reference corpus of general Sanskrit has been tokenised by Matej Martinc within the project Computing the Dharma funded by the National Endowment for the Humanities (HAA-277246-21). Parts of the reference have been re-processed from the conllu corpus created by O. Hellwig (see the corpus metadata included in this repository for details).

Dr. Bruno Galasek-Hul has contributed to versions 1.4 - 1.7 thanks to funding from the Mangalam Research Center for Buddhist Languages.

Dr Anuja Ajotikar, Madhusudan Rimal & Jai Paranjape have proofread sentences sampled from versions 1.7 to 2.0, thanks to funding from the Khyentse Foundation.

Thanks to GRETIL, CTS e-texts, Vinita Tseng, Jowita Kramer and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.

Changelog

version 3.0 begins adding semantic tags (only a few words are tagged for now)

version 2.2.3.2 improved segmentation and lemmatization

version 2.2.3.1 improved segmentation and lemmatization

version 2.2.3 improved verb segmentation and lemmatization

version 2.2.2 Buddhist corpus: much improved word-segmentation and lemmatization + added a few texts.

version 2.2.1 improved word-segmentation and lemmatization of Buddhist corpus

version 2.1 reprocessed reference corpus; both corpora expanded

version 2.0 changes the title of the corpus, adds more Buddhist texts and improves pre-processing accuracy.

version 1.9 adds more Buddhist texts, has better segmentation and lemmatization and is partially proofread

version 1.8 adds more Buddhist texts and is partially proofread

version 1.7 adds more Buddhist texts and a new pre-processed corpus of general Sanskrit

version 1.6 adds more Buddhist texts, improves segmentation and adds an initial iteration of the lemmatised corpus

version 1.5 adds more Buddhist texts, removes the reference corpus and improves segmentation

version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors

version 1.4.1 corrects some spacing and sentence parsing errors

Files

Lugli2019_BuddhSktSketchGrammar.txt

Files (234.3 MB)

Name	Size	Download all
Lugli2019_BuddhSktSketchGrammar.txt md5:c58791f712ac0174c2363523179d12f7	15.1 kB	Preview Download
LugliAndQuinones2025July_BuddhistSanskritCorpus_lemmatized.zip md5:f6688154308790aca4ab3159bded6814	76.2 MB	Preview Download
LugliAndQuinones2025July_BuddhistSanskritCorpus_StemmedAndNormalized.zip md5:7640a7136abd3189c3e192b2664c1d65	23.8 MB	Preview Download
LugliAndQuinones2025July_ReferenceSanskritCorpus_lemmatized.zip md5:1ea6a8b3548a98d90f8e1fe6face31c2	133.3 MB	Preview Download
LugliGalasakQuinones2024_SanskritCorpusMetadata.csv md5:c15e062956267d2ac5224b5c0ba40784	1.0 MB	Preview Download

Additional details

Is described by: Publication: https://aclanthology.org/2022.lrec-1.411.pdf (URL)
Is part of: Other: 10.6084/m9.figshare.c.6800682.v2 (DOI)
Is source of: Dataset: 10.5281/zenodo.16633800 (DOI)

National Endowment for the Humanities
Computing the Dharma HAA-277246-21

Available: 2025-07-31

new version

Development Status: Active

	All versions	This version
Views	2,837	63
Downloads	1,845	53
Data volume	47.7 GB	2.0 GB

Mangalam Corpus of Buddhist Sanskrit Literature

Contributors

Hosting institution:

Others:

Project members:

Supervisor:

Files

Lugli2019_BuddhSktSketchGrammar.txt

Files (234.3 MB)

Additional details

Related works

Funding

Dates

Software

Mangalam Corpus of Buddhist Sanskrit Literature

Creators

Contributors

Hosting institution:

Others:

Project members:

Supervisor:

Description

Files

Lugli2019_BuddhSktSketchGrammar.txt

Files (234.3 MB)

Additional details

Related works

Funding

Dates

Software