Mangalam Corpus of Buddhist Sanskrit Literature
Contributors
Hosting institution:
Project members:
Supervisor:
- 1. 84000
- 2. university of alberta
- 3. Shan State Buddhist University
- 4. Mangalam Research Center
- 5. King's College London
Description
This is a Sanskrit corpus developed at the Mangalam Research Center (Berkeley, California) for the study of Buddhist Sanskrit lexicology.
It comprises:
- 446 lemmatized and metadata-enriched Buddhist Sanskrit texts for a total of ~ 7.5 million words.
- a lemmatized reference corpus of general Sanskrit including 375 texts for a total of ~ 15 million words
- a metadata table with information about each text in the Buddhist and Reference corpora
- stemmed and normalised version of the Buddhist corpus & sketch grammar for use in Sketch Engine
Lemmatization notes
The corpora are in romanised Sanskrit (UTF-8 encoding). Verbs are lemmatized to the stem of the present indicative of third person singular; the verb root can be found in the Root column. We have replaced avagraha with a.
Data Quality & Limitations
We are grateful to have received an Ashoka grant from the Khyentse Foundation to proofread samples of the Buddhist Sanskrit Corpus. Still, only a small percentage of the corpus has been proofread and many segmentation and lemmatization errors are likely to remain. Quantitative evaluation based on ~9000 proofread sentences puts pre-processing accuracy at ~94% (F1 0.938 averaged across all sentences).
Acknowledgments
The corpus had been first realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London under the supervision of Prof. Henrietta Kate Crosby. It has subsequently been expanded and its accuracy improved with funding from the Khyentse Foundation (Ashoka Grant 2021-2022).
Dr. Bruno Galasek-Hul has contributed to versions 1.4 - 1.7 thanks to funding from the Mangalam Research Center for Buddhist Languages.
Dr Anuja Ajotikar, Madhusudan Rimal & Jai Paranjape have proofread sentences sampled from versions 1.7 to 2.0, thanks to funding from the Khyentse Foundation.
The reference corpus of general Sanskrit has been tokenised by Matej Martinc within the project 'Computing the Dharma' funded by the National Endowment for the Humanities (HAA-277246-21).
Thanks to GRETIL, CTS e-texts, Vinita Tseng, Jowita Kramer and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.
Changelog
version 2.2.3.1 improved segmentation and lemmatization
version 2.2.3 improved verb segmentation and lemmatization
version 2.2.2 Buddhist corpus: much improved word-segmentation and lemmatization + added a few texts.
version 2.2.1 improved word-segmentation and lemmatization of Buddhist corpus
version 2.1 reprocessed reference corpus; both corpora expanded
version 2.0 changes the title of the corpus, adds more Buddhist texts and improves pre-processing accuracy.
version 1.9 adds more Buddhist texts, has better segmentation and lemmatization and is partially proofread
version 1.8 adds more Buddhist texts and is partially proofread
version 1.7 adds more Buddhist texts and a new pre-processed corpus of general Sanskrit
version 1.6 adds more Buddhist texts, improves segmentation and adds an initial iteration of the lemmatised corpus
version 1.5 adds more Buddhist texts, removes the reference corpus and improves segmentation
version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors
version 1.4.1 corrects some spacing and sentence parsing errors
Files
Lugli2019_BuddhSktSketchGrammar.txt
Files
(230.6 MB)
Name | Size | Download all |
---|---|---|
md5:c58791f712ac0174c2363523179d12f7
|
15.1 kB | Preview Download |
md5:15e6d790b87d465d5879706953fb2cae
|
73.9 MB | Preview Download |
md5:707c5dc366580c1a70912542636b6be3
|
23.8 MB | Preview Download |
md5:fa18203ae2faca451cf9c5f566a67e1d
|
131.9 MB | Preview Download |
md5:c15e062956267d2ac5224b5c0ba40784
|
1.0 MB | Preview Download |
Additional details
Related works
- Is cited by
- Publication: https://aclanthology.org/2022.lrec-1.411.pdf (URL)
- Is part of
- Other: 10.6084/m9.figshare.c.6800682.v2 (DOI)
Dates
- Available
-
2024-06-27new version