the wordlist is derived from three dictionaries: Monier Williams Edgerton and (for verbs only) Stardict
the original dictionary files have been manilupated in the following ways:
- cleaned of extraneous characters and columns other than the lemmata
- removed lemmata that are most likley to correspond to lemma+alpha (or an) privativum
- lemmata have been deduplicated
- compounds that are typically lexicalised in Buddhist sources have been added (e.g. svacitta, pañcadharma)
- lemmata have been stemmed to allow grouping of as many morphological forms of the same lemma as possible in the ngrams
stemming procedure:
- the final character of all words is removed.
- for words ending in -as/aś/an/at and an the 2 characters are removed (bhagavat => bhagav)
- words in -ant are stemmed to at see below
- verbs are stemmed by removing the third pers sing ending ati/oti; the wordlist has been developed within aproject focused on nouns, more work needed for stemming verbal forms, esp derivative forms
exceptions to stemming procedure:
indeclinables and pronouns are usually not stemmed but there may be expceptions
sandhi initial surface forms
since the wordlist is stemmed we do not need to account for ending sandhi: the string buddhaścadharmaśca will only match buddh
& dharm
; the (hypothetical) string dharmasyauṣadham will only match dharm
and auṣad
.
Initial sandhi by contrast needs to be accounted for. the string cittotpada would only match citt
in a simple stemmed wordlist. We need to enrich the wordlist with surface forms for initial sandhi; in other words, we need to add otpad
to the wordlist.
prthivyaiveyam ā-eva-iyam Here are the additions:
- words beginning in a => ā (avagrāha is already resolved with corpus cleaning)
- words beginning in i => ī
- words beginning in i => e
- words beginning in e => ai
- words beginning in u => o
- words beginning in u => ū
- words beginning in o => au
- words beginning in ṛ => r
- words beginning in ś =>cch
Derivative forms
derivative forms have generally been removed from the stemmed wordlist.