the wordlist is derived from three dictionaries: Monier Williams Edgerton and (for verbs only) Stardict

the original dictionary files have been manilupated in the following ways:

stemming procedure:

  • the final character of all words is removed.
  • for words ending in -as/aś/an/at and an the 2 characters are removed (bhagavat => bhagav)
  • words in -ant are stemmed to at see below
  • verbs are stemmed by removing the third pers sing ending ati/oti; the wordlist has been developed within aproject focused on nouns, more work needed for stemming verbal forms, esp derivative forms

exceptions to stemming procedure:

indeclinables and pronouns are usually not stemmed but there may be expceptions

sandhi initial surface forms

since the wordlist is stemmed we do not need to account for ending sandhi: the string buddhaścadharmaśca will only match buddh & dharm ; the (hypothetical) string dharmasyauṣadham will only match dharm and auṣad .

Initial sandhi by contrast needs to be accounted for. the string cittotpada would only match citt in a simple stemmed wordlist. We need to enrich the wordlist with surface forms for initial sandhi; in other words, we need to add otpad to the wordlist.

prthivyaiveyam ā-eva-iyam Here are the additions:

Derivative forms

derivative forms have generally been removed from the stemmed wordlist.