Preface

The Visual Dictionary and Thesaurus of Buddhist Sanskrit is part of the Buddhist Translator Workbench, a project directed by Ligeia Lugli and hosted at the Mangalam Research Center for Buddhist Languages.

The initial development of the computational infrastrustrure behind this app was funded by the British Academy through a Newton International Fellowship [NF161436]

This resource is the product of the concerted efforts of Ligeia Lugli, Bruno Galasek-Hul and Luis Gamaliel Quiñones-Martinez.

This app is currently a working prototype best viewed on wide screens and not designed for mobile devices.

Our goal

We seek to convey a portrait of the vocabulary of Buddhist Sanskrit sources that reflects attested language use.

In particular we aim to:

  1. Highlight the malleability of words in context, help users appreciate the influence of co-text, discourse, genre and traditional affiliation on lexical meaning

  2. Convey the semantic continuity that typically holds between different identifiable senses of a lemma, and help users negotiate the often fuzzy distinction between polysemy and vagueness (for more on this see the Lexical Data & Annotation section and the explanation of the Polysemy Viewer).

  3. Show the relationship between specialised and non-specialised uses of Buddhist vocabulary. While we appreciate the importance of ecyclopedic explanations of Buddhist terminology, here we offer a different perspective. We highlight the continuity between general language and specialised meanings, with a view to invite our users to reflect on the idiomaticity and level of intelligibility of many Buddhist terminological applications. For a detailed account see Lugli’s paper Words or Terms?.

  4. Facilitate the comparison between clusters of semantically or etymologically similar words. To this end we cluster lemmata in cognates and near-synonyms and offer a thesaurus view of our data.

Corpus

We use a corpus of convenience based on texts that have been digitized and are available in translations in major European languages. Our corpus is a subset of Lugli’s corpus of Buddhist Sanskrit literature and it is currently undergoing significant expansion. A list of all primary sources currenlty used in this resource is available in the Sources tab on this page.

Metadata

All our sources are tagged with metadata pertaining, title, text.type, period, tradition and, when available, author. All these categories are, to different degrees, uncertain. the title category, is perhaps the least controversial, still it is not without challenges. For example, should we refer to the Madhyāntavibhāgabhāṣya and the Madhyāntavibhāga under two headings, one for the commentary and one for the mūla? We decided to subsume both under a single title, on the assumption that the verses and the commentary are likely to have been composed around the same time, possibly by the same person.

Matters are far more nebolous for the categories of text.type and tradition. Both are used merely as heuristic are still rather fluid. text.type categorises all texts into broad genre-like categories: sūtras, śāstras, avadānas and ‘literature’(lit), intended as belles lettres. The category of

tradition has been devised here mainly to allow for diachronic camparison within strands of Buddhist literature, or philosophy. So far it distinguishes between madhyamaka, yogācāra, tathāgatagarbha (TG), prajñāparamitā and all the rest, which is tagged with ‘Buddh’. The categorisation has been largely guided by the availability of diachronic subcorpora for a certain type of literature.period is a very rough and broad chronological categorization that divides the entire corpus into three layers: foundational (ca. I BCE - III CE), classical (ca IV-V CE) and commentarial (VI-XI CE). This is by no means a definitive periodization and we are open to suggestions as to how to slice the corpus diachronically, especially with regards to later materials, which are beyond our area of expertise. A further metadata is available in the source corpus, but is not yet included in this resource, discourse, which allows for, for example, divide śāstras into abhidharma, commentaries, logic etc. Needless to say, every text could be classified in multiple ways and we are still working out a taxonomy that would strike a balance between precision and heuristic power.

We are well aware that these categorisation are in no small part subjective. Other researchers may want to tag our sources with different metadata for their own analysis. This is why we offer the possibility to customise metadata through our Metadata Editor.

To use the Metadata Editor go to Documentation > Sources & Metadata. If you only wish to change the period associated to a text or two, select the text you want to change, choose your desired period and click save. Repeat the process for each text you want to modify. The changes will last for the duration of the current session or until you click on reset default. This means that you will have to re-enter your changes next time you use this resource. This can be tedious if you wish to change many values. To speed up the process, you can upload a revised dataset file. This is the option to use also if you wish to change the metdata tagset altogether, for example if you want to add a fourth period to our threefold periodization or if you want to change the categories for text.type or tradition. To use a customised file we recommend to first download our metadata file using the download button in the Metadata Editor and make the derised changes on the downloaded file. It is important that the column names and column order remains the same as in the original file downloaded from the Metadata Editor page. Once you have edited the file as needed, pleae save it as .csv and upload it to the Metadata Editor throught the file browser provided. The customised metadata from the uploaded file will then appear in all barcharchats. (They will not appear in the Examples filters, though). The custom metadata from the uploaded file will remain active duration of the current session or until you click on reset default, after wich a new upload will be necessary to access you custom metadata.

Method

Lemma selection

So far we have adopted a primarily onomasiological approach to lemma selection. That is, we study clusters of words that express different aspects of a set of concepts. The choice of concepts stems from the research interests of our team. To date we have worked on lemmata pertaining to the conceptual domains of LANGUAGE, CONCEPTUALIZATION and FAITH.

This onomasiological approach allows us to compare semantically related words. To expand comparison to etymologically related words, we have extended our scope to also include cognates of the lemmata included in our resource, regardless of their meaning.

In the future we plan to integrate a corpus-driven element to our lemma selection, whereby words over-represented in Buddhist literature vis-à-vis other types of Sanskrit literature are prioritized over other lexical items. Still, for each ‘Buddhist keyword’ thus identified we will continue to cover near-synonyms and cognates to facilitate lexical comparison.

Lexical Data & Annotation

All entries are powered by a lexical dataset compiled by our team. For each lemma, the dataset contains citations extracted from our corpus and annotated by Luis Quiñones and Bruno Galasek-Hul.

The manual annotations include semantic, syntactic, conceptual and discoursive information.

Semantic annotations consist of sense and subsense labels crafted by our team as well as semantic categories derived from the Historical Thesaurus of English (for details, see explanation of the semantic field variable in Lemma Explorer).

Syntactic annotations mark a lemma’s syntactic dependencies in a citation.

Conceptual annotations highlight the conceptual relations a lemma enters in with other words (or the concepts they express) in a citation.

We use a tenfold typology of conceptual relations: leading to, caused by, possessing, belonging to, locus of, located in, by means of, is achieved through, is the goal of and takes as goal

Discoursive annotations mark cases where the lemma is contrasted with, listed with , juxtaposed to, glossed by or glosses another word in a citation

Other manual annotations include semantic prosody/connotation and uncertainty.

semantic prosody (sem.pros) refers to whether a lemma acquires a positive or negative overtone in context. Generally the semantic prosody associated to a word emerges from a repeated pattern of use. For example we can say that ‘set in’ has a negative connotation in English because it is systematically associated with words that possess a negative connotation.

For a pattern to emerge, we need to consider many individual instances. To this end in this project we treat semantic prosody slightly differently and we annotate it in each citation as a property of each instantiation of a lemma in context. We use a fourfold typology for semantic prosody: positive, negative, neutral and neutral-nagative.

We annotate a lemma as having negative semantic prosody when the concept expressed by the lemma is clearly depicted as negative , e.g. the lemma vikalpa in the phrase vikalpasaṃsārāvahāka (vikalpa is the source of samsara). Conversely, we annotate semantic prosody as positive if the concept expressed by a lemma is described as positive, or leading to something good etc. In cases where the lemma is negated (e.g. na vikalpayati) or is modified by an adjective with a negative connotation which suggests that some aspects of the concept expressed by the lemma are negative ( but not the concept tout-court, e.g. akuśala-vikalpa), we annotate the semantic prosody as being neutral-negative.

We annotate uncertainty to flag annotations associated with problematic citations. We use a fourfold typology of uncertainty: philological (e.g. the passage appears to be corrupted), disputed (i.e. scholars in the fields have put forward radically different interpretation of the meaning of a lemma in that citation), vague (the meaning of the lemma in a citation is under-specified), other (the problem stems from other causes).

Once all citations selected for a lemma have been annotated it is possible for us to craft an overview of the lemma based on analysis of the annotations. The rpoduction of such lexical overviews will be carried out in the next iteration of the dictionary see Lugli’s eLex2019 article

The same lexcial dataset that powers the Visual Dictioanry of Buddhist Sanskrit also generates the Thesaurus display

Citation selection

Time contraints prevent us from annotating all citations available in our corpus for a lemma. We perfrom stratified random sampling to select a fraction of citations to be annotated.

Typically we annotate 20% of the citations available for a lemma in each text. However, we aim to annotate at least 30 citations per lemma. This means that for rare lemmata we sample a greated percentage of citation, or even annotate all available citations.

Needless to say, some lemmata have less that 30 attestations in our corpus - some far less than 30. Entries based on very few citations are clearly signalled, as the data is infuccient for lexicographic description (see data quality)

A word of caution

graphs and stats can be deceptive, especially when applied to inherently fuzzy data such as semantics derived from suboptimal editions of virtually undatable texts that have been subjected to centuries of textual transmission.

all categorisation and metadata are merely heuristics and often tentative and subject to change/revisions

charts should ideally be blurry, borders between variables are intended to be porous - we yet have to find a suitable data visualisation for this…

Setting expectations

This dictionary is a product of a very small team of three people working on a tight budget.

No one of us is a professional software engineer. We rely on the R shiny package to bring to the public our work as quickly as possible (which, given the labour associated with analysing Sanskrit passages, is still very slow).

R and shiny are amazing open source tools, but have some drawbacks. Chief among them, they can be slow (especially with our clunky code!). Please be patient if the data take a while to load and please let us know if you encounter errors (email ligeia.lugli@kcl.ac.uk).

User guide

The Visual Dictionary and Thesaurus of Buddhist Sanskrit is a rapidly evolving prototype. This guide may not reflect the latest published version exactly.


Dictionary

Choose a Lemma

There two ways to search for a lemma: from the Lemmata list abd from the Dictionary page:

The Lemmata list tab is on the Home page (last tab on the right. below the top navigation bar). This list contains all the lemmata currently available in this resource. Clicking on the corresponding view lemma link will bring you to the Overview of the corresponding lemma in the Dictionary page:



Alternatively, go to the Dictionary page, choose the desired tab (by default Overview is deplayed), and select a lemma using the menu of the left.

To select a lemma from the menu, first delete the lemma displayed in the search box, then start inputting the desired word in the search box. As you start typing, a dropdown menu displays headwords matching the inputted string (note that a maximum of 7 headwords is displayed at one time, even though they may be more matches, keep typing to refine matches). Finally select the desired headword from the dropdown menu and hit return to view the corresponding entry.

If no matches appear in the dropdown menu, the word you are searching for is not yet in the dictionary.

The dictionary is a work in progress and only relatively few lemmata are currently available. To check which words are available without leaving the Doctionary page click on the link below the lemma search box.

Lemma Overview

The introduction section of an entry consists of four parts:

  1. data quality

  2. automatic summary

  3. semantic tree

  4. curated overview (work in progress)

Here is how these parts appear in an entry:

data quality

The first line of an entry states on how many citations that entry is based. This is an important information about the reliability of an entry.

As a rule of thumb, the more citations an entry is based on, the more reliable the information contained in an entry … with the obvious caveat that no matter how many citations we have studied for a lemma, the sense labels we assign to it are entirely the fruit of our interpretation and as such can always be inaccurate.

When an entry is based on less than 20 citations a warning appears to signal that this is an ancillary entry meant to be persued as part of a wider lexical cluster.

Ancillary entries are a byproduct of our work on other semantically or etymologically related words. We systematically investigate cognates and derivative forms of the lemmata we cover, so it often happens that while working on a fairly well-attested word we also annotate citations for some very rare cognate of the word. For example, while working on abhilāpa we have annotated one signle citation for the derivate form abhilapana. A single attestation is unlikely to yield meanigful lexicographic information, however the attestation may be of interest when perused together with other related cognates and compared with the lemma.

To facilitate such comparive exploration, a tentative cluster of semantically related cognates is provided in the Mini thesaurus section.

automatic summary

This section provides a automatically-generated list of the senses we have assigned to the lemma.

The sense labels are in red and are the fruit of our intepretation based on the citations we have annotated. For details on our semantic annotation see Lugli 2015, for the process we use to select citation to annotate see our preface.

semantic tree

The semantic tree offers a graphical representation of the our semantic annotation. The tree can be zoomed in and out better to fit the screen

The size of the nodes in the tree is proprotional to the raw frequency of the node. Hover over node to see frequency data.

The tree shows tree levels of semantic granularity going from left to right:

  1. conceptual domain.

This is a very general semantic categorisation that serves to identify the broad conceptual domains to which the senses of a lemma can be said to pertain. For example, the English word key has senses that can be associated to the domain of music (think D major) and others that pertain to the physical world (think door key).

  1. sense

This is a sense label in English that we think captures the general meaning expressed by the lemma in some (or all) the attestation we have analysed.

It can be conceived of as an English cognitive equivalent of the meaning expressed by the Sanskrit lemma (or part thereof).

  1. subsense

This is a more fine-grained sense-label aimed at capturing semantic nuances, metaphorical applications and terminological specialisations that can be interpreted as derived from a more general sense.

If we have found no subsenses pertaining for a certain sense, the same label used for sense is repeated at subsense level.

Needless to say, semantic labelling is a subjective endeavour; the semantic tree is a representation of our intepreation of the citations we have analysed.

curated overview

This is a human-curated overview of the semantic spectrum of the lemma, its use in context and its collocational or syntactic preferences as they emerge from our analysis of the corpus attestation considered for the entry. Human-curation will be the focus of the next iteration of the Dictionary. Currently only one entry displays this feature, for showcasing purposes.

The rationale for postponing curation to a later iteration of our Visual Dictionary is explained in this paper

Lemma in Context

Quick Examples

Examples are taken from our corpus with the aim of illustrating the use of a lemma in context in each of its meanings.

Currently the examples are randomly selected, but we are transitioning to a curated selection of examples to better illustrate the use of lemmata in context.

Users wishing to view more examples or retrieve examples matching criteria other then semantic categorisation can use our [Example Explorer] app.

More Examples

This section contains all the citations we have annotated, minus those marked with an uncertainty tag (see [Lexical Data][Lexical Data]).

Use the controls on the right to filter examples.

This tab also contains a contrastive examples section that highlights how the lemma related with other words.

Contextual Wordclouds

typically 3 wordclouds are displayed in this section:

  1. a wordcloud showing the cotext (i.e. words that occur in the vicinity) of the lemma in our citations. the information displayed here is derived direcly from the corpus and not curated.

  2. a wordcloud showing words that feature in a specific syntactic relation with the lemma, e.g. adjectives that modify the lemma or verbs that take the lemma as subject or object. This information is derived from manual syntactic annotation of the lemma. It is to be noted that beside adjectives and appositions, the modifies and mobified_by relations include genitive and compounded constructions where one words serve to specify the meaning or scope of aplication of another (e.g. in the phrase the neighbour’s dog, ‘neighbour’ would be marked as modifying ‘dog’)

  3. a wordcloud showing words that feature in a specific lexical or discoursive relation with the lemma, e.g. words that are contrasted with the lemma, listed with it, used to gloss it etc. this information is based on manual annotation of citations.

The color of each wordcloud corresponds to the type of relation (cotextual, syntactic or discoursive) it represents. The specific relation displayed can be changed using the dropdown menu on the left under ‘wordclouds’.

The size of each word int he wordcloud is proportional to its frequency. the sliders on the left can be used to change the size, maximum and minimum frequency of the words in the wordclouds.

If less than 3 wordclounds are displayed, it means that there are insufficient data to generate a particular type of wordcloud.

Explore the Lemma

This section consists of two parts:

Lemma Explorer

this versatile barchart graph can be used to show relationship between most of variables encoded in the annotated citations, be them curated (e.g. semantic and connotational information) or derived directly form the corpus (e.g. grammatical information).

the variables:

  1. senses (see above under semantic tree)

  2. subsenses (see above under semantic tree)

3.connotation, or semantic prosody (sem.pros) (see above under [Lexical Data])refers to whether a lemma acquires a positive or negative overtone in context. Generally sthe semantic prosody associated to a word emerges from a repeated pattern of use. For example we can say that ‘set in’ has a negative connotation in English because it is systematically associated with words that possess a negative connotation. We use a fourfold typology for semantic prosody: positive, negative, neutral and neutral-nagative. We annotate a lemma as having negative semantic prosody when the concept expressed by the lemma is clearly depicted as negative , e.g. the lemma vikalpa in the phrase phrasevikalpasaṃsārāvahāka (vikalpa is the source of samsara). Conversely, we annotate semantic prosody as positive if the concept expressed by a lemma is described as positive, or leading to something good etc. In cases where the lemma is negated (e.g. na vikalpayati) or is modified by an adjective with a negative connotation which suggests that some aspects of the concept expressed by the lemma are negative ( but not the concept tout-court, e.g. akuśala-vikalpa), we annotate the semantic prosody as being neutral-negative.

  1. semantic domain or conceptual domain(see above under semantic tree)

  2. semantic field (sem.field) these are semantic labels that we have adapted from the concpetual taxonomy of the Historical Thesaurus of English (HTE). The semantic space is carved differently in Sanskrit and English, so often there is not a semantic field in the HTE taxonomy that approximates the semantics of a Sanskrit word. This is typically the case when a Sanskrit word simultaneously expresses different English semantic fields, as in the case where samjñā straddles across the fields of LANGUAGE and THOUGHT In this cases we use the @ sign to represent the combination of two different semantic fields. We have limited combnations to maximum two semantic fields, as combining more fields would negatively impact the readability of the semantic labels. For more on our use of the HTE’s semantic fields see Lugli 2015

  3. semantic category (sem.cat). these are semantic labels like those used for the sematic feild variable, but they seek to achieve a more granuar semantic description than semantic fields.

  4. text-type/genre. This is an approximative descriptor of literature type, currently the dictioanry includes sūtras, śāstras, avadāna and ‘literary text’ (lit). The latter category includes poetry and drama.These categories are to be understood as fluid and merely heuristics.

  5. tradition/school. This is an approximative descriptor of the philosophical tradition to which a text belongs (Madhyamika, Yogācāra, Tathāgatagarbha). These categories are to be understood as fluid and merely heuristics.

  6. period . This refers to a tentative rough categorisation in three broad doachronical layers: foundational (circa I-III CE), classical (circa IV-V CE) and commentarial (VI CE onwards).These categories are to be understood as fluid and merely heuristics.

  7. case/voice this referes to the grammatical case for noins and the voice for verbs

  8. numbertthis is grammatical number

  9. text the titles of the texts in the corpus used by the dictioanry. In most cases the titles are too many to display nicely on a screen; it is advisable to choose this variable only for lemmata with few citations.

Each variable can be viewed on either the x or y axis. It is usually easier to read the graph with the variable with the least values on the y axis.

The bars in the graph can be viewed as stacked bars, dodged bars (side by side) or as proportional bars. The first two options represent raw frequencies of occurrences of the values of a chosen variable in our data. The last option displays those frequencies as percentage of all occurrences for that variable. It is advisable to always check raw frequencies, especially when the number of observations differ radically across variables.

Polysemy Viewer

This graph shows how the various senses of a word are distributed over a chosen text. The attestation of different senses in close proximity may signal that the senses were perceived as a single vague or underspecified sense in the speech community within which the text was produced and the distinction into discrete senses is an artifact of interpreting the word through translation. it may also signal word play. A version of the polysemy viewer as been used in the research behind this article

Mini thesaurus

To understand what a word means and how it is used in context, it is often valuable to contrast it with similar words, especially near sysnonyms and cognates (words sharing the same root or etymology). In order to do this, we need first to be aware of these similar words. This section of the Visual Dictionary is desgned to facilitate this.

Two tree-like visualisations show near synonyms and cognates of the lemma. The tree on the left, in blue, displays words that share a semantic field with the lemma and are likely to belong to the same part of speech (e.g. all nouns). the part of speech is inferred from the morphology of the words and is not manually annotated so it may not always be accurate. The tree on the right, in red, displays words that share both a semantic field and an morphological root with the lemma. This tree includes words in all part of speech. As for the previous tree, this one too relies on raw corpus data without any manual annotation and may not always be accurate.

the final visualisation of the dictionary allows us to zoom in on each of the semantic domains lexicalised by the lemma and discover which other words in the dictionary

lexicalise the same domain. In particular this graph affords an overview of the different aspects of a concept that are foregrounded by the vocabulary covered by the dictionary. For example, the words vitarka and saṃjñā foreground different different aspect of the general concept of THOUGHT. Here we use the word ‘concept’ rather loosely to indicate a high-order conceptual domain in a semantic taxonomy. Likewise, we take the different aspect of a concept to coincide with the semantic fields we annotate. Of course this is but a rough approximation; we hope it may still be usuaful to get a sense of the conceptual universe expressed by the vocabulary covered in the dictioanry.

This graph also displays the relative frequency (as percentage) in which each aspect of a concept (i.e. a semantic field) is lexicalised by each word. for example the semantic field Thought@Creation, which represent the imaginative/projecting aspect of Though, is equally lexicalised by forms derived from pari√kḷp (e.g parikalpa, parikalpita) and kḷp (e.g. kalpanā, kalpa, kalpita).

To better see where each word is in the graph, click on the corresponding word in the list on the right (note that you may have to scroll down the list to view all lemmata):

Thesaurus

Choose a Conceptual Domain

Just like for choosing a lemma, there two ways to search for a conceptual domain (domain for short): from the Domains list and from the Thesaurus page:

The Domains list tab is on the Home page (next to last tab on the right. below the top navigation bar). This list contains all the domains currently available in this resource. Clicking on the corresponding view domain link will bring you to the Doamin Overview tab of the corresponding domain in the Thesaurus page.

Alternatively, go to the Thesaurus page, choose the desired tab (by default Doamin Overview is deplayed), and select a domain using the menu of the left.

To select a domain from the menu, first delete what is displayed in the search box, then start inputting the desired domain in the search box. As you start typing, a dropdown menu displays headwords matching the inputted string (note that a maximum of 7 headwords is displayed at one time, even though they may be more matches, keep typing to refine matches). Finally select the desired domain from the dropdown menu and hit return to view the corresponding entry.

If no matches appear in the dropdown menu, the domain you are searching for is not yet in this resource

The Thesaurus is a work in progress and only relatively few domains are currently available. To check which conceptual domains are available without leaving the Thesaurus,page click on the link below the domains search box.