Modeling Structure and Content: Socio-Semantic Network Analysis of the Mahābhārata

There is a demand to incorporate content information into social networks. The authors constructed and visualized a network of the most important gods and heroes in the Sanskrit epic Mahābhārata. The network includes semantic information about the actors and their relationships. These two types of information were collected automatically with the help of the Nubbi topic modeling algorithm, which assigns separate sets of topics to both persons and their relations. The visualization of such a network provides intuitive access to a high density of information, like the topic distribution for each actor and the predominant topic for each relation.

This paper focuses on the methodological problem of the relationship between (social) structure and (semantic) content in humanities network research. Network analysis as a method traditionally leans toward the structural side. The social structure of fictional texts has been studied, e.g. by Moretti in his work on Shakespearean drama [1] or by Mac Carron and Kenna dealing with Icelandic sagas [2]. These approaches largely ignore the content of their texts. Here, we propose a method to jointly analyze structure and content.
We are interested in capturing the social structure but also including the semantics that are embedded within it. This kind of approach has been called "semantic social network analysis" [3,4]. Recent research in this area developed from "Semantic Web" principles. Social networks are extended with information from given ontologies to model semantically different types of nodes and edges. The outcomes are multimodal and multiplex networks. The downside of this approach is that it requires the researcher to specify a semantic content model in the form of an ontology, instead of empirically building such a content model from the data. Instead, we pursue an inductive approach to uncover the internal qualities of the textual sources under study.
To this end, we apply techniques from topic modeling, which has the advantage of empirically detecting semantic clusters of words with little input from the researcher. One fitting application of topic modeling to the domain of network analysis has been developed and published under the name Nubbi, which is short for Networks Uncovered by Bayesian Inference [5]. This method performs topic modeling of context in a social network setting. The results are topics, or semantic word clusters, for both actors and relations in a social network. This allows the discovery of node classes and relationship types inductively. Nubbi applies a dual data model, which makes use of the fact that the text under study contextualizes entities and their relations. Entities mentioned in the text are extracted as a social network. Additionally, words around entities and entity pairs are assigned to nodes and edges as context documents. The topic modeling process then calculates topics for both nodes and edges. It assumes that words around single entities contribute only to entity topics, while words around entity pairs contribute to either entity topics or pair topics.
The results of this process are topics that describe entities (node classes) and topics that describe entity relations (edge types). There is an implementation of this algorithm available for R as part of the "lda" package [6].
The text to which we applied these methods is the Mahābhārata, the Indian national epic, in its original Sanskrit version (1M+ lexical entities). We added the topic distribution inferred by Nubbi to a social network of the 370 most frequent persons, based on the co-occurrence of two of them in one verse. Then, we visualized a subgraph containing 20 central gods and heroes as an arc diagram (Fig. 1) [7].
The graph displays the characteristic distribution pattern of entity topics for the single actors. The Nubbi algorithm e.g. distinguishes pure warriors from other persons with a more variegated profile. Relations between actors are represented as colored arcs whose thickness is proportional to the number of common contexts. The color for each pair is chosen according to the most frequently occurring pair topic between the two entities. The arc color represents dominant edge topics like "Religion" and "Fighting." E.g. the persons connected by red and brown arcs on the right side of the graph are the most active participants in the great battle whose description makes up about 1/3 of the entire text.
To conclude, we found that Nubbi is a viable solution to the problem of analyzing structure and content in an integrated model. The arc diagram usefully highlights the most central connections but only works well with comparably few nodes.