Toward supporting decision-making under uncertainty in digital humanities with progressive visualization

Digital Humanities (DH) research and practice is subject to uncertainty during the life cycle of any project. Even in non data-oriented cases, analysts and other stakeholders need to make decisions without being aware of the level of uncertainty associated to the data being transformed by the computational tools used to enable the kind of novel work of humanists pursued within DH. We examine in this paper the literature that have characterized the types and sources of uncertainty in other fields, with the intent of establishing a foundation upon which build novel computational tools supporting the decision-making under uncertainty processes that DH is currently facing. We propose the use of progressive visual analytics as a feasible means to manage decision-making under uncertainty, which may help tackling some challenges related to the elimination or mitigation of uncertainty in DH, that otherwise would tamper the quality of the yielded results.


INTRODUCTION
In the last decades the importance of computational tools in the work of researchers in the humanities has been continuously increasing and the definition of the DH has been reformulated accordingly, since the DH research must be integrated with practice within and beyond the academia [47]; both research and practice have been adopting new methodologies and resources that render definitions obsolete quite rapidly. In our work we adhere to the characterization of DH as "the application and or development of digital tools and resources to enable researchers to address questions and perform new types of analyses in the humanities disciplines" [1]. This symbiosis means that the application of humanities methods to research into digital objects or phenomena [47] is another way to look at DH research.
At any rate, the computational methods that are available to humanities scholars are very rich and may intervene at different stages of the entire project life cycle. Some examples of computational methods applied in DH research are the analysis of large data sets and digitized sources, data visualization, text mining, and statistical analysis of humanities data. We are aware that the diversity of fields that may fall under the broad outline of what constitutes DH research bring many different and valid goals, methods and measurements into the picture, so there is no general set of procedures that must be conducted in a research to qualify as DH research. However, any intervention of computational tools in the research is bound to deal with data, that will go through several processes and modifications throughout the life cycle of the project, even in the cases where the research itself is not data-driven. From the inception of the project to the generation of knowledge, the intervention of computational tools transform data by means of processes that may increase the uncertainty of the finals results. Furthermore during the life cycle of the projects there are many situations in which the scholars and/or stakeholders need to take decisions to advance the research based on incomplete or uncertain data [1], that in turn, will yield another level of uncertainty inherently associated to a particular software or computational method.
The motivation of this paper is to examine when this decision making under uncertainty occurs in DH projects where data transformations are performed. This work is part of the PROVIDEDH (PROgressive VIsual DEcision Making for Digital Humanities 1 ) research project, aimed at providing visual interactive tools that convey the degree of uncertainty of the datasets and computational models used behind, designed to progressively adapt the visualizations to incorporate the new, more complete or more accurate data.
The rest of the paper is organized as follows: In Section 2 we introduce types of uncertainty as defined in the reliability theory, since it provides a mature and sound body of work upon which to build our research. In section 3 we examine DH humanities research and practice in a first attempt to characterize sources of uncertainty in DH. Section 4 is devoted to discussing how managing and processing data in DH research and practice are subject to uncertainty. The fifth section of the paper presents a progressive visual analysis proposal that approaches DH projects or experiences in which uncertainty and decision-making play a big role, with the intention of providing some hints on how mitigate the impact of uncertainty on the results. Finally, in section 6 we outline the main conclusions of our work that can be used to scaffold the support of decision-making under uncertainty in DH.

TYPES OF UNCERTAINTY
The characterization of uncertainties has been thoroughly investigated in the literature, with a major emphasis in areas such as risk analysis and risk management and reliability engineering [20][15] [19][38] and decision-making and planning [24], with contributions from many other fields: operational research [50], software engineering [35], management [34], ecology [37], environmental modelling [36], health care [18], organizational behavior [21] and uncertainty quantification [33], to name a few. This interest on the formalization and modelling of uncertainty has traditionally been related to the question of what is scientific knowledge [25], and the implications of uncertainty in the humanities has not been approached in depth, although, as Druzdzel argues, uncertainty "is perhaps the most inherent and the most prevalent property of the knowledge of the world around us. Incompleteness of information, imprecision, approximations made for the sake of simplicity, variability of the described phenomena, contribute all to the fact that we rarely can make categorical statements about the world" [11].
Uncertainty has various interpretations in different fields, and in our research we refer to uncertainty as "a complex characterization about data or predictions made from data that may include several concepts including error, accuracy, validity, quality, noise and confidence and reliability" [30] .
According to Dubois [12], knowledge can be classified depending on its type and sources as generic (repeated observations), singular (situations like test results or measurements) or coming from beliefs (unobserved singular events).

Aleatory Uncertainty
This uncertainty exists due to the random nature of physical events. This type of uncertainty refers to the inherent uncertainty due to the probabilistic variability and thus is modeled by probability theory. It is also known as statistical uncertainty, stochastic uncertainty, type A uncertainty, irreducible uncertainty, variability uncertainty, and objective uncertainty.
It mainly appears in scientific domains and is usually associated with objective knowledge coming from generic knowledge or singular observations.
The main characteristic of aleatory uncertainty is that it is considered to be irreducible [2].

Epistemic Uncertainty
This type of uncertainty results from the lack of knowledge or its imprecise character and is associated to the analysists performing the analysis. It also known as systematic uncertainty, subjective uncertainty, type B uncertainty, reducible uncertainty or state of knowledge.
It is mainly found with subjective data based on beliefs and can be modeled with the belief functions theory introduced by Arthur P. Dempster [8].
This kind of uncertainty is specially related to decision-making processes and as such may be found both in scientific (usually associated to hypothesis testing) and humanities (associated to disputed theories or events) research.
The main characteristic of epistemic uncertainty is that it is considered to be reducible due to the fact that new information can reduce or eliminate it.

Implications for decision-making in DH
As explained in the introduction, our research is to investigate opportunities to support decision-making in DH research and practice by means of interactive visual tools. Given the exposed dual nature of uncertainty, the second type of uncertainty (epistemic) offers an opportunity to enhance the DH research and support the stakeholders in assessing the level of uncertainty of the project at any given moment.
On one hand, as already introduced, the epistemic uncertainty can be modelled with the belief functions theory, which defines a theory of evidence that can be seen as a general framework for reasoning with uncertainty.
On the other hand, recent efforts can be found in the literature that focus on the adaptation and proposal of data provenance models for DH ecosystems [23] [7], and are often used to record the chain of production of digital research results, in order to increase transparency in research and make such results reproducible [46]. These models can be enhanced in order to also convey the level of uncertainty at any link in the chain. This would give the opportunity of making decisions related to a change in the research direction, if, for instance, at some point the conclusion is incompatible with what the humanist feels to be solid ground epistemically, or new information is introduced that mitigates a given uncertainty level.

SOURCES OF UNCERTAINTY IN DH
Similarly to the case of uncertainty types, different attempts at providing a taxonomy of sources of uncertainty can be found in the literature: Smithson proposed a taxonomy of ignorance [43][3], Pate-Cornell discussed six levels of uncertainty [31] and Fisher presented a conceptual model of uncertainty in spatial data [16]. Building upon these taxonomies, in [42] four notions are identified as sources of epistemic uncertainty: imprecision (inability to express an exact value of a measure), ignorance (inability to express knowledge), incompleteness (when not all situations are covered) and credibility (the weight an agent can attach to its judgement).
Although to the best of our knowledge a taxonomy of sources of uncertainty in DH has yet to be proposed, there is no doubt that in this realm there are multiple sources of uncertainty to be found. It is our aim to contribute to pave the way towards a taxonomy of uncertainty sources in DH by identifying an discussing some instances of sources of uncertainty related to data in DH research and practice.
Taking on the preciously discussed work by Simon [42], we can expand on the taxonomy it covers. Focusing on epistemic uncertainty (that related to the imprecise character of knowledge, or plainly the lack of it), multiple sources of uncertainty can be described within its scope. Those sources come from the fact that often we do not know some exact values of the data we are dealing with. On top of that, we must acknowledge for the inconsistency that can be generated by the fact that some information could be contradictory itself. Proposed by Fisher [16], a taxonomy of the epistemic uncertainty can be organized in four categories or notions ( Figure  1), which are described in greater detail next. Also, to complete the description of Fisher's notions, we will provide real examples of each of them in the context of four different DH projects that dealt with uncertainty in GIS [39], a dataset of French medieval texts [22], information related to early holocaust data [4], and an approach to the presence of uncertainties in visual analysis [40].
A. Imprecision This notion is present in those datasets in which the information or entries they contain are made of multiple attributes that may be imprecise.
There exists, therefore, an inability to express the definitive accurate value of a measure, or a lack of information allowing us to precisely obtain the exact value of it. Ideally, we would be able to study and research the topic we are dealing with while working with a dataset in order to sort out any uncertainties and remove them from it, but in most cases we will find barriers that will prevent that.
On three of the cited DH projects [39][22] [4], imprecision is present in one or other way. One instance of the presence of uncertainty due to imprecision is that related to time and dates, such as the one related to the medieval text introduced in [22]. Not every one of the texts had this problem, but in multiple instances a concrete date in which they were written was not available, but instead they were represented in idiosyncratic ways (e.g. between 1095-1291, first half of the 14th century, before 1453, etc.), making for a very strong presence of uncertainties to asses.
B. Ignorance It could be partial or total, and it is related to the fact that information could have been incorrectly assessed by the persons gathering, organizing, etc. the data. It is also possible that people, not fully sure about how to deal with data and feeling insecure about it, ignore some information and generate uncertainty during the evaluation and decision processes.
Mostly due to the passage of time (in the scope of DH) and the fact that new knowledge becomes available with new experiences and research projects being completed and available, we are able to find information that makes the one we had at the inception of our projects outdated or misread/misunderstood at the time.
Interpretation issues can also be considered in this category or notion, given that not everybody may have the same perspective on the same data depending on its context, and that can affect its certainty.
On iterative research projects, also, unexpected results can be reached. In that scenario, if the person analyzing data is insecure and his or her expectations are not on par with what was generated, it is possible that some uncertainty is generated. This uncertainty can turn into the ignorance of the result, providing a new dataset wrongly assessed. This issue is tackled by Seipp et al. [40] in relation with the presence of uncertainties in visual analytics. One of the main issues in visual analysis is the possibility of misinterpretations, and in order to avoid it the data quality needs to be appropriately represented. Even with that, the results can be misleading and the analyst may not be able to interpret them correctly, turning into him or her, and encouraging to ignore them and potentially introducing uncertainty on further iterations if the perceived values differ from the real values conveyed by the visualization.

C. Credibility
Probably one of the strongest sources of uncertainty, the credibility of any dataset or person involved in its assessment can be crucial to the presence (or lack of it) of uncertainty.
This concept can be linked to that of biased opinions, which are related to personal visions of the landscape, which can make for wild variations between different groups and individuals given their backgrounds.
Moreover, this also is referred to the level of presence of experts that are taking charge of the preparation or gathering of the data, its usage, research on it, etc. The more weight an agent bears, the less (in principle) unpredictability is expected to be present in the data.
This notion is also important when working on open projects with researches that allow for external agents to contribute to them in different ways, given that their knowledge of the matter at hand would be very different from ones to others, and that must be taken into consideration when dealing with their inputs, as they could potentially introduce other types of uncertainty into the project and potentially change the results of the research.
This last type of research can be related to that carried out by Binder et al. for the GeoBib project [4]. Given the open nature of it, in which people can contribute new information or modify already available data. As each individual comes to the system with a different background, experience and knowledge, the information entered in the database can be related to the same record but be completely different depending on who introduces it. It is the researchers' work to assess how credible each input is depending on where it comes from.

D. Incompleteness
Finally, the notion of incomplete data is a type of uncertainty that can be related to that of imprecise values. We can never be totally sure of anything, and that mostly has to do with the lack of knowledge (imprecision) that comes from the impossibility of knowing every possible option available.
When dealing with a dataset comprised of logs of visitors of a library in Dublin [39], the authors found records that included names of places that are neither longer existing, nor traceable, due to their renaming or simply because the person recording the instance used a name bound to his or her own knowledge. This makes impossible to geo-localize those places, making for an ultimately incomplete (and also imprecise if wrong coordinates are assigned instead of leaving blank fields) dataset.

DATA AND UNCERTAINTY IN DIGITAL HUMANITIES
It is assumed that science advance on a foundation of trusted discoveries [27] and the scientific community has traditionally pursued the reproducibility of experiments, with transparency as a key factor to grant the scrutiny and validation of results.
Recently the importance of disclosing information on data handling and computational methods used in the experiments has been recognized, since access to the computational steps taken to process data and generate findings is as important as access to data themselves [44]. On the contrary, humanities research have a different relationship with data. Given the nature of this research, data are continuously under assessment and different interpretative perspectives. In [13], Edmond and Nugent argue that "An agent encountering an object or its representation perceives and draws upon the data layer they apprehend to create their own narratives", understanding by narrative "the story we tell about data". The collaboration of humanities and computer science have opened new ways of doing research, but it also brings many challenges to overcome. Related to our research, we focus here on the role of data in DH, as humanities data are both massive and diverse, and provide enormous analytical challenges for the humanities scholar [48].
In [48] four humanities challenges have been identified, relating to the ways in which perspectives, context, structure and narration can be understood. Those challenges open up many opportunities to collect, store, analyze and enrich the multimodal data used in the research; among the opportunities identified in the paper two are especially relevant to our discussion: a) understanding changes in meaning and perspective, and b) representing uncertainty in data sources and knowledge claims.
These opportunities are inherently related to a notion of uncertainty in data. On the one hand, humanities research is subject to changes in the data over time and across groups or scholars. When new sources or documents are discovered, new interpretations are elaborated, and such understanding of the research objects are highly dependent on particular theoretical positions of the scholars. On the other hand, those changes in meaning and perspective arise from the availability of sources and reference material, so its highly important for the scholars to be able to assess the nature of the data related to what may be missing, ambiguous, contradictory, etc.
This, as expected, generates uncertainty in how the data is ultimately handled and analyzed depending on the data processing procedures and the provenance associated to them.

MANAGEMENT OF UNCERTAINTY BY MEANS OF PROGRESSIVE VISUAL ANALYTICS
Visual analytics are key to expose humanities findings and to connect the two hemispheres of the DH in a successful manner. The usefulness and suitability of visually-supported computer techniques is a proven fact nowadays, and one can refer to the growing number of publications, papers, dissertations and talks touching the subject in recent years. However, many of these proposals still are regarded with a skeptical eye by prominent authors in the field and are considered by some "a kind of intellectual Trojan horse" that can be harmful for the purpose of the humanistic research [10]. These authors' critique appeals to the inability of these techniques to present categories in qualitative information as subject to interpretation, "riven with ambiguity and uncertainty" and they call for "imaginative action and intellectual engagement with the challenge of rethinking digital tools for visualization on basic principles of the humanities". These claims point to a major issue in DH: On the one hand, humanities scholars are keen on employing computational methods to assist them in their research, but on the other hand such computational methods are often too complex to be understood in full and adequately applied. In turn, acquiring this knowledge generally would require an investment of time and effort that most scholars are reluctant to commit to and would invalidate the need for any kind of multidisciplinary cooperation. As a consequence, algorithms and other computational processes are seen as black boxes that produce results in an opaque manner, a key fact that we identify as one of the main causes of the controversy and whose motivations are rooted at the very foundations of HCI. But in the same way users are not expected to understand the particularities of the HTTP and 4G protocols in order to access an online resource using their mobile phones, algorithmic mastery should not be an entry-level requirement for DH visual analytics either. In a similar approach, such analytics systems should not purposely conceal information from the user when mistakenly assuming that a) the user is completely illiterate on these subjects and/or, maybe even with more harmful consequences may b) the user is unable to learn.
Ghani & Deshpande in their research dating from 1994 identified the sense of control over one's environment as a major factors affecting the experience of flow [17]. We argue that is precisely the lack of control over the algorithms that drive the visualization what might be frustrating DH practitioners.
In the context of this problematic we frame our proposal of an exploration paradigm for the DH, which aims to bring scientific rigor and reproducibility into the field without impeding intellectual work as intended by humanities scholars. As it was presented in previous sections, the tasks of categorization, assessment and display of uncertainty in all its forms play a key role in the solving of the aforementioned issues. In order to provide an answer to this question we draw on recent research by authors in the CS field to construct a theoretical framework on which the management of uncertainty is streamlined in all phases of the data analysis pipeline: Progressive Visual Analytics.
Progressive Visual Analytics is a computational paradigm [28,45] that refers to the ability of informational systems to deliver their results in a progressive fashion. As opposed to sequential systems, which are limited by the intrinsic latency of the algorithms in action, Progressive Visual Analytics systems by definition are always able to offer partial results of the computation.
The inclusion of this feature is of major importance to avoid well-known issues of exploratory analysis related to human perception such as continuity, flow and attention preservation among others [29] and enhances the notion of direct manipulation of abstract data in the final user of the system [41]. This paradigm also brings important advantages related to the ability to break with the black-box vision of the algorithms commented earlier in this text [28]: There are many examples online and in the literature that illustrate how by observing visual results of the execution of an algorithm, users are able to understand how it works in a better manner [6]. Not only this is useful in an educational sense but also in a practical one: Progressive Analytics often produce steerable computations, allowing users to intervene in the ongoing execution of an algorithm and make more informed decisions during the exploration task [28]. In our case, this would allow a fast re-computation of results according to a set of well-defined series of beliefs or certainties on the data, with important benefits related to the problematic presented in [10]. Therefore, the challenge lies in reimplementing the typical DH workflows and algorithms in a progressive manner, allowing for a fast reevaluation of beliefs that sparks critical thinking and intellectual work under conditions of uncertainty. In order to develop this conversion Good first candidates for this conversion are the typical graph layout and force-directed methods as a) they have been typically implemented in a progressive manner [5] and b) they are considered important to enable research in the humanities [48]. Other good candidates fall into the categories of dimensionality reduction (t-SNE [32]), pattern-mining (SPAM [45]) or classification (K-means [14]) although in principle any algorithm is susceptible of conversion following the guides explained in [28]. For example, a complete list of relevant methods for the humanities could be compiled from the contributions by Wyatt and Millen [49].  [45] and Fekete and Primet [14].
In Figure 2 we show a modification of the progressive visualization workflow proposed by Stolper et al. [45] on which we treat the data set as a first-class research object that can be labeled, versioned, stored and retrieved employing a data repository. Our proposal also draws on the ideas by Fekete and Primet [14] and we model uncertainty as a parameter Up of the progressive computation Fp defined by the authors. Initially a dataset "A" is loaded, which will consist of a series of data tables, each one associated with a concrete uncertainty parameter which might or might not exist yet and that was, in case of existing, assigned in a previous session by the same or another user. At the beginning of the session, the user may choose to modify the uncertainty parameters according to his experience or newer research or leave them as they are. We call this the initial user perspective P, which is a series of uncertainty parameters U1…z related to each one of the data tables D1..z. As the workflow progresses, the user will modify this perspective, subsequently obtaining P', P'', etc. Once the workflow is finished, the dataset Ar, along with the final user perspective P r is stored in the data repository for later use and becomes a research object that can be referenced, reused and reproduced in a transparent fashion.

CONCLUSIONS
In this paper we saw how the inclusion and treatment of uncertainty in exploratory visual analysis is of key importance to bridging the gap between the humanities and computer science. Although the DH conform an exciting new field of collaboration between practitioners with substantially different backgrounds, there are still major issues that need to be addressed as briefly as possible in order to achieve better goals. In order to overcome these challenges, we draw on a relatively new data visualization paradigm that breaks with the black-box perception of the algorithm that is blocking the collaboration in many research areas. Although the progressive workflow model in our proposal is a first approach to the problem, we are currently working to provide concrete implementations that we expect to test within the next 2-3 years. We have seen a great surge of Progressive Analytics in the CS community in the past years but its applicability in the field of the DH is still to be proven with adequate use cases and real data sets.