Dataset for Multidisciplinary Uncertainty Mining - ver1
Description
This dataset contains sentences extracted from articles in various disciplines and annotated with respect to uncertainty in science. It has been produced as part of the ANR InSciM (Modelling Uncertainty in Science) project.
The dataset is drawn from reputable scientific articles from a variety of disciplines. It consists of two distinct samples of sentences, each annotated using a different method. The first sample is obtained through uncertainty cue mapping, while the second sample is derived from manual annotation of randomly selected articles. To ensure comprehensive annotation, both samples were manually annotated using our multidimensional annotation framework.
For a more comprehensive understanding of the construction of the dataset, including the selection of journals, sampling procedure, and the annotation methodology, see (Ningrum and Atanassova, 2023).
This dataset provides valuable insights into the representation of uncertainty within scientific literature across different domains. Researchers and practitioners can utilize this dataset to study and analyze the different dimensions of uncertainty in scientific discourse.
The dataset is presented as a CSV table where colons ( are used as delimiters. The columns of the table are as follows :
- source : 'db' or 'manual' referring to the method used to identify and extract the sentence;
- article_id : internal id of the article from which the sentence was extracted;
- sen_id : internal unique id of the sentence;
- cue : uncertainty cue present in the sentence;
- text : sentence text;
- journal_id : short name of the journal;
- check : 'Y' if the sentence expresses uncertainty and 'N' otherwise;
- ref, nature, context, timeline, expression : annotations of the type of uncertainty according to the annotation framework proposed by (Ningrum and Atanassova, 2023).
It is essential to highlight the presence of duplicate data in the dataset. These duplicates arise from the detection of multiple cues in sentences during the cue mapping procedure. While one might consider omitting these duplicates, we deliberately chose to retain them. This decision allows for a more comprehensive understanding of how the cues manifest within the sentences. By analyzing the duplicate instances, we can gain valuable insights into the various ways in which the cues are expressed.
Bibliography
Ningrum, P. K., Atanassova, I. (2023) "Scientific Uncertainty: an Annotation Framework and Corpus Study in Different Disciplines" In 19th International Conference of the International Society for Scientometrics and Informetrics (ISSI 2023), Bloomington, Indiana, US.
Files
dataset_for_multidisciplinary_uncertainty_mining_v1.csv
Files
(119.0 kB)
Name | Size | Download all |
---|---|---|
md5:e8b0ba3b34ca7f8b7bda8fb9cdbc9080
|
119.0 kB | Preview Download |
Additional details
Funding
- Agence Nationale de la Recherche
- InSciM – Modelling Uncertainty in Science ANR-21-CE38-0003