MUSIC-OpRA: Multidimensional Uncertainty in Scientific Interdisciplinary Corpora for Open Research Article
Description
The MUSIC-OpRA dataset offers valuable insights into the representation of uncertainty in scientific literature across various domains. Researchers and practitioners can use this dataset to study and analyze the variations of uncertainty expressions in scholarly discourse.
This dataset contains sentences extracted from open access articles in a wide range of fields, covering both Science, Technology, and Medicine (STM); and Social Sciences and Humanities (SSH) and annotated with respect to uncertainty in science. The dataset is derived from PubMed, Scopus, Web of Science (WoS). It has been produced as part of the ANR InSciM (Modelling Uncertainty in Science) project.
The sentences were annotated by two independent annotators following the annotation guide proposed by Ningrum and Atanassova (2024). The annotators were trained on the basis of an annotation guide and previously annotated sentences in order to guarantee the consistency of the annotations.
Each sentence was annotated as expressing or not expressing uncertainty (Uncertainty and No Uncertainty).
Sentences expressing uncertainty were then annotated along five dimensions: Reference , Nature, Context , Timeline and Expression.
The dataset is provided in CSV format. The columns in the table are as follows:
- sentence_id: A unique internal identifier for each sentence.
- journal_name: The name of the journal in which the article was published.
- sampling_technique: Sampling method used to select the sentence. Two approaches were employed:
- CueMapping: Sentences were randomly selected based on occurrences of uncertainty cues from pre-defined lists (Bongelli et al., 2019; Chen et al., 2018; Hyland, 1996).
- Manual: Sentences were manually extracted by identifying uncertainty and non-uncertainty expressions in a subset of articles (two randomly selected articles per journal).
- article_title: The title of the article from which the sentence was extracted.
- document_id: The URL where the article is published.
- publication_year: The year the article was published.
- sentence: The text of the sentence.
- uncertainty: '1' if the sentence expresses uncertainty, and '0' otherwise.
- reference, nature, context, timeline, expression: annotations of the type of uncertainty according to the annotation framework proposed by Ningrum and Atanassova (2023). The annotation of each dimension in this dataset are in numeric format rather than textual. The mapping betwen textual and numeric labels is presented in the Table below.
Dimension | 1 | 2 | 3 | 4 | 5 |
Reference | Author | Former | Both | ||
Nature | Epistemic | Aleatory | Both | ||
Context | Background | Methods | Res&Disc | Conclusion | Others |
Timeline | Past | Present | Future | ||
Expression | Quantified | Unquantified |
For a more comprehensive understanding of the construction of the dataset, including the selection of journals, sampling procedure, and the annotation methodology, see Ningrum and Atanassova (2023); and Ningrum and Atanassova (2024).
References
Bongelli, R., Riccioni, I., Burro, R., & Zuczkowski, A. (2019). Writers’ uncertainty in scientific and popular biomedical articles. A comparative analysis of the British Medical Journal and Discover Magazine [Publisher: Public Library of Science]. PLoS ONE, 14 (9). https://doi.org/10.1371/journal.pone.0221933
Chen, C., Song, M., & Heo, G. E. (2018). A scalable and adaptive method for finding semantically equivalent cue words of uncertainty. Journal of Informetrics, 12 (1), 158–180. https://doi.org/10.1016/j.joi.2017.12.004
Hyland, K. E. (1996). Talking to the academy forms of hedging in science research articles [Publisher: SAGE Publications Inc.]. Written Communication, 13 (2), 251–281. https://doi.org/10.1177/0741088396013002004
Ningrum, P. K., & Atanassova, I. (2023). Scientific Uncertainty: An Annotation Framework and Corpus Study in Different Disciplines. 19th International Conference of the International Society for Scientometrics and Informetrics (ISSI 2023). https://doi.org/10.5281/zenodo.8306035
Ningrum, P. K., & Atanassova, I. (2024). Annotation of scientific uncertainty using linguistic patterns. Scientometrics. https://doi.org/10.1007/s11192-024-05009-z
Files
AURORA-MM.csv
Files
(538.2 kB)
Name | Size | Download all |
---|---|---|
md5:a4ac2d59db267f21ae10acc78a0bf2f2
|
538.2 kB | Preview Download |
Additional details
Related works
- Continues
- Dataset: 10.5281/zenodo.15001250 (DOI)
Funding
- Agence Nationale de la Recherche
- InSciM - Modelling Uncertainty in Science ANR-21-CE38-0003
Dates
- Available
-
2025-04