Published April 8, 2025 | Version ver1
Dataset Open

MUSIC-OpRA: Multidimensional Uncertainty in Scientific Interdisciplinary Corpora for Open Research Article

  • 1. ROR icon Université Marie et Louis Pasteur

Description

The MUSIC-OpRA dataset offers valuable insights into the representation of uncertainty in scientific literature across various domains. Researchers and practitioners can use this dataset to study and analyze the variations of uncertainty expressions in scholarly discourse.

This dataset contains sentences extracted from open access articles in a wide range of fields, covering both Science, Technology, and Medicine (STM); and Social Sciences and Humanities (SSH) and annotated with respect to uncertainty in science. The dataset is derived from PubMed, Scopus, Web of Science (WoS). It has been produced as part of the ANR InSciM (Modelling Uncertainty in Science) project.

The sentences were annotated by two independent annotators following the annotation guide proposed by Ningrum and Atanassova (2024). The annotators were trained on the basis of an annotation guide and previously annotated sentences in order to guarantee the consistency of the annotations.

Each sentence was annotated as expressing or not expressing uncertainty (Uncertainty and No Uncertainty).
Sentences expressing uncertainty were then annotated along five dimensions: Reference , Nature, Context , Timeline and Expression. 

The dataset is provided in CSV format. The columns in the table are as follows:

  • sentence_id: A unique internal identifier for each sentence.
  • journal_name: The name of the journal in which the article was published.
  • sampling_technique: Sampling method used to select the sentence. Two approaches were employed:
    • CueMapping: Sentences were randomly selected based on occurrences of uncertainty cues from pre-defined lists (Bongelli et al., 2019; Chen et al., 2018; Hyland, 1996).
    • Manual: Sentences were manually extracted by identifying uncertainty and non-uncertainty expressions in a subset of articles (two randomly selected articles per journal).
  • article_title: The title of the article from which the sentence was extracted.
  • document_id: The URL where the article is published.
  • publication_year: The year the article was published.
  • sentence: The text of the sentence.
  • uncertainty: '1' if the sentence expresses uncertainty, and '0' otherwise.
  • reference, nature, context, timeline, expression: annotations of the type of uncertainty according to the annotation framework proposed by Ningrum and Atanassova (2023). The annotation of each dimension in this dataset are in numeric format rather than textual. The mapping betwen textual and numeric labels is presented in the Table below.
Dimension 1 2 3 4 5
Reference Author Former Both    
Nature Epistemic Aleatory Both    
Context Background Methods Res&Disc Conclusion Others
Timeline Past Present Future    
Expression Quantified Unquantified      

For a more comprehensive understanding of the construction of the dataset, including the selection of journals, sampling procedure, and the annotation methodology, see Ningrum and Atanassova (2023); and Ningrum and Atanassova (2024).

References

Bongelli, R., Riccioni, I., Burro, R., & Zuczkowski, A. (2019). Writers’ uncertainty in scientific and popular biomedical articles. A comparative analysis of the British Medical Journal and Discover Magazine [Publisher: Public Library of Science]. PLoS ONE, 14 (9). https://doi.org/10.1371/journal.pone.0221933

Chen, C., Song, M., & Heo, G. E. (2018). A scalable and adaptive method for finding semantically equivalent cue words of uncertainty. Journal of Informetrics, 12 (1), 158–180. https://doi.org/10.1016/j.joi.2017.12.004

Hyland, K. E. (1996). Talking to the academy forms of hedging in science research articles [Publisher: SAGE Publications Inc.]. Written Communication, 13 (2), 251–281. https://doi.org/10.1177/0741088396013002004

Ningrum, P. K., & Atanassova, I. (2023). Scientific Uncertainty: An Annotation Framework and Corpus Study in Different Disciplines. 19th International Conference of the International Society for Scientometrics and Informetrics (ISSI 2023). https://doi.org/10.5281/zenodo.8306035

Ningrum, P. K., & Atanassova, I. (2024). Annotation of scientific uncertainty using linguistic patterns. Scientometrics. https://doi.org/10.1007/s11192-024-05009-z

Files

AURORA-MM.csv

Files (538.2 kB)

Name Size Download all
md5:a4ac2d59db267f21ae10acc78a0bf2f2
538.2 kB Preview Download

Additional details

Related works

Continues
Dataset: 10.5281/zenodo.15001250 (DOI)

Funding

Agence Nationale de la Recherche
InSciM - Modelling Uncertainty in Science ANR-21-CE38-0003

Dates

Available
2025-04