Dataset Open Access

Softcite Dataset: A dataset of software mentions in research publications

James Howison; Patrice Lopez; Caifan Du; Hannah Cohoon

The Softcite dataset is a gold-standard dataset of software mentions in research publications, a free resource primarily for software entity recognition in scholarly text. This is the first release of this dataset.

What's in the dataset

With the aim of facilitating software entity recognition efforts at scale and eventually increased visibility of research software for the due credit of software contributions to scholarly research, a team of trained annotators from Howison Lab at the University of Texas at Austin annotated 4,093 software mentions in 4,971 open access research publications in biomedicine (from PubMed Central Open Access collection) and economics (from Unpaywall open access services). The annotated software mentions, along with their publisher, version, and access URL, if mentioned in the text, as well as those publications annotated as containing no software mentions, are all included in the released dataset as a TEI/XML corpus file.

For understanding the schema of the Softcite corpus, its design considerations, and provenance, please refer to our paper included in this release (preprint version).

Use scenarios

The release of the Softcite dataset is intended to encourage researchers and stakeholders to make research software more visible in science, especially to academic databases and systems of information retrieval; and facilitate interoperability and collaboration among similar and relevant efforts in software entity recognition and building utilities for software information retrieval. This dataset can also be useful for researchers investigating software use in academic research.

Current release content

softcite-dataset v1.0 release includes:

  • The Softcite dataset corpus file: softcite_corpus-full.tei.xml
  • Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications, our paper that describes the design consideration and creation process of the dataset: Softcite_Dataset_Description_RC.pdf. (This is a preprint version of our forthcoming publication in the Journal of the Association for Information Science and Technology.)

The Softcite dataset is licensed under a Creative Commons Attribution 4.0 International License.

If you have questions, please start a discussion or issue in the howisonlab/softcite-dataset Github repository.

We thank Alfred P. Sloan Foundation for supporting this work. We also appreciate our collaborators and student annotators for making this dataset gold-standard and available.
Files (2.9 MB)
Name Size
2.9 MB Download
  • Du, C., Cohoon, J., Lopez, P., & Howison, J. (forthcoming). Softcite Dataset: A Dataset of Software Mentions in Biomedical and Economic Research Publications. Journal of the Association for Information Science and Technology. DOI: 10.1002/asi.24454

All versions This version
Views 546530
Downloads 3939
Data volume 113.6 MB113.6 MB
Unique views 491481
Unique downloads 3838


Cite as