Published April 4, 2025 | Version 1.0.0
Dataset Open

Softcite Extractions from the Open Access Literature

  • 1. ROR icon The University of Texas at Austin
  • 2. EDMO icon University of California, Berkeley

Description

The softcite-extractions-oa dataset is a collection of ML-identified mentions of software detected in about 24 million academic papers. The papers are all open access papers available circa 2024. The extractions were created from academic PDFs using the Softcite mention extraction toolchain, which is built on the Grobid model trained on the Softcite Annotations dataset v2. More details available at the Softcite Org home page.

This work used JetStream 2 at Indiana through allocation CIS220172 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing computational resources that have contributed to the creation and processing of this research dataset. URL: http://www.tacc.utexas.edu

See usage examples, report issues and see more documentation at https://github.com/softcite/softcite-extractions-oa

As documented in the GitHub repo, these parquet files were created by processing the original JSON files created by softcite-mentions. An intermediate dataset (about ~26GiB) is here: https://doi.org/10.5281/zenodo.15096765

Version history:

1.0.0 Removed file restrictions, added instructions on how/where to report extraction problems

0.2.1 Added link to upstream JSONL extractions dataset (which also links to this dataset)

0.2.0 Removed extraneous Mac OS files from the zip

0.1.0 Initial upload

Files

softcite-extractions-oa-data.zip

Files (6.3 GB)

Name Size Download all
md5:d2c844fa676a8b3022aadf85fb2c92c2
6.3 GB Preview Download

Additional details

Software