Softcite Extractions from the Open Access Literature

Howison, James; Lopez, Patrice; Karthik, Ram; Beason, Will

doi:10.5281/zenodo.15149379

Published April 4, 2025 | Version 1.0.0

Dataset Open

Softcite Extractions from the Open Access Literature

1. The University of Texas at Austin
2. University of California, Berkeley

The softcite-extractions-oa dataset is a collection of ML-identified mentions of software detected in about 24 million academic papers. The papers are all open access papers available circa 2024. The extractions were created from academic PDFs using the Softcite mention extraction toolchain, which is built on the Grobid model trained on the Softcite Annotations dataset v2. More details available at the Softcite Org home page.

This work used JetStream 2 at Indiana through allocation CIS220172 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing computational resources that have contributed to the creation and processing of this research dataset. URL: http://www.tacc.utexas.edu

See usage examples, report issues and see more documentation at https://github.com/softcite/softcite-extractions-oa

As documented in the GitHub repo, these parquet files were created by processing the original JSON files created by softcite-mentions. An intermediate dataset (about ~26GiB) is here: https://doi.org/10.5281/zenodo.15096765

Version history:

1.0.0 Removed file restrictions, added instructions on how/where to report extraction problems

0.2.1 Added link to upstream JSONL extractions dataset (which also links to this dataset)

0.2.0 Removed extraneous Mac OS files from the zip

0.1.0 Initial upload

Files

softcite-extractions-oa-data.zip

Files (6.3 GB)

Name	Size	Download all
softcite-extractions-oa-data.zip md5:d2c844fa676a8b3022aadf85fb2c92c2	6.3 GB	Preview Download

Additional details

Repository URL: https://github.com/softcite/softcite-extractions-oa
Development Status: Active

	All versions	This version
Views	537	429
Downloads	58	50
Data volume	379.5 GB	328.9 GB

Softcite Extractions from the Open Access Literature

Creators

Description

Files

softcite-extractions-oa-data.zip

Files (6.3 GB)

Additional details

Software