Softcite Extractions from the Open Access Literature
Description
The softcite-extractions-oa dataset is a collection of ML-identified mentions of software detected in about 24 million academic papers. The papers are all open access papers available circa 2024. The extractions were created from academic PDFs using the Softcite mention extraction toolchain, which is built on the Grobid model trained on the Softcite Annotations dataset v2. More details available at the Softcite Org home page.
This work used JetStream 2 at Indiana through allocation CIS220172 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing computational resources that have contributed to the creation and processing of this research dataset. URL: http://www.tacc.utexas.edu
See usage examples, report issues and see more documentation at https://github.com/softcite/softcite-extractions-oa
As documented in the GitHub repo, these parquet files were created by processing the original JSON files created by softcite-mentions. An intermediate dataset (about ~26GiB) is here: https://doi.org/10.5281/zenodo.15096765
Version history:
1.0.0 Removed file restrictions, added instructions on how/where to report extraction problems
0.2.1 Added link to upstream JSONL extractions dataset (which also links to this dataset)
0.2.0 Removed extraneous Mac OS files from the zip
0.1.0 Initial upload
Files
softcite-extractions-oa-data.zip
Files
(6.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d2c844fa676a8b3022aadf85fb2c92c2
|
6.3 GB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/softcite/softcite-extractions-oa
- Development Status
- Active