Invasion Biology WikiProject Scientific Papers: Text Data Mining and LLM-based Information Extraction of Species, Locations, Habitats, and Ecosystems
Description
This dataset contains the abstract and full-text for publication DOIs from the Invasion Biology WikiProject (DOI: 10.5281/zenodo.12518036). The data was retrieved using the ask.orkg.org API. For the script used to obtain the data, refer to the accompanying GitHub repository: https://github.com/jd-coderepos/invasion-biology-IE/.
The resulting CSV file includes the following fields: "ASK ID", "DOI", "Title", "Abstract", and "Full-text".
Of the 49,438 queried DOIs, the ASK database provided:
- Total DOIs processed: 12,636
- DOIs with neither abstract nor full-text: 36 (abstract token count was less than 10)
- DOIs with abstracts but no full-text: 12,636
- DOIs with both abstract and full-text: 2,834
The second part of the dataset contains structured information extracted from the publications using the GPT-4o Large Language Model. This structured data is included in the zipped folder structured-publications.zip.
The accompanying GitHub repository provides access to the code and scripts used at various stages of the information extraction (IE) process.
Theme of the Study:
"Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models."
Files
publications_tdm.csv
Files
(183.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:b200dd1e1518a4778c0ad4c16159e1d7
|
176.4 MB | Preview Download |
|
md5:f1ab504e62b88db821dd56b40de157b9
|
7.3 MB | Preview Download |
Additional details
Funding
Dates
- Available
-
2024-12-19
Software
- Repository URL
- https://github.com/jd-coderepos/invasion-biology-IE/
- Programming language
- Python
- Development Status
- Active
References
- Mietchen, D., Jeschke, J. M., Bernard-Verdier, M., Heger, T., Musseau, C., & Tyszka, S. (2024). Invasion biology corpus 2024-07 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12518037