Published December 19, 2024 | Version v3
Dataset Open

Invasion Biology WikiProject Scientific Papers: Text Data Mining and LLM-based Information Extraction of Species, Locations, Habitats, and Ecosystems

  • 1. ROR icon Technische Informationsbibliothek (TIB)

Contributors

Contact person:

  • 1. ROR icon Technische Informationsbibliothek (TIB)

Description

This dataset contains the abstract and full-text for publication DOIs from the Invasion Biology WikiProject (DOI: 10.5281/zenodo.12518036). The data was retrieved using the ask.orkg.org API. For the script used to obtain the data, refer to the accompanying GitHub repository: https://github.com/jd-coderepos/invasion-biology-IE/.

The resulting CSV file includes the following fields: "ASK ID", "DOI", "Title", "Abstract", and "Full-text".

Of the 49,438 queried DOIs, the ASK database provided:

  • Total DOIs processed: 12,636
  • DOIs with neither abstract nor full-text: 36 (abstract token count was less than 10)
  • DOIs with abstracts but no full-text: 12,636
  • DOIs with both abstract and full-text: 2,834

The second part of the dataset contains structured information extracted from the publications using the GPT-4o Large Language Model. This structured data is included in the zipped folder structured-publications.zip.

The accompanying GitHub repository provides access to the code and scripts used at various stages of the information extraction (IE) process.

Theme of the Study:
"Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models."

Files

publications_tdm.csv

Files (183.7 MB)

Name Size Download all
md5:b200dd1e1518a4778c0ad4c16159e1d7
176.4 MB Preview Download
md5:f1ab504e62b88db821dd56b40de157b9
7.3 MB Preview Download

Additional details

Funding

Federal Ministry of Education and Research
SCINEXT: Neural-Symbolic SCholarly InnovatioN EXTraction 01lS22070

Dates

Available
2024-12-19

Software

Repository URL
https://github.com/jd-coderepos/invasion-biology-IE/
Programming language
Python
Development Status
Active

References