Published August 22, 2023 | Version v1
Dataset Open

PICKLE Dataset

Authors/Creators

  • 1. Michigan State University

Description

The PICKLE dataset accompanies the paper In a PICKLE: A gold standard entity and relation corpus for the molecular plant sciences. It is a natural language processing (NLP) dataset of scientific abstracts labeled with gold standard entities and relations. The abstracts were drawn from PubMed searches for the terms "jasmonic acid" and "gibberellic acid". There are 6,245 entities and 2,149 relations across the 250 documents in the brat-formatted (.txt/.ann) documents, and 6,164 entity and 2,094 relation annotations in the jsonl-formatted dataset, as some annotations cannot be aligned to the tokenization used in the jsonl format and are dropped.

Files

brat_formatted_PICKLE_dataset_unsplit.zip

Files (859.2 kB)

Name Size Download all
md5:caaa214a3d6a5f6355127d2eec97c477
403.4 kB Preview Download
md5:9792e4a131226228aff846fcb50cb89a
455.8 kB Preview Download

Additional details

Funding

U.S. National Science Foundation
NRT-HDR: Intersecting computational and data science to address grand challenges in plant biology DGE-1828149
U.S. National Science Foundation
TRTech-PGR: Connecting sequences to functions within and between species through computational modeling and experimental studies IOS-2107215
U.S. National Science Foundation
RESEARCH-PGR: Combining machine learning and experimental analysis to define trichome and root-specific gene regulatory networks in cultivated tomato and related Solanaceae species. IOS-2218206
U.S. National Science Foundation
Assessing the connections between genetic interactions, environments, and phenotypes in Arabidopsis thaliana. MCB-2210431
Great Lakes Bioenergy Research Center
Great Lakes Bioenergy Research Center BER DE-SC0018409

Dates

Accepted
2023-11-06
Manuscript formally accepted to in silico Plants
Available
2023-11-07
First Zenodo upload of both dataset formats
Available
2023-08-22
jsonl data uploaded to Huggingface