Published October 19, 2023 | Version v1
Dataset Open

Plant science corpus

  • 1. ROR icon Michigan State University


The plant science corpus consists of the titles and abstracts of plant science articles in PubMed published prior to 2021 with a small number of 2021 records due to modification of records. The columns are:

  • Index: integer index serving as identifier
  • PMID: PubMed identifier
  • Date: Publication date
  • Journal: journal where the article was published
  • Title: Title of the article
  • Abstract: Abstract of the article
  • Corpus: Title and abstract combined
  • Text classification score: plant science record prediction model score
  • Preprocessed corpus: Corpus after lower-casing, stop word removal, removal of non-alphanumeric and non-white space characters, lemmitisation
  • Topic: index of topics after topic modeling


Files (361.6 MB)

Name Size Download all
361.6 MB Preview Download

Additional details


Collaborative Research: Assessing the connections between genetic interactions, environments, and phenotypes in Arabidopsis thaliana 2210431
National Science Foundation
TRTech-PGR: Connecting sequences to functions within and between species through computational modeling and experimental studies 2107215
National Science Foundation
NRT-HDR: Intersecting computational and data science to address grand challenges in plant biology 1828149
National Science Foundation
RESEARCH-PGR: Combining machine learning and experimental analysis to define trichome and root-specific gene regulatory networks in cultivated tomato and related Solanaceae species 2218206
National Science Foundation
Great Lakes Bioenergy Research Center DE-SC0018409
United States Department of Energy