Published October 19, 2023
| Version v1
Dataset
Open
Plant science corpus
Description
The plant science corpus consists of the titles and abstracts of plant science articles in PubMed published prior to 2021 with a small number of 2021 records due to modification of records. The columns are:
- Index: integer index serving as identifier
- PMID: PubMed identifier
- Date: Publication date
- Journal: journal where the article was published
- Title: Title of the article
- Abstract: Abstract of the article
- Corpus: Title and abstract combined
- Text classification score: plant science record prediction model score
- Preprocessed corpus: Corpus after lower-casing, stop word removal, removal of non-alphanumeric and non-white space characters, lemmitisation
- Topic: index of topics after topic modeling
Files
corpus_with_topic_assignment_nodup.zip
Files
(361.6 MB)
Name | Size | Download all |
---|---|---|
md5:1214476564d9ddea991758ce564b467b
|
361.6 MB | Preview Download |
Additional details
Funding
- Collaborative Research: Assessing the connections between genetic interactions, environments, and phenotypes in Arabidopsis thaliana 2210431
- U.S. National Science Foundation
- TRTech-PGR: Connecting sequences to functions within and between species through computational modeling and experimental studies 2107215
- U.S. National Science Foundation
- NRT-HDR: Intersecting computational and data science to address grand challenges in plant biology 1828149
- U.S. National Science Foundation
- RESEARCH-PGR: Combining machine learning and experimental analysis to define trichome and root-specific gene regulatory networks in cultivated tomato and related Solanaceae species 2218206
- U.S. National Science Foundation
- Great Lakes Bioenergy Research Center DE-SC0018409
- United States Department of Energy