Functional annotation of 180 RefSeq reference plant proteomes reveals a dataset of 113,684 NLR proteins
Description
Abstract
Nucleotide-binding leucine-rich repeat receptors (NLRs) are critical components of plant immune systems, responsible for detecting pathogens and initiating defence responses. As part of our exploration of NLR protein diversity across a broad spectrum of plant species, we created a comprehensive NLRome dataset by analyzing 180 reference plant genomes from the NCBI RefSeq database (Pruitt et al. 2007). This database includes high-quality genome annotations for species from a wide phylogenetic range, encompassing algae, gymnosperms, early flowering plants, monocots, and dicots (https://www.ncbi.nlm.nih.gov/refseq/). Using NLRtracker, a specialized bioinformatics tool that integrates InterProScan for domain identification, we extracted and catalogued NLR proteins across these diverse genomes. Based on the NLR definition of RefPlantNLR and NLRtracker (Kourelis et al. 2021), 169 of the 180 species had at least 1 NLR predicted. In total, we catalogued 113,686 NLRs, ranging from 33 in Cucurbita maxima to 4155 in Quercus robur. In addition to NLR annotation, NLRtracker provided functional annotations for the entire proteome of each species enabling comparative genomics and evolutionary studies.
NLRtracker output legend:
|
File extension |
Description |
|
* _NLRtracker.tsv |
NLRtracker overview output with gene status. |
|
*_NLR.lst |
Identifier list of NLRs. |
|
*_NLR.gff3 |
NLR annotation of motifs, domains, and regions in GFF3 format. |
|
*_NLR.fasta |
NLR FASTA sequences. |
|
*_NLR-associated.lst |
Identifier list of NLR associated genes. |
|
*_NLR-associated.gff3 |
NLR associated genes annotation of motifs, domains, and regions in GFF3 format. |
|
*_NLR_associated.fasta |
NLR associated genes FASTA sequences. |
|
*_NBARC.fasta |
NB-ARC domain FASTA sequences. |
|
*_NBARC_deduplictated.fasta |
Deduplicated NB-ARC domain FASTA sequences. |
|
*_iTOL.txt |
Domain annotation file for iTOL. |
|
*_iTOL_dedup.txt |
Domain annotation file of the deduplicated sequences for iTOL. |
|
*_Domains.tsv |
Full-length and domain sequence and metadata for all NLRtracker output. |
|
interpro_result.gff |
InterProScan output of the query proteome. |
Supplementary Data
Data S1. RefSeq species list and metadata.
Data S2. Per genome sequence number statistics table for proteomes, total NLR, and putative NLR types determined by NLRtracker.
Files
RefSeq_NLRtracker.zip
Files
(9.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:1ebd854ac75ceb1a87e20b8b5d377dd7
|
72.5 kB | Download |
|
md5:901e7db6a614f52b38cbccb559f1614b
|
28.4 kB | Download |
|
md5:2de76e1457339fc078e6b3d5fe527f86
|
9.0 GB | Preview Download |
Additional details
References
- Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5. doi: 10.1093/nar/gkl842. Epub 2006 Nov 27. PMID: 17130148; PMCID: PMC1716718.
- Kourelis J, Sakai T, Adachi H, Kamoun S. RefPlantNLR is a comprehensive collection of experimentally validated plant disease resistance proteins from the NLR family. PLoS Biol. 2021 Oct 20;19(10):e3001124. doi: 10.1371/journal.pbio.3001124. PMID: 34669691; PMCID: PMC8559963.
- Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, Salazar GA, Bileschi ML, Bork P, Bridge A, Colwell L, Gough J, Haft DH, Letunić I, Marchler-Bauer A, Mi H, Natale DA, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A. InterPro in 2022. Nucleic Acids Res. 2023 Jan 6;51(D1):D418-D427. doi: 10.1093/nar/gkac993. PMID: 36350672; PMCID: PMC9825450.