Published September 2024 | Version v1
Dataset Open

Functional annotation of 180 RefSeq reference plant proteomes reveals a dataset of 113,684 NLR proteins

  • 1. Sainsbury Laboratory

Description

Abstract

Nucleotide-binding leucine-rich repeat receptors (NLRs) are critical components of plant immune systems, responsible for detecting pathogens and initiating defence responses. As part of our exploration of NLR protein diversity across a broad spectrum of plant species, we created a comprehensive NLRome dataset by analyzing 180 reference plant genomes from the NCBI RefSeq database (Pruitt et al. 2007). This database includes high-quality genome annotations for species from a wide phylogenetic range, encompassing algae, gymnosperms, early flowering plants, monocots, and dicots (https://www.ncbi.nlm.nih.gov/refseq/). Using NLRtracker, a specialized bioinformatics tool that integrates InterProScan for domain identification, we extracted and catalogued NLR proteins across these diverse genomes. Based on the NLR definition of RefPlantNLR and NLRtracker (Kourelis et al. 2021), 169 of the 180 species had at least 1 NLR predicted. In total, we catalogued 113,686 NLRs, ranging from 33 in Cucurbita maxima to 4155 in Quercus robur. In addition to NLR annotation, NLRtracker provided functional annotations for the entire proteome of each species enabling comparative genomics and evolutionary studies.

 

NLRtracker output legend:

File extension

Description

* _NLRtracker.tsv

NLRtracker overview output with gene status.

*_NLR.lst

Identifier list of NLRs.

*_NLR.gff3

NLR annotation of motifs, domains, and regions in GFF3 format.

*_NLR.fasta

NLR FASTA sequences.

*_NLR-associated.lst

Identifier list of NLR associated genes.

*_NLR-associated.gff3

NLR associated genes annotation of motifs, domains, and regions in GFF3 format.

*_NLR_associated.fasta

NLR associated genes FASTA sequences.

*_NBARC.fasta

NB-ARC domain FASTA sequences.

*_NBARC_deduplictated.fasta

Deduplicated NB-ARC domain FASTA sequences.

*_iTOL.txt

Domain annotation file for iTOL.

*_iTOL_dedup.txt

Domain annotation file of the deduplicated sequences for iTOL.

*_Domains.tsv

Full-length and domain sequence and metadata for all NLRtracker output.

interpro_result.gff

InterProScan output of the query proteome.

 

Supplementary Data

Data S1. RefSeq species list and metadata.

Data S2. Per genome sequence number statistics table for proteomes, total NLR, and putative NLR types determined by NLRtracker.

Files

RefSeq_NLRtracker.zip

Files (9.0 GB)

Name Size Download all
md5:1ebd854ac75ceb1a87e20b8b5d377dd7
72.5 kB Download
md5:901e7db6a614f52b38cbccb559f1614b
28.4 kB Download
md5:2de76e1457339fc078e6b3d5fe527f86
9.0 GB Preview Download

Additional details

References

  • Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5. doi: 10.1093/nar/gkl842. Epub 2006 Nov 27. PMID: 17130148; PMCID: PMC1716718.
  • Kourelis J, Sakai T, Adachi H, Kamoun S. RefPlantNLR is a comprehensive collection of experimentally validated plant disease resistance proteins from the NLR family. PLoS Biol. 2021 Oct 20;19(10):e3001124. doi: 10.1371/journal.pbio.3001124. PMID: 34669691; PMCID: PMC8559963.
  • Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, Salazar GA, Bileschi ML, Bork P, Bridge A, Colwell L, Gough J, Haft DH, Letunić I, Marchler-Bauer A, Mi H, Natale DA, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A. InterPro in 2022. Nucleic Acids Res. 2023 Jan 6;51(D1):D418-D427. doi: 10.1093/nar/gkac993. PMID: 36350672; PMCID: PMC9825450.