Published February 26, 2025 | Version 1.0.0
Dataset Open

Deep-learning-based annotation of 230 superasterid genomes reveals a harmonized dataset of 91,366 NLRs (v250214_91366)

  • 1. ROR icon Sainsbury Laboratory
  • 2. ROR icon Iwate Biotechnology Research Center

Description

Abstract

Plant nucleotide-binding leucine-rich repeat receptors (NLRs) are intracellular immune receptors crucial for pathogen recognition and immune responses. Despite their importance, NLRs are often challenging to annotate and frequently overlooked by standard annotation pipelines. To address the variability in NLR annotation accuracy across pipelines, we performed a harmonized de novo annotation of 230 high-quality superasterid genomes using the deep learning-based software Helixer (Holst et al. 2023), resulting in the annotation of 10,124,265 protein sequences. Additionally, we employed NLRtracker, which leverages InterProScan for domain identification, to detect NLR and NLR-associated sequences (Kourelis et al. 2021, Blum et al. 2025). Using the NLR definition from the RefPlantNLR dataset, we identified 91,366 NLRs, with counts ranging from 12 and 19 in the parasitic plants Cuscuta campestris and Orobanche coerulescens to 2,804 in Solanum tuberosum (potato). Beyond NLR annotation, we provide genome annotations, including proteomes, coding nucleotide sequences (CDS), and GFF files generated by Helixer. This dataset offers a valuable resource for standardized comparative genomics and evolutionary studies across superasterids.

Available at Dryad: https://doi.org/10.5061/dryad.sxksn03d6

 

Methods

Helixer v0.3.2 (Stiehler et al. 2020; Holst et al. 2023) was executed using Singularity for genome FASTA files with the option '--lineage land_plant', which applies the default model (land_plant_v0.3_a_0080.h5) for land plants. Coding DNA sequences (CDS) and protein FASTA files were extracted from the output GFF files using GffRead v0.12.7 (Pertea and Pertea 2020) with the '-x' and '-y' options, respectively. The extracted protein sequences were then analyzed using NLRtracker (Kourelis et al. 2021), which integrates InterProScan v5.65-97.0 (Jones et al. 2014).

BUSCO scores were generated using BUSCO v5.5.0 with [-m protein --lineage_dataset viridiplantae_odb10] options (Manni et al. 2021).

 

Helixer output legend

Genome annotations are categorized according to the phylogenetic order, based on information from APG IV (The Angiosperm Phylogeny Group et al. 2016). Each order has its own subdirectory containing genome assembly FASTA, GFF annotations, CDS FASTA, protein FASTA, and NLRtracker output files. Additionally, two files containing compiled proteomes and CDS FASTA files with source assembly tags are provided.

 

 

NLRtracker output legend

File extension

Description

* _NLRtracker.tsv

NLRtracker overview output with gene status.

*_NLR.lst

Identifier list of NLRs.

*_NLR.gff3

NLR annotation of motifs, domains, and regions in GFF3 format.

*_NLR.fasta

NLR FASTA sequences.

*_NLR-associated.lst

Identifier list of NLR associated genes.

*_NLR-associated.gff3

NLR associated genes annotation of motifs, domains, and regions in GFF3 format.

*_NLR_associated.fasta

NLR associated genes FASTA sequences.

*_NBARC.fasta

NB-ARC domain FASTA sequences.

*_NBARC_deduplictated.fasta

Deduplicated NB-ARC domain FASTA sequences.

*_iTOL.txt

Domain annotation file for iTOL.

*_iTOL_dedup.txt

Domain annotation file of the deduplicated sequences for iTOL.

*_Domains.tsv

Full-length and domain sequence and metadata for all NLRtracker output.

interpro_result.gff

InterProScan output of the query proteome.

 

Recommended decompressing method for NLRtracker output files: "tar -xzvf"

 

Supplementary Data

Data S1. Species list and metadata.

Data S2. Per genome sequence number statistics table for proteomes, total NLR, and putative NLR types determined by NLRtracker, and proteome BUSCO scores.

Files

Files (89.9 kB)

Name Size Download all
md5:36355bd683c6d785da6d24dcd915d668
44.9 kB Download
md5:c5417efb1d601b8b52dbe4e26165967d
45.0 kB Download

Additional details

Funding

European Commission
BLASTOFF - Retooling plant immunity for resistance to blast fungi 743165
UK Research and Innovation
Mechanisms of pathogen suppression of NLR-mediated immunity BB/V002937/1
UK Research and Innovation
Engineering CC-HMA-NLR immune receptors for disease resistance in crops (ERiC) BB/W002221/1
UK Research and Innovation
BB/Y002997/1 BBSRC Institute Strategic Programme: Advancing Plant Health (APH) Partner Grant
UK Research and Innovation
Genome evolution of a pandemic clonal lineage of the wheat blast fungus BB/W008157/1
UK Research and Innovation
PIKOBODIES: Made-to-order plant disease resistance genes using receptor-nanobody fusions EP/Y032187/1

References

  • Blum M, Andreeva A, Florentino LC, Chuguransky SR, Grego T, Hobbs E, Pinto BL, Orr A, Paysan-Lafosse T, Ponamareva I, et al. InterPro: the protein sequence classification resource in 2025. Nucleic Acids Res. 2025:53(D1):D444–D456. https://doi.org/10.1093/nar/gkae1082
  • Holst F, Bolger A, Günther C, Maß J, Triesch S, Kindel F, Kiel N, Saadat N, Ebenhöh O, Usadel B, et al. Helixer–de novo Prediction of Primary Eukaryotic Gene Models Combining Deep Learning and a Hidden Markov Model. 2023:2023.02.06.527280. https://doi.org/10.1101/2023.02.06.527280
  • Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014:30(9):1236–1240. https://doi.org/10.1093/bioinformatics/btu031
  • Kourelis J, Sakai T, Adachi H, and Kamoun S. RefPlantNLR is a comprehensive collection of experimentally validated plant disease resistance proteins from the NLR family. PLOS Biol. 2021:19(10):e3001124. https://doi.org/10.1371/journal.pbio.3001124
  • Manni M, Berkeley MR, Seppey M, Simão FA, and Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021:38(10):4647–4654. https://doi.org/10.1093/molbev/msab199
  • Pertea G and Pertea M. GFF Utilities: GffRead and GffCompare. F1000Research. 2020:9:304. https://doi.org/10.12688/f1000research.23297.1
  • Stiehler F, Steinborn M, Scholz S, Dey D, Weber APM, and Denton AK. Helixer: cross-species gene annotation of large eukaryotic genomes using deep learning. Bioinformatics. 2020:36(22–23):5291–5298. https://doi.org/10.1093/bioinformatics/btaa1044
  • The Angiosperm Phylogeny Group, Chase MW, Christenhusz MJM, Fay MF, Byng JW, Judd WS, Soltis DE, Mabberley DJ, Sennikov AN, Soltis PS, et al. An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG IV. Bot J Linn Soc. 2016:181(1):1–20. https://doi.org/10.1111/boj.12385