Genome assembly and annotation of a 17-year periodical cicada Magicicada cassini (Insecta: Hemiptera: Cicadidae)
Authors/Creators
Description
We assembled and annotated the genome of a 17-year periodical cicada species Magicicada cassini. Total genomic DNA was extracted from a female of Magicicada cassini Brood II from Virginia (38.80366°N, 77.47689°W), May 26, 2013, and fixed in RNAlater. PacBio Continuous Long Read (CLR) sequences (32 million reads; 658G bp; mean length, 20,648 bp; coverage, 101-fold) were obtained using the PacBio Sequel II platform, and Illumina paired-end (2 x 250bp) sequences (2505 million reads; 626G bp; coverage, 96-fold based on raw read counts) were obtained using the Illimina NovaSeq 6000 platform. Genome size estimated using the Illumina short reads by GenomeScope ver. 2.0 (Ranallo-Benavidez et al., 2020) was 6,428 Mb. We assembled the PacBio CLR sequences using Canu v2.1.1 (Koren et al., 2017). Illumina PE reads were mapped on the assembled genome sequence using BWA, and the genome sequence was polished three times using Pilon v1.23 (Walker et al., 2014). Duplicated haplotigs were eliminated using Purge_dups v1.2.3 (Guan et al., 2020) with six parameter settings, among which the best result was obtained in terms of the number of haplotigs (least), scaffold number (7361; smallest), scaffold N50 (2.05 Mbp; longest), the BUSCO scores (complete, 95.1% with single 92.8% and 2.3% duplicated; fragmented, 3.7%; missing, 1.2%). Finally, the contigs containing the mitochondrial genome were detected, and the mitogenome sequences were eliminated from the assembled genome. The final genome assembly was 6.0 Gbp in total length with 7359 scaffolds and N50 of 2.05 Mbp. GC contents were 35.27%.
Protein-coding genes were predicted by combining the results of the RNA-seq-based, homology-based, and ab initio-based prediction methods. RNA-seq data of M. cassini used for gene prediction were from 1 third instar nymph, 3 female fifth instar nymphs, 1 male and 1 female adults (GenBank accession no.: DRR526411; DRR526355; DRR526359; DRR526370; DRR055517; DRR055504). RNA-seq-based prediction utilized both assembly-first and mapping-first methods. For the assembly-first method, RNA-seq data were assembled using Trinity v2.8.4 (Grabherr et al., 2011) and Oases v0.2.8 (Schulz et al., 2012). The redundant assembled RNA contigs were removed using CD-HIT v4.6 (Fu et al., 2012), and then splice-mapped to the genome sequences using GMAP v2018-07-04 (Wu and Watanabe, 2005). For the mapping-first method, RNA-seq data were mapped to genome scaffolds using HISAT2 v2.2.1 (Kim et al., 2019), and gene sets were predicted with StringTie v2.1.7 (Pertea et al., 2016) from mapped results. The ORF regions were estimated using TransDecorder v5.0.2 (https://github.com/TransDecoder/TransDecoder) from assembly-first method and mapping-first method results. Regarding homology-based prediction, amino acid sequences of Halymorpha halys (NCBI accession No: GCA_001676915.1), Nilaparvata lugens (NCBI accession No: GCF_014356525.1), Homalodisca vitripennis (NCBI accession No: GCF_021130785.1), Bemisia tabaci (NCBI accession No: GCF_001854935.1), and Acyrthosiphon pisum (NCBI accession No: GCF_005508785.1), were splice-mapped to genome scaffolds using Spaln v2.3.3f (Gotoh, 2008), and gene sets were predicted. For ab initio prediction, training sets were first selected from the RNA-seq-based prediction results. Then, AUGUSTUS v3.3.2 (Stanke and Waack, 2003)was trained using this set. The SNAP v2006-07-28 (Korf, 2004) was also used in this study. Finally, all predicted gene candidates were merged using the GINGER pipeline (Taniguchi et al., 2023).
Files
M.cassini_blasted_genes.txt
Additional details
Funding
- Japan Society for the Promotion of Science
- JP16H06279 (PAGS)
- Japan Society for the Promotion of Science
- JP19H05550/JP20K20461