The genome sequence of Samia ricini, a new model species of lepidopteran insect

Samia ricini, a gigantic saturniid moth, has the potential to be a novel lepidopteran model species. Samia ricini is far more resistant to diseases than the current model species Bombyx mori, and therefore can be more easily reared. In addition, genetic resources available for S. ricini rival those for B. mori: at least 26 ecoraces of S. ricini are reported and S. ricini can hybridize with wild Samia species, which are distributed throughout Asian countries, and produce fertile progenies. Physiological traits such as food preference, integument colour and larval spot pattern differ among S. ricini strains and wild Samia species so that those traits can be targeted in forward genetic analyses. To facilitate genetic research in S. ricini, we determined its whole genome sequence. The assembled genome of S. ricini was 458 Mb with 155 scaffolds, and the scaffold N50 length of the assembly was ~ 21 Mb. In total, 16,702 protein coding genes were predicted. While the S. ricini genome was mostly collinear with that of B. mori with some rearrangements and few S. ricini‐specific genes were discovered, chorion genes and fibroin genes seemed to have expanded in the S. ricini lineage. As the first step of genetic analyses, causal genes for “Blue,” “Yellow,” “Spot,” and “Red cocoon” phenotypes were mapped to chromosomes.

Populations of S. canningi and S. c. pryeri are distributed throughout South and East Asian countries, and natural variation in traits such as larval integument colour, larval marking patterns, cocoon colour and host plant preference can be observed among populations (Brahma et al., 2015;Peigler & Naumann, 2003).
Another advantage of S. ricini is that they are relatively easy to rear. Samia ricini is a multivoltine species whereas most saturniid species are univoltine or bivoltine (Brahma et al., 2015;Sternburg & Waldbauer, 1984), which means that research into S. ricini is free from seasonal limitations (Singh et al., 2017). In addition, S. ricini grows uniformly and can be reared synchronously at large scale, resulting in efficient egg production. We have already succeeded in establishing a genome-editing system in this species using transcription activator-like effector nucleases (TALENs) and have successfully obtained several gene knockout lines (Lee et al., 2018), meaning that functional analysis of genes of interest is now achievable.
To facilitate genetic research of S. ricini, we have determined its whole genome sequence. We used both long-read and shortread sequencers, namely the Pacbio Sequel system and Illumina HiSeq1500, to construct a high-quality genome assembly. After the assembly was completed, we attempted to identify the responsible chromosomes for multiple larval phenotypes in S. ricini and S. c. pryeri as an initial feasibility test for forward genetic research in this system.

| Insects
The UT strain of Samia ricini and Nagano strain of S. c. pryeri larvae were provided by the National BioResource Project (NBRP; http:// shigen.nig.ac.jp/wildm oth/). Samia ricini larvae were reared on Ricinus communis leaves under long-day conditions (16-hr light/8-hr dark) at 25°C. S. c. pryeri larvae were reared on Ailanthus altissima with the same photoperiod and temperature conditions. F 1 interspecific hybrids were obtained by crossing S. c. pryeri females and S. ricini males. F 1 individuals and BC 1 individuals were reared under the same conditions as S. ricini larvae.

| DNA sample preparation for whole genome sequencing
Posterior silk glands were sampled from fifth-instar larvae. Genomic DNA was prepared using Genomic-tip 100/G (Qiagen) according to the manufacturer's protocol.

| Library preparation and genome sequencing
For whole genome sequencing, the PacBio Sequel System (Pacific Bioscience) and Illumina HiSeq (Illumina) were employed. For PacBio, a 20-kb library was prepared and four SMRT cells were used for sequencing; 26.8 Gb in 3,267,255 subreads were obtained (Table S1). Illumina paired-end and mate-pair libraries were pre-  Table S2.

| RNA sequencing
Embryo-derived libraries for RNA sequencing (RNA-seq) were prepared using a TruSeq RNA Library Prep Kit (Illumina) and were sequenced using the Illumina HiSeq 2500 platform with 100-and 101-bp paired-end reads. The library for midgut-derived RNA samples was prepared using a TruSeq RNA Library Prep Kit (Illumina) and sequenced using the Genome Analyser IIx System with 76-bp paired-end reads. The libraries for anterior silk glandand middle silk gland-derived RNA samples were prepared using a SureSelect Strand Specific RNA Library Prep Kit (Agilent) and were sequenced using the Illumina HiSeq 2500 platform with 100-bp paired-end reads. Table S3 summarizes information on the results of RNA-seq.

| Quality check and trimming
The quality of Illumina short reads was examined using fastqc version 0.11.3. Based on the quality check results, trimming of reads were conducted using trimmomatic version 0.36 (Bolger et al., 2014).

| Genome assembly and completeness assessment
Long reads derived from the Sequel System were assembled using the HGAP4 pipeline bundled in smrtlink version 5.0.1. To construct consensus sequences from draft contigs from HGAP4 (Chin et al., 2016), racon version 1.2.0 (Vaser et al., 2016) with minimap version 0.2 (Li, 2016) was employed. racon treatment was repeated until the output fasta file showed no difference from that of the previous run. In this case, four repeats were sufficient for convergence. Then, to polish the assembly, pilon version 1.21 (Walker et al., 2014) was utilized with Illumina short reads. This final assembly was deposited at DDBJ (with accession nos. BLXV01000001-BLXV01000155).

| Linkage analysis of scaffolds
To clarify the linkages between scaffolds, we adapted a classical genetic approach. First, we obtained backcross generation 1 (BC 1 ) individuals between S. ricini and S. c. pryeri, a closely related species to S. ricini. The crossing scheme was (S. c. pryeri × S. ricini) × S. ricini.
We designed 35 PCR-based genetic markers which can specifically detect 35 scaffolds longer than 1 Mb and can molecularly distinguish S. ricini and S. c. pryeri ( Figure S1a) and then performed genomic PCR. Genomic DNA was extracted from the legs of eight BC 1 larvae using the DNeasy Blood and Tissue Kit (Qiagen). The genomic PCR programme was as follows: 40 cycles of 10 s at 98°C, 5 s at 60°C and 5 s at 68°C. KOD One PCR Master Mix (TOYOBO) was used to perform genomic PCR. Then, the allele combinations of scaffolds in eight BC 1 individuals were examined. According to the result, we designated the identified linkage groups according to Yoshido et al. (2011). Electrophoresis was conducted using a 2.0% agarose gel or MultiNA microchip electrophoresis system (Shimadzu). Table S5 lists all primers used for linkage analysis.

| Gene prediction
The braker2 pipeline (Camacho et al., 2009;Hoff et al., 2016Hoff et al., , 2019Lomsadze et al., 2014;Stanke et al., 2006Stanke et al., , 2008 was employed for gene prediction. First, repetitive sequences in the genome identified by repeatmasker were soft-masked. To generate extrinsic evidence for gene prediction, 11 sets of RNA-sequencing (RNA-seq) reads (Table S3) were mapped to the genome sequence using hisat2 (Kim et al., 2015). The resultant BAM files generated by hisat2 were submitted to braker2 by using "--bam" and "--softmasking" options. In parallel, we assembled the RNA-seq reads using the trinity assembler (Haas et al., 2013). Then, the tr2aacds. pl program bundled in the EvidentialGene suite (http://arthr opods. eugen es.org/Evide ntial Gene/evige ne/) was used to merge the assemblies from multiple transcriptome data sets. The merged transcriptome assemblies were aligned to the genome sequence using pasa (Haas et al., 2008(Haas et al., , 2013 for identifying the exon regions. In addition to tr2aacds.pl program, stringtie (Pertea et al., 2015) was also used to merge multiple transcriptome data for exon prediction. In addition, amino acid sequences of manually annotated sequences of S. ricini deposited in the Universal Protein Resource database (UniProt, http://www.unipr ot.org) (Bateman, 2019) were aligned to the genome sequence using exonerate version 2.2.0 (Slater & Birney, 2005) to obtain protein spliced alignment information. Finally, the multiple predictions generated by braker2, pasa, stringtie and exonerate were integrated using evidencemodeler (Haas et al., 2008). To assess the completeness of gene prediction, predicted gene sets were also submitted to busco version 3.0.2 (Waterhouse et al., 2018).

| Functional annotation
Amino acid sequences of the predicted genes were aligned to the Uniprot database with the blastp program (Camacho et al., 2009).

| Comparative genome analysis
To identify orthologue groups among multiple species, including B. mori, D. plexippus, Papilio xuthus and Plutella xylostella, orthofinder (Emms & Kelly, 2015) was used. Each gene set corresponded to the genome assembly, which was used for busco analysis (Table S4).

| Drawing a circular ideogram for Bombyx mori and Samia ricini genomes
To assess the similarity of B. mori and S. ricini genomes, a circular ideogram was drawn using clico (Cheong et al., 2015) with the circos program (Krzywinski et al., 2009). Single-copy orthologues, identified by orthofinder in each genome, were connected. To simplify the ideogram, short scaffolds in the B. mori genome assembly which were not assigned to 28 chromosomes were filtered out.

| Identifying the chorion gene cluster and phylogenetic analysis of chorion genes
orthofinder found that chorion genes were tandemly arrayed on chromosome 1 (Chr. 1). For more detailed information, we performed blastp searches against the NCBI nonredundant protein database, as well as the Uniprot database (Bateman, 2019) with an e-value cut-off of less than 1e-5. The predicted gene models within and around the chorion gene region were used as query sequences. As a result, 80 chorion genes were found in a cluster on Chr. 1. Within this cluster, five nonchorion gene models (evm.model. Sr_HGAP_JL_scaf_2.1123, 1128, 1135, 1136and 1137 were also identified (Table S6).
Phylogenetic analysis of chorion genes was conducted with 80 S. ricini chorion genes, 121 B. mori chorion genes, 21 Plutella xylostella chorion genes, 29 Papilio xuthus chorion genes, 24 D. plexippus chorion genes registered at the Uniprot and NCBI database and one nonchorion gene (evm.model.Sr_HGAP_JL_scaf_2.1135) as an outgroup. muscle was used to generate alignments of protein sequences (Edgar, 2004). Aligned sequences were subjected to phylogenetic analysis by maximum likelihood and ultrafast bootstrap methods (Minh et al., 2013) with 1,000 replicates using iq-tree version 1.5.5 (Nguyen et al., 2015). The phylogenetic tree was constructed based on the PMB+F+R5 model.
To check whether S. ricini has high-cysteine chorion gene or not, amino acid sequences of 38 high-cysteine chorion proteins of B. mori were aligned to deduced amino acid sequences of 80 S. ricini chorion genes via the blastp program.

| Identifying fibroin and sericin genes in the Samia ricini genome
The Fib-H (BAQ55621.1) and p25 (LC001863.1, LC001864.1 and LC001865.1) genes of S. ricini are already registered in GenBank and were used as queries in blastp searches against 16,702 gene models of S. ricini using an e-value cut-off of less than 1e-5 and "-seg no" option. Where no blastp hits were reported, tblastn searches against nucleotide sequences of the S. ricini genome were conducted with the same filtering parameters. To investigate whether the homologue of Fib-L is present or not in the S. ricini genome, B. mori Fib-L (NP_001037488.1) was utilized as a query for blastp and tblastn searches. In addition, we performed tblastn searches against the A. yamamai genome using B. mori Fib-L sequence as query. Tsubota et al. (2015) and Dong et al. (2015) reported that five and four sericin genes are expressed in anterior silk gland and middle silk gland, respectively (Table S7). The deduced amino acid sequences of putative sericin transcripts were submitted to the gene model set of S. ricini through blastp. Regarding LC001867 and LC001870, because the corresponding gene models were not found, tblastn was conducted to confirm whether both transcripts were present or not.
When we tried to determine the repertoire of silk protein-encoding genes in D. plexippus and Plutella xylostella, tblastn searches against the genome assemblies were conducted with B. mori Fib-H (NP_001106733.1), Fib-L, p25 (NP_001139413.1) and sericin-1, 2, 3 (AB112019.1, NP_001166287.1, NP_001108116.1) sequences as queries because no transcripts or amino acid sequences were previously reported as Fib-H, Fib-L, p25 and sericin in Plutella xylostella and D. plexippus. Genome assemblies which were used for tblastn searches were those used in busco analysis (Table S4). As the transcripts of Fib-H, Fib-L and p25 of Papilio xuthus were already registered (see Table 3), those sequences were mapped to the Papilio xuthus genome sequence to confirm the presence. Regarding sericin genes in Papilio xuthus, no sequences were previously registered in GenBank, and thus the same procedure as in Plutella xylostella and D. plexippus was taken. Phylogenetic analysis of sericin was conducted with seven S. ricini putative sericin genes, three B. mori sericin genes and five A. yamamai sericin genes (LC08587, LC08588, LC08589, LC08590 and LC08591; Zurovec et al., 2016). muscle was used to generate alignments of protein sequences (Edgar, 2004).
Aligned sequences were subjected to phylogenetic analysis by maximum likelihood and bootstrap methods with 1,000 replicates using megax (Kumar et al., 2018). The maximum likelihood tree under the Whelan And Goldman + Freq. model (Whelan & Goldman, 2001) was inferred. Nearest-Neighbor-Interchange (NNI) was used for heuristic tree searching. All sites including those containing gaps were used for the analysis.
2.16 | Identifying responsible chromosomes of "Blue," "Yellow," "Spot" and "Red cocoon" phenotypes in BC 1 individuals BC 1 individuals were phenotyped for one of four morphological traits: "Blue," "Yellow," "Spot" and "Red cocoon." The genetic markers designed for scaffold linkage analysis were utilized in segregation analysis (see Table S5). A DNeasy Blood and Tissue kit (Qiagen) and MightyAmp DNA Polymerase Version 3 (TaKaRa) was used for DNA extraction and genomic PCR of BC 1 individuals, respectively.
The genomic PCR programme was as follows: 2 min at 98°C and 40 cycles of 10 s at 98°C, 15 s at 60°C and 1 min at 68°C.

| Overview of Samia ricini genome assembly
The final assembly of the Samia ricini genome was 450,479,495 bp long with 155 scaffolds. Because k-mer analysis (k = 31) estimated that the genome size ranges from 439,526,288 to 439,568,542 bp, we concluded that the assembled genome size of 450 Mb was reasonable. The N50 length of the assembly was ~21 Mb (Table 1).
GC content was 34.3%. The longest scaffold length was ~34 Mb.
Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis using busco version 3.0.2 with insecta odb9, including 1,658 BUSCOs from 42 species, revealed that 97.9% of BUSCOs were completely detected in the assembled genome (1,615, complete and single-copy; eight, complete and duplicated) among 1,658 tested BUSCOs (see Table S4). To the best of our knowledge, these statistic scores are the best among currently available lepidopteran genome assemblies (Challis et al., 2016;Kim et al., 2018;Triant et al., 2018). Low heterozygosity in the S. ricini strain used for this project might be the key to the successful assembly: k-mer distribution analysis (k = 31) estimated that heterozygosity in one male individual of S. ricini was 0.0469 ± 0.0003% (Table 1, see Figure S2), considerably lower than the estimated heterozygosity (0.807%) of a male individual of Ay-7, an inbred line in Antheraea yamamai, which belongs to the same family ( Figure S2; Kim et al., 2018) Figure S1b), which is consistent with a previous report (Yoshido et al., 2011) where BAC-FISH (bacterial artificial chromosomes-fluorescence in situ hybridization) was conducted and concluded that S. ricini has 13 autosomes and one Z chromosome (male: 2n = 28, female: 2n = 27). These 35 scaffolds totalled 443,618,927 bp, meaning that ~98.5% of the genome was assigned to the chromosomes (  (Table 1). Except for "unclassified" repeats, long interspersed nuclear element (LINE) is the largest superfamily of repetitive sequences in S. ricini (Figure 2). Interestingly, although the total length of LINE and its proportion of all repetitive sequences in the genome were similar between S. ricini and B. mori (Figure 2), the components of LINE families were different. Table S8 shows the copy number of each LINE family in S. ricini and B. mori genomes.

| Gene prediction and comparative genome analysis
To maximize the utility of the S. ricini genomic resource for genetic research, we opted to perform gene prediction on a soft-masked ge-  (Table S9). The estimated completeness of the annotation is slightly lower than that of the genome (Table S4) and implies some limitation with gene prediction pipelines. This relationship between annotation and assembly completeness is not uncommon and in this study was also the case for B. mori (Table S4) long-read sequencer will also contribute to more precise gene models (Sharon et al., 2013).
interproscan (Finn et al., 2017) analysis shows that the Reverse transcriptase domain (IPR 000477) and Integrase catalytic core domain (IPR001584) are the two most well-represented domains in S. ricini genes ( Figure S3). As the gene prediction on soft-masked genome did not prohibit but just penalized the prediction within the repetitive regions, this result may reflect the large amount of retrotransposable elements (SINE, LINE, long terminal repeat (LTR) in  Table S10). Of 1,586 S. ricini-specific genes, 873 were not given any GO term annotation (Table S10). Of 205 S. ricini-specific OGs, 46 are related to retrotransposable elements ( Figure 2; Table S10).
Among the three subfamilies, the high-cysteine (Hc) chorion is considered to play an important role in embryonic diapause, because Hc chorion proteins increase the hardness of eggshells for embryos to survive diapause in the winter (Rodakis & Kafatos, 1982).
Interestingly, according to the blast search and phylogenetic analysis, Hc chorion protein genes seemed to be absent in the S. ricini genome (Figure 3c; Tables S6 and S11). Moreover, average cysteine contents of 80 chorions of S. ricini were ~6.00%, whereas those of Hc class chorions of B. mori were ~27.5% (Table S12). Given that S. ricini is a nondiapause species, it is plausible that S. ricini lacks Hc A 0 A 1 9 4 P U S 0  A 0 A 1 9 4 P T 4 6 X M _ 0 1 1 5 6 5 3 6 8 .

| Fibroin and sericin
Fibroin is the major component of silk protein. Although fibroin of B. mori consists of three polypeptides, namely heavy-chain (Fib-H), light-chain (Fib-L) and fibrohexamerin (p25) (Inoue et al., 2000), it was biochemically confirmed that fibroin of S. ricini lacks Fib-L and p25 and it consists of Fib-H/Fib-H homodimer (Tamura & Kubota, 1988).The complete amino acid sequence of Fib-H (SrFib-H) was determined by Sezutsu and Yukuhiro (2014), but our gene prediction was unable to properly construct the gene model for SrFib-H, mainly because of its repetitive sequences. However, a tblastn search using SrFib-H as query detected the near-complete coding sequence of SrFib-H ( Figure S4a), supporting the accuracy of the assembly. The genome information also revealed that the S. ricini genome has three copies of p25, in addition to Fib-H, but lacks Fib-L (Table 3). In addition, we confirmed that Fib-L is absent in the genome of A. yamamai (Kim et al., 2018), another saturniid moth, through tblastn searches ( Figure S4b).
Because other lepidopteran species, including B. mori, Plutella xylostella, Papilio xuthus and Corcyra cephalonica (Chaitanya & Dutta-Gupta, 2010), possess the Fib-L gene, the absence of Fib-L in saturniid moths can be ascribed to the loss of Fib-L in the common ancestor of Saturniinae.
As described above, silk fibroin of B. mori consists of H-chain, L-chain and P25. The three fibroin polypeptides assemble with a 6:6:1 molecular ratio, which is considered to be indispensable for proper secretion of fibroin: mutations in Fib-H or Fib-L cause fibroin secretion deficiency (Inoue et al., 2000;Ma et al., 2014). Bombyx mori strains with deletions in Fib-H or Fib-L cannot properly secrete fibroin protein to the lumen in silk glands, and their cocoons are mainly composed of sericin. Therefore, it has been hypothesized that B. mori has a mechanism which recognizes the three-dimensional structure of fibroin assembled by the three polypeptides with 6:6:1 molecular ratio and selectively transports the fibroin polypeptide complex to the lumen in silk glands (Inoue et al., 2000). As saturniid species lack the Fib-L gene, the fibroin transportation and secretion system in saturniid species must be different from that in B. mori.
Thus far, the biological function of p25 remains unclear. Whether knockout of p25 affects the secretion of fibroin or not remains to be answered. Because p25 protein is undetectable in S. ricini silk, p25 could take on a different function other than being the part of complex structure of fibroin. The presence of multicopies of p25 in the S. ricini genome raises the possibility of functional differentiation among paralogous p25s (Table 3).
Sericin occupies the second largest proportion of silk protein, following fibroin. Unlike fibroin, sericin is water-soluble and makes up the outermost layer of silk. Bombyx mori has three sericin genes, Ser1, Ser2 and Ser3 (Tsubota et al., 2015). While Ser1 and Ser3 are components of cocoon protein, Ser2 is not present in cocoon (Takasu et al., 2010). Two proteins derived from alternative splicing of Ser2 can be found in larval silk produced during the growth stages (Takasu et al., 2010). To date, nine transcripts are registered on NCBI GenBank as Sericin-encoding genes or Sericin-like genes in S. ricini (Table S7; Dong et al., 2015;Tsubota et al., 2015).

| Identification of the chromosomes responsible for larval and cocoon phenotypes in Samia ricini
To examine the feasibility of identifying trait-related genes in S. ricini, we initiated forward genetic analysis using S. ricini and S. c. pryeri. Some morphological traits are different between S. ricini and S. c. pryeri (Figures 1 and 4a), and thus such traits may be good targets for forward genetic analysis. As shown in Figure 4b,c, the phenotypes originally derived from S. c. pryeri were isolated in backcross generation 1 (BC 1 ) individuals which were obtained by crossing (S. ricini × S. c. pryeri) × S. ricini. Here, we tried to identify the responsible chromosomes for four phenotypes, namely "Blue," "Yellow," "Spot" and "Red cocoon." "Blue" and "Yellow" refer to a blue and yellow larval integument, respectively. "Spot" refers to black spots on the larval integument. The "Red cocoon" phenotype literally illustrates the colour of cocoons which some BC 1 individuals produce.
Because meiotic recombination does not occur in lepidopteran females, all chromosomes of the BC 1 individuals should be S. ricini-S. c. pryeri heterozygotes or S. ricini-S. ricini homozygotes, and not chimeric. Given that the above-mentioned four phenotypes derived from S. c. pryeri are dominant, the responsible chromosomes should be heterozygous in all BC 1 individuals.
Genomic PCR with chromosome-specific markers, which can molecularly distinguish S. ricini and S. c. pryeri, revealed that Chr. 8, 13, 3 and 12 were uniformly heterozygotic in all examined "Blue," "Yellow," "Spot" and "Red cocoon" individuals, respectively, linking the causal loci for these traits to those chromosomes ( Figure S6). This is the first report to demonstrate that forward genetic analysis is achievable in S. ricini (and S. c. pryeri). Although the genes responsible for the four phenotypes have not yet been identified, this is the first step toward that goal.

| CON CLUS ION
Here we have reported a high-quality genome sequence of Samia ricini, which shows reciprocal correspondence at the chromosome scale to the B. mori genome, and forward genetic analyses of specific traits in S. ricini were shown to be feasible. We successfully identified the chromosomes responsible for certain traits. We are anticipating that this report will pave the way for "forward genetics of wild silkmoth."

ACK N OWLED G EM ENTS
We thank Dr Masahiro Kasahara for valuable advice on the assembly strategy, and Dr Ken Sahara for helpful discussions. Toru Shimada https://orcid.org/0000-0002-5791-0000