Genomic insights into the chromosomal elongation in a family of Collembola

Collembola is a highly diverse and abundant group of soil arthropods with chromosome numbers ranging from 5 to 11. Previous karyotype studies indicated that the Tomoceridae family possesses an exceptionally long chromosome. To better understand chromosome size evolution in Collembola, we obtained a chromosome-level genome of Yoshiicerus persimilis with a size of 334.44 Mb and BUSCO completeness of 97.0% (n = 1013). Both genomes of Y. persimilis and Tomocerus qinae (recently published) have an exceptionally large chromosome (ElChr greater than 100 Mb), accounting for nearly one-third of the genome. Comparative genomic analyses suggest that chromosomal elongation occurred independently in the two species approximately 10 million years ago, rather than in the ancestor of the Tomoceridae family. The ElChr elongation was caused by large tandem and segmental duplications, as well as transposon proliferation, with genes in these regions experiencing weaker purifying selection (higher dN/dS) than conserved regions. Moreover, inter-genomic synteny analyses indicated that chromosomal fission/fusion events played a crucial role in the evolution of chromosome numbers (ranging from 5 to 7) within Entomobryomorpha. This study provides a valuable resource for investigating the chromosome evolution of Collembola.


Background
Chromosome size is a fundamental characteristic of a genome as it reflects the size of the DNA molecule that carries a vast amount of genetic information [1].Notably, chromosome size exhibits significant variations both among different species and within the same species.For instance, Collembola, recognized as the most diverse and abundant group of soil arthropods, have been studied through karyological research to investigate their chromosomal evolution [2,3].These studies use chromosome-banding techniques to provide important insights into chromosome number and size characteristics [3][4][5].Previous research has revealed that chromosome numbers in collembolan species vary between 5 and 11, and this observation is also supported by publicly available genomic data [3][4][5][6][7][8].Furthermore, Collembola exhibit significant variations in chromosome size among different species.Most species demonstrate a relatively consistent distribution of chromosome lengths, with the longest chromosome comprising approximately one-fifth of the total genome length [7,8] (electronic supplementary material, table S1).However, the early study on collembolan karyotypes identified an exceptionally large chromosome in the species Tomocerus minor (Tomoceridae) [3].More recently, the chromosome-level genome assembly of Tomocerus qinae was reported, revealing the presence of a large chromosome (TqinChr1, approximately 140 Mb) that constituted nearly one-third of the genome size (approx.334 Mb) [6].Although these evolutionary studies have provided valuable collembolan genome resources, the mechanisms underlying the alteration of chromosome size remain unclear and require further exploration.
Changes in chromosome size can result from various processes, including chromosome rearrangements (such as fusions, and fissions), transposable element proliferation and gene duplication [9,10].Chromosomal fusion/fission is usually considered a significant force driving chromosome size changes among these factors.Chromosome size can increase through the fusion of two chromosomes at their telomeres, followed by the loss of one of the centromeres [11,12].By contrast, decreases in chromosome size can occur through simple chromosome fission in the centromere region [13].For example, human chromosome 2 is the result of the head-to-head fusion of two ancestral ape chromosomes [14,15].Chromosomal fission/fusion events are common evolutionary changes and have been reported in several insect groups [16,17].In Heliconius butterflies, it is believed that ten pairs of the 31 ancestral chromosomes underwent fusion events, resulting in 21 chromosomes and a significant increase in chromosome size [18,19].The proliferation of transposable elements (TEs) can also rapidly increase chromosome size.TEs, also known as 'jumping genes', are DNA sequences that move from one place to another within a genome, cutting and pasting themselves or inserting those copies into different genomic positions [20,21].TEs replicate and insert themselves into new locations within the chromosome, resulting in an increase in the chromosome length.Previous research published two chromosome-level genomes of Folsomia candida, a parthenogenetic strain (FCDK, 219.08 Mb) and a sexual strain (FCSH, 153.09Mb), and found that the seven chromosomes of FCDK are 25%-54% larger than their corresponding chromosomes in FCSH due to transposable element proliferation [7].
Gene duplication is another crucial evolutionary mechanism that plays a vital role in the evolution of genome complexity [22].Two primary mechanisms cause gene duplications: large-scale duplication (LSD) and small-scale duplication (SSD) events [23].LSD involves the duplication of a significant number of genes or genomic regions, including large chromosomal segments and even whole-genome duplication (WGD) [24].This process leads to the duplication of numerous genes, resulting in additional copies of genetic material being inserted into the genome and ultimately leading to an increase in chromosome size [22].In contrast, LSD involves the duplication of a limited number of genes or genomic segments within the genome, providing a mechanism contributing to genetic innovation and phenotypic diversity [25,26].
To enhance our understanding of the evolution of chromosome size in Collembola, we generated a high-quality chromosomelevel genome of Yoshiicerus persimilis, which combined PacBio long reads and Hi-C data.Surprisingly, both Y. persimilis and T. qinae have exceptionally large chromosomes, making up approximately one-third of their genome sizes.To further investigate this observation, we conducted annotations of repeat elements and protein-coding genes, as well as orthology inference and gene evolutionary analyses.Additionally, we analysed the interspecific chromosomal macrosynteny among Y. persimilis, T. qinae, FCDK and Sinella curviseta.

Material and methods (a) Sample collection and sequencing
The original population of Y. persimilis was collected in June 2017 on the Purple Mountain (32.056°N, 118.83°E,Nanjing, China).Samples were collected using an entomological aspirator, washed with phosphate-buffered saline and then immediately preserved in liquid nitrogen.We prepared 200, 50, 20 and 100 adult individuals for PacBio, Illumina whole-genome, Illumina transcriptome and Hi-C sequencing, respectively.To identify the X chromosome, we performed BGI (BGI Genomics, Shenzhen, China) whole-genome sequencing on a single male specimen.Initially, we determined the male individuals based on their morphology, as male entomobryomorphan individuals typically have slender, smaller bodies.Subsequently, we validated our observations through molecular analyses.
DNA was extracted and purified from a single male individual of Y. persimilis using QIAamp DNA Micro Kits (Qiagen).Due to the low DNA yield (less than 200 ng), which did not meet the minimum requirement of 500 ng for sequencing library preparation, genomic DNA was amplified using the REPLI-g UltraFast Mini Kit (Qiagen).Final libraries were sequenced using the BGI MGISEQ-2000 platform (Shenzhen, China) with a 150 bp paired-end cycle kit.
Genomic DNA for PacBio and Illumina sequencing was isolated using the Blood & Cell Culture DNA Mini Kit (QIAGEN GmbH, Shanghai, China).A 30 kb long-read library was created and sequenced using the SMRTbell Template Prep Kit 1.0-SPv3 (Pacific Biosciences of California, Menlo Park, USA) on the PacBio Sequel II platform.A library of 150 bp paired-end reads with a 350 bp insert size was constructed for Illumina sequencing using the TruSeq DNA PCR-free kit (Illumina, USA).Hi-C sequencing was carried out by digesting extracted DNA with the Mbol restriction enzyme.Genomic RNA was extracted using TRIzol Reagent, and the TruSeq RNA v2 kit was used to construct the library.All short libraries were sequenced on the Illumina NovaSeq 6000 platform, which was provided by Berry Genomics (Beijing, China).

(b) Genome assembly and annotation
Raw Illumina sequence data were subjected to quality control using BBTools v. 38.67 ([27]; https://sourceforge.net/projects/bbmap/).Duplicate reads were removed using the 'clumpify.sh'script.The 'bbduk.sh'tool was employed to trim sequences with quality scores lower than 20, filtre out sequences containing more than 5 Ns and reads shorter than 15 bp, trim poly-A/G/C tails longer than 10 bp and correct overlapping paired reads.To estimate the size in the Y. persimilis genome, we conducted a genome survey using GenomeScope v. 2.0 [28].
De novo assembly of PacBio long reads was performed using Raven v. 1.6.0[29] with a k-mer length of 15 without polishing (-k 15 -p 0).The assembly was then polished with one round of long reads using Flye v. 2.8.3 [30] and two rounds of Illumina short reads using Next-Polish v. 1.3.1 [31].Heterozygous regions were removed via Purge_Dups v. 1.2.5 [32] with a 70% cut-off for identifying contigs as haplotigs.Minimap2 v. 2.23 [33] was used as the read mapper for all redundancy removal and polishing steps.Hi-C reads were aligned to the assembly using Juicer v. 1.6.2[34].Primary contigs were anchored into chromosomes using 3D-DNA v. 180922 [35].To identify potential errors, Hi-C heatmaps were manually inspected and corrected using Juicebox v. 1.11.08 [34].Potential contaminant sequences were identified using MMseqs2 v. 11 [36] against the NCBI nucleotide (nt) and UniVec databases.
We mapped short reads sequenced from a male individual to the chromosome-level assembly to identify the X chromosome using Minimap2.The coverage of each chromosome was calculated using SAMtools v. 1.9 [37].As Tomoceridae has an XO sex-determination royalsocietypublishing.org/journal/rspb Proc.R. Soc.B 291: 20232937 system, we expected the sequencing depth of X to be approximately half of that of autosomes in males.Two methods were used to evaluate the quality of the final genome assembly.First, the completeness of the assembly was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO) v. 3.0.2[38] with the 'arthropoda_odb10 00 database (n = 1,013).Second, the mapping rate was calculated by mapping the clean reads from both Illumina and PacBio sequencing to the final assembly using Minimap2 and SAMtools.
Homology searching and de novo prediction were used to mask repetitive sequences in the genome of Y. persimilis.The de novo repeat library was constructed using RepeatModeler v. 2.0.3 [39], which included the new LTR discovery pipeline (-LTRStruct), and we then combined the de novo library with the Dfam 3.5 [40] and RepBase-20181026 databases [41] to create a custom library.The custom library was used to identify repetitive elements with RepeatMasker v. 4.1.2-p1[42].To estimate the abundance and divergence for each TE family of five collembolan genomes (Y.persimilis, T. qinae, FCDK, FCSH and S. curviseta), we used the calcDivergenceFromalign.pl script from RepeatMasker to calculate the Kimura divergence value.Additionally, we used the createRepeatLandscape.plscript within RepeatMasker to generate a repeat landscape.

(c) Intra-genomic synteny
To evaluate intra-genome collinearity, we used an MMseqs2 Blastp-like search with three rounds of iteration and an e-value of 1×10 −5 to identify homologous gene pairs.MCScanX [65] was used to identify synteny blocks with an e-value threshold of 1×10 −5 .The tool Dupli-cate_gene_classifier from MCScanX, effectively classifies the origins of genes into segmental (match genes in syntenic blocks), tandem (continuous repeat), proximal (in the nearby chromosomal region but not adjacent), dispersed (other modes than segmental, tandem and proximal) duplications and singletons [65].TBtools v. 1.09852 [66] was used to visualize gene density, GC content, repeat content and synteny on individual pseudochromosomes.

(d) K S distribution analyses
To conduct an in-depth investigation into the chromosomal expansion history of Y. persimilis and T. qinae and to determine whether it occurred in the ancestor of Tomoceridae or independently in the two species, we analysed the distribution of synonymous distances (K S ) in paralogous families within ElChr, and duplicated regions of ElChr (i.e.regions where major segmental duplication occurred, highlighted by the red circle in figure 1b).Additionally, we examined the one-versus-one ortholog K S distributions of the Y. persimilis and T. qinae genomes.The K S distribution of coding sequences for the two species was constructed by wgd v. 1.1.1[67].For each gene family, a coding sequence alignment was inferred by MUSCLE v. 3.8.31[68].The K S values for all gene pairs in the family were estimated through maximum likelihood estimation using the codeml package in PAML v. 4.9j [69].FastTree v. 2.1 [70] was used to construct a phylogenetic tree and node-weighted empirical K S distributions with default parameters.We employed the Wilcoxon rank-sum test ('wilcox.test'in R v. 4.2.2) to investigate potential differences between the K S distributions of ElChr and duplicated regions within ElChr of Y. persimilis and T. qinae.

(e) Inter-genomic synteny
To reveal chromosomal evolution among Y. persimilis, T. qinae, FCSH and S. curviseta, MMseqs2 was used to identify orthologs gene pairs with one round of iteration and an e-value cut-off of 1×10 −5 .MCScanX was used to identify synteny blocks with a homologous block that includes five or more genes with no rearrangements ('-b 2 -s 5 0 ).

(f ) Gene orthology and dN/dS analysis
The GC content displays considerable variation across distinct regions of the genome, with lower GC content frequently associated with gene duplication events [71].Additionally, shorter genes have a greater probability of being retained following gene duplication compared to longer genes [72].To investigate the relationship between GC content and gene length across different genomic regions, we analysed the GC content (counted every 10 kb window) and gene length of the whole genome, ElChr, duplicated and conserved regions of ElChr in both genomes.
To investigate the selection pressure on gene families in different genomic regions of Y. persimilis and T. qinae, we calculated the dN/dS ratio for gene families on ElChr (including duplicated and conserved regions), and other chromosomes in both species.Orthogroups (gene families) in both genomes were identified using OrthoFinder v. 2.5.2 [73] based on their protein-coding genes, with the longest isoform selected for each gene.For sequence alignment, we used Diamond in the ultra-sensitive mode (-S diamond_ultra_sens).The amino acid sequences of each family were aligned using MAFFT v. 7.487 [74] with the L-INS-I strategy.Coding sequences were aligned based on their amino acid alignments and simultaneously trimmed using trimAl v. 1.4.1 [75] to remove unreliable homologous regions with the 'auto-mated1 0 heuristic method.Individual gene trees were inferred using IQ-TREE v. 2.1.3with the LG site-homogeneous model [76].We used codeml within the PAML package with model = 0 and NSsites = 0 to calculate the dN/dS ratio of gene families in the aforementioned royalsocietypublishing.org/journal/rspb Proc.R. Soc.B 291: 20232937 regions.To evaluate the significance of the average values of GC content, gene length and the dN/dS ratio across different chromosome regions, we utilized the Wilcoxon rank-sum test.

(g) Enrichment analysis of duplicated genes on elChr of Y. persimilis and T. qinae
To explore the functions of the genes within the duplicated regions of ElChr of Y. persimilis and T. qinae, we conducted Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses.These analyses were performed using the R package clusterProfiler v. 3.14.3[77] with default values.

Results (a) Genome assembly and annotation
We generated a total of 152.36 Gb of raw data, including 43.86 Gb of PacBio (139x), 53.98 Gb of Illumina (171x), 34.82 Gb of Hi-C (110x), 13.26 Gb of BGI (42x) and 6.44 Gb of transcriptome reads (electronic supplementary material, table S2).The final genome assembly had a size of 315.79 Mb, comprising 422 scaffolds with scaffold N50 length of 46.22 Mb and a GC content of 34.52%.More than 95% of the genome was anchored to six chromosomes, with chromosome lengths ranging from 21.49 Mb to 109.05Mb.Notably, chromosome 1 was identified as an exceptionally large chromosome (ElChr), with a size of 109.05Mb (electronic supplementary material, figure S1).The assembly size closely matched the estimated genome size (314.63Mb) obtained from GenomeScope.The sequencing coverage for each chromosome was computed and is presented in electronic supplementary material, table S4.A particular chromosome displayed approximately half the coverage compared to the other chromosomes, we inferred that chromosome as the X chromosome.The average sequencing coverage for the X chromosome was 27x, which is significantly lower than that for the other autosomes (42x).
The TE landscapes in all five collembolan genomes exhibited an L-shaped distribution (figure 2c), indicating a recent transposition burst characterized by a predominance of young and active copies.Specifically, in the genomes of Y. persimilis, T. qinae and S. curviseta, there were clear signs of transposition bursts for LTR retrotransposons, with distance values below 10%.Two Folsomia candida genomes showed indications of transposition bursts for DNA retrotransposons, with distance values below 10%.Furthermore, the genomes of Y. persimilis, T. qinae and S. curviseta exhibited transposition bursts for TE elements, with distance values ranging from 30% to 40%.Notably, Y. persimilis and T. qinae specifically showed transposition bursts for LINE retrotransposons, while in S. curviseta, the transposition bursts were primarily associated with LTR retrotransposons (figure 2c).
A total of 19 653 protein-coding genes were predicted in Y. persimilis, which is slightly lower than the number of protein-coding genes in other entomobryomorphan genomes (greater than 20 000) (electronic supplementary material, table S6).The average lengths of the genes, exons, and introns were 4551.8bp, 268.4 bp and 461.9 bp, respectively, with an average of 6.9 exons and 5.9 introns per gene.The BUSCO assessment showed that 97.9% of the 1,013 reference genes were complete.Gene functional annotation identified significant hits with proteins in the UniProtKB database for 15 694 genes (79.86% of all genes).Additionally, GO terms, KEGG ko terms, Enzyme Codes, KEGG pathways, Reactome pathways and COG categories were assigned to 13,976, 10,960, 4,732, 12,772, 14,079 and 15 238 genes, respectively.

(b) Intra-genomic synteny
Our analysis revealed 44 syntenic blocks (503 genes) and 56 syntenic blocks (706 genes) within the genomes of Y. persimilis and T. qinae, respectively.Notably, only five and seven syntenic blocks were inter-chromosomal in Y. persimilis and T. qinae,     respectively.In Y. persimilis, the majority of intra-chromosomal syntenic blocks (34) were clustered within 40-109 Mb of ElChr, while most intra-chromosomal syntenic blocks (44) were located in 0-80 Mb of ElChr in T. qinae (figure 1b).We designated these percentages as duplicated regions, which were highlighted with a red circle, while the remaining regions of ElChr were designated as conserved regions (figure 1b).The genome of both Y. persimilis and T. qinae exhibits a significant percentage of dispersed duplications, with 50.31% and 48.43%, respectively.Tandem duplications and segmental duplications make up a percentage of 17.71% and 2.62% in Y. persimilis, and 15.51% and 3.48% in T. qinae, respectively.However, the percentage of tandem duplications and segmental duplications in ElChr of Y. persimilis (20.45% and 5.07%) and T. qinae (20.39% and 6.71%) was significantly higher compared to the remaining chromosomal regions in Y. persimilis (Chr2-ChrX, 15.43% and 1.15%) and T. qinae (Chr2-Chr5, 12.04% and 1.18%).Furthermore, within the duplicated regions of ElChr, the percentage of tandem duplications and segmental duplications is further increased in Y. persimilis (23.96% and 8.06%) and T. qinae (26.59% and 11.51%), which higher that in the conserved regions within ElChr of Y. persimilis (15.52% and 0.87%) and T. qinae (15.49% and 2.91%) (figure 2a).These results suggest that tandem and segmental duplication events may have significantly contributed to the elongation of ElChr in Y. persimilis and T. qinae.

(c) K S distribution
The presence of large-scale duplication events can be revealed by the K S distributions with peaks.The number of synonymous substitutions per synonymous site, denoted by the abscissa K S , can be utilized as a proxy for the age of paralogues or orthologs.
The K S distribution of paralogous genes in different genomic regions of Y. persimilis and T. qinae and the K S distribution of one-toone orthologs in these species were shown in figure 3a-c.Notably, a distinct bump between 0.2 and 0.3 is observed in ElChr and duplicated regions within ElChr of both species, highlighted in red.To further explore the relationship between the K S distributions of ElChr and duplicated regions within ElChr of Y. persimilis and T. qinae, we compared the average K S values.In Y. persimilis, the average K S in the ElChr was 1.65, which is higher than that in the duplicated regions within ElChr (1.57) ( p = 0.006988).Similarly, in T. qinae, the average K S in the ElChr was 1.66, higher than that in the duplicated regions within ElChr (1.56) ( p = 0.0004684).The results indicate that the duplicated genes are relatively young in age.Furthermore, an additional significant peak is observed between 1 and 1.5 in the K S distribution of Y. persimilis and T. qinae genomes.We observed that most syntenic blocks occurred in genomic regions with high TEs density (figure 1a).To analyse the TE content in different genomic regions, we calculated the proportions of major TE groups (LTR, LINE and DNA) in the whole genome, ElChr (duplicated regions and conserved regions of ElChr) and Chr2-ChrX in Y. persimilis, as well as in T. qinae.SINE content was not included due to its low proportion in the genome (less than 0.1%).Our results revealed significant variation in TEs among genomic regions and even within the same chromosome.For instance, in Y. persimilis, the TE content ranged from approximately 5% in Chr2-ChrX to more than 9% in ElChr, with the highest TE content (approx.12%) observed in the duplicated regions of ElChr, which was greater than that in the conserved regions of ElChr (5%).Similarly, in T. qinae, the TE content varied from around 6% in Chr2-Chr5 to over 11% in ElChr, with the highest TE content (approx.13%) observed in the duplicated regions of ElChr (figure 2b).Our analysis of genomic regions in Y. persimilis and T. qinae revealed a broad distribution of TE families.Notably, DNA transposon families exhibited remarkable diversity, with 19 out of the 40 identified TE families belonging to this category.The most abundant DNA transposons identified in Y. persimilis and T. qinae were Maverick, hAT (Blackjack, Charlie, etc.), MULE, TcMar and CMC.Furthermore, L2, RTE, and Penelope were the LINEs with the largest proportions, while the three LTR retrotransposon types with the largest proportions were Gypsy, Pao and Copia.Notably, our analysis revealed that the TE families Gypsy, Pao, L2, RTE, hAT and Maverick exhibited significant expansions in the duplicated regions of ElChr in both genomes (figure 3d).

(e) Inter-genomic synteny
We observed high synteny of sex chromosome X between the Y. persimilis and S. curviseta genomes, which was homologous to Chr3 of T. qinae and Chr7 of FCSH, respectively.We also detected some fusion and fission events between autosomes of these four entomobryomorphan genomes.Both chromosomes 4 and 5 of Y. persimilis exhibited a perfect match with the corresponding chromosomes of T. qinae.Additionally, both chromosomes 2 and 3 of Y. persimilis were fused to form Chr2 of T. qinae.Chr2, Chr3, Chr4 and Chr5 of Y. persimilis were homologous to Chr4, Chr3, Chr1 and Chr2 of S. curviseta, respectively.In Y. persimilis, the region spanning 0-40 Mb on chromosome 1 of Y. persimilis was found to be split into parts of chromosomes 1 and 5 of S. curviseta.The Chr2-Chr5 of S. curviseta were syntenic with the corresponding chromosomes of FCSH.The Chr1 of S. curviseta was mainly syntenic with Chr1 and Chr6 of FCSH (figure 1c).

(f ) Gene orthology and dN/dS
We investigated the relationship between GC content and gene length in the different genomic regions of Y. persimilis and T. qinae.The average GC contents in the whole genomes of Y. persimilis and T. qinae were 34.51% and 34.42%, respectively, which were higher than those in ElChr (33.93% and 34.02%, respectively; figure 4a,b; p < 2.2×10 −16 ).Moreover, the average GC contents in the duplicated regions of ElChr for Y. persimilis and T. qinae were 33.67% and 33.52%, respectively, which were lower than those in the conserved regions of ElChr (34.38% and 34.66%, respectively; figure 4a,b; p < 2.2×10 −16 ).Additionally, the average gene lengths in the whole genomes of Y. persimilis and T. qinae were 4551.82 and 6083.28 bp, respectively, which were higher than those in ElChr (3718.15 and 5319.25 bp, respectively; figure 4c,d; p < 2.2×10 −16 ).Furthermore, the average gene lengths in the duplicated regions of ElChr for Y. persimilis and T. qinae were 3220.32 and 4789.02bp, respectively, which were lower than those in the conserved regions of ElChr (4417.56 and 5738.72 bp, respectively; figure 4c,d; p = 4.429×10 −12 , respectively).One possible explanation for these findings is that shorter genes are more likely to be retained than longer genes after gene duplication because shorter genes require fewer resources and less energy to replicate and can sustain genetic changes without experiencing significant changes in overall function [72].
We estimated dN/dS for gene families in the corresponding ElChr (duplicated regions and conserved regions) and other chromosomes of Y. persimilis and T. qinae.Based on the orthologous assignment, ElChr and other chromosomes of these two species share 3567 (2488 for conserved regions, 1079 for duplicated regions) and 8026 orthologous groups, respectively.The average dN/dS ratio for gene families on ElChr is 0.07550, which is significantly higher than that on other chromosomes (0.06030; figure 4e; p < 2.2×10 −16 ).In addition, the average dN/dS ratio for gene families in the duplicated regions of ElChr (0.07935) was also higher than that in the conserved regions of ElChr (0.06296) of Y. persimilis and T. qinae (figure 4f; p = 1.043×10 −6 ).The gene pair is considered to be under 'purifying selection' when dN/dS is less than 1 [78].A higher dN/dS ratio observed in gene families within the duplicated regions of ElChr indicates a decreased number of eliminated substitutions and a lower level of selective constraint.This suggests a reduction in purifying selection within these gene families.

(g) GO and KEGG pathway enrichment analyses
We conducted enrichment analyses to further investigate the functions of these genes within the duplicated regions of ElChr of Y. persimilis and T. qinae, respectively.The GO enrichment analysis within the duplicated regions of ElChr of both genomes indicated that these genes were associated with proliferation and growth, including transmembrane receptor protein tyrosine kinase activity, vascular endothelial growth factor-activated receptor activity and positive regulation of MAPK cascade, etc. (electronic supplementary material, figure S2a,b).In Y. persimilis, the KEGG pathway enrichment analysis identified significant functions related to Brassinosteroid biosynthesis (electronic supplementary material, figure S2c).In T. qinae, both KEGG and GO enrichment results showed important functions involved in various growth-related pathways (e.g.MAPK signalling pathway, PI3K−Akt signalling pathway, etc.; electronic supplementary material, figure S2d).

Discussion
In this study, we obtained a high-quality chromosome-level genome of Y. persimilis.Interestingly, both tomocerid genomes (Y.persimilis and T. qinae) have an unusually large chromosome (greater than 100 Mb), which accounts for almost one-third of the genome [6].This finding confirmed the result of a previous karyotype study [3].

(a) Chromosome structure and number evolution in Entomobryomorpha
Our genome synteny analysis revealed strong synteny in the X chromosome among entomobryomorphan genomes (figure 1c), which is consistent with previous research indicating a high level of conservation of the X chromosome across insect species [79].However, we also detected instances of fusion and fission events between the autosomes of these four genomes.For example, the conserved regions of ElChr of Y. persimilis and T. qinae can be attributed to the fusion of chromosomes 1 and 5 present in S. curviseta.Our findings suggest that the observed elongated chromosome in Y. persimilis and T. qinae may have originated from a fusion event involving chromosomes in their common ancestor, marking the initial stage of the elongation process.This finding aligns with the previous literature reported by Hemmer [3].Furthermore, following a prolonged period of genetic stability, this elongated chromosome underwent expansions in specific lineages.In addition, Chr6 of S. curviseta is the result of fusion between Chr1 and Chr6 of FCSH, and chromosomes 2 and 3 of Y. persimilis were fused to form Chr2 of T. qinae (figure 1c).Chromosomal fission and fusion events are believed to have played a crucial role in the evolution of chromosome number in Entomobryomorpha, which typically ranges from 5 to 7 [3].However, it remains unclear whether a similar mechanism has contributed to the evolution of chromosome number in Collembola, and further investigation is necessary.Further chromosome-level assemblies  (b) Elongation in ElChr was caused by tandem and segmental duplication and transposon proliferation Our intra-chromosome synteny analysis revealed evident duplications in the genomes of Y. persimilis and T. qinae, primarily located in ElChr.Notably, a majority of the syntenic blocks (over three-quarters) were identified within this region.This study determined the specific type of duplication events that occur (WGD, LSD or SSD events).During rediploidization, it is expected that some intra-chromosomal co-linearity will emerge, which plays a crucial role in chromosomal rearrangements like translocations and fusions [80].Paralogs resulting from WGD events are typically found on different chromosomes, as observed in the WGD event of Tetraodon nigroviridis [81].However, these duplicated genes observed in both the Y. persimilis and T. qinae genomes were predominantly located on the same chromosome, which is inconsistent with a WGD event.We noticed a distinct peak between 0.2 and 0.3 of the K S distribution in the duplicated regions of ElChr for both species (figure 3a,b).Peaks in the K S distribution histograms often suggest the presence of a LSD event [82].This finding suggests the occurrence of LSD events in the ElChr of both species, and these LSD events may have contributed to the elongation of the large chromosome in Y. persimilis and T. qinae.Moreover, the lower Ks (Ks peak between 0.2 and 0.3) indicated the relatively young age of these duplicated genes compared to the genomes of Y. persimilis and T. qinae (Ks peak at approx.1.5; figure 3c).Previous research has estimated that the divergence between the Yoshiicerus and Tomocerus genera occurred approximately 53 million years ago (Mya) [83].Therefore, we speculate that chromosomal expansion occurred around 10 Mya in both Y. persimilis and T. qinae independently, rather than in a common ancestor of the Tomoceridae family.Further investigation of duplicate gene pairs in the ElChr of both genomes revealed that tandem and segmental duplications were significantly increased in the duplicated regions of ElChr, exhibiting a significantly higher percentage than the conserved regions of ElChr (figure 2a).Based on these findings, we conclude that the elongation observed in ElChr of Y. persimilis and T. qinae was caused by large tandem and segmental duplication.We found evidence suggesting that the segmental duplications in ElChr of these species were driven by the activity of TEs, which are a major source of genomic diversity in arthropods and play a crucial role in shaping genome structure and evolution [84].A rise in homologous regions is caused by hotspots of TEs, which provides an increased opportunity for homologous recombination and unequal crossing over to drive genome amplification [85].Therefore, there would be high TE activity/abundance in gene duplication regions of the genome.In the Y. persimilis and T. qinae genomes, TEs were significantly increased in the duplicated regions of ElChr (figure 2b).Furthermore, our finding is consistent with previous results in F. candida, where synteny blocks were associated with high densities of TEs [86].Our findings suggest that tandem and segmental duplication and transposable element proliferation play an important role in the elongation of ElChr in Y. persimilis and T. qinae.

Figure 1 .
Figure 1.An overview of the genomic characteristics and chromosomal synteny of Yoshiicerus persimilis and Tomocerus qinae.(a) Circos graph of the genome characteristics of Y. persimilis and T. qinae.From the inside to outside rings of the circle represents simple repeats, TEs, gene density, GC content and chromosome length.(b) Intra-chromosomal relationships of Y. persimilis and T. qinae.We highlighted the duplicated regions in red circles.(c) Genome-wide syntenic relationship among Y. persimilis, T. qinae, the sexual strain of Folsomia candida, and Sinella curviseta.We highlighted the fusion process of chromosome 1 using the colour orange, and indicating the conservation of the X chromosome with the colour red.On chromosome 1, the orange colour represents the conserved regions.

Figure 2 .
Figure 2. Numbers of genes from different origins and transposable elements comparison between the genomes of Y. persimilis and T. qinae.(a) Numbers of genes from different origins as classified by duplicate gene classifier in different genomic regions of Y. persimilis and T. qinae genomes.(b) The percentage of TEs (DNA, long interspersed nuclear elements (LINE) and LTR) in different genomic regions of Y. persimilis and T. qinae genomes.The bars denote the proportion of the corresponding family.(c) Distribution of divergence rates for transposable elements in five collembolan genomes.The y-axis represents the genome percentage of various types of TEs, while the x-axis represents the distribution of divergence (Kimura two parameters, K2P) for each TE family.DR, duplicated regions.

Figure 3 .
Figure 3.The K S frequency distribution and diversity and abundance of transposable element families of Y. persimilis and T. qinae.(a,b) K S frequency distribution graphs in different genomic regions of Y. persimilis (a) and T. qinae (b).We highlighted the distinct bump between 0.0 and 0.5 with red boxs.(c) The paranome K S distribution graphs of Y. persimilis and T. qinae are presented alongside the K S distribution of one-to-one orthologs in the two species.(d ) Diversity and abundance of transposable elements families for Y. persimilis and T. qinae.

Figure 4 .
Figure 4.The relationship between GC content, gene length and the dN/dS ratio in different genomic regions of T. qinae and Y. persimilis.Box plots display the median value as a black horizontal line within the box, with the average indicated by a black dot.(a,b) GC content in different genomic regions of (a) T. qinae and (b) Y. persimilis.(c,d) Gene lengths in different genomic regions of (c) T. qinae and (d) Y. persimilis.(e,f ) The dN/dS ratio in different genomic regions, including (e) ElChr and other chromosomes, as well as (f ) ElChr duplicated regions and ElChr conserved regions of and T. qinae and Y. persimilis.