Published October 2, 2025 | Version v3
Dataset Open

Supporting Information for "Forty New Genomes Shed Light on Sexual Reproduction and the Origin of Tetraploidy in Microsporidia"

  • 1. EDMO icon Wellcome Sanger Institute

Description

This is a list of the files and materials contained in this Supporting Information dataset (v3). References mentioned in this description can be found in the manuscript.

Section 1: Genomes and associated data, figures, and annotations

Table S1: Microsporidian genome assembly statistics and host meta-data.

Full list of recovered microsporidian genome assemblies, their associated meta-data, and host meta-data. Genome accessions will be added as genomes are released through ENA.

Fig. S1: Sex of hosts the microsporidian genome assemblies are derived from.

The sex of our genomes’ hosts was unknown in most cases (24 species). In the remaining cases, nine were identified as female and seven as male. A relatively equal proportion of female and male hosts are infected with Nosematida (Fig. 1), but we could not assess skews in host sex ratios for other microsporidian groups due to missing data on sex (for Amblyosporida-infected hosts), or a small sample size (for Neopereziida-infected hosts). The data underlying this figure can be found in Supporting Information Section 1 Table S1. The figure was generated using Matplotlib [92], and manually annotated using InkScape (version 1.2.2).

Fig. S2: Ploidy inference examples for three microsporidian genomes, highlighting segmental duplications. 

GenomeScope2 transformed linear plot and Smudgeplot [82] respectively for (A), (B) diploid iyCepSpine2.µ (host Cephus spinipes [Hymenoptera]); (C), (D) diploid idPhaFune2.µ (host Phania funesta [Diptera]); and (E), (F) polyploid (tetraploid or octoploid) iiMysAzur1.µ (host Mystacides azureus [Trichoptera]). Jellyfish was used to generate the initial k-mer spectra (k = 21, version 2.2.10) [134]. Both iyCepSpin2.µ and idPhaFune2.µ have mostly diploid genomes, but carry a level of duplication that generated an identifiable “tetraploid” signal in their k-mer spectra. Similarly, the k-mer spectrum of iiMysAzur1.µ can be interpreted as either a highly homozygous tetraploid where large segmental duplications have occurred in all the four copies leading to a detectable octoploid signal, or an octoploid genome composed of two distinct tetraploids. Such cases are common, with some level of segmental duplication observed in nearly all of the 14 polyploid genomes (refer to Supporting Information Section 1, File Collection 1). The genomes used to generate this figure can be found in Supporting Information Section 1 File Collection 12. The figure was generated using GenomeScope2 [82], and manually annotated using InkScape (version 1.2.2).

File Collection 1: K-mer histogram plots for the reads used to produce the microsporidian genome assemblies, used for ploidy estimation.

GenomeScope2 [82] plots of the reads used to produce the microsporidian genome assemblies presented in this study. Jellyfish was used to generate the k-mer spectrum for each read set (k = 21, version 2.2.10) [134].

File Collection 2: Repeat landscape across chromosome-level microsporidian genome assemblies.

Distribution of repeat annotation results (with RepeatModeler and RepeatMasker) on all chromosome-level genomes [80,81].

File Collection 3: rRNA landscape across chromosome-level microsporidian genome assemblies.

Distribution of rRNA across all chromosome-level genomes.

File Collection 4: Oxford Dot Plots for tetraploid microsporidian genome assemblies.

Oxford dot plot for tetraploid genome assemblies displaying BUSCO genes. Gene pairs which are less divergent than the same species threshold are in sky blue, while gene pairs which are more divergent than the same species threshold are in red.

File Collection 5: BUSCO annotation results for microsporidian genome assemblies.

BUSCO annotation full tables, generated using microsporidia_odb10 (version 5.4.6) [78].

File Collection 6: Self alignment plots for microsporidian genome assemblies.

Self-alignment dot plots for the microsporidian genome assemblies generated in this study. Each genome was aligned to itself using FASTGA (Github: https://github.com/thegenemyers/FASTGA). The plots were generated using HyraxDotPlot (v2.0) (Github: https://github.com/Amjad-Khalaf/HyraxDotPlot).

File Collection 7: Hi-C contact heatmaps for scaffolded microsporidian genome assemblies.

Hi-C contact heatmaps for scaffolded microsporidian genome assemblies visualised using PretextView [103].

Table S2: Filtering parameters used in generating genome assemblies.

Parameters used for filtering microsporidian contigs from their respective (meta-)genomic assemblies in filtering steps 1 (BlobToolKit [133]) and 2 (BubblePlot, Github: https://github.com/Amjad-Khalaf/BubblePlot). See Materials and Methods for details.

File Collection 8: Smudgeplot ploidy estimation.

Smudgeplot [82] plots of the reads used to produce the microsporidian genome assemblies for which ploidy could be estimated using GenomeScope2 [82] (Supporting Information Section 1, File Collection 4).

File Collection 9: K-mer analysis plots for the microsporidian genome assemblies.

MerqurkyFK plots (Github: https://github.com/thegenemyers/MERQURY.FK) for final microsporidian genome assemblies generated in this study.

File Collection 10: K-mer plots used to inform genome assembly purging.

Purge_dups [75] histogram plots used to inform genome assembly purging, with cutoffs used clearly indicated.

File Collection 11: Statistics of intermediate steps for each microsporidian genome assembly.

Scaffold/contig and read statistics for intermediate steps produced in the generation of each microsporidian genome assembly.

File Collection 12: Microsporidian genome assemblies fasta files.

Fasta files of recovered microsporidian genome assemblies. The primary assemblies listed in Table S1 are given by {Host ToLID}.µ.fasta, whereas purged haplotypic duplication sequences are given by {Host ToLID}.µ.alt.fasta where applicable. In the case of iuLoeVari1.µ, the primary assembly is the best haploid representative genome assembly possible, containing sequences across all four compartments. The diploid genome assemblies of iuLoeVari1.µ’s AB and CD compartments are given by iuLoeVari1.µ.AB.fasta and iuLoeVari1.µ.CD.fasta respectively.

File Collection 13: Repeat annotation results for microsporidian genome assemblies.

Results for repeat annotation (with RepeatModeler and RepeatMasker) on all genomes [80,81].

File Collection 14: GeneMark-ES annotation results for microsporidian genome assemblies.

GeneMark-ES [142] annotation files, generated as part of the BRAKER2 pipeline [138] with protein hints  consisting of all microsporidian proteins available on UniProt [143].

Section 2: Phylogeny and associated analyses

Fig. S3: 600 Gene Phylogeny of Microsporidia. 

(A) ASTRAL [77] phylogeny summarising individual phylogenies of 600 BUSCO genes (microsporidia_odb10) [78] across all publicly available microsporidian genome assemblies (including multiple strains where they are available), and the genome assemblies generated in this study (n = 40, marked in purple). Branch lengths were estimated with IQ-TREE using a concatenated alignment of the individual BUSCOs [79]. Nodes with less than 95% support are marked with pink circles. Ploidy is marked in circles at the tips of the tree for genomes where it was characterisable. (B) Genome assembly span (Mb) as calculated by assembly-stats (Github: https://github.com/sanger-pathogens/assembly-stats), with black circles marking chromosome-level genome assemblies. (C) N50 values (Mb) as calculated by assembly-stats (Github: https://github.com/sanger-pathogens/assembly-stats), with asterisks marking purged genome assemblies. (D) BUSCO gene (microsporidia_odb10) completeness percentage, marked in green for single-copy genes, and beige for duplicated genes. (E) Transposable element percentage as predicted by RepeatModeler and RepeatMasker [80,81], marked in burgundy for retroelements, peach for DNA transposons, and blue for rolling circles. Neop.: Neopereziida; Or. Lin.: Orphan Lineage. The data underlying A can be found in Supporting Information Section 2. The data underlying B, C, D and E can be found in Supporting Information Section 1. The figure was generated using ToyTree [74], and manually annotated using InkScape (version 1.2.2).

Text S1: Newick string of phylogeny.

ASTRAL [77] phylogeny summarising individual phylogenies of 600 BUSCO genes (microsporidia_odb10) [78] across all publicly available microsporidian genome assemblies (n = 106), and the genome assemblies generated in this study (n = 40, marked in purple). Branch lengths were estimated with IQ-TREE using a concatenated alignment of the individual BUSCOs [79]. The model chosen according to IQ-TREE’s model finder was “Q.yeast.I.G4”.

Table S3: Trait-phylogeny regression.

Transformations representing the fit with the tree’s topology (λ), branch-lengths (κ) and root-tip distance (δ) [83] and the number of coding sequences, transposable element loads, and genome spans.

Table S4: Trait correlation.

Correlations between transposable element loads, and genome spans.

Table S5: Accession numbers for publicly available genomes used in this study.

On the 1st of January 2025, we downloaded all microsporidian genome assemblies available in the NCBI Genome database. This retrieved 106 genome assemblies.

Section 3: Species delineation and haplotype phylogenetics

Table S6: Branch length distances for species delineation.

Pairwise branch length distances which include one of our genomes, and can be classified to a species or a genus. The conservative branch length threshold range was defined using the shortest observed branch lengths between known same-species genomes for the lower bound (0) and the smallest distance between H. tvaerminnensis and H. magnivora genomes for the upper bound (0.012). The relaxed threshold uses the full range of observed branch lengths among known same-species genomes (excluding the H. tvaerminnensis – H. magnivora cutoff).

Script S1: Plotting histograms depicting phylogenetic branch lengths (in amino acid substitutions per site) between homeologous gene pairs for thirteen tetraploid genomes.

Python script used to extract pairwise branch lengths between homeologous gene pairs for thirteen tetraploid genomes, and plot them as histograms. Please note that the following genomes are represented by deprecated ToLIDs, which differ from the ones used in this manuscript. iyOecSmar33: idDelPlat3; iyOecSmar35: idTanUsma1; iyOecSmar39: idChiSpeb1; iyOecSmar41: idDelPlat4; and iyOecSmar44: idDelPlat5.

File Collection 15: Individual BUSCO haplotype phylogenies for each tetraploid species.

BUSCO (microsporidia_odb10, version 5.4.6) (Simão et al. 2015) was run on the unpurged genome assemblies of the tetraploid genomes. For each tetraploid, the haplotypes of each BUSCO locus were aligned to one another and an outgroup using MAFFT (version 7.525) (Katoh et al. 2002), and a phylogeny was generated for each alignment using IQ-TREE (version 2.3.4, with ModelFinder enabled and 1000 bootstrap replicates) (Minh et al. 2020; Kalyaanamoorthy et al. 2017). Please note that the following genomes are represented by deprecated ToLIDs, which differ from the ones used in this manuscript. iyOecSmar33: idDelPlat3; iyOecSmar35: idTanUsma1; iyOecSmar39: idChiSpeb1; iyOecSmar41: idDelPlat4; and iyOecSmar44: idDelPlat5.

Text S2: Approximately Unbiased phylogenetic test results.

BUSCO (microsporidia_odb10, version 5.4.6) (Simão et al. 2015) was run on the unpurged genome assemblies of the tetraploid genomes. The haplotypes of each BUSCO locus were aligned to one another using MAFFT (version 7.525) (Katoh et al. 2002), and a phylogeny was generated for each alignment using IQ-TREE (version 2.3.4, with ModelFinder enabled and 1000 bootstrap replicates) (Minh et al. 2020; Kalyaanamoorthy et al. 2017). The Approximately Unbiased statistical test [90] on the multi-copy BUSCO gene phylogenies for all pairwise combinations of tetraploid microsporidian genomes. The high level summary of these pairwise tests are included in this Supporting Information text. For each listed pairwise comparison, the “+” sign indicates the number of phylogenies where haplotypes coalesce more recently than species, and the “-” sign indicates the number of phylogenies where species coalesce more recently than haplotypes.

File Collection 16: Age distributions of duplicate gene pairs.

Output files generated by wgd for diploid genomes shown in Fig. 7.

Fig. S4: Comparison of whole-genome phylogeny species delineation thresholds and individual gene phylogeny branch length distribution species delineation thresholds.

The approach we presented in the main text relies on branch lengths derived from the whole-genome phylogeny in Fig. 2 (i.e. a concatenated supermatrix of genes). We re-estimated same-species branch length thresholds for each gene. For each gene, we used the distribution of branch lengths between genomes known to belong to the same species, and measured each distribution’s mean and 95th percentile. The upper threshold was then set by retrieving the highest observed 95th percentile (orange dashed line), and the highest observed mean (magenta dashed line). While the percentage of genes exceeding each threshold varies for each genome, they are relatively consistent, and lead to the same OTU assignment and the same conclusions when investigating tetraploid species. ilAceEphe1.µ still stands out as possessing more genes which exceed the same-species threshold (no matter what threshold was used) than other genomes. The figure was generated using Matplotlib [92], and manually annotated using InkScape (version 1.2.2).

Fig. S5: Relationship between whole-genome phylogeny species delineation thresholds and individual gene phylogeny branch length distribution species delineation thresholds.

We compared our two gene-based metrics (highest 95th percentile and highest mean of branch length distributions of individual gene trees for genomes known to belong to the same species) to the whole-genome-based metric (highest branch length observed between any two same species genomes). We found the relationship between them to be consistent and linear, in line with the fact that they lead to the same conclusions. The figure was generated using Matplotlib [92], and manually annotated using InkScape (version 1.2.2).

Section 4: Between-haplotype rearrangements

Fig. S6: Tetraploid ilAceEphe1.µ is uneven and rearranged.

The number of BUSCO genes found in X haplotypes, along with their total copy number. idChiSpeb1.µ is an even tetraploid, so nearly all its BUSCO genes are in 4 copies, distributed across 4 haplotypes. On the other hand, ilAceEphe1.µ is an uneven tetraploid. The majority of its BUSCO genes are in less than 4 copies, and they are not evenly distributed across its haplotypes. For instance, some BUSCO genes occur in 3 copies present only in a single haplotype. The figure was generated using gerbil (Github: https://github.com/Amjad-Khalaf/gerbil), and manually annotated using InkScape (version 1.2.2).

Section 5: Between-genome rearrangements

Text S3: Details on rearrangements inferred and methods attempted.

Comparing the BUSCO positions on closely-related genomes showed a pattern of dynamic change of microsporidian linkage groups. The Encephalitozoon species genomes were highly syntenic, with the exception of rearrangement involving a single linkage group [67] (Fig. S11-S12). ilEupExig1.µ (host Eupithecia exiguata [Lepidoptera]), placed basal to Encephalitozoon species, had six chromosomes that could be derived through either five pairwise fusions of the 11 chromosomes of Encephalitozoon species, or an ancestral karyotype that was subject to fission in Encephalitozoon. Vairimorpha necatrix was sister to the Encephalitozoon-ilEupExig1.µ clade, and had 11 chromosomes, but these did not correspond to the 11 found in Encephalitozoon species and did not simply confirm the karyotype of ilEupExig1.µ as being ancestral or derived. The karyotype of the enterocytozoonid iyOphElle1.µ, sister to the nosematids (V. necatrix, Encephalitozoon and ilEupExig1.µ), had 12 chromosomes, the largest of which is syntenic with the largest chromosome of V. necatrix, and thus suggests that the splitting of this chromosome in Encephalitozoon and ilEupExig1.µ is derived. This large linkage group was also found in the neoperezeiid Antonospora locustae, sister to the Encephalitozoonida, in the orphan lineage species Hamiltosporidium tvaerminnensis and in two genomes on Ambylosporidia species, suggesting it was likely present in the last ancestor of all microsporidia analysed. 

The orphan lineage-Ambylosporidia group showed a similar general pattern of linkage group conservation between close relatives and members of the same OTU, coupled with major rearrangements between clades (Fig. S11-S12). We were unable to infer a robust set of putative ancestral linkage groups for these genomes using syngraph [139] (Fig. S7-S8) or unsupervised clustering of loci based on their chromosomal occupancy [140,141] (Fig. S9-S10), likely because of the high frequency of rearrangements observed.

Fig. S7: Phylogeny used by Syngraph, with its internal node labelling.

Each node is labelled with its Syngraph name in a grey box. Yellow boxes indicate the number of chromosomes each genome possesses, and blue boxes indicate the number of chromosomes which possess BUSCO gene markers. The figure was generated using ToyTree [74], and manually annotated using InkScape (version 1.2.2).

Fig. S8: Number of chromosomes inferred at each node is highly variable.

The number of chromosomes inferred for each node, and the total number of BUSCO genes assigned to a chromosome for each “m”. “m” is the parameter in Syngraph to determine the minimum number of genes needed to travel together for the event to be counted as a rearrangement. For example, if m = 3, only rearrangements involving 3 or more genes will be counted. Deep nodes are highly variable and their karyotype (and thus the number of rearrangements that have occurred along each branch) cannot be estimated reliably. See Fig. S7 for node labels on the phylogeny. The figure was generated using Matplotlib [92], and manually annotated using InkScape (version 1.2.2).

Fig. S9: T-SNE plot depicting BUSCO linkage groups across the microsporidian phylogeny.

Each point represents a BUSCO gene, positioned based on its co-occurrence profile across the chromosome-level microsporidian genomes. Distances between points reflect similarities in co-occurrence. Points are coloured by their assigned chromosome in Anotonspora locustae. This disorganised pattern illustrates that the rate of rearrangement is too high for a reliable complete reconstruction of putative ancestral linkage groups. The large-scale patterns are influenced by more densely sampled taxa, see Fig. S10. The data underlying this figure can be found in Supporting Section 1 File Collection 5. The figure was generated using Scikit-learn [140,141] and Matplotlib [92], and manually annotated using InkScape (version 1.2.2).

Fig. S10: T-SNE plot depicting BUSCO linkage groups across the microsporidian phylogeny, highlighting clustering influence by more densely sampled taxa.

Each point represents a BUSCO gene, positioned based on its co-occurrence profile across the chromosome-level microsporidian genomes. Distances between points reflect similarities in co-occurrence. Points are coloured by their assigned chromosome in Encephalitozoon cuniculi. This disorganised pattern illustrates that the rate of rearrangement is too high for a reliable complete reconstruction of putative ancestral linkage groups. The large-scale patterns are influenced by more densely sampled taxa, such as Encephalitozoon cuniculi. The data underlying this figure can be found in Supporting Section 1 File Collection 5. The figure was generated using Scikit-learn [140,141] and Matplotlib [92], and manually annotated using InkScape (version 1.2.2).

Fig. S11: Synteny plots of chromosomal microsporidian genome assemblies.

Genome-wide synteny plots of all available chromosomal microsporidian genome assemblies. Each line represents a single-copy BUSCO (microsporidia_odb10) [78]. BUSCOs are painted by their chromosomal position in A. locustae. The data underlying this figure can be found in Supporting Section 1 File Collection 5. Figure was generated by using ribbon plot scripts from https://github.com/conchoecia/odp [109] and ToyTree [74], and manually annotated using InkScape (version 1.2.2).

Fig. S12: Synteny plots of chromosomal microsporidian genome assemblies.

Genome-wide synteny plots of all available chromosomal microsporidian genome assemblies. Each line represents a single-copy BUSCO (microsporidia_odb10) [78]. BUSCOs are painted by their chromosomal position in H. tvaerminnensis. Figure was generated by using ribbon plot scripts from https://github.com/conchoecia/odp [109] and ToyTree [74], and manually annotated using InkScape (version 1.2.2).

Files

Fig S1.png

Files (794.2 MB)

Name Size Download all
md5:c8337d93af6dda5e98f5ebd2c61e07ff
17.2 kB Download
md5:b0bdbd6af4f9f4b5b2209476e95d9174
15.6 MB Preview Download
md5:9c868ef080c29cace632496e7e5a21a6
111.5 kB Preview Download
md5:511689e9624c44c5a8cb69a3a9c3f4ff
28.6 MB Preview Download
md5:cb2f4249e4e704d4b5f0d95fe94008e8
27.5 MB Preview Download
md5:10bd103f9da1082950e69e67167dc704
3.6 MB Preview Download
md5:b521b6a2168e12a4ffcde5a39eb5aeae
431.4 MB Preview Download
md5:e2fc7c5ab11a61dcd1ba510eba2892f2
220.8 kB Preview Download
md5:1af9b935f7f5b54e2be5a243637f205f
121.1 kB Preview Download
md5:23968a586441baf41d69e4e23df637d3
167.9 kB Preview Download
md5:e15574895d6d40fe1a0372d27374f991
389.3 kB Preview Download
md5:273bc60c64d3b06cc364ed5f37bec69b
591.4 kB Preview Download
md5:01b0161f9f374fdb71787fce4b342883
127.8 kB Preview Download
md5:2cdb1209435eab4a7fd788622aa85867
8.8 MB Preview Download
md5:6ad272aa8432bd3a291c560d4d5436c6
564.8 kB Preview Download
md5:673c0169fbd935ba112d51a4fc828987
74.4 kB Download
md5:94d193ccc5bf9d3a8558461eef0d8cbb
219.5 MB Preview Download
md5:c15f272c44d1c1e046899adf92bce63f
13.0 kB Download
md5:15d2cf17dd44d817255607288b49463a
13.2 MB Preview Download
md5:38a7786aabcd17aec85108fb88ea033b
1.5 MB Preview Download
md5:3173ef0b7f70e8fdb234b009d058b7d1
82.1 kB Preview Download
md5:9379a3d5a154a8a4bf32c3f9d3316595
2.1 MB Preview Download
md5:e5de42eeec5eeacb4e71beaeb9c632c3
914.4 kB Preview Download
md5:dc301332cab2fb5ce1fbd7e711830320
946.7 kB Preview Download
md5:57ad4650e53016e0d75e37b5dd754d3b
860.8 kB Preview Download
md5:205a66f91f8f9e355fa985a72f846bbd
23.7 MB Preview Download
md5:f6ab4d460af30b56126fe911fe44f779
9.4 MB Preview Download
md5:7007241e9d755be1dfcb827afbeb6c66
148.9 kB Preview Download
md5:99af852d7a938fd01e27f23cace24aca
3.8 MB Preview Download
md5:d2814d59724808f3febb5af6e0272fd2
3.4 kB Download
md5:8ddea7c90ea7e76c9a8a37616a7a555f
41.1 kB Download
md5:1beb41012909ea70394c5250a55637e6
7.6 kB Download
md5:f350018ff031418dc6a4ec59e26353db
8.0 kB Download
md5:28d94bfbdde450cb0c6df5c7c2648076
8.1 kB Download
md5:68f3a9e4b3917c6141657b7c66b468c2
6.9 kB Download
md5:e1b8ce13576cc8055a4b09e96845c8be
8.6 kB Preview Download
md5:59fda5694be4017a45d7e7e393c2c358
2.1 kB Preview Download
md5:f6cbc0a2ce4468728b0e5007d0abd41d
34.6 kB Preview Download

Additional details

Funding

Wellcome Trust
220540/Z/20/A