Published June 25, 2025 | Version v1
Dataset Open

Genomes and annotations of 4 Diamesa species

Description

Here we provide the genome assemblies and genome annotations of four Diamesa species (D. hyperborea, D. lindrothi, D. serratosioi, D. tonsa). 

We assembled the species using a pre-release of the EBP-Nor genome assembly pipeline (https://github.com/ebp-nor/GenomeAssembly). KMC (Kokot et al. 2017) was used to count k-mers of size 32 in the PacBio HiFi reads, excluding k-mers occurring more than 10,000 times. Genomescope (Ranallo-Benavidez et al. 2022) was run on the k-mer histogram output from KMC to estimate genome size, heterozygosity and repetitiveness while ploidy level was investigated using Smudgeplot (Ranallo-Benavidez et al. 2022). HifiAdapterFilt (Sim et al. 2022) was applied on the HiFi reads to remove possible remnant PacBio adapter sequences. The filtered HiFi reads were assembled using hifiasm (Cheng et al. 2021) with Hi-C integration resulting in a pair of haplotype-resolved assemblies, pseudo-haplotype one (hap1) and pseudo-haplotype two (hap2). Unique k-mers in each assembly/pseudo-haplotype were identified using meryl (Rhie et al. 2020) and used to create two sets of Hi-C reads, one without any k-mers occurring uniquely in hap1 and the other without k-mers occurring uniquely in hap2. K-mer filtered Hi-C reads were aligned to each scaffolded assembly using BWA-MEM (Li 2013) with -5SPM options. The alignments were sorted based on name using samtools (Li et al. 2009) before applying samtools fixmate to remove unmapped reads and secondary alignments and to add mate score, and samtools markdup to remove duplicates. The resulting BAM files were used to scaffold the two assemblies using YaHS (Zhou et al. 2023) with default options. FCS-GX (Astashyn et al. 2024) was used to search for contamination. Contaminated sequences were removed. The mitochondrion was searched for in reads using Oatk (Zhou et al. 2024).

We annotated the genome assemblies using a pre-release version of the EBP-Nor genome annotation pipeline (https://github.com/ebp-nor/GenomeAnnotation). First, AGAT (https://zenodo.org/record/7255559) scripts agat_sp_keep_longest_isoform.pl and agat_sp_extract_sequences.pl were used on the fruit fly (Drosophila melanogaster) genome assembly (BDGP6.46 (GCA_000001215.4) from Ensembl) and annotation to generate one protein (the longest isoform) per gene. Miniprot (Li 2023) was used to align the proteins to the curated assemblies. UniProtKB/Swiss-Prot (Coudert et al. 2023) release 2022_03 in addition to the arthropoda part of OrthoDB v11 (Kuznetsov et al. 2023) were also aligned separately to the assemblies. Red (Girgis 2015) was run via redmask (https://github.com/nextgenusfs/redmask) on the assemblies to mask repetitive areas. In addition, we ran Earl Grey (Baril et al. 2024) to annotate transposable elements. GALBA (Brunå et al. 2023Li 2023, Buchfink et al. 2015, Hoff & Stanke 2019, Stanke et al. 2006) was run with the fruit fly proteins using the miniprot mode on the masked assemblies. The funannotate-runEVM.py script from Funannotate was used to run EvidenceModeler (Haas et al. 2008) on the alignments of the fruit fly proteins, UniProtKB/Swiss-Prot proteins, arthropoda proteins and the predicted genes from GALBA. The resulting predicted proteins were compared to the protein repeats that Funannotate distributes using DIAMOND blastp  and the predicted genes were filtered based on this comparison using AGAT. The filtered proteins were compared to the UniProtKB/Swiss-Prot release 2022_03 using DIAMOND (Buchfink et al. 2015) blastp to find gene names and InterProScan  was used to discover functional domains. AGATs agat_sp_manage_functional_annotation.pl was used to attach the gene names and functional annotations to the predicted genes.

Files

Files (300.4 MB)

Name Size Download all
md5:aad14f7ebeab2a380dbbc63d43dc5b2c
30.8 MB Download
md5:1f961828cde68ac314c3a1dca86c7151
2.4 MB Download
md5:780025ceb1476d63b1d2ea9789e18ace
33.0 MB Download
md5:bda2a03ca708a275a22d6b1873591078
2.5 MB Download
md5:9c725e96811acb3bd4db08bc62ef1aec
3.7 MB Download
md5:d2b3f74c67c0151620d2df59ae7a2f8c
3.8 MB Download
md5:23a93cd68de269deed0fce5c746f77eb
30.8 MB Download
md5:2249e45a6853b4184653bc44f54f82ac
3.7 MB Download
md5:6f845ecaef721ce9061ef2a2b1fba516
31.7 MB Download
md5:ea0c1aa5e5f8c5fe02e86e7525cc441c
2.5 MB Download
md5:de716b1fe8158ad98ce4eee5ef6f0fa1
2.4 MB Download
md5:71e7d227db4b5d279c552d41ab44c146
3.7 MB Download
md5:1413eae06f9f75e38fd3873565bf8b77
31.4 MB Download
md5:09f84da625788d5e6406286fe61de954
2.5 MB Download
md5:9ea594b3ba6ed971dc92c8950dcb4bea
29.0 MB Download
md5:7b844925ec927f896cf28a29aa4a6182
2.4 MB Download
md5:3836bb94facae865a0bc7589be0597a4
3.8 MB Download
md5:a625e85fa27b2ae1750b7ee7381e35b8
3.6 MB Download
md5:b837bb713a6a065af5128386afed41e8
33.5 MB Download
md5:2b16872fbef3dd92c84323af331f9b4b
2.5 MB Download
md5:4aee0cc3bc49135badf156da9e560d91
30.8 MB Download
md5:96a8f6234f9f0f8f4395a0ed76f818d6
2.4 MB Download
md5:7635e6eea7e07429362eb38f5de1264c
3.8 MB Download
md5:389531ed8296acf40fa11780e52ca0d5
3.7 MB Download

Additional details

Funding

The Research Council of Norway
Earth Biogenome Project Norway 326819