Genomes and annotations of 4 Diamesa species

Martin, Sarah L Fordyce; La Torre, Renato; Danneels, Bram; Skage, Morten; Kollias, Spyridon; Tørresen, Ole Kristian; Anbaran, Mohsen Falahati; Stur, Elisabeth; Jakobsen, Kjetill Sigurd; Martin, Michael; Ekrem, Torbjørn

doi:10.5281/zenodo.15735891

Published June 25, 2025 | Version v1

Dataset Open

Genomes and annotations of 4 Diamesa species

1. Norwegian University of Science and Technology
2. University of Bergen
3. University of Oslo
4. NTNU University Museum

Here we provide the genome assemblies and genome annotations of four Diamesa species (D. hyperborea, D. lindrothi, D. serratosioi, D. tonsa).

We assembled the species using a pre-release of the EBP-Nor genome assembly pipeline (https://github.com/ebp-nor/GenomeAssembly). KMC (Kokot et al. 2017) was used to count k-mers of size 32 in the PacBio HiFi reads, excluding k-mers occurring more than 10,000 times. Genomescope (Ranallo-Benavidez et al. 2022) was run on the k-mer histogram output from KMC to estimate genome size, heterozygosity and repetitiveness while ploidy level was investigated using Smudgeplot (Ranallo-Benavidez et al. 2022). HifiAdapterFilt (Sim et al. 2022) was applied on the HiFi reads to remove possible remnant PacBio adapter sequences. The filtered HiFi reads were assembled using hifiasm (Cheng et al. 2021) with Hi-C integration resulting in a pair of haplotype-resolved assemblies, pseudo-haplotype one (hap1) and pseudo-haplotype two (hap2). Unique k-mers in each assembly/pseudo-haplotype were identified using meryl (Rhie et al. 2020) and used to create two sets of Hi-C reads, one without any k-mers occurring uniquely in hap1 and the other without k-mers occurring uniquely in hap2. K-mer filtered Hi-C reads were aligned to each scaffolded assembly using BWA-MEM (Li 2013) with -5SPM options. The alignments were sorted based on name using samtools (Li et al. 2009) before applying samtools fixmate to remove unmapped reads and secondary alignments and to add mate score, and samtools markdup to remove duplicates. The resulting BAM files were used to scaffold the two assemblies using YaHS (Zhou et al. 2023) with default options. FCS-GX (Astashyn et al. 2024) was used to search for contamination. Contaminated sequences were removed. The mitochondrion was searched for in reads using Oatk (Zhou et al. 2024).

We annotated the genome assemblies using a pre-release version of the EBP-Nor genome annotation pipeline (https://github.com/ebp-nor/GenomeAnnotation). First, AGAT (https://zenodo.org/record/7255559) scripts agat_sp_keep_longest_isoform.pl and agat_sp_extract_sequences.pl were used on the fruit fly (Drosophila melanogaster) genome assembly (BDGP6.46 (GCA_000001215.4) from Ensembl) and annotation to generate one protein (the longest isoform) per gene. Miniprot (Li 2023) was used to align the proteins to the curated assemblies. UniProtKB/Swiss-Prot (Coudert et al. 2023) release 2022_03 in addition to the arthropoda part of OrthoDB v11 (Kuznetsov et al. 2023) were also aligned separately to the assemblies. Red (Girgis 2015) was run via redmask (https://github.com/nextgenusfs/redmask) on the assemblies to mask repetitive areas. In addition, we ran Earl Grey (Baril et al. 2024) to annotate transposable elements. GALBA (Brunå et al. 2023, Li 2023, Buchfink et al. 2015, Hoff & Stanke 2019, Stanke et al. 2006) was run with the fruit fly proteins using the miniprot mode on the masked assemblies. The funannotate-runEVM.py script from Funannotate was used to run EvidenceModeler (Haas et al. 2008) on the alignments of the fruit fly proteins, UniProtKB/Swiss-Prot proteins, arthropoda proteins and the predicted genes from GALBA. The resulting predicted proteins were compared to the protein repeats that Funannotate distributes using DIAMOND blastp and the predicted genes were filtered based on this comparison using AGAT. The filtered proteins were compared to the UniProtKB/Swiss-Prot release 2022_03 using DIAMOND (Buchfink et al. 2015) blastp to find gene names and InterProScan was used to discover functional domains. AGATs agat_sp_manage_functional_annotation.pl was used to attach the gene names and functional annotations to the predicted genes.

Files

Files (300.4 MB)

Name	Size	Download all
idDiaHype1.1.hap1.fa.gz md5:aad14f7ebeab2a380dbbc63d43dc5b2c	30.8 MB	Download
idDiaHype1.1.hap1.gff.gz md5:1f961828cde68ac314c3a1dca86c7151	2.4 MB	Download
idDiaHype1.1.hap2.fa.gz md5:780025ceb1476d63b1d2ea9789e18ace	33.0 MB	Download
idDiaHype1.1.hap2.gff.gz md5:bda2a03ca708a275a22d6b1873591078	2.5 MB	Download
idDiaHype1_hap1.proteins.fa.gz md5:9c725e96811acb3bd4db08bc62ef1aec	3.7 MB	Download
idDiaHype1_hap2.proteins.fa.gz md5:d2b3f74c67c0151620d2df59ae7a2f8c	3.8 MB	Download
idDiaLind1.1.hap1.fa.gz md5:23a93cd68de269deed0fce5c746f77eb	30.8 MB	Download
idDiaLind1.1.hap1.proteins.fa.gz md5:2249e45a6853b4184653bc44f54f82ac	3.7 MB	Download
idDiaLind1.1.hap2.fa.gz md5:6f845ecaef721ce9061ef2a2b1fba516	31.7 MB	Download
idDiaLind1.1.hap2.gff.gz md5:ea0c1aa5e5f8c5fe02e86e7525cc441c	2.5 MB	Download
idDiaLind1.hap1.gff.gz md5:de716b1fe8158ad98ce4eee5ef6f0fa1	2.4 MB	Download
idDiaLind1_hap2.proteins.fa.gz md5:71e7d227db4b5d279c552d41ab44c146	3.7 MB	Download
idDiaSerr1.1.hap1.fa.gz md5:1413eae06f9f75e38fd3873565bf8b77	31.4 MB	Download
idDiaSerr1.1.hap1.gff.gz md5:09f84da625788d5e6406286fe61de954	2.5 MB	Download
idDiaSerr1.1.hap2.fa.gz md5:9ea594b3ba6ed971dc92c8950dcb4bea	29.0 MB	Download
idDiaSerr1.1.hap2.gff.gz md5:7b844925ec927f896cf28a29aa4a6182	2.4 MB	Download
idDiaSerr1_hap1.proteins.fa.gz md5:3836bb94facae865a0bc7589be0597a4	3.8 MB	Download
idDiaSerr1_hap2.proteins.fa.gz md5:a625e85fa27b2ae1750b7ee7381e35b8	3.6 MB	Download
idDiaTons1.1.hap1.fa.gz md5:b837bb713a6a065af5128386afed41e8	33.5 MB	Download
idDiaTons1.1.hap1.gff.gz md5:2b16872fbef3dd92c84323af331f9b4b	2.5 MB	Download
idDiaTons1.1.hap2.fa.gz md5:4aee0cc3bc49135badf156da9e560d91	30.8 MB	Download
idDiaTons1.1.hap2.gff.gz md5:96a8f6234f9f0f8f4395a0ed76f818d6	2.4 MB	Download
idDiaTons1_hap1.proteins.fa.gz md5:7635e6eea7e07429362eb38f5de1264c	3.8 MB	Download
idDiaTons1_hap2.proteins.fa.gz md5:389531ed8296acf40fa11780e52ca0d5	3.7 MB	Download

Additional details

The Research Council of Norway
Earth Biogenome Project Norway 326819

	All versions	This version
Views	85	85
Downloads	444	444
Data volume	5.7 GB	5.7 GB

Genomes and annotations of 4 Diamesa species

Authors/Creators

Description

Files

Files (300.4 MB)

Additional details

Funding