Published August 25, 2025 | Version v1
Dataset Open

Genome assembly and gene annotations for Glacier lanternfish (Benthosema glaciale)

  • 1. ROR icon University of Oslo
  • 2. ROR icon Norwegian University of Life Sciences
  • 3. ROR icon Norwegian University of Science and Technology

Description

Here we provide the genome assembly and gene annotations for the Glacier lanternfish (Benthosema glaciale). We provide these for both convenience and because some of the functional annotations of genes/proteins are removed when we prepare these for uploading to ENA. 

We assembled the species using a pre-release of the EBP-Nor genome assembly pipeline (https://github.com/ebp-nor/GenomeAssembly). HiFiAdapterFilt (Sim et al., 2022) was applied on the HiFi reads to remove possible remnant PacBio adapter sequences. The filtered HiFi reads were assembled using hifiasm (Cheng et al., 2021) with Hi-C integration resulting in a pair of haplotype-resolved assemblies, pseudo-haplotype one (hap1) and pseudo-haplotype two (hap2) for each species. Unique k-mers in each assembly/pseudo-haplotype were identified using meryl (Rhie et al., 2020) and used to create two sets of Hi-C reads, one without any k-mers occurring uniquely in hap1 and the other without k-mers occurring uniquely in hap2. K-mer filtered Hi-C reads were aligned to each scaffolded assembly using BWA-MEM (Li, 2013) with -5SPM options. The alignments were sorted based on name using samtools (Li et al., 2009) before applying samtools fixmate to remove unmapped reads and secondary alignments and to add mate score, and samtools markdup to remove duplicates. The resulting BAM files were used to scaffold the two assemblies using YaHS (Zhou et al., 2022) with default options. FCS-GX (Astashyn et al., 2023) was used to search for contamination. Contaminated sequences were removed. If a contaminant was detected at the start or end of a sequence, the sequence was trimmed using a combination of samtools faidx, bedtools (Quinlan and Hall, 2010) complement, and bedtools getfasta. If the contaminant was internal, it was masked using bedtools maskfasta. The mitochondrion was searched for in contigs and reads using MitoHiFi (Uliano-Silva et al., 2023). The assemblies were manually curated using PretextView. Chromosomes were identified by inspecting the Hi-C contact map in PretextView and named according to homology to kcLamFluv1. Some of the tools used for evaluation have been implemented in the EBP-Nor genome assembly evaluation pipeline (https://github.com/ebp-nor/GenomeEvaluation).

We annotated the genome assemblies using a pre-release version of the EBP-Nor genome annotation pipeline (https://github.com/ebp-nor/GenomeAnnotation). First, AGAT (https://zenodo.org/record/7255559) agat_sp_keep_longest_isoform.pl and agat_sp_extract_sequences.pl were used on the GRCz11  genome assembly and annotation to generate one protein (the longest isoform) per gene. Miniprot (Li, 2023) was used to align the proteins to the curated assemblies. UniProtKB/Swiss-Prot (Consortium et al., 2022) release 2022_03 in addition to the Vertebrata part of OrthoDB v11 (Kuznetsov et al., 2022) were also aligned separately to the assemblies. Red (Girgis, 2015) was run via redmask (https://github.com/nextgenusfs/redmask) on the assemblies to mask repetitive areas. GALBA (Brůna et al., 2023; Buchfink et al., 2015; Hoff and Stanke, 2018; Li, 2023; Stanke et al., 2006) was run with the sea lamprey proteins using the miniprot mode on the masked assemblies. The funannotate-runEVM.py script from Funannotate was used to run EvidenceModeler (Haas et al., 2008) on the alignments of sea lamprey proteins, UniProtKB/Swiss-Prot proteins, Vertebrata proteins and the predicted genes from GALBA. The resulting predicted proteins were compared to the protein repeats that Funannotate distributes using DIAMOND blastp, and the predicted genes were filtered based on this comparison using AGAT. The filtered proteins were compared to the UniProtKB/Swiss-Prot release 2022_03 using DIAMOND (Buchfink et al., 2015) blastp to find gene names, and InterProScan was used to discover functional domains. AGATs agat_sp_manage_functional_annotation.pl was used to attach the gene names and functional annotations to the predicted genes.

Files

Files (771.7 MB)

Name Size Download all
md5:a8b3a2f8730194d099f0aa25e9cefdf4
362.2 MB Download
md5:9db698f6eebec396509c4c8ce1e29dd5
10.3 MB Download
md5:76a6b5a56fcff61342a33a0f4d9d043d
10.1 MB Download
md5:01503c4992859b95da229e72096cbacc
368.4 MB Download
md5:cac591350444783c9f6b1c38f1f850ce
10.5 MB Download
md5:1b5b87293677370713729503d98177a9
10.2 MB Download