Genome assembly and gene annotations for Glacier lanternfish (Benthosema glaciale)
Authors/Creators
Description
Here we provide the genome assembly and gene annotations for the Glacier lanternfish (Benthosema glaciale). We provide these for both convenience and because some of the functional annotations of genes/proteins are removed when we prepare these for uploading to ENA.
We assembled the species using a pre-release of the EBP-Nor genome assembly pipeline (https://github.com/ebp-nor/GenomeAssembly). HiFiAdapterFilt (Sim et al., 2022) was applied on the HiFi reads to remove possible remnant PacBio adapter sequences. The filtered HiFi reads were assembled using hifiasm (Cheng et al., 2021) with Hi-C integration resulting in a pair of haplotype-resolved assemblies, pseudo-haplotype one (hap1) and pseudo-haplotype two (hap2) for each species. Unique k-mers in each assembly/pseudo-haplotype were identified using meryl (Rhie et al., 2020) and used to create two sets of Hi-C reads, one without any k-mers occurring uniquely in hap1 and the other without k-mers occurring uniquely in hap2. K-mer filtered Hi-C reads were aligned to each scaffolded assembly using BWA-MEM (Li, 2013) with -5SPM options. The alignments were sorted based on name using samtools (Li et al., 2009) before applying samtools fixmate to remove unmapped reads and secondary alignments and to add mate score, and samtools markdup to remove duplicates. The resulting BAM files were used to scaffold the two assemblies using YaHS (Zhou et al., 2022) with default options. FCS-GX (Astashyn et al., 2023) was used to search for contamination. Contaminated sequences were removed. If a contaminant was detected at the start or end of a sequence, the sequence was trimmed using a combination of samtools faidx, bedtools (Quinlan and Hall, 2010) complement, and bedtools getfasta. If the contaminant was internal, it was masked using bedtools maskfasta. The mitochondrion was searched for in contigs and reads using MitoHiFi (Uliano-Silva et al., 2023). The assemblies were manually curated using PretextView. Chromosomes were identified by inspecting the Hi-C contact map in PretextView and named according to homology to kcLamFluv1. Some of the tools used for evaluation have been implemented in the EBP-Nor genome assembly evaluation pipeline (https://github.com/ebp-nor/GenomeEvaluation).
We annotated the genome assemblies using a pre-release version of the EBP-Nor genome annotation pipeline (https://github.com/ebp-nor/GenomeAnnotation). First, AGAT (https://zenodo.org/record/7255559) agat_sp_keep_longest_isoform.pl and agat_sp_extract_sequences.pl were used on the GRCz11 genome assembly and annotation to generate one protein (the longest isoform) per gene. Miniprot (Li, 2023) was used to align the proteins to the curated assemblies. UniProtKB/Swiss-Prot (Consortium et al., 2022) release 2022_03 in addition to the Vertebrata part of OrthoDB v11 (Kuznetsov et al., 2022) were also aligned separately to the assemblies. Red (Girgis, 2015) was run via redmask (https://github.com/nextgenusfs/redmask) on the assemblies to mask repetitive areas. GALBA (Brůna et al., 2023; Buchfink et al., 2015; Hoff and Stanke, 2018; Li, 2023; Stanke et al., 2006) was run with the sea lamprey proteins using the miniprot mode on the masked assemblies. The funannotate-runEVM.py script from Funannotate was used to run EvidenceModeler (Haas et al., 2008) on the alignments of sea lamprey proteins, UniProtKB/Swiss-Prot proteins, Vertebrata proteins and the predicted genes from GALBA. The resulting predicted proteins were compared to the protein repeats that Funannotate distributes using DIAMOND blastp, and the predicted genes were filtered based on this comparison using AGAT. The filtered proteins were compared to the UniProtKB/Swiss-Prot release 2022_03 using DIAMOND (Buchfink et al., 2015) blastp to find gene names, and InterProScan was used to discover functional domains. AGATs agat_sp_manage_functional_annotation.pl was used to attach the gene names and functional annotations to the predicted genes.
Files
Files
(771.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:a8b3a2f8730194d099f0aa25e9cefdf4
|
362.2 MB | Download |
|
md5:9db698f6eebec396509c4c8ce1e29dd5
|
10.3 MB | Download |
|
md5:76a6b5a56fcff61342a33a0f4d9d043d
|
10.1 MB | Download |
|
md5:01503c4992859b95da229e72096cbacc
|
368.4 MB | Download |
|
md5:cac591350444783c9f6b1c38f1f850ce
|
10.5 MB | Download |
|
md5:1b5b87293677370713729503d98177a9
|
10.2 MB | Download |