Published December 6, 2024 | Version v2
Dataset Open

Gene annotations for river and brook lamprey (Lampetra fluviatilis and Lampetra planeri)

Description

Here we provide the gene annotations river and brook lamprey (Lampetra fluviatilis and Lampetra planeri), in addition to annotations we did for sea lamprey (kPetMar1; GCA_010993605.1) and another river lamprey (kcLamFluv1; GCA_964198585.1) for the preprint https://www.biorxiv.org/content/10.1101/2024.12.06.627158v1. We provide these for both convenience and because some of the functional annotations of genes/proteins are removed when we prepare these for uploading to ENA. We also provide the FASTA files for the assemblies we have made.

We assembled the species using a pre-release of the EBP-Nor genome assembly pipeline (https://github.com/ebp-nor/GenomeAssembly). HiFiAdapterFilt (Sim et al., 2022) was applied on the HiFi reads to remove possible remnant PacBio adapter sequences. The filtered HiFi reads were assembled using hifiasm (Cheng et al., 2021) with Hi-C integration resulting in a pair of haplotype-resolved assemblies, pseudo-haplotype one (hap1) and pseudo-haplotype two (hap2) for each species. Unique k-mers in each assembly/pseudo-haplotype were identified using meryl (Rhie et al., 2020) and used to create two sets of Hi-C reads, one without any k-mers occurring uniquely in hap1 and the other without k-mers occurring uniquely in hap2. K-mer filtered Hi-C reads were aligned to each scaffolded assembly using BWA-MEM (Li, 2013) with -5SPM options. The alignments were sorted based on name using samtools (Li et al., 2009) before applying samtools fixmate to remove unmapped reads and secondary alignments and to add mate score, and samtools markdup to remove duplicates. The resulting BAM files were used to scaffold the two assemblies using YaHS (Zhou et al., 2022) with default options. FCS-GX (Astashyn et al., 2023) was used to search for contamination. Contaminated sequences were removed. If a contaminant was detected at the start or end of a sequence, the sequence was trimmed using a combination of samtools faidx, bedtools (Quinlan and Hall, 2010) complement, and bedtools getfasta. If the contaminant was internal, it was masked using bedtools maskfasta. The mitochondrion was searched for in contigs and reads using MitoHiFi (Uliano-Silva et al., 2023). The assemblies were manually curated using PretextView. Chromosomes were identified by inspecting the Hi-C contact map in PretextView and named according to homology to kcLamFluv1. Some of the tools used for evaluation have been implemented in the EBP-Nor genome assembly evaluation pipeline (https://github.com/ebp-nor/GenomeEvaluation).

We annotated the genome assemblies using a pre-release version of the EBP-Nor genome annotation pipeline (https://github.com/ebp-nor/GenomeAnnotation). First, AGAT (https://zenodo.org/record/7255559) agat_sp_keep_longest_isoform.pl and agat_sp_extract_sequences.pl were used on the sea lamprey (GCA_010993605.1) genome assembly and annotation to generate one protein (the longest isoform) per gene. Miniprot (Li, 2023) was used to align the proteins to the curated assemblies. UniProtKB/Swiss-Prot (Consortium et al., 2022) release 2022_03 in addition to the Vertebrata part of OrthoDB v11 (Kuznetsov et al., 2022) were also aligned separately to the assemblies. Red (Girgis, 2015) was run via redmask (https://github.com/nextgenusfs/redmask) on the assemblies to mask repetitive areas. GALBA (Brůna et al., 2023; Buchfink et al., 2015; Hoff and Stanke, 2018; Li, 2023; Stanke et al., 2006) was run with the sea lamprey proteins using the miniprot mode on the masked assemblies. The funannotate-runEVM.py script from Funannotate was used to run EvidenceModeler (Haas et al., 2008) on the alignments of sea lamprey proteins, UniProtKB/Swiss-Prot proteins, Vertebrata proteins and the predicted genes from GALBA. The resulting predicted proteins were compared to the protein repeats that Funannotate distributes using DIAMOND blastp, and the predicted genes were filtered based on this comparison using AGAT. The filtered proteins were compared to the UniProtKB/Swiss-Prot release 2022_03 using DIAMOND (Buchfink et al., 2015) blastp to find gene names, and InterProScan was used to discover functional domains. AGATs agat_sp_manage_functional_annotation.pl was used to attach the gene names and functional annotations to the predicted genes.

Data generated underlying these assemblies are available under ENA BioProject PRJEB77187 and PRJEB77192. Raw PacBio sequencing data for L. fluviatilis (ENA BioSample: SAMEA115797768) are deposited in ENA under ERX12712303, ERX12712308 and ERX12712309, while Illumina Hi-C sequencing data is deposited in ENA under ERX12712501. Pseudo-haplotype one can be found in ENA at PRJEB77117 while pseudo-haplotype two is PRJEB77186. Raw PacBio sequencing data for L. planeri (ENA BioSample: SAMEA115802553) are deposited in ENA under ERX12713780, ERX12713797 and ERX12713807, while Illumina Hi-C sequencing data is deposited in ENA under ERX12714064. Pseudo-haplotype one can be found in ENA at PRJEB77190 while pseudo-haplotype two is PRJEB77191.

We downloaded the other river lamprey from NCBI like this:

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/964/198/595/GCA_964198595.1_kcLamFluv1.1/GCA_964198595.1_kcLamFluv1.1_genomic.fna.gz

And simplified the FASTA headers like this before annotation:

zcat GCA_964198595.1_kcLamFluv1.1_genomic.fna.gz |cut -f 1 -d " " > GCA_964198595.1_kcLamFluv1.1_genomic.anno.fna

Similarly for sea lamprey:

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/010/993/605/GCF_010993605.1_kPetMar1.pri/GCF_010993605.1_kPetMar1.pri_genomic.fna.gz

zcat GCF_010993605.1_kPetMar1.pri_genomic.fna.gz |cut -f 1 -d " " > GCF_010993605.1_kPetMar1.pri_genomic.anno.fna

Most of the tools programs used in this project have been implemented in the pipelines mentioned, and their parameters can be viewed there. However, we also run some other tools which in total ended up with just a few lines of code. Instead of creating a GitHub with those lines of code, we show them here:

The predicted proteins from all the individuals (from hap1 for river and brook lamprey) were put in a folder, and OrthoFinder were run like this:

orthofinder -a 40 -M msa -A mafft -T iqtree -t 40 -f only_one > orthofinder_only_one_msa.out 2> orthofinder_only_one_msa.err

ASTRAL-Pro was then run on the OrthoFinder results:

mkdir -p astral_only_one

cat only_one/OrthoFinder/Results_Sep20_1/Resolved_Gene_Trees/* > astral_only_one/raw.tree

cat astral_only_one/raw.tree |sed "s/kPetMar1_proteins_mod_//g" |sed "s/kcLamFluv2_1_hap1_proteins_mod_//g" | \

sed "s/kcLamPlan1_1_hap1_proteins_mod_//g" |sed "s/kcLamFluv1_proteins_mod_//g" > astral_only_one/mod.tree

grep ">" only_one/kcLamFluv1.proteins.mod.fa | tr -d ">" |awk '{print $1 "\tLF1"}' > astral_only_one/species.map

grep ">" only_one/kcLamFluv2.1.hap1.proteins.mod.fa | tr -d ">" |awk '{print $1 "\tLF2"}' >> astral_only_one/species.map

grep ">" only_one/kcLamPlan1.1.hap1.proteins.mod.fa | tr -d ">" |awk '{print $1 "\tLP1"}' >> astral_only_one/species.map

grep ">" only_one/kPetMar1.proteins.mod.fa | tr -d ">" |awk '{print $1 "\tPM1"}' >> astral_only_one/species.map

astral-pro -u 3 -R -i astral_only_one/mod.tree -a astral_only_one/species.map -o astral_only_one/astral_pro_u3.out 1> astral_pro_u3.out 2> astral_pro_u3.err

cp freqQuad.csv freqQuad_astral_pro.csv

 

List of files provided here and their description:

kPetMar1.gff.gz  - the annotation of sea lamprey

kPetMar1.proteins.fa.gz   - predicted proteins from the annotation of sea lamprey

kcLamFluv1.gff.gz  - the annotation of UK river lamprey

kcLamFluv1.proteins.fa.gz - predicted proteins from the annotation of UK river lamprey

kcLamFluv2.2.hap1.fa.gz  - genome assembly (hap1) of river lamprey  

kcLamFluv2.2.hap1.gff.gz - the annotation of river lamprey (hap1)

kcLamFluv2.2.hap1.proteins.fa.gz - predicted proteins for the annotation of river lamprey (hap1)

kcLamFluv2.2.hap2.fa.gz  - genome assembly (hap2) of river lamprey  

kcLamFluv2.2.hap2.gff.gz  -  the annotation of river lamprey (hap2) 

kcLamFluv2.2.hap2.proteins.fa.gz - predicted proteins for the annotation of river lamprey (hap2)

kcLamPlan1.2.hap1.fa.gz  - genome assembly (hap1) of brook lamprey  

kcLamPlan1.2.hap1.gff.gz  - the annotation of brook lamprey (hap1)     

kcLamPlan1.2.hap1.proteins.fa.gz - predicted proteins for the annotation of brook lamprey (hap1)

kcLamPlan1.2.hap2.fa.gz -  genome assembly (hap2) of brook lamprey  

kcLamPlan1.2.hap2.gff.gz - the annotation of brook lamprey (hap2)  

kcLamPlan1.2.hap2.proteins.fa.gz - predicted proteins for the annotation of brook lamprey (hap2)

 

Files

Files (1.2 GB)

Name Size Download all
md5:93f63a470d57de5a4931d8d50b0565aa
6.5 MB Download
md5:9d046635cc70becc5c7d94ff3011883b
6.2 MB Download
md5:6d999bdee51e9b4490730df42731ded9
295.3 MB Download
md5:b827f72baa1c5e079c8820fa67a74b64
6.7 MB Download
md5:0b224ad18c4db43de1927f7e73d14c4e
5.8 MB Download
md5:93ba6f6b79d6aec1dd16f8f78fee05d8
274.1 MB Download
md5:474fa09e256a9df977e5c5978fa1c208
5.4 MB Download
md5:51c1ffc89cb3a53b396f04a1bcab18cf
6.5 MB Download
md5:39de96f3dc60bad2f7d9d1e587b7acb7
292.6 MB Download
md5:06df74a67197a3be0b830cff6d6dde38
6.4 MB Download
md5:b4c721ae0eaa63aa701f38f5a485fc59
6.3 MB Download
md5:16b0114894ed7063c94a83a3a7c62f80
276.9 MB Download
md5:e3c919322d9fe0dac1a6c57e546b854e
6.4 MB Download
md5:1f922a5b5e19b8836336f65c1ba21f21
6.7 MB Download
md5:20d4da8e4a6f24f6a8da7dd6cd3c614b
7.0 MB Download
md5:e7dc5e9990ff1839cda08a45d81cb845
6.8 MB Download
md5:723bb48d07790770e1784c92901d6f72
8.5 kB Download