Gene and repeat annotation for common eider (Somateria mollissima)

doi:10.5281/zenodo.11159637

Published May 9, 2024 | Version v1

Dataset Open

Gene and repeat annotation for common eider (Somateria mollissima)

Tørresen, Ole K. (Researcher)¹

1. University of Oslo

Here we provide the gene and repeat annotation for common eider (Somateria mollissima). It is unfortunately currently not possible to upload repeat annotation tracks to an international nucleotide sequence database such as ENA. While uploading the gene annotation is possible, some of the cross references to different databases in the functional annotation is removed. Further, the names of the entries in the publicly available genome assemblies on ENA have different names that what is found in the annotation tracks here, so we also provide the FASTA files for the assemblies. Ideally, all this should have been available via ENA.

We annotated the genome assemblies using a pre-release version of the EBP-Nor genome annotation pipeline (https://github.com/ebp-nor/GenomeAnnotation). First, AGAT (https://zenodo.org/record/7255559) agat_sp_keep_longest_isoform.pl and agat_sp_extract_sequences.pl were used on the GRCg7b (GCA_016699485.1) chicken genome assembly and annotation to generate one protein (the longest isoform) per gene. Miniprot (Li, 2023) was used to align the proteins to the curated assemblies. UniProtKB/Swiss-Prot (Consortium et al., 2022) release 2022_03 in addition to the vertebrata part of OrthoDB v11 (Kuznetsov et al., 2022) were also aligned separately to the assemblies. Red (Girgis, 2015) was run via redmask (https://github.com/nextgenusfs/redmask) on the assemblies to mask repetitive areas. In addition, we ran Earl Grey (Baril et al., 2023) to annotate transposable elements. GALBA (Brůna et al., 2023; Buchfink et al., 2015; Hoff and Stanke, 2018; Li, 2023; Stanke et al., 2006) was run with the chicken proteins using the miniprot mode on the masked assemblies. The funannotate-runEVM.py script from Funannotate was used to run EvidenceModeler (Haas et al., 2008) on the alignments of chicken proteins, UniProtKB/Swiss-Prot proteins, vertebrata proteins and the predicted genes from GALBA. The resulting predicted proteins were compared to the protein repeats that Funannotate distributes using DIAMOND blastp and the predicted genes were filtered based on this comparison using AGAT. The filtered proteins were compared to the UniProtKB/Swiss-Prot release 2022_03 using DIAMOND (Buchfink et al., 2015) blastp to find gene names and InterProScan was used to discover functional domains. AGATs agat_sp_manage_functional_annotation.pl was used to attach the gene names and functional annotations to the predicted genes. EMBLmyGFF3 (Norling et al., 2018) was used to combine the fasta files and GFF3 files into a EMBL format for submission to ENA.

The assemblies provided here can also be found at ENA under accessions PRJEB61097 (pseudo-haplotype one with sex chromosomes; https://www.ebi.ac.uk/ena/browser/view/PRJEB61097) and PRJEB62037 (pseudo-haplotype two; https://www.ebi.ac.uk/ena/browser/view/PRJEB62037).

Files

Files (702.7 MB)

Name	Size	Download all
bSomMol1.1.h1.annotation.gff.gz md5:3e9e07103ca3f02ba96bcaeb6a8da149	6.1 MB	Download
bSomMol1.1.h1.eg.filteredRepeats.gff.gz md5:d9e4d37169395ead4aaee8eab8aa934e	13.7 MB	Download
bSomMol1.1.h1.fasta.gz md5:2d5d3896f972d548803d341c0c286602	346.8 MB	Download
bSomMol1.1.h2.annotation.gff.gz md5:b2fe6f665b3da7a70f5325e237e91e64	5.8 MB	Download
bSomMol1.1.h2.eg.filteredRepeats.gff.gz md5:2f684d39d1076dd749a12bc75556cbc7	12.4 MB	Download
bSomMol1.1.h2.fasta.gz md5:da6eba9821a57a13291ca67aaf4b6bb6	318.0 MB	Download

	All versions	This version
Views	37	37
Downloads	11	11
Data volume	1.4 GB	1.4 GB

Gene and repeat annotation for common eider (Somateria mollissima)

Creators

Description

Files

Files (702.7 MB)