Published July 20, 2023 | Version v1
Dataset Open

Single-copy orthologous genes used for Ricefish phylogeny

  • 1. Leibniz Institute for the Analysis of Biodiversity Change, Bonn, Germany
  • 2. LOEWE Center for Translational Biodiversity Genomics, Frankfurt, Germany
  • 3. Museum Zoologicum Bogoriense, Research Center for Biosystematics and Evolution, National Research and Innovation Agency (BRIN), Bogor, Indonesia
  • 4. University of Cologne, Cologne Center for Genomics (CCG)
  • 5. Carl von Ossietzky University Oldenburg, Oldenburg, Germany

Description

Ortholog set

            We generated a reference set consisting of 8390 single-copy protein-coding genes derived from OrthoDB v.9.1 (Waterhouse et al., 2013) available for the following species: Austrofundulus limnaeus, Centrocoris variegatus, Fundulus heteroclitus, Kryptolebias marmoratus, Nothobranchius furzeri, Oryzias latipes, O. melastigma, Poecilia formosa, P. latipinna ,P. mexicana, P. reticulata and Xiphophorus maculatus (NCBI Accession numbers in Table S7). The hierarchical split was set to Actinopterygii (ID 7898). We used the script “make-ogs-corresponding.pl” to check for inconsistencies between the amino acid sequences and the corresponding nucleotide sequences and removed 96 problematic genes (Tab. S7). 

Identification of orthologs for transcripts and genome and alignment of single-copy genes

            Ortholog identification among 16 ricefish species and four outgroups (DS1, supplementary tables Tab. S1a) was carried out with Orthograph v0.7.1 (Petersen et al., 2017). Forward search for candidate transcript was left at default. Best reciprocal hit: Ortholog candidate genes needed at least one hit in either O. latipes or O. melastigma and we allowed concatenation of hits if they met the criteria and did not overlap. Max-blast-searches were set to 50, blast-max-hits were also set to 50. “U” in the amino acid sequences was changed to “X” to avoid issues in downstream analysis. The results of the orthology prediction were summarized for all species using a custom perl script coming with the orthograph package. Sequences of only those orthologs with all species present were aligned using MAFFT v7.221 with the L-INS-I algorithm on amino acid level (Katoh & Standley, 2013). 915 orthologs with outliers were identified according to Misof et al. 2014 and were subsequently removed from further analysis. We used the amino-acid alignments as blue print to generate corresponding nucleotide alignments with a modified version of Pal2Nal v14 (Misof et al., 2014; Suyama et al., 2006). To check each amino acid alignment for ambiguously aligned regions, we ran ALISCORE v2.0 with the maximal number of possible sequence selected pairs to analyze (-r) (Kück et al., 2010; Misof et al., 2014; Misof & Misof, 2009). Sites which needed masking were cut out using ALICUT v2.3 (Kück, 2009) from the amino acid alignments and correspondingly also from the nucleotide alignments. For further analyses we only proceeded with the data set on nucleotide level.

Notes

This research was funded by an ERC starting grant to Arne W. Nolte ("Evolmapping") and a Leibniz-Gemeinschaft SAW grant to Julia Schwarzer (SAW-Ricefish P91/2016).

Files

single-copy_orth_ricefish_genes_1907.zip

Files (6.2 MB)

Name Size Download all
md5:23882170e05073c7fa58cfb58f0446d9
6.2 MB Preview Download

Additional details

Related works

Is cited by
Preprint: 10.1101/2022.07.05.498713 (DOI)