Published July 21, 2021 | Version v1
Dataset Open

Phylo-k-mers databases for SHERPAS

  • 1. Montpellier Laboratory of Informatics, Robotics and Microelectronics

Description

SHERPAS is a new program to identify novel recombinant sequences in a large collection of viral sequences, and to provide a first estimate of their recombinant structure. SHERPAS is much faster than other softwares for recombination detection; its main feature is the use of a pre-computed database of "phylogenetically-informed k-mers" (or phylo-k-mers). The computation of this phylo-k-mer database is a heavy computational step, but it only needs to be executed once for a given reference alignment.

A phylo-k-mer database can be built from any reference alignment, and a phylogenetic tree built from that alignment, using RAPPAS2 (https://github.com/phylo42/rappas2). We propose here three ready-to-use databases, for three reference alignments:
-An alignment of 167 sequences of the pol region of the HIV genome, provided with the program SCUEAL, accessible at https://github.com/spond/SCUEAL/blob/master/data/pol2009.nex
-An alignment of 339 sequence of the whole HBV genome, provided with the programm jpHMM, accessible at http://jphmm.gobics.de/download.html.
-An alignment of 881 sequences of the whole HIV genome, also provided with jpHMM, accessible at http://jphmm.gobics.de/download.html.

For each of these alignments, we provide a .zip file containing three files: The phylo-k-mer database (.rps file), the reference phylogenetic tree used to build the database (.tree file), and a table associating each reference sequence to a strain of the virus (.csv file). The details of the construction of the database, the construction of the tree, as well as the origin of the information reported in the table, can be found in the Supplementary Materials associated with the original Bioinformatics publication.

Notes

Structure of the dataset:

For the pol region of the HIV genome (167 reference sequences):
    pkDB-HIV-pol.zip
            DB_k10_o1.5.rps
            pol2009-GTRtree.tree
            ref-group.csv

For the whole HBV genome (339 reference sequences):
    pkDB-HBV-full.zip
            DB_k10_o1.5.rps
            HBV_tree.tree
            ref-group.csv

For the whole HIV genome (881 reference sequences):
    pkDB-HIV-full.zip
            DB_k10_o1.5.rps
            HIV-alignment.tree
            ref-group.csv

 

SHERPAS download and documentation:

https://github.com/phylo42/sherpas

Funding provided by: ANR
Crossref Funder Registry ID: http://dx.doi.org/10.13039/501100001665

Funding provided by: The French National Research Agency
Crossref Funder Registry ID:

Funding provided by: Investissements d'avenir" programme
Crossref Funder Registry ID:
Award Number: ANR-16-IDEX-0006

Files

pkDB-HBV-full.zip

Files (7.7 GB)

Name Size Download all
md5:18b6f12936ede678a0d545022a8d64be
667.6 MB Preview Download
md5:b94ef668a2e90e71ee21d4c740122c4e
6.7 GB Preview Download
md5:537880a91f6c4b264812abb80d8e02e8
421.3 MB Preview Download

Additional details

Related works