Published December 28, 2022
| Version 1.0
Dataset
Open
Datasets of the manuscript "Rational design of profile HMMs for sensitive and specific sequence detection with case studies applied to viruses, bacteriophages, and casposons"
Authors/Creators
- 1. University of São Paulo, Brazil
- 2. Universidad de los Andes, Colombia
- 3. Friedrich-Schiller-University Jena, Germany
Description
DATASETS
Rational design of profile HMMs for sensitive and specific sequence detection with case studies applied to viruses, bacteriophages, and casposons
Liliane S. Oliveira, Alejandro Reyes, Bas E. Dutilh and Arthur Gruber*
* Correspondence: argruber@usp.br (AG); Tel. +55 11 3091 7274
Here we provide different data of Microviridae, Flavivirus and casposons used throughout the work:
- Microviridae folder
- conserved_HMMs – profile HMMs constructed with TABAJARA in Conservation mode for Microviridae
- discriminative_HMMs – profile HMMs constructed with TABAJARA in Discrimination mode for Microviridae
- sequences – different sequence datasets and respective multiple sequence alignments
- Microviridae_113-seq_training_set.fasta - 113 VP1 sequences covering diversity of the Microviridae family
- Microviridae_113-seq.aln – multiple sequence alignment of the 113-protein dataset
- Microviridae_1836-seq_testset.fasta - 1,836 sequence dataset covering 1,836 sequences of the major capsid protein (VP1) comprising 501 Alpavirinae sequences, 1,040 Gokushovirinae sequences and 295 Pichovirinae sequences
- Microviridae_1866-seq.aln - multiple sequence alignment of the 1,866-protein Microviridae dataset used in the experiment of Figure 4
- Flavivirus folder
- conserved_HMMs – profile HMMs constructed with TABAJARA in Conservation mode for Flavivirus
- discriminative_HMMs – profile HMMs constructed with TABAJARA in Discrimination mode for Flavivirus
- full-length – models constructed from full-length protein sequences
- short - models constructed from selected short alignment blocks of the protein sequences
- sequences – different sequence datasets and respective multiple sequence alignments
- Flavivirus_127-seq_training_set.fasta - 127 polyprotein sequences covering species diversity of the genus Flavivirus
- Flavivirus_127-seq.aln – multiple sequence alignment of the 127-protein dataset
- Flavivirus_6364-seq_testset.fasta - 6,364 sequence dataset covering species diversity of Flavivirus, including 3,919 of dengue virus (DENV), 327 of Zika virus (ZIKV), 63 of yellow fever virus (YFV), and the remaining 2,055 sequences covering other available flaviviruses
- Flavivirus_6364-seq.aln - multiple sequence alignment of the 6,364-protein Flavivirus dataset
- Casposons folder
- casposon_generic_HMMs – profile HMMs constructed with TABAJARA in Discrimination mode for the generic detection of all casposons and discrimination from CRISPRs.
- casposon_family_discriminative_HMMs – profile HMMs constructed with TABAJARA in Discrimination mode for the specific discrimination among casposon families and from CRISPRs.
- sequences – different sequence datasets and respective multiple sequence alignments
- casposons_crisprs.fasta – 106 Cas1 bona fide sequences derived from 52 CRISPRs and 54 casposons
- casposon_family_discrimination.aln - multiple sequence alignment of 52 bona fide CRISPR and 54 casposon sequences, with appropriate nomenclature to run TABAJARA for the discrimination of each casposon family.
- casposons_crisprs_discrimination.aln - multiple sequence alignment of 52 bona fide CRISPR and 54 casposon sequences, with appropriate nomenclature to run TABAJARA for discrimination of CRISPRs and casposons.
Files
Supplementary_data.zip
Files
(4.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:84935ea937cda12611fd7859412a189c
|
4.5 MB | Preview Download |