CAZyme prediction in Ascomycetes yeast genomes guides discovery of novel xylanolytic species with diverse capacities of hemicellulose hydrolysis
- 1. Chalmers University of Technology
Description
Background
This data is part of our publication "title_placeholder" (link_placeholder). The fasta files are originally from another publication (https://doi.org/10.1016/j.cell.2018.10.023), with data hosted on Figshare (https://doi.org/10.6084/m9.figshare.5854692). We have, however, further processed those fasta files by clustering them at 98% identity (and removed whitespace in the fasta headers). They are provided here to enable users to retrieve protein sequences for genes listed in the "332_yeast_genomes_enzyme_info_version_3.tsv" file.
File description
The main output file is the "332_yeast_genomes_enzyme_info_version_3.tsv" tab-separated output file. Each row in the data file indicated one gene with a single corresponding hmm hit at a specific position in the gene. A gene can (and often does) occur multiple times with different hmm model hits or hits with the same hmm model but at different positions inside the gene. Below follows a description of the data contained in each column of the output file. The name of each gene is specified and the corresponding protein sequence can be obtained from the organism fasta files obtained from the Figshare repository indicated above.
The columns in the output file "332_yeast_genomes_enzyme_info_version_3.tsv" are as follows:
column: organism
description: the organism name
value type: text
column: gene
description: the gene name as given inside fasta files in "protein_fasta.zip"
value type: text
column: hmm_model
description: the hmm model from signalp that gave the hit
value type: text
column: hmm_model_len
description: length of the hmm model, specified in the hmmer output file (there in the "qlen" column)
value type: integer
column: hmm_match_from
description: where in the hmm model the match with the gene starts, specified in the hmmer output file (there in the "hmm coord from" column)
value type: integer
column: hmm_match_to
description: where in the hmm model the match with the gene ends, specified in the hmmer output file (there in the "hmm coord to" column)
value type: integer
column: hmm_match_coverage
description: how much of hmm model actually matched to the gene from 0.35 to 1.0, computed as ("hmm_match_to" - "hmm_match_from")/"hmm_model_len"
value type: float
column: match_evalue
description: the e-value of the hmm model hit, specified in the hmmer output file (there in the "Evalue" column)
value type: float, scientific notation
column: gene_match_from
description: where in the gene the hmm model match starts, specified in the hmmer output file (there in the "ali coord from" column)
value type: integer
column: gene_match_to
description: where in the gene the hmm model match ends, specified in the hmmer output file (there in the "ali coord to" column)
value type: integer
column: enzyme
description: the full enzyme name, parsed from the hmm model name by excluding the ".hmm" file extension
value type: text
column: family
description: the enzyme name, excluding subfamily designations, parsed from the "enzyme" column
value type: text
column: enzyme_type
description: which main class of enzyme it is, GH, CBM, CE, etc., parsed from the "family" column
value type: text
column: signal_peptide
description: whether signal peptide is predicted (SP(Sec/SPI)) or not (OTHER), specified in the signalp output file (there in the "Prediction" column)
value type: text
column: signal_peptide_prob
description: probability that a signal peptide is present, specified in the signalp output file (there in the "SP(SEC/SPI)" column)
value type: float
column: sp_cut_pos
description: the position in the protein sequence where the signal peptide is predicted to be cleaved, specified in the signalp output file (there in the "CS Position" column)
value type: text
column: sp_cut_seq
description: the sequence at which the signal peptide is predicted to be cleaved, specified in the signalp output file (there in the "CS Position" column)
value type: text
column: sp_cut_prob
description: the probability of the cut-site prediction, specified in the signalp output file (there in the "CS Position" column)
value type: float
column: genes_in_fasta
description: the number of genes present in the organisms fasta file
value type: integer
Files
protein_fasta.zip
Files
(588.3 MB)
Name | Size | Download all |
---|---|---|
md5:8a775eab1dda1c1a52c96f1553b325b2
|
12.5 MB | Download |
md5:ecbe90c92834e70b905c5067d4d2dc44
|
575.9 MB | Preview Download |