Published February 18, 2021 | Version 1.0
Dataset Open

CAZyme prediction in Ascomycetes yeast genomes guides discovery of novel xylanolytic species with diverse capacities of hemicellulose hydrolysis

  • 1. Chalmers University of Technology

Description

Background

This data is part of our publication "title_placeholder" (link_placeholder). The fasta files are originally from another publication (https://doi.org/10.1016/j.cell.2018.10.023), with data hosted on Figshare (https://doi.org/10.6084/m9.figshare.5854692). We have, however, further processed those fasta files by clustering them at 98% identity (and removed whitespace in the fasta headers). They are provided here to enable users to retrieve protein sequences for genes listed in the "332_yeast_genomes_enzyme_info_version_3.tsv" file.

File description

The main output file is the "332_yeast_genomes_enzyme_info_version_3.tsv" tab-separated output file. Each row in the data file indicated one gene with a single corresponding hmm hit at a specific position in the gene. A gene can (and often does) occur multiple times with different hmm model hits or hits with the same hmm model but at different positions inside the gene. Below follows a description of the data contained in each column of the output file. The name of each gene is specified and the corresponding protein sequence can be obtained from the organism fasta files obtained from the Figshare repository indicated above.

The columns in the output file "332_yeast_genomes_enzyme_info_version_3.tsv" are as follows:

column: organism
description: the organism name
value type: text


column: gene
description: the gene name as given inside fasta files in "protein_fasta.zip"
value type: text


column: hmm_model
description: the hmm model from signalp that gave the hit
value type: text


column: hmm_model_len
description: length of the hmm model, specified in the hmmer output file (there in the "qlen" column)
value type: integer


column: hmm_match_from
description: where in the hmm model the match with the gene starts,  specified in the hmmer output file (there in the "hmm coord from" column)
value type: integer


column: hmm_match_to
description: where in the hmm model the match with the gene ends,  specified in the hmmer output file (there in the "hmm coord to" column)
value type: integer


column: hmm_match_coverage
description: how much of hmm model actually matched to the gene from 0.35 to 1.0, computed as ("hmm_match_to" - "hmm_match_from")/"hmm_model_len"
value type: float


column: match_evalue
description: the e-value of the hmm model hit, specified in the hmmer output file (there in the "Evalue" column)
value type: float, scientific notation


column: gene_match_from
description: where in the gene the hmm model match starts, specified in the hmmer output file (there in the "ali coord from" column)
value type: integer


column: gene_match_to
description: where in the gene the hmm model match ends, specified in the hmmer output file (there in the "ali coord to" column)
value type: integer


column: enzyme
description: the full enzyme name, parsed from the hmm model name by excluding the ".hmm" file extension
value type: text


column: family
description: the enzyme name, excluding subfamily designations, parsed from the "enzyme" column
value type: text


column: enzyme_type
description: which main class of enzyme it is, GH, CBM, CE, etc., parsed from the "family" column
value type: text


column: signal_peptide
description: whether signal peptide is predicted (SP(Sec/SPI)) or not (OTHER), specified in the signalp output file (there in the "Prediction" column)
value type: text


column: signal_peptide_prob
description: probability that a signal peptide is present, specified in the signalp output file (there in the "SP(SEC/SPI)" column)
value type: float


column: sp_cut_pos
description: the position in the protein sequence where the signal peptide is predicted to be cleaved, specified in the signalp output file (there in the "CS Position" column)
value type: text


column: sp_cut_seq
description: the sequence at which the signal peptide is predicted to be cleaved, specified in the signalp output file (there in the "CS Position" column)
value type: text


column: sp_cut_prob
description: the probability of the cut-site prediction, specified in the signalp output file (there in the "CS Position" column)
value type: float


column: genes_in_fasta
description: the number of genes present in the organisms fasta file
value type: integer

Files

protein_fasta.zip

Files (588.3 MB)

Name Size Download all
md5:8a775eab1dda1c1a52c96f1553b325b2
12.5 MB Download
md5:ecbe90c92834e70b905c5067d4d2dc44
575.9 MB Preview Download