CAZyme prediction in Ascomycetes yeast genomes guides discovery of novel xylanolytic species with diverse capacities of hemicellulose hydrolysis

Ravn, Jonas; Engqvist, Martin; Larsbrink, Johan; Geijer, Cecilia

doi:10.5281/zenodo.4548336

Published February 18, 2021 | Version 1.0

Dataset Open

CAZyme prediction in Ascomycetes yeast genomes guides discovery of novel xylanolytic species with diverse capacities of hemicellulose hydrolysis

1. Chalmers University of Technology

Background

This data is part of our publication "title_placeholder" (link_placeholder). The fasta files are originally from another publication (https://doi.org/10.1016/j.cell.2018.10.023), with data hosted on Figshare (https://doi.org/10.6084/m9.figshare.5854692). We have, however, further processed those fasta files by clustering them at 98% identity (and removed whitespace in the fasta headers). They are provided here to enable users to retrieve protein sequences for genes listed in the "332_yeast_genomes_enzyme_info_version_3.tsv" file.

File description

The main output file is the "332_yeast_genomes_enzyme_info_version_3.tsv" tab-separated output file. Each row in the data file indicated one gene with a single corresponding hmm hit at a specific position in the gene. A gene can (and often does) occur multiple times with different hmm model hits or hits with the same hmm model but at different positions inside the gene. Below follows a description of the data contained in each column of the output file. The name of each gene is specified and the corresponding protein sequence can be obtained from the organism fasta files obtained from the Figshare repository indicated above.

The columns in the output file "332_yeast_genomes_enzyme_info_version_3.tsv" are as follows:

column: organism
description: the organism name
value type: text

column: gene
description: the gene name as given inside fasta files in "protein_fasta.zip"
value type: text

column: hmm_model
description: the hmm model from signalp that gave the hit
value type: text

column: hmm_model_len
description: length of the hmm model, specified in the hmmer output file (there in the "qlen" column)
value type: integer

column: hmm_match_from
description: where in the hmm model the match with the gene starts, specified in the hmmer output file (there in the "hmm coord from" column)
value type: integer

column: hmm_match_to
description: where in the hmm model the match with the gene ends, specified in the hmmer output file (there in the "hmm coord to" column)
value type: integer

column: hmm_match_coverage
description: how much of hmm model actually matched to the gene from 0.35 to 1.0, computed as ("hmm_match_to" - "hmm_match_from")/"hmm_model_len"
value type: float

column: match_evalue
description: the e-value of the hmm model hit, specified in the hmmer output file (there in the "Evalue" column)
value type: float, scientific notation

column: gene_match_from
description: where in the gene the hmm model match starts, specified in the hmmer output file (there in the "ali coord from" column)
value type: integer

column: gene_match_to
description: where in the gene the hmm model match ends, specified in the hmmer output file (there in the "ali coord to" column)
value type: integer

column: enzyme
description: the full enzyme name, parsed from the hmm model name by excluding the ".hmm" file extension
value type: text

column: family
description: the enzyme name, excluding subfamily designations, parsed from the "enzyme" column
value type: text

column: enzyme_type
description: which main class of enzyme it is, GH, CBM, CE, etc., parsed from the "family" column
value type: text

column: signal_peptide
description: whether signal peptide is predicted (SP(Sec/SPI)) or not (OTHER), specified in the signalp output file (there in the "Prediction" column)
value type: text

column: signal_peptide_prob
description: probability that a signal peptide is present, specified in the signalp output file (there in the "SP(SEC/SPI)" column)
value type: float

column: sp_cut_pos
description: the position in the protein sequence where the signal peptide is predicted to be cleaved, specified in the signalp output file (there in the "CS Position" column)
value type: text

column: sp_cut_seq
description: the sequence at which the signal peptide is predicted to be cleaved, specified in the signalp output file (there in the "CS Position" column)
value type: text

column: sp_cut_prob
description: the probability of the cut-site prediction, specified in the signalp output file (there in the "CS Position" column)
value type: float

column: genes_in_fasta
description: the number of genes present in the organisms fasta file
value type: integer

Files

protein_fasta.zip

Files (588.3 MB)

Name	Size
332_yeast_genomes_enzyme_info_version_3.tsv md5:8a775eab1dda1c1a52c96f1553b325b2	12.5 MB	Download
protein_fasta.zip md5:ecbe90c92834e70b905c5067d4d2dc44	575.9 MB	Preview Download

	All versions	This version
Views	685	684
Downloads	235	234
Data volume	60.5 GB	60.5 GB

CAZyme prediction in Ascomycetes yeast genomes guides discovery of novel xylanolytic species with diverse capacities of hemicellulose hydrolysis

Authors/Creators

Description

Files

protein_fasta.zip

Files (588.3 MB)