Published January 14, 2024 | Version 1.2.0
Dataset Open

TemStaPro Datasets

  • 1. Institute of Biotechnology, Life Sciences Center, Vilnius University; Institute of Informatics, Faculty of Mathematics and Informatics, Vilnius University
  • 2. Institute of Biotechnology, Life Sciences Center, Vilnius University
  • 3. CasZyme
  • 4. Institute of Biotechnology, Life Sciences Center, Vilnius University; CasZyme

Description

This dataset contains protein sequences used to train, validate, and test binary classifiers that form TemStaPro program, which is applied for protein thermostability prediction with respect to nine temperature thresholds from 40 to 80 degrees Celsius using a step of five degrees.

The data is given in files of FASTA format. Each protein sequence has a header made of three values separated by vertical bar symbols: organism's, to which the protein belongs, UniParc taxonomy identifier; UniProtKB/TrEMBL identifier of the protein sequence; organism's growth temperature taken from the dataset of growth temperatures of over 21 thousand organisms (Engqvist, 2018).

TemStaPro-Major-30 set is composed of 12 files:

  • one training
  • one validation
  • one imbalanced testing
  • nine balanced samples of 2000 sequences from each of the balanced testing set

TemStaPro-Minor-30 set is composed of cross-validation and testing files all balanced for 65 degrees Celsius temperature threshold.

SupplementaryFileC2EPsPredictions.tsv file contains thermostability predictions using the default mode of TemStaPro program to check the thermostability of different C2EP groups.

The detailed description is given in the revised version of the corresponding paper (https://doi.org/10.1093/bioinformatics/btae157).

If you use the data from this dataset, please cite both the paper and the DOI of the dataset.

Other

This project has received funding from European Regional Development Fund (project No 13.1.1-LMT-K-718-05-0021) under grant agreement with the Research Council of Lithuania (LMTLT). Funded as European Union's measure in response to COVID-19 pandemic.

Files

Files (329.5 MB)

Name Size Download all
md5:a5c547e1d60a4170b5898cd0419dddf6
14.6 MB Download
md5:7546d561ff98fee20061e2e4feb7da23
407.3 kB Download
md5:8b465cea3f10076017a3afd5c78b7eff
411.0 kB Download
md5:36b52448bdb6427c8a099c7ff5fbf38a
391.1 kB Download
md5:3703b2d13a8be60b15ce697cb71ceeb7
394.5 kB Download
md5:21b718ae0dfd46bb75756a6656e65a26
397.1 kB Download
md5:5d8a4fe0b73991d81b5660a1738f0221
403.0 kB Download
md5:40e0fc04e89d9b847ac61ad8a9b8a188
387.6 kB Download
md5:b8b3263093236651dbe467114d8848a7
378.8 kB Download
md5:15b35df9b96bf2057af07878fb6e68f2
378.8 kB Download
md5:01860c59e648e4b6f49173e5e685c226
41.9 MB Download
md5:f84885509e4889cd3ae92e1da0e37059
174.9 MB Download
md5:a3fa609c9f4a886faf155ebee58dff69
43.4 MB Download
md5:3482baf8f1c53ba2624703221e75e182
43.5 MB Download
md5:9ff110eeb1af843ce98aedd1fed088ee
7.5 MB Download

Additional details

Related works

Is supplement to
Preprint: 10.1101/2023.03.27.534365 (DOI)

Dates

Updated
2024-01-14

References

  • Engqvist, Martin Karl Magnus. (2018). Growth temperatures for 21,498 microorganisms (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.1175609
  • Engqvist, Martin Karl Magnus. (2018). Correlating enzyme annotations with a large set of microbial growth temperatures reveals metabolic adaptations to growth at diverse temperatures. BMC microbiology, 18, 1-14. Engqvist, M. K. (2018). Correlating enzyme annotations with a large set of microbial growth temperatures reveals metabolic adaptations to growth at diverse temperatures. BMC microbiology, 18, 1-14. https://doi.org/10.1186/s12866-018-1320-7