Life at the extremes: Maximally divergent microbes with similar genomic signatures linked to extreme environments (Dataset)
Authors/Creators
Description
Dataset Description
We curated a dataset of 693 high-quality extremophile microbial genomes, spanning diverse environmental conditions defined by both temperature and pH optima. This includes psychrophiles, mesophiles, thermophiles, hyperthermophiles, acidophiles, and alkaliphiles, representing broad taxonomic coverage across bacteria and archaea. Genome assemblies were collected through a comprehensive review of primary literature and cross-referenced with the Genome Taxonomy Database to ensure accuracy and taxonomic consistency.
The dataset was organized into two subsets:
-
Temperature Dataset (598 genomes): 148 psychrophiles, 190 mesophiles, 183 thermophiles, and 77 hyperthermophiles.
-
pH Dataset (186 genomes): 100 acidophiles and 86 alkaliphiles.
A total of 91 genomes occur in both subsets, either as mesophiles adapted to acidic/alkaline conditions or as polyextremophiles capable of surviving under both temperature and pH extremes.
To ensure compatibility with downstream machine learning analyses, we applied a genome proxy approach in which each genome was represented by pseudo-concatenated, randomly selected subsequences. This alignment-free strategy enabled the calculation of canonical 𝑘-mer frequency vectors and the identification of convergent genomic signatures across maximally divergent bacteria–archaea pairs.
Technical Information
-
Assemblies.zip: Contains the curated set of extremophile genome assemblies in FASTA format.
-
Extremophiles_metadata.tsv: Provides metadata including taxonomy, environmental category (temperature/pH class), and isolation details for each genome.
General description of FASTA files:
Each genome FASTA file contains sequence data where entries begin with a header line (>), followed by sequence identifiers.
Metadata linkage:
Sequence identifiers in the FASTA assemblies can be directly mapped to the metadata file (Extremophiles_metadata.tsv), which contains corresponding environmental and taxonomic information.
Files
Assemblies_Extremophiles.zip
Files
(680.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:268515e18d952a590dddbd706a17256d
|
680.0 MB | Preview Download |
|
md5:a4e0996842b3f41c7ae2eea52f0d8336
|
166.6 kB | Download |
Additional details
Related works
- Is described by
- Publication: 10.1101/2025.06.04.657665 (DOI)
Dates
- Available
-
2025-09-21
Software
- Repository URL
- https://github.com/Kari-Genomics-Lab/Extreme_Env_2
- Programming language
- Python
- Development Status
- Active