Published September 18, 2025 | Version 1.0.0
Dataset Open

Life at the extremes: Maximally divergent microbes with similar genomic signatures linked to extreme environments (Dataset)

  • 1. EDMO icon University of Waterloo
  • 2. ROR icon Western University

Contributors

  • 1. ROR icon University of Guelph
  • 2. University of Waterloo
  • 3. The University of Western Ontario

Description

Dataset Description

We curated a dataset of 693 high-quality extremophile microbial genomes, spanning diverse environmental conditions defined by both temperature and pH optima. This includes psychrophiles, mesophiles, thermophiles, hyperthermophiles, acidophiles, and alkaliphiles, representing broad taxonomic coverage across bacteria and archaea. Genome assemblies were collected through a comprehensive review of primary literature and cross-referenced with the Genome Taxonomy Database to ensure accuracy and taxonomic consistency.

The dataset was organized into two subsets:

  • Temperature Dataset (598 genomes): 148 psychrophiles, 190 mesophiles, 183 thermophiles, and 77 hyperthermophiles.

  • pH Dataset (186 genomes): 100 acidophiles and 86 alkaliphiles.

A total of 91 genomes occur in both subsets, either as mesophiles adapted to acidic/alkaline conditions or as polyextremophiles capable of surviving under both temperature and pH extremes.

To ensure compatibility with downstream machine learning analyses, we applied a genome proxy approach in which each genome was represented by pseudo-concatenated, randomly selected subsequences. This alignment-free strategy enabled the calculation of canonical 𝑘-mer frequency vectors and the identification of convergent genomic signatures across maximally divergent bacteria–archaea pairs.

Technical Information

  • Assemblies.zip: Contains the curated set of extremophile genome assemblies in FASTA format.

  • Extremophiles_metadata.tsv: Provides metadata including taxonomy, environmental category (temperature/pH class), and isolation details for each genome.

General description of FASTA files:
Each genome FASTA file contains sequence data where entries begin with a header line (>), followed by sequence identifiers.

Metadata linkage:
Sequence identifiers in the FASTA assemblies can be directly mapped to the metadata file (Extremophiles_metadata.tsv), which contains corresponding environmental and taxonomic information.

Files

Assemblies_Extremophiles.zip

Files (680.2 MB)

Name Size Download all
md5:268515e18d952a590dddbd706a17256d
680.0 MB Preview Download
md5:a4e0996842b3f41c7ae2eea52f0d8336
166.6 kB Download

Additional details

Related works

Is described by
Publication: 10.1101/2025.06.04.657665 (DOI)

Dates

Available
2025-09-21

Software

Repository URL
https://github.com/Kari-Genomics-Lab/Extreme_Env_2
Programming language
Python
Development Status
Active