Published 2026 | Version 1.0.0
Dataset Open

BioRemPP Database: A Curated Compound-Centric Resource for Bioremediation Potential Profiling

  • 1. ROR icon Universidade Federal do Rio Grande do Norte

Description

Description

Overview

The BioRemPP Database (Bioremediation Potential Profile Database) is a curated, integrated resource designed to support environmental bioremediation research by systematically linking chemical compounds, genes, enzymes, and regulatory frameworks. This database addresses a critical gap in bioremediation science: the absence of a unified, standardized resource connecting priority pollutant with their potential biodegradation pathways across multiple knowledge bases and regulatory contexts.

Scientific Rationale

Environmental contamination by xenobiotic compounds—including chlorinated solvents, polyaromatic hydrocarbons, pesticides, and heavy metals—poses significant ecological and public health challenges. While substantial knowledge exists regarding microbial biodegradation capabilities, this information remains fragmented across databases and regulatory frameworks. BioRemPP systematically integrates these sources into a unified, FAIR-compliant (Findable, Accessible, Interoperable, Reusable) framework.

Database Contents (v1.0.0)

This release contains:

  • 10,869 database entries linking compounds to functional annotations
  • 384 unique chemical compounds with standardized identifiers (CAS, ChEBI, SMILES)
  • 1,541 KEGG Orthology (KO) identifiers for functional annotation
  • 12 chemical compound classes for classification

Data Sources and Integration

Data were curated from multiple authoritative sources:

Regulatory Frameworks:

  • ATSDR (Agency for Toxic Substances and Disease Registry) Substance Priority List
  • EPA (U.S. Environmental Protection Agency) National Priorities List
  • CONAMA (Brazilian National Environment Council) Regulations
  • IARC (International Agency for Research on Cancer) Classifications (Groups 1, 2A, 2B)
  • EU Water Framework Directive Priority Substances
  • Canadian Environmental Protection Act Priority Substances List

Functional Annotations:

  • KEGG (Kyoto Encyclopedia of Genes and Genomes) — KO identifiers, pathways, enzymes
  • HADEG (Hydrocarbon Aerobic Degradation Enzymes and Genes) — degradation-specific coverage
  • BlastKOALA (v3.0) and EggNOG-mapper (v2) — sequence-based functional assignments

Chemical Classification:

  • ChEBI (Chemical Entities of Biological Interest) — standardized identifiers, SMILES, compound classes

Toxicological Annotations:

  • ToxCSM — machine learning-based multi-endpoint toxicity predictions (mutagenicity, carcinogenicity, environmental hazard)

Integration Framework and External Database Contributions

BioRemPP is not a copy or aggregation of existing databases. Rather, it is an integration layer that establishes compound-centric relationships across independent external resources, each contributing distinct and complementary information. BioRemPP does not redistribute primary data from these sources; instead, it provides standardized cross-references and relational mappings that enable users to navigate between resources through a unified analytical framework.

The data architecture organizes compound-gene-enzyme relationships within a centralized core containing 1,541 unique KOs and 384 compounds. This core is linked to three external resources that expand functional and toxicological coverage through cross-references, not data duplication.

The following external databases contribute specific layers of information to BioRemPP:

External Resource Contribution to BioRemPP Quantitative Coverage Relationship Type
KEGG (Kyoto Encyclopedia of Genes and Genomes) KO identifiers, gene symbols, enzyme classifications (EC), and pathway associations for functional annotation 855 entries from 20 canonical xenobiotic metabolism pathways Cross-reference via KO identifiers
HADEG (Hydrocarbon Aerobic Degradation Enzymes and Genes) Extends degradation-specific gene coverage for hydrocarbons, polymers, and biosurfactant-related pathways 867 entries across 71 sub-pathways Cross-reference via KO identifiers
ChEBI (Chemical Entities of Biological Interest) Standardized chemical identifiers, SMILES representations, and compound class assignments 384 compounds with validated identifiers Cross-reference via ChEBI IDs
ToxCSM Machine learning-based toxicity predictions as annotation layers 370 compounds with 31 toxicity endpoints (nuclear, stress-response, genomic, environmental, dose-related) Cross-reference via CPD identifiers
Regulatory Frameworks (ATSDR, EPA, CONAMA, IARC, EU-WFD, CEPA-PSL) Priority compound lists and hazard classifications defining the scope of environmentally relevant compounds 9 international regulatory references Compound inclusion criteria

What BioRemPP adds:

  1. Compound-centric relational structure — Links compounds to genes, enzymes, pathways, and regulatory status through standardized identifiers (CAS, ChEBI, KO, SMILES), analogous to toxicogenomic frameworks that prioritize chemical-gene interactions
  2. Cross-database harmonization — Resolves identifier inconsistencies and synonym mappings across sources, ensuring interoperability between KEGG, ChEBI, HADEG, and ToxCSM
  3. Regulatory context integration — Associates functional annotations with environmental priority status from multiple international frameworks, structuring the database around environmentally prioritized pollutants rather than pathway catalogs alone
  4. Analytical framework — Provides structured tidy data tables (10,869 entries with 100% completeness across core fields) optimized for bioremediation potential profiling, functional coverage analysis, and sample comparison

Users seeking primary sequence data, pathway diagrams, or detailed toxicological reports should consult the original databases directly. BioRemPP facilitates this navigation by maintaining traceable cross-references to source identifiers.

Reference Genomes

For demonstration and validation purposes, the database includes functional annotations from nine representative RefSeq genomes spanning principal bioremediation-relevant groups:

  • Bacteria: Acinetobacter baumannii, Enterobacter asburiae, Pseudomonas aeruginosa
  • Fungi: Aspergillus nidulans, Fusarium graminearum, Cryptococcus gattii
  • Microalgae/Cyanobacteria: Chlorella variabilis, Nannochloropsis gaditana, Synechocystis sp.

File Formats

All data tables are provided in CSV format with UTF-8 encoding. Detailed field descriptions and data dictionaries are included in the accompanying documentation (biorempp-schemas).

Associated Web Server

The BioRemPP web server (https://bioinfo.imd.ufrn.br/biorempp/) provides interactive visualization and analysis tools for exploring this database across.eight analytical modules (56 use cases) supporting hypothesis generation.

License

This dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0). Third-party data sources retain their original licenses.

Version History

  • v1.0.0: Initial release

Contact

For questions or feedback, please contact biorempp@gmail.com or submit issues via the project repository https://github.com/BioRemPP/biorempp_web/issues.

Keywords 

bioremediation, biodegradation, xenobiotics, environmental microbiology, functional genomics, KEGG, pollutants, regulatory compounds, compound-gene associations, toxicology, environmental biotechnology, metagenomics

Metadata Fields

Field Value
Resource Type Dataset
Title BioRemPP Database: A Curated Compound-Centric Resource for Bioremediation Potential Profiling
Version 1.0.0
License Creative Commons Attribution 4.0 International (CC BY 4.0)
Language English
Subjects Environmental Sciences, Bioinformatics, Microbiology, Biotechnology

 

Files

kegg_degradation_db.csv

Files (1.6 MB)

Name Size Download all
md5:992a8877d75ae3b4733a2df46e7206e8
19.4 kB Preview Download
md5:9ae358299e131fbb0cb939afa876090a
1.1 MB Preview Download
md5:22ecc4aa40635de6f3a2dd364647eb39
189.4 kB Preview Download
md5:74dbbd02ad1678cd1ee5e0b44ca2e868
44.9 kB Preview Download
md5:0e388abe869abcd05d8f99f5ea18dc7d
22.0 kB Preview Download
md5:9d9631d5bd126e0883565db80e2824bb
189.6 kB Preview Download

Additional details

Related works

Software

Repository URL
https://github.com/BioRemPP/biorempp_web
Programming language
Python , R
Development Status
Active