BioRemPP Database: A Curated Compound-Centric Resource for Bioremediation Potential Profiling
Authors/Creators
Description
Description
Overview
The BioRemPP Database (Bioremediation Potential Profile Database) is a curated, integrated resource designed to support environmental bioremediation research by systematically linking chemical compounds, genes, enzymes, and regulatory frameworks. This database addresses a critical gap in bioremediation science: the absence of a unified, standardized resource connecting priority pollutant with their potential biodegradation pathways across multiple knowledge bases and regulatory contexts.
Scientific Rationale
Environmental contamination by xenobiotic compounds—including chlorinated solvents, polyaromatic hydrocarbons, pesticides, and heavy metals—poses significant ecological and public health challenges. While substantial knowledge exists regarding microbial biodegradation capabilities, this information remains fragmented across databases and regulatory frameworks. BioRemPP systematically integrates these sources into a unified, FAIR-compliant (Findable, Accessible, Interoperable, Reusable) framework.
Database Contents (v1.0.0)
This release contains:
- 10,869 database entries linking compounds to functional annotations
- 384 unique chemical compounds with standardized identifiers (CAS, ChEBI, SMILES)
- 1,541 KEGG Orthology (KO) identifiers for functional annotation
- 12 chemical compound classes for classification
Data Sources and Integration
Data were curated from multiple authoritative sources:
Regulatory Frameworks:
- ATSDR (Agency for Toxic Substances and Disease Registry) Substance Priority List
- EPA (U.S. Environmental Protection Agency) National Priorities List
- CONAMA (Brazilian National Environment Council) Regulations
- IARC (International Agency for Research on Cancer) Classifications (Groups 1, 2A, 2B)
- EU Water Framework Directive Priority Substances
- Canadian Environmental Protection Act Priority Substances List
Functional Annotations:
- KEGG (Kyoto Encyclopedia of Genes and Genomes) — KO identifiers, pathways, enzymes
- HADEG (Hydrocarbon Aerobic Degradation Enzymes and Genes) — degradation-specific coverage
- BlastKOALA (v3.0) and EggNOG-mapper (v2) — sequence-based functional assignments
Chemical Classification:
- ChEBI (Chemical Entities of Biological Interest) — standardized identifiers, SMILES, compound classes
Toxicological Annotations:
- ToxCSM — machine learning-based multi-endpoint toxicity predictions (mutagenicity, carcinogenicity, environmental hazard)
Integration Framework and External Database Contributions
BioRemPP is not a copy or aggregation of existing databases. Rather, it is an integration layer that establishes compound-centric relationships across independent external resources, each contributing distinct and complementary information. BioRemPP does not redistribute primary data from these sources; instead, it provides standardized cross-references and relational mappings that enable users to navigate between resources through a unified analytical framework.
The data architecture organizes compound-gene-enzyme relationships within a centralized core containing 1,541 unique KOs and 384 compounds. This core is linked to three external resources that expand functional and toxicological coverage through cross-references, not data duplication.
The following external databases contribute specific layers of information to BioRemPP:
| External Resource | Contribution to BioRemPP | Quantitative Coverage | Relationship Type |
|---|---|---|---|
| KEGG (Kyoto Encyclopedia of Genes and Genomes) | KO identifiers, gene symbols, enzyme classifications (EC), and pathway associations for functional annotation | 855 entries from 20 canonical xenobiotic metabolism pathways | Cross-reference via KO identifiers |
| HADEG (Hydrocarbon Aerobic Degradation Enzymes and Genes) | Extends degradation-specific gene coverage for hydrocarbons, polymers, and biosurfactant-related pathways | 867 entries across 71 sub-pathways | Cross-reference via KO identifiers |
| ChEBI (Chemical Entities of Biological Interest) | Standardized chemical identifiers, SMILES representations, and compound class assignments | 384 compounds with validated identifiers | Cross-reference via ChEBI IDs |
| ToxCSM | Machine learning-based toxicity predictions as annotation layers | 370 compounds with 31 toxicity endpoints (nuclear, stress-response, genomic, environmental, dose-related) | Cross-reference via CPD identifiers |
| Regulatory Frameworks (ATSDR, EPA, CONAMA, IARC, EU-WFD, CEPA-PSL) | Priority compound lists and hazard classifications defining the scope of environmentally relevant compounds | 9 international regulatory references | Compound inclusion criteria |
What BioRemPP adds:
- Compound-centric relational structure — Links compounds to genes, enzymes, pathways, and regulatory status through standardized identifiers (CAS, ChEBI, KO, SMILES), analogous to toxicogenomic frameworks that prioritize chemical-gene interactions
- Cross-database harmonization — Resolves identifier inconsistencies and synonym mappings across sources, ensuring interoperability between KEGG, ChEBI, HADEG, and ToxCSM
- Regulatory context integration — Associates functional annotations with environmental priority status from multiple international frameworks, structuring the database around environmentally prioritized pollutants rather than pathway catalogs alone
- Analytical framework — Provides structured tidy data tables (10,869 entries with 100% completeness across core fields) optimized for bioremediation potential profiling, functional coverage analysis, and sample comparison
Users seeking primary sequence data, pathway diagrams, or detailed toxicological reports should consult the original databases directly. BioRemPP facilitates this navigation by maintaining traceable cross-references to source identifiers.
Reference Genomes
For demonstration and validation purposes, the database includes functional annotations from nine representative RefSeq genomes spanning principal bioremediation-relevant groups:
- Bacteria: Acinetobacter baumannii, Enterobacter asburiae, Pseudomonas aeruginosa
- Fungi: Aspergillus nidulans, Fusarium graminearum, Cryptococcus gattii
- Microalgae/Cyanobacteria: Chlorella variabilis, Nannochloropsis gaditana, Synechocystis sp.
File Formats
All data tables are provided in CSV format with UTF-8 encoding. Detailed field descriptions and data dictionaries are included in the accompanying documentation (biorempp-schemas).
Associated Web Server
The BioRemPP web server (https://bioinfo.imd.ufrn.br/biorempp/) provides interactive visualization and analysis tools for exploring this database across.eight analytical modules (56 use cases) supporting hypothesis generation.
License
This dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0). Third-party data sources retain their original licenses.
Version History
- v1.0.0: Initial release
Contact
For questions or feedback, please contact biorempp@gmail.com or submit issues via the project repository https://github.com/BioRemPP/biorempp_web/issues.
Keywords
bioremediation, biodegradation, xenobiotics, environmental microbiology, functional genomics, KEGG, pollutants, regulatory compounds, compound-gene associations, toxicology, environmental biotechnology, metagenomics
Metadata Fields
| Field | Value |
|---|---|
| Resource Type | Dataset |
| Title | BioRemPP Database: A Curated Compound-Centric Resource for Bioremediation Potential Profiling |
| Version | 1.0.0 |
| License | Creative Commons Attribution 4.0 International (CC BY 4.0) |
| Language | English |
| Subjects | Environmental Sciences, Bioinformatics, Microbiology, Biotechnology |
Files
kegg_degradation_db.csv
Files
(1.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:992a8877d75ae3b4733a2df46e7206e8
|
19.4 kB | Preview Download |
|
md5:9ae358299e131fbb0cb939afa876090a
|
1.1 MB | Preview Download |
|
md5:22ecc4aa40635de6f3a2dd364647eb39
|
189.4 kB | Preview Download |
|
md5:74dbbd02ad1678cd1ee5e0b44ca2e868
|
44.9 kB | Preview Download |
|
md5:0e388abe869abcd05d8f99f5ea18dc7d
|
22.0 kB | Preview Download |
|
md5:9d9631d5bd126e0883565db80e2824bb
|
189.6 kB | Preview Download |
Additional details
Related works
- Is source of
- Figure: 10.1016/j.jhazmat.2024.136866 (DOI)
- Figure: 10.1016/j.procbio.2024.11.036 (DOI)
- Figure: 10.1016/j.micres.2023.127420 (DOI)
Software
- Repository URL
- https://github.com/BioRemPP/biorempp_web
- Programming language
- Python , R
- Development Status
- Active