EvoMotif: Evolution-Driven Framework for Protein Motif Discovery
Authors/Creators
Description
EvoMotif: Evolutionary Protein Motif Discovery and Statistical Validation
OVERVIEW
EvoMotif discovers evolutionarily conserved protein motifs through multi-species sequence analysis, combining information theory, evolutionary substitution matrices, and rigorous statistical validation.
CORE ALGORITHMS
1. Dual-Metric Conservation Scoring
- Shannon Entropy: H(i) = -Σ p_a(i) × log₂ p_a(i), normalized to [0,1]
Detects strict conservation (identical residues at catalytic sites)
- BLOSUM62 Score: Captures functional constraints from evolutionary substitution data
Detects functional conservation (physicochemically similar substitutions)
- Combined Score: C_final(i) = 0.5 × C_shannon(i) + 0.5 × B_norm(i)
2. Sliding Window Motif Discovery
- Multi-scale scanning: windows of 5, 7, 9, 11, 13, 15, 17, 19, 21 residues
- Adaptive thresholding (default: conservation ≥ 0.70)
- Overlap resolution: keeps highest-scoring windows
- Gap filtering: requires ≥70% sequence coverage
3. Statistical Validation
- Permutation Testing: 10,000 permutations per motif for exact p-values
- FDR Correction: Benjamini-Hochberg procedure at α = 0.05
- Effect Size: Cohen's d > 0.5 required for reporting
- Only motifs with p < 0.05 (FDR-corrected) AND d > 0.5 are reported
VALIDATION RESULTS
Tested against known functional sites in hemoglobin α-chain, p53 tumor suppressor, and BRCA1:
- Hemoglobin: 100% detection of heme-binding residues (His59, His88)
- p53: All 5 Zn²⁺-binding cysteines identified, R248 and R273 cancer hotspots detected
- BRCA1: RING domain Cys/His residues, BRCT phospho-peptide binding sites found
Conclusion: All discovered motifs correspond to experimentally validated functional sites
PERFORMANCE BENCHMARKS (Intel Core i7-9700K, 16GB RAM)
- Ubiquitin (50 seq, 76 res): 45 sec total, 350 MB memory, 4 motifs
- Hemoglobin α (100 seq, 143 res): 2.5 min total, 580 MB memory, 9 motifs
- p53 (150 seq, 393 res): 8 min total, 1.2 GB memory, 12 motifs
- BRCA1 (200 seq, 1863 res): 28 min total, 3.8 GB memory, 38 motifs
USE CASES
1. Mutagenesis Planning: Identify critical residues (conservation > 0.85) vs safe targets (< 0.4)
2. Disease Variant Interpretation: Assess pathogenicity of missense mutations
3. Functional Domain Annotation: Discover domains in unannotated proteins
4. Protein Engineering: Design minimal functional constructs
5. Structural Biology: Correlate conservation with AlphaFold confidence scores
6. Comparative Genomics: Study evolutionary constraints across protein families
PIPELINE STAGES
Sequence retrieval (NCBI) → Alignment (MAFFT) → Conservation scoring (Shannon + BLOSUM62) → Motif discovery (sliding windows) → Statistical validation (permutation + FDR) → Phylogenetic tree (FastTree) → Structure mapping (PDB)
OUTPUT FILES
- FASTA: sequences and alignments
- JSON: conservation scores, motifs with p-values and effect sizes
- Newick: phylogenetic trees
- PDB: conservation mapped to B-factor column
INSTALLATION
pip install evomotif
External dependencies: mafft, fasttree (via apt, brew, or conda)
DOCUMENTATION
GitHub: https://github.com/tahagill/EvoMotif
Complete Guide: https://github.com/tahagill/EvoMotif/blob/main/docs/COMPLETE_GUIDE.md
PyPI: https://pypi.org/project/evomotif/
REQUIREMENTS
Python 3.8-3.11, Linux/macOS/WSL, 8GB RAM minimum (16GB recommended)
LICENSE
MIT License
Files
tahagill/EvoMotif-v0.1.5.zip
Files
(139.1 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:4c4fcbbf0b09441c74fff09dabccfdf6
|
139.1 kB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/tahagill/EvoMotif (URL)
Software
- Repository URL
- https://github.com/tahagill/EvoMotif