Published December 23, 2025 | Version v0.1.5
Software Open

EvoMotif: Evolution-Driven Framework for Protein Motif Discovery

Authors/Creators

Description

EvoMotif: Evolutionary Protein Motif Discovery and Statistical Validation

OVERVIEW
EvoMotif discovers evolutionarily conserved protein motifs through multi-species sequence analysis, combining information theory, evolutionary substitution matrices, and rigorous statistical validation.

CORE ALGORITHMS

1. Dual-Metric Conservation Scoring
   - Shannon Entropy: H(i) = -Σ p_a(i) × log₂ p_a(i), normalized to [0,1]
     Detects strict conservation (identical residues at catalytic sites)
   - BLOSUM62 Score: Captures functional constraints from evolutionary substitution data
     Detects functional conservation (physicochemically similar substitutions)
   - Combined Score: C_final(i) = 0.5 × C_shannon(i) + 0.5 × B_norm(i)

2. Sliding Window Motif Discovery
   - Multi-scale scanning: windows of 5, 7, 9, 11, 13, 15, 17, 19, 21 residues
   - Adaptive thresholding (default: conservation ≥ 0.70)
   - Overlap resolution: keeps highest-scoring windows
   - Gap filtering: requires ≥70% sequence coverage

3. Statistical Validation
   - Permutation Testing: 10,000 permutations per motif for exact p-values
   - FDR Correction: Benjamini-Hochberg procedure at α = 0.05
   - Effect Size: Cohen's d > 0.5 required for reporting
   - Only motifs with p < 0.05 (FDR-corrected) AND d > 0.5 are reported

VALIDATION RESULTS
Tested against known functional sites in hemoglobin α-chain, p53 tumor suppressor, and BRCA1:
- Hemoglobin: 100% detection of heme-binding residues (His59, His88)
- p53: All 5 Zn²⁺-binding cysteines identified, R248 and R273 cancer hotspots detected
- BRCA1: RING domain Cys/His residues, BRCT phospho-peptide binding sites found
Conclusion: All discovered motifs correspond to experimentally validated functional sites

PERFORMANCE BENCHMARKS (Intel Core i7-9700K, 16GB RAM)
- Ubiquitin (50 seq, 76 res): 45 sec total, 350 MB memory, 4 motifs
- Hemoglobin α (100 seq, 143 res): 2.5 min total, 580 MB memory, 9 motifs
- p53 (150 seq, 393 res): 8 min total, 1.2 GB memory, 12 motifs
- BRCA1 (200 seq, 1863 res): 28 min total, 3.8 GB memory, 38 motifs

USE CASES
1. Mutagenesis Planning: Identify critical residues (conservation > 0.85) vs safe targets (< 0.4)
2. Disease Variant Interpretation: Assess pathogenicity of missense mutations
3. Functional Domain Annotation: Discover domains in unannotated proteins
4. Protein Engineering: Design minimal functional constructs
5. Structural Biology: Correlate conservation with AlphaFold confidence scores
6. Comparative Genomics: Study evolutionary constraints across protein families

PIPELINE STAGES
Sequence retrieval (NCBI) → Alignment (MAFFT) → Conservation scoring (Shannon + BLOSUM62) → Motif discovery (sliding windows) → Statistical validation (permutation + FDR) → Phylogenetic tree (FastTree) → Structure mapping (PDB)

OUTPUT FILES
- FASTA: sequences and alignments
- JSON: conservation scores, motifs with p-values and effect sizes
- Newick: phylogenetic trees
- PDB: conservation mapped to B-factor column

INSTALLATION
pip install evomotif
External dependencies: mafft, fasttree (via apt, brew, or conda)

DOCUMENTATION
GitHub: https://github.com/tahagill/EvoMotif
Complete Guide: https://github.com/tahagill/EvoMotif/blob/main/docs/COMPLETE_GUIDE.md
PyPI: https://pypi.org/project/evomotif/

REQUIREMENTS
Python 3.8-3.11, Linux/macOS/WSL, 8GB RAM minimum (16GB recommended)

LICENSE
MIT License

Files

tahagill/EvoMotif-v0.1.5.zip

Files (139.1 kB)

Name Size Download all
md5:4c4fcbbf0b09441c74fff09dabccfdf6
139.1 kB Preview Download

Additional details

Related works

Software