Supplementary materials for "MemConverter: An Iterative Pipeline for Reprogramming Protein Localization in Membrane or Aqueous Solution"
Authors/Creators
Description
Description
This dataset contains 85,051 isolated transmembrane domains derived from the tmAFDB and TED. It was specifically curated to fine-tune ProteinMPNN for membrane protein design (MemProtMPNN).
We focused on isolated domains to ensure the model learns specific membrane topological constraints rather than soluble features.
Data Construction
-
Domain Extraction: Primarily based on domain annotations from The Encyclopedia of Domains (TED). For proteins without TED annotations, Merizo was used to identify domains.
-
Filtering: Validated by TMbed. We excluded proteins with <20% transmembrane residues or domains shorter than 32 residues.
-
Splitting: Clustered using MMseqs2 (30% sequence identity) to strictly separate Training, Validation, and Test sets.
Statistics
| Set | Count | Percentage |
| Train | 72,293 | 85% |
| Validation | 7,655 | 9% |
| Test | 5,103 | 6% |
Files Description
membrane_domain_dataset.zip: The main dataset archive.-
data_files.zip: Contains the processed dataset partitions (Train, Validation, Test). -
test_list: A curated list of representative cases from the test set (one representative sequence per cluster) used for performance evaluation.
Files
membrane_domain_dataset.zip
Additional details
Additional titles
- Alternative title (English)
- A curated dataset of transmembrane protein domains for MemProtMPNN