Published February 4, 2026 | Version v1
Dataset Open

Supplementary materials for "MemConverter: An Iterative Pipeline for Reprogramming Protein Localization in Membrane or Aqueous Solution"

  • 1. ROR icon Peking University

Description

Description

This dataset contains 85,051 isolated transmembrane domains derived from the tmAFDB and TED. It was specifically curated to fine-tune ProteinMPNN for membrane protein design (MemProtMPNN).

We focused on isolated domains to ensure the model learns specific membrane topological constraints rather than soluble features.

Data Construction

  • Domain Extraction: Primarily based on domain annotations from The Encyclopedia of Domains (TED). For proteins without TED annotations, Merizo was used to identify domains.

  • Filtering: Validated by TMbed. We excluded proteins with <20% transmembrane residues or domains shorter than 32 residues.

  • Splitting: Clustered using MMseqs2 (30% sequence identity) to strictly separate Training, Validation, and Test sets.

Statistics

Set Count Percentage
Train 72,293 85%
Validation 7,655 9%
Test 5,103 6%

Files Description

  • membrane_domain_dataset.zip: The main dataset archive.
  • data_files.zip: Contains the processed dataset partitions (Train, Validation, Test). 

  • test_list: A curated list of representative cases from the test set (one representative sequence per cluster) used for performance evaluation. 

 

Files

membrane_domain_dataset.zip

Files (2.5 GB)

Name Size Download all
md5:83b97226106bcff9aa552849c3e43489
8.9 MB Preview Download
md5:4923aa4779aa6b38034e4c7cf89a1726
2.4 GB Preview Download
md5:ac8584ddfb53173a2ee5589f5ffac3ed
108.3 kB Preview Download

Additional details

Additional titles

Alternative title (English)
A curated dataset of transmembrane protein domains for MemProtMPNN