Published February 27, 2026 | Version 2.0.0
Dataset Open

MPLID: Membrane Protein-Lipid Interface Dataset v2.0.0

  • 1. Embrapa Digital Agriculture

Description

A large-scale dataset of experimentally validated lipid contact residues derived from experimentally determined structures in the Protein Data Bank.

100% EXPERIMENTAL LABELS - NO COMPUTATIONAL DATABASE DEPENDENCIES

Dataset Statistics (v2.0.0)

  • Proteins: 4,704
  • Total residues: 8,055,325
  • Contact residues: 80,439
  • Contact rate: 1.00%
  • Sequence clusters: 813 (30% identity)
  • Lipid codes recognized: 117

Train/Validation/Test Splits

  • Train: 2,578 proteins, 4,907,696 residues
  • Val: 1,051 proteins, 1,403,838 residues
  • Test: 1,075 proteins, 1,743,791 residues

Key Features

  • Labels derived 100% from experimentally resolved lipids in PDB structures
  • 4.0 Angstrom all-atom heavy-atom distance cutoff
  • 4,704 proteins across all membrane protein classes
  • Cluster-aware splits prevent data leakage
  • Fully reproducible from public PDB data

GitHub: https://github.com/omagebright/MPLID

Notes

Funded by São Paulo Research Foundation (FAPESP) grants 2023/02691-2 and 2025/23708-6.

Files

DATA_DICTIONARY.txt

Files (112.2 MB)

Name Size Download all
md5:345982a32292f499034e9a6c3d27c434
4.3 kB Preview Download
md5:e249a876631dede45a3077efd8118309
1.3 kB Preview Download
md5:9f35324ae0263b1fad2dd57c15a91f93
216.3 kB Preview Download
md5:fd0775092c4236bcbf747ea75118d4a5
3.9 kB Preview Download
md5:69c0863eeef4cf0ec32048445a51166d
24.0 MB Download
md5:e6f3c0660b1740d43d49e7d3a9daf780
68.6 MB Download
md5:048c3881d0944d0b97c565ba66aaee2b
19.3 MB Download

Additional details

Related works