Published May 14, 2023 | Version V1.0
Dataset Open

A large expert-curated cryo-EM image dataset for machine learning protein particle picking

  • 1. University of Missouri
  • 2. Brookhaven National Laboratory

Description

Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structures of biological macromolecular complexes. Picking single-protein particles from cryo-EM micrographs is a crucial step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though machine learning and artificial intelligence (AI) based particle picking can potentially automate the process, its development is hindered by lack of large, high-quality labelled training data. To address this bottleneck, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for protein particle picking and analysis. It consists of labelled cryo-EM micrographs (images) of 34 representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). The dataset is 2.6 terabytes and includes 9,893 high-resolution micrographs with labelled protein particle coordinates. The labelling process was rigorously validated through 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of both AI and classical methods for automated cryo-EM protein particle picking.

Files

Files (315.8 MB)

Name Size Download all
md5:eb5e8de64cf9863b2503bc8edf826011
441.4 kB Download
md5:f5b3ae87678d2a6b2bb2563e2e5c62b9
5.3 MB Download
md5:2e63200d5e05cdfcdd87e39696f9c966
3.1 MB Download
md5:8f4434ab5c95f82a7880b54a502c1d91
19.7 MB Download
md5:ab7aab865953c4414d9fd6c2131d4ac2
6.3 MB Download
md5:7fd15b756342f647a1d72fb8ee4d9107
1.3 MB Download
md5:47341ae2b0ce6480da08c77dd007e2c2
4.4 MB Download
md5:be2cc9a0bcef289d704ab4f4341ea44b
4.5 MB Download
md5:db4b969492d880edf379abd70919bb37
5.4 MB Download
md5:58ff17a25c588507db20f9f3fc4aa38c
24.9 MB Download
md5:e3d3faf070cd5a4fcfa684d56d57149b
25.9 MB Download
md5:2281a3bcda545fb0efa696d7c5439c74
8.2 MB Download
md5:e47eb2c09deda058d9b297dfe5a15100
7.5 MB Download
md5:5d517e7ec0254790d3a67c341b2110e7
14.3 MB Download
md5:8739966a224e148576f42bc40d8f1fbc
2.2 MB Download
md5:2b61192df594fa71233f7165ec1f3f18
8.0 MB Download
md5:d5fe874f6606f1fe396bf88826605075
1.4 MB Download
md5:0afb0dbbd4c7598455a1ec808037bd33
2.3 MB Download
md5:59b9585dddffff244cb367bd22f59da2
6.7 MB Download
md5:1515b706f5ac854846a38c54d09a575a
868.8 kB Download
md5:d8d3d3c144e4fe9c81a273bf0d1388bd
10.9 MB Download
md5:5eac75c86dfeb60cb98ebeaf3bbf4abd
7.9 MB Download
md5:fc0aec2bde5b8830d31152a1a7a330f5
6.9 MB Download
md5:f8ba17a86f2b022b8f993cbfb74f3d44
2.8 MB Download
md5:4ed27bd5606cebd4f4e29e7428281721
14.0 MB Download
md5:d38ae972a67867c796a112dcb08fb1b6
6.0 MB Download
md5:502ee09d9f0234a7052fae057ee29329
25.7 MB Download
md5:db982e06306d2e59fa72fe8b9c1b1d9d
5.1 MB Download
md5:cc141c4f0a262086287f7d880a034128
27.2 MB Download
md5:0fbd2049efadd210873d114171d9114d
7.4 MB Download
md5:c95f5b0e6595231157a42aa863603772
8.5 MB Download
md5:3ce522e4b9c1965a490170637cf95883
10.7 MB Download
md5:d919c668a9f30ff3dd94aa1ce0b6b477
6.9 MB Download
md5:6909fefc9175988c4e68b22541dd9913
23.2 MB Download

Additional details

Related works

Is cited by
Journal article: 10.1038/s41597-023-02280-2 (DOI)