Published August 5, 2025 | Version v1
Dataset Open

Phold Database 3.16M Remaining Protein Structure Predictions

Authors/Creators

Description

Repository containing 1,802,498 protein structures consituting all proteins in Phold DB 3.16M that are not in Phold Search DB 1.36M. All Phold Search DB 1.36M structures can be found at https://doi.org/10.5281/zenodo.16739199. These are all efam and enVhog proteins that were not able to be assigned PHROGs and have no annotated functional label.

Specifically, the large `final_combined_phold_db_3M16_efam_envhog_no_phrog.tar.xz` tarball consists of:

* 158,898 extremely conservative efam proteins <3000AA  without an assigned PHROG group are in `efam`. These have been chunked into batches of 5000 `.pdb` files for file-system sanity

* 1,638,667 enVhog proteins <3000AA without an assigned PHROG group are in `envhog`. These have also been chunked into batches of 5000 files.

There is an additional smaller tarball `envhogs_renamed_3000_plus.tar.gz` containing

* 4933 `.pdb` files of enVhog proteins 3000+ AA have been fragmented as described in the Phold manuscript (i.e. proteins were chunked into equal fragments under 3000AA so a 5000AA protein has two 2500AA chunks, a 8100AA protein has three 2700AA chunks). These fragments are indicated as `_1` `_2` etc. These are in a separate top level tarball (`envhogs_renamed_3000_plus.tar.gz`)

All structures are named as they appear in the Phold DB 3.16M

# Recombining the large tarball

Due to the large size of uploading the tarball `final_combined_phold_db_3M16_efam_envhog_no_phrog.tar.xz1`, I had significant difficulties in uploading it to Zenodo. Accordingly, I split it into ten smaller chunks as follows (using Linux split command):

`split --bytes=5G --suffix-length=3 --numeric-suffixes final_combined_phold_db_3M16_efam_envhog_no_phrog.tar.xz final_combined_phold_db_3M16_efam_envhog_no_phrog.tar.xz_part_`

To recombine into the original larger tarball, download all the chunks in this record and combine

`cat final_combined_phold_db_3M16_efam_envhog_no_phrog.tar.xz_part_??? > final_combined_phold_db_3M16_efam_envhog_no_phrog.tar.xz`

 

 

 

 

 

 

 

Files