Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.
Published November 7, 2023 | Version 1.1.0
Dataset Open

Datasets of sequences, alignments and structural models generated for the structural prediction of complexes mediated by intrinsically disordered regions.

  • 1. Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
  • 2. Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France.

Description

This repository contains input and ouput files used and generated for the scanning of intrinsically disordered region and the prediction of their binding sites to receptor proteins using the SCAN_IDR pipeline with AlphaFold2-Multimer.

It contains two archives: 

  1. scanidr_data_repository_corr6J08.tar dedicated to the analysis of a dataset of 42 protein complexes non redundant with the dataset used for AlphaFold2 training,
  2. 923_elm_cases_repository.tar.gz dedicated to the analysis of 923 complexes from the ELM database.

These data can be used to rerun specific sections of the pipeline and scripts provided in: https://github.com/i2bc/SCAN_IDR

Dataset of 42 non redundant complexes

The first archive scanidr_data_repository_corr6J08.tar contains 3 compressed directories and a README file detailing their contents :

  • the initial raw sequence and alignment data for every chain        -> DIRECTORY fasta_msa/
  • the input and output data of every Alphafold run for every complex   -> DIRECTORY af2_runs/
  • the native reference structures    -> DIRECTORY ref_capri_curated/

The protein-peptide complex cases have been assigned a distinct index number, from 1 to 42, consistent across the several directories of the archive. Their corresponding directories are labelled as <index>_<pdbcode>.

The models in this archive were generated using AlphaFold2-Multimer v2.2

Dataset of 923 complexes selected from the ELM database

The second archive 923_elm_cases_repository.tar.gz contains input and ouput files used and generated for the analysis of 923 Eukaryotic Linear Motifs (ELM) database entries.

Each ELM entry is indexed with specific integer id and is composed of a receptor and a ligand protein.  

The archive contains a Table associating ELM indexes with the ELM entry information, 5 directories and a README file detailing their contents:

  • the table describing ELM entries -> FILE Table_923ELM_uid_delimitations_info_for_archive.txt
  • the initial raw sequence and multiple sequence alignment (MSA) data for every chain        -> DIRECTORY fasta_msa/
  • the concatenated MSA model for every ELM complex and protocol used -> DIRECTORY af2_elm_coali_inputs/
  • the best model of every AF2 protocol for every complex according to the AF2   -> DIRECTORY af2_elm_models/
  • the best model cut in the ligand part to select only the ELM motifs as used for the evaluation of the models -> DIRECTORY elm_cut_models/
  • the reference structures used for the evaluation of the models   -> DIRECTORY ref_capri_curated/

The models in this archive were generated using AlphaFold2-Multimer v2.3

Notes (English)

In the version 1.1.0, in the scanidr_data_repository_corr6J08.tar archive, a correction has been made to case 36_6J08. The correct isoform Q3KP22-3 has been used for this case, instead of the previously incorrectly used Q3KP22-1.

Files

Files (5.2 GB)

Name Size Download all
md5:d1642cd5f29633850a3bb18e3166928c
4.0 GB Download
md5:4996d9206e0ce21279f798d5f1a11bb0
1.2 GB Download

Additional details

Funding

PPIMei – Protein-Protein Interactions in Meiosis ANR-21-CE44-0009
Agence Nationale de la Recherche
ESPRINet – Integrating heterogeneous Evolutionary, Structural and Omics data to predict Protein-RNA Interaction Networks ANR-18-CE45-0005
Agence Nationale de la Recherche
HPC resources of IDRIS 2023-AD010314343
Grand Équipement National de Calcul Intensif (France)