Published March 31, 2026 | Version v1
Dataset Open

RPIS_FragmentationLibraries: Molecular Fragments Useful for Design of Molecular Glues for Protein-RNA complexes

  • 1. Technical University of Munich
  • 2. ROR icon ETH Zurich

Description

RPIS_FragmentationLibraries: Molecular Fragments Useful for Design of Molecular Glues for Protein-RNA Complexes

This Database lives at:

- [Zenodo] (Large Data Files)
- [github] (Executables and README)
- [CSBJ] (Original Publication)


Overview

This repository consists of two primary components: chemical compound libraries in SMILES format, a universally recognized representation in computational molecular sciences.  

The fragment libraries presented here are part of a peer-reviewed scientific study. These fragments were derived through a comprehensive in silico workflow designed to identify promising stabilizer candidates for protein–RNA interactions (RPIs).  

The workflow integrates several computational methods, including:

- Binding pocket detection and evaluation  
Molecular docking
- Molecular dynamics simulations  
- Binding free energy estimations  

This pipeline yielded a set of stabilizer candidates, which were then used to generate fragment libraries through two distinct approaches:

1. Extended Connectivity Fingerprints (ECFP) – to extract the most representative chemical features.  
2. Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS) – to decompose molecules into synthetically meaningful fragments.  

Both approaches were implemented using Python’s rdkit.Chem module and can be used for a large spectrum of different applications.

---

Highlights

- 1,000 most abundant ECFP-derived fragments for compound database filtering
- Executable scripts demonstrating the database filtering workflow  
- T38DrugDB filtered database, organized into 9 sub-databases, containing only molecules with at least 2 to 10 fragment matches (provided as .tar.xz archives for portability)  
- All 213 BRICS fragments decomposed from the 96 high-ranking stabilizer ligands identified in our study  
- Executables for de novo compound generation from the 213 BRICS fragments, with adjustable maxDepth parameters  
- Comprehensive 4,000,000-compound database, generated from all possible combinations of the 213 BRICS fragments with maxDepth = 3


Intended Usage

The fragment libraries in this repository are divided into two subsets: ECFP-derived fragments and BRICS-derived fragments. Although both provide databases of SMILES strings, their intended applications differ.

- ECFP fragments can be used to filter existing compound databases for molecular patterns enriched in RPI stabilizers identified through our in silico workflow. These fragments are particularly useful for identifying chemical motifs associated with stabilizing protein–RNA interactions.

- BRICS fragments, on the other hand, also encode combinatorial information. In this method, chemical bonds of RPI stabilizers were broken in a retrosynthetically meaningful way using the rdkit.Chem.BRICS module. The resulting fragments can be recombined following the same retrosynthetic rules, enabling the construction of new compounds enriched with RPI-stabilizing features. These fragments can also be used for database filtering, but since they lack fragment importance scores, filtering must be done naively using all fragments.

While these two applications, database filtering (ECFP) and de novo compound generation (BRICS), represent the primary intended uses, they are not the only possibilities. Generative machine learning techniques, for instance, could utilize these fragment libraries to design new potential binders, particularly leveraging the ECFP dataset.

In general, the scope of applications for these fragment libraries is broad and limited only by the creativity of the user. The included executable scripts demonstrate the core workflows for database filtering (ECFP) and de novo compound assembly (BRICS).

ECFP – Database Filtering

Within the github ECFP/ directory, you will find the executable script RefilterDatabaseWithECFP.py and the example dataset sampleDB.csv. The fragment data required for execution is provided in the file MMGBSA_ChemicalAnalysisFragments_cutoff10_ranked_datatable_3orMore.csv. This file must be present to run RefilterDatabaseWithECFP.py without errors. After installing all required dependencies (listed at the beginning of RefilterDatabaseWithECFP.py), the script can be executed to produce nine .csv files named in the format: ECFP_fragmentFilteredLibrary_#N#FragsOrMore.csv. These files serve as example outputs. 
The results of the filtering process are stored in the Zenode database . That dataset contains the results of applying the filtering procedure to the T38DrugDB, which includes approximately 34 million drug-like compounds published elsewhere. The filtered sub-databases are provided as compressed `.tar.xz` archives for portability. These databases can be used for targeted virtual screening of RPI-stabilizing drug candidates. This example workflow can be executed as follows:

# clone this repository
git clone git@github.com:Foly93/RPIS_FragmentationLibraries.git ./

# switch to ECFP directory
cd RPIS_FragmentationLibraries/ECFP

# INSTALL REQUIRED PYTHON PACKAGES FROM YOUR FAVOURITE PACKAGE MANAGER e.g. micromamba
micromamba activate your_env_name_goes_here

# execute the Python program
python RefilterDatabaseWithECFP.py 

# Check if the expected output was generated (assuming ubuntu, mac or power shell)
ls -rtal ECFP_fragmentFilteredLibrary_*FragsOrMore.csv

BRICS – De Novo Compound Assembly

The BRICS/ directory contains several executables that serve different purposes, as well as the BRICS fragment library RPIS_BRICS_Fragment_DB.smi. This .smifile contains 213 BRICS fragments, which can be combined to generate novel molecules using the rdkit.Chem.BRICS module. This functionality is demonstrated across three executables:
- BRICS_fragment_database_interactive.ipynb
- build_one_example_Mol_from_BRICS_fragments.py
- generate_N_Mols_from_BRICS_fragments.py

The Jupyter notebook BRICS_fragment_database_interactive.ipynb requires a Python environment with jupyter notebook installed. It provides a visual and interactive overview of the fragment set, demonstrating how to:
- Display the BRICS fragments  
- Assemble random molecules from the fragment library  
- Export the resulting molecules as SMILES strings  

The assembly process uses the RDKit function BRICS.BRICSBuild. Its parameters are highly sensitive—particularly the maxDepth option, which controls the maximum number of fragments combined into a single molecule. With 213 fragments available, the total combinatorial space is on the order of 4 × 10¹¹ possible assemblies, making exhaustive enumeration computationally infeasible. 
The script build_one_example_Mol_from_BRICS_fragments.py replicates the notebook’s core functionality in a standalone Python executable. It generates a single example molecule, saving the output to a file identified by its timestamp.
The final executable, generate_N_Mols_from_BRICS_fragments.py, creates a specified number of BRICS-assembled molecules and saves them into a timestamped .smi file. While it can be run without command-line arguments, optional parameters are available and can be displayed using the -h flag. The generated .smi files can be directly used for virtual screening or binding affinity prediction in drug discovery workflows. This showcase can be executed as follows:

# clone this repository
# not necessary if already done
git clone git@github.com:Foly93/RPIS_FragmentationLibraries.git ./

# switch to BRICS directory
cd RPIS_FragmentationLibraries/BRICS

# INSTALL REQUIRED PACKAGES FROM YOUR FAVOURITE PACKAGE MANAGER e.g. micromamba
micromamba activate your_env_name_goes_here

# open the jupyter notebook in your browser and have a good look at the functionality
jupyter notebook BRICS_fragment_database_interactive.ipynb

# execute the one-example-python-program
python build_one_example_Mol_from_BRICS_fragments.py 

# execute the batch creation python program
python generate_N_Mols_from_BRICS_fragments.py 

# display the options of the batch creation python program
python generate_N_Mols_from_BRICS_fragments.py -h 

# run the python program with custom flags
python generate_N_Mols_from_BRICS_fragments.py \
    --maxDepth 3 \
    --numMold 10 \
    --scrambleReagents True \
    --outputDirectory ../trash

Finally, the Zenodo Database also contains BRICS_DB_BuiltMaxDepth_3.txt a data base that contains all possible combinations for maxDepth set to 3. This file contains 3,878,955 compound SMILES strings which is less that the theoretically possible  213 x 213 x  213 = 9,663,597 which results from incompatibilities between some BRICS fragments and duplicate entries.

File Description

RPIS_FragmentationLibraries/
├── README.md                          # THIS file
├── ECFP_fragmentFilteredLibrary_3FragsOrMore.csv.tar.xz     # T38DrugDB compounds that contain 3 0r more ECFP fragments
├── ECFP_fragmentFilteredLibrary_4FragsOrMore.csv.tar.xz     # T38DrugDB compounds that contain 4 0r more ECFP fragments
├── ECFP_fragmentFilteredLibrary_5FragsOrMore.csv.tar.xz     # T38DrugDB compounds that contain 5 0r more ECFP fragments
├── ECFP_fragmentFilteredLibrary_6FragsOrMore.csv.tar.xz     # T38DrugDB compounds that contain 6 0r more ECFP fragments
├── ECFP_fragmentFilteredLibrary_7FragsOrMore.csv.tar.xz     # T38DrugDB compounds that contain 7 0r more ECFP fragments
├── ECFP_fragmentFilteredLibrary_8FragsOrMore.csv.tar.xz     # T38DrugDB compounds that contain 8 0r more ECFP fragments
├── ECFP_fragmentFilteredLibrary_9FragsOrMore.csv.tar.xz     # T38DrugDB compounds that contain 9 0r more ECFP fragments
├── ECFP_fragmentFilteredLibrary_10FragsOrMore.csv.tar.xz    # T38DrugDB compounds that contain 10 0r more ECFP fragments
└── BRICS_DB_BuiltMaxDepth_3.txt                             # Data base containing SMILES of all available BRICS assemblies with maxDepth set to 3

Citation

If you use these Fragment libraries, please cite:
```
Luis Vollmers, Shu-Yu Chen, Martin Zacharias. In Silico Analysis of Potential Stabilizer Binding Sites at Protein–RNA Interfaces. Comput Struct Biotechnol J.  2026;35:0016.DOI:10.34133/csbj.0016
```

License

This work is licensed under a Creative Commons Attribution 4.0 International License. See creativecommons.org/licenses/
by/4.0/ for further information.

Contact

For questions, issues, or contributions:
- luis.vollmers@tum.de
- zacharias@tum.de
- Publication Link: https://doi.org/10.34133/csbj.0016

Files

BRICS_DB_BuiltMaxDepth_3.txt

Files (978.9 MB)

Name Size Download all
md5:4d53232e6e0357f7cfd4b27f00b1da0a
355.6 MB Preview Download
md5:4eb759045c9e32d3b1104b93c4d288ad
7.7 kB Download
md5:a3044d73766019cf307875c3912fe1ec
238.5 MB Download
md5:b89a6ff15e7421e1e9687d52dca578ee
190.6 MB Download
md5:9b58665803941d34f5c1a6bbbd7dc235
120.0 MB Download
md5:ad3593e17f72d411860dc4d1605ec879
53.9 MB Download
md5:7cba351a9aa0a1405267e7236a542d47
16.5 MB Download
md5:83c4b735fa0e7b1bca02cbc54c52add9
3.3 MB Download
md5:abf6cc00d5c72a0fed8ca395f1894af9
480.4 kB Download
md5:621ebce1fad3e17bfbbff111fdc3aae3
68.4 kB Download