# Medicinal chemistry subproject example (synthesis_bots)

## Workflow preparation and file system setup

The files in `DATA/INPUT_MINIMUM` are required to execute the first part of the workflow. All subsequent input files (in `DATA/INPUT`) are the expected output and have been generated by the decision maker based on the experimental results.

### Potential MS hits

Monoisotopic masses for expected products (and a number of common adducts) were calculated using the [pyISOPACh package](https://github.com/AberystwythSystemsBiology/pyISOPACh). The list of possible expected m/z values in each sample is kept in a nested dictionary and follows the structure:

```python
{
    "SAMPLE_ID": {
        "[M+H]+": {
            "1": MZ_VALUE
        },
        "[M+Na]+": {
            "1": MZ_VALUE,
        },
        ...
    },
    ...
}
```
The resulting dictionary can be found in the input data folder as `MEDCHEM-SCREENING-EXPECTED-MS.json`.

### Starting materials NMR spectra

To ensure purity, 1H NMR spectra of the starting materials were acquired. They were processed using TopSpin 4.3.0 analogously to the workflow spectra (referenced to the unsuppressed spectrum, copied the `SR` processing parameter, automated phasing and baseline correction, automated pick peaking). The spectra can be found in `DATA/MEDCHEM-SCREENING-SM` and their peaks were exported into `MEDCHEM-SCREENING-SM-NMR.json` in the input data folder.

### ChemSpeed control CSV

As stock solution samples are weighted manually, the CSV file controlling ChemSpeed needs to be updated with correct weights. An example ChemSpeed CSV file from the `MEDCHEM-SCREENING` can be found in `MEDCHEM-SCREENING-CHEMSPEED.csv` inside the input data folder.

### Sample information file

We have also prepared a JSON file that contains the main sample information, alongside a list of NMR experiments to be performed on each sample. It follows a similar nested dictionary structure (see `MEDCHEM-SCREENING.json` in the input data folder):

```python
{
    "SAMPLE_ID": {
        "sample_info": {
            "amine": AMINE_SM,
            "isocyanate": ISOCYANATE_SM
        },
        "solvent": SOLVENT,
        "nmr_experiments": [
            NMR_PARAMETER_SET
        ]
    },
    ...
}
```

Note that `SAMPLE_ID` is the same across all instruments to facilitate tracking of the samples in the laboratory.

### Settings file

The `settings.toml` file contains the settings used throughout the workflow, including the paths for archiving data, names for the workflows, paths for the corresponding Python files and default settings for NMR and LCMS acquisition. In particular, when extending the workflow it is important to take note of those lines:

```toml
[workflows.NMR]
Medchem_Screening    = "synthesis_bots.workflows.nmr.medchem.screening"
```

as they indicate which file will be run when a given command is received. In the example above, upon receiving the string `Medchem_Screening`, the NMR will execute `main()` from `synthesis_bots/workflows/nmr/medchem/screening.py`.

In [None]:
settings_toml = '''[dry]

dry = true

[tcp]

HOST      = "tcp://172.31.1.17:5558"
NMR       = "tcp://172.31.1.15:5552"
CHEMSPEED = "tcp://172.31.1.16:5553"
LCMS      = "tcp://172.31.1.18:5554"

[paths]

LCMS_archive   = "."                                 # Path to archive from LCMS PC
LCMS_queue     = "."                                 # Path to LCMS queue for MassLynx
LCMS_data      = "DATA/MEDCHEM-SCREENING/DATA/LCMS"  # Raw LCMS data on LCMS control PC
LCMS_to_NMR    = "."                                 # Path to NMR data from LCMS PC
NMR_data       = "."                                 # Raw NMR data on NMR control PC
NMR_archive    = "."                                 # Path to archive from NMR PC
CS_csv_medchem = "."                                 # CSV on ChemSpeed Computer

[defaults.NMR]

num_scans    = 64
pp_threshold = 0.02
field_presat = 10
l30          = 2
parameters   = "MULTISUPPDC_f"
solvent      = "CH3CN"
wait_time    = 120
shim_time    = 1200
reshim_time  = 14400
shim_sample  = 1
rack_layout  = "KUKA"
owner        = "Filip T. Szczypinski"
origin       = "AIC Group, University of Liverpool"

[defaults.MS]

injection_volume      = 0.5
peak_match_tolerance  = 0.4
analog_peak_threshold = 0.4
analog_peak_threshold2 = 0.05
tic_peak_params       = { "height" = 0.2, "distance" = 50 }
analog_peaks_params   = { "height" = 0.1, "distance" = 50 }
ms_peak_params        = { "height" = 0.1, "distance" = 10 }
solvent_front         = 0.4
lc_run_end            = 3.0
integral_rel_height   = 0.95
lc_ms_flowpath        = 2.84
# Time (s) for sample to reach MS after LC detector

[workflows.PREFIX]

Medchem_Screening    = "MEDCHEM-SCREENING"
Medchem_Scaleup      = "MEDCHEM-SCALEUP"
Medchem_Diversity    = "MEDCHEM-DIVERSITY"

[workflows.NMR]

Medchem_Screening = "synthesis_bots.workflows.nmr.medchem.screening"
Medchem_Scaleup    "synthesis_bots.workflows.nmr.medchem.scaleup"
Medchem_Diversity = "synthesis_bots.workflows.nmr.medchem.diversity"

[workflows.LCMS]

InsertRack1  = "synthesis_bots.workflows.ms.insert_rack_one"
InsertRack2  = "synthesis_bots.workflows.ms.insert_rack_two"
ExtractRack1 = "synthesis_bots.workflows.ms.eject_rack_one"
ExtractRack2 = "synthesis_bots.workflows.ms.eject_rack_two"
Medchem1 = "synthesis_bots.workflows.ms.medchem.screening"
Medchem2    "synthesis_bots.workflows.ms.medchem.scaleup"
Medchem3 = "synthesis_bots.workflows.ms.medchem.diversity"

[workflows.decision]

dtw_threshold = 30.0 # Distance threshold for dynamic time warp.

'''

with open("settings.toml", "w") as f:
    f.write(settings_toml)

## Medicinal chemistry screening

The first step is to screen different experiments and identify which ones give the expected product.

### Testing for MS criteria

For screening medicinally-related organic reactions, we employed a gradiend UPLC method coupled with ESI-MS. The code identifies peaks on the UPLC trace (using `scipy`), checks whether their LC area percentage is satisfactory (in this case we decided to use a 40% cut-off), extracts the corresponding mass spectrum and searches through peaks within the mass spectrum that satisfy reasonable criteria (e.g., 10% relative intensity and m/z agreement within 0.4 Da). No UPLC peak detection is performed before the solvent front (ca. 0.4 min) or after the gradient finishes and the column equilibrates (3 min). The flow path between the UPLC detector and the MS detector (2.84 s) needs to be accounted for.

Note that all criteria can be modified by domain experts in the settings file if they so wish.

In [None]:
from pathlib import Path
from synthesis_bots.utils.constants import PATHS

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"
PATHS["LCMS_data"] = DATA / "MEDCHEM-SCREENING" / "DATA" / "LCMS"

from synthesis_bots.workflows.ms.medchem.screening import results_analysis

results_analysis(
    expected_json=INPUT / "MEDCHEM-SCREENING-EXPECTED-MS.json",
    archive_path=DATA / "MEDCHEM-SCREENING" / "DATA" / "LCMS",
    summary_path=DATA / "MEDCHEM-SCREENING" / "DATA" / "SUMMARY_MS.json"
)

### Testing for NMR criteria

For the medicinal chemistry example, we mostly rely on UPLC-MS data in order to establish whether the desired product is formed as a majority product. We decided to record NMR data for future records and the decision is simply based on whether the NMR data indicated a reaction taking place. This was decided based on dynamic time warping (DTW) algorithm implemented in [dtw-python](https://dynamictimewarping.github.io/py-api/html/index.html). To calculate the DTW similarity score between the starting materials NMR and the reaction mixture NMR, we performed min-max normalisation of all spectra and combined the starting materials spectra point by point to generate a "starting materials spectrum". This was then compared against the experimental spectrum of the starting material and a DTW distance of at least 30 was used to decide whether any reaction has taken place.

As previously, any of those criteria can be changed in the settings if the users so wish.

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.medchem.screening import analyse_results

analyse_results(
    json_path=INPUT / "MEDCHEM-SCREENING.json",
    data_path=RAW_NMR / "MEDCHEM-SCREENING" / "DATA" / "NMR",
    sm_nmr_data_path=RAW_NMR / "MEDCHEM-SCREENING-SM" / "DATA" / "NMR",
    ms_summary_path = DATA / "MEDCHEM-SCREENING" / "DATA" / "SUMMARY_MS.json",
    nmr_summary_path=DATA / "MEDCHEM-SCREENING" / "DATA" / "SUMMARY_NMR.json",
    archive_path=DATA / "MEDCHEM-SCREENING" / "DATA" / "NMR"
)

### Decision making on the scaleup step

With the MS and NMR screening results at hand, the decision maker now looks through the summary JSON files to identify samples were all criteria passed. Based on those results, a ChemSpeed control CSV file is automatically generated, alongside the selection of expected m/z values for replication and the overall sample information JSON (which also contains a list of NMR experiments to perform).

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.medchem.screening import next_step

next_step(
    json_path=INPUT / "MEDCHEM-SCREENING.json",
    nmr_summary_path=DATA / "MEDCHEM-SCREENING" / "DATA" / "SUMMARY_NMR.json",
    expected_ms_json=INPUT / "MEDCHEM-SCREENING-EXPECTED-MS.json",
    cs_csv_input=INPUT / "MEDCHEM-SCREENING-CHEMSPEED.csv",
    future_cs_csv=INPUT / "MEDCHEM-SCALEUP-CHEMSPEED.csv",
    future_ms_json=INPUT / "MEDCHEM-SCALEUP-EXPECTED-MS.json",
    future_nmr_json=INPUT / "MEDCHEM-SCALEUP.json",
)

## Scale-up reactions

Hits identified in the previous step are now scaled-up to yield enough material for future derivatisation.

### Testing for MS criteria

This time, we are trying to establish if any of the m/z values corresponding to the expected structure has been observed.

In [None]:
from pathlib import Path
from synthesis_bots.utils.constants import PATHS

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"
PATHS["LCMS_data"] = DATA / "MEDCHEM-SCALEUP" / "DATA" / "LCMS"

from synthesis_bots.workflows.ms.medchem.scaleup import results_analysis

results_analysis(
    expected_json=INPUT / "MEDCHEM-SCALEUP-EXPECTED-MS.json",
    archive_path=DATA / "MEDCHEM-SCALEUP" / "DATA" / "LCMS",
    summary_path=DATA / "MEDCHEM-SCALEUP" / "DATA" / "SUMMARY_MS.json"
)

### Testing for NMR criteria

During replication, we are interested whether the NMR spectrum is identical (or very close) to the reference spectrum from the screening run. Towards that goal, we emplied dynamic time warping (DTW) algorithm implemented in [dtw-python](https://dynamictimewarping.github.io/py-api/html/index.html). Before the DTW distance is calculated, the spectra are min-max normalised and all points with intensity below 5% are assumed to contribute only to the noise and replaced with zeros. We considered DTW distance of 20 as "good agreement", but this can be adjusted in the settings if others so wish.

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.medchem.scaleup import analyse_results

analyse_results(
    json_path=INPUT / "MEDCHEM-SCALEUP.json",
    data_path=RAW_NMR / "MEDCHEM-SCALEUP" / "DATA" / "NMR",
    screening_data_path=RAW_NMR / "MEDCHEM-SCREENING" / "DATA" / "NMR",
    ms_summary_path = DATA / "MEDCHEM-SCALEUP" / "DATA" / "SUMMARY_MS.json",
    nmr_summary_path=DATA / "MEDCHEM-SCALEUP" / "DATA" / "SUMMARY_NMR.json",
    archive_path=DATA / "MEDCHEM-SCALEUP" / "DATA" / "NMR"
)

### Decision making on the diversification step

With the MS and NMR screening results at hand, the decision maker now looks through the summary JSON files to identify samples were all criteria passed. Successful scaleup reactions are taken forward to the diversification step. In this workflow, we decided to automate the generation of expected masses post-diversification based on the molecular fragments (see `MEDCHEM-DIVERSITY-DIVERSIFICATION-MASS.csv` in the input data folder) added and the expected reactions (i.e., Sonogashira coupling and alkyne-azide cycloaddition) but those could in principle be enumerated manually or through any other means.

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.medchem.scaleup import next_step

next_step(
    nmr_summary_path=DATA / "MEDCHEM-SCALEUP" / "DATA" / "SUMMARY_NMR.json",
    expected_ms_json=INPUT / "MEDCHEM-SCALEUP-EXPECTED-MS.json",
    json_path=INPUT / "MEDCHEM-SCALEUP.json",
    cs_csv_input=INPUT / "MEDCHEM-SCALEUP-CHEMSPEED.csv",
    mass_csv_inp=INPUT / "MEDCHEM-DIVERSITY-DIVERSIFICATION-MASS.csv",
    future_cs_csv=INPUT / "MEDCHEM-DIVERSITY-CHEMSPEED.csv",
    future_ms_json=INPUT / "MEDCHEM-DIVERSITY-EXPECTED-MS.json",
    future_nmr_json=INPUT / "MEDCHEM-DIVERSITY.json"
)


## Late-stage diversification

Successfully scaled-up reactions are now divided into batches for late-stage diversification. Akin to common medicinal chemistry and diversity-oriented synthesis approaches, products are now identified by UPLC-MS for offline purification (by preparative HPLC or flash chromatography) if successful. This allows for full characterisation and separate testing for properties. As the complexity of the molecules is significantly increasing in this step (and so are their coupling relationships), the low-field NMR spectra are meaningless for structure elucidation. This is further exacerbated by the use of multiple solvents to insure solubility, which makes spectra analysis next to impossible. Hence, NMR spectra are only recorded for future reference but no decisions are taken based on them.

To reflect the fact that only a small amount of product is needed for testing, we decided to use a small threshold of UPLC area percentage for analysis. As previously, those settings can be adjusted by users.

In [None]:
from pathlib import Path
from synthesis_bots.utils.constants import PATHS

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"
PATHS["LCMS_data"] = DATA / "MEDCHEM-DIVERSITY" / "DATA" / "LCMS"

from synthesis_bots.workflows.ms.medchem.diversity import results_analysis

results_analysis(
    expected_json=INPUT / "MEDCHEM-DIVERSITY-EXPECTED-MS.json",
    archive_path=DATA / "MEDCHEM-DIVERSITY" / "DATA" / "LCMS",
    summary_path=DATA / "MEDCHEM-DIVERSITY" / "DATA" / "SUMMARY_MS.json"
)

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.medchem.diversity import process_results

process_results(
    json_path=INPUT / "MEDCHEM-DIVERSITY.json",
    data_path=RAW_NMR / "MEDCHEM-DIVERSITY" / "DATA" / "NMR",
    nmr_summary_path=DATA / "MEDCHEM-DIVERSITY" / "DATA" / "SUMMARY_NMR.json",
    archive_path=DATA / "MEDCHEM-DIVERSITY" / "DATA" / "NMR"
)