# Supramolecular chemistry subproject example (synthesis_bots)

## Workflow preparation and file system setup

The files in `DATA/INPUT_MINIMUM` are required to execute the first part of the workflow. All subsequent input files (in `DATA/INPUT`) are the expected output and have been generated by the decision maker based on the experimental results.

### Potential MS hits

A list of all possible metal-organic structures including up to ten metal ions was generated. Their monoisotopic masses were calculated for each possible charge (corresponding to different number of counterions), using the [pyISOPACh package](https://github.com/AberystwythSystemsBiology/pyISOPACh). The list of possible expected m/z values in each sample is kept in a nested dictionary and follows the structure:

```python
{
    "SAMPLE_ID": {
        "CHEMICAL_STRUCTURE_1": {
            "CHARGE": MZ_VALUE,
            ...
        },
        "CHEMICAL_STRUCTURE_2": {
            "CHARGE": MZ_VALUE,
            ...
        },
        ...
    },
    ...
}
```
The resulting dictionary can be found in the input data folder as `SUPRAMOL-SCREENING-EXPECTED-MS.json`.

### Starting materials NMR spectra

To ensure purity, 1H NMR spectra of the starting materials were acquired. They were processed using TopSpin 4.3.0 analogously to the workflow spectra (referenced to the unsuppressed spectrum, copied the `SR` processing parameter, automated phasing and baseline correction, automated pick peaking). The spectra can be found in `DATA/SUPRAMOL-SCREENING/SUPRAMOL-SM` and their peaks were exported into `SUPRAMOL-SCREENING-SM-NMR.json` in the input data folder.

### ChemSpeed control CSV

As stock solution samples are weighted manually, the CSV file controlling ChemSpeed needs to be updated with correct weights. An example ChemSpeed CSV file from the `SUPRAMOL-SCREENING` can be found in `SUPRAMOL-SCREENING-CHEMSPEED.csv` inside the input data folder.

### Sample information file

We have also prepared a JSON file that contains the main sample information, alongside a list of NMR experiments to be performed on each sample. It follows a similar nested dictionary structure (see `SUPRAMOL-SCREENING.json` in the input data folder):

```python
{
    "SAMPLE_ID": {
        "sample_info": {
            "amine": AMINE,
            "carbonyl": CARBONYL,
            "metal": METAL
        },
        "solvent": "CH3CN",
        "nmr_experiments": [
            NMR_PARAMETER_SET
        ]
    },
    ...
}
```

Note that `SAMPLE_ID` is the same across all instruments to facilitate tracking of the samples in the laboratory.

### Settings file

The `settings.toml` file contains the settings used throughout the workflow, including the paths for archiving data, names for the workflows, paths for the corresponding Python files and default settings for NMR and LCMS acquisition. In particular, when extending the workflow it is important to take note of those lines:

```toml
[workflows.NMR]
Supramol_Screening   = "synthesis_bots.workflows.nmr.supramol.screening"
```

as they indicate which file will be run when a given command is received. In the example above, upon receiving the string `Supramol_Screening`, the NMR will execute `main()` from `synthesis_bots/workflows/nmr/supramol/screening.py`.

In [None]:
settings_toml = '''[dry]

dry = true

[tcp]

HOST      = "tcp://172.31.1.17:5558"
NMR       = "tcp://172.31.1.15:5552"
CHEMSPEED = "tcp://172.31.1.16:5553"
LCMS      = "tcp://172.31.1.18:5554"

[paths]

LCMS_archive   = "."    # Path to archive from LCMS PC
LCMS_queue     = "."    # Path to LCMS queue for MassLynx
LCMS_data      = "DATA/SUPRAMOL-SCREENING/DATA/LCMS"    # Raw LCMS data on LCMS control PC
LCMS_to_NMR    = "."    # Path to NMR data from LCMS PC
NMR_data       = "."    # Raw NMR data on NMR control PC
NMR_archive    = "."    # Path to archive from NMR PC
CS_csv_supra   = "."    # CSV on ChemSpeed Computer

[defaults.NMR]

num_scans    = 64
pp_threshold = 0.02
field_presat = 10
l30          = 2
parameters   = "MULTISUPPDC_f"
solvent      = "CH3CN"
wait_time    = 120
shim_time    = 1200
reshim_time  = 14400
shim_sample  = 1
rack_layout  = "KUKA"
owner        = "Filip T. Szczypinski"
origin       = "AIC Group, University of Liverpool"

[defaults.MS]
injection_volume = 0.5
peak_match_tolerance = 0.4
tic_peak_params      = { "height" = 0.2, "distance" = 50 }
ms_peak_params       = { "height" = 0.5, "distance" = 10 }

[workflows.PREFIX]

Supramol_Screening   = "SUPRAMOL-SCREENING"
Supramol_Replication = "SUPRAMOL-REPLICATION"
Supramol_HostGuest   = "SUPRAMOL-HOST-GUEST"

[workflows.NMR]

Supramol_Screening   = "synthesis_bots.workflows.nmr.supramol.screening"
Supramol_Replication = "synthesis_bots.workflows.nmr.supramol.replication"
Supramol_HostGuest   = "synthesis_bots.workflows.nmr.supramol.host_guest"

[workflows.LCMS]

InsertRack1  = "synthesis_bots.workflows.ms.insert_rack_one"
InsertRack2  = "synthesis_bots.workflows.ms.insert_rack_two"
ExtractRack1 = "synthesis_bots.workflows.ms.eject_rack_one"
ExtractRack2 = "synthesis_bots.workflows.ms.eject_rack_two"
Supra1       = "synthesis_bots.workflows.ms.supramol.screening"
Supra2       = "synthesis_bots.workflows.ms.supramol.replication"

[workflows.decision]

peak_number = 3 # How many peaks different from SM allowed.
shifted_proportion = 0.5 # What proportion of peaks needs to have shifted.
metals_mz = [
    3,
    2,
] # [x, y] if x metals, required at least y m/z peaks.
dtw_threshold = 20.0 # Distance threshold for dynamic time warp.
ppm_range = [11, 6] # PPM range of interest
hg_shift = 0.005 # PPM shift to trigget host-guest identification
hg_lb = 1.8 # Hz exponential multiplication line broadening

'''

with open("settings.toml", "w") as f:
    f.write(settings_toml)

## Supramolecular screening

The first step of the workflow is to identify hits when screening different experiments.

### Testing for MS criteria

In the supramolecular screening, only direct injection ESI-MS is used. The code identifies the peak of the injection (using `scipy`), extracts the corresponding mass spectrum and searches through peaks within the mass spectrum that satisfy reasonable criteria: in this case, 50% relative intensity and m/z agreement within 0.4 Da.

Because there are a lot of possible charges enumerated, and hence very many potential false positivies, we included another criterion, that if at least three metals are present (i.e., the structure corresponds to a polymetallic assembly with at least six counterions present) then we require at least two m/z peaks that corresponds to different number of counterions removed in the ionisation process.

Note that all criteria can be modified by domain experts in the settings file if they so wish.

In [None]:
from pathlib import Path
from synthesis_bots.utils.constants import PATHS

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"
PATHS["LCMS_data"] = DATA / "SUPRAMOL-SCREENING" / "DATA" / "LCMS"

from synthesis_bots.workflows.ms.supramol.screening import results_analysis

results_analysis(
    expected_json=INPUT / "SUPRAMOL-SCREENING-EXPECTED-MS.json",
    archive_path=DATA / "SUPRAMOL-SCREENING" / "DATA" / "LCMS",
    summary_path=DATA / "SUPRAMOL-SCREENING" / "DATA" / "SUMMARY_MS.json"
)

### Testing for NMR criteria

In the supramolecular screening, the goal is to identify large but symmetric structures that potentially form metal-organic assemblies. Hence we peform automated NMR processing followed by peak picking and compare the reaction mixture to the NMR spectra of the starting materials. In successful reactions, we anticipate a small number of peaks (comparable to the sum of the numbers of peaks of the starting material) that have shifted from their positions in the NMR spectra of the starting materials. Given that there is possible peak overlap - or slight changes in multiplet shapes that are significant at low-field NMR instruments - or allow there to be three peaks more of fewer than in the starting materials. Similarly, we expect at least 50% of the peaks to have shifted in position.

As previously, any of those criteria can be changed in the settings if the users so wish.

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.supramol.screening import results_analysis

results_analysis(
    sm_nmr_path=INPUT / "SUPRAMOL-SCREENING-SM-NMR.json",
    json_path=INPUT / "SUPRAMOL-SCREENING.json",
    data_path=RAW_NMR / "SUPRAMOL-SCREENING" / "DATA" / "NMR",
    ms_summary_path = DATA / "SUPRAMOL-SCREENING" / "DATA" / "SUMMARY_MS.json",
    nmr_summary_path=DATA / "SUPRAMOL-SCREENING" / "DATA" / "SUMMARY_NMR.json",
    archive_path=DATA / "SUPRAMOL-SCREENING" / "DATA" / "NMR",
)

### Decision making on the replication step

With the MS and NMR screening results at hand, the decision maker now looks through the summary JSON files to identify samples were all criteria passed. Based on those results, a ChemSpeed control CSV file is automatically generated, alongside the selection of expected m/z values for replication and the overall sample information JSON (which also contains a list of NMR experiments to perform).

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.supramol.screening import next_step

next_step(
    nmr_summary_path=DATA / "SUPRAMOL-SCREENING" / "DATA" / "SUMMARY_NMR.json",
    expected_ms_json=INPUT / "SUPRAMOL-SCREENING-EXPECTED-MS.json",
    cs_csv_supra_input=INPUT / "SUPRAMOL-SCREENING-CHEMSPEED.csv",
    future_cs_csv=INPUT / "SUPRAMOL-REPLICATION-CHEMSPEED.csv",
    future_ms_json=INPUT / "SUPRAMOL-REPLICATION-EXPECTED-MS.json",
    future_nmr_json=INPUT / "SUPRAMOL-REPLICATION.json",
)

## Synthesis replication

Hits identified in the previous step are now replicated six times and checked for purity.

### Testing for MS criteria

This time, we are trying to establish if any of the m/z values corresponding to the expected structure has been observed.

In [None]:
from pathlib import Path
from synthesis_bots.utils.constants import PATHS

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"
PATHS["LCMS_data"] = DATA / "SUPRAMOL-REPLICATION" / "DATA" / "LCMS"

from synthesis_bots.workflows.ms.supramol.replication import results_analysis

results_analysis(
    expected_json=INPUT / "SUPRAMOL-REPLICATION-EXPECTED-MS.json",
    archive_path=DATA / "SUPRAMOL-REPLICATION" / "DATA" / "LCMS",
    summary_path=DATA / "SUPRAMOL-REPLICATION" / "DATA" / "SUMMARY_MS.json"
)

### Testing for NMR criteria

During replication, we are interested whether the NMR spectrum is identical (or very close) to the reference spectrum from the screening run. Towards that goal, we emplied dynamic time warping (DTW) algorithm implemented in [dtw-python](https://dynamictimewarping.github.io/py-api/html/index.html). Before the DTW distance is calculated, the spectra are min-max normalised and all points with intensity below 5% are assumed to contribute only to the noise and replaced with zeros. We considered DTW distance of 20 as "good agreement", but this can be adjusted in the settings if others so wish.

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.supramol.replication import results_analysis

results_analysis(
    json_path=INPUT / "SUPRAMOL-REPLICATION.json",
    data_path=RAW_NMR / "SUPRAMOL-REPLICATION" / "DATA" / "NMR",
    screening_data_path=RAW_NMR / "SUPRAMOL-SCREENING" / "DATA" / "NMR",
    ms_summary_path = DATA / "SUPRAMOL-REPLICATION" / "DATA" / "SUMMARY_MS.json",
    nmr_summary_path=DATA / "SUPRAMOL-REPLICATION" / "DATA" / "SUMMARY_NMR.json",
    archive_path=DATA / "SUPRAMOL-REPLICATION" / "DATA" / "NMR",
)

### Decision making on the host guest step

With the MS and NMR screening results at hand, the decision maker now looks through the summary JSON files to identify samples were all criteria passed. All samples are taken forward to the host-guest chemistry step as it would be impossible to selectively keep repeating unsuccessful replicas on ChemSpeed. Furthermore, addition of guests might template the host and lead to re-equilibration. However, the record is kept of all data for all sample in case any researcher would like to investigate the results post automated analysis.

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.supramol.replication import next_step

next_step(
    nmr_summary_path=DATA / "SUPRAMOL-REPLICATION" / "DATA" / "SUMMARY_NMR.json",
    json_path=INPUT / "SUPRAMOL-REPLICATION.json",
    cs_csv_supra_input=INPUT / "SUPRAMOL-REPLICATION-CHEMSPEED.csv",
    future_cs_csv=INPUT / "SUPRAMOL-HOST-GUEST-CHEMSPEED.csv",
    future_nmr_json=INPUT / "SUPRAMOL-HOST-GUEST.json"
)

## Host-guest chemistry

After identification of potential supramolecular capsules, a set of guest molecules is added to each replication. If the NMR peaks shift as compared to the reference, then an interaction is occuring and the sample is identified as a hit. No MS analysis is performed at this stage as host-guest complexes are too fragile to be identified after ionisation.

### Testing for NMR criteria

The spectra are automatically processed and automated peak picking is performed. In the host-guest chemistry experiments, we intentionally apply much greater line broadening (exponential multiplication) so as to emulate the effects of fast NMR exchange (peak coalescence) in systems where slow exchange is observed but the peaks are close together. The algorithm is checking for peak positions of the peaks observed in the reference metal-organic structure and identifies which signals are in different positions, hence implying intermolecular interactions between the host and the guest happening.

In [None]:
from pathlib import Path

DATA = Path.cwd() / "DATA"
INPUT = DATA / "INPUT"
RAW_NMR = Path.cwd() / "NMR"
PLOTS = Path.cwd() / "PLOTS"
SUMMARY = Path.cwd() / "SUMMARY"

from synthesis_bots.workflows.nmr.supramol.host_guest import results_analysis

results_analysis(
    json_path=INPUT / "SUPRAMOL-HOST-GUEST.json",
    data_path=RAW_NMR / "SUPRAMOL-HOST-GUEST" / "DATA" / "NMR",
    nmr_summary_path=DATA / "SUPRAMOL-HOST-GUEST" / "DATA" / "SUMMARY_NMR.json",
    replication_data_path=RAW_NMR / "SUPRAMOL-REPLICATION" / "DATA" / "NMR",
    archive_path=DATA / "SUPRAMOL-HOST-GUEST" / "DATA" / "NMR",
)