# PFlow

PFlow - Data processing workflow for PFAS identification

## Description

This software contains two data processing workflows. The first one is applied for the suspect list
generation of PFAS from publically available resources (Suspect list generation workflows
). The second workflow uses the created suspect list and input measurement data for the
identification of PFAS (PFlow, identification workflow).

The suspect list generation workflow was developed for generating suspect lists for PFAS (Per- and
polyfluoroalkyl substances). It addresses the challenge of non-uniform information across existing
suspect lists and incorporates isotopologues into the suspect list for direct use in isotope
analysis. Users can choose to include these isotopologues in the final suspect list during the
configuration step.

Pflow is an automated data processing workflow designed for processing DI-UHRMS data from FT-ICR
MS, implemented in KNIME. It integrates R scripts and native KNIME nodes to perform molecular mass
matching and validation using heavy isotopes (C, S, O, Cl, Br, Si).

## Key Features

### Suspect List Generation Workflow

- **Uniform Suspect List Generation:** Consolidates multiple suspect lists with varying formats
into a standardized list.
- **Isotopologue Integration:** Includes isotopologues for enhanced isotope analysis.
- **User Configuration:** Allows users to decide whether to include isotopologues in the final
suspect list.

### PFAS Identification Workflow

- **Automated Data Processing:** Integrates R scripts and KNIME nodes to automate mass matching and
validation using heavy isotopes (C, S, O, Cl, Br, Si).
- **PubChem Database Integration:** Queries the PubChem database to retrieve possible molecular
formulas for specific m/z values.
- **Visualization**: Provides interactive visualizations such as Kendrick mass defect plots, mass
defect distributions, and elemental ratio plots.
- **Efficient Data Export:** Outputs results in XLSX format for easy compatibility with various
analysis tools.

## Installation and Preparation

Follow the below steps to prepare your machine for executing the workflows.

### Download and Install KNIME Analytics Platform

- Download KNIME Analytics Platform version 4.7.0 from the
[KNIME website](https://www.knime.com/downloads/previous)
- Follow the installation instructions for your operating system

### Open the Workflows

- Open the KNIME Analytics Platform and initialize a new workspace by following the instrcutions
on the screen.
- Import the workflows by clicking (right click) on `LOCAL` > `Import KNIME Workflow...`.
- Navigate in your file system to your PFlow download. There, select a workflow and click `Open`.
Click `Finish` then. Repeat this step to import the other workflows as well.

### Set Up KNIME and Install Extensions

- Double click on the workflow `Suspect list` to open it. This will result in an error message
asking to install extensions. Click `Yes`. Follow the instructions on the screen to finish the
installation. Repeat this process with the `KNIME_Suspect screening` workflow.
- KNIME comes with its own Python environment to use for a quick start. To enable that go `File` >
`Preferences` > `KNIME` > `Python`. There, select `Bundled` and finish the process by clicking
`Apply and Close`.
- Unfortunately, you have to install `R` manually to run it in KNIME. To enable it go to `File` >
`Preferences` > `KNIME` > `R`. There, set the correct path to your R home. Finish the process by
clicking `Apply and Close`.

### Suspect List Preparation

- Ensure that all suspect lists you want to merge are located in a single folder.
- Dummy suspect lists for testing are stored in the directory `PFlow` > `data` > `reference` >
`suspect_lists`.
- Suspect lists should have the following header (column names):
  - `DTXSID`
  - `PREFERRED NAME`
  - `CASRN`
  - `INCHIKEY`
  - `IUPAC NAME`
  - `SMILES`
  - `INCHI STRING`
  - `MOLECULAR FORMULA`
  - `AVERAGE MASS`
  - `MONOISOTOPIC MASS`
  - `# of SOURCES`
  - `# OF PUBMED ARTICLES`
  - `PUBCHEM DATA SOURCES`
  - `CPDAT COUNT`
  - `QC Level`
  - `# ToxCast Active`
  - `% ToxCast Active`
  - `Total Assays`
- The suspect lists do not need to contain the same number of columns, but only the columns listed above are allowed.

### Measurement Data Preparation

- Ensure columns in your XLSX files are correctly named. For the measurement sample, use:
  - `MeasuredMass`
  - `S/N`
  - `MeasuredIntensity`
  - `I%`
  - `Res.`

## Usage

### Suspect List Generation

1. **Open KNIME Analytics Platform.**
2. **Load the Workflow:**
3. **Configure the Workflow:**
   - During the configuration step, preselect the folder containing the suspect lists.
   - Choose whether to include isotopologues in the final suspect list.
4. **Run the Workflow:**
   - From the home panel, press the green button with the two white right arrows to execute the entire workflow.
   - The workflow will automatically generate and save the suspect list in the directory specified during the configuration step.

**Note:** Only entries with SMILES or INCHI KEY will go through the de-salting step; otherwise, the molecule is considered in the neutral form.

### PFAS Identifiaction with PFlow

1. **Open KNIME Analytics Platform.**
2. **Load the Workflow:**
3. **Configure the Workflow:**
   - During the configuration step, select data and set parameters
4. **Run the Workflow:**
   - From the home panel, press the green button with the two white right arrows to execute the entire workflow.
   - The workflow will automatically generate and save the suspect list in the directory specified during the configuration step.

## Infrastructure

This workflow was built using a combination of native KNIME nodes, R scripts, and Python scripts.
Below is a brief overview of these components:

- **KNIME Nodes:** Native nodes for data manipulation and integration.
- **R Scripts:** Custom scripts for specific data processing tasks.
- **Python Scripts:** Custom scripts for advanced data analysis and processing.

## License

This project is licensed under a UFZ-adapted GPL-3.0-only software license. See the license file
(LICENSE) for detailed information.

## Credits

This workflow was developed using the following tools and platforms:

- [KNIME Analytics Platform](https://www.knime.com/)
- [R](https://www.r-project.org/)
- [Python](https://www.python.org/)

## How to cite this material

Please, to cite this material use the following DOI: 10.5281/zenodo.11633376

## Contact

For questions or support, please contact Silvia Dudasova at [silvia.hupcejova-dudasova@ufz.de] or
Johann Wurz at [johann.wurz@ufz.de].
