Published December 9, 2025 | Version v1.0
Dataset Open

Smart Microfluidics: a curated dataset of microfluidic liposome formulations with cross-laboratory validation for machine-learning applications

  • 1. ROR icon Sapienza University of Rome
  • 2. ROR icon Unitelma Sapienza University
  • 1. ROR icon Sapienza University of Rome
  • 2. BSP Pharmaceuticals

Description

This dataset documents how different formulation choices and microfluidic operating conditions influence the size and uniformity of liposomes produced using a controlled microfluidic system. The data were generated through a multi-step experimental workflow that included a screening phase, two response-surface optimization phases, and an independent cross-laboratory study. Together, these experiments provide an extensive view of how lipid composition and flow conditions shape the final properties of the liposomes.   

Liposome formation is sensitive to both the ingredients used (such as lipid ratios) and the production settings (such as flow rates, flow-rate ratios, and buffer choice). Factors like mixing intensity, solvent–aqueous ratios, and chip geometry all influence how lipids self-assemble into vesicles. This dataset enables the systematic study of these relationships across a broad and diverse experimental space.

The dataset package is contained in the microfluidics_dataset.zip folder and is composed of 5 subfolders:

  • code: This folder contains Python code for preprocessing steps and examples of data exploration and usage. It includes the following:

    • logs: directory containing logging information about the processing scripts in the current directory.
    • data_gn_adder.py: script that generates the dataset extension data_extensions/formulations_extended_with_gn.csv.

    • data_smote_adder.py: script that generates the dataset extension data_extensions/formulations_extended_with_SMOTE.csv.

    • raw_data_checker.py: script that validates and preprocesses formulations in the raw_data directory.

    • raw_data_slicer.py: script that selects the correct CHIP configuration from the files in the raw_data directory.

  • data: This folder contains the preprocessed raw datasets, specifically:

    • formulations.csv: the main cleaned dataset obtained by preprocessing and merging initial_formulations_raw.xlsx with new_formulations_raw.xlsx.
    • wet_lab_validation.csv: contains independent wet-lab validation formulations obtained by preprocessing wet_lab_validation_raw.xlsx.

  • data_extensions: This folder contains two artificially extended datasets:

    • formulations_extended_with_gn.csv, generated by adding Gaussian noise to formulations.csv.
    • formulations_extended_with_SMOTE.csv, generated by applying SMOTE interpolation to formulations.csv.

  • metadata: This folder contains documentation and supporting information, including:
    • features_bounds.json: a dictionary describing the physical constraints of each feature.
    • features_names_mappings.json: a dictionary containing standardized names for each feature and additional data conventions.

    • features_descriptions.txt: a text file providing brief descriptions of each feature in the raw datasets.

  • raw_data: contains the raw spreadsheets compiled by the operators

    • initial_formulations_raw.xlsx was collected by one operator (operator A) using a specific instrument (equipment A) in a laboratory in Latina (Latium), Italy (laboratory A).  

    • new_formulations_raw.xlsx was collected by operator A using an equivalent instrument (equipment B) in a laboratory in Rome, Italy (laboratory B). 

    • wet_lab_validation_raw.xlsx was collected by another independent operator (operator B) using equipment A in laboratory A.

Experimental variables included in each data record encompass formulation components (ESM, HSPC, CHOL, PEG), microfluidic operating conditions (Total Flow Rate, Flow Rate Ratio, aqueous medium), and microfluidic hardware type (Droplet or Micromixer). Output variables include liposome size, polydispersity index (PDI), and a binary formation indicator. Only the Micromixer produced reliable liposomes; therefore, only Micromixer entries are retained in the cleaned dataset.

The preprocessing pipeline applies standardized naming, consistency checks, physical constraint validation, chip filtering, outlier removal, and optional augmentation via Gaussian noise or SMOTE. The final cleaned dataset contains 304 high-quality microfluidic liposome formulations suitable for statistical modelling and machine learning.

Further details, including full preprocessing steps, variable definitions, and usage examples, are available in the README file included in this package.

Files

README.pdf

Files (345.3 kB)

Name Size Download all
md5:28fcc5379016033c8f7a634a9eac0795
100.8 kB Preview Download
md5:c33f49a3e24bd726a96a98ee307fd7db
244.5 kB Preview Download

Additional details

Related works

Is source of
Publication: 10.1016/j.ijpharm.2025.126362 (DOI)

Dates

Created
2025-12-09

Software