SMART-SHap

Ko, Jae-Hyeok; CHEUNG, Joonsuk; Kim, Chang Min

doi:10.5281/zenodo.18102223

Published February 19, 2026 | Version v2

Software Open

SMART-SHap

1. Yonsei University
2. Hanyang University

SMART-SHap

**Statistical Modeling for Advanced semiconductor Recipe Tuning using SHAP algorithms**

### Authors

- Jae-Hyeok Ko (Yonsei University)
- Joonsuk Cheung (Yonsei University)
- Changmin Kim (Hanyang University)

### Release History

- 2025-11-07: SMART-SHap v1 released
- 2025-12-25: SMART-SHap v2 released
- 2026-02-19: SMART-SHap v3 released (actively maintained)

We'd like to introduce our new software, providing a lightweight Python package scaffold for **semiconductor process parameter analysis**.
The command-line interface (CLI), exposed via `smartsh.py`, orchestrates an end-to-end workflow that includes data validation, model training, and feature attribution analysis.

Key functionalities include:

- Performing model training and validation on process data using **nested cross validation** in the IO module
- Identifying problematic or influential process features using **XGBoost, LightGBM, or CatBoost combined with TreeSHAP** in the FeatureAttribution module
- mpirun-based software-parallelization is supported under a proper installation situation

### Repository Structure

- `smartsh.py` – CLI entry point for terminal execution
- `smartsh/` – Core Python package
- `cli.py` – Argument parsing and workflow dispatch
- `IO/` – Dataset loading, profiling, dtype coercion, and nested CV utilities
- `ML_model/` – Algorithm-specific analysis pipelines (XGBoost / LightGBM / CatBoost)
- `FeatureAttribution/` – TreeSHAP training helpers, plot generation, and report writing
- `hyperparam_tune/` – Hyperparameter search strategies (grid, random TPE, two-stage TPE)
- `mpirun/` – MPI-aware nested execution helper scripts
- `requirements.txt` – Python dependency list

### Quick Start

0. Input preparation (Important)

1) SMART-SHap provides analysis on "csv"-format files.
2) csv files should include rows of data instant, with columns of parameters.
3) Additionally, results (which will be used to classification) should be located in columns.
(And it should be declared in the commnad; e.g. smartsh.py --target "Pass/Fail (columns)")

1. (Optional) Since some packages can sometimes induce version mismatch between user's environment,
we recommend to use "virtual environment" of python. (conda environment is strongly recommended as well)

python -m venv .venv
source .venv/bin/activate

2. Install required dependencies

pip install -r requirements.txt

OR

conda install --file requirements.txt

3. Run the analysis from the directory containing your CSV file

python /path/to/smartsh.py --markdown --algo XGBoost \
--input your_data.csv \
--output run1 \
--target "Pass/Fail"

### CLI Argument Details

- `--input`
Path to the input CSV file (file name only if located in the current directory)

- `--output`
Directory where results will be saved (automatically created if it does not exist)

- `--target`
Name of the target column
If the name contains special characters (e.g., `Pass/Fail`), wrap it in quotes

- Classification targets are automatically encoded to `0 ... n-1`, even if the original labels are `-1/1` or other non-binary formats

- Non-numeric columns (strings, timestamps) are automatically converted into numeric or categorical representations compatible with XGBoost/LightGBM/CatBoost

- `--nested`
Enable nested cross-validation for hyperparameter tuning (outer folds + inner folds).

- `--nested-mpirun`
Enable MPI-aware nested execution (equivalent to combining nested CV flow with
MPI outer-fold distribution).

- By default, **nested cross validation preserves temporal order** (time-blocked split)
Use `--shuffled-nested` to switch to shuffled K-Fold or StratifiedKFold

- `--hypertuning {grid,randomTPE,2stageTPE}`
Unified hyperparameter tuning selector:
- `--hypertuning grid` → nested GridSearchCV hyperparameter tuning
- `--hypertuning randomTPE` → coarse random search followed by refined TPE
- `--hypertuning 2stageTPE` → two-stage TPE deepsearch (per-stage TPE trials)

- `--sampling <int>`
- With `--hypertuning randomTPE`: sets coarse random-search trial count
- With `--hypertuning 2stageTPE`: sets per-stage outer-fold TPE trial count

- If `--group-column <column_name>` is specified:
- Samples belonging to the same group are assigned to the same fold
- The group column is automatically excluded from model features

- `--markdown` / `--algo {XGBoost,LightGBM,CatBoost}`
Enable dataset summary reporting and select which model + SHAP analysis to run
(e.g., `--algo LightGBM` to run the LightGBM workflow)

### MPI Usage for Nested TPE Deepsearch

When using `--nested-mpirun` with `--hypertuning 2stageTPE`, the MPI-aware nested
execution path is available for **XGBoost, LightGBM, and CatBoost**. Under MPI:

- `--hypertuning 2stageTPE` enables two-stage TPE deepsearch; use `--sampling <trials>`
to set per-stage trial count (defaults to 60 when omitted).
- Each rank runs stage1/stage2 TPE tuning for the outer folds assigned to that rank
(fold-local hyperparameter optimization).
- Outer folds are distributed across ranks (round-robin), and each fold is evaluated
using its own tuned best parameters.
- Total TPE trial count is `outer_folds × 2(stage1+stage2) × sampling`.
- XGBoost internal thread count is reduced under MPI to avoid nested oversubscription.

Example:

mpirun -np 4 python smartsh.py --algo XGBoost \
--input your_data.csv \
--output run_mpi \
--target "Pass/Fail" \
--nested-mpirun --hypertuning 2stageTPE --sampling 1000

Example (LightGBM):

mpirun -np 4 python smartsh.py --algo LightGBM \
--input your_data.csv \
--output run_mpi_lgbm \
--target "Pass/Fail" \
--nested-mpirun --hypertuning 2stageTPE --sampling 1000

Example (CatBoost):

mpirun -np 4 python smartsh.py --algo CatBoost \
--input your_data.csv \
--output run_mpi_cat \
--target "Pass/Fail" \
--nested-mpirun --hypertuning 2stageTPE --sampling 1000

### XGBoost Hyperparameter Overrides (optional)

If you enable XGBoost via `--algo XGBoost` but do **not** provide any of the following flags, the built-in defaults and/or nested CV search grid are used. To force specific values, pass one or more of the CLI options below:

- `--objective` – Explicit XGBoost objective (e.g., `binary:logistic`, `reg:squarederror`)
- `--eval_metric` – Override evaluation metric (otherwise defaults to `logloss` for classification or `rmse` for regression)
- `--booster` – Booster type (`gbtree`, `gblinear`, `dart`)
- `--learning_rate` – Learning rate (eta)
- `--max_depth` – Maximum tree depth
- `--missing` – Value to treat as missing (default `nan` if omitted)
- `--scale_pos_weight` – Class balancing weight for binary classification
- `--random_state` – Seed for reproducibility (affects CV and model)
- `--n_estimators` – Number of boosting rounds
- `--min_child_weight` – Minimum sum of instance weight needed in a child
- `--gamma` – Minimum loss reduction required to make a split
- `--subsample` – Row subsampling ratio
- `--colsample_bytree` – Column subsampling per tree
- `--colsample_bylevel` – Column subsampling per level
- `--colsample_bynode` – Column subsampling per split
- `--reg_alpha` – L1 regularization term
- `--reg_lambda` – L2 regularization term
- `--max_delta_step` – Maximum delta step for weight estimates

Example (forcing a shallower model with custom learning rate):

python smartsh.py --algo XGBoost \
--input your_data.csv \
--output tuned_run \
--target "Pass/Fail" \
--learning_rate 0.1 \
--max_depth 3 \
--subsample 0.8

### LightGBM Hyperparameter Overrides (optional)

When running `--algo LightGBM`, the workflow will continue to use its built-in defaults and tuning grid unless you pass explicit overrides. Provide any of the following flags to force specific LightGBM settings:

- `--objective` – LightGBM objective (e.g., `binary`, `regression`)
- `--metric` – Evaluation metric (e.g., `auc`, `rmse`)
- `--is_unbalance` / `--scale_pos_weight` – Class imbalance controls
- `--seed` / `--random_state` – Reproducibility seeds
- `--boosting_type` – Boosting flavor (`gbdt`, `dart`, `goss`, `rf`)
- Tree/leaf shape: `--max_depth`, `--num_leaves`, `--min_child_samples`, `--min_child_weight`, `--min_split_gain`
- Sampling: `--feature_fraction` (alias of `--colsample_bytree`), `--colsample_bytree`, `--bagging_fraction` (alias of `--subsample`), `--subsample`, `--bagging_freq`
- Learning schedule: `--learning_rate`, `--n_estimators`
- Regularization: `--lambda_l1`, `--lambda_l2`
- Histogram/binning: `--max_bin`, `--min_data_in_bin`, `--subsample_for_bin`

Example (custom imbalance handling and learning settings):

python smartsh.py --algo LightGBM \
--input your_data.csv \
--output tuned_lgbm \
--target "Pass/Fail" \
--objective binary \
--metric auc \
--is_unbalance \
--learning_rate 0.05 \
--num_leaves 63

### CatBoost Hyperparameter Overrides (optional)

When running `--algo CatBoost`, the workflow uses its built-in defaults and tuning grid
unless you pass explicit overrides. The following flags map to CatBoost parameters:

- `--objective` – Loss function (e.g., `Logloss`, `RMSE`)
- `--eval_metric` – Evaluation metric (e.g., `AUC`, `RMSE`)
- `--learning_rate` – Learning rate
- `--max_depth` – Tree depth
- `--n_estimators` – Number of boosting iterations
- `--random_state` – Reproducibility seed
- `--scale_pos_weight` – Class balancing weight for binary classification

Example:

python smartsh.py --algo CatBoost \
--input your_data.csv \
--output tuned_catboost \
--target "Pass/Fail" \
--objective Logloss \
--learning_rate 0.08 \
--max_depth 6

### Shared Hyperparameter Tuning Strategies

The `smartsh/hyperparam_tune` package centralizes nested CV tuning logic for XGBoost,
LightGBM, and CatBoost. The CLI `--hypertuning` flag dispatches to the following
shared strategy modules:

- `gridsearch.py` → nested GridSearchCV tuning
- `random_tpe.py` → coarse random search followed by TPE refinement
- `two_stage_tpe.py` → two-stage TPE deepsearch (coarse → refined)

Each pipeline forwards the model-specific estimator builder and per-fold SHAP callback
to these shared routines so tuning behavior remains consistent across algorithms.

### SHAP Visualization Options (choose one or use `--shapall`)

- `--shapbar`
SHAP summary bar plot (`shap_summary_bar.png`)

- `--shapbee`
SHAP beeswarm plot (`shap_beeswarm.png`)

- `--shapdepend`
SHAP dependence plot for top-ranked features (`shap_dependence.png`)

- `--shapall`
Generate all three plots

### Running Only SHAP Visualizations

To generate specific SHAP plots without executing the full workflow, run XGBoost analysis with the desired visualization flag:

python smartsh.py --algo XGBoost \
--input your_data.csv \
--output shap_plots \
--target "Pass/Fail" \
--shapbar

Files

SMART-SHap.zip

Files (418.6 kB)

Name	Size	Download all
SMART-SHap.zip md5:aeb5be1f36bd9feeb5429b6cc86e9a1c	418.6 kB	Preview Download

Additional details

Programming language: Python

	All versions	This version
Views	174	51
Downloads	13	4
Data volume	3.4 MB	1.7 MB

SMART-SHap

Authors/Creators

Description

SMART-SHap

### Authors

### Release History

### Repository Structure

### Quick Start

0. Input preparation (Important)

1. (Optional) Since some packages can sometimes induce version mismatch between user's environment, we recommend to use "virtual environment" of python. (conda environment is strongly recommended as well)

2. Install required dependencies

3. Run the analysis from the directory containing your CSV file

### CLI Argument Details

### MPI Usage for Nested TPE Deepsearch

### XGBoost Hyperparameter Overrides (optional)

### LightGBM Hyperparameter Overrides (optional)

### CatBoost Hyperparameter Overrides (optional)

### Shared Hyperparameter Tuning Strategies

### SHAP Visualization Options (choose one or use `--shapall`)

### Running Only SHAP Visualizations

Files

SMART-SHap.zip

Files (418.6 kB)

Additional details

Software

1. (Optional) Since some packages can sometimes induce version mismatch between user's environment,
we recommend to use "virtual environment" of python. (conda environment is strongly recommended as well)