Published April 2025 | Version v3
Computational notebook Open

Breast Cancer - TDA

  • 1. ROR icon University of the Philippines Los Baños

Description

Topological Insights and Hybrid Feature Extraction for Breast Cancer Detection: A Persistent Homology Classification Approach

This is the repository associated with the submitted AI Applications article (titled above) in PeerJ Computer Science. It consists of all codes used to implement the experiments in the article. The focus here is to utilize Persistent Homology Classification Algorithm (PHCA) [1], a novel computational topology-based classifier, and some computer vision technique for breast cancer mammography scans detection. To replicate results, follow through the usage instructions elaborate below.


Code Information

 

Required Packages and Versions

  • numpy==2.1.2
  • matplotlib==3.9.2
  • scikit-learn==1.5.2
  • scikit-image==0.24.0
  • ripser==0.6.10
  • opencv-python==4.10.0.84
  • tqdm==4.66.6

All of these packages are indicated in the `requirements.txt` file. The ripser package works better for Python versions >3.7 and <= 3.10.10.

Content

  • A `main.py` file and a `generate_results.ipynb` notebook.
  • `modules` folder containing the following: `__init__.py`, `classification.py`, `phca.py`, and `pixel_extraction.py`


Usage Instructions

Data Preparation

  • Download the dataset from https://doi.org/10.5281/zenodo.14769221. You should be able to obtain 2 zip files: `BreastCancer_Benign.zip` and `BreastCancer_Malignant.zip`.
  • Save downloaded files in a folder/directory. This folder will be your directory for later implementation.
  • Extract the contents of each zip file. You should expect two folders: `BreastCancer_Benign` and `BreastCancer_Malignant`
  • If the folders contain the images upon opening, you're all set. If not and you observe another folder with the same name as the one opened, move the "inner" folder to your directory. Upon moving, you won't be needing the now empty folders.

Code Preparation (without cloning)

  • Download all files from this repository.
  • Extract the contents of the `modules.zip` file. You should expect a `modules` folder which contains Python files: `__init__.py`, `classification.py`, `phca.py`, and `pixel_extraction.py`.
  • Create a virtual environment and install all dependencies from the `requirements.txt` file.
  • Create a `prepared_data` folder in the directory. This will contain the test and train data and targets for later use.

Code Preparation (with cloning)


```
git clone https://github.com/ji-chani/BreastCancer-TDA.git
```


Main Implementation

  • Run the `main.py` file. For first time implementation, make sure that the _extract_data_ and _extract_features_ global controls are set to _True_. This will take some time.
  • After running the `main.py` file, you should expect a new file with name `PHCA_predicted_labels.npy`. This contains the true labels and predicted labels by PHCA.
  • Run through the `generate_results.ipynb` Jupyter notebook to plot the results.


Summary of Methodology


The framework of the implementation is presented in the figure below. All breast cancer mammography scans in the dataset are first converted into grayscale having dimension 224 pixels x 224 pixels x 1 channel. Then, Histogram of Oriented Gradients (HOG) [2] is implemented on each pixel. The defined parameters for the feature descriptor are (8,8) pixels per cell and (3,3) cells per block. The feature descriptor from HOG is then flattened to obtain some vector of dimension 54,756 for each image. Now, due to computational limitations of the device used, only 6000 randomly selected images are considered from the dataset. Of which, 2017 images are classified as benign and 3983 images as malignant. The images are then split into training and test sets with 80:20 ratio. In particular, 4800 images were used for training and 1200 images for validation or testing. After this, the image descriptors are scaled using Standard Scaler. Finally, for the preprocessing stage, Principal Component Analysis (PCA) [3] is performed on the dataset to reduce the dimension. The final feature vectors used for classification has 1,595 dimensions which represent the number of principal components that preserves 95% of the variability of the scaled image descriptors. 
![image](framework.png)

 

Code Explanation

  • Starting with the `main.py` file, the pixels of the images are first extracted using OpenCV and are organized according to class: benign and malignant.
  • Images are then split into 80:20 train and test sets.
  • Features are extracted from these pixels using HOG and PCA. The scaled principal components that preserves 95% of the variability of the result from HOG are used as features for each image.
  • Classification is implemented using PHCA. Metrics are then computed for analysis.
  • Results of experiment can be viewed in the `generate_results.ipynb` file.

 

References


[1] De Lara, M. L. D. (2023). Persistent homology classification algorithm. PeerJ Computer Science, 9:e1195.

[2] Madhuri, G. and Negi, A. (2023). Discriminative dictionary learning based on statistical methods, page 55–77. Elsevier.

[3] Azman, B., Hussain, S., Azmi, N., Ghani, M., and Norlen, N. (2022). Prediction of distant recurrence in breast cancer using a deep neural network. Revista Internacional de M ´etodos Num ´ericos para C ´alculo y Dise˜no en Ingenier´ıa, 38(1).

 

 

Files

framework.png

Files (181.4 kB)

Name Size Download all
md5:20afb426a4edef4167726c3c2c18c843
104.3 kB Preview Download
md5:e9d66aa76c01d6c8dc1c4fb2c6c21c89
57.9 kB Preview Download
md5:62af6707d4ac650dbc2d2a0dbca3313b
5.1 kB Download
md5:4ba50439cb08a811b352c39caa20cd98
8.7 kB Preview Download
md5:8f268af8a4c22b723c314722082291d6
5.3 kB Preview Download
md5:4e6eeebebc01d95e8113e4a14ce0614e
130 Bytes Preview Download