A HYBRID CNN–TRANSFORMER FRAMEWORK FOR AUTOMATED SKIN CANCER DETECTION FROM DERMOSCOPIC IMAGES

Hamza A. Mashagba, Suhaila Abuowaida, Azlan B. Abd Aziz, Nawaf Alshdaifat, Mahmoud Baniata, Mardeni Bin Roslee, Mohamad Yusoff Alias, Azwan Mahmud

doi:10.5281/zenodo.19949266

Published May 1, 2026 | Version v1

Journal article Open

A HYBRID CNN–TRANSFORMER FRAMEWORK FOR AUTOMATED SKIN CANCER DETECTION FROM DERMOSCOPIC IMAGES

Hamza A. Mashagba, Suhaila Abuowaida, Azlan B. Abd Aziz, Nawaf Alshdaifat, Mahmoud Baniata, Mardeni Bin Roslee, Mohamad Yusoff Alias, Azwan Mahmud (Contact person)¹

1. Faculty of Engineering and Technology, Multimedia University, Melaka, Malaysia; Centre for Wireless Technology (CWT), Multimedia University Department of Data Science and Artificial Intelligence, Faculty of Prince Al-Hussein Bin Abdallah II for IT, Al Al-Bayt University, Mafraq, Jordan Department of Information Technology, Faculty of Prince Al-Hussein Bin Abdullah II for Information Technology, The Hashemite University, Zarqa, Jordan Department of Computer Science, Faculty of Information Technology, Applied Science Private University, Amman, Jordan Faculty of Artificial Intelligence and Engineering, Multimedia University, Cyberjaya, Selangor, Malaysia

The early detection of melanoma and other forms of skin cancer is currently one of the most difficult challenges facing clinicians in the field of dermatology. The difficulty lies in the subtle differences in appearance among benign and malignant lesions. In this research we introduce a new type of deep learning hybrid framework that utilizes both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to overcome the limitations inherent in single paradigm frameworks. Our framework utilizes a pre-trained version of EfficientNet-B4 to extract hierarchical local features from each image and a multi-layer Vision Transformer to capture long range spatial dependencies and global contextual information. To combine the two different types of complementary representation, our framework uses a sophisticated fusion methodology based on feature concatenation, multi-layer perceptron processing, and residual connections. The efficacy of our hybrid architecture was tested on the 33,126 dermoscopic images available on the ISIC 2020 dataset using a stratified 5-fold cross-validation testing approach. Our hybrid architecture achieved a superior diagnostic performance compared to the state-of-the-art previous model, which utilized a pre-trained EfficientNet-B4 + Attention. Specifically, our hybrid architecture achieved a 95.4% classification accuracy rate, a 90.7% sensitivity rate, a 95.1% specificity rate, and a .982 AUC-ROC value. The increases in both sensitivity and specificity rates represent clinically relevant improvements in both melanoma detection and false positive reductions. Therefore, our results demonstrate that combining CNN-based local texture analysis with transformer-based global semantic understanding creates a more accurate and robust computer aided diagnosis system, and offers significant opportunities to support clinicians in their decision-making processes as well as improve patient outcomes.

Files

703.2242-OJS Ready final 1.pdf

Files (565.0 kB)

Name	Size	Download all
703.2242-OJS Ready final 1.pdf md5:527bfec599e772f2f079d6e46f4cc416	565.0 kB	Preview Download

	All versions	This version
Views	14	14
Downloads	8	8
Data volume	5.7 MB	5.7 MB

A HYBRID CNN–TRANSFORMER FRAMEWORK FOR AUTOMATED SKIN CANCER DETECTION FROM DERMOSCOPIC IMAGES

Authors/Creators

Description

Files

703.2242-OJS Ready final 1.pdf

Files (565.0 kB)