Explainable Multimodal Deepfake Detection using Image and Audio Transformers with Attention Mechanisms

Charan G; Dhanush C; Dhanush S; Dheemanth M; Dr. Pallavi G B

doi:10.5281/zenodo.17623460

Published November 16, 2025 | Version v1

Preprint Open

Explainable Multimodal Deepfake Detection using Image and Audio Transformers with Attention Mechanisms

1. BMS College of Engineering
2. B.M.S. College of Engineering

This work takes a deep look at current deepfake detection methods and introduces a new multimodal framework that mixes vision transformers, audio-based spectral transformers, and attention-driven explainability tools. We review the latest approaches in both single-modal and multimodal detection, discuss the main datasets and evaluation metrics used in the field, and highlight the major challenges researchers still face. In the end, we put forward an explainable multimodal model that uses cross-modal attention to make deepfake detection both stronger and more transparent.

Files

Explainable Multimodal Deepfake Detection.pdf

Files (189.2 kB)

Name	Size	Download all
Explainable Multimodal Deepfake Detection.pdf md5:282f563360d750ab39355ebd605b61c8	189.2 kB	Preview Download

100

Views

Downloads

Show more details

	All versions	This version
Views	100	100
Downloads	52	52
Data volume	12.5 MB	12.5 MB

More info on how stats are collected....

DOI

Resource type

Preprint

Publisher

Zenodo

Languages

English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: November 16, 2025
Modified: November 16, 2025

Explainable Multimodal Deepfake Detection using Image and Audio Transformers with Attention Mechanisms

Authors/Creators

Description

Files

Explainable Multimodal Deepfake Detection.pdf

Files (189.2 kB)