Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

Pozorski, Paweł Dominik; Muszyński, Jakub Miłosz; Ganzha, Maria

doi:10.5281/zenodo.19677572

Published April 21, 2026 | Version v1

Publication Open

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

1. Warsaw University of Technology
2. Systems Research Institute

Multimodal Large Language Models (MLLMs) can integrate text and audio to interpret context in interactive conversations. However, the mechanisms by which information from different modalities shapes model behavior remain difficult to analyze. Shapley values (SV) are widely used for local, model-agnostic explainability in simple text-based conversations, yet their direct application to multimodal data is nontrivial due to cross-channel dependencies, dialogue structure, and the prohibitive computational cost of native audio tokenization. This work introduces a multimodal extension of Shapley values (SV), where units of information – such as text tokens and audio segments – are treated as cooperative features. To make the approach feasible under real-world constraints, we pair this formulation with efficient estimation methods: exact Shapley computation for short inputs, and sampling-based approximations using Monte Carlo permutation and stratified sampling with Neyman allocation, balancing variance against a strict computational budget. Additionally, to address the granularity mismatch between text and audio, we propose Spectrogram-Guided Phonetic Alignment (SGPA), a pre-processing method that maps dense audio streams to interpretable, word-aligned segments. As an applied contribution, we provide a model-agnostic Python package for computing and visualizing multimodal Shapley values for text and audio. A companion GUI enables interactive inspection of attributions, side-by-side modality visualization, and method-specific estimates of computational cost. Furthermore, we curate resources derived from the VoiceBench and Infinity Instruct datasets, encompassing diverse modality configurations and multilingual scenarios. These resources are used in validation experiments which demonstrate that input modality appears to be a significant driver of attribution volatility, while syntactic importance proxies often fail to predict model attention across languages.

Files

Engineering_Thesis.pdf

Files (8.3 MB)

Name	Size	Download all
Engineering_Thesis.pdf md5:c9c4405d05c9cac428eece5862a3479c	8.3 MB	Preview Download

Additional details

Accepted: 2026-02-09

Thesis Defended

Repository URL: https://github.com/Pawlo77/MLLM-Shap
Programming language: Python
Development Status: Active

	All versions	This version
Views	65	65
Downloads	46	46
Data volume	467.3 MB	467.3 MB

Engineering_Thesis.pdf

Files (8.3 MB)

Dates

Software

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

Authors/Creators

Description

Files

Engineering_Thesis.pdf

Files (8.3 MB)

Additional details

Dates

Software