Published April 21, 2026 | Version v1
Publication Open

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

  • 1. ROR icon Warsaw University of Technology
  • 2. ROR icon Systems Research Institute

Description

Multimodal Large Language Models (MLLMs) can integrate text and audio to interpret context in interactive conversations. However, the mechanisms by which information from different modalities shapes model behavior remain difficult to analyze. Shapley values (SV) are widely used for local, model-agnostic explainability in simple text-based conversations, yet their direct application to multimodal data is nontrivial due to cross-channel dependencies, dialogue structure, and the prohibitive computational cost of native audio tokenization. This work introduces a multimodal extension of Shapley values (SV), where units of information – such as text tokens and audio segments – are treated as cooperative features. To make the approach feasible under real-world constraints, we pair this formulation with efficient estimation methods: exact Shapley computation for short inputs, and sampling-based approximations using Monte Carlo permutation and stratified sampling with Neyman allocation, balancing variance against a strict computational budget. Additionally, to address the granularity mismatch between text and audio, we propose Spectrogram-Guided Phonetic Alignment (SGPA), a pre-processing method that maps dense audio streams to interpretable, word-aligned segments. As an applied contribution, we provide a model-agnostic Python package for computing and visualizing multimodal Shapley values for text and audio. A companion GUI enables interactive inspection of attributions, side-by-side modality visualization, and method-specific estimates of computational cost. Furthermore, we curate resources derived from the VoiceBench and Infinity Instruct datasets, encompassing diverse modality configurations and multilingual scenarios. These resources are used in validation experiments which demonstrate that input modality appears to be a significant driver of attribution volatility, while syntactic importance proxies often fail to predict model attention across languages.

Files

Engineering_Thesis.pdf

Files (8.3 MB)

Name Size Download all
md5:c9c4405d05c9cac428eece5862a3479c
8.3 MB Preview Download

Additional details

Dates

Accepted
2026-02-09
Thesis Defended

Software

Repository URL
https://github.com/Pawlo77/MLLM-Shap
Programming language
Python
Development Status
Active