Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis
Authors/Creators
Description
Multimodal Large Language Models (MLLMs) can integrate text and audio to interpret context in interactive conversations. However, the mechanisms by which information from different modalities shapes model behavior remain difficult to analyze. Shapley values (SV) are widely used for local, model-agnostic explainability in simple text-based conversations, yet their direct application to multimodal data is nontrivial due to cross-channel dependencies, dialogue structure, and the prohibitive computational cost of native audio tokenization. This work introduces a multimodal extension of Shapley values (SV), where units of information – such as text tokens and audio segments – are treated as cooperative features. To make the approach feasible under real-world constraints, we pair this formulation with efficient estimation methods: exact Shapley computation for short inputs, and sampling-based approximations using Monte Carlo permutation and stratified sampling with Neyman allocation, balancing variance against a strict computational budget. Additionally, to address the granularity mismatch between text and audio, we propose Spectrogram-Guided Phonetic Alignment (SGPA), a pre-processing method that maps dense audio streams to interpretable, word-aligned segments. As an applied contribution, we provide a model-agnostic Python package for computing and visualizing multimodal Shapley values for text and audio. A companion GUI enables interactive inspection of attributions, side-by-side modality visualization, and method-specific estimates of computational cost. Furthermore, we curate resources derived from the VoiceBench and Infinity Instruct datasets, encompassing diverse modality configurations and multilingual scenarios. These resources are used in validation experiments which demonstrate that input modality appears to be a significant driver of attribution volatility, while syntactic importance proxies often fail to predict model attention across languages.
Files
Engineering_Thesis.pdf
Files
(8.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c9c4405d05c9cac428eece5862a3479c
|
8.3 MB | Preview Download |
Additional details
Dates
- Accepted
-
2026-02-09Thesis Defended
Software
- Repository URL
- https://github.com/Pawlo77/MLLM-Shap
- Programming language
- Python
- Development Status
- Active