Impact of Visual Modality on Robustness of Self-Supervised Speech Representations Against Adversarial Attacks
Description
The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from au
Research goal: What is the impact of incorporating visual modality into self-supervised learning for speech representations on the robustness of neural source-filter models against adversarial attacks, as evaluated using metrics like adversarial accuracy and perturbation resilience?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.8/10.
Notes
Files
paper.pdf
Files
(81.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:641670b742ec6d4b607bb5781a105c9a
|
81.7 kB | Preview Download |