CTHFNet: Contrastive Translation and Hierarchical Fusion Network for Text-Video-Audio Sentiment Analysis
Creators
Description
Multimodal Sentiment Analysis (MSA) aims to predict human sentiment polarity or intensity by heterogeneous information sources such as text, audio, and video. Previous research has focused on exploring multimodal fusion strategies while neglecting intra-modal noise. Indeed, both are crucial for sentiment prediction, as sentiment information may be dispersed across modalities or aggregated within a single modality. This paper presents a novel framework called Contrastive Translate and Hierarchical Fusion Network (CTHFNet) to discuss complex relationships within and between modalities. Specifically, CTHFNet leverages a modality translator based on contrastive learning and the Seq2seq model to translate non-verbal modalities into textual modalities to filter unimodal noise. Besides, CTHFNet utilizes a cross hierarchical multimodal fusion network that captures interactions between modalities at different hierarchies and classes with a contrastive learning task. Extensive experiments on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that our approach outperforms state-of-the-art methods in almost all metrics.
Files
CTHFNet.zip
Files
(180.9 kB)
Name | Size | Download all |
---|---|---|
md5:849067570098f0c4beceacae0fd08c83
|
180.9 kB | Preview Download |