Published June 23, 2024 | Version v1
Model Open

CTHFNet: Contrastive Translation and Hierarchical Fusion Network for Text-Video-Audio Sentiment Analysis

Creators

Description

Multimodal Sentiment Analysis (MSA) aims to predict human sentiment polarity or intensity by heterogeneous information sources such as text, audio, and video. Previous research has focused on exploring multimodal fusion strategies while neglecting intra-modal noise. Indeed, both are crucial for sentiment prediction, as sentiment information may be dispersed across modalities or aggregated within a single modality. This paper presents a novel framework called Contrastive Translate and Hierarchical Fusion Network (CTHFNet) to discuss complex relationships within and between modalities. Specifically, CTHFNet leverages a modality translator based on contrastive learning and the Seq2seq model to translate non-verbal modalities into textual modalities to filter unimodal noise. Besides, CTHFNet utilizes a cross hierarchical multimodal fusion network that captures interactions between modalities at different hierarchies and classes with a contrastive learning task. Extensive experiments on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that our approach outperforms state-of-the-art methods in almost all metrics.

Files

CTHFNet.zip

Files (180.9 kB)

Name Size Download all
md5:849067570098f0c4beceacae0fd08c83
180.9 kB Preview Download