Multi-TPC: A Multimodal Dataset for Three-Party Conversations with Speech, Motion, and Gaze
Authors/Creators
Description
The dataset comprises multiple synchronized modalities. Motion capture and gaze data are provided as plain-text (TXT) files indexed at the frame level, with joint rotations represented as Euler angles (in degrees) and gaze pitch and yaw angles (in degrees) derived from eye-tracker measurements. Audio recordings are distributed in uncompressed WAV format with a sampling rate of 44.1 kHz, and word-level transcripts are provided as plain-text files containing tokenized words with aligned onset and offset times (in seconds). Prosodic features extracted from the audio signal are stored in a dedicated directory. In addition, an integrated annotation resource combines conversational state labels, gaze information, and head gesture annotations into a unified representation.
Data are organized by modality at the top level, with separate directories for Mocap, Gaze, Audio, Word, Prosody, and AudioGazeGBack. Within each modality directory, files are grouped by recording date (e.g., 01-28-2022, 03-04-2022, 03-11-2022), corresponding to individual recording sessions. This date-based organization enables consistent alignment of all modalities collected during the same session. All frame-based modalities share a common temporal resolution of 60 Hz to support synchronized multimodal analysis.
For most modalities, data are stored separately for each participant. The Prosody and AudioGazeGBack directories are exceptions, as they contain session-level files that integrate information across all participants. In the file naming conventions below, D denotes the recording date, s the session index, and n the participant index.
| Modality | File name pattern |
|---|---|
| Motion | Mocap/D/Session_s_PC_n_mocap_data.txt |
| Gaze | Gaze/D/Session_s_PC_n_EyeTracker_data_gapfilled.txt |
| Audio | Audio/D/Session_s_PC_n_audio.wav |
| Text | Word/D/Session_s_PC_n_words.csv |
| Prosody | Prosody/D_Session_s_prosody.csv |
| Annotated | AudioGazeGBack/D_Session_s_audio_gaze_gback.csv |
The AudioGazeGBack directory contains processed, frame-level features that represent each moment of interaction across all participants. These features include speaking activity for each participant, gaze direction labels indicating whether a participant is looking toward the left or right listener relative to the current speaker, and gestural backchannel annotations capturing head nodding and shaking.
Detailed descriptions of column headings, abbreviations, units, and file formats are provided in accompanying README files. The full data processing and preprocessing pipeline used to generate these records is documented and made available through the project GitHub repository (https://github.com/MCMartinLee/Multi-TPC) to support transparency and reproducibility.
Files
Dataset.zip
Files
(5.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:e4493b9e97f2046550220672cef2e4ad
|
5.5 GB | Preview Download |
Additional details
Funding
- U.S. National Science Foundation
- CHS: Small: An Analysis-and-Synthesis Framework for Small Group Conversations 2005430
Software
- Repository URL
- https://github.com/MCMartinLee/Multi-TPC
- Programming language
- Python , MATLAB , C++