Data collected for the paper "Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios"
Authors/Creators
Description
This dataset contains the raw audio recordings, machine translation outputs, automatic quality metrics, and human survey responses collected during the Ouvia study. The study investigates how end users perceive the usability of speech-to-text machine translation across three English varieties translated into European Portuguese. Each entry pairs a sender (who records a spoken message), a receiver (who answers questions about the translation), and a validator (who assesses translation quality).
See our paper for details: arXiv.
DATASET DETAILS
Dataset Description
- Curated by: Giuseppe Attanasio
- User Study Funded by: European Association for Machine Translation
- Language(s): English, Portuguese (pt-PT)
- English Variants: Native US Black, Native US White, and Hindi speakers
- License: CC BY 4.0
USES
Direct Use
The dataset can be used within one of the following broad scopes:
- Study the difference across demographic groups of end-user perceived usability of each translation outcome.
- As an English speech recognition or speech translation benchmark for testing model performance in translating from different English varieties to European Portuguese.
Out-of-Scope Use
We consider out-of-scope uses:
- Training of automatic speech systems or models.
- Inquiries targeted to deanonymize study participants.
- Cloning voices of the study participants.
DATASET STRUCTURE
The release contains two JSONL files: data.jsonl (study units — one row per entry) and conversation_data.jsonl (conversation and question metadata).
=== data.jsonl ===
The data fields are:
# | Field | Description
----|----------------------- |--------------------------------------------------
1 | entry_id | Unique identifier for each data entry within each language variety.
2 | conversation_id | Identifier linking this entry to a specific conversation.
3 | conversation_topic | Topic of the conversation (Health or Everyday).
4 | conversation_source | Source or provenance of the conversation text.
5 | conversation_text | The English conversation starter uttered and recorded by the sender.
6 | sender_unique_id | Anonymized unique identifier for the sender (Prolific participant).
7 | sender_gender | Self-reported gender of the sender (woman or man).
8 | receiver_unique_id | Anonymized unique identifier for the message receiver (Prolific participant).
9 | validator_unique_id | Anonymized unique identifier for the validator (Prolific participant).
10 | final_sender_unique_id | Anonymized unique identifier for the final sender in the conversation flow. It matches sender_unique_id for most entries, but in some cases (when the original sender dropped the study) it may correspond to a different sender who took their place in the study.
11 | translation_text | The automatically translated conversation starter into Portuguese (pt-PT).
12 | translation_model | Speech translation model used to translate the conversation. One of DeSTA2, Phi 4, Voxtral, or Tower+.
13 | questions | List of questions shown to the receiver.
14 | receiver_responses | Responses provided by the receiver.
15 | validator_translation_score | Overall translation quality score assigned by the validator.
16 | validator_evaluations | Detailed ratings provided by the validator.
17 | validator_corrections | Number of responses assessed as incorrect by the validator.
18 | validator_question_total | Total number of questions the validator was asked to review.
19 | Unbabel--wmt23-cometkiwi-da-xl | Translation quality score from the Unbabel wmt23-cometkiwi-da-xl COMET model (continuous, unnormalized, higher is better).
20 | Unbabel--wmt22-cometkiwi-da | Translation quality score from the Unbabel wmt22-cometkiwi-da COMET model (continuous, unnormalized, higher is better).
21 | Unbabel--wmt22-comet-da | Translation quality score from the Unbabel wmt22-comet-da COMET model (continuous, unnormalized, higher is better).
22 | Unbabel--XCOMET-XL | Translation quality score from the Unbabel XCOMET-XL model (continuous, unnormalized, higher is better).
23 | google--metricx-24-hybrid-xl-v2p6 | MetricX-24 translation quality score, normalized to [0, 1] range (higher is better). Original scores divided by 25 and inverted.
24 | baseline_satisfaction_score | Baseline survey response (1-5 Likert) for satisfaction (before seeing the validator assessment).
25 | baseline_trust_score | Baseline survey response (1-5 Likert) for trust (before seeing the validator assessment).
26 | baseline_reliance_score | Baseline survey response (1-5 Likert) for reliance (before seeing the validator assessment).
27 | satisfaction_score | Survey response (1-5 Likert) for satisfaction after seeing the validator assessment.
28 | trust_score | Survey response (1-5 Likert) for trust in the translation after seeing the validator assessment.
29 | reliance_score | Survey response (1-5 Likert) for reliance on the translation after seeing the validator assessment.
30 | language_variety | Language variety of the data (US Black, US White, or Hindi).
31 | usability_score | Average usability score — mean of satisfaction_score, trust_score, and reliance_score.
32 | baseline_usability_score | Average baseline usability score — mean of the three baseline scores.
33 | qa_score | QA score (0-1, higher is better). Computed as 1 - (validator_corrections / validator_question_total).
34 | audio | Relative path to the audio file for this entry, in the format ./wav/<variety>/entry_<entry_id>.wav.
=== conversation_data.jsonl ===
This file contains the conversation starters and associated questions. Each row is one question within a conversation. The data fields are:
# | Field | Description
----|---------------|--------------------------------------------------
1 | conversation_id | Identifier linking to a specific conversation. Joins with data.jsonl on this field.
2 | conversation_topic | Topic of the conversation (Health or Everyday).
3 | conversation_source | Source or provenance of the conversation text (e.g., MED-MT).
4 | conversation_text | The English conversation starter text that the sender recorded.
5 | question_id | Unique identifier for each question within a conversation.
6 | question_text | The English question generated automatically.
7 | translated_question | The machine-translated question text into Portuguese (pt-PT) shown to the receiver.
DATASET CREATION
Curation Rationale
We release all data collected during our study for reproducibility and to facilitate future research on measuring the usability of speech translation systems in real-world situations.
Source Data
Data Collection and Processing
We conducted an online study through a custom web platform. See our paper linked above for details.
Who are the source data producers?
All data was collected through compensated crowdworkers who were recruited via Prolific (https://www.prolific.com/).
Personal and Sensitive Information
All personally identifiable information has been removed or anonymized before release. Prolific participant IDs have been replaced with random hexadecimal strings. Audio recordings are identified only by entry ID and language variety. No demographic data beyond self-reported gender (restricted to woman/man) and language variety is included.
BIAS, RISKS, AND LIMITATIONS
To address the ethical risks associated with collecting personal data, including sociodemographic attributes and recorded voice samples, we have implemented comprehensive risk management measures aligned with data protection principles and research ethics standards.
Addressing the Risk of Personal Identification
While voice recordings and demographic metadata are made publicly available for research purposes, no other information is disclosed. This choice aligns with the principles of data minimization and anonymization while maintaining utility for fairness research in AI systems.
Each participant is identified in the dataset by one or more anonymized hexadecimal strings (see sender_unique_id, receiver_unique_id, validator_unique_id, and final_sender_unique_id). These are randomly generated and bear no relation to the original Prolific participant IDs. As a result, it is not possible to uniquely identify a real individual from their anonymized identifier.
Addressing the Risk of Voice Cloning
We informed all study participants enrolling as "Sender" through the Informed Consent document that there is a risk their voice may be cloned using automatic tools. However, our study design minimizes this risk by collecting only 10 short segments per participant. Segments are short (30 seconds or less) and recorded in a neutral tone, which makes faithful and expressive voice cloning more challenging to achieve. Similarly to the case of personal identification, we will require any user to explicitly agree not to use it to clone individual voices.
DATASET CARD CONTACT
Giuseppe Attanasio: gattanasio.work@gmail.com
Files
Additional details
Software
- Repository URL
- https://github.com/g8a9/ouvia