Data collected for the paper "Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios"

ATTANASIO, GIUSEPPE; Savoldi, Beatrice; Chechelnitsky, Daniel; Negri, Matteo; Carpuat, Marine; Sap, Maarten; Torres Martins, Andre Filipe

doi:10.5281/zenodo.20544289

Published June 4, 2026 | Version 1.0

Dataset Restricted

Data collected for the paper "Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios"

1. Instituto de Telecomunicações
2. Fondazione Bruno Kessler
3. Carnegie Mellon University
4. University System of Maryland

This dataset contains the raw audio recordings, machine translation outputs, automatic quality metrics, and human survey responses collected during the Ouvia study. The study investigates how end users perceive the usability of speech-to-text machine translation across three English varieties translated into European Portuguese. Each entry pairs a sender (who records a spoken message), a receiver (who answers questions about the translation), and a validator (who assesses translation quality).

See our paper for details: arXiv.

DATASET DETAILS

Dataset Description

Curated by: Giuseppe Attanasio
User Study Funded by: European Association for Machine Translation
Language(s): English, Portuguese (pt-PT)
English Variants: Native US Black, Native US White, and Hindi speakers
License: CC BY 4.0

USES

Direct Use

The dataset can be used within one of the following broad scopes:

- Study the difference across demographic groups of end-user perceived usability of each translation outcome.
- As an English speech recognition or speech translation benchmark for testing model performance in translating from different English varieties to European Portuguese.

Out-of-Scope Use

We consider out-of-scope uses:

- Training of automatic speech systems or models.
- Inquiries targeted to deanonymize study participants.
- Cloning voices of the study participants.

DATASET STRUCTURE

The release contains two JSONL files: data.jsonl (study units — one row per entry) and conversation_data.jsonl (conversation and question metadata).

=== data.jsonl ===

The data fields are:

# | Field | Description
----|----------------------- |--------------------------------------------------
1 | entry_id | Unique identifier for each data entry within each language variety.
2 | conversation_id | Identifier linking this entry to a specific conversation.
3 | conversation_topic | Topic of the conversation (Health or Everyday).
4 | conversation_source | Source or provenance of the conversation text.
5 | conversation_text | The English conversation starter uttered and recorded by the sender.
6 | sender_unique_id | Anonymized unique identifier for the sender (Prolific participant).
7 | sender_gender | Self-reported gender of the sender (woman or man).
8 | receiver_unique_id | Anonymized unique identifier for the message receiver (Prolific participant).
9 | validator_unique_id | Anonymized unique identifier for the validator (Prolific participant).
10 | final_sender_unique_id | Anonymized unique identifier for the final sender in the conversation flow. It matches sender_unique_id for most entries, but in some cases (when the original sender dropped the study) it may correspond to a different sender who took their place in the study.
11 | translation_text | The automatically translated conversation starter into Portuguese (pt-PT).
12 | translation_model | Speech translation model used to translate the conversation. One of DeSTA2, Phi 4, Voxtral, or Tower+.
13 | questions | List of questions shown to the receiver.
14 | receiver_responses | Responses provided by the receiver.
15 | validator_translation_score | Overall translation quality score assigned by the validator.
16 | validator_evaluations | Detailed ratings provided by the validator.
17 | validator_corrections | Number of responses assessed as incorrect by the validator.
18 | validator_question_total | Total number of questions the validator was asked to review.
19 | Unbabel--wmt23-cometkiwi-da-xl | Translation quality score from the Unbabel wmt23-cometkiwi-da-xl COMET model (continuous, unnormalized, higher is better).
20 | Unbabel--wmt22-cometkiwi-da | Translation quality score from the Unbabel wmt22-cometkiwi-da COMET model (continuous, unnormalized, higher is better).
21 | Unbabel--wmt22-comet-da | Translation quality score from the Unbabel wmt22-comet-da COMET model (continuous, unnormalized, higher is better).
22 | Unbabel--XCOMET-XL | Translation quality score from the Unbabel XCOMET-XL model (continuous, unnormalized, higher is better).
23 | google--metricx-24-hybrid-xl-v2p6 | MetricX-24 translation quality score, normalized to [0, 1] range (higher is better). Original scores divided by 25 and inverted.
24 | baseline_satisfaction_score | Baseline survey response (1-5 Likert) for satisfaction (before seeing the validator assessment).
25 | baseline_trust_score | Baseline survey response (1-5 Likert) for trust (before seeing the validator assessment).
26 | baseline_reliance_score | Baseline survey response (1-5 Likert) for reliance (before seeing the validator assessment).
27 | satisfaction_score | Survey response (1-5 Likert) for satisfaction after seeing the validator assessment.
28 | trust_score | Survey response (1-5 Likert) for trust in the translation after seeing the validator assessment.
29 | reliance_score | Survey response (1-5 Likert) for reliance on the translation after seeing the validator assessment.
30 | language_variety | Language variety of the data (US Black, US White, or Hindi).
31 | usability_score | Average usability score — mean of satisfaction_score, trust_score, and reliance_score.
32 | baseline_usability_score | Average baseline usability score — mean of the three baseline scores.
33 | qa_score | QA score (0-1, higher is better). Computed as 1 - (validator_corrections / validator_question_total).
34 | audio | Relative path to the audio file for this entry, in the format ./wav/<variety>/entry_<entry_id>.wav.

=== conversation_data.jsonl ===

This file contains the conversation starters and associated questions. Each row is one question within a conversation. The data fields are:

# | Field | Description
----|---------------|--------------------------------------------------
1 | conversation_id | Identifier linking to a specific conversation. Joins with data.jsonl on this field.
2 | conversation_topic | Topic of the conversation (Health or Everyday).
3 | conversation_source | Source or provenance of the conversation text (e.g., MED-MT).
4 | conversation_text | The English conversation starter text that the sender recorded.
5 | question_id | Unique identifier for each question within a conversation.
6 | question_text | The English question generated automatically.
7 | translated_question | The machine-translated question text into Portuguese (pt-PT) shown to the receiver.

DATASET CREATION

Curation Rationale

We release all data collected during our study for reproducibility and to facilitate future research on measuring the usability of speech translation systems in real-world situations.

Source Data

Data Collection and Processing

We conducted an online study through a custom web platform. See our paper linked above for details.

Who are the source data producers?

All data was collected through compensated crowdworkers who were recruited via Prolific (https://www.prolific.com/).

Personal and Sensitive Information

All personally identifiable information has been removed or anonymized before release. Prolific participant IDs have been replaced with random hexadecimal strings. Audio recordings are identified only by entry ID and language variety. No demographic data beyond self-reported gender (restricted to woman/man) and language variety is included.

BIAS, RISKS, AND LIMITATIONS

To address the ethical risks associated with collecting personal data, including sociodemographic attributes and recorded voice samples, we have implemented comprehensive risk management measures aligned with data protection principles and research ethics standards.

Addressing the Risk of Personal Identification

While voice recordings and demographic metadata are made publicly available for research purposes, no other information is disclosed. This choice aligns with the principles of data minimization and anonymization while maintaining utility for fairness research in AI systems.

Each participant is identified in the dataset by one or more anonymized hexadecimal strings (see sender_unique_id, receiver_unique_id, validator_unique_id, and final_sender_unique_id). These are randomly generated and bear no relation to the original Prolific participant IDs. As a result, it is not possible to uniquely identify a real individual from their anonymized identifier.

Addressing the Risk of Voice Cloning

We informed all study participants enrolling as "Sender" through the Informed Consent document that there is a risk their voice may be cloned using automatic tools. However, our study design minimizes this risk by collecting only 10 short segments per participant. Segments are short (30 seconds or less) and recorded in a neutral tone, which makes faithful and expressive voice cloning more challenging to achieve. Similarly to the case of personal identification, we will require any user to explicitly agree not to use it to clone individual voices.

DATASET CARD CONTACT

Giuseppe Attanasio: gattanasio.work@gmail.com

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/20544289">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

This dataset contains voice recordings of real individuals who consented to share their data for research purposes. By downloading or using this dataset, you agree to the following conditions:

You will not attempt to identify, re-identify, or deanonymize any individual speaker in the dataset — whether through manual inspection, automated techniques, or by cross-referencing with other data sources.
You will not use the voice recordings to clone, synthesize, or impersonate any individual's voice.
You will strictly follow the CC BY 4.0 license and its requirements regarding data management, use, and redistribution.

Violation of these terms constitutes a breach of the participants' informed consent and may carry legal consequences under applicable data protection laws.

You are currently not logged in. Do you have an account? Log in here

Additional details

Repository URL: https://github.com/g8a9/ouvia

	All versions	This version
Views	29	29
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Data collected for the paper "Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios"

Authors/Creators

Description

DATASET DETAILS

Dataset Description

USES

Direct Use

Out-of-Scope Use

DATASET STRUCTURE

DATASET CREATION

Source Data

BIAS, RISKS, AND LIMITATIONS

DATASET CARD CONTACT

Files

Restricted

Request access

Additional details

Software