A multi-speaker dataset of real-time two-dimensional speech magnetic resonance images with articulator ground-truth segmentations

Ruthven, Matthieu; Peplinski, Agnieszka; Miquel, Marc

doi:10.5281/zenodo.7595164

Published February 1, 2023 | Version 1.0

Dataset Open

A multi-speaker dataset of real-time two-dimensional speech magnetic resonance images with articulator ground-truth segmentations

1. Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom. School of Biomedical Engineering & Imaging Sciences, King's College London, King's Health Partners, St Thomas' Hospital, London SE1 7EH, United Kingdom.
2. Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom .
3. Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom. Centre for Advanced Cardiovascular Imaging, NIHR Barts Biomedical Research Centre, William Harvey Institute, Queen Mary University of London, London EC1M 6BQ, United Kingdom. Digital Environment Research Institute (DERI), Empire House, 67-75 New Road, Queen Mary University of London, London E1 1HH, United Kingdom.

Summary

This dataset consists of real-time magnetic resonance images of speech and corresponding ground-truth (GT) segmentations and velopharyngeal closure labels.

Images

The images are of five healthy adult volunteers (two females, three males; age range 24-28 years) counting a single time from 1 to 10 in English. Each volunteer was imaged in a supine position using a 3.0 T TX Achieva magnetic resonance imaging (MRI) scanner and a 16-channel neurovascular coil (both Philips Healthcare, Best, Netherlands). Images of a 10 mm thick midsagittal slice of the head were acquired using a steady state free procession (SSFP) pulse sequence based on the sequence identified by [1] as being optimal for vocal tract image quality. The acquired matrix size and in-plane pixel size were 120×93 and 2.50×2.45 mm² respectively. However, k-space data were zero padded to a matrix size of 256×256 by the scanner before being reconstructed, resulting in a reconstructed in-plane pixel size of 1.17×1.17 mm². Images were acquired at a temporal resolution of 0.1s and one image series was acquired per volunteer. The volunteers were instructed to perform the speech task at a rate which they considered to be normal. Some performed the task faster than others and consequently not all series had the same number of images. The series have 105, 71, 71, 78 and 67 images each (392 images in total).

Velopharyngeal closure labels

Each image was visually inspected and labelled as either showing contact between the soft palate and posterior pharyngeal wall or not showing contact. A label of 1 indicates contact, while a label of 0 indicates no contact. To reduce the subjectivity of the labels, each image was independently labelled by four MRI Physicists with four, ten, two and one years of speech MRI experience, and the majority label was chosen as the GT label.

Ground-truth segmentations

GT segmentations were created by manually labelling pixels in each of the images. The segmentations consisted of six classes, each made up of one or more anatomical features. There was no overlap between classes: a pixel could not belong to more than one class. For conciseness, the classes were named as follows: head, soft palate, jaw, tongue, vocal tract and tooth space. However, the names of the head, jaw and tongue classes are simplifications. The head class consisted of all anatomical features superior to or posterior to the vocal tract. It therefore included the upper lip, hard palate, brain, skull, posterior pharyngeal wall and neck. The jaw class consisted of the lower lips, the soft tissue anterior to and inferior to the mandible and the soft tissue inferior to the tongue. The tongue class included the epiglottis and the hyoid bone. Pixels not labelled as belonging to one of the classes were considered to belong to the background. GT segmentations were created by the MRI Physicist with four years of speech MRI experience.

Dataset structure

Images are contained in the MRI_SSFP_10fps folder. Within this folder, each subfolder contains the images of a different volunteer. Each image is saved as a separate DICOM file with name image_N.dcm.

Velopharyngeal closure labels are saved in velopharyngeal_closure.xslx. The labels of each volunteer are saved in different sheets. The spreadsheet row corresponds to the image number (i.e. the label in row 1 is the label for image 1).

Ground-truth segmentations are contained in the GT_Segmentations folder. Within this folder, each subfolder contains the GT segmentations of a different volunteer. Each GT segmentation is saved as a separate MAT file with name mask_N.mat. In each MAT file, pixels with the following values correspond to the following class:

0 = background
1 = head
2 = soft palate
3 = jaw
4 = tongue
5 = vocal tract
6 = tooth space

References

[1] A.D. Scott, R. Boubertakh, M.J. Birch, M.E. Miquel, Towards clinical assessment of velopharyngeal closure using MRI: evaluation of real-time MRI sequences at 1.5 and 3 T, Br. J. Radiol. 85 (2012) e1083–e1092. https://doi.org/10.1259/bjr/32938996.

Notes

Funded by NIHR Grant Reference Number: ICA-CDRF-2018-04-ST2-032 and Barts Health Charity Grant Reference Number: MGU0600

Files

data.zip

Files (21.6 MB)

Name	Size	Download all
data.zip md5:5b401468771adecb51060fb51442d5a7	21.6 MB	Preview Download

	All versions	This version
Views	1,857	688
Downloads	363	153
Data volume	8.8 GB	3.8 GB

A multi-speaker dataset of real-time two-dimensional speech magnetic resonance images with articulator ground-truth segmentations

Authors/Creators

Description

Notes

Files

data.zip

Files (21.6 MB)