Published October 24, 2024 | Version v1
Journal Restricted

ParkCeleb: A Novel Longitudinal Corpus for Evaluating Speech Patterns in People with Parkinson's Disease During the Prodromal Period

Description

ParkCeleb is a unique longitudinal corpus of speech samples collected from 40 celebrities diagnosed with Parkinson's Disease (PD) and 40 control subjects (CNs). The PD and CN groups are matched based on age, sex, ethnicity, and nationality to ensure comparability. This repository does not contain the actual audio recordings but provides metadata files with links to YouTube videos, speaker information, and transcriptions. You can utilize the following GitHub repository to download the audio files and segment recordings for each speaker based on their timestamps.

The paper associated with the corpus is entitled Unveiling early signs of Parkinson’s Disease via a longitudinal Analysis of Celebrity Speech Recordings. If you plan to use ParkCeleb in your research, please cite this paper as follows: Favaro, A., Butala, A., Thebaud, T. et al. Unveiling early signs of Parkinson’s disease via a longitudinal analysis of celebrity speech recordings. npj Parkinsons Dis. 10, 207 (2024). https://doi.org/10.1038/s41531-024-00817-9

Key Features

  • Diverse Acoustic Environments: the recordings feature speech captured from various real-world settings like studio interviews, press conferences, red-carpet events, and public speeches.
  • Publicly Available Data: all speech samples were extracted from publicly available videos uploaded to YouTube.

Directory Structure

Root directory

The root directory (ParkCeleb) contains two main subdirectories:

  • PD: Speech data of celebrities diagnosed with Parkinson’s disease.
  • CN: Speech data of control subjects.

Inside the root directory, you will also find the following key files:

  1. speakers_pairs.xlsx: contains the pairs of speakers (PD-CN) gender and age-matched used in the classification and longitudinal analysis.
  2. PD_demo.xlsx: is a metadata file with demographic information related to the PD group.
  3. CN_demo.xlsx: is a metadata file with demographic information related to the CN group.

Speaker Folders

Each speaker is assigned an anonymized folder labeled by their ID, e.g., cn_xx for control subjects or pd_xx for PD subjects, where xx is a number ranging from 01 to 40. Inside each speaker’s folder, you will find a metadata.csv file containing YouTube video links for downloading the corresponding recordings.

Each speaker folder also includes subfolders named after the YouTube video ID. Inside these video-specific subfolders, you will find:

  1. Transcription File (.json) contains word-by-word transcriptions with corresponding timestamps for each word.
  2.  Speaker Timestamps File (.csv): it contains speaker labels and timestamps, indicating when each speaker is active during the recording.
  3.  Target Speaker Annotation (speakers_info.csv):
    1. Column status: this column identifies the target speaker in the video (whether they are the PD or CN subject).
    2. Column years_from_diagnosis: this column specifies how many years before or after diagnosis the recording occurred. This column is left empty for the CN subjects ranging from 41 to 60 as these 20 subjects were used as additional controls for auxiliary experiments (see dissemination paper). 
    3. Column intersection: this column represents how much time the diarization segment and the transcript segment (or word) overlap.
    4. Column union: this column represents the total time covered by the diarization and transcript segments (or word).
    5. For control subjects (CNs), the timeline is matched to the diagnosis year of the corresponding PD subject for comparative purposes.

 

Abstract (English)

Numerous studies proposed methods to detect Parkinson’s disease (PD) via speech analysis. However, existing corpora often lack prodromal recordings, have small sample sizes, and lack longitudinal data. Speech samples from celebrities who publicly disclosed their PD diagnosis provide longitudinal data, allowing the creation of a new corpus, ParkCeleb. We collected videos from 40 subjects with PD and 40 controls and analyzed evolving speech features from 10 years before to 20 years after diagnosis. Our longitudinal analysis, focused on 15 subjects with PD and 15 controls, revealed features like pitch variability, pause duration, speech rate, and syllable duration, indicating PD progression. Early dysarthria patterns were detectable in the prodromal phase, with the best classifiers achieving AUCs of 0.72 and 0.75 for data collected ten and five years before diagnosis, respectively, and 0.93 post-diagnosis. This study highlights the potential for early detection methods, aiding treatment response identification and screening in clinical trials.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Software

Repository URL
https://github.com/Annafavaro/PARKCELEB
Programming language
Python
Development Status
Active