Dataset Open Access
Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi
Synthetically Spoken COCO
This dataset contain synthetically generated spoken versions of MS COCO  captions. This
dataset was created as part the research reported in .
The speech was generated using gTTS . The dataset consists of the following files:
- dataset.json: Captions associated with MS COCO images. This information comes from .
- sentid.txt: List of caption IDs. This file can be used to locate MFCC features of the MP3 files
in the numpy array stored in dataset.mfcc.npy.
- mp3.tgz: MP3 files with the audio. Each file name corresponds to caption ID in dataset.json
and in sentid.txt.
- dataset.mfcc.npy: Numpy array with the Mel Frequence Cepstral Coefficients extracted from
the audio. Each row corresponds to a caption. The order or the captions corresponds to the
ordering in the file sentid.txt. MFCCs were extracted using .