Published June 1, 2017 | Version v1
Dataset Open


  • 1. Université Grenoble Alpes




Our corpus is an extension of the MS COCO image recognition and captioning dataset. MS COCO comprises images paired with a set of five captions. Yet, it does not include any speech. Therefore, we used Voxygen's text-to-speech system to synthesise the available captions.

The addition of speech as a new modality enables MSCOCO to be used for researches in the field of language acquisition, unsupervised term discovery, keyword spotting, or semantic embedding using speech and vision.

Our corpus is licensed under a Creative Commons Attribution 4.0 License.

Data Set

  • This corpus contains 616,767 spoken captions from MSCOCO's val2014 and train2014 subsets (respectively 414,113 for train2014 and 202,654 for val2014).

  • We used 8 different voices. 4 of them have a British accent (Paul, Bronwen, Judith, and Elizabeth) and the 4 others have an American accent (Phil, Bruce, Amanda, Jenny).

  • In order to make the captions sound more natural, we used SOX tempo command, enabling us to change the speed without changing the pitch. 1/3 of the captions are 10% slower than the original pace, 1/3 are 10% faster. The last third of the captions was kept untouched.

  • We also modified approximately 30% of the original captions and added disfluencies such as "um", "uh", "er" so that the captions would sound more natural.

  • Each WAV file is paired with a JSON file containing various information: timecode of each word in the caption, name of the speaker, name of the WAV file, etc. The JSON files have the following data structure:

    "duration": float,
    "speaker": string,
    "synthesisedCaption": string,
    "timecode": list,
    "speed": float,
    "wavFilename": string,
    "captionID": int,
    "imgID": int,
    "disfluency": list
  • On average, each caption comprises 10.79 tokens, disfluencies included. The WAV files are on average 3.52 seconds long.


The repository is organized as follows:

  • CORPUS-MSCOCO (~75GB once decompressed)

    • train2014/ : folder contains 413,915 captions

      • json/

      • wav/

      • translations/

        • train_en_ja.txt

        • train_translate.sqlite3

      • train_2014.sqlite3

    • val2014/ : folder contains 202,520 captions

      • json/

      • wav/

      • translations/

        • train_en_ja.txt

        • train_translate.sqlite3

      • val_2014.sqlite3

    • speechcoco_API/

      • speechcoco/





.wav files contain the spoken version of a caption

.json files contain all the metadata of a given WAV file

.sqlite3 files are SQLite databases containing all the information contained in the JSON files

We adopted the following naming convention for both the WAV and JSON files:



We created a script called in order to handle the metadata and allow the user to easily find captions according to specific filters. The script uses the *.db files.


  • Aggregate all the information in the JSON files into a single SQLite database

  • Find captions according to specific filters (name, gender and nationality of the speaker, disfluency position, speed, duration, and words in the caption). The script automatically builds the SQLite query. The user can also provide his own SQLite query.

The following Python code returns all the captions spoken by a male with an American accent for which the speed was slowed down by 10% and that contain "keys" at any position

# create SpeechCoco object
db = SpeechCoco(train_2014.sqlite3, train_translate.sqlite3, verbose=True)

# filter captions (returns Caption Objects)
captions = db.filterCaptions(gender="Male", nationality="US", speed=0.9, text='%keys%')
for caption in captions:
298817      26763   Phil    0.9     298817_26763_Phil_None_0-9.wav          A group of turkeys with bushes in the background.
108505      147972  Phil    0.9     108505_147972_Phil_Middle_0-9.wav               Person using a, um, slider cell phone with blue backlit keys.
258289      154380  Bruce   0.9     258289_154380_Bruce_None_0-9.wav                Some donkeys and sheep are in their green pens .
545312      201303  Phil    0.9     545312_201303_Phil_None_0-9.wav         A man walking next to a couple of donkeys.
  • Find all the captions belonging to a specific image

captions = db.getImgCaptions(298817)
for caption in captions:
Birds wondering through grassy ground next to bushes.
A flock of turkeys are making their way up a hill.
Um, ah. Two wild turkeys in a field walking around.
Four wild turkeys and some bushes trees and weeds.
A group of turkeys with bushes in the background.
  • Parse the timecodes and have them structured


[1926.3068, "SYL", ""],
[1926.3068, "SEPR", " "],
[1926.3068, "WORD", "white"],
[1926.3068, "PHO", "w"],
[2050.7955, "PHO", "ai"],
[2144.6591, "PHO", "t"],
[2179.3182, "SYL", ""],
[2179.3182, "SEPR", " "]


'begin': 1926.3068,
'end': 2179.3182,
'syllable': [{'begin': 1926.3068,
              'end': 2179.3182,
              'phoneme': [{'begin': 1926.3068,
                           'end': 2050.7955,
                           'value': 'w'},
                          {'begin': 2050.7955,
                           'end': 2144.6591,
                           'value': 'ai'},
                          {'begin': 2144.6591,
                           'end': 2179.3182,
                           'value': 't'}],
              'value': 'wait'}],
'value': 'white'
  • Convert the timecodes to Praat TextGrid files

caption.timecode.toTextgrid(outputDir, level=3)
  • Get the words, syllables and phonemes between n seconds/milliseconds

The following Python code returns all the words between 0.2 and 0.6 seconds for which at least 50% of the word's total length is within the specified interval

pprint(caption.getWords(0.20, 0.60, seconds=True, level=1, olapthr=50))
404537      827239  Bruce   US      0.9     404537_827239_Bruce_None_0-9.wav                Eyeglasses, a cellphone, some keys and other pocket items are all laid out on the cloth. .
        'begin': 0.0,
        'end': 0.7202778,
        'overlapPercentage': 55.53412863758955,
        'word': 'eyeglasses'
  • Get the translations of the selected captions

As for now, only japanese translations are available. We also used Kytea to tokenize and tag the captions translated with Google Translate

captions = db.getImgCaptions(298817)
for caption in captions:

    # Get translations and POS
    print('\tja_google: {}'.format(db.getTranslation(caption.captionID, "ja_google")))
    print('\t\tja_google_tokens: {}'.format(db.getTokens(caption.captionID, "ja_google")))
    print('\t\tja_google_pos: {}'.format(db.getPOS(caption.captionID, "ja_google")))
    print('\tja_excite: {}'.format(db.getTranslation(caption.captionID, "ja_excite")))
   Birds wondering through grassy ground next to bushes.
    ja_google: 鳥は茂みの下に茂った地面を抱えています。
        ja_google_tokens: 鳥 は 茂み の 下 に 茂 っ た 地面 を 抱え て い ま す 。
        ja_google_pos: 鳥/名詞/とり は/助詞/は 茂み/名詞/しげみ の/助詞/の 下/名詞/した に/助詞/に 茂/動詞/しげ っ/語尾/っ た/助動詞/た 地面/名詞/じめん を/助詞/を 抱え/動詞/かかえ て/助詞/て い/動詞/い ま/助動詞/ま す/語尾/す 。/補助記号/。
    ja_excite: 低木と隣接した草深いグラウンドを通って疑う鳥。

A flock of turkeys are making their way up a hill.
    ja_google: 七面鳥の群れが丘を上っています。
        ja_google_tokens: 七 面 鳥 の 群れ が 丘 を 上 っ て い ま す 。
        ja_google_pos: 七/名詞/なな 面/名詞/めん 鳥/名詞/とり の/助詞/の 群れ/名詞/むれ が/助詞/が 丘/名詞/おか を/助詞/を 上/動詞/のぼ っ/語尾/っ て/助詞/て い/動詞/い ま/助動詞/ま す/語尾/す 。/補助記号/。
    ja_excite: 七面鳥の群れは丘の上で進んでいる。

Um, ah. Two wild turkeys in a field walking around.
    ja_google: 野生のシチメンチョウ、野生の七面鳥
        ja_google_tokens: 野生 の シチメンチョウ 、 野生 の 七 面 鳥
        ja_google_pos: 野生/名詞/やせい の/助詞/の シチメンチョウ/名詞/しちめんちょう 、/補助記号/、 野生/名詞/やせい の/助詞/の 七/名詞/なな 面/名詞/めん 鳥/名詞/ちょう
    ja_excite: まわりで移動しているフィールドの2羽の野生の七面鳥

Four wild turkeys and some bushes trees and weeds.
    ja_google: 4本の野生のシチメンチョウといくつかの茂みの木と雑草
        ja_google_tokens: 4 本 の 野生 の シチメンチョウ と いく つ か の 茂み の 木 と 雑草
        ja_google_pos: 4/名詞/4 本/接尾辞/ほん の/助詞/の 野生/名詞/やせい の/助詞/の シチメンチョウ/名詞/しちめんちょう と/助詞/と いく/名詞/いく つ/接尾辞/つ か/助詞/か の/助詞/の 茂み/名詞/しげみ の/助詞/の 木/名詞/き と/助詞/と 雑草/名詞/ざっそう
    ja_excite: 4羽の野生の七面鳥およびいくつかの低木木と雑草

A group of turkeys with bushes in the background.
    ja_google: 背景に茂みを持つ七面鳥の群
        ja_google_tokens: 背景 に 茂み を 持 つ 七 面 鳥 の 群
        ja_google_pos: 背景/名詞/はいけい に/助詞/に 茂み/名詞/しげみ を/助詞/を 持/動詞/も つ/語尾/つ 七/名詞/なな 面/名詞/めん 鳥/名詞/ちょう の/助詞/の 群/名詞/むれ
    ja_excite: 背景の低木を持つ七面鳥のグループ



Files (46.9 GB)

Name Size Download all
7.2 kB Preview Download
31.5 GB Download
15.4 GB Download

Additional details

Related works

Is documented by
Conference paper: 10.21437/GLU.2017-9 (DOI)
Is new version of
Dataset: 10.18709/perscido.2017.06.ds80 (DOI)


  • SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set