Published November 18, 2021 | Version v1
Video/Audio Open

Places Audio Captions (Japanese) 100k

Description

The Places Audio Caption (Japanese) 100K Corpus contains approximately 100,000 Japanese spoken captions for natural images drawn from the Places 205 image dataset.

This speech corpus was collected to investigate the learning of spoken language (words, sub-word units, higher-level semantics, etc.) from visually-grounded speech. For a description of the corpus, see:

@INPROCEEDINGS{Ohishi2020trilingual,
  author={Ohishi, Yasunori and Kimura, Akisato and Kawanishi, Takahito and Kashino, Kunio and Harwath, David and Glass, James},
  booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Trilingual Semantic Embeddings of Visually Grounded Speech with Self-Attention Mechanisms}, 
  year={2020},
  pages={4352-4356},
}

The corpus only includes audio recordings, and not the associated images. You will need to separately download the Places image dataset here.

The data is distributed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license (link).

If you use this data in your own publications, please cite the paper above.

Files

Files (43.5 GB)

Name Size Download all
md5:4a4d1093363001e2a7c8c3ca5aa46533
43.5 GB Download