Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.

There is a newer version of the record available.

Published May 10, 2021 | Version 2.0
Dataset Open

Clotho dataset

  • 1. Audio Research Group, Faculty of Information Technology and Communication Sciences, Tampere University

Description

=== There is a newer version. Please use the newer version ===

Clotho is an audio captioning dataset, now reached version 2. Clotho consists of 6974 audio samples, and each audio sample has five captions (a total of 34 870 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. 

Clotho is thoroughly described in our paper:

K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.

available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 

If you use Clotho, please cite our paper.

 

To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset 

 

These are the files for the development, validation, and evaluation splits of Clotho dataset. 

--------------------------------------------------------------------------------------------------------

== Changes in version 2 ==

In version 2 of Clotho, there are audio files added in the development split and a new validation split is added. There are no changes in the evaluation split. 

Specifically: 

  • Now there are 3840 audio files in the development split. In Clotho version 1, there were 2893 audio files. Now, 947 new audio files are added. 
  • There are 1046 new audio files in the validation split. 

All new captions are treated as in version 1 of Clotho, i.e. having word consistency, no named entities, no speech transcription, and no hapax legomena between splits (i.e. words appearing only in one of the splits). 

--------------------------------------------------------------------------------------------------------

== Usage ==

To use the dataset you have to:

  1. Download the audio files: clotho_audio_development.7z,clotho_audio_validation.7z, and clotho_audio_evalution.7z
  2. Download the files with the captions: clotho_captions_development.csv, clotho_captions_validation.csv, and clotho_captions_evaluation.csv
  3. Download the files with the associated metadata: clotho_metadata_development.csv, clotho_metadata_validation.csv, and clotho_metadata_evaluation.csv
  4. Extract the audio files
  5. Then you can use each audio file with its corresponding captions

--------------------------------------------------------------------------------------------------------

== License ==

The audio files in the archives:

  • clotho_audio_development.7z,
  • clotho_audio_validation.7z, and
  • clotho_audio_evalution.7z

and the associated meta-data in the CSV files:

  • clotho_metadata_development.csv
  • clotho_metadata_validation.csv
  • clotho_metadata_evaluation.csv

are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: 

  • File name
  • Keywords
  • URL for the original audio file
  • Start and ending samples for the excerpt that is used in the Clotho dataset
  • Uploader/user in the Freesound platform (manufacturer)
  • Link to the licence of the file

The captions in the files:

  • clotho_captions_development.csv
  • clotho_captions_validation.csv
  • clotho_captions_evaluation.csv

are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). 

--------------------------------------------------------------------------------------------------------

== References ==
[1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245

Files

clotho_captions_development.csv

Files (6.5 GB)

Name Size Download all
md5:eda144a5e05a60e6d2e37a65fc4720a9
4.3 GB Download
md5:4569624ccadf96223f19cb59fe4f849f
1.2 GB Download
md5:0475bfa5793e80f748d32525018ebada
954.4 MB Download
md5:800633304e73d3daed364a2ba6069827
1.3 MB Preview Download
md5:1b16b9e57cf7bdb7f13a13802aeb57e2
362.0 kB Preview Download
md5:3109c353138a089c7ba724f27d71595d
367.6 kB Preview Download
md5:5fdc51b4c4f3468ff7d251ea563588c9
830.8 kB Preview Download
md5:13946f054d4e1bf48079813aac61bf77
225.3 kB Preview Download
md5:f69cfacebcd47c4d8d30d968f9865475
224.8 kB Preview Download
md5:38d422ac8a2c9c35c288232576e2e810
1.9 kB Download

Additional details

Related works

Is supplement to
Conference paper: https://arxiv.org/abs/1910.09387 (URL)
Is supplemented by
Software: https://github.com/audio-captioning/clotho-dataset (URL)

Funding

EVERYSOUND – Computational Analysis of Everyday Soundscapes 637422
European Commission

References

  • Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245