4783391
doi
10.5281/zenodo.4783391
oai:zenodo.org:4783391
user-tut-arg
user-audio-captioning
user-eu
Samuel Lipping
Audio Research Group, Faculty of Information Technology and Communication Sciences, Tampere University
Tuomas Virtanen
Audio Research Group, Faculty of Information Technology and Communication Sciences, Tampere University
Clotho dataset
Konstantinos Drossos
Audio Research Group, Faculty of Information Technology and Communication Sciences, Tampere University
url:https://github.com/audio-captioning/clotho-dataset
url:https://arxiv.org/abs/1910.09387
info:eu-repo/semantics/openAccess
Other (Attribution)
Clotho
Audio captioning
Dataset
Audio processing
Signal processing
Machine listening
Computational auditory scene analysis
Captioning
Deep learning
<p>Clotho is an audio captioning dataset, now reached version 2. Clotho consists of 6974 audio samples, and each audio sample has five captions (a total of 34 870 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. </p>
<p>Clotho is thoroughly described in our paper:</p>
<p><em>K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.</em></p>
<p>available online at: <a href="https://arxiv.org/abs/1910.09387">https://arxiv.org/abs/1910.09387</a> and at: <a href="https://ieeexplore.ieee.org/document/9052990">https://ieeexplore.ieee.org/document/9052990</a> </p>
<p><strong>If you use Clotho, please cite our paper.</strong></p>
<p> </p>
<p><strong>To use the dataset, you can use our code at:</strong> <a href="https://github.com/audio-captioning/clotho-dataset">https://github.com/audio-captioning/clotho-dataset</a> </p>
<p> </p>
<p>These are the files for the development, validation, and evaluation splits of Clotho dataset. </p>
<p>--------------------------------------------------------------------------------------------------------</p>
<p><strong>== Changes in version 2.1 ==</strong></p>
<p>In version 2.1 of Clotho, we fixed some files that were corrupted from the compression and transferring processes (around 150 files) and we also replaced some characters that were illegal for most filesystems, e.g. ":" (around 10 files). </p>
<p>Please use this version for your experiments. </p>
<p><strong>== Changes in version 2 ==</strong></p>
<p>In version 2 of Clotho, there are audio files added in the development split and a new validation split is added. There are no changes in the evaluation split. </p>
<p>Specifically: </p>
<ul>
<li>Now there are 3840 audio files in the development split. In Clotho version 1, there were 2893 audio files. Now, 947 new audio files are added. </li>
<li>There are 1046 new audio files in the validation split. </li>
</ul>
<p>All new captions are treated as in version 1 of Clotho, i.e. having word consistency, no named entities, no speech transcription, and no hapax legomena between splits (i.e. words appearing only in one of the splits). </p>
<p>--------------------------------------------------------------------------------------------------------</p>
<p><strong>== Usage ==</strong></p>
<p>To use the dataset you have to:</p>
<ol>
<li>Download the audio files: clotho_audio_development.7z,clotho_audio_validation.7z, and clotho_audio_evalution.7z</li>
<li>Download the files with the captions: clotho_captions_development.csv, clotho_captions_validation.csv, and clotho_captions_evaluation.csv</li>
<li>Download the files with the associated metadata: clotho_metadata_development.csv, clotho_metadata_validation.csv, and clotho_metadata_evaluation.csv</li>
<li>Extract the audio files</li>
<li>Then you can use each audio file with its corresponding captions</li>
</ol>
<p>--------------------------------------------------------------------------------------------------------</p>
<p><strong>== License ==</strong></p>
<p>The audio files in the archives:</p>
<ul>
<li>clotho_audio_development.7z,</li>
<li>clotho_audio_validation.7z, and</li>
<li>clotho_audio_evalution.7z</li>
</ul>
<p>and the associated meta-data in the CSV files:</p>
<ul>
<li>clotho_metadata_development.csv</li>
<li>clotho_metadata_validation.csv</li>
<li>clotho_metadata_evaluation.csv</li>
</ul>
<p>are under the corresponding licences (<strong><em>mostly CreativeCommons with attribution</em></strong>) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: </p>
<ul>
<li>File name</li>
<li>Keywords</li>
<li>URL for the original audio file</li>
<li>Start and ending samples for the excerpt that is used in the Clotho dataset</li>
<li>Uploader/user in the Freesound platform (manufacturer)</li>
<li>Link to the licence of the file</li>
</ul>
<p>The captions in the files:</p>
<ul>
<li>clotho_captions_development.csv</li>
<li>clotho_captions_validation.csv</li>
<li>clotho_captions_evaluation.csv</li>
</ul>
<p>are under the Tampere University licence, described in the LICENCE file (<strong><em>mainly a non-commercial with attribution licence</em></strong>). </p>
<p>--------------------------------------------------------------------------------------------------------</p>
<p><strong>== References ==</strong><br>
[1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245</p>
Zenodo
2021-05-26
info:eu-repo/semantics/other
3490683
user-tut-arg
user-audio-captioning
user-eu
2.1
award_title=Computational Analysis of Everyday Soundscapes; award_number=637422; award_identifiers_scheme=url; award_identifiers_identifier=https://cordis.europa.eu/projects/637422; funder_id=00k4n6c32; funder_name=European Commission;
1622341254.275726
1249726503
md5:4569624ccadf96223f19cb59fe4f849f
https://zenodo.org/records/4783391/files/clotho_audio_evaluation.7z
1336762
md5:d4090b39ce9f2491908eebf4d5b09bae
https://zenodo.org/records/4783391/files/clotho_captions_development.csv
224803
md5:2e010427c56b1ce6008b0f03f41048ce
https://zenodo.org/records/4783391/files/clotho_metadata_validation.csv
1260701425
md5:7dba730be08bada48bd15dc4e668df59
https://zenodo.org/records/4783391/files/clotho_audio_validation.7z
225311
md5:13946f054d4e1bf48079813aac61bf77
https://zenodo.org/records/4783391/files/clotho_metadata_evaluation.csv
1874
md5:38d422ac8a2c9c35c288232576e2e810
https://zenodo.org/records/4783391/files/LICENSE
4541582263
md5:c8b05bc7acdb13895bb3c6a29608667e
https://zenodo.org/records/4783391/files/clotho_audio_development.7z
830797
md5:170d20935ecfdf161ce1bb154118cda5
https://zenodo.org/records/4783391/files/clotho_metadata_development.csv
367649
md5:5879e023032b22a2c930aaa0528bead4
https://zenodo.org/records/4783391/files/clotho_captions_validation.csv
361995
md5:1b16b9e57cf7bdb7f13a13802aeb57e2
https://zenodo.org/records/4783391/files/clotho_captions_evaluation.csv
public
https://github.com/audio-captioning/clotho-dataset
Is supplemented by
url
https://arxiv.org/abs/1910.09387
Is supplement to
url
10.5281/zenodo.3490683
isVersionOf
doi