<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Konstantinos Drossos</dc:creator>
  <dc:creator>Samuel Lipping</dc:creator>
  <dc:creator>Tuomas Virtanen</dc:creator>
  <dc:date>2019-10-15</dc:date>
  <dc:description>&amp;lt;p&amp;gt;Clotho is a novel audio captioning dataset, consisting of&amp;nbsp;4981 audio samples, and each audio sample has five captions (a total of&amp;nbsp;24 905 captions).&amp;nbsp;Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Clotho is thoroughly described in our paper:&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;em&amp;gt;K. Drossos, S. Lipping and T. Virtanen, &amp;quot;Clotho: an Audio Captioning Dataset,&amp;quot; IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.&amp;lt;/em&amp;gt;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;available online at:&amp;nbsp;&amp;lt;a href="https://arxiv.org/abs/1910.09387"&amp;gt;https://arxiv.org/abs/1910.09387&amp;lt;/a&amp;gt;&amp;nbsp;and at:&amp;nbsp;&amp;lt;a href="https://ieeexplore.ieee.org/document/9052990"&amp;gt;https://ieeexplore.ieee.org/document/9052990&amp;lt;/a&amp;gt;&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;If you use Clotho, please cite our paper.&amp;lt;/strong&amp;gt;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;To use the dataset, you can use our code at:&amp;lt;/strong&amp;gt;&amp;nbsp;&amp;lt;a href="https://github.com/audio-captioning/clotho-dataset"&amp;gt;https://github.com/audio-captioning/clotho-dataset&amp;lt;/a&amp;gt;&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;These are the files for the development and evaluation splits of Clotho dataset.&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;--------------------------------------------------------------------------------------------------------&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;== Usage ==&amp;lt;/strong&amp;gt;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;To use the dataset you have to:&amp;lt;/p&amp;gt;

&amp;lt;ol&amp;gt;
	&amp;lt;li&amp;gt;Download the audio files:&amp;nbsp;clotho_audio_development.7z and&amp;nbsp;clotho_audio_evalution.7z&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Download the files with the captions:&amp;nbsp;clotho_captions_development.csv and&amp;nbsp;clotho_captions_evaluation.csv&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Download the files with the associated metadata:&amp;nbsp;clotho_metadata_development.csv and&amp;nbsp;clotho_metadata_evaluation.csv&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Extract the audio files&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Then you can use each audio file with its corresponding captions&amp;lt;/li&amp;gt;
&amp;lt;/ol&amp;gt;

&amp;lt;p&amp;gt;--------------------------------------------------------------------------------------------------------&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;== License&amp;nbsp;==&amp;lt;/strong&amp;gt;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;The audio files in the archives:&amp;lt;/p&amp;gt;

&amp;lt;ul&amp;gt;
	&amp;lt;li&amp;gt;clotho_audio_development.7z and&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;clotho_audio_evalution.7z&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;

&amp;lt;p&amp;gt;and the associated meta-data in the CSV files:&amp;lt;/p&amp;gt;

&amp;lt;ul&amp;gt;
	&amp;lt;li&amp;gt;clotho_metadata_development.csv&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;clotho_metadata_evaluation.csv&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;

&amp;lt;p&amp;gt;are under the corresponding licences (&amp;lt;strong&amp;gt;&amp;lt;em&amp;gt;mostly CreativeCommons with attribution&amp;lt;/em&amp;gt;&amp;lt;/strong&amp;gt;) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are:&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;ul&amp;gt;
	&amp;lt;li&amp;gt;File name&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Keywords&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;URL for the original audio file&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Start and ending samples for the excerpt that is used in the Clotho dataset&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Uploader/user in the Freesound platform&amp;nbsp;(manufacturer)&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;Link to the licence of the file&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;

&amp;lt;p&amp;gt;The captions in the files:&amp;lt;/p&amp;gt;

&amp;lt;ul&amp;gt;
	&amp;lt;li&amp;gt;clotho_captions_development.csv&amp;lt;/li&amp;gt;
	&amp;lt;li&amp;gt;clotho_captions_evaluation.csv&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;

&amp;lt;p&amp;gt;are under the Tampere University licence, described in the LICENCE file (&amp;lt;strong&amp;gt;&amp;lt;em&amp;gt;mainly a non-commercial with attribution licence&amp;lt;/em&amp;gt;&amp;lt;/strong&amp;gt;).&amp;nbsp;&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;--------------------------------------------------------------------------------------------------------&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;strong&amp;gt;== References ==&amp;lt;/strong&amp;gt;&amp;lt;br&amp;gt;
[1]&amp;nbsp;Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM &amp;#39;13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245&amp;lt;/p&amp;gt;</dc:description>
  <dc:identifier>https://doi.org/10.5281/zenodo.3490684</dc:identifier>
  <dc:identifier>oai:zenodo.org:3490684</dc:identifier>
  <dc:language>eng</dc:language>
  <dc:publisher>Zenodo</dc:publisher>
  <dc:relation>https://github.com/audio-captioning/clotho-dataset</dc:relation>
  <dc:relation>https://arxiv.org/abs/1910.09387</dc:relation>
  <dc:relation>https://zenodo.org/communities/tut-arg</dc:relation>
  <dc:relation>https://zenodo.org/communities/audio-captioning</dc:relation>
  <dc:relation>https://zenodo.org/communities/eu</dc:relation>
  <dc:relation>https://doi.org/10.5281/zenodo.3490683</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>Other (Attribution)</dc:rights>
  <dc:subject>Clotho</dc:subject>
  <dc:subject>Audio captioning</dc:subject>
  <dc:subject>Dataset</dc:subject>
  <dc:subject>Audio processing</dc:subject>
  <dc:subject>Signal processing</dc:subject>
  <dc:subject>Machine listening</dc:subject>
  <dc:subject>Computational auditory scene analysis</dc:subject>
  <dc:subject>Captioning</dc:subject>
  <dc:subject>Deep learning</dc:subject>
  <dc:title>Clotho dataset</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
</oai_dc:dc>