Cigdem Beyan
Muhammad Shahid
Vittorio Murino
2020-07-02
<p><strong>RealVAD: A Real-world Dataset for Voice Activity Detection</strong></p>
<p>The task of automatically detecting “Who is Speaking and When” is broadly named as Voice Activity Detection (VAD). Automatic VAD is a very important task and also the foundation of several domains, e.g., human-human, human-computer/ robot/ virtual-agent interaction analyses, and industrial applications.</p>
<p>RealVAD dataset is constructed from a YouTube video composed of a panel discussion lasting approx. 83 minutes. The audio is available from a single channel. There is one static camera capturing all panelists, the moderator and audiences.</p>
<p>Particular aspects of RealVAD dataset are:</p>
<ul>
<li> It is composed of panelists with different nationalities (British, Dutch, French, German, Italian, American, Mexican, Columbian, Thai). This aspect allows studying the effect of ethnic origin variety to the automatic VAD.</li>
<li> There is a gender balance such that there are four female and five male panelists.</li>
<li> The panelists are sitting in two rows and they can be gazing audience, other panelists, their laptop, the moderator or anywhere in the room while speaking or not-speaking. Therefore, they were captured not only from frontal-view but also from side-view varying based on their instant posture and head orientation.</li>
<li> The panelists are moving freely and are doing various spontaneous actions (e.g., drinking water, checking their cell phone, using their laptop, etc.), resulting in different postures.</li>
<li> The panelists’ body parts are sometimes partially occluded by their/other's body part or belongings (e.g., laptop).</li>
<li> There are also natural changes of illumination and shadow rising on the wall behind the panelists in the back row.</li>
<li> Especially, for the panelists sitting in the front row, there is sometimes background motion occurring when the person(s) behind them moves.</li>
</ul>
<p>The annotations includes:</p>
<ul>
<li> The upper body detection of nine panelists in bounding box form.</li>
<li> Associated VAD ground-truth (speaking, not-speaking) for nine panelists.</li>
<li> Acoustic features extracted from the video: MFCC and raw filterbank energies.</li>
</ul>
<p><em>All info regarding the annotations are given in the ReadMe.txt and Acoustic Features README.txt files.</em></p>
<p><strong>When using this dataset for your research, please cite the following paper in your publication:</strong></p>
<ol>
<li>C. Beyan, M. Shahid and V. Murino, "RealVAD: A Real-world Dataset and A Method for Voice Activity Detection by Body Motion Analysis", in IEEE Transactions on Multimedia, 2020.</li>
</ol>
https://doi.org/10.5281/zenodo.3928151
oai:zenodo.org:3928151
eng
Zenodo
https://doi.org/10.5281/zenodo.3928150
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
voice activity detection
dataset
nonverbal behavior
RealVAD: A Real-world Dataset for Voice Activity Detection
info:eu-repo/semantics/other