Dataset Open Access

Webis YouTube 8M Augmented 2018

Jiani Qu; Anny Marleen Hißbach; Tim Gollub; Martin Potthast


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <controlfield tag="005">20200328082018.0</controlfield>
  <controlfield tag="001">3724807</controlfield>
  <datafield tag="711" ind1=" " ind2=" ">
    <subfield code="d">5-8 July 2018</subfield>
    <subfield code="g">HCOMP</subfield>
    <subfield code="a">The Sixth AAAI Conference on Human Computation and Crowdsourcing</subfield>
    <subfield code="c">Zurich</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Bauhaus-Universität Weimar</subfield>
    <subfield code="a">Anny Marleen Hißbach</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Bauhaus-Universität Weimar</subfield>
    <subfield code="0">(orcid)0000-0003-1737-6517</subfield>
    <subfield code="a">Tim Gollub</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Leipzig University</subfield>
    <subfield code="0">(orcid)0000-0003-2451-0665</subfield>
    <subfield code="a">Martin Potthast</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">995427910</subfield>
    <subfield code="z">md5:8dc641a8eda0952b9b9a1f072709c558</subfield>
    <subfield code="u">https://zenodo.org/record/3724807/files/captions-list.zip</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">4773743236</subfield>
    <subfield code="z">md5:8fb3191a31242ff485d49eab091022cd</subfield>
    <subfield code="u">https://zenodo.org/record/3724807/files/captions.zip</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">48267380016</subfield>
    <subfield code="z">md5:9c6407509e780eac43b1ba0f4e01c1b2</subfield>
    <subfield code="u">https://zenodo.org/record/3724807/files/comments.zip</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">7878839249</subfield>
    <subfield code="z">md5:62d05a8cea6d772b1288e575c9337e18</subfield>
    <subfield code="u">https://zenodo.org/record/3724807/files/metadata.zip</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">111905277832</subfield>
    <subfield code="z">md5:37a791ddf259b546f2f9c66ea4924e55</subfield>
    <subfield code="u">https://zenodo.org/record/3724807/files/thumbnails.zip</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="y">Conference website</subfield>
    <subfield code="u">https://www.humancomputation.com/2018/</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2018-07-05</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="p">user-webis</subfield>
    <subfield code="o">oai:zenodo.org:3724807</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Bauhaus-Universität Weimar</subfield>
    <subfield code="a">Jiani Qu</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Webis YouTube 8M Augmented 2018</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-webis</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;We used the YouTube Data API&amp;nbsp;to augment the&amp;nbsp;&lt;a href="https://research.google.com/youtube8m/"&gt;YouTube 8M&lt;/a&gt;&amp;nbsp;corpus by crawling a variety of meta data for the videos.&lt;/p&gt;

&lt;p&gt;First point of interest was the &amp;quot;video resource,&amp;quot;&amp;nbsp;which comprises data about the video, such as the video&amp;rsquo;s title, description, uploader name, tags, view count, and more. Also included in the meta data is whether comments have been left for the video. If so, we downloaded them as well, including information about their authors, likes, dislikes, and responses.&lt;/p&gt;

&lt;p&gt;There is no property which specifies a video&amp;rsquo;s&amp;nbsp;language, since this information is not mandatory when uploading a video. Also, the API provides only information about the available captions, but not the captions themselves. Only the uploader of a video is given access to its captions via the API; we extracted them using &lt;a href="https://ytdl-org.github.io/youtube-dl/"&gt;youtube-dl&lt;/a&gt;.&amp;nbsp;For each video, all manually created captions were downloaded, and auto-generated captions in the &amp;quot;default&amp;quot;&amp;nbsp;language and English. The &amp;quot;default&amp;quot;&amp;nbsp;auto-generated caption gives perhaps the only hint at a video&amp;rsquo;s original language.&lt;/p&gt;

&lt;p&gt;Finally, we downloaded all thumbnails used to advertise a video, which are not available via the API, but only via a canonical URL. Our corpus provides the possibility to recreate the way a video is presented on YouTube (meta data and thumbnail), what the actual content is ((sub)titles and descriptions), and how its viewers reacted (comments).&lt;br&gt;
&lt;br&gt;
If you use this dataset in your publication, &lt;strong&gt;please cite the dataset as outlined in the right column.&lt;/strong&gt;&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.3724806</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.3724807</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
301
258
views
downloads
All versions This version
Views 301301
Downloads 258258
Data volume 19.5 TB19.5 TB
Unique views 271271
Unique downloads 169169

Share

Cite as