Conference paper Open Access

Incorporating Textual Similarity in Video Captioning Schemes

Gkountakos, Konstantinos; Dimou, Anastasios; Papadopoulos, Georgios Th.; Daras, Petros


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nam##2200000uu#4500</leader>
  <datafield tag="942" ind1=" " ind2=" ">
    <subfield code="a">2020-02-12</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Video captioning</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Word2Vec</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Textual information</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Encoder-decoder</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Recurrent Neural Network</subfield>
  </datafield>
  <controlfield tag="005">20200212072052.0</controlfield>
  <controlfield tag="001">3560519</controlfield>
  <datafield tag="711" ind1=" " ind2=" ">
    <subfield code="d">17-19 June 2019</subfield>
    <subfield code="g">ICE/ITMC</subfield>
    <subfield code="a">2019 IEEE International Conference on Engineering, Technology and Innovation</subfield>
    <subfield code="c">Valbonne Sophia-Antipolis, France</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">CERTH</subfield>
    <subfield code="a">Dimou, Anastasios</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">CERTH</subfield>
    <subfield code="a">Papadopoulos, Georgios Th.</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="a">Daras, Petros</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">392781</subfield>
    <subfield code="z">md5:1e2476051c9520268499e8c78c185b2c</subfield>
    <subfield code="u">https://zenodo.org/record/3560519/files/Incorporating Textual Similarity in Video Captioning Schemes.pdf</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2019-08-12</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire</subfield>
    <subfield code="o">oai:zenodo.org:3560519</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">CERTH</subfield>
    <subfield code="a">Gkountakos, Konstantinos</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Incorporating Textual Similarity in Video Captioning Schemes</subfield>
  </datafield>
  <datafield tag="536" ind1=" " ind2=" ">
    <subfield code="c">787061</subfield>
    <subfield code="a">Advanced tools for fighting oNline Illegal TrAfficking</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;The problem of video captioning has been heavily investigated from the research community the last years and, especially, since Recurrent Neural Networks (RNNs) have been introduced. Aforementioned approaches of video captioning, are usually based on sequence-to-sequence models that aim to exploit the visual information by detecting events, objects, or via matching entities to words. However, the exploitation of the contextual information that can be extracted from the vocabulary has not been investigated yet, except from approaches that make use of parts of speech such as verbs, nouns, and adjectives. The proposed approach is based on the assumption that textually similar captions should represent similar visual content. Specifically, we propose a novel loss function that penalizes/rewards the wrong/correct predicted words based on the semantic cluster that they belong to. The proposed method is evaluated using two widely-known datasets in the video captioning domain, Microsoft Research - Video to Text (MSR-VTT) and Microsoft Research Video Description Corpus (MSVD). Finally, experimental analysis proves that the proposed method outperforms the baseline approach in most cases.&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.1109/ICE.2019.8792602</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">publication</subfield>
    <subfield code="b">conferencepaper</subfield>
  </datafield>
</record>
33
18
views
downloads
Views 33
Downloads 18
Data volume 7.1 MB
Unique views 22
Unique downloads 18

Share

Cite as