Dataset Restricted Access

2014 ImageCLEF WEBUPV Collection

Villegas, Mauricio; Paredes, Roberto


JSON Export

{
  "owners": [
    19451
  ], 
  "doi": "10.5281/zenodo.259758", 
  "stats": {
    "version_unique_downloads": 13.0, 
    "unique_views": 263.0, 
    "views": 288.0, 
    "version_views": 288.0, 
    "unique_downloads": 13.0, 
    "version_unique_views": 263.0, 
    "volume": 110167422414.0, 
    "version_downloads": 62.0, 
    "downloads": 62.0, 
    "version_volume": 110167422414.0
  }, 
  "links": {
    "latest_html": "https://zenodo.org/record/259758", 
    "doi": "https://doi.org/10.5281/zenodo.259758", 
    "badge": "https://zenodo.org/badge/doi/10.5281/zenodo.259758.svg", 
    "html": "https://zenodo.org/record/259758", 
    "latest": "https://zenodo.org/api/records/259758"
  }, 
  "created": "2017-01-25T17:52:13.421037+00:00", 
  "updated": "2020-01-24T19:21:42.466403+00:00", 
  "conceptrecid": "748690", 
  "revision": 7, 
  "id": 259758, 
  "metadata": {
    "access_right_category": "danger", 
    "doi": "10.5281/zenodo.259758", 
    "description": "<p>This document describes the WEBUPV dataset compiled for the ImageCLEF 2014<br>\nScalable Concept Image Annotation challenge. The data mentioned here indicates what<br>\nis ready for download. However, upon request or depending on feedback from the<br>\nparticipants, additional data may be released.</p>\n\n<p>The following is the directory structure of the collection, and bellow there<br>\nis a brief description of what each compressed file contains. The<br>\ncorresponding MD5 checksums of the files shown (for verifying a correct<br>\ndownload) can be found in md5sums.txt.</p>\n\n<p>Directory structure<br>\n-------------------</p>\n\n<p>.<br>\n|<br>\n|--- README.txt<br>\n|--- md5sums.txt<br>\n|--- webupv14_train_lists.zip<br>\n|--- webupv14_train2_lists.zip<br>\n|--- webupv14_devel_lists.zip<br>\n|--- webupv14_test_lists.zip<br>\n|--- webupv14_baseline.zip<br>\n|<br>\n|--- feats_textual/<br>\n|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |<br>\n|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2}_textual_pages.zip<br>\n|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2}_textual.scofeat.gz<br>\n|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2}_textual.keywords.gz<br>\n|<br>\n|--- feats_visual/<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_images.zip<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_gist2.feat.gz<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_sift_1000.feat.gz<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_csift_1000.feat.gz<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_rgbsift_1000.feat.gz<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_opponentsift_1000.feat.gz<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_colorhist.feat.gz<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_getlf.feat.gz</p>\n\n<p><br>\nContents of files<br>\n-----------------</p>\n\n<p>* webupv14_train{|2}_lists.zip</p>\n\n<p>&nbsp; The first training set (&quot;train_*&quot;) includes images for the concepts of the<br>\n&nbsp; development set, whereas the second training set (&quot;train2_*&quot;) includes<br>\n&nbsp; images for the concepts in the test set that are not in the development set.</p>\n\n<p>&nbsp; -&gt; train{|2}_iids.txt : IDs of the images (IIDs) in the training set.</p>\n\n<p>&nbsp; -&gt; train{|2}_rids.txt : IDs of the webpages (RIDs) in the training set.</p>\n\n<p>&nbsp; -&gt; train{|2}_*urls.txt : The original URLs from where the images (iurls)<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and the webpages (rurls) were downloaded. Each line in the file<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; corresponds to an image, starting with the IID and is followed<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; by one or more URLs.</p>\n\n<p>&nbsp; -&gt; train{|2}_rimgsrc.txt : The URLs of the images as referenced in each<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; of the webpages. Each line of the file is of the form: IID RID<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; URL1 [URL2 ...]. This information is necessary to locate the<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; images within the webpages and it can also be useful as a<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; textual feature.</p>\n\n<p>* webupv14_{devel|test}_lists.zip</p>\n\n<p>&nbsp; -&gt; {devel|test}_conceptlists.txt : Lists per image of concepts for<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; annotation. Each line starts with an image ID and is followed by the<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; list of concepts in alphabetical order. Each ID may appear more than<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; once. In total there are 1940 image annotation lists for the<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; development set and 7291 image annotation lists for the test set. These<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; correspond to 1000 and 4122 unique images (IDs) for the development and<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; test sets, respectively.</p>\n\n<p>&nbsp; -&gt; {devel|test}_allconcepts.txt : Complete list of concepts for the<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; development/test set.</p>\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The concepts are defined by one or more WordNet synsets, which is<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; intended to make it possible to easily obtain more information about<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the concepts, e.g. synonyms. In the concept list, the first column<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (which is the name of the concept) indicates the word to search in<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WordNet, the second column the synset type (either noun or adjective),<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the third column is the sense number and the fourth column is the<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WordNet offset (although this cannot be trusted since it changes<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; between WordNet versions). For most of the concepts there is a fifth<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; column which is a Wikipedia article related to the concept.</p>\n\n<p>&nbsp; -&gt; {devel|test}_groundtruth.txt : Ground truth concepts for the development<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and test sets.</p>\n\n<p>&nbsp; -&gt; {devel|test}_*urls.txt : The original URLs from where the images (iurls)<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and the webpages (rurls) were downloaded. Each line in the file<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; corresponds to an image, starting with the IID and is followed by one<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; or more URLs.</p>\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Note: These are included only to acknowledge the source of the<br>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; data, not be used as input to the annotation systems.</p>\n\n<p><br>\n* webupv14_baseline.zip</p>\n\n<p>&nbsp; An archive that includes code for computing the evaluation measures<br>\n&nbsp; for two baseline techniques. See the included README.txt for<br>\n&nbsp; details.</p>\n\n<p><br>\n* feats_textual/webupv14_train{|2}_textual_pages.zip</p>\n\n<p>&nbsp; Contains all of the webpages which referenced the images in the<br>\n&nbsp; training set after being converted to valid xml. In total there are<br>\n&nbsp; 262588 files, since each image can appear in more than one page, and<br>\n&nbsp; there can be several versions of same page which differ by the<br>\n&nbsp; method of conversion to xml. To avoid having too many files in a<br>\n&nbsp; single directory (which is an issue for some types of partitions),<br>\n&nbsp; the files are found in subdirectories named using the first two<br>\n&nbsp; characters of the RID, thus the paths of the files after extraction<br>\n&nbsp; are of the form:</p>\n\n<p>&nbsp;&nbsp;&nbsp; ./WEBUPV/pages/{RID:0:2}/{RID}.{CONVM}.xml.gz</p>\n\n<p>&nbsp; To be able to locate the training images withing the webpages, the<br>\n&nbsp; URLs of the images as referenced are provided in the file<br>\n&nbsp; train_rimgsrc.txt.</p>\n\n<p>* feats_textual/webupv14_train{|2}_textual.scofeat.gz</p>\n\n<p>&nbsp; The processed text extracted from the webpages near where the images<br>\n&nbsp; appeared. Each line corresponds to one image, having the same order<br>\n&nbsp; as the train_iids.txt list. The lines start with the image ID,<br>\n&nbsp; followed by the number of extracted unique words and the<br>\n&nbsp; corresponding word-score pairs. The scores were derived taking into<br>\n&nbsp; account 1) the term frequency (TF), 2) the document object model<br>\n&nbsp; (DOM) attributes, and 3) the word distance to the image. The scores<br>\n&nbsp; are all integers and for each image the sum of scores is always<br>\n&nbsp; &lt;=100000 (i.e. it is normalized).</p>\n\n<p><br>\n* feats_textual/webupv14_train{|2}_textual.keywords.gz</p>\n\n<p>&nbsp; The words used to find the images when querying image search<br>\n&nbsp; engines. Each line corresponds to an image (in the same order as in<br>\n&nbsp; train_iids.txt). The lines are composed of triplets:</p>\n\n<p>&nbsp;&nbsp;&nbsp; [keyword] [rank] [search_engine]</p>\n\n<p>&nbsp; where [keyword] is the word used to find the image, [rank] is the<br>\n&nbsp; position given to the image in the query, and [search_engine] is a<br>\n&nbsp; single character indicating in which search engine it was found<br>\n&nbsp; (&#39;g&#39;:google, &#39;b&#39;:bing, &#39;y&#39;:yahoo).</p>\n\n<p><br>\n* feats_visual/webupv14_*_images.zip</p>\n\n<p>&nbsp; Contains thumbnails (maximum 640 pixels of either width or height)<br>\n&nbsp; of the images in jpeg format. To avoid having too many files in a<br>\n&nbsp; single directory (which is an issue for some types of partitions),<br>\n&nbsp; the files are found in subdirectories named using the first two<br>\n&nbsp; characters of the IID, thus the paths of the files after extraction<br>\n&nbsp; are of the form:</p>\n\n<p>&nbsp;&nbsp;&nbsp; ./WEBUPV/images/{IID:0:2}/{IID}.jpg</p>\n\n<p>* feats_visual/webupv14_*.feat.gz</p>\n\n<p>&nbsp; The visual features in a simple ASCII text sparse format. The first<br>\n&nbsp; line of the file indicates the number of vectors (N) and the<br>\n&nbsp; dimensionality (DIMS). Then each line corresponds to one vector,<br>\n&nbsp; starting with the number of non-zero elements and followed by pairs<br>\n&nbsp; of dimension-value, being the first dimension 0. In summary the file<br>\n&nbsp; format is:</p>\n\n<p>&nbsp;&nbsp;&nbsp; N DIMS<br>\n&nbsp;&nbsp;&nbsp; nz1 Dim(1,1) Val(1,1) ... Dim(1,nz1) Val(1,nz1)<br>\n&nbsp;&nbsp;&nbsp; nz2 Dim(2,1) Val(2,1) ... Dim(2,nz2) Val(2,nz2)<br>\n&nbsp;&nbsp;&nbsp; ...<br>\n&nbsp;&nbsp;&nbsp; nzN Dim(N,1) Val(N,1) ... Dim(N,nzN) Val(N,nzN)</p>\n\n<p>&nbsp; The order of the features is the same as in the lists<br>\n&nbsp; devel_conceptlists.txt, test_conceptlists.txt and train_iids.txt.</p>\n\n<p>&nbsp; The procedure to extract the SIFT based features in this<br>\n&nbsp; subdirectory was conducted as follows. Using the ImageMagick<br>\n&nbsp; software, the images were first rescaled to having a maximum of 240<br>\n&nbsp; pixels, of both width and height, while preserving the original<br>\n&nbsp; aspect ratio, employing the command:</p>\n\n<p>&nbsp;&nbsp;&nbsp; convert {IMGIN}.jpg -resize &#39;240&gt;x240&gt;&#39; {IMGOUT}.jpg</p>\n\n<p>&nbsp; Then the SIFT features where extracted using the ColorDescriptor<br>\n&nbsp; software from Koen van de Sande<br>\n&nbsp; (http://koen.me/research/colordescriptors). As configuration we<br>\n&nbsp; used, &#39;densesampling&#39; detector with default parameters, and a hard<br>\n&nbsp; assignment codebook using a spatial pyramid as<br>\n&nbsp; &#39;pyramid-1x1-2x2&#39;. The number in the file name indicates the size of<br>\n&nbsp; the codebook. All of the vectors of the spatial pyramid are given in<br>\n&nbsp; the same line, thus keeping only the first 1/5th of the dimensions<br>\n&nbsp; would be like not using the spatial pyramid. The codebook was<br>\n&nbsp; generated using 1.25 million randomly selected features and the<br>\n&nbsp; k-means algorithm. The GIST features were extracted using the<br>\n&nbsp; LabelMe Toolbox. The images where first resized to 256x256 ignoring<br>\n&nbsp; original aspect ratio, using 5 scales, 6 orientations and 4<br>\n&nbsp; blocks. The other features colorhist and getlf, are both color<br>\n&nbsp; histogram based extracted using our own implementation.</p>\n\n<p>&nbsp;</p>", 
    "title": "2014 ImageCLEF WEBUPV Collection", 
    "relations": {
      "version": [
        {
          "count": 1, 
          "index": 0, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "748690"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "259758"
          }
        }
      ]
    }, 
    "access_conditions": "<p>This dataset is available under a Creative Commons Attribution-<br>\nNonCommercial-ShareAlike 3.0 Unported License. Before downloading<br>\nthe data, please read and accept the Creative Commons License and<br>\nthe following usage agreement:</p>\n\n<p>Data Usage Agreement ImageCLEF 2012/2013/2014/2015/2016 WEBUPV Image<br>\nAnnotation Datasets</p>\n\n<p>By downloading the &quot;Dataset&quot;, you (the &quot;Researcher&quot;) agrees to the<br>\nfollowing terms.</p>\n\n<p>* The Researcher will only use the Dataset for non-commercial<br>\nresearch and/or educational purposes.</p>\n\n<p>* The Researcher will cite one of the following papers in any<br>\npublication that makes use of the Dataset.</p>\n\n<p>&nbsp; Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A.,<br>\n&nbsp; Dellandrea, E., Gaizauskas, R., Villegas, M., Mikolajczyk, K.:<br>\n&nbsp; Overview of the ImageCLEF 2016 scalable concept image<br>\n&nbsp; annotation task. In: CLEF2016 Working Notes, CEUR Workshop<br>\n&nbsp; Proceedings, CEUR-WS.org, &Eacute;vora, Portugal, 5&ndash;8 September 2016</p>\n\n<p>&nbsp; Gilbert, A., Piras, L., Wang, J., Yan, F., Dellandrea, E.,<br>\n&nbsp; Gaizauskas, R., Villegas, M., Mikolajczyk, K.: Overview of the<br>\n&nbsp; ImageCLEF 2015 Scalable Image Annotation, Localization and<br>\n&nbsp; Sentence Generation task. In: CLEF2015 Working Notes. CEUR<br>\n&nbsp; Workshop Proceedings, CEUR-WS.org, Toulouse, France (September<br>\n&nbsp; 8-11 2015)</p>\n\n<p>&nbsp; Villegas, M., Paredes, R.: Overview of the ImageCLEF 2014<br>\n&nbsp; Scalable Concept Image Annotation Task. In: CLEF2014 Working<br>\n&nbsp; Notes. CEUR Workshop Proceedings, vol. 1180, pp. 308&ndash;328.<br>\n&nbsp; CEUR-WS.org, Sheffield, UK (September 15-18 2014)</p>\n\n<p>&nbsp; Villegas, M., Paredes, R., Thomee, B.: Overview of the ImageCLEF<br>\n&nbsp; 2013 Scalable Concept Image Annotation Subtask. In: CLEF 2013<br>\n&nbsp; Evaluation Labs and Workshop, Online Working Notes. Valencia,<br>\n&nbsp; Spain (September 23-26 2013)</p>\n\n<p>* The Researcher may provide research associates and colleagues a<br>\ncopy of the Dataset provided that they also agree to this Data<br>\nUsage Agreement.</p>\n\n<p>* The Researcher will assume all responsibility against any claims<br>\narising from Researcher&#39;s use of the Dataset.</p>", 
    "grants": [
      {
        "code": "600707", 
        "links": {
          "self": "https://zenodo.org/api/grants/10.13039/501100000780::600707"
        }, 
        "title": "tranScriptorium", 
        "acronym": "TRANSCRIPTORIUM", 
        "program": "FP7", 
        "funder": {
          "doi": "10.13039/501100000780", 
          "acronyms": [], 
          "name": "European Commission", 
          "links": {
            "self": "https://zenodo.org/api/funders/10.13039/501100000780"
          }
        }
      }
    ], 
    "communities": [
      {
        "id": "ecfunded"
      }, 
      {
        "id": "imageclef"
      }
    ], 
    "publication_date": "2014-04-01", 
    "creators": [
      {
        "affiliation": "Universitat Politecnica de Valencia", 
        "name": "Villegas, Mauricio"
      }, 
      {
        "affiliation": "Universitat Politecnica de Valencia", 
        "name": "Paredes, Roberto"
      }
    ], 
    "meeting": {
      "acronym": "CLEF", 
      "url": "http://clef2014.clef-initiative.eu/", 
      "dates": "15-18 September 2014", 
      "place": "Sheffield, UK", 
      "title": "Conference and Labs of the Evaluation Forum"
    }, 
    "access_right": "restricted", 
    "resource_type": {
      "type": "dataset", 
      "title": "Dataset"
    }, 
    "related_identifiers": [
      {
        "scheme": "url", 
        "identifier": "http://ceur-ws.org/Vol-1180/CLEF2014wn-Image-VillegasEt2014.pdf", 
        "relation": "isSupplementTo"
      }, 
      {
        "scheme": "url", 
        "identifier": "http://imageclef.org/2014/annotation", 
        "relation": "isSupplementTo"
      }
    ]
  }
}
288
62
views
downloads
All versions This version
Views 288288
Downloads 6262
Data volume 110.2 GB110.2 GB
Unique views 263263
Unique downloads 1313

Share

Cite as