Dataset Restricted Access

2014 ImageCLEF WEBUPV Collection

Villegas, Mauricio; Paredes, Roberto


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Villegas, Mauricio</dc:creator>
  <dc:creator>Paredes, Roberto</dc:creator>
  <dc:date>2014-04-01</dc:date>
  <dc:description>This document describes the WEBUPV dataset compiled for the ImageCLEF 2014
Scalable Concept Image Annotation challenge. The data mentioned here indicates what
is ready for download. However, upon request or depending on feedback from the
participants, additional data may be released.

The following is the directory structure of the collection, and bellow there
is a brief description of what each compressed file contains. The
corresponding MD5 checksums of the files shown (for verifying a correct
download) can be found in md5sums.txt.

Directory structure
-------------------

.
|
|--- README.txt
|--- md5sums.txt
|--- webupv14_train_lists.zip
|--- webupv14_train2_lists.zip
|--- webupv14_devel_lists.zip
|--- webupv14_test_lists.zip
|--- webupv14_baseline.zip
|
|--- feats_textual/
|      |
|      |--- webupv14_{train|train2}_textual_pages.zip
|      |--- webupv14_{train|train2}_textual.scofeat.gz
|      |--- webupv14_{train|train2}_textual.keywords.gz
|
|--- feats_visual/
       |
       |--- webupv14_{train|train2|devel|test}_visual_images.zip
       |--- webupv14_{train|train2|devel|test}_visual_gist2.feat.gz
       |--- webupv14_{train|train2|devel|test}_visual_sift_1000.feat.gz
       |--- webupv14_{train|train2|devel|test}_visual_csift_1000.feat.gz
       |--- webupv14_{train|train2|devel|test}_visual_rgbsift_1000.feat.gz
       |--- webupv14_{train|train2|devel|test}_visual_opponentsift_1000.feat.gz
       |--- webupv14_{train|train2|devel|test}_visual_colorhist.feat.gz
       |--- webupv14_{train|train2|devel|test}_visual_getlf.feat.gz


Contents of files
-----------------

* webupv14_train{|2}_lists.zip

  The first training set ("train_*") includes images for the concepts of the
  development set, whereas the second training set ("train2_*") includes
  images for the concepts in the test set that are not in the development set.

  -&gt; train{|2}_iids.txt : IDs of the images (IIDs) in the training set.

  -&gt; train{|2}_rids.txt : IDs of the webpages (RIDs) in the training set.

  -&gt; train{|2}_*urls.txt : The original URLs from where the images (iurls)
       and the webpages (rurls) were downloaded. Each line in the file
       corresponds to an image, starting with the IID and is followed
       by one or more URLs.

  -&gt; train{|2}_rimgsrc.txt : The URLs of the images as referenced in each
       of the webpages. Each line of the file is of the form: IID RID
       URL1 [URL2 ...]. This information is necessary to locate the
       images within the webpages and it can also be useful as a
       textual feature.

* webupv14_{devel|test}_lists.zip

  -&gt; {devel|test}_conceptlists.txt : Lists per image of concepts for
       annotation. Each line starts with an image ID and is followed by the
       list of concepts in alphabetical order. Each ID may appear more than
       once. In total there are 1940 image annotation lists for the
       development set and 7291 image annotation lists for the test set. These
       correspond to 1000 and 4122 unique images (IDs) for the development and
       test sets, respectively.

  -&gt; {devel|test}_allconcepts.txt : Complete list of concepts for the
       development/test set.

       The concepts are defined by one or more WordNet synsets, which is
       intended to make it possible to easily obtain more information about
       the concepts, e.g. synonyms. In the concept list, the first column
       (which is the name of the concept) indicates the word to search in
       WordNet, the second column the synset type (either noun or adjective),
       the third column is the sense number and the fourth column is the
       WordNet offset (although this cannot be trusted since it changes
       between WordNet versions). For most of the concepts there is a fifth
       column which is a Wikipedia article related to the concept.

  -&gt; {devel|test}_groundtruth.txt : Ground truth concepts for the development
       and test sets.

  -&gt; {devel|test}_*urls.txt : The original URLs from where the images (iurls)
       and the webpages (rurls) were downloaded. Each line in the file
       corresponds to an image, starting with the IID and is followed by one
       or more URLs.

       Note: These are included only to acknowledge the source of the
       data, not be used as input to the annotation systems.


* webupv14_baseline.zip

  An archive that includes code for computing the evaluation measures
  for two baseline techniques. See the included README.txt for
  details.


* feats_textual/webupv14_train{|2}_textual_pages.zip

  Contains all of the webpages which referenced the images in the
  training set after being converted to valid xml. In total there are
  262588 files, since each image can appear in more than one page, and
  there can be several versions of same page which differ by the
  method of conversion to xml. To avoid having too many files in a
  single directory (which is an issue for some types of partitions),
  the files are found in subdirectories named using the first two
  characters of the RID, thus the paths of the files after extraction
  are of the form:

    ./WEBUPV/pages/{RID:0:2}/{RID}.{CONVM}.xml.gz

  To be able to locate the training images withing the webpages, the
  URLs of the images as referenced are provided in the file
  train_rimgsrc.txt.

* feats_textual/webupv14_train{|2}_textual.scofeat.gz

  The processed text extracted from the webpages near where the images
  appeared. Each line corresponds to one image, having the same order
  as the train_iids.txt list. The lines start with the image ID,
  followed by the number of extracted unique words and the
  corresponding word-score pairs. The scores were derived taking into
  account 1) the term frequency (TF), 2) the document object model
  (DOM) attributes, and 3) the word distance to the image. The scores
  are all integers and for each image the sum of scores is always
  &lt;=100000 (i.e. it is normalized).


* feats_textual/webupv14_train{|2}_textual.keywords.gz

  The words used to find the images when querying image search
  engines. Each line corresponds to an image (in the same order as in
  train_iids.txt). The lines are composed of triplets:

    [keyword] [rank] [search_engine]

  where [keyword] is the word used to find the image, [rank] is the
  position given to the image in the query, and [search_engine] is a
  single character indicating in which search engine it was found
  ('g':google, 'b':bing, 'y':yahoo).


* feats_visual/webupv14_*_images.zip

  Contains thumbnails (maximum 640 pixels of either width or height)
  of the images in jpeg format. To avoid having too many files in a
  single directory (which is an issue for some types of partitions),
  the files are found in subdirectories named using the first two
  characters of the IID, thus the paths of the files after extraction
  are of the form:

    ./WEBUPV/images/{IID:0:2}/{IID}.jpg

* feats_visual/webupv14_*.feat.gz

  The visual features in a simple ASCII text sparse format. The first
  line of the file indicates the number of vectors (N) and the
  dimensionality (DIMS). Then each line corresponds to one vector,
  starting with the number of non-zero elements and followed by pairs
  of dimension-value, being the first dimension 0. In summary the file
  format is:

    N DIMS
    nz1 Dim(1,1) Val(1,1) ... Dim(1,nz1) Val(1,nz1)
    nz2 Dim(2,1) Val(2,1) ... Dim(2,nz2) Val(2,nz2)
    ...
    nzN Dim(N,1) Val(N,1) ... Dim(N,nzN) Val(N,nzN)

  The order of the features is the same as in the lists
  devel_conceptlists.txt, test_conceptlists.txt and train_iids.txt.

  The procedure to extract the SIFT based features in this
  subdirectory was conducted as follows. Using the ImageMagick
  software, the images were first rescaled to having a maximum of 240
  pixels, of both width and height, while preserving the original
  aspect ratio, employing the command:

    convert {IMGIN}.jpg -resize '240&gt;x240&gt;' {IMGOUT}.jpg

  Then the SIFT features where extracted using the ColorDescriptor
  software from Koen van de Sande
  (http://koen.me/research/colordescriptors). As configuration we
  used, 'densesampling' detector with default parameters, and a hard
  assignment codebook using a spatial pyramid as
  'pyramid-1x1-2x2'. The number in the file name indicates the size of
  the codebook. All of the vectors of the spatial pyramid are given in
  the same line, thus keeping only the first 1/5th of the dimensions
  would be like not using the spatial pyramid. The codebook was
  generated using 1.25 million randomly selected features and the
  k-means algorithm. The GIST features were extracted using the
  LabelMe Toolbox. The images where first resized to 256x256 ignoring
  original aspect ratio, using 5 scales, 6 orientations and 4
  blocks. The other features colorhist and getlf, are both color
  histogram based extracted using our own implementation.

 </dc:description>
  <dc:identifier>https://zenodo.org/record/259758</dc:identifier>
  <dc:identifier>10.5281/zenodo.259758</dc:identifier>
  <dc:identifier>oai:zenodo.org:259758</dc:identifier>
  <dc:relation>info:eu-repo/grantAgreement/EC/FP7/600707/</dc:relation>
  <dc:relation>url:http://ceur-ws.org/Vol-1180/CLEF2014wn-Image-VillegasEt2014.pdf</dc:relation>
  <dc:relation>url:http://imageclef.org/2014/annotation</dc:relation>
  <dc:relation>url:https://zenodo.org/communities/ecfunded</dc:relation>
  <dc:relation>url:https://zenodo.org/communities/imageclef</dc:relation>
  <dc:rights>info:eu-repo/semantics/restrictedAccess</dc:rights>
  <dc:title>2014 ImageCLEF WEBUPV Collection</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>dataset</dc:type>
</oai_dc:dc>
167
59
views
downloads
All versions This version
Views 167167
Downloads 5959
Data volume 109.6 GB109.6 GB
Unique views 152152
Unique downloads 1010

Share

Cite as