Dataset Restricted Access

2014 ImageCLEF WEBUPV Collection

Villegas, Mauricio; Paredes, Roberto

MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="">
  <controlfield tag="005">20200124192142.0</controlfield>
  <controlfield tag="001">259758</controlfield>
  <datafield tag="711" ind1=" " ind2=" ">
    <subfield code="d">15-18 September 2014</subfield>
    <subfield code="g">CLEF</subfield>
    <subfield code="a">Conference and Labs of the Evaluation Forum</subfield>
    <subfield code="c">Sheffield, UK</subfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Universitat Politecnica de Valencia</subfield>
    <subfield code="a">Paredes, Roberto</subfield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">restricted</subfield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="y">Conference website</subfield>
    <subfield code="u"></subfield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2014-04-01</subfield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="p">user-ecfunded</subfield>
    <subfield code="p">user-imageclef</subfield>
    <subfield code="o"></subfield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Universitat Politecnica de Valencia</subfield>
    <subfield code="a">Villegas, Mauricio</subfield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">2014 ImageCLEF WEBUPV Collection</subfield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-ecfunded</subfield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-imageclef</subfield>
  <datafield tag="536" ind1=" " ind2=" ">
    <subfield code="c">600707</subfield>
    <subfield code="a">tranScriptorium</subfield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;This document describes the WEBUPV dataset compiled for the ImageCLEF 2014&lt;br&gt;
Scalable Concept Image Annotation challenge. The data mentioned here indicates what&lt;br&gt;
is ready for download. However, upon request or depending on feedback from the&lt;br&gt;
participants, additional data may be released.&lt;/p&gt;

&lt;p&gt;The following is the directory structure of the collection, and bellow there&lt;br&gt;
is a brief description of what each compressed file contains. The&lt;br&gt;
corresponding MD5 checksums of the files shown (for verifying a correct&lt;br&gt;
download) can be found in md5sums.txt.&lt;/p&gt;

&lt;p&gt;Directory structure&lt;br&gt;

|--- README.txt&lt;br&gt;
|--- md5sums.txt&lt;br&gt;
|--- feats_textual/&lt;br&gt;
|&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |&lt;br&gt;
|&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2};br&gt;
|&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2}_textual.scofeat.gz&lt;br&gt;
|&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2}_textual.keywords.gz&lt;br&gt;
|--- feats_visual/&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test};br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_gist2.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_sift_1000.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_csift_1000.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_rgbsift_1000.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_opponentsift_1000.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_colorhist.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_getlf.feat.gz&lt;/p&gt;

Contents of files&lt;br&gt;

&lt;p&gt;* webupv14_train{|2};/p&gt;

&lt;p&gt;&amp;nbsp; The first training set (&amp;quot;train_*&amp;quot;) includes images for the concepts of the&lt;br&gt;
&amp;nbsp; development set, whereas the second training set (&amp;quot;train2_*&amp;quot;) includes&lt;br&gt;
&amp;nbsp; images for the concepts in the test set that are not in the development set.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; train{|2}_iids.txt : IDs of the images (IIDs) in the training set.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; train{|2}_rids.txt : IDs of the webpages (RIDs) in the training set.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; train{|2}_*urls.txt : The original URLs from where the images (iurls)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; and the webpages (rurls) were downloaded. Each line in the file&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; corresponds to an image, starting with the IID and is followed&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; by one or more URLs.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; train{|2}_rimgsrc.txt : The URLs of the images as referenced in each&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; of the webpages. Each line of the file is of the form: IID RID&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; URL1 [URL2 ...]. This information is necessary to locate the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; images within the webpages and it can also be useful as a&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; textual feature.&lt;/p&gt;

&lt;p&gt;* webupv14_{devel|test};/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; {devel|test}_conceptlists.txt : Lists per image of concepts for&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; annotation. Each line starts with an image ID and is followed by the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; list of concepts in alphabetical order. Each ID may appear more than&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; once. In total there are 1940 image annotation lists for the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; development set and 7291 image annotation lists for the test set. These&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; correspond to 1000 and 4122 unique images (IDs) for the development and&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; test sets, respectively.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; {devel|test}_allconcepts.txt : Complete list of concepts for the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; development/test set.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; The concepts are defined by one or more WordNet synsets, which is&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; intended to make it possible to easily obtain more information about&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; the concepts, e.g. synonyms. In the concept list, the first column&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; (which is the name of the concept) indicates the word to search in&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; WordNet, the second column the synset type (either noun or adjective),&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; the third column is the sense number and the fourth column is the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; WordNet offset (although this cannot be trusted since it changes&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; between WordNet versions). For most of the concepts there is a fifth&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; column which is a Wikipedia article related to the concept.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; {devel|test}_groundtruth.txt : Ground truth concepts for the development&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; and test sets.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; {devel|test}_*urls.txt : The original URLs from where the images (iurls)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; and the webpages (rurls) were downloaded. Each line in the file&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; corresponds to an image, starting with the IID and is followed by one&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; or more URLs.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Note: These are included only to acknowledge the source of the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; data, not be used as input to the annotation systems.&lt;/p&gt;


&lt;p&gt;&amp;nbsp; An archive that includes code for computing the evaluation measures&lt;br&gt;
&amp;nbsp; for two baseline techniques. See the included README.txt for&lt;br&gt;
&amp;nbsp; details.&lt;/p&gt;

* feats_textual/webupv14_train{|2};/p&gt;

&lt;p&gt;&amp;nbsp; Contains all of the webpages which referenced the images in the&lt;br&gt;
&amp;nbsp; training set after being converted to valid xml. In total there are&lt;br&gt;
&amp;nbsp; 262588 files, since each image can appear in more than one page, and&lt;br&gt;
&amp;nbsp; there can be several versions of same page which differ by the&lt;br&gt;
&amp;nbsp; method of conversion to xml. To avoid having too many files in a&lt;br&gt;
&amp;nbsp; single directory (which is an issue for some types of partitions),&lt;br&gt;
&amp;nbsp; the files are found in subdirectories named using the first two&lt;br&gt;
&amp;nbsp; characters of the RID, thus the paths of the files after extraction&lt;br&gt;
&amp;nbsp; are of the form:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ./WEBUPV/pages/{RID:0:2}/{RID}.{CONVM}.xml.gz&lt;/p&gt;

&lt;p&gt;&amp;nbsp; To be able to locate the training images withing the webpages, the&lt;br&gt;
&amp;nbsp; URLs of the images as referenced are provided in the file&lt;br&gt;
&amp;nbsp; train_rimgsrc.txt.&lt;/p&gt;

&lt;p&gt;* feats_textual/webupv14_train{|2}_textual.scofeat.gz&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The processed text extracted from the webpages near where the images&lt;br&gt;
&amp;nbsp; appeared. Each line corresponds to one image, having the same order&lt;br&gt;
&amp;nbsp; as the train_iids.txt list. The lines start with the image ID,&lt;br&gt;
&amp;nbsp; followed by the number of extracted unique words and the&lt;br&gt;
&amp;nbsp; corresponding word-score pairs. The scores were derived taking into&lt;br&gt;
&amp;nbsp; account 1) the term frequency (TF), 2) the document object model&lt;br&gt;
&amp;nbsp; (DOM) attributes, and 3) the word distance to the image. The scores&lt;br&gt;
&amp;nbsp; are all integers and for each image the sum of scores is always&lt;br&gt;
&amp;nbsp; &amp;lt;=100000 (i.e. it is normalized).&lt;/p&gt;

* feats_textual/webupv14_train{|2}_textual.keywords.gz&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The words used to find the images when querying image search&lt;br&gt;
&amp;nbsp; engines. Each line corresponds to an image (in the same order as in&lt;br&gt;
&amp;nbsp; train_iids.txt). The lines are composed of triplets:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; [keyword] [rank] [search_engine]&lt;/p&gt;

&lt;p&gt;&amp;nbsp; where [keyword] is the word used to find the image, [rank] is the&lt;br&gt;
&amp;nbsp; position given to the image in the query, and [search_engine] is a&lt;br&gt;
&amp;nbsp; single character indicating in which search engine it was found&lt;br&gt;
&amp;nbsp; (&amp;#39;g&amp;#39;:google, &amp;#39;b&amp;#39;:bing, &amp;#39;y&amp;#39;:yahoo).&lt;/p&gt;

* feats_visual/webupv14_*;/p&gt;

&lt;p&gt;&amp;nbsp; Contains thumbnails (maximum 640 pixels of either width or height)&lt;br&gt;
&amp;nbsp; of the images in jpeg format. To avoid having too many files in a&lt;br&gt;
&amp;nbsp; single directory (which is an issue for some types of partitions),&lt;br&gt;
&amp;nbsp; the files are found in subdirectories named using the first two&lt;br&gt;
&amp;nbsp; characters of the IID, thus the paths of the files after extraction&lt;br&gt;
&amp;nbsp; are of the form:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ./WEBUPV/images/{IID:0:2}/{IID}.jpg&lt;/p&gt;

&lt;p&gt;* feats_visual/webupv14_*.feat.gz&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The visual features in a simple ASCII text sparse format. The first&lt;br&gt;
&amp;nbsp; line of the file indicates the number of vectors (N) and the&lt;br&gt;
&amp;nbsp; dimensionality (DIMS). Then each line corresponds to one vector,&lt;br&gt;
&amp;nbsp; starting with the number of non-zero elements and followed by pairs&lt;br&gt;
&amp;nbsp; of dimension-value, being the first dimension 0. In summary the file&lt;br&gt;
&amp;nbsp; format is:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; N DIMS&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; nz1 Dim(1,1) Val(1,1) ... Dim(1,nz1) Val(1,nz1)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; nz2 Dim(2,1) Val(2,1) ... Dim(2,nz2) Val(2,nz2)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ...&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; nzN Dim(N,1) Val(N,1) ... Dim(N,nzN) Val(N,nzN)&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The order of the features is the same as in the lists&lt;br&gt;
&amp;nbsp; devel_conceptlists.txt, test_conceptlists.txt and train_iids.txt.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The procedure to extract the SIFT based features in this&lt;br&gt;
&amp;nbsp; subdirectory was conducted as follows. Using the ImageMagick&lt;br&gt;
&amp;nbsp; software, the images were first rescaled to having a maximum of 240&lt;br&gt;
&amp;nbsp; pixels, of both width and height, while preserving the original&lt;br&gt;
&amp;nbsp; aspect ratio, employing the command:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; convert {IMGIN}.jpg -resize &amp;#39;240&amp;gt;x240&amp;gt;&amp;#39; {IMGOUT}.jpg&lt;/p&gt;

&lt;p&gt;&amp;nbsp; Then the SIFT features where extracted using the ColorDescriptor&lt;br&gt;
&amp;nbsp; software from Koen van de Sande&lt;br&gt;
&amp;nbsp; ( As configuration we&lt;br&gt;
&amp;nbsp; used, &amp;#39;densesampling&amp;#39; detector with default parameters, and a hard&lt;br&gt;
&amp;nbsp; assignment codebook using a spatial pyramid as&lt;br&gt;
&amp;nbsp; &amp;#39;pyramid-1x1-2x2&amp;#39;. The number in the file name indicates the size of&lt;br&gt;
&amp;nbsp; the codebook. All of the vectors of the spatial pyramid are given in&lt;br&gt;
&amp;nbsp; the same line, thus keeping only the first 1/5th of the dimensions&lt;br&gt;
&amp;nbsp; would be like not using the spatial pyramid. The codebook was&lt;br&gt;
&amp;nbsp; generated using 1.25 million randomly selected features and the&lt;br&gt;
&amp;nbsp; k-means algorithm. The GIST features were extracted using the&lt;br&gt;
&amp;nbsp; LabelMe Toolbox. The images where first resized to 256x256 ignoring&lt;br&gt;
&amp;nbsp; original aspect ratio, using 5 scales, 6 orientations and 4&lt;br&gt;
&amp;nbsp; blocks. The other features colorhist and getlf, are both color&lt;br&gt;
&amp;nbsp; histogram based extracted using our own implementation.&lt;/p&gt;

  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">url</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="a"></subfield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">url</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="a"></subfield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.259758</subfield>
    <subfield code="2">doi</subfield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
All versions This version
Views 287287
Downloads 6262
Data volume 110.2 GB110.2 GB
Unique views 262262
Unique downloads 1313


Cite as