Dataset Restricted Access

2014 ImageCLEF WEBUPV Collection

Villegas, Mauricio; Paredes, Roberto


DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
  <identifier identifierType="DOI">10.5281/zenodo.259758</identifier>
  <creators>
    <creator>
      <creatorName>Villegas, Mauricio</creatorName>
      <givenName>Mauricio</givenName>
      <familyName>Villegas</familyName>
      <affiliation>Universitat Politecnica de Valencia</affiliation>
    </creator>
    <creator>
      <creatorName>Paredes, Roberto</creatorName>
      <givenName>Roberto</givenName>
      <familyName>Paredes</familyName>
      <affiliation>Universitat Politecnica de Valencia</affiliation>
    </creator>
  </creators>
  <titles>
    <title>2014 ImageCLEF WEBUPV Collection</title>
  </titles>
  <publisher>Zenodo</publisher>
  <publicationYear>2014</publicationYear>
  <dates>
    <date dateType="Issued">2014-04-01</date>
  </dates>
  <resourceType resourceTypeGeneral="Dataset"/>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/259758</alternateIdentifier>
  </alternateIdentifiers>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="URL" relationType="IsSupplementTo">http://ceur-ws.org/Vol-1180/CLEF2014wn-Image-VillegasEt2014.pdf</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="URL" relationType="IsSupplementTo">http://imageclef.org/2014/annotation</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="URL" relationType="IsPartOf">https://zenodo.org/communities/ecfunded</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="URL" relationType="IsPartOf">https://zenodo.org/communities/imageclef</relatedIdentifier>
  </relatedIdentifiers>
  <rightsList>
    <rights rightsURI="info:eu-repo/semantics/restrictedAccess">Restricted Access</rights>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">&lt;p&gt;This document describes the WEBUPV dataset compiled for the ImageCLEF 2014&lt;br&gt;
Scalable Concept Image Annotation challenge. The data mentioned here indicates what&lt;br&gt;
is ready for download. However, upon request or depending on feedback from the&lt;br&gt;
participants, additional data may be released.&lt;/p&gt;

&lt;p&gt;The following is the directory structure of the collection, and bellow there&lt;br&gt;
is a brief description of what each compressed file contains. The&lt;br&gt;
corresponding MD5 checksums of the files shown (for verifying a correct&lt;br&gt;
download) can be found in md5sums.txt.&lt;/p&gt;

&lt;p&gt;Directory structure&lt;br&gt;
-------------------&lt;/p&gt;

&lt;p&gt;.&lt;br&gt;
|&lt;br&gt;
|--- README.txt&lt;br&gt;
|--- md5sums.txt&lt;br&gt;
|--- webupv14_train_lists.zip&lt;br&gt;
|--- webupv14_train2_lists.zip&lt;br&gt;
|--- webupv14_devel_lists.zip&lt;br&gt;
|--- webupv14_test_lists.zip&lt;br&gt;
|--- webupv14_baseline.zip&lt;br&gt;
|&lt;br&gt;
|--- feats_textual/&lt;br&gt;
|&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |&lt;br&gt;
|&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2}_textual_pages.zip&lt;br&gt;
|&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2}_textual.scofeat.gz&lt;br&gt;
|&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2}_textual.keywords.gz&lt;br&gt;
|&lt;br&gt;
|--- feats_visual/&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_images.zip&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_gist2.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_sift_1000.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_csift_1000.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_rgbsift_1000.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_opponentsift_1000.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_colorhist.feat.gz&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; |--- webupv14_{train|train2|devel|test}_visual_getlf.feat.gz&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
Contents of files&lt;br&gt;
-----------------&lt;/p&gt;

&lt;p&gt;* webupv14_train{|2}_lists.zip&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The first training set (&amp;quot;train_*&amp;quot;) includes images for the concepts of the&lt;br&gt;
&amp;nbsp; development set, whereas the second training set (&amp;quot;train2_*&amp;quot;) includes&lt;br&gt;
&amp;nbsp; images for the concepts in the test set that are not in the development set.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; train{|2}_iids.txt : IDs of the images (IIDs) in the training set.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; train{|2}_rids.txt : IDs of the webpages (RIDs) in the training set.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; train{|2}_*urls.txt : The original URLs from where the images (iurls)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; and the webpages (rurls) were downloaded. Each line in the file&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; corresponds to an image, starting with the IID and is followed&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; by one or more URLs.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; train{|2}_rimgsrc.txt : The URLs of the images as referenced in each&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; of the webpages. Each line of the file is of the form: IID RID&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; URL1 [URL2 ...]. This information is necessary to locate the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; images within the webpages and it can also be useful as a&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; textual feature.&lt;/p&gt;

&lt;p&gt;* webupv14_{devel|test}_lists.zip&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; {devel|test}_conceptlists.txt : Lists per image of concepts for&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; annotation. Each line starts with an image ID and is followed by the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; list of concepts in alphabetical order. Each ID may appear more than&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; once. In total there are 1940 image annotation lists for the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; development set and 7291 image annotation lists for the test set. These&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; correspond to 1000 and 4122 unique images (IDs) for the development and&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; test sets, respectively.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; {devel|test}_allconcepts.txt : Complete list of concepts for the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; development/test set.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; The concepts are defined by one or more WordNet synsets, which is&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; intended to make it possible to easily obtain more information about&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; the concepts, e.g. synonyms. In the concept list, the first column&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; (which is the name of the concept) indicates the word to search in&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; WordNet, the second column the synset type (either noun or adjective),&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; the third column is the sense number and the fourth column is the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; WordNet offset (although this cannot be trusted since it changes&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; between WordNet versions). For most of the concepts there is a fifth&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; column which is a Wikipedia article related to the concept.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; {devel|test}_groundtruth.txt : Ground truth concepts for the development&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; and test sets.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; -&amp;gt; {devel|test}_*urls.txt : The original URLs from where the images (iurls)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; and the webpages (rurls) were downloaded. Each line in the file&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; corresponds to an image, starting with the IID and is followed by one&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; or more URLs.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Note: These are included only to acknowledge the source of the&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; data, not be used as input to the annotation systems.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
* webupv14_baseline.zip&lt;/p&gt;

&lt;p&gt;&amp;nbsp; An archive that includes code for computing the evaluation measures&lt;br&gt;
&amp;nbsp; for two baseline techniques. See the included README.txt for&lt;br&gt;
&amp;nbsp; details.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
* feats_textual/webupv14_train{|2}_textual_pages.zip&lt;/p&gt;

&lt;p&gt;&amp;nbsp; Contains all of the webpages which referenced the images in the&lt;br&gt;
&amp;nbsp; training set after being converted to valid xml. In total there are&lt;br&gt;
&amp;nbsp; 262588 files, since each image can appear in more than one page, and&lt;br&gt;
&amp;nbsp; there can be several versions of same page which differ by the&lt;br&gt;
&amp;nbsp; method of conversion to xml. To avoid having too many files in a&lt;br&gt;
&amp;nbsp; single directory (which is an issue for some types of partitions),&lt;br&gt;
&amp;nbsp; the files are found in subdirectories named using the first two&lt;br&gt;
&amp;nbsp; characters of the RID, thus the paths of the files after extraction&lt;br&gt;
&amp;nbsp; are of the form:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ./WEBUPV/pages/{RID:0:2}/{RID}.{CONVM}.xml.gz&lt;/p&gt;

&lt;p&gt;&amp;nbsp; To be able to locate the training images withing the webpages, the&lt;br&gt;
&amp;nbsp; URLs of the images as referenced are provided in the file&lt;br&gt;
&amp;nbsp; train_rimgsrc.txt.&lt;/p&gt;

&lt;p&gt;* feats_textual/webupv14_train{|2}_textual.scofeat.gz&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The processed text extracted from the webpages near where the images&lt;br&gt;
&amp;nbsp; appeared. Each line corresponds to one image, having the same order&lt;br&gt;
&amp;nbsp; as the train_iids.txt list. The lines start with the image ID,&lt;br&gt;
&amp;nbsp; followed by the number of extracted unique words and the&lt;br&gt;
&amp;nbsp; corresponding word-score pairs. The scores were derived taking into&lt;br&gt;
&amp;nbsp; account 1) the term frequency (TF), 2) the document object model&lt;br&gt;
&amp;nbsp; (DOM) attributes, and 3) the word distance to the image. The scores&lt;br&gt;
&amp;nbsp; are all integers and for each image the sum of scores is always&lt;br&gt;
&amp;nbsp; &amp;lt;=100000 (i.e. it is normalized).&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
* feats_textual/webupv14_train{|2}_textual.keywords.gz&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The words used to find the images when querying image search&lt;br&gt;
&amp;nbsp; engines. Each line corresponds to an image (in the same order as in&lt;br&gt;
&amp;nbsp; train_iids.txt). The lines are composed of triplets:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; [keyword] [rank] [search_engine]&lt;/p&gt;

&lt;p&gt;&amp;nbsp; where [keyword] is the word used to find the image, [rank] is the&lt;br&gt;
&amp;nbsp; position given to the image in the query, and [search_engine] is a&lt;br&gt;
&amp;nbsp; single character indicating in which search engine it was found&lt;br&gt;
&amp;nbsp; (&amp;#39;g&amp;#39;:google, &amp;#39;b&amp;#39;:bing, &amp;#39;y&amp;#39;:yahoo).&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
* feats_visual/webupv14_*_images.zip&lt;/p&gt;

&lt;p&gt;&amp;nbsp; Contains thumbnails (maximum 640 pixels of either width or height)&lt;br&gt;
&amp;nbsp; of the images in jpeg format. To avoid having too many files in a&lt;br&gt;
&amp;nbsp; single directory (which is an issue for some types of partitions),&lt;br&gt;
&amp;nbsp; the files are found in subdirectories named using the first two&lt;br&gt;
&amp;nbsp; characters of the IID, thus the paths of the files after extraction&lt;br&gt;
&amp;nbsp; are of the form:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ./WEBUPV/images/{IID:0:2}/{IID}.jpg&lt;/p&gt;

&lt;p&gt;* feats_visual/webupv14_*.feat.gz&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The visual features in a simple ASCII text sparse format. The first&lt;br&gt;
&amp;nbsp; line of the file indicates the number of vectors (N) and the&lt;br&gt;
&amp;nbsp; dimensionality (DIMS). Then each line corresponds to one vector,&lt;br&gt;
&amp;nbsp; starting with the number of non-zero elements and followed by pairs&lt;br&gt;
&amp;nbsp; of dimension-value, being the first dimension 0. In summary the file&lt;br&gt;
&amp;nbsp; format is:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; N DIMS&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; nz1 Dim(1,1) Val(1,1) ... Dim(1,nz1) Val(1,nz1)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; nz2 Dim(2,1) Val(2,1) ... Dim(2,nz2) Val(2,nz2)&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ...&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; nzN Dim(N,1) Val(N,1) ... Dim(N,nzN) Val(N,nzN)&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The order of the features is the same as in the lists&lt;br&gt;
&amp;nbsp; devel_conceptlists.txt, test_conceptlists.txt and train_iids.txt.&lt;/p&gt;

&lt;p&gt;&amp;nbsp; The procedure to extract the SIFT based features in this&lt;br&gt;
&amp;nbsp; subdirectory was conducted as follows. Using the ImageMagick&lt;br&gt;
&amp;nbsp; software, the images were first rescaled to having a maximum of 240&lt;br&gt;
&amp;nbsp; pixels, of both width and height, while preserving the original&lt;br&gt;
&amp;nbsp; aspect ratio, employing the command:&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; convert {IMGIN}.jpg -resize &amp;#39;240&amp;gt;x240&amp;gt;&amp;#39; {IMGOUT}.jpg&lt;/p&gt;

&lt;p&gt;&amp;nbsp; Then the SIFT features where extracted using the ColorDescriptor&lt;br&gt;
&amp;nbsp; software from Koen van de Sande&lt;br&gt;
&amp;nbsp; (http://koen.me/research/colordescriptors). As configuration we&lt;br&gt;
&amp;nbsp; used, &amp;#39;densesampling&amp;#39; detector with default parameters, and a hard&lt;br&gt;
&amp;nbsp; assignment codebook using a spatial pyramid as&lt;br&gt;
&amp;nbsp; &amp;#39;pyramid-1x1-2x2&amp;#39;. The number in the file name indicates the size of&lt;br&gt;
&amp;nbsp; the codebook. All of the vectors of the spatial pyramid are given in&lt;br&gt;
&amp;nbsp; the same line, thus keeping only the first 1/5th of the dimensions&lt;br&gt;
&amp;nbsp; would be like not using the spatial pyramid. The codebook was&lt;br&gt;
&amp;nbsp; generated using 1.25 million randomly selected features and the&lt;br&gt;
&amp;nbsp; k-means algorithm. The GIST features were extracted using the&lt;br&gt;
&amp;nbsp; LabelMe Toolbox. The images where first resized to 256x256 ignoring&lt;br&gt;
&amp;nbsp; original aspect ratio, using 5 scales, 6 orientations and 4&lt;br&gt;
&amp;nbsp; blocks. The other features colorhist and getlf, are both color&lt;br&gt;
&amp;nbsp; histogram based extracted using our own implementation.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description>
  </descriptions>
  <fundingReferences>
    <fundingReference>
      <funderName>European Commission</funderName>
      <funderIdentifier funderIdentifierType="Crossref Funder ID">10.13039/501100000780</funderIdentifier>
      <awardNumber awardURI="info:eu-repo/grantAgreement/EC/FP7/600707/">600707</awardNumber>
      <awardTitle>tranScriptorium</awardTitle>
    </fundingReference>
  </fundingReferences>
</resource>
167
59
views
downloads
All versions This version
Views 167167
Downloads 5959
Data volume 109.6 GB109.6 GB
Unique views 152152
Unique downloads 1010

Share

Cite as