Dataset Restricted Access
Villegas, Mauricio; Paredes, Roberto
{ "publisher": "Zenodo", "DOI": "10.5281/zenodo.259758", "title": "2014 ImageCLEF WEBUPV Collection", "issued": { "date-parts": [ [ 2014, 4, 1 ] ] }, "abstract": "<p>This document describes the WEBUPV dataset compiled for the ImageCLEF 2014<br>\nScalable Concept Image Annotation challenge. The data mentioned here indicates what<br>\nis ready for download. However, upon request or depending on feedback from the<br>\nparticipants, additional data may be released.</p>\n\n<p>The following is the directory structure of the collection, and bellow there<br>\nis a brief description of what each compressed file contains. The<br>\ncorresponding MD5 checksums of the files shown (for verifying a correct<br>\ndownload) can be found in md5sums.txt.</p>\n\n<p>Directory structure<br>\n-------------------</p>\n\n<p>.<br>\n|<br>\n|--- README.txt<br>\n|--- md5sums.txt<br>\n|--- webupv14_train_lists.zip<br>\n|--- webupv14_train2_lists.zip<br>\n|--- webupv14_devel_lists.zip<br>\n|--- webupv14_test_lists.zip<br>\n|--- webupv14_baseline.zip<br>\n|<br>\n|--- feats_textual/<br>\n| |<br>\n| |--- webupv14_{train|train2}_textual_pages.zip<br>\n| |--- webupv14_{train|train2}_textual.scofeat.gz<br>\n| |--- webupv14_{train|train2}_textual.keywords.gz<br>\n|<br>\n|--- feats_visual/<br>\n |<br>\n |--- webupv14_{train|train2|devel|test}_visual_images.zip<br>\n |--- webupv14_{train|train2|devel|test}_visual_gist2.feat.gz<br>\n |--- webupv14_{train|train2|devel|test}_visual_sift_1000.feat.gz<br>\n |--- webupv14_{train|train2|devel|test}_visual_csift_1000.feat.gz<br>\n |--- webupv14_{train|train2|devel|test}_visual_rgbsift_1000.feat.gz<br>\n |--- webupv14_{train|train2|devel|test}_visual_opponentsift_1000.feat.gz<br>\n |--- webupv14_{train|train2|devel|test}_visual_colorhist.feat.gz<br>\n |--- webupv14_{train|train2|devel|test}_visual_getlf.feat.gz</p>\n\n<p><br>\nContents of files<br>\n-----------------</p>\n\n<p>* webupv14_train{|2}_lists.zip</p>\n\n<p> The first training set ("train_*") includes images for the concepts of the<br>\n development set, whereas the second training set ("train2_*") includes<br>\n images for the concepts in the test set that are not in the development set.</p>\n\n<p> -> train{|2}_iids.txt : IDs of the images (IIDs) in the training set.</p>\n\n<p> -> train{|2}_rids.txt : IDs of the webpages (RIDs) in the training set.</p>\n\n<p> -> train{|2}_*urls.txt : The original URLs from where the images (iurls)<br>\n and the webpages (rurls) were downloaded. Each line in the file<br>\n corresponds to an image, starting with the IID and is followed<br>\n by one or more URLs.</p>\n\n<p> -> train{|2}_rimgsrc.txt : The URLs of the images as referenced in each<br>\n of the webpages. Each line of the file is of the form: IID RID<br>\n URL1 [URL2 ...]. This information is necessary to locate the<br>\n images within the webpages and it can also be useful as a<br>\n textual feature.</p>\n\n<p>* webupv14_{devel|test}_lists.zip</p>\n\n<p> -> {devel|test}_conceptlists.txt : Lists per image of concepts for<br>\n annotation. Each line starts with an image ID and is followed by the<br>\n list of concepts in alphabetical order. Each ID may appear more than<br>\n once. In total there are 1940 image annotation lists for the<br>\n development set and 7291 image annotation lists for the test set. These<br>\n correspond to 1000 and 4122 unique images (IDs) for the development and<br>\n test sets, respectively.</p>\n\n<p> -> {devel|test}_allconcepts.txt : Complete list of concepts for the<br>\n development/test set.</p>\n\n<p> The concepts are defined by one or more WordNet synsets, which is<br>\n intended to make it possible to easily obtain more information about<br>\n the concepts, e.g. synonyms. In the concept list, the first column<br>\n (which is the name of the concept) indicates the word to search in<br>\n WordNet, the second column the synset type (either noun or adjective),<br>\n the third column is the sense number and the fourth column is the<br>\n WordNet offset (although this cannot be trusted since it changes<br>\n between WordNet versions). For most of the concepts there is a fifth<br>\n column which is a Wikipedia article related to the concept.</p>\n\n<p> -> {devel|test}_groundtruth.txt : Ground truth concepts for the development<br>\n and test sets.</p>\n\n<p> -> {devel|test}_*urls.txt : The original URLs from where the images (iurls)<br>\n and the webpages (rurls) were downloaded. Each line in the file<br>\n corresponds to an image, starting with the IID and is followed by one<br>\n or more URLs.</p>\n\n<p> Note: These are included only to acknowledge the source of the<br>\n data, not be used as input to the annotation systems.</p>\n\n<p><br>\n* webupv14_baseline.zip</p>\n\n<p> An archive that includes code for computing the evaluation measures<br>\n for two baseline techniques. See the included README.txt for<br>\n details.</p>\n\n<p><br>\n* feats_textual/webupv14_train{|2}_textual_pages.zip</p>\n\n<p> Contains all of the webpages which referenced the images in the<br>\n training set after being converted to valid xml. In total there are<br>\n 262588 files, since each image can appear in more than one page, and<br>\n there can be several versions of same page which differ by the<br>\n method of conversion to xml. To avoid having too many files in a<br>\n single directory (which is an issue for some types of partitions),<br>\n the files are found in subdirectories named using the first two<br>\n characters of the RID, thus the paths of the files after extraction<br>\n are of the form:</p>\n\n<p> ./WEBUPV/pages/{RID:0:2}/{RID}.{CONVM}.xml.gz</p>\n\n<p> To be able to locate the training images withing the webpages, the<br>\n URLs of the images as referenced are provided in the file<br>\n train_rimgsrc.txt.</p>\n\n<p>* feats_textual/webupv14_train{|2}_textual.scofeat.gz</p>\n\n<p> The processed text extracted from the webpages near where the images<br>\n appeared. Each line corresponds to one image, having the same order<br>\n as the train_iids.txt list. The lines start with the image ID,<br>\n followed by the number of extracted unique words and the<br>\n corresponding word-score pairs. The scores were derived taking into<br>\n account 1) the term frequency (TF), 2) the document object model<br>\n (DOM) attributes, and 3) the word distance to the image. The scores<br>\n are all integers and for each image the sum of scores is always<br>\n <=100000 (i.e. it is normalized).</p>\n\n<p><br>\n* feats_textual/webupv14_train{|2}_textual.keywords.gz</p>\n\n<p> The words used to find the images when querying image search<br>\n engines. Each line corresponds to an image (in the same order as in<br>\n train_iids.txt). The lines are composed of triplets:</p>\n\n<p> [keyword] [rank] [search_engine]</p>\n\n<p> where [keyword] is the word used to find the image, [rank] is the<br>\n position given to the image in the query, and [search_engine] is a<br>\n single character indicating in which search engine it was found<br>\n ('g':google, 'b':bing, 'y':yahoo).</p>\n\n<p><br>\n* feats_visual/webupv14_*_images.zip</p>\n\n<p> Contains thumbnails (maximum 640 pixels of either width or height)<br>\n of the images in jpeg format. To avoid having too many files in a<br>\n single directory (which is an issue for some types of partitions),<br>\n the files are found in subdirectories named using the first two<br>\n characters of the IID, thus the paths of the files after extraction<br>\n are of the form:</p>\n\n<p> ./WEBUPV/images/{IID:0:2}/{IID}.jpg</p>\n\n<p>* feats_visual/webupv14_*.feat.gz</p>\n\n<p> The visual features in a simple ASCII text sparse format. The first<br>\n line of the file indicates the number of vectors (N) and the<br>\n dimensionality (DIMS). Then each line corresponds to one vector,<br>\n starting with the number of non-zero elements and followed by pairs<br>\n of dimension-value, being the first dimension 0. In summary the file<br>\n format is:</p>\n\n<p> N DIMS<br>\n nz1 Dim(1,1) Val(1,1) ... Dim(1,nz1) Val(1,nz1)<br>\n nz2 Dim(2,1) Val(2,1) ... Dim(2,nz2) Val(2,nz2)<br>\n ...<br>\n nzN Dim(N,1) Val(N,1) ... Dim(N,nzN) Val(N,nzN)</p>\n\n<p> The order of the features is the same as in the lists<br>\n devel_conceptlists.txt, test_conceptlists.txt and train_iids.txt.</p>\n\n<p> The procedure to extract the SIFT based features in this<br>\n subdirectory was conducted as follows. Using the ImageMagick<br>\n software, the images were first rescaled to having a maximum of 240<br>\n pixels, of both width and height, while preserving the original<br>\n aspect ratio, employing the command:</p>\n\n<p> convert {IMGIN}.jpg -resize '240>x240>' {IMGOUT}.jpg</p>\n\n<p> Then the SIFT features where extracted using the ColorDescriptor<br>\n software from Koen van de Sande<br>\n (http://koen.me/research/colordescriptors). As configuration we<br>\n used, 'densesampling' detector with default parameters, and a hard<br>\n assignment codebook using a spatial pyramid as<br>\n 'pyramid-1x1-2x2'. The number in the file name indicates the size of<br>\n the codebook. All of the vectors of the spatial pyramid are given in<br>\n the same line, thus keeping only the first 1/5th of the dimensions<br>\n would be like not using the spatial pyramid. The codebook was<br>\n generated using 1.25 million randomly selected features and the<br>\n k-means algorithm. The GIST features were extracted using the<br>\n LabelMe Toolbox. The images where first resized to 256x256 ignoring<br>\n original aspect ratio, using 5 scales, 6 orientations and 4<br>\n blocks. The other features colorhist and getlf, are both color<br>\n histogram based extracted using our own implementation.</p>\n\n<p> </p>", "author": [ { "family": "Villegas, Mauricio" }, { "family": "Paredes, Roberto" } ], "id": "259758", "event-place": "Sheffield, UK", "type": "dataset", "event": "Conference and Labs of the Evaluation Forum (CLEF)" }
All versions | This version | |
---|---|---|
Views | 288 | 288 |
Downloads | 62 | 62 |
Data volume | 110.2 GB | 110.2 GB |
Unique views | 263 | 263 |
Unique downloads | 13 | 13 |