2012 ImageCLEF WEBUPV Collection

doi:10.5281/zenodo.1038533

Published May 15, 2012 | Version v1

Dataset Restricted

2012 ImageCLEF WEBUPV Collection

1. Universitat Politecnica de Valencia

This document describes the WEBUPV dataset compiled for the ImageCLEF
2012 Scalable image annotation task. The data mentioned here
indicates what is ready for download. However, upon request or
depending on feedback from the participants, additional data can be
released. For debugging purposes, thumbnails of the images in the
dataset can be obtained from a web server using '{IID}' the image
identifier:

http://risenet.prhlt.upv.es/db/img/{IID}.jpg

The following is the directory structure of the collection, and bellow
there is a brief description of what each compressed file
contains. The corresponding MD5 checksums of the files shown (for
verifying a correct download) can be found in the md5sums.txt.

Directory structure
-------------------

.
|
|--- README.txt
|--- md5sums.txt
|--- webupv_train_lists.zip
|--- webupv_devel_lists.zip
|--- webupv_test_lists.zip
|--- baseline.zip
|
|--- feats_textual/
|      |
|      |--- webupv_train_textual.rawfeat.gz
|      |--- webupv_train_textual.scofeat.gz
|      |--- webupv_train_textual.keywords.gz
|
|--- feats_visual/
       |
       |--- webupv_{train|devel|test}_visual_gist.feat.gz
       |--- webupv_{train|devel|test}_visual_sift_*.feat.gz
       |--- webupv_{train|devel|test}_visual_csift_*.feat.gz
       |--- webupv_{train|devel|test}_visual_rgbsift_*.feat.gz
       |--- webupv_{train|devel|test}_visual_opponentsift_*.feat.gz
       |--- webupv_{train|devel|test}_visual_colorhist.feat.gz

Contents of files
-----------------

* webupv_train_lists.zip
-> train_iids.txt : IDs of the images in the training set (250000).
-> train_rids.txt : IDs of the webpages in the training set.
-> train_rimgsrc.txt : The URLs of the images as referenced in each
of the webpages. This can also be useful as a
textual feature.

* webupv_devel_lists.zip
-> devel_iids.txt : IDs of the images in the development set (1000).
-> devel_concepts.txt : List concepts for the development set.
-> devel_gnd.txt : Ground truth concepts for the development set
images.

* webupv_test_lists.zip
-> test_iids.txt : IDs of the images in the test set (2000).
-> test_concepts.txt : List concepts for the test set.
-> test_gnd.txt : Ground truth concepts for the test set images.

* baseline.zip

An archive that includes code for computing the evaluation measures
for two baseline techniques for the "Scalable concept image
annotation" subtask. See the included README.txt for details.

* feats_textual/webupv_train_textual.rawfeat.gz

The raw text extracted from the webpages near where the images
appeared. Each line starts with the image and webpage IDs followed
by the text extracted. The position of the image within the text is
indicated by the special word '{X}'. The extracted text is somewhat
filtered (e.g. there are no HTML tags), although removed words and
tags have been replaced by full stops '.' to preserve word
distances. The title of the webpage is always included, and it is
the first sentence of the text. In total the file has 275749 lines
since the images can appear in more than one webpage.

* feats_textual/webupv_train_textual.scofeat.gz

The processed text extracted from the webpages near where the images
appeared. Each line corresponds to one image, having the same order
as the train_iids.txt list. The lines start with the image ID,
followed by the number of extracted unique words and the
corresponding word-score pairs. The scores were derived taking into
account 1) the term frequency (TF), 2) the document object model
(DOM) attributes, and 3) the word distance to the image. The scores
are all integers and for each image the sum of scores is always
<=100000 (i.e. it is normalized).

* feats_textual/webupv_train_textual.keywords.gz

The words used to find the images when querying image search
engines. Each line corresponds to an image (in the same order as
in train_iids.txt). The lines are composed of triplets:

[keyword] [rank] [search_engine]

where [keyword] is the word used to find the image, [rank] is the
position given to the image in the query, and [search_engine] is a
single character indicating in which search engine it was found
('g':google, 'b':bing, 'y':yahoo).

* feats_visual/webupv_*.feat.gz

The visual features in a simple ASCII text sparse format. The first
line of the file indicates the number of vectors (N) and the
dimensionality (DIMS). Then each line corresponds to one vector,
starting with the number of non-zero elements and followed by pairs
of dimension-value, being the first dimension 0. In summary the file
format is:

    N DIMS
    nz1 Dim(1,1) Val(1,1) ... Dim(1,nz1) Val(1,nz1)
    nz2 Dim(2,1) Val(2,1) ... Dim(2,nz2) Val(2,nz2)
    ...
    nzN Dim(N,1) Val(N,1) ... Dim(N,nzN) Val(N,nzN)

The order of the features is the same as in the lists
devel_iids.txt, test_iids.txt and train_iids.txt.

The procedure to extract the SIFT based features in this
subdirectory was conducted as follows. Using the ImageMagick
software, the images were first rescaled to having a maximum of 240
pixels, of both width and height, while preserving the original
aspect ratio, employing the command:

convert {IMGIN}.jpg -resize '240>x240>' {IMGOUT}.jpg

Then the SIFT features where extracted using the ColorDescriptor
software from Koen van de Sande
(http://koen.me/research/colordescriptors). As configuration we
used, 'densesampling' detector with default parameters, and a hard
assignment codebook using a spatial pyramid as
'pyramid-1x1-2x2'. The number in the file name indicates the size of
the codebook. All of the vectors of the spatial pyramid are given in
the same line, thus keeping only the first 1/5th of the dimensions
would be like not using the spatial pyramid. The codebook was
generated using 1.25 million randomly selected features and the
k-means algorithm.

Contact
-------

For further questions, please contact:
Mauricio Villegas <mauvilsa@upv.es>

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

This dataset is available under a Creative Commons Attribution-
NonCommercial-ShareAlike 3.0 Unported License. Before downloading
the data, please read and accept the Creative Commons License and
the following usage agreement:

Data Usage Agreement ImageCLEF 2012/2013/2014/2015/2016 WEBUPV Image
Annotation Datasets

By downloading the "Dataset", you (the "Researcher") agrees to the
following terms.

* The Researcher will only use the Dataset for non-commercial
research and/or educational purposes.

* The Researcher will cite one of the following papers in any
publication that makes use of the Dataset.

Gilbert, A., Piras, L., Wang, J., Yan, F., Ramisa, A.,
Dellandrea, E., Gaizauskas, R., Villegas, M., Mikolajczyk, K.:
Overview of the ImageCLEF 2016 scalable concept image
annotation task. In: CLEF2016 Working Notes, CEUR Workshop
Proceedings, CEUR-WS.org, Évora, Portugal, 5–8 September 2016

Gilbert, A., Piras, L., Wang, J., Yan, F., Dellandrea, E.,
Gaizauskas, R., Villegas, M., Mikolajczyk, K.: Overview of the
ImageCLEF 2015 Scalable Image Annotation, Localization and
Sentence Generation task. In: CLEF2015 Working Notes. CEUR
Workshop Proceedings, CEUR-WS.org, Toulouse, France (September
8-11 2015)

Villegas, M., Paredes, R.: Overview of the ImageCLEF 2014
Scalable Concept Image Annotation Task. In: CLEF2014 Working
Notes. CEUR Workshop Proceedings, vol. 1180, pp. 308–328.
CEUR-WS.org, Sheffield, UK (September 15-18 2014)

Villegas, M., Paredes, R., Thomee, B.: Overview of the ImageCLEF
2013 Scalable Concept Image Annotation Subtask. In: CLEF 2013
Evaluation Labs and Workshop, Online Working Notes. Valencia,
Spain (September 23-26 2013)

* The Researcher may provide research associates and colleagues a
copy of the Dataset provided that they also agree to this Data
Usage Agreement.

* The Researcher will assume all responsibility against any claims
arising from Researcher's use of the Dataset.

You are currently not logged in. Do you have an account? Log in here

Additional details

Is supplement to: http://ceur-ws.org/Vol-1178/CLEF2012wn-ImageCLEF-VillegasEt2012.pdf (URL); http://imageclef.org/2012/photo-web (URL)

	All versions	This version
Views	444	444
Downloads	16	16
Data volume	31.5 GB	31.5 GB

2012 ImageCLEF WEBUPV Collection

Creators

Description

Files

Restricted

Request access

Additional details

Related works