Dataset Restricted Access
Villegas, Mauricio; Paredes, Roberto
<?xml version='1.0' encoding='utf-8'?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:adms="http://www.w3.org/ns/adms#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:duv="http://www.w3.org/ns/duv#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:frapo="http://purl.org/cerif/frapo/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:gsp="http://www.opengis.net/ont/geosparql#" xmlns:locn="http://www.w3.org/ns/locn#" xmlns:org="http://www.w3.org/ns/org#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:prov="http://www.w3.org/ns/prov#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:vcard="http://www.w3.org/2006/vcard/ns#" xmlns:wdrs="http://www.w3.org/2007/05/powder-s#"> <rdf:Description rdf:about="https://doi.org/10.5281/zenodo.259758"> <rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/> <dct:type rdf:resource="http://purl.org/dc/dcmitype/Dataset"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://doi.org/10.5281/zenodo.259758</dct:identifier> <foaf:page rdf:resource="https://doi.org/10.5281/zenodo.259758"/> <dct:creator> <rdf:Description> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <foaf:name>Villegas, Mauricio</foaf:name> <foaf:givenName>Mauricio</foaf:givenName> <foaf:familyName>Villegas</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>Universitat Politecnica de Valencia</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <foaf:name>Paredes, Roberto</foaf:name> <foaf:givenName>Roberto</foaf:givenName> <foaf:familyName>Paredes</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>Universitat Politecnica de Valencia</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:title>2014 ImageCLEF WEBUPV Collection</dct:title> <dct:publisher> <foaf:Agent> <foaf:name>Zenodo</foaf:name> </foaf:Agent> </dct:publisher> <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear">2014</dct:issued> <frapo:isFundedBy rdf:resource="info:eu-repo/grantAgreement/EC/FP7/600707/"/> <schema:funder> <foaf:Organization> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">10.13039/100011102</dct:identifier> <foaf:name>European Commission</foaf:name> </foaf:Organization> </schema:funder> <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2014-04-01</dct:issued> <owl:sameAs rdf:resource="https://zenodo.org/record/259758"/> <adms:identifier> <adms:Identifier> <skos:notation rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://zenodo.org/record/259758</skos:notation> <adms:schemeAgency>url</adms:schemeAgency> </adms:Identifier> </adms:identifier> <dct:relation rdf:resource="http://ceur-ws.org/Vol-1180/CLEF2014wn-Image-VillegasEt2014.pdf"/> <dct:relation rdf:resource="http://imageclef.org/2014/annotation"/> <dct:isPartOf rdf:resource="https://zenodo.org/communities/ecfunded"/> <dct:isPartOf rdf:resource="https://zenodo.org/communities/imageclef"/> <dct:description><p>This document describes the WEBUPV dataset compiled for the ImageCLEF 2014<br> Scalable Concept Image Annotation challenge. The data mentioned here indicates what<br> is ready for download. However, upon request or depending on feedback from the<br> participants, additional data may be released.</p> <p>The following is the directory structure of the collection, and bellow there<br> is a brief description of what each compressed file contains. The<br> corresponding MD5 checksums of the files shown (for verifying a correct<br> download) can be found in md5sums.txt.</p> <p>Directory structure<br> -------------------</p> <p>.<br> |<br> |--- README.txt<br> |--- md5sums.txt<br> |--- webupv14_train_lists.zip<br> |--- webupv14_train2_lists.zip<br> |--- webupv14_devel_lists.zip<br> |--- webupv14_test_lists.zip<br> |--- webupv14_baseline.zip<br> |<br> |--- feats_textual/<br> |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |<br> |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2}_textual_pages.zip<br> |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2}_textual.scofeat.gz<br> |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2}_textual.keywords.gz<br> |<br> |--- feats_visual/<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_images.zip<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_gist2.feat.gz<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_sift_1000.feat.gz<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_csift_1000.feat.gz<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_rgbsift_1000.feat.gz<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_opponentsift_1000.feat.gz<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_colorhist.feat.gz<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |--- webupv14_{train|train2|devel|test}_visual_getlf.feat.gz</p> <p><br> Contents of files<br> -----------------</p> <p>* webupv14_train{|2}_lists.zip</p> <p>&nbsp; The first training set (&quot;train_*&quot;) includes images for the concepts of the<br> &nbsp; development set, whereas the second training set (&quot;train2_*&quot;) includes<br> &nbsp; images for the concepts in the test set that are not in the development set.</p> <p>&nbsp; -&gt; train{|2}_iids.txt : IDs of the images (IIDs) in the training set.</p> <p>&nbsp; -&gt; train{|2}_rids.txt : IDs of the webpages (RIDs) in the training set.</p> <p>&nbsp; -&gt; train{|2}_*urls.txt : The original URLs from where the images (iurls)<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and the webpages (rurls) were downloaded. Each line in the file<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; corresponds to an image, starting with the IID and is followed<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; by one or more URLs.</p> <p>&nbsp; -&gt; train{|2}_rimgsrc.txt : The URLs of the images as referenced in each<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; of the webpages. Each line of the file is of the form: IID RID<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; URL1 [URL2 ...]. This information is necessary to locate the<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; images within the webpages and it can also be useful as a<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; textual feature.</p> <p>* webupv14_{devel|test}_lists.zip</p> <p>&nbsp; -&gt; {devel|test}_conceptlists.txt : Lists per image of concepts for<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; annotation. Each line starts with an image ID and is followed by the<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; list of concepts in alphabetical order. Each ID may appear more than<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; once. In total there are 1940 image annotation lists for the<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; development set and 7291 image annotation lists for the test set. These<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; correspond to 1000 and 4122 unique images (IDs) for the development and<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; test sets, respectively.</p> <p>&nbsp; -&gt; {devel|test}_allconcepts.txt : Complete list of concepts for the<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; development/test set.</p> <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The concepts are defined by one or more WordNet synsets, which is<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; intended to make it possible to easily obtain more information about<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the concepts, e.g. synonyms. In the concept list, the first column<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (which is the name of the concept) indicates the word to search in<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WordNet, the second column the synset type (either noun or adjective),<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the third column is the sense number and the fourth column is the<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WordNet offset (although this cannot be trusted since it changes<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; between WordNet versions). For most of the concepts there is a fifth<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; column which is a Wikipedia article related to the concept.</p> <p>&nbsp; -&gt; {devel|test}_groundtruth.txt : Ground truth concepts for the development<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and test sets.</p> <p>&nbsp; -&gt; {devel|test}_*urls.txt : The original URLs from where the images (iurls)<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and the webpages (rurls) were downloaded. Each line in the file<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; corresponds to an image, starting with the IID and is followed by one<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; or more URLs.</p> <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Note: These are included only to acknowledge the source of the<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; data, not be used as input to the annotation systems.</p> <p><br> * webupv14_baseline.zip</p> <p>&nbsp; An archive that includes code for computing the evaluation measures<br> &nbsp; for two baseline techniques. See the included README.txt for<br> &nbsp; details.</p> <p><br> * feats_textual/webupv14_train{|2}_textual_pages.zip</p> <p>&nbsp; Contains all of the webpages which referenced the images in the<br> &nbsp; training set after being converted to valid xml. In total there are<br> &nbsp; 262588 files, since each image can appear in more than one page, and<br> &nbsp; there can be several versions of same page which differ by the<br> &nbsp; method of conversion to xml. To avoid having too many files in a<br> &nbsp; single directory (which is an issue for some types of partitions),<br> &nbsp; the files are found in subdirectories named using the first two<br> &nbsp; characters of the RID, thus the paths of the files after extraction<br> &nbsp; are of the form:</p> <p>&nbsp;&nbsp;&nbsp; ./WEBUPV/pages/{RID:0:2}/{RID}.{CONVM}.xml.gz</p> <p>&nbsp; To be able to locate the training images withing the webpages, the<br> &nbsp; URLs of the images as referenced are provided in the file<br> &nbsp; train_rimgsrc.txt.</p> <p>* feats_textual/webupv14_train{|2}_textual.scofeat.gz</p> <p>&nbsp; The processed text extracted from the webpages near where the images<br> &nbsp; appeared. Each line corresponds to one image, having the same order<br> &nbsp; as the train_iids.txt list. The lines start with the image ID,<br> &nbsp; followed by the number of extracted unique words and the<br> &nbsp; corresponding word-score pairs. The scores were derived taking into<br> &nbsp; account 1) the term frequency (TF), 2) the document object model<br> &nbsp; (DOM) attributes, and 3) the word distance to the image. The scores<br> &nbsp; are all integers and for each image the sum of scores is always<br> &nbsp; &lt;=100000 (i.e. it is normalized).</p> <p><br> * feats_textual/webupv14_train{|2}_textual.keywords.gz</p> <p>&nbsp; The words used to find the images when querying image search<br> &nbsp; engines. Each line corresponds to an image (in the same order as in<br> &nbsp; train_iids.txt). The lines are composed of triplets:</p> <p>&nbsp;&nbsp;&nbsp; [keyword] [rank] [search_engine]</p> <p>&nbsp; where [keyword] is the word used to find the image, [rank] is the<br> &nbsp; position given to the image in the query, and [search_engine] is a<br> &nbsp; single character indicating in which search engine it was found<br> &nbsp; (&#39;g&#39;:google, &#39;b&#39;:bing, &#39;y&#39;:yahoo).</p> <p><br> * feats_visual/webupv14_*_images.zip</p> <p>&nbsp; Contains thumbnails (maximum 640 pixels of either width or height)<br> &nbsp; of the images in jpeg format. To avoid having too many files in a<br> &nbsp; single directory (which is an issue for some types of partitions),<br> &nbsp; the files are found in subdirectories named using the first two<br> &nbsp; characters of the IID, thus the paths of the files after extraction<br> &nbsp; are of the form:</p> <p>&nbsp;&nbsp;&nbsp; ./WEBUPV/images/{IID:0:2}/{IID}.jpg</p> <p>* feats_visual/webupv14_*.feat.gz</p> <p>&nbsp; The visual features in a simple ASCII text sparse format. The first<br> &nbsp; line of the file indicates the number of vectors (N) and the<br> &nbsp; dimensionality (DIMS). Then each line corresponds to one vector,<br> &nbsp; starting with the number of non-zero elements and followed by pairs<br> &nbsp; of dimension-value, being the first dimension 0. In summary the file<br> &nbsp; format is:</p> <p>&nbsp;&nbsp;&nbsp; N DIMS<br> &nbsp;&nbsp;&nbsp; nz1 Dim(1,1) Val(1,1) ... Dim(1,nz1) Val(1,nz1)<br> &nbsp;&nbsp;&nbsp; nz2 Dim(2,1) Val(2,1) ... Dim(2,nz2) Val(2,nz2)<br> &nbsp;&nbsp;&nbsp; ...<br> &nbsp;&nbsp;&nbsp; nzN Dim(N,1) Val(N,1) ... Dim(N,nzN) Val(N,nzN)</p> <p>&nbsp; The order of the features is the same as in the lists<br> &nbsp; devel_conceptlists.txt, test_conceptlists.txt and train_iids.txt.</p> <p>&nbsp; The procedure to extract the SIFT based features in this<br> &nbsp; subdirectory was conducted as follows. Using the ImageMagick<br> &nbsp; software, the images were first rescaled to having a maximum of 240<br> &nbsp; pixels, of both width and height, while preserving the original<br> &nbsp; aspect ratio, employing the command:</p> <p>&nbsp;&nbsp;&nbsp; convert {IMGIN}.jpg -resize &#39;240&gt;x240&gt;&#39; {IMGOUT}.jpg</p> <p>&nbsp; Then the SIFT features where extracted using the ColorDescriptor<br> &nbsp; software from Koen van de Sande<br> &nbsp; (http://koen.me/research/colordescriptors). As configuration we<br> &nbsp; used, &#39;densesampling&#39; detector with default parameters, and a hard<br> &nbsp; assignment codebook using a spatial pyramid as<br> &nbsp; &#39;pyramid-1x1-2x2&#39;. The number in the file name indicates the size of<br> &nbsp; the codebook. All of the vectors of the spatial pyramid are given in<br> &nbsp; the same line, thus keeping only the first 1/5th of the dimensions<br> &nbsp; would be like not using the spatial pyramid. The codebook was<br> &nbsp; generated using 1.25 million randomly selected features and the<br> &nbsp; k-means algorithm. The GIST features were extracted using the<br> &nbsp; LabelMe Toolbox. The images where first resized to 256x256 ignoring<br> &nbsp; original aspect ratio, using 5 scales, 6 orientations and 4<br> &nbsp; blocks. The other features colorhist and getlf, are both color<br> &nbsp; histogram based extracted using our own implementation.</p> <p>&nbsp;</p></dct:description> <dct:accessRights rdf:resource="http://publications.europa.eu/resource/authority/access-right/RESTRICTED"/> <dct:accessRights> <dct:RightsStatement rdf:about="info:eu-repo/semantics/restrictedAccess"> <rdfs:label>Restricted Access</rdfs:label> </dct:RightsStatement> </dct:accessRights> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.259758"/> </dcat:Distribution> </dcat:distribution> </rdf:Description> <foaf:Project rdf:about="info:eu-repo/grantAgreement/EC/FP7/600707/"> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">600707</dct:identifier> <dct:title>tranScriptorium</dct:title> <frapo:isAwardedBy> <foaf:Organization> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">10.13039/100011102</dct:identifier> <foaf:name>European Commission</foaf:name> </foaf:Organization> </frapo:isAwardedBy> </foaf:Project> </rdf:RDF>
All versions | This version | |
---|---|---|
Views | 288 | 288 |
Downloads | 62 | 62 |
Data volume | 110.2 GB | 110.2 GB |
Unique views | 263 | 263 |
Unique downloads | 13 | 13 |