Dataset Open Access
Hagen, Matthias;
Gollub, Tim;
Busse, Matthias
<?xml version='1.0' encoding='UTF-8'?> <record xmlns="http://www.loc.gov/MARC21/slim"> <leader>00000nmm##2200000uu#4500</leader> <datafield tag="999" ind1="C" ind2="5"> <subfield code="x">Tim Gollub, Matthias Busse, Benno Stein, and Matthias Hagen. Keyqueries for Clustering and Labeling. In 12th Asia Information Retrieval Societies Conference (AIRS 2016), pages 42-55, November 2016. Springer.</subfield> </datafield> <datafield tag="041" ind1=" " ind2=" "> <subfield code="a">eng</subfield> </datafield> <datafield tag="653" ind1=" " ind2=" "> <subfield code="a">subtopic information retrieval</subfield> </datafield> <datafield tag="653" ind1=" " ind2=" "> <subfield code="a">subtopic</subfield> </datafield> <datafield tag="653" ind1=" " ind2=" "> <subfield code="a">documents</subfield> </datafield> <datafield tag="653" ind1=" " ind2=" "> <subfield code="a">ambient</subfield> </datafield> <controlfield tag="005">20200124192255.0</controlfield> <controlfield tag="001">3250669</controlfield> <datafield tag="711" ind1=" " ind2=" "> <subfield code="g">AIRS 2016</subfield> <subfield code="a">12th Asia Information Retrieval Societies Conference</subfield> </datafield> <datafield tag="700" ind1=" " ind2=" "> <subfield code="u">Bauhaus-Universität Weimar</subfield> <subfield code="0">(orcid)0000-0003-1737-6517</subfield> <subfield code="a">Gollub, Tim</subfield> </datafield> <datafield tag="700" ind1=" " ind2=" "> <subfield code="u">Bauhaus-Universität Weimar</subfield> <subfield code="a">Busse, Matthias</subfield> </datafield> <datafield tag="856" ind1="4" ind2=" "> <subfield code="s">80294291</subfield> <subfield code="z">md5:69bfbc52d51b0b84c433b9f1f9950200</subfield> <subfield code="u">https://zenodo.org/record/3250669/files/webis-ambient-html.tar.gz</subfield> </datafield> <datafield tag="856" ind1="4" ind2=" "> <subfield code="s">18786945</subfield> <subfield code="z">md5:7f5489c7aa9b9df0ced802a3d59f5637</subfield> <subfield code="u">https://zenodo.org/record/3250669/files/webis-ambient-main-content.tar.gz</subfield> </datafield> <datafield tag="856" ind1="4" ind2=" "> <subfield code="s">20282119</subfield> <subfield code="z">md5:9b65a761c86456f5de99e0eee76e4ff9</subfield> <subfield code="u">https://zenodo.org/record/3250669/files/webis-ambient-plain-text.tar.gz</subfield> </datafield> <datafield tag="542" ind1=" " ind2=" "> <subfield code="l">open</subfield> </datafield> <datafield tag="260" ind1=" " ind2=" "> <subfield code="c">2015-03-13</subfield> </datafield> <datafield tag="909" ind1="C" ind2="O"> <subfield code="p">openaire_data</subfield> <subfield code="p">user-webis</subfield> <subfield code="o">oai:zenodo.org:3250669</subfield> </datafield> <datafield tag="100" ind1=" " ind2=" "> <subfield code="u">Bauhaus-Universität Weimar</subfield> <subfield code="0">(orcid)0000-0002-9733-2890</subfield> <subfield code="a">Hagen, Matthias</subfield> </datafield> <datafield tag="245" ind1=" " ind2=" "> <subfield code="a">Webis-Ambient-15</subfield> </datafield> <datafield tag="980" ind1=" " ind2=" "> <subfield code="a">user-webis</subfield> </datafield> <datafield tag="540" ind1=" " ind2=" "> <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield> <subfield code="a">Creative Commons Attribution 4.0 International</subfield> </datafield> <datafield tag="650" ind1="1" ind2="7"> <subfield code="a">cc-by</subfield> <subfield code="2">opendefinition.org</subfield> </datafield> <datafield tag="520" ind1=" " ind2=" "> <subfield code="a"><p>This corpus is an extension of the <a href="http://search.fub.it/ambient/">Ambient data set created by Carpineto and Romano</a>. For each subtopic, the websites of the given URLs were downloaded (if accessible). Those documents are named as the original documents, for example, 1/1.4/1.3.html. Each subtopic was then manually enriched to ten documents with websites retrieved by Google (for example, 1/1.1/g00.html - &#39;g&#39; for Google, 00 for the first Google result). Some subtopics could not be sufficently enriched and were discarded. Moreover, some subtopics were duplicates or not interpretable and were also discarded.</p> <p>The data sets consists of 44 topics (topics.txt) and 481 subtopics (subtopics.txt). Some subtopics are topically very similar and therefore rather difficult to be clustered. These subtopics (11.2, 12.13, 14.2, 19.33, 20.2, 20.5, 21.2, 24.3, 24.4, 27.26, 31.16, 36.7, 44.9) are discarded in the file subtopics-filtered.txt, which lists only the remaining 468 subtopics.</p></subfield> </datafield> <datafield tag="773" ind1=" " ind2=" "> <subfield code="n">doi</subfield> <subfield code="i">isVersionOf</subfield> <subfield code="a">10.5281/zenodo.3250668</subfield> </datafield> <datafield tag="024" ind1=" " ind2=" "> <subfield code="a">10.5281/zenodo.3250669</subfield> <subfield code="2">doi</subfield> </datafield> <datafield tag="980" ind1=" " ind2=" "> <subfield code="a">dataset</subfield> </datafield> </record>
All versions | This version | |
---|---|---|
Views | 342 | 343 |
Downloads | 48 | 48 |
Data volume | 2.0 GB | 2.0 GB |
Unique views | 300 | 301 |
Unique downloads | 23 | 23 |