Dataset Open Access

Wikidata Graph Pattern Benchmark (WGPB) for RDF/SPARQL

Aidan Hogan; Cristian Riveros; Carlos Rojas; Adrián Soto


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">graph database</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">sparql</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">worst-case optimal</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">benchmark</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">wikidata</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">rdf</subfield>
  </datafield>
  <controlfield tag="005">20210111020435.0</controlfield>
  <controlfield tag="001">4035223</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">DCC,  Pontificia Universidad Católica de Chile; IMFD</subfield>
    <subfield code="0">(orcid)0000-0003-0832-116X</subfield>
    <subfield code="a">Cristian Riveros</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">DCC,  Pontificia Universidad Católica de Chile; IMFD</subfield>
    <subfield code="0">(orcid)0000-0002-3328-9256</subfield>
    <subfield code="a">Carlos Rojas</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">DCC,  Pontificia Universidad Católica de Chile; IMFD</subfield>
    <subfield code="0">(orcid)0000-0001-7682-1639</subfield>
    <subfield code="a">Adrián Soto</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">15013</subfield>
    <subfield code="z">md5:ebe3ad01f980c124cd7d841152101b98</subfield>
    <subfield code="u">https://zenodo.org/record/4035223/files/wgpb-queries.zip</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">593708070</subfield>
    <subfield code="z">md5:a11b29999983209cfc5fad6f9f439b82</subfield>
    <subfield code="u">https://zenodo.org/record/4035223/files/wikidata-wcg-filtered.nt.bz2</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">6710684347</subfield>
    <subfield code="z">md5:bae0977d699f8f45c28d7b7b428a8529</subfield>
    <subfield code="u">https://zenodo.org/record/4035223/files/wikidata-wcg.nt.bz2</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2019-10-30</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="p">user-linkeddata</subfield>
    <subfield code="o">oai:zenodo.org:4035223</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">DCC, Universidad de Chile; IMFD</subfield>
    <subfield code="0">(orcid)0000-0001-9482-1982</subfield>
    <subfield code="a">Aidan Hogan</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Wikidata Graph Pattern Benchmark (WGPB) for RDF/SPARQL</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-linkeddata</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;The&amp;nbsp;Wikidata Graph Pattern Benchmark (WGPB) is a benchmark consisting of 50 instances of 17 different abstract query patterns giving a total of 850 SPARQL queries.&amp;nbsp;The goal of the benchmark is to test the performance of query engines for more complex basic graph patterns. The benchmark was designed for&amp;nbsp;evaluating worst-case optimal join algorithms but also serves as a general-purpose benchmark for evaluating (basic) graph patterns. The queries are provided in &lt;a href="https://www.w3.org/TR/sparql11-query/"&gt;SPARQL syntax&lt;/a&gt; and all return at least one solution. We limit the number of results returned to a maximum of 1,000.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Queries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We provide an example of a &amp;quot;square&amp;quot; basic graph pattern (comments are added here for readability):&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT * WHERE { 
 ?x1 &amp;lt;http://www.wikidata.org/prop/direct/P149&amp;gt; ?x2 .  # architectural style
 ?x2 &amp;lt;http://www.wikidata.org/prop/direct/P1269&amp;gt; ?x3 . # facet of
 ?x3 &amp;lt;http://www.wikidata.org/prop/direct/P156&amp;gt; ?x4 .  # followed by
 ?x1 &amp;lt;http://www.wikidata.org/prop/direct/P135&amp;gt; ?x4 .  # movement
} LIMIT 1000&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are 49 other queries similar to this one in the dataset (replacing the predicates with other predicates), and 50 queries for 16 other abstract query patterns. For more details on these patterns, we refer to the publication mentioned below.&lt;/p&gt;

&lt;p&gt;Note that you can try the queries on the public &lt;a href="https://query.wikidata.org/"&gt;Wikidata Query Service&lt;/a&gt;, though some might give a timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The queries were generated over a reduced version of the &lt;a href="https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Truthy_statements"&gt;Wikidata truthy dump&lt;/a&gt; from November 15,&amp;nbsp;2018 that we call the Wikidata Core Graph (WCG). Specifically, in order to reduce the data volume, multilingual labels, comments, etc., were removed as they have limited use for evaluating joins (English labels were kept under &lt;em&gt;schema:name&lt;/em&gt;). Thereafter, in order to facilitate the generation of the queries, triples with rare predicates appearing in fewer than 1,000 triples, and very common predicates appearing in more than 1,000,000 triples, were removed. The queries provided will generate the same results over both graphs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Files&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this dataset, we then include three files:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&lt;strong&gt;wgpb-queries.zip &lt;/strong&gt;The list of 850 queries&lt;/li&gt;
	&lt;li&gt;&lt;strong&gt;wikidata-wcg.nt.gz&amp;nbsp;&lt;/strong&gt;Wikidata truthy graph&amp;nbsp;with English labels&lt;/li&gt;
	&lt;li&gt;&lt;strong&gt;wikidata-wcg-filtered.nt.bz2&amp;nbsp;&lt;/strong&gt;Wikidata truthy graph&amp;nbsp;with English labels&amp;nbsp;filtering triples with rare (&amp;lt;1000 triples) and very common (&amp;gt;1000000) predicates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We provide &lt;a href="https://cirojas.github.io/leapfrog-benchmark/"&gt;the code&lt;/a&gt; for generating the datasets, queries, etc., along with scripts and instructions on how to run these queries in a variety of SPARQL engines (Blazegraph, Jena, Virtuoso and our worst-case optimal variant of Jena), .&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Publication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The benchmark is proposed, described and used in the following paper. You can find more details about how it was generated, the 17 abstract patterns that were used, as well as results for prominent SPARQL engines.&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;Aidan Hogan, Cristian Riveros, Carlos Rojas and Adri&amp;aacute;n Soto. &amp;quot;&lt;a href="http://aidanhogan.com/docs/SPARQL_worst_case_optimal.pdf"&gt;&lt;em&gt;A Worst-Case Optimal Join Algorithm for SPARQL&lt;/em&gt;&lt;/a&gt;&amp;quot;. In the Proceedings of the&amp;nbsp;18th International Semantic Web Conference (ISWC), Auckland, New Zealand, October 26&amp;ndash;30, 2019.&lt;/li&gt;
&lt;/ul&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.4035222</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.4035223</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
419
87
views
downloads
All versions This version
Views 419419
Downloads 8787
Data volume 167.8 GB167.8 GB
Unique views 368368
Unique downloads 6363

Share

Cite as