Published May 8, 2025 | Version 1.0.0
Dataset Open

A Compendium of Regular Expression Shapes in SPARQL Queries

  • 1. University of Bayreuth

Description

Regular path queries (RPQs) are at the heart of navigational queries in graph databases. Motivated by new features of regular path queries in the languages Cypher, GQL, and SQL/PGQ, which require new approaches for indexing and compactly storing intermediate query results, we investigate a large corpus of real-world RPQs. Our corpus consists of 148.7 million RPQs occurring in 937.2 million SPARQL queries, used on 29 different data sets.

We investigate three main questions on these logs. First, what is the syntactic structure of RPQs in practice? Second, how much non-determinism do they have? Third, can they be evaluated tractably under simple path and trail semantics?

Concerning the first question, we show that all the RPQs can be classified in only 572 different syntactic shapes, which we provide in a downloadable data set in Zenodo. Furthermore, we classify the the relative use of various RPQ operators, and popular predicates that are used for transitive navigation. Concerning the second question, we show that although non-determinism occurs in the RPQs, less than one in ten million requires a deterministic finite automaton with more states than the size of the regular expression. This is remarkable because this blow-up is known to be exponential in the worst case.

When using this data set, please cite the following paper:
@inproceedings{HM25,
  author       = {Janik Hammerer and Wim Martens},
  title        = {A Compendium of Regular Expression Shapes in SPARQL Queries},
  booktitle    = {Joint International Workshop on Graph Data Management Experiences {\&} Systems {(GRADES)}
                  and Network Data Analytics (NDA)},
  publisher    = {{ACM}},
  year         = {2025},
  url          = {https://doi.org/10.1145/3735546.3735853},
  doi          = {10.1145/3735546.3735853}
}

Files

affymetrix.csv

Files (144.2 kB)

Name Size Download all
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:d9adc915601ddc821ff22d8193dbd719
509 Bytes Preview Download
md5:76b65caa7a6d4a527d9f299e353fc3ca
196 Bytes Preview Download
md5:6440912491482ef3b4b4dae58ea4c6d8
5.4 kB Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:ea1fd6ae972d63d89eed14e5d125bfaa
314 Bytes Preview Download
md5:3e7b75ceefa0633daef0c071812be8ba
99.5 kB Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:87da6e5d36978479ceb22c365be61a0c
196 Bytes Preview Download
md5:b1d641450ea73141306c6dc9659e0c22
197 Bytes Preview Download
md5:85ef38d9ccf3d9e3c517af0333fc9fad
196 Bytes Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:fe9c2d863e189e3f8762d739c57aaaca
157 Bytes Preview Download
md5:ad720c7947832a9d9186235209281219
445 Bytes Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:05a833740ec55a14c1561074d9263dcd
195 Bytes Preview Download
md5:2ec7c5b507225d64c47f83f5962eee9b
196 Bytes Preview Download
md5:72d1fea37f991ef16962f174fd113507
310 Bytes Preview Download
md5:85fdff2c4893342b13b1225f1e1f9b91
156 Bytes Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download
md5:50cd452e53b4754421c619e95175def7
195 Bytes Preview Download
md5:5926b9d4337a1e47db45e7fc9e62bacd
634 Bytes Preview Download
md5:764a2a1307f236664c61433618d9073c
196 Bytes Preview Download
md5:13be5bc14461f248fab3ca1d5ecbaaab
26.3 kB Preview Download
md5:afb3d405997e166097aed7e49cfcc868
7.2 kB Preview Download
md5:cf44a966ac88eb77d921611002b60826
157 Bytes Preview Download

Additional details

Related works

Is supplement to
Dataset: 10.1145/3735546.3735853 (DOI)

Software

Programming language
CSV