Published September 10, 2016 | Version v1
Journal article Open

Workload-Aware Self-Tuning Histograms for the Semantic Web

  • 1. NCSR "Demokritos"
  • 2. National Technical University of Athens

Description

Query processing systems typically rely on histograms, data structures that approximate data distribution, in order to optimize query execution. Histograms can be constructed by scanning the database tables and aggregating the values of the attributes in the table, or, more efficiently, progressively refined by analysing query results. Most of the relevant literature focuses on histograms of numerical data, exploiting the natural concept of a numerical range as an estimator of the volume of data that falls within the range. This, however, leaves Semantic Web data outside the scope of the histograms literature, as its most prominent datatype, the URI, does not offer itself to defining such ranges. This article first establishes a framework that formalises histograms over arbitrary data types and provides a formalism for specifying value ranges for different datatypes. This makes explicit the properties that ranges are required to have, so that histogram refinement algorithms are applicable. We demonstrate that our framework subsumes histograms over numerical data as a special case by using to formulate the state-of-the-art in numerical histograms. We then proceed to use the Jaro-Winkler metric to define URI ranges by exploiting the hierarchical nature of URI strings. This greatly extends the state of the art, where strings are treated as categorical data that can only be described by enumeration. We then present the open-source STRHist system that implements these ideas. We finally present empirical evaluation results using STRHist over a real dataset and query workload extracted from AGRIS, the most popular and widely used bibliographic database on agricultural research and technology.

Files

selftuning-TLDKS.pdf

Files (366.1 kB)

Name Size Download all
md5:5f4a9b35409b1781d91e919f79ad6645
366.1 kB Preview Download

Additional details

Funding

SEMAGROW – SemaGrow: Data intensive techniques to boost the real-time performance of global agricultural data infrastructures 318497
European Commission