KGCW 2024 Challenge @ ESWC 2024
Creators
Description
Knowledge Graph Construction Workshop 2024: challenge
Knowledge graph construction of heterogeneous data has seen a lot of uptake
in the last decade from compliance to performance optimizations with respect
to execution time. Besides execution time as a metric for comparing knowledge
graph construction, other metrics e.g. CPU or memory usage are not considered.
This challenge aims at benchmarking systems to find which RDF graph
construction system optimizes for metrics e.g. execution time, CPU,
memory usage, or a combination of these metrics.
Task description
The task is to reduce and report the execution time and computing resources
(CPU and memory usage) for the parameters listed in this challenge, compared
to the state-of-the-art of the existing tools and the baseline results provided
by this challenge. This challenge is not limited to execution times to create
the fastest pipeline, but also computing resources to achieve the most efficient
pipeline.
We provide a tool which can execute such pipelines end-to-end. This tool also
collects and aggregates the metrics such as execution time, CPU and memory
usage, necessary for this challenge as CSV files. Moreover, the information
about the hardware used during the execution of the pipeline is available as
well to allow fairly comparing different pipelines. Your pipeline should consist
of Docker images which can be executed on Linux to run the tool. The tool is
already tested with existing systems, relational databases e.g. MySQL and
PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
which can be combined in any configuration. It is strongly encouraged to use
this tool for participating in this challenge. If you prefer to use a different
tool or our tool imposes technical requirements you cannot solve, please contact
us directly.
Track 1: Conformance
The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:
These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.
Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.
Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.
Track 2: Performance
Part 1: Knowledge Graph Construction Parameters
These parameters are evaluated using synthetic generated data to have more
insights of their influence on the pipeline.
Data
- Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).
- Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).
- Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).
- Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).
- Number of input files: scaling the number of datasets (1, 5, 10, 15).
Mappings
- Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).
- Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).
- Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)
Part 2: GTFS-Madrid-Bench
The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
public transport domain in Madrid.
Scaling
- GTFS-1 SQL
- GTFS-10 SQL
- GTFS-100 SQL
- GTFS-1000 SQL
Heterogeneity
- GTFS-100 XML + JSON
- GTFS-100 CSV + XML
- GTFS-100 CSV + JSON
- GTFS-100 SQL + XML + JSON + CSV
Example pipeline
The ground truth dataset and baseline results are generated in different steps
for each parameter:
- The provided CSV files and SQL schema are loaded into a MySQL relational database.
- Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format
The pipeline is executed 5 times from which the median execution time of each
step is calculated and reported. Each step with the median execution time is
then reported in the baseline results with all its measured metrics.
Knowledge graph construction timeout is set to 24 hours.
The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,
you can adapt the execution plans for this example pipeline to your own needs.
Each parameter has its own directory in the ground truth dataset with the
following files:
- Input dataset as CSV.
- Mapping file as RML.
- Execution plan for the pipeline in
metadata.json
.
Datasets
Knowledge Graph Construction Parameters
The dataset consists of:
- Input dataset as CSV for each parameter.
- Mapping file as RML for each parameter.
- Baseline results for each parameter with the example pipeline.
- Ground truth dataset for each parameter generated with the example pipeline.
Format
All input datasets are provided as CSV, depending on the parameter that is being
evaluated, the number of rows and columns may differ. The first row is always
the header of the CSV.
GTFS-Madrid-Bench
The dataset consists of:
- Input dataset as CSV with SQL schema for the scaling and a combination of XML,
- CSV, and JSON is provided for the heterogeneity.
- Mapping file as RML for both scaling and heterogeneity.
- SPARQL queries to retrieve the results.
- Baseline results with the example pipeline.
- Ground truth dataset generated with the example pipeline.
Format
CSV datasets always have a header as their first row.
JSON and XML datasets have their own schema.
Evaluation criteria
Submissions must evaluate the following metrics:
- Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.
- CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.
- Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.
Expected output
Duplicate values
Scale | Number of Triples |
---|---|
0 percent | 2000000 triples |
25 percent | 1500020 triples |
50 percent | 1000020 triples |
75 percent | 500020 triples |
100 percent | 20 triples |
Empty values
Scale | Number of Triples |
---|---|
0 percent | 2000000 triples |
25 percent | 1500000 triples |
50 percent | 1000000 triples |
75 percent | 500000 triples |
100 percent | 0 triples |
Mappings
Scale | Number of Triples |
---|---|
1TM + 15POM | 1500000 triples |
3TM + 5POM | 1500000 triples |
5TM + 3POM | 1500000 triples |
15TM + 1POM | 1500000 triples |
Properties
Scale | Number of Triples |
---|---|
1M rows 1 column | 1000000 triples |
1M rows 10 columns | 10000000 triples |
1M rows 20 columns | 20000000 triples |
1M rows 30 columns | 30000000 triples |
Records
Scale | Number of Triples |
---|---|
10K rows 20 columns | 200000 triples |
100K rows 20 columns | 2000000 triples |
1M rows 20 columns | 20000000 triples |
10M rows 20 columns | 200000000 triples |
Joins
1-1 joins
Scale | Number of Triples |
---|---|
0 percent | 0 triples |
25 percent | 125000 triples |
50 percent | 250000 triples |
75 percent | 375000 triples |
100 percent | 500000 triples |
1-N joins
Scale | Number of Triples |
---|---|
1-10 0 percent | 0 triples |
1-10 25 percent | 125000 triples |
1-10 50 percent | 250000 triples |
1-10 75 percent | 375000 triples |
1-10 100 percent | 500000 triples |
1-5 50 percent | 250000 triples |
1-10 50 percent | 250000 triples |
1-15 50 percent | 250005 triples |
1-20 50 percent | 250000 triples |
1-N joins
Scale | Number of Triples |
---|---|
10-1 0 percent | 0 triples |
10-1 25 percent | 125000 triples |
10-1 50 percent | 250000 triples |
10-1 75 percent | 375000 triples |
10-1 100 percent | 500000 triples |
5-1 50 percent | 250000 triples |
10-1 50 percent | 250000 triples |
15-1 50 percent | 250005 triples |
20-1 50 percent | 250000 triples |
N-M joins
Scale | Number of Triples |
---|---|
5-5 50 percent | 1374085 triples |
10-5 50 percent | 1375185 triples |
5-10 50 percent | 1375290 triples |
5-5 25 percent | 718785 triples |
5-5 50 percent | 1374085 triples |
5-5 75 percent | 1968100 triples |
5-5 100 percent | 2500000 triples |
5-10 25 percent | 719310 triples |
5-10 50 percent | 1375290 triples |
5-10 75 percent | 1967660 triples |
5-10 100 percent | 2500000 triples |
10-5 25 percent | 719370 triples |
10-5 50 percent | 1375185 triples |
10-5 75 percent | 1968235 triples |
10-5 100 percent | 2500000 triples |
GTFS Madrid Bench
Generated Knowledge Graph
Scale | Number of Triples |
---|---|
1 | 395953 triples |
10 | 3959530 triples |
100 | 39595300 triples |
1000 | 395953000 triples |
Queries
Query | Scale 1 | Scale 10 | Scale 100 | Scale 1000 |
---|---|---|---|---|
Q1 | 58540 results | 585400 results | No results available | No results available |
Q2 | 636 results | 11998 results | 125565 results | 1261368 results |
Q3 | 421 results | 4207 results | 42067 results | 420667 results |
Q4 | 13 results | 130 results | 1300 results | 13000 results |
Q5 | 35 results | 350 results | 3500 results | 35000 results |
Q6 | 1 result | 1 result | 1 result | 1 result |
Q7 | 68 results | 67 results | 67 results | 53 results |
Q8 | 35460 results | 354600 results | No results available | No results available |
Q9 | 130 results | 1300 results | 13000 results | 130000 results |
Q10 | 1 result | 1 result | 1 result | 1 result |
Q11 | 130 results | 260 results | 260 results | 260 results |
Q12 | 13 results | 130 results | 1300 results | 13000 results |
Q13 | 265 results | 2650 results | 26500 results | 265000 results |
Q14 | 2234 results | 22340 results | 223400 results | No results available |
Q15 | 592 results | 8684 results | 35502 results | 206628 results |
Q16 | 390 results | 780 results | 260 results | 780 results |
Q17 | 855 results | 8550 results | 85500 results | 855000 results |
Q18 | 104 results | 1300 results | 13000 results | 130000 results |
Files
Files
(9.8 GB)
Name | Size | Download all |
---|---|---|
md5:3ca35fb72ba2202f4a6e996f53a8a986
|
4.5 GB | Download |
md5:29299fe585b641f09d1c54f01628a117
|
5.3 GB | Download |