Knowledge bases for explainable benchmarking (QALD10, QALD9+DB, QALD9+WK)

Zhang, Quannian; Röder, Michael; Srivastava, Nikit; Kouagou, N'Dah Jean; Ngonga Ngomo, Axel-Cyrille

doi:10.5281/zenodo.14720669

Published January 22, 2025 | Version v1

Dataset Open

Knowledge bases for explainable benchmarking (QALD10, QALD9+DB, QALD9+WK)

1. Paderborn University

This project provides three knowledge graphs that we created for the three QA benchmarks: QALD-9 plus DBpedia, QALD-9 plus Wikidata, and QALD-10. Here are some more details:

1. Preprocessing

We remove all questions from the three QA datasets that have an empty ground truth answer set.
We preprocessed the DBpedia reference graph by:
- Removing 43,618 triples with IRIs that do not pass through the RDF checker.
- Removing properties of the http://dbpedia.org/property/ namespace.
- Inferring the classes of all entities based on the class hierarchy.
We preprocessed Wikidata by replacing the property http://www.wikidata.org/prop/direct/P31 with http://www.w3.org/1999/02/22-rdf\textbackslash-syntax-ns\#type.

2. Knowledge Base Structure

In the first step of our benchmarking framework, we generate a knowledge graph comprising information from the dataset used during the benchmarking process. Our work relies on the QALD datasets, which include three types of data for each question:

Natural language question
Each question comes with a representation in several languages. From the English question, we extract linguistic features such as:
- The length of the question (dqb:hasLength) Note: The prefix dqb: refers to the namespacehttp://w3id.org/dice-research/qa-bench#.
- The presence of negation (dqb:hasNegation)
- The question word (dqb:hasQuestionWord)
- The NLP parse tree (dqb:hasNlpParseTreeRoot)
  Note: We employ the Stanford NLP toolkit for the extraction.
Answer(s)
Each question comes with the ground truth answers. We add these answers to the generated graph with three different properties distinguishing:
- IRI answers (dqb:hasIRIAnswer)
- Boolean answers (dqb:hasBooleanAnswer)
- Other literal answers (dqb:hasLiteralAnswer)
  For each IRI listed as an answer, we add its concise bounded description (CBD) extracted from the reference knowledge graph.
SPARQL query
Each question has a SPARQL query that returns the ground truth answer when used on the reference knowledge graph. We adopt LSQ to add the following SPARQL query features to our knowledge graph:
- Entities (dqb:hasEntity), properties (dqb:hasProperty) contained in the query, and the CBD of the entities
- Type of query
- The number of triple patterns
- The number of basic graph patterns
- The average degree of vertices
- The median degree of vertices involved in join operations
- The minimum, maximum, and median number of triple patterns in a basic graph pattern
- The presence of certain keywords such as FILTER, DISTINCT, and GROUP BY

Files

QALD_KG.zip

Files (2.4 GB)

Name	Size	Download all
QALD_KG.zip md5:5bc7d33a19e77f5ac502dec06c93bc3d	2.4 GB	Preview Download

	All versions	This version
Views	123	123
Downloads	30	30
Data volume	74.2 GB	74.2 GB

Knowledge bases for explainable benchmarking (QALD10, QALD9+DB, QALD9+WK)

Authors/Creators

Description

1. Preprocessing

2. Knowledge Base Structure

Files

QALD_KG.zip

Files (2.4 GB)