There is a newer version of the record available.

Published October 2, 2024 | Version v2
Dataset Open

PrimeKGQA, the dataset from paper: Bridging the Gap: Generating a Comprehensive Biomedical Knowledge Graph Question Answering Dataset

  • 1. Universität Hamburg

Description

Despite the plethora of resources such as large-scale corpora and manually curated Knowledge Graphs (KGs), the ability to perform reasoning with natural language inputs over biomedical graphs remains challenging due to insufficient training data. We propose a novel method for automatically constructing a Biomedical Knowledge Graph Question Answering (BioKGQA) dataset sourced from PrimeKG, the largest precision medicine-oriented KG. In total,
we create 83999 question-answer pairs along with their respective SPARQL queries. Our approach generates a diverse array of contextually relevant questions covering a wide spectrum of biomedical concepts and levels of complexity. We evaluate our method based on automatic metrics alongside manual annotations. We establish novel standards tailored for KGQA systems to highlight the linguistic correctness and semantical faithfulness of the generated questions based on extracted KG facts. The compiled dataset – PrimeKGQA – serves as a valuable benchmarking resource for advancing knowledge-driven biomedical research and evaluating KGQA system.

Files

test_call_bioLLM.json

Files (1.5 GB)

Name Size Download all
md5:43c40df023e7e199774cfb37758a776b
287.3 MB Preview Download
md5:1457be128b9ac4c70fe6139ee01d5b10
942.4 MB Preview Download
md5:b0404cc5dbe44acce8769a3aeb4f1d76
317.8 MB Preview Download

Additional details

Dates

Accepted
2024-08-20

Software

Repository URL
https://github.com/xixi019/primeKGQG
Programming language
Python