DSEBench
Authors/Creators
Description
DSEBench
DSEBench is a test collection designed to support the evaluation of Dataset Search with Examples (DSE), a task that generalizes two established paradigms: keyword-based dataset search and similarity-based dataset discovery. Given a textual query q and a set of target datasets Dt known to be relevant, the goal of DSE is to retrieve a ranked list Dc of candidate datasets that are both relevant to q and similar to the datasets in Dt.
As an extension, Explainable DSE further requires identifying, for each result dataset d∈Dc, a subset of metadata or content fields that explain its relevance to q and similarity to Dt.
This repository contains the datasets, queries, and relevance judgments. For source code, baseline implementations, and experimental setups, please visit our GitHub Repository. For further details, please refer to the accompanying paper.
Datasets
We reused the 46,615 datasets collected from NTCIR. The "datasets.json" file provides the id, title, description, tags, author, and summary of each dataset in JSON format.
{
"id": "0000de36-24e5-42c1-959d-2772a3c747e7",
"title": "Montezuma National Wildlife Refuge: January - April, 1943",
"description": "This narrative report for Montezuma National Wildlife Refuge outlines Refuge accomplishments from January through April of 1943. ...",
"tags": ["annual-narrative", "behavior", "populations"],
"author": "Fish and Wildlife Service",
"summary": "Almost continuous rains during April brought flood conditions to the Clyde River as well as to the refuge storage pool. Cayuga Lake is at its highest level in about ton years. ..."
}
Below is an example of how to load and use the datasets.json file:
import json
# Load the dataset file
with open('datasets.json', 'r') as f:
datasets_data = json.load(f)
# Iterate through each judgment
for dataset in datasets_data:
dataset_id = dataset['id'] # Get the dataset ID
title = dataset['title'] # Get the title
# Other code to process the judgment data...
Queries
The "queries.tsv" file provides 3,979 keyword queries. Each row represents a query with two "\t"-separated columns: query_id and query_text. The queries can be divided into two categories: generated queries, created from the metadata of datasets, and NTCIR queries, imported from the English part of the NTCIR dataset. Queries with IDs starting with "GEN_" are generated queries, while those starting with "NTCIR" are NTCIR queries.
Below is an example of how to load and use the queries.tsv file:
# Load the queries file
with open('queries.tsv', 'r') as f:
# Iterate through each line
for line in f:
query_id, query_text = line.split('\t') # Get the query ID and the query text
# Other code to process the data...
Test and Training Cases
In DSEBench, each input consists of a case, which includes a query and a set of target datasets that are known to be relevant to the query. The "cases.tsv" file provides 141 test cases and 5,699 training cases. Each row represents a case with three "\t"-separated columns: case_id, query_id, and target_dataset_id.
Test cases are identified by a case_id composed of pure numbers. These test cases are adapted from highly relevant query-dataset pairs from the NTCIR dataset. The remaining cases are training cases. Among these, those with a case_id starting with l1_ are adapted from partially relevant query-dataset pairs from NTCIR, while those starting with gen_ are synthetic training cases, where the queries are generated queries.
Below is an example of how to load and use the cases.tsv file:
# Load the cases file
with open('cases.tsv', 'r') as f:
# Iterate through each line
for line in f:
case_id, query_id, target_dataset_id = line.split('\t') # Get the case ID, the query ID, and the target dataset ID
# Other code to process the data...
Relevance Judgments
The "human_annotated_judgments.json" file contains 7,415 human-annotated judgments, and the "llm_annotated_judgments.json" file contains 122,585 judgments generated by a large language model (LLM). Each JSON object has eight keys: query_id, target_dataset_id, candidate_dataset_id, case_id (the ID of the input), query_rel (relevance of the candidate dataset to the query, 0: irrelevant; 1: partially relevant; 2: highly relevant), field_query_rel, target_sim (similarity of the candidate dataset to the target datasets, 0: dissimilar; 1: partially similar; 2: highly similar), and field_target_sim. The field_query_rel and field_target_sim are both lists of length 5 consisting of 0 and 1, corresponding to the fields [title, description, tags, author, summary].
{
"query_id": "NTCIR_200000",
"target_dataset_id": "002ece58-9603-43f1-8e2e-54e3d9649e84",
"candidate_dataset_id": "99e3b6a2-d097-463f-b6e1-3caceff300c9",
"case_id": "1",
"query_rel": 1,
"field_query_rel": [1, 1, 1, 0, 0],
"target_sim": 2,
"field_target_sim": [1, 1, 1, 1, 1]
}
Below is an example of how to load and use the human_annotated_judgments.json file:
import json
# Load the judgments file
with open('Data/human_annotated_judgments.json', 'r') as f:
judgments_data = json.load(f)
# Iterate through each judgment
for judgment in judgments_data:
case_id = judgment['case_id'] # Get the case ID
candidate_dataset_id = judgment['candidate_dataset_id'] # Get the candidate dataset ID
query_rel = judgment['query_rel'] # Get the query relevance score
field_query_rel = judgment['field_query_rel'] # Get the field-level query relevance scores (title, description, tags, author, summary)
# Other code to process the judgment data...
Splits for Training, Validation, and Test Sets
To ensure that evaluation results are comparable, we provide predefined train-validation-test splits. The "Splits/5-Fold_split" folder contains five sub-folders, each providing three qrel files for training, validation, and test sets. The "Splits/Annotators_split" folder contains three qrel files for the training, validation, and test sets as well.
These files are used in the same way as the relevance judgments files.
Evaluation Scripts
We provide Python scripts to facilitate standard evaluation.
1. DSE Evaluation (Retrieval/Reranking)
Use evaluate_dse.py to calculate metrics (MAP, NDCG, Recall).
- Input Format (JSON):
{ "case_id_1": {"dataset_id_A": 0.95, "dataset_id_B": 0.82}, "case_id_2": {"...": ...} } - Usage:
python evaluate_dse.py --qrels Data/human_annotated_judgments.json --run path/to/your_results.json - Note: This script requires the
pytrec_evallibrary.
2. Explainable DSE Evaluation
Use evaluate_explanation.py to calculate F1-scores for field-level explanations.
- Input Format (JSON): The
queryanddatasetlists correspond to the binary relevance of['title', 'description', 'tags', 'author', 'summary'].{ "case_id": { "dataset_id": { "query": [1, 1, 1, 1, 0], "dataset": [1, 1, 1, 1, 0] } } } - Usage:
python evaluate_explanation.py --qrels Data/human_annotated_judgments.json --run path/to/your_explanations.json
Codes and Baselines
To access the source code for retrieval, reranking, and explanation models, as well as the implementation details and full baseline results, please refer to our GitHub Repository.
Files
DSEBench_data.zip
Files
(38.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:2dc4207d4b2997114de30f3ef98fadf2
|
38.0 MB | Preview Download |