Published December 3, 2025 | Version v2
Dataset Open

DSEBench

Authors/Creators

Description

DSEBench

 

DSEBench is a test collection designed to support the evaluation of Dataset Search with Examples (DSE), a task that generalizes two established paradigms: keyword-based dataset search and similarity-based dataset discovery. Given a textual query q and a set of target datasets Dt known to be relevant, the goal of DSE is to retrieve a ranked list Dc of candidate datasets that are both relevant to q and similar to the datasets in Dt.

As an extension, Explainable DSE further requires identifying, for each result dataset d∈Dc, a subset of metadata or content fields that explain its relevance to q and similarity to Dt.

This repository contains the datasets, queries, and relevance judgments. For source code, baseline implementations, and experimental setups, please visit our GitHub Repository. For further details, please refer to the accompanying paper.

 

Datasets

We reused the 46,615 datasets collected from NTCIR. The "datasets.json" file provides the idtitledescriptiontagsauthor, and summary of each dataset in JSON format.

{ 
  "id": "0000de36-24e5-42c1-959d-2772a3c747e7", 
  "title": "Montezuma National Wildlife Refuge: January - April, 1943", 
  "description": "This narrative report for Montezuma National Wildlife Refuge outlines Refuge accomplishments from January through April of 1943. ...", 
  "tags": ["annual-narrative", "behavior", "populations"], 
  "author": "Fish and Wildlife Service", 
  "summary": "Almost continuous rains during April brought flood conditions to the Clyde River as well as to the refuge storage pool. Cayuga Lake is at its highest level in about ton years. ..."
}

Below is an example of how to load and use the datasets.json file:

import json

# Load the dataset file
with open('datasets.json', 'r') as f:
    datasets_data = json.load(f)
    
    # Iterate through each judgment
    for dataset in datasets_data:
        dataset_id = dataset['id']  # Get the dataset ID
        title = dataset['title']  # Get the title
        
        # Other code to process the judgment data...

 

Queries

The "queries.tsv" file provides 3,979 keyword queries. Each row represents a query with two "\t"-separated columns: query_id and query_text. The queries can be divided into two categories: generated queries, created from the metadata of datasets, and NTCIR queries, imported from the English part of the NTCIR dataset. Queries with IDs starting with "GEN_" are generated queries, while those starting with "NTCIR" are NTCIR queries.

Below is an example of how to load and use the queries.tsv file:

# Load the queries file
with open('queries.tsv', 'r') as f:
    # Iterate through each line
    for line in f:
        query_id, query_text = line.split('\t')  # Get the query ID and the query text

        # Other code to process the data...
 

 

Test and Training Cases

In DSEBench, each input consists of a case, which includes a query and a set of target datasets that are known to be relevant to the query. The "cases.tsv" file provides 141 test cases and 5,699 training cases. Each row represents a case with three "\t"-separated columns: case_idquery_id, and target_dataset_id.

Test cases are identified by a case_id composed of pure numbers. These test cases are adapted from highly relevant query-dataset pairs from the NTCIR dataset. The remaining cases are training cases. Among these, those with a case_id starting with l1_ are adapted from partially relevant query-dataset pairs from NTCIR, while those starting with gen_ are synthetic training cases, where the queries are generated queries.

Below is an example of how to load and use the cases.tsv file:

# Load the cases file
with open('cases.tsv', 'r') as f:
    # Iterate through each line
    for line in f:
        case_id, query_id, target_dataset_id = line.split('\t')  # Get the case ID, the query ID, and the target dataset ID

        # Other code to process the data...
 

 

Relevance Judgments

The "human_annotated_judgments.json" file contains 7,415 human-annotated judgments, and the "llm_annotated_judgments.json" file contains 122,585 judgments generated by a large language model (LLM). Each JSON object has eight keys: query_idtarget_dataset_idcandidate_dataset_idcase_id (the ID of the input), query_rel (relevance of the candidate dataset to the query, 0: irrelevant; 1: partially relevant; 2: highly relevant), field_query_reltarget_sim (similarity of the candidate dataset to the target datasets, 0: dissimilar; 1: partially similar; 2: highly similar), and field_target_sim. The field_query_rel and field_target_sim are both lists of length 5 consisting of 0 and 1, corresponding to the fields [title, description, tags, author, summary].

{
    "query_id": "NTCIR_200000", 
    "target_dataset_id": "002ece58-9603-43f1-8e2e-54e3d9649e84", 
    "candidate_dataset_id": "99e3b6a2-d097-463f-b6e1-3caceff300c9", 
    "case_id": "1", 
    "query_rel": 1, 
    "field_query_rel": [1, 1, 1, 0, 0], 
    "target_sim": 2, 
    "field_target_sim": [1, 1, 1, 1, 1]
}
 

Below is an example of how to load and use the human_annotated_judgments.json file:

import json

# Load the judgments file
with open('Data/human_annotated_judgments.json', 'r') as f:
    judgments_data = json.load(f)
    
    # Iterate through each judgment
    for judgment in judgments_data:
        case_id = judgment['case_id']  # Get the case ID
        candidate_dataset_id = judgment['candidate_dataset_id']  # Get the candidate dataset ID
        query_rel = judgment['query_rel']  # Get the query relevance score
        field_query_rel = judgment['field_query_rel']  # Get the field-level query relevance scores (title, description, tags, author, summary)
        
        # Other code to process the judgment data...
 

 

Splits for Training, Validation, and Test Sets

To ensure that evaluation results are comparable, we provide predefined train-validation-test splits. The "Splits/5-Fold_split" folder contains five sub-folders, each providing three qrel files for training, validation, and test sets. The "Splits/Annotators_split" folder contains three qrel files for the training, validation, and test sets as well.

These files are used in the same way as the relevance judgments files.

 

Evaluation Scripts

We provide Python scripts to facilitate standard evaluation.

1. DSE Evaluation (Retrieval/Reranking)

Use evaluate_dse.py to calculate metrics (MAP, NDCG, Recall).

  • Input Format (JSON):
    {
      "case_id_1": {"dataset_id_A": 0.95, "dataset_id_B": 0.82},
      "case_id_2": {"...": ...}
    }
    
  • Usage:
    python evaluate_dse.py --qrels Data/human_annotated_judgments.json --run path/to/your_results.json
    
  • Note: This script requires the pytrec_eval library.

2. Explainable DSE Evaluation

Use evaluate_explanation.py to calculate F1-scores for field-level explanations.

  • Input Format (JSON): The query and dataset lists correspond to the binary relevance of ['title', 'description', 'tags', 'author', 'summary'].
    {
      "case_id": {
        "dataset_id": {
           "query": [1, 1, 1, 1, 0],
           "dataset": [1, 1, 1, 1, 0]
        }
      }
    }
    
  • Usage:
    python evaluate_explanation.py --qrels Data/human_annotated_judgments.json --run path/to/your_explanations.json
    

Codes and Baselines

To access the source code for retrieval, reranking, and explanation models, as well as the implementation details and full baseline results, please refer to our GitHub Repository.

Files

DSEBench_data.zip

Files (38.0 MB)

Name Size Download all
md5:2dc4207d4b2997114de30f3ef98fadf2
38.0 MB Preview Download