There is a newer version of the record available.

Published June 13, 2023 | Version version 0
Dataset Open

DORIS-MAE-v0

Authors/Creators

Description

In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high costs and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task,  Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research.

The DORIS-MAE dataset is publicly available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset. This dataset is comprised of four main sub-datasets, each serving distinct purposes.

The Query dataset contains 50 human-crafted complex queries spanning across five categories: ML, NLP, CV, AI, and Composite. Each category has 10 associated queries. Queries are broken down into aspects (ranging from 3 to 9 per query) and sub-aspects (from 0 to 6 per aspect, with 0 signifying no further breakdown required). For each query, a corresponding candidate pool of relevant paper abstracts, ranging from 99 to 130, is provided.

The Corpus dataset is composed of 363,133 abstracts from computer science papers, published between 2011-2021, and sourced from arXiv. Each entry includes title, original abstract, URL, primary and secondary categories, as well as citation information retrieved from Semantic Scholar. A masked version of each abstract is also provided, facilitating the automated creation of queries.

The Annotation dataset includes generated annotations for all 83,591 question pairs, each comprising an aspect/sub-aspect and a corresponding paper abstract from the query's candidate pool. It includes the original text generated by ChatGPT (version chatgpt-3.5-turbo-0301) explaining its decision-making process, along with a three-level relevance score (e.g., 0,1,2) representing ChatGPT's final decision.

Finally, the Test Set dataset contains human annotations for a random selection of 250 question pairs used in hypothesis testing. It includes each of the three human annotators' final decisions, recorded as a three-level relevance score (e.g., 0,1,2).

Notes

The DORIS-MAE dataset is contained in a paper submitted to NeurIPS 2023 Dataset Track for review. Please refer to benchmarking details in the GitHub link: https://github.com/Real-Doris-Mae/Doris-Mae-Dataset. Author information will be made available shortly.

Files

DORIS_MAE_dataset_v0.json

Files (1.0 GB)

Name Size Download all
md5:e6e0749e6d818a019f0af4650340459f
1.0 GB Preview Download

Additional details