DOREMI Denoised Distantly Supervised Datasets

Menotti, Laura; MARCHESIN, STEFANO; Silvello, Gianmaria

doi:10.5281/zenodo.18170553

Published January 7, 2026 | Version 1.0

Dataset Open

DOREMI Denoised Distantly Supervised Datasets

1. University of Padua
2. University of Padova

This repository contains the distant denoised dataset produced using the DOREMI framework.

DOcument-level Relation Extraction optiMizing the long taIl (DOREMI) is an active learning-based system that enhances the training data through targeted manual annotation of highly informative examples. DOREMI operates upstream in the general DocRE pipeline by augmenting the dataset in a model-agnostic fashion, enabling any downstream DocRE model to benefit from improved long-tail coverage. Such an approach results in the production of a Denoised Distantly Supervised Dataset (DDS) that can be used to train any existing DocRE model, demonstrating improvements in long-tail relation predictions.

We release four DDSs, which were used for the experimental evalutation of DOREMI. Two datasets (denoted by "DOREMI") were generated by cleaning the DocRED distant dataset by the DOREMI framework utlizing DocRED and Re-DocRED. The other two datasets were generated by an hybrid approach (denoted by "DU"), combining DOREMI long-tail predictions with UGDRE annotations for frequent relations.

File Outline

All datasets are a denoised version of the DocRED distant dataset. Hence, they all contain the same documents and entities. The DocRED distant dataset consists of 101,873 documents and 1,965,484 entities.

The repository contains the following files:

DOREMI-DDS-DocRED.json: DDS generated by DOREMI utilizing DocRED. The dataset consists of 1,704,161 positive examples.
DOREMI-DDS-ReDocRED.json: DDS generated by DOREMI utilizing Re-DocRED. The dataset consists of 3,957,238 positive examples.
DU-DDS-DocRED.json: DDS generated by combining DOREMI (utilizing DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,726,707 positive examples.
DU-DDS-ReDocRED.json: DDS generated by combining DOREMI (utilizing Re-DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,796,229 positive examples.

Reproducibility Guidelines

This section describes how to obtain the results presented in the "Experimental Results" section of the paper.

Table 4 (and Tables A8 and B10 (a) of the Technical Appendix)

The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-DocRED.json.
The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-DocRED.json.

Tables 5 and 6 (and Tables A8 and B10 (b) of the Technical Appendix))

The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-ReDocRED.json.
The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-ReDocRED.json.

Files

DOREMI-DDS-DocRED.json

Files (1.9 GB)

Name	Size	Download all
DOREMI-DDS-DocRED.json md5:48b4609c725b39dfdf6300a8ff0c6970	443.6 MB	Preview Download
DOREMI-DDS-ReDocRED.json md5:e43d92797b8b9d836abe0feb4fab052e	575.5 MB	Preview Download
DU-DDS-DocRED.json md5:1bc328ff8696382f73cfed0946846cf6	441.9 MB	Preview Download
DU-DDS-ReDocRED.json md5:434d0ac06b38cdbdb483204387826838	445.7 MB	Preview Download

Additional details

European Commission
HEREDITARY - HetERogeneous sEmantic Data integratIon for the guT-bRain interplaY 101137074

	All versions	This version
Views	44	44
Downloads	26	26
Data volume	12.2 GB	12.2 GB

DOREMI Denoised Distantly Supervised Datasets

Authors/Creators

Description

File Outline

Reproducibility Guidelines

Table 4 (and Tables A8 and B10 (a) of the Technical Appendix)

Tables 5 and 6 (and Tables A8 and B10 (b) of the Technical Appendix))

Files

DOREMI-DDS-DocRED.json

Files (1.9 GB)

Additional details

Funding