Published January 7, 2026 | Version 1.0
Dataset Open

DOREMI Denoised Distantly Supervised Datasets

  • 1. ROR icon University of Padua
  • 2. University of Padova

Description

This repository contains the distant denoised dataset produced using the DOREMI framework. 

DOcument-level Relation Extraction optiMizing the long taIl (DOREMI) is an active learning-based system that enhances the training data through targeted manual annotation of highly informative examples. DOREMI operates upstream in the general DocRE pipeline by augmenting the dataset in a model-agnostic fashion, enabling any downstream DocRE model to benefit from improved long-tail coverage. Such an approach results in the production of a Denoised Distantly Supervised Dataset (DDS) that can be used to train any existing DocRE model, demonstrating improvements in long-tail relation predictions.

We release four DDSs, which were used for the experimental evalutation of DOREMI. Two datasets (denoted by "DOREMI") were generated by cleaning the DocRED distant dataset by the DOREMI framework utlizing DocRED and Re-DocRED. The other two datasets were generated by an hybrid approach (denoted by "DU"), combining DOREMI long-tail predictions with UGDRE annotations for frequent relations. 

File Outline

All datasets are a denoised version of the DocRED distant dataset. Hence, they all contain the same documents and entities. The DocRED distant dataset consists of 101,873 documents and 1,965,484 entities.

The repository contains the following files:

  • DOREMI-DDS-DocRED.json: DDS generated by DOREMI utilizing DocRED. The dataset consists of 1,704,161 positive examples.
  • DOREMI-DDS-ReDocRED.json: DDS generated by DOREMI utilizing Re-DocRED. The dataset consists of 3,957,238 positive examples.
  • DU-DDS-DocRED.json: DDS generated by combining DOREMI (utilizing DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,726,707 positive examples.
  • DU-DDS-ReDocRED.json: DDS generated by combining DOREMI (utilizing Re-DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,796,229 positive examples.

Reproducibility Guidelines

This section describes how to obtain the results presented in the "Experimental Results" section of the paper.

Table 4 (and Tables A8 and B10 (a) of the Technical Appendix)

  • The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-DocRED.json.
  • The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-DocRED.json.

Tables 5 and 6 (and Tables A8 and B10 (b) of the Technical Appendix))

  • The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-ReDocRED.json.
  • The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-ReDocRED.json.

Files

DOREMI-DDS-DocRED.json

Files (1.9 GB)

Name Size Download all
md5:48b4609c725b39dfdf6300a8ff0c6970
443.6 MB Preview Download
md5:e43d92797b8b9d836abe0feb4fab052e
575.5 MB Preview Download
md5:1bc328ff8696382f73cfed0946846cf6
441.9 MB Preview Download
md5:434d0ac06b38cdbdb483204387826838
445.7 MB Preview Download

Additional details

Funding

European Commission
HEREDITARY - HetERogeneous sEmantic Data integratIon for the guT-bRain interplaY 101137074