DOREMI Denoised Distantly Supervised Datasets
Authors/Creators
Description
This repository contains the distant denoised dataset produced using the DOREMI framework.
DOcument-level Relation Extraction optiMizing the long taIl (DOREMI) is an active learning-based system that enhances the training data through targeted manual annotation of highly informative examples. DOREMI operates upstream in the general DocRE pipeline by augmenting the dataset in a model-agnostic fashion, enabling any downstream DocRE model to benefit from improved long-tail coverage. Such an approach results in the production of a Denoised Distantly Supervised Dataset (DDS) that can be used to train any existing DocRE model, demonstrating improvements in long-tail relation predictions.
We release four DDSs, which were used for the experimental evalutation of DOREMI. Two datasets (denoted by "DOREMI") were generated by cleaning the DocRED distant dataset by the DOREMI framework utlizing DocRED and Re-DocRED. The other two datasets were generated by an hybrid approach (denoted by "DU"), combining DOREMI long-tail predictions with UGDRE annotations for frequent relations.
File Outline
All datasets are a denoised version of the DocRED distant dataset. Hence, they all contain the same documents and entities. The DocRED distant dataset consists of 101,873 documents and 1,965,484 entities.
The repository contains the following files:
- DOREMI-DDS-DocRED.json: DDS generated by DOREMI utilizing DocRED. The dataset consists of 1,704,161 positive examples.
- DOREMI-DDS-ReDocRED.json: DDS generated by DOREMI utilizing Re-DocRED. The dataset consists of 3,957,238 positive examples.
- DU-DDS-DocRED.json: DDS generated by combining DOREMI (utilizing DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,726,707 positive examples.
- DU-DDS-ReDocRED.json: DDS generated by combining DOREMI (utilizing Re-DocRED) long-tail predictions with UGDRE annotations for frequent relations. The dataset consists of 1,796,229 positive examples.
Reproducibility Guidelines
This section describes how to obtain the results presented in the "Experimental Results" section of the paper.
Table 4 (and Tables A8 and B10 (a) of the Technical Appendix)
- The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-DocRED.json.
- The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-DocRED.json.
Tables 5 and 6 (and Tables A8 and B10 (b) of the Technical Appendix))
- The performances for rows utilizing DOREMI as Distant Data were obtained training the DocRE models with DOREMI-DDS-ReDocRED.json.
- The performances for rows utilizing D+U as Distant Data were obtained training the DocRE models with DU-DDS-ReDocRED.json.