CrowdTruth Corpus for Open Domain Relation Extraction from Sentences

Anca Dumitrache; Lora Aroyo; Chris Welty

doi:10.5281/zenodo.1472330

Published October 26, 2018 | Version v.1.0

Dataset Open

CrowdTruth Corpus for Open Domain Relation Extraction from Sentences

1. Vrije Universiteit Amsterdam
2. Google

This repository contains a ground truth corpus for open domain relation extraction from sentences, acquired with crowdsourcing and processed with CrowdTruth metrics that capture ambiguity in annotations by measuring inter-annotator disagreement.

The dataset contains annotations for 4,100 sentences sampled from Angeli et al. (1) and Riedel et al. (2), over 16 relations, with each sentence annotated by 15 workers. The sentences have been pre-processed with Distant Supervision (3) using the Freebase knowledge base, in order to identify the term pairs in each sentence that are likely to express a relation. The crowdsourced data was collected from Figure Eight and Amazon Mechanical Turk.

This corpus has been discussed in the following papers:

Anca Dumitrache, Lora Aroyo and Chris Welty: Crowdsourcing Semantic Label Propagation in Relation Classification. FEVER Workshop at EMNLP 2018.
Anca Dumitrache, Lora Aroyo and Chris Welty: False Positive and Cross-relation Signals in Distant Supervision Data. AKBC Workshop at NIPS 2017.
Anca Dumitrache, Lora Aroyo and Chris Welty: Disagreement in Crowdsourcing and Active Learning for Better Distant Supervision Quality. Collective Intelligence 2017.

Sentence-level data is available in file: |--data/output/aggregated_sentences.csv

Worker-level data is available in file: |--data/output/aggregated_workers.csv

Raw crowdsourcig data is available in folder: |--data/input/

Results of the relation classification model are available in folder: |--data/model_results/

References

(1) Angeli, Gabor, et al. "Combining distant and partial supervision for relation extraction." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

(2) Riedel, Sebastian, et al. "Relation extraction with matrix factorization and universal schemas." Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2013.

(3) Mintz, Mike, et al. "Distant supervision for relation extraction without labeled data." Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009.

Files

CrowdTruth/Open-Domain-Relation-Extraction-v.1.0.zip

Files (9.5 MB)

Name	Size	Download all
CrowdTruth/Open-Domain-Relation-Extraction-v.1.0.zip md5:4a2d16d2244c676a8990817ac1184646	9.5 MB	Preview Download

Additional details

Is supplement to: https://github.com/CrowdTruth/Open-Domain-Relation-Extraction/tree/v.1.0 (URL)

Anca Dumitrache, Lora Aroyo and Chris Welty: Crowdsourcing Semantic Label Propagation in Relation Classification. FEVER Workshop at EMNLP 2018. arXiv:1809.00537
Anca Dumitrache, Lora Aroyo and Chris Welty: False Positive and Cross-relation Signals in Distant Supervision Data. AKBC Workshop at NIPS 2017. arXiv:1711.05186

	All versions	This version
Views	928	928
Downloads	105	105
Data volume	1.1 GB	1.1 GB

CrowdTruth Corpus for Open Domain Relation Extraction from Sentences

Authors/Creators

Description

Files

CrowdTruth/Open-Domain-Relation-Extraction-v.1.0.zip

Files (9.5 MB)

Additional details

Related works

References