Creating Training Corpora for Micro-Planners

doi:10.5281/zenodo.827394

Published August 1, 2017 | Version v1

Conference paper Open

Creating Training Corpora for Micro-Planners

1. CNRS, LORIA
2. LORIA
3. University of Edinburgh

In this paper, we present a novel framework for semi-automatically creating linguistically challenging microplanning data-to-text corpora from existing Knowledge Bases. Because our method pairs data of varying size and shape with texts ranging from simple clauses to short texts, a dataset created using this framework provides a challenging benchmark for microplanning. Another feature of this framework is that it can be applied to any large scale knowledge base and can therefore be used to train and learn KB verbalisers. We apply our framework to DBpedia data and compare the resulting dataset with Wen et al. (2016)’s. We show that while Wen et al.’s dataset is more than twice larger than ours, it is less diverse both in terms of input and in terms of text. We thus propose our corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of handling the complex interactions occurring during in micro-planning between lexicalisation, aggregation, surface realisation, referring expression generation and sentence segmentation. To encourage researchers to take up this challenge, we made available a dataset of 21,855 data/text pairs created using this framework in the context of the WEBNLG shared task.

Files

acl_2017.pdf

Files (285.2 kB)

Name	Size	Download all
acl_2017.pdf md5:cdcb4403aaac1767055ea56d6f9cf8f1	285.2 kB	Preview Download

Additional details

SUMMA – Scalable Understanding of Multilingual Media 688139: European Commission

	All versions	This version
Views	66	66
Downloads	75	74
Data volume	22.2 MB	22.0 MB

Creating Training Corpora for Micro-Planners

Creators

Description

Files

acl_2017.pdf

Files (285.2 kB)

Additional details

Funding