Webis EditorialSum Corpus 2020
Creators
- 1. Leipzig University
- 2. German Aerospace Centre (DLR)
- 3. Bauhaus Universität, Weimar
Description
The Webis EditorialSum Corpus consists of 1330 manually curated extractive summaries for 266 news editorials spanning three diverse portals: Al-Jazeera, Guardian and Fox News. Each editorial has 5 summaries, each labeled for overall quality and fine grained properties such as thesis-relevance, persuasiveness, reasonableness, self-containedness.
The files are organized as follows:
corpus.csv - Contains all the editorials and their acquired summaries
Note: (X = [1,5] for five summaries)
- article_id : Article ID in the corpus
- title : Title of the editorial
- article_text : Plain text of the editorial
- summary_{X}_text : Plain text of the corresponding summary
- thesis_{X}_text : Plain text of the thesis from the corresponding summary
- lead : top 15% of the editorial's segments
- body : segments between lead and conclusion sections
- conclusion : bottom 15% of the editorial's segments
- article_segments: Collection of paragraphs, each further divided into collection of segments containing:
{ "number": segment order in the editorial,
"text" : segment text,
"label": ADU type
}
- summary_{X}_segments: Collection of summary segments containing:
{ "number": segment order in the editorial,
"text" : segment text,
"adu_label": ADU type from the editorial,
"summary_label": can be 'thesis' or 'justification'
}
quality-groups.csv - Contains the IDs for high(and low)-quality summaries for each quality dimension per editorial
For example: article_id 2 has four high_quality summaries (summary_1, summary_2, summary_3, summary_4) and one low_quality summary (summary_5) in terms of overall quality.
The summary texts can be obtained from corpus.csv respectively.
Files
corpus.csv
Files
(10.8 MB)
Name | Size | Download all |
---|---|---|
md5:b3053455c6c58580570c9e30390f7d62
|
10.7 MB | Preview Download |
md5:117cb5a5712a3772b0da9ab254a331d7
|
95.0 kB | Preview Download |