Dataset Open Access

Webis EditorialSum Corpus 2020

Syed, Shahbaz; El Baff, Roxanne; Al-Khatib, Khalid; Kiesel, Johannes; Stein, Benno; Potthast, Martin

Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="" xmlns:oai_dc="" xmlns:xsi="" xsi:schemaLocation="">
  <dc:creator>Syed, Shahbaz</dc:creator>
  <dc:creator>El Baff, Roxanne</dc:creator>
  <dc:creator>Al-Khatib, Khalid</dc:creator>
  <dc:creator>Kiesel, Johannes</dc:creator>
  <dc:creator>Stein, Benno</dc:creator>
  <dc:creator>Potthast, Martin</dc:creator>
  <dc:description>The Webis EditorialSum Corpus consists of 1330 manually curated extractive summaries for 266 news editorials spanning three diverse portals: Al-Jazeera, Guardian and Fox News. Each editorial has 5 summaries, each labeled for overall quality and fine grained properties such as thesis-relevance, persuasiveness, reasonableness, self-containedness.

The files are organized as follows:

corpus.csv - Contains all the editorials and their acquired summaries

Note: (X = [1,5] for five summaries)

- article_id : Article ID in the corpus
- title : Title of the editorial
- article_text : Plain text of the editorial
- summary_{X}_text : Plain text of the corresponding summary
- thesis_{X}_text : Plain text of the thesis from the corresponding summary
- lead : top 15% of the editorial's segments
- body : segments between lead and conclusion sections
- conclusion : bottom 15% of the editorial's segments
- article_segments: Collection of paragraphs, each further divided into collection of segments containing:
 { "number": segment order in the editorial,
   "text" : segment text,
   "label": ADU type
- summary_{X}_segments: Collection of summary segments containing:
{ "number": segment order in the editorial,
  "text" : segment text,
  "adu_label": ADU type from the editorial,
  "summary_label": can be 'thesis' or 'justification'

quality-groups.csv - Contains the IDs for high(and low)-quality summaries for each quality dimension per editorial

For example: article_id 2 has four high_quality summaries (summary_1, summary_2, summary_3, summary_4) and one low_quality summary (summary_5) in terms of overall quality.
The summary texts can be obtained from corpus.csv respectively.



  <dc:subject>editorial summarization</dc:subject>
  <dc:subject>argumentation  summarization</dc:subject>
  <dc:subject>extractive summarization</dc:subject>
  <dc:title>Webis EditorialSum Corpus 2020</dc:title>
All versions This version
Views 5656
Downloads 6363
Data volume 548.5 MB548.5 MB
Unique views 4848
Unique downloads 3838


Cite as