Dataset Open Access

Webis EditorialSum Corpus 2020

Syed, Shahbaz; El Baff, Roxanne; Al-Khatib, Khalid; Kiesel, Johannes; Stein, Benno; Potthast, Martin


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Syed, Shahbaz</dc:creator>
  <dc:creator>El Baff, Roxanne</dc:creator>
  <dc:creator>Al-Khatib, Khalid</dc:creator>
  <dc:creator>Kiesel, Johannes</dc:creator>
  <dc:creator>Stein, Benno</dc:creator>
  <dc:creator>Potthast, Martin</dc:creator>
  <dc:date>2020-10-19</dc:date>
  <dc:description>The Webis EditorialSum Corpus consists of 1330 manually curated extractive summaries for 266 news editorials spanning three diverse portals: Al-Jazeera, Guardian and Fox News. Each editorial has 5 summaries, each labeled for overall quality and fine grained properties such as thesis-relevance, persuasiveness, reasonableness, self-containedness.

The files are organized as follows:


corpus.csv - Contains all the editorials and their acquired summaries


Note: (X = [1,5] for five summaries)

- article_id : Article ID in the corpus
- title : Title of the editorial
- article_text : Plain text of the editorial
- summary_{X}_text : Plain text of the corresponding summary
- thesis_{X}_text : Plain text of the thesis from the corresponding summary
- lead : top 15% of the editorial's segments
- body : segments between lead and conclusion sections
- conclusion : bottom 15% of the editorial's segments
- article_segments: Collection of paragraphs, each further divided into collection of segments containing:
 { "number": segment order in the editorial,
   "text" : segment text,
   "label": ADU type
 }
- summary_{X}_segments: Collection of summary segments containing:
{ "number": segment order in the editorial,
  "text" : segment text,
  "adu_label": ADU type from the editorial,
  "summary_label": can be 'thesis' or 'justification'
}


quality-groups.csv - Contains the IDs for high(and low)-quality summaries for each quality dimension per editorial

For example: article_id 2 has four high_quality summaries (summary_1, summary_2, summary_3, summary_4) and one low_quality summary (summary_5) in terms of overall quality.
The summary texts can be obtained from corpus.csv respectively.

 

 

 </dc:description>
  <dc:identifier>https://zenodo.org/record/4105765</dc:identifier>
  <dc:identifier>10.5281/zenodo.4105765</dc:identifier>
  <dc:identifier>oai:zenodo.org:4105765</dc:identifier>
  <dc:language>eng</dc:language>
  <dc:relation>doi:10.5281/zenodo.4105764</dc:relation>
  <dc:relation>url:https://zenodo.org/communities/webis</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
  <dc:subject>editorial summarization</dc:subject>
  <dc:subject>argumentation  summarization</dc:subject>
  <dc:subject>extractive summarization</dc:subject>
  <dc:title>Webis EditorialSum Corpus 2020</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>dataset</dc:type>
</oai_dc:dc>
56
63
views
downloads
All versions This version
Views 5656
Downloads 6363
Data volume 548.5 MB548.5 MB
Unique views 4848
Unique downloads 3838

Share

Cite as