Dataset Open Access

Webis ChangeMyView Corpus 2020 (Webis-CMV-20)

Al-Khatib, Khalid; Völske, Michael; Syed, Shahbaz; Kolyada, Nikolay; Stein, Benno


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.3778298", 
  "language": "eng", 
  "title": "Webis ChangeMyView Corpus 2020 (Webis-CMV-20)", 
  "issued": {
    "date-parts": [
      [
        2020, 
        4, 
        30
      ]
    ]
  }, 
  "abstract": "<p>The Webis-CMV-20 dataset comprises all&nbsp;available posts and comments in the <a href=\"https://reddit.com/r/changemyview\">ChangeMyView</a>&nbsp;subreddit&nbsp;from the foundation of the subreddit&nbsp;in 2005, until September 2017. From these, we have derived two sub-datasets for the tasks of persuasiveness prediction, and opinion malleability prediction. In addition, the corpus comprises historical posts by CMV authors, and derived personal characteristics.</p>\n\n<p><strong>Dataset specification</strong></p>\n\n<p>All files are in bzip2-compressed <a href=\"http://jsonlines.org/\">JSON Lines</a> format.</p>\n\n<ul>\n\t<li><strong>threads.jsonl:</strong> contains all the selected discussion threads from CMV</li>\n\t<li><strong>pairs.jsonl:</strong> each record contains submission, delta-comment and nondelta-comment and the comments&#39;&nbsp;similarity score</li>\n\t<li><strong>posts-malleability.jsonl:</strong> contains&nbsp;posts&nbsp;for&nbsp;opinion mallebility prediction,&nbsp;in the format provided in the original <a href=\"https://files.pushshift.io/reddit/\">Reddit Crawl</a> dataset</li>\n\t<li><strong>author_entity_category.jsonl:</strong> each record contains the author and list of Wikipedia entities mentioned by the author in the messages across all subreddits. For each mentioned entity we provide the following data:&nbsp;</li>\n</ul>\n\n<pre><code class=\"language-json\">[title, wikidata_id, wikipedia_page_id, mentioned_entity_title, wikifier_score, subreddit_name, subreddit_id, subreddit_category_name, subreddit_topcategory_name]</code></pre>\n\n<ul>\n\t<li><strong>author_liwc.jsonl:</strong>&nbsp;personality traits features computed with <a href=\"https://liwc.wpengine.com/\">LIWC</a> for the authors from pairs.jsonl and post_malleability.jsonl datasets</li>\n\t<li><strong>author_subreddit.jsonl:</strong> for each author statistics of all number of all posts (submissions/comments) across all subreddits are provided</li>\n\t<li><strong>author_subreddit_category.jsonl:</strong> similar to author_subreddit.jsonl, the statistics of all author posts is grouped by top-categories and categories according to <a href=\"https://snoopsnoo.com/subreddits/\">snoopsnoo.com</a><br>\n\t&nbsp;</li>\n</ul>", 
  "author": [
    {
      "family": "Al-Khatib, Khalid"
    }, 
    {
      "family": "V\u00f6lske, Michael"
    }, 
    {
      "family": "Syed, Shahbaz"
    }, 
    {
      "family": "Kolyada, Nikolay"
    }, 
    {
      "family": "Stein, Benno"
    }
  ], 
  "type": "dataset", 
  "id": "3778298"
}
53
12
views
downloads
All versions This version
Views 5353
Downloads 1212
Data volume 5.4 GB5.4 GB
Unique views 4242
Unique downloads 44

Share

Cite as