Dataset Open Access
Al-Khatib, Khalid;
Völske, Michael;
Syed, Shahbaz;
Kolyada, Nikolay;
Stein, Benno
<?xml version='1.0' encoding='utf-8'?> <resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd"> <identifier identifierType="DOI">10.5281/zenodo.3778298</identifier> <creators> <creator> <creatorName>Al-Khatib, Khalid</creatorName> <givenName>Khalid</givenName> <familyName>Al-Khatib</familyName> <affiliation>Bauhaus-Universität Weimar</affiliation> </creator> <creator> <creatorName>Völske, Michael</creatorName> <givenName>Michael</givenName> <familyName>Völske</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-9283-6846</nameIdentifier> <affiliation>Bauhaus-Universität Weimar</affiliation> </creator> <creator> <creatorName>Syed, Shahbaz</creatorName> <givenName>Shahbaz</givenName> <familyName>Syed</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-4821-1507</nameIdentifier> <affiliation>Leipzig University</affiliation> </creator> <creator> <creatorName>Kolyada, Nikolay</creatorName> <givenName>Nikolay</givenName> <familyName>Kolyada</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-6493-9557</nameIdentifier> <affiliation>Bauhaus-Universität Weimar</affiliation> </creator> <creator> <creatorName>Stein, Benno</creatorName> <givenName>Benno</givenName> <familyName>Stein</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0001-9033-2217</nameIdentifier> <affiliation>Bauhaus-Universität Weimar</affiliation> </creator> </creators> <titles> <title>Webis ChangeMyView Corpus 2020 (Webis-CMV-20)</title> </titles> <publisher>Zenodo</publisher> <publicationYear>2020</publicationYear> <subjects> <subject>social media</subject> <subject>argumentation</subject> <subject>persuasiveness</subject> </subjects> <dates> <date dateType="Issued">2020-04-30</date> </dates> <language>en</language> <resourceType resourceTypeGeneral="Dataset"/> <alternateIdentifiers> <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/3778298</alternateIdentifier> </alternateIdentifiers> <relatedIdentifiers> <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.3778297</relatedIdentifier> <relatedIdentifier relatedIdentifierType="URL" relationType="IsPartOf">https://zenodo.org/communities/webis</relatedIdentifier> </relatedIdentifiers> <rightsList> <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights> <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights> </rightsList> <descriptions> <description descriptionType="Abstract"><p>The Webis-CMV-20 dataset comprises all&nbsp;available posts and comments in the <a href="https://reddit.com/r/changemyview">ChangeMyView</a>&nbsp;subreddit&nbsp;from the foundation of the subreddit&nbsp;in 2005, until September 2017. From these, we have derived two sub-datasets for the tasks of persuasiveness prediction, and opinion malleability prediction. In addition, the corpus comprises historical posts by CMV authors, and derived personal characteristics.</p> <p><strong>Dataset specification</strong></p> <p>All files are in bzip2-compressed <a href="http://jsonlines.org/">JSON Lines</a> format.</p> <ul> <li><strong>threads.jsonl:</strong> contains all the selected discussion threads from CMV</li> <li><strong>pairs.jsonl:</strong> each record contains submission, delta-comment and nondelta-comment and the comments&#39;&nbsp;similarity score</li> <li><strong>posts-malleability.jsonl:</strong> contains&nbsp;posts&nbsp;for&nbsp;opinion mallebility prediction,&nbsp;in the format provided in the original <a href="https://files.pushshift.io/reddit/">Reddit Crawl</a> dataset</li> <li><strong>author_entity_category.jsonl:</strong> each record contains the author and list of Wikipedia entities mentioned by the author in the messages across all subreddits. For each mentioned entity we provide the following data:&nbsp;</li> </ul> <pre><code class="language-json">[title, wikidata_id, wikipedia_page_id, mentioned_entity_title, wikifier_score, subreddit_name, subreddit_id, subreddit_category_name, subreddit_topcategory_name]</code></pre> <ul> <li><strong>author_liwc.jsonl:</strong>&nbsp;personality traits features computed with <a href="https://liwc.wpengine.com/">LIWC</a> for the authors from pairs.jsonl and post_malleability.jsonl datasets</li> <li><strong>author_subreddit.jsonl:</strong> for each author statistics of all number of all posts (submissions/comments) across all subreddits are provided</li> <li><strong>author_subreddit_category.jsonl:</strong> similar to author_subreddit.jsonl, the statistics of all author posts is grouped by top-categories and categories according to <a href="https://snoopsnoo.com/subreddits/">snoopsnoo.com</a><br> &nbsp;</li> </ul></description> </descriptions> </resource>
All versions | This version | |
---|---|---|
Views | 713 | 713 |
Downloads | 333 | 333 |
Data volume | 148.5 GB | 148.5 GB |
Unique views | 600 | 600 |
Unique downloads | 140 | 140 |