Dataset Restricted Access
Zangerle, Eva;
Tschuggnall, Michael;
Specht, Günther;
Potthast, Martin;
Stein, Benno
{ "inLanguage": { "alternateName": "eng", "@type": "Language", "name": "English" }, "description": "<p>Many approaches have been proposed recently to identify <em>the</em> author of a given document. Thereby, one fact is often silently assumed: i.e., that the given document is indeed written by only author. For a realistic author identification system it is therefore crucial to at first determine whether a document is single- or multiauthored.</p>\n\n<p>To this end, previous PAN editions aimed to analyze multi-authored documents. As it has been shown that it is a hard problem to reliably identify individual authors and their contribution within a single document (<a href=\"https://pan.webis.de/clef16/pan16-web/author-identification.html\">Author Diarization, 2016</a>; <a href=\"https://pan.webis.de/clef17/pan17-web/author-identification.html\">Style Breach Detection, 2017</a>), last year's task substantially relaxed the problem by asking only for binary decision (single- or multi-authored). Considering the promising results achieved by the submitted approaches (see the <a href=\"http://ceur-ws.org/Vol-2125/invited_paper_2.pdf\">overview paper</a> for details), we continue last year's task and additionally ask participants to predict the number of involved authors.</p>\n\n<p>Given a document, participants thus should apply intrinsic style analyses to hierarchically answer the following questions:</p>\n\n<ol>\n\t<li>Is the document written by one or more authors, i.e., do style changes exist or not?</li>\n\t<li>If it is multi-authored, how many authors have collaborated?</li>\n</ol>\n\n<p>All documents are provided in English and may contain zero up to arbitrarily many style changes, resulting from arbitrarily many authors.</p>\n\n<p>The <strong>training</strong> set: contains 50% of the whole dataset and includes solutions. Use this set to feed/train your models.</p>\n\n<p>Like last year, the whole data set is based on user posts from various sites of the <a href=\"https://stackexchange.com/\">StackExchange network</a>, covering different topics and containing approximately 300 to 2000 tokens per document.</p>\n\n<p>For each problem instance X, two files are provided:</p>\n\n<ul>\n\t<li><code>problem-X.txt</code> contains the actual text</li>\n\t<li><code>problem-X.truth</code> contains the ground truth, i.e., the correct solution in <a href=\"http://www.json.org/\">JSON</a> format:</li>\n</ul>\n\n<pre><code class=\"language-json\">{ \"authors\": number_of_authors, \"structure\": [author_segment_1, ..., author_segment_3], \"switches\": [ character_pos_switch_segment_1, ..., character_pos_switch_segment_n, ] }</code></pre>\n\n<p>An example for a multi-author document could look as follows:</p>\n\n<pre><code class=\"language-json\">{ \"authors\": 4, \"structure\": [\"A1\", \"A2\", \"A4\", \"A2\", \"A4\", \"A2\", \"A3\", \"A2\", \"A4\"], \"switches\": [805, 1552, 2827, 3584, 4340, 5489, 7564, 8714] }</code></pre>\n\n<p>whereas a single-author document would have exactly the following form:</p>\n\n<pre><code class=\"language-json\">{ \"authors\": 1, \"structure\": [\"A1\"], \"switches\": [] }</code></pre>\n\n<p>Note that authors within the <em>structure</em> correspond only to the respective document, i.e., they are not the same over the whole dataset. For example, author <em>A1</em> in document 1 is most likely <strong>not</strong> the same author as <em>A1</em> in document 2 (it <strong>could</strong> be, but as there are hundreds of authors the chances are very small that this is the case). Further, please consider that the structure and the <em>switches</em> are provided only as additional resources for the development of your algorithms, i.e., they are <strong>not expected to be predicted</strong>.</p>\n\n<p>To tackle the problem, you can develop novel approaches, extend existing algorithms from last year's task or adapt approaches from related problems such as <strong>intrinsic plagiarism detection</strong> or <strong>text segmentation</strong>. You are also free to additionally evaluate your approaches on last year's training/validation/test dataset (for the number of authors use the corresponding meta data).</p>", "creator": [ { "@type": "Person", "name": "Zangerle, Eva" }, { "@type": "Person", "name": "Tschuggnall, Michael" }, { "@type": "Person", "name": "Specht, G\u00fcnther" }, { "affiliation": "University Leipzig", "@id": "https://orcid.org/0000-0003-2451-0665", "@type": "Person", "name": "Potthast, Martin" }, { "affiliation": "Bauhaus-Universit\u00e4t Weimar", "@id": "https://orcid.org/0000-0001-9033-2217", "@type": "Person", "name": "Stein, Benno" } ], "url": "https://zenodo.org/record/3577602", "datePublished": "2019-01-17", "version": "2.0", "@type": "Dataset", "keywords": [ "authorship analysis", "style", "detection", "change" ], "@context": "https://schema.org/", "identifier": "https://doi.org/10.5281/zenodo.3577602", "@id": "https://doi.org/10.5281/zenodo.3577602", "workFeatured": { "alternateName": "PAN at CLEF 2019", "location": "Switzerland", "@type": "Event", "name": "PAN at Conference and Labs of the Evaluation Forum 2019" }, "name": "PAN19 Authorship Analysis: Style Change Detection" }
All versions | This version | |
---|---|---|
Views | 167 | 143 |
Downloads | 12 | 12 |
Data volume | 81.8 MB | 81.8 MB |
Unique views | 126 | 115 |
Unique downloads | 6 | 6 |