Dataset Restricted Access

PAN19 Authorship Analysis: Style Change Detection

Zangerle, Eva; Tschuggnall, Michael; Specht, Günther; Potthast, Martin; Stein, Benno


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.3577602", 
  "container_title": "CLEF 2019 Labs and Workshops, Notebook Papers", 
  "language": "eng", 
  "title": "PAN19 Authorship Analysis: Style Change Detection", 
  "issued": {
    "date-parts": [
      [
        2019, 
        1, 
        17
      ]
    ]
  }, 
  "abstract": "<p>Many approaches have been proposed recently to identify&nbsp;<em>the</em>&nbsp;author of a given document. Thereby, one fact is often silently assumed: i.e., that the given document is indeed written by only author. For a realistic author identification system it is therefore crucial to at first determine whether a document is single- or multiauthored.</p>\n\n<p>To this end, previous PAN editions aimed to analyze multi-authored documents. As it has been shown that it is a hard problem to reliably identify individual authors and their contribution within a single document (<a href=\"https://pan.webis.de/clef16/pan16-web/author-identification.html\">Author Diarization, 2016</a>;&nbsp;<a href=\"https://pan.webis.de/clef17/pan17-web/author-identification.html\">Style Breach Detection, 2017</a>), last year&#39;s task substantially relaxed the problem by asking only for binary decision (single- or multi-authored). Considering the promising results achieved by the submitted approaches (see the&nbsp;<a href=\"http://ceur-ws.org/Vol-2125/invited_paper_2.pdf\">overview paper</a>&nbsp;for details), we continue last year&#39;s task and additionally ask participants to predict the number of involved authors.</p>\n\n<p>Given a document, participants thus should apply intrinsic style analyses to hierarchically answer the following questions:</p>\n\n<ol>\n\t<li>Is the document written by one or more authors, i.e., do style changes exist or not?</li>\n\t<li>If it is multi-authored, how many authors have collaborated?</li>\n</ol>\n\n<p>All documents are provided in English and may contain zero up to arbitrarily many style changes, resulting from arbitrarily many authors.</p>\n\n<p>The&nbsp;<strong>training</strong>&nbsp;set: contains 50% of the whole dataset and includes solutions. Use this set to feed/train your models.</p>\n\n<p>Like last year, the whole data set is based on user posts from various sites of the&nbsp;<a href=\"https://stackexchange.com/\">StackExchange network</a>, covering different topics and containing approximately 300 to 2000 tokens per document.</p>\n\n<p>For each problem instance X, two files are provided:</p>\n\n<ul>\n\t<li><code>problem-X.txt</code>&nbsp;contains the actual text</li>\n\t<li><code>problem-X.truth</code>&nbsp;contains the ground truth, i.e., the correct solution in&nbsp;<a href=\"http://www.json.org/\">JSON</a>&nbsp;format:</li>\n</ul>\n\n<pre><code class=\"language-json\">{ \"authors\": number_of_authors, \"structure\": [author_segment_1, ..., author_segment_3], \"switches\": [ character_pos_switch_segment_1, ..., character_pos_switch_segment_n, ] }</code></pre>\n\n<p>An example for a multi-author document could look as follows:</p>\n\n<pre><code class=\"language-json\">{ \"authors\": 4, \"structure\": [\"A1\", \"A2\", \"A4\", \"A2\", \"A4\", \"A2\", \"A3\", \"A2\", \"A4\"], \"switches\": [805, 1552, 2827, 3584, 4340, 5489, 7564, 8714] }</code></pre>\n\n<p>whereas a single-author document would have exactly the following form:</p>\n\n<pre><code class=\"language-json\">{ \"authors\": 1, \"structure\": [\"A1\"], \"switches\": [] }</code></pre>\n\n<p>Note that authors within the&nbsp;<em>structure</em>&nbsp;correspond only to the respective document, i.e., they are not the same over the whole dataset. For example, author&nbsp;<em>A1</em>&nbsp;in document 1 is most likely&nbsp;<strong>not</strong>&nbsp;the same author as&nbsp;<em>A1</em>&nbsp;in document 2 (it&nbsp;<strong>could</strong>&nbsp;be, but as there are hundreds of authors the chances are very small that this is the case). Further, please consider that the structure and the&nbsp;<em>switches</em>&nbsp;are provided only as additional resources for the development of your algorithms, i.e., they are&nbsp;<strong>not expected to be predicted</strong>.</p>\n\n<p>To tackle the problem, you can develop novel approaches, extend existing algorithms from last year&#39;s task or adapt approaches from related problems such as&nbsp;<strong>intrinsic plagiarism detection</strong>&nbsp;or&nbsp;<strong>text segmentation</strong>. You are also free to additionally evaluate your approaches on last year&#39;s training/validation/test dataset (for the number of authors use the corresponding meta data).</p>", 
  "author": [
    {
      "family": "Zangerle, Eva"
    }, 
    {
      "family": "Tschuggnall, Michael"
    }, 
    {
      "family": "Specht, G\u00fcnther"
    }, 
    {
      "family": "Potthast, Martin"
    }, 
    {
      "family": "Stein, Benno"
    }
  ], 
  "id": "3577602", 
  "note": "Version 2.0: added validation set", 
  "event-place": "Switzerland", 
  "version": "2.0", 
  "type": "dataset", 
  "event": "PAN at Conference and Labs of the Evaluation Forum 2019 (PAN at CLEF 2019)"
}
167
12
views
downloads
All versions This version
Views 167143
Downloads 1212
Data volume 81.8 MB81.8 MB
Unique views 126115
Unique downloads 66

Share

Cite as