Published March 8, 2021 | Version 2.0
Dataset Restricted

PAN21 Authorship Analysis: Style Change Detection

  • 1. Universität Innsbruck
  • 2. University Leipzig
  • 3. Bauhaus-Universität Weimar

Description

This is the dataset for the Style Change Detection task of PAN 2021.

The goal of the style change detection task is to identify text positions within a given multi-author document at which the author switches. 

Tasks

Given a document, we ask participants to answer the following three questions:

  • Single vs. Multiple. Given a text, find out whether the text is written by a single author or by multiple authors (task 1).
  • Style Change Basic. Given a text written by two or more authors and that contains a number of style changes, find the position of the changes (task 2).
  • Style Change Real-World. Given a text written by two or more authors, find all positions of writing style change, i.e., assign all paragraphs of the text uniquely to some author out of the number of authors you assume for the multi-author document (task 3).

All documents are provided in English and may contain an arbitrary number of style changes, resulting from at most five different authors. However, style changes may only occur between paragraphs (i.e., a single paragraph is always authored by a single author and does not contain any style changes).

Data

The dataset is split into three parts:

  1. training set: Contains 70% of the whole data set and includes ground truth data. Use this set to develop and train your models.
  2. validation set: Contains 15% of the whole data set and includes ground truth data. Use this set to evaluate and optimize your models.
  3. test set: Contains 15% of the whole data set. For the documents on the test set, you are not given ground truth data. This set is used for evaluation.

The dataset is based on user posts from various sites of the StackExchange network, covering different topics. We refer to each input problem (i.e., the document for which to detect style changes) by an ID, which is subsequently also used to identify the submitted solution to this input problem. We provide one folder for train, validation, and test data.

For each problem instance X (i.e., each input document), two files are provided:

  1. problem-X.txt contains the actual text, where paragraphs are denoted by \n\n.
  2. truth-problem-X.json contains the ground truth, i.e., the correct solution in JSON format:
    {
    "authors": NUMBER_OF_AUTHORS,
    "site": SOURCE_SITE,
    "multi-author": RESULT_TASK1,
    "changes": RESULT_ARRAY_TASK2,
    "paragraph-authors": RESULT_ARRAY_TASK3
    }
    The result for task 1 (key "multi-author") is a binary value (1 if the document is multi-authored, 0 if the document is single-authored). The result for task 2 (key "changes") is represented as an array, holding a binary for each pair of consecutive paragraphs within the document (0 if there was no style change, 1 if there was a style change). If the document is single-authored, the solution to task 2 is an array filled with 0s. For task 3 (key "paragraph-authors"), the result is the order of authors contained in the document (e.g., [1, 2, 1] for a two-author document), where the first author is "1", the second author appearing in the document is referred to as "2", etc. Furthermore, we provide the total number of authors and the Stackoverflow site the texts were extracted from (i.e., topic).

    An example of a multi-author document, where there was a style change between the third and fourth paragraph could look as follows (we only list the relevant key/value pairs here):
    {
    "multi-author": 1,
    "changes": [0,0,1,...],
    "paragraph-authors": [1,1,1,2,...]
    }
    A single-author document would have the following form (again, only listing the relevant key/value pairs):
    {
    "multi-author": 0,
    "changes": [0,0,0,...],
    "paragraph-authors": [1,1,1,...]
    }

     

Notes

Version 2.0: added test set.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

.

You are currently not logged in. Do you have an account? Log in here