Planned intervention: On Wednesday June 26th 05:30 UTC Zenodo will be unavailable for 10-20 minutes to perform a storage cluster upgrade.
Published March 13, 2023 | Version 1.0
Dataset Open

PAN23 Multi-Author Writing Style Analysis

  • 1. University of Innsbruck
  • 2. Leipzig University
  • 3. Bauhaus-Universität Weimar

Description

This is the dataset for the shared task on Multi-Author Writing Style Analysis PAN@CLEF2023. Please consult the task's page for further details on the format, the dataset's creation, and links to baselines and utility code.

Task: We ask participants to solve the following intrinsic style change detection task: for a given text, find all positions of writing style change on the paragraph-level (i.e., for each pair of consecutive paragraphs, assess whether there was a style change). The simultaneous change of authorship and topic will be carefully controlled and we will provide participants with datasets of three difficulty levels:

  1. Easy: The paragraphs of a document cover a variety of topics, allowing approaches to make use of topic information to detect authorship changes.
  2. Medium: The topical variety in a document is small (though still present) forcing the approaches to focus more on style to effectively solve the detection task.
  3. Hard: All paragraphs in a document are on the same topic.

All documents are provided in English and may contain an arbitrary number of style changes. However, style changes may only occur between paragraphs (i.e., a single paragraph is always authored by a single author and contains no style changes).

Data: To develop and then test your algorithms, three datasets including ground truth information are provided (dataset1 for the easy task, dataset2 for the medium task, and dataset3 for the hard task).

Each dataset is split into three parts:

  1. training set: Contains 70% of the whole dataset and includes ground truth data. Use this set to develop and train your models.
  2. validation set: Contains 15% of the whole dataset and includes ground truth data. Use this set to evaluate and optimize your models.
  3. test set: Contains 15% of the whole dataset, no ground truth data is given. This set is used for evaluation.

You are free to use additional external data for training your models. However, we ask you to make the additional data utilized freely available under a suitable license.

Versioning: 

  • 1.0: initial upload

Files

pan23-multi-author-analysis.zip

Files (26.1 MB)

Name Size Download all
md5:3af611691569c82891b6fc7a53ad04f2
26.1 MB Preview Download