PAN18 Author Identification: Attribution

Kestemont, Mike; Tschuggnall, Michael; Stamatatos, Efstathios; Daelemans, Walter; Specht, Günther; Stein, Benno; Potthast, Martin

doi:10.5281/zenodo.3737849

Published September 10, 2018 | Version v2

Dataset Open

PAN18 Author Identification: Attribution

1. Bauhaus-Universität Weimar
2. Universität Leipzig

We provide a corpus which comprises a set of cross-domain authorship attribution problems in each of the following 5 languages: English, French, Italian, Polish, and Spanish. Note that we specifically avoid to use the term 'training corpus' because the sets of candidate authors of the development and the evaluation corpora are not overlapping. Therefore, your approach should not be designed to particularly handle the candidate authors of the development corpus.

Each problem consists of a set of known fanfics by each candidate author and a set of unknown fanfics located in separate folders. The file problem-info.json that can be found in the main folder of each problem, shows the name of folder of unknown documents and the list of names of candidate author folders.

The true author of each unknown document can be seen in the file ground-truth.json, also found in the main folder of each problem.

In addition, to handle a collection of such problems, the file collection-info.jsonincludes all relevant information. In more detail, for each problem it lists its main folder, the language (either "en", "fr", "it", "pl", or "sp") and encoding (always UTF-8) of its documents.

More information: Link

Notes

new version: removed passwords inside packages

Files

pan18-cross-domain-authorship-attribution-dataset.zip

Files (10.6 MB)

Name	Size	Download all
pan18-cross-domain-authorship-attribution-dataset.zip md5:5619b5365e04d0275642fce22ffbcb3d	10.6 MB	Preview Download

Additional details

Mike Kestemont, Michael Tschuggnall, Efstathios Stamatatos, Walter Daelemans, Günther Specht, Benno Stein, and Martin Potthast. Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection. In Linda Cappellato, Nicola Ferro, Jian-Yun Nie, and Laure Soulier, editors, Working Notes Papers of the CLEF 2018 Evaluation Labs volume 2125 of CEUR Workshop Proceedings, September 2018. CEUR-WS.org. ISSN 1613-0073.

	All versions	This version
Views	1,453	1,111
Downloads	170	162
Data volume	1.9 GB	1.8 GB

PAN18 Author Identification: Attribution

Authors/Creators

Description

Notes

Files

pan18-cross-domain-authorship-attribution-dataset.zip

Files (10.6 MB)

Additional details

References