Blog-1K

Haining Wang

doi:10.5281/zenodo.7455623

Published December 18, 2022 | Version 1.1

Dataset Open

Blog-1K

Haining Wang¹

1. Indiana University Bloomington

The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.

1. Preprocessing

We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria:
- accumulatively at least 10,000 characters,
- accumulatively at most 49,410 characters,
- accumulatively at least 16 posts,
- accumulatively at most 40 posts, and
- each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).

Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.

2. Statistics

Its creation and statistics can be found in the Jupyter Notebook.

Split	# Authors	# Posts	# Characters	Avg. Characters Per Author (Std.)	Avg. Characters Per Post (Std.)
Train	1,000	16,132	30,092,057	30,092 (5,884)	1,865 (1,007)
Validation	935	2,017	3,755,362	4,016 (2,269)	1,862 (999)
Test	924	2,017	3,732,448	4,039 (2,188)	1,850 (936)

3. Usage

import pandas as pd

df = pd.read_csv('blog1000.csv.gz', compression='infer')

# read in training data
train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))

4. License
All the materials is licensed under the ISC License.

5. Contact
Please contact its maintainer for questions.

Files

Files (15.6 MB)

Name	Size	Download all
blog1000.csv.gz md5:0a9e38740af9f921b6316b7f400acf06	15.6 MB	Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	371	369
Downloads	27	27
Data volume	498.0 MB	498.0 MB

Blog-1K

Creators

Description

Files

Files (15.6 MB)