Blog-1K
Description
The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.
1. Preprocessing
We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria:
- accumulatively at least 10,000 characters,
- accumulatively at most 49,410 characters,
- accumulatively at least 16 posts,
- accumulatively at most 40 posts, and
- each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).
Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.
2. Statistics
Its creation and statistics can be found in the Jupyter Notebook.
Split | # Authors | # Posts | # Characters | Avg. Characters Per Author (Std.) | Avg. Characters Per Post (Std.) |
Train | 1,000 | 16,132 | 30,092,057 | 30,092 (5,884) | 1,865 (1,007) |
Validation | 935 | 2,017 | 3,755,362 | 4,016 (2,269) | 1,862 (999) |
Test | 924 | 2,017 | 3,732,448 | 4,039 (2,188) | 1,850 (936) |
3. Usage
import pandas as pd
df = pd.read_csv('blog1000.csv.gz', compression='infer')
# read in training data
train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))
4. License
All the materials is licensed under the ISC License.
5. Contact
Please contact its maintainer for questions.
Files
Files
(15.6 MB)
Name | Size | Download all |
---|---|---|
md5:0a9e38740af9f921b6316b7f400acf06
|
15.6 MB | Download |