Two-Annotator Sentence-Level Annotations of Human-Written Short Stories

León, Carlos; Gervás, Pablo; de la Torre Moreno, Pablo; Tapscott, Alan

doi:10.5281/zenodo.18791170

Published December 1, 2025 | Version v1

Dataset Open

Two-Annotator Sentence-Level Annotations of Human-Written Short Stories

1. Universidad Complutense de Madrid
2. Universidad de Cádiz
3. Pompeu Fabra University

Two-Annotator Sentence-Level Annotations of Human-Written Short Stories

Zenodo DOI (this dataset): https://doi.org/10.5281/zenodo.18791170
Release date: 2025-12-01
Related publication: Quantitative Characteristics of Human-Written Short Stories as a Metric for Automated Storytelling (León et al., 2020), https://doi.org/10.1007/s00354-020-00111-1

This Zenodo record contains two independent annotation files (Annotator 1 and Annotator 2) for a collection of human-written short stories used in the study above.
Each row corresponds to one sentence from a participant-written story, plus a set of narrative-structure annotations (plot actions/descriptions, agency, and causal dependencies).

The primary goal of releasing these files is to support: - Reproducibility of the analyses reported in the associated article. - Follow-up work on metrics for automated storytelling and comparisons between human-written and machine-generated narratives. - Research on inter-annotator agreement for story-structure annotation schemes.

Important: The dataset contains human-written narrative text. While it does not include direct identifiers (names, emails), researchers should treat the content as potentially sensitive and follow good data stewardship practices.

High-level dataset summary

Collection / scenario

In the underlying experiment, human subjects wrote short plots in a constrained scenario; the plots were later annotated to quantify narrative components and relations (see the related paper for full experimental protocol and analysis).

Units of annotation

One row = one sentence (a single sentence of a story, not spanning multiple sentences).
Sentences can be grouped into stories / subjects using the pair:
- DATE + CLASS SEAT
  (i.e., the date of data gathering and the participant’s seat identifier in the classroom).

Coverage and size

Annotator	Rows (sentences)	Columns	Unique subjects (DATE + CLASS SEAT)	Notes
Annotator 1	742	11	25	`TYPE` is missing for 4 rows
Annotator 2	770	11	26	`TYPE` is missing for 1 rows

Dates present (both annotators): 23/5/18, 24/5/18, 25/5/18
Exercise IDs present (both annotators): 1-1 (in these files, only one exercise identifier appears)

Per-subject story length (sentences)

Annotator	min	median	mean	max
Annotator 1	13	25.0	29.68	61
Annotator 2	13	25.5	29.62	61

Notable difference between annotators

Annotator 2 includes one additional subject not present in annotator 1: - DATE = 25/5/18, CLASS SEAT = 20 (28 sentences)

If you compute inter-annotator agreement, consider aligning the datasets by (DATE, CLASS SEAT, SENTENCE) or by position within each story.

Data schema (columns)

All files share the same columns:

Text and grouping fields

SENTENCE
The sentence text (one sentence per row). The story texts are primarily in Spanish.
DATE
Date when the data were gathered (string, e.g., 23/5/18). In these files, values are limited to 23/5/18, 24/5/18, 25/5/18.
CLASS SEAT
Participant seat identifier used during collection (integer-like). Together with DATE, it can be used to approximate a unique subject ID.
EXERCISE
Exercise identifier (string). In these files the value is 1-1 for all rows.

Annotation fields

TYPE
Narrative function of the sentence according to the annotator:
- Plot Description
- Plot Action
  Some rows have missing values (NaN) and should be handled explicitly during analysis.
PROTAGONIST IS AGENT
Whether the protagonist behaves with agency in the sentence.
In the JSON files this is stored as "TRUE" / "FALSE" (strings). In CSV it is also "TRUE"/"FALSE" but many parsers will auto-convert to booleans.
PROTAGONIST IS OBJECT
Whether the protagonist is the object of an action in the sentence (again, "TRUE" / "FALSE").
PREVIOUS ACTION
The previous action in the story according to the annotator.
Many rows do not have a value here (missing/blank). When present, it is typically a string containing a previous sentence (often the relevant prior action).
DEPENDENCY 1, DEPENDENCY 2, DEPENDENCY 3
Zero to three dependencies that establish the causal occurrences that trigger the current sentence (according to the annotator).
Each dependency, when present, is represented as a string (typically the textual form of a previous sentence).

Practical notes for analysis

1) Inter-annotator agreement (IAA)

If you want to compute IAA for TYPE, PROTAGONIST IS AGENT, or PROTAGONIST IS OBJECT, you will need an alignment strategy.

Common approaches: - Exact-text alignment: merge on (DATE, CLASS SEAT, SENTENCE) (works if sentence strings match exactly). - Positional alignment: within each (DATE, CLASS SEAT) group, align by row order (only safe if both annotators segmented and ordered sentences identically). - Hybrid alignment: normalize whitespace/punctuation in SENTENCE and then merge.

After aligning, you can compute: - Cohen’s kappa for categorical labels (TYPE, booleans) - Percentage agreement - Confusion matrices

2) Story-structure graphs

You can view each story as a directed graph: - Nodes: sentences - Edges: DEPENDENCY links, and optionally PREVIOUS ACTION links

This supports analyses of: - causal chain length - branching factor - action/description alternation - agency patterns

3) Aggregate metrics

The paper uses quantitative characteristics of stories; this dataset enables replication and extension of metrics such as: - ratio of Plot Action vs Plot Description - prevalence of protagonist agency/objecthood - dependency density (how often dependencies are recorded)

Missingness and sparsity

The annotation is intentionally sparse in some columns.

Annotator 1: missing values (count of rows)

TYPE: 4
PREVIOUS ACTION: 714
DEPENDENCY 1: 106
DEPENDENCY 2: 517
DEPENDENCY 3: 725

Annotator 2: missing values (count of rows)

TYPE: 1
PREVIOUS ACTION: 738
DEPENDENCY 1: 122
DEPENDENCY 2: 675
DEPENDENCY 3: 769

Reproducible loading examples

Python (pandas)

import pandas as pd

a1 = pd.read_csv("annotator1.csv")
a2 = pd.read_csv("annotator2.csv")

# Basic checks
print(a1.shape, a2.shape)
print(a1.columns)

# Optional: normalize booleans if your parser did not do it
def to_bool(x):
    if isinstance(x, bool):
        return x
    if isinstance(x, str):
        return x.strip().upper() == "TRUE"
    return False

for df in (a1, a2):
    df["PROTAGONIST_IS_AGENT"]  = df["PROTAGONIST IS AGENT"].map(to_bool)
    df["PROTAGONIST_IS_OBJECT"] = df["PROTAGONIST IS OBJECT"].map(to_bool)

# Example: exact-text alignment for IAA
common = a1.merge(
    a2,
    on=["DATE", "CLASS SEAT", "EXERCISE", "SENTENCE"],
    suffixes=("_a1", "_a2"),
)

print("Aligned rows:", len(common))

R (readr)

library(readr)
a1 <- read_csv("annotator1.csv", show_col_types = FALSE)
a2 <- read_csv("annotator2.csv", show_col_types = FALSE)

dim(a1); dim(a2)
names(a1)

Provenance

A prior project README (included with the files provided for this Zenodo deposit) describes this dataset as experimental data for the related article, including: - that each row corresponds to a sentence, - the collection dates (May 23–25, 2018), - and the role of DATE + CLASS SEAT as an approximate subject identifier.

License

Please select a license in Zenodo that matches: 1) the participant consent / institutional requirements for the original texts, and
2) your intended reuse policy.

For open research data, CC BY 4.0 is a common choice; however, if the original stories are subject to additional constraints, consider a more restrictive license.

How to cite

Dataset (this Zenodo record)

Recommended citation
León, C., Gervás, P., Delatorre, P., & Tapscott, A. (2025). Two-Annotator Sentence-Level Annotations of Human-Written Short Stories (Version 1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18791170

BibTeX

@dataset{leon2025_story_annotations,
  author       = {León, Carlos and Gervás, Pablo and Delatorre, Pablo and Tapscott, Alan},
  title        = {Two-Annotator Sentence-Level Annotations of Human-Written Short Stories},
  year         = {2025},
  month        = dec,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.18791170},
  url          = {https://doi.org/10.5281/zenodo.18791170}
}

Related paper

León, C., Gervás, P., Delatorre, P., & Tapscott, A. (2020). Quantitative Characteristics of Human-Written Short Stories as a Metric for Automated Storytelling. New Generation Computing, 38, 635–671. https://doi.org/10.1007/s00354-020-00111-1

BibTeX

@article{leon2020_quantitative_characteristics,
  author  = {León, Carlos and Gervás, Pablo and Delatorre, Pablo and Tapscott, Alan},
  title   = {Quantitative Characteristics of Human-Written Short Stories as a Metric for Automated Storytelling},
  journal = {New Generation Computing},
  year    = {2020},
  volume  = {38},
  pages   = {635--671},
  doi     = {10.1007/s00354-020-00111-1},
  url     = {https://doi.org/10.1007/s00354-020-00111-1}
}

Contact

Corresponding / contact author:
Carlos León — cleon@ucm.es

Changelog

2025-12-01 (v1.0.0): Zenodo release (this record).

Files

annotator1.csv

Files (1.1 MB)

Name	Size	Download all
annotator1.csv md5:20efed46101c7a40849bac047e6190cc	183.3 kB	Preview Download
annotator1.json md5:6e7144c7f635aaf6ad63b28d82846137	346.9 kB	Preview Download
annotator2.csv md5:90ed4414b61d72aeb4a770aace86aa6f	175.1 kB	Preview Download
annotator2.json md5:2612105a8c1322809ea22bc0084a6115	344.8 kB	Preview Download
README.md md5:09987f7a3e0d2254df9e57fbb85dae01	10.0 kB	Preview Download

Additional details

Is described by: Journal article: 10.1007/s00354-020-00111-1 (DOI)

Repository URL: https://github.com/NILGroup/annotatedShortStoryFeatures
Development Status: Active

	All versions	This version
Views	49	49
Downloads	18	18
Data volume	3.4 MB	3.4 MB

Two-Annotator Sentence-Level Annotations of Human-Written Short Stories

Contents

High-level dataset summary

Collection / scenario

Units of annotation

Coverage and size

Per-subject story length (sentences)

Notable difference between annotators

Data schema (columns)

Text and grouping fields

Annotation fields

Practical notes for analysis

1) Inter-annotator agreement (IAA)

2) Story-structure graphs

3) Aggregate metrics

Missingness and sparsity

Annotator 1: missing values (count of rows)

Annotator 2: missing values (count of rows)

Reproducible loading examples

Python (pandas)

R (readr)

Provenance

License

How to cite

Dataset (this Zenodo record)

Related paper

Contact

Changelog

annotator1.csv

Files (1.1 MB)

Related works

Software

Two-Annotator Sentence-Level Annotations of Human-Written Short Stories

Authors/Creators

Description

Two-Annotator Sentence-Level Annotations of Human-Written Short Stories

Contents

High-level dataset summary

Collection / scenario

Units of annotation

Coverage and size

Per-subject story length (sentences)

Notable difference between annotators

Data schema (columns)

Text and grouping fields

Annotation fields

Practical notes for analysis

1) Inter-annotator agreement (IAA)

2) Story-structure graphs

3) Aggregate metrics

Missingness and sparsity

Annotator 1: missing values (count of rows)

Annotator 2: missing values (count of rows)

Reproducible loading examples

Python (pandas)

R (readr)

Provenance

License

How to cite

Dataset (this Zenodo record)

Related paper

Contact

Changelog

Files

annotator1.csv

Files (1.1 MB)

Additional details

Related works

Software