Published December 1, 2025 | Version v1
Dataset Open

Two-Annotator Sentence-Level Annotations of Human-Written Short Stories

  • 1. Universidad Complutense de Madrid
  • 2. ROR icon Universidad de Cádiz
  • 3. ROR icon Pompeu Fabra University

Description

Two-Annotator Sentence-Level Annotations of Human-Written Short Stories

Zenodo DOI (this dataset): https://doi.org/10.5281/zenodo.18791170
Release date: 2025-12-01
Related publication: Quantitative Characteristics of Human-Written Short Stories as a Metric for Automated Storytelling (León et al., 2020), https://doi.org/10.1007/s00354-020-00111-1

This Zenodo record contains two independent annotation files (Annotator 1 and Annotator 2) for a collection of human-written short stories used in the study above.
Each row corresponds to one sentence from a participant-written story, plus a set of narrative-structure annotations (plot actions/descriptions, agency, and causal dependencies).

The primary goal of releasing these files is to support: - Reproducibility of the analyses reported in the associated article. - Follow-up work on metrics for automated storytelling and comparisons between human-written and machine-generated narratives. - Research on inter-annotator agreement for story-structure annotation schemes.

Important: The dataset contains human-written narrative text. While it does not include direct identifiers (names, emails), researchers should treat the content as potentially sensitive and follow good data stewardship practices.

Contents

This record provides the same data in CSV and JSON formats for each annotator:

  • annotator1.csv
  • annotator1.json
  • annotator2.csv
  • annotator2.json

The CSV and JSON versions for each annotator are equivalent (same rows, same fields), differing only in serialization.

High-level dataset summary

Collection / scenario

In the underlying experiment, human subjects wrote short plots in a constrained scenario; the plots were later annotated to quantify narrative components and relations (see the related paper for full experimental protocol and analysis).

Units of annotation

  • One row = one sentence (a single sentence of a story, not spanning multiple sentences).
  • Sentences can be grouped into stories / subjects using the pair:
    • DATE + CLASS SEAT
      (i.e., the date of data gathering and the participant’s seat identifier in the classroom).

Coverage and size

Annotator Rows (sentences) Columns Unique subjects (DATE + CLASS SEAT) Notes
Annotator 1 742 11 25 TYPE is missing for 4 rows
Annotator 2 770 11 26 TYPE is missing for 1 rows

Dates present (both annotators): 23/5/18, 24/5/18, 25/5/18
Exercise IDs present (both annotators): 1-1 (in these files, only one exercise identifier appears)

Per-subject story length (sentences)

Annotator min median mean max
Annotator 1 13 25.0 29.68 61
Annotator 2 13 25.5 29.62 61

Notable difference between annotators

Annotator 2 includes one additional subject not present in annotator 1: - DATE = 25/5/18, CLASS SEAT = 20 (28 sentences)

If you compute inter-annotator agreement, consider aligning the datasets by (DATE, CLASS SEAT, SENTENCE) or by position within each story.

Data schema (columns)

All files share the same columns:

Text and grouping fields

  • SENTENCE
    The sentence text (one sentence per row). The story texts are primarily in Spanish.

  • DATE
    Date when the data were gathered (string, e.g., 23/5/18). In these files, values are limited to 23/5/18, 24/5/18, 25/5/18.

  • CLASS SEAT
    Participant seat identifier used during collection (integer-like). Together with DATE, it can be used to approximate a unique subject ID.

  • EXERCISE
    Exercise identifier (string). In these files the value is 1-1 for all rows.

Annotation fields

  • TYPE
    Narrative function of the sentence according to the annotator:

    • Plot Description
    • Plot Action
      Some rows have missing values (NaN) and should be handled explicitly during analysis.
  • PROTAGONIST IS AGENT
    Whether the protagonist behaves with agency in the sentence.
    In the JSON files this is stored as "TRUE" / "FALSE" (strings). In CSV it is also "TRUE"/"FALSE" but many parsers will auto-convert to booleans.

  • PROTAGONIST IS OBJECT
    Whether the protagonist is the object of an action in the sentence (again, "TRUE" / "FALSE").

  • PREVIOUS ACTION
    The previous action in the story according to the annotator.
    Many rows do not have a value here (missing/blank). When present, it is typically a string containing a previous sentence (often the relevant prior action).

  • DEPENDENCY 1, DEPENDENCY 2, DEPENDENCY 3
    Zero to three dependencies that establish the causal occurrences that trigger the current sentence (according to the annotator).
    Each dependency, when present, is represented as a string (typically the textual form of a previous sentence).

Practical notes for analysis

1) Inter-annotator agreement (IAA)

If you want to compute IAA for TYPE, PROTAGONIST IS AGENT, or PROTAGONIST IS OBJECT, you will need an alignment strategy.

Common approaches: - Exact-text alignment: merge on (DATE, CLASS SEAT, SENTENCE) (works if sentence strings match exactly). - Positional alignment: within each (DATE, CLASS SEAT) group, align by row order (only safe if both annotators segmented and ordered sentences identically). - Hybrid alignment: normalize whitespace/punctuation in SENTENCE and then merge.

After aligning, you can compute: - Cohen’s kappa for categorical labels (TYPE, booleans) - Percentage agreement - Confusion matrices

2) Story-structure graphs

You can view each story as a directed graph: - Nodes: sentences - Edges: DEPENDENCY links, and optionally PREVIOUS ACTION links

This supports analyses of: - causal chain length - branching factor - action/description alternation - agency patterns

3) Aggregate metrics

The paper uses quantitative characteristics of stories; this dataset enables replication and extension of metrics such as: - ratio of Plot Action vs Plot Description - prevalence of protagonist agency/objecthood - dependency density (how often dependencies are recorded)

Missingness and sparsity

The annotation is intentionally sparse in some columns.

Annotator 1: missing values (count of rows)

  • TYPE: 4
  • PREVIOUS ACTION: 714
  • DEPENDENCY 1: 106
  • DEPENDENCY 2: 517
  • DEPENDENCY 3: 725

Annotator 2: missing values (count of rows)

  • TYPE: 1
  • PREVIOUS ACTION: 738
  • DEPENDENCY 1: 122
  • DEPENDENCY 2: 675
  • DEPENDENCY 3: 769

Reproducible loading examples

Python (pandas)

import pandas as pd

a1 = pd.read_csv("annotator1.csv")
a2 = pd.read_csv("annotator2.csv")

# Basic checks
print(a1.shape, a2.shape)
print(a1.columns)

# Optional: normalize booleans if your parser did not do it
def to_bool(x):
    if isinstance(x, bool):
        return x
    if isinstance(x, str):
        return x.strip().upper() == "TRUE"
    return False

for df in (a1, a2):
    df["PROTAGONIST_IS_AGENT"]  = df["PROTAGONIST IS AGENT"].map(to_bool)
    df["PROTAGONIST_IS_OBJECT"] = df["PROTAGONIST IS OBJECT"].map(to_bool)

# Example: exact-text alignment for IAA
common = a1.merge(
    a2,
    on=["DATE", "CLASS SEAT", "EXERCISE", "SENTENCE"],
    suffixes=("_a1", "_a2"),
)

print("Aligned rows:", len(common))

R (readr)

library(readr)
a1 <- read_csv("annotator1.csv", show_col_types = FALSE)
a2 <- read_csv("annotator2.csv", show_col_types = FALSE)

dim(a1); dim(a2)
names(a1)

Provenance

A prior project README (included with the files provided for this Zenodo deposit) describes this dataset as experimental data for the related article, including: - that each row corresponds to a sentence, - the collection dates (May 23–25, 2018), - and the role of DATE + CLASS SEAT as an approximate subject identifier.

License

Please select a license in Zenodo that matches: 1) the participant consent / institutional requirements for the original texts, and
2) your intended reuse policy.

For open research data, CC BY 4.0 is a common choice; however, if the original stories are subject to additional constraints, consider a more restrictive license.

How to cite

Dataset (this Zenodo record)

Recommended citation
León, C., Gervás, P., Delatorre, P., & Tapscott, A. (2025). Two-Annotator Sentence-Level Annotations of Human-Written Short Stories (Version 1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18791170

BibTeX

@dataset{leon2025_story_annotations,
  author       = {León, Carlos and Gervás, Pablo and Delatorre, Pablo and Tapscott, Alan},
  title        = {Two-Annotator Sentence-Level Annotations of Human-Written Short Stories},
  year         = {2025},
  month        = dec,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.18791170},
  url          = {https://doi.org/10.5281/zenodo.18791170}
}

Related paper

León, C., Gervás, P., Delatorre, P., & Tapscott, A. (2020). Quantitative Characteristics of Human-Written Short Stories as a Metric for Automated Storytelling. New Generation Computing, 38, 635–671. https://doi.org/10.1007/s00354-020-00111-1

BibTeX

@article{leon2020_quantitative_characteristics,
  author  = {León, Carlos and Gervás, Pablo and Delatorre, Pablo and Tapscott, Alan},
  title   = {Quantitative Characteristics of Human-Written Short Stories as a Metric for Automated Storytelling},
  journal = {New Generation Computing},
  year    = {2020},
  volume  = {38},
  pages   = {635--671},
  doi     = {10.1007/s00354-020-00111-1},
  url     = {https://doi.org/10.1007/s00354-020-00111-1}
}

Contact

Corresponding / contact author:
Carlos León — cleon@ucm.es

Changelog

  • 2025-12-01 (v1.0.0): Zenodo release (this record).

Files

annotator1.csv

Files (1.1 MB)

Name Size Download all
md5:20efed46101c7a40849bac047e6190cc
183.3 kB Preview Download
md5:6e7144c7f635aaf6ad63b28d82846137
346.9 kB Preview Download
md5:90ed4414b61d72aeb4a770aace86aa6f
175.1 kB Preview Download
md5:2612105a8c1322809ea22bc0084a6115
344.8 kB Preview Download
md5:09987f7a3e0d2254df9e57fbb85dae01
10.0 kB Preview Download

Additional details

Related works

Is described by
Journal article: 10.1007/s00354-020-00111-1 (DOI)

Software