Two-Annotator Sentence-Level Annotations of Human-Written Short Stories
Authors/Creators
Description
Two-Annotator Sentence-Level Annotations of Human-Written Short Stories
Zenodo DOI (this dataset): https://doi.org/10.5281/zenodo.18791170
Release date: 2025-12-01
Related publication: Quantitative Characteristics of Human-Written Short Stories as a Metric for Automated Storytelling (León et al., 2020), https://doi.org/10.1007/s00354-020-00111-1
This Zenodo record contains two independent annotation files (Annotator 1 and Annotator 2) for a collection of human-written short stories used in the study above.
Each row corresponds to one sentence from a participant-written story, plus a set of narrative-structure annotations (plot actions/descriptions, agency, and causal dependencies).
The primary goal of releasing these files is to support: - Reproducibility of the analyses reported in the associated article. - Follow-up work on metrics for automated storytelling and comparisons between human-written and machine-generated narratives. - Research on inter-annotator agreement for story-structure annotation schemes.
Important: The dataset contains human-written narrative text. While it does not include direct identifiers (names, emails), researchers should treat the content as potentially sensitive and follow good data stewardship practices.
Contents
This record provides the same data in CSV and JSON formats for each annotator:
annotator1.csvannotator1.jsonannotator2.csvannotator2.json
The CSV and JSON versions for each annotator are equivalent (same rows, same fields), differing only in serialization.
High-level dataset summary
Collection / scenario
In the underlying experiment, human subjects wrote short plots in a constrained scenario; the plots were later annotated to quantify narrative components and relations (see the related paper for full experimental protocol and analysis).
Units of annotation
- One row = one sentence (a single sentence of a story, not spanning multiple sentences).
- Sentences can be grouped into stories / subjects using the pair:
DATE+CLASS SEAT
(i.e., the date of data gathering and the participant’s seat identifier in the classroom).
Coverage and size
| Annotator | Rows (sentences) | Columns | Unique subjects (DATE + CLASS SEAT) | Notes |
|---|---|---|---|---|
| Annotator 1 | 742 | 11 | 25 | TYPE is missing for 4 rows |
| Annotator 2 | 770 | 11 | 26 | TYPE is missing for 1 rows |
Dates present (both annotators): 23/5/18, 24/5/18, 25/5/18
Exercise IDs present (both annotators): 1-1 (in these files, only one exercise identifier appears)
Per-subject story length (sentences)
| Annotator | min | median | mean | max |
|---|---|---|---|---|
| Annotator 1 | 13 | 25.0 | 29.68 | 61 |
| Annotator 2 | 13 | 25.5 | 29.62 | 61 |
Notable difference between annotators
Annotator 2 includes one additional subject not present in annotator 1: - DATE = 25/5/18, CLASS SEAT = 20 (28 sentences)
If you compute inter-annotator agreement, consider aligning the datasets by (DATE, CLASS SEAT, SENTENCE) or by position within each story.
Data schema (columns)
All files share the same columns:
Text and grouping fields
-
SENTENCE
The sentence text (one sentence per row). The story texts are primarily in Spanish. -
DATE
Date when the data were gathered (string, e.g.,23/5/18). In these files, values are limited to 23/5/18, 24/5/18, 25/5/18. -
CLASS SEAT
Participant seat identifier used during collection (integer-like). Together withDATE, it can be used to approximate a unique subject ID. -
EXERCISE
Exercise identifier (string). In these files the value is1-1for all rows.
Annotation fields
-
TYPE
Narrative function of the sentence according to the annotator:Plot DescriptionPlot Action
Some rows have missing values (NaN) and should be handled explicitly during analysis.
-
PROTAGONIST IS AGENT
Whether the protagonist behaves with agency in the sentence.
In the JSON files this is stored as"TRUE"/"FALSE"(strings). In CSV it is also"TRUE"/"FALSE"but many parsers will auto-convert to booleans. -
PROTAGONIST IS OBJECT
Whether the protagonist is the object of an action in the sentence (again,"TRUE"/"FALSE"). -
PREVIOUS ACTION
The previous action in the story according to the annotator.
Many rows do not have a value here (missing/blank). When present, it is typically a string containing a previous sentence (often the relevant prior action). -
DEPENDENCY 1,DEPENDENCY 2,DEPENDENCY 3
Zero to three dependencies that establish the causal occurrences that trigger the current sentence (according to the annotator).
Each dependency, when present, is represented as a string (typically the textual form of a previous sentence).
Practical notes for analysis
1) Inter-annotator agreement (IAA)
If you want to compute IAA for TYPE, PROTAGONIST IS AGENT, or PROTAGONIST IS OBJECT, you will need an alignment strategy.
Common approaches: - Exact-text alignment: merge on (DATE, CLASS SEAT, SENTENCE) (works if sentence strings match exactly). - Positional alignment: within each (DATE, CLASS SEAT) group, align by row order (only safe if both annotators segmented and ordered sentences identically). - Hybrid alignment: normalize whitespace/punctuation in SENTENCE and then merge.
After aligning, you can compute: - Cohen’s kappa for categorical labels (TYPE, booleans) - Percentage agreement - Confusion matrices
2) Story-structure graphs
You can view each story as a directed graph: - Nodes: sentences - Edges: DEPENDENCY links, and optionally PREVIOUS ACTION links
This supports analyses of: - causal chain length - branching factor - action/description alternation - agency patterns
3) Aggregate metrics
The paper uses quantitative characteristics of stories; this dataset enables replication and extension of metrics such as: - ratio of Plot Action vs Plot Description - prevalence of protagonist agency/objecthood - dependency density (how often dependencies are recorded)
Missingness and sparsity
The annotation is intentionally sparse in some columns.
Annotator 1: missing values (count of rows)
TYPE: 4PREVIOUS ACTION: 714DEPENDENCY 1: 106DEPENDENCY 2: 517DEPENDENCY 3: 725
Annotator 2: missing values (count of rows)
TYPE: 1PREVIOUS ACTION: 738DEPENDENCY 1: 122DEPENDENCY 2: 675DEPENDENCY 3: 769
Reproducible loading examples
Python (pandas)
import pandas as pd
a1 = pd.read_csv("annotator1.csv")
a2 = pd.read_csv("annotator2.csv")
# Basic checks
print(a1.shape, a2.shape)
print(a1.columns)
# Optional: normalize booleans if your parser did not do it
def to_bool(x):
if isinstance(x, bool):
return x
if isinstance(x, str):
return x.strip().upper() == "TRUE"
return False
for df in (a1, a2):
df["PROTAGONIST_IS_AGENT"] = df["PROTAGONIST IS AGENT"].map(to_bool)
df["PROTAGONIST_IS_OBJECT"] = df["PROTAGONIST IS OBJECT"].map(to_bool)
# Example: exact-text alignment for IAA
common = a1.merge(
a2,
on=["DATE", "CLASS SEAT", "EXERCISE", "SENTENCE"],
suffixes=("_a1", "_a2"),
)
print("Aligned rows:", len(common))
R (readr)
library(readr)
a1 <- read_csv("annotator1.csv", show_col_types = FALSE)
a2 <- read_csv("annotator2.csv", show_col_types = FALSE)
dim(a1); dim(a2)
names(a1)
Provenance
A prior project README (included with the files provided for this Zenodo deposit) describes this dataset as experimental data for the related article, including: - that each row corresponds to a sentence, - the collection dates (May 23–25, 2018), - and the role of DATE + CLASS SEAT as an approximate subject identifier.
License
Please select a license in Zenodo that matches: 1) the participant consent / institutional requirements for the original texts, and
2) your intended reuse policy.
For open research data, CC BY 4.0 is a common choice; however, if the original stories are subject to additional constraints, consider a more restrictive license.
How to cite
Dataset (this Zenodo record)
Recommended citation
León, C., Gervás, P., Delatorre, P., & Tapscott, A. (2025). Two-Annotator Sentence-Level Annotations of Human-Written Short Stories (Version 1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18791170
BibTeX
@dataset{leon2025_story_annotations,
author = {León, Carlos and Gervás, Pablo and Delatorre, Pablo and Tapscott, Alan},
title = {Two-Annotator Sentence-Level Annotations of Human-Written Short Stories},
year = {2025},
month = dec,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.18791170},
url = {https://doi.org/10.5281/zenodo.18791170}
}
Related paper
León, C., Gervás, P., Delatorre, P., & Tapscott, A. (2020). Quantitative Characteristics of Human-Written Short Stories as a Metric for Automated Storytelling. New Generation Computing, 38, 635–671. https://doi.org/10.1007/s00354-020-00111-1
BibTeX
@article{leon2020_quantitative_characteristics,
author = {León, Carlos and Gervás, Pablo and Delatorre, Pablo and Tapscott, Alan},
title = {Quantitative Characteristics of Human-Written Short Stories as a Metric for Automated Storytelling},
journal = {New Generation Computing},
year = {2020},
volume = {38},
pages = {635--671},
doi = {10.1007/s00354-020-00111-1},
url = {https://doi.org/10.1007/s00354-020-00111-1}
}
Contact
Corresponding / contact author:
Carlos León — cleon@ucm.es
Changelog
- 2025-12-01 (v1.0.0): Zenodo release (this record).
Files
annotator1.csv
Files
(1.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:20efed46101c7a40849bac047e6190cc
|
183.3 kB | Preview Download |
|
md5:6e7144c7f635aaf6ad63b28d82846137
|
346.9 kB | Preview Download |
|
md5:90ed4414b61d72aeb4a770aace86aa6f
|
175.1 kB | Preview Download |
|
md5:2612105a8c1322809ea22bc0084a6115
|
344.8 kB | Preview Download |
|
md5:09987f7a3e0d2254df9e57fbb85dae01
|
10.0 kB | Preview Download |
Additional details
Related works
- Is described by
- Journal article: 10.1007/s00354-020-00111-1 (DOI)
Software
- Repository URL
- https://github.com/NILGroup/annotatedShortStoryFeatures
- Development Status
- Active