Published March 24, 2026 | Version v3
Dataset Open

Attribution Bias in Literary Style Evaluation: Comparing Human and AI Perceptions of Authorship

  • 1. ROR icon Princeton University

Description

This dataset contains the complete experimental data for a large-scale study investigating attribution bias in literary style evaluation across two complementary experimental designs. Study 1 compares how human evaluators (N=556) and AI language models (N=13) assess identical literary content (the original Queneau vs. GPT-4-generated versions of his exercises) differently based on perceived authorship (human-authored vs. AI-generated). Study 2 expands the content generation across 14 models, then examining bias patterns when evaluators judge material from every AI creator in the experimental matrix.

Both studies employed a three-condition experimental design using Raymond Queneau's "Exercises in Style" as literary stimuli across 30 distinct writing styles. Participants and AI models evaluated the same content under blind, open-label, and counterfactual conditions, revealing systematic attribution bias where identical text receives different quality assessments based solely on authorship labels.

Dataset Contents

This repository includes raw experimental data, quality-filtered datasets, AI model simulation logs, and processed analysis-ready files spanning the complete research pipeline for both experimental studies. Human participant response data can be found here in de-identified form (N=2,780 evaluations from 556 participants), with personally identifiable information removed in accordance with Princeton University IRB protocol #18320.

The compressed data folder (data.zip) should be extracted into this cloned GitHub repository (drop the unzipped folder at the same level of the analysis folder) to enable full replication of analyses through the provided Jupyter notebooks: 

style_and_prejudice/
├── analysis/
│   ├── 01_data_quality_analysis.ipynb
│   ├── 02_human_bias_analysis.ipynb
│   ├── 03_run_ai_simulation.ipynb            # Study 1: single creator, 13 AI evaluators
│   ├── 03b_run_ai_simulation_expanded.ipynb  # Study 2: cross-model design, 14x14 creators
│   ├── 04_ai_bias_analysis.ipynb             # Study 1: analysis
│   ├── 04b_ai_bias_analysis_extended.ipynb   # Study 2: analysis
│   ├── 05_comparative_analysis.ipynb         # Study 1: human vs. AI evaluator bias
│   ├── 06_surface_feature_confounds.ipynb   # Surface feature analysis (Study 1)
│   └── 07_llm_criterion_inversion_annotation.ipynb  # Criterion inversion coding & analysis
├── data/  # ← COPY HERE!
│   ├── literary_materials/
│   ├── responses/
│   ├── logs/
├── criterion_inversions_coding/
└── README.md

Folder Descriptions

  • responses/ --> core response data organized by evaluator type and processing stage:

==> input for notebook #1 (01_data_quality_analysis.ipynb

    • human_evaluators/raw/ → original participant data (fully anonymized!) collected through web platform
      • questionnaire.csv → demographics, attention checks, and screening responses
      • responses.csv → all participant evaluations across experimental conditions

-------------------------------------------------------------------------------------------------------

==> generated by notebook #1 (01_data_quality_analysis.ipynb), used by notebook #2 (02_human_bias_analysis.ipynb)

    • human_evaluators/processed/ → quality-filtered human participant datasets
      • valid_participants_dataset.csv → participants meeting inclusion criteria (N=556)
      • excluded_participants_dataset.csv → excluded participants with reasons

-------------------------------------------------------------------------------------------------------

==> generated by notebook #3 and #3b (03_run_ai_simulation.ipynb, 03b_run_ai_simulation_expanded.ipynb), used by notebook #4 and #4b (04_ai_bias_analysis.ipynb, 04b_ai_bias_analysis_extended.ipynb)

    • ai_evaluators/raw/ → AI model simulation data
      • Study 1 (single AI-generator): ai_participant_simulation_20250805_113946.csv → complete AI evaluations across all 13 models (13 models × 30 styles × 3 conditions, 3 separate runs)
      • Study 1 robustness replications:
        • ai_participant_simulation_20250915_210237-awardwinningai-t07.csv → alternative labeling ("award-winning AI")
        • ai_participant_simulation_20250916_195700-symmetricallanguage-t07.csv → symmetric labeling (t=0.7)
        • ai_participant_simulation_20250917_211949-symmetricallanguage-t00.csv → symmetric labeling, deterministic (t=0)
      • Study 2 (cross-model design): ai_participant_simulation_expanded.csv → cross-model evaluations (14 evaluators × 14 creators × 30 styles × 3 conditions, 1 run)

-------------------------------------------------------------------------------------------------------

==> generated by notebooks #2 (02_human_bias_analysis.ipynb) & #4 (04_ai_bias_analysis.ipynb), used by notebook #5 (05_comparative_analysis.ipynb)

    • processed_for_comparison/ → final analysis-ready datasets for cross-evaluator comparison
      • human_responses_processed.csv → processed human evaluation data
      • human_summary_for_comparison.csv → Human bias summary statistics
      • ai_responses_processed.csv → AI evaluation data formatted for comparison
      • ai_summary_for_comparison.csv → AI bias summary statistics
      • human_experiment_stats.json & ai_condition_stats.json → experimental condition metadata

-------------------------------------------------------------------------------------------------------

  • logs/ --> processing logs of AI evaluator simulations:
    • ai_evaluators/ → API execution logs from AI simulation
    •  Study 1:
      • ai_participant_simulation_20250805_113946.log
      • ai_participant_simulation_20250915_210237-awardwinningai-t07.log
      • ai_participant_simulation_20250916_195700-symmetricallanguage-t07.log
      • ai_participant_simulation_20250917_211949-symmetricallanguage-t00.log
    • Study 2:
      • ai_participant_simulation_expanded.log

-------------------------------------------------------------------------------------------------------

  • literary_materials/
    • Study 1: experimental_stimuli.xlsx → GPT-4 generated stories (30 styles)
    • Study 2: experimental_stimuli_expanded.xlsx → multi-model generated stories (14 AI creators × 30 styles)

The experimental stimuli files contain structured literary content across 30 style categories, enabling full replication pipeline. The datasets include AI-generated stories from multiple contemporary language models, providing researchers flexibility to experiment with alternative literary materials. For complete replication of the original evaluation study, the placeholder literary content should be replaced with corresponding excerpts from Raymond Queneau's Exercises in Style (2012 New Directions edition, see here). The dataset enables complete reproduction of findings while respecting copyright restrictions on literary materials through placeholder text that demonstrates experimental methodology without redistributing protected content.

-------------------------------------------------------------------------------------------------------

==> used by notebook #7 (07_llm_criterion_inversion_annotation.ipynb)

  • criterion_inversions_coding/ → systematic coding of AI evaluator rationales for criterion inversion analysis
    • llm_coders/ → compressed JSONL annotation files from three independent LLM coder models
      • criterion_inversion_annotations__openai__gpt-5.2.jsonl.zip → 5,880 paired rationale annotations
      • criterion_inversion_annotations__openai__gpt-4.1.jsonl.zip → 5,880 paired rationale annotations
      • criterion_inversion_annotations__anthropic__claude-sonnet-4.6.jsonl.zip → 5,880 paired rationale annotations

        Each record contains structured annotations for one paired rationale (open-label + counterfactual) using a six-criterion codebook (C1–C6): criterion presence, valence toward Version A and Version B (+1/0/−1), and grounding evidence snippets. All three coders annotated all pairs independently at temperature 0, blind to experimental conditions. Decompress before use.

    • human_audit/ → human inter-coder reliability data (98 paired rationales, 2 independent coders)
      • human_audit_coderA.csv → annotations from human coder A (98 pairs)
      • human_audit_coderB.csv → annotations from human coder B (98 pairs)
      • human_audit_combinedAB.csv → merged file with both coders' annotations side by side (196 rows)

        Two researchers independently coded a stratified sample of 98 paired rationales using the same codebook and blinding protocol as the LLM coders. These files serve as the human reference set for validating the LLM coding pipeline.


Individual files are also provided separately for direct access to specific datasets.

Funding and Support

This work was supported by the Princeton Language and Intelligence (PLI) Seed Grant Program. Research conducted by Wouter Haverals and Meredith Martin, at Princeton University's Center for Digital Humanities.

Files

ai_condition_stats.json

Files (96.0 MB)

Name Size Download all
md5:1cb8b2584db307888b1f48f04166be31
856 Bytes Preview Download
md5:ea4a80834ff5b01c8478019747d3e324
7.5 MB Preview Download
md5:6613fe905f94fe5182ce6d062d15445a
1.6 MB Download
md5:4a1ad09ce7750e280389e208de38e7fa
6.4 MB Preview Download
md5:976426e8bf83b9be4bc865c446b7b7a1
1.3 MB Download
md5:5abe4f4740ae208da0ff80bba76c1092
6.2 MB Preview Download
md5:01e1080f71532b7c237762b58dd02390
1.4 MB Download
md5:e3e2d39f6d0f9c6406b062b9c92a6e4f
6.2 MB Preview Download
md5:6fcfa10aae85a78708ca956b9d475e8b
1.5 MB Download
md5:3e9b16ac7cb6a7e592c2f6c4e8588c74
18.5 MB Preview Download
md5:8a44139eb5159de2e0ad3b9b68c85526
4.8 MB Download
md5:8c5fe88028fac82a39b7b0188cfe8e5d
4.7 MB Preview Download
md5:de77c24288991c34b5be962531182079
150 Bytes Preview Download
md5:4b1180d1ccee4df65b9013dcb3bff70d
4.3 MB Preview Download
md5:3cfe63e35c04140a5e89c5adf5e28053
3.7 MB Preview Download
md5:3337620c3076c45590588aef78455cf3
3.1 MB Preview Download
md5:44f29429d2f3557fcf51e5beb6434768
11.2 MB Preview Download
md5:a0c018bfa945fbbccad4fad6ac1a3094
197.0 kB Preview Download
md5:57b7ebe03900cc4d62c7b4a8f7be0532
51.5 kB Download
md5:f804856e42148eebaa9a6caa3c581cfe
162.0 kB Download
md5:e0e03e7589b051530c0acb1db1c5f942
92.1 kB Preview Download
md5:6d7900433563496424078671885a913d
135.1 kB Preview Download
md5:2477d89ae8599978e19284bc09fa0b88
223.1 kB Preview Download
md5:afaee92effc32899954eb0f486c4a422
435 Bytes Preview Download
md5:988ab49075e01765b0a7a2a2b14a6976
3.6 MB Preview Download
md5:c7fc9b1c615cec205584d68a11938ac8
157 Bytes Preview Download
md5:2d5e3b7362d573fb808c9022ac3b3d8a
172.0 kB Preview Download
md5:ff82f0c72e7896e45a45c4b6abbe323c
3.9 MB Preview Download
md5:a6af6431b1ce1286363a1bf00312c984
5.2 MB Preview Download

Additional details