Attribution Bias in Literary Style Evaluation: Comparing Human and AI Perceptions of Authorship
Authors/Creators
Description
This dataset contains the complete experimental data for a large-scale study investigating attribution bias in literary style evaluation across two complementary experimental designs. Study 1 compares how human evaluators (N=556) and AI language models (N=13) assess identical literary content (the original Queneau vs. GPT-4-generated versions of his exercises) differently based on perceived authorship (human-authored vs. AI-generated). Study 2 expands the content generation across 14 models, then examining bias patterns when evaluators judge material from every AI creator in the experimental matrix.
Both studies employed a three-condition experimental design using Raymond Queneau's "Exercises in Style" as literary stimuli across 30 distinct writing styles. Participants and AI models evaluated the same content under blind, open-label, and counterfactual conditions, revealing systematic attribution bias where identical text receives different quality assessments based solely on authorship labels.
Dataset Contents
This repository includes raw experimental data, quality-filtered datasets, AI model simulation logs, and processed analysis-ready files spanning the complete research pipeline for both experimental studies. Human participant response data can be found here in de-identified form (N=2,780 evaluations from 556 participants), with personally identifiable information removed in accordance with Princeton University IRB protocol #18320.
The compressed data folder (data.zip) should be extracted into this cloned GitHub repository (drop the unzipped folder at the same level of the analysis folder) to enable full replication of analyses through the provided Jupyter notebooks:
style_and_prejudice/├── analysis/│ ├── 01_data_quality_analysis.ipynb│ ├── 02_human_bias_analysis.ipynb│ ├── 03_run_ai_simulation.ipynb # Study 1: single creator, 13 AI evaluators│ ├── 03b_run_ai_simulation_expanded.ipynb # Study 2: cross-model design, 14x14 creators│ ├── 04_ai_bias_analysis.ipynb # Study 1: analysis│ ├── 04b_ai_bias_analysis_extended.ipynb # Study 2: analysis│ ├── 05_comparative_analysis.ipynb # Study 1: human vs. AI evaluator bias│ ├── 06_surface_feature_confounds.ipynb # Surface feature analysis (Study 1)│ └── 07_llm_criterion_inversion_annotation.ipynb # Criterion inversion coding & analysis├── data/ # ← COPY HERE!│ ├── literary_materials/│ ├── responses/│ ├── logs/├── criterion_inversions_coding/└── README.md
Folder Descriptions
responses/--> core response data organized by evaluator type and processing stage:
==> input for notebook #1 (01_data_quality_analysis.ipynb)
-
human_evaluators/raw/→ original participant data (fully anonymized!) collected through web platformquestionnaire.csv→ demographics, attention checks, and screening responsesresponses.csv→ all participant evaluations across experimental conditions
-------------------------------------------------------------------------------------------------------
==> generated by notebook #1 (01_data_quality_analysis.ipynb), used by notebook #2 (02_human_bias_analysis.ipynb)
-
human_evaluators/processed/→ quality-filtered human participant datasetsvalid_participants_dataset.csv→ participants meeting inclusion criteria (N=556)excluded_participants_dataset.csv→ excluded participants with reasons
-------------------------------------------------------------------------------------------------------
==> generated by notebook #3 and #3b (03_run_ai_simulation.ipynb, 03b_run_ai_simulation_expanded.ipynb), used by notebook #4 and #4b (04_ai_bias_analysis.ipynb, 04b_ai_bias_analysis_extended.ipynb)
-
ai_evaluators/raw/→ AI model simulation data- Study 1 (single AI-generator):
ai_participant_simulation_20250805_113946.csv→ complete AI evaluations across all 13 models (13 models × 30 styles × 3 conditions, 3 separate runs) - Study 1 robustness replications:
ai_participant_simulation_20250915_210237-awardwinningai-t07.csv→ alternative labeling ("award-winning AI")ai_participant_simulation_20250916_195700-symmetricallanguage-t07.csv→ symmetric labeling (t=0.7)ai_participant_simulation_20250917_211949-symmetricallanguage-t00.csv→ symmetric labeling, deterministic (t=0)
- Study 2 (cross-model design):
ai_participant_simulation_expanded.csv→ cross-model evaluations (14 evaluators × 14 creators × 30 styles × 3 conditions, 1 run)
- Study 1 (single AI-generator):
-------------------------------------------------------------------------------------------------------
==> generated by notebooks #2 (02_human_bias_analysis.ipynb) & #4 (04_ai_bias_analysis.ipynb), used by notebook #5 (05_comparative_analysis.ipynb)
-
processed_for_comparison/→ final analysis-ready datasets for cross-evaluator comparisonhuman_responses_processed.csv→ processed human evaluation datahuman_summary_for_comparison.csv→ Human bias summary statisticsai_responses_processed.csv→ AI evaluation data formatted for comparisonai_summary_for_comparison.csv→ AI bias summary statisticshuman_experiment_stats.json&ai_condition_stats.json→ experimental condition metadata
-------------------------------------------------------------------------------------------------------
logs/--> processing logs of AI evaluator simulations:ai_evaluators/→ API execution logs from AI simulation- Study 1:
ai_participant_simulation_20250805_113946.logai_participant_simulation_20250915_210237-awardwinningai-t07.logai_participant_simulation_20250916_195700-symmetricallanguage-t07.logai_participant_simulation_20250917_211949-symmetricallanguage-t00.log
- Study 2:
ai_participant_simulation_expanded.log
-------------------------------------------------------------------------------------------------------
literary_materials/- Study 1:
experimental_stimuli.xlsx→ GPT-4 generated stories (30 styles) - Study 2:
experimental_stimuli_expanded.xlsx→ multi-model generated stories (14 AI creators × 30 styles)
- Study 1:
The experimental stimuli files contain structured literary content across 30 style categories, enabling full replication pipeline. The datasets include AI-generated stories from multiple contemporary language models, providing researchers flexibility to experiment with alternative literary materials. For complete replication of the original evaluation study, the placeholder literary content should be replaced with corresponding excerpts from Raymond Queneau's Exercises in Style (2012 New Directions edition, see here). The dataset enables complete reproduction of findings while respecting copyright restrictions on literary materials through placeholder text that demonstrates experimental methodology without redistributing protected content.
-------------------------------------------------------------------------------------------------------
==> used by notebook #7 (07_llm_criterion_inversion_annotation.ipynb)
criterion_inversions_coding/→ systematic coding of AI evaluator rationales for criterion inversion analysisllm_coders/→ compressed JSONL annotation files from three independent LLM coder modelscriterion_inversion_annotations__openai__gpt-5.2.jsonl.zip→ 5,880 paired rationale annotationscriterion_inversion_annotations__openai__gpt-4.1.jsonl.zip→ 5,880 paired rationale annotationscriterion_inversion_annotations__anthropic__claude-sonnet-4.6.jsonl.zip→ 5,880 paired rationale annotations
Each record contains structured annotations for one paired rationale (open-label + counterfactual) using a six-criterion codebook (C1–C6): criterion presence, valence toward Version A and Version B (+1/0/−1), and grounding evidence snippets. All three coders annotated all pairs independently at temperature 0, blind to experimental conditions. Decompress before use.
human_audit/→ human inter-coder reliability data (98 paired rationales, 2 independent coders)human_audit_coderA.csv→ annotations from human coder A (98 pairs)human_audit_coderB.csv→ annotations from human coder B (98 pairs)human_audit_combinedAB.csv→ merged file with both coders' annotations side by side (196 rows)
Two researchers independently coded a stratified sample of 98 paired rationales using the same codebook and blinding protocol as the LLM coders. These files serve as the human reference set for validating the LLM coding pipeline.
Individual files are also provided separately for direct access to specific datasets.
Funding and Support
This work was supported by the Princeton Language and Intelligence (PLI) Seed Grant Program. Research conducted by Wouter Haverals and Meredith Martin, at Princeton University's Center for Digital Humanities.
Files
ai_condition_stats.json
Files
(96.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:1cb8b2584db307888b1f48f04166be31
|
856 Bytes | Preview Download |
|
md5:ea4a80834ff5b01c8478019747d3e324
|
7.5 MB | Preview Download |
|
md5:6613fe905f94fe5182ce6d062d15445a
|
1.6 MB | Download |
|
md5:4a1ad09ce7750e280389e208de38e7fa
|
6.4 MB | Preview Download |
|
md5:976426e8bf83b9be4bc865c446b7b7a1
|
1.3 MB | Download |
|
md5:5abe4f4740ae208da0ff80bba76c1092
|
6.2 MB | Preview Download |
|
md5:01e1080f71532b7c237762b58dd02390
|
1.4 MB | Download |
|
md5:e3e2d39f6d0f9c6406b062b9c92a6e4f
|
6.2 MB | Preview Download |
|
md5:6fcfa10aae85a78708ca956b9d475e8b
|
1.5 MB | Download |
|
md5:3e9b16ac7cb6a7e592c2f6c4e8588c74
|
18.5 MB | Preview Download |
|
md5:8a44139eb5159de2e0ad3b9b68c85526
|
4.8 MB | Download |
|
md5:8c5fe88028fac82a39b7b0188cfe8e5d
|
4.7 MB | Preview Download |
|
md5:de77c24288991c34b5be962531182079
|
150 Bytes | Preview Download |
|
md5:4b1180d1ccee4df65b9013dcb3bff70d
|
4.3 MB | Preview Download |
|
md5:3cfe63e35c04140a5e89c5adf5e28053
|
3.7 MB | Preview Download |
|
md5:3337620c3076c45590588aef78455cf3
|
3.1 MB | Preview Download |
|
md5:44f29429d2f3557fcf51e5beb6434768
|
11.2 MB | Preview Download |
|
md5:a0c018bfa945fbbccad4fad6ac1a3094
|
197.0 kB | Preview Download |
|
md5:57b7ebe03900cc4d62c7b4a8f7be0532
|
51.5 kB | Download |
|
md5:f804856e42148eebaa9a6caa3c581cfe
|
162.0 kB | Download |
|
md5:e0e03e7589b051530c0acb1db1c5f942
|
92.1 kB | Preview Download |
|
md5:6d7900433563496424078671885a913d
|
135.1 kB | Preview Download |
|
md5:2477d89ae8599978e19284bc09fa0b88
|
223.1 kB | Preview Download |
|
md5:afaee92effc32899954eb0f486c4a422
|
435 Bytes | Preview Download |
|
md5:988ab49075e01765b0a7a2a2b14a6976
|
3.6 MB | Preview Download |
|
md5:c7fc9b1c615cec205584d68a11938ac8
|
157 Bytes | Preview Download |
|
md5:2d5e3b7362d573fb808c9022ac3b3d8a
|
172.0 kB | Preview Download |
|
md5:ff82f0c72e7896e45a45c4b6abbe323c
|
3.9 MB | Preview Download |
|
md5:a6af6431b1ce1286363a1bf00312c984
|
5.2 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/WHaverals/style_and_prejudice