Replication Data for: The SEO Floor: Measuring Google Rank Distribution of AI-Cited Pages
Authors/Creators
Description
This package contains the full data, code, protocol, and pre-registration log for Study A: "The SEO Floor — Measuring Google Rank Distribution of AI-Cited Pages" (Lee, 2026). The companion manuscript is available as a separate Zenodo record (linked under "Related identifiers"). Pre-registration is filed at OSF: DOI 10.17605/OSF.IO/FMSRD.
Study summary. We tested whether Google ranking is a prerequisite for AI citation across four production AI platforms (ChatGPT, Perplexity, Claude, Google AI Mode) using a comparison-pool design that addresses the Berkson's-paradox concern in prior cited-only research. We collected 100,411 AI citation events across 2,000 user queries spanning 14 verticals, pulled Google's top-100 SERPs for the same queries (~213,000 SERP rows), assembled a comparison pool of 165,661 unique URLs (cited + uncited), crawled them all via Playwright, and fit mixed-effects logistic regressions on the resulting 114,729 (URL, query) observations with query random intercepts.
Headline findings. (1) 75.4% of citation events go to pages outside Google's top 30 in aggregate — but per-page, top-3 pages are 7.82× more likely to be cited than rank 11–30 pages, while rank 31–100 pages are 4× less likely (95% CI excludes 1.0 throughout, p<0.0001). The 75% aggregate is a denominator artifact; SEO ranking dominates per-page citation odds with a ~34× range. (2) A pre-registered seven-feature GEO composite adds a small but statistically robust independent effect above SEO rank (Z-sum OR=1.06, PCA-1 OR=1.15 per 1 SD), driven primarily by schema markup (OR=1.31). Six of seven content features show positive associations after rank conditioning. (3) Most "deep-tier" citations are one-shot, sprawling across 43,000 unique URLs and 16,000 domains, with sharp platform divergence on user-generated-content tolerance (Claude 0.6% UGC in deep tier, Perplexity 24%). (4) ~1% of citations go to URLs Google has dropped from its index entirely (CommonCrawl-confirmed on the open web), 84% of which were cited within the corpus's 30-day window — direct evidence of live retrieval beyond Google's index.
Contents of this record.
code/ — 47 analysis scripts (Python pipeline for data assembly + R scripts for lme4::glmer mixed-effects logistic regression + PowerShell orchestrator). Includes the canonical Princeton + Experiment-M feature extractors imported verbatim per pre-registration §5.2 (princeton_replication/step1_extract_princeton_features.py, experiment_M/extract_features.py, experiment_M/crawl_pages.py).
protocol/ — Pre-registered protocol v2.1, complete pre-registration log with nine documented addenda (date-range correction, API-only data exclusions, ecom rescue, Method-B κ scope changes, ChatGPT κ deviation, H3 reframing, Time Epoch drop, Wald CIs, feature-pipeline correction).
paper/ — Manuscript draft (markdown source).
data/inputs/ — Locked 2,000-query analysis set (sha256 5c22fcc1ea7315feaaf9970ec8ab3e8f22d9dacce938d9807cc220a30685f3c8), Gemini 3.1 Flash Lite keyword reformulations, VPS citation snapshot.
data/analytic/ — Final tiered citation corpus (100,411 events), Google + Bing SERP pool (213,814 rows), Tier 5 confirmed URLs verified via direct CommonCrawl CDX bypass (810 URLs), Cohen's κ per-event comparison (397 events), Playwright-crawled feature data (115,248 URLs with usable feature data after Princeton processing), the regression-ready H2 dataset (114,729 (URL, query) observations), the final H2 regression results (JSON + markdown), and the archived flawed-features run (preserved per Addendum 9 for transparency).
MANIFEST.json — SHA256 hash of every file in the bundle.
README.md — Replication guide with phase-by-phase command sequences.
Reproducibility. All analysis is fully reproducible from the locked input data via code/scripts/run_h2_all.ps1 (PowerShell orchestrator) running on R 4.5.3 + lme4 + jsonlite + dplyr + broom.mixed + optimx. Total wall time on commodity hardware: ~7 hours. The git tag study-a-v1.0.2 in the bundle's repository identifies the analysis state at this submission.
Methodological notes. Pre-registration was filed before Phase 2 data collection. All deviations from the filed protocol are documented in nine addenda on the OSF project wiki (osf.io/w76y8) and reproduced verbatim in protocol/study_a_preregistration_log.md. The most consequential addenda are Addendum 7 (Time Epoch covariate dropped due to 10-week corpus and uncited-observations-have-no-timestamp issue), Addendum 8 (Wald CIs replace pre-registered profile-likelihood CIs at n=114k for tractability with statistically negligible difference), and Addendum 9 (feature post-processor was rewritten to import Princeton/Experiment-M extractors directly after a pre-registration compliance failure was caught in internal review; flawed-vs-corrected coefficient comparison documented).
Files
study_a_reproducibility_v1.0.2.zip
Files
(106.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:458b756748e0eba674e696bf3512d480
|
106.1 MB | Preview Download |
Additional details
Related works
- Is supplemented by
- Other: 10.17605/OSF.IO/FMSRD (DOI)
Dates
- Collected
-
2026-02-04 / 2026-04-17Citation-corpus effective span (per Addendum 1; 99.96% of events fall in this range)
- Submitted
-
2026-04-20OSF pre-registration filing date (finalized 2026-04-21)
- Created
-
2026-04-25Final reproducibility bundle build / git tag study-a-v1.0.2