There is a newer version of the record available.

Published June 6, 2026 | Version v3

Modern large language models

Authors/Creators

Description

License notice

This record is licensed under the Apache License 2.0.

SPDX-License-Identifier: Apache-2.0

Copyright (c) 2025-2026 Stanislav Volokhovych.

The deposited materials were authored and controlled by the sole copyright holder, Stanislav Volokhovych.

Previous inconsistent license metadata associated with this record was unintended and has been corrected. The current official license metadata for this record is Apache License 2.0.

Notes

Code, scripts, and analysis results: https://github.com/ngscode23/latent-space-shift-research

This record contains scripts, runbooks, readout documents, and metric archives for an empirical study of context-induced latent-state shifts in large language models.

The main research question is whether a coherent context merely changes the final visible answer of a model, or whether it moves the model into a different measurable internal state during prompt processing and generation.

The package includes Gemma3-12B-IT hidden-state geometry experiments, target/control/shuffle comparisons, component-axis construction (`x_full`, `x_content`, `x_order`, `x_order_orth`), generation-trajectory metrics, causal component interventions, SAE feature readouts, SAE decoder-direction steering runs, final next-token KL, teacher-forced per-token KL, and Qwen replication material.

Version 3 adds a clean review package with organized scripts and documentation, together with separate metric archives for audit.

 

Technical info

Version 3 adds a clean review package and the corresponding metric archives for the context-induced latent-state shift project.

This version includes:

1. A clean review package
   - main navigation files (`README_FIRST.md`, `README.md`, `START_HERE.md`);
   - experiment scripts and runbooks under `experiments/`;
   - Gemma3 Grade4 hidden-geometry scripts;
   - SAE candidate discovery, scale calibration, and decoder-direction steering scripts;
   - dense `x_order_orth` axis steering script;
   - selected post-hoc analysis tools under `scripts/analysis_tools/latent_gpu_rapids_analysis/`;
   - English/Russian readout documents for the Gemma3 Grade4 + SAE line.

2. Metric archives
   - Gemma3 Grade4 hidden-geometry / SAE metric packages;
   - Qwen replication metric packages;
   - Gemma SAE decoder-direction steering metric packages;
   - full CSV/ZIP outputs needed for audit and reproduction of the reported readouts.

The clean review package is intended as the main entry point for readers. The metric archives are included separately so that the repository-style package remains readable while the full evidence trail is still available for inspection.

Suggested reading order:
1. `README_FIRST.md`
2. `START_HERE.md`
3. `experiments/gemma3_grade4_sae_academic_readout/`
4. `experiments/grade4_axis_decomposition_gemma/RUNBOOK.md`
5. `experiments/steering/sae_gemma_qwen/RUNBOOK.md`
6. Metric ZIP files for detailed audit

Main GitHub review branch:
https://github.com/ngscode23/latent-space-shift-research/tree/review/gemma3-latent-shift-clean

Large metric package folder:
https://drive.google.com/drive/folders/1Zl9iY33Lmwz3VuOATWx4jup-cE7TJ7TJ?usp=drive_link

Abstract

Modern large language models may not primarily regulate behavior through isolated refusals, local token suppression, or shallow instruction following. Instead, they appear capable of entering internally organized discourse-level regimes: distributed latent states that shape how the model reasons, frames conclusions, allocates caution, tolerates asymmetry, performs neutrality, and structures epistemic authority. These regimes do not behave like simple lexical priming effects. Evidence suggests that they: persist across neutral conversational turns, survive arbitrary neutral relabeling, systematically alter downstream reasoning style, concentrate in late-layer representation geometry, and only partially depend on explicit alignment vocabulary. The strongest effects appear not from safety keywords themselves, but from higher-order rhetorical topology: pressure cadence, procedural framing, asymmetry structure, institutional tone, and discourse-level authority signals. This suggests that prompting is not merely instruction transmission. It may function as state induction. Under this view, many apparently separate phenomena in aligned LLMs — caution drift, procedural overreach, sycophancy, disclaimer inflation, neutrality performance, refusal persistence, jailbreak sensitivity, and style locking — may be manifestations of transitions between latent discourse-policy manifolds. In this picture, alignment is no longer well-described as a modular wrapper placed on top of an otherwise independent intelligence system. Instead, alignment may reshape the topology of the model’s representational space itself, globally reorganizing discourse behavior rather than only filtering outputs. This would explain why alignment effects often appear entangled with reasoning style, directness, specificity, decisiveness, and institutional tone. The model is not merely “prevented” from saying certain things; its generative dynamics may already be reorganized around different discourse attractors. If true, this changes the effective unit of analysis for language models. The relevant object is no longer just: the token, the instruction, the refusal, or the output distribution. The relevant object becomes the discourse regime itself: a temporary but structured representational configuration governing epistemic posture, rhetorical organization, procedural behavior, and judgment style across time. This reframes prompt engineering as latent-state induction rather than keyword optimization. It reframes jailbreaks as transitions between attractor regimes rather than simple filter bypasses. And it reframes alignment as geometry engineering rather than purely policy engineering. The implication is not that language models possess beliefs, intentions, or consciousness. Rather, large sequence learners may naturally develop metastable high-level representational modes that functionally resemble cognitive framing states: transient global configurations that persist, influence future reasoning, and organize behavior across otherwise unrelated tasks. If this interpretation is correct, then the central scientific challenge of alignment shifts fundamentally. The problem is no longer merely: “Which outputs should the model refuse?” but: “Which latent discourse regimes exist inside the model, how are they induced, how stable are they, how do they interact, and how do they reshape reasoning itself?” In that sense, alignment may ultimately be less about constraining outputs and more about shaping the geometry of cognition-like generative states inside large language model

Files

gemma_sae_steering_fast_readout_3tasks.zip

Files (249.3 MB)

Name Size Download all
md5:0572ececbb9d836918cc75b4774401f7
3.6 MB Preview Download
md5:b221bdaefa76257c630821838837a8d3
16.6 kB Preview Download
md5:de019cf884330737fd054904961f0089
889.0 kB Preview Download
md5:6ad8c677c8c6c0c0845cd6d9f05abdf5
1.3 MB Preview Download
md5:88872bb5c14aefbdd59cc6597c728218
517.0 kB Preview Download
md5:bebf196d6030bd56660509b9bb034dd6
3.0 MB Preview Download
md5:a07cd94230a1d98223e098e454e72a75
57.0 MB Preview Download
md5:27f6147e0bbe60d32a5daac013fc3392
182.9 MB Preview Download

Additional details

Software

Programming language
Python
Development Status
Active