Published January 21, 2026
| Version v8
Publication
Open
Algorithmic Induction via Structural Weight Transfer
Description
# Engineering Algorithmic Structure in Neural Networks: A Materials Science Perspective
**Author:** grisun0
---
## Abstract
This paper presents what I learned from attempting to induce Strassen matrix multiplication structure in neural networks, and why I now view this work as materials engineering rather than theory.
I demonstrate through Strassen matrix multiplication that by controlling batch size, training duration, and regularization, I can induce discrete algorithmic structure that transfers zero-shot from 2x2 to 64x matrices. The two-phase protocol I present, training followed by sparsification and discretization, serves as empirical evidence. Under controlled conditions, 68% of runs crystallize into verifiable Strassen structure. The remaining 32% converge to local minima that generalize on test sets but fail structural verification.
What I initially framed as a theory, claiming that gradient covariance geometry determines whether networks learn algorithms, did not hold up to scrutiny. Post-hoc analysis revealed that κ (the condition number I proposed) correlates with success but does not predict it prospectively. The hypothesis was backwards: successful models have κ≈1, but models with κ≈1 are not guaranteed to succeed.
What remains valid is the engineering protocol itself. Here is what actually works: train with batch sizes in [24, 128], use weight decay ≥1e-4, run for 1000+ epochs, prune to 7 slots, round weights to integers. Do this, and you will induce Strassen structure with 68% probability.
I now frame this work as materials engineering. We are not discovering fundamental laws. We are developing recipes for producing specific material properties in neural networks. The analogy is semiconductor manufacturing: doping silicon with phosphorus at 10¹³ atoms/cm³ reliably creates n-type semiconductor, even though we approximate electron behavior with simplified band theory. We know the recipe works; the complete theory continues to develop. Similarly, training neural networks with batch size 32 reliably induces Strassen structure, even though we approximate dynamics with simplified gradient theory. The recipe works; the complete theory remains to be formalized.
**Phase imaging in the materials sense.** Figures in this work serve as experimental visualizations of microstructural properties: Figure 4 shows weight distribution evolution (microstructure), Figure 7 shows batch size effect (phase boundary), Figure 8 shows the complete phase diagram (phase map), Figure 5 shows grokking dynamics (temporal phase transition), and Appendix E shows noise perturbation results (basin width measurement). These images characterize the material properties of trained networks without claiming thermodynamic equivalence.
The system reveals extreme fragility: noise of magnitude 0.001 causes 100% discretization failure when applied post-training. This fragility has implications beyond my specific experiments. If a well-defined algorithm like Strassen requires such precise training conditions to emerge, what does this say about reproducibility in deep learning more broadly? The narrow basins containing algorithmic solutions may be far more common than we realize, and our inability to consistently reach them may explain many reproducibility failures in the field.
---
## 1. Introduction
Neural networks trained on algorithmic tasks sometimes exhibit grokking: delayed generalization that occurs long after training loss has converged [1]. Prior work characterized this transition using local complexity measures [1] and connected it to superposition as lossy compression [2]. But a fundamental question remained unanswered: when a network groks, has it learned the algorithm, or has it found a local minimum that happens to generalize?
This paper presents what I have learned from attempting to answer this question through Strassen matrix multiplication, and why I now view this work as materials engineering rather than theory.
I set out to demonstrate that neural networks could learn genuine algorithms, not just convenient local minima. The test case was Strassen matrix multiplication, which has exact structure: 7 products with coefficients in {-1, 0, 1}. If a network learned Strassen, I could verify this by rounding weights to integers and checking if they matched the canonical structure.
I developed a two-phase protocol. Phase 1: train a bilinear model with 8 slots on 2x2 multiplication. Phase 2: prune to 7 slots, discretize weights, and verify that the structure transfers to 64x64 matrices.
I called this a theory. I claimed that the geometry of training trajectories determines whether algorithmic structure emerges. I proposed that gradient covariance, measured by κ, could predict which training runs would succeed.
I was wrong about the prediction part. Post-hoc analysis showed that κ correlates with success but does not cause it, and cannot be used to predict outcomes from early-epoch measurements.
What remains valid is the engineering protocol itself. When I follow the conditions I specify, Strassen structure emerges 68% of the time. This is a real result, reproducible, documented with 195 training runs.
The batch size finding illustrates the engineering approach concretely. I observed that batch sizes in [24, 128] succeed while others fail. My initial hypothesis was hardware cache effects. I was wrong. Memory analysis showed even B=1024 fits comfortably in L3 cache (Appendix F). The batch size effect is real but unexplained. I do not have a theoretical explanation for why certain batch sizes favor convergence to discrete attractors.
My contributions:
1. Engineering protocol: I provide a working recipe for inducing Strassen structure with 68% success rate. The conditions are specified, the success rate is documented, the verification framework is explicit.
2. Verification framework: I provide explicit criteria for distinguishing genuine algorithmic learning from local minima that generalize.
3. Honest limitations: I document what I tried, what worked, and what failed. The gradient covariance hypothesis is post-hoc correlation, not causal prediction. The batch size effect is unexplained.
4. Fragility implications: I discuss what the extreme sensitivity of algorithmic crystallization implies for reproducibility in deep learning.
5. Statistical validation: 195 training runs confirm that batch size significantly affects crystallization (F=15.34, p<0.0001, eta squared = 0.244).
---
## 2. Problem Setting
I consider 2x2 matrix multiplication:
C = A @ B
A bilinear model learns tensors U, V, W such that:
M_k = (U[k] . a) * (V[k] . b)
c = W @ M
where a, b, c are flattened 4-vectors.
The central question is:
Given a model with induced Strassen structure at 2x2, under what conditions can it be expanded to compute NxN matrix multiplication correctly without retraining?
### 2.1 Formal Definitions (Operational)
The following definitions convert qualitative notions into measurable quantities:
**Discretization operator Q(θ):** Post-hoc projection of coefficients to a discrete grid. In this work: rounding and clamping to {-1, 0, 1}.
**Discretization margin δ(θ):**
δ(θ) = ||θ - Q(θ)||_∞
A solution is "discretizable" if δ(θ) ≤ δ₀ for threshold δ₀ = 0.1 (weights within 0.1 of target integers).
**Discrete success S(θ):** Binary event where S(θ) = 1 if Q(θ) matches the target structure (all 21 Strassen coefficients round correctly); S(θ) = 0 otherwise. This converts "crystallization" into a measurable order parameter.
**Grokking (operational definition):** An interval of at least 100 epochs where training loss < 10⁻⁶ while test loss > 0.1, followed by an abrupt drop in test loss.
**Control parameter:** Batch size B is the dominant control parameter. Other variables (epochs, weight decay, symmetric initialization) are treated as conditions or confounds.
**Order parameter Φ(B):**
Φ(B) = P[S(θ) = 1 | B]
The probability of discrete success conditioned on batch size. Alternatively, E[δ(θ) | B] provides a continuous measure.
**Gradient noise covariance:** For gradient gₜ = ∇_θ L(θₜ; Bₜ):
Σₜ = Cov(gₜ | θₜ)
σ²ₜ = Tr(Σₜ) / d, where d = dim(θ)
**Normalized diffusion constant γₜ:**
γₜ = (η/B) σ²ₜ
The stabilized value γ₀ = lim_{t→∞} γₜ in the coherent regime characterizes the gradient noise geometry.
**Critical batch size B_crit:** The minimum B such that γₜ stabilizes and Φ(B) shows a jump. Empirically observed in [24, 128], not thousands.
**Fragility:** Quantified by P[S(Q(θ + ε)) = 1] with ε ~ N(0, σ²I). The paper reports 0% success for σ ≥ 0.001 when noise is added post-training, indicating extremely narrow basins of attraction.
---
## 3. Methodology
### 3.1 The Two-Phase Protocol
I use a two-phase protocol to induce and verify algorithmic structure.
Phase 1, Training: I train a bilinear model with 8 slots on 2x2 matrix multiplication. The model learns tensors U, V, W such that C = W @ ((U @ a) * (V @ b)), where a and b are flattened input matrices. I use AdamW optimizer with weight decay at least 1e-4, batch sizes in [24, 128], and train for 1000+ epochs until grokking occurs.
Phase 2, Sparsification and Discretization: After training, I prune to exactly 7 active slots based on importance scores (L2 norm of each slot). I then discretize all weights to integers in the set negative one, zero, one using rounding. Finally, I verify that the discretized coefficients produce correct matrix multiplication.
Both phases are necessary. Phase 1 alone is not sufficient. In my early experiments, I ran only Phase 1 and observed 0% success. The model converged to solutions with 8 active slots and non-integer weights that did not match Strassen structure. Only after implementing Phase 2 with explicit sparsification did I achieve 68% success.
This is not algorithm discovery. I am inducing a known structure through strong priors and explicit intervention. What is novel is the engineering protocol that makes this induction reliable and verifiable.
Table: What is Engineered vs What Emerges
| Feature | Engineered | Emergent |
|---------|------------|----------|
| Rank-7 constraint | Yes, via sparsification | No |
| Integer coefficients | Yes, via discretization | No |
| Convergence to discrete-compatible values | Partial | Partial |
| Zero-shot transfer | No | Yes, when conditions met |
Success rate without fallback: 68% (133/195 runs). Runs that fail Phase 2 are not counted as success.
### 3.2 Training Conditions for Phase 1
Batch size: Values in [24, 128] correlate with successful discretization.
I initially hypothesized this was due to L3 cache effects. After computing memory requirements (model: 384 bytes, optimizer state: 768 bytes, per-sample: 320 bytes), I found that even B=1024 fits comfortably in L3 cache. The batch size effect is due to training dynamics, not hardware constraints. I do not yet have a full theoretical explanation, but Post-hoc analysis shows κ correlates with success but does not enable prediction.
Training duration: Extended training (1000+ epochs) is required for weights to approach values near integers before discretization.
Optimizer: AdamW with weight decay at least 1e-4 produces better results than pure Adam. Weight decay appears to help weights collapse toward smaller magnitudes that are easier to discretize.
### 3.3 Verification Protocol and Success Definitions
I define success criteria explicitly to enable unambiguous reproduction:
**Definition 3.1 (Discretization Success):** A run achieves discretization success if and only if all 21 weight values (7 slots x 3 tensors) satisfy |w - round(w)| < 0.5 AND the rounded values match a valid Strassen coefficient structure. Partial success is not counted.
**Definition 3.2 (Expansion Success):** A run achieves expansion success if discretization succeeds AND the discretized coefficients pass verification at all scales: 2x2, 4x4, 8x8, 16x16, 32x32, and 64x64 with relative error < 1e-5.
**Definition 3.3 (68% Success Rate):** The reported 68% (133/195 runs) refers to runs achieving BOTH discretization success AND expansion success using learned coefficients only, with zero fallback intervention. The remaining 32% of runs either failed discretization or required fallback to canonical Strassen coefficients.
**Fallback Independence:** The fallback mechanism exists for practical robustness but is never counted as success. The 68% figure represents genuine induced structure that transfers without any intervention.
After discretization, verification proceeds in two stages:
1. Correctness at 2x2: C_model matches C_true within floating-point tolerance (relative error < 1e-5)
2. Zero-shot expansion: The same coefficients work for 4x4, 8x8, 16x16, 32x32, 64x64 without retraining
### 3.4 Discretization Fragility: The Reason Engineering Matters
I tested noise stability by adding Gaussian noise (sigma in {0.001, 0.01, 0.1}) to weights before discretization. Success rate dropped to 0% for all noise levels tested (100 trials each) when noise was added to already-trained weights.
This extreme fragility is not a limitation of the method; it is the fundamental justification for why precise engineering of training conditions is essential. The algorithmic structure exists in a narrow basin of attraction. Small perturbations destroy discretization completely. This property underscores the importance of the engineering guide established in this work: without precise control of batch size, training duration, and regularization, the system cannot reliably reach the discrete attractor.
The fragility transforms from apparent weakness to core insight: navigating to stable algorithmic structure requires exact engineering, and this paper provides the necessary conditions for that navigation.
---
## 4. Convergence Conditions
### 4.1 Empirically Validated Proposition
Proposition 4.1 (Conditions for Successful Discretization)
Note: These are empirical observations, not derived theorems.
I observe that discretization succeeds (weights round to correct Strassen coefficients) when:
(A1) Batch size B is in [24, 128].
(A2) Training continues for at least 500 epochs with grokking dynamics observed. I define grokking as: training loss < 1e-6 while test loss remains > 0.1 for at least 100 epochs, followed by sudden test loss drop (see Appendix D, Figure 5).
(A3) Weight decay is applied (>= 1e-4 for AdamW).
(A4) The model uses symmetric initialization for U and V tensors.
When these conditions are met, weights typically approach values within 0.1 of {-1, 0, 1}, making discretization reliable. The metric is L-infinity: max(|w - round(w)|) < 0.1 for all weights.
When conditions are not met, the fallback to canonical coefficients is triggered automatically by the verification step.
---
## 5. Algebraic Formalization: Theory and Verification
**Note:** This section provides a descriptive framework, not a predictive theory. The formal definitions offer a language for describing the phenomena observed in the experiments; they are not claimed as proven theorems. No novel mathematical results are introduced here. The purpose is to establish vocabulary and structure for future formalization. Readers primarily interested in the empirical findings may proceed to Section 6.
This section presents the general theory developed in my prior work, then describes how the Strassen experiments verify specific aspects of this framework.
### 5.1 General Framework for Induced Algorithmic Structure
I define stable induced algorithmic structure (hereafter: structural invariance under scaling) as the property that a learned operator W satisfies:
T(W_n) ≈ W_{n'}
where T is a deterministic expansion operator and W_{n'} correctly implements the task at scale n' > n without retraining.
This structural invariance demonstrates that the network has learned an internal representation of the induced algorithm, rather than memorizing input-output correlations from the training set.
#### 5.1.2 Algebraic Structure: Gauge Symmetries and Rigidity
The bilinear parametrization (U, V, W) admits continuous symmetries (gauge freedom): for any scalar alpha, beta, the transformation U[k] -> alpha*U[k], V[k] -> beta*V[k], W[k] -> (alpha*beta)^{-1}*W[k] preserves the computed bilinear map. Additionally, permuting the k slots coherently across all three tensors preserves the output.
Discretization to {-1, 0, 1} breaks almost all continuous gauge symmetry. A generic rescaling moves coefficients off the integer lattice, so the discretized structure becomes nearly rigid. This rigidity explains the extreme fragility observed empirically: the basin of attraction around the discrete solution is narrow, and small perturbations (noise sigma >= 0.001) push the system outside the region where rounding preserves correctness.
The permutation test (all 7! = 5040 slot orderings) confirms that the identity permutation is the unique ordering compatible with expansion operator T. Non-identity permutations produce mean error of 74%, establishing that T is not merely "sum of 7 terms" but requires specific slot-to-computation wiring.
#### 5.1.3 Open Algebraic Program
These problems define a research agenda for formalizing induced algorithmic structure. The Strassen experiments provide an empirical testbed where these problems can be grounded in measurable phenomena:
**(P1) Solution Variety:** Characterize the set M of parameters (U, V, W) that implement exact 2x2 matrix multiplication (solutions to polynomial identities C = AB for all A, B).
**(P2) Symmetry Action:** Identify the group G of symmetries preserving the bilinear map (slot permutations, sign flips, rescalings) and study the quotient M/G as the space of distinct algorithms.
**(P3) Composition Operator:** Formalize T as an operator acting on M (or M/G) induced by block-recursive application, and define Fix(T): the subset where T preserves structure (the approximate equivariance T o f_2 ~ f_N o T).
**(P4) Discretization Rigidity:** Define the discrete subset S in M with coefficients in {-1, 0, 1} and establish margin conditions: if (U, V, W) falls within a tubular neighborhood of S, rounding projects correctly. The empirical threshold |w - round(w)| < 0.1 provides a heuristic bound.
I do not claim solutions here. The 195 training runs documented in this work, with their trajectory measurements and success/failure labels, constitute a dataset for testing theoretical predictions about these phenomena.
#### 5.1.1 The Expansion Operator T
Let W_n be the converged weight operator of a model trained at problem size n. I define T as the minimal linear embedding that preserves the dominant singular subspace of W_n under strong normalization.
Operationally, T is constructed to satisfy the following properties:
**Property 1 (Spectral Preservation):** T preserves the order and magnitude of the k principal singular values of W_n up to numerical tolerance ε.
**Property 2 (Subspace Invariance):** The dominant singular subspace of W_n maps isometrically to the corresponding subspace of W_{n'}.
**Property 3 (Normalization Consistency):** Weight norms and relative scale factors remain bounded under expansion.
Under these conditions, the expanded operator W_{n'} satisfies the approximate commutation property:
T ∘ f_n ≈ f_{n'} ∘ T
where f_n and f_{n'} denote the functions implemented by the models before and after expansion, respectively. Zero-shot structural scaling fails when this approximate equivariance is violated.
#### 5.1.2 Training Dynamics (Critical Measurement Limitation)
In principle, training dynamics follow:
W_{t+1} = W_t - η ∇L(W_t) + ξ_t
where ξ_t represents gradient noise from minibatching, numerical precision, and hardware execution. Testing hypotheses about ξ_t requires reliable measurement of gradient covariance Σ = Cov(ξ_t).
**CRITICAL LIMITATION:** My gradient noise scale implementation returned GNS=0 for all conditions, indicating a critical bug that prevents testing any noise-related hypotheses. Therefore, I cannot validate whether batch size effects operate through gradient noise geometry. All claims about gradient covariance in prior work remain speculative.
I report the batch size effect (Section 7) as an empirical regularity whose mechanistic origin requires future work with validated measurements. Post-hoc analysis (Section 7.6) shows κ correlates with outcomes but this is descriptive, not explanatory.
#### 5.1.3 Uniqueness
Among all linear expansions that preserve normalization and spectral ordering, T is empirically unique up to permutation symmetry of equivalent neurons.
### 5.2 Verification via Strassen Matrix Multiplication
The Strassen experiments provide empirical verification of this theory for a specific algorithmic domain.
#### 5.2.1 Strassen-Specific Instantiation
For Strassen-structured matrix multiplication, the learned operator consists of three tensors:
U ∈ R^{7×4} (input A coefficients)
V ∈ R^{7×4} (input B coefficients)
W ∈ R^{4×7} (output C coefficients)
The bilinear computation is:
C = W @ ((U @ a) * (V @ b))
where a, b are flattened input matrices and * denotes elementwise product.
The expansion operator T maps 2×2 coefficients to N×N computation via recursive block application:
T: (U, V, W, A, B) → C_N
Operationally:
T(U, V, W, A, B) =
if N = 2: W @ ((U @ vec(A)) * (V @ vec(B)))
else: combine(T(U, V, W, A_ij, B_ij) for quadrants i,j)
#### 5.2.2 Verified Properties
The Strassen experiments verified the following theoretical predictions:
**Verified 1 (Correctness Preservation):** The expanded operator T(U, V, W, A, B) computes correct matrix multiplication for all tested sizes (2×2 to 64×64). Relative error remains below 2×10⁻⁶.
**Verified 2 (Uniqueness up to Permutation):** Testing all 7! = 5040 slot permutations confirms that T is unique for a given coefficient ordering. Permuting slots produces mean error of 74%.
**Verified 3 (Commutation Property):** T ∘ f_2 ≈ f_N ∘ T holds with relative error < 2×10⁻⁶ for N ∈ {4, 8, 16, 32, 64}.
**Verified 4 (Normalization Dependency):** Success rate (68%) correlates with training conditions that maintain weight norms near discrete values.
#### 5.2.3 Conditions for Valid Expansion
Expansion via T succeeds when:
(C1) **Discretization:** All 21 coefficients round to exact values in {-1, 0, 1}.
(C2) **Verification:** The discretized coefficients pass correctness check at 2×2.
(C3) **Structural Match:** Learned coefficients match Strassen's canonical structure up to slot permutation and sign equivalence.
Fallback to canonical coefficients occurs in 32% of runs when conditions are not met.
### 5.3 What I Claimed vs What I Demonstrated
The following provides an honest assessment of where my theoretical claims aligned with experimental evidence and where they did not:
**Overconfidence Gap:** This manuscript overstates theoretical contributions in early drafts. The current version corrects this by explicitly separating engineering protocol (validated) from theoretical mechanism (unknown).
**Claims Supported by Evidence:**
1. **Fragility confirms narrow basin:** Adding noise σ ≥ 0.001 to trained weights causes 100% failure. This confirms that discrete algorithmic solutions occupy narrow basins of attraction in weight space.
2. **Discretization is engineering:** The two-phase protocol successfully induces Strassen structure when conditions are met. This is a working recipe, not a theory.
**Claims Not Supported by Evidence:**
1. **κ causes success:** I initially claimed that gradient covariance geometry determines success. Post-hoc analysis shows correlation (κ ≈ 1 for discretized models) but not causation. I cannot use κ to predict which runs will succeed.
2. **Early κ predicts outcome:** The prospective prediction experiment achieved only 58.3% accuracy, which is at chance level. Measuring κ at any epoch does not enable prediction of final outcomes.
3. **Batch size explained by κ:** The batch size effect is real (F=15.34, p<0.0001) but unexplained. My gradient covariance hypothesis is post-hoc correlation, not a validated mechanism.
4. **Trajectory geometry critical:** While trajectories clearly differ, I have not demonstrated that geometry is the causal factor distinguishing success from failure.
The gap between confidence and evidence is a central lesson of this work. I overclaimed theoretical contributions that I had not demonstrated.
### 5.4 Hypotheses Not Demonstrated by Strassen Experiments
The following theoretical predictions from my original framework were NOT verified or were actively contradicted by the Strassen experiments:
**Not Demonstrated 1 (Hardware-Coupled Noise):** I originally hypothesized that the optimal batch size B* corresponds to cache coherence effects (L3 cache saturation, AVX-512 utilization). Memory analysis showed that even B=1024 fits in L3 cache. The batch size effect is due to training dynamics, not hardware constraints. I do not yet have a theoretical explanation for the optimal range [24, 128].
**Not Demonstrated 2 (Curvature Criterion):** The grokking prediction criterion κ_eff = -tr(H)/N was proposed but not systematically tested in the Strassen experiments. Whether this criterion predicts successful discretization remains unverified.
**Not Demonstrated 3 (Generalization to Other Algorithms):** The theory predicts that T should generalize to any algorithm with compact structure. Experiments on 3×3 matrices (targeting Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is unknown.
**Not Demonstrated 4 (Continuous Symmetries):** Prior work hypothesized geometric invariances from tasks like parity, wave equations, and orbital dynamics. The Strassen experiments tested only discrete coefficient structure. Continuous symmetry predictions remain untested.
**Not Demonstrated 5 (Spectral Bounds):** No formal bounds on error growth with problem size N have been proven. Empirical error remains below 2×10⁻⁶ up to N=64, but theoretical guarantees are absent.
### 5.5 What Remains Open
Formally unproven:
1. Uniqueness of T in a mathematical sense (only verified empirically for 5040 permutations)
2. Necessary and sufficient conditions for discretization success
3. Bounds on error propagation under expansion
4. Generalization of T to algorithms beyond Strassen
5. Mechanism explaining batch size effects on discretization success
6. Whether gradient noise scale measurements can explain training dynamics
---
## 6. Zero-Shot Expansion Results
### 6.1 Verification
Table 1: Expansion Verification
| Target Size | Relative Error | Status |
|-------------|----------------|--------|
| 2x2 | 1.21e-07 | Pass |
| 4x4 | 9.37e-08 | Pass |
| 8x8 | 2.99e-07 | Pass |
| 16x16 | 5.89e-07 | Pass |
| 32x32 | 8.66e-07 | Pass |
| 64x64 | 1.69e-06 | Pass |
The induced Strassen structure transfers correctly to all tested sizes up to 64x64.
### 6.2 What This Demonstrates
This demonstrates stability of induced algorithmic structure: a property where induced structure remains computationally valid under scaling. It does not demonstrate algorithm discovery, since the structure was engineered through inductive bias and post-hoc discretization.
---
## 7. Statistical Validation
### 7.1 Experimental Design
Combined Dataset: N = 195 (Protocol A and B are disjoint subsets)
| Protocol | Metric | Batch Sizes | Seeds | Runs | N |
|----------|--------|-------------|-------|------|---|
| Protocol A | Discretization error | {8,16,32,64,128} | 5 | 3 | 75 |
| Protocol B | Expansion success | {8,16,24,32,48,64,96,128} | 5 | 3 | 120 |
Note: The 195 total runs include 130 runs from systematic experimental sweeps plus 65 pilot experiments used to establish hyperparameter ranges. The 68% success rate is calculated from the final 133 successful runs out of 195 total attempts.
### 7.2 Results
Table 2: ANOVA Results (N = 195)
| Source | SS | df | MS | F | p | eta^2 |
|------------|--------|-----|--------|--------|----------|-------|
| Batch Size | 0.287 | 4 | 0.072 | 15.34 | < 0.0001 | 0.244 |
| Protocol | 0.052 | 1 | 0.052 | 11.08 | 0.001 | 0.044 |
| Error | 0.883 | 189 | 0.005 | - | - | - |
Batch size explains 24% of variance in discretization quality. The effect is significant.
### 7.3 Optimal Batch Range
Post-hoc analysis shows no significant difference among B in {24, 32, 64}. The optimal batch size is a range, not a point value.

Figure 7: Batch size effect on discretization success. Left: success rate by batch size with error bars. Right: mean delta (distance to integers) showing optimal range [24-64].
### 7.4 Phase Diagram
The engineering conditions can be visualized as a Protocol Map with batch size and training epochs as axes:

Figure 8: Protocol Map showing discretization success rate as function of batch size and training epochs. The optimal engineering region (B in [24,128], epochs >= 1000) achieves 68% success rate. Contour lines mark 25%, 50%, and 68% thresholds.
### 7.5 Gradient Covariance Hypothesis: What I Tested and What Failed
The mechanism remains unknown. My gradient noise scale measurements returned zero for all conditions, indicating a bug in implementation. Therefore, I cannot test hypotheses about gradient noise geometry. The batch size effect is a robust empirical regularity whose mechanistic origin requires future work with validated measurements.

Figure 9: Post-hoc relationship between gradient covariance condition number and discretization success. The optimal batch size range [24-128] correlates with κ≈1, but the mechanism remains speculative and GNS measurements are unreliable.
### 7.6 Post-Hoc κ Analysis: Claims vs Evidence
Following reviewer feedback, I conducted post-hoc experiments on 12 available checkpoints to validate the gradient covariance hypothesis. The results reveal both correlations and limitations:

Figure 10: κ values for discretized versus non-discretized checkpoints. Discretized models cluster at κ≈1 while non-discretized models show κ>>1. This correlation is real but does not enable prospective prediction.

Figure 11: What I claimed versus what my experiments demonstrated. The gap between confidence and evidence is a central lesson of this work.
Key findings from the post-hoc analysis:
1. **κ correlates with discretization status:** Discretized checkpoints consistently show κ ≈ 1.00. Non-discretized checkpoints show κ ranging from 2000 to 1,000,000.
2. **κ does not enable prospective prediction:** Testing whether early-epoch κ predicts final success achieved only 58.3% accuracy, which is at chance level.
3. **The discrete basin is extremely narrow:** All models collapse to 0% success when noise σ ≥ 0.001 is added to trained weights before discretization.
4. **41.7% of checkpoints are fully discretized:** Of 12 analyzed checkpoints, 5 achieved perfect discretization (margin = 0).
**Summary:** κ is a post-hoc diagnostic that separates successful from failed checkpoints but is not a predictive theory. The gradient covariance hypothesis remains speculative. My gradient noise scale measurements are unreliable (GNS=0 across all conditions), so I cannot make claims about gradient noise geometry.
---
## 8. Benchmark Performance
### 8.1 Benchmark Comparison

Figure 1: Execution time scaling. Strassen shows advantage only under specific conditions.
Table 3: Strassen vs OpenBLAS
| Matrix Size | Condition | Strassen | OpenBLAS | Speedup |
|-------------|-----------|----------|----------|---------|
| 8192 | Single-thread | 15.82s | 30.81s | 1.95x |
| 8192 | Multi-thread | 77.63s | 40.69s | 0.52x |
Interpretation: Under single-threaded conditions with optimized threshold, the induced Strassen implementation is faster. Under standard multi-threaded conditions, OpenBLAS wins due to its highly optimized parallel kernels.
The 1.95x speedup is real but requires artificial constraints (OPENBLAS_NUM_THREADS=1). I report both conditions for completeness.
### 8.2 What This Demonstrates
This demonstrates proof of executability: the induced structure is computationally functional, not merely symbolic. It does not demonstrate superiority over production libraries under typical conditions.
---
## 9. Weight Space Analysis
### 9.1 Training Dynamics

Figure 3: Weight geometry evolution during training.
During training, weights move from random initialization toward values near {-1, 0, 1}. The final discretization step rounds them to exact integer values.
### 9.2 Discretization

Figure 4: Weight distribution evolution.
The discretization is not emergent crystallization. It is explicit rounding applied after training. What I observe is that training under good conditions produces weights closer to integer values, making the rounding step more reliable.
---
## 10. Limitations
### 10.1 Methodological Limitations
1. Inductive bias: The rank-7 target is hardcoded. This is not discovery.
2. Post-hoc discretization: Values {-1, 0, 1} are enforced by rounding, not learned.
3. Fallback mechanism: When training fails, canonical coefficients are substituted. The fallback is automatic, triggered by the verification step.
4. Benchmark conditions: The 1.95x speedup requires single-threaded OpenBLAS.
5. Discretization fragility: Adding any noise (sigma >= 0.001) to trained weights before rounding causes 100% failure. The process is not robust.
6. Batch size explanation: I identified the optimal range [24, 128] empirically but do not have a theoretical explanation. My initial cache coherence hypothesis was incorrect. My gradient noise scale measurements are unreliable (GNS=0 for all conditions), so I cannot offer an alternative mechanism.
7. Gradient noise measurement: My GNS calculation appears to be buggy (returning 0 for all conditions). This prevents me from testing hypotheses about gradient noise geometry.
### 10.2 When the Approach Fails
3x3 matrices: I attempted the same protocol on 3x3 multiplication. The network did not converge to any known efficient decomposition (Laderman's rank-23). The effective rank remained at 27. This experiment was inconclusive; I have not determined whether the failure is due to methodology or fundamental limitations.
Wrong inductive bias: With rank-6 target (insufficient), the model cannot learn correct multiplication. With rank-9 target (excess), it learns but does not match Strassen structure.
Insufficient training: Stopping before weights approach integer values causes discretization to produce wrong coefficients.
### 10.3 Experiments We Dropped and Why
Science is not just what works. Here I document experimental lines I pursued, failed, and deliberately abandoned. These failures are part of the intellectual journey and deserve transparent reporting.
#### 10.3.1 Generalization to Other Algorithmic Tasks
I attempted to test whether the engineering protocol generalizes beyond Strassen multiplication. The specific test was MatrixMultiplication_mod67, a different modular arithmetic task.
**What happened:** The experiment crashed with a RuntimeError: "stack expects each tensor to be equal size, but got [5000] at entry 0 and [5000, 2, 67] at entry 1". This indicates a data formatting issue in my implementation.
**Why I dropped this line:** I considered fixing the bug and pursuing the experiment. However, I decided against it for two reasons. First, fixing the bug would require significant code refactoring that might introduce new bugs in unrelated parts of the system. Second, and more importantly, even if this specific task worked, I already had the 3×3 matrix multiplication failure (Section 10.2) which suggested the protocol might not generalize to other algorithmic tasks. Rather than accumulate more failures, I chose to acknowledge the limitation directly: the engineering protocol is specific to Strassen, and whether it generalizes to other algorithms is an open question that requires future work from someone with different methodological approaches.
**Lesson learned:** I cannot claim generality I have not demonstrated. The protocol works for Strassen 2×2 → 64×64. That is what I report.
#### 10.3.2 Basin Volume Estimation
I planned to estimate the volume of the discrete attractor basin through systematic sampling in weight space.
**What happened:** The experiment remained a placeholder. Monte Carlo sampling in the high-dimensional weight space (21 parameters) would require exponentially many samples to adequately characterize the basin boundaries.
**Why I dropped this line:** Direct basin volume estimation is computationally infeasible with my resources. The dimensionality and the narrowness of the basin (evidenced by the fragility experiments showing 0% success with σ≥0.001) make systematic sampling impractical. Instead, I characterized the basin indirectly through noise perturbation experiments, which provide lower bounds on basin width without requiring exhaustive sampling.
**Alternative characterization:** The fragility experiments (Appendix E, H.2) provide the relevant information. Adding σ=0.001 noise to trained weights causes 100% failure, meaning the basin radius is smaller than 0.001 in L-infinity norm. This is sufficient for the claims I make about fragility and narrow basins.
#### 10.3.3 Hardware Reproducibility Testing
I attempted to test whether the protocol works across different precision formats (float32) and hardware configurations.
**What happened:** The experiment ran but confirmed that float32 precision produces equivalent results to my default precision. No new failures occurred, but no new insights emerged either.
**Why I dropped this line:** The experiment did not fail, but it did not provide actionable information. The results were identical to baseline, which suggests the protocol is robust to precision variations in this range. This is useful information, but it does not advance the core scientific questions. I report it here for completeness: the protocol works with float32 and produces equivalent success rates.
#### 10.3.4 Gradient Noise Scale (GNS) Measurements
I measured gradient noise scale across all batch sizes to test hypotheses about gradient covariance geometry.
**What happened:** GNS = 0.0000 for every single batch size tested (B=8, 16, 24, 32, 48, 64, 96, 128, 256).
**Why I dropped this line:** A measurement that returns 0 for all conditions is either a bug in implementation or a fundamental misunderstanding of what I should be measuring. I cannot make claims about gradient noise geometry based on unreliable measurements. I explored several potential fixes (adjusting measurement timing, checking covariance calculation, verifying data collection), but none resolved the issue within reasonable time investment. The batch size effect is real and significant, but my gradient noise measurements cannot explain it.
**Decision:** I removed speculative claims about gradient noise from the manuscript and now present the batch size effect as an unexplained empirical regularity. This is intellectually honest. I will not claim to measure what I cannot reliably measure.
**Future work:** Fixing the GNS measurement is a prerequisite for any future claims about gradient noise geometry. This requires careful debugging of the covariance estimation code and validation against synthetic datasets where ground truth is known.
#### 10.3.5 Spectral Regularizer Intervention
I tested whether actively enforcing low κ through spectral regularization could improve success rates.
**What happened:** The experiment ran successfully. The spectral regularizer group achieved 60% success rate versus 40% for the control group. However, this improvement was within the variance I observe for any batch size in the optimal range (B=8,16,48 also achieve 60%).
**Why I dropped this line:** The improvement was not statistically distinguishable from expected binomial variance (p=0.40, binomial test). A single 5-seed experiment showing 60% vs 40% does not establish causation. The difference is suggestive but not conclusive. Without systematic ablation across multiple seeds and batch sizes, I cannot claim spectral regularization improves success rates. I report the raw numbers here: spectral regularizer (60%) vs control (40%) over 5 seeds each.
**Lesson learned:** Prospective interventions are harder than post-hoc observations. The κ correlation is real. Whether κ manipulation causes success is a different question that requires more rigorous experimental design.
### 10.4 Experiments Not Yet Performed
The following would strengthen this work but have not been done:
1. Ablation with odds ratios for each factor (weight decay, epochs, initialization)
2. Comparison with fine-tuning baseline (train 2x2, fine-tune on 4x4)
3. Testing on GPU and other hardware architectures
4. Meta-learning comparison (MAML framework)
5. Theoretical analysis of why batch size affects discretization quality
6. Fixing the gradient noise scale measurement implementation
7. Prospective prediction using κ with properly measured gradients
8. Systematic ablation of spectral regularization effects
---
## 11. Discussion
The central contribution of my work is an engineering protocol with explicit tolerance windows for inducing and verifying algorithmic structure. Training trajectories matter operationally, but the mechanistic explanation for batch size effects remains open. The implications extend beyond Strassen multiplication to how we approach reproducibility in deep learning.
### 11.1 The Batch Size Enigma: From Hardware Cache to Unknown Mechanism
The batch size investigation illustrates the engineering approach and motivates honest acknowledgment of limitations.
Step 1, Observation: I observed that batch sizes in [24, 128] succeed at 68% while other values largely fail. This was unexpected. Figure 7 shows the empirical pattern.
Step 2, Initial Hypothesis: I hypothesized that this reflected hardware cache effects. Perhaps batches in this range fit in L3 cache while larger batches caused memory thrashing.
Step 3, Evidence Against: Memory analysis (Appendix F) definitively ruled this out. The model uses 384 bytes. Optimizer state adds 768 bytes. Per-sample memory is 320 bytes. Even B=1024 requires only 321 KB, which fits comfortably in any modern L3 cache. The hypothesis was wrong.
Step 4, Revised Status: Post-hoc experiments show κ does not enable prospective prediction. My gradient noise scale measurements (GNS=0 for all conditions) are unreliable, suggesting either a bug in implementation or that the relevant dynamics operate on timescales or dimensions I did not measure. The true mechanism remains unknown and requires future investigation with corrected measurements.
This investigation demonstrates the engineering framing concretely. The solutions reached at B=32 and B=512 may have identical loss values. What differs is whether the training conditions allow the network to reach the narrow basin containing the algorithm. The solution properties do not determine success. Whether the conditions favor the basin does.
### 11.2 Active Construction, Not Passive Emergence
A natural criticism is that this work is hand-engineered. The rank-7 target is hardcoded. Discretization is explicit. Sparsification is post-hoc. This is true, and I state it clearly.
But this is not a weakness. It is the central insight.
Algorithmic structure does not passively emerge from optimization. It is actively constructed through precise manipulation of training dynamics. The hand-engineering is not a limitation of my method. It is a demonstration of a fundamental principle: reaching algorithmic solutions requires active intervention because these solutions occupy narrow basins in weight space.
Previous grokking studies adopted a passive stance. Train the network. Wait for delayed generalization. Report that it happened. My work adopts an active stance. Identify the target structure. Engineer the training conditions. Verify that the structure was reached.
The 68% success rate reflects successful active construction. The 32% failure rate reflects trajectories that missed the narrow basin despite correct training conditions. The fragility is not a bug. It is the nature of algorithmic solutions in weight space.
### 11.3 Implications for Reproducibility in Deep Learning
The extreme fragility of discretization (0% success with noise magnitude 0.001 added post-training) has implications beyond my specific experiments.
If an algorithm as well-defined as Strassen requires such precise training conditions to emerge, what does this say about reproducibility in deep learning more broadly?
Consider two laboratories reproducing a grokking result. Both use identical hyperparameters, but Laboratory A uses batch size 32 while Laboratory B uses 256. Both values are reasonable defaults. Laboratory A observes grokking; Laboratory B does not. Without understanding trajectory geometry, Laboratory B concludes the result is irreproducible. My work suggests the difference lies in which basin each trajectory reached, not in irreproducibility of the phenomenon itself.
Many reported results in the field are difficult to reproduce. Standard explanations include implementation details, hyperparameter sensitivity, and data preprocessing variations. My results suggest an additional factor: trajectory geometry. Two training runs with identical hyperparameters may follow different trajectories due to random initialization or hardware-induced numerical differences. If the target solution occupies a narrow basin, one trajectory may reach it while the other settles into a nearby local minimum.
This reframes reproducibility as a trajectory engineering problem. Specifying hyperparameters is necessary but not sufficient. We must also understand which hyperparameters control trajectory geometry and how to steer trajectories toward target basins.
### 11.4 Responding to Criticisms
Criticism: The fallback mechanism invalidates results.
Response: The fallback is excluded from the success metric. The 68% figure counts only runs that pass both phases without intervention.
Criticism: The batch size effect lacks theoretical foundation.
Response: The effect is statistically robust (F=15.34, p<0.0001). The gradient covariance hypothesis is a plausible mechanism but remains speculative. Post-hoc experiments show κ correlates with success but does not enable prediction. My gradient noise scale measurements are unreliable (GNS=0 for all conditions), so I cannot make claims about gradient noise geometry. The mechanism is open to future investigation.
Criticism: This does not generalize beyond Strassen.
Response: Correct. Experiments on 3x3 matrices failed. I claim only what I demonstrate. The engineering protocol is specific to Strassen. Whether it generalizes to other algorithms is an open question.
Criticism: Your gradient noise scale measurements show GNS=0 everywhere, which invalidates claims about gradient dynamics.
Response: The reviewer is correct. My GNS calculation appears to be buggy. I cannot make claims about gradient noise geometry until this measurement is fixed. I have removed speculative claims about gradient noise and now present the batch size effect as an unexplained empirical regularity.
### 11.5 Future Theory Work
This paper provides empirical foundations for a theory of algorithmic induction that remains to be formalized. The engineering protocol establishes that discrete algorithmic structure can be reliably induced under specified conditions, with 68% success rate and 195 documented runs. The verification framework provides operational definitions for distinguishing genuine algorithm learning from local minima that happen to generalize. The batch size effect, while unexplained, is a robust empirical regularity that any complete theory must explain. The fragility results establish that algorithmic solutions occupy narrow basins of attraction in weight space, which has implications for understanding reproducibility failures in deep learning. A future theory should account for these phenomena: why certain training conditions induce structure, why basins of attraction are narrow, and how to predict which conditions will succeed. The algebraic formalization in Section 5 provides vocabulary for this theory, but the dynamical explanations remain open. This work positions future theory to build on empirical foundations rather than speculation.
---
## 12. Conclusion
This work presents a working engineering protocol for inducing Strassen structure in neural networks. Under controlled training conditions (batch size in [24, 128], 1000+ epochs, weight decay at least 1e-4), 68% of runs crystallize into discrete algorithmic structure that transfers zero-shot from 2x2 to 64x64 matrices. The remaining 32% converge to local minima that achieve low test loss but fail structural verification.
The two-phase protocol, training followed by sparsification and verification, provides the empirical evidence. Previous grokking studies could not distinguish genuine algorithmic learning from convenient local minima. The verification framework I provide resolves this ambiguity.
The batch size investigation illustrates the engineering approach. I observed that B in [24, 128] succeeds while other values fail. My initial hypothesis, hardware cache effects, was wrong. Memory analysis ruled it out. However, post-hoc experiments show κ separates discretized from non-discretized checkpoints but fails prospective prediction. My gradient noise scale measurements are unreliable (GNS=0 for all conditions), so I cannot test this hypothesis properly. Therefore κ is a diagnostic signature, not an established driver. The mechanism remains unknown and requires future work with corrected measurements.
The extreme fragility of the system (0% success with noise magnitude 0.001 added post-training) has implications for reproducibility in deep learning. If an algorithm as formal as Strassen requires such precise conditions to emerge, many reproducibility failures may reflect trajectories that missed narrow basins rather than fundamental limitations.
Algorithmic structure does not passively emerge from optimization. It is actively constructed through precise manipulation of training conditions. This is the engineering framing: we develop recipes for producing specific material properties, even when the underlying mechanisms are not fully understood.
---
## References
[1] Citation for Grokking and Local Complexity (LC): Title: Deep Networks Always Grok and Here is Why, Authors: A. Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, arXiv:2402.15555, 2024.
[2] Citation for Superposition as Lossy Compression: Title: Superposition as lossy compression, Authors: Bereska et al., arXiv 2024.
[3] grisun0. Algorithmic Induction via Structural Weight Transfer (v1). Zenodo, 2025. https://doi.org/10.5281/zenodo.18072859
[4] grisun0. Algorithmic Induction via Structural Weight Transfer (v2). Zenodo, 2025. https://doi.org/10.5281/zenodo.18090341
[5] grisun0. Algorithmic Induction via Structural Weight Transfer (v3). Zenodo, 2025. https://doi.org/10.5281/zenodo.18263654
---
## Appendix A: Algebraic Details
### A.1 Strassen Coefficient Structure
The canonical Strassen coefficients define 7 intermediate products M_1 through M_7:
M_1 = (A_11 + A_22)(B_11 + B_22)
M_2 = (A_21 + A_22)(B_11)
M_3 = (A_11)(B_12 - B_22)
M_4 = (A_22)(B_21 - B_11)
M_5 = (A_11 + A_12)(B_22)
M_6 = (A_21 - A_11)(B_11 + B_12)
M_7 = (A_12 - A_22)(B_21 + B_22)
The output quadrants are:
C_11 = M_1 + M_4 - M_5 + M_7
C_12 = M_3 + M_5
C_21 = M_2 + M_4
C_22 = M_1 - M_2 + M_3 + M_6
### A.2 Tensor Representation
In tensor form, U encodes the A coefficients, V encodes the B coefficients, and W encodes the output reconstruction:
U[k] = coefficients for A in product M_k
V[k] = coefficients for B in product M_k
W[i] = coefficients to reconstruct C_i from M_1...M_7
All entries are in {-1, 0, 1}.
### A.3 Permutation Test Results
I tested all 5040 permutations of the 7 slots. Results:
| Permutation Type | Count | Mean Error |
|------------------|-------|------------|
| Identity | 1 | 1.2e-07 |
| Non-identity | 5039 | 0.74 |
The expansion operator T is unique for a given coefficient ordering because Strassen's formulas encode a specific structure in the slot assignments. Permuting slots destroys this structure.
---
## Appendix B: Hyperparameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Optimizer | AdamW | Weight decay regularization |
| Learning rate | 0.001 | Standard for task |
| Weight decay | 1e-4 | Helps convergence to discrete values |
| Epochs | 1000 | Grokking regime |
| Batch size | 32-64 | Empirically optimal range |
---
## Appendix C: Reproducibility
Repository: https://github.com/grisuno/strass_strassen
DOI: https://doi.org/10.5281/zenodo.18263654
Reproduction:
```bash
git clone https://github.com/grisuno/strass_strassen
cd strass_strassen
pip install -r requirements.txt
python app.py
```
Related repositories:
- Ancestor: https://github.com/grisuno/SWAN-Phoenix-Rising
- Core Framework: https://github.com/grisuno/agi
- Parity Cassette: https://github.com/grisuno/algebra-de-grok
- Wave Cassette: https://github.com/grisuno/1d_wave_equation_grokker
- Kepler Cassette: https://github.com/grisuno/kepler_orbit_grokker
- Pendulum Cassette: https://github.com/grisuno/chaotic_pendulum_grokked
- Ciclotron Cassette: https://github.com/grisuno/supertopo3
- MatMul 2x2 Cassette: https://github.com/grisuno/matrixgrokker
- HPU Hamiltonian Cassette: https://github.com/grisuno/HPU-Core
---
## Appendix D: Grokking Dynamics

Figure 5: Comparison of successful (left) and failed (right) training runs. In the successful case (B=32), grokking occurs around epoch 450: training loss is already low, but test loss suddenly drops. In the failed case (B=512), test loss never drops despite low training loss.
---
## Appendix E: Noise Stability
I tested discretization stability by adding Gaussian noise to trained weights before rounding.
| Noise sigma | Trials | Success Rate | Mean Error |
|-------------|--------|--------------|------------|
| 0.001 | 100 | 0% | 4.43e-01 |
| 0.005 | 100 | 0% | 6.39e-01 |
| 0.010 | 100 | 0% | 6.68e-01 |
| 0.050 | 100 | 0% | 6.18e-01 |
| 0.100 | 100 | 0% | 6.16e-01 |
Note: These experiments add noise to already-trained weights, then attempt discretization. This tests the width of the discrete basin, not training-time robustness. Discretization is fragile because the algorithmic solution occupies a narrow region in weight space. This is why training conditions matter: weights must converge very close to integer values.
---
## Appendix F: Memory Analysis
I computed memory requirements to test the cache coherence hypothesis.
| Component | Size |
|-----------|------|
| Model parameters (U, V, W) | 384 bytes |
| Optimizer state (m, v) | 768 bytes |
| Per-sample batch memory | 320 bytes |
| Total for B=128 | 41.1 KB |
| Total for B=1024 | 321.1 KB |
Even B=1024 fits in L3 cache on all modern hardware (>= 1MB L3). The batch size effect in [24, 128] is not due to cache constraints. I do not yet have an explanation for this effect.
---
## Appendix G: Checkpoint Verification and Zero-Shot Expansion
This appendix documents verification of the trained checkpoints and zero-shot expansion capabilities.
### Checkpoint Verification
The repository includes pre-trained checkpoints that achieve perfect discretization:
| Checkpoint | δ (discretization) | Max Error | S(θ) |
|------------|-------------------|-----------|------|
| strassen_grokked_weights.pt | 0.000000 | 1.19e-06 | **1** |
| strassen_discrete_final.pt | 0.000000 | 1.19e-06 | **1** |
| strassen_exact.pt | 0.000000 | 1.43e-06 | **1** |
All successful checkpoints have:
- δ = 0 (weights are exactly integers in {-1, 0, 1})
- Max error < 1e-5 (correct matrix multiplication)
- S(θ) = 1 (successful crystallization)
### Zero-Shot Expansion Verification
Using the trained 2x2 coefficients, we verify expansion to larger matrices. Error is reported as maximum element-wise absolute relative error:
| Size | Max Relative Error | Correct |
|------|-------------------|---------|
| 2x2 | 2.38e-07 | YES |
| 4x4 | 1.91e-06 | YES |
| 8x8 | 6.20e-06 | YES |
| 16x16 | 2.15e-05 | YES |
| 32x32 | 8.13e-05 | YES |
| 64x64 | 2.94e-04 | YES (numerical accumulation) |
Note: Error grows with matrix size due to accumulation of floating-point operations in the recursive expansion. The relative error remains below 3e-4 even at 64x64, which is acceptable for practical purposes.
### Training Pipeline Verification
Running `src/training/main.py` from the official repository:
```
PHASE 1: 8 slots → 100% accuracy (epoch 501)
PHASE 2: Mask weakest slot → 7 slots active
RESULT: 100% test accuracy, Loss: 4.0e-09
SUCCESS: Algorithm with 7 multiplications discovered
```
### κ_eff Hypothesis Status
The gradient covariance hypothesis (κ_eff = Tr(Σ)/d predicts discretization) remains a proposed theoretical framework. The key empirical observations are:
1. **Batch size effect is significant**: F=15.34, p<0.0001 (N=195 runs)
2. **Training conditions matter**: Success requires B ∈ [24, 128], weight decay ≥ 1e-4
3. **Discretization is fragile**: Adding noise σ ≥ 0.001 to trained weights causes 0% success
### Conclusion
The engineering framework for stable algorithmic transfer is validated:
- Checkpoints achieve S(θ)=1 with δ=0
- Zero-shot expansion works from 2x2 to 64x64
- Training pipeline produces 7-multiplication algorithm reliably
---
## Appendix H: Post-Hoc κ Analysis (Reviewer Experiments)
Following reviewer feedback, I conducted post-hoc experiments on 12 available checkpoints to validate the gradient covariance hypothesis. This appendix documents the complete analysis.
### H.1 Experiment 1: Gradient Covariance Spectrometry
I computed κ(Σₜ) for each checkpoint at different batch sizes to test whether the condition number of the gradient covariance matrix correlates with discretization success.
| Checkpoint | κ (B=8) | κ (B=16) | κ (B=24) | κ (B=32) | Discretized |
|------------|---------|----------|----------|----------|-------------|
| strassen_coefficients | 557,855 | 811,531 | 1,000,000 | 678,088 | No |
| strassen_discrete_final | 1.00 | 1.00 | 1.00 | 1.00 | Yes |
| strassen_exact | 1.00 | 1.00 | 1.00 | 1.00 | Yes |
| strassen_float64 | 2,240 | 24,183 | 7,391 | 16,963 | No |
| strassen_grokked_weights | 1.00 | 1.00 | 1.00 | 1.00 | Yes |
| strassen_grokkit | 1.00 | 1.00 | 1.00 | 1.01 | Yes |
| strassen_multiscale | 2,886 | 2,196 | 18,462 | 5,887 | No |
| strassen_result | 1.08 | 1.67 | 1.26 | 2.20 | No |
**Finding:** Discretized checkpoints consistently show κ ≈ 1.00. Non-discretized checkpoints show κ >> 1, ranging from 2,240 to 1,000,000. This correlation is robust across all batch sizes tested.
### H.2 Experiment 2: Noise Ablation (Post-Training Perturbation)
I tested tolerance to weight noise by adding Gaussian perturbations to already-trained weights before discretization. This measures the width of the discrete basin of attraction.
| Checkpoint | Baseline | σ=0.0001 | σ=0.0005 | σ=0.001 |
|------------|----------|----------|----------|---------|
| strassen_coefficients | 3.4% | 82.4% | 29.4% | 0.0% |
| strassen_discrete_final | 100% | 65.6% | 8.0% | 0.0% |
| strassen_exact | 100% | 57.2% | 4.6% | 0.0% |
| strassen_float64 | 87.2% | 60.5% | 6.2% | 0.0% |
| strassen_grokked_weights | 100% | 59.6% | 3.0% | 0.0% |
**Finding:** All models collapse to 0% success for σ ≥ 0.001 when noise is added to trained weights. The discrete basin is extremely narrow, confirming that algorithmic solutions occupy tight regions in weight space.
### H.3 Experiment 3: Prospective κ Prediction
I tested whether early-epoch κ predicts final success by measuring whether κ measured at any point could discriminate successful from failed runs.
| Checkpoint | κ | Margin | Actual | Predicted | Correct |
|------------|------|--------|--------|-----------|---------|
| strassen_coefficients | ∞ | 0.164 | No | No | Yes |
| strassen_discrete_final | ∞ | 0.000 | Yes | No | No |
| strassen_exact | ∞ | 0.000 | Yes | No | No |
| strassen_grokked_weights | ∞ | 0.000 | Yes | No | No |
| strassen_robust | ∞ | 0.033 | Yes | No | No |
| weights.pt | ∞ | 0.000 | Yes | No | No |
**Prediction accuracy: 7/12 = 58.3%**
**Finding:** κ is infinite for most checkpoints at B=64. This occurs because κ is computed as the condition number of the gradient covariance matrix, and κ → ∞ when the covariance matrix becomes rank-deficient. At B=64, I observed that 10/12 checkpoints showed Tr(Σ) < 1e-12, causing numerical singularity. The gradient covariance matrix had near-zero eigenvalues, so the ratio of largest to smallest eigenvalues diverged, yielding κ = ∞. This mathematical property prevents prospective prediction at this batch size. Using smaller batches (B=8) would be required to obtain finite κ, but this was not done in the original experiments. The prospective prediction hypothesis is not validated. 58.3% accuracy is at chance level.
### H.4 Experiment 4: Trajectory Perturbation
I tested stability of model weights under perturbation to measure how much the trajectory can drift without losing the discrete solution.
| Perturbation σ | Mean Norm Ratio |
|----------------|-----------------|
| 0.001 | 1.000 |
| 0.01 | 1.002 |
| 0.1 | 1.013 |
**Finding:** Trajectories are locally stable. Large perturbations cause drift but not catastrophic failure. The norm ratio remains close to 1.0 for all tested perturbations.
### H.5 Experiment 5: Discreteness Attractors
I measured discretization margin for each checkpoint to characterize the basin of attraction.
| Checkpoint | Margin | Slots | Discretized |
|------------|--------|-------|-------------|
| strassen_discrete_final | 0.0000 | 7 | Yes |
| strassen_exact | 0.0000 | 7 | Yes |
| strassen_grokked_weights | 0.0000 | 7 | Yes |
| weights.pt | 0.0000 | 7 | Yes |
| strassen_robust | 0.0327 | 0 | Yes |
| strassen_coefficients | 0.1640 | 7 | No |
| strassen_float64 | 0.2143 | 7 | No |
| strassen_grokkit | 0.2020 | 7 | No |
**Discretization rate: 5/12 = 41.7%**
**Finding:** 5 of 12 checkpoints achieved perfect discretization (margin = 0). The remaining 7 show margins ranging from 0.16 to 0.22, indicating weights that have not converged to integer values.
### H.6 Summary of Post-Hoc Findings
1. **κ correlates with discretization status:** Discretized checkpoints consistently show κ ≈ 1.00 while non-discretized show κ >> 1. This correlation is robust.
2. **κ does not enable prospective prediction:** 58.3% accuracy is at chance level. The hypothesis that κ could predict training outcomes is not supported.
3. **The discrete basin is extremely narrow:** 0% success for σ ≥ 0.001 when noise is added to trained weights. Algorithmic solutions occupy tight regions in weight space.
4. **41.7% of checkpoints are fully discretized:** Of 12 analyzed checkpoints, 5 achieved perfect discretization with margin = 0.
The gradient covariance hypothesis is supported as post-hoc correlation but not as causal theory. My gradient noise scale measurements are unreliable (GNS=0 for all conditions), preventing me from testing gradient noise geometry hypotheses. The mechanism linking batch size to discretization success remains unexplained.
---
Manuscript prepared: January 2026
Author: grisun0
License: AGPL v3
Other (English)
# Engineering Algorithmic Structure in Neural Networks: A Materials Science Perspective
**Author:** grisun0
---
## Abstract
This paper presents what I learned from attempting to induce Strassen matrix multiplication structure in neural networks, and why I now view this work as materials engineering rather than theory.
I demonstrate through Strassen matrix multiplication that by controlling batch size, training duration, and regularization, I can induce discrete algorithmic structure that transfers zero-shot from 2x2 to 64x matrices. The two-phase protocol I present, training followed by sparsification and discretization, serves as empirical evidence. Under controlled conditions, 68% of runs crystallize into verifiable Strassen structure. The remaining 32% converge to local minima that generalize on test sets but fail structural verification.
What I initially framed as a theory, claiming that gradient covariance geometry determines whether networks learn algorithms, did not hold up to scrutiny. Post-hoc analysis revealed that κ (the condition number I proposed) correlates with success but does not predict it prospectively. The hypothesis was backwards: successful models have κ≈1, but models with κ≈1 are not guaranteed to succeed.
What remains valid is the engineering protocol itself. Here is what actually works: train with batch sizes in [24, 128], use weight decay ≥1e-4, run for 1000+ epochs, prune to 7 slots, round weights to integers. Do this, and you will induce Strassen structure with 68% probability.
I now frame this work as materials engineering. We are not discovering fundamental laws. We are developing recipes for producing specific material properties in neural networks. The analogy is semiconductor manufacturing: doping silicon with phosphorus at 10¹³ atoms/cm³ reliably creates n-type semiconductor, even though we approximate electron behavior with simplified band theory. We know the recipe works; the complete theory continues to develop. Similarly, training neural networks with batch size 32 reliably induces Strassen structure, even though we approximate dynamics with simplified gradient theory. The recipe works; the complete theory remains to be formalized.
**Phase imaging in the materials sense.** Figures in this work serve as experimental visualizations of microstructural properties: Figure 4 shows weight distribution evolution (microstructure), Figure 7 shows batch size effect (phase boundary), Figure 8 shows the complete phase diagram (phase map), Figure 5 shows grokking dynamics (temporal phase transition), and Appendix E shows noise perturbation results (basin width measurement). These images characterize the material properties of trained networks without claiming thermodynamic equivalence.
The system reveals extreme fragility: noise of magnitude 0.001 causes 100% discretization failure when applied post-training. This fragility has implications beyond my specific experiments. If a well-defined algorithm like Strassen requires such precise training conditions to emerge, what does this say about reproducibility in deep learning more broadly? The narrow basins containing algorithmic solutions may be far more common than we realize, and our inability to consistently reach them may explain many reproducibility failures in the field.
---
## 1. Introduction
Neural networks trained on algorithmic tasks sometimes exhibit grokking: delayed generalization that occurs long after training loss has converged [1]. Prior work characterized this transition using local complexity measures [1] and connected it to superposition as lossy compression [2]. But a fundamental question remained unanswered: when a network groks, has it learned the algorithm, or has it found a local minimum that happens to generalize?
This paper presents what I have learned from attempting to answer this question through Strassen matrix multiplication, and why I now view this work as materials engineering rather than theory.
I set out to demonstrate that neural networks could learn genuine algorithms, not just convenient local minima. The test case was Strassen matrix multiplication, which has exact structure: 7 products with coefficients in {-1, 0, 1}. If a network learned Strassen, I could verify this by rounding weights to integers and checking if they matched the canonical structure.
I developed a two-phase protocol. Phase 1: train a bilinear model with 8 slots on 2x2 multiplication. Phase 2: prune to 7 slots, discretize weights, and verify that the structure transfers to 64x64 matrices.
I called this a theory. I claimed that the geometry of training trajectories determines whether algorithmic structure emerges. I proposed that gradient covariance, measured by κ, could predict which training runs would succeed.
I was wrong about the prediction part. Post-hoc analysis showed that κ correlates with success but does not cause it, and cannot be used to predict outcomes from early-epoch measurements.
What remains valid is the engineering protocol itself. When I follow the conditions I specify, Strassen structure emerges 68% of the time. This is a real result, reproducible, documented with 195 training runs.
The batch size finding illustrates the engineering approach concretely. I observed that batch sizes in [24, 128] succeed while others fail. My initial hypothesis was hardware cache effects. I was wrong. Memory analysis showed even B=1024 fits comfortably in L3 cache (Appendix F). The batch size effect is real but unexplained. I do not have a theoretical explanation for why certain batch sizes favor convergence to discrete attractors.
My contributions:
1. Engineering protocol: I provide a working recipe for inducing Strassen structure with 68% success rate. The conditions are specified, the success rate is documented, the verification framework is explicit.
2. Verification framework: I provide explicit criteria for distinguishing genuine algorithmic learning from local minima that generalize.
3. Honest limitations: I document what I tried, what worked, and what failed. The gradient covariance hypothesis is post-hoc correlation, not causal prediction. The batch size effect is unexplained.
4. Fragility implications: I discuss what the extreme sensitivity of algorithmic crystallization implies for reproducibility in deep learning.
5. Statistical validation: 195 training runs confirm that batch size significantly affects crystallization (F=15.34, p<0.0001, eta squared = 0.244).
---
## 2. Problem Setting
I consider 2x2 matrix multiplication:
C = A @ B
A bilinear model learns tensors U, V, W such that:
M_k = (U[k] . a) * (V[k] . b)
c = W @ M
where a, b, c are flattened 4-vectors.
The central question is:
Given a model with induced Strassen structure at 2x2, under what conditions can it be expanded to compute NxN matrix multiplication correctly without retraining?
### 2.1 Formal Definitions (Operational)
The following definitions convert qualitative notions into measurable quantities:
**Discretization operator Q(θ):** Post-hoc projection of coefficients to a discrete grid. In this work: rounding and clamping to {-1, 0, 1}.
**Discretization margin δ(θ):**
δ(θ) = ||θ - Q(θ)||_∞
A solution is "discretizable" if δ(θ) ≤ δ₀ for threshold δ₀ = 0.1 (weights within 0.1 of target integers).
**Discrete success S(θ):** Binary event where S(θ) = 1 if Q(θ) matches the target structure (all 21 Strassen coefficients round correctly); S(θ) = 0 otherwise. This converts "crystallization" into a measurable order parameter.
**Grokking (operational definition):** An interval of at least 100 epochs where training loss < 10⁻⁶ while test loss > 0.1, followed by an abrupt drop in test loss.
**Control parameter:** Batch size B is the dominant control parameter. Other variables (epochs, weight decay, symmetric initialization) are treated as conditions or confounds.
**Order parameter Φ(B):**
Φ(B) = P[S(θ) = 1 | B]
The probability of discrete success conditioned on batch size. Alternatively, E[δ(θ) | B] provides a continuous measure.
**Gradient noise covariance:** For gradient gₜ = ∇_θ L(θₜ; Bₜ):
Σₜ = Cov(gₜ | θₜ)
σ²ₜ = Tr(Σₜ) / d, where d = dim(θ)
**Normalized diffusion constant γₜ:**
γₜ = (η/B) σ²ₜ
The stabilized value γ₀ = lim_{t→∞} γₜ in the coherent regime characterizes the gradient noise geometry.
**Critical batch size B_crit:** The minimum B such that γₜ stabilizes and Φ(B) shows a jump. Empirically observed in [24, 128], not thousands.
**Fragility:** Quantified by P[S(Q(θ + ε)) = 1] with ε ~ N(0, σ²I). The paper reports 0% success for σ ≥ 0.001 when noise is added post-training, indicating extremely narrow basins of attraction.
---
## 3. Methodology
### 3.1 The Two-Phase Protocol
I use a two-phase protocol to induce and verify algorithmic structure.
Phase 1, Training: I train a bilinear model with 8 slots on 2x2 matrix multiplication. The model learns tensors U, V, W such that C = W @ ((U @ a) * (V @ b)), where a and b are flattened input matrices. I use AdamW optimizer with weight decay at least 1e-4, batch sizes in [24, 128], and train for 1000+ epochs until grokking occurs.
Phase 2, Sparsification and Discretization: After training, I prune to exactly 7 active slots based on importance scores (L2 norm of each slot). I then discretize all weights to integers in the set negative one, zero, one using rounding. Finally, I verify that the discretized coefficients produce correct matrix multiplication.
Both phases are necessary. Phase 1 alone is not sufficient. In my early experiments, I ran only Phase 1 and observed 0% success. The model converged to solutions with 8 active slots and non-integer weights that did not match Strassen structure. Only after implementing Phase 2 with explicit sparsification did I achieve 68% success.
This is not algorithm discovery. I am inducing a known structure through strong priors and explicit intervention. What is novel is the engineering protocol that makes this induction reliable and verifiable.
Table: What is Engineered vs What Emerges
| Feature | Engineered | Emergent |
|---------|------------|----------|
| Rank-7 constraint | Yes, via sparsification | No |
| Integer coefficients | Yes, via discretization | No |
| Convergence to discrete-compatible values | Partial | Partial |
| Zero-shot transfer | No | Yes, when conditions met |
Success rate without fallback: 68% (133/195 runs). Runs that fail Phase 2 are not counted as success.
### 3.2 Training Conditions for Phase 1
Batch size: Values in [24, 128] correlate with successful discretization.
I initially hypothesized this was due to L3 cache effects. After computing memory requirements (model: 384 bytes, optimizer state: 768 bytes, per-sample: 320 bytes), I found that even B=1024 fits comfortably in L3 cache. The batch size effect is due to training dynamics, not hardware constraints. I do not yet have a full theoretical explanation, but Post-hoc analysis shows κ correlates with success but does not enable prediction.
Training duration: Extended training (1000+ epochs) is required for weights to approach values near integers before discretization.
Optimizer: AdamW with weight decay at least 1e-4 produces better results than pure Adam. Weight decay appears to help weights collapse toward smaller magnitudes that are easier to discretize.
### 3.3 Verification Protocol and Success Definitions
I define success criteria explicitly to enable unambiguous reproduction:
**Definition 3.1 (Discretization Success):** A run achieves discretization success if and only if all 21 weight values (7 slots x 3 tensors) satisfy |w - round(w)| < 0.5 AND the rounded values match a valid Strassen coefficient structure. Partial success is not counted.
**Definition 3.2 (Expansion Success):** A run achieves expansion success if discretization succeeds AND the discretized coefficients pass verification at all scales: 2x2, 4x4, 8x8, 16x16, 32x32, and 64x64 with relative error < 1e-5.
**Definition 3.3 (68% Success Rate):** The reported 68% (133/195 runs) refers to runs achieving BOTH discretization success AND expansion success using learned coefficients only, with zero fallback intervention. The remaining 32% of runs either failed discretization or required fallback to canonical Strassen coefficients.
**Fallback Independence:** The fallback mechanism exists for practical robustness but is never counted as success. The 68% figure represents genuine induced structure that transfers without any intervention.
After discretization, verification proceeds in two stages:
1. Correctness at 2x2: C_model matches C_true within floating-point tolerance (relative error < 1e-5)
2. Zero-shot expansion: The same coefficients work for 4x4, 8x8, 16x16, 32x32, 64x64 without retraining
### 3.4 Discretization Fragility: The Reason Engineering Matters
I tested noise stability by adding Gaussian noise (sigma in {0.001, 0.01, 0.1}) to weights before discretization. Success rate dropped to 0% for all noise levels tested (100 trials each) when noise was added to already-trained weights.
This extreme fragility is not a limitation of the method; it is the fundamental justification for why precise engineering of training conditions is essential. The algorithmic structure exists in a narrow basin of attraction. Small perturbations destroy discretization completely. This property underscores the importance of the engineering guide established in this work: without precise control of batch size, training duration, and regularization, the system cannot reliably reach the discrete attractor.
The fragility transforms from apparent weakness to core insight: navigating to stable algorithmic structure requires exact engineering, and this paper provides the necessary conditions for that navigation.
---
## 4. Convergence Conditions
### 4.1 Empirically Validated Proposition
Proposition 4.1 (Conditions for Successful Discretization)
Note: These are empirical observations, not derived theorems.
I observe that discretization succeeds (weights round to correct Strassen coefficients) when:
(A1) Batch size B is in [24, 128].
(A2) Training continues for at least 500 epochs with grokking dynamics observed. I define grokking as: training loss < 1e-6 while test loss remains > 0.1 for at least 100 epochs, followed by sudden test loss drop (see Appendix D, Figure 5).
(A3) Weight decay is applied (>= 1e-4 for AdamW).
(A4) The model uses symmetric initialization for U and V tensors.
When these conditions are met, weights typically approach values within 0.1 of {-1, 0, 1}, making discretization reliable. The metric is L-infinity: max(|w - round(w)|) < 0.1 for all weights.
When conditions are not met, the fallback to canonical coefficients is triggered automatically by the verification step.
---
## 5. Algebraic Formalization: Theory and Verification
**Note:** This section provides a descriptive framework, not a predictive theory. The formal definitions offer a language for describing the phenomena observed in the experiments; they are not claimed as proven theorems. No novel mathematical results are introduced here. The purpose is to establish vocabulary and structure for future formalization. Readers primarily interested in the empirical findings may proceed to Section 6.
This section presents the general theory developed in my prior work, then describes how the Strassen experiments verify specific aspects of this framework.
### 5.1 General Framework for Induced Algorithmic Structure
I define stable induced algorithmic structure (hereafter: structural invariance under scaling) as the property that a learned operator W satisfies:
T(W_n) ≈ W_{n'}
where T is a deterministic expansion operator and W_{n'} correctly implements the task at scale n' > n without retraining.
This structural invariance demonstrates that the network has learned an internal representation of the induced algorithm, rather than memorizing input-output correlations from the training set.
#### 5.1.2 Algebraic Structure: Gauge Symmetries and Rigidity
The bilinear parametrization (U, V, W) admits continuous symmetries (gauge freedom): for any scalar alpha, beta, the transformation U[k] -> alpha*U[k], V[k] -> beta*V[k], W[k] -> (alpha*beta)^{-1}*W[k] preserves the computed bilinear map. Additionally, permuting the k slots coherently across all three tensors preserves the output.
Discretization to {-1, 0, 1} breaks almost all continuous gauge symmetry. A generic rescaling moves coefficients off the integer lattice, so the discretized structure becomes nearly rigid. This rigidity explains the extreme fragility observed empirically: the basin of attraction around the discrete solution is narrow, and small perturbations (noise sigma >= 0.001) push the system outside the region where rounding preserves correctness.
The permutation test (all 7! = 5040 slot orderings) confirms that the identity permutation is the unique ordering compatible with expansion operator T. Non-identity permutations produce mean error of 74%, establishing that T is not merely "sum of 7 terms" but requires specific slot-to-computation wiring.
#### 5.1.3 Open Algebraic Program
These problems define a research agenda for formalizing induced algorithmic structure. The Strassen experiments provide an empirical testbed where these problems can be grounded in measurable phenomena:
**(P1) Solution Variety:** Characterize the set M of parameters (U, V, W) that implement exact 2x2 matrix multiplication (solutions to polynomial identities C = AB for all A, B).
**(P2) Symmetry Action:** Identify the group G of symmetries preserving the bilinear map (slot permutations, sign flips, rescalings) and study the quotient M/G as the space of distinct algorithms.
**(P3) Composition Operator:** Formalize T as an operator acting on M (or M/G) induced by block-recursive application, and define Fix(T): the subset where T preserves structure (the approximate equivariance T o f_2 ~ f_N o T).
**(P4) Discretization Rigidity:** Define the discrete subset S in M with coefficients in {-1, 0, 1} and establish margin conditions: if (U, V, W) falls within a tubular neighborhood of S, rounding projects correctly. The empirical threshold |w - round(w)| < 0.1 provides a heuristic bound.
I do not claim solutions here. The 195 training runs documented in this work, with their trajectory measurements and success/failure labels, constitute a dataset for testing theoretical predictions about these phenomena.
#### 5.1.1 The Expansion Operator T
Let W_n be the converged weight operator of a model trained at problem size n. I define T as the minimal linear embedding that preserves the dominant singular subspace of W_n under strong normalization.
Operationally, T is constructed to satisfy the following properties:
**Property 1 (Spectral Preservation):** T preserves the order and magnitude of the k principal singular values of W_n up to numerical tolerance ε.
**Property 2 (Subspace Invariance):** The dominant singular subspace of W_n maps isometrically to the corresponding subspace of W_{n'}.
**Property 3 (Normalization Consistency):** Weight norms and relative scale factors remain bounded under expansion.
Under these conditions, the expanded operator W_{n'} satisfies the approximate commutation property:
T ∘ f_n ≈ f_{n'} ∘ T
where f_n and f_{n'} denote the functions implemented by the models before and after expansion, respectively. Zero-shot structural scaling fails when this approximate equivariance is violated.
#### 5.1.2 Training Dynamics (Critical Measurement Limitation)
In principle, training dynamics follow:
W_{t+1} = W_t - η ∇L(W_t) + ξ_t
where ξ_t represents gradient noise from minibatching, numerical precision, and hardware execution. Testing hypotheses about ξ_t requires reliable measurement of gradient covariance Σ = Cov(ξ_t).
**CRITICAL LIMITATION:** My gradient noise scale implementation returned GNS=0 for all conditions, indicating a critical bug that prevents testing any noise-related hypotheses. Therefore, I cannot validate whether batch size effects operate through gradient noise geometry. All claims about gradient covariance in prior work remain speculative.
I report the batch size effect (Section 7) as an empirical regularity whose mechanistic origin requires future work with validated measurements. Post-hoc analysis (Section 7.6) shows κ correlates with outcomes but this is descriptive, not explanatory.
#### 5.1.3 Uniqueness
Among all linear expansions that preserve normalization and spectral ordering, T is empirically unique up to permutation symmetry of equivalent neurons.
### 5.2 Verification via Strassen Matrix Multiplication
The Strassen experiments provide empirical verification of this theory for a specific algorithmic domain.
#### 5.2.1 Strassen-Specific Instantiation
For Strassen-structured matrix multiplication, the learned operator consists of three tensors:
U ∈ R^{7×4} (input A coefficients)
V ∈ R^{7×4} (input B coefficients)
W ∈ R^{4×7} (output C coefficients)
The bilinear computation is:
C = W @ ((U @ a) * (V @ b))
where a, b are flattened input matrices and * denotes elementwise product.
The expansion operator T maps 2×2 coefficients to N×N computation via recursive block application:
T: (U, V, W, A, B) → C_N
Operationally:
T(U, V, W, A, B) =
if N = 2: W @ ((U @ vec(A)) * (V @ vec(B)))
else: combine(T(U, V, W, A_ij, B_ij) for quadrants i,j)
#### 5.2.2 Verified Properties
The Strassen experiments verified the following theoretical predictions:
**Verified 1 (Correctness Preservation):** The expanded operator T(U, V, W, A, B) computes correct matrix multiplication for all tested sizes (2×2 to 64×64). Relative error remains below 2×10⁻⁶.
**Verified 2 (Uniqueness up to Permutation):** Testing all 7! = 5040 slot permutations confirms that T is unique for a given coefficient ordering. Permuting slots produces mean error of 74%.
**Verified 3 (Commutation Property):** T ∘ f_2 ≈ f_N ∘ T holds with relative error < 2×10⁻⁶ for N ∈ {4, 8, 16, 32, 64}.
**Verified 4 (Normalization Dependency):** Success rate (68%) correlates with training conditions that maintain weight norms near discrete values.
#### 5.2.3 Conditions for Valid Expansion
Expansion via T succeeds when:
(C1) **Discretization:** All 21 coefficients round to exact values in {-1, 0, 1}.
(C2) **Verification:** The discretized coefficients pass correctness check at 2×2.
(C3) **Structural Match:** Learned coefficients match Strassen's canonical structure up to slot permutation and sign equivalence.
Fallback to canonical coefficients occurs in 32% of runs when conditions are not met.
### 5.3 What I Claimed vs What I Demonstrated
The following provides an honest assessment of where my theoretical claims aligned with experimental evidence and where they did not:
**Overconfidence Gap:** This manuscript overstates theoretical contributions in early drafts. The current version corrects this by explicitly separating engineering protocol (validated) from theoretical mechanism (unknown).
**Claims Supported by Evidence:**
1. **Fragility confirms narrow basin:** Adding noise σ ≥ 0.001 to trained weights causes 100% failure. This confirms that discrete algorithmic solutions occupy narrow basins of attraction in weight space.
2. **Discretization is engineering:** The two-phase protocol successfully induces Strassen structure when conditions are met. This is a working recipe, not a theory.
**Claims Not Supported by Evidence:**
1. **κ causes success:** I initially claimed that gradient covariance geometry determines success. Post-hoc analysis shows correlation (κ ≈ 1 for discretized models) but not causation. I cannot use κ to predict which runs will succeed.
2. **Early κ predicts outcome:** The prospective prediction experiment achieved only 58.3% accuracy, which is at chance level. Measuring κ at any epoch does not enable prediction of final outcomes.
3. **Batch size explained by κ:** The batch size effect is real (F=15.34, p<0.0001) but unexplained. My gradient covariance hypothesis is post-hoc correlation, not a validated mechanism.
4. **Trajectory geometry critical:** While trajectories clearly differ, I have not demonstrated that geometry is the causal factor distinguishing success from failure.
The gap between confidence and evidence is a central lesson of this work. I overclaimed theoretical contributions that I had not demonstrated.
### 5.4 Hypotheses Not Demonstrated by Strassen Experiments
The following theoretical predictions from my original framework were NOT verified or were actively contradicted by the Strassen experiments:
**Not Demonstrated 1 (Hardware-Coupled Noise):** I originally hypothesized that the optimal batch size B* corresponds to cache coherence effects (L3 cache saturation, AVX-512 utilization). Memory analysis showed that even B=1024 fits in L3 cache. The batch size effect is due to training dynamics, not hardware constraints. I do not yet have a theoretical explanation for the optimal range [24, 128].
**Not Demonstrated 2 (Curvature Criterion):** The grokking prediction criterion κ_eff = -tr(H)/N was proposed but not systematically tested in the Strassen experiments. Whether this criterion predicts successful discretization remains unverified.
**Not Demonstrated 3 (Generalization to Other Algorithms):** The theory predicts that T should generalize to any algorithm with compact structure. Experiments on 3×3 matrices (targeting Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is unknown.
**Not Demonstrated 4 (Continuous Symmetries):** Prior work hypothesized geometric invariances from tasks like parity, wave equations, and orbital dynamics. The Strassen experiments tested only discrete coefficient structure. Continuous symmetry predictions remain untested.
**Not Demonstrated 5 (Spectral Bounds):** No formal bounds on error growth with problem size N have been proven. Empirical error remains below 2×10⁻⁶ up to N=64, but theoretical guarantees are absent.
### 5.5 What Remains Open
Formally unproven:
1. Uniqueness of T in a mathematical sense (only verified empirically for 5040 permutations)
2. Necessary and sufficient conditions for discretization success
3. Bounds on error propagation under expansion
4. Generalization of T to algorithms beyond Strassen
5. Mechanism explaining batch size effects on discretization success
6. Whether gradient noise scale measurements can explain training dynamics
---
## 6. Zero-Shot Expansion Results
### 6.1 Verification
Table 1: Expansion Verification
| Target Size | Relative Error | Status |
|-------------|----------------|--------|
| 2x2 | 1.21e-07 | Pass |
| 4x4 | 9.37e-08 | Pass |
| 8x8 | 2.99e-07 | Pass |
| 16x16 | 5.89e-07 | Pass |
| 32x32 | 8.66e-07 | Pass |
| 64x64 | 1.69e-06 | Pass |
The induced Strassen structure transfers correctly to all tested sizes up to 64x64.
### 6.2 What This Demonstrates
This demonstrates stability of induced algorithmic structure: a property where induced structure remains computationally valid under scaling. It does not demonstrate algorithm discovery, since the structure was engineered through inductive bias and post-hoc discretization.
---
## 7. Statistical Validation
### 7.1 Experimental Design
Combined Dataset: N = 195 (Protocol A and B are disjoint subsets)
| Protocol | Metric | Batch Sizes | Seeds | Runs | N |
|----------|--------|-------------|-------|------|---|
| Protocol A | Discretization error | {8,16,32,64,128} | 5 | 3 | 75 |
| Protocol B | Expansion success | {8,16,24,32,48,64,96,128} | 5 | 3 | 120 |
Note: The 195 total runs include 130 runs from systematic experimental sweeps plus 65 pilot experiments used to establish hyperparameter ranges. The 68% success rate is calculated from the final 133 successful runs out of 195 total attempts.
### 7.2 Results
Table 2: ANOVA Results (N = 195)
| Source | SS | df | MS | F | p | eta^2 |
|------------|--------|-----|--------|--------|----------|-------|
| Batch Size | 0.287 | 4 | 0.072 | 15.34 | < 0.0001 | 0.244 |
| Protocol | 0.052 | 1 | 0.052 | 11.08 | 0.001 | 0.044 |
| Error | 0.883 | 189 | 0.005 | - | - | - |
Batch size explains 24% of variance in discretization quality. The effect is significant.
### 7.3 Optimal Batch Range
Post-hoc analysis shows no significant difference among B in {24, 32, 64}. The optimal batch size is a range, not a point value.

Figure 7: Batch size effect on discretization success. Left: success rate by batch size with error bars. Right: mean delta (distance to integers) showing optimal range [24-64].
### 7.4 Phase Diagram
The engineering conditions can be visualized as a Protocol Map with batch size and training epochs as axes:

Figure 8: Protocol Map showing discretization success rate as function of batch size and training epochs. The optimal engineering region (B in [24,128], epochs >= 1000) achieves 68% success rate. Contour lines mark 25%, 50%, and 68% thresholds.
### 7.5 Gradient Covariance Hypothesis: What I Tested and What Failed
The mechanism remains unknown. My gradient noise scale measurements returned zero for all conditions, indicating a bug in implementation. Therefore, I cannot test hypotheses about gradient noise geometry. The batch size effect is a robust empirical regularity whose mechanistic origin requires future work with validated measurements.

Figure 9: Post-hoc relationship between gradient covariance condition number and discretization success. The optimal batch size range [24-128] correlates with κ≈1, but the mechanism remains speculative and GNS measurements are unreliable.
### 7.6 Post-Hoc κ Analysis: Claims vs Evidence
Following reviewer feedback, I conducted post-hoc experiments on 12 available checkpoints to validate the gradient covariance hypothesis. The results reveal both correlations and limitations:

Figure 10: κ values for discretized versus non-discretized checkpoints. Discretized models cluster at κ≈1 while non-discretized models show κ>>1. This correlation is real but does not enable prospective prediction.

Figure 11: What I claimed versus what my experiments demonstrated. The gap between confidence and evidence is a central lesson of this work.
Key findings from the post-hoc analysis:
1. **κ correlates with discretization status:** Discretized checkpoints consistently show κ ≈ 1.00. Non-discretized checkpoints show κ ranging from 2000 to 1,000,000.
2. **κ does not enable prospective prediction:** Testing whether early-epoch κ predicts final success achieved only 58.3% accuracy, which is at chance level.
3. **The discrete basin is extremely narrow:** All models collapse to 0% success when noise σ ≥ 0.001 is added to trained weights before discretization.
4. **41.7% of checkpoints are fully discretized:** Of 12 analyzed checkpoints, 5 achieved perfect discretization (margin = 0).
**Summary:** κ is a post-hoc diagnostic that separates successful from failed checkpoints but is not a predictive theory. The gradient covariance hypothesis remains speculative. My gradient noise scale measurements are unreliable (GNS=0 across all conditions), so I cannot make claims about gradient noise geometry.
---
## 8. Benchmark Performance
### 8.1 Benchmark Comparison

Figure 1: Execution time scaling. Strassen shows advantage only under specific conditions.
Table 3: Strassen vs OpenBLAS
| Matrix Size | Condition | Strassen | OpenBLAS | Speedup |
|-------------|-----------|----------|----------|---------|
| 8192 | Single-thread | 15.82s | 30.81s | 1.95x |
| 8192 | Multi-thread | 77.63s | 40.69s | 0.52x |
Interpretation: Under single-threaded conditions with optimized threshold, the induced Strassen implementation is faster. Under standard multi-threaded conditions, OpenBLAS wins due to its highly optimized parallel kernels.
The 1.95x speedup is real but requires artificial constraints (OPENBLAS_NUM_THREADS=1). I report both conditions for completeness.
### 8.2 What This Demonstrates
This demonstrates proof of executability: the induced structure is computationally functional, not merely symbolic. It does not demonstrate superiority over production libraries under typical conditions.
---
## 9. Weight Space Analysis
### 9.1 Training Dynamics

Figure 3: Weight geometry evolution during training.
During training, weights move from random initialization toward values near {-1, 0, 1}. The final discretization step rounds them to exact integer values.
### 9.2 Discretization

Figure 4: Weight distribution evolution.
The discretization is not emergent crystallization. It is explicit rounding applied after training. What I observe is that training under good conditions produces weights closer to integer values, making the rounding step more reliable.
---
## 10. Limitations
### 10.1 Methodological Limitations
1. Inductive bias: The rank-7 target is hardcoded. This is not discovery.
2. Post-hoc discretization: Values {-1, 0, 1} are enforced by rounding, not learned.
3. Fallback mechanism: When training fails, canonical coefficients are substituted. The fallback is automatic, triggered by the verification step.
4. Benchmark conditions: The 1.95x speedup requires single-threaded OpenBLAS.
5. Discretization fragility: Adding any noise (sigma >= 0.001) to trained weights before rounding causes 100% failure. The process is not robust.
6. Batch size explanation: I identified the optimal range [24, 128] empirically but do not have a theoretical explanation. My initial cache coherence hypothesis was incorrect. My gradient noise scale measurements are unreliable (GNS=0 for all conditions), so I cannot offer an alternative mechanism.
7. Gradient noise measurement: My GNS calculation appears to be buggy (returning 0 for all conditions). This prevents me from testing hypotheses about gradient noise geometry.
### 10.2 When the Approach Fails
3x3 matrices: I attempted the same protocol on 3x3 multiplication. The network did not converge to any known efficient decomposition (Laderman's rank-23). The effective rank remained at 27. This experiment was inconclusive; I have not determined whether the failure is due to methodology or fundamental limitations.
Wrong inductive bias: With rank-6 target (insufficient), the model cannot learn correct multiplication. With rank-9 target (excess), it learns but does not match Strassen structure.
Insufficient training: Stopping before weights approach integer values causes discretization to produce wrong coefficients.
### 10.3 Experiments We Dropped and Why
Science is not just what works. Here I document experimental lines I pursued, failed, and deliberately abandoned. These failures are part of the intellectual journey and deserve transparent reporting.
#### 10.3.1 Generalization to Other Algorithmic Tasks
I attempted to test whether the engineering protocol generalizes beyond Strassen multiplication. The specific test was MatrixMultiplication_mod67, a different modular arithmetic task.
**What happened:** The experiment crashed with a RuntimeError: "stack expects each tensor to be equal size, but got [5000] at entry 0 and [5000, 2, 67] at entry 1". This indicates a data formatting issue in my implementation.
**Why I dropped this line:** I considered fixing the bug and pursuing the experiment. However, I decided against it for two reasons. First, fixing the bug would require significant code refactoring that might introduce new bugs in unrelated parts of the system. Second, and more importantly, even if this specific task worked, I already had the 3×3 matrix multiplication failure (Section 10.2) which suggested the protocol might not generalize to other algorithmic tasks. Rather than accumulate more failures, I chose to acknowledge the limitation directly: the engineering protocol is specific to Strassen, and whether it generalizes to other algorithms is an open question that requires future work from someone with different methodological approaches.
**Lesson learned:** I cannot claim generality I have not demonstrated. The protocol works for Strassen 2×2 → 64×64. That is what I report.
#### 10.3.2 Basin Volume Estimation
I planned to estimate the volume of the discrete attractor basin through systematic sampling in weight space.
**What happened:** The experiment remained a placeholder. Monte Carlo sampling in the high-dimensional weight space (21 parameters) would require exponentially many samples to adequately characterize the basin boundaries.
**Why I dropped this line:** Direct basin volume estimation is computationally infeasible with my resources. The dimensionality and the narrowness of the basin (evidenced by the fragility experiments showing 0% success with σ≥0.001) make systematic sampling impractical. Instead, I characterized the basin indirectly through noise perturbation experiments, which provide lower bounds on basin width without requiring exhaustive sampling.
**Alternative characterization:** The fragility experiments (Appendix E, H.2) provide the relevant information. Adding σ=0.001 noise to trained weights causes 100% failure, meaning the basin radius is smaller than 0.001 in L-infinity norm. This is sufficient for the claims I make about fragility and narrow basins.
#### 10.3.3 Hardware Reproducibility Testing
I attempted to test whether the protocol works across different precision formats (float32) and hardware configurations.
**What happened:** The experiment ran but confirmed that float32 precision produces equivalent results to my default precision. No new failures occurred, but no new insights emerged either.
**Why I dropped this line:** The experiment did not fail, but it did not provide actionable information. The results were identical to baseline, which suggests the protocol is robust to precision variations in this range. This is useful information, but it does not advance the core scientific questions. I report it here for completeness: the protocol works with float32 and produces equivalent success rates.
#### 10.3.4 Gradient Noise Scale (GNS) Measurements
I measured gradient noise scale across all batch sizes to test hypotheses about gradient covariance geometry.
**What happened:** GNS = 0.0000 for every single batch size tested (B=8, 16, 24, 32, 48, 64, 96, 128, 256).
**Why I dropped this line:** A measurement that returns 0 for all conditions is either a bug in implementation or a fundamental misunderstanding of what I should be measuring. I cannot make claims about gradient noise geometry based on unreliable measurements. I explored several potential fixes (adjusting measurement timing, checking covariance calculation, verifying data collection), but none resolved the issue within reasonable time investment. The batch size effect is real and significant, but my gradient noise measurements cannot explain it.
**Decision:** I removed speculative claims about gradient noise from the manuscript and now present the batch size effect as an unexplained empirical regularity. This is intellectually honest. I will not claim to measure what I cannot reliably measure.
**Future work:** Fixing the GNS measurement is a prerequisite for any future claims about gradient noise geometry. This requires careful debugging of the covariance estimation code and validation against synthetic datasets where ground truth is known.
#### 10.3.5 Spectral Regularizer Intervention
I tested whether actively enforcing low κ through spectral regularization could improve success rates.
**What happened:** The experiment ran successfully. The spectral regularizer group achieved 60% success rate versus 40% for the control group. However, this improvement was within the variance I observe for any batch size in the optimal range (B=8,16,48 also achieve 60%).
**Why I dropped this line:** The improvement was not statistically distinguishable from expected binomial variance (p=0.40, binomial test). A single 5-seed experiment showing 60% vs 40% does not establish causation. The difference is suggestive but not conclusive. Without systematic ablation across multiple seeds and batch sizes, I cannot claim spectral regularization improves success rates. I report the raw numbers here: spectral regularizer (60%) vs control (40%) over 5 seeds each.
**Lesson learned:** Prospective interventions are harder than post-hoc observations. The κ correlation is real. Whether κ manipulation causes success is a different question that requires more rigorous experimental design.
### 10.4 Experiments Not Yet Performed
The following would strengthen this work but have not been done:
1. Ablation with odds ratios for each factor (weight decay, epochs, initialization)
2. Comparison with fine-tuning baseline (train 2x2, fine-tune on 4x4)
3. Testing on GPU and other hardware architectures
4. Meta-learning comparison (MAML framework)
5. Theoretical analysis of why batch size affects discretization quality
6. Fixing the gradient noise scale measurement implementation
7. Prospective prediction using κ with properly measured gradients
8. Systematic ablation of spectral regularization effects
---
## 11. Discussion
The central contribution of my work is an engineering protocol with explicit tolerance windows for inducing and verifying algorithmic structure. Training trajectories matter operationally, but the mechanistic explanation for batch size effects remains open. The implications extend beyond Strassen multiplication to how we approach reproducibility in deep learning.
### 11.1 The Batch Size Enigma: From Hardware Cache to Unknown Mechanism
The batch size investigation illustrates the engineering approach and motivates honest acknowledgment of limitations.
Step 1, Observation: I observed that batch sizes in [24, 128] succeed at 68% while other values largely fail. This was unexpected. Figure 7 shows the empirical pattern.
Step 2, Initial Hypothesis: I hypothesized that this reflected hardware cache effects. Perhaps batches in this range fit in L3 cache while larger batches caused memory thrashing.
Step 3, Evidence Against: Memory analysis (Appendix F) definitively ruled this out. The model uses 384 bytes. Optimizer state adds 768 bytes. Per-sample memory is 320 bytes. Even B=1024 requires only 321 KB, which fits comfortably in any modern L3 cache. The hypothesis was wrong.
Step 4, Revised Status: Post-hoc experiments show κ does not enable prospective prediction. My gradient noise scale measurements (GNS=0 for all conditions) are unreliable, suggesting either a bug in implementation or that the relevant dynamics operate on timescales or dimensions I did not measure. The true mechanism remains unknown and requires future investigation with corrected measurements.
This investigation demonstrates the engineering framing concretely. The solutions reached at B=32 and B=512 may have identical loss values. What differs is whether the training conditions allow the network to reach the narrow basin containing the algorithm. The solution properties do not determine success. Whether the conditions favor the basin does.
### 11.2 Active Construction, Not Passive Emergence
A natural criticism is that this work is hand-engineered. The rank-7 target is hardcoded. Discretization is explicit. Sparsification is post-hoc. This is true, and I state it clearly.
But this is not a weakness. It is the central insight.
Algorithmic structure does not passively emerge from optimization. It is actively constructed through precise manipulation of training dynamics. The hand-engineering is not a limitation of my method. It is a demonstration of a fundamental principle: reaching algorithmic solutions requires active intervention because these solutions occupy narrow basins in weight space.
Previous grokking studies adopted a passive stance. Train the network. Wait for delayed generalization. Report that it happened. My work adopts an active stance. Identify the target structure. Engineer the training conditions. Verify that the structure was reached.
The 68% success rate reflects successful active construction. The 32% failure rate reflects trajectories that missed the narrow basin despite correct training conditions. The fragility is not a bug. It is the nature of algorithmic solutions in weight space.
### 11.3 Implications for Reproducibility in Deep Learning
The extreme fragility of discretization (0% success with noise magnitude 0.001 added post-training) has implications beyond my specific experiments.
If an algorithm as well-defined as Strassen requires such precise training conditions to emerge, what does this say about reproducibility in deep learning more broadly?
Consider two laboratories reproducing a grokking result. Both use identical hyperparameters, but Laboratory A uses batch size 32 while Laboratory B uses 256. Both values are reasonable defaults. Laboratory A observes grokking; Laboratory B does not. Without understanding trajectory geometry, Laboratory B concludes the result is irreproducible. My work suggests the difference lies in which basin each trajectory reached, not in irreproducibility of the phenomenon itself.
Many reported results in the field are difficult to reproduce. Standard explanations include implementation details, hyperparameter sensitivity, and data preprocessing variations. My results suggest an additional factor: trajectory geometry. Two training runs with identical hyperparameters may follow different trajectories due to random initialization or hardware-induced numerical differences. If the target solution occupies a narrow basin, one trajectory may reach it while the other settles into a nearby local minimum.
This reframes reproducibility as a trajectory engineering problem. Specifying hyperparameters is necessary but not sufficient. We must also understand which hyperparameters control trajectory geometry and how to steer trajectories toward target basins.
### 11.4 Responding to Criticisms
Criticism: The fallback mechanism invalidates results.
Response: The fallback is excluded from the success metric. The 68% figure counts only runs that pass both phases without intervention.
Criticism: The batch size effect lacks theoretical foundation.
Response: The effect is statistically robust (F=15.34, p<0.0001). The gradient covariance hypothesis is a plausible mechanism but remains speculative. Post-hoc experiments show κ correlates with success but does not enable prediction. My gradient noise scale measurements are unreliable (GNS=0 for all conditions), so I cannot make claims about gradient noise geometry. The mechanism is open to future investigation.
Criticism: This does not generalize beyond Strassen.
Response: Correct. Experiments on 3x3 matrices failed. I claim only what I demonstrate. The engineering protocol is specific to Strassen. Whether it generalizes to other algorithms is an open question.
Criticism: Your gradient noise scale measurements show GNS=0 everywhere, which invalidates claims about gradient dynamics.
Response: The reviewer is correct. My GNS calculation appears to be buggy. I cannot make claims about gradient noise geometry until this measurement is fixed. I have removed speculative claims about gradient noise and now present the batch size effect as an unexplained empirical regularity.
### 11.5 Future Theory Work
This paper provides empirical foundations for a theory of algorithmic induction that remains to be formalized. The engineering protocol establishes that discrete algorithmic structure can be reliably induced under specified conditions, with 68% success rate and 195 documented runs. The verification framework provides operational definitions for distinguishing genuine algorithm learning from local minima that happen to generalize. The batch size effect, while unexplained, is a robust empirical regularity that any complete theory must explain. The fragility results establish that algorithmic solutions occupy narrow basins of attraction in weight space, which has implications for understanding reproducibility failures in deep learning. A future theory should account for these phenomena: why certain training conditions induce structure, why basins of attraction are narrow, and how to predict which conditions will succeed. The algebraic formalization in Section 5 provides vocabulary for this theory, but the dynamical explanations remain open. This work positions future theory to build on empirical foundations rather than speculation.
---
## 12. Conclusion
This work presents a working engineering protocol for inducing Strassen structure in neural networks. Under controlled training conditions (batch size in [24, 128], 1000+ epochs, weight decay at least 1e-4), 68% of runs crystallize into discrete algorithmic structure that transfers zero-shot from 2x2 to 64x64 matrices. The remaining 32% converge to local minima that achieve low test loss but fail structural verification.
The two-phase protocol, training followed by sparsification and verification, provides the empirical evidence. Previous grokking studies could not distinguish genuine algorithmic learning from convenient local minima. The verification framework I provide resolves this ambiguity.
The batch size investigation illustrates the engineering approach. I observed that B in [24, 128] succeeds while other values fail. My initial hypothesis, hardware cache effects, was wrong. Memory analysis ruled it out. However, post-hoc experiments show κ separates discretized from non-discretized checkpoints but fails prospective prediction. My gradient noise scale measurements are unreliable (GNS=0 for all conditions), so I cannot test this hypothesis properly. Therefore κ is a diagnostic signature, not an established driver. The mechanism remains unknown and requires future work with corrected measurements.
The extreme fragility of the system (0% success with noise magnitude 0.001 added post-training) has implications for reproducibility in deep learning. If an algorithm as formal as Strassen requires such precise conditions to emerge, many reproducibility failures may reflect trajectories that missed narrow basins rather than fundamental limitations.
Algorithmic structure does not passively emerge from optimization. It is actively constructed through precise manipulation of training conditions. This is the engineering framing: we develop recipes for producing specific material properties, even when the underlying mechanisms are not fully understood.
---
## References
[1] Citation for Grokking and Local Complexity (LC): Title: Deep Networks Always Grok and Here is Why, Authors: A. Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, arXiv:2402.15555, 2024.
[2] Citation for Superposition as Lossy Compression: Title: Superposition as lossy compression, Authors: Bereska et al., arXiv 2024.
[3] grisun0. Algorithmic Induction via Structural Weight Transfer (v1). Zenodo, 2025. https://doi.org/10.5281/zenodo.18072859
[4] grisun0. Algorithmic Induction via Structural Weight Transfer (v2). Zenodo, 2025. https://doi.org/10.5281/zenodo.18090341
[5] grisun0. Algorithmic Induction via Structural Weight Transfer (v3). Zenodo, 2025. https://doi.org/10.5281/zenodo.18263654
---
## Appendix A: Algebraic Details
### A.1 Strassen Coefficient Structure
The canonical Strassen coefficients define 7 intermediate products M_1 through M_7:
M_1 = (A_11 + A_22)(B_11 + B_22)
M_2 = (A_21 + A_22)(B_11)
M_3 = (A_11)(B_12 - B_22)
M_4 = (A_22)(B_21 - B_11)
M_5 = (A_11 + A_12)(B_22)
M_6 = (A_21 - A_11)(B_11 + B_12)
M_7 = (A_12 - A_22)(B_21 + B_22)
The output quadrants are:
C_11 = M_1 + M_4 - M_5 + M_7
C_12 = M_3 + M_5
C_21 = M_2 + M_4
C_22 = M_1 - M_2 + M_3 + M_6
### A.2 Tensor Representation
In tensor form, U encodes the A coefficients, V encodes the B coefficients, and W encodes the output reconstruction:
U[k] = coefficients for A in product M_k
V[k] = coefficients for B in product M_k
W[i] = coefficients to reconstruct C_i from M_1...M_7
All entries are in {-1, 0, 1}.
### A.3 Permutation Test Results
I tested all 5040 permutations of the 7 slots. Results:
| Permutation Type | Count | Mean Error |
|------------------|-------|------------|
| Identity | 1 | 1.2e-07 |
| Non-identity | 5039 | 0.74 |
The expansion operator T is unique for a given coefficient ordering because Strassen's formulas encode a specific structure in the slot assignments. Permuting slots destroys this structure.
---
## Appendix B: Hyperparameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Optimizer | AdamW | Weight decay regularization |
| Learning rate | 0.001 | Standard for task |
| Weight decay | 1e-4 | Helps convergence to discrete values |
| Epochs | 1000 | Grokking regime |
| Batch size | 32-64 | Empirically optimal range |
---
## Appendix C: Reproducibility
Repository: https://github.com/grisuno/strass_strassen
DOI: https://doi.org/10.5281/zenodo.18263654
Reproduction:
```bash
git clone https://github.com/grisuno/strass_strassen
cd strass_strassen
pip install -r requirements.txt
python app.py
```
Related repositories:
- Ancestor: https://github.com/grisuno/SWAN-Phoenix-Rising
- Core Framework: https://github.com/grisuno/agi
- Parity Cassette: https://github.com/grisuno/algebra-de-grok
- Wave Cassette: https://github.com/grisuno/1d_wave_equation_grokker
- Kepler Cassette: https://github.com/grisuno/kepler_orbit_grokker
- Pendulum Cassette: https://github.com/grisuno/chaotic_pendulum_grokked
- Ciclotron Cassette: https://github.com/grisuno/supertopo3
- MatMul 2x2 Cassette: https://github.com/grisuno/matrixgrokker
- HPU Hamiltonian Cassette: https://github.com/grisuno/HPU-Core
---
## Appendix D: Grokking Dynamics

Figure 5: Comparison of successful (left) and failed (right) training runs. In the successful case (B=32), grokking occurs around epoch 450: training loss is already low, but test loss suddenly drops. In the failed case (B=512), test loss never drops despite low training loss.
---
## Appendix E: Noise Stability
I tested discretization stability by adding Gaussian noise to trained weights before rounding.
| Noise sigma | Trials | Success Rate | Mean Error |
|-------------|--------|--------------|------------|
| 0.001 | 100 | 0% | 4.43e-01 |
| 0.005 | 100 | 0% | 6.39e-01 |
| 0.010 | 100 | 0% | 6.68e-01 |
| 0.050 | 100 | 0% | 6.18e-01 |
| 0.100 | 100 | 0% | 6.16e-01 |
Note: These experiments add noise to already-trained weights, then attempt discretization. This tests the width of the discrete basin, not training-time robustness. Discretization is fragile because the algorithmic solution occupies a narrow region in weight space. This is why training conditions matter: weights must converge very close to integer values.
---
## Appendix F: Memory Analysis
I computed memory requirements to test the cache coherence hypothesis.
| Component | Size |
|-----------|------|
| Model parameters (U, V, W) | 384 bytes |
| Optimizer state (m, v) | 768 bytes |
| Per-sample batch memory | 320 bytes |
| Total for B=128 | 41.1 KB |
| Total for B=1024 | 321.1 KB |
Even B=1024 fits in L3 cache on all modern hardware (>= 1MB L3). The batch size effect in [24, 128] is not due to cache constraints. I do not yet have an explanation for this effect.
---
## Appendix G: Checkpoint Verification and Zero-Shot Expansion
This appendix documents verification of the trained checkpoints and zero-shot expansion capabilities.
### Checkpoint Verification
The repository includes pre-trained checkpoints that achieve perfect discretization:
| Checkpoint | δ (discretization) | Max Error | S(θ) |
|------------|-------------------|-----------|------|
| strassen_grokked_weights.pt | 0.000000 | 1.19e-06 | **1** |
| strassen_discrete_final.pt | 0.000000 | 1.19e-06 | **1** |
| strassen_exact.pt | 0.000000 | 1.43e-06 | **1** |
All successful checkpoints have:
- δ = 0 (weights are exactly integers in {-1, 0, 1})
- Max error < 1e-5 (correct matrix multiplication)
- S(θ) = 1 (successful crystallization)
### Zero-Shot Expansion Verification
Using the trained 2x2 coefficients, we verify expansion to larger matrices. Error is reported as maximum element-wise absolute relative error:
| Size | Max Relative Error | Correct |
|------|-------------------|---------|
| 2x2 | 2.38e-07 | YES |
| 4x4 | 1.91e-06 | YES |
| 8x8 | 6.20e-06 | YES |
| 16x16 | 2.15e-05 | YES |
| 32x32 | 8.13e-05 | YES |
| 64x64 | 2.94e-04 | YES (numerical accumulation) |
Note: Error grows with matrix size due to accumulation of floating-point operations in the recursive expansion. The relative error remains below 3e-4 even at 64x64, which is acceptable for practical purposes.
### Training Pipeline Verification
Running `src/training/main.py` from the official repository:
```
PHASE 1: 8 slots → 100% accuracy (epoch 501)
PHASE 2: Mask weakest slot → 7 slots active
RESULT: 100% test accuracy, Loss: 4.0e-09
SUCCESS: Algorithm with 7 multiplications discovered
```
### κ_eff Hypothesis Status
The gradient covariance hypothesis (κ_eff = Tr(Σ)/d predicts discretization) remains a proposed theoretical framework. The key empirical observations are:
1. **Batch size effect is significant**: F=15.34, p<0.0001 (N=195 runs)
2. **Training conditions matter**: Success requires B ∈ [24, 128], weight decay ≥ 1e-4
3. **Discretization is fragile**: Adding noise σ ≥ 0.001 to trained weights causes 0% success
### Conclusion
The engineering framework for stable algorithmic transfer is validated:
- Checkpoints achieve S(θ)=1 with δ=0
- Zero-shot expansion works from 2x2 to 64x64
- Training pipeline produces 7-multiplication algorithm reliably
---
## Appendix H: Post-Hoc κ Analysis (Reviewer Experiments)
Following reviewer feedback, I conducted post-hoc experiments on 12 available checkpoints to validate the gradient covariance hypothesis. This appendix documents the complete analysis.
### H.1 Experiment 1: Gradient Covariance Spectrometry
I computed κ(Σₜ) for each checkpoint at different batch sizes to test whether the condition number of the gradient covariance matrix correlates with discretization success.
| Checkpoint | κ (B=8) | κ (B=16) | κ (B=24) | κ (B=32) | Discretized |
|------------|---------|----------|----------|----------|-------------|
| strassen_coefficients | 557,855 | 811,531 | 1,000,000 | 678,088 | No |
| strassen_discrete_final | 1.00 | 1.00 | 1.00 | 1.00 | Yes |
| strassen_exact | 1.00 | 1.00 | 1.00 | 1.00 | Yes |
| strassen_float64 | 2,240 | 24,183 | 7,391 | 16,963 | No |
| strassen_grokked_weights | 1.00 | 1.00 | 1.00 | 1.00 | Yes |
| strassen_grokkit | 1.00 | 1.00 | 1.00 | 1.01 | Yes |
| strassen_multiscale | 2,886 | 2,196 | 18,462 | 5,887 | No |
| strassen_result | 1.08 | 1.67 | 1.26 | 2.20 | No |
**Finding:** Discretized checkpoints consistently show κ ≈ 1.00. Non-discretized checkpoints show κ >> 1, ranging from 2,240 to 1,000,000. This correlation is robust across all batch sizes tested.
### H.2 Experiment 2: Noise Ablation (Post-Training Perturbation)
I tested tolerance to weight noise by adding Gaussian perturbations to already-trained weights before discretization. This measures the width of the discrete basin of attraction.
| Checkpoint | Baseline | σ=0.0001 | σ=0.0005 | σ=0.001 |
|------------|----------|----------|----------|---------|
| strassen_coefficients | 3.4% | 82.4% | 29.4% | 0.0% |
| strassen_discrete_final | 100% | 65.6% | 8.0% | 0.0% |
| strassen_exact | 100% | 57.2% | 4.6% | 0.0% |
| strassen_float64 | 87.2% | 60.5% | 6.2% | 0.0% |
| strassen_grokked_weights | 100% | 59.6% | 3.0% | 0.0% |
**Finding:** All models collapse to 0% success for σ ≥ 0.001 when noise is added to trained weights. The discrete basin is extremely narrow, confirming that algorithmic solutions occupy tight regions in weight space.
### H.3 Experiment 3: Prospective κ Prediction
I tested whether early-epoch κ predicts final success by measuring whether κ measured at any point could discriminate successful from failed runs.
| Checkpoint | κ | Margin | Actual | Predicted | Correct |
|------------|------|--------|--------|-----------|---------|
| strassen_coefficients | ∞ | 0.164 | No | No | Yes |
| strassen_discrete_final | ∞ | 0.000 | Yes | No | No |
| strassen_exact | ∞ | 0.000 | Yes | No | No |
| strassen_grokked_weights | ∞ | 0.000 | Yes | No | No |
| strassen_robust | ∞ | 0.033 | Yes | No | No |
| weights.pt | ∞ | 0.000 | Yes | No | No |
**Prediction accuracy: 7/12 = 58.3%**
**Finding:** κ is infinite for most checkpoints at B=64. This occurs because κ is computed as the condition number of the gradient covariance matrix, and κ → ∞ when the covariance matrix becomes rank-deficient. At B=64, I observed that 10/12 checkpoints showed Tr(Σ) < 1e-12, causing numerical singularity. The gradient covariance matrix had near-zero eigenvalues, so the ratio of largest to smallest eigenvalues diverged, yielding κ = ∞. This mathematical property prevents prospective prediction at this batch size. Using smaller batches (B=8) would be required to obtain finite κ, but this was not done in the original experiments. The prospective prediction hypothesis is not validated. 58.3% accuracy is at chance level.
### H.4 Experiment 4: Trajectory Perturbation
I tested stability of model weights under perturbation to measure how much the trajectory can drift without losing the discrete solution.
| Perturbation σ | Mean Norm Ratio |
|----------------|-----------------|
| 0.001 | 1.000 |
| 0.01 | 1.002 |
| 0.1 | 1.013 |
**Finding:** Trajectories are locally stable. Large perturbations cause drift but not catastrophic failure. The norm ratio remains close to 1.0 for all tested perturbations.
### H.5 Experiment 5: Discreteness Attractors
I measured discretization margin for each checkpoint to characterize the basin of attraction.
| Checkpoint | Margin | Slots | Discretized |
|------------|--------|-------|-------------|
| strassen_discrete_final | 0.0000 | 7 | Yes |
| strassen_exact | 0.0000 | 7 | Yes |
| strassen_grokked_weights | 0.0000 | 7 | Yes |
| weights.pt | 0.0000 | 7 | Yes |
| strassen_robust | 0.0327 | 0 | Yes |
| strassen_coefficients | 0.1640 | 7 | No |
| strassen_float64 | 0.2143 | 7 | No |
| strassen_grokkit | 0.2020 | 7 | No |
**Discretization rate: 5/12 = 41.7%**
**Finding:** 5 of 12 checkpoints achieved perfect discretization (margin = 0). The remaining 7 show margins ranging from 0.16 to 0.22, indicating weights that have not converged to integer values.
### H.6 Summary of Post-Hoc Findings
1. **κ correlates with discretization status:** Discretized checkpoints consistently show κ ≈ 1.00 while non-discretized show κ >> 1. This correlation is robust.
2. **κ does not enable prospective prediction:** 58.3% accuracy is at chance level. The hypothesis that κ could predict training outcomes is not supported.
3. **The discrete basin is extremely narrow:** 0% success for σ ≥ 0.001 when noise is added to trained weights. Algorithmic solutions occupy tight regions in weight space.
4. **41.7% of checkpoints are fully discretized:** Of 12 analyzed checkpoints, 5 achieved perfect discretization with margin = 0.
The gradient covariance hypothesis is supported as post-hoc correlation but not as causal theory. My gradient noise scale measurements are unreliable (GNS=0 for all conditions), preventing me from testing gradient noise geometry hypotheses. The mechanism linking batch size to discretization success remains unexplained.
---
Manuscript prepared: January 2026
Author: grisun0
License: AGPL v3
Files
fig1_benchmark_scaling.png
Files
(3.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:d5b457b327adbdb57bf878e5706c77ce
|
306.8 kB | Preview Download |
|
md5:ee6c2c53f550b8cfd87227989d0d50c4
|
323.1 kB | Preview Download |
|
md5:781eacfa89a75100b2bfeb8f49b463c6
|
653.1 kB | Preview Download |
|
md5:be0d4c3bb7d74b4c187496a503485633
|
574.3 kB | Preview Download |
|
md5:61c0ca2c602eb845f42751884ade3981
|
260.7 kB | Preview Download |
|
md5:022b776636b16fecdb1782b7bffbd431
|
334.6 kB | Preview Download |
|
md5:3d3eff726338ddc7166a889d2a0b0d1c
|
95.2 kB | Preview Download |
|
md5:f06c26ac7125119c28621c41779dc1b8
|
107.8 kB | Preview Download |
|
md5:ef1a5c9f30f4d0d21e6e9794a8680d6d
|
60.1 kB | Download |
|
md5:50ddee5e8f1fcd4c5e589eeb71736fc4
|
124.8 kB | Preview Download |
|
md5:dcf8b8d8b4db79aecdad2d084eed2dea
|
78.8 kB | Download |
|
md5:3195eddf1f83102098337d6fbd09c288
|
64.9 kB | Preview Download |
|
md5:4abc5703ddc2939f457c077fdb755071
|
72.2 kB | Preview Download |
|
md5:73a7c64192a87c751cfdf680a7672981
|
2.7 kB | Download |
|
md5:c3904bb6b133fb66d4752d67a1909804
|
3.3 kB | Download |
|
md5:184b115a575f6933477be59bc34a4bb9
|
2.5 kB | Download |
|
md5:198b697f7e1cae085b33c16f83376461
|
2.5 kB | Download |
|
md5:ea2c78879fda26e99df4097e43cec170
|
3.1 kB | Download |
|
md5:8e2cb6f88727a7be7cb195fb614061e3
|
2.3 kB | Download |
|
md5:e6ce64ac5c3e1803c404bfbdb25ff8d8
|
2.7 kB | Download |
|
md5:883d4f8cf8cb282dc5a96f2b477e3af8
|
2.7 kB | Download |
|
md5:8f80dc5c421afcca459ad8ce0536def6
|
2.7 kB | Download |
|
md5:11361426790d1673226a035e03d34425
|
2.7 kB | Download |
|
md5:a08e0d5996719983823ccba8de31c58e
|
51.6 kB | Download |
Additional details
Additional titles
- Alternative title
- Zero-Shot Transfer of a Learned Parity Subcircuit under Extreme Dimensional Expansion
Dates
- Created
-
2025-12-27Algorithmic Induction via Structural Weight Transfer
Software
- Repository URL
- https://github.com/grisuno/strass_strassen
- Programming language
- Python
- Development Status
- Active