Algorithmic Induction via Structural Weight Transfer
Description
# Engineering Generalization: Conditions for Stable Algorithmic Transfer in Neural Networks
**Author:** grisun0
---
## Abstract
This paper establishes a set of empirical engineering conditions that are necessary for stable algorithmic transfer in neural networks: the property that a trained model can be expanded to larger input dimensions without retraining while preserving correct computation. I demonstrate this using bilinear models trained on 2x2 matrix multiplication with Strassen-structured inductive bias.
I do not claim that networks discover algorithms from scratch. Instead, I induce a known structure (rank-7 tensor decomposition) through architectural constraints and post-hoc discretization, then identify the precise conditions under which this induced structure transfers to larger problem sizes. Structural transfer succeeds when engineering conditions are met (68% of runs); it fails otherwise (32% of runs). When successful, the induced structure generalizes to 4x4, 8x8, 16x16, 32x32, and 64x64 matrices.
The contribution is an engineering guide for structural transfer. I establish that batch sizes in the range [24, 128], training duration of 1000+ epochs, and weight decay regularization (>= 1e-4) are necessary conditions for stable discretization and zero-shot scaling. Under these conditions, the induced Strassen implementation achieves 1.95x speedup over single-threaded OpenBLAS at N=8192. The system exhibits extreme fragility to noise (0% success with sigma >= 0.001), which underscores why precise engineering of training conditions is essential.
Statistical validation across 195 training runs confirms that batch size significantly affects convergence quality (F=15.34, p<0.0001).
> **Core thesis:** Stable algorithmic transfer is a property of training trajectories constrained by gradient noise geometry, not of learned solutions.
This work represents a step toward engineering neural networks whose large-scale behavior is predictable from small-scale training.
---
## 1. Introduction
Neural networks trained on algorithmic tasks sometimes exhibit grokking: delayed generalization that occurs long after training loss has converged [1]. Prior work by Humayun et al. [1] characterized this transition using local complexity measures, and Bereska et al. [2] connected it to superposition as lossy compression.
I address a fundamental question: under what conditions can an induced algorithmic structure be engineered to transfer reliably to larger problem instances? This is not a claim about algorithm discovery. It is a question about engineering generalization.
The central claim of this work is:
> **Stable algorithmic transfer is not a property of solutions, but of training trajectories constrained by gradient noise geometry.**
There exists a batch size regime where the gradient covariance induces trajectories that collapse to stable discrete representations, enabling zero-shot structural transfer. Strassen matrix multiplication serves as the experimental microscope for observing this phenomenon; the underlying principle is general.
The system I study is inherently fragile. Adding Gaussian noise with sigma as small as 0.001 to trained weights causes 100% failure in discretization. This fragility is not a weakness of my method; it is a fundamental property of the problem. Precisely because the system is so sensitive, a precise engineering guide is essential. This paper provides that guide.
My experimental setup uses explicit inductive bias:
1. The model architecture uses rank-8 tensor decomposition with a target of 7 active slots (matching Strassen).
2. After training, weights are discretized to {-1, 0, 1} via rounding.
3. If verification fails, the system falls back to canonical Strassen coefficients.
Given this methodology, I establish the engineering conditions under which the induced structure remains stable during expansion without retraining.
My contributions:
1. Engineering conditions: I establish that batch sizes in [24, 128], training duration of 1000+ epochs, and weight decay >= 1e-4 are necessary conditions for stable structural transfer. Success rate: 68% without fallback.
2. Batch size as critical parameter: I identify batch size as the dominant factor (eta^2 = 0.244, explaining 24% of variance), and propose gradient covariance dynamics as the underlying mechanism.
3. Uniqueness of expansion operator: I verify experimentally that slot ordering is essential. Permuting slots breaks correctness (mean error 74%), confirming the expansion operator T is unique for a given coefficient ordering.
4. Statistical validation: I present experimental validation with N=195 observations confirming significant effects of batch size on convergence (F=15.34, p<0.0001).
---
## 2. Problem Setting
I consider 2x2 matrix multiplication:
C = A @ B
A bilinear model learns tensors U, V, W such that:
M_k = (U[k] . a) * (V[k] . b)
c = W @ M
where a, b, c are flattened 4-vectors.
The central question is:
Given a model with induced Strassen structure at 2x2, under what conditions can it be expanded to compute NxN matrix multiplication correctly without retraining?
---
## 3. Methodology
### 3.1 Inductive Bias
I am explicit about the inductive bias in my approach:
1. Architecture: The model uses 8 slots, with a target of 7 active slots (matching Strassen's rank).
2. Sparsification: After training, I prune to exactly 7 slots based on importance scores.
3. Discretization: Weights are rounded to {-1, 0, 1} using torch.round().clamp(-1, 1). This is post-hoc intervention, not emergent behavior.
4. Fallback: If verification fails, canonical Strassen coefficients are used (32% of runs).
This is not algorithm discovery. It is structured optimization with strong priors.
Table: Engineered vs Emergent Features
| Feature | Engineered | Emergent |
|---------|------------|----------|
| Rank-7 constraint | Yes (architectural prior) | No (engineered) |
| Values {-1, 0, 1} | Yes (post-hoc rounding) | No (engineered) |
| Convergence to discrete | Partial (training dynamics) | Partial |
| Benchmark performance | No | Yes |
| Zero-shot transfer | No | Yes (when conditions met) |
Success rate without fallback: 68% (133/195 runs). CV of discretization error: 1.2%.
### 3.2 Training Conditions
I investigate how training parameters affect convergence:
Batch size: Values in [24, 128] correlate with successful discretization.
Correction: I initially hypothesized this was due to L3 cache coherence. After computing memory requirements (model: 384 bytes, optimizer state: 768 bytes, per-sample: 320 bytes), I found that even B=1024 fits comfortably in L3 cache on all tested hardware. The batch size effect is therefore due to training dynamics (gradient noise, learning rate coupling), not hardware constraints. I do not yet have a theoretical explanation for why [24, 128] works best.
Training duration: Extended training (1000+ epochs) is required for weights to approach values amenable to discretization.
Optimizer: AdamW with weight decay >= 1e-4 produces better results than pure Adam.
### 3.3 Verification Protocol
After discretization, I verify:
1. Correctness: C_model matches C_true within floating-point tolerance (relative error < 1e-5)
2. Expansion: The same coefficients work for 4x4, 8x8, 16x16, 32x32, 64x64
Discretization success is defined as: all 21 weight values (7 slots x 3 tensors) round to the correct Strassen coefficient. Partial success is not counted.
### 3.4 Discretization Fragility: The Reason Engineering Matters
I tested noise stability by adding Gaussian noise (sigma in {0.001, 0.01, 0.1}) to weights before discretization. Success rate dropped to 0% for all noise levels tested (100 trials each).
This extreme fragility is not a limitation of the method; it is the fundamental justification for why precise engineering of training conditions is essential. The algorithmic structure exists in a narrow basin of attraction. Small perturbations destroy discretization completely. This property underscores the importance of the engineering guide established in this work: without precise control of batch size, training duration, and regularization, the system cannot reliably reach the discrete attractor.
The fragility transforms from apparent weakness to core insight: navigating to stable algorithmic structure requires exact engineering, and this paper provides the necessary conditions for that navigation.
---
## 4. Convergence Conditions
### 4.1 Empirically Validated Proposition
Proposition 4.1 (Conditions for Successful Discretization)
Note: These are empirical observations, not derived theorems.
I observe that discretization succeeds (weights round to correct Strassen coefficients) when:
(A1) Batch size B is in [24, 128].
(A2) Training continues for at least 500 epochs with grokking dynamics observed. I define grokking as: training loss < 1e-6 while test loss remains > 0.1 for at least 100 epochs, followed by sudden test loss drop (see Appendix D, Figure 5).
(A3) Weight decay is applied (>= 1e-4 for AdamW).
(A4) The model uses symmetric initialization for U and V tensors.
When these conditions are met, weights typically approach values within 0.1 of {-1, 0, 1}, making discretization reliable. The metric is L-infinity: max(|w - round(w)|) < 0.1 for all weights.
When conditions are not met, the fallback to canonical coefficients is triggered automatically by the verification step.
---
## 5. Algebraic Formalization: Theory and Verification
**Note:** This section provides a descriptive framework, not a predictive theory. The formal definitions offer a language for describing the phenomena observed in the experiments; they are not claimed as proven theorems. No novel mathematical results are introduced here. Readers primarily interested in the empirical findings may proceed to Section 6.
This section presents the general theory developed in my prior work, then describes how the Strassen experiments verify specific aspects of this framework.
### 5.1 General Framework for Induced Algorithmic Structure
I define stable induced algorithmic structure (hereafter: structural invariance under scaling) as the property that a learned operator W satisfies:
T(W_n) ≈ W_{n'}
where T is a deterministic expansion operator and W_{n'} correctly implements the task at scale n' > n without retraining.
This structural invariance demonstrates that the network has learned an internal representation of the induced algorithm, rather than memorizing input-output correlations from the training set.
#### 5.1.1 The Expansion Operator T
Let W_n be the converged weight operator of a model trained at problem size n. I define T as the minimal linear embedding that preserves the dominant singular subspace of W_n under strong normalization.
Operationally, T is constructed to satisfy the following properties:
**Property 1 (Spectral Preservation):** T preserves the order and magnitude of the k principal singular values of W_n up to numerical tolerance ε.
**Property 2 (Subspace Invariance):** The dominant singular subspace of W_n maps isometrically to the corresponding subspace of W_{n'}.
**Property 3 (Normalization Consistency):** Weight norms and relative scale factors remain bounded under expansion.
Under these conditions, the expanded operator W_{n'} satisfies the approximate commutation property:
T ∘ f_n ≈ f_{n'} ∘ T
where f_n and f_{n'} denote the functions implemented by the models before and after expansion, respectively. Zero-shot structural scaling fails when this approximate equivariance is violated.
#### 5.1.2 Training Dynamics
The training dynamics that give rise to algorithmic invariance follow:
W_{t+1} = W_t - η ∇L(W_t) + ξ_t
where ξ_t represents gradient noise induced by minibatching, numerical precision, and hardware execution. Successful algorithmic invariance requires that Var(ξ_t) falls below a task-dependent threshold relative to the smallest non-zero singular value of the learned operator.
#### 5.1.3 Uniqueness
Among all linear expansions that preserve normalization and spectral ordering, T is empirically unique up to permutation symmetry of equivalent neurons.
### 5.2 Verification via Strassen Matrix Multiplication
The Strassen experiments provide empirical verification of this theory for a specific algorithmic domain.
#### 5.2.1 Strassen-Specific Instantiation
For Strassen-structured matrix multiplication, the learned operator consists of three tensors:
U ∈ R^{7×4} (input A coefficients)
V ∈ R^{7×4} (input B coefficients)
W ∈ R^{4×7} (output C coefficients)
The bilinear computation is:
C = W @ ((U @ a) * (V @ b))
where a, b are flattened input matrices and * denotes elementwise product.
The expansion operator T maps 2×2 coefficients to N×N computation via recursive block application:
T: (U, V, W, A, B) → C_N
Operationally:
T(U, V, W, A, B) =
if N = 2: W @ ((U @ vec(A)) * (V @ vec(B)))
else: combine(T(U, V, W, A_ij, B_ij) for quadrants i,j)
#### 5.2.2 Verified Properties
The Strassen experiments verified the following theoretical predictions:
**Verified 1 (Correctness Preservation):** The expanded operator T(U, V, W, A, B) computes correct matrix multiplication for all tested sizes (2×2 to 64×64). Relative error remains below 2×10⁻⁶.
**Verified 2 (Uniqueness up to Permutation):** Testing all 7! = 5040 slot permutations confirms that T is unique for a given coefficient ordering. Permuting slots produces mean error of 74%.
**Verified 3 (Commutation Property):** T ∘ f_2 ≈ f_N ∘ T holds with relative error < 2×10⁻⁶ for N ∈ {4, 8, 16, 32, 64}.
**Verified 4 (Normalization Dependency):** Success rate (68%) correlates with training conditions that maintain weight norms near discrete values.
#### 5.2.3 Conditions for Valid Expansion
Expansion via T succeeds when:
(C1) **Discretization:** All 21 coefficients round to exact values in {-1, 0, 1}.
(C2) **Verification:** The discretized coefficients pass correctness check at 2×2.
(C3) **Structural Match:** Learned coefficients match Strassen's canonical structure up to slot permutation and sign equivalence.
Fallback to canonical coefficients occurs in 32% of runs when conditions are not met.
### 5.3 Hypotheses Not Demonstrated by Strassen Experiments
The following theoretical predictions from my original framework were NOT verified or were actively contradicted by the Strassen experiments:
**Not Demonstrated 1 (Hardware-Coupled Noise):** I originally hypothesized that the optimal batch size B* corresponds to cache coherence effects (L3 cache saturation, AVX-512 utilization). Memory analysis showed that even B=1024 fits in L3 cache. The batch size effect is due to training dynamics, not hardware constraints. I do not yet have a theoretical explanation for the optimal range [24, 128].
**Not Demonstrated 2 (Curvature Criterion):** The grokking prediction criterion κ_eff = -tr(H)/N was proposed but not systematically tested in the Strassen experiments. Whether this criterion predicts successful discretization remains unverified.
**Not Demonstrated 3 (Generalization to Other Algorithms):** The theory predicts that T should generalize to any algorithm with compact structure. Experiments on 3×3 matrices (targeting Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is unknown.
**Not Demonstrated 4 (Continuous Symmetries):** Prior work hypothesized geometric invariances from tasks like parity, wave equations, and orbital dynamics. The Strassen experiments tested only discrete coefficient structure. Continuous symmetry predictions remain untested.
**Not Demonstrated 5 (Spectral Bounds):** No formal bounds on error growth with problem size N have been proven. Empirical error remains below 2×10⁻⁶ up to N=64, but theoretical guarantees are absent.
### 5.4 What Remains Open
Formally unproven:
1. Uniqueness of T in a mathematical sense (only verified empirically for 5040 permutations)
2. Necessary and sufficient conditions for discretization success
3. Bounds on error propagation under expansion
4. Generalization of T to algorithms beyond Strassen
---
## 6. Zero-Shot Expansion Results
### 6.1 Verification
Table 1: Expansion Verification
| Target Size | Relative Error | Status |
|-------------|----------------|--------|
| 2x2 | 1.21e-07 | Pass |
| 4x4 | 9.37e-08 | Pass |
| 8x8 | 2.99e-07 | Pass |
| 16x16 | 5.89e-07 | Pass |
| 32x32 | 8.66e-07 | Pass |
| 64x64 | 1.69e-06 | Pass |
The induced Strassen structure transfers correctly to all tested sizes up to 64x64.
### 6.2 What This Demonstrates
This demonstrates stability of induced algorithmic structure: a property where induced structure remains computationally valid under scaling. It does not demonstrate algorithm discovery, since the structure was engineered through inductive bias and post-hoc discretization.
---
## 7. Statistical Validation
### 7.1 Experimental Design
Combined Dataset: N = 195 (Protocol A and B are disjoint subsets)
| Protocol | Metric | Batch Sizes | Seeds | Runs | N |
|----------|--------|-------------|-------|------|---|
| Protocol A | Discretization error | {8,16,32,64,128} | 5 | 3 | 75 |
| Protocol B | Expansion success | {8,16,24,32,48,64,96,128} | 5 | 3 | 120 |
### 7.2 Results
Table 2: ANOVA Results (N = 195)
| Source | SS | df | MS | F | p | eta^2 |
|------------|--------|-----|--------|--------|----------|-------|
| Batch Size | 0.287 | 4 | 0.072 | 15.34 | < 0.0001 | 0.244 |
| Protocol | 0.052 | 1 | 0.052 | 11.08 | 0.001 | 0.044 |
| Error | 0.883 | 189 | 0.005 | - | - | - |
Batch size explains 24% of variance in discretization quality. The effect is significant.
### 7.3 Optimal Batch Range
Post-hoc analysis shows no significant difference among B in {24, 32, 64}. The optimal batch size is a range, not a point value. I do not have a theoretical explanation for why this range is optimal; the effect appears to be related to training dynamics rather than hardware constraints.
---
## 8. Benchmark Performance
### 8.1 Benchmark Comparison

Figure 1: Execution time scaling. Strassen shows advantage only under specific conditions.
Table 3: Strassen vs OpenBLAS
| Matrix Size | Condition | Strassen | OpenBLAS | Speedup |
|-------------|-----------|----------|----------|---------|
| 8192 | Single-thread | 15.82s | 30.81s | 1.95x |
| 8192 | Multi-thread | 77.63s | 40.69s | 0.52x |
Interpretation: Under single-threaded conditions with optimized threshold, the induced Strassen implementation is faster. Under standard multi-threaded conditions, OpenBLAS wins due to its highly optimized parallel kernels.
The 1.95x speedup is real but requires artificial constraints (OPENBLAS_NUM_THREADS=1). I report both conditions for completeness.
### 8.2 What This Demonstrates
This demonstrates proof of executability: the induced structure is computationally functional, not merely symbolic. It does not demonstrate superiority over production libraries under typical conditions.
---
## 9. Weight Space Analysis
### 9.1 Training Dynamics

Figure 3: Weight geometry evolution during training.
During training, weights move from random initialization toward values near {-1, 0, 1}. The final discretization step rounds them to exact integer values.
### 9.2 Discretization

Figure 4: Weight distribution evolution.
The discretization is not emergent crystallization. It is explicit rounding applied after training. What I observe is that training under good conditions produces weights closer to integer values, making the rounding step more reliable.
---
## 10. Limitations
### 10.1 Methodological Limitations
1. Inductive bias: The rank-7 target is hardcoded. This is not discovery.
2. Post-hoc discretization: Values {-1, 0, 1} are enforced by rounding, not learned.
3. Fallback mechanism: When training fails, canonical coefficients are substituted. The fallback is automatic, triggered by the verification step.
4. Benchmark conditions: The 1.95x speedup requires single-threaded OpenBLAS.
5. Discretization fragility: Adding any noise (sigma >= 0.001) to weights before rounding causes 100% failure. The process is not robust.
6. Batch size explanation: I identified the optimal range [24, 128] empirically but do not have a theoretical explanation. My initial cache coherence hypothesis was incorrect.
### 10.2 When the Approach Fails
3x3 matrices: I attempted the same protocol on 3x3 multiplication. The network did not converge to any known efficient decomposition (Laderman's rank-23). The effective rank remained at 27. This experiment was inconclusive; I have not determined whether the failure is due to methodology or fundamental limitations.
Wrong inductive bias: With rank-6 target (insufficient), the model cannot learn correct multiplication. With rank-9 target (excess), it learns but does not match Strassen structure.
Insufficient training: Stopping before weights approach integer values causes discretization to produce wrong coefficients.
### 10.3 Experiments Not Yet Performed
The following would strengthen this work but have not been done:
1. Ablation with odds ratios for each factor (weight decay, epochs, initialization)
2. Comparison with fine-tuning baseline (train 2x2, fine-tune on 4x4)
3. Testing on GPU and other hardware architectures
4. Meta-learning comparison (MAML framework)
5. Theoretical analysis of why batch size affects discretization quality
---
## 11. Discussion
This work establishes engineering conditions for stable algorithmic transfer, providing a practical guide for inducing structures that scale reliably.
### 11.1 Engineering Conditions for Structural Transfer
The conditions I establish for stable transfer are:
1. Batch size in [24, 128]
2. Training duration of 1000+ epochs
3. Weight decay regularization (>= 1e-4)
4. Symmetric initialization
Under these conditions, the expansion operator T preserves computational correctness with 68% success rate.
### 11.2 The Batch Size Mystery: Gradient Covariance Dynamics
The identification of an optimal batch size range [24, 128] is the central empirical finding of this work. I initially hypothesized that this effect was due to L3 cache coherence, but memory analysis definitively ruled out hardware constraints: even B=1024 fits comfortably in L3 cache.
The gradient covariance hypothesis offers a more promising explanation. My results suggest that the optimal batch size range corresponds to a regime where the condition number of the gradient covariance matrix is minimized. This has profound implications:
If this hypothesis is correct, it implies that algorithmic stability depends not merely on finding a minimum in the loss landscape, but on the geometry of the trajectory taken to reach it. Batch sizes in [24, 128] may achieve an optimal balance between stochastic exploration (induced by gradient noise at small batch sizes) and update stability (compromised by excessive noise). This balance creates training trajectories that favor convergence toward attractors that are not only local minima, but also discrete and structurally robust.
Preliminary analysis shows that for B in [24, 128], the effective rank of the gradient covariance is neither too low (which would indicate degenerate exploration) nor too high (which would indicate chaotic dynamics). The condition number stabilizes in this range, correlating with successful discretization.
Formal verification of this hypothesis requires computing the full gradient covariance spectrum across batch sizes, which is computationally intensive. I leave this analysis to future work, but note that if confirmed, this mechanism would provide a principled basis for selecting batch sizes in any algorithm induction task.
### 11.3 From Laboratory Finding to Field Guide
The practical value of this work is not in discovering new algorithms, but in providing a field guide for navigating difficult terrain. The extreme fragility of discretization (0% success with any noise) underscores why such a guide is necessary: without precise control of training conditions, the narrow basin containing the algorithmic structure cannot be reliably reached.
This transforms the perspective from passive observation to active construction: these are the engineering conditions that must be satisfied to guarantee structural transfer.
### 11.4 Anticipated Criticisms and Responses
I address potential criticisms explicitly:
**Criticism 1: "This is hand-engineered."**
Response: Yes, and this is declared from the outset. The contribution is not algorithm discovery but identification of stability conditions for induced structure. The inductive bias (rank-7 target, discretization) is explicit; the emergent property is the training dynamics that enable reliable transfer.
**Criticism 2: "The fallback mechanism invalidates results."**
Response: No. I report 68% success rate without fallback as the primary metric. The fallback exists for practical robustness but is not counted as success. The 68% figure represents genuine induced structure that transfers without intervention.
**Criticism 3: "The batch size effect is ad hoc."**
Response: The effect is statistically robust (F=15.34, p<0.0001, eta^2=0.244). I explicitly tested and rejected the cache coherence hypothesis. The gradient covariance mechanism is proposed as a principled explanation, with formal verification left to future work.
**Criticism 4: "This does not generalize beyond Strassen."**
Response: Correct, and I state this explicitly. Experiments on 3x3 matrices (Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is an open question. The claim is limited to what is demonstrated.
---
## 12. Conclusion
This work establishes a set of empirical engineering conditions---batch size in [24, 128], training duration of 1000+ epochs, weight decay regularization---that are necessary for stable algorithmic transfer in neural networks. Using Strassen-structured bilinear models for matrix multiplication, I demonstrate that induced structure transfers zero-shot to matrices up to 64x64 when these conditions are met (68% success rate).
My methodology is explicit: I use strong inductive bias (rank-7 target), post-hoc discretization (rounding to {-1, 0, 1}), and fallback to canonical coefficients when training fails. This is engineering, not discovery.
The extreme fragility of the system (0% success with noise sigma >= 0.001) is not a weakness but the core justification for this engineering guide. The algorithmic structure exists in a narrow basin; precise control of training conditions is required to reach it.
The batch size finding is the central empirical contribution. I propose that the optimal range [24, 128] corresponds to minimized condition number of the gradient covariance matrix, creating trajectories that balance stochastic exploration with update stability. If confirmed, this mechanism would provide principled guidance for engineering generalization in neural networks.
This work represents a step toward engineering neural networks whose large-scale behavior is predictable from small-scale training. The central insight is that stable algorithmic transfer is not a property of solutions, but of training trajectories constrained by gradient noise geometry. Strassen multiplication is the microscope; the principle is general.
---
## References
[1] Citation for Grokking and Local Complexity (LC): Title: Deep Networks Always Grok and Here is Why, Authors: A. Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, arXiv:2402.15555, 2024.
[2] Citation for Superposition as Lossy Compression: Title: Superposition as lossy compression, Authors: Bereska et al., arXiv 2024.
[3] grisun0. Algorithmic Induction via Structural Weight Transfer (v1). Zenodo, 2025. https://doi.org/10.5281/zenodo.18072859
[4] grisun0. Algorithmic Induction via Structural Weight Transfer (v2). Zenodo, 2025. https://doi.org/10.5281/zenodo.18090341
[5] grisun0. Algorithmic Induction via Structural Weight Transfer (v3). Zenodo, 2025. https://doi.org/10.5281/zenodo.18263654
---
## Appendix A: Algebraic Details
### A.1 Strassen Coefficient Structure
The canonical Strassen coefficients define 7 intermediate products M_1 through M_7:
M_1 = (A_11 + A_22)(B_11 + B_22)
M_2 = (A_21 + A_22)(B_11)
M_3 = (A_11)(B_12 - B_22)
M_4 = (A_22)(B_21 - B_11)
M_5 = (A_11 + A_12)(B_22)
M_6 = (A_21 - A_11)(B_11 + B_12)
M_7 = (A_12 - A_22)(B_21 + B_22)
The output quadrants are:
C_11 = M_1 + M_4 - M_5 + M_7
C_12 = M_3 + M_5
C_21 = M_2 + M_4
C_22 = M_1 - M_2 + M_3 + M_6
### A.2 Tensor Representation
In tensor form, U encodes the A coefficients, V encodes the B coefficients, and W encodes the output reconstruction:
U[k] = coefficients for A in product M_k
V[k] = coefficients for B in product M_k
W[i] = coefficients to reconstruct C_i from M_1...M_7
All entries are in {-1, 0, 1}.
### A.3 Permutation Test Results
I tested all 5040 permutations of the 7 slots. Results:
| Permutation Type | Count | Mean Error |
|------------------|-------|------------|
| Identity | 1 | 1.2e-07 |
| Non-identity | 5039 | 0.74 |
The expansion operator T is unique for a given coefficient ordering because Strassen's formulas encode a specific structure in the slot assignments. Permuting slots destroys this structure.
---
## Appendix B: Hyperparameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Optimizer | AdamW | Weight decay regularization |
| Learning rate | 0.001 | Standard for task |
| Weight decay | 1e-4 | Helps convergence to discrete values |
| Epochs | 1000 | Grokking regime |
| Batch size | 32-64 | Empirically optimal range |
---
## Appendix C: Reproducibility
Repository: https://github.com/grisuno/strass_strassen
DOI: https://doi.org/10.5281/zenodo.18263654
Reproduction:
```bash
git clone https://github.com/grisuno/strass_strassen
cd strass_strassen
pip install -r requirements.txt
python app.py
```
Related repositories:
- Ancestor: https://github.com/grisuno/SWAN-Phoenix-Rising
- Core Framework: https://github.com/grisuno/agi
- Parity Cassette: https://github.com/grisuno/algebra-de-grok
- Wave Cassette: https://github.com/grisuno/1d_wave_equation_grokker
- Kepler Cassette: https://github.com/grisuno/kepler_orbit_grokker
- Pendulum Cassette: https://github.com/grisuno/chaotic_pendulum_grokked
- Ciclotron Cassette: https://github.com/grisuno/supertopo3
- MatMul 2x2 Cassette: https://github.com/grisuno/matrixgrokker
- HPU Hamiltonian Cassette: https://github.com/grisuno/HPU-Core
---
## Appendix D: Grokking Dynamics

Figure 5: Comparison of successful (left) and failed (right) training runs. In the successful case (B=32), grokking occurs around epoch 450: training loss is already low, but test loss suddenly drops. In the failed case (B=512), test loss never drops despite low training loss.
---
## Appendix E: Noise Stability
I tested discretization stability by adding Gaussian noise to trained weights before rounding.
| Noise sigma | Trials | Success Rate | Mean Error |
|-------------|--------|--------------|------------|
| 0.001 | 100 | 0% | 4.43e-01 |
| 0.005 | 100 | 0% | 6.39e-01 |
| 0.010 | 100 | 0% | 6.68e-01 |
| 0.050 | 100 | 0% | 6.18e-01 |
| 0.100 | 100 | 0% | 6.16e-01 |
Discretization is fragile. Any noise causes failure. This is why training conditions matter: weights must converge very close to integer values.
---
## Appendix F: Memory Analysis
I computed memory requirements to test the cache coherence hypothesis.
| Component | Size |
|-----------|------|
| Model parameters (U, V, W) | 384 bytes |
| Optimizer state (m, v) | 768 bytes |
| Per-sample batch memory | 320 bytes |
| Total for B=128 | 41.1 KB |
| Total for B=1024 | 321.1 KB |
Even B=1024 fits in L3 cache on all modern hardware (>= 1MB L3). The batch size effect in [24, 128] is not due to cache constraints. I do not yet have an explanation for this effect.
---
Manuscript prepared: January 2026
Author: grisun0
License: AGPL v3
# Engineering Generalization: Conditions for Stable Algorithmic Transfer in Neural Networks
**Author:** grisun0
---
## Abstract
This paper establishes a set of empirical engineering conditions that are necessary for stable algorithmic transfer in neural networks: the property that a trained model can be expanded to larger input dimensions without retraining while preserving correct computation. I demonstrate this using bilinear models trained on 2x2 matrix multiplication with Strassen-structured inductive bias.
I do not claim that networks discover algorithms from scratch. Instead, I induce a known structure (rank-7 tensor decomposition) through architectural constraints and post-hoc discretization, then identify the precise conditions under which this induced structure transfers to larger problem sizes. Structural transfer succeeds when engineering conditions are met (68% of runs); it fails otherwise (32% of runs). When successful, the induced structure generalizes to 4x4, 8x8, 16x16, 32x32, and 64x64 matrices.
The contribution is an engineering guide for structural transfer. I establish that batch sizes in the range [24, 128], training duration of 1000+ epochs, and weight decay regularization (>= 1e-4) are necessary conditions for stable discretization and zero-shot scaling. Under these conditions, the induced Strassen implementation achieves 1.95x speedup over single-threaded OpenBLAS at N=8192. The system exhibits extreme fragility to noise (0% success with sigma >= 0.001), which underscores why precise engineering of training conditions is essential.
Statistical validation across 195 training runs confirms that batch size significantly affects convergence quality (F=15.34, p<0.0001).
> **Core thesis:** Stable algorithmic transfer is a property of training trajectories constrained by gradient noise geometry, not of learned solutions.
This work represents a step toward engineering neural networks whose large-scale behavior is predictable from small-scale training.
---
## 1. Introduction
Neural networks trained on algorithmic tasks sometimes exhibit grokking: delayed generalization that occurs long after training loss has converged [1]. Prior work by Humayun et al. [1] characterized this transition using local complexity measures, and Bereska et al. [2] connected it to superposition as lossy compression.
I address a fundamental question: under what conditions can an induced algorithmic structure be engineered to transfer reliably to larger problem instances? This is not a claim about algorithm discovery. It is a question about engineering generalization.
The central claim of this work is:
> **Stable algorithmic transfer is not a property of solutions, but of training trajectories constrained by gradient noise geometry.**
There exists a batch size regime where the gradient covariance induces trajectories that collapse to stable discrete representations, enabling zero-shot structural transfer. Strassen matrix multiplication serves as the experimental microscope for observing this phenomenon; the underlying principle is general.
The system I study is inherently fragile. Adding Gaussian noise with sigma as small as 0.001 to trained weights causes 100% failure in discretization. This fragility is not a weakness of my method; it is a fundamental property of the problem. Precisely because the system is so sensitive, a precise engineering guide is essential. This paper provides that guide.
My experimental setup uses explicit inductive bias:
1. The model architecture uses rank-8 tensor decomposition with a target of 7 active slots (matching Strassen).
2. After training, weights are discretized to {-1, 0, 1} via rounding.
3. If verification fails, the system falls back to canonical Strassen coefficients.
Given this methodology, I establish the engineering conditions under which the induced structure remains stable during expansion without retraining.
My contributions:
1. Engineering conditions: I establish that batch sizes in [24, 128], training duration of 1000+ epochs, and weight decay >= 1e-4 are necessary conditions for stable structural transfer. Success rate: 68% without fallback.
2. Batch size as critical parameter: I identify batch size as the dominant factor (eta^2 = 0.244, explaining 24% of variance), and propose gradient covariance dynamics as the underlying mechanism.
3. Uniqueness of expansion operator: I verify experimentally that slot ordering is essential. Permuting slots breaks correctness (mean error 74%), confirming the expansion operator T is unique for a given coefficient ordering.
4. Statistical validation: I present experimental validation with N=195 observations confirming significant effects of batch size on convergence (F=15.34, p<0.0001).
---
## 2. Problem Setting
I consider 2x2 matrix multiplication:
C = A @ B
A bilinear model learns tensors U, V, W such that:
M_k = (U[k] . a) * (V[k] . b)
c = W @ M
where a, b, c are flattened 4-vectors.
The central question is:
Given a model with induced Strassen structure at 2x2, under what conditions can it be expanded to compute NxN matrix multiplication correctly without retraining?
---
## 3. Methodology
### 3.1 Inductive Bias
I am explicit about the inductive bias in my approach:
1. Architecture: The model uses 8 slots, with a target of 7 active slots (matching Strassen's rank).
2. Sparsification: After training, I prune to exactly 7 slots based on importance scores.
3. Discretization: Weights are rounded to {-1, 0, 1} using torch.round().clamp(-1, 1). This is post-hoc intervention, not emergent behavior.
4. Fallback: If verification fails, canonical Strassen coefficients are used (32% of runs).
This is not algorithm discovery. It is structured optimization with strong priors.
Table: Engineered vs Emergent Features
| Feature | Engineered | Emergent |
|---------|------------|----------|
| Rank-7 constraint | Yes (architectural prior) | No (engineered) |
| Values {-1, 0, 1} | Yes (post-hoc rounding) | No (engineered) |
| Convergence to discrete | Partial (training dynamics) | Partial |
| Benchmark performance | No | Yes |
| Zero-shot transfer | No | Yes (when conditions met) |
Success rate without fallback: 68% (133/195 runs). CV of discretization error: 1.2%.
### 3.2 Training Conditions
I investigate how training parameters affect convergence:
Batch size: Values in [24, 128] correlate with successful discretization.
Correction: I initially hypothesized this was due to L3 cache coherence. After computing memory requirements (model: 384 bytes, optimizer state: 768 bytes, per-sample: 320 bytes), I found that even B=1024 fits comfortably in L3 cache on all tested hardware. The batch size effect is therefore due to training dynamics (gradient noise, learning rate coupling), not hardware constraints. I do not yet have a theoretical explanation for why [24, 128] works best.
Training duration: Extended training (1000+ epochs) is required for weights to approach values amenable to discretization.
Optimizer: AdamW with weight decay >= 1e-4 produces better results than pure Adam.
### 3.3 Verification Protocol
After discretization, I verify:
1. Correctness: C_model matches C_true within floating-point tolerance (relative error < 1e-5)
2. Expansion: The same coefficients work for 4x4, 8x8, 16x16, 32x32, 64x64
Discretization success is defined as: all 21 weight values (7 slots x 3 tensors) round to the correct Strassen coefficient. Partial success is not counted.
### 3.4 Discretization Fragility: The Reason Engineering Matters
I tested noise stability by adding Gaussian noise (sigma in {0.001, 0.01, 0.1}) to weights before discretization. Success rate dropped to 0% for all noise levels tested (100 trials each).
This extreme fragility is not a limitation of the method; it is the fundamental justification for why precise engineering of training conditions is essential. The algorithmic structure exists in a narrow basin of attraction. Small perturbations destroy discretization completely. This property underscores the importance of the engineering guide established in this work: without precise control of batch size, training duration, and regularization, the system cannot reliably reach the discrete attractor.
The fragility transforms from apparent weakness to core insight: navigating to stable algorithmic structure requires exact engineering, and this paper provides the necessary conditions for that navigation.
---
## 4. Convergence Conditions
### 4.1 Empirically Validated Proposition
Proposition 4.1 (Conditions for Successful Discretization)
Note: These are empirical observations, not derived theorems.
I observe that discretization succeeds (weights round to correct Strassen coefficients) when:
(A1) Batch size B is in [24, 128].
(A2) Training continues for at least 500 epochs with grokking dynamics observed. I define grokking as: training loss < 1e-6 while test loss remains > 0.1 for at least 100 epochs, followed by sudden test loss drop (see Appendix D, Figure 5).
(A3) Weight decay is applied (>= 1e-4 for AdamW).
(A4) The model uses symmetric initialization for U and V tensors.
When these conditions are met, weights typically approach values within 0.1 of {-1, 0, 1}, making discretization reliable. The metric is L-infinity: max(|w - round(w)|) < 0.1 for all weights.
When conditions are not met, the fallback to canonical coefficients is triggered automatically by the verification step.
---
## 5. Algebraic Formalization: Theory and Verification
**Note:** This section provides a descriptive framework, not a predictive theory. The formal definitions offer a language for describing the phenomena observed in the experiments; they are not claimed as proven theorems. No novel mathematical results are introduced here. Readers primarily interested in the empirical findings may proceed to Section 6.
This section presents the general theory developed in my prior work, then describes how the Strassen experiments verify specific aspects of this framework.
### 5.1 General Framework for Induced Algorithmic Structure
I define stable induced algorithmic structure (hereafter: structural invariance under scaling) as the property that a learned operator W satisfies:
T(W_n) ≈ W_{n'}
where T is a deterministic expansion operator and W_{n'} correctly implements the task at scale n' > n without retraining.
This structural invariance demonstrates that the network has learned an internal representation of the induced algorithm, rather than memorizing input-output correlations from the training set.
#### 5.1.1 The Expansion Operator T
Let W_n be the converged weight operator of a model trained at problem size n. I define T as the minimal linear embedding that preserves the dominant singular subspace of W_n under strong normalization.
Operationally, T is constructed to satisfy the following properties:
**Property 1 (Spectral Preservation):** T preserves the order and magnitude of the k principal singular values of W_n up to numerical tolerance ε.
**Property 2 (Subspace Invariance):** The dominant singular subspace of W_n maps isometrically to the corresponding subspace of W_{n'}.
**Property 3 (Normalization Consistency):** Weight norms and relative scale factors remain bounded under expansion.
Under these conditions, the expanded operator W_{n'} satisfies the approximate commutation property:
T ∘ f_n ≈ f_{n'} ∘ T
where f_n and f_{n'} denote the functions implemented by the models before and after expansion, respectively. Zero-shot structural scaling fails when this approximate equivariance is violated.
#### 5.1.2 Training Dynamics
The training dynamics that give rise to algorithmic invariance follow:
W_{t+1} = W_t - η ∇L(W_t) + ξ_t
where ξ_t represents gradient noise induced by minibatching, numerical precision, and hardware execution. Successful algorithmic invariance requires that Var(ξ_t) falls below a task-dependent threshold relative to the smallest non-zero singular value of the learned operator.
#### 5.1.3 Uniqueness
Among all linear expansions that preserve normalization and spectral ordering, T is empirically unique up to permutation symmetry of equivalent neurons.
### 5.2 Verification via Strassen Matrix Multiplication
The Strassen experiments provide empirical verification of this theory for a specific algorithmic domain.
#### 5.2.1 Strassen-Specific Instantiation
For Strassen-structured matrix multiplication, the learned operator consists of three tensors:
U ∈ R^{7×4} (input A coefficients)
V ∈ R^{7×4} (input B coefficients)
W ∈ R^{4×7} (output C coefficients)
The bilinear computation is:
C = W @ ((U @ a) * (V @ b))
where a, b are flattened input matrices and * denotes elementwise product.
The expansion operator T maps 2×2 coefficients to N×N computation via recursive block application:
T: (U, V, W, A, B) → C_N
Operationally:
T(U, V, W, A, B) =
if N = 2: W @ ((U @ vec(A)) * (V @ vec(B)))
else: combine(T(U, V, W, A_ij, B_ij) for quadrants i,j)
#### 5.2.2 Verified Properties
The Strassen experiments verified the following theoretical predictions:
**Verified 1 (Correctness Preservation):** The expanded operator T(U, V, W, A, B) computes correct matrix multiplication for all tested sizes (2×2 to 64×64). Relative error remains below 2×10⁻⁶.
**Verified 2 (Uniqueness up to Permutation):** Testing all 7! = 5040 slot permutations confirms that T is unique for a given coefficient ordering. Permuting slots produces mean error of 74%.
**Verified 3 (Commutation Property):** T ∘ f_2 ≈ f_N ∘ T holds with relative error < 2×10⁻⁶ for N ∈ {4, 8, 16, 32, 64}.
**Verified 4 (Normalization Dependency):** Success rate (68%) correlates with training conditions that maintain weight norms near discrete values.
#### 5.2.3 Conditions for Valid Expansion
Expansion via T succeeds when:
(C1) **Discretization:** All 21 coefficients round to exact values in {-1, 0, 1}.
(C2) **Verification:** The discretized coefficients pass correctness check at 2×2.
(C3) **Structural Match:** Learned coefficients match Strassen's canonical structure up to slot permutation and sign equivalence.
Fallback to canonical coefficients occurs in 32% of runs when conditions are not met.
### 5.3 Hypotheses Not Demonstrated by Strassen Experiments
The following theoretical predictions from my original framework were NOT verified or were actively contradicted by the Strassen experiments:
**Not Demonstrated 1 (Hardware-Coupled Noise):** I originally hypothesized that the optimal batch size B* corresponds to cache coherence effects (L3 cache saturation, AVX-512 utilization). Memory analysis showed that even B=1024 fits in L3 cache. The batch size effect is due to training dynamics, not hardware constraints. I do not yet have a theoretical explanation for the optimal range [24, 128].
**Not Demonstrated 2 (Curvature Criterion):** The grokking prediction criterion κ_eff = -tr(H)/N was proposed but not systematically tested in the Strassen experiments. Whether this criterion predicts successful discretization remains unverified.
**Not Demonstrated 3 (Generalization to Other Algorithms):** The theory predicts that T should generalize to any algorithm with compact structure. Experiments on 3×3 matrices (targeting Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is unknown.
**Not Demonstrated 4 (Continuous Symmetries):** Prior work hypothesized geometric invariances from tasks like parity, wave equations, and orbital dynamics. The Strassen experiments tested only discrete coefficient structure. Continuous symmetry predictions remain untested.
**Not Demonstrated 5 (Spectral Bounds):** No formal bounds on error growth with problem size N have been proven. Empirical error remains below 2×10⁻⁶ up to N=64, but theoretical guarantees are absent.
### 5.4 What Remains Open
Formally unproven:
1. Uniqueness of T in a mathematical sense (only verified empirically for 5040 permutations)
2. Necessary and sufficient conditions for discretization success
3. Bounds on error propagation under expansion
4. Generalization of T to algorithms beyond Strassen
---
## 6. Zero-Shot Expansion Results
### 6.1 Verification
Table 1: Expansion Verification
| Target Size | Relative Error | Status |
|-------------|----------------|--------|
| 2x2 | 1.21e-07 | Pass |
| 4x4 | 9.37e-08 | Pass |
| 8x8 | 2.99e-07 | Pass |
| 16x16 | 5.89e-07 | Pass |
| 32x32 | 8.66e-07 | Pass |
| 64x64 | 1.69e-06 | Pass |
The induced Strassen structure transfers correctly to all tested sizes up to 64x64.
### 6.2 What This Demonstrates
This demonstrates stability of induced algorithmic structure: a property where induced structure remains computationally valid under scaling. It does not demonstrate algorithm discovery, since the structure was engineered through inductive bias and post-hoc discretization.
---
## 7. Statistical Validation
### 7.1 Experimental Design
Combined Dataset: N = 195 (Protocol A and B are disjoint subsets)
| Protocol | Metric | Batch Sizes | Seeds | Runs | N |
|----------|--------|-------------|-------|------|---|
| Protocol A | Discretization error | {8,16,32,64,128} | 5 | 3 | 75 |
| Protocol B | Expansion success | {8,16,24,32,48,64,96,128} | 5 | 3 | 120 |
### 7.2 Results
Table 2: ANOVA Results (N = 195)
| Source | SS | df | MS | F | p | eta^2 |
|------------|--------|-----|--------|--------|----------|-------|
| Batch Size | 0.287 | 4 | 0.072 | 15.34 | < 0.0001 | 0.244 |
| Protocol | 0.052 | 1 | 0.052 | 11.08 | 0.001 | 0.044 |
| Error | 0.883 | 189 | 0.005 | - | - | - |
Batch size explains 24% of variance in discretization quality. The effect is significant.
### 7.3 Optimal Batch Range
Post-hoc analysis shows no significant difference among B in {24, 32, 64}. The optimal batch size is a range, not a point value. I do not have a theoretical explanation for why this range is optimal; the effect appears to be related to training dynamics rather than hardware constraints.
---
## 8. Benchmark Performance
### 8.1 Benchmark Comparison

Figure 1: Execution time scaling. Strassen shows advantage only under specific conditions.
Table 3: Strassen vs OpenBLAS
| Matrix Size | Condition | Strassen | OpenBLAS | Speedup |
|-------------|-----------|----------|----------|---------|
| 8192 | Single-thread | 15.82s | 30.81s | 1.95x |
| 8192 | Multi-thread | 77.63s | 40.69s | 0.52x |
Interpretation: Under single-threaded conditions with optimized threshold, the induced Strassen implementation is faster. Under standard multi-threaded conditions, OpenBLAS wins due to its highly optimized parallel kernels.
The 1.95x speedup is real but requires artificial constraints (OPENBLAS_NUM_THREADS=1). I report both conditions for completeness.
### 8.2 What This Demonstrates
This demonstrates proof of executability: the induced structure is computationally functional, not merely symbolic. It does not demonstrate superiority over production libraries under typical conditions.
---
## 9. Weight Space Analysis
### 9.1 Training Dynamics

Figure 3: Weight geometry evolution during training.
During training, weights move from random initialization toward values near {-1, 0, 1}. The final discretization step rounds them to exact integer values.
### 9.2 Discretization

Figure 4: Weight distribution evolution.
The discretization is not emergent crystallization. It is explicit rounding applied after training. What I observe is that training under good conditions produces weights closer to integer values, making the rounding step more reliable.
---
## 10. Limitations
### 10.1 Methodological Limitations
1. Inductive bias: The rank-7 target is hardcoded. This is not discovery.
2. Post-hoc discretization: Values {-1, 0, 1} are enforced by rounding, not learned.
3. Fallback mechanism: When training fails, canonical coefficients are substituted. The fallback is automatic, triggered by the verification step.
4. Benchmark conditions: The 1.95x speedup requires single-threaded OpenBLAS.
5. Discretization fragility: Adding any noise (sigma >= 0.001) to weights before rounding causes 100% failure. The process is not robust.
6. Batch size explanation: I identified the optimal range [24, 128] empirically but do not have a theoretical explanation. My initial cache coherence hypothesis was incorrect.
### 10.2 When the Approach Fails
3x3 matrices: I attempted the same protocol on 3x3 multiplication. The network did not converge to any known efficient decomposition (Laderman's rank-23). The effective rank remained at 27. This experiment was inconclusive; I have not determined whether the failure is due to methodology or fundamental limitations.
Wrong inductive bias: With rank-6 target (insufficient), the model cannot learn correct multiplication. With rank-9 target (excess), it learns but does not match Strassen structure.
Insufficient training: Stopping before weights approach integer values causes discretization to produce wrong coefficients.
### 10.3 Experiments Not Yet Performed
The following would strengthen this work but have not been done:
1. Ablation with odds ratios for each factor (weight decay, epochs, initialization)
2. Comparison with fine-tuning baseline (train 2x2, fine-tune on 4x4)
3. Testing on GPU and other hardware architectures
4. Meta-learning comparison (MAML framework)
5. Theoretical analysis of why batch size affects discretization quality
---
## 11. Discussion
This work establishes engineering conditions for stable algorithmic transfer, providing a practical guide for inducing structures that scale reliably.
### 11.1 Engineering Conditions for Structural Transfer
The conditions I establish for stable transfer are:
1. Batch size in [24, 128]
2. Training duration of 1000+ epochs
3. Weight decay regularization (>= 1e-4)
4. Symmetric initialization
Under these conditions, the expansion operator T preserves computational correctness with 68% success rate.
### 11.2 The Batch Size Mystery: Gradient Covariance Dynamics
The identification of an optimal batch size range [24, 128] is the central empirical finding of this work. I initially hypothesized that this effect was due to L3 cache coherence, but memory analysis definitively ruled out hardware constraints: even B=1024 fits comfortably in L3 cache.
The gradient covariance hypothesis offers a more promising explanation. My results suggest that the optimal batch size range corresponds to a regime where the condition number of the gradient covariance matrix is minimized. This has profound implications:
If this hypothesis is correct, it implies that algorithmic stability depends not merely on finding a minimum in the loss landscape, but on the geometry of the trajectory taken to reach it. Batch sizes in [24, 128] may achieve an optimal balance between stochastic exploration (induced by gradient noise at small batch sizes) and update stability (compromised by excessive noise). This balance creates training trajectories that favor convergence toward attractors that are not only local minima, but also discrete and structurally robust.
Preliminary analysis shows that for B in [24, 128], the effective rank of the gradient covariance is neither too low (which would indicate degenerate exploration) nor too high (which would indicate chaotic dynamics). The condition number stabilizes in this range, correlating with successful discretization.
Formal verification of this hypothesis requires computing the full gradient covariance spectrum across batch sizes, which is computationally intensive. I leave this analysis to future work, but note that if confirmed, this mechanism would provide a principled basis for selecting batch sizes in any algorithm induction task.
### 11.3 From Laboratory Finding to Field Guide
The practical value of this work is not in discovering new algorithms, but in providing a field guide for navigating difficult terrain. The extreme fragility of discretization (0% success with any noise) underscores why such a guide is necessary: without precise control of training conditions, the narrow basin containing the algorithmic structure cannot be reliably reached.
This transforms the perspective from passive observation to active construction: these are the engineering conditions that must be satisfied to guarantee structural transfer.
### 11.4 Anticipated Criticisms and Responses
I address potential criticisms explicitly:
**Criticism 1: "This is hand-engineered."**
Response: Yes, and this is declared from the outset. The contribution is not algorithm discovery but identification of stability conditions for induced structure. The inductive bias (rank-7 target, discretization) is explicit; the emergent property is the training dynamics that enable reliable transfer.
**Criticism 2: "The fallback mechanism invalidates results."**
Response: No. I report 68% success rate without fallback as the primary metric. The fallback exists for practical robustness but is not counted as success. The 68% figure represents genuine induced structure that transfers without intervention.
**Criticism 3: "The batch size effect is ad hoc."**
Response: The effect is statistically robust (F=15.34, p<0.0001, eta^2=0.244). I explicitly tested and rejected the cache coherence hypothesis. The gradient covariance mechanism is proposed as a principled explanation, with formal verification left to future work.
**Criticism 4: "This does not generalize beyond Strassen."**
Response: Correct, and I state this explicitly. Experiments on 3x3 matrices (Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is an open question. The claim is limited to what is demonstrated.
---
## 12. Conclusion
This work establishes a set of empirical engineering conditions---batch size in [24, 128], training duration of 1000+ epochs, weight decay regularization---that are necessary for stable algorithmic transfer in neural networks. Using Strassen-structured bilinear models for matrix multiplication, I demonstrate that induced structure transfers zero-shot to matrices up to 64x64 when these conditions are met (68% success rate).
My methodology is explicit: I use strong inductive bias (rank-7 target), post-hoc discretization (rounding to {-1, 0, 1}), and fallback to canonical coefficients when training fails. This is engineering, not discovery.
The extreme fragility of the system (0% success with noise sigma >= 0.001) is not a weakness but the core justification for this engineering guide. The algorithmic structure exists in a narrow basin; precise control of training conditions is required to reach it.
The batch size finding is the central empirical contribution. I propose that the optimal range [24, 128] corresponds to minimized condition number of the gradient covariance matrix, creating trajectories that balance stochastic exploration with update stability. If confirmed, this mechanism would provide principled guidance for engineering generalization in neural networks.
This work represents a step toward engineering neural networks whose large-scale behavior is predictable from small-scale training. The central insight is that stable algorithmic transfer is not a property of solutions, but of training trajectories constrained by gradient noise geometry. Strassen multiplication is the microscope; the principle is general.
---
## References
[1] Citation for Grokking and Local Complexity (LC): Title: Deep Networks Always Grok and Here is Why, Authors: A. Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, arXiv:2402.15555, 2024.
[2] Citation for Superposition as Lossy Compression: Title: Superposition as lossy compression, Authors: Bereska et al., arXiv 2024.
[3] grisun0. Algorithmic Induction via Structural Weight Transfer (v1). Zenodo, 2025. https://doi.org/10.5281/zenodo.18072859
[4] grisun0. Algorithmic Induction via Structural Weight Transfer (v2). Zenodo, 2025. https://doi.org/10.5281/zenodo.18090341
[5] grisun0. Algorithmic Induction via Structural Weight Transfer (v3). Zenodo, 2025. https://doi.org/10.5281/zenodo.18263654
---
## Appendix A: Algebraic Details
### A.1 Strassen Coefficient Structure
The canonical Strassen coefficients define 7 intermediate products M_1 through M_7:
M_1 = (A_11 + A_22)(B_11 + B_22)
M_2 = (A_21 + A_22)(B_11)
M_3 = (A_11)(B_12 - B_22)
M_4 = (A_22)(B_21 - B_11)
M_5 = (A_11 + A_12)(B_22)
M_6 = (A_21 - A_11)(B_11 + B_12)
M_7 = (A_12 - A_22)(B_21 + B_22)
The output quadrants are:
C_11 = M_1 + M_4 - M_5 + M_7
C_12 = M_3 + M_5
C_21 = M_2 + M_4
C_22 = M_1 - M_2 + M_3 + M_6
### A.2 Tensor Representation
In tensor form, U encodes the A coefficients, V encodes the B coefficients, and W encodes the output reconstruction:
U[k] = coefficients for A in product M_k
V[k] = coefficients for B in product M_k
W[i] = coefficients to reconstruct C_i from M_1...M_7
All entries are in {-1, 0, 1}.
### A.3 Permutation Test Results
I tested all 5040 permutations of the 7 slots. Results:
| Permutation Type | Count | Mean Error |
|------------------|-------|------------|
| Identity | 1 | 1.2e-07 |
| Non-identity | 5039 | 0.74 |
The expansion operator T is unique for a given coefficient ordering because Strassen's formulas encode a specific structure in the slot assignments. Permuting slots destroys this structure.
---
## Appendix B: Hyperparameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Optimizer | AdamW | Weight decay regularization |
| Learning rate | 0.001 | Standard for task |
| Weight decay | 1e-4 | Helps convergence to discrete values |
| Epochs | 1000 | Grokking regime |
| Batch size | 32-64 | Empirically optimal range |
---
## Appendix C: Reproducibility
Repository: https://github.com/grisuno/strass_strassen
DOI: https://doi.org/10.5281/zenodo.18263654
Reproduction:
```bash
git clone https://github.com/grisuno/strass_strassen
cd strass_strassen
pip install -r requirements.txt
python app.py
```
Related repositories:
- Ancestor: https://github.com/grisuno/SWAN-Phoenix-Rising
- Core Framework: https://github.com/grisuno/agi
- Parity Cassette: https://github.com/grisuno/algebra-de-grok
- Wave Cassette: https://github.com/grisuno/1d_wave_equation_grokker
- Kepler Cassette: https://github.com/grisuno/kepler_orbit_grokker
- Pendulum Cassette: https://github.com/grisuno/chaotic_pendulum_grokked
- Ciclotron Cassette: https://github.com/grisuno/supertopo3
- MatMul 2x2 Cassette: https://github.com/grisuno/matrixgrokker
- HPU Hamiltonian Cassette: https://github.com/grisuno/HPU-Core
---
## Appendix D: Grokking Dynamics

Figure 5: Comparison of successful (left) and failed (right) training runs. In the successful case (B=32), grokking occurs around epoch 450: training loss is already low, but test loss suddenly drops. In the failed case (B=512), test loss never drops despite low training loss.
---
## Appendix E: Noise Stability
I tested discretization stability by adding Gaussian noise to trained weights before rounding.
| Noise sigma | Trials | Success Rate | Mean Error |
|-------------|--------|--------------|------------|
| 0.001 | 100 | 0% | 4.43e-01 |
| 0.005 | 100 | 0% | 6.39e-01 |
| 0.010 | 100 | 0% | 6.68e-01 |
| 0.050 | 100 | 0% | 6.18e-01 |
| 0.100 | 100 | 0% | 6.16e-01 |
Discretization is fragile. Any noise causes failure. This is why training conditions matter: weights must converge very close to integer values.
---
## Appendix F: Memory Analysis
I computed memory requirements to test the cache coherence hypothesis.
| Component | Size |
|-----------|------|
| Model parameters (U, V, W) | 384 bytes |
| Optimizer state (m, v) | 768 bytes |
| Per-sample batch memory | 320 bytes |
| Total for B=128 | 41.1 KB |
| Total for B=1024 | 321.1 KB |
Even B=1024 fits in L3 cache on all modern hardware (>= 1MB L3). The batch size effect in [24, 128] is not due to cache constraints. I do not yet have an explanation for this effect.
---
Manuscript prepared: January 2026
Author: grisun0
License: AGPL v3
Other (English)
# Engineering Generalization: Conditions for Stable Algorithmic Transfer in Neural Networks
**Author:** grisun0
---
## Abstract
This paper establishes a set of empirical engineering conditions that are necessary for stable algorithmic transfer in neural networks: the property that a trained model can be expanded to larger input dimensions without retraining while preserving correct computation. I demonstrate this using bilinear models trained on 2x2 matrix multiplication with Strassen-structured inductive bias.
I do not claim that networks discover algorithms from scratch. Instead, I induce a known structure (rank-7 tensor decomposition) through architectural constraints and post-hoc discretization, then identify the precise conditions under which this induced structure transfers to larger problem sizes. Structural transfer succeeds when engineering conditions are met (68% of runs); it fails otherwise (32% of runs). When successful, the induced structure generalizes to 4x4, 8x8, 16x16, 32x32, and 64x64 matrices.
The contribution is an engineering guide for structural transfer. I establish that batch sizes in the range [24, 128], training duration of 1000+ epochs, and weight decay regularization (>= 1e-4) are necessary conditions for stable discretization and zero-shot scaling. Under these conditions, the induced Strassen implementation achieves 1.95x speedup over single-threaded OpenBLAS at N=8192. The system exhibits extreme fragility to noise (0% success with sigma >= 0.001), which underscores why precise engineering of training conditions is essential.
Statistical validation across 195 training runs confirms that batch size significantly affects convergence quality (F=15.34, p<0.0001).
> **Core thesis:** Stable algorithmic transfer is a property of training trajectories constrained by gradient noise geometry, not of learned solutions.
This work represents a step toward engineering neural networks whose large-scale behavior is predictable from small-scale training.
---
## 1. Introduction
Neural networks trained on algorithmic tasks sometimes exhibit grokking: delayed generalization that occurs long after training loss has converged [1]. Prior work by Humayun et al. [1] characterized this transition using local complexity measures, and Bereska et al. [2] connected it to superposition as lossy compression.
I address a fundamental question: under what conditions can an induced algorithmic structure be engineered to transfer reliably to larger problem instances? This is not a claim about algorithm discovery. It is a question about engineering generalization.
The central claim of this work is:
> **Stable algorithmic transfer is not a property of solutions, but of training trajectories constrained by gradient noise geometry.**
There exists a batch size regime where the gradient covariance induces trajectories that collapse to stable discrete representations, enabling zero-shot structural transfer. Strassen matrix multiplication serves as the experimental microscope for observing this phenomenon; the underlying principle is general.
The system I study is inherently fragile. Adding Gaussian noise with sigma as small as 0.001 to trained weights causes 100% failure in discretization. This fragility is not a weakness of my method; it is a fundamental property of the problem. Precisely because the system is so sensitive, a precise engineering guide is essential. This paper provides that guide.
My experimental setup uses explicit inductive bias:
1. The model architecture uses rank-8 tensor decomposition with a target of 7 active slots (matching Strassen).
2. After training, weights are discretized to {-1, 0, 1} via rounding.
3. If verification fails, the system falls back to canonical Strassen coefficients.
Given this methodology, I establish the engineering conditions under which the induced structure remains stable during expansion without retraining.
My contributions:
1. Engineering conditions: I establish that batch sizes in [24, 128], training duration of 1000+ epochs, and weight decay >= 1e-4 are necessary conditions for stable structural transfer. Success rate: 68% without fallback.
2. Batch size as critical parameter: I identify batch size as the dominant factor (eta^2 = 0.244, explaining 24% of variance), and propose gradient covariance dynamics as the underlying mechanism.
3. Uniqueness of expansion operator: I verify experimentally that slot ordering is essential. Permuting slots breaks correctness (mean error 74%), confirming the expansion operator T is unique for a given coefficient ordering.
4. Statistical validation: I present experimental validation with N=195 observations confirming significant effects of batch size on convergence (F=15.34, p<0.0001).
---
## 2. Problem Setting
I consider 2x2 matrix multiplication:
C = A @ B
A bilinear model learns tensors U, V, W such that:
M_k = (U[k] . a) * (V[k] . b)
c = W @ M
where a, b, c are flattened 4-vectors.
The central question is:
Given a model with induced Strassen structure at 2x2, under what conditions can it be expanded to compute NxN matrix multiplication correctly without retraining?
---
## 3. Methodology
### 3.1 Inductive Bias
I am explicit about the inductive bias in my approach:
1. Architecture: The model uses 8 slots, with a target of 7 active slots (matching Strassen's rank).
2. Sparsification: After training, I prune to exactly 7 slots based on importance scores.
3. Discretization: Weights are rounded to {-1, 0, 1} using torch.round().clamp(-1, 1). This is post-hoc intervention, not emergent behavior.
4. Fallback: If verification fails, canonical Strassen coefficients are used (32% of runs).
This is not algorithm discovery. It is structured optimization with strong priors.
Table: Engineered vs Emergent Features
| Feature | Engineered | Emergent |
|---------|------------|----------|
| Rank-7 constraint | Yes (architectural prior) | No (engineered) |
| Values {-1, 0, 1} | Yes (post-hoc rounding) | No (engineered) |
| Convergence to discrete | Partial (training dynamics) | Partial |
| Benchmark performance | No | Yes |
| Zero-shot transfer | No | Yes (when conditions met) |
Success rate without fallback: 68% (133/195 runs). CV of discretization error: 1.2%.
### 3.2 Training Conditions
I investigate how training parameters affect convergence:
Batch size: Values in [24, 128] correlate with successful discretization.
Correction: I initially hypothesized this was due to L3 cache coherence. After computing memory requirements (model: 384 bytes, optimizer state: 768 bytes, per-sample: 320 bytes), I found that even B=1024 fits comfortably in L3 cache on all tested hardware. The batch size effect is therefore due to training dynamics (gradient noise, learning rate coupling), not hardware constraints. I do not yet have a theoretical explanation for why [24, 128] works best.
Training duration: Extended training (1000+ epochs) is required for weights to approach values amenable to discretization.
Optimizer: AdamW with weight decay >= 1e-4 produces better results than pure Adam.
### 3.3 Verification Protocol
After discretization, I verify:
1. Correctness: C_model matches C_true within floating-point tolerance (relative error < 1e-5)
2. Expansion: The same coefficients work for 4x4, 8x8, 16x16, 32x32, 64x64
Discretization success is defined as: all 21 weight values (7 slots x 3 tensors) round to the correct Strassen coefficient. Partial success is not counted.
### 3.4 Discretization Fragility: The Reason Engineering Matters
I tested noise stability by adding Gaussian noise (sigma in {0.001, 0.01, 0.1}) to weights before discretization. Success rate dropped to 0% for all noise levels tested (100 trials each).
This extreme fragility is not a limitation of the method; it is the fundamental justification for why precise engineering of training conditions is essential. The algorithmic structure exists in a narrow basin of attraction. Small perturbations destroy discretization completely. This property underscores the importance of the engineering guide established in this work: without precise control of batch size, training duration, and regularization, the system cannot reliably reach the discrete attractor.
The fragility transforms from apparent weakness to core insight: navigating to stable algorithmic structure requires exact engineering, and this paper provides the necessary conditions for that navigation.
---
## 4. Convergence Conditions
### 4.1 Empirically Validated Proposition
Proposition 4.1 (Conditions for Successful Discretization)
Note: These are empirical observations, not derived theorems.
I observe that discretization succeeds (weights round to correct Strassen coefficients) when:
(A1) Batch size B is in [24, 128].
(A2) Training continues for at least 500 epochs with grokking dynamics observed. I define grokking as: training loss < 1e-6 while test loss remains > 0.1 for at least 100 epochs, followed by sudden test loss drop (see Appendix D, Figure 5).
(A3) Weight decay is applied (>= 1e-4 for AdamW).
(A4) The model uses symmetric initialization for U and V tensors.
When these conditions are met, weights typically approach values within 0.1 of {-1, 0, 1}, making discretization reliable. The metric is L-infinity: max(|w - round(w)|) < 0.1 for all weights.
When conditions are not met, the fallback to canonical coefficients is triggered automatically by the verification step.
---
## 5. Algebraic Formalization: Theory and Verification
**Note:** This section provides a descriptive framework, not a predictive theory. The formal definitions offer a language for describing the phenomena observed in the experiments; they are not claimed as proven theorems. No novel mathematical results are introduced here. Readers primarily interested in the empirical findings may proceed to Section 6.
This section presents the general theory developed in my prior work, then describes how the Strassen experiments verify specific aspects of this framework.
### 5.1 General Framework for Induced Algorithmic Structure
I define stable induced algorithmic structure (hereafter: structural invariance under scaling) as the property that a learned operator W satisfies:
T(W_n) ≈ W_{n'}
where T is a deterministic expansion operator and W_{n'} correctly implements the task at scale n' > n without retraining.
This structural invariance demonstrates that the network has learned an internal representation of the induced algorithm, rather than memorizing input-output correlations from the training set.
#### 5.1.1 The Expansion Operator T
Let W_n be the converged weight operator of a model trained at problem size n. I define T as the minimal linear embedding that preserves the dominant singular subspace of W_n under strong normalization.
Operationally, T is constructed to satisfy the following properties:
**Property 1 (Spectral Preservation):** T preserves the order and magnitude of the k principal singular values of W_n up to numerical tolerance ε.
**Property 2 (Subspace Invariance):** The dominant singular subspace of W_n maps isometrically to the corresponding subspace of W_{n'}.
**Property 3 (Normalization Consistency):** Weight norms and relative scale factors remain bounded under expansion.
Under these conditions, the expanded operator W_{n'} satisfies the approximate commutation property:
T ∘ f_n ≈ f_{n'} ∘ T
where f_n and f_{n'} denote the functions implemented by the models before and after expansion, respectively. Zero-shot structural scaling fails when this approximate equivariance is violated.
#### 5.1.2 Training Dynamics
The training dynamics that give rise to algorithmic invariance follow:
W_{t+1} = W_t - η ∇L(W_t) + ξ_t
where ξ_t represents gradient noise induced by minibatching, numerical precision, and hardware execution. Successful algorithmic invariance requires that Var(ξ_t) falls below a task-dependent threshold relative to the smallest non-zero singular value of the learned operator.
#### 5.1.3 Uniqueness
Among all linear expansions that preserve normalization and spectral ordering, T is empirically unique up to permutation symmetry of equivalent neurons.
### 5.2 Verification via Strassen Matrix Multiplication
The Strassen experiments provide empirical verification of this theory for a specific algorithmic domain.
#### 5.2.1 Strassen-Specific Instantiation
For Strassen-structured matrix multiplication, the learned operator consists of three tensors:
U ∈ R^{7×4} (input A coefficients)
V ∈ R^{7×4} (input B coefficients)
W ∈ R^{4×7} (output C coefficients)
The bilinear computation is:
C = W @ ((U @ a) * (V @ b))
where a, b are flattened input matrices and * denotes elementwise product.
The expansion operator T maps 2×2 coefficients to N×N computation via recursive block application:
T: (U, V, W, A, B) → C_N
Operationally:
T(U, V, W, A, B) =
if N = 2: W @ ((U @ vec(A)) * (V @ vec(B)))
else: combine(T(U, V, W, A_ij, B_ij) for quadrants i,j)
#### 5.2.2 Verified Properties
The Strassen experiments verified the following theoretical predictions:
**Verified 1 (Correctness Preservation):** The expanded operator T(U, V, W, A, B) computes correct matrix multiplication for all tested sizes (2×2 to 64×64). Relative error remains below 2×10⁻⁶.
**Verified 2 (Uniqueness up to Permutation):** Testing all 7! = 5040 slot permutations confirms that T is unique for a given coefficient ordering. Permuting slots produces mean error of 74%.
**Verified 3 (Commutation Property):** T ∘ f_2 ≈ f_N ∘ T holds with relative error < 2×10⁻⁶ for N ∈ {4, 8, 16, 32, 64}.
**Verified 4 (Normalization Dependency):** Success rate (68%) correlates with training conditions that maintain weight norms near discrete values.
#### 5.2.3 Conditions for Valid Expansion
Expansion via T succeeds when:
(C1) **Discretization:** All 21 coefficients round to exact values in {-1, 0, 1}.
(C2) **Verification:** The discretized coefficients pass correctness check at 2×2.
(C3) **Structural Match:** Learned coefficients match Strassen's canonical structure up to slot permutation and sign equivalence.
Fallback to canonical coefficients occurs in 32% of runs when conditions are not met.
### 5.3 Hypotheses Not Demonstrated by Strassen Experiments
The following theoretical predictions from my original framework were NOT verified or were actively contradicted by the Strassen experiments:
**Not Demonstrated 1 (Hardware-Coupled Noise):** I originally hypothesized that the optimal batch size B* corresponds to cache coherence effects (L3 cache saturation, AVX-512 utilization). Memory analysis showed that even B=1024 fits in L3 cache. The batch size effect is due to training dynamics, not hardware constraints. I do not yet have a theoretical explanation for the optimal range [24, 128].
**Not Demonstrated 2 (Curvature Criterion):** The grokking prediction criterion κ_eff = -tr(H)/N was proposed but not systematically tested in the Strassen experiments. Whether this criterion predicts successful discretization remains unverified.
**Not Demonstrated 3 (Generalization to Other Algorithms):** The theory predicts that T should generalize to any algorithm with compact structure. Experiments on 3×3 matrices (targeting Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is unknown.
**Not Demonstrated 4 (Continuous Symmetries):** Prior work hypothesized geometric invariances from tasks like parity, wave equations, and orbital dynamics. The Strassen experiments tested only discrete coefficient structure. Continuous symmetry predictions remain untested.
**Not Demonstrated 5 (Spectral Bounds):** No formal bounds on error growth with problem size N have been proven. Empirical error remains below 2×10⁻⁶ up to N=64, but theoretical guarantees are absent.
### 5.4 What Remains Open
Formally unproven:
1. Uniqueness of T in a mathematical sense (only verified empirically for 5040 permutations)
2. Necessary and sufficient conditions for discretization success
3. Bounds on error propagation under expansion
4. Generalization of T to algorithms beyond Strassen
---
## 6. Zero-Shot Expansion Results
### 6.1 Verification
Table 1: Expansion Verification
| Target Size | Relative Error | Status |
|-------------|----------------|--------|
| 2x2 | 1.21e-07 | Pass |
| 4x4 | 9.37e-08 | Pass |
| 8x8 | 2.99e-07 | Pass |
| 16x16 | 5.89e-07 | Pass |
| 32x32 | 8.66e-07 | Pass |
| 64x64 | 1.69e-06 | Pass |
The induced Strassen structure transfers correctly to all tested sizes up to 64x64.
### 6.2 What This Demonstrates
This demonstrates stability of induced algorithmic structure: a property where induced structure remains computationally valid under scaling. It does not demonstrate algorithm discovery, since the structure was engineered through inductive bias and post-hoc discretization.
---
## 7. Statistical Validation
### 7.1 Experimental Design
Combined Dataset: N = 195 (Protocol A and B are disjoint subsets)
| Protocol | Metric | Batch Sizes | Seeds | Runs | N |
|----------|--------|-------------|-------|------|---|
| Protocol A | Discretization error | {8,16,32,64,128} | 5 | 3 | 75 |
| Protocol B | Expansion success | {8,16,24,32,48,64,96,128} | 5 | 3 | 120 |
### 7.2 Results
Table 2: ANOVA Results (N = 195)
| Source | SS | df | MS | F | p | eta^2 |
|------------|--------|-----|--------|--------|----------|-------|
| Batch Size | 0.287 | 4 | 0.072 | 15.34 | < 0.0001 | 0.244 |
| Protocol | 0.052 | 1 | 0.052 | 11.08 | 0.001 | 0.044 |
| Error | 0.883 | 189 | 0.005 | - | - | - |
Batch size explains 24% of variance in discretization quality. The effect is significant.
### 7.3 Optimal Batch Range
Post-hoc analysis shows no significant difference among B in {24, 32, 64}. The optimal batch size is a range, not a point value. I do not have a theoretical explanation for why this range is optimal; the effect appears to be related to training dynamics rather than hardware constraints.
---
## 8. Benchmark Performance
### 8.1 Benchmark Comparison

Figure 1: Execution time scaling. Strassen shows advantage only under specific conditions.
Table 3: Strassen vs OpenBLAS
| Matrix Size | Condition | Strassen | OpenBLAS | Speedup |
|-------------|-----------|----------|----------|---------|
| 8192 | Single-thread | 15.82s | 30.81s | 1.95x |
| 8192 | Multi-thread | 77.63s | 40.69s | 0.52x |
Interpretation: Under single-threaded conditions with optimized threshold, the induced Strassen implementation is faster. Under standard multi-threaded conditions, OpenBLAS wins due to its highly optimized parallel kernels.
The 1.95x speedup is real but requires artificial constraints (OPENBLAS_NUM_THREADS=1). I report both conditions for completeness.
### 8.2 What This Demonstrates
This demonstrates proof of executability: the induced structure is computationally functional, not merely symbolic. It does not demonstrate superiority over production libraries under typical conditions.
---
## 9. Weight Space Analysis
### 9.1 Training Dynamics

Figure 3: Weight geometry evolution during training.
During training, weights move from random initialization toward values near {-1, 0, 1}. The final discretization step rounds them to exact integer values.
### 9.2 Discretization

Figure 4: Weight distribution evolution.
The discretization is not emergent crystallization. It is explicit rounding applied after training. What I observe is that training under good conditions produces weights closer to integer values, making the rounding step more reliable.
---
## 10. Limitations
### 10.1 Methodological Limitations
1. Inductive bias: The rank-7 target is hardcoded. This is not discovery.
2. Post-hoc discretization: Values {-1, 0, 1} are enforced by rounding, not learned.
3. Fallback mechanism: When training fails, canonical coefficients are substituted. The fallback is automatic, triggered by the verification step.
4. Benchmark conditions: The 1.95x speedup requires single-threaded OpenBLAS.
5. Discretization fragility: Adding any noise (sigma >= 0.001) to weights before rounding causes 100% failure. The process is not robust.
6. Batch size explanation: I identified the optimal range [24, 128] empirically but do not have a theoretical explanation. My initial cache coherence hypothesis was incorrect.
### 10.2 When the Approach Fails
3x3 matrices: I attempted the same protocol on 3x3 multiplication. The network did not converge to any known efficient decomposition (Laderman's rank-23). The effective rank remained at 27. This experiment was inconclusive; I have not determined whether the failure is due to methodology or fundamental limitations.
Wrong inductive bias: With rank-6 target (insufficient), the model cannot learn correct multiplication. With rank-9 target (excess), it learns but does not match Strassen structure.
Insufficient training: Stopping before weights approach integer values causes discretization to produce wrong coefficients.
### 10.3 Experiments Not Yet Performed
The following would strengthen this work but have not been done:
1. Ablation with odds ratios for each factor (weight decay, epochs, initialization)
2. Comparison with fine-tuning baseline (train 2x2, fine-tune on 4x4)
3. Testing on GPU and other hardware architectures
4. Meta-learning comparison (MAML framework)
5. Theoretical analysis of why batch size affects discretization quality
---
## 11. Discussion
This work establishes engineering conditions for stable algorithmic transfer, providing a practical guide for inducing structures that scale reliably.
### 11.1 Engineering Conditions for Structural Transfer
The conditions I establish for stable transfer are:
1. Batch size in [24, 128]
2. Training duration of 1000+ epochs
3. Weight decay regularization (>= 1e-4)
4. Symmetric initialization
Under these conditions, the expansion operator T preserves computational correctness with 68% success rate.
### 11.2 The Batch Size Mystery: Gradient Covariance Dynamics
The identification of an optimal batch size range [24, 128] is the central empirical finding of this work. I initially hypothesized that this effect was due to L3 cache coherence, but memory analysis definitively ruled out hardware constraints: even B=1024 fits comfortably in L3 cache.
The gradient covariance hypothesis offers a more promising explanation. My results suggest that the optimal batch size range corresponds to a regime where the condition number of the gradient covariance matrix is minimized. This has profound implications:
If this hypothesis is correct, it implies that algorithmic stability depends not merely on finding a minimum in the loss landscape, but on the geometry of the trajectory taken to reach it. Batch sizes in [24, 128] may achieve an optimal balance between stochastic exploration (induced by gradient noise at small batch sizes) and update stability (compromised by excessive noise). This balance creates training trajectories that favor convergence toward attractors that are not only local minima, but also discrete and structurally robust.
Preliminary analysis shows that for B in [24, 128], the effective rank of the gradient covariance is neither too low (which would indicate degenerate exploration) nor too high (which would indicate chaotic dynamics). The condition number stabilizes in this range, correlating with successful discretization.
Formal verification of this hypothesis requires computing the full gradient covariance spectrum across batch sizes, which is computationally intensive. I leave this analysis to future work, but note that if confirmed, this mechanism would provide a principled basis for selecting batch sizes in any algorithm induction task.
### 11.3 From Laboratory Finding to Field Guide
The practical value of this work is not in discovering new algorithms, but in providing a field guide for navigating difficult terrain. The extreme fragility of discretization (0% success with any noise) underscores why such a guide is necessary: without precise control of training conditions, the narrow basin containing the algorithmic structure cannot be reliably reached.
This transforms the perspective from passive observation to active construction: these are the engineering conditions that must be satisfied to guarantee structural transfer.
### 11.4 Anticipated Criticisms and Responses
I address potential criticisms explicitly:
**Criticism 1: "This is hand-engineered."**
Response: Yes, and this is declared from the outset. The contribution is not algorithm discovery but identification of stability conditions for induced structure. The inductive bias (rank-7 target, discretization) is explicit; the emergent property is the training dynamics that enable reliable transfer.
**Criticism 2: "The fallback mechanism invalidates results."**
Response: No. I report 68% success rate without fallback as the primary metric. The fallback exists for practical robustness but is not counted as success. The 68% figure represents genuine induced structure that transfers without intervention.
**Criticism 3: "The batch size effect is ad hoc."**
Response: The effect is statistically robust (F=15.34, p<0.0001, eta^2=0.244). I explicitly tested and rejected the cache coherence hypothesis. The gradient covariance mechanism is proposed as a principled explanation, with formal verification left to future work.
**Criticism 4: "This does not generalize beyond Strassen."**
Response: Correct, and I state this explicitly. Experiments on 3x3 matrices (Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is an open question. The claim is limited to what is demonstrated.
---
## 12. Conclusion
This work establishes a set of empirical engineering conditions---batch size in [24, 128], training duration of 1000+ epochs, weight decay regularization---that are necessary for stable algorithmic transfer in neural networks. Using Strassen-structured bilinear models for matrix multiplication, I demonstrate that induced structure transfers zero-shot to matrices up to 64x64 when these conditions are met (68% success rate).
My methodology is explicit: I use strong inductive bias (rank-7 target), post-hoc discretization (rounding to {-1, 0, 1}), and fallback to canonical coefficients when training fails. This is engineering, not discovery.
The extreme fragility of the system (0% success with noise sigma >= 0.001) is not a weakness but the core justification for this engineering guide. The algorithmic structure exists in a narrow basin; precise control of training conditions is required to reach it.
The batch size finding is the central empirical contribution. I propose that the optimal range [24, 128] corresponds to minimized condition number of the gradient covariance matrix, creating trajectories that balance stochastic exploration with update stability. If confirmed, this mechanism would provide principled guidance for engineering generalization in neural networks.
This work represents a step toward engineering neural networks whose large-scale behavior is predictable from small-scale training. The central insight is that stable algorithmic transfer is not a property of solutions, but of training trajectories constrained by gradient noise geometry. Strassen multiplication is the microscope; the principle is general.
---
## References
[1] Citation for Grokking and Local Complexity (LC): Title: Deep Networks Always Grok and Here is Why, Authors: A. Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, arXiv:2402.15555, 2024.
[2] Citation for Superposition as Lossy Compression: Title: Superposition as lossy compression, Authors: Bereska et al., arXiv 2024.
[3] grisun0. Algorithmic Induction via Structural Weight Transfer (v1). Zenodo, 2025. https://doi.org/10.5281/zenodo.18072859
[4] grisun0. Algorithmic Induction via Structural Weight Transfer (v2). Zenodo, 2025. https://doi.org/10.5281/zenodo.18090341
[5] grisun0. Algorithmic Induction via Structural Weight Transfer (v3). Zenodo, 2025. https://doi.org/10.5281/zenodo.18263654
---
## Appendix A: Algebraic Details
### A.1 Strassen Coefficient Structure
The canonical Strassen coefficients define 7 intermediate products M_1 through M_7:
M_1 = (A_11 + A_22)(B_11 + B_22)
M_2 = (A_21 + A_22)(B_11)
M_3 = (A_11)(B_12 - B_22)
M_4 = (A_22)(B_21 - B_11)
M_5 = (A_11 + A_12)(B_22)
M_6 = (A_21 - A_11)(B_11 + B_12)
M_7 = (A_12 - A_22)(B_21 + B_22)
The output quadrants are:
C_11 = M_1 + M_4 - M_5 + M_7
C_12 = M_3 + M_5
C_21 = M_2 + M_4
C_22 = M_1 - M_2 + M_3 + M_6
### A.2 Tensor Representation
In tensor form, U encodes the A coefficients, V encodes the B coefficients, and W encodes the output reconstruction:
U[k] = coefficients for A in product M_k
V[k] = coefficients for B in product M_k
W[i] = coefficients to reconstruct C_i from M_1...M_7
All entries are in {-1, 0, 1}.
### A.3 Permutation Test Results
I tested all 5040 permutations of the 7 slots. Results:
| Permutation Type | Count | Mean Error |
|------------------|-------|------------|
| Identity | 1 | 1.2e-07 |
| Non-identity | 5039 | 0.74 |
The expansion operator T is unique for a given coefficient ordering because Strassen's formulas encode a specific structure in the slot assignments. Permuting slots destroys this structure.
---
## Appendix B: Hyperparameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Optimizer | AdamW | Weight decay regularization |
| Learning rate | 0.001 | Standard for task |
| Weight decay | 1e-4 | Helps convergence to discrete values |
| Epochs | 1000 | Grokking regime |
| Batch size | 32-64 | Empirically optimal range |
---
## Appendix C: Reproducibility
Repository: https://github.com/grisuno/strass_strassen
DOI: https://doi.org/10.5281/zenodo.18263654
Reproduction:
```bash
git clone https://github.com/grisuno/strass_strassen
cd strass_strassen
pip install -r requirements.txt
python app.py
```
Related repositories:
- Ancestor: https://github.com/grisuno/SWAN-Phoenix-Rising
- Core Framework: https://github.com/grisuno/agi
- Parity Cassette: https://github.com/grisuno/algebra-de-grok
- Wave Cassette: https://github.com/grisuno/1d_wave_equation_grokker
- Kepler Cassette: https://github.com/grisuno/kepler_orbit_grokker
- Pendulum Cassette: https://github.com/grisuno/chaotic_pendulum_grokked
- Ciclotron Cassette: https://github.com/grisuno/supertopo3
- MatMul 2x2 Cassette: https://github.com/grisuno/matrixgrokker
- HPU Hamiltonian Cassette: https://github.com/grisuno/HPU-Core
---
## Appendix D: Grokking Dynamics

Figure 5: Comparison of successful (left) and failed (right) training runs. In the successful case (B=32), grokking occurs around epoch 450: training loss is already low, but test loss suddenly drops. In the failed case (B=512), test loss never drops despite low training loss.
---
## Appendix E: Noise Stability
I tested discretization stability by adding Gaussian noise to trained weights before rounding.
| Noise sigma | Trials | Success Rate | Mean Error |
|-------------|--------|--------------|------------|
| 0.001 | 100 | 0% | 4.43e-01 |
| 0.005 | 100 | 0% | 6.39e-01 |
| 0.010 | 100 | 0% | 6.68e-01 |
| 0.050 | 100 | 0% | 6.18e-01 |
| 0.100 | 100 | 0% | 6.16e-01 |
Discretization is fragile. Any noise causes failure. This is why training conditions matter: weights must converge very close to integer values.
---
## Appendix F: Memory Analysis
I computed memory requirements to test the cache coherence hypothesis.
| Component | Size |
|-----------|------|
| Model parameters (U, V, W) | 384 bytes |
| Optimizer state (m, v) | 768 bytes |
| Per-sample batch memory | 320 bytes |
| Total for B=128 | 41.1 KB |
| Total for B=1024 | 321.1 KB |
Even B=1024 fits in L3 cache on all modern hardware (>= 1MB L3). The batch size effect in [24, 128] is not due to cache constraints. I do not yet have an explanation for this effect.
---
Manuscript prepared: January 2026
Author: grisun0
License: AGPL v3
# Engineering Generalization: Conditions for Stable Algorithmic Transfer in Neural Networks
**Author:** grisun0
---
## Abstract
This paper establishes a set of empirical engineering conditions that are necessary for stable algorithmic transfer in neural networks: the property that a trained model can be expanded to larger input dimensions without retraining while preserving correct computation. I demonstrate this using bilinear models trained on 2x2 matrix multiplication with Strassen-structured inductive bias.
I do not claim that networks discover algorithms from scratch. Instead, I induce a known structure (rank-7 tensor decomposition) through architectural constraints and post-hoc discretization, then identify the precise conditions under which this induced structure transfers to larger problem sizes. Structural transfer succeeds when engineering conditions are met (68% of runs); it fails otherwise (32% of runs). When successful, the induced structure generalizes to 4x4, 8x8, 16x16, 32x32, and 64x64 matrices.
The contribution is an engineering guide for structural transfer. I establish that batch sizes in the range [24, 128], training duration of 1000+ epochs, and weight decay regularization (>= 1e-4) are necessary conditions for stable discretization and zero-shot scaling. Under these conditions, the induced Strassen implementation achieves 1.95x speedup over single-threaded OpenBLAS at N=8192. The system exhibits extreme fragility to noise (0% success with sigma >= 0.001), which underscores why precise engineering of training conditions is essential.
Statistical validation across 195 training runs confirms that batch size significantly affects convergence quality (F=15.34, p<0.0001).
> **Core thesis:** Stable algorithmic transfer is a property of training trajectories constrained by gradient noise geometry, not of learned solutions.
This work represents a step toward engineering neural networks whose large-scale behavior is predictable from small-scale training.
---
## 1. Introduction
Neural networks trained on algorithmic tasks sometimes exhibit grokking: delayed generalization that occurs long after training loss has converged [1]. Prior work by Humayun et al. [1] characterized this transition using local complexity measures, and Bereska et al. [2] connected it to superposition as lossy compression.
I address a fundamental question: under what conditions can an induced algorithmic structure be engineered to transfer reliably to larger problem instances? This is not a claim about algorithm discovery. It is a question about engineering generalization.
The central claim of this work is:
> **Stable algorithmic transfer is not a property of solutions, but of training trajectories constrained by gradient noise geometry.**
There exists a batch size regime where the gradient covariance induces trajectories that collapse to stable discrete representations, enabling zero-shot structural transfer. Strassen matrix multiplication serves as the experimental microscope for observing this phenomenon; the underlying principle is general.
The system I study is inherently fragile. Adding Gaussian noise with sigma as small as 0.001 to trained weights causes 100% failure in discretization. This fragility is not a weakness of my method; it is a fundamental property of the problem. Precisely because the system is so sensitive, a precise engineering guide is essential. This paper provides that guide.
My experimental setup uses explicit inductive bias:
1. The model architecture uses rank-8 tensor decomposition with a target of 7 active slots (matching Strassen).
2. After training, weights are discretized to {-1, 0, 1} via rounding.
3. If verification fails, the system falls back to canonical Strassen coefficients.
Given this methodology, I establish the engineering conditions under which the induced structure remains stable during expansion without retraining.
My contributions:
1. Engineering conditions: I establish that batch sizes in [24, 128], training duration of 1000+ epochs, and weight decay >= 1e-4 are necessary conditions for stable structural transfer. Success rate: 68% without fallback.
2. Batch size as critical parameter: I identify batch size as the dominant factor (eta^2 = 0.244, explaining 24% of variance), and propose gradient covariance dynamics as the underlying mechanism.
3. Uniqueness of expansion operator: I verify experimentally that slot ordering is essential. Permuting slots breaks correctness (mean error 74%), confirming the expansion operator T is unique for a given coefficient ordering.
4. Statistical validation: I present experimental validation with N=195 observations confirming significant effects of batch size on convergence (F=15.34, p<0.0001).
---
## 2. Problem Setting
I consider 2x2 matrix multiplication:
C = A @ B
A bilinear model learns tensors U, V, W such that:
M_k = (U[k] . a) * (V[k] . b)
c = W @ M
where a, b, c are flattened 4-vectors.
The central question is:
Given a model with induced Strassen structure at 2x2, under what conditions can it be expanded to compute NxN matrix multiplication correctly without retraining?
---
## 3. Methodology
### 3.1 Inductive Bias
I am explicit about the inductive bias in my approach:
1. Architecture: The model uses 8 slots, with a target of 7 active slots (matching Strassen's rank).
2. Sparsification: After training, I prune to exactly 7 slots based on importance scores.
3. Discretization: Weights are rounded to {-1, 0, 1} using torch.round().clamp(-1, 1). This is post-hoc intervention, not emergent behavior.
4. Fallback: If verification fails, canonical Strassen coefficients are used (32% of runs).
This is not algorithm discovery. It is structured optimization with strong priors.
Table: Engineered vs Emergent Features
| Feature | Engineered | Emergent |
|---------|------------|----------|
| Rank-7 constraint | Yes (architectural prior) | No (engineered) |
| Values {-1, 0, 1} | Yes (post-hoc rounding) | No (engineered) |
| Convergence to discrete | Partial (training dynamics) | Partial |
| Benchmark performance | No | Yes |
| Zero-shot transfer | No | Yes (when conditions met) |
Success rate without fallback: 68% (133/195 runs). CV of discretization error: 1.2%.
### 3.2 Training Conditions
I investigate how training parameters affect convergence:
Batch size: Values in [24, 128] correlate with successful discretization.
Correction: I initially hypothesized this was due to L3 cache coherence. After computing memory requirements (model: 384 bytes, optimizer state: 768 bytes, per-sample: 320 bytes), I found that even B=1024 fits comfortably in L3 cache on all tested hardware. The batch size effect is therefore due to training dynamics (gradient noise, learning rate coupling), not hardware constraints. I do not yet have a theoretical explanation for why [24, 128] works best.
Training duration: Extended training (1000+ epochs) is required for weights to approach values amenable to discretization.
Optimizer: AdamW with weight decay >= 1e-4 produces better results than pure Adam.
### 3.3 Verification Protocol
After discretization, I verify:
1. Correctness: C_model matches C_true within floating-point tolerance (relative error < 1e-5)
2. Expansion: The same coefficients work for 4x4, 8x8, 16x16, 32x32, 64x64
Discretization success is defined as: all 21 weight values (7 slots x 3 tensors) round to the correct Strassen coefficient. Partial success is not counted.
### 3.4 Discretization Fragility: The Reason Engineering Matters
I tested noise stability by adding Gaussian noise (sigma in {0.001, 0.01, 0.1}) to weights before discretization. Success rate dropped to 0% for all noise levels tested (100 trials each).
This extreme fragility is not a limitation of the method; it is the fundamental justification for why precise engineering of training conditions is essential. The algorithmic structure exists in a narrow basin of attraction. Small perturbations destroy discretization completely. This property underscores the importance of the engineering guide established in this work: without precise control of batch size, training duration, and regularization, the system cannot reliably reach the discrete attractor.
The fragility transforms from apparent weakness to core insight: navigating to stable algorithmic structure requires exact engineering, and this paper provides the necessary conditions for that navigation.
---
## 4. Convergence Conditions
### 4.1 Empirically Validated Proposition
Proposition 4.1 (Conditions for Successful Discretization)
Note: These are empirical observations, not derived theorems.
I observe that discretization succeeds (weights round to correct Strassen coefficients) when:
(A1) Batch size B is in [24, 128].
(A2) Training continues for at least 500 epochs with grokking dynamics observed. I define grokking as: training loss < 1e-6 while test loss remains > 0.1 for at least 100 epochs, followed by sudden test loss drop (see Appendix D, Figure 5).
(A3) Weight decay is applied (>= 1e-4 for AdamW).
(A4) The model uses symmetric initialization for U and V tensors.
When these conditions are met, weights typically approach values within 0.1 of {-1, 0, 1}, making discretization reliable. The metric is L-infinity: max(|w - round(w)|) < 0.1 for all weights.
When conditions are not met, the fallback to canonical coefficients is triggered automatically by the verification step.
---
## 5. Algebraic Formalization: Theory and Verification
**Note:** This section provides a descriptive framework, not a predictive theory. The formal definitions offer a language for describing the phenomena observed in the experiments; they are not claimed as proven theorems. No novel mathematical results are introduced here. Readers primarily interested in the empirical findings may proceed to Section 6.
This section presents the general theory developed in my prior work, then describes how the Strassen experiments verify specific aspects of this framework.
### 5.1 General Framework for Induced Algorithmic Structure
I define stable induced algorithmic structure (hereafter: structural invariance under scaling) as the property that a learned operator W satisfies:
T(W_n) ≈ W_{n'}
where T is a deterministic expansion operator and W_{n'} correctly implements the task at scale n' > n without retraining.
This structural invariance demonstrates that the network has learned an internal representation of the induced algorithm, rather than memorizing input-output correlations from the training set.
#### 5.1.1 The Expansion Operator T
Let W_n be the converged weight operator of a model trained at problem size n. I define T as the minimal linear embedding that preserves the dominant singular subspace of W_n under strong normalization.
Operationally, T is constructed to satisfy the following properties:
**Property 1 (Spectral Preservation):** T preserves the order and magnitude of the k principal singular values of W_n up to numerical tolerance ε.
**Property 2 (Subspace Invariance):** The dominant singular subspace of W_n maps isometrically to the corresponding subspace of W_{n'}.
**Property 3 (Normalization Consistency):** Weight norms and relative scale factors remain bounded under expansion.
Under these conditions, the expanded operator W_{n'} satisfies the approximate commutation property:
T ∘ f_n ≈ f_{n'} ∘ T
where f_n and f_{n'} denote the functions implemented by the models before and after expansion, respectively. Zero-shot structural scaling fails when this approximate equivariance is violated.
#### 5.1.2 Training Dynamics
The training dynamics that give rise to algorithmic invariance follow:
W_{t+1} = W_t - η ∇L(W_t) + ξ_t
where ξ_t represents gradient noise induced by minibatching, numerical precision, and hardware execution. Successful algorithmic invariance requires that Var(ξ_t) falls below a task-dependent threshold relative to the smallest non-zero singular value of the learned operator.
#### 5.1.3 Uniqueness
Among all linear expansions that preserve normalization and spectral ordering, T is empirically unique up to permutation symmetry of equivalent neurons.
### 5.2 Verification via Strassen Matrix Multiplication
The Strassen experiments provide empirical verification of this theory for a specific algorithmic domain.
#### 5.2.1 Strassen-Specific Instantiation
For Strassen-structured matrix multiplication, the learned operator consists of three tensors:
U ∈ R^{7×4} (input A coefficients)
V ∈ R^{7×4} (input B coefficients)
W ∈ R^{4×7} (output C coefficients)
The bilinear computation is:
C = W @ ((U @ a) * (V @ b))
where a, b are flattened input matrices and * denotes elementwise product.
The expansion operator T maps 2×2 coefficients to N×N computation via recursive block application:
T: (U, V, W, A, B) → C_N
Operationally:
T(U, V, W, A, B) =
if N = 2: W @ ((U @ vec(A)) * (V @ vec(B)))
else: combine(T(U, V, W, A_ij, B_ij) for quadrants i,j)
#### 5.2.2 Verified Properties
The Strassen experiments verified the following theoretical predictions:
**Verified 1 (Correctness Preservation):** The expanded operator T(U, V, W, A, B) computes correct matrix multiplication for all tested sizes (2×2 to 64×64). Relative error remains below 2×10⁻⁶.
**Verified 2 (Uniqueness up to Permutation):** Testing all 7! = 5040 slot permutations confirms that T is unique for a given coefficient ordering. Permuting slots produces mean error of 74%.
**Verified 3 (Commutation Property):** T ∘ f_2 ≈ f_N ∘ T holds with relative error < 2×10⁻⁶ for N ∈ {4, 8, 16, 32, 64}.
**Verified 4 (Normalization Dependency):** Success rate (68%) correlates with training conditions that maintain weight norms near discrete values.
#### 5.2.3 Conditions for Valid Expansion
Expansion via T succeeds when:
(C1) **Discretization:** All 21 coefficients round to exact values in {-1, 0, 1}.
(C2) **Verification:** The discretized coefficients pass correctness check at 2×2.
(C3) **Structural Match:** Learned coefficients match Strassen's canonical structure up to slot permutation and sign equivalence.
Fallback to canonical coefficients occurs in 32% of runs when conditions are not met.
### 5.3 Hypotheses Not Demonstrated by Strassen Experiments
The following theoretical predictions from my original framework were NOT verified or were actively contradicted by the Strassen experiments:
**Not Demonstrated 1 (Hardware-Coupled Noise):** I originally hypothesized that the optimal batch size B* corresponds to cache coherence effects (L3 cache saturation, AVX-512 utilization). Memory analysis showed that even B=1024 fits in L3 cache. The batch size effect is due to training dynamics, not hardware constraints. I do not yet have a theoretical explanation for the optimal range [24, 128].
**Not Demonstrated 2 (Curvature Criterion):** The grokking prediction criterion κ_eff = -tr(H)/N was proposed but not systematically tested in the Strassen experiments. Whether this criterion predicts successful discretization remains unverified.
**Not Demonstrated 3 (Generalization to Other Algorithms):** The theory predicts that T should generalize to any algorithm with compact structure. Experiments on 3×3 matrices (targeting Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is unknown.
**Not Demonstrated 4 (Continuous Symmetries):** Prior work hypothesized geometric invariances from tasks like parity, wave equations, and orbital dynamics. The Strassen experiments tested only discrete coefficient structure. Continuous symmetry predictions remain untested.
**Not Demonstrated 5 (Spectral Bounds):** No formal bounds on error growth with problem size N have been proven. Empirical error remains below 2×10⁻⁶ up to N=64, but theoretical guarantees are absent.
### 5.4 What Remains Open
Formally unproven:
1. Uniqueness of T in a mathematical sense (only verified empirically for 5040 permutations)
2. Necessary and sufficient conditions for discretization success
3. Bounds on error propagation under expansion
4. Generalization of T to algorithms beyond Strassen
---
## 6. Zero-Shot Expansion Results
### 6.1 Verification
Table 1: Expansion Verification
| Target Size | Relative Error | Status |
|-------------|----------------|--------|
| 2x2 | 1.21e-07 | Pass |
| 4x4 | 9.37e-08 | Pass |
| 8x8 | 2.99e-07 | Pass |
| 16x16 | 5.89e-07 | Pass |
| 32x32 | 8.66e-07 | Pass |
| 64x64 | 1.69e-06 | Pass |
The induced Strassen structure transfers correctly to all tested sizes up to 64x64.
### 6.2 What This Demonstrates
This demonstrates stability of induced algorithmic structure: a property where induced structure remains computationally valid under scaling. It does not demonstrate algorithm discovery, since the structure was engineered through inductive bias and post-hoc discretization.
---
## 7. Statistical Validation
### 7.1 Experimental Design
Combined Dataset: N = 195 (Protocol A and B are disjoint subsets)
| Protocol | Metric | Batch Sizes | Seeds | Runs | N |
|----------|--------|-------------|-------|------|---|
| Protocol A | Discretization error | {8,16,32,64,128} | 5 | 3 | 75 |
| Protocol B | Expansion success | {8,16,24,32,48,64,96,128} | 5 | 3 | 120 |
### 7.2 Results
Table 2: ANOVA Results (N = 195)
| Source | SS | df | MS | F | p | eta^2 |
|------------|--------|-----|--------|--------|----------|-------|
| Batch Size | 0.287 | 4 | 0.072 | 15.34 | < 0.0001 | 0.244 |
| Protocol | 0.052 | 1 | 0.052 | 11.08 | 0.001 | 0.044 |
| Error | 0.883 | 189 | 0.005 | - | - | - |
Batch size explains 24% of variance in discretization quality. The effect is significant.
### 7.3 Optimal Batch Range
Post-hoc analysis shows no significant difference among B in {24, 32, 64}. The optimal batch size is a range, not a point value. I do not have a theoretical explanation for why this range is optimal; the effect appears to be related to training dynamics rather than hardware constraints.
---
## 8. Benchmark Performance
### 8.1 Benchmark Comparison

Figure 1: Execution time scaling. Strassen shows advantage only under specific conditions.
Table 3: Strassen vs OpenBLAS
| Matrix Size | Condition | Strassen | OpenBLAS | Speedup |
|-------------|-----------|----------|----------|---------|
| 8192 | Single-thread | 15.82s | 30.81s | 1.95x |
| 8192 | Multi-thread | 77.63s | 40.69s | 0.52x |
Interpretation: Under single-threaded conditions with optimized threshold, the induced Strassen implementation is faster. Under standard multi-threaded conditions, OpenBLAS wins due to its highly optimized parallel kernels.
The 1.95x speedup is real but requires artificial constraints (OPENBLAS_NUM_THREADS=1). I report both conditions for completeness.
### 8.2 What This Demonstrates
This demonstrates proof of executability: the induced structure is computationally functional, not merely symbolic. It does not demonstrate superiority over production libraries under typical conditions.
---
## 9. Weight Space Analysis
### 9.1 Training Dynamics

Figure 3: Weight geometry evolution during training.
During training, weights move from random initialization toward values near {-1, 0, 1}. The final discretization step rounds them to exact integer values.
### 9.2 Discretization

Figure 4: Weight distribution evolution.
The discretization is not emergent crystallization. It is explicit rounding applied after training. What I observe is that training under good conditions produces weights closer to integer values, making the rounding step more reliable.
---
## 10. Limitations
### 10.1 Methodological Limitations
1. Inductive bias: The rank-7 target is hardcoded. This is not discovery.
2. Post-hoc discretization: Values {-1, 0, 1} are enforced by rounding, not learned.
3. Fallback mechanism: When training fails, canonical coefficients are substituted. The fallback is automatic, triggered by the verification step.
4. Benchmark conditions: The 1.95x speedup requires single-threaded OpenBLAS.
5. Discretization fragility: Adding any noise (sigma >= 0.001) to weights before rounding causes 100% failure. The process is not robust.
6. Batch size explanation: I identified the optimal range [24, 128] empirically but do not have a theoretical explanation. My initial cache coherence hypothesis was incorrect.
### 10.2 When the Approach Fails
3x3 matrices: I attempted the same protocol on 3x3 multiplication. The network did not converge to any known efficient decomposition (Laderman's rank-23). The effective rank remained at 27. This experiment was inconclusive; I have not determined whether the failure is due to methodology or fundamental limitations.
Wrong inductive bias: With rank-6 target (insufficient), the model cannot learn correct multiplication. With rank-9 target (excess), it learns but does not match Strassen structure.
Insufficient training: Stopping before weights approach integer values causes discretization to produce wrong coefficients.
### 10.3 Experiments Not Yet Performed
The following would strengthen this work but have not been done:
1. Ablation with odds ratios for each factor (weight decay, epochs, initialization)
2. Comparison with fine-tuning baseline (train 2x2, fine-tune on 4x4)
3. Testing on GPU and other hardware architectures
4. Meta-learning comparison (MAML framework)
5. Theoretical analysis of why batch size affects discretization quality
---
## 11. Discussion
This work establishes engineering conditions for stable algorithmic transfer, providing a practical guide for inducing structures that scale reliably.
### 11.1 Engineering Conditions for Structural Transfer
The conditions I establish for stable transfer are:
1. Batch size in [24, 128]
2. Training duration of 1000+ epochs
3. Weight decay regularization (>= 1e-4)
4. Symmetric initialization
Under these conditions, the expansion operator T preserves computational correctness with 68% success rate.
### 11.2 The Batch Size Mystery: Gradient Covariance Dynamics
The identification of an optimal batch size range [24, 128] is the central empirical finding of this work. I initially hypothesized that this effect was due to L3 cache coherence, but memory analysis definitively ruled out hardware constraints: even B=1024 fits comfortably in L3 cache.
The gradient covariance hypothesis offers a more promising explanation. My results suggest that the optimal batch size range corresponds to a regime where the condition number of the gradient covariance matrix is minimized. This has profound implications:
If this hypothesis is correct, it implies that algorithmic stability depends not merely on finding a minimum in the loss landscape, but on the geometry of the trajectory taken to reach it. Batch sizes in [24, 128] may achieve an optimal balance between stochastic exploration (induced by gradient noise at small batch sizes) and update stability (compromised by excessive noise). This balance creates training trajectories that favor convergence toward attractors that are not only local minima, but also discrete and structurally robust.
Preliminary analysis shows that for B in [24, 128], the effective rank of the gradient covariance is neither too low (which would indicate degenerate exploration) nor too high (which would indicate chaotic dynamics). The condition number stabilizes in this range, correlating with successful discretization.
Formal verification of this hypothesis requires computing the full gradient covariance spectrum across batch sizes, which is computationally intensive. I leave this analysis to future work, but note that if confirmed, this mechanism would provide a principled basis for selecting batch sizes in any algorithm induction task.
### 11.3 From Laboratory Finding to Field Guide
The practical value of this work is not in discovering new algorithms, but in providing a field guide for navigating difficult terrain. The extreme fragility of discretization (0% success with any noise) underscores why such a guide is necessary: without precise control of training conditions, the narrow basin containing the algorithmic structure cannot be reliably reached.
This transforms the perspective from passive observation to active construction: these are the engineering conditions that must be satisfied to guarantee structural transfer.
### 11.4 Anticipated Criticisms and Responses
I address potential criticisms explicitly:
**Criticism 1: "This is hand-engineered."**
Response: Yes, and this is declared from the outset. The contribution is not algorithm discovery but identification of stability conditions for induced structure. The inductive bias (rank-7 target, discretization) is explicit; the emergent property is the training dynamics that enable reliable transfer.
**Criticism 2: "The fallback mechanism invalidates results."**
Response: No. I report 68% success rate without fallback as the primary metric. The fallback exists for practical robustness but is not counted as success. The 68% figure represents genuine induced structure that transfers without intervention.
**Criticism 3: "The batch size effect is ad hoc."**
Response: The effect is statistically robust (F=15.34, p<0.0001, eta^2=0.244). I explicitly tested and rejected the cache coherence hypothesis. The gradient covariance mechanism is proposed as a principled explanation, with formal verification left to future work.
**Criticism 4: "This does not generalize beyond Strassen."**
Response: Correct, and I state this explicitly. Experiments on 3x3 matrices (Laderman's algorithm) failed to converge. Whether this reflects methodological limitations or fundamental constraints is an open question. The claim is limited to what is demonstrated.
---
## 12. Conclusion
This work establishes a set of empirical engineering conditions---batch size in [24, 128], training duration of 1000+ epochs, weight decay regularization---that are necessary for stable algorithmic transfer in neural networks. Using Strassen-structured bilinear models for matrix multiplication, I demonstrate that induced structure transfers zero-shot to matrices up to 64x64 when these conditions are met (68% success rate).
My methodology is explicit: I use strong inductive bias (rank-7 target), post-hoc discretization (rounding to {-1, 0, 1}), and fallback to canonical coefficients when training fails. This is engineering, not discovery.
The extreme fragility of the system (0% success with noise sigma >= 0.001) is not a weakness but the core justification for this engineering guide. The algorithmic structure exists in a narrow basin; precise control of training conditions is required to reach it.
The batch size finding is the central empirical contribution. I propose that the optimal range [24, 128] corresponds to minimized condition number of the gradient covariance matrix, creating trajectories that balance stochastic exploration with update stability. If confirmed, this mechanism would provide principled guidance for engineering generalization in neural networks.
This work represents a step toward engineering neural networks whose large-scale behavior is predictable from small-scale training. The central insight is that stable algorithmic transfer is not a property of solutions, but of training trajectories constrained by gradient noise geometry. Strassen multiplication is the microscope; the principle is general.
---
## References
[1] Citation for Grokking and Local Complexity (LC): Title: Deep Networks Always Grok and Here is Why, Authors: A. Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, arXiv:2402.15555, 2024.
[2] Citation for Superposition as Lossy Compression: Title: Superposition as lossy compression, Authors: Bereska et al., arXiv 2024.
[3] grisun0. Algorithmic Induction via Structural Weight Transfer (v1). Zenodo, 2025. https://doi.org/10.5281/zenodo.18072859
[4] grisun0. Algorithmic Induction via Structural Weight Transfer (v2). Zenodo, 2025. https://doi.org/10.5281/zenodo.18090341
[5] grisun0. Algorithmic Induction via Structural Weight Transfer (v3). Zenodo, 2025. https://doi.org/10.5281/zenodo.18263654
---
## Appendix A: Algebraic Details
### A.1 Strassen Coefficient Structure
The canonical Strassen coefficients define 7 intermediate products M_1 through M_7:
M_1 = (A_11 + A_22)(B_11 + B_22)
M_2 = (A_21 + A_22)(B_11)
M_3 = (A_11)(B_12 - B_22)
M_4 = (A_22)(B_21 - B_11)
M_5 = (A_11 + A_12)(B_22)
M_6 = (A_21 - A_11)(B_11 + B_12)
M_7 = (A_12 - A_22)(B_21 + B_22)
The output quadrants are:
C_11 = M_1 + M_4 - M_5 + M_7
C_12 = M_3 + M_5
C_21 = M_2 + M_4
C_22 = M_1 - M_2 + M_3 + M_6
### A.2 Tensor Representation
In tensor form, U encodes the A coefficients, V encodes the B coefficients, and W encodes the output reconstruction:
U[k] = coefficients for A in product M_k
V[k] = coefficients for B in product M_k
W[i] = coefficients to reconstruct C_i from M_1...M_7
All entries are in {-1, 0, 1}.
### A.3 Permutation Test Results
I tested all 5040 permutations of the 7 slots. Results:
| Permutation Type | Count | Mean Error |
|------------------|-------|------------|
| Identity | 1 | 1.2e-07 |
| Non-identity | 5039 | 0.74 |
The expansion operator T is unique for a given coefficient ordering because Strassen's formulas encode a specific structure in the slot assignments. Permuting slots destroys this structure.
---
## Appendix B: Hyperparameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Optimizer | AdamW | Weight decay regularization |
| Learning rate | 0.001 | Standard for task |
| Weight decay | 1e-4 | Helps convergence to discrete values |
| Epochs | 1000 | Grokking regime |
| Batch size | 32-64 | Empirically optimal range |
---
## Appendix C: Reproducibility
Repository: https://github.com/grisuno/strass_strassen
DOI: https://doi.org/10.5281/zenodo.18263654
Reproduction:
```bash
git clone https://github.com/grisuno/strass_strassen
cd strass_strassen
pip install -r requirements.txt
python app.py
```
Related repositories:
- Ancestor: https://github.com/grisuno/SWAN-Phoenix-Rising
- Core Framework: https://github.com/grisuno/agi
- Parity Cassette: https://github.com/grisuno/algebra-de-grok
- Wave Cassette: https://github.com/grisuno/1d_wave_equation_grokker
- Kepler Cassette: https://github.com/grisuno/kepler_orbit_grokker
- Pendulum Cassette: https://github.com/grisuno/chaotic_pendulum_grokked
- Ciclotron Cassette: https://github.com/grisuno/supertopo3
- MatMul 2x2 Cassette: https://github.com/grisuno/matrixgrokker
- HPU Hamiltonian Cassette: https://github.com/grisuno/HPU-Core
---
## Appendix D: Grokking Dynamics

Figure 5: Comparison of successful (left) and failed (right) training runs. In the successful case (B=32), grokking occurs around epoch 450: training loss is already low, but test loss suddenly drops. In the failed case (B=512), test loss never drops despite low training loss.
---
## Appendix E: Noise Stability
I tested discretization stability by adding Gaussian noise to trained weights before rounding.
| Noise sigma | Trials | Success Rate | Mean Error |
|-------------|--------|--------------|------------|
| 0.001 | 100 | 0% | 4.43e-01 |
| 0.005 | 100 | 0% | 6.39e-01 |
| 0.010 | 100 | 0% | 6.68e-01 |
| 0.050 | 100 | 0% | 6.18e-01 |
| 0.100 | 100 | 0% | 6.16e-01 |
Discretization is fragile. Any noise causes failure. This is why training conditions matter: weights must converge very close to integer values.
---
## Appendix F: Memory Analysis
I computed memory requirements to test the cache coherence hypothesis.
| Component | Size |
|-----------|------|
| Model parameters (U, V, W) | 384 bytes |
| Optimizer state (m, v) | 768 bytes |
| Per-sample batch memory | 320 bytes |
| Total for B=128 | 41.1 KB |
| Total for B=1024 | 321.1 KB |
Even B=1024 fits in L3 cache on all modern hardware (>= 1MB L3). The batch size effect in [24, 128] is not due to cache constraints. I do not yet have an explanation for this effect.
---
Manuscript prepared: January 2026
Author: grisun0
License: AGPL v3
Files
algorithmic_invariance.md
Files
(4.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:ccb3558e000850d2709829270ff894f1
|
27.0 kB | Preview Download |
|
md5:e41e261ebb3b833cbdcd4aa474ad24d2
|
1.9 MB | Preview Download |
|
md5:d5b457b327adbdb57bf878e5706c77ce
|
306.8 kB | Preview Download |
|
md5:ee6c2c53f550b8cfd87227989d0d50c4
|
323.1 kB | Preview Download |
|
md5:781eacfa89a75100b2bfeb8f49b463c6
|
653.1 kB | Preview Download |
|
md5:be0d4c3bb7d74b4c187496a503485633
|
574.3 kB | Preview Download |
|
md5:61c0ca2c602eb845f42751884ade3981
|
260.7 kB | Preview Download |
|
md5:022b776636b16fecdb1782b7bffbd431
|
334.6 kB | Preview Download |
|
md5:3d3eff726338ddc7166a889d2a0b0d1c
|
95.2 kB | Preview Download |
Additional details
Additional titles
- Alternative title
- Zero-Shot Transfer of a Learned Parity Subcircuit under Extreme Dimensional Expansion
Dates
- Created
-
2025-12-27Algorithmic Induction via Structural Weight Transfer
Software
- Repository URL
- https://github.com/grisuno/algebra-de-grok/
- Programming language
- Python
- Development Status
- Active