# HyperNet N1 SDC Benchmark Methodology

## 1. Benchmark Selection

**Dataset:** OpenAI HumanEval  
**Source:** https://huggingface.co/datasets/openai/openai_humaneval  
**Version:** Official 164-problem release  
**Rationale:** HumanEval is a widely-recognized benchmark for code generation, enabling direct comparison with published results from other systems.

## 2. Evaluation Metric

**Metric:** pass@1  
**Definition:** Each problem receives exactly one submission attempt per lane. A submission passes if all provided unit tests execute successfully.  
**No retries:** Failed problems were not re-attempted with different prompts.

## 3. Lane Configuration

Six AI models ("lanes") were tested:

| Lane | Model | Provider | Role in Constellation |
|------|-------|----------|----------------------|
| Lola | GPT-4o | OpenAI | Relational Integrator |
| Claude | claude-sonnet-4 | Anthropic | Structure & Ethics |
| Grok | grok-2-1212 | xAI | Disruptive Ideation |
| Deep | deepseek-chat | DeepSeek | Long-Horizon Mission |
| Gemini | gemini-2.0-flash | Google | Pattern Synthesis |
| Kimi | moonshot-v1 | Moonshot | Technical Precision |

## 4. Execution Protocol

1. **Problem loading:** All 164 HumanEval problems loaded from official HuggingFace dataset
2. **Prompt construction:** Standard HumanEval prompt with function signature and docstring
3. **Lane execution:** Each problem sent to all 6 lanes independently
4. **Solution extraction:** Code block extracted from each model response
5. **Verification:** Python subprocess execution with 10-second timeout against official unit tests
6. **Logging:** All results recorded with pass/fail status per lane per problem

## 5. Human Governance

The Central Processing Node (CPN / Steve Kawa) maintained oversight through:

- **Policy-level control:** Routing rules and lane assignments approved before execution
- **No per-message approval:** Individual model calls did not require human confirmation during benchmark
- **Audit logging:** All routing decisions recorded for post-hoc review

This demonstrates "human-governed" rather than "human-in-the-loop" architecture — the human sets policy, the system executes within those bounds.

## 6. Constellation Scoring

A problem is considered "solved by constellation" if **at least one lane** produces a correct solution.

This reflects the operational model: in production, the router would select the best answer from available lanes, not require all lanes to agree.

## 7. Results Summary

### Individual Lanes

| Lane | Passed | Failed | Accuracy |
|------|--------|--------|----------|
| Claude | 161 | 3 | 98.2% |
| Lola | 145 | 19 | 88.4% |
| Grok | 133 | 31 | 81.1% |
| Gemini | 120 | 44 | 73.2% |
| Kimi | 111 | 53 | 67.7% |
| Deep | 43 | 121 | 26.2% |

### Constellation

| Metric | Value |
|--------|-------|
| At Least One Correct | 163/164 (99.4%) |
| Unanimous Pass | 24/164 (14.6%) |
| Unanimous Fail | 1/164 (0.6%) |

### The Single Constellation Failure

One problem failed across all 6 lanes, indicating a shared limitation rather than routing failure. This represents the irreducible floor for this particular lane configuration.

## 8. Infrastructure

| Spec | Value |
|------|-------|
| Instance | AWS t3.small |
| vCPUs | 2 |
| RAM | 2 GB |
| GPU | None |
| Training | None required |
| Benchmark runtime | ~4 hours |
| API cost | < $50 |

## 9. Reproducibility

The raw JSON results file contains:
- Per-problem, per-lane pass/fail status
- Problem task IDs and entry points
- Aggregate consensus metrics
- Metadata including date and configuration

This enables independent verification of all reported statistics.

## 10. Limitations

1. **Single benchmark:** HumanEval alone does not characterize full system capability
2. **API variability:** Model behavior may vary across API versions and dates
3. **No router code:** The routing implementation is proprietary; only results are published
4. **Lane selection:** Different lane configurations would produce different results
