Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 88.3333 ± 2.8868
total_entries: 20.0000 ± 0.0000
correct_matches: 17.6667 ± 0.5774
average_execution_time: 4.5562 ± 0.7519
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 17.6667 ± 0.5774
    exact_code_matches: 11.0000 ± 0.0000
    exact_code_match_rate: 55.0000 ± 0.0000
    pv_exact_matched: 17.6667 ± 0.5774
    pv_exact_mismatched: 2.3333 ± 0.5774
    pv_exact_match_rate: 88.3333 ± 2.8868
    average_pv_match_rate: 93.2708 ± 2.4537
    average_pv_mismatch_rate: 6.7292 ± 2.4537
    timing_matched: 18.6667 ± 0.5774
    timing_mismatched: 1.3333 ± 0.5774
    timing_match_rate: 93.3333 ± 2.8868
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 17.6667 ± 0.5774
    full_mismatched: 2.3333 ± 0.5774
    full_match_rate: 88.3333 ± 2.8868
    accuracy: 88.3333 ± 2.8868
    average_timing_score: 0.9247 ± 0.0210
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.9311 ± 0.0238
    average_best_codebleu:
      average_CodeBLEU: 0.7887 ± 0.0094
      comb_7_CodeBLEU: 0.8701 ± 0.0068
      dataflow_match_score: 0.0611 ± 0.0127
      ngram_match_score: 0.6582 ± 0.0223
      syntax_match_score: 0.8710 ± 0.0049
      weighted_ngram_match_score: 0.6477 ± 0.0191
    average_best_levenshtein: 11.7667 ± 2.9569
    average_best_normalized_levenshtein: 0.1153 ± 0.0133
    average_inference_time: 4.5562 ± 0.7519
==================================================
