o3-high
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 88.3333 ± 2.8868
total_entries: 20.0000 ± 0.0000
correct_matches: 17.6667 ± 0.5774
average_execution_time: 4.5562 ± 0.7519
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 17.6667 ± 0.5774
    exact_code_matches: 11.0000 ± 0.0000
    exact_code_match_rate: 55.0000 ± 0.0000
    pv_exact_matched: 17.6667 ± 0.5774
    pv_exact_mismatched: 2.3333 ± 0.5774
    pv_exact_match_rate: 88.3333 ± 2.8868
    average_pv_match_rate: 93.2708 ± 2.4537
    average_pv_mismatch_rate: 6.7292 ± 2.4537
    timing_matched: 18.6667 ± 0.5774
    timing_mismatched: 1.3333 ± 0.5774
    timing_match_rate: 93.3333 ± 2.8868
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 17.6667 ± 0.5774
    full_mismatched: 2.3333 ± 0.5774
    full_match_rate: 88.3333 ± 2.8868
    accuracy: 88.3333 ± 2.8868
    average_timing_score: 0.9247 ± 0.0210
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.9311 ± 0.0238
    average_best_codebleu:
      average_CodeBLEU: 0.7887 ± 0.0094
      comb_7_CodeBLEU: 0.8701 ± 0.0068
      dataflow_match_score: 0.0611 ± 0.0127
      ngram_match_score: 0.6582 ± 0.0223
      syntax_match_score: 0.8710 ± 0.0049
      weighted_ngram_match_score: 0.6477 ± 0.0191
    average_best_levenshtein: 11.7667 ± 2.9569
    average_best_normalized_levenshtein: 0.1153 ± 0.0133
    average_inference_time: 4.5562 ± 0.7519
==================================================
================================================================================

llama3.3
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 38.3333 ± 2.8868
total_entries: 20.0000 ± 0.0000
correct_matches: 7.6667 ± 0.5774
average_execution_time: 7.1210 ± 0.0142
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 7.6667 ± 0.5774
    exact_code_matches: 4.0000 ± 0.0000
    exact_code_match_rate: 20.0000 ± 0.0000
    pv_exact_matched: 8.0000 ± 0.0000
    pv_exact_mismatched: 12.0000 ± 0.0000
    pv_exact_match_rate: 40.0000 ± 0.0000
    average_pv_match_rate: 54.3647 ± 0.0000
    average_pv_mismatch_rate: 45.6353 ± 0.0000
    timing_matched: 9.6667 ± 0.5774
    timing_mismatched: 10.3333 ± 0.5774
    timing_match_rate: 48.3333 ± 2.8868
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 7.6667 ± 0.5774
    full_mismatched: 12.3333 ± 0.5774
    full_match_rate: 38.3333 ± 2.8868
    accuracy: 38.3333 ± 2.8868
    average_timing_score: 0.5778 ± 0.0121
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.5505 ± 0.0024
    average_best_codebleu:
      average_CodeBLEU: 0.5872 ± 0.0000
      comb_7_CodeBLEU: 0.6997 ± 0.0000
      dataflow_match_score: 0.0250 ± 0.0000
      ngram_match_score: 0.3781 ± 0.0000
      syntax_match_score: 0.5746 ± 0.0000
      weighted_ngram_match_score: 0.4211 ± 0.0000
    average_best_levenshtein: 124.2500 ± 0.0000
    average_best_normalized_levenshtein: 0.2994 ± 0.0000
    average_inference_time: 7.1210 ± 0.0142
==================================================
================================================================================

athene-v2
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 35.0000 ± 0.0000
total_entries: 20.0000 ± 0.0000
correct_matches: 7.0000 ± 0.0000
average_execution_time: 6.4077 ± 0.0140
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 7.0000 ± 0.0000
    exact_code_matches: 4.0000 ± 0.0000
    exact_code_match_rate: 20.0000 ± 0.0000
    pv_exact_matched: 7.0000 ± 0.0000
    pv_exact_mismatched: 13.0000 ± 0.0000
    pv_exact_match_rate: 35.0000 ± 0.0000
    average_pv_match_rate: 35.7500 ± 0.0000
    average_pv_mismatch_rate: 64.2500 ± 0.0000
    timing_matched: 7.0000 ± 0.0000
    timing_mismatched: 13.0000 ± 0.0000
    timing_match_rate: 35.0000 ± 0.0000
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 7.0000 ± 0.0000
    full_mismatched: 13.0000 ± 0.0000
    full_match_rate: 35.0000 ± 0.0000
    accuracy: 35.0000 ± 0.0000
    average_timing_score: 0.3582 ± 0.0023
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.3576 ± 0.0005
    average_best_codebleu:
      average_CodeBLEU: 0.5797 ± 0.0000
      comb_7_CodeBLEU: 0.6447 ± 0.0000
      dataflow_match_score: 0.0375 ± 0.0000
      ngram_match_score: 0.4663 ± 0.0000
      syntax_match_score: 0.3885 ± 0.0000
      weighted_ngram_match_score: 0.4767 ± 0.0000
    average_best_levenshtein: 55.8500 ± 0.0000
    average_best_normalized_levenshtein: 0.1916 ± 0.0000
    average_inference_time: 6.4077 ± 0.0140
==================================================
================================================================================

phi3.5
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 0.0000 ± 0.0000
total_entries: 20.0000 ± 0.0000
correct_matches: 0.0000 ± 0.0000
average_execution_time: 5.4872 ± 1.2150
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 0.0000 ± 0.0000
    exact_code_matches: 0.0000 ± 0.0000
    exact_code_match_rate: 0.0000 ± 0.0000
    pv_exact_matched: 0.0000 ± 0.0000
    pv_exact_mismatched: 20.0000 ± 0.0000
    pv_exact_match_rate: 0.0000 ± 0.0000
    average_pv_match_rate: 0.0000 ± 0.0000
    average_pv_mismatch_rate: 100.0000 ± 0.0000
    timing_matched: 0.0000 ± 0.0000
    timing_mismatched: 20.0000 ± 0.0000
    timing_match_rate: 0.0000 ± 0.0000
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 0.0000 ± 0.0000
    full_mismatched: 20.0000 ± 0.0000
    full_match_rate: 0.0000 ± 0.0000
    accuracy: 0.0000 ± 0.0000
    average_timing_score: 0.0000 ± 0.0000
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.0000 ± 0.0000
    average_best_codebleu:
      average_CodeBLEU: 0.2990 ± 0.0077
      comb_7_CodeBLEU: 0.4401 ± 0.0106
      dataflow_match_score: 0.0583 ± 0.0000
      ngram_match_score: 0.0418 ± 0.0010
      syntax_match_score: 0.1098 ± 0.0251
      weighted_ngram_match_score: 0.0861 ± 0.0048
    average_best_levenshtein: 3464.9333 ± 1038.4222
    average_best_normalized_levenshtein: 0.8432 ± 0.0177
    average_inference_time: 5.4872 ± 1.2150
==================================================
================================================================================

claude-3.5-sonnet
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 78.3333 ± 2.8868
total_entries: 20.0000 ± 0.0000
correct_matches: 15.6667 ± 0.5774
average_execution_time: 2.8904 ± 0.1838
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 15.6667 ± 0.5774
    exact_code_matches: 12.0000 ± 0.0000
    exact_code_match_rate: 60.0000 ± 0.0000
    pv_exact_matched: 16.0000 ± 0.0000
    pv_exact_mismatched: 4.0000 ± 0.0000
    pv_exact_match_rate: 80.0000 ± 0.0000
    average_pv_match_rate: 87.2708 ± 0.0000
    average_pv_mismatch_rate: 12.7292 ± 0.0000
    timing_matched: 17.6667 ± 0.5774
    timing_mismatched: 2.3333 ± 0.5774
    timing_match_rate: 88.3333 ± 2.8868
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 15.6667 ± 0.5774
    full_mismatched: 4.3333 ± 0.5774
    full_match_rate: 78.3333 ± 2.8868
    accuracy: 78.3333 ± 2.8868
    average_timing_score: 0.8953 ± 0.0052
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.8772 ± 0.0010
    average_best_codebleu:
      average_CodeBLEU: 0.7860 ± 0.0000
      comb_7_CodeBLEU: 0.8645 ± 0.0000
      dataflow_match_score: 0.0500 ± 0.0000
      ngram_match_score: 0.6589 ± 0.0000
      syntax_match_score: 0.8336 ± 0.0000
      weighted_ngram_match_score: 0.6516 ± 0.0000
    average_best_levenshtein: 6.0000 ± 0.0000
    average_best_normalized_levenshtein: 0.0758 ± 0.0000
    average_inference_time: 2.8904 ± 0.1838
==================================================
================================================================================

qwen3
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 0.0000 ± 0.0000
total_entries: 20.0000 ± 0.0000
correct_matches: 0.0000 ± 0.0000
average_execution_time: 48.6507 ± 0.0461
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 0.0000 ± 0.0000
    exact_code_matches: 0.0000 ± 0.0000
    exact_code_match_rate: 0.0000 ± 0.0000
    pv_exact_matched: 0.0000 ± 0.0000
    pv_exact_mismatched: 20.0000 ± 0.0000
    pv_exact_match_rate: 0.0000 ± 0.0000
    average_pv_match_rate: 0.0000 ± 0.0000
    average_pv_mismatch_rate: 100.0000 ± 0.0000
    timing_matched: 0.0000 ± 0.0000
    timing_mismatched: 20.0000 ± 0.0000
    timing_match_rate: 0.0000 ± 0.0000
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 0.0000 ± 0.0000
    full_mismatched: 20.0000 ± 0.0000
    full_match_rate: 0.0000 ± 0.0000
    accuracy: 0.0000 ± 0.0000
    average_timing_score: 0.0000 ± 0.0000
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.0000 ± 0.0000
    average_best_codebleu:
      average_CodeBLEU: 0.2985 ± 0.0000
      comb_7_CodeBLEU: 0.4563 ± 0.0000
      dataflow_match_score: 0.0792 ± 0.0000
      ngram_match_score: 0.0016 ± 0.0000
      syntax_match_score: 0.1439 ± 0.0000
      weighted_ngram_match_score: 0.0693 ± 0.0000
    average_best_levenshtein: 8949.8000 ± 0.0000
    average_best_normalized_levenshtein: 0.9681 ± 0.0000
    average_inference_time: 48.6507 ± 0.0461
==================================================
================================================================================

mistral-nemo
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 0.0000 ± 0.0000
total_entries: 20.0000 ± 0.0000
correct_matches: 0.0000 ± 0.0000
average_execution_time: 1.3368 ± 0.0006
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 0.0000 ± 0.0000
    exact_code_matches: 0.0000 ± 0.0000
    exact_code_match_rate: 0.0000 ± 0.0000
    pv_exact_matched: 0.0000 ± 0.0000
    pv_exact_mismatched: 20.0000 ± 0.0000
    pv_exact_match_rate: 0.0000 ± 0.0000
    average_pv_match_rate: 4.6875 ± 0.0000
    average_pv_mismatch_rate: 95.3125 ± 0.0000
    timing_matched: 1.0000 ± 0.0000
    timing_mismatched: 19.0000 ± 0.0000
    timing_match_rate: 5.0000 ± 0.0000
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 0.0000 ± 0.0000
    full_mismatched: 20.0000 ± 0.0000
    full_match_rate: 0.0000 ± 0.0000
    accuracy: 0.0000 ± 0.0000
    average_timing_score: 0.0481 ± 0.0007
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.0471 ± 0.0001
    average_best_codebleu:
      average_CodeBLEU: 0.4523 ± 0.0000
      comb_7_CodeBLEU: 0.5661 ± 0.0000
      dataflow_match_score: 0.0000 ± 0.0000
      ngram_match_score: 0.2624 ± 0.0000
      syntax_match_score: 0.2840 ± 0.0000
      weighted_ngram_match_score: 0.2629 ± 0.0000
    average_best_levenshtein: 22.9000 ± 0.0000
    average_best_normalized_levenshtein: 0.3455 ± 0.0000
    average_inference_time: 1.3368 ± 0.0006
==================================================
================================================================================

claude-opus-4-20250514-thinking
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 85.0000 ± 0.0000
total_entries: 20.0000 ± 0.0000
correct_matches: 17.0000 ± 0.0000
average_execution_time: 8.0936 ± 0.3190
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 17.0000 ± 0.0000
    exact_code_matches: 13.0000 ± 0.0000
    exact_code_match_rate: 65.0000 ± 0.0000
    pv_exact_matched: 17.0000 ± 0.0000
    pv_exact_mismatched: 3.0000 ± 0.0000
    pv_exact_match_rate: 85.0000 ± 0.0000
    average_pv_match_rate: 90.7240 ± 0.0151
    average_pv_mismatch_rate: 9.2760 ± 0.0151
    timing_matched: 18.0000 ± 0.0000
    timing_mismatched: 2.0000 ± 0.0000
    timing_match_rate: 90.0000 ± 0.0000
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 17.0000 ± 0.0000
    full_mismatched: 3.0000 ± 0.0000
    full_match_rate: 85.0000 ± 0.0000
    accuracy: 85.0000 ± 0.0000
    average_timing_score: 0.9457 ± 0.0022
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.9149 ± 0.0005
    average_best_codebleu:
      average_CodeBLEU: 0.7995 ± 0.0000
      comb_7_CodeBLEU: 0.8827 ± 0.0000
      dataflow_match_score: 0.0500 ± 0.0000
      ngram_match_score: 0.6678 ± 0.0000
      syntax_match_score: 0.8764 ± 0.0000
      weighted_ngram_match_score: 0.6537 ± 0.0000
    average_best_levenshtein: 5.9000 ± 0.0000
    average_best_normalized_levenshtein: 0.0718 ± 0.0000
    average_inference_time: 8.0936 ± 0.3190
==================================================
================================================================================

claude-sonnet-4-20250514-thinking
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 85.0000 ± 0.0000
total_entries: 20.0000 ± 0.0000
correct_matches: 17.0000 ± 0.0000
average_execution_time: 5.8662 ± 0.1981
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 17.0000 ± 0.0000
    exact_code_matches: 13.0000 ± 0.0000
    exact_code_match_rate: 65.0000 ± 0.0000
    pv_exact_matched: 17.0000 ± 0.0000
    pv_exact_mismatched: 3.0000 ± 0.0000
    pv_exact_match_rate: 85.0000 ± 0.0000
    average_pv_match_rate: 90.7352 ± 0.0261
    average_pv_mismatch_rate: 9.2648 ± 0.0261
    timing_matched: 18.0000 ± 0.0000
    timing_mismatched: 2.0000 ± 0.0000
    timing_match_rate: 90.0000 ± 0.0000
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 17.0000 ± 0.0000
    full_mismatched: 3.0000 ± 0.0000
    full_match_rate: 85.0000 ± 0.0000
    accuracy: 85.0000 ± 0.0000
    average_timing_score: 0.9451 ± 0.0008
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.9149 ± 0.0001
    average_best_codebleu:
      average_CodeBLEU: 0.7995 ± 0.0000
      comb_7_CodeBLEU: 0.8827 ± 0.0000
      dataflow_match_score: 0.0500 ± 0.0000
      ngram_match_score: 0.6678 ± 0.0000
      syntax_match_score: 0.8764 ± 0.0000
      weighted_ngram_match_score: 0.6537 ± 0.0000
    average_best_levenshtein: 5.9000 ± 0.0000
    average_best_normalized_levenshtein: 0.0718 ± 0.0000
    average_inference_time: 5.8662 ± 0.1981
==================================================
================================================================================

gpt-4o-abacus
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 85.0000 ± 0.0000
total_entries: 20.0000 ± 0.0000
correct_matches: 17.0000 ± 0.0000
average_execution_time: 1.8779 ± 0.0917
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 17.0000 ± 0.0000
    exact_code_matches: 13.0000 ± 0.0000
    exact_code_match_rate: 65.0000 ± 0.0000
    pv_exact_matched: 17.0000 ± 0.0000
    pv_exact_mismatched: 3.0000 ± 0.0000
    pv_exact_match_rate: 85.0000 ± 0.0000
    average_pv_match_rate: 92.2708 ± 0.0000
    average_pv_mismatch_rate: 7.7292 ± 0.0000
    timing_matched: 19.0000 ± 0.0000
    timing_mismatched: 1.0000 ± 0.0000
    timing_match_rate: 95.0000 ± 0.0000
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 17.0000 ± 0.0000
    full_mismatched: 3.0000 ± 0.0000
    full_match_rate: 85.0000 ± 0.0000
    accuracy: 85.0000 ± 0.0000
    average_timing_score: 0.9433 ± 0.0083
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.9268 ± 0.0017
    average_best_codebleu:
      average_CodeBLEU: 0.7995 ± 0.0000
      comb_7_CodeBLEU: 0.8827 ± 0.0000
      dataflow_match_score: 0.0500 ± 0.0000
      ngram_match_score: 0.6678 ± 0.0000
      syntax_match_score: 0.8764 ± 0.0000
      weighted_ngram_match_score: 0.6537 ± 0.0000
    average_best_levenshtein: 5.9000 ± 0.0000
    average_best_normalized_levenshtein: 0.0718 ± 0.0000
    average_inference_time: 1.8779 ± 0.0917
==================================================
================================================================================

gpt-4o
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 76.6667 ± 2.8868
total_entries: 20.0000 ± 0.0000
correct_matches: 15.3333 ± 0.5774
average_execution_time: 1.4703 ± 0.1535
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 15.3333 ± 0.5774
    exact_code_matches: 12.3333 ± 0.5774
    exact_code_match_rate: 61.6667 ± 2.8868
    pv_exact_matched: 16.3333 ± 0.5774
    pv_exact_mismatched: 3.6667 ± 0.5774
    pv_exact_match_rate: 81.6667 ± 2.8868
    average_pv_match_rate: 87.3750 ± 4.8446
    average_pv_mismatch_rate: 12.6250 ± 4.8446
    timing_matched: 16.6667 ± 1.1547
    timing_mismatched: 3.3333 ± 1.1547
    timing_match_rate: 83.3333 ± 5.7735
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 15.3333 ± 0.5774
    full_mismatched: 4.6667 ± 0.5774
    full_match_rate: 76.6667 ± 2.8868
    accuracy: 76.6667 ± 2.8868
    average_timing_score: 0.8738 ± 0.0330
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.8738 ± 0.0451
    average_best_codebleu:
      average_CodeBLEU: 0.7837 ± 0.0145
      comb_7_CodeBLEU: 0.8641 ± 0.0178
      dataflow_match_score: 0.0500 ± 0.0000
      ngram_match_score: 0.6567 ± 0.0096
      syntax_match_score: 0.8354 ± 0.0402
      weighted_ngram_match_score: 0.6427 ± 0.0096
    average_best_levenshtein: 6.0000 ± 0.1000
    average_best_normalized_levenshtein: 0.0742 ± 0.0024
    average_inference_time: 1.4703 ± 0.1535
==================================================
================================================================================

qwen2.5-coder
-------
Statistics Summary for op agent (mean ± std):
==================================================
accuracy: 55.0000 ± 0.0000
total_entries: 20.0000 ± 0.0000
correct_matches: 11.0000 ± 0.0000
average_execution_time: 3.4048 ± 0.0013
metrics_by_complexity:
  simple:
    count: 20.0000 ± 0.0000
    full_matches: 11.0000 ± 0.0000
    exact_code_matches: 6.0000 ± 0.0000
    exact_code_match_rate: 30.0000 ± 0.0000
    pv_exact_matched: 11.0000 ± 0.0000
    pv_exact_mismatched: 9.0000 ± 0.0000
    pv_exact_match_rate: 55.0000 ± 0.0000
    average_pv_match_rate: 61.8478 ± 0.0000
    average_pv_mismatch_rate: 38.1522 ± 0.0000
    timing_matched: 12.0000 ± 0.0000
    timing_mismatched: 8.0000 ± 0.0000
    timing_match_rate: 60.0000 ± 0.0000
    temp_matched: 0.0000 ± 0.0000
    temp_mismatched: 0.0000 ± 0.0000
    temp_match_rate: 0.0000 ± 0.0000
    full_matched: 11.0000 ± 0.0000
    full_mismatched: 9.0000 ± 0.0000
    full_match_rate: 55.0000 ± 0.0000
    accuracy: 55.0000 ± 0.0000
    average_timing_score: 0.6412 ± 0.0006
    average_temp_score: 1.0000 ± 0.0000
    average_full_score: 0.6230 ± 0.0001
    average_best_codebleu:
      average_CodeBLEU: 0.6385 ± 0.0000
      comb_7_CodeBLEU: 0.7375 ± 0.0000
      dataflow_match_score: 0.0333 ± 0.0000
      ngram_match_score: 0.4630 ± 0.0000
      syntax_match_score: 0.6237 ± 0.0000
      weighted_ngram_match_score: 0.4841 ± 0.0000
    average_best_levenshtein: 45.3000 ± 0.0000
    average_best_normalized_levenshtein: 0.2650 ± 0.0000
    average_inference_time: 3.4048 ± 0.0013
==================================================
================================================================================

