================================================================================
CANDIDATE LANGUAGE VALIDATION SUITE
Full fair test of every proposed Voynich manuscript language
================================================================================

Loading and decoding Voynich corpus with H12...
  35,916 tokens, 7,733 types
  4,972 unique decoded forms

Loading language dictionaries...
  Sinhala     :  1,470,278 words  (sinhala_dictionary.txt)
  Latin       :    141,827 words  (latin_wordlist.txt)
  Arabic      :    379,737 words  (arabic_wordlist.txt)
  Turkish     :    409,612 words  (turkish_wordlist.txt)
  Tamil       :    403,980 words  (tamil_wordlist.txt)
  Hindi       :    127,451 words  (hindi_wordlist.txt)
  Hebrew      :    897,978 words  (hebrew_wordlist.txt)

================================================================================
H12 DECODED OUTPUT vs EACH LANGUAGE DICTIONARY
================================================================================

  Language       Dict Size   Exact Tok   Exact %  Edit-1 Tok  Combined %   Exact Typ
  ------------  ----------  ----------  --------  ----------  ----------  ----------
  Arabic           379,737       3,365      9.4%      15,045       51.3%         132
  Hebrew           897,978           0      0.0%       4,332       12.1%           0
  Hindi            127,451      12,917     36.0%      16,984       83.3%         402
  Latin            141,827      10,848     30.2%      18,440       81.5%         292
  Sinhala        1,470,278      16,977     47.3%      14,588       87.9%         704
  Tamil            403,980       5,509     15.3%      15,452       58.4%         116
  Turkish          409,612      14,533     40.5%      15,310       83.1%         487

  Arabic top matches:
    ara                 461 tokens
    ala                 348 tokens
    sa                  257 tokens
    tha                 208 tokens
    kha                 195 tokens
    da                  163 tokens
    ra                  125 tokens
    ma                  121 tokens
    la                  111 tokens
    ka                   96 tokens
    ... and 122 more types

  Hebrew top matches:

  Hindi top matches:
    ula               1,131 tokens
    ura                 607 tokens
    eda                 530 tokens
    ena                 494 tokens
    ugena               491 tokens
    ara                 461 tokens
    meda                445 tokens
    uta                 437 tokens
    ga                  392 tokens
    ala                 348 tokens
    ... and 392 more types

  Latin top matches:
    ula               1,131 tokens
    gena                797 tokens
    ura                 607 tokens
    ara                 461 tokens
    meda                445 tokens
    uta                 437 tokens
    ga                  392 tokens
    ala                 348 tokens
    ea                  344 tokens
    gala                302 tokens
    ... and 282 more types

  Sinhala top matches:
    ula               1,131 tokens
    gena                797 tokens
    ura                 607 tokens
    eda                 530 tokens
    ena                 494 tokens
    ugena               491 tokens
    ara                 461 tokens
    uga                 458 tokens
    meda                445 tokens
    uta                 437 tokens
    ... and 694 more types

  Tamil top matches:
    ula               1,131 tokens
    ura                 607 tokens
    ena                 494 tokens
    ara                 461 tokens
    ala                 348 tokens
    sa                  257 tokens
    utala               240 tokens
    tha                 208 tokens
    ra                  125 tokens
    mu                  123 tokens
    ... and 106 more types

  Turkish top matches:
    ula               1,131 tokens
    gena                797 tokens
    ura                 607 tokens
    eda                 530 tokens
    ena                 494 tokens
    ara                 461 tokens
    meda                445 tokens
    uta                 437 tokens
    ga                  392 tokens
    ala                 348 tokens
    ... and 477 more types

================================================================================
RANDOM DECODER COMPARISON (200 trials)
================================================================================

Running random decoders (exact match only for each language)...
  ... trial 1/200
  ... trial 51/200
  ... trial 101/200
  ... trial 151/200
  Done.

================================================================================
RESULTS: H12 vs RANDOM DECODERS (per language)
================================================================================

  Language       Dict Size   H12 Exact    H12 %     Rnd Avg     Rnd Std   Z-score   Beat H12
  ------------  ----------  ----------  -------  ----------  ----------  --------  ---------
  Arabic           379,737       3,365     9.4%      6020.7      2979.4     -0.89   175/200
  Hebrew           897,978           0     0.0%       144.8       198.4     -0.73   200/200
  Hindi            127,451      12,917    36.0%     15292.2      1480.4     -1.60   196/200
  Latin            141,827      10,848    30.2%     11764.4      1191.0     -0.77   162/200
  Sinhala        1,470,278      16,977    47.3%     18595.3      1974.2     -0.82   154/200
  Tamil            403,980       5,509    15.3%      6702.0      1466.0     -0.81   157/200
  Turkish          409,612      14,533    40.5%     16885.6      1322.0     -1.78   198/200

================================================================================
SIZE-NORMALIZED COMPARISON
(Tokens matched per 100K dictionary entries — controls for dict size)
================================================================================

  Language        H12 per 100K    Rnd per 100K     Ratio
  ------------  --------------  --------------  --------
  Arabic                 886.1          1585.5     0.56x
  Hebrew                   0.0            16.1     0.00x
  Hindi                10134.9         11998.5     0.84x
  Latin                 7648.8          8294.9     0.92x
  Sinhala               1154.7          1264.7     0.91x
  Tamil                 1363.7          1659.0     0.82x
  Turkish               3548.0          4122.3     0.86x

================================================================================
LANGUAGE RANKING
================================================================================

  By Z-score (H12 advantage over random decoders):
    1. Hebrew        Z =  -0.73
    2. Latin         Z =  -0.77
    3. Tamil         Z =  -0.81
    4. Sinhala       Z =  -0.82 <-- H12 hypothesis
    5. Arabic        Z =  -0.89
    6. Hindi         Z =  -1.60
    7. Turkish       Z =  -1.78

  By raw exact-match coverage:
    1. Sinhala         16,977 tokens ( 47.3%) <-- H12 hypothesis
    2. Turkish         14,533 tokens ( 40.5%)
    3. Hindi           12,917 tokens ( 36.0%)
    4. Latin           10,848 tokens ( 30.2%)
    5. Tamil            5,509 tokens ( 15.3%)
    6. Arabic           3,365 tokens (  9.4%)
    7. Hebrew               0 tokens (  0.0%)

================================================================================
VERDICT
================================================================================

  Best language by Z-score:     Hebrew (Z=-0.73)
  Best language by raw matches: Sinhala (16,977 tokens)
  Second-best by raw matches:   Turkish (14,533 tokens)
  Discrimination ratio:         1.2x

  Sinhala has highest raw coverage but not highest Z-score.
  The Z-score advantage may be reduced by Sinhala's large dictionary size.

  Per-language assessment:
    Arabic      :   9.4% coverage, Z= -0.89 — RULED OUT (below random)
    Hebrew      :   0.0% coverage, Z= -0.73 — RULED OUT (below random)
    Hindi       :  36.0% coverage, Z= -1.60 — RULED OUT (below random)
    Latin       :  30.2% coverage, Z= -0.77 — RULED OUT (below random)
    Sinhala     :  47.3% coverage, Z= -0.82 — HYPOTHESIS LANGUAGE
    Tamil       :  15.3% coverage, Z= -0.81 — RULED OUT (below random)
    Turkish     :  40.5% coverage, Z= -1.78 — RULED OUT (below random)

================================================================================
