================================================================================
CROSS-LANGUAGE PHARMACEUTICAL DISCRIMINATION TEST
Does H12 decode to Sinhala rather than Tamil, Hindi, Malayalam, or Pali?
================================================================================

Loading cross-language pharmaceutical vocabulary...
  Source: crosslang_pharmaceutical_vocab.tsv
  Concepts: 60
  sinhala     :  102 term variants
  tamil       :  115 term variants
  hindi       :  115 term variants
  malayalam   :  105 term variants
  pali        :  102 term variants

Loading Voynich corpus...
  35,916 tokens, 7,733 types

Decoding corpus with H12...
  Decoded 7,733 unique types -> 4,972 unique decoded forms

H12 per-language match counts:
  Language        Tokens   Types   Token %
  ------------  --------  ------  --------
  sinhala          5,094      28    14.18%
  tamil                3       1     0.01%
  hindi              332       3     0.92%
  malayalam           77       2     0.21%
  pali             1,306      10     3.64%

  Sinhala terms matched by H12:
    ula               (1,131 tokens)
    gena              (  797 tokens)
    ura               (  607 tokens)
    ara               (  461 tokens)
    ala               (  348 tokens)
    gara              (  331 tokens)
    gala              (  302 tokens)
    uda               (  182 tokens)
    mula              (  178 tokens)
    gula              (  135 tokens)
    leda              (  129 tokens)
    ugura             (  104 tokens)
    sula              (   75 tokens)
    kara              (   70 tokens)
    tula              (   67 tokens)
    ata               (   64 tokens)
    udara             (   51 tokens)
    dena              (   16 tokens)
    asa               (   10 tokens)
    guda              (    9 tokens)
    kasa              (    8 tokens)
    mukha             (    7 tokens)
    masa              (    6 tokens)
    phala             (    2 tokens)
    kata              (    1 tokens)
    ... and 3 more

Running 200 random decoders...
  ... trial 1/200
  ... trial 51/200
  ... trial 101/200
  ... trial 151/200
  Done.

================================================================================
RESULTS: H12 vs RANDOM DECODERS (per language)
================================================================================

  Language       H12 tok   H12 typ   Rnd avg tok   Rnd avg typ   Rnd std tok   Z-score
  ------------  --------  --------  ------------  ------------  ------------  --------
  sinhala          5,094        28        2514.7          16.2        1325.5      1.95
  tamil                3         1           6.2           0.6          19.0     -0.17
  hindi              332         3         319.7           4.0         350.7      0.03
  malayalam           77         2         210.3           1.0         437.0     -0.30
  pali             1,306        10         394.8           5.3         418.6      2.18

================================================================================
KEY METRICS
================================================================================

  Sinhala Z-score (H12 vs random Sinhala):  1.95
  Approximate p-value:                     2.58e-02

  Language discrimination (H12 token matches):
    1. sinhala          5,094 tokens
    2. pali             1,306 tokens
    3. hindi              332 tokens
    4. malayalam           77 tokens
    5. tamil                3 tokens

  Best language:       sinhala (5,094 tokens)
  Next-best language:  pali (1,306 tokens)
  Discrimination ratio (best / next-best): 3.90x

  Language discrimination (H12 type matches):
    1. sinhala           28 types
    2. pali              10 types
    3. hindi              3 types
    4. malayalam          2 types
    5. tamil              1 types

  Type discrimination ratio: 2.80x

================================================================================
RANDOM DECODER LANGUAGE PROFILES (average tokens per language)
================================================================================

    1. sinhala         2514.7 avg tokens
    2. pali             394.8 avg tokens
    3. hindi            319.7 avg tokens
    4. malayalam        210.3 avg tokens
    5. tamil              6.2 avg tokens

  Sinhala wins among random decoders: 187/200 (93.5%)
  (If languages were equivalent, expected ~20%)

================================================================================
VERDICT
================================================================================

  WEAK PASS

  Sinhala is the best-matching language but the advantage is modest (ratio=3.9x, Z=1.9). More data or tighter vocabulary definitions may strengthen discrimination.

================================================================================
