================================================================================
STRUCTURAL MULTI-LANGUAGE TEST (FAIR VERSION)
Are the dictionary matches MEDICAL terms, or just coincidental CV overlap?
================================================================================

FAIRNESS CONTROLS:
  - All 6 languages use the SAME equalized pharmaceutical vocabulary
  - 115 concepts across 8 medical categories
  - Results normalized by vocabulary size
  - Same random decoder baseline for all languages

Loading corpus and decoding with H12...
  35,916 tokens

Loading equalized pharmaceutical vocabulary...
  115 concepts across 8 categories
  Arabic      :  193 terms
  Hindi       :  188 terms
  Latin       :  174 terms
  Sinhala     :  180 terms
  Tamil       :  179 terms
  Turkish     :  154 terms

================================================================================
TEST 1: EQUALIZED CROSS-LANGUAGE PHARMACEUTICAL VOCABULARY
Does H12 output match pharmaceutical terms in each language?
All languages use the same 115-concept vocabulary.
================================================================================

  Arabic      :    399 pharma tokens ( 1.11%)    5 types  [vocab: 193]  [per 100 terms:  206.7]
    Top: gala(302), sara(90), usara(4), ama(2), maa(1)

  Hindi       :    344 pharma tokens ( 0.96%)    6 types  [vocab: 188]  [per 100 terms:  183.0]
    Top: gala(302), dena(16), lena(14), guda(9), le(2)

  Latin       :      1 pharma tokens ( 0.00%)    1 types  [vocab: 174]  [per 100 terms:    0.6]
    Top: sapa(1)

  Sinhala     :  5,021 pharma tokens (13.98%)   35 types  [vocab: 180]  [per 100 terms: 2789.4]
    Top: ula(1131), gena(797), ura(607), ara(461), ala(348)

  Tamil       :     74 pharma tokens ( 0.21%)    3 types  [vocab: 179]  [per 100 terms:   41.3]
    Top: tula(67), aara(4), pu(3)

  Turkish     :    105 pharma tokens ( 0.29%)    4 types  [vocab: 154]  [per 100 terms:   68.2]
    Top: sara(90), agu(7), su(7), uc(1)

================================================================================
TEST 2: PER-CATEGORY PHARMACEUTICAL MATCHING
Which medical categories drive the matches for each language?
================================================================================

  Category              Arabic     Hindi     Latin    Sinhal     Tamil    Turkis
  ------------------  --------  --------  --------  --------  --------  --------
  ingredient                 1         0         1     1,179         0         7
  plant_part                 4        10         0       596         3         0
  body_part                  0       302         0     1,164         0         0
  disease                   92         0         0     1,012         0        97
  process                  302        32         0     1,185         4         0
  preparation                0         0         0         0         0         0
  dosage_timing              0         0         0       256        67         1
  quality                    0         0         0         1         0         0
  TOTAL                    399       344         1     5,393        74       105

================================================================================
TEST 3: PANCHAVIDHA KASHAYA KALPANA (Classical Dosage Forms)
CAVEAT: The Sinhala Panchavidha terms are H12-decoded forms.
Testing them against H12 output is partially circular.
The Z-score comparison (H12 vs random decoders) mitigates this,
since random decoders produce DIFFERENT decoded forms.
================================================================================

  SINHALA (Ayurvedic Panchavidha):
    ugeda        = Churna          (powder         )    476 tokens
    ea           = Ghrita          (ghee vehicle   )    344 tokens
    uteda        = Kashaya         (decoction      )    323 tokens
    mea          = Madhu           (honey vehicle  )    282 tokens
    gula         = Vati/Gutika     (pill           )    135 tokens
    ugeea        = Sneha           (fat-soluble    )      0 tokens
    TOTAL                                            1560 tokens

  EQUALIZED PREPARATION TERMS (from equalized vocab):
    Arabic      :     0 tokens  (0 types of 20 vocab)
    Hindi       :     0 tokens  (0 types of 15 vocab)
    Latin       :     0 tokens  (0 types of 14 vocab)
    Sinhala     :     0 tokens  (0 types of 18 vocab)
    Tamil       :     0 tokens  (0 types of 18 vocab)
    Turkish     :     0 tokens  (0 types of 12 vocab)

================================================================================
PHARMACEUTICAL MATCHING SUMMARY
================================================================================

  Language      Pharma Tok  Vocab Size     Per 100  Dict Match   Med Density   Dict Size
  ------------  ----------  ----------  ----------  ----------  ------------  ----------
  Arabic               399         193       206.7       3,365         11.9%     379,737
  Hindi                344         188       183.0      12,917          2.7%     127,451
  Latin                  1         174         0.6      10,848          0.0%     141,827
  Sinhala            5,021         180      2789.4      16,977         29.6%   1,470,278
  Tamil                 74         179        41.3       5,509          1.3%     403,980
  Turkish              105         154        68.2      14,533          0.7%     409,612

================================================================================
WHAT DO THE MATCHES MEAN?
Top 10 dictionary matches per language — are they medical?
================================================================================

  Arabic top 10 matches:
     1. ara                 461 tokens  
     2. ala                 348 tokens  
     3. sa                  257 tokens  
     4. tha                 208 tokens  
     5. kha                 195 tokens  
     6. da                  163 tokens  
     7. ra                  125 tokens  
     8. ma                  121 tokens  
     9. la                  111 tokens  
    10. ka                   96 tokens  

  Hindi top 10 matches:
     1. ula               1,131 tokens  
     2. ura                 607 tokens  
     3. eda                 530 tokens  
     4. ena                 494 tokens  
     5. ugena               491 tokens  
     6. ara                 461 tokens  
     7. meda                445 tokens  
     8. uta                 437 tokens  
     9. ga                  392 tokens  
    10. ala                 348 tokens  

  Latin top 10 matches:
     1. ula               1,131 tokens  
     2. gena                797 tokens  
     3. ura                 607 tokens  
     4. ara                 461 tokens  
     5. meda                445 tokens  
     6. uta                 437 tokens  
     7. ga                  392 tokens  
     8. ala                 348 tokens  
     9. ea                  344 tokens  
    10. gala                302 tokens  

  Sinhala top 10 matches:
     1. ula               1,131 tokens  PHARMA
     2. gena                797 tokens  PHARMA
     3. ura                 607 tokens  PHARMA
     4. eda                 530 tokens  
     5. ena                 494 tokens  
     6. ugena               491 tokens  
     7. ara                 461 tokens  PHARMA
     8. uga                 458 tokens  
     9. meda                445 tokens  
    10. uta                 437 tokens  

  Tamil top 10 matches:
     1. ula               1,131 tokens  
     2. ura                 607 tokens  
     3. ena                 494 tokens  
     4. ara                 461 tokens  
     5. ala                 348 tokens  
     6. sa                  257 tokens  
     7. utala               240 tokens  
     8. tha                 208 tokens  
     9. ra                  125 tokens  
    10. mu                  123 tokens  

  Turkish top 10 matches:
     1. ula               1,131 tokens  
     2. gena                797 tokens  
     3. ura                 607 tokens  
     4. eda                 530 tokens  
     5. ena                 494 tokens  
     6. ara                 461 tokens  
     7. meda                445 tokens  
     8. uta                 437 tokens  
     9. ga                  392 tokens  
    10. ala                 348 tokens  

================================================================================
RANDOM DECODER COMPARISON (200 trials)
Do random decoders also produce pharmaceutical matches?
Uses equalized vocabulary for ALL languages.
================================================================================

  ... trial 1/200
  ... trial 51/200
  ... trial 101/200
  ... trial 151/200
  Done.

  Sinhala Panchavidha (CAVEAT: partially circular):
    H12:           1,560 tokens
    Random avg:      443.7
    Random std:      155.0
    Z-score:           7.2
    Beat H12:     1/200

  Equalized pharmaceutical matching (Z-scores):
  (Same 115-concept vocabulary for all languages)

    Arabic      : H12=  399 (norm= 206.7)  Rnd avg= 244.6 (norm= 126.7)  Z= 0.78  Beat=48/200  [vocab=193]
    Hindi       : H12=  344 (norm= 183.0)  Rnd avg= 353.4 (norm= 188.0)  Z=-0.03  Beat=68/200  [vocab=188]
    Latin       : H12=    1 (norm=   0.6)  Rnd avg=  19.5 (norm=  11.2)  Z=-0.38  Beat=200/200  [vocab=174]
    Sinhala     : H12=5,021 (norm=2789.4)  Rnd avg=2657.7 (norm=1476.5)  Z= 1.73  Beat=9/200  [vocab=180]
    Tamil       : H12=   74 (norm=  41.3)  Rnd avg=  55.0 (norm=  30.7)  Z= 0.32  Beat=83/200  [vocab=179]
    Turkish     : H12=  105 (norm=  68.2)  Rnd avg= 171.7 (norm= 111.5)  Z=-0.56  Beat=131/200  [vocab=154]

================================================================================
VERDICT
================================================================================

  Pharmaceutical matches (equalized vocab, ranked):
    Sinhala     :  5,021 tokens  (per 100: 2789.4)  Z= 1.73 <== H12
    Arabic      :    399 tokens  (per 100:  206.7)  Z= 0.78
    Hindi       :    344 tokens  (per 100:  183.0)  Z=-0.03
    Turkish     :    105 tokens  (per 100:   68.2)  Z=-0.56
    Tamil       :     74 tokens  (per 100:   41.3)  Z= 0.32
    Latin       :      1 tokens  (per 100:    0.6)  Z=-0.38

  Panchavidha Z-score:  7.2  (partially circular)

  PASS: Sinhala pharmaceutical terms dominate by 2x+.

  HONESTY NOTE: Sinhala pharma Z-score is below 2,
  meaning the result is not statistically significant.

================================================================================
