Clinical-EN SmPC AWQ Calibration Corpus: an English-language activation-aware quantization calibration set derived from EMA Summary of Product Characteristics
Authors/Creators
- 1. Department of Respiratory Physiopathology, Medical University of Białystok, Poland
Description
Summary. The Clinical-EN SmPC AWQ Calibration Corpus is an English-language, clinical-domain text corpus assembled to serve as a calibration set for activation-aware weight quantization (AWQ / AutoAWQ / GPTQ, W4A16) of English and multilingual large language models intended for clinical use. It comprises 412 text chunks (~512-token budget; observed window 80–255 words, median 161) of dense, domain-specific clinical English drawn from the pulmonology and thoracic-oncology therapeutic areas. The corpus is distributed as a single newline-delimited JSON file (corpus.jsonl) in which every record carries full per-chunk source provenance. It is the English cross-language counterpart to the Clinical-PL SmPC AWQ Calibration Corpus (mozarcik/clinical-pl-smpc-awq-calibration): the same 61 EMA medicines, but the text is the English Annex I Summary of Product Characteristics rather than the Polish Charakterystyka Produktu Leczniczego, extracted from the same EMA EPAR documents. Pairing the two enables the English side of cross-lingual calibration experiments alongside the mozarcik/Llama-PLLuM-70B-*-awq series.
What it is. A calibration corpus for post-training quantization (AWQ / AutoAWQ / GPTQ) of English-language and multilingual clinical LLMs. Activation-aware quantization requires a representative, dense sample of in-domain text so that per-channel activation scales are estimated on data resembling the deployment distribution; for the clinical use case that target distribution is dense clinical English (pulmonology and thoracic oncology).
How it was built. For each of 61 centrally-authorised European Medicines Agency (EMA) medicines shared with the Polish corpus, the English Product Information PDF (Annex I SmPC) was retrieved from EMA on 2026-06-06 from the documented English PI URL pattern (.../product-information/<epar_slug>-epar-product-information_en.pdf); the brand → EPAR-slug map (manifest.json) is derived directly from the verified Polish corpus source URLs, so no URL is guessed. Extraction (script extract_corpus.py, PyMuPDF — an English-parametrized copy of the Polish extractor) was restricted to Annex I clinical prose, excluding Annex II/III labelling and leaflet text and the Section 6 pharmaceutical-particulars tail (excipients, shelf-life, marketing-authorisation boilerplate). Text was chunked at a soft window of ~150 words (hard cap 255, min 80), sentence-boundary preferred, and sampled per drug proportionally to that drug's clinical-prose volume with a section weighting biased toward clinical-efficacy (SmPC §5.1), pharmacokinetics (§5.2), dosing (§4.2), and special-warnings (§4.4) content. The chunking, sampling and section-weighting logic is byte-for-byte the Polish logic; only language-bound parsing tokens and the explicit fetch step differ. The extraction workflow is reproducible and included in the deposit.
Per-chunk provenance schema (every record in corpus.jsonl).
{
"text": "… (English SmPC clinical prose, ~512 tokens) …",
"source_authority": "EMA",
"source_document_type": "SmPC / Product Information",
"source_url": "https://www.ema.europa.eu/en/documents/product-information/tagrisso-epar-product-information_en.pdf",
"medicine": "osimertinib",
"brand_name": "TAGRISSO",
"language": "en",
"retrieved_at": "2026-06-06",
"chunk_id": "EMA_osimertinib_en_0001",
"license_note": "EMA reproduction policy; source attribution required"
}
All 412 released chunks have source_authority = "EMA", source_document_type = "SmPC / Product Information" and language = "en". corpus.jsonl SHA-256: 820b7a4de75f20baddf3d89fa73a3d8f348ae22086f1ed52f0278ec6040aac85.
Intended use. Calibration data for post-training quantization (AWQ / AutoAWQ / GPTQ) of English-language and multilingual clinical LLMs, and the English side of cross-lingual calibration experiments paired with the Polish corpus; reuse as a corpus-controlled calibration set enabling like-for-like quantization-quality comparison across model scales and languages.
What it is NOT. This is not a training set and not an evaluation / benchmark set; it is not a question-answering, instruction-tuning, or clinical-decision-support dataset and must not be used to fine-tune clinical behaviour or to evaluate clinical accuracy. It contains no patient data and no protected health information (PHI): SmPC documents describe medicinal products (indications, dosing, adverse reactions, pharmacokinetics, aggregate clinical-trial data), not individuals. Every chunk's text is verbatim source text (verified: 412/412 chunks are a contiguous span of their source document after line-cleaning; zero fabricated or paraphrased text) and absence of PHI was confirmed by automated pattern scan. The corpus is not clinical advice and confers no clinical authority; the canonical, legally-authoritative product information remains the EMA-published SmPC.
Rights and source attribution. This dataset is a calibration corpus that contains text chunks derived from official medicinal product information documents. Source text is derived from EMA-published English SmPC / Product Information documents. EMA source text: © European Medicines Agency. EMA-published documents are reproduced and distributed under EMA's content-reproduction policy, which permits reproduction and/or distribution, in whole or in part, for non-commercial and commercial purposes, provided that EMA is always acknowledged as the source. Compilation, drug selection, extraction workflow, chunking, dataset structuring and metadata: Łukasz Minarowski / navimed-umb. No claim is made that the underlying SmPC source text is licensed under CC-BY-4.0, CC-BY-NC-4.0, MIT, Apache-2.0 or any other open-source software licence. Users are responsible for preserving EMA source attribution and for checking source-specific restrictions before redistribution or downstream use. The licence on this record is therefore set to "other (attribution)" (EMA public-reproduction policy and source-specific reuse terms); the author's compilation contribution is the author's own work but does not relicense the underlying source text.
Context. Produced for NaviMed-UMB (https://github.com/kicrazom/navimed-umb), a local-LLM benchmarking and clinical-decision-support feasibility project at the Medical University of Białystok. The corpus is published on HuggingFace at mozarcik/clinical-en-smpc-awq-calibration; this deposit makes it independently citable with its own DOI. It is the English cross-language counterpart to the Polish corpus mozarcik/clinical-pl-smpc-awq-calibration (same 61 medicines, English Annex I instead of Polish ChPL).
AI assistance disclosure. Project documentation was prepared with assistance from large language models (Claude, Anthropic; GPT, OpenAI; Gemini, Google). All drug selection, extraction design, provenance assignment and scientific claims are the author's. See AI_USAGE_DISCLOSURE.md in the repository.
Notes
Files
Files
(629.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:9f996dc1f40d7045fc9de57906e2e2ca
|
629.8 kB | Download |