Clinical-EN SmPC AWQ Calibration Corpus: an English-language activation-aware quantization calibration set derived from EMA Summary of Product Characteristics

Minarowski, Łukasz

doi:10.5281/zenodo.20565901

Published June 6, 2026 | Version 1.0.0

Dataset Open

Clinical-EN SmPC AWQ Calibration Corpus: an English-language activation-aware quantization calibration set derived from EMA Summary of Product Characteristics

Minarowski, Łukasz¹

1. Department of Respiratory Physiopathology, Medical University of Białystok, Poland

Summary. The Clinical-EN SmPC AWQ Calibration Corpus is an English-language, clinical-domain text corpus assembled to serve as a calibration set for activation-aware weight quantization (AWQ / AutoAWQ / GPTQ, W4A16) of English and multilingual large language models intended for clinical use. It comprises 412 text chunks (~512-token budget; observed window 80–255 words, median 161) of dense, domain-specific clinical English drawn from the pulmonology and thoracic-oncology therapeutic areas. The corpus is distributed as a single newline-delimited JSON file (corpus.jsonl) in which every record carries full per-chunk source provenance. It is the English cross-language counterpart to the Clinical-PL SmPC AWQ Calibration Corpus (mozarcik/clinical-pl-smpc-awq-calibration): the same 61 EMA medicines, but the text is the English Annex I Summary of Product Characteristics rather than the Polish Charakterystyka Produktu Leczniczego, extracted from the same EMA EPAR documents. Pairing the two enables the English side of cross-lingual calibration experiments alongside the mozarcik/Llama-PLLuM-70B-*-awq series.

What it is. A calibration corpus for post-training quantization (AWQ / AutoAWQ / GPTQ) of English-language and multilingual clinical LLMs. Activation-aware quantization requires a representative, dense sample of in-domain text so that per-channel activation scales are estimated on data resembling the deployment distribution; for the clinical use case that target distribution is dense clinical English (pulmonology and thoracic oncology).

How it was built. For each of 61 centrally-authorised European Medicines Agency (EMA) medicines shared with the Polish corpus, the English Product Information PDF (Annex I SmPC) was retrieved from EMA on 2026-06-06 from the documented English PI URL pattern (.../product-information/<epar_slug>-epar-product-information_en.pdf); the brand → EPAR-slug map (manifest.json) is derived directly from the verified Polish corpus source URLs, so no URL is guessed. Extraction (script extract_corpus.py, PyMuPDF — an English-parametrized copy of the Polish extractor) was restricted to Annex I clinical prose, excluding Annex II/III labelling and leaflet text and the Section 6 pharmaceutical-particulars tail (excipients, shelf-life, marketing-authorisation boilerplate). Text was chunked at a soft window of ~150 words (hard cap 255, min 80), sentence-boundary preferred, and sampled per drug proportionally to that drug's clinical-prose volume with a section weighting biased toward clinical-efficacy (SmPC §5.1), pharmacokinetics (§5.2), dosing (§4.2), and special-warnings (§4.4) content. The chunking, sampling and section-weighting logic is byte-for-byte the Polish logic; only language-bound parsing tokens and the explicit fetch step differ. The extraction workflow is reproducible and included in the deposit.

Per-chunk provenance schema (every record in corpus.jsonl).

{
  "text": "… (English SmPC clinical prose, ~512 tokens) …",
  "source_authority": "EMA",
  "source_document_type": "SmPC / Product Information",
  "source_url": "https://www.ema.europa.eu/en/documents/product-information/tagrisso-epar-product-information_en.pdf",
  "medicine": "osimertinib",
  "brand_name": "TAGRISSO",
  "language": "en",
  "retrieved_at": "2026-06-06",
  "chunk_id": "EMA_osimertinib_en_0001",
  "license_note": "EMA reproduction policy; source attribution required"
}

All 412 released chunks have source_authority = "EMA", source_document_type = "SmPC / Product Information" and language = "en". corpus.jsonl SHA-256: 820b7a4de75f20baddf3d89fa73a3d8f348ae22086f1ed52f0278ec6040aac85.

Intended use. Calibration data for post-training quantization (AWQ / AutoAWQ / GPTQ) of English-language and multilingual clinical LLMs, and the English side of cross-lingual calibration experiments paired with the Polish corpus; reuse as a corpus-controlled calibration set enabling like-for-like quantization-quality comparison across model scales and languages.

What it is NOT. This is not a training set and not an evaluation / benchmark set; it is not a question-answering, instruction-tuning, or clinical-decision-support dataset and must not be used to fine-tune clinical behaviour or to evaluate clinical accuracy. It contains no patient data and no protected health information (PHI): SmPC documents describe medicinal products (indications, dosing, adverse reactions, pharmacokinetics, aggregate clinical-trial data), not individuals. Every chunk's text is verbatim source text (verified: 412/412 chunks are a contiguous span of their source document after line-cleaning; zero fabricated or paraphrased text) and absence of PHI was confirmed by automated pattern scan. The corpus is not clinical advice and confers no clinical authority; the canonical, legally-authoritative product information remains the EMA-published SmPC.

Rights and source attribution. This dataset is a calibration corpus that contains text chunks derived from official medicinal product information documents. Source text is derived from EMA-published English SmPC / Product Information documents. EMA source text: © European Medicines Agency. EMA-published documents are reproduced and distributed under EMA's content-reproduction policy, which permits reproduction and/or distribution, in whole or in part, for non-commercial and commercial purposes, provided that EMA is always acknowledged as the source. Compilation, drug selection, extraction workflow, chunking, dataset structuring and metadata: Łukasz Minarowski / navimed-umb. No claim is made that the underlying SmPC source text is licensed under CC-BY-4.0, CC-BY-NC-4.0, MIT, Apache-2.0 or any other open-source software licence. Users are responsible for preserving EMA source attribution and for checking source-specific restrictions before redistribution or downstream use. The licence on this record is therefore set to "other (attribution)" (EMA public-reproduction policy and source-specific reuse terms); the author's compilation contribution is the author's own work but does not relicense the underlying source text.

Context. Produced for NaviMed-UMB (https://github.com/kicrazom/navimed-umb), a local-LLM benchmarking and clinical-decision-support feasibility project at the Medical University of Białystok. The corpus is published on HuggingFace at mozarcik/clinical-en-smpc-awq-calibration; this deposit makes it independently citable with its own DOI. It is the English cross-language counterpart to the Polish corpus mozarcik/clinical-pl-smpc-awq-calibration (same 61 medicines, English Annex I instead of Polish ChPL).

AI assistance disclosure. Project documentation was prepared with assistance from large language models (Claude, Anthropic; GPT, OpenAI; Gemini, Google). All drug selection, extraction design, provenance assignment and scientific claims are the author's. See AI_USAGE_DISCLOSURE.md in the repository.

Notes

DRAFT — not yet deposited. Corpus text is EMA-derived and is NOT released under CC-BY or any open-source software licence; licence is "other-at" (Other / attribution required: EMA public-reproduction policy + source-specific reuse terms). Legacy Zenodo rejects bare "other"; the sibling PL draft should be reconciled to "other-at" as well. EMA must always be acknowledged as the source. The author's compilation/extraction/metadata are the author's own work. No patient data / no PHI. Not a training or evaluation set — calibration only. No throughput/latency/benchmark numbers are included (those remain under the NaviMed-UMB METHODOLOGY publication embargo). DESIGN DECISION PENDING (owner): PL and EN may be deposited as TWO separate records (this draft) or as ONE bilingual PL+EN record — the PL Zenodo record is still an unpublished draft so combining is still possible. See zenodo-draft.md section 8.A.

Files

Files (629.8 kB)

Name	Size	Download all
corpus.jsonl md5:9f996dc1f40d7045fc9de57906e2e2ca	629.8 kB	Download

	All versions	This version
Views	24	24
Downloads	1	1
Data volume	629.8 kB	629.8 kB

Clinical-EN SmPC AWQ Calibration Corpus: an English-language activation-aware quantization calibration set derived from EMA Summary of Product Characteristics

Authors/Creators

Description

Notes

Files

Files (629.8 kB)

Additional details

Related works