Published November 15, 2025 | Version v1
Software Open

Replication Package for: Evaluating LLMs on Medical Diagnostics Leveraging a Golden Model & Combinatorial Testing

Description

Replication package for: "Evaluating LLMs on Medical Diagnostics Leveraging a Golden Model & Combinatorial Testing", which is an extenstion to [5, 6, 7]. This package provides the data retrieved during experiments for the corresponding paper. The experiments are performed using ChatGPT[1] and GPT-4o[2,3] which are evaluated against NetDoktor's "Symptom-Checker"[4]. Furthermore, it should be noted that these models are closed-source and might be updated by the vendors at any time. Because of this and the non-deterministic nature of LLMs results might differ on re-runs of the experiments. Finally, we provide the code for aggregating data from [4], so that it can be leveraged as a golden model.

[1] OpenAI(2023). ChatGPT. Online: chat.openai.com/chat.
[2] OpenAI (2023). GPT-4 technical report. In arXiv: 2303.08774. arXiv.
[3] OpenAI (2024). Introducing GPT-4o and more tools to ChatGPT free users. Online: https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free.
[4] Richter et al. (2024). Symptom-Checker. Online: www.netdoktor.at/symptom-checker.
[5] Perko et al. (2024). Testing ChatGPT's Performance on Medical Diagnostic Tasks.
[6] Perko et al. (2024). Using Combinatorial Testing for Prompt Engineering of LLMs in Medicine.
[7] Mujic et al. (2025). Extraction of Knowledge Representations for Reasoning from Medical Questionnaires.

Files

13765132.zip

Files (1.6 MB)

Name Size Download all
md5:5434563f543dac11c49430855bd6c4f1
646.5 kB Preview Download
md5:503d033f2ae7e65cc1d02cc0be8cabd5
972.2 kB Preview Download
md5:b2b67e2a886d9e0c7d6d0d5992ffd6b6
15.2 kB Preview Download

Additional details

Funding

European Commission
ChatMED - Bridging Research Institutions to Catalyze Generative AI Adoption by the Health Sector in the Widening Countries 101159214