Replication Package for: Evaluating LLMs on Medical Diagnostics Leveraging a Golden Model & Combinatorial Testing

Perko, Alexander; Mujić, Emir; Nica, Iulia; Wotawa, Franz

doi:10.5281/zenodo.17619761

Published November 15, 2025 | Version v1

Software Open

Replication Package for: Evaluating LLMs on Medical Diagnostics Leveraging a Golden Model & Combinatorial Testing

1. Graz University of Technology

Replication package for: "Evaluating LLMs on Medical Diagnostics Leveraging a Golden Model & Combinatorial Testing", which is an extenstion to [5, 6, 7]. This package provides the data retrieved during experiments for the corresponding paper. The experiments are performed using ChatGPT[1] and GPT-4o[2,3] which are evaluated against NetDoktor's "Symptom-Checker"[4]. Furthermore, it should be noted that these models are closed-source and might be updated by the vendors at any time. Because of this and the non-deterministic nature of LLMs results might differ on re-runs of the experiments. Finally, we provide the code for aggregating data from [4], so that it can be leveraged as a golden model.

[1] OpenAI(2023). ChatGPT. Online: chat.openai.com/chat.
[2] OpenAI (2023). GPT-4 technical report. In arXiv: 2303.08774. arXiv.
[3] OpenAI (2024). Introducing GPT-4o and more tools to ChatGPT free users. Online: https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free.
[4] Richter et al. (2024). Symptom-Checker. Online: www.netdoktor.at/symptom-checker.
[5] Perko et al. (2024). Testing ChatGPT's Performance on Medical Diagnostic Tasks.
[6] Perko et al. (2024). Using Combinatorial Testing for Prompt Engineering of LLMs in Medicine.
[7] Mujic et al. (2025). Extraction of Knowledge Representations for Reasoning from Medical Questionnaires.

Files

13765132.zip

Files (1.6 MB)

Name	Size	Download all
13765132.zip md5:5434563f543dac11c49430855bd6c4f1	646.5 kB	Preview Download
13765346.zip md5:503d033f2ae7e65cc1d02cc0be8cabd5	972.2 kB	Preview Download
golden_model_aggregator.zip md5:b2b67e2a886d9e0c7d6d0d5992ffd6b6	15.2 kB	Preview Download

Additional details

European Commission
ChatMED - Bridging Research Institutions to Catalyze Generative AI Adoption by the Health Sector in the Widening Countries 101159214

	All versions	This version
Views	26	26
Downloads	6	6
Data volume	3.3 MB	3.3 MB

Replication Package for: Evaluating LLMs on Medical Diagnostics Leveraging a Golden Model & Combinatorial Testing

Authors/Creators

Description

Files

13765132.zip

Files (1.6 MB)

Additional details

Funding