Replication Package for: Evaluating LLMs on Medical Diagnostics Leveraging a Golden Model & Combinatorial Testing
Authors/Creators
Description
Replication package for: "Evaluating LLMs on Medical Diagnostics Leveraging a Golden Model & Combinatorial Testing", which is an extenstion to [5, 6, 7]. This package provides the data retrieved during experiments for the corresponding paper. The experiments are performed using ChatGPT[1] and GPT-4o[2,3] which are evaluated against NetDoktor's "Symptom-Checker"[4]. Furthermore, it should be noted that these models are closed-source and might be updated by the vendors at any time. Because of this and the non-deterministic nature of LLMs results might differ on re-runs of the experiments. Finally, we provide the code for aggregating data from [4], so that it can be leveraged as a golden model.
[1] OpenAI(2023). ChatGPT. Online: chat.openai.com/chat.
[2] OpenAI (2023). GPT-4 technical report. In arXiv: 2303.08774. arXiv.
[3] OpenAI (2024). Introducing GPT-4o and more tools to ChatGPT free users. Online: https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free.
[4] Richter et al. (2024). Symptom-Checker. Online: www.netdoktor.at/symptom-checker.
[5] Perko et al. (2024). Testing ChatGPT's Performance on Medical Diagnostic Tasks.
[6] Perko et al. (2024). Using Combinatorial Testing for Prompt Engineering of LLMs in Medicine.
[7] Mujic et al. (2025). Extraction of Knowledge Representations for Reasoning from Medical Questionnaires.