KomMKonLLM - Combinatorial test cases for consistency testing of LLMs
Creators
Description
This dataset was created using the KomMKonLLM implementation available at https://github.com/KomMKonLLM/KomMKonLLM.
Example dataset and results
This folder contains an example set of questions and labelled correct answers, the synonyms utilized in the testing process (which forms the basis for the construction of input parameter models required to generate a covering array), the resulting covering array, and the LLM output for 6 models as well as its boolean interpretation.
Models
The following models are used in this example, all interfaced with using the Ollama meta-model:
- Starling-LM
- Llama 3.1
- Llama 3.2
- Llama 2 Uncensored
- DeepSeek-r1
- Mistral
Questions
The file `public_questions.jsonl` contains 27 questions and the associated correct answer label. It was constructed using a human-guided ChatGPT session.
Test runs
Data regarding the executed test runs is available in `test_runs.csv`.
The first column contains a unique ID that is also utilized in the file names for synonyms and covering arrays described below.
Each row further contains the original sentence text, the date, the correct answer, the model name (which is always OLLAMA in these examples), the note (which contains the actual model) and the strength.
Note that identical sentence texts are contained multiple times in this file, once per model.
Synonyms
Files named `synonyms-<id>.txt` contain the synonyms for the test sentence with the ID `<id>`.
Each line is a JSON list with synonyms for a token (roughly equivalent to a word). A limit of 3 synonyms per token is applied internally and only some types of tokens (such as proper nouns and adjectives) are replaced with synonyms.
The first entry in each of the lines is the token as it appears in the original sentence, with other entries provided by a NLP library.
Covering Arrays
The generated covering arrays (at strength 2 in this example) are available in files called `ca-<id>.csv`. They are based on the associated synonym list such that the n-th column in `ca-<id>.csv` refers to the n-th row in `synonyms-<id>.txt` and each row in the CA file forms one test case.
Queries
The resulting queries (i.e. mutated test sentences) are listed in `queries.csv`. This file contains the full prompt submitted to the LLM (note that the `sentence_id` in this file refers to the `id` in `test_runs.csv`; this can be used to identify the concrete LLM that was queried), the LLM's response, and the response interpreted as a boolean via our oracle.
In each of these rows, the value of the `correct_answer_label` column should be compared to the `result` column to evaluate whether the LLM's response was correct.
Results
This section visualizes the results of our analysis, which was performed using the JupyterLab notebook contained in this project.
Parsed LLM responses
The chart in llm-responses-parsed.png shows the parsed LLM responses across all test runs.
Responses that could not be parsed as boolean are listed as "undefined". In general, this means that the LLM did not return a JSON boolean. Some models are prone to printing significantly more complex data structures instead or try to add a textual explanation instead of following the prompt.
Precision
The chart precision.png visualizes the precision of each model in this analysis.
The precision is defined as the ratio of true positives over the sum of true positives and true negatives.
Recall
The file recall.png shows the recall of the models contained in our evaluation.
The recall is the ratio of true positives over the sum of true positives and false negatives.
F1 score
The image f1.png contains the F1 scores of all models under test, defined as 2*(precision*recall)/(precision + recall).
Consistency
The chart in consistency.png visualizes the consistency of parsed LLM responses within each test run.
The consistency measures the ratio of responses to mutated queries (i.e. those where parts have been replaced with synonyms) that are identical to the response of the original query.
Notes (English)
Files
ca-1.csv
Files
(2.1 MB)
Name | Size | Download all |
---|---|---|
md5:9b0c3aeaafc28df8c249f15f14db5713
|
2.1 kB | Preview Download |
md5:2853bcc18e66500927dd5cf81e5c5ae5
|
2.3 kB | Preview Download |
md5:eeef5846166c667d28e3a6a6b84c2809
|
1.5 kB | Preview Download |
md5:a96156cbb8a74c909c0b6d6710d213fa
|
1.6 kB | Preview Download |
md5:1b88af6342ed5a75258222cee8e28597
|
1.6 kB | Preview Download |
md5:2c5de31fbb6d908499f42003373a65a8
|
1.6 kB | Preview Download |
md5:99eb5069831bb2f9009c1b97e548e578
|
1.6 kB | Preview Download |
md5:6c332c1c1cf3760bd8d772295cfd39b9
|
1.2 kB | Preview Download |
md5:b4503f7b7362b1b1199d57e971e6144f
|
1.0 kB | Preview Download |
md5:101b7ae1215d09c12c4ea81f0be6a8cb
|
1.7 kB | Preview Download |
md5:8922fd9289e04feb8abf3d70baaeba03
|
1.1 kB | Preview Download |
md5:5c94da53fc5f40d1803648c9be776445
|
2.4 kB | Preview Download |
md5:4d79e52361ca7250439740735f381f32
|
1.2 kB | Preview Download |
md5:4ab2cf8072c58edfc78582800e890419
|
1.7 kB | Preview Download |
md5:a99d031365c1b91a3a7e21b22b45ccf9
|
1.5 kB | Preview Download |
md5:9e316fef8fdd5e7ab722a84b27554925
|
1.3 kB | Preview Download |
md5:b517d25e310af81cd4a6193adb20a2ee
|
1.6 kB | Preview Download |
md5:13c5b7def470efa97a4060c8c09ce335
|
1.6 kB | Preview Download |
md5:37f2b3f70b741b99087e59d992b4a373
|
1.5 kB | Preview Download |
md5:65443b678ede3ccd9fdcf5eaf183b3ae
|
32 Bytes | Preview Download |
md5:9b0c3aeaafc28df8c249f15f14db5713
|
2.1 kB | Preview Download |
md5:70f3d743f8a82462efb07dee25eeac5c
|
1.6 kB | Preview Download |
md5:c15caff3700c1ba9d35de73b08bec087
|
1.5 kB | Preview Download |
md5:dd820f8f195815083dcf96a7022aa4c5
|
1.4 kB | Preview Download |
md5:9753e614bff73d88a5ec22db89326bd4
|
1.8 kB | Preview Download |
md5:37ab0e50ea0d38c318a8c3c04314358e
|
1.3 kB | Preview Download |
md5:7e90fca6571fbf165eef31be0bad468c
|
2.4 kB | Preview Download |
md5:209bef6646c5be37bd825d3a87dd6a8b
|
26.3 kB | Preview Download |
md5:2a10846688c05574c51e5c6e1b7810e5
|
24.9 kB | Preview Download |
md5:e7c96bc89c5e53b5218afa10199ce6fc
|
18.1 kB | Preview Download |
md5:ee7fe65d86ff93f3dc3655ada7d03874
|
25.0 kB | Preview Download |
md5:e045fe94db38a91a87abf02fc1790fb2
|
7.7 kB | Download |
md5:f1acf6117ecf382456e70252d840674c
|
1.8 MB | Preview Download |
md5:500e97a1f98507861978f31b286c37c3
|
4.3 kB | Preview Download |
md5:34a0120addc2270b57a9c7e8f9515025
|
24.4 kB | Preview Download |
md5:45083b4a72eeb9bd8680b5c4a2471e3b
|
947 Bytes | Preview Download |
md5:54fada55093ab059d406dd25a3a366fb
|
977 Bytes | Preview Download |
md5:bb65a3775bc5bb2c5cd16971620bd1e8
|
773 Bytes | Preview Download |
md5:dbd71fa9575cb57573d46e25f69d4322
|
705 Bytes | Preview Download |
md5:3212a17d92793e6b60bc7e6a38990d50
|
738 Bytes | Preview Download |
md5:ab606caf495f4174b4907e5a53c8e63a
|
700 Bytes | Preview Download |
md5:0c4b56ad01fe5e2ea2d8fa1b95b750d7
|
844 Bytes | Preview Download |
md5:a734d3c0e3bad33160a7d08b3aa15399
|
691 Bytes | Preview Download |
md5:25721ce41d17f3cfb0d43bfe273578af
|
506 Bytes | Preview Download |
md5:35daf76ee44e387e8428e52091f38647
|
830 Bytes | Preview Download |
md5:de23e56d53f4e254bd09c58931e3d929
|
670 Bytes | Preview Download |
md5:8bb201b2ebb05d685051759dc5278c0d
|
1.0 kB | Preview Download |
md5:ee7b57fb342068fc9aebc2461f4b1ce7
|
651 Bytes | Preview Download |
md5:b92ce13dbf66b953ed29c7df7faf0492
|
813 Bytes | Preview Download |
md5:9cfefc3cf21b8c30d98d6a29fc583302
|
732 Bytes | Preview Download |
md5:50846acc9b56e4bc8087bdc312889332
|
633 Bytes | Preview Download |
md5:4df60cc85176243efe90537bed2386af
|
756 Bytes | Preview Download |
md5:24459c7ad5983472b3aebd60463668a8
|
800 Bytes | Preview Download |
md5:81393719301eee8e0dbae4e3460c29a8
|
781 Bytes | Preview Download |
md5:1c60aae3142757930e3b82b54c48550a
|
80 Bytes | Preview Download |
md5:8b540f6a97ad0a45fecc958152a21192
|
946 Bytes | Preview Download |
md5:2850d0bb3c80dc90962c0c62a0dbb41a
|
713 Bytes | Preview Download |
md5:e1e32b7c7e9bc55d514045c37a589424
|
835 Bytes | Preview Download |
md5:fc5e08edcada637e102432630a40e4b1
|
639 Bytes | Preview Download |
md5:e024e045a9dbfb69917e86341c3530f7
|
818 Bytes | Preview Download |
md5:d3869bac75b88ceee4b53c71665a3a4a
|
658 Bytes | Preview Download |
md5:e5709554262d90696e0ecabe2f13a12c
|
975 Bytes | Preview Download |
md5:de70b30abc11586b811bc4c8945b82b2
|
44.8 kB | Preview Download |
Additional details
Dates
- Available
-
2025-04-14