Published April 14, 2025 | Version 1.0.0
Dataset Open

KomMKonLLM - Combinatorial test cases for consistency testing of LLMs

Description

This dataset was created using the KomMKonLLM implementation available at https://github.com/KomMKonLLM/KomMKonLLM.

Example dataset and results

This folder contains an example set of questions and labelled correct answers, the synonyms utilized in the testing process (which forms the basis for the construction of input parameter models required to generate a covering array), the resulting covering array, and the LLM output for 6 models as well as its boolean interpretation.

Models

The following models are used in this example, all interfaced with using the Ollama meta-model:

  • Starling-LM
  • Llama 3.1
  • Llama 3.2
  • Llama 2 Uncensored
  • DeepSeek-r1
  • Mistral

Questions

The file `public_questions.jsonl` contains 27 questions and the associated correct answer label. It was constructed using a human-guided ChatGPT session.

Test runs

Data regarding the executed test runs is available in `test_runs.csv`.
The first column contains a unique ID that is also utilized in the file names for synonyms and covering arrays described below.
Each row further contains the original sentence text, the date, the correct answer, the model name (which is always OLLAMA in these examples), the note (which contains the actual model) and the strength.
Note that identical sentence texts are contained multiple times in this file, once per model.

Synonyms

Files named `synonyms-<id>.txt` contain the synonyms for the test sentence with the ID `<id>`.
Each line is a JSON list with synonyms for a token (roughly equivalent to a word). A limit of 3 synonyms per token is applied internally and only some types of tokens (such as proper nouns and adjectives) are replaced with synonyms.
The first entry in each of the lines is the token as it appears in the original sentence, with other entries provided by a NLP library.

Covering Arrays

The generated covering arrays (at strength 2 in this example) are available in files called `ca-<id>.csv`. They are based on the associated synonym list such that the n-th column in `ca-<id>.csv` refers to the n-th row in `synonyms-<id>.txt` and each row in the CA file forms one test case.

Queries

The resulting queries (i.e. mutated test sentences) are listed in `queries.csv`. This file contains the full prompt submitted to the LLM (note that the `sentence_id` in this file refers to the `id` in `test_runs.csv`; this can be used to identify the concrete LLM that was queried), the LLM's response, and the response interpreted as a boolean via our oracle.
In each of these rows, the value of the `correct_answer_label` column should be compared to the `result` column to evaluate whether the LLM's response was correct.

Results

This section visualizes the results of our analysis, which was performed using the JupyterLab notebook contained in this project.

Parsed LLM responses

The chart in llm-responses-parsed.png shows the parsed LLM responses across all test runs. 
Responses that could not be parsed as boolean are listed as "undefined". In general, this means that the LLM did not return a JSON boolean. Some models are prone to printing significantly more complex data structures instead or try to add a textual explanation instead of following the prompt.

Precision

The chart precision.png visualizes the precision of each model in this analysis.
The precision is defined as the ratio of true positives over the sum of true positives and true negatives.

Recall

The file recall.png shows the recall of the models contained in our evaluation.
The recall is the ratio of true positives over the sum of true positives and false negatives.

F1 score

The image f1.png contains the F1 scores of all models under test, defined as 2*(precision*recall)/(precision + recall).

Consistency

The chart in consistency.png visualizes the consistency of parsed LLM responses within each test run.
The consistency measures the ratio of responses to mutated queries (i.e. those where parts have been replaced with synonyms) that are identical to the response of the original query.

Notes (English)

This project is funded by Internet Privatstiftung Austria (Internet Foundation Austria) through its netidee programme under the title "KomMKonLLM: Kombinatorische Methoden für Konsistenztests von Large Language Models". See https://www.netidee.at/kommkonllm for details.

Files

ca-1.csv

Files (2.1 MB)

Name Size Download all
md5:9b0c3aeaafc28df8c249f15f14db5713
2.1 kB Preview Download
md5:2853bcc18e66500927dd5cf81e5c5ae5
2.3 kB Preview Download
md5:eeef5846166c667d28e3a6a6b84c2809
1.5 kB Preview Download
md5:a96156cbb8a74c909c0b6d6710d213fa
1.6 kB Preview Download
md5:1b88af6342ed5a75258222cee8e28597
1.6 kB Preview Download
md5:2c5de31fbb6d908499f42003373a65a8
1.6 kB Preview Download
md5:99eb5069831bb2f9009c1b97e548e578
1.6 kB Preview Download
md5:6c332c1c1cf3760bd8d772295cfd39b9
1.2 kB Preview Download
md5:b4503f7b7362b1b1199d57e971e6144f
1.0 kB Preview Download
md5:101b7ae1215d09c12c4ea81f0be6a8cb
1.7 kB Preview Download
md5:8922fd9289e04feb8abf3d70baaeba03
1.1 kB Preview Download
md5:5c94da53fc5f40d1803648c9be776445
2.4 kB Preview Download
md5:4d79e52361ca7250439740735f381f32
1.2 kB Preview Download
md5:4ab2cf8072c58edfc78582800e890419
1.7 kB Preview Download
md5:a99d031365c1b91a3a7e21b22b45ccf9
1.5 kB Preview Download
md5:9e316fef8fdd5e7ab722a84b27554925
1.3 kB Preview Download
md5:b517d25e310af81cd4a6193adb20a2ee
1.6 kB Preview Download
md5:13c5b7def470efa97a4060c8c09ce335
1.6 kB Preview Download
md5:37f2b3f70b741b99087e59d992b4a373
1.5 kB Preview Download
md5:65443b678ede3ccd9fdcf5eaf183b3ae
32 Bytes Preview Download
md5:9b0c3aeaafc28df8c249f15f14db5713
2.1 kB Preview Download
md5:70f3d743f8a82462efb07dee25eeac5c
1.6 kB Preview Download
md5:c15caff3700c1ba9d35de73b08bec087
1.5 kB Preview Download
md5:dd820f8f195815083dcf96a7022aa4c5
1.4 kB Preview Download
md5:9753e614bff73d88a5ec22db89326bd4
1.8 kB Preview Download
md5:37ab0e50ea0d38c318a8c3c04314358e
1.3 kB Preview Download
md5:7e90fca6571fbf165eef31be0bad468c
2.4 kB Preview Download
md5:209bef6646c5be37bd825d3a87dd6a8b
26.3 kB Preview Download
md5:2a10846688c05574c51e5c6e1b7810e5
24.9 kB Preview Download
md5:e7c96bc89c5e53b5218afa10199ce6fc
18.1 kB Preview Download
md5:ee7fe65d86ff93f3dc3655ada7d03874
25.0 kB Preview Download
md5:e045fe94db38a91a87abf02fc1790fb2
7.7 kB Download
md5:f1acf6117ecf382456e70252d840674c
1.8 MB Preview Download
md5:500e97a1f98507861978f31b286c37c3
4.3 kB Preview Download
md5:34a0120addc2270b57a9c7e8f9515025
24.4 kB Preview Download
md5:45083b4a72eeb9bd8680b5c4a2471e3b
947 Bytes Preview Download
md5:54fada55093ab059d406dd25a3a366fb
977 Bytes Preview Download
md5:bb65a3775bc5bb2c5cd16971620bd1e8
773 Bytes Preview Download
md5:dbd71fa9575cb57573d46e25f69d4322
705 Bytes Preview Download
md5:3212a17d92793e6b60bc7e6a38990d50
738 Bytes Preview Download
md5:ab606caf495f4174b4907e5a53c8e63a
700 Bytes Preview Download
md5:0c4b56ad01fe5e2ea2d8fa1b95b750d7
844 Bytes Preview Download
md5:a734d3c0e3bad33160a7d08b3aa15399
691 Bytes Preview Download
md5:25721ce41d17f3cfb0d43bfe273578af
506 Bytes Preview Download
md5:35daf76ee44e387e8428e52091f38647
830 Bytes Preview Download
md5:de23e56d53f4e254bd09c58931e3d929
670 Bytes Preview Download
md5:8bb201b2ebb05d685051759dc5278c0d
1.0 kB Preview Download
md5:ee7b57fb342068fc9aebc2461f4b1ce7
651 Bytes Preview Download
md5:b92ce13dbf66b953ed29c7df7faf0492
813 Bytes Preview Download
md5:9cfefc3cf21b8c30d98d6a29fc583302
732 Bytes Preview Download
md5:50846acc9b56e4bc8087bdc312889332
633 Bytes Preview Download
md5:4df60cc85176243efe90537bed2386af
756 Bytes Preview Download
md5:24459c7ad5983472b3aebd60463668a8
800 Bytes Preview Download
md5:81393719301eee8e0dbae4e3460c29a8
781 Bytes Preview Download
md5:1c60aae3142757930e3b82b54c48550a
80 Bytes Preview Download
md5:8b540f6a97ad0a45fecc958152a21192
946 Bytes Preview Download
md5:2850d0bb3c80dc90962c0c62a0dbb41a
713 Bytes Preview Download
md5:e1e32b7c7e9bc55d514045c37a589424
835 Bytes Preview Download
md5:fc5e08edcada637e102432630a40e4b1
639 Bytes Preview Download
md5:e024e045a9dbfb69917e86341c3530f7
818 Bytes Preview Download
md5:d3869bac75b88ceee4b53c71665a3a4a
658 Bytes Preview Download
md5:e5709554262d90696e0ecabe2f13a12c
975 Bytes Preview Download
md5:de70b30abc11586b811bc4c8945b82b2
44.8 kB Preview Download

Additional details

Dates

Available
2025-04-14