Published April 15, 2025 | Version 1.0.2
Dataset Open

XAI-FUNGI: Dataset from the user study on comprehensibility of XAI algorithms

  • 1. Uniwersytet Jagielloński w Krakowie
  • 2. Jagiellonian University
  • 3. Uniwersytet Jagiellonski w Krakowie

Description

XAI-FUNGI: Dataset from the user study on comprehensibility of XAI algorithms

We present the dataset which was created during a user study on evaluation of explainability of artificial intelligence (AI) at the Jagielloninan University as a collaborative work of computer science (GEIST team) and information sciences research groups. The main goal of the research was to explore effective explanations of AI model patterns to diverse audiences.

The dataset contains material collected from 39 participants during the interviews conducted by the Information Sciences research group.  The participants were recruited from 149 candidates to form three groups that represented domain experts in the field of mycology (DE), students with data science and visualization background (IT) and students from social sciences and humanities (SSH). Each group was given an explanation of a machine learning model trained to predict edible and non-edible mushrooms and asked to interpret the explanations and answer various questions during the interview. The machine learning model and explanations for its decision were prepared by the computer science research team.

The resulting dataset was constructed from the surveys obtained from the candidates, anonymized transcripts of the interviews, the results from thematic analysis, and original explanations with modifications suggested by the participants. The dataset is complemented with the source code allowing one to reproduce the initial machine leaning model and explanations.

The general structure of the dataset is described in the following table. The files that contain in their names [RR]_[SS]_[NN] contain the individual results obtained from particular participant. The meaning of the prefix is as follows:

  • RR - initials of the researcher conducting the interview,
  • SS - type of the participant (DE for domain expert, SSH  for social sciences and humanities students, or IT for computer science students),
  • NN - number of the participant

 

File Description
SURVEY.csv The results from a survey that was filled by 149 participants out of which 39 were selected to form a final group of particiapnts.
SURVEY_en.csv Content of the SURVEY translated into English. 
CODEBOOK.csv The codebook used in thematic analysis and MAXQDA coding
QUESTIONS.csv List of questions that the participants were asked during interviews. 
SLIDES.csv List of slides used in the study with their interpretation and reference to MAXQDA themes and VISUAL_MODIFICATIONS tables.
MAXQDA_SUMMARY.csv Summary of thematic analysis performed with codes used in CODEBOOK for each participant
PROBLEMS.csv List of problems that participants were asked to solve during interviews. They correspond to three instances from the dataset that the participants had to classify using knowledge gained from explanations.
PROBLEMS_en.csv Content of the PROBLEMS file translated into English.
PROBLEMS_RESPONSES.csv The responses to the problems for each participant to the problems listed in PROBLEMS.csv
VISUALIZATION_MODIFICATIONS.csv Information on how the order of the slides was modified by the participant, which slides (explanations) were removed, and what kind of additional explanation was suggested.
ORIGINAL_VISUZALIZATIONS.pdf The PDF file containing the visualization of explanations presented to the participants during the interviews
ORIGINAL_VISUZALIZATIONS_EN.pdf Content of the ORIGINAL_VISUZALIZATIONS translated into English.
VISUALIZATION_MODIFICATIONS.zip The PDF file containing the original slides from ORIGINAL_VISUZALIZATIONS.pdf with the modifications suggested by the participant. Each file is a PDF file named with the participant ID, i.e. [RR]_[SS]_[NN].pdf
TRANSCRIPTS.zip The anonymized transcripts of interviews for each given participant, zipped into one archive. Each transcript is named after the particiapnt ID, i.e. [RR]_[SS]_[NN].csv and contains text tagged with slide number that it related to, question number from QUESTIONS.csv, and problem number from PROBLEMS.csv.

The detailed structure of the files presented in the previous Table is given in the Technical info section.

The source code used to train ML model and to generate explanations is available on  Gitlab

 

Technical info

Technical Info

The following sections contain descriptions of the records in the particual files. We provide description only to the CSV files, which has a structure. The PDF files content is self explanatory, and does not need additional description.

SURVEY

Column name Column description
candidate_id Unique ID of the participant of a survey. Note that this ID is not used further in other files. Instead participant_id should be used to join records from the CSV files.
[set of columns corresponding to survey content] This columsn are self explanatory, as the title of the colum contain a question the participant was asked and the content the answer given. It also contains some metadata connected with the survey as time spent to fill in the survey by the participant, dates when the survey was filled etc.
participant_id The participant ID that is an unique identifier of an individual that participated in the interviews. This column should be used to join records form other tables.
comment Column containing additional comments for participants. In current version this comments are limited to indicators which of the candidates were used as pilot study participants that were nnot included in the final dataset.

CODEBOOK

Column name Column description
code The name of a code used in thematic analysis that has the hierarchy encoede within. The divides hierarchy levels. For instance Aesthetics > layout represent a code that is on the second level of the hierarchy, havind Aesthetics as the parent.
memo The meaning of a given code.

SLIDES

Column name Column description
slide_id Unique ID of a slide that allows to join other tables that uses slide ID with TRANSCRIPTS
maxqda_theme The name of the explanatation type used in thematic analysis and present in columns of MAXQDA_SUMMARY.csv
slide_name Slide name used in VISUALIZATION_MODIFICATIONS.csv
comment The explanation of a content of the slide

MAXQDA_SUMMARY

Columna name Column description
code Name of the code, that can be mached with CODEBOOK.csv
[list of columns corresponding to particual type of explanation, e.g., LIME, descriptive statistics, etc.] Number of occurences of a given code in a given type of explanation
participant_id ID of the participant for which the summary was prepared

PROBLEMS

Column name Column description
problem_id Unique ID of the perticipant that can be usede to join the table with other tables
[set of features for a given mushroom used in particual problem] Features values of a mushroom to be classified
model_class Class returned by the machine learnign model
model_probability Probability assigned by the machine learnign model to the prediction

PROBLEMS_RESPONSES

Column name Column description
problem_id Unique ID of the problem that can be used to join the responses with PROBLEMS table and other tables.
participant_id Unique ID representing given participant that can be sued as a key to join with other tables
prediction_decision The class assigned by the participand to the given problem
prediction_decision_en English translation of a prediction decision given by the participant
prediction_certainty The certainty of the particiapnt decision
prediction_certainty_en English translation of a prediction certainty given by the participant

VISUALIZATION_MODIFICATIONS

Column name Column description
participant_id Unique ID of the problem that can be used to join the table with other tables.
slide_id ID of a slide used in other tables 
original_order The original order of a given slide in the presentation. In case of custom slides, thet were added to the presentation by the participant, this field is empty.
new_order The order (place in a presentation) of a slide assigned by the participant. 
slide_name symbolic name of the slide
modification The type of modification suggested by the participant, where 0 means no modification, 1 removal and 2 addition of custom slide
details Details o a custom slide added to the presentation. In case of the other slides this field is empty

QUESTIONS

Column name Column description
question_id Unique ID of each of the question that allows to match the actual question text with transcripts
realted_slide_id The slide ID that the question was oroginally assigned to
question The actual text of the question 
question_en English translation of the question

TRANSCRIPTS

Column name Column description
speaker_id ID of a person who is the autorship of the text in the following column. This is basically the distinction between investigator and participant.
slide_id ID of a slide that matches the ID in SLIDES.csv that the following text is related to. The row in which the slide ID appears is marked as the starting point there this slide is on the participant screen. There are three special slides numbers that identifis the stage of the interview : __S00__ indicates the begining of the core part of the interview with slide analysis, before that the description of the study is introducet to the participant; __S99__ represent the begining of the section where the participants analyze visualization order, the slides ids are not assigned in this section, due to dynamic slide switching by participants; __S15__ represents the end of the slide analysis section,  __S88__ represents the begining of the problem solving section .
question_id ID of a question that can be matched with QUESTIONS.csv. It represent a place, where the question was asked, or where the participant started giving answer to the question. Not all of the questions from QUESTIONS can be matched with transcripts, as some participants did not answer some of the questions, or the tagging of a question answer was to vague.
problem_id ID of a problem that participants were asked to solve that can be matched with PROBLEMS.csv. The apperance of the ID in a row indicate that from this point in time the participant tried to solve the problem.
text The anonymized transript of the participant words obtained with MS Teams.

Notes (En)

Release notes

  • 1.0.2 — Typos fixed in the VISUALIZATION_MODIFICATIONS.csv file  
  • 1.0.1 — English translations added for Polish-only columns and content (for preview for the international community)  
  • 1.0.0 — Original dataset  

Files

CODEBOOK.csv

Files (31.7 MB)

Name Size Download all
md5:b90cac76f0e5df3aabcb1478322830b9
13.1 kB Preview Download
md5:8cfd50bec23d6459c632cc475eb966ea
174.2 kB Preview Download
md5:c2eac1cf62d2325d7af99b05116b00c1
1.3 MB Preview Download
md5:2e1d01641b0f4e160944c1a7c5283891
1.3 MB Preview Download
md5:78fdd4cb137fb233f6dff966c06bc374
1.1 kB Preview Download
md5:779a8ae7904f4a8d2cafed812a04bcaa
755 Bytes Preview Download
md5:aa642e278c460e72dbf94033cb4dec71
5.8 kB Preview Download
md5:d871a460524b3e223c4adaabc32beed6
6.6 kB Preview Download
md5:a7e07e0c3942cd71057de7e70d4bf4c2
1.8 kB Preview Download
md5:367e558dbb12e48007738fd7a31e4cbf
128.6 kB Preview Download
md5:5526cc7acb1245c326bef7085c217dd1
125.8 kB Preview Download
md5:83396f076ebe16e00450c0d15e8281d1
633.9 kB Preview Download
md5:b34d0cf7bb6b5ee40ce6b0271cf11af7
14.1 kB Preview Download
md5:ee3df0e76526ef14ab385e4b990cdc23
28.0 MB Preview Download

Additional details

Related works

Is described by
Journal article: 10.1038/s41597-025-05167-6 (DOI)

Funding

National Science Centre
XPM - Explainable Predictive Maintenance 2020/02/Y/ST6/00070
Ministry of Science and Higher Education
Priority Research Area (DigiWorld) under the Strategic Programme Excellence Initiative at Jagiellonian University. ID.UJ

Software

Repository URL
https://gitlab.geist.re/pro/xai-fungi
Programming language
Python