Published January 11, 2026 | Version v2
Other Open

[Supplementary material] Meta-Fair: AI-Assisted Fairness Testing of Large Language Models

  • 1. SCORE Lab, I3US Institute, Universidad de Sevilla
  • 2. ROR icon Mondragon University

Description

This repository contains supplementary material for the paper Meta-Fair: AI-Assisted Fairness Testing of Large Language Models. It includes data, scripts, and resources to reproduce and analyse the experiments described in the paper.
 
The contents are organised into four main folders:
  • data/: Contains data for metamorphic tests, structured by research question:
    • attributes_catalogue.csv: Catalogue of attributes used in metamorphic test generation.
    • Subfolders for each research question (rq1/, rq2/, rq3/) include:
      • generation/: CSV files of generated data for each metamorphic relation.
      • execution/: CSV files representing the execution results of the metamorphic tests.
      • evaluation/: Evaluation-related data, such as experiments and judgements.
      • manual_revision/: Manually revised data provided by human judges.
  • experimental_setup/: Provides the setup required to generate, execute, and evaluate the metamorphic tests:
    • configuration/: JSON files (metamorphic_relations.json, rq1.json, rq2.json, rq3.json) defining configurations for each research question.
    • jobs/: Subfolders (rq1/, rq2/, rq3/) containing job configurations for task execution.
    • scripts/: Python scripts (evaluation.py, execution.py, generation.py, experiment.py) to automate generation, execution, and evaluation processes.
    • tools/: Source code of the three developed tools for LLM-assisted generation (MUSE), execution (GENIE), and evaluation (GUARD-ME).
    • requirements.txt: Lists dependencies needed to run the experiments.
  • prompt_templates/: Includes the prompt templates used in the generation and evaluation of the metamorphic tests:
    • base_generation.txt: The base prompt template for generation tasks.
    • base_evaluation.txt: The base prompt template for evaluation tasks. 
    • generation_derivates/: Prompt templates (i.e., dual_attributes.txt, hypothetical_scenario.txt, metal.txt, multiple_choice.txt, prioritisation.txt, proper_nouns.txt, ranked_list.txt, score.txt, sentence_completion.txt, single_attribute.txt, yes_no_question.txt), derived from base_generation.txt, employed to generate metamorphic tests using different strategies.
    • evaluation_derivates/: Prompt templates (i.e., attribute_comparison.txtinverted_consistency.txt, proper_nouns_comparison.txt), derived from base_evaluation.txt, used to guide the judge model across different evaluation methods.
  • analysis/: Contains resources for analysing and visualising the results of the experiments:
    • results_analysis.ipynb: A Jupyter notebook used for data analysis, visualisation, and supplementary experiments. It enables interactive result analysis and supports reproducibility.
    • requirements.txt: Lists the dependencies required to run the notebook environment.
    • outputs/: Figures (figures/), tables (tables/) and statistical tests (statistical_tests/) that summarise the experimental findings.

Files

meta-fair-supplementary-material.zip

Files (120.8 MB)

Name Size Download all
md5:9a09287d235a848276777b9043ee97c0
120.8 MB Preview Download