IDEAFix: An Evaluation Framework for Creative Defixation Prompting in LLMs
Authors/Creators
Description
Dataset Card for Dataset Name
This dataset accompanies the paper: "IDEAFix: An Evaluation Framework for Creative Defixation Prompting in LLMs," submitted to NeurIPS 2026.
Dataset Details
Dataset Description
IDEAFix is a controlled evaluation dataset for studying divergent thinking and creative idea generation in Large Language Models (LLMs). It combines structured design briefs, attribute-based expansions, and method-inspired prompting strategies to enable systematic analysis of how task formulation and prompting influence LLM creativity. The dataset includes 81 design briefs expanded into 567 experimental conditions, paired with 25 structured prompts, yielding a total of 14,350 input prompts. Inference outputs from 6 LLMs are also provided.
Main dataset files:
-
Brief categories-Table 1.csvmaps category symbols to their corresponding attributes. It contains three columns:Category(a single letter representing the category),Attributes(the category name), andExplanation(a short description of the category). Categories cover two main axes: adjective placement strategies within a brief (e.g., start, end, after the 1st word) and brief type classifications (e.g., product, service, strategy, process). A subset of categories is reserved for jailbreak-related briefs, flagging prompts that LLMs should not or may not be able to answer. -
Briefs-Table 1.csvcontains 81 design briefs used in the benchmark, organized in a hierarchical structure. Each brief (e.g., Bookmark, Lemonade Stand) is associated with a category tag (as defined inBrief Categories-Table 1.csv) and a 3×3 matrix of adjectives, spanning three property types (Traditional, Atypical, Surprising) and three sentiment polarities (Neutral, Positive, Negative). Each adjective is itself tagged with one or more categories, capturing its placement strategy and/or brief type. Together, these adjectives represent the structured prompting variations used to study how task formulation influences LLMs' idea generation. -
Crea_prompts-Table 1.csvcontains 25 prompts structured around established creativity and design methods. Each row corresponds to a named method — TRIZ, Brainstorming, C-K, SCAMPER, Design Thinking, Prospective Scenarios, Control Prompts, and AI Creativity Prompts — and provides up to four alternative phrasings of the method-specific prompt. All prompts follow a shared instruction format, directing the model to generate a list of solutions (each at most two sentences) in English for a given[Brief context]placeholder. The alternative phrasings allow for systematic comparison of how prompt wording influences idea generation within the same methodological framework.
Inference files:
We provide inference outputs for all model runs across the following models: Gemini-2.5-Flash, GPT-4o, Grok-4.1-Fast-Reasoning, Qwen3-30B, and Llama-3.1-70B. For each of these models, three sets of output files are available, while for Claude, only one set of outputs is provided.
Files are named following the convention inference-<model_name>_{1,2,3}.{json,tsv}, and are released in both JSON and TSV formats containing identical information.
Each inference record contains the following fields:
-
Brief metadata:
brief.categories,brief.name— the category tags and name of the design brief used as input. -
Prompt metadata:
full_prompt,prompt.method_name,prompt.version_name— the full prompt sent to the model, along with the creativity method and version it belongs to. -
Inference configuration:
inference.brief_categories,inference.brief_sentiment_jailbreak,inference.brief_sentiment_position,inference.brief_text,inference.brief_word,inference.property_type,inference.sentiment_type— structured metadata describing the specific variation of the brief used, including adjective placement, sentiment polarity, and property type. -
Model metadata:
model_metadata.raw,model_metadata.thoughts,model_metadata.usage.completion_tokens,model_metadata.usage.n_generations,model_metadata.usage.prompt_tokens,model_metadata.usage.total_tokens— raw model output, chain-of-thought traces where available, and token usage statistics. -
Model response:
model_response.texts— the list of generated ideas returned by the model. -
Other:
index,source,timestamp— record identifier, data source, and timestamp of the inference run. -
Curated by: Anonymous Author(s)
-
Language(s) (NLP): English
-
License: CC-BY 4.0
Dataset Sources
Repository: Code Dataset: Dataset Paper: Submitted to the 40th Conference on Neural Information Processing Systems (NeurIPS 2026). Do not distribute.
Uses
IDEAFix is intended for researchers and practitioners studying LLM creativity, divergent thinking, and the influence of prompting on idea generation. Suitable use cases include:
- Benchmarking and comparing the creative capabilities of LLMs across divergent thinking dimensions (fluency, diversity, novelty, rarity).
- Studying the effects of task formulation (brief type, attribute typicality, sentiment polarity) on LLM-generated ideas.
- Evaluating structured prompting strategies — including creativity and defixation methods such as TRIZ, Brainstorming, C-K, SCAMPER, and Design Thinking — on LLM idea generation.
- Analyzing homogenization and fixation effects in LLM outputs at both the individual and collective level.
- Extending the framework with new briefs, attributes, prompts, or models, given its compositional and modular design.
Annotations
Candidate attribute words (adjectives) for each brief were generated by an LLM and subsequently reviewed by experts, who selected the most contextually relevant variant for each of the 18 cells (3 typicality levels × 2 sentiment polarities × 3 candidates). Each of the 25 prompting strategies was similarly submitted to an academic expert in creativity and innovation management, who provided structured written feedback; expert comments were systematically reviewed and integrated into the final prompt formulations prior to dataset deployment.
Personal and Sensitive Information
The dataset does not contain any personal or individually identifiable information. However, a subset of briefs is explicitly flagged as sensitive or jailbreak-relevant, covering categories such as fraud, pornography, political lobbying, privacy violations, legal opinion, financial advice, and government decision-making. These briefs are included solely to evaluate model safety and refusal behavior, and are clearly identified through the category tagging system defined in Brief categories-Table 1.csv.