Published October 10, 2025 | Version 2.0.0
Dataset Open

Exploring Large Language Models for Automated Non-Functional Requirements Generation

  • 1. EDMO icon Pennsylvania State University

Description

Exploring Large Language Models for Automated Non-Functional Requirements Generation: A Human Annotated Dataset for NFR Quality

This artifact provides a comprehensive dataset and analysis tools for evaluating the quality of Non-Functional Requirements (NFRs) generated by Large Language Models (LLMs) based solely on Functional Requirements (FRs). The dataset includes human evaluations of NFR quality according to ISO/IEC 25010:2023 standard quality attributes.

Description

This research artifact contains:

  • Human evaluation data for NFRs generated by 8 different LLMs across 34 functional requirements
  • Professional responses collected through Turso database service from software engineering professionals
  • Analysis scripts for data processing and statistical analysis
  • LLM outputs in structured JSON format for all tested models
  • Advanced prompting techniques incorporating ISO/IEC 25010:2023 standards

The study evaluates two key aspects:

  1. NFR Validity (1-5 scale): Coherence and appropriateness of generated NFRs
  2. Attribute Applicability (1-5 scale): Relevance of assigned ISO quality attributes

Requirements

Software Dependencies

  • Deno (JavaScript/TypeScript runtime) - Version 1.40+ recommended
  • SQLite3 (for database operations)
  • Standard text editor (for viewing TSV/JSON files)

Hardware Requirements

  • RAM: Minimum 4GB (recommended 8GB)
  • Storage: 500MB free space
  • OS: Cross-platform (Linux, macOS, Windows)

Installation

  1. Install Deno:
# Linux/macOS
curl -fsSL https://deno.land/install.sh | sh

# Windows (PowerShell)
irm https://deno.land/install.ps1 | iex
  1. Verify Installation:
deno --version
  1. Clone/Download Artifact: Extract downloaded archive

Step-by-Step Instructions to Reproduce Paper Results

Step 1: Examine Raw Data Sources

Input: Professional evaluation data collected via Turso database service

  • Filedata/dump.sql
  • Description: SQL dump containing responses from software engineering professionals who evaluated LLM-generated NFRs
  • Content: Raw evaluation data including validity scores (1-5), applicability scores (1-5), and quality attribute assignments

Expected Output: Understanding of data collection methodology and raw response structure

Step 2: Generate Analysis Database

Purpose: Convert SQL dump to SQLite database for analysis

cd analysis
deno run --allow-read --allow-write generateData.ts

Process:

  • The generateData.ts script reads data/dump.sql
  • Creates data/dump.db SQLite database
  • Structures data for statistical analysis

Expected Outputdata/dump.db file created (approximately 2-5MB)

Step 3: Process and Merge Evaluation Data

Purpose: Combine human evaluations with LLM assignments and generate final dataset

The generateData.ts script performs:

  1. Assignment Processing: Maps evaluators to specific FR-LLM combinations:
    • NFR Validity evaluations: 10 evaluators × 3 FRs each × 8 LLMs
    • Attribute Applicability evaluations: 10 evaluators × 3 FRs each × 8 LLMs
  2. Data Merging: Combines database records with assignment metadata
  3. CSV Generation: Outputs structured TSV file for analysis

Expected Outputanalysis/Human_Evaluation_Data.tsv (final dataset used in paper)

Step 4: Analyze LLM Output Structure

FilesLLMOutputs/*.json (8 files, one per LLM)

  • claude-3-5-haiku.json
  • claude-3-7-sonnet.json
  • deepSeek-V3.json
  • gemini-1.5-pro.json
  • gpt-4o-mini.json
  • grok-2.json
  • lama-3.3-70B.json
  • Qwen2.5-72B.json

Expected Format for each FR:

{
  "functionalRequirement": "System shall allow users to log in with username and password",
  "identifiedNFRs": [
    {
      "attribute": "Security",
      "requirement": "The system must encrypt passwords using AES-256 encryption",
      "justification": "Login functionality requires secure credential handling"
    }
  ]
}

Analysis: Each JSON contains 34 FR entries with generated NFRs following ISO/IEC 25010:2023 categories

Step 5: Examine Prompt Engineering Approach

Filedata/AdvancedPrompt.txt Content: Complete prompt used for NFR generation including:

  • Role assignment (expert software quality engineer)
  • Knowledge grounding (ISO/IEC 25010:2023 standard)
  • Output structure constraints (JSON format)
  • Quality requirements (specific, actionable, testable NFRs)

File Descriptions

Core Dataset Files

  • analysis/Human_Evaluation_Data.tsv: Main evaluation dataset (2,240 evaluated NFRs)
    • Columns: FR ID, FR text, NFR ID, LLM model, ISO attribute, NFR text, justification, validity score, applicability score, human attribute assignment, evaluator assignment type, evaluator ID
  • data/FR_34.tsv: 34 functional requirements subset used for evaluation
  • data/dump.sql: Raw SQL dump from Turso database service containing professional evaluations

LLM Output Files

  • LLMOutputs/[model].json: Structured NFR generations for each of 8 LLMs
    • Each file contains 34 FR entries with associated NFRs in JSON format

Configuration Files

  • data/AdvancedPrompt.txt: Complete prompt template with ISO/IEC 25010:2023 integration
  • analysis/generateData.ts: Data processing script for database creation and CSV generation

Documentation

  • LICENSE.md: Distribution rights and usage terms
  • analysis/visualization.ipynb: Jupyter notebook for data visualization and statistical analysis

Mapping to Paper Claims

Key Paper Statistics (Section 6 - Results)

  • 1,593 total NFRs generated across 8 LLMs and 34 FRs
  • 174 NFRs evaluated for validity and applicability scoring
  • 168 NFRs evaluated for attribute selection task
  • Mean validity score: 4.63 (median: 5.0) on 1-5 scale
  • Mean applicability score: 4.59 (median: 5.0) on 1-5 scale
  • 80.4% attribute accuracy in expert vs. LLM attribute selection

Figure Reproduction Mapping

  • Figure 3: Reproduced from validity scores in Human_Evaluation_Data.tsv
    • Shows 90.8% of NFRs scored ≥4, with 76.4% scoring perfect 5
  • Figure 4: Generated from applicability scores in Human_Evaluation_Data.tsv
    • Demonstrates 90.2% highly applicable ratings (scores 4-5)
  • Figure 5: Computed from attribute selection task data
    • Visualizes 80.4% exact matches, 8.3% near misses, 11.3% complete mismatches
  • Figure 6: Generated from LLM vs. expert attribute assignments
    • Shows specific misclassification patterns (e.g., Functional Suitability vs. Reliability)

Table Reproduction Mapping

  • Table 4 (LLM Comparison): Directly derived from Human_Evaluation_Data.tsv grouped by LLM model
    • Validity ranges: 3.96 (claude-3-7-sonnet) to 4.94 (llama-3.3-70B)
    • Applicability ranges: 3.67 (claude-3-7-sonnet) to 4.97 (grok-2)
    • Attribute accuracy ranges: 71.4% (deepSeek-V3) to 90.9% (gemini-1.5-pro)

Research Questions Validation

  • RQ1 (LLM Effectiveness): Validated through high validity (90.8% ≥4) and applicability (90.2% ≥4) scores
  • RQ2 (Best Performing LLM): Answered via Table 4 comparison showing gemini-1.5-pro (highest attribute accuracy) and llama-3.3-70B (highest validity/applicability)
  • RQ3 (Prompting Technique Impact): Demonstrated through advanced vs. baseline prompting comparison

Methodology Reproduction (Section 4)

  • 34 FRs Selection: Subset available in data/FR_34.tsv
  • 8 LLM Configuration: Models and parameters detailed in Table 3, outputs in LLMOutputs/*.json
  • Evaluation Framework: 10 evaluators with 13 years average experience, dual-task design
  • Custom Prompting: Complete advanced prompt available in data/AdvancedPrompt.txt

Statistical Claims Verification

  • Sample Size Calculation: Based on 15 SRS documents analysis (Table 2) yielding 33.5 average FRs
  • Expert Evaluation Distribution: Task 1 (32 FRs, 174 NFRs) and Task 2 (2 FRs, 168 NFRs)
  • Temperature Setting: 0.4 selected through systematic testing (Section 4.3.2)
  • Quality Assessment: Ordinal scale (1-5) with specific rubrics for validity and applicability

Data Provenance

  1. Professional Collection: Software engineering professionals recruited through academic networks
  2. Turso Database: Cloud database service used for response collection and management
  3. Assignment Strategy: Balanced design ensuring each FR-LLM combination evaluated by multiple professionals
  4. Quality Control: Validation checks implemented in data processing pipeline

Expected Results Summary

When following the reproduction steps, you should observe:

  • 2,240 total NFR evaluations across 8 LLMs and 34 FRs
  • Validity scores ranging 1-5 with LLM-specific distributions
  • Applicability scores showing attribute-specific patterns
  • JSON-structured LLM outputs demonstrating prompt effectiveness
  • Professional evaluation data providing ground truth for NFR quality assessment

This artifact enables full reproduction of the paper's experimental methodology and statistical findings regarding LLM performance in automated NFR generation.

Troubleshooting

Common Issues

  • Deno not found: Ensure Deno is properly installed and added to your PATH
  • Permission errors: Use --allow-read --allow-write flags when running Deno scripts
  • Database file not found: Ensure generateData.ts has been run to create the database from dump.sql

Support

For questions about the artifact or reproduction steps, please refer to the paper or contact the authors through the conference proceedings.

Files

NFR generation Artifacts.zip

Files (2.0 MB)

Name Size Download all
md5:625b3d6d8246cf76d018237bf90482d3
2.0 MB Preview Download

Additional details

Dates

Submitted
2025-03-10
Accepted
2025-09-06

Software

Programming language
TypeScript
Development Status
Active