Exploring Large Language Models for Automated Non-Functional Requirements Generation
Authors/Creators
Description
Exploring Large Language Models for Automated Non-Functional Requirements Generation: A Human Annotated Dataset for NFR Quality
This artifact provides a comprehensive dataset and analysis tools for evaluating the quality of Non-Functional Requirements (NFRs) generated by Large Language Models (LLMs) based solely on Functional Requirements (FRs). The dataset includes human evaluations of NFR quality according to ISO/IEC 25010:2023 standard quality attributes.
Description
This research artifact contains:
- Human evaluation data for NFRs generated by 8 different LLMs across 34 functional requirements
- Professional responses collected through Turso database service from software engineering professionals
- Analysis scripts for data processing and statistical analysis
- LLM outputs in structured JSON format for all tested models
- Advanced prompting techniques incorporating ISO/IEC 25010:2023 standards
The study evaluates two key aspects:
- NFR Validity (1-5 scale): Coherence and appropriateness of generated NFRs
- Attribute Applicability (1-5 scale): Relevance of assigned ISO quality attributes
Requirements
Software Dependencies
- Deno (JavaScript/TypeScript runtime) - Version 1.40+ recommended
- SQLite3 (for database operations)
- Standard text editor (for viewing TSV/JSON files)
Hardware Requirements
- RAM: Minimum 4GB (recommended 8GB)
- Storage: 500MB free space
- OS: Cross-platform (Linux, macOS, Windows)
Installation
- Install Deno:
# Linux/macOS
curl -fsSL https://deno.land/install.sh | sh
# Windows (PowerShell)
irm https://deno.land/install.ps1 | iex
- Verify Installation:
deno --version
- Clone/Download Artifact: Extract downloaded archive
Step-by-Step Instructions to Reproduce Paper Results
Step 1: Examine Raw Data Sources
Input: Professional evaluation data collected via Turso database service
- File:
data/dump.sql - Description: SQL dump containing responses from software engineering professionals who evaluated LLM-generated NFRs
- Content: Raw evaluation data including validity scores (1-5), applicability scores (1-5), and quality attribute assignments
Expected Output: Understanding of data collection methodology and raw response structure
Step 2: Generate Analysis Database
Purpose: Convert SQL dump to SQLite database for analysis
cd analysis
deno run --allow-read --allow-write generateData.ts
Process:
- The
generateData.tsscript readsdata/dump.sql - Creates
data/dump.dbSQLite database - Structures data for statistical analysis
Expected Output: data/dump.db file created (approximately 2-5MB)
Step 3: Process and Merge Evaluation Data
Purpose: Combine human evaluations with LLM assignments and generate final dataset
The generateData.ts script performs:
- Assignment Processing: Maps evaluators to specific FR-LLM combinations:
- NFR Validity evaluations: 10 evaluators × 3 FRs each × 8 LLMs
- Attribute Applicability evaluations: 10 evaluators × 3 FRs each × 8 LLMs
- Data Merging: Combines database records with assignment metadata
- CSV Generation: Outputs structured TSV file for analysis
Expected Output: analysis/Human_Evaluation_Data.tsv (final dataset used in paper)
Step 4: Analyze LLM Output Structure
Files: LLMOutputs/*.json (8 files, one per LLM)
claude-3-5-haiku.jsonclaude-3-7-sonnet.jsondeepSeek-V3.jsongemini-1.5-pro.jsongpt-4o-mini.jsongrok-2.jsonlama-3.3-70B.jsonQwen2.5-72B.json
Expected Format for each FR:
{
"functionalRequirement": "System shall allow users to log in with username and password",
"identifiedNFRs": [
{
"attribute": "Security",
"requirement": "The system must encrypt passwords using AES-256 encryption",
"justification": "Login functionality requires secure credential handling"
}
]
}
Analysis: Each JSON contains 34 FR entries with generated NFRs following ISO/IEC 25010:2023 categories
Step 5: Examine Prompt Engineering Approach
File: data/AdvancedPrompt.txt Content: Complete prompt used for NFR generation including:
- Role assignment (expert software quality engineer)
- Knowledge grounding (ISO/IEC 25010:2023 standard)
- Output structure constraints (JSON format)
- Quality requirements (specific, actionable, testable NFRs)
File Descriptions
Core Dataset Files
analysis/Human_Evaluation_Data.tsv: Main evaluation dataset (2,240 evaluated NFRs)- Columns: FR ID, FR text, NFR ID, LLM model, ISO attribute, NFR text, justification, validity score, applicability score, human attribute assignment, evaluator assignment type, evaluator ID
data/FR_34.tsv: 34 functional requirements subset used for evaluationdata/dump.sql: Raw SQL dump from Turso database service containing professional evaluations
LLM Output Files
LLMOutputs/[model].json: Structured NFR generations for each of 8 LLMs- Each file contains 34 FR entries with associated NFRs in JSON format
Configuration Files
data/AdvancedPrompt.txt: Complete prompt template with ISO/IEC 25010:2023 integrationanalysis/generateData.ts: Data processing script for database creation and CSV generation
Documentation
LICENSE.md: Distribution rights and usage termsanalysis/visualization.ipynb: Jupyter notebook for data visualization and statistical analysis
Mapping to Paper Claims
Key Paper Statistics (Section 6 - Results)
- 1,593 total NFRs generated across 8 LLMs and 34 FRs
- 174 NFRs evaluated for validity and applicability scoring
- 168 NFRs evaluated for attribute selection task
- Mean validity score: 4.63 (median: 5.0) on 1-5 scale
- Mean applicability score: 4.59 (median: 5.0) on 1-5 scale
- 80.4% attribute accuracy in expert vs. LLM attribute selection
Figure Reproduction Mapping
- Figure 3: Reproduced from validity scores in
Human_Evaluation_Data.tsv- Shows 90.8% of NFRs scored ≥4, with 76.4% scoring perfect 5
- Figure 4: Generated from applicability scores in
Human_Evaluation_Data.tsv- Demonstrates 90.2% highly applicable ratings (scores 4-5)
- Figure 5: Computed from attribute selection task data
- Visualizes 80.4% exact matches, 8.3% near misses, 11.3% complete mismatches
- Figure 6: Generated from LLM vs. expert attribute assignments
- Shows specific misclassification patterns (e.g., Functional Suitability vs. Reliability)
Table Reproduction Mapping
- Table 4 (LLM Comparison): Directly derived from
Human_Evaluation_Data.tsvgrouped by LLM model- Validity ranges: 3.96 (claude-3-7-sonnet) to 4.94 (llama-3.3-70B)
- Applicability ranges: 3.67 (claude-3-7-sonnet) to 4.97 (grok-2)
- Attribute accuracy ranges: 71.4% (deepSeek-V3) to 90.9% (gemini-1.5-pro)
Research Questions Validation
- RQ1 (LLM Effectiveness): Validated through high validity (90.8% ≥4) and applicability (90.2% ≥4) scores
- RQ2 (Best Performing LLM): Answered via Table 4 comparison showing gemini-1.5-pro (highest attribute accuracy) and llama-3.3-70B (highest validity/applicability)
- RQ3 (Prompting Technique Impact): Demonstrated through advanced vs. baseline prompting comparison
Methodology Reproduction (Section 4)
- 34 FRs Selection: Subset available in
data/FR_34.tsv - 8 LLM Configuration: Models and parameters detailed in Table 3, outputs in
LLMOutputs/*.json - Evaluation Framework: 10 evaluators with 13 years average experience, dual-task design
- Custom Prompting: Complete advanced prompt available in
data/AdvancedPrompt.txt
Statistical Claims Verification
- Sample Size Calculation: Based on 15 SRS documents analysis (Table 2) yielding 33.5 average FRs
- Expert Evaluation Distribution: Task 1 (32 FRs, 174 NFRs) and Task 2 (2 FRs, 168 NFRs)
- Temperature Setting: 0.4 selected through systematic testing (Section 4.3.2)
- Quality Assessment: Ordinal scale (1-5) with specific rubrics for validity and applicability
Data Provenance
- Professional Collection: Software engineering professionals recruited through academic networks
- Turso Database: Cloud database service used for response collection and management
- Assignment Strategy: Balanced design ensuring each FR-LLM combination evaluated by multiple professionals
- Quality Control: Validation checks implemented in data processing pipeline
Expected Results Summary
When following the reproduction steps, you should observe:
- 2,240 total NFR evaluations across 8 LLMs and 34 FRs
- Validity scores ranging 1-5 with LLM-specific distributions
- Applicability scores showing attribute-specific patterns
- JSON-structured LLM outputs demonstrating prompt effectiveness
- Professional evaluation data providing ground truth for NFR quality assessment
This artifact enables full reproduction of the paper's experimental methodology and statistical findings regarding LLM performance in automated NFR generation.
Troubleshooting
Common Issues
- Deno not found: Ensure Deno is properly installed and added to your PATH
- Permission errors: Use
--allow-read --allow-writeflags when running Deno scripts - Database file not found: Ensure
generateData.tshas been run to create the database from dump.sql
Support
For questions about the artifact or reproduction steps, please refer to the paper or contact the authors through the conference proceedings.
Files
NFR generation Artifacts.zip
Files
(2.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:625b3d6d8246cf76d018237bf90482d3
|
2.0 MB | Preview Download |
Additional details
Dates
- Submitted
-
2025-03-10
- Accepted
-
2025-09-06
Software
- Programming language
- TypeScript
- Development Status
- Active