Published November 16, 2025 | Version v5
Dataset Open

Are We SOLID Yet? An Empirical Study on Prompting LLMs to Detect Design Principle Violations

  • 1. ROR icon Bilkent University

Description

Replication Package for 'Are We SOLID Yet?,' Accepted at ASE 2025 NIER

 

@misc{pehlivan2025solidyetempiricalstudy,
      title={Are We SOLID Yet? An Empirical Study on Prompting LLMs to Detect Design Principle Violations}, 
      author={Fatih Pehlivan and Arçin Ülkü Ergüzen and Sahand Moslemi Yengejeh and Mayasah Lami and Anil Koyuncu},
      year={2025},
      eprint={2509.03093},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2509.03093}, 
}

Background & Motivation

Developers often overlook SOLID principles, leading to maintainability and scalability issues. While LLMs show promise in code analysis, their effectiveness in detecting OOP violations remains unclear.

Objectives

This project aims to develop a locally deployable LLM-based tool that detects violations in:

  • Single Responsibility Principle (SRP)
  • Open-Closed Principle (OCP)
  • Liskov Substitution Principle (LSP)
  • Interface Segregation Principle (ISP)
  • Dependency Inversion Principle (DIP)

Our goal is to systematically evaluate LLM performance across principles, programming languages, and prompt strategies.

Dataset

We provide 240 synthetic examples covering all five SOLID principles, across:

  • 4 programming languages: Java, Python, Kotlin, and C#
  • 3 difficulty levels: Easy, Moderate, Hard
  • 4 examples per principle × 3 levels × 4 languages

Table: Representative Scenarios Used to Generate SOLID Violation Examples in the Dataset

Principle Violation Scenarios
SRP

User Database: A class handling both user persistence and email notifications.

Salary Payslip: An employee class calculating salary and printing pay slips. 

File Archiving: A processor handling file content, archiving, and history tracking.

Product Discount: A product class applying discounts and managing display logic.

OCP

Payment Processing: A processor requiring modification to add new payment methods.

Customer Registration: A service requiring changes to support new customer types. 

Document Notification: A service requiring modification for new notification channels. 

Report Exporting: An exporter requiring modification to add new export formats.

DIP

Email Service: A service with a direct, hardcoded dependency on a MySQLDatabase class.

Payment Processing: A system tightly coupled to a specific PayPalGateway

File Processing: A processor with hardcoded dependencies on file system operations. 

Notification System: A service with a direct dependency on a TwilioSMS provider.

LSP

Vehicle Fuel System: An ElectricVehicle subclass violating the expected refuel() behavior. 

Payment Processing: A CashProcessor subclass improperly restricting valid payment amounts. 

Bird Flying Behavior: A Penguin subclass breaking the inherited fly() contract. 

Document Processing: A ReadOnlyDocument subclass breaking the inherited save() contract.

ISP

Game Development: A mage class forced to implement an unused meleeAttack() method.

Restaurant Management: A waiter class forced to implement an irrelevant cookFood() method. 

Vehicle Control System: A car class forced to implement an irrelevant fly() method. 

Music Player System: A CDPlayer implementing a "fat" interface with unnecessary methods.

Methodology Overview

 

Results

 

Dataset Location

  • Replication package of research part is located in dataset/
  • Each .json file in dataset/ includes 48 (4 Languages * 3 Difficulties * 4) records. Each record has only 1 violation and comes with its non violating version. These files are the manually prepared ground truth.
  • dataset/clean_code_pipeline.py, dataset/processing_pipeline.py and dataset/known_violation_pipeline.py generates the files in dataset/output/ using API (with the server running locally) for all strategies.
  • dataset/clean_code_pipeline.py uses already clean code as input.
  • dataset/processing_pipeline.py uses violated code without specifying ground truth violation. Generates files with the same names as those in dataset/clean_code_pipeline.py.
  • dataset/known_violation_pipeline.py uses violated code and includes the ground truth violation type in the prompt.
  • dataset/creation_scenarios.md: Describes the violation scenarios used to create the dataset.
  • dataset/groundtruth/: Ground-truth labeled samples.
  • dataset/completions/test/: LLM outputs on test examples.

Evaluation Methodology

  • Each sample's detected violation is compared against its ground truth label.
  • Performance metrics include accuracy and F1-score, broken down by model, principle, language, and difficulty.

Evaluation Artifacts

  • evaluation_final/: Contains all accuracy/F1 plots and CSVs by model, strategy, language, and level.
  • detailed_results_final.json: Full sample-level results including model, strategy, expected/detected violations, language, and difficulty level.

Steps to Reproduce:

  • dataset/clean_code_pipeline.py, dataset/processing_pipeline.py and dataset/known_violation_pipeline.py generates the files in dataset/output/ using API (with the server running locally) for all strategies.
  • Then, run manual_evaluation/violation_comparison.py to search through the raw response of the models using our specified regex. This will produce a json file detailed_results.json for all the results, as well as failed_extraction_for_review.json and multiple_violations_for_review.json for manual review.
  • After manual review, update the json file with the new violation_match entries by running match_dataset.json. This will result in the final json file ready for evaluation detailed_results_final.json.
  • For evaluation, simple run evaluation_final/final_analysis.py. To trace every step of evaluation, you can also run evaluation_final/evaluation_traceable/final_analysis_traceability.py.

Structure of the Project

AreWeSOLID/
├── __pycache__/                              # Python cache files
├── .idea/                                    # IDE configuration files
├── analytic_reports_trials/                  # Trials for output analysis (not used for final evaluation)
├── dataset/                                  # Core dataset and processing scripts
│   ├── groundtruth/                          # Ground-truth labeled samples
│   ├── completions/test/                     # LLM outputs on test examples
│   │   └── output/                           # Generated outputs from pipelines
│   ├── clean_code_pipeline.py                # Pipeline for clean code processing
│   ├── processing_pipeline.py                # Main processing pipeline
│   ├── known_violation_pipeline.py           # Pipeline with known violations
│   └── creation_scenarios.md                 # Violation scenarios documentation
├── evaluation_final/                         # Evaluation results and analysis
│   ├── evaluation_traceable/                 # Traceable evaluation scripts
│   ├── final_analysis.py                     # Main evaluation script
│   └── [accuracy/F1 plots and CSVs by model, strategy, language, level]
├── manual_evaluation/                        # Manual evaluation tools
│   ├── violation_comparison.py               # Regex-based violation detection
│   ├── failed_extraction_for_review_v5.json  # Outputs to be manually reviewed for failed extraction
│   ├── multiple_violations_for_review_v5.json # Outputs to be manually reviewed for multiple violations
│   └── SOLID_Violation_Cases_for_Manual_Review_complete.csv  # Completed manual review
├── plots/                                    # Generated visualization plots
├── viewer/                                   # Data visualization tools
├── calculate_metrics.py                      # Metrics calculation script
├── complexity_analysis_report.txt            # Complexity analysis results
├── cyclo_complexity.py                       # Cyclomatic complexity analysis
├── cyclomatic_complexity_results.csv         # Complexity results data
├── detailed_results_final.json               # Final output results after manual evaluation
├── llm-request.py                            # LLM API request handler
├── match_dataset.py                          # Dataset matching utilities
├── plot_dataset_analytics.py                 # Dataset analytics plotting
├── sequence.diagram.md                       # System sequence diagrams
└── README.md                                 # Project documentation

Files

AreWeSOLID.zip

Files (76.1 MB)

Name Size Download all
md5:4c17cf854a55747da202e95d20217cf3
76.1 MB Preview Download

Additional details