Published August 27, 2025 | Version v1
Dataset Open

Dataset and Code for 'Feature Extraction-based Clustering Selection Methodology to Identify Representative Buildings for Scalable Energy Simulations'

  • 1. ROR icon Aalto University
  • 1. ROR icon Aalto University
  • 2. ROR icon City of Helsinki
  • 3. ROR icon Tallinn University of Technology

Description

This repository contains the anonymized dataset and reproducible code accompanying the paper:

“Feature extraction-based clustering selection methodology to identify representative buildings for scalable energy simulations.”

The dataset includes:

  • epc_metadata.csv: EPC-based building attributes (energy performance value, weighted U-value, air leakage, heat recovery)
  • dh_merged_kwh_per_m2.csv: hourly district heating consumption, normalized per m²

  • outdoor_temp_2023.csv: hourly outdoor temperature data (2023)

  • features_epc_only_standardized.csv: standardized feature table used for clustering

  • data_dictionary.xlsx: description of all variables

The code provides a minimal pipeline to reproduce the main results of the paper:

  1. Load and preprocess data

  2. Extract EPC-based features

  3. Run clustering (K-Medoids, Agglomerative, GMM)

  4. Compute validation metrics and agreement indices

Expected outputs (e.g., validation_metrics.csv, agreement_metrics.csv) reproduce the main tables in the manuscript.

Anonymization: all identifiers (school names, locations, addresses) are removed. Only numerical features are included.

Licenses:

  • Code: MIT License

  • Data: Creative Commons Attribution 4.0 International (CC BY 4.0)

Related resources:
The GitHub repository with the same reproducible pipeline is available at:
https://github.com/hatefh/zenodo_code_bundle_epc_clustering_paper

Files

zenodo_code_bundle_v1.0.0.zip

Files (1.3 MB)

Name Size Download all
md5:f97519d6d53cc78627dd1f000bcbb934
1.3 MB Preview Download

Additional details

Software

Programming language
Python
Development Status
Active