Published October 4, 2024 | Version v1
Dataset Open

A large synthetic dataset for machine learning applications in power transmission grids

  • 1. ROR icon HES-SO Valais-Wallis
  • 2. ROR icon University of Geneva

Description

With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access. 

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data.  When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use: 

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use: 

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

Files

europe_network.png

Files (17.8 GB)

Name Size Download all
md5:e47d30d1992160011871bd4da1373885
4.3 MB Preview Download
md5:d21bc6901a010bfd5bbf3d7c3bc940c3
492.4 kB Preview Download
md5:7eca87c8e48815318244a85f43b770a5
125.3 MB Preview Download
md5:da71157da136e95c222f944ef9b8efa5
125.2 MB Preview Download
md5:3c0a4dffd0ff432958c2f7ada24f9fde
125.8 MB Preview Download
md5:499e964d85298967498ccaa68e01855e
125.6 MB Preview Download
md5:df03da605b63fd98a4c19bee593e40a7
125.8 MB Preview Download
md5:dec6cc0486f75b4e9fefd2bb9ee0c360
5.3 kB Preview Download
md5:a5f53687434fc8fa10fc7851bb80bc7a
2.1 GB Preview Download
md5:43c5746bc830f31570dc4c76de076be4
2.1 GB Preview Download
md5:f1a7abb55d04acb1e19bdacbb08dae0c
2.1 GB Preview Download
md5:5114aca295b45fd6411ff1e287e746e8
2.1 GB Preview Download
md5:9f7729c48784b84c12a856515d4c32cc
2.1 GB Preview Download
md5:b03079b2d595bf43e7e60673a373b782
1.3 GB Preview Download
md5:bdb95750f155d51b02440bb14e76129d
1.3 GB Preview Download
md5:dd7e2e6d472b9b21bb5699185f6adf50
1.3 GB Preview Download
md5:21f86e74f50cc66bd1566575fbc26f92
1.3 GB Preview Download
md5:7f8d37d9f028748d75306f6ff5b7a5fa
1.3 GB Preview Download
md5:137f7dce9b7dc21fc55ab7d29175c141
38.8 kB Preview Download

Additional details

Related works

Is described by
Data paper: 10.1038/s41597-025-04479-x (DOI)
References
Dataset: 10.5281/zenodo.2642175 (DOI)

Software

Repository URL
https://github.com/GeeeHesso/PowerData
Programming language
Python, Julia, Jupyter Notebook
Development Status
Active