A large synthetic dataset for machine learning applications in power transmission grids
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
Data generation algorithm
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
Network
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
Time series
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
Usage
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
Selecting a particular country
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
Averaging over time
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
Source code
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
Funding
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
Files
europe_network.png
Files
(17.8 GB)
Name | Size | Download all |
---|---|---|
md5:e47d30d1992160011871bd4da1373885
|
4.3 MB | Preview Download |
md5:d21bc6901a010bfd5bbf3d7c3bc940c3
|
492.4 kB | Preview Download |
md5:7eca87c8e48815318244a85f43b770a5
|
125.3 MB | Preview Download |
md5:da71157da136e95c222f944ef9b8efa5
|
125.2 MB | Preview Download |
md5:3c0a4dffd0ff432958c2f7ada24f9fde
|
125.8 MB | Preview Download |
md5:499e964d85298967498ccaa68e01855e
|
125.6 MB | Preview Download |
md5:df03da605b63fd98a4c19bee593e40a7
|
125.8 MB | Preview Download |
md5:dec6cc0486f75b4e9fefd2bb9ee0c360
|
5.3 kB | Preview Download |
md5:a5f53687434fc8fa10fc7851bb80bc7a
|
2.1 GB | Preview Download |
md5:43c5746bc830f31570dc4c76de076be4
|
2.1 GB | Preview Download |
md5:f1a7abb55d04acb1e19bdacbb08dae0c
|
2.1 GB | Preview Download |
md5:5114aca295b45fd6411ff1e287e746e8
|
2.1 GB | Preview Download |
md5:9f7729c48784b84c12a856515d4c32cc
|
2.1 GB | Preview Download |
md5:b03079b2d595bf43e7e60673a373b782
|
1.3 GB | Preview Download |
md5:bdb95750f155d51b02440bb14e76129d
|
1.3 GB | Preview Download |
md5:dd7e2e6d472b9b21bb5699185f6adf50
|
1.3 GB | Preview Download |
md5:21f86e74f50cc66bd1566575fbc26f92
|
1.3 GB | Preview Download |
md5:7f8d37d9f028748d75306f6ff5b7a5fa
|
1.3 GB | Preview Download |
md5:137f7dce9b7dc21fc55ab7d29175c141
|
38.8 kB | Preview Download |
Additional details
Related works
- Is described by
- Data paper: 10.1038/s41597-025-04479-x (DOI)
- References
- Dataset: 10.5281/zenodo.2642175 (DOI)
Software
- Repository URL
- https://github.com/GeeeHesso/PowerData
- Programming language
- Python, Julia, Jupyter Notebook
- Development Status
- Active