Published July 9, 2025 | Version v6
Dataset Open

Wind Turbine SCADA Data For Early Fault Detection

  • 1. ROR icon Fraunhofer Institute for Energy Economics and Energy System Technology

Description

This dataset is published together with the paper "CARE to Compare: A real-world dataset for anomaly detection in wind turbine data" which explains the dataset in detail and defines the CARE score that can be used to evaluate anomaly detection algorithms on this dataset. When referring to this dataset, please cite the paper mentioned in the related work section. 

The data consists of 95 datasets, containing 89 years of SCADA time series distributed across 36 different wind turbines
from the three wind farms A, B and C. The number of features depends on the wind farm; Wind farm A has 86 features, wind farm B has 257 features and wind farm C has 957 features. 

The overall dataset is balanced, as 45 out the 95 datasets contain a labeled anomaly event that leads up to a turbine fault and the other 50 datasets represent normal behavior. Additionally, the quality of training data is ensured by turbine-status-based labels for each data point and further information about some of the given turbine faults are included.

The data for Wind farm A is based on data from the EDP open data platform (https://www.edp.com/en/innovation/open-data/data), 
and consists of 5 wind turbines of an onshore wind farm in Portugal. 
It contains SCADA data and information derived by a given fault logbook which defines start timestamps for specified faults. 
From this data 22 datasets were selected to be included in this data collection. 
The other two wind farms are offshore wind farms located in Germany. All three datasets were anonymized due to confidentiality reasons for the wind farms B and C.
Each dataset is provided in form of a csv-file with columns defining the features and rows representing the data points of the time series. Files

More detailed information can be found in the included README-file.

Notes

In wind farm A status_type_id labels can be ignored while evaluating prediction time frames of error events with metrics like the CARE-score since the status_type_id is of wind farm A is based on the EDP failure logbook and it is intended to be used for filtering of the training data.

Version Changes:

Version 5 -> 6:

  • Changed unit of sensor_40 and sensor_61 for wind farm C to hPa instead of bar. This unit error became obvious when looking at the data and comparing it to the standard air pressure.
  • Edited event_description of events 34, 7 and 19 to high temperature in transformer cell.
  • Changed date in event description of event 44 since it was not affected by the change in the date anonymization procedure from version 2.
  • Changed date in event description of event 47 since it was not affected by the change in the date anonymization procedure from version 2 and edited the description text
  •  Changed date format in event_info files to match the date format in the dataset files.
  • Fixed typo in Readme
  • Re-added Readme files

Version 4->5:

Corrections to labels were made:

  • Previously missing status_type_id 4 labels were added to datasets in Wind Farm A. 
  • Event 51 from Wind Farm A was wrongly labeled as a normal event. With the newly added status_type_id 4 occurences, it is to be considered an anomaly event due to a gearbox bearing damage within the prediction data.
  • Wind Farm A no longer contains status_type_id 5. All occurences of status_type_id 5 have been changed to 0 and are considered normal time stamps. This change is done, because status_type_id 5 was set as a result of a wind speed and power analysis, flagging potential anomalous data. This is not based on a fixed ground truth, so status_type_id 5 was removed. For Wind Farms B and C status_type_id 5 is still valid since it is based on real SCADA-status codes.
  • The event_info.csv files now contain an additional column 'asset_id'.

Version 3->4:

  • The change of the timestamp anonymization lead to duplicate timestamps when transitioning from a leap year to 2022. This is now fixed in Version 4.

Version 2->3:

  • In version 2 timestamp changes were not consistent with the timestamps in the event-info-files. Version 3 fixes this.

Version 1->2:

  • Version 2 contains one deviation from version 1 regarding the anonymization procedure. Instead of shifting the timestamps of each sub-dataset by a random number of years, the size of the time shift is now determined to be the number of years so that each sub-dataset starts in 2022. This change is made to make the timestamp anonymization more consistent and to avoid future timestamps being present within the data.

Files

CARE_To_Compare.zip

Files (5.5 GB)

Name Size Download all
md5:2547b58c21ac8c242d13232860cf500c
5.5 GB Preview Download

Additional details

Additional titles

Alternative title (English)
CARE To Compare Data

Related works

Is described by
Journal article: 10.3390/data9120138 (DOI)