Power Output Reconstruction of Photovoltaic Curtailment

The use of renewable energy sources in the grid’s energy mix has recently gained popularity. Especially as solar photovoltaic (PV) generation production has almost zero emissions during its operations, they are preferred over fuel-based electricity production. However, expanding PV generation in grid capacity increases the chance of PV curtailment occurrence. Not to mention that microgrids supported with large-scale PV generation almost certainly create PV curtailment regularly. As the forecast of PV production is one of the electricity grid operation cornerstones, the prediction model should be as accurate as possible. The latest trend is utilizing machine learning (ML) models to predict PV output, thanks to their excellent learning and regression capabilities. However, its performance can be highly influenced by measurements used during the model design. Unfortunately, only some of the research on this topic deals with the PV curtailment problem resulting in underperforming ML models. This paper proposes a novel approach to identify and replace curtailed PV measurements. The methodology includes the physical model as a baseline of truly producable energy, which is then investigated and corrected as a piecewise linear system using Pearson correlation and weather measurements. Through real-life comparative scenarios, the suggested data reconstruction method provides increased accuracy of supervised ML-based solar power prediction.


I. INTRODUCTION
In recent years the world has changed its leaning towards renewable sources, especially solar energy emerges as one of the leading clean and cheap power production. This change in the energy mix is essential as it establishes a sustainable and ecological electrical system [1]. However, the operating grid can not rely on pure renewable production as its generation is volatile and can not be directly shifted towards the electricity load demand [2]. This burden requires power generation maintainers to predict their production most accurately. Based on this generation's forecasts, microgrid customers can schedule load demand, buy electricity on the short-term market or utilize their model predictive control to optimize the operation of the microgrid [3]. On the other hand, large grid operators can prepare other resources to provide electricity when needed [4]. Hence, the quality of the provided photovoltaic (PV) production predictions is crucial, as many decision-making actions depend on it.
Many approaches exist to designing PV forecasters, with different prediction horizons and steps considering their use. The bases in microgrid control or real-time grid scheduling are day-ahead forecasts with quarter-hourly or hourly steps. The direct solar energy predictions can be categorized as physical, statistical, and machine learning (ML) [5], [6] approaches. The most widely used are ML-based prediction especially deeplearning techniques, thanks to its excellent regression performance through its capability of learning hidden patterns. The bottleneck of this method are data used to train and generate forecasts. To accurately predict future PV production and its fluctuation, it is necessary to correctly identify and predict weather conditions acting on the PV system. In this field, it is standard to use various data preprocessing techniques, as shown in [7], to improve prediction capabilities [8]. However, most of them focus on removing outliers, filtering signals, replacing missing data, and finding the correct feature inputs to the ML models. A few of them solve the problem of the PV curtailment [9], which directly affects solar power measurements see [10], [11]. The critical step to finding the most accurate ML model is to consider all possible influences that can act on the PV system. If the data used contain curtailed PV production model will underperform as it is trained to provide false (lower) PV prediction for specific inputs. This paper's main contribution and purpose is to improve any ML-based PV power forecasts of the systems where PV curtailment occurs. Proposed data reconstruction algorithm utilizes the physical model of the PV system, the Pearson correlation coefficient, and a wide range of weather measurements. The capability of used data reconstruction method is tested on a real large-scale system with five different scenarios. To demonstrate the benefits of this approach, multiple MLbased models are trained (using the reconstructed and curtailed data) and tested to produce a day-ahead prediction.

A. System and Data Description
In general solar plants consist of multiple assets. The main two are photovoltaic generators that directly produce electricity from sunlight and power converters that transform direct current (DC) to alternating current (AC).
The power plant is capable of functioning in three distinct modes. The first mode is referred to as the off-the-grid system, which operates independently of any national or local electricity distribution network. The remaining two setups, known as the hybrid and grid-connected, are connected to the primary power grid. Both the hybrid and off-the-grid modes allow the utilization of excess energy through battery storage, although the battery's capacity is typically undersized. In addition, the grid connection for the hybrid and grid-connected setups enables the exportation of excess electricity to the primary grid or importing of electricity during any shortages.
Measurement of this solar power plant generation can be done through inverters or smart electricity meters. The negative of this system from the data perspective is that the measurement collection is limited only to the exact energy production. When the current possible power generation on the solar panels is greater than the load's consumption, naturally, the load will consume only the necessary power. The extra energy is lost (curtailed), as the information about it. If the power plant includes batteries, this behavior is only shifted in time as the capacity of the batteries reaches its maximum and can no longer store energy, so the power production is curtailed. Such behavior occurs mainly in an off-grid system, as others can export excessive electric energy. However, grid contestants must meet agreed deals as it can destabilize the whole grid operation. This concluded in the same situation that PV production is curtailed as the load consumption decrees.

B. Physical Model
The PV system's output relies on multiple factors, categorized as weather conditions and mechanical properties. The more significant factor is the corresponding meteorological state, which behaves as a multi-variable non-linear system, including solar irradiance, temperature, humidity, wind speed, pressure, visibility, and many more. Not all the mentioned parameters are essential to developing a good estimator of accessible power production. The most important is solar irradiance, which directly transforms into electrical energy and temperature, influencing PV panel efficiency.
In general, total solar irradiance G t can be modeled as a sum of three components, surface absorbed irradiance G s , diffused irradiance G d , and ground reflected irradiance G g By direct measurement of G DHI (k) diffuse horizontal irradiance and G DNI (k) direct normal irradiance at each time step k is possible to determine each component separately, concerning α STA and α SAA PV surface tiled and azimuth angle respectively. The final output power production can be calculated as energy produced from total solar irradiance affected by current PV panel efficiency where the left part of the brackets represents total irradiation (1) transformed at defined time step k at standard test conditions (STC) G STC = 1000 W/m 2 , T STC = 25 • C, P max is a maximal (peak) power production for specific PV panel at STC and N p is a number of installed panels. The right side of the brackets defines the effect of PV panel efficiency linked to the cell temperature T c (k), and κ is the temperature coefficient of P max . As the PV cell temperature is hard to measure in realtime, it can be calculated as follows where T NOCT is nominal operating cell temperature (NOCT) obtained at test condition (TC) G TC = 800 W/m 2 , T STC = 20 • C and T (k) is measured ambient temperature. It follows from the (2) that if the T c (k) is higher, then T STC efficiency of the PV panel decrease by the coefficient κ. Defined STC, TC, NOCT, P max , κ can be obtained from PV manufacturer. The number of PV panels, surface tiled, and azimuth angles are unique for each solar power plant installation. This section provides a brief overview of the model used. For more details, see [12].

C. Pearson Correlation Coefficient
Purpose of the correlation coefficients is to find and interprets how strong a relationship is between the investigated variables [13]. Pearson correlation coefficient is one of the most popular tools in the field of feature extraction for machine learning inputs with a large number of available variables. This coefficient represents a linear correlation between two continuous variables (x, y), and it is formulated as follows where r xy represents correlation coefficient, x denotes mean of x and y denotes mean of variable y across all samples M . Value form (4) can come from a variance of r xy ∈ [−1, 1]. The closer the coefficient gets to the 1, the higher the positive correlation is between x and y. Conversely, suppose the coefficient gets closer to the −1. In that case, variables have a higher negative correlation, and if the r xy is close to 0, there is no direct linear correlation between x and y.

III. PROBLEM STATEMENT
As described in Section II-A, power production can be curtailed by many factors despite solar power plant operation modes. The consequence of such a behavior is that the measured information from the smart devices provides incomplete information as other grid parts influence it. This creates unwanted data corruption, which is hard to identify and process before syntheses of the forecasting model. If data are not processed correctly model trained with such data provides underestimated PV production. It affects the utilization of solar power generation as its forecasts are used in model predictive control (MPC), which benefits from knowing the most accurate predictions. Likewise, various design-making processes are dependent on the best PV forecasts.
This work suggests a novel curtailed data reconstruction approach for supervised machine learning modeling of PV panels production. Utilizing the solar panel physical model and Pearson correlation coefficient to identify and replace curtailed power production with estimated values of a maximal possible generation.

IV. PREDICTION OF ACHIEVABLE ENERGY
In general, the designed PV energy forecasting model includes weather forecasts as input producing the one-step prediction in the following form where y is prediction of generated power, x represents input features to the predictive model f P for define time step k.
In the field of ML-based solar power prediction, the adopted methodology consists of standard procedures, including: 1) Data collection and pre-processing.
2) Identification and feature extraction.
3) Model selection (type, structure). 4) Model training and validation using weather measurements and historical PV production. 5) Testing procedure using weather forecasts and historical PV production to select best performance model. The following section extends the pre-processing as it aims to avoid biased predictions and model underperforming. The training data are reconstructed to replace curtailed PV production with estimated achievable production.

A. Data Reconstruction
Assume we have acquired a representative sample of historical weather measurements W ∈ R M ×nw , where n w represents the number of unique variables w ∈ R M and power production measurements Y ∈ R M which includes corrupted information.
The first step in PV data reconstruction is to find the physical model (2) and identify its parameters. Using historical weather measurements and the designed model, we can construct dataset Y Pt ∈ R M , which corresponds to the measured one. This data represents a naive guideline of achievable power generation, which is unaware of grid limitations. The curtailment of power production indicates that only interesting information from the modeled dataset Y Pt is that which is greater than measured information from Y . Using this fact created dataset is modified as follows where m = 1, . . . , M represents an index of the single value of the vector. In this way, it is possible to keep original measurements intact and separate potentially corrupted ones.
To correctly identify curtailed power production, it is necessary to investigate modeled data. Whether the mismatching between Y Pt and Y is caused by effects of the connected grid or exogenous influences of the weather conditions, which are not included in the model (2). Calculating the difference and using it in Pearson guided correlation from Section II-C it is possible to determine if the weather measurements w from W significantly impact modelled and measured data mismatch (7). If the premise given is correct, it should be reflected in reconstructed data. Otherwise, we can assume that there was a curtailment in PV-produced energy.
The compensation of weather variables that are not modeled is based on the simple linear model in which coefficients are found as follows where a ∈ R ns , b ∈ R represents coefficient of linear equation and S ∈ R M ×ns includes only dependent variables from W (n s ≤ n w ). The weather measurements are selected based on a simple rule. If the investigated weather variable w (i) where i = 1, . . . , n w , has a greater correlation coefficient (4), then the variables from the model (2) (G DHI , G DNI , T ) it is included in S. After the linear model is found, the final formulation of the resulting preprocess data is provided as where Y f represents reconstructed data, Y Pt represents modeled data, and Sa ⊤ + b is a compensation of exogenous weather influences acting on the PV system. However, as mentioned in Section II-B, the meteorological impact on PV power production is highly non-linear. This means that a simple linear model provides unsatisfying results for a large number of historical samples M containing the different weather conditions (rainy, warm, cold days or changing seasons, etc.). Also, Pearson correlation is a measure of linear dependency between two datasets creating the same problem.
The key step is to investigate a small portion of the time series data M ⇒ {M 1 , . . . , M n } at the time. This solution helps to decompose time series data as a piecewise linear process providing the desired results.

B. Implementation Details
The condition on lines (13, 16) can be interpreted as follows. Suppose the correlation of the exogenous weather variable w (i) with ∆Y for the defined portion of data is smaller than the modeled variables (G DHI , G DNI , T ). In that case, it is possible to assume that external weather influences did not create the difference ∆Y . Otherwise, the linear model (9) will compensate their influence. This condition can be modified by tuning coefficient p r ≥ 0, which for non-zero values allows passing less dependant weather variables.
The range of the investigated data ∆t should be selected based on the sampling time of the measured data W, Y and weather conditions of the PV panel's location. Chosen time  Set δ ← ∆Y (j : j + ∆t), φ ← Y Pt (j : j + ∆t) and ω ← W (j : j + ∆t). Calculate Correlation for Modeled Variables : 11: Set r DHI xy ← f (G DHI , δ), r DNI xy ← f (G DNI , δ) and r T xy ← f (T, δ).

22:
Append Y f ← σ. 23: end for range of investigated data is crucial as it is directly linked to linear models. Using a too large portion of the dataset at once (week, month, year) will lead to inaccurate correlations as the modeled variables are often the most dependent and directly affect power production. Consequences are such that algorithm will never or very rarely compensate possible exogenous effects, and if so, it will be inaccurate. On the contrary, selecting a too small portion of the dataset compared to the sampling time of the measurement will lead to too much compensation of modeled data, and we will end up with an underestimation of power production. It is worth mentioning that the range of the reconstructed data ∆t may be different in each iteration. The proposed algorithm does not restrict that. If the data indicates it, it is even a recommended step. Another point is that provided physical model from Section II-B can take any complexity, form, or additional variables. The only difference will be that the S will be a smaller subset of W as the number of the exogenous variables will decrease with the modeled variables increase.

V. CASE STUDY
This work presents a real large-scale PV system on which we test the proposed algorithm and provide its reliability for any case of PV setup. Investigated PV installation is located in Romania and includes 11680 individual solar panels of the same efficiency and peak power P max . The system contains three master slave inverter control, each containing four separate inverters whit individual smart meters. The primary inverter controls the other three, which are connecting and disconnecting based on the power production. The system is directly connected to the main grid without local power consumption, so it does not show frequent power curtailment. This makes it a perfect example as a proposed algorithm can be tested against true power production. Excluding power inverters data one by one from the aggregated dataset, we can simulate PV power curtailment, which is then reconstructed based on the Algorithm 1 shown in the previous section.

VI. RESULTS AND DISCUSSION
The data included in this work represents the total PV productions, weather measurements, and forecasts within 230 days with one hour sampling time T s = 1 h. Weather measurements and forecasts for defined location are imported from third-party open-source API Tomorrow.io, providing 20 different weather variables, including possible 430h prediction.
The total PV production is used in five different showcase simulations to provide representative results across a wide range of solar power curtailment. In the presented results, we apply the following scenario scheme: From now on, we will refer to the aggregated (the sum of all gropes) information of curtailed production, achievable production (without curtailment), and reconstructed data (curtailed data that have undergone a reconstruction process).
Each case is individually processed, as it contains unique curtailed PV measurements Y , which are then transformed to reconstructed dataset Y f representing maximal possible power generation for a defined time. The range of the investigating data is chosen as ∆t = 24 h. By comparing these datasets Y, Y f with achievable production Y R , we generate results shown in Fig. 1 and Tab. III. The presented Fig. 1a shows a detailed    statistical comparison across all measurements (case 2) for each hour of the active PV production (from 17 pm to 4 am is negligible power generation). As we can see, the mean of the reconstructed data is shifted toward the mean of the achievable PV production. As well as, the variance of reconstructed data is stretched compared to curtailed PV measurements, leading to better correlation with achievable power within every hour of the day. As the results suggest, the reconstruction procedure succeeded in mimicking the real measurements (achievable power), as we can see in the example Fig. 1b. The difference between reconstructed and achievable PV production is caused by a one-time change in the weather conditions, which are not captured as significant by Pearson correlation (4), so in  Fig. 2: Day-ahead PV forecast, using LSTM trained with reconstructed P f and curtailed P PV data compared to achievable historical PV production (case 2). conclusion, they are ignored by the reconstruction procedure.
In Tab. III, we can see statistic results for the whole dataset across all cases. Naturally, increases in the missing total produced power P miss decrease the overall accuracy of the proposed algorithm. Looking at the mean and variance of the reconstructed data, we can observe a slower decline compared to curtailed PV production, which decreases at a much greater rate. Showing that the reconstruction procedure provides beneficial results as it scales with higher power curtailment.
To show the advantages of reconstructed data in PV power prediction, we design pair of the long-short term memory (LSTM see [14]) for every case from Tab. II. Each scenario is then represented by LSTM trained using curtailed PV measurement, and LSTM trained with reconstructed data to predict PV power P, P f notated subsequently. We divide our datasets into three groups. First, the training dataset includes measurements within 138 days, and it is used to estimate the trainable parameters of the LSTMs. The second group represents validation data (23 days) for the model validation. Finally, the last group includes 69 days of measurements on which we perform accuracy analysis presented in Tab. IV and Fig. 2. The structure of the LSTM networks is selected as follows. Weather feature inputs are chosen using provided Algorithm 1 by including all modeled variables and those which were used in model compensation (9). In summary, inputs include (direct normal irradiance, diffuse horizontal irradiance, outside temperature, cloud base, dew point, humidity, precipitation intensity, wind speed, and wind direction). Ideal hyperparasites of the ML model are found using Bayesian optimization, the number of LSTM units is 543, the learning rate is selected as 3.715·10 −4 . The 2900 epochs of minimizing the sum of squared errors were sufficient in order to find optimal weights and basis of the LSTM units with ADAM optimizer. The inputs to the LSTM are not changing, and outputs are not so much different in provided scenarios, so we fix these parameters for all ten recurrent neural networks.
In Fig. 2, we can see a day-ahead forecast (prediction for 24 h from the midnight of the previous day) comparison of constructed models with historical values for case 2. As shown, LSTM trained with reconstructed data is more reliable on days with higher power generation. For the days with lower production, both LSTMs provide similar results as the training data are not often curtailed in such small production. More detailed results can be found in Tab. IV. Where P tot represents the total power produced without curtailment from our testing dataset, which is the same across all scenarios. From the mean squared error (MSE), we can declare that LSTM trained with reconstructed data provide superior results. The maximal absolute error between aggregated true daily production and the day-ahead forecast is shown in the third column. As we can see, LSTM trained with curtailed power production has almost twice larger maximal error 2∆P ≈ ∆P f . We chose the sum of daily aggregated surplus and deficit as the last indicator across all daily samples. As it already been pointed out, LSTM trained with reconstructed data naturally provides a larger surplus (overestimated PV prediction) competed to LSTM trained with achievable PV measurements, which on the contrary, provides a more significant deficit (underestimated PV prediction). However, as we can see, the sum of the deficit and surplus for both of the LSTM models compared to P tot are proportionally different on a large scale showing the benefits of the proposed data reconstruction.

VII. CONCLUSION
This paper proposes a novel approach to photovoltaic (PV) power production measurement reconstruction for systems where PV generation occasionally exceeds load consumption. The methodology is based on constructing the physical model of a real PV system providing the baseline of maximal production. Investigating the curtailed PV measurements as a piecewise linear system, we were able to utilize the Pearson correlation coefficient and linear model of weather measurements. Both tools are used to compensate the difference between the modeled and curtailed PV power production. Using this approach, we reconstruct historical PV measurements and preserve the behavior of a non-linear system. The provided algorithm was tested on a real large-scale PV system, including 11680 individual panels. To investigate the performance of our scheme, we have simulated various levels of curtailment by artificially excluding information from a subset of PV inverters. In all scenarios algorithm successfully reconstruct corrupted (curtailed PV) measurements with some degree of accuracy, which is linked to the size of power curtailment.
The direct benefit of the proposed approach is improved machine learning (ML) based photovoltaic power production forecasts. Our study involves generating and comparing dayahead predictions using ML models for a defined system, both with and without data reconstruction. From the investigation of different scenarios, we conclude that LSTM trained with reconstructed data provide superior results. Not only maximal error between the historical values and its predictions is almost twice lower. We also provide a good trade-off between the prediction deficit and surplus by possible tuning of the data reconstruction algorithm.