Generation of Realistic Synthetic Financial Time-series

Financial markets have always been a point of interest for automated systems. Due to their complex nature, financial algorithms and fintech frameworks require vast amounts of data to accurately respond to market fluctuations. This data availability is tied to the daily market evolution, so it is impossible to accelerate its acquisition. In this article, we discuss several solutions for augmenting financial datasets via synthesizing realistic time-series with the help of generative models. This problem is complex, since financial time series present very specific properties, e.g., fat-tail distribution, cross-correlation between different stocks, specific autocorrelation, cluster volatility and so on. In particular, we propose solutions for capturing cross-correlations between different stocks and for transitioning from fixed to variable length time-series without resorting to sequence modeling networks, and adapt various network architectures, e.g., fully connected and convolutional GANs, variational autoencoders, and generative moment matching networks. Finally, we tackle the problem of evaluating the quality of synthetic financial time-series. We introduce qualitative and quantitative metrics, along with a portfolio trend prediction framework that validates our generative models’ performance. We carry out experiments on real-world financial data extracted from the US stock market, proving the benefits of these techniques.


INTRODUCTION
Financial markets are a field which, if acted upon correctly, can bring a major financial gain.This has attracted attention both from individual traders and from researchers.The latter focused on fitting mathematical models to the market's behavior, trying to reach accurate automatic prediction of financial events.Historically, researchers opted for one of two major directions: (i) statistical models, e.g., ARCH with its variants [5,16,17,24,31,47,52,65,66,75,85], and (ii) agent-based models [7-9, 51, 60, 81].Despite all these efforts, it is still too complex to perfectly capture the underlying properties of financial time series with such mathematical tools [78].
One possible solution to overcoming this drawback is to resort to state-of-the-art deep learning methods [33].This idea is also strongly backed by the current exponential progress that has been done in other signal modeling domains such as speech recognition [3,68,88], speaker recognition [67,77], medical image classification [1,23], action recognition [13,35,40] or image enhancement [10,21,37].This makes deep neural networks perfect candidates for tackling the financial data modeling problem as well, since financial time-series are time-varying signals with very specific properties.However, these come at the cost of needing significantly larger amounts of data for training robust models than their hand-crafted feature-based learning counterparts.In these domains, such as multimedia and vision, the content available for research is growing at an exponential pace [49,56].Moreover, when there is a need for a very specific type of data, researchers can create these resources by themselves by scraping the Internet and using crowd sourcing annotation tools or even fully automatic annotations, e.g., for face and object detection [4] or for concept detection [70].
In contrast, even though financial market indices are public and accessible to everybody, it is a thorough and long-lasting gathering of this data that is missing.Quite often, if there are such datasets available, they are kept behind a pay-wall, discouraging open-source research and reproducibility.Complete and curated datasets can be bought from specialised providers, such as Bloomberg L.P. 1 .This restriction forced researchers to resort to building their own lightweight versions of the datasets.There are works that evaluated their research on the Google stocks [84], Shanghai Composite Index, International Business Machine (IBM) Index, Microsoft Corporation (MSFT) Index, Ping An Insurance Company of China (PAICC) Index [86], CSI 300 Index, Nifty 50 Index, Hang Seng Index, Nikkei 225 Index, DJIA Index [58] and S&P 500 Index [58,86].The most common approach remains, however, to evaluate on the entire S&P 500 dataset 2 [48,78,82].The financial markets depend, by their nature, on the daily evolution of worldwide events and there is no available mechanism to accelerate the acquisition of such data.Thus, it is of great interest to be able to synthesize new financial data which resembles real stock markets, on the spot, to train the models.
Deep learning methods have achieved great success in realistic data generation and out of all the proposed network models, Generative Adversarial Networks (GANs) [26] offered the most spectacular results.This has been proven in several fields, GANs having the ability to generate realistic faces [42,71], perform image-to-image translation [36], generate scenes [2], raw audio waveforms [15] or realistic text sequences [87].Different approaches involving variational embeddings have also proved successful [18,38,63].In this paper we tackle this issue and investigate various solutions to generating realistic synthetic time-series as well as providing effective tools for assessing their quality.
The applications related to financial time-series generation include, but are not limited to: 1) portfolio management -portfolio managers (stockbrokers and hedge funds) extract meaningful data related to the companies in which they should invest on short/medium/long term, e.g., daily values prediction and trend pattern analysis.2) metadata management -extracting valuable information from a statistically relevant number of samples each day, e.g., understanding data seasonality and predicting sales patterns for various companies; 3) forecasting models -all previously mentioned synthetically generated samples.Contrary to other existing works, we decided to focus more on exploring the boundaries of this problem and therefore, investigated several solutions rather than proposing a single one.Previous experiments show that different network architectures come each with their own pros and cons, and finding a candidate that performs best in all situations, e.g., capturing all the statistical properties of the real time-series, may be an ill-posed task as one doesn't look for an identical replica of the initial data.Researchers will usually find solutions for different usage situations of the data where variable independent properties of the data should be captured, but very seldom they will focus on meeting all the properties at once.
The contributions of our work, beyond the state of the art are as follows: (i) we provide solutions for data pre-and post-processing that allow increasing the performance of the models, (ii) we extend the range of generative models and evaluation metrics for financial time-series addressed by the current state of the art, (iii) we propose a solution for capturing cross-correlation between different stocks during training, (iv) we propose a solution for transitioning from fixed to variable length time-series without resorting to sequence modeling networks (e.g., LSTM, RNN, GRU etc.), (v) we explore a great wealth of advanced generative architectures and provide several solutions for different scenarios, (vi) we validate the proposed approach by performing stocks trend prediction and prove that synthetic samples help improve prediction accuracy.This paper builds upon our previous work [12] where fixed-length financial time-series were generated with GANs.The main new features of this work consist of investigating more classes of generative models, proposing several additional qualitative metrics, proposing a new batch feeding mechanism to capture crosscorrelation of the real stocks and implementing a complex trend prediction setup used to evaluate the goodness of the synthetic samples, as well as providing extensive testing.We compare our results with our previous work [12] and with the ones of Takahashi et al. [78] since the principle they are following is the closest to our work.
Overall, this is an incipient domain which started to gain traction recently, which is visible from the amount of publications and few algorithm resources available.Even though they are scarce, it is worth mentioning that they are very recent, proving that this field receives growing attention.In this context, we consider our work exploratory, providing a deeper understanding of the financial time series generation problem.

FINANCIAL TIME-SERIES
Financial data represents information about the state and progress of a company's financial assets.A company's financial time series represents the chronological evolution of several indicators.To create good prediction models it is necessary to train on sufficiently large and diverse datasets.Such a dataset must contain not only a great number of companies (in order to offer good generalization perspectives), but also a large number of samples for each company, i.e., to span a long time period (in order to capture as many different moments in trading history as possible).However, this resource is not always freely available and, when it is, it does not provide enough data for more complex models.It is therefore extremely helpful to have a system that can synthetically generate training data for models to become more profitable.
In our work, we use the daily values of the closing price (C) -the price that the stock reached at the end of the trading day.All entities involved in the trading of stocks use the closing price as a reference point to monitor a company's performance over one day.Moreover, the trend that the closing price is following is more important than its magnitude since stock prices can have various ranges between companies.We take as an example the evolution of Apple stocks from March 20 th to 21 st 2019, when the closing price rose from $187.43 to $194.34, resulting in an absolute difference of $6.91.Similarly, Amazon closing price increased from $1897.83 to $1904.28 from June 26 th to 27 th 2019, resulting in an absolute difference of $6.45.These two differences are comparable in magnitude, but proportional to their corresponding closing prices, they are different by one order of magnitude.Therefore, we focus on ratios of closing prices, rather than absolute closing prices.In particular, we investigate log returns, i.e., the logarithm of ratios of closing prices from consecutive days, given by: where   represents the closing price of day  and   the log return closing price of day .This ratio is useful because it reduces not only the intra-variation of the time series, but also acts as a normalization between different companies' stocks, as displayed in Figure 1.We can see there that two stocks that do not have the same behavior and whose magnitudes of the closing prices belong to different ranges can be successfully encoded under the same range by applying the log return transformation.Financial time-series are different from other data in the sense that it is necessary to wait for an entire day to extract one new sample (if the granularity is at day level), given that international stock markets update closing prices at the end of the trading day.This strengthens the necessity to have a good data generator for this type of data.
The microstructure of the financial market gives the financial time series several properties and shapes [6,11,82].It is known that these time series are more peaked than normal distribution and exhibit a fat-tailed behavior, meaning that extreme values (both high and low) are more probable than in normal distributions.Also, large changes of prices tend to cluster together, an effect called volatility clustering and can be observed in Figure 1, where large/small changes are followed by large/small changes, respectively.This volatility is negatively correlated with the return process and is called leverage effect.Lastly, empirical asset returns are uncorrelated for any value of the lag larger than one, but not independent.Generative models face the major challenge of having to cover all these properties of the financial time-series.

GENERATIVE MODELS
The first step of our proposed solution consists of generating a fixed-length 1-dimensional array of samples with the help of several generative models.Since there is no standardised architecture agreed upon in the literature that would solve this problem, we performed an in-depth study over a vast number of generative architectures.Our aim was not to come up with a novel model architecture since this domain is still young and there has not been any work conducted to prove one model's superiority over the others.Instead, we adapted to our framework many architectures that have been successfully applied in other multimedia domains (mostly image generation).We investigated 3 major classes of generative models: Generative Adversarial Networks [26] (GANs), Variational Autoencoders (VAEs) [45,73] and Generative Moment Matching Networks (GMMNs) [14,54].For each such class we explore various architectures and training setups in search of the model that manages to best capture the financial market's characteristics.We explored the influence of the number of neurons, layers, activations and types of layers by proposing several models that would cover a range as diverse as possible and were surprised that choosing one model over the others made an important difference.Moreover, as supported by our previous work [12] and the work of Takahashi et al. [78], we concluded that small variations in the model architecture, such as batch normalization, for example, make a crucial difference by turning an otherwise good working model into an unusable generator.The target is to generate a fixed-length vector of log return Close prices.In the following, we describe each of the architectures that were implemented.The length of the synthesized 1D array has been set to 250 for all models, the equivalent of an entire working year in finance.Please note that all architectures are presented in their optimized versions, achieved after in-depth ablation studies.

Generative Adversarial Networks
GANs have been successfully used in other tasks such as image generation with outstanding results.As pointed out in [25], if both the generator and the discriminator have enough capacity and at each step of the training process the discriminator is allowed to reach its optimum given the generator, then the generator model's probability distribution will converge to that of the training data under the classical GAN optimization function.As mentioned in Section 3, financial time-series follow specific probability distributions, so it was our motivation to fit this exact distribution with the help of GANs.We experimented two types of GANs: fully connected and fully convolutional.The fully connected setup aims to take into consideration the effect that each value from the 250 samples long 1D array has on the outcome.The fully convolutional approach is more oriented towards the effect that short groups of consecutive values captured by the receptive field of the convolution process (which also have the highest correlation since they refer to consecutive business days), have on the outcome.Time series are essentially 1-dimensional arrays that hold a different value for each time step.Therefore, our architectures have been redesigned for the 1D case.The five architectures are as follows: •  1 : the generator contains 5 layers of 1D transpose convolutions, each of which is followed by a batch normalization layer and a ReLU activation, except for the last layer, which does not connect to batch normalization.The discriminator has a mirrored structure, with convolutional layers replacing the transpose convolutions; •  2 : is a shallower version of  1 that requires flattening layers to adapt to the output dimension; •  3 : is the same as  1 , but without any batch normalization.
•  4 : is the same as  2 , but without any batch normalization.
•  : spectral normalization GAN.This is the same layer organization as in  1 , with the only difference that we replaced batch normalization layers with spectral normalization layers [64].

Matching statistic moments.
Motivated by the fact that we would like to generate samples whose statistic moments match those of the real data, we adopted a weighted loss function between the Maximum Mean Discrepancy (MMD) [27,28] loss and the classical generator loss function for both fully connected and fully convolutional GANs.Thus, the objective functions to optimize alternatively during one training iteration become: where L  and L  are the discriminator and generator losses, respectively,  () and  () are the discriminator and generator outputs, respectively,  are real financial data,  are the latent noise vectors that are transformed by the generator into synthetic samples,  is a weighting factor and L  2 is the Maximum Mean Discrepancy computed between the generated samples and the real ones, expressed as: where  and  represent samples from two different sets  = {  }  =1 and  = {  }  =1 , with  and  being the sample set's dimensions.These samples are drawn from two different distributions   and   .MMD gives an estimate of the distance between the two distributions.If this loss becomes 0, then   =   .In our case, we compute the MMD between real and generated samples on each iteration of the generator's training, over one full batch.Therefore,  =  = , where  is the batch size.Then,  (, represents the Gaussian kernel with  being the bandwidth parameter.Leveraging this algorithm, we can get an explicit feature map by using a Taylor series expansion with an infinite number of terms which, in theory, covers all orders of statistics.The rationale behind adding the MMD is that we wanted the generator to have two objectives: generating good enough samples that fool the discriminator while at the same time matching the statistical moments of the training distribution.Since it is not clear beforehand which of the two terms has more importance, we added a weighting parameter that was varied (between 0 -original GAN formulation and 1 -generator trained solely on MMD loss) throughout the training procedure.

Wasserstein training.
For each of the aforementioned models we applied two different training frameworks.One is the vanilla GAN setup [26] and the other is the Wasserstein GAN setup [2], with gradient penalty.The change to be done here was only in the way the training was performed and not in the model architectures.In the Wasserstein models' case, for each epoch we trained the Discriminator for 5 iterations, then the Generator for 1 iteration.

Variational Autoencoders
We built two VAE models in close connection to the GAN architectures described above, one based on  2 and the other on  3 .The discriminator's structure was copied in the encoder and the generator's structure was copied in the decoder.Regarding the rationale of these specific architectures, we found these two approaches (one for MLP and one for FCGAN) to offer the best stability in the long-run training so we continued with them only in the extended testing phase that is introduced in the next sections.As also explained in Section 3, there are other properties of the financial time-series which are not easily quantifiable (such as volatility clusters, auto and cross correlation for specific lag values etc.).It was our intuition that these properties may be hidden in a lower, fundamental, dimension of the data which led to the choice of VAEs.Moreover, VAEs have a more direct way of training which can be assessed numerically with the help of the RMSE.The input of the VAE, , is encoded into the mean and variance vectors.Random noise  is drawn from the Gaussian normal distribution N (0, 1), multiplied with the variance and the result is added to the mean, forming  ∼ N (,  2 ), which is decoded into the output  ′ .In both cases, the encoder of the VAE would replicate the layer structure of the discriminator from its GAN correspondent and the decoder would follow the generator layout.We chose the bottleneck for both models to be of size 20.
The cost function that we minimize during the VAEs' training is as follows: where  is a weighting factor,  is the mean square error,   is the Kullback-Leibler divergence, x is the reconstruction of the input, (|) models the encoder network and   () the decoder network.Additionally, the weighting factor helps us analyze the impact of each of the two losses.
For  = 1 2 we obtain the original VAE loss formulation.Choosing this loss function was motivated by the fact that for this application it is more useful to capture the overall data distribution rather than reconstruct the input samples, especially since we treated the entire generation process predominantly from a statistical point of view.More so, our intuition was that if we focused more on the reconstruction than on the regression part, we would end up with averaged versions of the input, which would probably help the prediction part due to the smoother nature of the data, but would definitely fail in the subjective metrics described in Section 6.1.

Generative Moment Matching Networks
The key idea of GMMNs is the use of a statistical hypothesis testing framework, namely the maximum mean discrepancy (MMD).Minimizing this discrepancy is equivalent to replicating the statistical moments.That is, if the samples generated by the model follow a distribution whose moments match those of the training data distribution, then the two distributions (empirical and model) are bound to be similar.
The setup for successfully creating a GMMN involves first training an autoencoder on a given dataset.Next, the encoder is used to transform the input data into the latent code space.The generator is then trained to sample data from the latent code distribution which, in turn, will be transformed by the autoencoder's decoder into new samples.Autoencoders, by themselves, have a discrete latent code distribution, which makes them unusable for generation.GMMNs, however, infer a continuos data distribution over the latent code space.One disasdvantage of this method is that it requires large batch size in order to have the moment estimation average over a statistically significant number of samples.This setup is also known as GMMN+AE.The autoencoder is trained to encode the training samples,  , into a latent vector representation, .Afterwards, the GMMN is trained to map the noise vector  to the latent vector distribution.The GMMN ensures that the model's latent code distribution and that of the training data are similar by minimizing the MMD.This means that we can generate new samples from the continuous latent code distribution which will be transformed, by the autoencoder's decoder, into new samples,  ′ .
The use of MMD in the GANs and the GMMN+AE is slightly different.For GANs, we use it as a weighted term in the generator's cost function in order to force it to also focus on matching the statistics of the training data.For GMMN+AE, we use it to train a noise-driven generator that outputs values in the autoencoder's latent code (bottleneck) distribution by matching the central moment statistics of this space, thus adding the generating capabilities to the autoencoder.Similarly to the VAEs, we use the same two encoder-decoder setups,  2 and  3 .The moment matching network is an MLP with the   − 40 − 80 − 120 − 180 −   structure.  represents the dimension of the noise vector, set to 100.  is the dimension of the autoencoder's bottleneck and depends on the encoder-decoder architecture.

DATA
Financial data is different from most other types of multimedia data in the sense that it possesses several distinct properties, as discussed in Section 3. Furthermore, each country has its own stock market with specific companies and specific behaviours.This means that we will see differences in how the stock market evolves in different countries.This discrepancy affects financial algorithms and makes it difficult to perform a fair comparison between models trained on different stock markets.However, it has become the norm to train, validate and compare results on the S&P500 dataset.Financial data is publicly available for each listed company, but gathering data from all companies under a single dataset, aligning them from a temporal point of view and pruning them is currently an effort that resides behind a pay-wall.

Dataset Creation
We train our generative models on the S&P dataset provided by Hana Institute of Technology.This dataset consists of 1,506 companies with daily closing prices records from January 1 st 2000 to March 31 st 2020.Compared to the commonly used S&P500 dataset, it contains more companies (1,506 vs 500), but our dataset spans a shorter time interval (start date is January 1 st 2000 vs 31 st March 1964).We address only the most recent 20 years of data because they are closer to current market behaviour than older stocks.We chose the granularity of the data to be at a daily level after running preliminary experiments that showed that a finer granularity, e.g., hour-based, would bring significant noise to the monitored statistics, whereas a coarser granularity, e.g., week-based, would overlook important variations that took place during the workweek.
In this dataset, the financial time series are available for companies which entered the stock market (were listed) at different times, so we expect them to start at different moments on the time axis.However, all companies were still active on the stock market at the moment when the data was gathered, meaning that these time series last until the same day.In other words, the dataset consists of time series with different starting dates, but the same end-date.We illustrate in Figure 2 how this time series are laid out in time.This introduces a certain bias, since there is no available data related to companies that disappeared from the stock market at previous moments (due to bankruptcy, fusion or being bought by other companies).
Our generative models restrict us from using varying length inputs, therefore we need to feed them fixed size data.The solution that we found for this problem was to split each available time series into segments of a fixed number of samples using a sliding window mechanism.Starting from the earliest position from our dataset we began extracting segments of 250 samples for each company (or ticker, as depicted in Figure 2) in order to prepare the data for processing.This amount is equivalent to one working year's worth of samples and is a reasonable choice since it allows capturing possible seasonality (events happening once per season/year) and it is not long enough for the market to change drastically between the start and end point.When fixing such a window it is possible to encounter tickers that have an incomplete set of samples due to the fact that they appeared on the stock market sometime during the captured window time interval.This problem can be dealt with in two possible ways.One can pad the incomplete segments with 0 until they reach the full extent of the window or remove them completely.We chose to drop these segments completely in order to avoid altering the dataset by adding hard coded values.We formally represent missing values from these incomplete segments as 'NaN'.Consequently, if the starting position of a segment is 'NaN' then the segment can be deemed as incomplete.Thus, we define: to be the set of all segments starting at day  extracted from all companies among the 1,506 that are listed on the stock market at moment .  represents the length of the window, which we set to 250, as previously mentioned.Formally, we consider the start of the dataset (January 1 st 2000) to be the day with index  = 1.
We process the rest of the dataset in a sliding window fashion, with a step of 30 samples, equivalent to 6 working weeks, and add them to our training data set.We denote the obtained training dataset as: All data was transformed to log returns, as explained in equation 1.The closer we come to the present with the sliding window, the more companies will present full segments over the captured period due to the previously discussed reasons.

Dataset Preparation
The success of the time series generation process depends on the dataset preparation.Thus, an important contribution of this paper lies in the way we addressed this step, as explained next.Our final dataset, , contains more than 200k 1-dimensional entries each of length   .An important issue that arises here is how to draw samples from this set.In our previous work [12], we considered  to be a homogeneous mixture of windows and randomly sampled batches of data from it .This led to partly realistic synthetic samples, but there was no cross-correlation present between the generated samples.This is mainly due to the fact that the generative models can encounter crosscorrelated input (values from the same time frame for different companies) only by accident.The same approach was carried out by [78], [48] and [20] with window lengths of 8,192, 252 and 230 samples, respectively.
Cross-correlation is important in the stock market domain because companies activate in a limited number of sectors.Thus, news that may impact a given sector will directly impact all companies belonging to that industry and indirectly impact connected industries.For example, if there is a global shortage of Silicium, then companies which activate in the extraction industry will be directly affected and suffer a decrease in prices.Then, companies which use this resource for their products will also be negatively impacted (e.g.glass industry, electronic devices industry etc.) and from here on there is a chain reaction up to a certain point.This means that stocks belonging to the aforementioned sectors will behave similarly when confronted with a powerful external factor (political, social, economic etc.).
In order to feed cross-correlated samples to the models and, inherently teach the generator to synthesize cross-correlated data, we suggest the following processing.Instead of scrambling all the extracted windows inside a large dataset, we kept each set of segments   in their original form and formed batches out of each such subset .This means that we process each subset as an entire batch.We perform shuffling inside each subset and between subsets but we do not mix entries belonging to different subsets.We are aware that this imposes the batch size of different lengths and it also forces large batch sizes (up to 1,506), but since each window is only 250 samples long, this does not pose any problem (maximum amount of values that are fed at one iteration is 1506 × 250 values which is less than the equivalent of one HD image).Even if this is only an implementation issue, we discovered that it greatly helps in encoding latent connections between different stocks.It is very helpful to have such a mechanism because usually the entire stock market responds in approximately the same manner to strong external stimuli, e.g., economic crisis, presidential elections, pandemic outbreaks etc.

Regime Splits
One interesting aspect about the stock market dataset is that it is generally "growing".This means that the values of the closing prices are increasing on the long term.In log return terms (see Equation 1) it means that the positive values outweigh the negative ones.In this context, we need to determine whether the short-term regime of a time series is up-trending or down-trending for each day.We define the regime for day  as: Here,  ×  and  ×_  are the cross-sectional mean and the cross-sectional rolling mean, respectively, for day .If   = { , |  , ≠ 'NaN', ∀ ∈ [1, 1, 506]} is the set of all available samples on day : In other words, we define  ×  as the mean over all closing prices from day  and  ×_  as the mean of  ×  over the previous   days.  is constant, representing the length of the window on which the regime is computed.In setting its value, we examined values from the set {30, 50, 100, 150, 200, 230, 250}.Since there was no significant difference between these setups we decided to keep consistency with previously mentioned rolling windows so we set   =   = 250.Also, for  ×_  we applied a triangular rolling window.Again, this had no significant impact over applying a regular rolling window.We show in Figure 3 how the market is distributed between up-trending and down-trending.Performing this split based on regimes results in labelling 68.07% of the days as being up-trending and the rest of 31.93% as down-trending.The difference between the two regimes is significant enough to take into consideration the fact that adding data belonging to only one of the two regimes in the training process might introduce some noise in the generative models' outcome.Therefore, we decided to implement 2 strategies for training each of our models.The first one was to train each model with the complete dataset .The second one was to compute the regimes for each day on the original dataset, split the dataset according to the two regimes and then apply the windowing mechanism described in Equations 7 and 8 on each of the two regimes.This means that for each architecture setup we trained 3 models: a complete one, trained on the entirety of the dataset without regime splits, an up-trending one, trained only on up-trending days and a down-trending one, trained only on down-trending days.We denote these 3 versions as 'complete', 'up' and 'down', respectively.

Synthetic Time Series Formation
Once we have the generators trained to output a fixed length time series, we need to combine them such that we obtain arbitrary size time series.In the case of the 'complete' models we simply generate several batches of fixed-length and concatenate them together until reaching the desired length.For the 'mixed' regimes approach, however, we apply the following procedure.We rely on the fact that the up-trending and down-trending regimes come in bursts of 20 to 120 (statistically determined) consecutive samples.Moreover, we know the final quota of each of the two regimes.Therefore, we sample segments of random length between 20 and 120 samples with a 68% probability of them coming from the 'up' model generator and 32% of them coming from the 'down' model.We concatenate these segments until we reach the desired time series length.Also, adjusting the batch size for the generator is equivalent to setting the number of stocks that we want to generate for a given period, i.e. the financial universe size.

EVALUATION PROCEDURE
While the evaluation of classification and retrieval systems is a well known problem and many metrics have been validated through time [61], the evaluation of synthetic generated data is still an open issue and an entire research domain by itself [25,79], especially in the financial field.Several attempts have been made to assess the goodness of synthesized images, e.g., Inception Score [74] and Fréchet Inception Distance [32] but transforming financial time series to images behaves poorly when evaluated by a neural network trained on natural images.We therefore analyze and propose a series of metrics that were inspired by signal processing problems.Most of these metrics are still experimental regarding how accurate they can describe the performance of financial data synthesis, but they can still give an idea on a hierarchy between different models.We approach the evaluation at three levels: (i) qualitatively, (ii) quantitatively, and via a (iii) predictive accuracy test, presented as follows.

Qualitative analysis
Lucic et al. [59] argue that it is necessary to report a summary of distribution of results, rather than a single best result achieved.To capture as much information as possible, we randomly select one batch of real data and one of synthesized data during each epoch and we compare several metrics.Graphical examples of how these metrics were assessed are available in the appendix.Since many architectures and variations have to be compared, a complete qualitative analysis is virtually impossible.To overcome this setback, we propose to explore the following properties: • Central moments -one way to determine whether two distributions are alike is to examine their statistical central moments.If the samples generated from two distributions have similar behavior in terms of mean, variance , skew and kurtosis, it is a strong indicator that the two distributions might be similar.• Autocorrelation -a well-known property of financial time series is that they do not posses a linear predictability, meaning that the autocorrelation of the returns is a diminishing function.• Heavy-tailed distribution -financial time series are known to exhibit a heavy-tailed behavior, i.e., their distribution presents a higher probability than a normal distribution of sampling very high and very low values.This translates in having a taller peak than normal distributions and thinner middle.• Volatility clustering -another well-known property is the volatility clustering, meaning that changes (either high or low) come grouped in clusters.In other words, small changes tend to be followed by small changes and high changes tend to be followed by high changes, respectively.This can be assessed by examining the autocorrelation plot of non-linear transformations of returns such as squared returns or absolute returns where no significant autocorrelation can be observed.• Cumulative sum -plotting the cumulative sum of the returns of any company should yield a non-monotonous curve with varying shapes.The cumulative sum for any company, at time  is defined as   =  =0   , where   is the log return from equation 1.We are interested in the general form of these graphs and not in specific values.
• Trend ratios -we define the trend ratio as: where   =    −  −1 is the stock's trend and   =  =−     −1 − 1 the noise corresponding to the same stock.Here,   is the trend lookback window which we set to 20.Again, we are interested only in the general form of these graphs.

Quantitative analysis
Quantitative analysis is generally quite difficult to perform on generative models, especially in the context of this relatively new issue of financial data generation.To provide a solution, we adapt metrics that are generally used in the information theory field as well as metrics designed for generative models but in other fields, such as image and speech processing, namely: • Kullback-Leibler divergence [50] -this is a measure of how one probability distribution is different from another.Since it is not a symmetric result, we set the real data distribution as reference and computed the metric for all synthetic data distributions.This measures the capability of our models to assign high probability to most realistic points [25].For continuous probabilities it is defined as: where  and  are distributions of 2 continuous random variables and  and  denote the probability densities of  and , respectively.We are interested in the lowest values that we can attain.• Jensen-Shanon divergence [55] -compared to the Kullback-Leibler divergence, the Jensen-Shanon divergence is symmetric, and it is a measure of similarity between two probability distributions.It is defined as: where  = 1 2 ( + ) is the average of the two distributions,  and .Again, we are interested in the lowest values.
• Kolmogorov-Smirnov test statistics [62] -this is a method of telling if two samples belong to the same distribution or not.We run this on two randomly selected batches, one of real data and one of synthesized data and report the K-S statistics.A low value means that we cannot reject the hypothesis that the two instances are from the same data distribution.With the increase in the number of samples for each instance we expect a more precise statistic.• Earth Mover's distance [41] -the first Wasserstein distance between two 1D distributions.
It can be seen as the minimum amount of "work" required to transform distribution  into distribution , where "work" is measured as the amount of distribution weight that must be moved, multiplied by the distance it has to be moved.It is defined as: where Γ(, ) is the set of (probability) distributions on R × R whose marginals are  and  on the first and second factors respectively.
All of these metrics are in accordance with the optimization criterion of the generative models and we found them to be the most adequate for our setup from an information theoretic point of view.

Predictive accuracy test
To assess the quality of the synthetic data, we propose to predict stock market movement by framing the prediction as a classification problem, discerning whether stocks are up-trending or down-trending at a predefined time point, as the movement information is inherently generated from price.The goal is to provide profitable buy and sell action signals.In this context, this section presents a deep learning approach for financial time series prediction, involving four stages: (i) statistical clustering of stocks based on their normalized returns, (ii) statistical labelling of stocks in up-trending or down-trending, (iii) denoising and reducing the dimensionality of representations using a stacked autoencoder model, and (iv) training predictive models to generate the one-stepahead output.In the following, we explain each block in detail.

Statistical
Clustering for Industry Classification.We have selected an unsupervised learning algorithm in the detriment of the industry classifications such as GICS 3 , NAICS 4 , Finviz5 , etc., (which are independent of the pricing data) because the synthetic data is unknown to the aforementioned classification schemes, therefore, we are required to use an automatic algorithm to cluster the real and synthetic stock universe.In this context, we used the protocol in [39] by applying the k-means clustering algorithm to cluster the entire universe of stocks according to how close the normalized returns are to the cross-sectional means of the parent clusters.Let  be the number of observations,  the trading days, and   the daily stock returns,  = 1, ...,  , and  = 1, ..., .We cluster the normalized returns   , where R =       ,   =    and  =  (((  )) − 3 ((  ))) .For all   < 1,   ≡ 1, and Median(•) and MAD(•) are cross-sectional.The standard deviation is computed with a loopback of 100 days, the clusterization is set to 15 clusters, performed with a loopback of 1000 days, and a stride of 30.

Statistical Labeling.
To predict stock trend, we formulate the task as a classification problem by classifying stocks into up-trending and down-trending for each period in the training set and for each statistical cluster defined previously.Let   be the time series of stock prices, where  = 1, . . .,  labels the stocks, and  = 1, . . .,  labels the trading days, a time series from day  will be assigned with a corresponding label, denoted   , according to the value of   compared to the median of the cluster it belongs to.If   is greater or equal to the median, then   = 1, otherwise   = 0.

Denoising and dimensionality reduction.
Because of the huge number of immediate market movements and trade noise, financial data has a complicated structure of irregularities and roughness.The noise in financial data generally shows strong tailedness, which means that the underlying time series data has a lot of sharp breaks every once in a while.Ignoring these anomalies might lead to erroneous data mining and statistical modeling results.As a result, in order to unveil more meaningful representations, we propose to denoise and reduce the dimensionality of the data using a stacked autoencoder structure (SAE) by layering a succession of single-layer autoencoders (AEs).In this regard, the input daily log returns are mapped into the first hidden vector using the single-layer autoencoder.The reconstruction layer of the first single-layer autoencoder is discarded after training, and the hidden layer is passed as the input layer of the succeeding AEs.By trial and error, the bottleneck's size is fixed at 16, and the depth is set to 4. The denoising is obtained by encoding and then decoding the input.If we reconstruct the time series using the bottleneck features, we will reduce the outliers and get a smoothed input estimate to predict the future stock prices.The dimensionality reduction is obtained by encoding the input and using the bottleneck features to predict future stock prices.6.3.4Prediction.Three variants of DNN have been implemented and tested, started with universal approximators such as a 4-layer perceptrons (MLP), and moving forward with a 1D ResNet50 variant [30], and finally a bidirectional Long Short Term Memory (BiLSTM) network with the goal of avoiding the long-term dependency problem of time-series data.We run our simulations over 18 years of data (from 2000 up to 2020, with the first 3 years being used solely for the statistical clustering), using two protocols i) train and test on real data, and ii) train on a mix of real and synthetic data and test strictly on real data.Both scenarios involved using a split protocol of past 7 years' worth of data for training and the following year for testing, in a rolling window manner, until we pass through the whole dataset.We use the first protocol to build a baseline approach.The bidirectional LSTM achieved the best results when trained on real data, and tested on real data so we report it as our baseline.This approach will further be compared with the second protocol, where the same model is trained on the mix of real and synthetic datasets to asses whether the synthetic data improves the prediction of up-trending and down-trending stocks or not.To the best of our knowledge, our work is the first to perform trend prediction on such a long time span and for so many companies.

RESULTS AND DISCUSSION
Each generative model took 6±1 hours to train for 100 epochs on an NVIDIA QUADRO M4000, using the PyTorch [69] framework and we examined the results of a total of approximately 1200 models (excluding the preliminary stages where we established the final architectures and training setups).The pre-processing part took about 0.5 hours for each experiment, but since it was the same for all models, we saved the state of the system after pre-processing was computed and used it from that point on for each model.Regarding the prediction setup, the pre-processing took roughly 20 minutes and the entire training about 2 hours on the same hardware.As there are very few other approaches tailored for this type of time series data generation in the literature, we compare our proposed architectures to the ones in [78] and with our previous work [12].Due to the vast number of experiments that we conducted, presenting all the results is physically impossible (we explored the outcome after more than 100 different epochs for more than 650 network models), which led us to adopting the following best-performer selection process.We made snapshots of each network setting whenever it would encounter a new best value for any of the proposed quantitative metrics.Afterwards, we manually inspected all the qualitative metrics of these snapshots.Empiric results show that among the proposed metrics, the Jensen-Shannon divergence is the best indicator to which model has a better overall performance, so we present the results for the snapshots that achieved the best JSD for each model.Also, between vanilla and Wasserstein training setups, preliminary results and previous work [12] suggest that the Wasserstein GAN with gradient penalty was superior in all aspects so we continued solely with this framework for the GAN setups.By examining the results synthesized in Table 1 we draw the following conclusions.

Training Results
Regarding the models' training procedure there are several aspects that are worth mentioning.The majority of models that produced viable results were trained with a learning rate of 1e-04.We carried out experiments with the following learning rates: 1e-03, 5e-04, 1e-04, 1e-05.Choosing between mixed or complete models does not have a major influence on the trend prediction accuracy.Both techniques offer similar results, thus validating our mixing technique.Only 2 cases ended up with considerable differences: FCGAN_2 (the complete model outperforms the mixed model by 0.29%) and MLP_1 (the mixed model outperforms the complete model by 0.17%).All other model pairs have small differences (<0.1%).In all cases, however, the lower performing models still bring a meaningful improvement to the ground truth dataset and help the prediction models in achieving higher accuracy.
Batch normalization layers do not not hurt the model training anymore.A possible explanation for this is that in the stock market all prices across one day usually follow the same trend.If a major event happens, it is likely to affect the entire market in the same direction for all stocks.Therefore, log returns of different companies do not have significantly different values for a given day.Consequently, applying batch normalization on cross-sectional batches of data is likely to adapt well on individual samples (since they will all have similar mean and variance).This does not happen in the setup proposed in [12] and [78], where batch normalization layers lead to the notorious mode collapse.Another argument that batch normalization is not an issue anymore comes from the fact that models such as MLP_1 and FCGAN_1, which contain batch normalization layers, achieved valid results, as opposed to the works in [12,78] where any model containing batch normalization layers would fail.
Lastly, balancing the different losses (the  parameter in Equations 3 and 5) for each model one way or another does not have a meaningful impact on the result.Each of our models was trained under 5 different setups depending on the values for : 0, 0.3, 0.5, 0.7 and 1.

Metric Results
We can identify several metrics that especially emphasize bad models, which is useful in reducing search time for good candidates.For example, the Jensen-Shannon divergence is a very strong indicator for bad models.Namely, high values of JSD mean that the model does not perform well.Good values, on the other hand, generally indicate good models.However, this is not flawless, since this metric can be confusing especially when the model collapses to a single sample.This sample was already computed as a solver for minimizing the JSD in the generative model's cost function, therefore obtaining low JSD values, but having bad overall characteristics.This is the case for the the sn_FCGAN complete model, which despite achieving a JSD value of 0.0666 (among the smallest ones), generates the same sample, irrespective of the noise vector that is used to drive characterizing the cumulative sum difference between two or more time series could result in a very good exclusion criterion.
We noticed that models that generate samples whose probability density function (PDF) matches the real samples' probability density function tend to offer good results.This information is integrated in the 'Heavy tail' property.Oppositely, non-overlapping PDFs indicate bad models.This can be assessed by examining the heavy-tail property, which also incorporates the PDF of the generated samples.We also noticed that models that managed to fit the 4 th central moment, i.e., kurtosis, behave well on many levels.This can constitute a ranking criterion in future developments and was met only by FCGAN_2 and the VAE family.As it is also pointed out in [12,78], cluster volatility is difficult to meet.In our experiments, only FCGAN_1 managed to meet this property.Fitting the autocorrelation property, however, improved as compared to our previous work [12].With our new approach, 12/39 (30.76%) models managed to capture autocorrelation, while previous models [12] had a lower rate: 3/12 (25%).We believe that the proposed dataset preparation technique helped in capturing autocorrelation by feeding cross-correlated samples at each iteration.Based on the number of properties that our models succesfully managed to capture and on the quantitative metrics that were obtained, we compiled a shortlist of best performing models belonging to each major class that was implemented: MLP_4 complete, FCGAN_4 up, sn_FCGAN complete, GMMN_AE_FC complete and VAE_FC complete.Out of these, the clear winner is VAE_FC complete, which outperforms every other model in almost every aspect.

Prediction Results
Trained strictly on real data, and tested on real data, the best performer was the BiLSTM network achieving a score of 50.04%, becoming the baseline reference for our research.We further augmented the training data set with synthetic data obtained with each of the models presented in Table 2 and tested the baseline algorithm on real data to asses whether the synthetic data add value to our research.We report the mean accuracy obtained over the 10 train-test split pairs presented in Section 6.3 as well as the maximum accuracy over all evaluation periods.The motivation is that the mean accuracy offers a general appreciation of how well such a model is adapted to the entire period, whereas the maximum accuracy finds the best performing training + testing periods.One could argue that after finding the maximum accuracy it would be worth freezing the said model and use it to predict the market performance on the entire validation dataset.The maximum accuracies are on average 0.11% higher than the mean accuracies showing an important performance variation, confirming the volatility nature of stock markets.The baseline value was obtained by training the prediction algorithm with real data only and tested on real data.
One important note regarding prediction accuracy values is that most papers in the literature report accuracies in the 50%-65% range: Feng et al. [19] obtained accuracies of 57.2% and 53.05% on datasets containing 88 and 50 stocks, spanning 2 and 9 years, respectively; Hu et al. [34] reached an accuracy of 47.8% on 2527 stocks, over the course of 3 years; Kinlay [46] tested 1 million different prediction models and obtained an accuracy of 51.5% on 10 stocks, spanning 10 years; Liu [57] obtained an accuracy of 66.93% on 473 stocks, spanning 12 years; finally, Wiese et al. [82] report a prediction accuracy of 58.23% on 88 stocks, over 2 years.We can see a large variety in the results and this is mostly due to the way the experiments were conducted as there is no universal consensus regarding the training and testing dataset, which makes prediction results difficult to compare.Our financial partners clearly expressed that achieving 52% true prediction accuracy for this application is nearly impossible and that would lead to immense financial profits, so we set this as a gold standard.Contrary to the enumerated approaches, we report the average results obtained on 1,506 companies over 20 years.This is a significantly larger dataset than anything reported in the literature.Moreover, we did not make any selection as to what periods to report.This is important, since the year 2008 introduces a strong disturbance in the algorithms' performance due to the financial crisis, when most patterns were broken and almost all companies suffered important losses.Finally, our trend prediction algorithm is focused more on fairness (following the correct steps such that no information from the future is leaked into the training set) than on achieving the best performance, since this is not our goal, because subtle errors can occur very often and lead to unrealistically high prediction results.Looking at the previously compiled shortlist of models we can see the following absolute accuracy improvements over the baseline: MLP_4 complete (+0.30%),FCGAN_4 up (+0.28%), sn_FCGAN complete (+0.27%),GMMN_AE_FC complete (+0.36%) and VAE_FC complete (+0.27%).Given that these accuracies are computed as an average over 1506 companies' performances, and that the baseline prediction accuracy is 50.04%,we can conclude that the proposed generative models achieved their intended purpose of boosting the trend prediction accuracy.

Takeaway Findings
Closely examining each model allowed us to identify several key aspects: • Under the GAN formulation, Wasserstein training (with gradient penalty) outperforms its vanilla counterparts.As mentioned before, all preliminary experiments indicated this aspect.• There are several models that converged to generating a single sample, irrespective of the noise vector used to drive the generators.These models are marked with * in Table 1 and among them are the spectral normalization GANs and GMMNs.This indicates that these 2 types of generative networks are not well suited for this problem.• Variational models reached convergence much faster than all other models.On average, it took them 20 epochs to reach the best state, whereas other models took 59 epochs.• Variational models offered the best overall results.We ranked the models based on each individual quantitative metric, we averaged these ranks and performed a final ranking based on this average.4 out of the first 5 ranks were occupied by variational models, which is coherent with our manual analysis.• Using any of the proposed models for dataset augmentation helps in achieving better prediction accuracy with the proposed prediction framework.Even though the accuracy increase is not spectacular and the the prediction framework is not optimal, it is enough to prove that our proposed solutions achieve their goal.Concretely, a financial time-series regime prediction model achieves better results if the training dataset is augmented with synthetically generated samples.

Open Challenges
Given the fact that this research field is still at its early stages, we encountered several issues that have not been specifically approached in the literature, nor have they been mentioned by researchers in their prospective future works so far.We have compiled a list of the challenges that remain open up to this point: • Optimal architecture.There is no consensus in the current state of the art as to what network architecture works best for generating financial time series.Researchers tried several models but have failed in finding one type that outperforms all others.• Cost function.Finding the right cost function to optimize the generation process is another key aspect that requires further investigation.Since a concrete evaluation metric is missing, it is difficult to design a proper cost function, not knowing what the end goal of the learning procedure is.This makes the choice of an optimization function a random process.• Standardized evaluation.Most papers in this field report their results on different datasets and under different evaluation setups.Without having a common denominator it is extremely difficult to assess which algorithm performs better.Moreover, it is still debatable what evaluation metric should be used in order to assess the goodness of the synthetic data.This last point is a common problem for generative models, also encountered in other fields, such as computer vision.

CONCLUSIONS
In this paper we proposed a complex framework for generating realistic financial time-series.We proposed a new way of extracting batches of data from the training set, adapted to the particularity of financial time-series.We investigated 3 major classes of generative models with various model composition, setups, hyperparameters, training frameworks and data regimes.We examined different qualitative and quantitative metrics and tested the dataset augmentation ability on real data, under a complex prediction scenario.
Based on our results, we strongly believe it is necessary to perform exhaustive tests on a large number of network models in order to find the optimal setup.This study involved a large number of iterations in order to single out the best combinations that can provide a performance boost to fintech algorithms.Our entire progress was validated at each step by experts from our financial partner, offering valuable insights when the nature of the processed data cast a shade of ambiguity on the obtained results.
Our current work suggests that even though good results can be achieved with various generative models, it is far more probable to find a good setup under the variational autoencoder framework.This manages to satisfy both qualitative and quantitative constraints, improves accuracy when used for augmenting the training dataset on the prediction task, converges faster than all other models and it managed to generate realistic samples to the point of being difficult to tell apart from real data by financial experts.
Finally, similar to the image generation field, we stress the need of finding a metric or validation framework that can harness both objective and subjective properties under a single quantifiable value.This should completely characterise the goodness of the generated samples and serve as an optimisation criterion.Our future work will focus on this specific part, since it was among the most difficult obstacles that we encountered during the development stage.

Fig. 1 .
Fig. 1.Log return influence on the intra and inter variations of the closing prices for two companies (A.O. Smith Corp -AOS and Adobe Inc. -ADBE); top -closing prices evolution over 20 years, bottom -log return of the same prices.

Fig. 2 .
Fig. 2. Visual representation of the S&P dataset layout in time.

Fig. 4 .
Fig. 4. Continuous dataset arrangement for training and testing during the entire sample period.
(a) Synthetic samples were obtained with a good generation model.(b) Synthetic samples were obtained with a bad generation model.

Fig. 6 .
Fig. 6.Autocorrelation box plots computed on an entire batch of samples, for different lags.
(a) Cumulative sum plot obtained on a batch of real samples.(b) Cumulative sum plot obtained on a batch of synthetic samples generated by a good model.(c) Cumulative sum plot obtained on a batch of synthetic samples generated by a bad model.

Fig. 7 .
Fig. 7. Cumulative sum plots computed on an entire batch of samples.
(a) Probability density functions of real samples and synthetic samples generated by a good model.(b) Probability density functions of real samples and synthetic samples generated by a bad model.

Fig. 8 .
Fig. 8. Probability density function plots computed on an entire batch of samples.
2 : is similar to  1 , but without dropout and batch normalization layers;•  3 : has the same layer organization structure as  2 , but with different number of neurons on each layer; •  4 : is a shallow version of  3 , with different number of neurons in the discriminator's layers.  (  ) structure in the generator and a   − 1 − 12 − 24 − 48 − 96 − 1 −   (1) structure in the discriminator, where each value represents the number of feature maps in the respective layer and   represents the length of the synthesized 1D array.The dimension of the normal distributed random noise vector that was used as input for the generator is on the first layer's position, i.e., 100.

Table 1 .
Qualitative and quantitative results.The * in the Regime column indicates that the respective model resulted in a mode collapse.Qualitative metrics that have been met by the model are indicated by 'yes', whereas missing them by 'no'.