WaveTF: a fast 2D wavelet transform for machine learning in Keras *

The wavelet transform is a powerful tool for performing multiscale analysis and it is a key subroutine in countless applications, from image processing to astronomy. Recently, it has extended its range of users to include the ever growing machine learning community. For a wavelet library to be efﬁciently adopted in this context, it needs to provide transformations which can be integrated seamlessly in already existing machine learning workﬂows and neural networks, being able to leverage the same libraries and run on the same hardware (e.g., CPU vs GPU) as the rest of the machine learning pipeline, without impacting training and evaluation performance. In this paper we present WaveTF, a wavelet library available as a Keras layer, which leverages TensorFlow to exploit GPU parallelism and can be used to enrich already existing machine learning workﬂows. To demonstrate its efﬁciency we compare its raw performance against other alternative libraries and ﬁnally measure the overhead it causes to the learning process when it is integrated in an already existing Convolutional Neural Network.


Introduction
The wavelet transform [18] is a powerful tool for multiscale analysis. It produces a mix of time/spatial and frequency data and has countless applications in many areas of science, including image compression, medical imaging, finance, geophysics, and astronomy [2]. Recently, the wavelet transform has also been applied to machine learning, for instance to extract the feature set to be used by a standard learning workflow [3,16] and to enhance Convolutional Neural Networks (CNNs) [4,12,21,15]. For many of these applications, and machine learning in particular, parallel execution on GPGPU accelerators is of critical importance to ensure the tractability of real-world problems. Therefore, a library that provides wavelet transform functionality for this context must efficiently integrate into existing computational pipelines, mitigating the loss of performance due to the cost of exchanging data between memories in different phases of the computation -e.g., if our pipeline runs on a GPU we would like to execute the wavelet on the same device, without the need to repeatedly move data between the GPU and the main memory.
In this work we present WaveTF, a library providing a fast 1D and 2D wavelet implementation that provides scalable parallel execution on CPU and GPU devices. WaveTF enables full GPU execution of computational pipelines including wavelet transforms. The library is built on top of the popular TensorFlow framework and is exposed as a Keras layer, making it easy to integrate into existing Python workflows based on these widely adopted frameworks. Our evaluation shows that WaveTF improves upon the state of the art by providing faster routines and by adding only a negligible overhead to machine learning applications.
The rest of this manuscript is structured as follows. In Sec. 2 we provide a description of wavelet transforms, followed by a discussion of the related work in Sec. 3. Section 4 describes the implementation of the WaveTF library, while an evaluation of its performance is presented in Sec. 5. Finally, Sec. 6 points the reader to the software and Sec. 7 concludes the manuscript.

Wavelet transform
Wavelet transforms are a family of invertible signal transformations that, given an input signal evolving in time, produce an output which mixes time and frequency information [8]. This paper will only focus on discrete transformations.

Haar transform
The simplest wavelet transform is the Haar transform, which, given in input a signal x = (x 0 , . . . , x n−1 ) (with n even) produces as output H(x) := (l 0 , . . . , l n 2 −1 , h 0 , . . . , h n 2 −1 ) = (l(x), h(x)) , where with l i and h i containing low and high frequency components localized at times 2i and 2i + 1 of the original signal. Note that when the input size is not even, the signal must be extended using some form of padding. The wavelet transform is often iterated on the low components to carry out a multiscale analysis:

Daubechies wavelet
The Haar transform can be extended so that l i and h i are linear functions of more than two terms, as done by the following Daubechies-N=2 (DB2) transform (see [7] for details): where and vectors λ := (λ 0 , λ 1 , λ 2 , λ 3 ) and µ := (µ 0 , µ 1 , µ 2 , µ 3 ) being orthonormal. When working with larger kernels (4×2 in this case, where Haar was 2×2) the border of the signal must always be extended with padding to be able to invert the transformation.

Multidimensional transform
The wavelet transform is extended to multidimensional signals by executing it orderly in all the dimensions. For instance, in the two-dimensional case the input is a matrix and the output is obtained by first transforming the rows and then the columns; it is thus formed by 4 matrices (conventionally called LL, LH, HL, and HH), containing the low and high components for the horizontal and vertical directions (an example can be seen in Fig. 1). As with the 1D case, the multidimensional transformations can also be iterated for perform a multilevel analysis (see Fig. 2). When the input is a multichannel image (e.g., RGB or HSV), transformations are performed independently for each channel.

TensorFlow and Keras
TensorFlow [1] is a powerful framework for efficiently manipulating multidimensional arrays (i.e., tensors) in parallel, and it provides APIs for Python, C ++ , Java and JavaScript. It has been developed as a fast and scalable framework for machine learning, and for this purpose it is complemented by the higher level Keras library [6]. However, TensorFlow offers many powerful algebraic routines which can be used independently of the application and its Python API can be seen as a parallel, GPU-enabled version of NumPy [23], with which it shares many similarities in the syntax and names of its methods. Note that TensorFlow supports a wide variety of computing hardware: it can run on multiple CPUs,  GPUs and also on specialized ASICs known as TPUs [13], which are now available for end-users as part of the Google Cloud infrastructure.
We have chosen to implement WaveTF leveraging TensorFlow's rich API and scalability, so that it can easily exploit available parallelism, be easily and efficiently integrated with other programs that use TensorFlow and Keras and provide its functions to the growing machine learning community.

Related work
In this section we briefly describe three alternative wavelet libraries available for Python and published as open source software: PyWavelets, pypwt and TF-Wavelets. In Sec. 5.1 we will compare their raw performance to our library.

PyWavelets
PyWavelets [14] is probably the most widely used Python library for wavelet transforms. Its core routines are written in C and made available to Python through Cython. It supports 1D and 2D transformations and provides over 100 built-in wavelet kernels and 9 signal extension modes. Unlike WaveTF, it is a sequential library and runs exclusively on CPUs.

pypwt
pypwt [20] is a Python wrapper of PDWT, which in turn is a C ++ wavelet transform library, written using the parallel CUDA platform and running on NVIDIA GPUs. It implements 1D and 2D transforms, (though it does not support batched 2D transforms) supports 72 wavelet kernels and adopts periodic padding for signal extension.

TF-Wavelets
TF-Wavelets [10,17] is a Python wavelet implementation which, like WaveTF, leverages the TensorFlow framework. It features two wavelet kernels (Haar and DB2) and implements periodic padding for signal extension. It is the library more conceptually similar to ours, allowing, for instance, both input and output to reside in GPU memory, and it is thus the best match for a raw performance comparison against WaveTF. However, it lacks support of batched, multichannel, 2D transforms, which are typically required for machine learning applications in Keras. As a consequence, it does not provide a network layer for that framework.

Implementation
WaveTF is written in Python using the TensorFlow API. It exposes its functions via a Keras layer which can either be called directly or can be plugged easily into already existing neural networks. The library currently implements the Haar (Eq. (1)) and DB2 (Eq. (3)) wavelet kernels -which are the two most commonly used ones. To handle border effects, anti-symmetric-reflect padding (known as asym in MATLAB) has been implemented, which extends the signal by preserving its first-order finite difference at the border. WaveTF supports both 32-and 64-bit floats transparently at runtime.

Direct transform
In order to efficiently implement the wavelet transform in TensorFlow we first reshape it as a matrix operation. Let us consider, as an example, the 1D DB2 transform with input size n, where n is a multiple of 4. The original formulation of the transform presented in Sec. 2.1.2 can be rewritten as a matrix multiplication in the following form: In order to generate the data matrix above we need to group the data vector by 4 and interleave it with a copy of itself, shifted left by two (plus some constant operations for the padding at the border). This operation can be implemented with the reshape, concat and stack methods provided by TensorFlow. Alternatively, the specialized conv1d method can be employed instead of the standard matrix multiplication, somewhat simplifying the data rearrangement. We have implemented both the variants and we have seen that the convolution one is faster in all considered cases, except for the 1D-Haar transform (for which we have thus adopted the matrix multiplication algorithm). Note that when n is not a multiple of 4, the border values are arranged slightly differently, but the procedural steps remain the same.

Inverse transform
In this section we show how to properly invert the DB2 wavelet transform, taking into account the border effects while keeping the padding as small as possible. This is done both to justify the exact algorithmic steps we adopted and to offer a future reference for alternative implementations by other authors. To the best of our knowledge the following derivation, at this level of detail, is original, though it is likely that it might be already present, at least implicitly, in the vast literature on Wavelet transform.
To better understand how to properly handle the border effect when computing the inverse, let us reshape the transformation above in a slightly different way: i.e., as w = W x = KP x, with K being the n × (n + 2) kernel matrix and P the (n + 2) × n (anti-symmetric-reflect) padding matrix: We can then decompose K, P and W in (non-square) blocks (with each block shape shown between parentheses): To invert W we first note that K 11 has orthonormal rows and thus admits its transpose as a right inverse: K 11 K t 11 = I n−4 . Furthermore, W 00 := K 00 P 00 and W 22 := K 22 P 22 have linearly independent columns and thus admit a (Moore-Penrose) left inverse: W + 00 W 00 = W + 22 W 22 = I 2 . Finally, because of the choice of coefficients in Eq. (3), we have We can now verify that W is inverted by and that we can compute its non-border elements similarly to the direct transform case: and its border values as:

Correctness
In addition to the formal derivation given above, we have tested our implementation for consistency against PyWavelets, and we have composed direct and inverse transforms to verify that they result in an identity map (up to numerical precision errors). The randomized test code is included with the source code and is runnable with the pytest framework [19]. Note that, contrary to PyWavelets, WaveTF always uses a minimal padding when transforming: e.g., WaveTF's output for an input vector of size 10 is a 2×5 matrix, whereas PyWavelets produces a 2×6 matrix when using the DB2 kernel and a 2×5 one when using the Haar kernel.

Performance results
The performance of WaveTF has been tested in two ways: • by executing raw signal transforms, leaving the output data available for the user either in RAM or in the GPU memory; In the first test, we also computed the same transformations with the PyWavelets, pypwt and TF-Wavelets libraries to compare their performance to WaveTF's. In order to better exploit the computation power provided by the GPU [5], the tests have been run with single-precision floating-point types: np.float32 for PyWavelets, tf.float32 for WaveTF and TF-Wavelets, and pypwt compiled to use 32-bit floats.
The hardware and software used in the tests are detailed in Tables 1 and 2.

Raw transformation
PyWavelets operates in RAM and pypwt uses RAM for input and output but runs its computation in the GPU. On the other hand, WaveTF and TF-Wavelets operate on TensorFlow tensors which, when GPUs are available and used, reside in the GPU memory. We expect to see this difference reflect on the runtimes, because of the overhead of moving data between GPU and RAM. We have recorded the wall clock time of one-and two-dimensional Haar and DB2 wavelet transforms using WaveTF, PyWavelets and pypwt and TF-Wavelets. For WaveTF, we have measured both the time required when leaving the data in the GPU memory and when input and output are required to be in main memory. For TF-Wavelets we have instead focused on the fastest case of working only on GPU memory, to offer a fair comparison for WaveTF. The test procedure for the one-dimensional case is as follows: • A random array of n elements is created, with n ranging from 5 · 10 6 to 10 8 , • For the non-batched case the array is used as is (i.e., shape = [n]), for the batched case it is reshaped to [b, n/b], with b = 100, • The transform, on the same input array, is executed from a minimum of 500 up to a maximum of 10000 times for smaller data size; the total time is measured and the time per iteration is recorded.
For the two-dimensional case, the input matrix is chosen to be as square as possible given the target total size of n elements, i.e., shape = √ n , √ n .
Note that we have not measured the time to execute a single transformation, but instead the time to execute many of them grouped together (up to 10000), because the single execution time when working in GPU memory would have been completely overshadowed by the setup time required for the library calls. The standard deviations for these grouped measures are all well below 1%, so they are not shown in the plots.

Discussion
As can be seen from the data in Fig. 3 and Table 3, there is a huge gap in performance between PyWavelets and pypwt and the TensorFlow programs. The performance of PyWavelets is explained by the fact that it is a serial program and that it does not exploit the parallelism available in the GPU. pypwt, on the other hand, does use the GPU but incurs a big overhead caused by the data movement between GPU and main memory -as demonstrated by the similar performance achieved by WaveTF when it is forced to have both input and output in RAM. When working directly in GPU memory WaveTF and TF-Wavelets have a big performance advantage over the other evaluated libraries, with WaveTF being about 70x faster than PyWavelets and pypwt on 1D Haar and 30-40x on 1D DB2. For the 2D cases WaveTF has a speedup greater than 40x over PyWavelets and a 12-14x one over pypwt. This test scenario mirrors the common situation in TensorFlow-based machine learning workflows using wavelet transforms.
The speedup of WaveTF against TF-Wavelets is still quite impressive, considered that both libraries adopt the same general strategy, and it ranges from 1.6x up to 3.2x. This improvement is mainly due to a careful algorithmic implementation as to avoid redundant computations.

Machine learning
In this section we quantify the overhead of integrating WaveTF in machine learning workflows. For this purpose we consider a classification problem on a standard image dataset solved by a simple CNN. In our experiment we measure the training and evaluation times before and after enriching the CNN with wavelet layers.
For this test we have adopted the Imagenette2-320 dataset [11] -a subset of 10 classes from ImageNet [22] -consisting of 9469 training and 3925 validation RGB images. For the classification task we used a basic CNN network featuring 5 levels of convolution, followed by downscaling which halves the spatial feature dimensions at each level (i.e., 320x320 → 160x160 → 80x80 → 40x40 → 20x20). To enrich this network with the wavelet transform, each newly downscaled layer   ) are concatenated with the wavelet components (LL l , LH l , HL l , HH l , ) at the corresponding level of scale, before the following convolution is performed.  Fig. 4), launched iteratively as shown in Eq. (2). This approach has been used, e.g., for improving texture classification [9]. Since the objective of our experiment is only to quantify the computational overhead of adding wavelet features via WaveTF to the network, we disabled all forms of data augmentation for the training -these procedures would add their own considerable overhead which would confound our results. To compute the training overhead, we measured the wall clock time required to train the model for 20 epochs, with and without enriching the network with the wavelet features. We repeated this training process 20 times (after a first, unmeasured run, used to set the memory buffering to a stationary state). On the other hand, to measure the overhead incurred in evaluation we used the trained network to evaluate all the images in the dataset and repeated the process 20 times.

Discussion
As can be seen from the results shown in Table 4, the overhead of adding wavelet features to the existing 5-level CNN is below 1%, both in training and evaluation, thus allowing its use at an almost negligible cost.
accompanying documentation and some usage examples, which also include the CNN used in this paper. The link to the GitHub repository is shown in Table 2.

Conclusion and future work
In this work we have presented an efficient wavelet library which leverages TensorFlow and Keras to exploit GPU parallelism and allows for easy integration in already existing machine learning workflows. Since the wavelet transform is characterized by high parallelism and low computational complexity (time complexity being O(n) for an input of size n), minimizing communication is pivotal to achieve good performance, and in this work we have shown how to do it by limiting the transfer between GPU and memory whenever is possible.
In future we plan to extend the library to include other popular wavelet kernels and padding extensions, as well as extending it to 3D signals.