A quasi-orthogonal, invertible, and perceptually relevant time-frequency transform for audio coding

We describe ERB-MDCT, an invertible real-valued time-frequency transform based on MDCT, which is widely used in audio coding (e.g. MP3 and AAC). ERB-MDCT was designed similarly to ERBLet, a recent invertible transform with a resolution evolving across frequency to match the perceptual ERB frequency scale, while the frequency scale in most invertible transforms (e.g. MDCT) is uniform. ERB-MDCT has mostly the same frequency scale as ERBLet, but the main improvement is that atoms are quasi-orthogonal, i.e. its redundancy is close to 1. Furthermore, the energy is more sparse in the time-frequency plane. Thus, it is more suitable for audio coding than ERBLet.


INTRODUCTION
State-of-the-art lossy audio codecs use real-valued timefrequency (TF) transforms, typically Modified Discrete Cosine Transform (MDCT) for MP3 and AAC [1].The motivation is that modeling the auditory perception is more efficient in the TF domain.MDCT is perfectly invertible and has a redundancy 1.In other words, the number of transform coefficients equals the number of samples in the signal.Usually, TF transforms for audio coding are orthogonal bases i.e. the vectors (called atoms here) that define analysis and synthesis operators are orthogonal and span the signal space.These properties are usually associated to a fixed frequency resolution that is not in line with auditory perception (see Sec. 2).Practically, a masking threshold is computed on the uniform frequency grid by interpolating masking thresholds computed in another perceptual frequency scale, which is not optimal.
Perceptual TF transforms have already been proposed (e.g.Gammatone [2]) but they do not achieve perfect reconstruction and generate some redundancy, and thus are not suitable for audio coding.In [3], it was proposed to perform a decomposition of the signal on a union of MDCTs, which is well suited for audio coding but the frequency scale is not perceptually-motivated.Recently, the ERBLet transform was proposed [4].Its frequency resolution is matched to the Equivalent Rectangular Bandwidth (ERB) scale and it achieves perfect reconstruction as long as the redundancy is larger than 1.In this paper, we propose a real-valued variant of the ERBLet called ERB-MDCT, more suitable for audio coding.The frequency scale still follows the ERB scale, but the analysis and synthesis sets of atoms are nearly orthogonal bases, which means that the redundancy is close to 1.
This paper is organized as follows: First, we briefly describe the ERB scale and the ERBLet.Then, we describe the ERB-MDCT, and give some implementation details.Finally, we compare it to a standard MDCT and ERBLet in terms of orthogonality, redundancy and TF energy localization.We also provide TF images obtained with a real audio signal.

THE ERB SCALE AND THE ERBLET
The peripheral auditory system can be modeled as bank of bandpass filters usually described by their equivalent rectangular bandwidth (ERB).The ERB (in Hz) of the auditory filter centered at frequency f (in Hz) is [5]: The full range of audible frequencies (20 Hz-20 kHz) can be modeled as a juxtaposition of 39 bandpass filters whose center frequencies f b , b ∈ {1 • • • 39}, are given by [5]: In [4], a transform with a resolution evolving across frequency has been formulated based on the theory of nonstationary Gabor frames [6].Specifically, Gaussian windows with bandwidths satisfying equation (1) are constructed in the frequency domain and equidistantly spaced on the ERB scale according to equation (2).The resulting ERBlet transform is computed by applying the set of windows to the Fourier transform of the signal.

ERB-MDCT basics
The original MDCT has a constant TF resolution [7].Extensions were proposed, where the TF resolution changes along time [8].This "time-domain non-stationary" MDCT is actually used in audio codecs like MP3 or AAC (the coder can switch between two resolutions [1]).Basically, ERB-MDCT is a "frequency-domain non-stationary" MDCT that follows the ERB scale.This is achieved by applying a Discrete Cosine Transform (DCT) to a time-domain non-stationary MDCT.
For a given discrete time-domain of length N samples: n ∈ {0 • • • N − 1}, any linear TF transform can be defined by two sets of signals ψ p,τ [n] and ψp,τ [n] called respectively analysis and synthesis atoms.For a signal x, analysis and synthesis operators can be defined as [9]: where X p,τ are the transform coefficients and x is the reconstructed signal.p is a frequency index and τ a time-shift index.For MDCT, we have ψp,τ = ψ p,τ and x = x (i.e.perfect reconstruction) except on the edges of the time-domain.
In a first step, we focus on the ERB-MDCT synthesis atoms.In a discrete frequency domain: ) where N p is the MDCT size, τ ∈ {0 • • • N p − 1} and the window w p can be seen as the frequency response of a band-pass filter centered on k p .The support of w p is When k p and w p are properly defined, {φ p,τ } is an orthogonal basis [8].The ERB-MDCT synthesis atoms are defined as DCT-IV transforms of φ p,τ : There are N p atoms in band p, thus the total number of atoms is N T = p N p .k p should be defined such that the frequencies ν p follow equation ( 2), and w p and N p such that the transform is invertible.However, we can not force the bandwidth of the atoms to follow equation (1) because of the orthogonality constraint.

Setting the frequency scale
The frequency (in Hz) corresponding to , where F s is the sampling frequency.Ideally, f p should follow the ERB scale with v bands per ERB (defined by equation ( 2) with b = p v ), with f 0 = 0 and f P = Fs 2 + 1 2N .This is not possible because: 1.For a given value of v, one can usually not find an integer The extreme values k p = 0 and k p = N correspond respectively to f p = Fs 4N and f p = Fs 2 + Fs 4N .Thus, we first set P as the closest integer such that f 2N in equation ( 2) and then compute k p using: which is an approximation of the ERB scale.These real values will be converted to integers later.

Setting MDCT sizes
The variable-size MDCT is invertible under the conditions: . These conditions imply that the second half of w p−1 is the "flipped" version of the first half of w p with respect to a center of symmetry (see figure 1): We know from equation ( 5) that k 0 = 0. Thus, equation ( 8) leads to: Solving this system for k p defined by equation ( 5) should lead to a suitable sequence N p .However, there might be an infinite set of solutions (because the system is under-determined) or no solutions at all (because only even integer and increasing sequences N p are acceptable).In Section 3.6, we propose a heuristic to solve this problem.

Setting MDCT windows
The perfect-reconstruction conditions ( 6) and ( 7) are verified for the following window: with: Such a window is illustrated on figure 1.

Analysis and synthesis atoms
Equation ( 4) defines the synthesis atoms.One can see from the system (10) that the union of Thus, the analysis atoms are: (12) Perfect reconstruction on variable-size MDCT is achieved when adjacent windows overlap, i.e. for k ∈ {0 • • • N − 1}.This justifies the boundaries of the sum in equation (4).

Implementation details
One can see from equations (3) and ( 12) that the ERB-MDCT analysis operator is equivalent to: 1. Apply a DCT-IV to the whole signal in the time domain.

Apply variable-length MDCT to DCT-IV coefficients.
The synthesis operator follows the reverse scheme.DCT-IV and MDCT can be efficiently implemented using the FFT algorithm.Practically, in an audio codec, the signal should be segmented in overlapping frames that should be processed separately.To minimize the final redundancy, one should use long frames and short overlapping sections.
Computing valid sequences N p and k p is not a simple problem (see Section 3.3).We propose a simple heuristic that works for most values of N and v: 1. Compute the (real) target values for k p as in Section 3.2.

Set N
3. Compute N p by finding the unique solution to (9). 4. Round each N p to the nearest even integer.

Compute the final integer values of k p using (9).
The system (9) implies that 2 ≤ N 0 ≤ k 1 .Thus, no valid solution can be found for k 1 < 2. Point 2. comes from the fact that N p must be an increasing sequence, and we found out that this is always verified when choosing N 0 = k 1 .However, the final value of k P may not be equal to N .This can be tackled by iteratively initializing the heuristic with a value of N slightly different from the target value, and stop when k P matches the target value.

Orthogonality and redundancy
We wish that ERB-MDCT analysis and synthesis atoms are each orthogonal sets.For p = 1 . . .P − 1, atoms are orthogonal because variable-size MDCT atoms and DCT-IV atoms are orthogonal.However, for p = 0 and P , variable-size MDCT atoms are computed from DCT-IV coefficients that are symmetric with respect to 0 and N (see equation ( 12)).This gives physically-relevant transform coefficients but breaks the orthogonality.Therefore, ERB-MDCT is quasiorthogonal.ERB-MDCT redundancy (equal to N T /N ) is always larger than 1 and depends on the discretization of the ERBscale.On table 1, we give the redundancy as a function of v, for N = 4096, for ERB-MDCT and ERBLet (in the painless case i.e. straightforward synthesis).The ERB-MDCT and ERBLet redundancy can not a priori be compared, because ERBLet represents positive and negative frequencies with complex coefficients, while ERB-MDCT represents positive frequencies with real coefficients.But in the case of real-valued signals, the comparison is meaningful because of Hermitian symmetry in ERBLet.One can observe that redundancy is close to 1 in ERB-MDCT and much higher in ERBLet.Redundancy decreases with respect to v in ERB-MDCT but increases in ERBLet.This is because ERBLet bandwidths follow equation (1) and do not depend on v. Thus, the overlap between bands increases with v, which is not the case with ERB-MDCT.Practically, audio coding requires that partials in pitched sounds are resolved, thus a sufficiently high frequency resolution is required (typically v = 3), which corresponds to a neglectable redundancy for ERB-MDCT (+2%), whereas ERBLet redundancy is definitely inappropriate for compressive coding.

Energy localization in time and frequency domains
In this section, we compare the energy localization of synthesis atoms between standard MDCT, ERBLet and ERB-MDCT for N = 4096.For ERBLet and ERB-MDCT, we chose v = 1, i.e. 43 bands (negative-frequencies in the ERBLet are discarded).We focus on atoms that are approximately centered on N  2 and oscillate around 1000 Hz (where the sensitivity of the hearing system is maximal).This corresponds either to p = 16 or p = 17.For the standard MDCT, we chose the same frequency resolution at 1000 Hz as with ERB-MDCT.This corresponds to 160 bands and either to p = 7 or p = 8.On figure 2, we plot the energy of atoms in the time domain.One can observe that energy oscillates with cosinemodulated atoms (MDCT, ERB-MDCT), while it is smooth with complex-modulated atoms (ERBLet).MDCT atoms are compactly supported in the time domain (320 samples, i.e. 7.2 ms), whereas others are not.Thus, energy is best localized with MDCT.Furthermore, energy decays faster with ERBLet than with ERB-MDCT: -3 dB at first lobe (time delay: 8.4 ms), and -15 dB at second lobe (time delay: 16.7 ms).
On figure 3, we plot the spectrum of previously-described atoms.As MDCT atoms are compactly-supported in the time domain, their energy decays slowly in the frequency domain.In contrast, ERB-MDCT and especially ERBLet are much more selective.Both follow the ERB-scale but the central frequency is slightly shifted to the right with ERB-MDCT because of the modified ERB scale (see (5)).Furthermore, ERBLet atoms are compactly supported in the frequency domain, which is not the case with others: The attenuation in the stop-band is about -60 dB for ERB-MDCT and -20 dB for MDCT.Thus, energy is better localized in the frequency do- main with ERBLet.One can also notice that the bandwidths of ERBLet atoms are broader than those of ERB-MDCT.This is due to the fact that ERBLet atoms are optimized both on ERB center frequencies and bandwidths, whereas ERB-MDCT atoms are optimized only on ERB center frequencies.

Time-frequency images for a real audio signal
In this section, we compare TF images obtained for a real audio signal: The beginning of "Tom's Dinner" by Suzanne Vega.N equals the length of the audio excerpt and F s = 44.1 kHz.We set v = 3 for ERB-MDCT and ERBLet (i.e.128 bands, keeping only positive frequencies in ERBLet).We also apply a MDCT with the same frequency resolution at 1000 Hz, i.e. with 500 bands.We use the implementation of ERBLet available in the LTFAT 2.0 Toolbox for Matlab (http://ltfat.sourceforge.net/).We also provide online (http://potion.cnrs-mrs.fr/eusipco15.html) an implementation of ERB-MDCT for Matlab.The ERB-MDCT, MDCT and ERBLet TF images are plotted on figure 4.
Between ERB transforms and MDCT, energy spreading is clearly different in the frequency domain: With MDCT, most coefficients in the upper 3/4th represent low-energy information, while high-energy partials are concentrated in the lower 1/4th.With ERB transforms, high frequencies are "compressed" in the upper part, and partials are more salient.
Between ERB-MDCT and ERBLet, the main difference is that the partials are broader in frequency with ERBLet, because ERBLet bandwidths are wider.Then, partials might be unresolved, especially in high frequencies.In other words, TF representation is more sparse with ERB-MDCT, which is desirable for audio coding: More zero (or zero-quantized) coefficients require less coding bits.Finally, one can observe that the TF image is smoother with ERBLet.This comes from the fact that ERB-MDCT is based on MDCT, which is not shiftinvariant in time.This generates local oscillations of energy in the TF plane [10].

CONCLUSION
We proposed a real-valued perfectly-invertible TF transform, inspired by ERBLet, but close to a basis.It was conceived as a trade-off between an efficient modeling of the hearing system and constraints specific to audio coding: Redundancy close to 1, sparse representation in the transform domain, and low computational cost.However, the localization of energy in time and frequency domains is not as good as with ERBLet.
In a future work, we will use this transform in a real audio codec and evaluate its performance compared to a MDCT.

Table 1 .
Number of frequency bands (K) and redundancy as a function of bands per ERB (v).N = 4096.