Phase reconstruction of spectrograms with linear unwrapping: application to audio signal restoration

This paper introduces a novel technique for reconstructing the phase of modified spectrograms of audio signals. From the analysis of mixtures of sinusoids we obtain relationships between phases of successive time frames in the Time-Frequency (TF) domain. To obtain similar relationships over frequencies, in particular within onset frames, we study an impulse model. Instantaneous frequencies and attack times are estimated locally to encompass the class of non-stationary signals such as vibratos. These techniques ensure both the vertical coherence of partials (over frequencies) and the horizontal coherence (over time). The method is tested on a variety of data and demonstrates better performance than traditional consistency-based approaches. We also introduce an audio restoration framework and observe that our technique outperforms traditional methods.


INTRODUCTION
A variety of music signal processing techniques act in the TF domain, exploiting the particular structure of music signals. For instance, the family of techniques based on Nonnegative Matrix Factorization (NMF) is often applied to spectrogram-like representations, and has proved to provide a successful and promising framework for source separation [1]. Magnitude-recovery techniques are also useful for restoring missing data in corrupted signals [2].
However, when it comes to resynthesizing time signals, the phase recovery of the corresponding Short-Time Fourier Transform (STFT) is necessary. In the source separation framework, a common practice consists in applying Wienerlike filtering (soft masking of the complex-valued STFT of the original mixture). When there is no prior on the phase of a component (e.g. in the context of audio restoration), a consistency-based approach is often used for phase recovery [3]. That is, a complex-valued matrix is iteratively computed to be close to the STFT of a time signal. A recent benchmark has been conducted to assess the potential of source separation methods with phase recovery in NMF [4].
It points out that consistency-based approaches provide poor results in terms of audio quality. Besides, Wiener filtering fails to provide good results when sources overlap in the TF domain. Thus, phase recovery of modified audio spectrograms is still an open issue. The High Resolution NMF (HRNMF) model [5] has shown to be a promising approach, since it models a TF mixture as a sum of autoregressive (AR) components in the TF domain, thus dealing explicitly with a phase model.
Another approach to reconstruct the phase of a spectrogram is to use a phase model based on the observation of fundamental signals that are mixtures of sinusoids. Contrary to consistency-based approaches using the redundancy of the STFT, this model exploits the natural relationship between adjacent TF bins due to the model. This approach is used in the phase vocoder algorithm [6], although it is mainly dedicated to time stretching and pitch modification of signals, and it requires the phase of the original STFT. More recently, [7] proposed a complex NMF framework with phase constraints based on sinusoidal modeling, and [8] used a similar technique for recovering the phase of speech signals in noisy mixtures. Although promising, these approaches are limited to harmonic and stationary signals. Besides, the phase constrained complex NMF model [7] requires prior knowledge on fundamental frequencies and numbers of partials. In the speech enhancement framework introduced in [8], the fundamental frequency is estimated, however the estimation error is propagated and amplified through partials and time frames.
In this paper, we propose a generalization of this approach that consists in estimating the phase of mixtures of sinusoids from its explicit calculation. We then obtain an algorithm which unwraps the phases horizontally (over time frames) to ensure the temporal coherence of the signal, and vertically (over frequency channels) to enforce spectral coherence between partials, which is observed in musical acoustics for several instruments [9]. Our technique is suitable for a variety of pitched music signals, such as piano or guitar sounds, but percussive signals are outside the scope of this research. A dynamic estimation (at each time frame) of instantaneous frequencies extends the validity of this technique to nonstationary signals such as cellos and speech. This technique is tested on a variety of signals and integrated in an audio restoration framework.
The paper is organized as follows. Section 2 presents the horizontal phase unwrapping model. Section 3 is dedicated to phase reconstruction on onset frames. Section 4 presents a performance evaluation of this technique through various experiments. Section 5 introduces an audio restoration framework using this phase recovery method. Finally, section 6 draws some concluding remarks.

Sinusoidal modeling
Let us consider a sinusoid of normalized frequency f 0 ∈ [− 1 2 ; 1 2 ], initial phase φ 0 ∈ [−π; π] and amplitude A > 0: The expression of the STFT is, for each frequency channel (with F the odd-valued Fourier transform length) and time frame t ∈ Z: where w is a N sample-long analysis window and S is the time shift (in samples) between successive frames. Let The unwrapped phase of the STFT is then: where ∠z denotes the argument of the complex number z. This leads to a relationship between two successive time frames: More generally, we can compute the phase of the STFT of a frequency-modulated sinusoid. If the frequency variation is low between two successive time frames, we can generalize the previous equation: Instantaneous frequency must then be estimated at each time frame to encompass variable frequency signals such as vibratos, which commonly occur in music signals (singing voice or cello signals for instance).

Instantaneous frequency estimation
Quadratic interpolation FFT (QIFFT) is a powerful tool for estimating the instantaneous frequency near a magnitude peak in the spectrum [10]. It consists in approximating the shape of a spectrum near a magnitude peak by a parabola. This parabolic approximation is justified theoretically for Gaussian analysis windows, and used in practical applications for any window type. The computation of the maximum of the parabola leads to the instantaneous frequency estimate. Note that this technique is suitable for signals where only one sinusoid is active per frequency channel.
The frequency bias of this method can be reduced by increasing the zero-padding factor [11]. For a Hann window without zero-padding, the frequency estimation error is less than 1 %, which is hardly perceptible in most music applications according to the authors.

Regions of influence
When the mixture is composed of one sinusoid, the phase must be unwrapped in all frequency channels according to (5) using the instantaneous frequency f 0 . When there is more than one sinusoid, frequency estimation is performed near each magnitude peak. Then, the whole frequency range must be decomposed in several regions (regions of influence [6]) to ensure that the phase in a given frequency channel is unwrapped with the appropriate instantaneous frequency.
At time frame t, we consider a magnitude peak A p in channel k p . The magnitudes (resp. the frequency channels) of neighboring peaks are denoted A p−1 and A p+1 (resp. k p−1 and k p+1 ). We define the region of influence I p of the p-th peak as follows: The greater A p is relatively to A p−1 and A p+1 , the wider I p is. Note that other definitions of regions of influence exist, such as choosing the limit between two peaks as the channel of lowest energy [6].

Impulse model
Impulse signals are useful to obtain a relationship between phases over frequencies (vertical unwrapping) [12]. Although they do not accurately model attack sounds, they provide simple equations that can be further improved for more complex signals. The model is: where δ is equal to one if n = n 0 (the so-called attack time) and zero elsewhere and A > 0 is the amplitude. Its STFT is equal to zero except within attack frames: We can then obtain a relationship between the phases of two successive frequency channels within an onset frame, assuming that w ≥ 0: and φ(0, t) = 0. The similarity between (10) and (5) was expected because the impulse is the dual of the sinusoid in the TF domain. This comparison naturally leads to estimating parameter n 0 (the "instantaneous" attack time) in each frequency channel as we previously estimated f 0 (the instantaneous frequency) in each time frame (cf. equation (6)). This leads to the following vertical unwrapping equation:

Attack time estimation
In order to estimate n 0 (k), we look at the magnitude of the STFT of the impulse in a frequency channel k: We then choose n 0 such that the STFT magnitude of the impulse over onset frames has a shape similar to that of the analysis window. For instance, a least-squares estimation method can be used. We tested this technique on synthetic mixtures of impulses: perfect reconstruction has been reached. Alternatively, we can also estimate n 0 (k) with a temporal QIFFT and update the phase with (11).

Protocol and datasets
The MATLAB Tempogram Toolbox [13] provides a fast and reliable onset frames detection from spectrograms. We use several datasets in our experiments:   [17] and expressed in dB. The popular consistency-based Griffin and Lim (GL) algorithm [3] is also used as a reference. We run 200 iterations of this algorithm (performance is not further improved beyond). It is initialized with random values, except for TF bins where the phase is known. Results are averaged over 30 initializations.
Simulations are run on a 3.60GHz CPU processor and 16Go RAM computer. The related MATLAB code and some sound excerpts are provided on the author web page 1 . Figure 1 illustrates the instantaneous frequencies estimated with the phase vocoder technique [6], used as a reference, and with our algorithm on a vibrato. Identical results are obtained. Our method is thus suitable for estimating variable instantaneous frequency signals as well as stationary components. We computed the average frequency error between phase vocoder and QIFFT estimates for the datasets presented in section 4.1. The results presented in the first column of Table 1 confirm that QIFFT provides an accurate frequency estimation. Table 1 also presents reconstruction performance for Griffin and Lim (GL) and our Phase Unwrapping (PU) algorithms. In both cases the onset phases are known. Our approach significantly outperforms the traditional GL method: both stationary and variable frequency signals are reconstructed accurately. In addition, our algorithm is faster than the GL technique: on a 3min 48s piano piece, the reconstruction is

Onset phase reconstruction
Onset phases can be reconstructed with n 0 -estimation using the impulse magnitude (Imp) or with QIFFT (QI). We also test random phases values (Rand, no vertical coherence), zero phases (0, partials in phase) and alternating partial phases between 0 and π (Alt, phase-opposed partials). These choices are justified by the observation of the phase relationships between piano partials in musical acoustics [9]. The phase of the partials is then fully recovered with horizontal unwrapping. We test these methods on dataset A. Results presented in Table 2 show that all our approaches provide better results than GL algorithm on this class of signals. Onset phase unwrapping with n 0 -estimation based on QIFFT provides the best result, ensuring some form of vertical coherence. In particular, we perceptually observe that this approach provides a neat percussive attack.

Complete phase reconstruction
We consider unaltered magnitude spectrograms from dataset A. A variable percentage of the STFT phases is randomly corrupted. We evaluate the performance of our algorithm to restore the phase both on onset and non-onset frames. Figure 2 confirms the potential of this technique. Our method produced an average increase in SDR of 6dB over the corrupted data. It also performs better than the GL algorithm when a high percentage of the STFT phases must be recovered.
However, note that this experiment consists in phase reconstruction of consistent spectrograms (i.e positive matrices that are the magnitude of the STFT of a time signal): GL  algorithm is then naturally advantaged in this case. Realistic applications (cf. next section) involve the restoration of both phase and magnitude, which leads to inconsistent spectrograms.

APPLICATION TO AUDIO RESTORATION
A common alteration of music signals is the presence of noise on short time periods (a few samples) called clicks. We corrupt time signals with clicks that represent less than 1 % of the total duration. Clicks are obtained by differentiating a 10 sample-long Hann window and added to the clean signal. Magnitude restoration of missing bins is performed by linear interpolation of the log-magnitudes in each frequency channel. Figure 3 illustrates this technique. Phase recovery is then performed with our method (PU) or alternatively with the GL algorithm. We compare those results to the traditional restoration method based on autoregressive (AR) modeling of the time signal [18], and with HRNMF [5]. Table 3 presents results of restoration. HRNMF provides the best results in terms of SDR. Though, our approach outperforms the traditional method and GL algorithm. Besides, we underline that the HRNMF model uses the phase of the non-corrupted bins, while our algorithm is blind. Lastly, our technique remains faster than HRNMF: for a 3min55s piano piece, restoration is performed in 99s with our algorithm and in 222s with HRNMF.

CONCLUSION
The new phase reconstruction technique introduced in this work appears to be an efficient and promising method. The analysis of mixtures of sinusoids leads to relationships between successive TF bins phases. Physical parameters such as instantaneous frequencies and attack times are estimated dynamically, encompassing a variety of signals such as piano and cellos sounds. The phase is then unwrapped in all frequency channels for onset frames and over time for partials. Experiments have demonstrated the accuracy of this method, and we integrated it in an audio restoration framework. Better results than with traditional methods have been reached.
The reconstruction of onset frames still needs to be improved as suggested by the variety of data. Further work will focus on exploiting known phase data for reconstruction: missing bins can be inferred from observed phase values. Alternatively, time-invariant parameters such as phase offsets between partials [19] can be used. Such developments will be introduced in an audio source separation framework, where the phase of the mixture can be exploited.