Parallel Voice Conversion Based on a Continuous Sinusoidal Model

The main challenge introduced in current voice conversion is the tradeoff between speaker similarity and computational complexity. To tackle the latter problems, this paper introduces a novel sinusoidal model applied for voice conversion (VC) with parallel training data. The conventional source-filter based techniques usually give sound quality and similarity degradation of the converted voice due to parameterization errors and over smoothing, which leads to a mismatch in the converted characteristics. Therefore, we developed a VC method using continuous sinusoidal model (CSM), which decomposes the source voice into harmonic components to improve VC performance. In contrast to current VC approaches, our method is motivated by two observations. Firstly, it allows continuous fundamental frequency (F0) to avoid alignment errors that may happen in voiced and unvoiced segments and can degrade the converted speech, that is important to maintain a high converted speech quality. We secondly compare our model with two high-quality modern (MagPhase and WORLD) vocoders applied for VC, and one with a vocoder-free VC framework based on a differential Gaussian mixture model that was used recently for the Voice Conversion Challenge 2018. Similarity and intelligibility are finally evaluated in objective and subjective measures. Experimental results confirmed that the proposed method obtained higher speaker similarity compared to the conventional methods.


INTRODUCTION
Voice conversion (VC), as considered in this paper, aims to modify the speech signal of a source speaker into that of a target speaker. It has great potential in the development of various speech tasks such as Text-to-Speech (TTS) [1], speaking assistance [2], and speech enhancement [3].
Numerous statistical approaches have been employed for mapping the source and the target features. Gaussian mixture model (GMM) [4] is a typical form of VC that requires a source-target alignment for training the conversion models. Some other statistical methods have also been proposed for VC, such as non-negative matrix factorization (NMF) [5], restricted Boltzmann machines [6], variational auto-encoders [7], and maximum likelihood estimation of spectral parameter trajectory [8]. Although these techniques achieve improvements of converting the voice signal into the target one, the naturalness of the sound quality usually deteriorates due to the over-smooth phenomenon or discontinuity problems, which makes the converted speech sound muffled. Recently, deep neural networks (DNNs) have significantly improved the conversion accuracy of statistical VC techniques. Deep belief networks [9], generative adversarial networks [10], deep bidirectional long short-term memory [11] have been recently proposed to preserve the sound quality. Notably, the similarity of the converted voices is still degraded in terms of subjective quality due to model complexity and computational expense. Hence, it is desirable to develop a VC technique to convert speech to more naturalsounding speech with simple DNN models.
Most of the VC systems found in the literature can be built either using a parallel framework in which source and target speakers read out the same set of utterances, or using a nonparallel framework in which the target speaker's utterances are different from those of the source speaker. However, in practice, the subjective experiment results in [12] [13] yield that the average performance of the non-parallel VC system is not outperformed by the parallel VC system. The main reason behind this challenging issue is that it is usually hard to achieve an accurate non-parallel frame alignment between speaker utterances and, therefore, a parallel data-driven approach will be used in this work.
In essence, a well-designed VC system often consists of analysis, conversion, and synthesis modules. The process of parametrizing the input waveform into acoustic features and then synthesizing the converted waveform based on the converted features is one of the major factors that may degrade the performance of VCs. For this, the characteristics of the speech vocoder (analysis/synthesis system) given to the VC are of paramount importance.
Various parametric vocoders, see [14] for comparison, have been used to model the speech signal. In general terms, we can group the state-of-the-art vocoders based VC into three categories. a) Source-filter models: STRAIGHT [15] and mixed excitation [16]; b) Sinusoidal models: Harmonic plus Noise Model [17] is the only model has been found in the literature based VC; c) end-to-end complex models: WaveNet-based waveform generation [18] and Tacotron [19]. In the face of their clear differences, each model has advantages to work reasonably well, for a particular speaker or gender conversion task, which make them attractive to researchers. Nonetheless, such mismatch between the trained, converted, and tested features still exist, which often causes significant quality and similarity degradation. Consequently, simple and uniform vocoders, which would handle all speech sounds and voice qualities (e.g., creaky voice) in a unified way, are still missing in VC.
There seem to be three important factors that should be taken into consideration in the design and development of a VC system. Firstly, the most common feature in most of the above-mentioned VC techniques is the fact that they are based on the spectral envelope (SE). Although SE contains enough information to convert the original speech signal onto that of the target speaker, SE is not enough alone to achieve the desired converted results, for particular applications, even with a better SE estimation method. Secondly, traditional conversion systems focus on the prosodic feature represented by the discontinuous fundamental frequency (F0) assumption that depends on a binary voicing decision. Therefore, modelling of F0 in VC applications is problematic because of the differing nature of F0 observations between voiced and unvoiced speech regions. An alternative solution of increasing the accuracy of the acoustic VC model is using a continuous F0 (contF0) to avoid alignment errors that may happen in voiced and unvoiced segments and can degrade the converted speech. It should be pointed out to the third issue that leads to the degradation of the performance of VC is that most of the existing VC techniques discard or does not typically preserve phase spectrum information. However, the effectiveness of phase information in detecting synthetic speech has recently been proved by [20]. Hence, one possible way of enhancing the accuracy of VC models is to incorporate phase information in order to achieve superior synthesized speech. Therefore, it is still worth to develop advanced vocoder based VC for achieving high-quality converted speech.
To tradeoff between the complexity of the model and conversion accuracy in statistical VC, we propose to use a sinusoidal type synthesis model based on contF0. Moreover, the goal of this paper is to evaluate the performance of a continuous sinusoidal model (CSM) that is suitable for statistical modeling on a voice conversion system. The remainder of the paper is organized as follows. In Section II, we propose the novel idea of using CSM-based voice conversion considering continuous F0 and feed-forward neural network. Datasets, experimental conditions, and baseline VC systems are described in Section III. In Section IV, objective and subjective evaluations are presented. Finally, we summarize this paper in Section V and suggest avenues for future research.

A. Continuous Sinusoidal Model
Continuous vocoder based sinusoidal model (CSM) was designed to overcome shortcomings of discontinuity in the speech parameters and the computational complexity of modern vocoders. The novelty behind this vocoder is to use harmonic features to facilitate and improve the synthesizing step before speech reconstruction.
By keeping the number of our previous source-filter vocoder parameters unchanged [21] and similarly to [22] [23], the synthesis algorithm implemented in this paper decomposes the speech frames into a lower-band voiced component ( ) and an upper-band noise component ( ) based on Maximum Voiced Frequency (MVF) values. We define these components here as In order to avoid discontinuities at the frames boundaries, Overlap-add (OLA) technique is used to reconstruct the speech signal from their corresponding parameters estimated from our analysis model in [21]. If the current frame is voiced, the harmonic part can be expressed as: where A k (t) and ∅ k (t) are the amplitude and phase at frame i (both are obtained in a similar manner as described in [23]), t = 0, 1, …, N and N is the frame length. K is the timevarying frequency components or harmonics that depends on the contF0 and MVF as: The synthetic noise signal n(t) is filtered by a high-pass filter f h (t) with a cutoff frequency equal to the local MVF, and then modulated by its time-domain envelope e(t) as we described it in our previous study [21] s n If the current frame is unvoiced (MVF=0), the harmonic part is zero and the synthetic frame is usually equal to the produced noise. Thus, the synthesized speech signal is obtained by adding the harmonic and noise components.

B. Voice Conversion Based on DNN
In [24] [25], the neural network based VC reaches higher performance on conversion than the GMM-based solution. In this work, a feed-forward deep neural network (FF-DNN) is used to model the transformation between source and target speech features as shown in the middle part of Fig. 1. It consists of 6 feed-forward hidden layers, each consisting of 1024 units and performs a non-linear function of the previous layer's representation and a linear activation function at the output layer. We applied a hyperbolic tangent activation function whose outputs lie in the range (-1 to 1) which can yield lower error rates and faster convergence than a logistic sigmoid function. For the first 15 epochs, a fixed learning rate of 0.002 was chosen with a momentum of 0.3. More specifically, after 10 epochs, the momentum was increased to 0.9, and then the learning rate was halved regularly. Thus, input features are propagated forward through the FF-DNN with estimated parameters to produce the corresponding output parameters.
The framework of the proposed VC system is shown in

A. Datasets
We used a CMU-ARCTIC database [28] to evaluate the sound quality and speaker identity of the proposed VC framework. The parallel speech data of four speakers are chosen as our corpus, denoted BDL (American English, male), JMK (Canadian English, male), SLT (American English, female), and CLB (US English, female), each one consisting of 1132 sentences. The four speakers read out the same set of sentences. These sentences are divided into training, validation and testing set, each with 1000, 66, 66 sentences, respectively. As sample frequency 16 kHz and 16bit samples are used, and acoustic features were extracted with a 5 ms frame shift. We conducted intra-gender and crossgender pairs. Consequently, the number of combinations of the source and target speaker was 12 pairs. Note that we trained the conversion models for every speaker pair independently. The FF-DNN used in this work was implemented in the open source Merlin toolkit for speech synthesis [29] with some changes are introduced to be able to train our CSM. Besides, the training procedures were conducted on an NVidia Titan X GPU.

B. Baseline Systems
In this experiment, the proposed CSM based VC system was evaluated by comparing it with three systems: • WORLD: It was found in [30] that the WORLD vocoder outperformed the state-of-the-art vocoders based speech synthesis (e.g., STRAIGHT). Therefore, we used WORLD vocoder based VC as a first baseline to measure our model performance.
• MagPhase: As our CSM followed the sinusoidal concept that contains both amplitude (intensity) and phase information, we chose a recently proposed MagPhase vocoder [31] based VC as a second baseline system.
• Sprocket: It is a vocoder-free VC system based on a differential GMM [32] submitted to the Voice Conversion Challenge 2018 (VCC2018). It will be used as a third baseline system in our study.
To fairly compare all systems mentioned above, we used the same nonlinear conversion function architecture (FF-DNN) as for the proposed system, except baseline 3 that is a linear function based on GMM. Thus, we ran 48 experiments in order to measure the performance of these VC-systems.

A. Objective Evaluation
Two objective speech quality measures are considered to evaluate the quality of the proposed model. Frequencyweighted segmental signal-to-noise ratio (fwSNRseg) [33] was firstly calculated, defined as (6) where X i,j 2 , Y i.j 2 are critical-band magnitude spectra in the frequency band of the target and converted frame signals respectively, K is the number of bands, and W is a weight vector. Secondly, Log-Likelihood Ratio (LLR) [34] was employed to evaluate the distance between the converted and target speech from their linear prediction coefficients (LPC), which takes the form where a x , a y , and R x are the LPC vector of the target signal frame, converted signal frame, and the autocorrelation matrix of the target speech signal, respectively. A more detailed case-by-case analysis by fwSNRseg and LLR are shown in Table 1. The results were averaged over 20 synthesized test utterances for each pair. A calculation is done frame-by-frame, and the best value in each column of Table 1 is bold faced.
First, it could be observed that our proposed method gives significantly better LLR scores than other systems in femaleto-male voice conversion. In other words, the CSM can convert voice characteristics more accurately than other methods when a female is a source speaker. Similar observations can be found in male-to-female voice conversions (in particular, BDL-to-SLT, BDL-to-CLB, and JMK-to-CLB), where the fwSNRseg measure tended to have the highest scores in our proposed model. In a sense, there is a tendency to an increased fwSNRseg when considering continuous F0 in the proposed method. Second, for the samegender speaker pairs, the LLR values in Table 1 indicate that the proposed system obviously outperforms the baseline systems in female-to-female conversions. On the other hand, in terms of male-to-male voice conversions, our proposed system achieves the second highest sound quality. Overall, these findings demonstrate that the CSM can yield a good performance comparable to other systems.
The comparison of the spectral envelope of one speech frame converted by the proposed method is given in Fig. 2a. It may be observed that the converted spectral envelope is more similar in general to the target one than the source one. It can also be seen in Fig. 2b that the converted contF0 trajectories generated from the proposed method follow the same shape of the target confirming the similarity between them and can provide better F0 predictions. Similarly, when looking at Fig. 2c, it makes apparent that the proposed framework produces converted speech with MVF more similar to the target trajectories rather than to the source ones.
As a result, these experiments show that the proposed model with continuous sinusoidal vocoder is competitive for the VC task and superior to the reference WORLD model.

B. Subjective Evaluation
A perceptual listening test was designed to test and evaluate the quality of our proposed model. First, we performed a web-based MUSHRA-like (MUlti-Stimulus test with Hidden Reference and Anchor) listening test [35] to evaluate the speaker identity/similarity of the converted speech to a natural-reference target voice. The listeners had to rate the naturalness and similarity of each stimulus, from 0 to 1 http://smartlab.tmit.bme.hu/sped2019_vc 100. Twelve utterances were randomly chosen and presented in a randomized order. Altogether, 72 utterances were included in the MUSHRA test (6 types x 12 sentences). Twenty listeners (11 males and 9 females) participated in the experiment. On average, the MUSHRA test took 10 minutes to fill. The listening tests samples can be found online 1 .
The MUSHRA similarity scores of the listening test are presented in Fig. 3. An interesting note is that the listeners preferred our system compared to others developed earlier.
According to Mann-Whitney-Wilcoxon ranksum tests (with a 95% confidence level), all differences are statistically significant. This means that our proposed model has successfully converted the source voice to the target voice on the same-gender and cross-gender cases. Moreover, Fig. 3 shows that the WORLD and Sprocket systems get higher scores in the MUSHRA test for only the JMK-to-SLT, JMKto-BDL, and CLB-to-SLT speaker conversions, respectively.
Overall, these results suggest that the best conversion technique for the source-filter based vocoder is the CSM, while the WORLD is also a good option, having the second highest similarity scores.

V. DISCUSSION AND CONCLUSIONS
This work proposed a CSM-based voice conversion framework, and the continuous F0 is our main interest to avoid alignment errors that may happen at voiced-unvoiced boundaries. A number of recently developed VC methods have been applied and compared with the proposed model. The performance of the methods was statistically analyzed with two error metrics and subjectively evaluated by the use of expert opinion. The results discussed in Section IV show the effectiveness of the proposed method in terms of naturalness and speaker similarity. The advantage of the CSM is that it gives the closest results to the target speaker in both objective and similarity tests compared to other approaches. Future works will aim at improving the quality scores through the use of bidirectional recurrent neural networks, in which many-to-one and one-to-many voice conversion can be achieved.  2. Example of the natural source (black), target (red), and converted (blue) spectral envelope, contF0, and MVF trajectories using the proposed method.