Audio/video supervised independent vector analysis through multimodal pilot dependent components

Independent Vector Analysis is a powerful tool for estimating the broadband acoustic transfer function between multiple sources and the microphones in the frequency domain. In this work, we consider an extended IVA model which adopts the concept of pilot dependent signals. Without imposing any constraint on the de-mixing system, pilot signals depending on the target source are injected into the model enforcing the permutation of outputs to be consistent over time. A neural network trained on acoustic data and a lip motion detection are jointly used to produce a multimodal pilot signal dependent on the target source. It is shown through experimental results that this structure allows the enhancement of a predefined target source in very difficult and ambiguous scenarios.


I. INTRODUCTION
Independent Vector Analysis (IVA) is a popular tool for unsupervised multichannel source separation [1]. Its virtues are related to the ability to avoid the permutation problem of traditional narrow-band frequency-domain methods for source separation [2], [3]. Differently from Independent Component Analysis (ICA), IVA uses a multivariate source model in order to jointly estimate the separated components in each frequency. The multivariate model allows to bypass the need for additional permutation solver algorithms, which often rely on prior assumptions on the geometrical interpretation of the mixing system [4], [5].
On-line implementations [6] and several other extensions have been proposed in [7]. Nevertheless, despite its potential, IVA is still not widely used in commercial applications such as VoIP and ASR preprocessing. Indeed, the effectiveness of IVA is intrinsically limited by the core paradigm of unsupervised source separation: The nature of the sources of interest is not explicitly defined. Therefore, although the internal permutation problem is solved by the multivariate model, the external order of the recovered sources cannot be guaranteed. The same output can contain portions of different source signals at different time instants, especially when the mixing conditions are not static.
To overcome this issue, geometrical constraints have been employed by imposing that a given output signal is associated 1 The work of Zbyněk Koldovský was supported by The Czech Science Foundation through Project No. 17-00902S to a source having known angular position [8]. However, these constraints cannot work well if the source and the noise are located at similar angles or when the target source position cannot be uniquely defined. Furthermore, these constraints make IVA similar to adaptive beamforming [9] or to geometrically constrained ICA [10], thus limiting the potential of multivariate modeling [11].
To mitigate the mentioned IVA ambiguities but without imposing any geometrical constraint, in [12] we proposed to modify the multivariate model by injecting pilot signals that are mutually dependent with the sources of interest. The pilot signals were defined to be proportional to the posterior probabilities to observe each source, given an observed wideband spatial or spectral feature. Inspired by [11], in this work we further extend the model by considering posteriors derived from multimodal signals: • A neural network is trained using extensive prior acoustic data such that it produces a pilot signal that contains posteriors of the target source dominance in the observed mixture.
• Another pilot signal is derived based on the lip detection in a video recorded together with audio. This allows to disambiguate the separation in cases where the target as well as the interference are speech signals. Experimental evaluations are carried out to confirm the effectiveness of the supervised structure in separating speech from noise sources, in difficult ambiguous scenarios, e.g. when the target and the noise are both speech sources and are located at a similar angular direction. II. SUPERVISED IVA N source signals are assumed to be recorded by an array of M microphones. Let S k n and X k m be the STFT coefficients obtained for the kth frequency bin, the nth source and the mth mixture signal, respectively. Let S k = [S k 1 · · · S k N ] T and X k = [X k 1 · · · X k M ] T . The mixing model is where N k = [N k 1 , · · · , N k M ] T is the vector of background noise and interference signals, and H k indicates the mixing matrix for the kth frequency bin. Assuming N = M , the objective of IVA is to estimate a set of de-mixing matrices W k = {W k nm }, k = 1, . . . , K, where K is the number of the frequency bins. The de-mixing matrices jointly recover independent multidimensional sources Y n = [Y 1 n , · · · , Y K n ], n = 1, . . . , N , where up to a scaling ambiguity, which can be subsequently resolved by applying the Minimal Distortion Principle [13] to each matrix W k .
A typical way to model the sources is with a multivariate spherical super-Gaussian distribution 1 defined as [1] a n = [a 1 n , . . . , a K n ] T , f (a n ) = α exp In the supervised IVA (S-IVA) [12], the multivariate model (3) is extended by injecting additional "pilot" dependent components. In this work we consider Q pilots for each source, P 1 n , . . . , P Q n , which will be related to different modalities: a n = [a 1 n , . . . , a K n , γ 1 P 1 n , . . . , γ q P Q n ], where γ q is an hyper-parameter controlling the influence of each pilot. To obtain the update rule, the Maximum Likelihood (ML) [1] approach is used by considering the cost function where Y n = [Y 1 n , . . . , Y K n , γ 1 P 1 n , . . . , γ Q P Q n ] denotes the extended observation output vector. The expectation E[·] is approximated with the time average over the frames. Then, by taking the derivatives of (5) with respect to W k nm and applying the natural gradient modification to maximize (5), we obtain the update rule where η is the adaptation rate, I nm indicates the nmth element of the identity matrix, and the nonlinearities φ k (·), k = 1, . . . , K are the score functions related to the density (4), namely, As the pilot components do not depend on W k nm , the second sum in (7) remains constant during the optimization. This way, any IVA algorithm can be modified to its supervised version.

III. DEFINITION OF PILOT SIGNALS
The proposed method can be related to a previous work in [14] where a user-guided source activity was used to supervise the IVA adaptation. However, the formulation of S-IVA is far more general as it can naturally include many supervising modalities, through the definition of multiple pilot signals. In this work, the pilot signals are derived from audio and video information.
As we use the spherical Laplacian model in (4), the pilots are assumed complex-valued zero-mean signals, uncorrelated to the frequency bins but with a dependent time-varying variance. Therefore, only the variance of the pilots has to be defined. By indicating with a l n and b l n the posteriors of source activity derived from the audio and video modalities (with a l n , b l n ∈ [0, 1]), the pilot signal variance is defined as where l is the time frame index, E[·] indicates the expectation, which is approximated as a smooth time-average and the term c(l) rescales the pilot to a dynamic range proportional to the sum of the frequency components 2 .
A. Derivation of a l n through an Acoustic Neural Network A neural network (NN), trained to solve a regression problem, is used to predict the source activity posteriors a l n . Namely, the network is trained to estimate the power ratio between the true target speech and the noisy mixture. Any machine learning method for regression can be used, such as recurrent neural networks (see, e.g., [15]) but we found that a naive multilayer feed forward NN, often named deep NN (DNN), is sufficiently accurate to produce a useful prediction.
Let S kl nd be the klth time-frequency representation of the dth signal corresponding to the nth source, included in the training set D n . Example mixtures for the training are obtained as The DCT and the logarithm is applied to |X kl d | to define the transformed features X kl where X l d = [ X 1l d , · · · , X Kl d ], K < K (i.e. only the first K DCT coefficients are used). Two hidden layers of 256 neurons are used with the hyperbolic tangent as the activation function. The softmax function is used in the output layer, which has dimension N . Each output represents a dominancerelated feature for the nth source. For the dth mixture at the In this work we focus on the scenario where the source of interest is "speech" while any other non-speech acoustic event is considered as "noise" (i.e. the noise can be considered as composed by multiple sources). For the training of the DNN, a large set of 100k mixtures was generated by randomly combining noise examples with speech sentences in the TIMIT database. Noises were collected from different sources and the dataset was designed to balance the amount of noises belonging to different categories. Noise signals selected did not contain any speech, as the scope of the network is only to discriminate between speech and noise. Two datasets of 10k mixtures were generated for both cross-validation and testing. After training, the output prediction for the nth source at the lth frame, indicated as a l n , is obtained through the feedforward propagation of the input vectors v l computed on the test recordings. Figure 1 shows an example of the DNN output for a given test recording used in the experimental evaluation. We want to highlight that, although in this work our target is a speech source, S-IVA can also be applied to separate other type of acoustic sources, e.g. musical sources, as long as the DNN can discriminate them from their time-frequency representation.

B. Derivation of b l n from lip-motion detection
In scenarios where the target source and the noise are acoustically similar, additional modalities can be used to disambiguate the definition of the target source. Here, we use the video signal synced with the audio recording to extract the lip motion of a main target speaker.
In order to track the movement of the speakers' lips, a set of 68 facial landmarks is extracted from each frame of the video using the Ensemble of Regression Trees algorithm [16]. A subset of 8 landmarks describing the inner lip region then defines a polygon whose area r i approximately corresponds to the mouth opening in the ith frame. Then, the mean m i and variance v i of 21 consecutive values r i−10 , . . . , r i+10 are computed and normalized to the range 0, 1 . The posteriors Since the audio and video stream were captured at different rates, resampling was applied in order to produce a signal consistent with the time-frequency representation used by IVA.

IV. EXPERIMENTAL EVALUATION
We conduct experiments with M = 2. An on-line S-IVA implementation is realized through updating the de-mixing matrices at each frame l according to The scaling normalization is applied to each bin to stabilize the convergence as in [17]. The signal mixtures are transformed in their corresponding time-frequency representation through Short-time Fourier Transform with the Hanning window of 4096 points with 75% overlap. After separations, the images of the target source at each microphone are recovered through MDP, and signals are transformed back to the time-domain using overlap-add. Two different experimental evaluations were carried out: • Test1: separation with pilots based on acoustic features only; • Test2: separation based on audio/video combined acoustic features. The block diagram of the supervised IVA is depicted in Fig. 2.

A. Test1: Separation of speech from noise
In this experiment, the video information was not available. Thus, |P 2 n (l)| 2 in (13) was set to 0, for each n and l. Recordings were made with two microphones with mutual distance of 0.2 m. Signals were recorded at f s = 16 kHz in a room of size 5 × 5 × 2.5 m with T 60 = 300 ms. Partially diffuse noise was simulated according to the 3Quest standard by playing multichannel signals through 4 loudspeakers consisting of different types of real-world noise such as cafeteria, road noise, train station, etc. The target speaker was recorded at the distance of 2 m from the center of the microphones at different angles. Note, this can be considered an underdetermined scenario, as the noise was generated by playing partially uncorrelated signals through multiple loudspeakers.
In order to validate the robustness of the proposed approach, a dataset of 100 mixtures was generated by combining speech signals (speakers at random angles) with randomly selected noise examples (not included in the training set of the DNN model). Performance were evaluated by computing both the Noise-to-Speech ratio improvement (NSRi) at the noise output, and the Signal-to-Distortion ratio improvement (SDRi) at the speech output. Indeed, it should be noted that the scenario is highly underdetermined and a complete good speech extraction system should make use of both speech and noise estimates [18] [19]. Fig. 3 shows the performance averaged over the test recordings, comparing standard IVA with S-IVA with γ 1 tuned for the best SDRi (γ 1 = 24). It is seen that S-IVA consistently improves the average performance compared to standard IVA, i.e. when γ 1 = 0, as the source order for IVA is not guaranteed to be consistent over all test samples.
In a second experiment we evaluate the robustness of S-IVA to an inaccuracy in the VAD. To simulate errors in the DNN prediction, artificial noise was added as Fig. 4 shows the performance with varying β, demonstrating the robustness of S-IVA to noisy pilot signals.

B. Test2: Separation of target speech from noise speech
In this experiment, we consider S-IVA endowed by a multimodal audio/video pilot signal. A target speaker was recorded live in a front of a commercial laptop while simulating a  VoIP conversation. Noise was generated by recording a TV, located in the back of the laptop, at a distance of about 2 meters. Although the speaker position is known in advance, it is worth noting that applying spatial constraints as in [8] would not be effective in these conditions. In fact, the angular positions of the target and of the noise sources are very close to each other. For a more detailed analysis of this aspect, see the experimental evaluation in [12]. A multimodal audio/video pilot signal is generated as in (8) and the performance were evaluated by varying both the parameter γ 1 and γ 2 .
In a first experiment, we consider a recording where the TV noise contains only spoken news. This scenario is very difficult for the acoustic DNN as it cannot discriminate between the target and noise speech. Figures 5 and 6 show the SNRi and SDRi performance averaged over the target and noise source. It is straightforward that the pilot based on the acoustic DNN prediction is not reliable as by increasing γ 1 the performance degrades. On the other hand, the video information undoubtely provides a robust supervision as both SDRi and SNRi increase with γ 2 .
In a second experiment, we consider a recording where the TV noise contained a mix of speech and music. From Figures  7 and 8 it can be seen that the acoustic DNN prediction is more effective in these noise conditions as the presence of nonspeech related events, helps S-IVA to converge to the correct source order. Interestingly, this experiment shows that the best performance is obtained when combining both audio and video information together. Indeed, while the lip-detection accuracy should not be sensitive to the presence of acoustic noise, other disturbances could make it less reliable. For example, false detections are produced by movements of the lips that happens also when no speech is produced. This might also suggest that more effective multimodal formulations can be defined in alternative to (8), in order to better reflect the statistical correlation of the errors produced by each modality.

V. CONCLUSIONS
In this work we have presented a supervised extension of Independent Vector Analysis. A pilot signal is injected in the multivariate model to steer the estimation toward the extraction of a specific wanted source. A multimodal pilot signal was defined combining both audio and video information. A deep neural network was used to produce time-varying posteriors of source dominance in order to discriminate speech from acoustic noise events. A lip motion detection was used to distinguish between the activity of the desired speaker from that of interfering speech. It is shown that, without explicit constraints to the demixing system, it is possible to have a consistent enhancement of a specific target source in difficult scenarios, such as in far-field, in underdetermined conditions and when sources propagate from a similar direction.
It was shown that when S-IVA is supervised by the DNNbased pilot signal, good performance can be obtained if the noise does not contain any speech. On the other hand, when the noise is a speech source, the performance obtained with a video-based pilot signal clearly outperforms the acoustic supervision. Nevertheless, it was also observed that in mixed noise conditions the best performance was obtained by combining audio and video modalities together. This result suggests that further work is required to design multimodal formulations more effective than a naive weighted combination of each single modality. Furthermore, future work might also explore the use of EEG-based pilot signals, to realize effective biofeedback source enhancement methods [20].