Making sense of periodicity glimpses in a prediction-update-loop-A computational model of attentive voice tracking.

Humans are able to follow a speaker even in challenging acoustic conditions. The perceptual mechanisms underlying this ability remain unclear. A computational model of attentive voice tracking, consisting of four computational blocks: (1) sparse periodicity-based auditory features (sPAF) extraction, (2) foreground-background segregation, (3) state estimation, and (4) top-down knowledge, is presented. The model connects the theories about auditory glimpses, foreground-background segregation, and Bayesian inference. It is implemented with the sPAF, sequential Monte Carlo sampling, and probabilistic voice models. The model is evaluated by comparing it with the human data obtained in the study by Woods and McDermott [Curr. Biol. 25(17), 2238-2246 (2015)], which measured the ability to track one of two competing voices with time-varying parameters [fundamental frequency (F0) and formants (F1,F2)]. Three model versions were tested, which differ in the type of information used for the segregation: version (a) uses the oracle F0, version (b) uses the estimated F0, and version (c) uses the spectral shape derived from the estimated F0 and oracle F1 and F2. Version (a) simulates the optimal human performance in conditions with the largest separation between the voices, version (b) simulates the conditions in which the separation in not sufficient to follow the voices, and version (c) is closest to the human performance for moderate voice separation.

is the listeners ability to attentively follow a given speaker at a cocktail party (Cherry, 1953).

30
A simple, but powerful stimulus for investigating this phenomenon in the auditory system 31 was presented by Woods and McDermott (2015). They measured the listener's ability to 32 attentively track one of two simultaneously active synthetic voices, whose parameters -33 fundamental frequency and first two formants -varied over time. There were no constant, 34 distinctive features between the voices that could facilitate stream formation (for example, 35 direction of arrival or timbre) and the multidimensional parameter trajectories crossed over  Exactly how the mixture of acoustic signals is decomposed into attended foreground and 40 residual background remains unclear (Carlyon, 2004). Over the last few years, several  This study presents a computational model of human auditory perception, which takes 166 the above-discussed aspects into account. It illustrates the human ability to attentively track 167 a voice in the presence of other sounds. We demonstrate the feasibility of the model based on 168 a simple, but challenging auditory scene used in the study of Woods and McDermott (2015) 169 : two simultaneously active synthetic voices with varying F 0,F 1 and F 2, whose parameter 170 trajectories cross in time. 171 The main contributions of this work include:   The computational framework for modeling attentive voice tracking is depicted in Fig. 1. 183 We assume that an auditory scene consists of the foreground (F) and the background (B).

184
The foreground contains the attended stream of information: target auditory object. All  Observation is segregated based on the estimates from the previous time step: s B (n − 1) and 198 s F (n − 1). O B (n) and O F (n) are further passed to the state estimation stage (Fig. 1. c.).

199
State estimation consists of two parallel particle filters: one responsible for the foreground, 200 one for the background.
(1)  (Klatt, 1980), which yields acoustic signals with time varying F 0, F 1 and F 2. The signals are added to create a mixture signal. The mixture signal is presented to the model and is further processed by the sPAF feature extraction stage.
13 JASA/Computational model of attentive voice tracking The full-length signals are defined by the hidden state trajectories: where the trajectory sampling rate is F S = 50 Hz and N is the length of the trajectory. The 222 acoustic signal containing a mixture of voices is the input to the model Fig. 2 The objective of the model is to track the voice state, given the sPAF features from the 226 mixed signal.
The symbol is also used to differentiate between the hidden and estimated quantities.

239
Tracking yields one-dimensional estimated state trajectories:

243
The method consists of thee main steps: 244 1. Auditory pre-processing, which provides the auditory-inspired time-frequency repre-245 sentation.

246
2. Periodicity analysis, which analyses the periodic structure of the sound in each con-247 sidered frequency band and yields a time-frequency-period representation.  In every time instance n at the output of the sPAF feature extraction stage, we obtain  16 JASA/Computational model of attentive voice tracking Observation O(n) consists of 23 channel sets P cn for each frequency channel c (see Fig. 3): Channel set P cn consists of the salient period values -period glimpses -denoted as 254 P cnm . M cn is the number of period glimpses in the set P cn . One channel set P cn can consist

257
It should be emphasized that the fundamental assumption about the sPAF features is 258 that they represents only the robust components originating from a single speech source.

259
Noisy T-F bins, as well as the bins containing superposition of many sound sources, should detail in Section II F 2), which summarizes the statistical relationship between F 0 and the 281 observed period glimpses. This is done for every non-empty channel set P cn . The likelihoods 282 p P cn |F 0 F (n − 1) and p P cn |F 0 B (n − 1) (Eq.23) are compared and the set P cn is assigned 283 to the voice for which the likelihood is larger: This method is used to confirm if F 0 from a preceding time step can be used to perform The second method is also based on F 0. The difference is that in each time instance n, 290 segregation is done based on the estimated F 0s from the preceding time step: F 0 B (n − 1) 291 and F 0 F (n − 1) (see Eq.(4)): With this method we evaluate the single-dimensional tracking performance of our system 293 without any oracle information. 3. FG-BG segregation based on F 0(n − 1), F 1(n − 1) and F 2(n − 1)

295
The third method is substantially different from the first two methods: in addition to 296 the estimated F 0, it uses the oracle information about the formants to segregate voices. For 297 each voice, a channel-dependent weightẼ(n, c) F orẼ(n, c) B is computed. This reflects the 298 energy distribution over frequency channels for a given combination of F 0, F 1 and F 2. The

299
F 0 estimate is used in the encoding of weights, but is not explicitly used for the segregation 300 as in the first two methods. For example, the channel-dependent weightẼ(n, c) F for the 301 foreground voice is computed as the spectral power per frequency subband for a voice with 302 F 0 = F 0 F (n − 1), F 1 = F 1 F (n − 1) and F 2 = F 2 F (n − 1).

303
The weightẼ(n, c) B is computed analogously using F 0 estimate F 0 B (n) and the hidden 304 formants F 1 B (n) and F 2 B (n) of the background voice. The resulting weightsẼ(n, c) B andẼ(n, c) B are compared in every channel with a non-306 empty channel set P cn . The set P cn is assigned to the voice with the larger weight.
This method is used to prove that information about the formants from a preceding time 308 step can improve the feature segregation in our modeling framework. with associated weights and to compute estimates s F (n) and s B (n) based on these samples 316 and weights. Below we review the particle filtering steps for the foreground particle filter.

317
The background particle filter executes the same operations for the background stream.

318
The particle filter consists of a finite set of K particles: hypothetical states (in this case 319 hypothetical F 0 values) with weights assigned to them. At time n particle set is defined as: and the corresponding set of weights is defined as: We use K = 300 particles in the model evaluation. The particle set is iteratively up-322 dated using the incoming observation O F (n) and the statistical voice models (see Sec. II F).

323
Specifically, the algorithm iteratively executes the following steps (see Fig. 4B.). Before executing the actual iterative procedure, the particle filter is initialized with the 327 available prior knowledge. The initial hypotheses are sampled from the attention prior 328 probability distribution: p s F (0) (see Sec. II F 1): The initial weights are all given the same value 1/K.

337
Weights represent the likelihood of the observation given hypothetical states (particles).

338
When normalized to 1, they approximate the posterior distribution of the state given a 339 sequence of observations. In the foreground particle filter, they are updated using the segre- given hypothesis: The resulting weights are normalized, so that their sum is equal to one: . (16)

346
Hypothetical states in set H F (n) together with the normalized weights assigned to them: 347 W F (n) constitute the approximate posterior distribution for the foreground voice.
where P r[·] denotes the probability value for a discrete random variable [·].

349
Finally, the estimate of a hidden state of a voice is the expected value of the current 350 approximate posterior:

352
The resampling step is executed in order to focus the limited computational resources the particle diversity, measured in terms of effective sample size N ef f F (n), is lower than a 360 predetermined threshold. The effective sample size is defined as The particle is resampled if the following condition is fulfilled: The resampled hypotheses set H F (n) is generated by drawing samples from the approximate 363 discrete posterior distribution: and the corresponding weights are equalized to the same value 1/K.  cue, the listener knows the initial parameters associated with that stream. In the case of 376 F 0 tracking, prior information can be used to limit the range of F 0 values to be considered 377 when determining the F0 of a stimulus. This is reflected in our model by attention prior 378 probability for the voices, which initializes the foreground particle filter around the correct 379 value and the background particle filter everywhere else, which can be written as:  in paragraph II F 2 d). Next, in each non-empty channel set P cn , the likelihood is integrated 402 by computing a product across the likelihoods of the elements of the channel set: We assume the mutual independence of the period glimpses within a channel set P cn . The  Therefore, the likelihood is summed across frequency channels: Sections c-d describe motivation and implementation of the function p P cnm |F 0 , which 409 evaluates the likelihood of a single period glimpse P cnm . with each other; in these channels the period is related to the difference frequency, which 416 is F 0 itself (similar nature of the periodicity at the output of cochlear-inspired filterbank 417 was also described in Shamma and Dutta (2019)). Furthermore, a signal with a period P ,

418
is also periodic at 2 · P , 3 · P , 4 · P , etc., therefore multiples of the period are also detected  to the period multiple i, we can transform the period glimpses in seconds to relative period 445 glimpses, computed as follows: where P 0 = F 0 −1 is the period of the hypothetical F 0 and rem(·) is the remainder where: F 0 is the hypothetical fundamental frequency, R cnm (j·F 0) is the relative period value  To predict the next value for a given hypothetical F0, we compute the trend between two 489 previous estimates ∆ F 0 F (n) = F 0 F (n − 2) − F 0 F (n − 1), we predict the next value according 490 to that trend F 0 + ∆ F 0 F (n), and finally, we add gaussian noise to this value: where σ trans = 1 Hz. In addition, we make sure that the difference between two previous  In this chapter, we explain how we obtain responses of the model in a psychoacoustic task, 501 and we discuss the simulated experiments in more detail.  In the study by Woods and McDermott (2015) participants were given the following task:

507
After hearing a 500 ms cue signal indicating which voice should be attended, the 2 s long 508 signal, containing two competing voices, was presented. At the end of a trial, a 500 ms 509 34 JASA/Computational model of attentive voice tracking probe signal, coming from one of the two voices, was presented, and the listeners had to 510 decide whether or not the probe came from the attended voice (see Fig. 8, upper panel).

511
Performance was measured in terms of sensitivity index (d ).

512
We simulated this experimental procedure using the attentive tracking model. Each trial

522
The next step was to obtain the model's response to a probe. We did not present the where AUC is the area under the ROC curve, computed with the trapezoidal approximation 536 and function Z(p), where p ∈ [0, 1], is the inverse of a cumulative Gaussian distribution.

537
The lower panel of Fig. 8 shows schematically a single trial of the simulated experiment.

B. Computational simulations 539
The following numerical experiments were performed in the scope of this paper: F 1 F (n − 1), F 2 F (n − 1), F 1 B (n − 1) and F 2 B (n − 1) were available in every time step. 590 We expected the discrimination performance in this condition to improve significantly in 591 comparison to Simulation 1.b., but not to exceed the model's performance from Simulation 592 1.a.. space. Results showed that discrimination between the attended and the unattended voice 600 improved continuously as the voice distance was increased. (see Fig.11).

638
We expected that the discrimination performance would decrease when we replace the 639 oracle F 0 with the estimated F 0. Nevertheless, after seeing the successful discrimination 640 results in Simulation 1.a., we expected it to remain above chance level for most conditions.

641
However, as shown in Fig. 9 (middle panel), the results dropped significantly in comparison  Looking more closely at the tracking results, we detected different ways in which the 654 algorithm without any oracle information could typically be misled:

656
At the F 0 crossings, the foreground particle filter could take over the tracking of the

673
• Identity switch at the sub(harmonics) of the correct F 0 (see Fig. 10C.)

674
The third reason was a combination of the two aspects mentioned above. A particle 675 filter could start to track a harmonic or subharmonic of a competing voice. This can 676 be seen as an identity switch at the places where the F 0 of one voice crosses with a 677 (sub)harmonic of the second voice.

678
In summary, tracking can potentially be misled at every point where the F 0 trajectories 679 or their harmonics or subharmonics cross (see Fig 10D.). We concluded, that using purely Simulation 1.c.

684
In this simulation, we investigated whether additional information about the formant 685 frequencies could possibly prevent period glimpses from being assigned to the wrong stream.

686
As expected, with the formant-guided tracking we obtained a significant improvement in 687 discrimination performance in comparison to simulation 1.b without oracle information.

688
The median d-prime across all runs of the simulation was d = 0.98 for voices varying in 689 three dimensions and d = 0.2 for voices varying only in F 0 (see Fig. 9). indicate that for low minimum distances, the attentive tracking in humans is prone to similar 727 errors as the model without oracle information (see Fig.10).

728
In the next two conditions with a minimum distance of 5.5 and 7.5 semitones, the model Interestingly, model's results for the minimum distance of 7.5 semitones are slightly worse 737 than for 5.5 semitones. Possible explanation could be that increasing distance raises the 738 likelihood that the model will follow a track one octave away from the correct F 0.

740
In this simulation the oracle information about formants of the voices was used to seg- With the examples above, we demonstrate that our model has a generic structure, which 843 could be used to simulate other experiments than the experiment simulated in this study.

844
Our framework is comprised of various sub-tasks which could be adapted or extended de-  To demonstrate the feasibility of the approach, we used the one-dimensional state space: mapping into the tracking system would enable the full use of the potential of the particle 900 filtering and the periodicity-based features, which was only partly exploited in this study. 923 Instead of using two particle filters with resampling, the resampling could be used in the 924 foreground particle filter alone. The background particle filter could be updated without 925 resampling, providing the information about the background statistics, but keeping the 926 hypotheses broadly distributed. hand, the voices are simultaneously active throughout the whole stimulus duration, meaning 935 that they always 'share' the frequency space. In a more realistic scenario, we would expect 936 not only much more disturbance from the acoustic environment, but also, due to the sparsity 937 of the speech signal, many more time-frequency windows with one clearly dominating voice.

938
In future work, we plan to test the model using more realistic signals containing speech.