Wide Learning for Auditory Comprehension

Classical linguistic, cognitive, and engineering models for speech recognition and human auditory comprehension posit representations for sounds and words that mediate between the acoustic signal and interpretation. Recent advances in automatic speech recognition have shown, using deep learning, that state-of-the-art performance is obtained without such units. We present a cognitive model of auditory comprehension based on wide rather than deep learning that was trained on 20 to 80 hours of TV news broadcasts. Just as deep network models, our model is an end-to-end system that does not make use of phonemes and phonological word form representations. Nevertheless, it performs well on the difﬁcult task of single word identiﬁcation (model accuracy 11 . 37 %, Mozilla DeepSpeech: 4 . 45 %). The architecture of the model is a simple two-layered wide neural network with weighted connections between acoustic frequency band features as inputs and lexical outcomes (pointers to semantic vectors) as outputs. Model performance shows hardly any degradation when trained on speech in noise rather than on clean speech. Performance was further enhanced by adding a second network to a standard wide network. The present word recognition module is designed to become part of a larger system modeling the comprehension of running speech.


Introduction
The question of how we understand speech is under investigation in many disciplines, ranging from linguistics, cognitive science and neuroscience, to natural language engineering [1]. Almost all current linguistic theories assume speech recognition is a two-stage process, with an initial stage at which the acoustic signal is mapped onto a sequence of phonemes, and a subsequent stage at which the stream of phonemes is segmented into a sequence of words. Accordingly, a substantial body of research has focused on linking properties of the acoustic signal to linguistic units such as phonemes and phonological word form representations [2,3], and cognitive architectures have been put forward that specify how these representations are accessed [4]. Classical automatic speech recognition (ASR) systems build on hidden Markov models (HMMs) in which phonemes again play a pivotal role [5]. However, deep learning has enabled considerable progress, replacing hand-engineered processing with endto-end approaches that directly learn from data. "Deep Speech" is an example of a state-of-the-art ASR system based on endto-end deep learning that does not depend on the concept of a "phoneme" as theoretical construct or computational unit [6].
The present study is a progress report on a linguistic approach to auditory comprehension that, like deep learning, rejects the phoneme as a pivotal unit for language comprehension, reflecting the discomfort that also exists within the linguistics community about the validity and usefulness of the phoneme as a theoretical construct [7,8]. Unlike deep learning, we make use of wide learning, in combination with substantial investment in the development of linguistically and cognitively well-motivated input features. The general framework within which this development takes place is that of naive discriminative learning (NDL) [9,10]. NDL implements error-driven learning based on the learning rule proposed by Rescorla and Wagner [11], which has a strong history in the field of animal learning and more recently also human learning [12,13].
The network architecture used for the standard NDL model is a simple two layer network where the weights on connections from input units (henceforth, cues) to output units (henceforth, outcomes) are gradually updated based on the Rescorla-Wagner learning rule (for more details, see section 2.3). The aim of NDL is to build end-to-end models, with in the case of auditory comprehension low-level form features as cues and semantic units as outcomes. Importantly, the standard implementation of NDL (available as an R package [14] and a python library [15]) does not make use of any hidden layers, and hence explores to what extent it is possible, given well-chosen acoustic features, to discriminate between lexical meanings using simply a linear network. However, Sering et al. [16] proposed an extension of the NDL architecture with a second two-layer network, that is trained independently of the first, that further enhances classification performance.
NDL has been successfully employed in modeling the data from a range of experimental studies, showing promising results in explaining human language processing [9], [17], [18] as well as lexical learning in animals [19]. For human auditory comprehension, Arnold et al. [20] developed an NDL-based model of single word recognition, and applied it successfully to spontaneous conversational German speech. A comparison of model performance on lexical discrimination with human performance on the same speech tokens revealed that model accuracy was within the human range.
The current study builds upon this model, and tests it on more and different kinds of speech data, and at the same time also explores whether the second network proposed by Sering et al. indeed improves classification accuracy. Results for single word recognition are compared to that of Mozilla Deep-Speech [21]. A further contribution of the present study is a method for distinguishing between relatively clean speech and speech in noise in the input corpus which comprises TV broadcast videos recorded in studio or outdoors, along with music and background noise.

Material
The data resource employed for this study is a subset of the big data from the Distributed Little Red Hen Lab, a vast repository of multi-modal TV news broadcasts. We used 500 audio files containing 266 hours of national and cable broadcasts Interspeech 2018 2-6 September 2018, Hyderabad, India from the United States in English, recorded in 2016, which were accompanied by high quality transcripts and had been aligned successfully for more than 97% of their words by the Gentle forced aligner [22]. The advantage of working with a huge archive such as the Red Hen data is that a substantial amount of speech is recorded in noisy conditions. The archive not only contains relatively clean speech recorded in a studio, but also speech recorded when reporters are outside, in which case considerable background noise can be present. Furthermore, even recordings made in the studio often carry not only speech, but music playing in the background as well.
We developed an algorithm to automatically distinguish between relatively clean parts where there is speech without background noise or music, from noisy data. To do so, a threshold of 350 in a CD quality (44,100 Hz sampling frequency and 16 bit resolution) pulse-code modulation (PCM) encoded speech stream, was defined to mark the level of amplitudes close to zero. This threshold (≈ 3% of the peak amplitude) was chosen to capture pauses during speech and the short periods of silence during the closure of plosives. Background noise typically results in such short periods of silence being absent. Sliding a non-overlapping time window of 30 s over the audio files, we traced the number of pause markers which are completely contained in the 30-second window. All speech chunks having more than 40 paus markers were considered clean. As a result, a total of 5924 audio files, each with a duration of 30 s, was selected to represent almost 50 h of clean speech. A subset of 970 randomly selected files were manually evaluated by an American English native speaker. The proportion of files for which no background noise could be detected anywhere for the full 30 seconds was 0.35. Thus, the clean dataset comprises both truly clean speech files, and speech files with mild background noise.
A noisy subset was also compiled using the "sound to textgrid analysis (silences)" from Praat [23] with a silence threshold of −26 dB and a minimum silence interval duration of 0.09 s. Audio chunks with speech, as opposed to those tagged as silence, with a duration of at least 5.6 s were included as representing noisy speech, the noise being either outside noise or music playing in the background. In this way, 19,602 audio files of varying durations were obtained, to a total of 80 h of speech. A random subset of 2000 noisy files was evaluated by the same native-speaker, who reported that 91.9% of the files are indeed noisy speech and music snippets.
From the clean and noisy data sets with 50 and 80 hours of speech respectively, henceforth clean-50 and noisy-80, we randomly sampled subsets of 20 and 50 hours of speech (clean-20, noisy-20, and noisy-50), in order to enable comparison with the original results of Arnold et al., which were based on 20 hours of speech, and to provide insight into how the classification algorithm performs as the amount of speech is increased.

Acoustic features
The acoustic features that served as input cues for the NDL network were the Frequency Band Summary (FBS) features developed by Arnold et al. [20]. FBS features are derived as follows. Given the audio signal for a word, minima of the Hilbert amplitude envelope of the speech wave are used to segment the speech into chunks of varying duration. When no clear minima are present, the signal contributes one chunk. Next, each chunk is evaluated on 21 MEL spectrum frequency bands, which are motivated by the different receptive areas of the cochlea that are known to be responsive to variation in different frequency ranges in acoustic signals [24]. For each chunk, and each of the 21 frequency bands of these chunks, an FBS feature brings together band number, chunk number, and a summary of the temporal variation in the band by means of the median, minimum, maximum, initial, and final intensities of the values in the band. We used the AcousticNDLCodeR R package [25] to extract the FBS features from our speech files.

The NDL classifier
Consider a set of cues C with m unique members ci (i = 1, · · · , m) and a set of lexical outcomes O with n unique members oj (j = 1, · · · , n). C and O elements occur with repetition in a pair called a learning event. A sequence of learning events Etrain of length r compose the training data. The NDL network is defined by an m × n matrix W of connection weights wij, where wij is the association strength from ci to oj. The connections weights in W are initialized with zeros; w (t=0) ij = 0 (i = 1, · · · , m; j = 1, · · · , n). During learning, events are visited one at a time. At time t (t = 1, · · · , r), the learning event e at t comprises a set of active cues Ct (Ct ⊆ C) and a set of observed outcomes Ot (Ot ⊆ O) that drive the updating of the weights of W . Denoting the weight from ci to oj at time t by ij , the update in weights from ci to oj at time t is given by ∆w The update ∆w ij itself is given by the Rescorla-Wagner learning rule: (2) The parameters of the Rescorla-Wagner learning rule were set to λ = 1.0, αi = 1.0 and βj = 0.001 for all i, j, following earlier modeling studies with NDL, and never changed in the course of the present study.
Given W and active cues at learning event ei, c k ∈ Ci, the support aij (henceforth, activation) of these cues for a given outcome oj is given by the sum of the weights from the active cues to this outcome: More generally, given an r × m matrix C of learning events by cues that is zero but one for those cues that are present at a given learning event, we have that The activation matrix A provides, for each word presented to it, the network's support for all possible outcomes. To assess network performance, the lexical outcome with the highest activation is selected and compared with the targeted outcome. Alternatively, the number of targeted outcomes among the top n most highly activated outcomes can be considered. Following Sering et al. [16], a second network was stacked on top of the first one. This second network is defined by an n × n decision matrix D which linearly seeks to rotate the activation matrix A of the first network onto an r × n target matrix T specifying for each learning event whether a given outcome is present (1) or absent (0), i.e., Therefore, D can be estimated by solvinĝ resulting in a matrix of predicted outcome strengthsT , As for the standard NDL network, model performance is based on whether the most activated outcome in the relevant column ofT is identical to the targeted outcome. Here too, evaluation can be extended to include targeted outcomes among the top n best supported outcomes. For each of the five datasets introduced above, 10-fold cross validation was applied, resulting in a total of 50 models for the standard NDL model (using A) and a second set of 50 models for the extended NDL model, henceforth NDL+ (usingT). All models are trained and tested on single word tokens (as given by the word boundaries provided by the aligner) with FBS Features of the audio file as cues and orthographic form of the word types as identifiers for lexical outcomes. Out-of-vocabulary word types were discounted when computing accuracy. From the clean-50 corpus, 72,711 FBS features and 15,698 lexomes were extracted from 401,015 word tokens. The noisy-80 corpus contained 66,106 FBS features, 13,523 lexomes, and 289,245 word tokens. The ndl2 (version 0.1.0.9002) R package [14] was used to estimate A. The matrices of NDL+ were obtained using python code developed in the context of [16]. Figure 1 visualizes model accuracy across cross-validation runs for the 5 data sets when using standard NDL (left) and NDL+ (right). NDL reaches, on average, a recognition accuracy of 11.72 on the clean datasets (11.58 on clean-20 and 11.86 on clean-50) and 11.13 on the noisy datasets (10.75, 11.36, and 11.29 on noisy-20, noisy-50, and noisy-80, respectively). A Wilcoxon Rank Sum Test indicated that NDL accuracy on 50 hours of clean speech was higher than accuracy on 20 hours of clean speech W = 8, p < .001. There were also statistically significant difference in mean NDL accuracy between the three noisy datasets (Kruskal-Wallis rank sum test; H(2) = 19.66, p < .001), with post hoc pairwise multiple comparisons using the Nemeneyi test and p-value adjustment using the Bonferroni correction indicating that mean NDL accuracy for the 20 hours dataset is lower than that for the 50 hours (p < .001) and the 80 hours datasets (p < .01). As expected, accuracy decreases when replacing clean speech by noisy speech (Wilcoxon rank sum test; W = 560, p < .001), but the decrease is modest, around 1%.

Results
As illustrated by Figure 1, the performance of NDL+ was superior to that of NDL (W = 332, p < .001, Wilcoxon test), increasing by 9.14 % of the NDL accuracy, going from 11.37 ± 0.41 to 12.41 ± 0.76. The Wilcoxon test showed that the observed improvement for model accuracies going from NDL to NDL+ are significant across all corpora (Noisy-20: W = 13, p < .005; other corpora: W = 0, p < .001). Analysis of variance revealed a significant effect of corpus (F (4, 45) = 50.73, p < .001) and corpus size (F (2, 47) = 61.03, p < .001) on the amount of increase from NDL to NDL+ in the model accuracy.
A linear model for NDL accuracy, excluding the noisy-80 condition, showed significant interaction between the clean/noisy status and size of the corpus. However, there is no such interaction with NDL+ (Table 1). Furthermore, all 3 twoway interactions in a model of accuracy as a function of corpus size, corpus clean/noisy status, and method (NDL or NDL+) are well supported. Significance codes ' * * * ': p < .001; ' * * ': p < .01 The recognition of isolated words sliced out of running speech is a hard task both for ASRs and humans. Human accuracy on the German data of [20] ranged from 20% to 40% (NDL performance with training on 20 hours of speech of 20 females was around 20-25%). Recognition accuracy of the present models is lower, ranging from 10.37 to 13.53, unsurprisingly as there is much greater speaker variability while at the same time we are working not with lab-recorded speech but with speech with a much lower signal to noise ratio. To put the present results in perspective, the performance of the open source Mozilla Deep-Speech [21] speech-to-text engine with a pre-trained English model was assessed on the isolated words from the clean-50 and noisy-80 corpora that the NDL models were trained on. Accuracy of single word recognition was 6.28% for the clean corpus and 2.62 for the noisy corpus.

Discussion and Conclusions
We presented a cognitively motivated model of speech recognition trained and evaluated on single word tokens taken from real speech data of the Red Hen Lab, using 10-fold cross validation for assessing model accuracy across five datasets that were automatically sampled from the data, including both relatively clear speech and speech with substantial background noise. We also extended previous work with NDL on auditory comprehension by increasing the volume of data in hours from 20 to 50 and 80. We also tested a recent extension of the model, NDL+, which adds a second network that takes the activation vectors of the first network as input, and is trained to map these onto one-hot encoded output vectors for the lexical outcomes. The results show that NDL and NDL+ accuracies improve when the model is exposed to more training data, across both clean and noisy speech. Increasing the amount of training data was more beneficial for the noisy compared to the clean data. For NDL+, but not for NDL, accuracy improved for 80 in comparison with 50 hours of noisy speech, suggesting that with greater quantities of training data, further improvement is possible. Also, training on more data was more advantageous for NDL+ than NDL. We therefore plan to test NDL+ on much larger volumes of speech, with hundreds and perhaps thousands of hours of speech, as available in the Red Hen repository.
As expected, NDL and NDL+ accuracies dropped when the models were exposed to speech in noise, compared to relatively clean studio-recorded speech, but the drop in accuracy was surprisingly modest. To our knowledge, other cognitive models of speech comprehension trained on real speech have been restricted to clear laboratory speech only [26,27]. We also observed that the number of cues was lower in the noisy environment compared to the clean environment, which dovetails well with reduced sensitivity to speech and degraded comprehension performance. The number of outcomes was also lower in the noisy condition, suggesting that speakers when communicating in noise fall back on a more restricted and presumably better-transmittable vocabulary.
We have evaluated model performance by calculating the proportion of targets that had the highest activation of, on average, 12,030 lexical outcomes. When we consider the number of targeted lexical outcomes among the top 5 and top 10 best supported outcomes, accuracy reaches 30.80% and 40.46% for the clean data and 29.72% and 38.82% for the noisy data. In future work, we plan to compare NDL performance with human performance on words sampled from the Red Hen Lab datasets.
To place the performance of NDL and NDL+ in the context of ASR systems, we compared the performance of our wide learning networks with that of the Mozilla DeepSpeech. The NDL and NDL+ wide models outperformed the DeepSpeech system by roughly 6 to 9%. This comparison does not do justice to the deep speech model, as this model is optimized to recognize words in context rather than isolated words, and is in all likelihood trained on a broader range of registers of spoken English than our news broadcast data. However, we note that the present NDL models are developed as part of a wider project addressing word recognition in running speech. A blueprint of the envisioned general framework can be found in [28].
The results of the present study indicate that a simple errordriven wide network, or a pair of such networks but trained independently, without any back-propagation of errors, can go quite far in modeling auditory comprehension, given the challenges of the task: discriminating between thousands of different lexical outcomes with huge variability in respect of background noise and speakers' accent, dialect, sociolect, speech rate, age and gender. We hope the model will prove useful also for understanding, predicting, and modeling the sensitivity of human listeners to the many social features that characterize speakers and that are part and parcel of what they communicate when speaking.