Thesis Open Access
Automatic speech recognition (ASR) is a key element in making the dream of natural human-machine communication a reality, lessened from the burden of the various interfaces and aids. The long years of research effort focused on this fascinating field has led to numerous advances, in spite of the field’s complexity. Indeed, the progress made so far has led to the expansion of ASR systems in our modern life. This revolution is largely due to the increased deployment of ASR technology in intelligent mobile devices, as well as numerous other embedded systems. Because these systems are usually used in environments filled with noise, noise robustness has become a critical parameter in ASR system design.
This PhD Thesis focuses on two aspects of the field of ASR. The first is the design and implementation of an ASR system for Macedonian which will be speaker independent and will have a medium sized vocabulary. Such a system does not yet exist in Macedonia, which leaves out the Macedonian language from the global revolution in speech technologies. To combat this, the design and development of an ASR system for Macedonian named “Talk to me in Macedonian” are presented in this Thesis. After the three development stages, the system in its final form supports a vocabulary of close to 200 words that allow the user partial control of a smartphone device. The ASR system is speaker independent and was developed using a database recorded with 30 native speakers of Macedonian from different dialect regions. The results of the evaluation of the ASR system showed that it has high performance with a word recognition accuracy of 95% for clean speech, and above 90% accuracy for signal-to-noise ratios (SNRs) of 15 dB, and above. This shows that the presented ASR system has practical value and can be used as a basis for practical ASR system deployment in a real world scenario. This would open the doors for Macedonian to join the family of languages of the developed world.
The second aspect addressed in this Thesis is the development of a novel noise robust ASR feature extraction algorithm that will give improved performance to the standard algorithms in noise. Towards this goal the Kernel Power flow Orientation Coefficients (KPOC) are introduced. The key contribution of the KPOC algorithm is the use of the power flow orientation in the auditory spectrogram of the speech signal to describe its spectro-temporal content, as extracted with the use of banks of 2D spectro-temporal kernels. The orientation coefficients make up a novel type of spectrogram termed the Power flow Orientation Spectrogram (POS). The advantages of introducing the POS are that it is inherently robust to various types of noise, and that it eliminates the need of the feature dimensionality reduction that is otherwise necessary in the standard spectro-temporal approach. Results show that KPOC gives good performance improvement in various types of noise for SNRs less than 15 dB compared to several of the reference algorithms used in the analysis. For higher SNRs, ASR system performance using KPOC approaches that of high-end feature extraction algorithms.