Handling within-word and cross-word pronunciation variation for Arabic speech recognition (knowledge-based approach)

Arabic is one of the phonetically complex languages, and the creation of accurate speech recognition system is a challengeable task. Phonetic dictionary is essential component in automatic speech recognition system (ASR). The pronunciation variations in Arabic are tangible and are investigated widely using data driven approach or knowledge based approach. The phonological rules are used to get the pronunciation of each word accurately to reduce the mismatch between the actual phoneme representation of the spoken words and ASR dictionary. Several studies in Arabic ASR system are conducted using different number of phonological rules. In this paper we focus on those rule that handle within-word pronunciation variation and cross-word pronunciation variation. The experimental results indicate that handling within-word pronunciation variation using phonological rule doesn’t enhance the recognition performance, but using these rules to handle cross-word variation provide a good performance.


1.Introduction
Automatic speech recognition can be defined as converting the speech signal into text. The quality of the system is measure by known how much the recognized text is close to the text recognized by human. Speech recognition is take a large interest in many fields such as the natural language processing (NLP) and human computer interaction (HCI). There are three types of Arabic language each has different characteristic [1]: Classical Arabic (CA), that is the formal and standard form of Arabic, it is the Quran language, Modern Standard Arabic (MSA), used in TV and the news the "common language" used by speakers of different dialects, and Spoken Arabic (dialect), that differ from one country to other and have no organized writing form. Despite the importance of Arabic language and the research effort, Arabic Automatic Speech Recognition (ASR) is unfortunately still insufficient. Several issues for Arabic language that need to be addressed to catch up with the progress of other language [2]. Dicritization is one of obstacle face Arabic ASR systems, science not all text is dicritized and this lead to shortage in the training data needed by ASR systems. Dicritization is essential for Arabic ASR system that is integrated with other system in which this system perform better using diacritics such as speech-to-speech systems [3] .The other problem is morphological complexity since Arabic has a large potential of word forms that increases the out-vocabulary rate. Also pronunciation variations (within word or cross word variation) lead to mismatch between the spoken word and the text used in the ARS system modelling. Within-word variation causes alternate pronunciations to the same word. In contrast, a cross-word variation happens in continuous speech in which a sequence of words forms a compound word that must be treated as one entity [4]. Modelling the pronunciation variation in any ASR system is a critical task. It helps to improve the performance by reducing the mismatch between the speech and the text used in the acoustic model training [5][6].
Two main methods used in the previous literature in modelling the pronunciation -variation [7][8] Knowledge-based approach, that uses phonetic and linguistic knowledge to write phonological rules that handle variants in pronunciation. Data-driven approach uses a corpus from real speech to derive the variation in speech. The chosen approach depends on the type of variation you need to handle in your work and the purpose of handling these variations [6]. The pronunciation variation modelling should be considered in three levels: the pronunciation dictionary, acoustic model, and the language model [9].

Arabic phoneme set
The phoneme is the small and basic unit of speech. It represents a distinct sound of the language's phonology. Any phoneme change in a word makes a change in the meaning of the word. Phonemes play a vital role in the performance of ASR and text to speech systems. In this work, we used a phoneme set that is used in [10] in addition to the proposed phoneme to generate the adapted dictionary to handle within word variation. Arabic language contain 28 consonant, 3 short vowels represent Fatha, Damma, and Kasra,3 long vowels that are the long version of the short vowels and the pharyngealized allophone as illustrated in table 1.

The Arabic phonological rules
The phonetic dictionary has a great impact on the accuracy of ASR system, it contains the words available in the language and their pronunciation as a phonemes or allophones exist in the acoustic model. the dictionary creation can be done manually by expert but it's a hard task and take a big time for example English dictionary is built manually over many years because of large exceptions [10].The pronunciation of Arabic language follow a specific rules especially when the text is fully diacritized, so the creation of the phonetic dictionary can be done automatically following this rules [11][12]. After the dictionary generation it can be adapted manually for exception words. a number of research issues for Arabic speech recognition such as absence of short vowels in written text and the presence of compound words generated from the concatenation of conjunctions, prepositions, articles, and pronouns, as prefixes and suffixes to the word stem is discussed in [13].An Arabic broadcast news transcription system is developed and its phonetic dictionary provides different pronunciation variations for words that may be pronounced differently [14]. A change to the standard phonetic rule to adapt the pronunciation variation for better training and decoding process is developed. A rule-based technique is developed to generate Arabic phonetic dictionaries for a large vocabulary speech recognition system [10]. They used classic Arabic pronunciation rules, MSA rules, and morphologically driven rules. Al-Haj et al. (2009) create a knowledge based approach to handle short vowels for Iraqi-Arabic speech and a number of pronunciation variations to the phonetic dictionary. A set of 80 pronunciation rules is generated to create a phonetic dictionary for the Tunisian Arabic [15].

The proposed method
Some of letters have different pronunciation when followed by a special letter such as letter DAL[d], THE[t] and DAD[dd] for example the letter DAL(‫)د‬ when followed by a voweled TEH(‫)ت‬ it is omitted also, the letter DAD ‫)ض(‬ when followed by a voweled TEH(‫)ت‬ or TAH(‫)ط‬ is omitted [2],but Ramsay   . Table2 shows the phonetic dictionary for the two approaches in which point view 1 for the approach in [5]and point view 2 for the approach in [2]. The phoneme NN, NK and NF are used to represent ɲ ̟ , ŋ and ɱ respectively.

Experiment result
This experiment is conducted using Nawar Halabi dataset that is continuous speaker dependent speech corpus. The transcript of the dataset was collected from Aljazeera Learn (Al Jazeera, 2015), which is a language learning website NOON: .

Handling Cross-word pronunciation variations
Cross-word pronunciation variations change the phonetic spelling of words outside their listed forms in the phonetic dictionary, this lead to a number of Out-Of-Vocabulary (OOV) word forms [8]. The cross-word variation occurs at the intervals of words that is captured by the triphones of the acoustic model. It could also be realized as a change in pronunciation according to the last phoneme of a word and the first phoneme of the next word [16]. While a cross-words variation modelling has been done in many Languages, little work in Arabic is done. Two well-known MSA phonological rules are applied, assimilation (Idgham) and changing (Iqlaab).
There are 3 types of assimilation  Noon Saakinah or Tanween

Conclusion
Handling Arabic pronunciations variation influence Arabic ASR systems performance. Two types of variation exist which are within-word variations and cross-word variations. Handling within-word variation (Noon assimilation and shadda) using phonological rules (the knowledge-based approach) has no significant effect on system performance. On the other hand, better performance is achieved when handling cross-word variation by phonological rules. Accordingly, handling within-word variation using data driven approach need to be examined. Also more phonological rules to handle other cross-word variation will be checked.