Development of Quranic Reciter Identification System using MFCC and GMM Classifier

Received Oct 2, 2017 Revised Dec 2, 2017 Accepted Dec 16, 2017 Nowadays, there are many beautiful recitation of Al-Quran available. Quranic recitation has its own characteristics, and the problem to identify the reciter is similar to the speaker recognition/identification problem. The objective of this paper is to develop Quran reciter identification system using Mel-frequency Cepstral Coefficient (MFCC) and Gaussian Mixture Model (GMM). In this paper, a database of five Quranic reciters is developed and used in training and testing phases. We carefully randomized the database from various surah in the Quran so that the proposed system will not prone to the recited verses but only to the reciter. Around 15 Quranic audio samples from 5 reciters were collected and randomized, in which 10 samples were used for training the GMM and 5 samples were used for testing. Results showed that our proposed system has 100% recognition rate for the five reciters tested. Even when tested with unknown samples, the proposed system is able to reject it. Keyword:


INTRODUCTION
Al-Quran is the holy book of Muslims which is written and recited in Arabic language. Interestingly, Al-Quran is the most popular and most recited book of all time [1], [2]. Muslim should try their best to avoid mistakes in reciting the Quran, such as reciting rules (tajwid), missing words, verses, misreading vowel pronounciations, punctuations, and accents [3]. Recitation should follow the rules of pronounciation, intonation, and caesuras established by the the Islamic prophet Muhammad (PBUH). The rules and guidance to read Quran is propagated from the prophet Muhammad until the Quranic reciter through a verified chain of transmission (sanad). Many non-Arabic people studied and learnt Al-Quran by listening to the well known Quranic reciters (qari). Although each reciter recited the same Quranic verses, but it has differences due to their unique voice and characteristics. To identify the Quranic reciter, the problem is similar to the speaker recognition [4]. Typical speaker recognition system includes pre-processing, feature extraction, and classification [5]. Many features and classifiers have been used in the speaker recognition research. Audio features such as Mel-frequency Cepstral Coefficients [4], [6], linear-frequency cepstral coefficients (LFCC), and linear predictive coefficients (LPC). LFCC is similar to MFCC except that their frequencies is not warped by a nonlinear frequency scale and it has been found that LFCC performed better than MFCC in female trials [7]. As stated by [8] is the most commonly used features in speaker recognition.
Given a set of feature vectors, each speaker model will be built so that a vector from the same speaker has higher probability compared to any other models. Several classifers have been used, such as k-  [9], Gaussian mixture model (GMM) [10], artificial neural network [4], and deep neural network (DNN) [11]. Of the various classifiers available, in this research we selected GMM as our baseline for speaker recognition. Although many researches have been conducted on speaker recognition, but very limited are targeted on the Quranic reciter recognition. Recent research conducted by [12] stated that the Quranic recitation has different characteristics compared to the English spoken language. The Quranic recitation is predominantly voiced speech, in which it could potentially be exploited to build more efficient speaker models. Therefore, the objective of this reseach is to develop a Quranic recitation identification using MFCC and GMM, and to evaluate its performance. The rest of the paper is organized as follows: Section 2 describes the typical components in a speaker recognition system. Section 3 explains the proposed Quranic reciter identification system. Section 4 evaluates its performance in terms of recognition rate, while Section 5 concludes this paper.

SPEAKER RECOGNITION
The flow chart of basic model for recognition speaker as shown in Figure 1. First, the audio signal is going through the front-end processing, in which the features that could uniquely represent the speaker information are extracted. The short-time spectral is the most-frequently used typed of features [5]. The front-end may also include pre-processing modules, such as voice activity detection to remove silence from the input, or a channel compensation module to normalize the effect of the recording channel [5], [13]. Currently, there are many methods that can be used to verify a speaker identity and the most two known methods are linear predictive coding (LPC) and Mel frequency cepstrum (MFCC) [4], [6]. However, in this paper MFCC methods is choosen as the feature extraction since the system give higher accurancy. MFCC is the most popular method due to it is easy to moderate and can handle multiple speakers or multiple languages.
A vector of features acquired from the previous step is then compared agains a set of speaker models. The identity of the test speaker is associated with the identity of the highest scoring model. A speaker model is a statistical model that represents speaker-dependent information, and can be used to predict new data. Any modeling techniques can be used, but the most popular techniques are: clustering, hidden Markov model, artificial neural network, and Gaussian mixture model [4], [5], [14]. In this research, we used GMM as it is one of the most effective techniques in speaker recognition [5]. GMM used estimation maximum log-likelihood algorithm to find the pattern matching and is able to form smooth approximation for arbitrarily shaped densities. Figure 2 illustrates our porposed system for Quranic recitation identification. We used MFCC for feature extraction and GMM for classifier due to its popularity and effectiveness for speaker recognition.

Mel-Frequency Cepstral Coefficients (MFCCs)
MFCCs use a non-linear frequency scale, i.e. mel scale, based on the auditory perception. A mel is a unit of measure of perceived pitch or frequency of a tone. Equation (1)  where mel f is the frequency in mels and Hz f is the normal frequency in Hz. MFCCs are often calculated using a filter bank of M filters, in which each filter has a triangular shape and is spaced uniformly on the mel scale as shown in Equation (2).
The log-energy mel spectrum is then calculated as follows: Although traditional cepstrum uses inverse discrete Fourier transform (IDFT), mel frequency cepstrum is normally implemented using discrete cosine transform (DCT) since   m S is even as shown in Equation (4), as follows: Typically, the number of filters M ranges from 20 to 40, and the number of kept coefficients is 13. Some research reported that the performance of speech recognition and speaker identification systems reached peak with 32-35 filters [8].

Gaussian Mixture Model (GMM)
GMM provides a probabilistic model of a speaker's voice. A Gaussian mixture distribution is a weighted sum of M densities: where i  is the mean vector, and i  is the covariance matrix. A GMM is characterized by the mean vector, covariance matrix, and weight from all components. So, we can represent it in a compact notation as follows:

Maximum Likelihood Estimation
Given a set of training samples X, the most popular method to train a GMM is maximum likelihood estimation. The likelihood of a GMM can be defined as: Maximum likelihood parameters are normally estimated using the expectation maximization (EM) algorithm. Among a set of speakers characterized by parameters  , a GMM system makes it prediction by returning the speaker that maximizes the a posteriori probability given an utterance X as follows: If prior probabilities of all speakers are equal,  

RESULTS AND DISCUSSION
In this section, we will present the experimental setup and Quranic audio database, experiment with training samples, experiment with testing samples, experiment with unknown samples, and recognition rate evaluation. We carefully randomized the audio database so that the proposed system is not prone to the recited verses but to the reciter voice only.

Experimental Setup and Quranic Audio Database
A high performance system was used for processing, i.e. a multicore system with Intel Core i7 6700 K 4.00 GHz (4 cores with 8 threads), 32 GBytes RAM, 256 GBytes SSD and 2 TBytes hard disk, installed with the latest version of Windows 10 64-bits operating system and Matlab 2017b with Signal Processing and Neural Network Toolboxes.
The audio database in this research were downloaded in form of MP3 from the internet or originally from CD audio for five reciters, i.e. Abdul Basit (Reciter A), Abdurahman As-Sudais (Reciter B), Saud Ash-Shuraym (Reciter C), Sheikh Ali Abdulrahman (Reciter D) and Sheikh Said al-Ghamdi (Reciter E). Using Audacity software, the audio files were converted from MP3 to WAV files with 8000Hz sampling frequency, mono, and were cut into duration of 60 seconds for each sample. The database is divided into two

Experiment with the Training Samples
In this experiment, the samples 01 until 10 for each reciter were used to train the GMM. The same samples were then were used to evaluate the recognition rate of the trained GMM. The log-likelihood was calculated for each reciter and each training samples, in which the highest likelihood as selected as the recognized reciter. Table 2 shows the recognition rate of the training samples. It shows that each samples for each reciters were identified correctly. It means that the recognition rate for the training phase was 100%.

Experiment with the Testing Samples
In this experiment, the previous trained GMM in section 4.2 was used to test different samples. The samples 11 until 15 were used to evaluate the recognition rate of the trained GMM when tested with different samples from the same reciters. The log-likelihood was calculated for each reciter and each testing samples, in which the highest likelihood as selected as the recognized reciter. Table 3 shows the recognition rate of the testing samples. It shows that each samples for each reciters were identified correctly. It means that the recognition rate for the testing phase was 100% as well.

Experiment with Unknown Samples
The last experiment is conducted to evaluate the proposed system whether it can detect unknown speaker which is not in the database. The database are collected for five reciter only which are Abdul Basit,

377
Abdurahman As-Sudais, Saud Ash-Shuraym, Sheikh Ali Abdulrahman and Sheikh Said al-Ghamdi. For this purpse, one unknown reciter named Fatih is tested using the trained GMM and the training samples. The maximum estimation log-likelihood of unknown speaker does not matched with any parameter of the reciter in the database. Because of that, the result shows that Fatih is recognized as an unknown speaker as sown in Figure 4.

Recognition Rate Evaluation
It can be concluded that, three of the experiments are successfully conducted. This results showed that the proposed system was able to verify and identify the tra i ned reciters. Although the reciter was randomly reciting different Surah in each samples, but the system still can recognize the pattern of the reciter's recitation. Furthemore, the proposed system was also able to reject the unknown reciter tested. Table 5 show the recognition rate for each experiments. This result is better than the result reported in [4], in which they obtained around 91% accuracy using MFCC and ANN. Better results achieved in this paper could be due to the use of GMM instead of ANN, and the randomize Quranic audio database so that the proposed system is not prone to the recited Quranic verses but able to distinguish the characteristics of individual reciter.

CONCLUSIONS AND FUTURE WORKS
This paper has presented the development of Quranic reciter identification system using MFCC and GMM. MFCC was selected as the features, while GMM is selected as the classifer. First, we build a Quranic audio database from five reciters, in which they recite different surah from Al-Quran. Altogether, there are 15 samples collected for each reciter, in which 10 samples were used to train the GMM and 5 samples were used for testing. Furthermore, we use another unknown reciter to evaluate the performance of the proposed system. Results showed that our proposed system achieved 100% accuracy in the training and testing phase. The unknown samples were also achieved 100% rejection rate. Further research includes variation of shorter utterance of the recited Quranic verses, different reciters, different features, or different classifier.