A N E FFECTIVE A RABIC T EXT C LASSIFICATION A PPROACH B ASED ON K ERNEL N AIVE B AYES C LASSIFIER

.


INTRODUCTION
Online or offline storage integrated with intelligent tools facilitated access to information stored on electronic documents in an effective manner.These electronic documents can be easily processed and also classified through these intelligent tools.According to [1], text classification (TC) is the activity of labelling the texts of natural language with a pre-specified set of thematic categories.Therefore, TC of electronic documents mainly refers to the categorization of the documents based on their contents into their relevant groups.High growth of text information in Internet and Big data environment resulted in billions of electronic documents to be created, edited and stored digitally [2].
Arabic language is one of the most widely spoken semantic, sharing many commonalities with other semantic languages in terms of vocabularies, vowels, morphologies and word orders [3,4].Arabic topic classification (ATC) is considered one of the most challenging research topics.This is probably caused by the fact that Arabic words have unlimited variation in the meaning, in addition to the problems that are specific to Arabic language only.Using Natural Language Processing (NLP) tools coupled with TC algorithms, it is possible to automatically identify the semantic content of electronic documents and group them according to their topics.Topic detection or classification process is achieved through the training a model with the corpus of every topic using the term of word vector.Arabic topic detection algorithm involves identifying and selecting the topics from the Arabic documents.This is achieved through various tools such as a Vector Space Model (VSM) [5].In this context.It is worth mentioning that, single document summarizer is performed using a VSM where the sentences of Arabic document file are taken to generate an executive summary.So, automatic text summarization is an important tool in the Arabic topic detection [6].Actually, process mining of text is mainly performed based on a priori architectural design and the term of word vectors.Arabic topic identification and Arabic topic mining algorithms are two techniques used to identify and classify the text documents based on the specific topics in an online or offline manner [7].
In this paper, we introduce a new approach to classify the Arabic text using the kernel naive Bayes classifier.The proposed kernel Bayes classifier is more efficient than other baseline classifiers, including the traditional naive Bayes classifier which are used in the literature of Arabic topic classification task.At first, term frequency and inverse document frequency (TF-IDF) technique were used to generate the textual features of a corpus that contains Arabic text documents.The generated features are then used along with the kernel Bayes classifier to classify the testing set documents.Then, a number of baselineclassifiers are employed for the comparison purpose.These classifiers include Naïve Bayes (NB), Bayes Nets (BN), K-Nearest Neighbours (KNN), Decision Tree (J48), Support Vector Machine (SVM) and Hidden Markov Model (HMM).For implementation, an open source machine-learning Weka tool is used for data preprocessing, feature extraction and classification.
In next section, a literate review of related work is presented.Section 3 introduces the proposed methodology.Section 4 discusses the experiments and evaluation for the proposed.Finally,the conclusions are summarized in Section 5.

RELATED WORK
Text classification is becoming a very important task with the presence of huge text information.Online or offline electronic large scale documents in many different languages are increasing every day [5].Arabic text documents are part of this growth and there is increasingly need to retrieve, filter and mine in many applications.In [8] developed a text classification system for Arabic language.This system compares the representation of document by N-grams (unigrams and bigrams) and single terms (bag of words) as a feature extraction method in the pre-processing step.Afterwards, TF-IDF is used to reduce dimensionality and KNN classifier is applied for classification of Arabic text.The experimental results showed that using unigrams and bigrams as representation of documents outperformed the use of bag of words in term of accuracy.
In [9], a statistical method called maximum entropy is used to classify Arabic news article.In [10], designed a multi-word term extraction method as a feature extraction of Arabic language.They used a hybrid method to extract multi-word terminology from Arabic corpus.From the respective of linguistic, they used some linguistic information to extract and filter the candidates of multiword terminology.In [11], SVM is used to classify the Arabic text and compared result with KNN classifier.Arabic Topic Identification Algorithm involves the identification and selection of topics from an Arabic document, and it was achieved through various tools including the Vector Space Model (VSM).Single document summarizer is performed using the Vector Space Model which takes Arabic document and the initial sentence of the file to generate an executive summary.The automatic text summarization presents to be an important tool in the Arabic Topic Detection.According to the study conducted by [11], the use of VSM for identification and selection of topics from Arabic documents, helped to get best results comparable to other methods.Automatic text summarization is an effective method for selecting the Arabic characters and reducing the noise from the documents [6].
Moreover, Hmeidi et al. [12] studied the influence of raw text, khoja root-based stemmer and light stemming of Arabic text documents based on standard classifiers, such as NB, SVM, KNN, J48 and Decision Table classifiers.The results exhibited that the SVM and NB classifiers with light stemming provides better classification accuracy than other classifiers.The same conclusion was drawn up by Al-Badarneh [13] and Ayedh et al. [14] by using various pre-processing methods.Additionally, Al-Molegi et al. [15] and Khreisat [16] have proposed an approach to classify Arabic text documents based on the combination of N-grams with some similarity measures, including Manhattan, Euclidean distances and Dice.The overall results illustrated that the combination of tri-gram with Dice measure obtained a better performance.
Al-Anzi et al. [17] presented a method based on LSI with clustering techniques for Arabic text classification by grouping similar unlabelled documents into a pre-defined number of topics.The results revealed that this method is able to label the documents without any training data.In another work, Al-Anzi et al. [18] offered a technique for Arabic text classification based on Latent Semantic Indexing and cosine similarity.The results showed that the LSI features outperform significantly the TF-IDF.Also, these results demonstrated that the KNN with cosine measureand SVM attained the best performance.Even though the most works in the literature review have already achieved a good performance, Arabic is a rich language that requires effective text classification algorithms, dealing with different aspects of the language, such as vocabulary, morphology, and syntax.[18], addressed some of the challenges of the Arabic language.Additionally, the authors in [19] used the conventional TF-IDF for Arabic text classification by using a number of different machine learning classifiers.

PROPOSED METHODOLOGY
Our proposed methodology aims to develop an automatic ATC system based on a kernel Naïve Bayes classifier.It consists of three steps as shown in Figure 1.The first step is a preprocessing step to extract the words (words tokenization), eliminate the stop words (stop words removal) from the collection of documents and remove common affix (prefix and suffix) from words (Arabic words light stemmer).The second step is a feature extraction and normalization step to convert the words strings into vectors and normalizing them.The third step is the classification step to classify those text documents into one of a pre-defined set of classes (topics) using the proposed kernel naïve Bayes classifier.In the following, we describe each step in more details:

Pre-processing
This step is responsible for Arabic words tokenization, stop words removal and Arabic words light stemmer.In tokenization, each sentence in documents is broken into tokens (words strings).Stop words removal is used to remove the useless words like " " (from), " " (on), etc. remove the stop word from documents increase the accuracy of the classification task.

Feature Extraction and Normalization
Feature extraction of words strings is based on the standard TF-IDF method that is one of the popular methods used in several text domains.TF-IDF is more efficient for selecting significant words that assigns high weights to the high frequency terms in different documents, but relatively rare in the whole corpus.The classical formula of TF-IDF is shown in the following Equation (1): Where , is the weight for word in document , , is the frequency of word in document , is the number of documents in the collection, and is the number of documents that contain the word .Word frequencies for a document (instance) should be normalized by dividing them by document length.

Classification
In this step, we perform two tasks.The first task is usually building the model of machine learning by a selected dataset, called a training dataset.The second task is testing the built model using another unseen dataset, called a testing dataset.The proposed model used on our methodology is a KNB classifier.Actually, KNB is a Naïve Bayes with kernel density estimation (KDE).The following subsection presents an explanation of the proposed KNB.

Kernel Naïve Bayes (KNB) Classifier
Suppose X is a set of Arabic documents words features, ) .... , ( and Cis a set of Arabic topics ( j c ) or classes.By Naive Bayes assumption, the probability ofa topic is c, given the features of words that the feature value of the word i is equal to i x given the topic j (class j ) is equal , j c were estimated using KDE from a set of labeled training data (X, C).KDE is a non-parametric way of estimating the probability density function population [20].The probability ), | ( j i c x P was estimated using Equation ( 4).
where guKernel is a Gaussian function kernel with variance 1 and mean zero, N is the number of the input data X belonging to class j which is equal , j c c x vi is the feature value of the word in the i-th position of the v-th input X = (x 1i , x 2i … x Ni ) in class j , and h is a bandwidth, or a smoothing parameter.To optimally estimate the conditional probabilities, h was optimized on the training dataset.

Dataset Collection
We created an Arabic topic mining corpus dataset that contains 1897 documents belonging to 3 different topics Economic (625 documents), culture (639 documents) and sport (633 documents).The corpus contains 2478 unique words.Table 1 shows the statistics of the createdcorpus dataset.Particularly, thedataset is collected from multiple online newspapers atdifferent time periods,from May to Jun of 2017 for our experiment of Arabic text classification.These different time periods of collecting the data make it more diversity to give a fair test of the classifier and more accurate evaluation of the work.

Tool Description
The experiment was conducted using Waikato Environment for Knowledge Analysis (WEKA) [21].It is widely used for machine learning and data mining and originally developed at the University of Waikato in New Zealand.

Evaluation Measures and Comparison Results
The following five measures are computed to evaluate the performance of the proposed KNB classifier, using counts of true positives (TP; for the predicted topics which are correctly  • Recall, or sensitivity, measures the proportion of the number of True Positives divided by the number of True Positives and False Negatives and is defined as TP/ (TP + FN).
• Precision measures the proportion of the number of True Positives divided by the number of True Positives and False Positives and is defined as TP/ (TP + FP).
• Accuracy (ACC) is the proportion of the number of True Positives and True Negatives divided by the number of True Positives, False Negatives, True Negatives and False Positives, and is defined as (TP + TN)/ (TP + FN + TN + FP).
• F-measure combines precision and recall into their harmonic mean, and is defined as 2 × (precision × recall)/ (precision + recall).
In our experiment, we split randomly the dataset into 70% as a training dataset and the remaining 30% are used as a testing dataset.Table 2 shows the number of instances in training and testing dataset.
Table 2: the number of instances in training and testing dataset.
During the evaluation phase, we computed the accuracy of training and testing, regarding to the proposed KNB classifier and NB classifier.The result shows that the KND achieved improvements of 13.1123% for the training accuracy and 9.4737% for the testing accuracy, compared to the traditional NB as seen in Table 3 and Figure 2.  Table 4 summarizes the results of KNB compared to the other baseline classifiers in terms of evaluation measures used in our experiment.In addition to the evaluation of accuracy, the time taken to build the model for each classes is also studied and evaluated in our experiment.The comparison of this time is shown in Table 5.As shown in Table 5, the time taken to build the HMM model is the lowest with 0.01second, whereas the time taken to build the J48 model is the highest with 24.53seconds.However, the time taken to build our KNB modelhas0.79secondcompared to the time taken to build the SVM and NB which have 1.31seconds and 0.82second, respectively.In general, the time taken to build our model on training data is less than one second and almost near to the lowest ones with respect to the training speed.

CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a new approach for Arabic text classification using a Kernel Naïve Bayes (KNB) classifier.Some text pre-processing techniques, including word tokenization, stop word removal and Arabic words light stemmer are used.For Arabic words feature extraction, TF-IDF technique is also applied to convert them into vector space which are normalized for classification task.An effective classifier is proposed for classification in our approach.Experimental results on the collected dataset slows that our approach based on the proposed classifier achieved outstanding results in terms of accuracy and time against all baseline classifiers used in the previous studies.
The main conclusion is that Arabic text classification of electronic documents using the proposed KNB show better performance than other baseline classifiers in machine learning field.An open path of future research is to compare the performance of Arabic text classification classifiers with dimensionality reduction methods such as feature selection methods and topic models.

Figure 1 .
Figure 1.Proposed approach Methodology topics), false positives (FP; for the predicted topics which are incorrectly classified as actual topics), true negatives (TN; for the unpredicted topics which are correctly classified as actual topics) and false negatives (FN; for the unpredicted topics which are incorrectly classified as actual topics).

Figure 2 .
Figure 2. Results of KNB and NB classifiers.

Table 1 :
Data Collection Statistics.

Table 3 :
The results of training and testing accuracy for KNB and NB classifiers.

Table 4 :
The comparison results of the proposed KNB classifier and other baseline classifiers.As we see, the proposed KNB classifier achieves the highest accuracy (91.2281%), since its ability to solve the non-linearity problem of Arabic text classification.The highest and best results are highlighted as bold font in the Table4.On the other hand, the HMM classifier achieved the lowest accuracy (32.1053%), as well as it has the lowest results for the other measures.The best MCC measure is obtained for the proposed KNB classifier with 0.869,whereas the worst is obtained for HMM with 0. The reason is that the most of instances are classified as correct for KNB compared to all baseline classifiers in our study.It is also clear that SVM classifier achieved a good result (85.7895%) compared to other classifiers.The result of SVM agree with the results of previous studies which found that it is a good classifier for text classification.In more details, we show that the Precision, Recall and F-Measure of our method are the highest with 91.3%, 91.2% and 91.3%, respectively, where as the Precision, Recall and F-Measure of the SVM are 88.5%, 85.8% and 86.1%, respectively.Generally, the proposed KNBclassifier attains the highest results in all measures of the study.

Table 5 :
The time taken to build each classifier model in our experiment.