Question to Question Similarity Analysis using Morphological, Syntactic, Semantic, and Lexical Features

In the digitally connected world that we are living in, people expect to get answers to their questions spontaneously. This fact increased the burden on the Question/Answer platforms such as Stack Overflow and many others. A promising solution to this problem is to detect if a question being asked similar to a question in the database and present the answer of the detected question to the user. To address this challenge, we propose a novel Natural Language Processing (NLP) approach that detects if two Arabic questions are similar or not using their extracted morphological, syntactic, semantic, lexical features. Our approach involves several phases including Arabic text processing, novel feature extraction, and text classifications. To conduct our experiments, we used a real-world questions dataset consisting of 4,000 pairs of Arabic questions in which our approach achieved 78.2% accuracy using XGBoost model on the best features selected by the Random Forest feature selection technique. This high accuracy shows the ability of our approach to correctly detect the similarity between two Arabic questions.


I. INTRODUCTION
Question/Answer websites such as Stack Overflow [1], Quora [2], Mawdoo3 [3] and many more, greatly benefit from detecting similar question(s) to a newly asked question by a user. Such information would allow these websites to give answers to the users faster, avoiding duplicate answers, and without extra efforts to answer the new question. Such a service would increase the usability of the service as well as users satisfactions. The challenge here is how to efficiently and accurately detect if two questions are similar.
Semantic Text Similarity (STS) techniques [4] [5] are concerned with recognizing the similarity of two texts. STS has been widely utilized in the Natural Language Processing (NLP) research such as text classification, text summarization, information retrieval, and word sense disambiguation. In this paper, we propose a novel approach to detect the similarity between two Arabic questions based on their morphological, syntactic, semantic, and lexical features. Our approach involves several steps including Arabic text processing, novel feature extraction, and text classifications.
Text classification tasks are considered as supervised learning problems [6] [7] where the model is trained using labeled data. Some machine learning algorithms have been used to solve the text classification problems such as Support Vector Machine (SVM) as in [8], [9], and [10] and Decision Tree (DT)as in [11]. To solve this problem, we extracted novel features from a large labeled dataset of a real-world question/answer platform and trained our dataset on the XGBoost machine learning model using different sets of features.
The contribution of this paper is twofold. First, we extract novel features of Arabic questions including the morphological, syntactic, semantic, and lexical features. Second, we build an XGBoost machine learning model and we trained and test it on different sets of features. We used a real Arabic questions corpus that consists of 4,000 Arabic questions shared with us from the Mawdoo3 website [3], the largest Arabic website with more than 50 million visitors monthly and more than 150,000 articles. To the best of our knowledge, this is the first research to apply text classification to detect similar Arabic questions using morphological, syntactic, semantic, and lexical features using XGBoost supervised machine learning model applied on a real dataset with this size.
The remainder of this paper is structured as follows. Section II covers the related work. Section III introduces our methodology. Section IV describes the different evaluation metrics we used to report the performance of our model as well as the results obtained from our machine learning model. The last section concludes the paper with avenues of future work.
II. RELATED WORK In this section, we review some previous work related to the Arabic text classification.
Al-Anzi and AbuZeina [12] showed that the cosine similarity is a preferable measure for Arabic text classification. They also provided a comparison between eight text classification techniques. This research supports our decision of including the cosine lexical feature in our model. AL-Smadi et al. [13] proposed an approach for detecting similar news in Arabic Tweets of the Twitter social media. Hamza et al. [14] built a taxonomy of Arabic question domains and they also proposed a technique for classifying Arabic questions to help question answering platforms to retrieve answers more efficiently.
Siolas and d'Alché-Buc [15] proposed an approach to solve the text classification problem based on a priori semantic knowledge of words. They used two supervised classification algorithms, the Support Vector Machine (SVM) and the K-Nearest Neighbors (K-NN), and they found that SVM outperformed the K-NN.
All of the aforementioned research efforts are related to our text classification research. However, they all do not solve the problem that we are trying to solve, that is, given two Arabic questions, can we determine if they are similar or not efficiently and accurately.
III. METHODOLOGY Question to Question (Q2Q) similarity detection task can be achieved mainly in two ways: string-based matching technique and machine-learning-based technique. For string-based technique, an approach needs to check if two questions contain same words or similar words, in terms of their meanings. Such an approach does not work in real-life because its a tedious approach, one needs to build word-to-word similarity map, and it cannot be generalized to new questions with new words.
On the other hand, learning-based technique utilizes machine learning to automatically classify if two questions are similar or not. Such an approach is efficient and generalizable. In the learning-based approach, each pair of questions is an instance represented by a set of features. The main challenge of such an approach relies on carefully extracting features that can help the learning model to accurately classify if two arbitrary questions are similar or not.
To determine if two questions are similar, we used a supervised learning model. For supervised learning, each instance is given a label of either "Yes", if the two questions are similar, or "No" if the two questions are different. Figure 1 overviews our machine learning approach which consists of five phases: (1) obtaining and cleansing the dataset (Dataset), described in Section III-A, (2) Feature Extraction, described in Section III-B, (3) Feature Selection, explained in Section III-C, (4) Model Training, described in Section IV, and (5) Model Testing and Reporting Results, discussed in Section IV.

A. Preparing and Cleansing the Dataset
To develop our machine-learning-based technique, we used a real-world dataset shared confidentially with us by Mawdoo3 [3] company, a leading Arabic content platform that allows users to ask and answer questions.
The dataset consists of 8,997 pairs of Arabic questions (17,994 questions) manually labeled by the Mawdoo3 company. Each pair of questions, an instance in the dataset, is labeled either with class "Yes", if the two questions are similar, or with class "No", if the two questions are not similar. Table  I shows the number of instances on each class. Based on the table, there are 3,579 pairs of similar questions, i.e., 40% of the dataset, and 5,418 pairs of questions that are not similar, i.e., 60% of the dataset.  Table II shows a randomly selected pair of questions that belong to the "Yes" class and another pair of questions that belong to the "No" class. In the first row of Table II, Question 1 asks about the birth city of the comprehensive thinker Al-Razi where as Question 2 asks about the city of his museum. Clearly, those two questions are not similar since they are asking about two different things. On the other hand, Question 1 and Question 2 in the second row of Table  II are asking about the first country to start the Communism political ideology in different ways. Since the original dataset consists of 60% of instances that belong to the "No" class, this would make the decision of the learning model bias to the "No" label. To avoid such bias in the learning process, we only included 4,000 pairs of Arabic questions distributed evenly, i.e., 2,000 instances belong to the "Yes" class and the same number of instances belong to the "No" class as shown in Table III. This dataset used to train and test our model. Data cleansing is an important step in machine learning since it ensures that all instances have correct labels, remove any duplicate, correcting corrupt instances or missing labels, etc. Therefore, after we collected our dataset, we processed it and removed any unnecessary symbols such as ", (, ), and _ and we added "?" for questions without question marks.

B. Feature Extraction
Feature extraction is an important step in machine learning approach since all subsequent training, testing, and generalization steps depend on it. Therefore, we carefully designed and extracted novel features inherited from the NLP field. To that end, for each question, we extracted features that belong to four categories: morphological, syntactic, semantic, and lexical features.
This section described the features that we extracted as well as the tools we used to extract them.

1) Extracting Lexical Features:
Regarding to the lexical features, we computed the Cosine and the Jaccard similarities of two questions. Cosine and Jaccard similarity measure the distance between two objects represented as vectors. If the distance is small, then the two objects are considered similar, otherwise, they are not similar.
These measures are widely used in text classification tasks. Jaccard similarity depends on the overlapping (same) words in two questions. Whereas to compute the Cosine similarity, we leveraged the word embedding technique to represent each question as a numerical vector and then we calculated the Cosine similarity (the distance) between two vectors.
Word embedding is a vector representation of a word. It is considered one of the most popular representations of text vocabulary. It can capture the context of a word in a text, such as semantic similarity, syntactic similarity, relations with other words, and many more. There are many techniques to generate word embedding vectors such as Word2Vec .
(1) • Jaccard Similarity. It measures the similarity between two finite sets, by calculating the number of overlapped words in questions over the number of unique words in them.
The resulting values are between 0 and 1 (0 ≤ J(A,B) ≤ 1). Given an input of questions A and B, Jaccard similarity is calculated using Equation 2.
2) Extracting Morphological, Semantic, and Syntactic Features: There are many data analysis tools to analyze and extract features from Arabic scripts but the most widely used by researchers in the NLP community is MADAMIRA [20]. MADAMIRA is a comprehensive Arabic analyzer developed based on the aggregation of two systems MADA [21] and AMIR [22]. We used MADAMIRA to extract the morphological, semantic, and syntactic features.
Arabic Morphology [23] is a critical analysis for NLP and it focuses on the meaning and surface form of the words. Arabic Morphology is divided into three categories: • Inflectional Morphology. Inflectional morphology determines the forms (grammatical categories) of words after changing the affixation, adding morpheme or affix, or vowel to the words. There are two categories of inflectional morphology features: the verbs category features such as aspect, mood, voice, etc, and the subject category features such as person, gender, number, etc. • Cliticization Morphology. A clitic is a word or part of a word that does not stand by itself and it depends on its neighboring words such as "m" in the word "I'm". Cliticization morphology features are Proclitic 0, Proclitic 1, Proclitic 2, Proclitic 3, and Enclitic 0. • Derivational Morphology. Derivational morphology is the process of creating new words from existing words mainly by adding prefix or suffix to words. By doing so, the grammatical category of the words could be changed from one category to another. An example of the derivational morphology is the Part-of-Speech (PoS) feature, which is considered a syntactic feature. In addition to the previously extracted features using MADAMIRA, we also extract the Named Entity Recognition (NER) semantic feature. NER, or called entity identification, is the process to locate and classify the named entity mentioned in a text. Table IV presents the morphological, syntactic, and semantic featured that we extracted from our dataset using MADAMIRA. Based on the table, the total number of the extracted features is 19 divided as follow: 17 morphological features, 1 syntactic which is the PoS feature, and 1 semantic which is the NER feature.

C. Features Selection
After we extracted the features, we used Random Forest (RF) [24] to automatically select the important features from our 21 features. Feature selection helps for building learning models faster, increase their performance, and reduces the possibility of overfitting [25].
Random Forest uses a measurement called impurity to find the best feature that splits the data, thus when training a tree, it can compute how much each feature decreases the weighted impurity in a tree, and then the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Table V shows the importance values of the features ordered descending as obtained from running the Random Forest on our dataset. From Table V, we chose the top 10 most important features for building our learning models. The top 10 features out of the 21 extracted features are: Cosine Similarity, Jaccard Similarity, Gen, Stt, Cas, Prc0, Prc3, NER, Num, and BPC.

D. The Model: Extreme Gradient Boosting (XGB)
Following the success of the Extreme Gradient Boosting (XGB) machine learning model, we have used it in our research. Extreme Gradient Boosting (XGB) is an implementation of gradient boosted decision trees used in supervised learning tasks such as regression and classification tasks [14]. It produces a model of ensemble weak prediction models, mainly decision trees. Each time a weak leaner is added, the loss value is computed until the model achieves a satisfactory performance value. We trained our XGBoost model with the following parameters and values: n_estimation=20, learning_rate=0.2, max_features=2, max_depth=2, and random_state=0.

B. Model Evaluation Metrics
To evaluate our machine learning model, we calculated and reported different measurements such as Precision, Recall, F-measure, and the Accuracy. This section provides a brief description of each one of them to help the reader understands them and their differences.  2) Recall: Recall measures the ability of a model to identify correct instances over all relevant instances in the dataset calculated using Equation refeq:r.

Recall =
True Positives True Positives + False Negatives (4) 3) F-measure: F-measure is the harmonic mean of the precision and the recall. It is calculated using Equation 5.

4) Accuracy:
Accuracy is the ratio of correctly predicted observation to the total observations. It is the simplest and most used measure to evaluation models. It is calculated using Equation 6 which is the same as Equation 7.

Accuracy = Number of Correct Predictions Total Number of Predictions (6)
Accuracy = TP + TN TP + TN + FP + FN (7) After we extracted the features as explained in Section III-B, we used the 4,000 pairs of Arabic questions to train and test the performance of our XGBoost model. As shown in Table VI, we split the dateset to 70% training dataset and 30% testing dataset. The training dataset consists of 2,800 pairs of questions in which the number of instances that belong to the "Yes" equals the number of instances that belong to the "No" class. The testing dataset contains 1,200 instances in which 601 are "Yes" instances and 599 are "No" instances.
As shown in Table VII, we trained and tested our XGBoost model using the dataset mentioned in Table VI. We conducted three different experiments in which each experiment includes a different set of features. In the first experiment, we included all the features and our model achieved 77.42% F-measure and 78.5% accuracy. On the second experiment, we only included the best 10 features determined by the Random Forest feature selection technique and our model achieved 76.88% F-measure with 78.2% accuracy. On the last experiment, we only included the cosine similarity feature and our model obtained good results where the F-measure is 71.41% and the accuracy is 72.7%.
All of the results shown in Table VII demonstrate the ability of our model to detect similar Arabic questions with high accuracy. They also indicate the importance of the word embedding feature set on the performance of our model. Moreover, since the performance of our model using all features is very close to the performance of our model using the best 10 features, this indicates that our 10 selected best features are valid to describe the dataset without the need to spend extra time extracting the other features.

V. CONCLUSION AND FUTURE WORK
A challenge facing the Question/Answer platforms is the ability to detect if a question being asked similar to a question in the database so they can serve their users faster and increase their satisfaction. Unfortunately, this problem remains a research challenge especially for the Arabic text. To address this challenge, we propose a novel NLP approach that can detect if two Arabic questions are similar or not using their morphological, syntactic, semantic, and lexical similarity. Moreover, we build an XGBoost machine learning model and trained it on different sets of features. To extract the morphological, semantic, and syntactic features, we leveraged MADAMIRA. To extract lexical features, we calculated the cosine similarity on the word embedding vectors of two questions and we also calculated the Jaccard similarity of two questions. Using a realworl dataset shared with us by Mawdoo3 [3] which consists of more than 4,000 questions, our approach achieved 78.2% accuracy using XGBoost classifier on the best selected features by the Random Forest feature selection technique.
Leveraging deep learning techniques to solve the Arabic question to question similarity problem and comparing the results with the traditional machine learning classifiers is an interesting avenue of future work.