Machine learning with multistage classifiers for identification of of ectoparasite infected mud crab genus Scylla

Recently, the mud-crab farming can help the rural population economically. However, the existing parasite in the mud-crabs could interfere the long live of the mud-crabs. Unfortunately, the parasite has been identified to live in hundreds of mud-crabs, particularly it happened in Terengganu Coastal Water, Malaysia. This study investigates the initial identification of the parasite features based on their classes by using machine learning techniques. In this case, we employed five classifiers i.e logistic regression (LR), k-nearest neighbors (kNN), Gaussian Naive Bayes (GNB), support vector machine (SVM), and linear discriminant analysis (LDA). We compared these five classfiers to best performance of classification of the parasites. The classification process involving three stages. First, classify the parasites into two classes (normal and abnormal) regardless of their ventral types. Second, classified sexuality (female or male) and maturity (mature or immature). Finally, we compared the five classifiers to identify the species of the parasite. The experimental results showed that GNB and LDA are the most effective classifiers for carrying out the initial classification of the rhizocephalan parasite within the mud crab genus Scylla


INTRODUCTION
The demand for mud crab, genus Scylla had increased rapidly, thus leading to the growth in the mud crab cultivation business [1,2]. Considering the health of mud crab population which is a free-parasite, it is crucial to understand the parasite profiles. Infected mud crab would give huge impact into indigenous crab stocks with significant deleterious economic consequences [3]. It is imperative to accurately analyze the contaminations of the mud crab species as early as possible. In the case of reasonable measures of parasite control are not developed, at that point, this could have a negative effect on mud crabs production. Loxothylacus ihlei Boschma and Sacculina beauforti are species of rhizocephalan. In any case, the parasites of mud crabs pulled in less consideration from analysts, it is due to the need of depiction of their contaminations is commercially abused, compared with shellfish and other cultivation species [4]. The analysis of the mud crab parasites has not been properly implemented. However, in Asia nation, due to the improper used of the mud crab culture methods, the analysis of the mud crab parasites has been recently begun [5,6]. At current, many methods and strategies dissection microscopes, morphology keys, shape of calcareous plates have been explored in identification of true species of infected mud crab [4]. Routine screening for the presence of infected species has the potential to generate huge volumes of samples, particularly during periods of a suspected outbreak, which must be identified rapidly and correctly. These methods, however, are still done manually in a Biology lab [3].

TELKOMNIKA Telecommun Comput El Control
In this study, a semi-automatic method involving the machine learning techniques was developed to early detect the infected mud crabs. This process involving classification of the parasite features in the mud crab, genus Scylla by using multistage classifiers such as logistic regression (LR), k-nearest neighbors (kNN), Gaussian Naive Bayes (GNB), support vector machine (SVM), and linear discriminant analysis (LDA). The investigation of scheme is based on proven study that an integrated ensemble method has shown the capability to learn interpretable multi-target models which can predict several target classes simultaneously [7]. Ensemble based methods have recently great attention due to their reported superiority over single method-based system generalization performance [8,9]. The aim of ensemble classification is to combine multiple models (classifiers or features) to solve particular problems [10]. Ensemble method can be divided into a number of categories, such as ensemble classifiers, ensemble features, and ensemble feature and classifiers [11][12][13].

MUD CRAB, SCYLLA OLIVACE
Scylla tranquebarica, Scylla paramamosain and Scylla olivacea are species of mud crab in Malaysia that have been recognized, and S. olivacea is the species that are typically found [4]. In Malaysia, mud crabs or locally known as ketam nipah or ketam bakau, is inhabit mangrove forests and river mouths in estuarine surroundings [14]. From 2008 to 2015, inshore fisheries in Malaysia were persistently greater in amount and value than the remote-ocean fisheries. Mud crabs appeared the most noteworthy littler fisheries landing species arrivals within the east coast and west coast of Peninsular Malaysia and East Malaysia. Mud crabs which are amazingly looked for, both locally and overseas, in consist of the significant fishery field in Malaysia. During previous sampling works, survey discovered that a few shapes of abnormalities were show within the mud crab species in Marudu Bay, Sabah, Malaysia [15]. Local people likewise pick unusual crabs in light of their totality in meat dependent on their higher body weight (BW) contrasted with the typical developed mud crabs and this event has been far reaching for very five years. This is remarkable in light of the fact that most hosts plagued with rhizocephalan parasites have lower advancement sums and lower taking care of conduct and the effect of crab utilization on human wellbeing is of imperative concern.
Rhizocephalan parasites are notable to cause their hosts to be sterilized and feminized. The most distinguished feature of rhizocephalan contaminations is the appearance of a yellow sac-like shape notable as externa (contains the female parasite's reproductive organs) within its host's outer abdomen cavity [15]. The parasite nourishes the nutrient removed from its host's hemolymph via the internal (root-like shape) and infected crabs cannot moult after the creation of externa [16,17]. When a virgin female's externa is fertilized by a predominate male, the diseased crab can watch out of the intruders' fertilized eggs until they hatch. The diseased male will be feminized and demonstrations typical morphological and behavioral changes in females, like decrease in chela size, widening of the abdomen, and typical egg caring [17,18]. Mostly targeting crabs, this parasite infestation will result in over exploitation, threatening the world-wide fisheries industry. The sterilization impact of parasites arising from changes in morphological character and hormonal concentrations reduces general reproductive rates, thus reducing population densities over time.

MACHINE LEARNING CLASSIFICATION
Today, machine learning is the fastest spreading field in computer science that encompasses varied areas as information security, manufacturing, marketing, transportation and health care [19][20][21][22][23]. Machine learning is a sub set of artificial intelligence that offers computers and computer systems with the capacity to learn and enhance separately from prior experiences without explicit human programming. Machine learning is based on computer programs that can collect information and learn for themselves [20,21].
Machine learning's pioneer, Arthur Samuel, identified machine learning as a research field that enables computers to learn without explicit programming [24]. Machine learning is the application for trend prediction, categorization, or delineation of specific algorithms to datasets and the methods have traditionally been applied to big databases of high dimensions [23]. In this research, five machine-learning classifier (LR, K-NN, GNB, SVM and LDA) are used to classifying the parasite in the mud crab genus Scylla.

DATASET
Mud crabs within the genus Scylla can be found in muddy mangrove zones and estuaries. There are profoundly named as an imperative source for little scale fishers all through the Asia Pacific. Genus Scylla are the foremost exchanged fish product in Asia and the therefore the culturing practiced is already worn out most of Asian country some years past. The request for mud crabs has expanded quickly over the last decade, giving extraordinary potential for the advancement of the mud crab aquaculture industry.
S. olivacea, S. paramamosain and S. tranquebarica are three species of mud crab that have been identified in Malaysia [25]. S. olivacea is that the commonest species compared to alternative species [26]. S. paramamosain is single species compared to other species which is found within the bound space. In any case, the scientific classification and species status of mud crab in Malaysia are however yet to be characterized. In this study, only S. olivacea is taken for analysis and classification.
Identification of the S. olivacea was based on morphological characters described by Waiho et al. [6]. Those are carapace width (CW), body weight (BW), abdomen width (AW), gonopod length (GL), externa diameter (ED) and pleopod length (PL). Figure 1 shows the ventral view of normal and abnormal S. olivacea. Other features also considered are level or normality (normal or abnormal), sexuality (female or male) and maturity (mature or immature). In total, 1100 data points were used in this analysis and classification.

METHODOLOGY
An assortment of machine learning classifiers and their performance in classifying the data have been investigated. The framework for the classification of the parasite in mud crab, genus Scylla is illustrated in Figure 2. It involves data collection, data preparation, and classification. There are four stages involved in the classification of the species (Figure 3). In the first stage, the mud crab species are classified into two groups, namely normal and abnormal. In this study, only the normal group is brought for further classification. In the second stage, the normal group is then classified between male and female. These two groups, on the next stage, are classified to identify either they are mature or immature species. Lastly, the accuracy level for each classifier of each stage is computed by using (1) [28].

Data collection
There are 1100 data of mud crabs were used in this study. To collect the data, firstly, the screening of the mud crab, S. olivacea is performed. Then, we carry out the identification of the mud crab, S. olivacea based on morphological characters. These include the carapace width (CW), body weight (BW), abdomen width (AW), gonopod length (GL), externa diameter (ED) and pleopod length (PL). The outcome of this data collection step is that we only proposed the features that lead to possible semi-automatic identification of parasite.

Classification techniques
Machine learning classifiers, i.e. LDA, LR, K-NN, Gaussian Naive Bayes, and SVM are used in classifying and thus identifying parasites species that infected the mud crab. Through experimental investigation, these classifiers were used to find the accuracy classification rate of normality present in the males and females genus Scylla. The accuracy and confusion matrix were used to measure the performance of the classifiers.
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis are the generalization of Fisher's linear discriminant. This method is used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier or more commonly, for dimensionality reduction before later classification. Fisher's linear discriminant analysis (FLDA) tries to find a projection matrix that projects the training data onto a low-dimensional space that maximizes between-class variance as well as minimizing within-class variance [29]. There are five general steps for performing a linear discriminant analysis, namely listed as follows: a. Compute the d-dimensional mean vectors for the different classes from the dataset. b. Compute the scatter matrices (in-between-class and within-class scatter matrix). c. Compute the eigenvectors (e1, e2, . . ., ed) and corresponding eigenvalues (T1, T2, . . ., Td) for the scatter matrices. d. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d × k dimensional matrix W (where every column represents an eigenvector). e. Use this d × k eigenvector matrix to transform the samples onto the new subspace. This can be summa-rized by the matrix multiplication: representing the n samples, and Y be k × d transformed k × d -dimensional samples in the new subspace). The k-Nearest Neighbors (kNN) classifier assigns any test vector to the respective class that its knearest neighbors belong at most, considering the distances between the test and training vectors in the feature space [24]. Although it is obvious that classification performance is directly related to the parameter k, there is no obvious information on the selection of k, except that it should be positive and not a multiple of the total number of classes [10]. The kNN formula should be interpreted as averaging the categories of a circle with k data-points around a certain xi. If k = 1 then each data point is categorized correctly. In the case of k = 1 then the kNN formula implies for each xi . The specific classification Yi can be computed as follows; Gaussian Naive Bayes (GNB) classifier implements the Naïve Bayes theorem for classification [30]. The formula of GNB can be mathematically written as follows where P(Xi|y) are Gaussian number, Xi : the i-th variable, xi : the value of the i-th variable, Y : class, yj : sub class of Y that you are looking for, v : Mean, the average of all attributes, and : standard deviation.. The classification of the mud crabs is specified into three stages, which are a. Stage 1: Classification in normality. At this stage mud crab dataset is separated into two classes: normal or abnormal using the five classifiers. b. Stage 2: Classification in sex. At this second stage, the classification of the first stage is continued to obtain the normal and abnormal data. In this stage, the data is trained so that we classify the groups into two: male and female. The results are summarized in a confusion matrix c. Stage 3: Classification in maturity, in this stage, the groups male and female obtained from the second stage, is classified into mature and immature. This last stage of the classification process is crucial since we only expect the mature group which is free of parasite.

EXPERIMENTAL RESULTS
To employ the five machine learning classifiers, the data sets are split into training and test data. The training data set contains the known variables and the model learns on this data in order to generalized the data. Each stage of classifications recorded confusion matrix table. The accuracy level of each classifier is computed by using (1). Based on the results of the confusion matrices provided in tables 1, 2, and 3, we obtained the accuracy level of GNB and LDA were definitely 100% in any stages and become the highest rate of classification. The LR classifier has the accuracy level ranging from 98.5% to 100%. Furthermore, the SVM classifier produced accuracy interval of 81.8% until 99.8%. Finally the poorest of accuracy rates of classification in this study is KNN which produced the accuracy level between 77.8% until 97.7%.

CONCLUSION
The selection of effective and better multistage classification model for the identification of Ectoparasited infected mud crab, S. olivacea, were investigated in this paper. In this study, five machine learning classifiers were utilized to evaluate their performance in correctly classifying the infestation of rhizocephalan parasite in the mud crab S. olivacea was based on morphological characters.The multistage classifier used in this paper produced a good performance for the identification of Ectopar-asited infected mud crab as compared with previous individual machine learning approaches, i.e SVM and KNN. The results from the experiment demonstrated that, LDA and GNB produce the highest rate of classification was succeeded which correctly assigned 100%. The LR has accuracy of 98.5-100%, while SVM classifier produce accuracy of 81.8-99.8%. The poorer of accuracy rates of classification is KNN which produced 77.2-97.7%. In the future work, KNN classifier should be rejected from the analyses. In the near future work, we will continue to develop other kinds of machine learning methods for the identification of ectoparasited infected mud crab and test for other experimental setup.