A NEW SDM CLASSIFIER USING JACCARD MINING PROCEDURE CASE STUDY : RHEUMATIC FEVER DATA

In this paper, a new Statistical Data Mining (SDM) technique is proposed using Jaccard Mining Procedure (JMP) contributing a novel classifier & predictor by applying very effective stages on the training data depending on Jaccard (J) distance matrix Linked with the Gini Index Measure as precision measure for initiating a new classifier and a new predictor, The proposed SDM technique using JMP is applied and examined on a Rheumatic Fever Data to demonstrate its applicability.


1.INTRODUCTION
Classification [3,4,5,7], is a data mining function that assigns items in a collection to target categories or classes.The goal of classification is to accurately predict the target class for each case in the data.For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks.
A classification task begins with a data set in which the class assignments are known.For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time.In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on.Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case.
The simplest type of classification problem is binary classification.In binary classification, the target attribute has only two possible values: for example, high credit rating or low credit rating.Multiclass targets have more than two values: for example, low, medium, high, or unknown credit rating.
Different classification algorithms [7] use different techniques for finding relationships.These relationships are summarized in a model, which can then be applied to a different data set in which the class assignments are unknown.Classification has many applications in customer segmentation, business modeling, marketing, credit analysis, and biomedical and drug response modeling.
There is a difference between the definition of the algorithm and the flowchart [6], an algorithm is just a detailed sequence of simple steps that are needed to solve a problem, from the other side; a flowchart is a graphical representation of an algorithm.So we will introduce the flowchart and the algorithm o our paper.
The sequence of this paper is organized as follows; Section 2 represents the materials & methods which are subtitled into, Rheumatic Fever Data Characteristics, Jaccard distance (J) & Gini Index.Section 3 presents the JMP algorithm, JMP flow chart and JMP systematic structure stages using Jaccard distance, Jaccard classes, Jaccard classifiers linking the previous stages with the Gini Index as precision measures for initiating a new classifier and concluding a new predictor.Finally, the conclusion and the future work are introduced in Section 4.

Rheumatic Fever Data: Characteristic
No doubt that the Rheumatic Fever is a very common disease [2] and it has many symptoms differs from patient to another though the diagnosis is the same.So, we obtained the following example on seven rheumatic fever patients from Tanta University Hospital, Egypt.All patients are between 9-12 years old with history of Arthritis began from age 3-5 years.This disease has many symptoms and it is usually started in young age and still with the patient along his life.Table (1) introduced seven patients characterized by 8 symptoms (attributes) [1] using them to decide the diagnosis for each patient (decision attribute).Table (2) introduces the training coded data.Where, {S, F, A, R, K, E, P, H} are the conditional attributes, {P1, P2, P3,…, P7} in Table (2) are the Rheumatic Fever training data objects and the diagnosis attribute (D) as the decision attribute.

Gini Index
The Gini index [3], measures the impurity of the target attribute, so we will use it as a precision method in our paper.The Gini index of a data partition or set of training tuples, as Where pj is the relative frequency of class j in A.
The Gini index considers a binary split for each attribute.Let's first consider the case where B is a discrete-valued attribute having v distinct values, {b1, b2… bv}, occurring in A. When considering a binary split, we compute a weighted sum of the impurity of each resulting partition.
For example, if a binary split on B partitions the attribute A into A1 and A2, the Gini index of the attribute A given that partitioning is, In General the Gini of the split formula is, The most important characteristics of the Gini Index that, 1) It varies between 0 and 1.

6)
Has difficulty when number of classes is large.

7)
Tends to favor tests that result in equal-sized partitions and purity in both partitions.

JMP Algorithm
In this section the JMP algorithm steps will be introduced in the following.

Input
Where, • U is the universe of all attributes, A are the conditional attributes & D is the decision attribute • J is the Jaccard Matrix

JMP Flow Chart
In this section the flow chart of the JMP Systematic Structure Stages flowchart Figure (1).

Jaccard Matrix
The 1 st stage in our JMP, Calculating Jaccard matrix of the Rheumatic Fever Data conditional attributes of Table (2).The results are given below in Table (

Jaccard Classes
The next stage in our classifier to divide the values of the Jaccard matrix of Table (3) into three classes; (J = 0, 0 < J ≤ 0.5, 0.5 < J ≤ 1) and obtaining the conditional attributes classes introducing the output of this stage in Table (4) of the conditional attributes Jaccard classes.

Best JMP Classifier
Deciding the best Jaccard classifier of Table (5) depends on the conditional attributes Gini Index of Table (6).So, the JMP classifier will be 0.5 < J

JMP Predictor
This stage to determine which set of the best JMP classifier sets to be the JMP predictor which we can use it with any prediction technique for predicting the diagnosis for any test calculating the Gini index averges for each set of the best JMP classifier of Table (7) in Table (8) and Chart (2).5) depends on the conditional attributes Gini Index of Table (6).So, the JMP classifier will be 0.5 < J ≤ 1 the best JMP classifier as in Table (7).JMP Predictor = ሼ, } Eq. ( 6)

4.CONCLUSION
A new Statistical Data Mining (SDM) technique is initiated in this paper using Jaccard Mining Procedure (JMP) contributing a novel classifier & predictor depending on Jaccard (J) distance matrix and Gini Index Measure.Applying JMP on real life application of Rheumatic Fever data diagnosis to see the accuracy of its applicability and the result was very accurate for the diagnosis of the data specialist.JMP opens the way for other new SDM techniques using an alternative distance measures and other accuracy measures according to the data type.

Table ( 1
): Rheumatic Fever Data Description [3]is very important to note that, Jaccard coefficient is a measure for similarity between two variables and Jaccard distance[3]is a measure of dissimilarity and both of them are measurement of asymmetric information on binary and non-binary variables.The definition of the Jaccard similarity coefficient and the Jaccard distance are as follows; Def.1: Jaccard similarity between binary variables A and B ۸ ‫ܕܑ܁‬ ሺ‫,ۯ‬ ۰ሻ = ‫۾‬ሺ‫۰∩ۯ‬ሻ ‫۾‬ሺ‫۰∪ۯ‬ሻ Eq. (1) Def.2: Jaccard Distance between binary variables A and B ۸ ‫.ܜܛܑ۲‬ ሺ‫,ۯ‬ ۰ሻ = − ۸ ‫ܕܑ܁‬ ሺ‫,ۯ‬ ۰ሻEq.(

Table ( 7
): Best JMP Classifier This stage to determine which set of the best JMP classifier sets to be the JMP predictor which we can use it with any prediction technique for predicting the diagnosis for any test calculating the Gini index averges for each set of the best JMP classifier of Table (7) International Journal on Bioinformatics & Biosciences (IJBB) Vol.4,No.1, March 2014 classifier of Table (5) depends on the conditional attributes Gini Index 1 the best JMP classifier as in Table (7).This stage to determine which set of the best JMP classifier sets to be the JMP predictor which we can use it with any prediction technique for predicting the diagnosis for any test data, by calculating the Gini index averges for each set of the best JMP classifier of Table (7) represented From Chart (2) we can find that the value of the Gini average of {K,E}is the closest value to the diagnosis Gini and this indicates that {K,E}will be the JMP predictor attributes.