K Means Clustering and Meanshift Analysis for Grouping the Data of Coal Term in Puslitbang tekMIRA

Indonesian government agencies under the Ministry of Energy and Mineral Resources have problems in classifying data dictionary of coal. This research conduct grouping coal dictionary using KMeans and MeanShift algorithm. K-means algorithm is used to get cluster value on character and word criteria. The last iteration of Euclidian distance calculation data on k-means combine with Meanshift algorithm. The meanshift calculates centroid by selecting different bandwidths. The result of grouping using k-means and meanshift algorithm shows different centroid to find optimum bandwidth value. The data dictionary of this research has sorted in alphabetically.


Introduction
Data Mining is a process of extracting data or filtering data by utilising a collection of a big data.The data through a series of operations to obtain valuable information from the dat a.According to Daryl Pregibon, the composition of artificial statistics, artificial intelligence, and database research called data mining [1].Data mining techniques used to explore the data using classification, prediction, grouping, outlier detection, association rules, sequence analysis, time series analysis and text mining, and some new techniques such as social networking analysis and sentiment analysis [2].The center of research and development mineral and coal technology (Puslitbang tekMIRA) is an Indonesia government institution under the Ministry of Energy and Mineral Resources.Puslitbang tekMIRA has a dictionary of coal.This coal dictionary has one thousand coal terms.It takes time between three to five minutes to find a term inside a coal dictionary.
Based on these problems, data mining method is used to classify data dictionary.This research conduct grouping data dictionary of a coal term using data mining method.K-Means and Meanshift Algorithm were chosen in this research.K-Means algorithm was used to categories students with skills such as cognitive, communication and relational [3], to evaluate student achievement levels for course content [4], and to group data based on user information created on SNS and recommend to users in the future [5].The grouping data based on user sentences by utilising regularity among data pursued by the user using K -Means Algorithm [6].The genetic algorithm and K-Means are used to calculate clustered centroid with heterogeneous populations that lead to better results than using random numbers [7].The Meanshift algorithm is used to accurately detect the location of target tracking [8] by using thi s method and then make it easy the accuracy of the calculation of the tracking results [9].The Meanshift algorithm is also used to solve facial-detection and tracking-based systems [10], if an imbalance occurs, it will affect the performance of [11].
In this research the algorithm grouping a coal term in a dictionary based on character and word in a cluster.The cluster serve data dictionary of coal based on predetermined criteria.The result data of clustering using K-Means and Meanshift algorithm is shown using the matloplib plot.Matplotlib is a Python package for Plotting that produces quality production graphs.Matplotlib is designed to be able to create simple and complex plots with multiple commands [12].

Research Method 2.1. K-Means
K-Means Clustering is a grouping of data, where the data in K Means Clustering K is the amount of data or the number of constants.Means is the average value of the data set as Cluster [7].So K-Means is a method to analyse data or called data mining method where this data modelling method without using supervise or unsupervised method and K -Means is a method to classify data by using partition system.The K-Means method solves large amounts of data in groups, and the data has the same characteristics as ot her data.Also, the team feature also has a feature [13].Similitude is some matrix used for similarities between instances in a cluster.K-Means is an algorithm used to generate k clusters from a collection of data sets in a simple way [14].Algorithm 1 is an explanation of k-means implementation.Algorithm 1: kmeans clustering: Randomly select k cluster centers c1, ... ck, Repeat, Set each data entity to the closest cluster center ci, Change the cluster center with the average cluster i, until the cluster center does not change [15].K -Means algorithm formula d = distance, j = amount of data, c = centroid, x = data, c = centroid 1 .The Euclidean distance formula is described in Equation 2.
is the distance of data between i and center cluster j.X k i is the data to i on attribute data to k. X kj is the center point to j at attribute to k.
Recalculate cluster center with current cluster membership.The cluster center is the average of all data/objects in a particular cluster.If desired it can also use the median of the cluster.So mean (mean) is not the only size that can use.Reassign each object using the new cluster center.If the cluster center does not change again, then the clustering process is complete.Alternatively, return to step number 3 until the center of the cluster does not change anymore [16].

Meanshift
The average shift method is a method for determining the maximum function limit of density with separate data from a function.The average shift use as a media grouping method, in which each mode represents each group [17]-18].The average shift method classifies data in its search mode that directs and moves data to the point region along with the iteration of the data environment built with the Gaussian Kernel [19].Starting from one data point and iteratively will improve the approximate mode [20].The Gaussian kernel is a differentiated multivariate kernel function used for actual calculations in assumptions [21].Bandwidth is a free parameter that shows the effect on the estimated density generated.

Results and Analysis
Grouping of data of coal dictionary using K-Means and Meanshift Algorithm.The grouping results are displayed using matplotlib plot.The sample data taken are vowel data (A, I, U, E, O) for the coal dictionary.Calculation data using K-Means algorithm.The result data of the last iteration, i.e. data that has been stable and without changes.

Result K-Means 3.1.1. The letter A
The data of the last iteration calculation using euclidian distance k-means for letter A. Clustering data using character and word criteria and use the centroid value of the highest and lowest values on the criteria.In table 1 is a cluster value that has a fixed and unchanged cluster.1353 Table 1.Data cluster of the letter A

The letter I
The data of the last iteration calculation using euclidian distance k -means for letter I. Clustering data using character and word criteria and use the centroid value of the highest and lowest values on the criteria.In Table 2 is a cluster value that has a fixed and unchanged cluster.
Table 2. Data cluster of the letter I

The letter U
The data of the last iteration calculation using euclidian distance k -means for letter U. Clustering data using character and word criteria and use the centroid value of the highest and lowest values on the criteria.In Table 3 is a cluster value that has a fixed and unchanged cluster.

The letter O
The data of the last iteration calculation using euclidian distance k -means for letter O. Clustering data using character and word criteria and use the centroid value of the highest and lowest values on the criteria.In Table 5 is a cluster value that has a fixed and unchanged cluster.

Meanshift 3.2.1. The letter A
In Figure 1 is the plot result for the data of the letter A. The data used is the term data of coal letter A. The value used by the data of the term A to get bandwidth one on the plot meanshift is at point 6.76.

The letter I
In Figure 2 is the plot result for the data of the letter I.The data used is the term data of coal letter I.The value used by the data of the term I to get bandwidth one on the plot meanshift is at point 18.95.

The letter U
In Figure 3 is the plot result for the data of the letter U.The data used is the term data of coal letter U.The value used by the data of the term I to get bandwidth one on the plot meanshift is at point 5.75.

Conclusion
Grouping a coal dictionary using the k-means algorithm produces a cluster value on character and word criteria.The last iteration of the Euclidian distance calculation results in different cluster values in each alphabet.The centroid value of the k -means calculation is then combined with the MeanShift algorithm.Centroid calculation results of the meanshift algorithm result in different bandwidth.Letter A produces a bandwidth of 6.76, for the letter I produce bandwidth 18.95, letter U produces bandwidth 5.75, and letter E produces bandwidth 3.36.The grouping of coal dictionaries in this study facilitates for a group of terms in the letter.Suggestions for future research, can classify the data dictionary of coal for the second, third and subsequent letters, and combine other algorithms in grouping such as K-Nearest Neighbour (KNN) algorithm.

K
Means Clustering and Meanshift Analysis for Grouping… (Rolly Maulana Awangga)

Table 3 .
Data cluster of the letter U Clustering data using character and word criteria and use the centroid value of the highest and on the criteria.In Table4is a cluster value that has a fixed and unchanged cluster.

Table 4 .
Data Cluster of the Letter E

Table 5 .
Data Cluster of the Letter O 1. Plot Mean Shift letter A