A Survey on Cleaning Dirty Data Using Machine Learning Paradigm for Big Data Analytics

ABSTRACT


INTRODUCTION
In 2016, IBM estimated that in last two years only, around 2.5 quintillion bytes' data have been produced each day, which is currently 90% of total data [1]. This big data is usually created using devices like sensors and new technologies evolving in today's era, even more the data evolution amount will possibly accelerate. Whereas, Cisco forecasted by 2020, the volume of worldwide traffic will cross the Internet with IP WAN networks may reach to 2.3ZB each year [2].
The bulky and heterogeneous nature of big data requires investigation using Big data analytics. Big data analytics helps to discover concealed patterns, anonymous relationships, trends of current market situation, consumer preferences and other aspects of data that can assist institutes and companies to make upto-date, faster and better decision for business.
By now, most well-known companies realized the demand of implementing big data analytics into their system for better products and services. Using big data capabilities any company can improve their products and services outcomes and grow productivity by obtaining meaningful visions to advance their work forward. There are different tools available in market to handle the big data but these tools concernts with few issues [3]. These tools are not usually integrated with data quality managment, therefore, in market the  [4]. It's not only use of big data capabilities an organization required to collect values without mistakes, incomplete values besides errors but it is very often negated too. This kind of data is usually known to dirty data, and to clean this data can be challenging for companies who want to get better results. Cleaning data manually requires experience and often human tent to make mistake. Currently, machine learning is adopted in different area for process the tasks automatically, such as [5,6] . Therefore, as machine learning can help any task to complete automatically it is possible to clean dirty data by training classification models.

BIG DATA ANALYTICS
The general procedure for obtaining visions from Big Data can be break down into five main stages [7]- [9] as shown in Figure 1. Data Acquisition: Timeliness is one of the important requirement while data loading [10]. The fundamental characteristics of Big Data with its exponential rate of growing demands improve exceptional issue in Big Data engineering such as data acquisition and storing [7].
Data Mining and Cleansing: The most essential stage of processing big data is to implement a method to extract from loaded un-structured Big Data and mine-out the necessary data to able to coherent it in a typical and organized arrangement that will be easy to recognize. Data cleaning process is helps to clean dirty data.
Data Aggregation and Integration: The cleaned data obtained required to aggregate for processing these data by gathering and expressing into summary form [11], [12] following by integrating Data, to organize data from disparate sources by grouping of practical and business methods, and obtain meaningful and valued result [12].
Data Analysis and Modelling: From the viewpoint of Big Data, the goals are to produce business significance through the analysis of data which may fluctuate according to technique and data form. Construct and investigate meaningful reports to help the business for better and faster decision making.
Data Interpretation: Presenting data in understandable form for users, i.e. presenting data using analysis and modelling results to make decision by interpreting the outcomes and extracting knowledge. Data Interpretation queries are categorized together and indicate to the same table, diagram graph or other data demonstration options

DATA QUALITY PROBLEMS
The data cleaning process gets more complex when data comes from heterogeneous sources. Here, data quality problem has to be solved by data cleaning and data transformation. Despite of the various viewpoints on the effect of data quality, in the end, all have the probability to produce in economic expenses for groups. Some of survey of real case, involving economic costs due to dirty data, on a survey in 2014 its found that around $13.3 million dollars' annual costs in organizations and 3 trillion per year to US economy due to bad data. Another organization, the U.S. Postal Service, recognizes the cost of bad data, in 2013, an estimated amount of mail unsuccessful delivering to mentioned address was around 6.8 billion, which racks up to $1.5 billion in managing costs [13]. By some evaluations it is known that the in organizations and companies issue of dirty data already reached to epidemic amounts. The issue is equally prevalent and hypothetically equal beyond frightening in health care and other organization. [14]. For instance, in a telecommunication industry, dirty data has numerous costs. First and foremost, Experian approximates average 12% loses in business due to wrong records causing productivity reduction, resources wastage, and significantly, misused chances for marketing of cross-channel. The Experian investigation also focuses that approximately one-third of responders think that they waste almost 10% or more budget in marketing because of outcome obtained from inaccurate data. The Experian presents that 25% of survey participants in their research presently in their organization do not measures accuracy of data, where growths in telecoms and utilities companies to 33%, and in organizations like governments reaches to 36% [15].
These measurements are within organizations, whereas observing external maters like marketing, marketers struggle with dirty data as well. Regarding to BizReport.com, "…marketers are generating a large portion of poor-quality leads, including those with improper formatting and even inaccuracies. Bad prospect information can have negative consequences, including wasted media investment, squandered resources, and poor customer experience, which marketers simply can't afford." [16] In medical case, errors can able to kill patients or produce long lasting harm to heath of the patient. In 1999 an institute of Medicine reported [17] approximations, for instance, at least 44,000 to 98,000 people lost their lives each year for medical errors in hospitals only and which caused more $17 to $29 billion annually in healthcare costs. Other than heath issue, dirty data can also be involved in privacy issue for patients.

DATA QUALITY CRITERIA
Data quality is generally described as the capability of data to satisfy stated and implied needs when used under specified conditions [18]. Data accuracy, completeness and consistency are most popular initiatives to address Data quality [19], [20], beside other dimensions like Accessibility, Consistent representation, timeliness, Understandability, Relevancy, etc. [19]. Moreover, data quality is combination of data content and form. Where data content must contain accurate information and data form essential be collected and visualized in an approach that creates data functioning. Content and form are significant consideration to reduce data mistakes, as they illuminate the task of repairing dirty data needs beyond simply providing correct data.
Likewise, while developing a scheme to improve data quality it is essential to identify the primary reasons of dirty data. The causes are categories into organized and unintentional errors. The basic sources of producing systematic errors include while programming, wrong definition for data types, rules not defined correctly, data collection's rules violation, badly defined rules, and trained poorly. The sources of random errors can be errors due to keying, unreadable script, data transcription complications, hardware failure or corruption, and errors or intentionally misrepresenting declarations on the portion of users specifying major data. Human role on data entry usually result error, this error can be typos, missing types, literal values, Heterogeneous ontologies (i.e. Different nature of data), Outdated values or Violations of integrity constraints. Similarly, see Figure 2. as an example, where few data quality problems can be identified in the Wireless Service Facility Permits (City of San Francisco) database. Therefore, the most common dimensions of dirty data including data duplication are: Inaccurate data refers to any field contains wrong values. A right value of data will bring accurate and signified arrangement of consistency and unambiguous.

1237
Incomplete data from missing data is produced by data sets basically missing values. These type of data considered concealed when the amount of values identified in a set, but the values themselves are unidentified, and it is also known to be condensed when there are values in a set that are eliminated.
Inconsistent data is data redundancy; i.e. same data value is stored in different files which may be in different formats.
Duplicate data is entries that have been added by a system user same data multiple times

CLEANING TOOLS
Different vendors provide data cleansing solutions, includes Tal presents the website link of the company. Where, the "like (s)" and "dislike (s)" are obtained from Customers comments obtained from different websites, like end, IBM, SAS, Oracle and Lavastorm Analytics. There are some free tools been work on data transformation [21] [22], such as, OpenRefine, plyr, and reshape2, although it is uncertain whether they can execute Big Data. Another well-known tool is ETL tools, which provides complex data conversion techniques by merging and repairing data [23]. A summarization of some available commercialized tools to manage Data Quality in presented in Table 1. Where the "Vendor" field mentions the company offering the tools and "Product" mentions the tool offered by the vendor for managing Data Quality. "Website" column [24], [25].

BIG DATA ANALYTICS DATA CLEANING CHALLENGES
Generally, the data gathered will not be in a ready form for analyzing. For instance, consider data obtained from Telecommunication stored system, consisting of feedback obtained from different agents and structured data from routers. It is challenging to analyze such types of unstructured data. Requirement of extraction procedure that recovers necessary data from various sources and demonstrates it in a structured arrangement appropriate for analysis is compulsory. Data cleaning is an essential portion of data analysis and challenging too [26]. Researcher from data base research community offered few challenges to obtain useful data from big data [27], [28]. This is challenging through every data analysis, but after involving the variety and voluminous big data, it transforms even beyond pronounced. The data quality required to assured for accurate and correct data visualization. To deal this issue, organization require to overcome some common challenges:

Scalability
Cleaning techniques required scaling data capacities as quickly increasing data size of Big data, which is quite challenging. Existing procedures involve jamming data for identical data detection [29], [30], identification and linkage for data cleaning [30], clean data using sampling [31], and distributed data cleaning [32].

Semi Structured and Unstructured Data
Big data is usually set of variety of data, which may be populated with semi structured layout data e.g. in XML/JSON and unstructured format data e.g. in word-processing files, in e-mail besides in text fields in databases. Semi structured and unstructured data remain mostly unfamiliar for Data quality problems [28,33].

User Engagement
While much research work was involved humans to execute deduplication process in data set. For instance, through active learning, including human expert in other to clean data [30], like getting user response to determine rules for data quality, is still to be discovered.

Raising Privacy and Security Interests
While cleaning data the most common task is to observe and examine complete set of raw data value which may be restricted by some domain is a significant challenges [9], like telecommunication, medicine and finance. For example, telecommunication data, such as the Internet connection login sessions log collected over an extensive period of time can reveal an individual's location and behavior, as shown in Figure 3. 1239 from 2016, and will reach 20.4 billion by 2020 [34]. This is the reason data cleansing actions may engage huge processing power.

Machine Learning and Other Algorithms
Lastly, it known that big data analytics is still in its initial periods of development as a technical discipline. Hence many Machine Learning algorithms usable to scale big data sets or unable to tolerate the noises and gaps produced by real world [35]- [38]. There is still further research going to to improve these algorithms that will be more suitable with real world conditions which may contain millions and trillions of components for data cleaning.

Manually
Currently, after benefit of histograms, conversation tables and rules with algorithms individual interference is nevertheless compulsory to recognize and repair the data [30], [39].

MACHINE LEARNING PARADIGMS FOR BIG DATA CLEANING
Currently there are different types of learning paradigms available in machine learning; but, not all types applicable to all field. For instance, [40] presented a cleaning approach using Data mining and SVM (a machine Learning Paradigm). Machine Learning techniques can be used to teach the system and complete the task my minimum human interaction. It may reduce the time and resources required to analyze and transform dirty data to usable clean data. Machine Learning techniques are used to make system intelligent by learning capability. Data can be classified by three ways, un-supervised, supervised and semi supervised methods. Selection of algorithms must be dependent on the size, quality, and nature of the data. Some common learning algorithms can be used to clean data are shown in Figure 4.

Deep Learning
This technique is widely used by data representation, rather than data features to execute data cleaning. Deep Learning Algorithms transforms data into abstract representations that allows learning features. Hence, there is no requirement for feature extraction as the features are learned right from the data. Due to nature of Big data, the capability to ignore feature extraction step is great deal.

Naïve Bayes Classifier Algorithm
This algorithm provides classification parameter and attributes to label the occurrences must be conditionally independent, if the instance contains several attributes. This algorithm is suitable for moderate or large training data set.

K-Means Clustering Machine Learning Algorithm
K-Means produces stronger clusters than hierarchical clustering in case of globular clusters. And for large number of variable K-Means clustering executes speedier than hierarchical clustering.

Apriori Algorithm
Apriori Algorithm is easy to implement and can be parallelized easily. Which uses large item set properties to implement.

Random Forest Machine Learning Algorithms
Random Forest is very less robust to noise, which makes it more efficient and versatile for classification and regression jobs. It is easy to define which parameters to use, since it's not delicate to the parameters required to run. This algorithm can be grown in parallel and efficient for large database with higher classification accuracy.

CONCLUSION
In recent years, probably big data processing brought the greatest revolution in computing. The data cleaning of massive sizes of data lies at the heart of big data analytics processing for all purpose of domains for better data investigation.
In this paper, an overview is initiated to identify the potential of data cleaning in big data analytics in the process of gathering, arranging and processing information. It is important to understand data quality criteria of dirty data to able to clean data sets without failure. A comparison of commercialized tools is presented by obtaining comments from different customers. Most of the tools mostly concerns to organize data sets and clean messy data and very methods uses machine learning. But they didn't give much importance to big data characteristics, which may lead to big challenge while cleaning data. There are many available data repairing algorithms, still it required human expert to take intelligent decision if the cleaning process is correct or not. Machine learning algorithms will probably replace most jobs in the world, with the fast evolution of big data and accessibility of programming tools like Python and R , machine learning is increasing mainstream existence for data scientists. Machine learning applications are highly automated and self-modifying which continue to improve over time with minimal human intervention as they learn with more data.
This survey has prompted us to conduct additional real-world evaluations and develop a modified framework of big data analytics by changing structure of cleaning phase to get more clear visions of data. It is expected to produce a new plan regarding the structure of data quality techniques which can be more efficient in big data analytics.