Efficiency of Flat File Database Approach in Data Storage and Data Extraction for Big Data

ABSTRACT


INTRODUCTION
Big data or big data analytic have been used to describe the data sets and analytical techniques in applications that are so large and complex that they require advanced and unique data storage, management, analysis and visualization and technology [1]. Big data also refers to tools/application, processes, and procedures that allow organizations to create, manipulate and manage very large data sets and storage facilities. Tools in big data is required in order to handle the issues in big data such as analysis, capture, data duration, sharing, storage, transfer, visualization, querying, updating, and information privacy. Most of organizations or industries such as healthcare, academic publications, etc. are looking for the best approach or method in order to handle big data. For instance, in academic publications, according to sociology and research article, a number of reports have pointed to the growing use of big data across economic sectors and its potential to bolster productivity, efficiency and growth [2]. In this paper, issue about efficiency during access the publications data is considerable inefficiency. The efficiency of accessing publications data is relate to how data is stored. Big data or huge data must be stored using suitable approach. In traditional approach, data is stored using relational database. By using this approach, the data can be represented in a table form. Database Management System (DBMS) is used to control and manipulate the data [3] [4]. However, by using this approach, time to fetch the data are considerably inefficiency. One of the solution to handle this problem is XML approach. XML is an emerging standard for exchanging representation over the Internet [5]. XML is widely used to store and manage huge of data. This approach is chosen because of simple syntax, easy to generate and parse, easy to debug, extensibility, etc. [6]. This approach is successful and currently used by most of industries such as health care, education, business, etc. especially involves with huge data. A second database approach is JSON. JSON provide unique strength similar with XML approach. JSON is chosen because this approach directly support inside JavaScript and is best suited for JavaScript and provide significant performance compared to XML [7]. JSON is estimated to parse up to one hundred times faster than XML in modern browse. JSON format is proven more powerful compare to XML approach in term of time to fetch or retrieve data from database [7][13] [14]. However, academician and researchers still looking for the best database approach specially involved with huge data. This paper proposed text file as an alternative database approach compared to XML and JSON. Publication datasets is used for experimental purposes. The performance of Flat File approach will compared with XML and JSON approach. The comparisons are made from the following aspects: query performance and CPU usage for data retrieving process. The rest of contribution is organized as follows: Section 2 gives the related works. Section 3 describes about the kinds of data model such as relational database, XML, JSON and Text File. Section 4 discusses the three database approaches concerned based on experimental results and our experience in the development. Finally, a conclusion is given in Section 5.

RELATED WORKS
Based on past researches, the most popular approach compared to relational database is XML and JSON. XML stands for eXtensible Mark-up Language a standard for data exchange issued by the World Wide Consortium (W3C) in 1998 [8]. XML has been widely accepted as a data format standard for data interchange and storage with the rapid development of internet and web service [9]. XML approaches have been implemented for clinical data storage [10]. This technique is effective to manage the clinical data and transform the data into structured format. The advantages of XML approach for clinical data are better in term of scalability, flexibility and extensibility. Native XML approach also has been implemented in external and distributed database [11]. The purpose of native XML is to minimize the query retrieval speed [12]. XML approach successful to handle huge data around 100000 records. In chemical industry, XML also used for integration of chemical data [13]. The implementation of XML approach because of chemistry community has been slower to adopt the Internet as a central service for exchanging information. Chemical data involves with large number of data file. XML approach can improve the efficiency of query processing when involves with the large number of data file. XML is implemented to overcome the information sharing each other and large number of databases issues. Through XML approach, different systems can share and exchange the information easily. By implementation of XML approach in different domains, XML is proven to handle large number of data. The efficiency of query processing using XML is efficiency compared to relational database. However, the efficiency of query processing using XML still can improve by using another approach as an alternative database approach. Meanwhile, JSON is lightweight data-interchange format is easy for humans to read and write, and for machines to parse and generate [12]. Nowadays, more and more data represented as JSON document. JSON is becoming the universal standard data format for the representation and exchanging the information. JSON approach is more powerful compared to XML approach. JSON approach has been implemented and able to handle 1000 records to 25000 records. The result shows JSON approach is powerful and more efficient in term of storage and query retrieval compared to XML [8] [15]. However, the researchers still looking the best technique for handling huge data. In this paper, text file database approach is introduced as a new approach to handle and manage huge data. Text file is a computer file that only contains text and has no special formatting such as bold text, italic text, images, etc. Text file is simply, that way text files are commonly used for storage of information. In this paper, comparison will made between XML and JSON approach to handle huge data which is more than 50,000. This is important to shows the efficiency of JSON approach for handling huge data.

TYPES OF DATABASE MODEL
Four type of database approaches are represented in this section. Currently, most of data sources are store in traditional database approach which is called relational database. Because of limitation this approach, many researches looking an alternative database approach. Three alternative approaches are identified and performance among them are compared in order to show which one is better to use as an alternative for database model. They are XML, JSON and TXT. Figure 1 shows the diagram which is contains publication data. Based on Figure 1, publication data coming from different sources such as article, book, inproceeding, master thesis (msthesis), proceeding, website/URL (www) and PhD thesis (phdthesis). These data sources are collect and extract to relational database approach. Figure 2 shows how publication data source in structured data format is extract and store in relational database. Number of records are stored in relation database is around 50,000 records. These records are split into four (4) segments: 1,000 records, 5,000 records, 10,000 records and 50,000 records. After records segmentation is done, these records from the relational database are convert into three different data format. They are XML, JSON and TXT format. XML, JSON and TXT data format can considered as an alternative approach for database approach. Algorithm is designed in order to allow data from relation database approach converted into XML, JSON and flat file (text format). Section 3.1 demonstrate how data from the relational database is converted into XML, JSON and flat file (text format).

The Relational Database Approach
The definition about relational database is a data abstraction that presents the data in a database as a set of tables [16]. Relational data is complex, it mimics the way people think by grouping similar objects together and breaking down complex objects into similar ones . TABLE 1 until TABLE 7 shows how publication data is stored. Tables that contains the publication data is divided into two part; row and column. Column represent attributes name and rows represent number of data (something is called tuples).

3.2.
XML Approach XML provides a standard for the semantic management of data. It is a formal meta-language facility for defining a markup language. The basic unit in an XML file is entity or chunk that contains content and mark up. Many excellent model-mapping schemas have been proposed for storing and retrieving XML data into/from relational database [17]. The markup describes a content. More generally, markup consists of tags, attributes, comments, and processing instructions for the content. In a start tag, the name and any additional information are surrounded by the "<" and ">" characters. Figure 3 shows the algorithm how data from relational database is convert into XML format.

Input
: Tables  Assign each record to each attribute tag, i i 1.5 Close XML tag of x 1.6 Repeat step 1.1 until end of records 2.
Repeat step 1 until end of tables 3.
Display data set A After algorithm in Figure 3 is convert in programming code, then execution is occur, system will produce XML file as represented in Figure 4. Similarly, an end tag consists of the tag name surrounded by the "< /" and ">". XML is case sensitive so start and end tag names must match exactly. Figure 3 shows how the publication data is represented in XML format.

JSON Approach
In this approach, data is represented in array format. JSON is built on two structures. The first is a collection of name/value of pairs. In various language, this is realized as an object, record, structure, dictionary, hash table, keyed list, or associate array. The second is an ordered list of values. In most language, this is realized as an array, list or sequence. Each object begins with "{" and ends with "}". Array is an ordered collection of values. An array begin with "[" and ends with "]". Meanwhile, a value can be a string in double quotes, or a number, or true or false, or an object or an array. Figure 5 represents algorithm how to convert relational database to JSON approach. In this algorithm, two input are required which is

Flat File Approach
In this approach, data is represented in flat file (text format). Flat file are text files stored in computer science. Data in flat file is simple and can ported to any program. The basic characteristics of a flat file are that data are stored as plain text, even the number are plain text, and that each line of the file contains one record or case in the data set. Each line a flat file, several contain the values for the different variables in the data set. Fields within a record are separated by a special character, or delimiter. Each line after the header consists of two fields separated by a colon (the character ":" is the delimiter). Alternatively, we can used "white space" (one or more space tabs) as the delimiter. Figure 7 show the algorithm how data from relational database is converted into flat file (text format). Assign M to C 6.

Input
Display data set (C) Figure 7. Algorithm (Relational Database to XML Format) Figure 8 shows list of data represented in flat file (text format). These data are extracted from original data sources which is store in relational database approach.

EXPERIMENTAL RESULTS
In this section, we evaluate the performance of the accessing the data from XML and JSON. Four different queries are used in experiments. The systems are build using a personal computer equipped with 2.40GHz Intel® Core ™ i7-5500U CPU, 8.00 GB RAM and a 250 GB solid-state drive. The operating system is Microsoft Windows 10. The database implementing the XML database (approach I) using X-Path for querying purposes and JSON database (approach II). We use benchmark dataset DBLP [18]. The variation in query time with the size of the database is also studied. For each of two database approaches, the time to query and CPU usage with varying complexity specified above is measured with databases containing 1000, 5000, 10,000 and 50,000 records respectively. For query retrieval, at each setting, the query is made for 10 times to calculate the average time and standard deviation [10].
The discussion is based on two experiments in the databases development and their application for the storage of structured data, from the perspectives of test data, efficiency and scalability, and extensibility. The performance of two database approaches is evaluated by using benchmark dataset DBLP. The data contain 50,000 records. TABLE 8 shows the queries with different complexity and TABLE 9 shows the queries constructed in the SQL statement. List out all the URLs which begin with the "db/journals" path II List out all the titles of the master thesis which contains the "Data" keyword III List the titles of inproceeding where the author is "Regine Laleau, Mammar" IV Count the number of phd thesis published in each year

Data Extraction (XML, JSON and Flat File (text format))
In this section, data from relational database are extract and convert into three different data format. The data size for each format are represented in in KB . TABLE 10 until TABLE 13 show the data size and performance query retrieval in three different format which are XML, JSON and Flat File (text format). Data are split into 4:-1000 records, 5000 records, 20,000 records and 50,000 records. Then, these records are convert into different data format. Based on TABLE 10 until TABLE 13, Flat File (text format) format is smaller compared to XML and JSON. That way, time to data retrieval also shows flat file in text format faster compared to XML and JSON.

Data Retrieval (XML, JSON and TXT)
In this section, we evaluated the performance of search the data from XML, JSON and Flat File (text format). Four (4) different queries were executed and time for query retrieval are executes in 10 times. Figure 6 until Figure 9 depict the query retrieval performance in term of time are taken to process the query in milliseconds (ms). The data are split into 5:-1000 records, 5000 records, 10,000 records and 50,000 records. Mean and standard deviation are calculated based on standard algorithm.

CPU Usage Performance (XML vs. JSON)
The performance of two database approaches is evaluated by using benchmark dataset DBLP. The data contain 50,000 records. Figure 10 until Figure 13 shows the queries with different complexity. The result shows flat file (text format) data format used less CPU usage compared to XML and JSON. The result show more significant when involves huge data especially 50,000 records. Figure. 10: CPU Usage -Query I Figure. 11: CPU Usage -Query II Figure. 12: CPU Usage -Query III Figure. 13: CPU Usage -Query IV

Efficiency and Scalability
The performance of two database approaches is evaluated by using benchmark dataset DBLP. The data contain 50,000 records. TABLE 8 shows the queries with different complexity. Meanwhile, TABLE 9 shows the SQL commands based on query complexity in Table 8. In term of execution time for query retrieval, Figure 5 until Figure 8 shows, flat file approach is powerful compared to XML and JSON approach. The scalability of time execution is changing based on number of records. Meanwhile, in term of CPU usage, flat file approach is also better compared to XML and JSON. Figure 9 until Figure 12 shows the CPU usage are used by flat file approach is low compared to XML and JSON. In this case, percentage of CPU are used to execute the query based on number of records still considerable lower compared to XML and JSON approach. The scalability of CPU usage using flat file approach is moving steadily based on number of records.

Flexibility
In relational database approach, data modelling is restricted by the permission number of columns of the database management system. But, flat file is more flexible in that there is no need to pre-define the required number of columns. In flat file, data with complex structure can always be added subsequently. Further, re-design of schema is not required when the content is changed since the schema is generalized by any publication data. Flat file approach is portable and independent platform. It is both human readable and machine process able. The format also facilities logical data management. Advantages of flat file is the potential interoperability with other systems. For example other systems can easily retrieving the data from flat file (text format). By using flat file database approach, other systems can easily integrate with this standard with minimal development effort.

CONCLUSION
A review on current approaches of database approach indicates that the flat file approach is viable alternative to relational database, XML and JSON as it provides better performance for query retrieval and CPU usage while still retaining certain degree of scalability and flexibility. In this research, the performance of Flat File, XML and JSON approach are compared by using DBLP dataset. In this study, Flat file is found to be flexible approach in handling huge records but it falls short in term of scalability and extensibility when compared to XML and JSON approaches.
In query retrieval experiment, four different type of complexity queries has been implemented. Based on the results, the execution of time using flat file approach also lower compared to XML and JSON approach. However, in term of scalability, both approaches reflect the time is increases steadily based on number of records. Further optimization is required to fully exploit the potential of XML and JSON database and minimize the performance of the data search engine.
In CPU usage experiment, the percentage of CPU usage using flat file approach is lower compared to XML and JSON approach. In term of scalability, flat file approach shows the steady increment the percentage of CPU usage based on number of records. Meanwhile in XML and JSON approach, the percentage of CPU usage changes rapidly when execute large number of records. In this cases, flat file approach is more practical and significant to be used for extracting large or huge records.
The study attempts to explore the vast opportunities flat file technologies in management of huge data. The prototype system developed is initially tested with maximum of 50,000 records only. Further evaluation using larger datasets, or even multiple databases and data warehouse, should give more comprehensive and thorough findings on the performance of query retrieval and CPU usage of Flat File, XML and JSON approaches.