Big Data Security Architecture using Split and Merge Method

ABSTRACT


INTRODUCTION
In this digital era, requirement of huge data is increasing rapidly, so are the security concerns. Everyone wants data to be safe with every aspect and dimension. Breach of privacy and security is one of the major concerns in the digital world we live in today. We have large amounts of data that is generated in the field of media, healthcare, technology, private sector, science, sports and security has to be maintained with every aspect. Big data refers to the management and analysis of huge amounts of data that exceeds the capability and efficiency of traditional data in every dimension. Big data collects and analyses large amounts of data from heterogeneous sources to discover unprecedented new knowledge and understanding of scientific and business scenarios [1] [2]. Aiming to gain greater insight into patterns not generally discernible from smaller data sets, big data business intelligence enables visibility into associations and trends that otherwise go unnoticed. Designing a fool proof security measure for such a wide volume and variety is an extremely intricate and cumbersome procedure. In this paper, we propose a schema that encrypts data and also makes data access difficult to an intruder. This is achieved by dividing or splitting the concerned data following which, encryption is performed. This approach proves to be beneficial since it not only encrypts data but also focuses on data integrity. To an intruder it may seem like the data integrity is lost since the data is divided, but the information about the file locations and pairing is known only to an authorized user. Hence a two level security is established, one that scrambles or masks the data items and one that protects its integrity under attack [4] [5]. Data Splitting, Encryption and Decryption. Encryption performed by using Secure Efficient data distribution algorithm and decryption is performed using Efficient Data Conflation algorithm. The data splitting is achieved through various steps for different kinds of data. The goal of this mechanism is to increase efficiency of performance and security. A massive data breach all over the news across the world has led to large multinational companies making information security their top most priority. This is not just restricted to securing the actual information but also extended to authorizing each individual user of the company servers. Sensitive information is further declared confidential and only a set of executive employees are granted access. With such security standards and protocols in place, security enforcements are undergoing a revolution in order to ensure top level data protection. Since big data analytics has invariably become one of the most sought after domains in the IT industry, big data security also assumes top priority. Big Data analytics reveals hidden data patterns, unknown co-relations, customer preferences and other useful information that aids the economic growth of a company. Hence every leading company is leaning on this technology to make better informed decisions, improve its operational efficiency, predict trends and over all improve its performance. There are three properties to big data, volume, variety and velocity. The variety arises by the presence of structured, unstructured and semi structured data. Providing a security solution that encompasses all the three types of data has been a challenging task. The proposed solution attempts to decrease the difficulty in performing this task. Most encryption algorithms provide a single level security by encrypting data using a random key. In the proposed solution, a two level security is provided, by encrypting the contents of a piece of information and also creating an illusion to an attacker that data integrity is lost [5][6].

RELATED WORK
While the term "big data" is relatively new, the act of gathering and storing large amounts of information for eventual analysis is ages old. The concept gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs: Volume. Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would've been a problem -but new technologies (such as Hadoop) have eased the burden. Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Variety. Data comes in all types of formats -from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions. We can consider two additional dimensions when it comes to big data: Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks [12]. Is something trending in social media daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data. Complexity. Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it's necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Keke Gai, et approaches. Future work would address securing data duplications in order to increase the level of data availability since any of datacenter's down will cause the failure of data retrievals. Ahmed Alahmadi, Mai Abdelhakim, Jian Ren, and Tongtong Li. In IEEE Transactions on Information Forensics And Security, Vol. 9, No. 5, May 2014 emphasize on, a reliable AES-assisted DTV scheme was proposed for robust primary and secondary system operations under primary user emulation attacks. In the proposed scheme, an AESencrypted reference signal is generated at the TV transmitter and used as the sync bits of the DTV data frames [5] [6]. By allowing a shared secret between the transmitter and the receiver, the reference signal can be regenerated at the receiver and be used to achieve accurate identification of authorized primary users. Moreover, when combined with the analysis on the auto-correlation of the received signal, the presence of the malicious user can be detected accurately no matter the primary user is present or not. The proposed approach is practically feasible in the sense that it can effectively combat PUEA with no change in hardware or system structure except of a plug-in AES chip. Potentially, it can be applied directly to today's HDTV systems for more robust spectrum sharing. : B Eswara Reddy, Gandikota Ramu et.al : IEEE 2nd International Conference on Big Data Security on Cloud, IEEE International Conference on High Performance and Smart Computing, IEEE International Conference on Intelligent Data and Security, 2016, stress on The fulfilment of data privacy and identity privacy are the major challenging issues in the EHR's integrity auditing system. In this study, we introduced a framework for secure auditing of EHR's in cloud servers [7] [8]. The proposed framework uses CP-ABE scheme with two-authority key computation method. Therefore, the public auditor verifies the data without accessing complete data, and he cannot reveal the patient identity in any case. This framework allows the auditors to perform auditing task simultaneously to enhance the efficiency of auditing task. In this framework, the cloud or KGA does not access the plaintext individually, so there is no chance of misusing of data in the cloud, it is a very important feature to support healthcare data auditing system [9][10]. This analysis explains that the framework is secure and efficient.

PROPOSED METHOD
The proposed system architecture is illustrated in Fig.-1, The system accepts the big data input file and identify the sensitive data using attribute relation methods and split the bug file into two and mask the split file then transmit the file to the destination and damask appropriately and calculate the performance parameters after the merging the split file to get the original file. Our approach is designed to divide the sensitive data into two encrypted parts for distributed storage in two different physical locations. The inputs include the initial data that consist of a string of data packets, which contain sensitive information. The outputs are two separate data packets that will be transmitted to different physical locations. The new generated data packets need to perfectly hide the sensitive information so that an attacker cannot read the information even though he or she has access to the data. In Masking Phase, the inputs of this algorithm include the two split components of the sensitive data that is to be encrypted and the randomly generated key.The two components are initialized as sets R and C for representation. Each of the two sets is subjected to XOR operation with the random key. This key is common for a given file input, but varies for different file inputs. It must be maintained in a special registry and communicated securely to a legitimate user of the information[1] [2].The XOR operation results in two output sets namely, α and β. These are then saved in different physical locations within the target file system. This is depicted in Figure 2.

Figure 2. Masking Phase
In demasking phase is designed to enable users to gain the information by converging two data components from distributed storage locations. Inputs of this algorithm include two data components, α, β, and the appropriate key, K. A log of all session keys must be maintained and tagged. According to the data to be decrypted the right key instance must be used. It is depicted in Figure 3. The proposed model achieves data confidentiality with minimum overhead since an illegitimate user without the appropriate key cannot access the protected data. If in any case, an attacker obtains the key, he or she may not able to retrieve the complete information since the original data packet is split into components. The attacker is fooled into thinking the data has been altered. To retrieve the original data as it is, both the data components must be decrypted and merged in order to ensure that data integrity has not been lost.

RESULTS AND DISCUSSIONS
The proposed methods have been experimented on text file, image and audio files of size 1 MB data of each and executed the algorithms and extracted the time versus and size for individual data sets and provide flexibility around how the data will be masked and ensure that business rules of the enterprise application will not be impacted. A text file with a .txt extension is chosen at random and the text file is split into two components which is then encrypted using the above said algorithm. The Figure 4 represents the  The Audio file with .avi extension using the proposed algorithm. The Figure 6 repreents the Time Vs Size graph that illustrates how the masking time behaves with increase in data size. As shown time is linearly proportional to the size. The big data has heterogeneous data sets and securing such data sets is a challenging task in the current digital era. To achieve the said outcomes, organizations need to be able to handle it efficiently, quickly, and because often this data will include sensitive information needs security at different scales.

CONCLUSION
In current business scenario, onsite offshore model and factory model are familiar and many organizations hesitate looking at security factors, big data boom is increasing day by day which has different varieties of data sources which needs to be secured based on the customer expectations and application specific, the proposed solutions are concerned about deploying the proposed solution in the first place. In this work we focused on the problem of security of the variety of data that is present on a big data platform and aimed to provide an approach that could protect sensitive data. Addressing this goal, we proposed a novel approach entitled as split and merge method to evaluate the expected performance dimensions, we evaluated the proposed model by assessing its execution time and size while different input data sizes were operated. This model has proven to be successful in performing the masking and demasking of different file formats