Initial Encryption of large Searchable Data Sets using Hadoop

With the introduction and the widely use of external hosted infrastructures, secure storage of sensitive data becomes more and more important. There are systems available to store and query encrypted data in a database, but not all applications may start with empty tables rather than having sets of legacy data. Hence, there is a need to transform existing plaintext databases to encrypted form. Usually existing enterprise databases may contain terabytes of data. A single machine would require many months for the initial encryption of a large data set. We propose encrypting data in parallel using a Hadoop cluster which is a simple five step process including the Hadoop set up, target preparation, source data import, encrypting the data, and finally exporting it to the target. We evaluated our solution on real world data and report on performance and data consumption. The results show that encrypting data in parallel can be done in a very scalable manner. Using a parallelized encryption cluster compared to a single server machine reduces the encryption time from months down to days or even hours.


INTRODUCTION
Storing data externally with a hosted provider becomes more and more attractive to companies which want to save costs on their own physical infrastructure. Since the cloud service providers have access to all data, clearly, storing sensitive data externally asks for solutions providing respective privacy precautions such that only the data owner is able to access the data. Inspired by CryptDB [1], we developed a system to encrypt and store data on a database, all happening transparently to the application. Hence, it can be used in the same way traditional databases are accessed to execute a large set of queries over the encrypted data. Introducing such an encrypted database solution to an existing application landscape requires an initialization phase where all unencrypted data is transferred and encrypted into the new database.
Since most of the encryption operations are slow, encrypting a plaintext table is very time-consuming, especially when the table is large. Using tables enabled for adjustable encryption [13] having multiple encryption layers for one database entry adds significantly to the computation time for the initial encryption. We show how to make the initial table encryption an automated process and much more efficient using a Hadoop cluster for parallelized data processing.
The rest of the report is structured as follows. Section 2 describes the technical details. In Section 3, we discuss the performance test and its result, and conclude with Section 4.

IMPLEMENTATION
In this section, we will discuss the technical details of how to encrypt the database in parallel.

Search over Encrypted Data
Inspired by CryptDB [1], we implemented a JDBC driver for connecting to database containing encrypted data. The goal of the driver is to realize accessing encrypted data transparently to the client application, which means a large set of regular SQL queries can be used to search over encrypted data. SQL operations (equality search, range search, aggregations, etc.) require that data is stored with different encryption schemes. When encrypting a database, each plaintext data item is encrypted using multiple schemes, depending on the SQL operations to be executed on the data. In total there are five types of encryption schemes we use: holds. It supports SQL range queries on the ciphertext. We use the algorithm introduced by Boldyreva et al. [2]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request [4] we support on demand re-encryption of a column to be on the same key as another column for join operations (see [13] for more details).
For the support of an initial maximum security and storage optimization, we use adjustable encryption [13]. This means a plaintext item x is encrypted in a layered approach with multiple encryption schemes such that, for instance, RND(DET(OPE(x))) holds. If during an SQL execution an in-between-layer is required, top layers are removed on demand to reveal the respective layer.
Obviously, even if stacked, we want the encryption schemes to keep their characteristics we listed above. Hence, schemes on a lower layer must have the same characteristics required for the layer stacked on top of it (e.g., OPE must be deterministic, if DET is stacked on top). We use four stack structures to support multiple SQL operations in one query (e.g., aggregation and a range condition). Moreover, we provide two stack structures for integers and one stack structure for strings. Table 1 gives an overview.
In Table 1, the row "Types" stands for the type of plaintext data the characteristics of the encryption schemes are actually applicable. For example, for integer or decimal values, the additive homomorphic encryption scheme is required, to actually enable aggregations; it is, however, not required for strings. For sorting items (with OPE) different paddings are used for integer and string values; moreover, in our use cases joins are only done with integer values such that we omit this layer for strings. Stack structure 1 is used for data retrieval to the client and group by selections only. Having AES-based encryption schemes in this stack is a very fast solution for decrypting result sets on the client.

Encryption
We use the implemented JDBC driver and a Hadoop cluster to encrypt the plaintext table. There are five steps in total for the complete process. The basic workflow is shown in Figure 2.

Import data to cluster
The third step is to import the plaintext data from the database to the cluster's file system called HDFS. To achieve this goal, we use Apache Sqoop. Sqoop is a tool for transferring data between Apache Hadoop and the relational database. The relational database can be from any database vendor. In our case, SAP HANA [9] is used. The imported data is stored in a CSV file typelike structure, which means each line in the file represents an entry of the table.
The performance of this step mainly depends on the database itself and the network between database and cluster. The data imported to the HDFS will be the input for next step.

Encryption
The forth step is to do the encryption on the cluster. It is a mapreduce process. As we have mentioned above, the imported data is in CSV file type, which means each line in the file represents an entry of the plaintext table. So in the map stage, the imported file is read line by line, and each line is encrypted using the encryption algorithms mentioned above. This step does not need the reduce stage. The output of this step is also CSV file(s) which is according to the encrypted table structure previously created.
The performance of this step mainly depends on the performance of the cluster.

Export data to database
The last step is to export the data from the cluster to the database. We implemented this step by two different methods.  (<ROW2>), etc., and obviously this will make the exporting faster. As a result, the performance of the first method highly depends on if the database supports the batch mode INSERT. Moreover, it also depends on the condition of the network. And for some large tables, this exporting process requires the cluster connecting to the database for a long time, which may result in the connection being closed during the process.
The second method we implemented deals with exporting the data in a naïve way. This method requires the cluster having the Hadoop NameNode Web Interface installed. After the previous step (map-reduce encryption) is finished, the database server can directly download the output file(s) from the NameNode Web Interface. Then it obviously depends on the database which commands to be used to import the data into the database. In our case, using SAP HANA, the IMPORT FROM FILE [10] command is used.
The performance of the second method depends on the network and the database IMPORT performance.

Workflow management
The previous steps (Section 2.2.2 to Section 2.2.5) are all scheduled and managed by Apache Oozie. Oozie is a workflow scheduler tool to manage the Hadoop jobs. The reason to use Oozie is that it can make the implementation "cleaner", easier to modify, and more reliable.

EVALUATION
In this section, we will discuss our experiments with the above described setup and its performance.

Environment setup
SAP HANA is used as database. Typically for this database, all data is stored in memory in tables using column-based storage.
The Hadoop cluster resides on an internal virtual hosting environment. There is one master, twelve slaves, and one Ambari server [8]. The master machine has 8 CPUs, and 31.6 GB memory. The slave machines have each 8 CPUs, and 16 GB memory. All have between 100 and 300 GB available space.

Plaintext Tables
We have chosen 9 plaintext tables in total to test the performance. All the tables are standard SAP ERP tables and mainly come from the application of sFIN [11]. The details are shown in Table 3.

Time
The tables listed in Table 3 are encrypted one by one. There is only one job running in parallel, and before a new job is started, the cluster is cleaned. For all columns we create encrypted columns (stacks) according to the type of the plaintext column given with Table 1. We report on both methods to export the data to database (see Section 2.2.4). The results are showed in Table 4.
From Table 4, it is clear that the direct export using INSERT statements is much faster if it comes to large tables rather than using method 2 with CSV files. Method 2 takes about 5 times as much for processing then method 1. A reason we see is that in method 2 all encrypted data has to be extracted and copied from Hadoop's file system to the database server where it then has to be processed a second time for importing it with the IMPORT command.

Encrypted Tables
After the encryption, there are 9 encrypted tables corresponding to the plaintext tables. Their information is shown in Table 5.

Scalability test
We test the scalability of our solution with a midsized table ORDERS. We report on the scalability of the encryption with respect to the size of the table as well as with the size of the Hadoop cluster, i.e., number of working nodes.

Different table size
Firstly we use a fixed number of nodes (1 master server, 12 slave nodes) in cluster and change the table's size by randomly selecting a predefined number of entries from table ORDERS. And for exporting, we use method 1. The results are shown in Table 5. A more intuitive result is showed in Figure 3. From it, it is clear that with the linear increasing table size, the computation time also increases linearly.

Different number of nodes in cluster
We also test the influence of the number of nodes in cluster on the computation time. Because in the cluster only slave nodes store the data and do the computation, we increased the number of slave nodes from one node to 12 nodes in total. The complete table ORDERS is used for each encryption, remaining stable in its size. For exporting, again method 1 is used. The results are shown in Table 6. A more intuitive result is showed in Figure 4. From it, we can see the number of slave nodes affects the computation time and correlates directly with the number of nodes used. The time roughly drops to a third by using 3 nodes instead of only one. And it drops roughly to 1/12 when using 12 nodes.

CONCLUSION
In this report, we mainly talk about how to encrypt tables in parallel. After the environment setup, the idea is to divide the process in three main steps: import data to cluster, encryption, and export data to database. For step "import data to cluster", Apache Sqoop is used. For step "encryption", Map-Reduce is used. In step "export data to database", there are two potential methods. The first one uses Sqoop too, and the second one is a naïve method where the data is copied to the database with a database-specific import command.
We have run tests using real world tables. The results show that an automated encryption solution is feasible and using a Hadoop cluster reduces encryption time drastically in a very scalable manner.