From Relational to a Column-based Database: A quasi-experiment

. This work reports the authors’ experience migrating a dataset from relational to a column-based database. The goal of the experiment is to identify and depict the several challenges faced by a group of advanced students in Software Engineering with little prior experience in database administration. As a first dive and considering the initial conditions, the authors consider this work a success in terms of learning.


Introduction
A NoSQL database environment is a non-relational and largely distributed database system that enables the ad-hoc organization and analysis of extremely high-volume and disparate data types.NoSQL databases are sometimes referred to as cloud databases, non-relational databases, big data databases and a myriad of other terms.They were developed in response to the sheer volume of data being generated, stored and analyzed by modern users (user-generated data) and their applications (machine-generated data) [Han et. al, 2011].The mainstream big data platforms adopt NoSQL to break and transcend the rigidity of normalized relational schemas [Chen and Chun, 2014].
While relational databases have been used for decades to store data -and they still represent a viable solution for many use cases -the NoSQL approach is chosen today for scalability and performance reasons.However the implementation of a NoSQL database may seem confusing or even overwhelming.
NoSQL databases are grouped into four primary product categories with different architectural characteristics: document databases, graph databases, key-value databases and wide column stores.Many NoSQL platforms are also tailored for specific purposes, and they may or may not work well with SQL technologies, which could be a necessity in some organizations.In addition, most NoSQL systems are not suitable replacements for relational databases in transaction processing applications, because they lack full ACID1 compliance for guaranteeing transactional integrity and data consistency.
Performing of experiments would allow to identify the required knowledge and the main issues of a relational to NoSQL migration process, besides the analysis regarding the selection of some kind of NoSQL alternative for a specific environment [Martínez and Aizemberg, 2015].
A quasi-experiment depicting the migration process from a relational to a No-SQL database is presented in this work.The paper is organized as follow.Section 2 introduces the main concepts related to No-SQL databases.Section 3 describes the experiment.Section 4 presents an analysis of the results.Finally, conclusions are discussed in Section 5.

NoSQL databases
The term NoSQL (Not only SQL) describes a broad set of databases lacking the properties of traditional relational databases, which are generally not queried by means of SQL (Structured Query Language).
By design, NoSQL databases and management systems are relation-less (or schemaless).They are not based on a single model and each database, depending on their targetfunctionality, adopt a different one.

Data Model
NoSQL databases vary widely by data model and have some distinct features on its own.There are different data models and functioning systems for NoSQL databases, as described following [Leavitt, 2010]:  Key / Value: A key-value database allows the user to store data in a schema-less manner, usually some kind of programming language datatype or an object.The data consists of two parts: 1) a string as the key, and 2) the actual data as the value.

BASE Properties
The NoSQL data model does not guarantee ACID properties but instead it guarantees BASE properties (Basically Available, Soft State, Eventual Consistency) [Singh, 2015].BASE brings a softer consistency model.Basically Available means the database assure system availability in terms of CAP theorem.Soft State establishes that the system state may change over a period of time even if no input is given.Finally, Eventual Consistency indicates that the system eventually become consistent with time if system is not feed with any input during that time.These kinds of databases prioritize availability over consistency [Nayak et. al, 2013][Singh, 2015].

The CAP Theorem
The CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees [Gilbert and Lynch, 2002]:  Consistency -All the servers in the system will have the same data so anyone using the system will get the same copy regardless of which server answers their request.
 Availability -The system will always respond to a request (even if it's not the latest data or consistent across the system or just a message saying the system isn't working).
 Partition Tolerance -The system continues to operate as a whole even if individual servers fail or can't be reached.
It is theoretically impossible to have all 3 requirements met, so a combination of 2 must be chosen and this is usually the deciding factor in what technology is used.

Experiment Design
The experiment design of this work belongs to the observational category.An experiment of this kind collects relevant data as a project develops, with relatively little control over the development process.Specifically, this paper adopts a project monitoring approach, collecting and storing the data that occurs through project development.
It is a passive model since the data will be whatever the project generates with no attempt to influence or redirect the development process or methods that are being used [Zelkowitz andWallace, 1997] [Dinardo, 2008].

Objectives and Research Questions
The objective of this work is to analyze the challenges related to the process of migrating a relational database to a NoSQL database.The migration is carried out by people with background knowledge on relational databases but without experience on NoSQL technologies.

Context, Experimental Units and Treatment
The experiment was carried out by a group of 4 Information System Engineering students, with background knowledge on relational data modeling and databases technologies.However, they have not previous knowledge on NoSQL databases.
The experiment involved a relational database composed by seven tables implemented in Oracle MySQL and hosted on a remote Azure Windows Server.
Cassandra (version 2.2.6) was the column-based database used in this experiment-Cassandra was originally developed by Facebook and open-sourced in 2008 [Chan, 2016].It can be defined as a "distributed storage system for managing structured data that is designed to scale to a very large size.It shares many design and implementation strategies with databases but does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format".Twitter, Digg and Rackspace, among others, have adopted Cassandra.It is important to state that we did not choose Cassandra out of a specific problem that needed to be solved, but for explicit interest on testing Cassandra's tools and behavior.
We knew beforehand that working on a simple, reduced context might not show significant improvement on performance issues.However, we consider appropriate to set this as a starting point for future similar projects, to get familiarized with this set of tools, environment and modelling approach.Upcoming iterations intend to add complexity and volume to this process.

Tasks and Material
The migration from a relational database to a column-oriented NoSQL database -Cassandra -was the task performed by means of the following steps.
Our starting point was defining technologies and setting them up for proper use.Two members of our team currently have a Microsoft Azure's student pass, so we could virtualize separately a Windows Server virtual machine (where we installed MySQL) and an Ubuntu virtual machine (where we installed Cassandra).Both gave us access through Putty terminal or remote desktop.
We started the whole migration process per se by setting the RDBS: our first task was conceptual modelling11 , and we ended up with the simplest schema: 7 tables, each of them with at most 10 columns and at most 2 foreign keys, as shown in Figure 1.Next step was populating these tables: for this task, we used random data (randomly generated but respecting its corresponding types and patterns).At the end of this stage, we had almost nine thousand rows in general, the most populated table having five thousand of them.All of them were populated following an import process from CSV file sources.
Third step would be then running all SQL queries considered important.In this case, we had to make up a whole set of queries given that we had no explicit problem to resolve, so we thought of seven queries that, under our consideration, cover all important corners of SQL behavior.Some of these queries are shown in Table 1.Up to this point, with relational MySQL created, running and tested it was mandatory to switch our focus to NoSQL Cassandra database.Differing from its root, it was immediate to us that a re-modelling process was necessary -mostly based on the facts that Cassandra claims to have cheaper "writes" than "reads", and its column families strongly based on the type of queries performed.Hence, the importance of having the queries previously defined.
The next step involved creating a keyspace (Simple Strategy and factor replication equal to one -to be noticed here the simplicity we chose for this project), and several column families (also known as tables) with, notably, redundant information -which is one of the key uses of Cassandra and column-oriented databases.Following the creation of the column families we had to populate them.For this task, we used all CSV files exported from the SQL queries run on MySQL previously.
Therefore, we made use of the COPY FROM command in order to import data to Cassandra.
Finally, the last step was to run the same type of queries we ran on MySQL (shown in Table 2), adjusted to fit CQL (Cassandra Query Language) syntax, and so we could compare the obtained results to the relational queries as well as both performances.

4
Result Analysis Even though it was not the main target of this quasi-experiment, we still conducted some analytical and technical comparisons regarding performance on both databases.For this task, we included different sets of queries such as:  Read-mostly query set. Read/write combined query set (approximately 50%-50%). Mixed query set (read, update, insert). Insert-mostly query set.
The reason of testing the former, under our consideration, is that they cover most typical modern applications quite well.Of course, they were run in a manner that did not allow data loss.
Being aware we would not find significant performance differences working on a single node, we still found out (or confirmed what theoretically was supposed to happen) several topics worth of quotation.Cassandra showed overall improvement, but remarkably, showed an outstanding throughput in insert-only queries.Regarding readonly and read/write queries it did not show a massive difference, although it was indeed better -which is probably due to the de-normalized data and the lack of joins, even being a write-oriented database competing with a read-oriented RDBS.Finally, to be noticed that mixed queries (reads & updates & inserts) showed better performance in CQL than those of read & write.We suspect at this point that the decreasing number of reads boosts Cassandra's overall performance.

Conclusions
As data-centric systems evolve, organizations increasingly find the need to evaluate new data stores to support changing applications and business requirements.The media hype around NoSQL databases and the commensurate lack of clarity in the market makes important for organizations to access to different implementations.In this paper, a quasi-experiment on migrating from relational to NoSQL database was presented.As we concluded it, we found ourselves having achieved our initial goals.Beyond the migration process our focus was elsewhere; above anything else, we wanted to have a first go into the NoSQL database world.
As we progressed, step by step, on the resolution of the proposed scenario, we took full consciousness of its deficits -given the fact we initially took off from a fictional problem.And that was because we started from scratch, with our minds already set to think following the relational paradigm.
The biggest problem we had found and the best experience we gained is that when one wants to use this kind of databases, it is mandatory to know, that the way of thinking must be changed and the database-structure must be modeled thinking on the application that supports and not thinking like the traditional SQL way.The tables must be created thinking on the queries needed by the application: the model it is not applicationindependent.
It's for this reason that we'd found ourselves making a lot of mistakes, struggling to implement some technologies and needing to learn a little bit more about other underlying concepts.These, in our opinion, are the most valuable lessons learnt: beyond every improvable aspect and without exploding all Cassandra's advantages, we could take a glance at these technologies -and get to know what they are and aren't capable of.
As a first dive and considering the initial conditions, we can call this test a success in terms of learning.We are now ready to retry this process, with a more accurate, realworld related, bigger problem, and adding other complex technologies.The starting point will be then further ahead.

Figure 2 .
Figure 2. Some of the query-based Column Families in Cassandra 12 .

Table 1 . SQL query examples.
"Show all countries where the sum of all of its employees' salaries is lower SELECT pais.Id_Pais, pais.Nombre, SUM(trab.Sueldo) FROM empleado empl, pais, ciudad ciu, provincia prov, direccion