Keywords

1 Introduction

Relational databases have been the top choice of organizations in the last decade for storing, processing, and analyzing the data generated in the organizations. Relational databases store structured data and support structured query language (SQL) to access the database [1]. When the data is in structured format and of low volume, relational databases can be comfortably used. Even the unstructured or semi-structured data can also be stored in relational databases after using ETL tools to convert into structured format. However, recently, the data is increasing at an exponential rate having huge volume, high velocity of data generation, varied variety of the data as big data [2]. With this, it has become almost impossible for relational databases to store and process this huge amount of big data generated by large organizations and social media. Moreover, social networking and cloud computing paradigms have become popular which require data stores to manage massive amount of data generated per second. In particular, the cost of storing and querying big data from relational databases is very high, and they cannot serve the requirement of millions of users at the same time.

The big data storage and processing are not much suitable with relational databases as it has structured, semi-structured, and unstructured data, whereas relational databases can store only structured data and provide very little support to unstructured and semi-structured data [3]. Considering the challenges of big data for relational databases, the modern larger organizations which have large data storage requirements are rapidly switching from existing relational databases to NoSQL databases. NoSQL organizes data in different formats such as key-value, columnar, documents, and graphs. Cassandra stores data in columnar format, which offers considerably fast write operations [4], whereas MongoDB [5] stores data in document format. Every data type of NoSQL database uses different data structures as per the requirements.

The successful handling of big data complexity is a great challenge for the conversion of relational databases to NoSQL databases. NoSQL databases are designed according to the application specific access patterns and queries without using a normalization process, and they do not support any join operation and foreign or primary key relations. It is challenging for the organizations to migrate data from the relation databases to NoSQL databases because they completely work on different technologies. To overcome this problem, we have proposed a framework, namely DLoader, to migrate data from relational database such as MySQL database to NoSQL databases such as Cassandra and MongoDB. The proposed approach DLoader involves extraction of data from the MySQL database, preprocessing the data by applying transformations and standardizations on the data, and finally mapping the columns in the MySQL tables into the fields of NoSQL databases.

The rest of the paper is structured as follows: Review of the related work is presented in Sects. 2 and 3 describes the proposed work; Sect. 4 describes about the experimentation with Cassandra and MongoDB. Lastly, Sect. 5 concludes the work and suggests a future work.

2 Literature Review

NoSQL is a modern database technology to store and process big data. Most of the utilities developed were developed for converting one form of SQL database to another SQL database, and no specific utility is available for NoSQL to NoSQL conversion. A data adapter system was proposed to promote hybrid database architecture including both SQL and NoSQL [6]. Many frameworks have been developed to solve the data migration problem with a different solution, characteristics, and properties which migrates data from one NoSQL database to another but not between SQL to NoSQL [7]. It provided algorithm migration and migration schemes to migrate data between NoSQL databases in actual operating environment. It is a challenging task for business organizations to migrate or transform their data from existing relational database to NoSQL databases due to the complexity in relational data.

An approach consists of two modules: Data transformation and data cleansing modules were proposed which transforms relational database to big data database [8]. To handle the complexity of automatic transformation of existing relational database into a NoSQL database, a bi-fold transformation consisting of schema-to-schema and data-to-data transformation approach was proposed which dealt with heterogeneous and complex data [9]. Heterogeneous data exchange and conversion of data across kinds of database systems are achieved through relationship schema mapping [10]. It specifically focused on exchanging XML heterogeneous, solving field attribute changing during migration problem.

A study revealed that NoSQL is faster than RDBMS in case of big data. A methodology was presented for data migration from MySQL to MongoDB as NoSQL database [11]. A study evaluated the performance of data insertion and retrieval speeds up by making comparison between MongoDB as NoSQL and MySQL as a relational database which showed NoSQL database is faster than the MySQL considering the parameters used in the study for huge amount of data [12]. An approach that used data and query features to migrate data from relational to NoSQL databases preserved the features in the source data in the relational database and queries for accessing the data on the source data as well for arranging the data in the target database [13, 14]. This system works for all NoSQL databases but requires meta data information of the source and target databases.

NoSQL layer can also be used as an interface resides between the source database and application to migrate the data between relational to NoSQL without changing the application code which gives high performance for MongoDB [15]. Other approaches include data adapter approach which integrates relational database and NoSQL databases which support hybrid database architecture. Data may not be consistent in this approach always [16]. Content management system for schema de-normalization works with the Hadoop framework [17]. In a study, document-oriented data schema was proposed covering all data types in databases, further it overcomes the issue of managing the relationships of a complex database. Two stages of the approach include designing the document-oriented data schema and migrating the ER model to the document-oriented data schema [18]. Data migrated from relational to NoSQL data models or between different NoSQL platforms needed to be validated in order to find the errors during transformation during migration of the data. Data can be validated using denormalized schema structures and bloom filters, and other approaches can also be implemented to validate the data [19].

3 Proposed Work

The concepts of relational databases and NoSQL databases and the differences between these two databases are discussed in the next two sub-sections. Thereafter, the proposed approach for transforming the data from relational databases to NoSQL databases is elaborated.

3.1 Relational Databases

Relational or structural databases are based on relational model. Data stored in these tables is related and can be accessed through the relationships between them. The relationships can be one-to-one, one-to-many, and many-to-many. Data stored in the relational databases is structured only. Relational databases consist of tables, and the tables consist of rows and columns. There can be number of rows in a table, and all the rows are of same type that contains different data. Relational database management system (RDBMS) is used for creation and management of the relational databases. Each row in a table is unique. There are mainly two types of keys in RDBMS: primary key and foreign key. Primary key consists of one or more columns which can uniquely identify all the records in a table, and foreign key consists of one or more columns of a table that refers to the primary key of another table. Relational databases also follow ACID properties (atomicity, consistency, isolation, and durability) for making transactions. For querying the data from the relational models, structured query language (SQL) is used. These databases should have a predefined schema which is not easy to alter if structure of the data coming into the database changes with time. Hence, schema must be strictly designed keeping the future requirements in the mind.

3.2 NoSQL Databases

NoSQL databases have been designed to facilitate big data storage and processing. NoSQL databases store data in various formats such as key-value, column store, documents, and graphs. They have their own query mechanism and do not have any specific query language. Key-value databases consist of items where each item contains keys and values. A value can be simply retrieved by referring its key. Columnar databases consist of tables which consist of rows and dynamic columns. They have flexible schema and number of columns can be increased with time without creating a new schema. The column names need not be predefined in NoSQL databases, and hence, the structure is not fixed. Columns in a row are kept in a sorted order according to their keys which include partition keys and clustering keys. Document type NoSQL databases store data in documents which is same as JSON objects. The collection of documents is called a collection. Each document contains key-value pairs where value can be of different types. Graph databases store data in nodes and edges. Nodes generally hold information about entities, whereas edges store information about the relationship between the nodes. NoSQL databases have denormalized structure as they do not have primary and foreign key concept and hence do not support join operations. They provide features like horizontal scalability, data replication, data availability, and data consistency at low cost. With the increasing amount of data, the NoSQL database can be easily scaled, and more data storage can be accommodated [20].

3.3 SQL to NoSQL Transformation

Data stored in the relational database in the tables must be mapped to NoSQL database tables column to column. We have used Spark framework for migrating data from SQL to NoSQL database [21]. Hadoop cluster is serves as a middle layer between the source and target database. NoSQL database such as Cassandra also supports collection data types such as set, map, and list, and tables are denormalized, so columns in the SQL database can also be mapped to collection data types. The detailed process is briefed in the further sub-sections. The steps in migrating the data from a relational database to a NoSQL database are shown in Fig. 1.

Fig. 1
A flow diagram indicates the migration of data from S Q L to No S Q L databases. It includes the source database, spark S Q L framework, and target database.

SQL to NoSQL data migration

3.3.1 Data Loading

Data in the relational database stored in MySQL holds a specific schema in which various tables are linked through primary and foreign key references. Each column in a table has a defined data type. We have made use of timestamps to get the data from the MySQL database. Timestamp is used to make sure that only recently added data gets retrieved. JDBC is used to establish connection with the source database. The data from the MySQL database is loaded into the Hadoop distributed file storage (HDFS) [22] using Spark SQL framework. HDFS is a clustered storage consisting of one name node and N data nodes. This Hadoop cluster acts as a middle layer between the source database (MySQL) and the target database (NoSQL). SQL like queries can also be made on the data in the Spark data frames by creating temporary tables by using Spark SQL. Spark SQL also supports join and aggregate functions as supported in the SQL. The incoming data form the source database is dumped in the HDFS keeping the schema, the same as that of source database. Data is temporarily stored in the HDFS as staging tables before applying processing the source data.

3.3.2 Data Transformation and Mapping

HDFS holds large amount of data dumped from the relational databases (MySQL) in delta file format without altering the schema of the source database. This data is loaded in a data frame of the Spark SQL framework, one table at a time. Transformations are applied on the table data contained in the data frame, keeping the target database schema in mind. Since the target NoSQL database does not support joins, the MySQL source tables are denormalized to enable migration of data from the relational database. Every column of a table in the source database is transformed according to the column schema of the target database. Different concepts in relational databases are mapped to NoSQL databases as shown in Fig. 2.

Fig. 2
A concept map indicates the relation of different concepts that are mapped between relational databases and No S Q L databases. They are databases, tables, columns, fields, primary and partition keys, and so on.

Concept mapping between relational database and NoSQL database

Transformation includes changing columns names, changing data type of columns according to target database, altering the tables by adding, or removing the columns according to the target database schema. Standardization includes ensuring consistent date formats, uniform format for name values, uniform abbreviations, for example, for gender column value, consistent format for mobile number values, etc. Finally, the transformed data frame is loaded into a particular table of target NoSQL database in append mode using Spark SQL. The algorithm for preprocessing and mapping the source data to target is given below:

Algorithm 1

Data preprocessing and mapping

Algorithm Data preprocessing mapping Input:             Relational database Output:            NoSQL database 1. Establish a connection with source database. 2. Create table object for source. 3. Dump the table objects in the HDFS storage. 4. Load the data from all the tables into HDFS storage 5. For each column of each of the tables, if transformation rule is applicable: Transform the columns according to the target table columns. 6. Standardize the content of the columns. 7. Map the HDFS table objects into target table objects. 8. Map the HDFS column objects into target column objects. 9. Construct each target table. 10. Construct target object of each column. 11. Select partition key and clustering key for each target table.. 12. Compare partition key of the target tables to keep track of the duplicate records.

4 Experimentation

For experimentation purpose, a test bed comprising of a HDFS-based cluster with a name node and three data nodes has been set up. The name node is a Intel Xeon server 3.3 GHz, 32 GB RAM, 4 Cores with 2 TB HDD, and the data nodes are Intel i-4 core workstations with 16 GB RAM and 1 TB HDD. Our own dataset has been generated consisting of immunization data related to children of Tamil Nadu state. The average population of Tamil Nadu state is 7.7 crore in 2020 [23], and around 30% of them are children. The number of children approximately in a state could be 2.3 crores. The immunization database consists of tables for storing details of children, their mothers, and immunization details. The data is generated using a tool called DbGen [24] and using MS Excel. DbGen is a Windows-based tool that can be configured based on the schema to generate data.

The immunization data in the MySQL database is migrated using our proposed approach of DLoader into two NoSQL databases such as Cassandra and MongoDB. For a given number of children records, the associated mother records and immunization records are also considered to be migrated from the source database. When loading the same number of records, the database sizes of Cassandra and MongoDB are different. The number of records used in the dataset along with the varied size of Cassandra and MongoDB databases are shown in Table 1. The number of children records has been varied from 10 lakh to 2.5 crores. It is observed that Cassandra offers good compression of data and stores the same number of records in the source database in storage space of 41% lesser when compared to MongoDB.

Table 1 Varied sizes of NoSQL databases for storing given relational data

As a second experiment, the performance of both NoSQL databases is analyzed in terms of response time and throughput. The load testing has been done using Jmeter tool [25]. The following two queries have been considered.

Q1. Retrieve the details of a particular child whose child_id is given.

Q2. Retrieve immunization details of a children in a particular routine immunization session.

The data load is varied from 0.3 to 8 GB of source MySQL database and migrated to Cassandra as well as MongoDB. For each of the queries Q1 and Q2, the throughput and response time are computed and shown in Table 2. The queries were executed three times under different network loads. The average throughput and maximum response time in seconds for Cassandra and MongoDB databases among three runs are shown in the table.

Table 2 Performance analysis of Cassandra and MongoDB

The performance of Cassandra and MongoDB by varying the data load for two different queries is shown in Fig. 3. The following observations are made from this experiment.

Fig. 3
Four graphs of response time and throughput versus the number of records. The data plotted in each graph are Cassandra and Mongo D B, where in response time, the data are all in an upward trend, while under throughput, the data are all in a downward trend.

Performance analysis of Cassandra and MongoDB

For query Q1

  • For 26-fold increase in Cassandra database size, there is only ninefold increase in Response time and throughput decreases by 77%.

  • For 26-fold increase in MongoDB size, there is only 12-fold increase in response time and throughput decreases by 77%.

For query Q2

  • For 26-fold increase in Cassandra database size, there is 12-fold increase in response time and throughput decreases by 76%.

  • For 26-fold increase in MongoDB size, there is eightfold increase in response time and throughput decreases by 88%.

The rate of increase in response time is more in case of Cassandra for both the queries when compared to MongoDB. However, in Cassandra, the response time for queries Q1 and Q2 for any database size is lesser by 16% and 50%, respectively, as compared to MongoDB.

5 Conclusion

The proposed system DLoader for migration of data from SQL to NoSQL database is the generalized system which is capable to migrating any schema in SQL to NoSQL database. Spark framework which provides fast performance through its inbuilt parallel processing is used for data extraction, transformation, standardization, and data loading. Spark cluster-based HDFS storage is used in the middle layer to handle large amount of data at a time and makes the data readily available during the transformation and mapping process. The proposed approach has achieved column to column mapping from source to target database. The proposed approach is tested for migrating relational data to NoSQL databases such as Cassandra and MongoDB. Cassandra consumes less storage space and offers better performance as compared to MongoDB.