Keywords

1 Introduction

Big data is a buzz word which usually represents enormous data which cannot be processed by a single system due to its bulky size, large variety, and high-speed of generation. Advancement in IT technologies is the primary reason for generation of big data. At any given time period only a fraction of big data is useful for most application domains. Hence, many experts and researchers have recommended the use of cloud for big data to optimally manage and reduce the overall cost of operating such systems. Cloud computing is a model which carters three services of its users, namely dynamisms, abstraction and resource sharing. Generally, a storage structure is defined in the physical data model. A physical data model is a representation of data on the secondary storage device and it also includes other data structures like indexes and others. It also defines the constraints of the database systems, like the data types available to store a data, the number of secondary indexes allowed, and others. As shown in Fig. 1, a physical data model comprises of Message Format, File structure, Physical schema, and other entities. There are two ways in which data in a table may be stored either in row-order or column-order considering options provided by physical schema [1].

Fig. 1
figure 1

Physical data model [1]

The physical schema defines the storage space required to organize the data on secondary storage devices. Also, it defines the number of indexes and limits the data structures which can be used to create the index. A mathematical model can be used to estimate the size of storage space required to store data. We found that storing 1.5GB blogging data with three secondary indexes (including a Text search index) was stored by MongoDB in 2.63GB which was 1.7 times the original size. It is very critical to know storage space requirement because it will impact the decision process of buying a storage space. Also, most cloud service providers limit access to storage by limiting the number of IOPS performed by an application. Hence, it is in the best interest of application developers and designers to have a detailed knowledge of physical schema of a data model or database before deciding to host the data on the Cloud. In Sect. 2, relevant works on physical schema, data models, and past attempts to estimate storage size for different physical schema are discussed. Successively, a mathematical model of storage space requirement for JSON-based databases is proposed. In Sect. 4, a simulation of the derived model would be discussed and the results would be experimental verified. Finally, contributors would conclude the work.

2 Literature Review

A true benchmark in the field of large-scale database management systems was achieved by information retrieval model by E Codd [2]. Only few works discuss and suggest new models for evaluating the pros and cons of big data systems. In Table 1 a list of important trends relating to evaluation of data models is revealed for the period starting from early 1970s to present.

Table 1 Findings and open problems
Table 2 Data models for big data applications
Table 3 Database solutions for big data applications

Three of every four companies have found the necessity of using or shifting to Big Data solutions in the next 2 years [3]. These industries would be facing a great challenge of researching and choosing a big data technology as they have a large variety of solutions to choose from. With 10+ Data Models (listed in Table 2) and 45+ DBMS systems (listed in Table 3) are available for various applications. However, a single solution does not fit all purpose of the industry, hence it becomes eventually necessary to combine one or more solutions into a single conglomerated system that solves all the business problems. For instance, Oracle Big Data System, provides both NoSQL and/or Hadoop cluster options to its customer with SQL. A major problem for choosing such technologies is that very few models such as Relational, Object-oriented, and Object-Relational have been built on strong mathematical model. Now, modeling of storage is a nontrivial challenge and in many cases demands evaluation of designs. If resource requirement cannot be justified, it would become increasingly difficult to monitor the growth of the system data and could adversely affect performance considering that scalability issue is not tackled in the right way.

Many prominent tools and technologies have been proposed in past to estimate the size of storage space required. MySQL also provides a perl script to estimate the size of storage space required for storing a database on the cluster-based storage engine named NDB based on size of storage space used by InnoDB storage engine to store the data [10]. InnoDB storage engine uses Barracuda file organization. Neo4j, a graph-based database also provides a calculator to estimate storage space, main memory, and processing power required at a node to store and process the data [11]. Neo4j calculator takes number of nodes, size of a single node, number of edges, and storage size of each edge as input to approximate the storage space required [11].

3 Storage Estimation Model for JSON-Based Databases

JSON has been one of the most influential format in the movement of migration from RDBMS to NoSQL [12]. JSON has found its place among many application domains with semi-structured and unstructured data [13,14,15,16]. Many databases and solutions have extended JSON to suit their needs like BSON. BSON is a communication and storage protocol used by MongoDB, which is derived from JSON.

Fig. 2
figure 2

A simple JSON document

Fig. 3
figure 3

Physical schema of MongoDB (BSON)

Figure 2 depicts a JSON document with a single field, “name” and its value “Devang”. Figure 3 describes the storage schema of BSON which is a communication and storage protocol used by MongoDB. BSON is a storage structure which is derived from JSON. From the figures, it is also evident that BSON will consume much large storage size than JSON, owing to extra information it keeps for recording the data. Although, this extra information does help in increasing throughput by informing about type and size of data, helping I/O processor makes smart decisions (if relevant technologies are available and programmed to use). Above all, this extra information also helps the I/O processor decide how much bits to skip so as to find next document making read task faster. Nevertheless, one cannot ignore the increment in amount of storage space they require.

We propose to derive a model that can help us to estimate the factor by which storage size of JSON increases in comparison with storage size required by CSV. Although, the model is derived for JSON, it is applicable across all databases and solutions that use JSON or its derivatives (e.g., BSON, MessagePack,Footnote 1 etc. [17]).

The storage estimation model is explained by considering the physical schema of CSV and JSON storage schemas. For the purpose of modeling storage space requirement, we proposed comparing storage with flat file databases like CSV as the raw storage size because of all available formats. CSV has been more commonly used by many literatures as a physical schema of choice due to its simplicity and high level of human readability that it offers [18,19,20,21].

Consider a source S, which emits data at regular intervals. This data may be stored in Table T with following properties:

  • A Table T consists of N columns and R rows.

  • Each column of the table has on average \(b_{i}\) bytes of data for ith column.

  • Total number of bytes for each row of the table on average is \(B = \sum _{i=1}^{N} b_{i}\).

  • Each column header is of size \(c_{i}\) bytes of data for ith column.

For simplicity, we assume that the source releases data at regular intervals. It can be considered that source follows some distribution for generating data. Thus, it can be said that the number of rows for the given Table T can be approximated using the prior attained distribution. Also, generating data is a characteristic of the Source. Hence, the maximum number of bytes required to store data in a file can be estimated. Thus, we can get the value of b\(_{i}\) from the source itself. By getting N, which is the number of data items required to be stored in the table, by using distribution, which predicts when the given source will produce the data. Thus, by knowing, b\(_{i}\), R and N, we can compute B. Finally, the size of column header c\(_{i}\) can be measured since the developer or DBA decides the column name.

CSV organizes the data in row-order format so that columns are mentioned in the first line and all successive lines store the data. Now amortized sizeFootnote 2 of column stored in CSV file would be \(\sum _{i=1}^{N} c_{i}\) and since B bytes is the average size of a row, data would take \(B \times R\). Hence, it can be concluded that for CSV store the size of data would be \(CSV\_Size = (B \times R) + \sum _{i=1}^{N} c_{i} \) bytes.

In JSON-based stores, each row is in the format {column1 name: value, column2 name: value, ...} as shown in Fig. 3. Hence, the size of each row in such a physical schemaFootnote 3 would be \((B + \sum _{i=1}^{N} c_{i})\) bytes. For R number of rows in the table, the size of database would be \(MC\_Size = R \times (B + \sum _{i=1}^{N} c_{i})\) bytes. Thus, the ratio of storage size for JSON-based store to CSV would be \(( R \times (B + \sum _{i=1}^{N} c_{i})/((B \times R) + \sum _{i=1}^{N} c_{i}) \) bytes.

4 Experiment

Experimental evaluation has been conducted with a simulation for total column field storage size of 136 bytes and Row size of 474 bytes for varying number of Row for NYC Taxi cab database [23] that is used for traffic patterns analysis of Taxi cabs to reduce pollution was utilized. To obtain size of column on an average we created a dummy document with all the values NULL or not set. We used this as a reference since we are only after amortized comparison of the storage size requirement. Figure 4 is a CDF and thus its corresponding PDF is “Exponential.” Which suggests that exponential increase in MongoDB storage size could be noticed when the size of raw data increases linearly. And the results obtained from simulation are produced in Fig. 4.

Fig. 4
figure 4

Simulation: ratio of MongoDB to CSV data size

Table 4 Ratio of MongoDB to CSV data size
Fig. 5
figure 5

Experimental evaluation: ratio of MongoDB data size to CSV data size (in GB)

Table 5 MongoDB throughput (wall clock time)

Results of the simulation were verified by inserting the data of NYC Yellow Taxi dataset in the big data solution, MongoDB (a JSON-based store) using WiredTiger storage engine. MongoDB was used for experiment as it is an open source solution, it uses JSON-like physical schema named BSON and is an extremely popular NoSQL data store [24]. On storing the data in MongoDB the size of stored data increased by 1.4 times the size of storage space used by CSV as shown in Table 4. The results of the experiments are shown in Fig. 5 which confirms the trend suggested by the model. Thus, using the model and simple maths we can devise a storage factor for estimating the size of storage space required by JSON and its derivatives (Table 5).

Above all, from the experiment it is discovered that MongoDB takes on an average 10–13 min to import a csv file of size 1.6 GB on a standard non-commercial grade hard drive with 5400 RPM disk speed on a machine with 8GB RAM and Intel core-i5 6th generation processor.

5 Conclusion

This paper has listed 14 data models and 45+ databases that provide a glimpse of wide range of solutions available in the market for different big data applications. Researchers in the given work had also proposed a model that proved the storage size of determination by using physical schema for JSON-based stores. It has also been proved that the increment in disk utilization is due to the requirement of storing schema and version information into the table so as to allow storing semi-structured or unstructured data. This increased disk usage with respect to raw size shows exponential increment as the size of data increases. In near future, a comprehensive research for uniting structured, semi-structured, and unstructured data from different data inception points needs to be carried out. This research should be from the perspective of storage and QoS achievement using minimum resources so that it assists decision makers to make an optimal choice for their application. Finally, the WiredTiger Storage Engine of MongoDB takes 1.4 times more space than CSV file for NYC Taxi Cab Dataset including a primary index. Also the proposed model varied from the experimental values from 5 to 11%.