Abstract
Numerous technologies have been proposed for storing big data on the Cloud platform. However, choice of these technologies is always application specific. Determining a strong model is a perplexing task which makes it necessary for the architects and designers to review the requirements and choose a solution. This paper presents 14 data models available in the market to choose from. Above all, there are more than 45 database solutions available in the market, which can be categorized into one of the data models each of which is applicable to its own set of use cases (However, there are few products which could not be categorized into any of these 14 data models). Contributors have figured out that while storing schemaless information, the size of data stored in the database is higher than the original size. Metadata information and physical schema are the two responsible factors for such a high amount of storage requirement. Mathematical models and experimental evaluations conducted show that MongoDB requires storage space many times more than the original size of data. A storage space estimation equation for JSON-based solutions has been suggested, which can compare the storage requirement size using space required by CSV as a base. This may be used to decide an approximate amount of storage space required by the application, before buying a storage space in the Cloud environment.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Big data is a buzz word which usually represents enormous data which cannot be processed by a single system due to its bulky size, large variety, and high-speed of generation. Advancement in IT technologies is the primary reason for generation of big data. At any given time period only a fraction of big data is useful for most application domains. Hence, many experts and researchers have recommended the use of cloud for big data to optimally manage and reduce the overall cost of operating such systems. Cloud computing is a model which carters three services of its users, namely dynamisms, abstraction and resource sharing. Generally, a storage structure is defined in the physical data model. A physical data model is a representation of data on the secondary storage device and it also includes other data structures like indexes and others. It also defines the constraints of the database systems, like the data types available to store a data, the number of secondary indexes allowed, and others. As shown in Fig. 1, a physical data model comprises of Message Format, File structure, Physical schema, and other entities. There are two ways in which data in a table may be stored either in row-order or column-order considering options provided by physical schema [1].
The physical schema defines the storage space required to organize the data on secondary storage devices. Also, it defines the number of indexes and limits the data structures which can be used to create the index. A mathematical model can be used to estimate the size of storage space required to store data. We found that storing 1.5GB blogging data with three secondary indexes (including a Text search index) was stored by MongoDB in 2.63GB which was 1.7 times the original size. It is very critical to know storage space requirement because it will impact the decision process of buying a storage space. Also, most cloud service providers limit access to storage by limiting the number of IOPS performed by an application. Hence, it is in the best interest of application developers and designers to have a detailed knowledge of physical schema of a data model or database before deciding to host the data on the Cloud. In Sect. 2, relevant works on physical schema, data models, and past attempts to estimate storage size for different physical schema are discussed. Successively, a mathematical model of storage space requirement for JSON-based databases is proposed. In Sect. 4, a simulation of the derived model would be discussed and the results would be experimental verified. Finally, contributors would conclude the work.
2 Literature Review
A true benchmark in the field of large-scale database management systems was achieved by information retrieval model by E Codd [2]. Only few works discuss and suggest new models for evaluating the pros and cons of big data systems. In Table 1 a list of important trends relating to evaluation of data models is revealed for the period starting from early 1970s to present.
Three of every four companies have found the necessity of using or shifting to Big Data solutions in the next 2 years [3]. These industries would be facing a great challenge of researching and choosing a big data technology as they have a large variety of solutions to choose from. With 10+ Data Models (listed in Table 2) and 45+ DBMS systems (listed in Table 3) are available for various applications. However, a single solution does not fit all purpose of the industry, hence it becomes eventually necessary to combine one or more solutions into a single conglomerated system that solves all the business problems. For instance, Oracle Big Data System, provides both NoSQL and/or Hadoop cluster options to its customer with SQL. A major problem for choosing such technologies is that very few models such as Relational, Object-oriented, and Object-Relational have been built on strong mathematical model. Now, modeling of storage is a nontrivial challenge and in many cases demands evaluation of designs. If resource requirement cannot be justified, it would become increasingly difficult to monitor the growth of the system data and could adversely affect performance considering that scalability issue is not tackled in the right way.
Many prominent tools and technologies have been proposed in past to estimate the size of storage space required. MySQL also provides a perl script to estimate the size of storage space required for storing a database on the cluster-based storage engine named NDB based on size of storage space used by InnoDB storage engine to store the data [10]. InnoDB storage engine uses Barracuda file organization. Neo4j, a graph-based database also provides a calculator to estimate storage space, main memory, and processing power required at a node to store and process the data [11]. Neo4j calculator takes number of nodes, size of a single node, number of edges, and storage size of each edge as input to approximate the storage space required [11].
3 Storage Estimation Model for JSON-Based Databases
JSON has been one of the most influential format in the movement of migration from RDBMS to NoSQL [12]. JSON has found its place among many application domains with semi-structured and unstructured data [13,14,15,16]. Many databases and solutions have extended JSON to suit their needs like BSON. BSON is a communication and storage protocol used by MongoDB, which is derived from JSON.
Figure 2 depicts a JSON document with a single field, “name” and its value “Devang”. Figure 3 describes the storage schema of BSON which is a communication and storage protocol used by MongoDB. BSON is a storage structure which is derived from JSON. From the figures, it is also evident that BSON will consume much large storage size than JSON, owing to extra information it keeps for recording the data. Although, this extra information does help in increasing throughput by informing about type and size of data, helping I/O processor makes smart decisions (if relevant technologies are available and programmed to use). Above all, this extra information also helps the I/O processor decide how much bits to skip so as to find next document making read task faster. Nevertheless, one cannot ignore the increment in amount of storage space they require.
We propose to derive a model that can help us to estimate the factor by which storage size of JSON increases in comparison with storage size required by CSV. Although, the model is derived for JSON, it is applicable across all databases and solutions that use JSON or its derivatives (e.g., BSON, MessagePack,Footnote 1 etc. [17]).
The storage estimation model is explained by considering the physical schema of CSV and JSON storage schemas. For the purpose of modeling storage space requirement, we proposed comparing storage with flat file databases like CSV as the raw storage size because of all available formats. CSV has been more commonly used by many literatures as a physical schema of choice due to its simplicity and high level of human readability that it offers [18,19,20,21].
Consider a source S, which emits data at regular intervals. This data may be stored in Table T with following properties:
-
A Table T consists of N columns and R rows.
-
Each column of the table has on average \(b_{i}\) bytes of data for ith column.
-
Total number of bytes for each row of the table on average is \(B = \sum _{i=1}^{N} b_{i}\).
-
Each column header is of size \(c_{i}\) bytes of data for ith column.
For simplicity, we assume that the source releases data at regular intervals. It can be considered that source follows some distribution for generating data. Thus, it can be said that the number of rows for the given Table T can be approximated using the prior attained distribution. Also, generating data is a characteristic of the Source. Hence, the maximum number of bytes required to store data in a file can be estimated. Thus, we can get the value of b\(_{i}\) from the source itself. By getting N, which is the number of data items required to be stored in the table, by using distribution, which predicts when the given source will produce the data. Thus, by knowing, b\(_{i}\), R and N, we can compute B. Finally, the size of column header c\(_{i}\) can be measured since the developer or DBA decides the column name.
CSV organizes the data in row-order format so that columns are mentioned in the first line and all successive lines store the data. Now amortized sizeFootnote 2 of column stored in CSV file would be \(\sum _{i=1}^{N} c_{i}\) and since B bytes is the average size of a row, data would take \(B \times R\). Hence, it can be concluded that for CSV store the size of data would be \(CSV\_Size = (B \times R) + \sum _{i=1}^{N} c_{i} \) bytes.
In JSON-based stores, each row is in the format {column1 name: value, column2 name: value, ...} as shown in Fig. 3. Hence, the size of each row in such a physical schemaFootnote 3 would be \((B + \sum _{i=1}^{N} c_{i})\) bytes. For R number of rows in the table, the size of database would be \(MC\_Size = R \times (B + \sum _{i=1}^{N} c_{i})\) bytes. Thus, the ratio of storage size for JSON-based store to CSV would be \(( R \times (B + \sum _{i=1}^{N} c_{i})/((B \times R) + \sum _{i=1}^{N} c_{i}) \) bytes.
4 Experiment
Experimental evaluation has been conducted with a simulation for total column field storage size of 136 bytes and Row size of 474 bytes for varying number of Row for NYC Taxi cab database [23] that is used for traffic patterns analysis of Taxi cabs to reduce pollution was utilized. To obtain size of column on an average we created a dummy document with all the values NULL or not set. We used this as a reference since we are only after amortized comparison of the storage size requirement. Figure 4 is a CDF and thus its corresponding PDF is “Exponential.” Which suggests that exponential increase in MongoDB storage size could be noticed when the size of raw data increases linearly. And the results obtained from simulation are produced in Fig. 4.
Results of the simulation were verified by inserting the data of NYC Yellow Taxi dataset in the big data solution, MongoDB (a JSON-based store) using WiredTiger storage engine. MongoDB was used for experiment as it is an open source solution, it uses JSON-like physical schema named BSON and is an extremely popular NoSQL data store [24]. On storing the data in MongoDB the size of stored data increased by 1.4 times the size of storage space used by CSV as shown in Table 4. The results of the experiments are shown in Fig. 5 which confirms the trend suggested by the model. Thus, using the model and simple maths we can devise a storage factor for estimating the size of storage space required by JSON and its derivatives (Table 5).
Above all, from the experiment it is discovered that MongoDB takes on an average 10–13 min to import a csv file of size 1.6 GB on a standard non-commercial grade hard drive with 5400 RPM disk speed on a machine with 8GB RAM and Intel core-i5 6th generation processor.
5 Conclusion
This paper has listed 14 data models and 45+ databases that provide a glimpse of wide range of solutions available in the market for different big data applications. Researchers in the given work had also proposed a model that proved the storage size of determination by using physical schema for JSON-based stores. It has also been proved that the increment in disk utilization is due to the requirement of storing schema and version information into the table so as to allow storing semi-structured or unstructured data. This increased disk usage with respect to raw size shows exponential increment as the size of data increases. In near future, a comprehensive research for uniting structured, semi-structured, and unstructured data from different data inception points needs to be carried out. This research should be from the perspective of storage and QoS achievement using minimum resources so that it assists decision makers to make an optimal choice for their application. Finally, the WiredTiger Storage Engine of MongoDB takes 1.4 times more space than CSV file for NYC Taxi Cab Dataset including a primary index. Also the proposed model varied from the experimental values from 5 to 11%.
Notes
- 1.
MessagePack is a JSON-like but comparitively smaller in size [22].
- 2.
We use the term amortize because we donot consider the size of putting other characters like comma, carriage return, space for null values, and other special characters.
- 3.
We are not including comma, other special characters, and null values since we are only after a rough estimate.
References
Whitehouse, O.: Fea consolidated reference model document (2005)
Codd, E.F.: A relational model of data for large shared data banks. Communications of the ACM 13(6), 377–387 (1970)
Gartner.com: Gartner report
Gibson, G.A., Vitter, J.S., Wilkes, J.: Strategic directions in storage i/o issues in large-scale computing. ACM Computing Surveys (CSUR) 28(4), 779–793 (1996)
Stonebraker, M., Hellerstein, J.: What goes around comes around. Readings in Database Systems 4 (2005)
Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.A., Mankovskii, S.: Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment 5(12), 1724–1735 (2012)
Demirkan, H., Delen, D.: Leveraging the capabilities of service-oriented decision support systems: Putting analytics and big data in cloud. Decision Support Systems 55(1), 412–421 (2013)
Chen, C.P., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences 275, 314–347 (2014)
Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. Journal of Parallel and Distributed Computing 74(7), 2561–2573 (2014)
Ndbcluster size requirement estimator. https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-programs-ndb-size-pl.html, accessed: 2016-09-30
Hardware sizing calculator. https://neo4j.com/hardware-sizing/, accessed: 2016-09-30
Padhy, R.P., Patra, M.R., Satapathy, S.C.: Rdbms to nosql: reviewing some next-generation non-relational databases. International Journal of Advanced Engineering Science and Technologies 11(1), 15–30 (2011)
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers, Principles, Techniques. Addison wesley Boston (1986)
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. pp. 1247–1250. AcM (2008)
Consortium, W.W.W., et al.: Json-ld 1.0: a json-based serialization for linked data (2014)
Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., et al.: The pfam protein families database. Nucleic acids research p. gkp985 (2009)
del Alba, L.: Data serialization comparison: Json, yaml, bson, messagepack. https://www.sitepoint.com/data-serialization-comparison-json-yaml-bson-messagepack/, accessed: 2016-09-26
Cook, K.B., Kazan, H., Zuberi, K., Morris, Q., Hughes, T.R.: Rbpdb: a database of rna-binding specificities. Nucleic acids research 39(suppl 1), D301–D308 (2011)
Cranford, K.: How to excel with sas. In: Proceedings of the 28 th Annual SCSUG Conference, Austin, Texas, September (2007)
Shafranovich, Y.: Common format and mime type for comma-separated values (csv) files (2005)
Sharma, T.C., Jain, M.: Weka approach for comparative study of classification algorithm. International Journal of Advanced Research in Computer and Communication Engineering 2(4), 1925–1931 (2013)
Messagepack. http://msgpack.org/index.html, accessed: 2016-09-26
Commission, N.T..L.: Tlc yellow taxi trip record data. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml, accessed: 2016-09-30
DB-engines.com: Dbms rankings 2017 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte. Ltd.
About this paper
Cite this paper
Swami, D., Sahoo, B. (2018). Storage Size Estimation for Schemaless Big Data Applications: A JSON-Based Overview. In: Hu, YC., Tiwari, S., Mishra, K., Trivedi, M. (eds) Intelligent Communication and Computational Technologies. Lecture Notes in Networks and Systems, vol 19. Springer, Singapore. https://doi.org/10.1007/978-981-10-5523-2_29
Download citation
DOI: https://doi.org/10.1007/978-981-10-5523-2_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-5522-5
Online ISBN: 978-981-10-5523-2
eBook Packages: EngineeringEngineering (R0)