Schema generation for document stores using workload-driven approach

Bansal, Neha; Sachdeva, Shelly; Awasthi, Lalit K.

doi:10.1007/s11227-023-05613-5

Schema generation for document stores using workload-driven approach

Published: 08 September 2023

Volume 80, pages 4000–4048, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

The Journal of Supercomputing Aims and scope Submit manuscript

Schema generation for document stores using workload-driven approach

Download PDF

Neha Bansal¹,
Shelly Sachdeva¹ &
Lalit K. Awasthi²

255 Accesses
Explore all metrics

Abstract

Although there are numerous data modeling tools for relational databases, data modeling for NoSQL databases has seen another perspective. These databases (a) do not define any explicit schema, (b) store data in a denormalized manner, and (c) give many structure alternatives. The decision to structure the data always relies on rules of thumb, which do not guarantee an optimal structural solution. Based on this motivation, this paper offers a workload-driven model for the logical schema design of a NoSQL document database. It consists of Model input, Intermediate transformation, and Final schema generation. The proposed model takes the conceptual schema (EER model) and application workload (queries and anticipated data volume) as input and describes a procedure to convert it into a logical model for NoSQL document stores. The conversion process initially converts the application queries into query graphs. The query graphs, along with the anticipated data volume, are used to generate the query labels. The resulting query labels are assigned on the schema graph designed from the EER model. The schema graph and labels are used to transform the EER model into the appropriate logical schema model based on the actions defined for each label. We evaluate the model using a case study in the eCommerce application domain. The experimental evaluation shows the proposed model outperforms the existing conventional, optimized, and query path graphs models in multiple aspects, including query performance, storage space efficiency, aggregate pipeline efficiency, read–write latency, collection-wise performance, scalability, throughput and latency. By effectively addressing the challenges associated with managing the variety and volume of big data through a well-designed schema, our proposed model significantly reduces the time, cost, and effort required for schema development and repair.

Influence of Schema Design in NoSQL Document Stores

Automatic schema suggestion model for NoSQL document-stores databases

Article Open access 06 December 2018

Schema Proposition Model for NoSQL Applications

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the rise of big data, the requirement of applications to change their schema is more frequent and crucial. This demand has given rise to the emergence of NoSQL databases, a new category designed to overcome the limitations of traditional relational databases in handling big data and real-time applications characterized by high-speed data generation (volume) and diverse data formats (variety). NoSQL is an umbrella term used for numerous non-relational database types. Four popular categories of NoSQL are named document-based, column-based, key-value-based, and graph-based [1, 2]. These four categories share similar logical structures: A key followed by a value; however, they are distinct in data modeling, data architecture, querying languages, and API's. Typically, the performance of these four categories depends on the selection of use cases.

Unstructured data collected from sources like sensors, social media, and natural language processing (NLP) holds valuable insights [3]. To extract valuable insights from unstructured data, new data storage solutions like Hadoop and NoSQL databases have emerged [4]. These technologies are extensively applied in domains, such as the Internet of Things, Facebook, Google, and Netflix [5, 6]. The increasing adoption of NoSQL databases in handling big data is driven by their ability to manage massive volumes of data without a predefined schema. NoSQL databases, excel in handling unstructured and semi-structured data aligning with the variety criterion of big data. Horizontal scalability using sharding and replication [7] is another key aspect addressed by NoSQL databases, allowing data distribution across multiple nodes to accommodate large volumes. The schema flexibility and horizontal scalability properties ensures efficient storage and processing without compromising performance. The proposed work is aligned with the variety and volume criteria of big data.

Although NoSQL database flexibility enables rapid initial development so that the application does not need to define a specific structure in advance [6, 8], the decision should be made early because (a) The application's overall performance depends on the schema choice selection. The wrong choices can impact several aspects of application quality, like data redundancy, navigation cost, data access cost, and maintainability. (b) It is challenging to fix a poorly designed data model after the development of an application. (c) For a poorly designed data schema, it is possible that some queries require excessive execution time or cannot be executed at all. Therefore, it is preferable to spend some time in advance designing a data model that is scalable, extensible, and maintainable throughout the application's lifetime.

The flexibility of NoSQL databases empowers developers and organizations to store and manipulate data according to their specific requirements. As a result, there can be numerous schema alternatives to model the same information [9]. Analyzing and comparing these schema alternatives can be complex and time-consuming using manual methods [8, 10]. Thus, there is a need for an automated tool or model that can evaluate various factors and can recommend optimal schema solutions from the available alternatives. Two existing approaches give automation to perform this task: Workload-Agnostic and Workload-Driven.

The Workload-Agnostic approach [11,12,13] focuses on creating the database schema without considering any specific workload or usage patterns. The goal is to develop a schema that can handle a variety of queries and workloads. The objective is to offer flexibility and adaptability to handle various queries and data. However, this approach may not optimize performance for particular query patterns or workloads because it does not consider the specific query characteristics. On the other hand, in a Workload-Driven approach [10, 14,15,16,17], the database schema is created for the specific workload or usage patterns. The schema design is influenced by the types of queries expected to be executed frequently, the data access patterns, and the workload's performance requirements. The goal is to optimize the schema design to improve query performance, reduce latency, and improve the overall system's efficiency. In our study, we have chosen a workload-driven approach to design an automated model that considers the workload queries and anticipated data volume to provide an optimal schema solution. We intend to design a schema that best meets the performance requirements and efficiency goals by analyzing the query characteristics of the workload.

This paper has proposed an automated model to transform the conceptual model into an optimal logical schema design with the aid of labels. It consists of three parts: Model input, Intermediate transformation, and Final schema generation. Model input consists of the EER model as well as the application workload. The application queries and the estimated data volume comprise the application workload. The intermediate transformation includes the generation of query graphs and the generation of query labels. The application queries are first transformed into query graphs, and then the generated query graphs are transformed into query labels using data volume. The generation of query labels involves three steps Label Categorization, Action Association, and Prioritization. The final schema generation consists of two parts: a) the generation of a Schema Graph and Label assignment, b) transformation into Logical Schema. The EER model is first converted into a graph model named schema graph. Then the derived query labels are assigned on the edges of the schema graph. Finally, the schema graph and labels are used to transform the EER model into an optimized logical schema based on the actions defined for each label. The working of the proposed model is evaluated through a case study in the eCommerce domain. We have picked MongoDB to work on because it is the most popular store among all document stores [18]. In addition, it is used in various applications, including eCommerce, mobile applications, and many more.

In this paper, we have made the following significant contributions:

(a)
The paper uses application workload to generate NoSQL document logical schemas from the conceptual model. The workload information is provided by the designer in terms of estimated total data volume and queries.
(b)
The proposed model uses query graphs, query labels, and schema graphs to transform conceptual inputs into logical schemas.
(c)
Query graphs are generated from workload queries and are used to analyze query characteristics. Query labels are used to showcase the investigated query characteristics.
(d)
The derived query labels and the schema graph are used to design the logical schema for NoSQL document stores.
(e)
To evaluate the proposed model, experiments are conducted through a case study in the eCommerce domain to showcase the performance of the proposed model.
(f)
The results show the proposed model reduces query response time and accelerates the data retrieval time of workload queries.

The remainder of the paper is arranged in the following sections. Section 2 gives the related work; Sect. 3 presents the detailed work of the proposed model. Section 4 presents the experimental evaluation, and Sect. 5 concludes the paper.

2 Related work and motivation

In the realm of Big Data applications, the large volume, variety, and velocity of data often surpass the capabilities of traditional relational databases [19]. NoSQL databases, such as MongoDB, Cassandra, HBase, and Neo4j, have emerged as vital technologies to overcome these challenges. These databases offer flexible data models, horizontal scalability, and high-performance data processing, making them well-suited for managing massive amounts of data in distributed environments. NoSQL databases are particularly well-suited for managing heterogeneous data due to their flexible models, large volumes' scalability, and high data retrieval performance [5, 6]. Distributed databases support supercomputing by providing the necessary infrastructure and capabilities for large-scale data processing and high-performance computing workloads [20,21,22,23].

Many tools are available in the market for data modeling of traditional databases (such as relational) [24, 25]. Still, these tools cannot be applied directly to the NoSQL database due to data modeling differences (normalized versus denormalized format, respectively). Authors [26, 27] comprehensively analyze the design requirements of NoSQL databases. Uta et al. [28] have presented various case studies on top-down, bottom-up, and reverse engineering approaches for schema management in NoSQL databases. According to Paola Gomez et al. [29], the performance of a NoSQL system is determined by appropriate schema design selection among all the design options. Similarly, Mior [30] stated that the performance of a NoSQL database depends on the choice of an appropriate schema design. They proposed a manual cost-based model based on workload queries for the physical optimization of column-based data stores. However, choosing the best suitable schema among all the possible schema alternatives (schema optimization) is difficult to perform manually. From this initial study, we find the following research gaps:

(a)
Unlike a relational database, the NoSQL database allows various data structure alternatives, which remains an ongoing research problem. Numerous researchers are working in this field [9, 16, 17, 31, 32].
(b)
NoSQL databases inherent flexibility and schema-less nature give rise to multiple schema design alternatives. For example, consider a scenario, if there are two entities representing student (S) and their faculty (F) related by a one-to-many relationship ${(r}_{1})$. Relationship ${(r}_{1})$ can be materialized by nesting or referencing information from the related entities. Hence there are multiple ways of schema design (S1 to S8) to store this information in document stores, as shown in Fig. 1 (adapted from [9]). The choice among these different schema designs depends on many factors, like data retrieval costs, query access patterns, and user needs. Manual schema design, typically guided by trial-and-error or ad-hoc methods, can be time-consuming and lacks a guarantee of optimal design among the various alternatives. A recent study [31] has found that only 9% of the database experts identified the optimal design among these possibilities. This evidence shows that the current manual way of database design does not yield the expected results, even for minimal scenarios taken as an example. Consequently, automation becomes crucial in streamlining the complex process, reducing time requirements, and selecting the most optimal schema design from the available options.
(c)
Numerous researchers have adopted different methodologies to convert conceptual to logical schema design. We have studied the existing working models and made the comparison based on common characteristics named conceptual schema, additional inputs, conversion methodology, target model, and automation, as shown in Table 1. We have categorized the existing work into the Workload-Agnostic (WA) and Workload-Driven (WD) approaches. WA does not consider the application workload means that the schema is designed without considering the specific queries or operations that the application can perform on the database. In contrast to WA, WD considers the application workload for NoSQL schema design. These methodologies consider the specific workload requirements, such as the types of queries, patterns, or operations the application is expected to perform on the database. By considering the workload, the schema can be optimized to support the application's specific needs better and improve performance.

Table 1 Comparison of related work for NoSQL data modeling

Schema generation for document stores using workload-driven approach

Abstract

Similar content being viewed by others

Influence of Schema Design in NoSQL Document Stores

Automatic schema suggestion model for NoSQL document-stores databases

Schema Proposition Model for NoSQL Applications

Explore related subjects

1 Introduction

2 Related work and motivation

2.1 Workload-agnostic (WA) approach

2.2 Workload-driven (WD) approach

3 Schema generation for document stores using workload-driven approach

3.1 Model input

3.1.1 Conceptual model

Definition 1

3.1.2 Application workload

3.2 Intermediate transformation

3.2.1 Generate query graphs

Definition 2

3.3 Generate query labels

Definition 3

3.3.1 Final schema generation

3.3.1.1 Generation of schema graph and label assignment

Definition 4

Algorithm 2

3.3.1.2 Transformation into logical schema

4 Experimental evaluation

4.1 Experimental setup

4.2 Experimental evaluation

4.2.1 Query response time and speedup

4.2.2 Write and read latency

4.2.3 Efficiency improvement using aggregate pipeline

4.2.4 Storage space

4.2.5 Collection-wise performance

4.2.6 Scalability

4.2.6.1 Scalability for CPU

4.2.6.2 Scalability for GPU

4.2.6.3 Throughput and Latency through sharding

4.2.6.4 Throughput

4.2.6.5 Latency

5 Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation