1 Introduction

With the rise of big data, the requirement of applications to change their schema is more frequent and crucial. This demand has given rise to the emergence of NoSQL databases, a new category designed to overcome the limitations of traditional relational databases in handling big data and real-time applications characterized by high-speed data generation (volume) and diverse data formats (variety). NoSQL is an umbrella term used for numerous non-relational database types. Four popular categories of NoSQL are named document-based, column-based, key-value-based, and graph-based [1, 2]. These four categories share similar logical structures: A key followed by a value; however, they are distinct in data modeling, data architecture, querying languages, and API's. Typically, the performance of these four categories depends on the selection of use cases.

Unstructured data collected from sources like sensors, social media, and natural language processing (NLP) holds valuable insights [3]. To extract valuable insights from unstructured data, new data storage solutions like Hadoop and NoSQL databases have emerged [4]. These technologies are extensively applied in domains, such as the Internet of Things, Facebook, Google, and Netflix [5, 6]. The increasing adoption of NoSQL databases in handling big data is driven by their ability to manage massive volumes of data without a predefined schema. NoSQL databases, excel in handling unstructured and semi-structured data aligning with the variety criterion of big data. Horizontal scalability using sharding and replication [7] is another key aspect addressed by NoSQL databases, allowing data distribution across multiple nodes to accommodate large volumes. The schema flexibility and horizontal scalability properties ensures efficient storage and processing without compromising performance. The proposed work is aligned with the variety and volume criteria of big data.

Although NoSQL database flexibility enables rapid initial development so that the application does not need to define a specific structure in advance [6, 8], the decision should be made early because (a) The application's overall performance depends on the schema choice selection. The wrong choices can impact several aspects of application quality, like data redundancy, navigation cost, data access cost, and maintainability. (b) It is challenging to fix a poorly designed data model after the development of an application. (c) For a poorly designed data schema, it is possible that some queries require excessive execution time or cannot be executed at all. Therefore, it is preferable to spend some time in advance designing a data model that is scalable, extensible, and maintainable throughout the application's lifetime.

The flexibility of NoSQL databases empowers developers and organizations to store and manipulate data according to their specific requirements. As a result, there can be numerous schema alternatives to model the same information [9]. Analyzing and comparing these schema alternatives can be complex and time-consuming using manual methods [8, 10]. Thus, there is a need for an automated tool or model that can evaluate various factors and can recommend optimal schema solutions from the available alternatives. Two existing approaches give automation to perform this task: Workload-Agnostic and Workload-Driven.

The Workload-Agnostic approach [11,12,13] focuses on creating the database schema without considering any specific workload or usage patterns. The goal is to develop a schema that can handle a variety of queries and workloads. The objective is to offer flexibility and adaptability to handle various queries and data. However, this approach may not optimize performance for particular query patterns or workloads because it does not consider the specific query characteristics. On the other hand, in a Workload-Driven approach [10, 14,15,16,17], the database schema is created for the specific workload or usage patterns. The schema design is influenced by the types of queries expected to be executed frequently, the data access patterns, and the workload's performance requirements. The goal is to optimize the schema design to improve query performance, reduce latency, and improve the overall system's efficiency. In our study, we have chosen a workload-driven approach to design an automated model that considers the workload queries and anticipated data volume to provide an optimal schema solution. We intend to design a schema that best meets the performance requirements and efficiency goals by analyzing the query characteristics of the workload.

This paper has proposed an automated model to transform the conceptual model into an optimal logical schema design with the aid of labels. It consists of three parts: Model input, Intermediate transformation, and Final schema generation. Model input consists of the EER model as well as the application workload. The application queries and the estimated data volume comprise the application workload. The intermediate transformation includes the generation of query graphs and the generation of query labels. The application queries are first transformed into query graphs, and then the generated query graphs are transformed into query labels using data volume. The generation of query labels involves three steps Label Categorization, Action Association, and Prioritization. The final schema generation consists of two parts: a) the generation of a Schema Graph and Label assignment, b) transformation into Logical Schema. The EER model is first converted into a graph model named schema graph. Then the derived query labels are assigned on the edges of the schema graph. Finally, the schema graph and labels are used to transform the EER model into an optimized logical schema based on the actions defined for each label. The working of the proposed model is evaluated through a case study in the eCommerce domain. We have picked MongoDB to work on because it is the most popular store among all document stores [18]. In addition, it is used in various applications, including eCommerce, mobile applications, and many more.

In this paper, we have made the following significant contributions:

  1. (a)

    The paper uses application workload to generate NoSQL document logical schemas from the conceptual model. The workload information is provided by the designer in terms of estimated total data volume and queries.

  2. (b)

    The proposed model uses query graphs, query labels, and schema graphs to transform conceptual inputs into logical schemas.

  3. (c)

    Query graphs are generated from workload queries and are used to analyze query characteristics. Query labels are used to showcase the investigated query characteristics.

  4. (d)

    The derived query labels and the schema graph are used to design the logical schema for NoSQL document stores.

  5. (e)

    To evaluate the proposed model, experiments are conducted through a case study in the eCommerce domain to showcase the performance of the proposed model.

  6. (f)

    The results show the proposed model reduces query response time and accelerates the data retrieval time of workload queries.

The remainder of the paper is arranged in the following sections. Section 2 gives the related work; Sect. 3 presents the detailed work of the proposed model. Section 4 presents the experimental evaluation, and Sect. 5 concludes the paper.

2 Related work and motivation

In the realm of Big Data applications, the large volume, variety, and velocity of data often surpass the capabilities of traditional relational databases [19]. NoSQL databases, such as MongoDB, Cassandra, HBase, and Neo4j, have emerged as vital technologies to overcome these challenges. These databases offer flexible data models, horizontal scalability, and high-performance data processing, making them well-suited for managing massive amounts of data in distributed environments. NoSQL databases are particularly well-suited for managing heterogeneous data due to their flexible models, large volumes' scalability, and high data retrieval performance [5, 6]. Distributed databases support supercomputing by providing the necessary infrastructure and capabilities for large-scale data processing and high-performance computing workloads [20,21,22,23].

Many tools are available in the market for data modeling of traditional databases (such as relational) [24, 25]. Still, these tools cannot be applied directly to the NoSQL database due to data modeling differences (normalized versus denormalized format, respectively). Authors [26, 27] comprehensively analyze the design requirements of NoSQL databases. Uta et al. [28] have presented various case studies on top-down, bottom-up, and reverse engineering approaches for schema management in NoSQL databases. According to Paola Gomez et al. [29], the performance of a NoSQL system is determined by appropriate schema design selection among all the design options. Similarly, Mior [30] stated that the performance of a NoSQL database depends on the choice of an appropriate schema design. They proposed a manual cost-based model based on workload queries for the physical optimization of column-based data stores. However, choosing the best suitable schema among all the possible schema alternatives (schema optimization) is difficult to perform manually. From this initial study, we find the following research gaps:

  1. (a)

    Unlike a relational database, the NoSQL database allows various data structure alternatives, which remains an ongoing research problem. Numerous researchers are working in this field [9, 16, 17, 31, 32].

  2. (b)

    NoSQL databases inherent flexibility and schema-less nature give rise to multiple schema design alternatives. For example, consider a scenario, if there are two entities representing student (S) and their faculty (F) related by a one-to-many relationship \({(r}_{1})\). Relationship \({(r}_{1})\) can be materialized by nesting or referencing information from the related entities. Hence there are multiple ways of schema design (S1 to S8) to store this information in document stores, as shown in Fig. 1 (adapted from [9]). The choice among these different schema designs depends on many factors, like data retrieval costs, query access patterns, and user needs. Manual schema design, typically guided by trial-and-error or ad-hoc methods, can be time-consuming and lacks a guarantee of optimal design among the various alternatives. A recent study [31] has found that only 9% of the database experts identified the optimal design among these possibilities. This evidence shows that the current manual way of database design does not yield the expected results, even for minimal scenarios taken as an example. Consequently, automation becomes crucial in streamlining the complex process, reducing time requirements, and selecting the most optimal schema design from the available options.

  3. (c)

    Numerous researchers have adopted different methodologies to convert conceptual to logical schema design. We have studied the existing working models and made the comparison based on common characteristics named conceptual schema, additional inputs, conversion methodology, target model, and automation, as shown in Table 1. We have categorized the existing work into the Workload-Agnostic (WA) and Workload-Driven (WD) approaches. WA does not consider the application workload means that the schema is designed without considering the specific queries or operations that the application can perform on the database. In contrast to WA, WD considers the application workload for NoSQL schema design. These methodologies consider the specific workload requirements, such as the types of queries, patterns, or operations the application is expected to perform on the database. By considering the workload, the schema can be optimized to support the application's specific needs better and improve performance.

Fig. 1
figure 1

Schema design alternatives (S1-S8) in Document stores for ER model

Table 1 Comparison of related work for NoSQL data modeling

2.1 Workload-agnostic (WA) approach

Li [11] gives the heuristics for converting the relational schema into an HBase (NoSQL column Store) schema. Similarly, Authors [12] have designed a heuristics-based method for converting RDB to document stores using relationship types. Imam et al. [33, 34] propose a mechanical schema suggestion model for a document database. Imam et al. [13] have given manual heuristics-based guidelines to translate ER model to Document stores using relationship type and cardinality. The drawback of the existing works is that they are workload-agnostic (WA), which means they do not consider application workload. Therefore, do not guarantee to give the best optimal schema design solution and can hamper the application's performance.

2.2 Workload-driven (WD) approach

Chebotko [35] offers the first workload-driven (WD) design method for mapping the ER model to Cassandra (NoSQL column Store). The mapping was done based on the application workflow by taking the ERQL queries. The proposed technique improves the performance of reading operations while decreasing the performance of write operations. Tianyu Jia et al. [36] have used graphs and DAGs during schema migration from relational schema to MongoDB. The graph is generated with the help of some tags, and the relational logs are used to define the tags on the ER model. The authors have used a threshold to calculate the tags, which seems bogus due to a lack of threshold information. Mior et al. [10] have proposed a tool for schema design recommendations for column stores (C). They use ER model along with the workload queries. The query frequencies and volume of data in each candidate plan are analyzed to suggest the best solution. However, the work applies to column stores only. Authors [14, 37] provide a logical mapping from the conceptual model (EER (Extended Entity-Relationship)) using initial workload information (in terms of the estimated number of data instances and primary query operations). They developed several rules based on workload data to map the entities and relationships from EER to MongoDB. But the authors have considered workload in the form of data volume only, which is insufficient to design an optimal schema. Vincent Reniers et al. [38] use the MongoDB schema to generate workload queries and ER model. The authors also considered various dimensions while schema generation, but the model and methodologies are not given too clearly and are not automated.

Similarly, Ali et al. [17, 39] have designed the schema recommendation model based on query patterns. The authors translate the workload queries into query path graphs. The query path graphs are then translated into logical schema using various rules designed by the authors. They have performed the embedding in the case of document stores and have not considered referencing during denormalization. Authors [40, 41] have used canonical representation to suggest the denormalized model using application queries. The proposed model only applies to document stores and is very complicated to be adopted by novice users due to estimated storage space as an input requirement.

Similarly, Paola et al. [9] studied various data structuring alternatives using software product line strategies and feature models. They developed a set of structural metrics to analyze the characteristics of these alternatives. Their work aimed to propose a model that enables the automatic generation of multiple suitable data structure alternatives based on an initial UML model. The challenge of the work is accurately analyzing data structuring alternatives and generating a comprehensive set of suitable options while considering various factors like performance, scalability, and system maintainability.

Based on the current works, as stated in Table 1, both WA and WD approaches have advantages and considerations. WA approaches offer flexibility and adaptability to varying workloads, but they do not guarantee to optimize the schema among various alternatives. Based on the WD approach, the existing work can provide more targeted optimal schema solutions but require a good understanding of the application workload. The work done so far for the WD approach considers workload queries [9, 10, 17, 36, 40] or estimated data volume [14, 37] as input. However, to generate an efficient NoSQL schema using a Workload-Driven (WD) approach, it is also necessary to consider the application workload in terms of workload queries and estimated data volumes. To fill the gap in the literature, we developed a schema generation model based on a workload-driven approach that considers application workload in the forms of workload queries and data volume to generate the schema, especially for document stores.

3 Schema generation for document stores using workload-driven approach

This section details the proposed schema generation model for document stores using the workload-driven approach. As shown in Fig. 2, the proposed model consists of three parts: Model input, Intermediate transformation, and Final schema generation. The proposed model begins with a conceptual model and application workload as input and produces the logical schema as output using an intermediate transformation. The graphical flow diagram is shown in Fig. 3, which shows how three parts of the model (shown in Fig. 2) work together. It is intended to be used for document stores during the early stages of application development.

Fig. 2
figure 2

Workload-driven approach for Document Stores

Fig. 3
figure 3

Graphical flow model

3.1 Model input

The proposed model takes a conceptual model in the form of an EER model and application workload in the form of workload queries and expected data volume as input. The details about model input are mentioned in this section.

3.1.1 Conceptual model

The conceptual model comprehensively captures the application requirements and workflow and represents the information in a high-level abstraction model in entities, relationships, and constraints. The conceptual modeling employs numerous techniques, including ER (Entity-Relationship), EER (Extended Entity-Relationship), and UML (Unified Modeling Language). However, EER provides a more expressive and flexible representation of the relationship between entities in a database than ER and UML[42]. Hence, we have taken EER of a real-time case study as the conceptual model shown in Fig. 4.

Fig. 4
figure 4

The EER schema for an e-commerce application

Definition 1

An EER model is defined as \(EER=(T,R)\) where \(T=\{{t}_{1},\dots ,{t}_{n}\}\) is a set of entities, and \(R=\{({t}_{i},{t}_{j})|{t}_{i},{t}_{j}\in T\}\) is a set of relationships. A relationship \(r=({t}_{i},{t}_{j})\) represents the mutual connection between entities \({t}_{i},{t}_{j}\). Both entities and relationships have a set of attributes.

We have taken a case study adapted from [14], based on the eCommerce domain, as a sample database. The case study states that customers can place orders for various items of different products. Suppliers supply the products and have many categories. Each order has a payment system through credit card or cash. The taken case study is closely related to a real-world scenario, and it's straightforward to explain our work using this sample database. The EER model of the eCommerce case study is shown in Fig. 4. The brief about the EER model is given as follows:

  1. (a)

    It consists of eleven entities \((T)\) named {Person, Category, Customer, Product, Order, Item, Carrier, Supplier, Bill, CreditCard, Payment}

  2. (b)

    It has eight relationships \((R)\) named {request, delivery, owner, reference, composite, catalog, commitment, and furnishing}. Each relationship has its attributes and relationship cardinality (1:1, 1:N, N:1, M: N), indicating how many objects of entities can be associated with objects of another entity.

  3. (c)

    It also consists of the relationship of special types such as generalization or union. For example, payment consists of two types named Credit Card or Bill. The special types of relationships are treated as regular one-to-one types of relationships.

  4. (d)

    EER displays the average (avg) access frequencies as estimated data volume by the application users.

We have used the EER of the taken case study to illustrate the work throughout the paper.

3.1.2 Application workload

NoSQL databases do not support joins. Embedding or referencing takes the place of the joins in the NoSQL database. The selection between embedding and referencing during data modeling of document stores is the most challenging. Deciding when to embed a document or instead create a reference between separate documents in different collections is an application workload consideration. Additionally, if the application workload is known during the early data modeling stage, it results in the optimized schema design solution. Hence, the application workload, which includes the estimated database volume and queries, is considered the model's input. The authors have taken the most common seven queries to cover two different scenarios of any eCommerce platform, a) Customer (Q1 and Q2), b) Seller (Q3-Q7). The seven designed queries are shown in Table 2.

Table 2 Workload queries

3.2 Intermediate transformation

The intermediate transformation generates Query Graph \((\mathrm{QG})\) and Query Label \((\mathrm{QL})\). It transforms the application queries into query graphs. The query graphs are used to generate query labels with the help of the application's estimated data volume. We have employed five distinct query labels, OnetoOne Relation, Frequent Lookup, Doc Size, Frequent Modify, and Cardinality, to cover all possible data modeling scenarios [14]. The detailed work of this phase is discussed in this section.

3.2.1 Generate query graphs

Each workload query returns information regarding one or more EER model entities. The derived information from EER is represented as a Query Graph \(\left(\mathrm{QG}\right)\) [43, 44]. A \(QG\) is a sub-graph derived from the EER model.

Definition 2

A Query Graph \((\mathrm{QG}\subseteq \mathrm{EER})\) consisting of \((N,E)\) for each query, \({q}_{i}\in {Q}_{n}\) is defined as follows: nodes N where \(N\subseteq T\) corresponds to entities \(T\) of EER as mentioned in \({q}_{i}\), and edges \(E=({n}_{i},{n}_{j})\subseteq R\) corresponds to a set of relationships \(R=({t}_{i},{t}_{j})\). The procedure to generate a \({\mathrm{QG}}_{i}\) for each query \({q}_{i}\in {Q}_{n}\) is mentioned as follows:

  1. 1.

    List all the entities \(({t}_{n}\in T)\) in each query \({q}_{i}\in {Q}_{n}\).

  2. 2.

    For each \({q}_{i}\in {Q}_{n}\), identify the starting entity \({t}_{i}\in T\). Add the entity as a node \({n}_{i}\) in \({QG}_{i}\). Traverse the relationship \(r=({t}_{i},{t}_{j})\in R\) in the EER model to determine the other entities \(({t}_{j}\in {t}_{n})\) that are connected to the starting entity \(({t}_{i})\). Add the relationship as edge \((e)\) and the connected entity as \({n}_{j}\) to the query graph \(({QG}_{i})\) along with the cardinalities of the relationship \(r\).

  3. 3.

    If any entity \({t}_{i}\in T\) belongs to relationships of special types, such as generalization, adding the entity as a node \({n}_{j}\) and the relationship \(r=({t}_{i},{t}_{j})\in R\) as an edge \((e)\) connecting nodes \({n}_{i}\) and \({n}_{j}\) to \(({QG}_{i})\). Add cardinality 1:1 on both sides of the edge \(e=({n}_{i},{n}_{j})\).

  4. 4.

    If any of the traversed entities already added to the query graph have relationships with other entities \(({t}_{k}\in {t}_{n})\), repeat step 2 to traverse these relationships and add the connected entity to the query graph \(({QG}_{i})\).

  5. 5.

    Continue this process until all the entities \(({t}_{n}\in T)\) are traversed along with the relationships.

  6. 6.

    Repeat the above steps for each query \({q}_{i}\in {Q}_{n}\).

Figure 5 depicts the Query Graphs \((QG)\) for all seven input queries mentioned in Table 2.

Fig. 5
figure 5

Query Graphs generated from workload queries

3.3 Generate query labels

In document stores, the entities of the EER model are represented by a collection. In contrast, the documents represent key-value pairs that specify the records contained within the entities. The relationships are replaced by embedding or referencing. There are three types of relationships named one-to-one (1:1), one-to-many or many-to-one (1:N or N:1), and many-to-many (M: N) between the entities in the EER model. According to the official documentation of MongoDB [45], as shown in Table 3, only embedding has to be done during data modeling of document stores for a one-to-one (1:1) type of relationship.

Table 3 Embed/ Reference based on the type of relationship

However, for the other two types, named one-to-many or many-to-one (1:N or N:1) and many-to-many (M: N) relationships, we either embed the related documents into a single collection or used referencing between distinct documents from different collections. Embedding the documents or making a reference across different collections is an application-specific decision that depends on data growth, read–write ratio, and query types. Based on these factors, the proposed model resolves the trade-off between embedding and referencing in the form of Query Labels \((\mathrm{QL})\). The \(\mathrm{QG}\) and expected data volume are utilized to determine \(\mathrm{QL}\). We have used five labels: OnetoOne Relation, Frequent Lookup, Doc Size, Frequent Modify, and Cardinality (Table 5).

  1. (a)

    OnetoOne Relation: Each one-to-one relationship belonging to \(QG\) is labeled as 'OnetoOne Relation. '

  2. (b)

    Frequent Lookup: If two or more entities are accessed frequently together repeatedly. So, a 'Frequent Lookup' label is added to these entities.

  3. (c)

    Doc Size: The expected monthly access frequency of entity pairs in the application workload can be used to forecast the future size of a document. If the document size is expected to exceed 16 MB, the 'Doc Size' label is assigned to those entity pairs.

  4. (d)

    Frequent Modify: When two or more entities are frequently inserted, updated, or deleted, we use 'Frequent Modify' labels on these entities.

  5. (e)

    Cardinality: If the ratio gap between many-to-many (M: N) types of relationships are high, then use 'Cardinality' labels on these entities.

The process of query label generation is broken down into three steps: Label Categorization, Action Association, and Prioritization, as shown in Fig. 6.

Fig. 6
figure 6

Generation of query labels

Step 1 Label Categorization.

NoSQL databases are designed for high performance and scalability, and one of the ways they achieve this is by storing related data together in a single document. NoSQL allows for faster data retrieval, as the data needed for a particular query is more likely to be in a single location. By analyzing the query characteristics, it is possible to determine which entities and attributes need to be accessed together and group them in MongoDB. We have represented the query characteristics in the form of Query Labels \((\mathrm{QL})\). The term "entity pairs" are used in the process of Label Categorization, which refers to the nodes (\({n}_{i},{n}_{j}\)) bounded by an edge \((e)\) (relationship) within a query graph. The entity pair represents a specific connection or association between two entities in the Query Graph and are utilized to label the edges based on the relationships among the different nodes.

Definition 3

For \(\mathrm{QG}=(N,E)\), an Entity Pair (\({n}_{i},{n}_{j}, r\)) is defined as nodes (\({n}_{i},{n}_{j}\)) where (\({n}_{i},{n}_{j}\))\(\in N\) bounded by a relationship \(r\varepsilon E\) is the edge connecting two nodes (\({n}_{i},{n}_{j}\)). The procedure to produce all possible entity pairs from \({\mathrm{QG}}_{i}|i=1\dots n\) is given below:

  • 1. Initialize an empty list named 'EntityPairs.'

  • 2 For each edge \(e\varepsilon E\) in \({QG}_{i}\), let \({n}_{i}\), \({n}_{j}\) be the source and target nodes of \(e\). Create an entity pair (\({n}_{i},{n}_{j}, r\)) and add the pair to the 'EntityPairs' list.

  • 3. Repeat for each query graph \({QG}_{i}\) in the query graph list (\({QG}_{n}\)).

  • 4. Return the 'EntityPairs' list, which contains all the unique entity pairs across the query graphs \(({QG}_{i}|i=1\dots n)\).

For the eCommerce case study, eight 'Entity pairs (EP)' are formed from QG, as shown in Fig. 7.

Fig. 7
figure 7

Entity pairs (EP) based on query graphs

Along with this, we have taken 40% as the threshold value. This assumption is rooted in the 60–40 rule, which states 60% of the profit involves 40% of the data [46]. This rule helps identify the most critical data and optimize the database's performance by focusing on that data. The details of the Label Categorization are given in Algorithm 1. The theoretical explanation is given as follows:

  1. (a)

    'OnetoOne Relation' label is assigned to all the entity pairs having a one-to-one relationship in the query graph \(\left({QG}_{i}|i=1,\dots ,7\right).\) For instance, entity pairs, Order-Payment, Supplier-Person, Payment-Bill, and Payment-Credit Card have a one-to-one relation for the taken case study. So, the 'OnetoOne Relation' label is assigned to the entity pairs, as shown in Fig. 8.

  2. (b)

    'Frequent Lookup' labels are assigned to relationships accessed repeatedly in the application workload. Calculate the access count for each distinct entity pair to assign the label. A threshold value is calculated, which gives the maximum number of entity pairs that can be assigned the 'Frequent Lookup' label. The formula to calculate the threshold is 40% of the maximum count of frequently accessed entity pairs. The access count is set in ascending order, and the 'Frequent Lookup' label is assigned to the number of entities pair whose count is equal to the number obtained from the threshold beginning from the highest value of access count.

Fig. 8
figure 8

‘OnetoOne Relation’ Label

For instance, in the preceding case study, we counted the total number of times the distinct entity pairs are accessed together. As shown in Fig. 9, the total count of frequency accessed entity pairs \(QG\) is given as Customer-Order, Product-Category is accessed twice, Order-CreditCard, Customer-CreditCard, and Order-Carrier are accessed once each, Order-Item is accessed four times, and Item-Product entities are accessed together five times. According to the formula, the maximum access count of an entity pair is 5, so the threshold (40% of 5) for 5 is 2. Therefore, the label is assigned to the upper two values (4 and 5). Hence, the entity pairs named Order-Item and Item-Product are labeled as 'Frequent Lookup.'

  1. (iii)

    For the 'Doc-Size' label calculation, two values are needed, one is the estimated data volume on each distinct entity pair in the application workload, and the second is the average document size. Among the two, the estimated data volume is taken from the EER model, and the average size of a single document can be calculated with the help of attributes of an entity. But for simplicity, we assume that each document has five attributes (key-value pair) with a maximum size of 12 bytes (the maximum key-value size in MongoDB). Therefore, the average size of each document is 60 bytes. To calculate the 'Doc Size' label, the expected data volume of each distinct entity pair is multiplied by the average document size. From the results, entity pairs with a size of more than 16 MB (16 × 106 bytes) are assigned a 'DocSize' label.

Fig. 9
figure 9

'Frequent Lookup' Label based on Query graphs

As shown in Table 4, for the taken case study, the distinct entity-pair accessed \(QG\) is named in the first column, and the expected data volume (as mentioned in Fig. 4) for each distinct entity-pair is shown in column 2. The third and fourth column shows the expected document size in bytes and MB, respectively, by multiplying the values of the second column by 60 bytes (taken average size). According to Table 4, the size of the 'Order-Carrier' is more than the threshold value; hence, as shown in Fig. 10, the entity pair is assigned with the 'Doc Size' label.

Table 4 The calculation for the 'Doc Size' Label
Fig. 10
figure 10

'Doc Size' Label

  1. (d)

    For 'Frequent Modify' Labels, the queries that perform the database's write (insert, delete, update) operations are selected from the application workload. From the selected workload queries, fetch the corresponding \(\mathrm{QG}\). From the selected \(\mathrm{QG}\), count distinct entity pairs (related entities) accessed together. The threshold value calculated is 40% of the frequently accessed entity pairs fetched from \(\mathrm{QG}\). Then, from the EER model, the data volume of each entity pair is determined. The values of estimated data volume are set in ascending order, and the 'Frequent Lookup' label is assigned to the entity pairs whose count is equal to the threshold beginning from the highest value of estimated data volume count.

Based on the calculations, the 'Frequently Modify' Label is assigned. For instance, queries Q1, Q2, Q3, and Q4 perform the write operations on the database. From the corresponding \({QG}_{i}|i=1,\dots ,4\) (Fig. 5), Customer-Order, Order-Item, Item-Product, Order-Payment, Supplier-Payment, Product-Category, and Order-Carrier are the total seven entity pairs that are accessed. The threshold value of 7 is 2. Figure 4 shows that the estimated data volume of the above-listed entity pair is given as Order-Customer 90 times, Order-Payment 1 time, Order-Item 65 times, Item-Product 2541 times, Supplier-Payment 16 times, Product-Category 35 times, Order-Carrier 283,440 times. Therefore, the 'Frequently Modify' label is assigned to Item-Product and Order-Carrier entity pairs with the uppermost two values, 283,440 and 2541, respectively, as illustrated in Fig. 11.

  1. (e)

    Cardinality Labels are affixed to many-to-many types of relationships because the ratio of M: N is calculated during logical schema generation of many-to-many types of relationships. If the ratio is high, one-way embedding is performed; otherwise, two-way embedding is performed. The provided case study does not address any M: N type of relationship. But for the sake of understanding, if n consists of a maximum of 5 categories for a book and m consists of a maximum of 50,000 books in a category because the ratio among M:N is high, then apply one-way embedding. If n is a maximum of 3 books written by an author and m consists of a maximum of 5 book authors, the M:N ratio is low, so use two-way embedding.

Fig. 11
figure 11

'Frequent Modify' Label

Step 2Action Association.

While schema designing of EER models into MongoDB, embedding or referencing is performed on entity pairs, we have associated the actions to the Query Labels with determining when to embed or refer the entity pairs. The association of actions has been done considering data modeling described on are designed based on the official website of MongoDB [45]. According to the official website, three actions can be performed on entity pairs: One-way Embedding, Two-way Embedding, and Referencing. The action associated with each Query Label is as follows: (i) For the 'OnetoOne' label, the entity pairs must always be embedded together using One-way embedding. (ii) For the 'Frequently Lookup' Label, entity pairs frequently accessed together should always be embedded because it involves too many read operations. Hence the data must be stored at the same location. (iii) For the 'Doc Size' label, the entities should always be referenced because as the document size exceeds 16 MB, MongoDB must allocate a new memory location for the growing document and copy the old document to the new space. It involves many input/output operations and could affect MongoDB's performance. (iv) The 'Frequent Modify' label involves more write operations, including insert, update, and delete operations. The write-intensive entities should always be referenced. (v) Two-way embedding is performed for the 'Cardinality' label, depending on the M: N ratio. The action associated with each label is displayed in Table 5.

Table 5 Complete information about Query Labels

Step 3 Prioritization of Labels.

If a relationship comprises more than one label, label prioritization addresses the trade-off between the actions associated with the labels. According to the summary outlined [6], among embedding and referencing, the highest priority is assigned to referencing because we must first prioritize write-heavy operations and large-size documents. Hence, we have assigned a higher priority, i.e., 1, to reference than embedding. Among one-way and two-way embedding, one-way embedding is assigned priority value 2. In contrast, two-way embedding is assigned priority value 3, as shown in Table 5.

figure a

3.3.1 Final schema generation

Final schema generation is further categorized into two parts: (i) generation of Schema Graph \((\mathrm{SG})\) and Label assignment, and (ii) Transformation into Logical Schema, as shown in Fig. 12. In the generation of the Schema Graph \(\left(\mathrm{SG}\right)\), the EER model is converted into a graph named Schema Graph \((\mathrm{SG}),\) and then \(\mathrm{QL}\) are assigned onto \(\mathrm{SG}\). Based on the defined rules, \(\mathrm{SG}\) is transformed into MongoDB logical schema.

Fig. 12
figure 12

Final schema generation

3.3.1.1 Generation of schema graph and label assignment

The EER model is first converted into a graph model \((\mathrm{SG})\) by representing EER entities as nodes and relationships as edges. After that, the Query Labels \((\mathrm{QL})\) are assigned to Schema Graph \((\mathrm{SG})\).

Definition 4

A Schema Graph \(\left(\mathrm{SG}\right)=({N}_{G},{E}_{G})\) can be represented with the help of nodes \(({N}_{G}\in T)\) and edges \(({E}_{G}\in R)\). Nodes \(({N}_{G})\) must always equal the number of entities \((T)\), and Edges \(({E}_{G})\) must equal the number of relationships \((R)\) in the EER model.

Algorithm 2

gives the detailed procedure of Schema Graph \((\mathrm{SG})\) generation. The algorithm iterates through the entities (T) and relationships (R) in the EER model. For each entity (\(t\varepsilon T)\), a corresponding node (\({n}_{G}\)) is created in the SG. Similarly, for each relationship (\(r\varepsilon R\)), an edge (\({e}_{G}\)) is added to the SG, connecting the start (\({t}_{i}\)) and end entities (\({t}_{j}\)) of the relationship (\(r\)). This process continues until all entities and relationships in the EER model have been processed. The resulting Schema Graph provides a visual representation of the EER model. Following creating the Schema Graph, the query labels are assigned on the edges of \(\mathrm{SG}\). Figure 13 depicts the \(\mathrm{SG}\) generated from the EER model, consisting of 11 nodes and 11 edges along with the assigned \(\mathrm{QL}\) on edges.

Fig. 13
figure 13

Schema Graph (SG) with query labels

figure b
3.3.1.2 Transformation into logical schema

A logical schema of document stores is derived from \(\mathrm{SG}\). The following rules are followed to generate a logical schema:

  1. (1)

    If a single \(\mathrm{QL}\) is assigned on the edges of \(\mathrm{SG}\), the decision is based on the action associated with \(\mathrm{QL}\), as discussed in Sect. 3.2.2.

  2. (2)

    If more than one \(\mathrm{QL}\) is assigned on the edges of \(\mathrm{SG}\), the decision is based on the priority associated with \(\mathrm{QL}\), and based on the priority, action must be taken.

  3. (3)

    If there is no \(\mathrm{QL}\) between the edges of \(\mathrm{SG}\), create a separate collection for each entity in \(\mathrm{SG}\).

Figure 14 depicts the final logical schema model for the case study after applying all phases of the proposed model. Since no label is assigned between entity pairs named Supplier-Product, Product-Category, Order-Customer, and Carrier-Customer following rule 3, separate collections are created for each entity. For the rest of the entities, there is a label between the edges, so by following rules 1 and 2, the entities are embedded or referenced among each other.

Fig. 14
figure 14

Logical schema generated from Schema Graph (SG)

The procedure of the proposed model that encompasses all the phases is mentioned as Algorithm 3. The proposed algorithm designs the schema automatically by transforming the model's inputs into the logical schema of MongoDB. The Query Graph \(\left(\mathrm{QG}\right)\) is generated for each query \(({q}_{i}\in {Q}_{n})\) \({\mathrm{q}}_{\mathrm{i }}\in \mathrm{Q}\) of the application workload (Line 1). The Query Labels \((\mathrm{QL}\)) is generated using Query Graph \(\left(\mathrm{QG}\right)\) (Line 2). The EER model is converted into Schema Graph (\(\mathrm{SG}\)) (Line 3) using Algorithm 2. Then, the calculated Query Label \((\mathrm{QL}\)) is assigned to the Schema Graph \((\mathrm{SG})\) (Line 4). The \((\mathrm{SG})\) is converted into the logical schema using actions performed on the assigned Query Label \((\mathrm{QL}\)) (Line 5–23). Remove edge \({\mathrm{e}}_{\mathrm{i}}\in\) E from \((\mathrm{SG})\). If it contains any \((\mathrm{QL}\)), if it is OnetoOne Relation, embed entities among one another. If the label is Frequent Lookup, for relationship type 1:1 or 1: N, embed the child entity into the parent entity; if the relationship is N:1, then embed the child into the parent as an array of embedded objects. For Doc Size or Frequent Modify label, perform referencing and refer child entity into parent entity. For Cardinality Label, perform one-way or two-way embedding depending on the ratio gap among M: N. If an edge \({(e}_{i}\in\) E) has more than two query labels and performs actions based on the priority of labels. If no label is assigned to the edge, make a new collection. If the number of entities is N and the number of relationships is M, then Algorithm 3 has a time complexity of \(O(N+M)\).

figure c

4 Experimental evaluation

We evaluate our approach with an experiment in the e-commerce domain, as described in Sect. 3. The experiments are conducted to validate our proposed model and demonstrate the model's positive effects on query processing time. The performance of our proposed model is compared with three existing models: (i) Conventional [37]: It is workload agnostic, and logical schema is designed without taking an application workload by following the relationship constraints only as shown in Fig. 15a; (ii) Optimized [37]: As shown in Fig. 15b, it is workload-driven but logical schema design, is generated based solely on expected data volume of application workload, and (iii) Query Path Graph (QPG) [17]: It considers application query patterns for the data modeling transformation of conceptual to logical modeling as shown in Fig. 15c. We chose these three existing models because they define and explore three distinct ways of MongoDB data modeling, and their work is comparable to our proposed work. Table 6 gives the qualitative analysis of the four models based on various important factors named Query Response Time, Query Speedup, Write Latency, Read Latency, Number of Pipeline Stages, Pipeline efficiency, Storage Space, Scalability, Throughput and Latency. It has been found that our model works well for all factors. However, experiments are conducted to prove the qualitative analysis in numbers. The following section details the experimental setup and results to verify the qualitative analysis.

Fig. 15
figure 15

Existing Logical model used for performance comparison a Conventional model, b Optimized, c QPG

Table 6 Qualitative analysis of existing and proposed model

4.1 Experimental setup

The experimental analysis is performed using an Intel Core i7- 1255U processor with 16 GB of RAM, 3-level cache, and 1 TB hard disk. The data is stored using MongoDB Atlas, a cloud-based database service provided by MongoDB. The experimental setup utilized a MongoDB Atlas Cluster M20 configuration dedicated to 4 GB RAM with 20 GB storage and 2vCPUs. The cluster is configured on AWS/Mumbai (ap-south-1) region running MongoDB version 6.0.6. It consisted of a replica set with three nodes to ensure load balancing of read and write operations, data availability, and reduced query response time. Studio 3 T, a graphical user interface (GUI) based MongoDB IDE [47], is also used. The experimental setup information is summarized in Table 7.

Table 7 Experimental setup

To make the query performance comparison, we created four physical databases for schema shown in Figs. 14 and 15 in MongoDB and named as conventional, optimized, QPG, and proposed. We have populated all four databases with identical data, as shown in Table 8. The seven queries (Q1-Q4 CRUD queries, Q5-Q7 aggregate pipeline) outlined in Sect. 3.2 are executed in each database. As stated in Table 9, along with the seven queries taken as input, the performance is measured on eight additional queries (Q8-Q15) because queries evolve with time, and new queries are always added to the system. Hence, to measure the performance of a system, it is necessary to measure it on run-time queries.

Table 8 A eCommerce dataset, along with the number of records in each EER table
Table 9 Additional run-time queries for performance evaluation

4.2 Experimental evaluation

To perform the experimental analysis, we have measured various essential performance parameters of a database as listed in Table 6. Based on the parameters, the following comparison is made among proposed and existing models: (1) Query Response Time and Speedup (Sect. 4.2.1), (2) Read and Write Latency (Sect. 4.2.2), (3) Efficiency Improvement using aggregate pipeline (Sect. 4.2.3), (4) Storage Space (Sect. 4.2.4), (5) Collection-wise Performance (Sect. 4.2.5), (6) Scalability (Sect. 4.2.6), (7) Throughput and Latency (Sect. 4.2.7).

4.2.1 Query response time and speedup

It refers to the time the database takes to process and respond to a request for information. Query response time is an important performance parameter, as it can affect the speed and efficiency of the database. In general, faster query response times are desirable, as they can lead to better performance and user experience. Also, we have calculated a unitless speedup factor calculated by the mean query response time for the proposed model divided by the mean query response time for the existing model. The overall result is summarized through the Geometric Mean (GM) of all 15 queries. Because GM is the appropriate, meaningful average for normalized unitless numbers [48]. GM helps make broad at-a-glance speedup comparisons among existing and proposed model’s performance. The formula to calculate the query speedup factor is:

$$\mathrm{Average\;Query\;Speedup\;Factor}= \sqrt[N]{{}_{n=1}{}^{N}\Pi {Q}_{n}}$$

where \({Q}_{n} \mathrm{is\;speed\;up\;factor\;for\;each\;query} {q}_{i}\)

$${Q}_{n}= \frac{{T}_{{i}_{P}}}{{T}_{{i}_{\mathrm{EM}}}}$$

where \({T}_{{i}_{P}}=\) Query execution time of ith query of proposed model, \({T}_{{i}_{\mathrm{EM}}}=\) Query execution time of ith query of existing model, N = total number of workload queries.

We have performed an extensive evaluation of proposed models against existing models on MongoDB. We ran each of the 15 queries on four models to conduct the experiment. Three runs for each query were made to avoid the distorted results by caches in MongoDB. The average value of three runs is taken as query response time. The speedup factor for an individual query is then calculated by dividing the particular query \({(q}_{i}\in {Q}_{n},\mathrm{where} i=1,\dots ,15)\) response time of the existing model by the query response time of the proposed model. Table 10 details the complete numerical figures of query response time and the speedup factor for all 15 queries, whereas the graphical representation of query response time is shown in Fig. 16. The average speedup factor is shown in the last row of Table 10. The following observations are made from Fig. 16.

Table 10 Query execution time of workload queries
Fig. 16
figure 16

Query response time comparison among different

It can be observed that the response time of the proposed model for queries Q1 and Q2 is less time than all other models. For queries Q5, Q6, Q8, and Q9, the proposed model performs much better than the existing models. The performance improvement is due to the reason that these queries access a specific Order, including Items. The conventional and optimized schema nests the Orders inside the Customer collection, whereas QPG has to make reference for the Orders with other collections named Items which is very time-consuming. In contrast, the schema generated by our model reads the documents from the Order collection directly. For queries Q7, Q10, and Q13 proposed model is better than QPG but performs poorly than conventional or optimized because the queries access Customer records, including Orders. In both conventional and optimized, the information can be directly accessed from the customer-rooted collection. But in QPG and the Proposed model, the Customer collection has to be linked with the Order collection. However, in QPG, more time is taken because orders are nested inside customers and have a separate collection. So, time is taken to perform both nesting and referencing from customer to order collection. For Q11, the performance of the proposed model is better than conventional but poor than optimized and QPG models, while Q12 performs poorly than all three existing models. For Q4, Q14, and Q15, all three models have almost the same performance. Hence, we can conclude that the performance of the proposed model is similar to or better than existing models for both input and run-time queries.

4.2.2 Write and read latency

Read latency is the time taken to retrieve data from the database, while write latency is the time taken to store data in the database. Latency can be affected by factors such as the schema from which data is being read or written, the workload on the database, and the type of storage used. In our case, to measure the effect on latency due to schema and workload, the queries are divided into two categories named (1) Write queries (Q1-Q4, Q14-Q15), and (2) Read queries (Q5-Q13). We have taken the average query response time for all underlying categories. The resultant table is shown in Table 11, while the graphical representation is shown in Fig. 17. It shows that the proposed model reduces the write latency by a factor of 1.14, 1.17, and 1.33 while read latency by a factor of 1.19, 1.09, and 1.37 than Conventional, Optimized, and QPG, respectively. Hence, the proposed model outperforms all three existing models regarding write and read latency, as the lowest latency means better performance.

Table 11 Write and Read latency for each schema model
Fig. 17
figure 17

Write and Read Latency among Proposed and existing models

4.2.3 Efficiency improvement using aggregate pipeline

MongoDB uses an aggregate pipeline framework for complex query processing [49]. The aggregate pipeline processes documents in different collections and returns computed results. An aggregate pipeline is a powerful tool for data processing and analysis in MongoDB. It can perform a wide variety of operations on data, including data transformation, data aggregation, and data analysis.

The efficiency \((\eta )\) of the aggregate pipeline in MongoDB depends on several factors, including the number of pipeline stages and the size and organization of the data being processed. Efficiency improvement \((\eta \%)\) measures the performance of the aggregate pipeline with a minimum number of aggregate stages. To calculate \((\eta )\), firstly, the execution plan is analyzed, which shows the total number of stages for MongoDB individual query for all four models, as shown in Table 12. Then the total number of stages \((S=\sum_{i=1}^{n}{S}_{i})\) for each query \({q}_{i}\in {Q}_{n} \; \mathrm{where}\; i=1,\dots ,15\) is calculated. The formula used to calculate the efficiency improvement is:

$$\text{Efficiency improvement}\;\left(\eta \%\right)=\left(\eta -100\right)\%, \; \mathrm{where}\;\eta =\frac{{S}_{{\mathrm{EM}}_{i}}}{{S}_{P}}*100|i=\mathrm{1,2},3$$

where \({S}_{{\mathrm{EM}}_{i}}=\text{Total number of stages in}\; i\mathrm{th} \mathrm{existing model}\), \({S}_{P}=\text{Total number of stages in proposed model}\)

Table 12 Number of Pipeline stages for each aggregate query

As mentioned above, we have nine aggregate queries among a total of fifteen queries. Hence, we have taken those aggregate queries to analyze the efficiency improvement. Table 12 shows the total number of stages used by each query and the total number of stages used by all nine queries. The efficiency improvement of the aggregate pipeline of the proposed method against existing models is calculated by the formula mentioned above.

Table 13 illustrates the percentage efficiency improvement of the proposed model against the conventional model is given by 17.5%. In contrast, for the optimized model, it is given as 15%; for the QPG model, it is given as 10%. This section concludes our method provides better pipeline efficiency than the existing methods. The reason for better efficiency is that the aggregate pipeline stages depend on the number of documents fetched against the application query. A schema designed based on the application queries results in fewer documents being scanned during query processing. Hence, we can conclude that our model suggests the best logical schema for the application workload than the existing models.

Table 13 Aggregate pipeline efficiency of the proposed model against existing models

4.2.4 Storage space

In MongoDB, the size of a collection refers to the total amount of disk space consumed by the data within that specific collection. The storage requirements in MongoDB can vary greatly depending on factors, such as data volume, data model, and query usage patterns. Due to these factors, the various models, including the Conventional, Optimized, QPG, and Proposed, introduce different collections and exhibit variations in storage space. Table 14 shows each model's storage space and the total number of documents in each collection. Due to the flexibility property, the four models (Conventional, Optimized, QPG, and Proposed) have different collections, and a collection in one model may or may not be present in another model. For example, the "Order" collection is only provided in the QPG and Proposed models, whereas the "Payment" collection is only in the conventional model.

Table 14 Storage space occupied by each collection, along with total space occupied by each model

Figure 18 represents the graphical visualization of storage space variations across different models. Figure 18a illustrates the space occupied by each collection individually, providing a collection-wise comparison among the models. On the other hand, Fig. 18b shows the total storage space occupied by all the collections within a specific model.

Fig. 18
figure 18

Storage Space comparison a Collection-wise storage comparison b Total space occupied by different models

Figure 18a shows that all four models have the same disk space for the Carrier, Supplier, and Category collections, indicating that these collections have almost consistent storage requirements across the models.

  1. 1.

    The disk space occupied by the Customer collection gradually decreases from the Conventional model to the Proposed model, indicating storage optimizations in the latter models. This decrease in disk space is caused by a lower level of embedding in the proposed models when compared to the Conventional model.

  2. 2.

    The Product collection shows variations in disk space among the different models. Compared to the Conventional and QPG models, the Optimized and Proposed models have more disk space allocated for the Product collection. This difference in disk space can be attributed to the fact that the Optimized and Proposed models include Furnishing and Catalog embedded documents within the Product collection. These additional embedded documents contribute to the Optimized and Proposed models' higher disk space utilization when compared to the other models.

  3. 3.

    The Order collection is absent in the Conventional and Optimized models but present in the QPG and Proposed models, which have significantly more disk space.

  4. 4.

    17b shows that the total disk space for Conventional, Optimized, and Proposed models is relatively similar, with the Proposed model showing a slight reduction. The QPG model consumes more total disk space than the other three models, owing to the repetition of the Order collection separately and within the customer collection, which takes up a significant amount of disk space.

4.2.5 Collection-wise performance

Collection-level performance analysis using “MongoTop” is a valuable technique to track the time taken by read and write activity of each collection in a MongoDB instance. The benefit of this is that it provides insights into the most active collections regarding disk I/O operations, which can help identify performance bottlenecks and optimize database operations. "MongoTop" is a tool provided by MongoDB that continuously samples data over a specified duration and provides real-time reports on the activity of individual collections. We have analyzed three important parameters (Total, Read, Write) to gain insights into a deeper understanding of Collection-wise performance. The "Total" shows the total amount of time, in microseconds, spent performing both read and write operations on a particular collection. By examining this metric, we can assess a collection's overall workload and activity level. The "Read" parameter indicates the amount of time, in microseconds, desiccated to read operations on a particular collection. The benefits of analyzing this metric are to identify heavily read-intensive collections, providing insights into the data access patterns and usage characteristics. The "Write" parameter displays the amount of time, in microseconds, spent on performing write operations on a collection. By examining this metric, we can identify collections that experience significant write activity, enabling us to focus on optimizing write-intensive operations. Table 15 provides information on the Total, Read and Write activity for each collection across the Conventional, Optimized, QPG, and Proposed models.

Table 15 Collection-wise performance comparison

Figure 19 shows the graphical representation of information shown in Table 15. Figure 19 has three parts: 19a, 19b, and 19c. Part '19a' compares the four models (Conventional, Optimized, QPG, and Proposed) based on the total time spent on read-and-write operations for each collection. Part '19b' compares different collections of a model based on Read time, whereas '19c' highlights the write time for each collection among four models. A detailed explanation is given below:

Fig. 19
figure 19

Collection-wise performance comparison a Time taken by both read and write operations, b Time taken by read operations, and c Time taken by both read operations

  1. 1.

    Figure 19a shows the proposed model has the lowest total time across all collections compared to the other models. The collections in Optimized and Conventional models show slightly higher total times compared to the Proposed model but remain relatively close. The QPG model stands out with significantly higher total times, primarily due to the Order collection, which substantially impacts the overall time.

  2. 2.

    Figure 19b shows Customer collection; the Conventional model has the highest read time, followed by the Optimized model. The QPG and Proposed models have significantly lower read times. Therefore, the overall performance of the Conventional and Optimized models depends on the Customer collection only. The Proposed model generally shows reduced read times compared to the Conventional and Optimized models, suggesting improved read performance. However, the QPG model exhibits higher read times for certain collections due to the Order collection in multiple locations.

  3. 3.

    Figure 19c shows the QPG model has the highest write time for the Customer collection, while the other models have relatively lower write times. Also, the Order collection has a high time in QPG. The Proposed model generally demonstrates lower write times than the Conventional and Optimized models. Notably, among the four, the proposed model performs better than all existing models in terms of write operations.

The graphs highlight the performance differences among the Conventional, Optimized, QPG, and Proposed models. The Conventional and Optimized models exhibit similar total, read, and write times, which are higher than those of the Proposed model. This is because the Customer collection in both models is heavily embedded with Order documents, leading to performance bottlenecks. In contrast, the Proposed model addresses this issue by separating the Customer and Order collections, resulting in improved performance. The QPG model stands out with significantly higher write and read times, attributed to the repetitive Order collection. Overall, the Proposed model consistently demonstrates lower read and write times across different collections, indicating its superior schema design compared to the existing models.

4.2.6 Scalability

To evaluate scalability, it is essential to test the model's ability to handle increased data volumes while maintaining acceptable query performance. Expanding the data volume in each collection allows us to simulate real-world scenarios with larger datasets and observe how the model performs under such conditions. We can determine if the model scales well with increased data volume by analyzing query performance metrics, such as response times and speed up. This information is crucial for capacity planning and optimizing the database infrastructure to ensure it can handle growing workloads without sacrificing performance. To analyze the scalability, we have increased the data volume, as shown in Table 16, which is almost double the size compared to Table 14. Then, we run the fifteen workload queries (Tables 2 and 9) on CPU and GPU to compare the query performance of the proposed model to the existing models. GPU is chosen due to its ability to accelerate computations, making them vital components of supercomputers.

Table 16 Scaling up the data volume for performance analysis
4.2.6.1 Scalability for CPU

The results for CPU are shown in Table 17, and the graphical representation is shown in Fig. 20. Based on the following inferences can be drawn:

  1. 1.

    For write queries (Q1, Q2, Q3, Q4, Q14, Q15), the Proposed model consistently outperforms the other models (conventional, optimized, and QPG) with the lowest response times. It demonstrates significant improvements in query performance, indicating better optimization and efficiency in handling write operations.

  2. 2.

    Regarding read queries Q5, Q8, Q12, and Q13, the Proposed and Optimized models perform better than the other two. For queries Q6, Q7, and Q9, the conventional QPG and proposed model achieve lower response times than the optimized model. For Q10 and Q11, the proposed model shows poor query performance than the conventional and optimized but is better than the QPG model. Therefore, the proposed model performs better for seven out of ten read queries (Q5, Q6, Q7, Q8, Q9, Q12, Q13) than the existing models, indicating improved query optimization and data retrieval strategies.

  3. 3.

    The proposed model also outperforms the existing models regarding speedup (SU) for the increased data volume. Specifically, the proposed model performs best against Conventional Model, with SUs of 1.3. It also outperforms Optimized and QPG, with SUs of 1.2.

Table 17 Impact of increased data volume on query response time and query speedup factor
Fig. 20
figure 20

Impact of increased data volume on query response time for CPU

With increased data volume, the Proposed model consistently exhibits the best performance across both read and write queries, achieving the lowest response times and high speedup factor compared to the Conventional, Optimized, and QPG models. Therefore, with increased data volume, the Proposed model maintains efficient query performance, indicating its ability to handle larger datasets.

4.2.6.2 Scalability for GPU

To conduct the experimental analysis, we have used an Amazon EC2 P2.xlarge instance with 1 NVIDIA K80 GPU, four vCPU’s and 63 GB of RAM. Amazon EC2 is highly optimized for high-performance computing and gives parallel processing capabilities with similar software configurations, as shown in Table 7. The results for GPU are shown in Table 18, and the graphical representation is shown in Fig. 21. Based on the following inferences can be drawn:

  1. 1.

    For write queries (Q1, Q2, Q3, Q4, Q14, Q15), the Proposed model consistently outperforms the other models (conventional, optimized, and QPG) with the lowest response times. It demonstrates significant improvements in query performance, indicating better optimization and efficiency in handling write operations.

  2. 2.

    Regarding read queries Q5, Q8, Q12, and Q13, the Proposed and Optimized models perform better than the other two. For queries Q6, Q7, and Q9, the conventional, QPG, and proposed models achieve lower response times than the optimized model. For Q10 and Q11, the proposed model shows poor query performance than the conventional and optimized but is better than the QPG model. Therefore, the proposed model performs better for seven out of ten read queries (Q5, Q6, Q7, Q8, Q9, Q12, Q13) than the existing models, indicating improved query optimization and data retrieval strategies.

  3. 3.

    The proposed model also outperforms the existing models regarding speedup (SU) for the increased data volume. Specifically, the proposed model performs best against Conventional Model, with SUs of 1.3. It also outperforms Optimized and QPG, with SUs of 1.2.

Table 18 Query response time and query speedup factor for GPU
Fig. 21
figure 21

Query response time for GPU

  As the data volume on the GPU increases, the Proposed model consistently demonstrates superior performance in both read and write queries. It consistently achieves the lowest response times and a substantial speedup factor compared to the Conventional, Optimized, and QPG models.

4.2.6.3 Throughput and Latency through sharding

MongoDB achieves scalability by utilizing a horizontal scaling technique named sharding because they are specifically designed to provide horizontal scalability to meet modern applications' high data volume demands. Sharding [7] is used to partition data horizontally across multiple servers or shards. It allows distributing the data across the cluster of machines to increase data storage capacity and improve query performance. To check the scalability of the proposed model, we have experimented by distributing the data among different clusters (nodes), as shown in Table 19. Our proposed model is deployed on MongoDB Atlas, which provides sharding capabilities to efficiently distribute the data. The general process for distributing data among different clusters in MongoDB:

  1. (a)

    Set up the Clusters: Create individual clusters to host the shards. Each cluster should have its own set of servers running MongoDB instances. To experiment, we have created 2, 3, and 4 Clusters having 3 nodes each.

  2. (b)

    Enable Sharding: Enable sharding on the clusters by configuring the config servers and enabling sharding for the relevant databases or collections.

  3. (c)

    Define Sharding Key: The sharding key determines how data is divided across the clusters. There are various sharding keys named range-based, hash-based, and compound-based. We have chosen hash-based keys for our work because it automatically distributes the documents uniformly across the shards.

  4. (d)

    Distribute Data: Insert or migrate data into the sharded collections. MongoDB distributes the data across the shards based on the defined sharding key. The sharded clusters ensure load balancing of read and write operations and uniform data distribution.

Table 19 Details of sharded clusters

After establishing the experimental setup, we proceeded with conducting experiments to evaluate the scalability of the proposed model using sharding on the scaled data, as detailed in Table 16. Specifically, we focus on two key parameters across the distributed data on different nodes: (a) Throughput (Sect. 4.2.7.1), and (b) Latency (Sect. 4.2.7.2).

4.2.6.4 Throughput

Throughput refers to how much work or data can be processed by the database system within a given time frame. It represents the rate at which MongoDB can handle and process operations, such as reads, writes, and queries. To maximize throughput in MongoDB, it is recommended to carefully design the database schema, optimize queries, utilize appropriate indexes, and scale the deployment horizontally by adding more servers or shards as needed. To calculate the throughput of the distributed data on different nodes, Apache JMeter—a tool known for measuring metrics like throughput and latency. Various test cases for CRUD operations are designed to compare the performance of proposed models against existing ones. Table 20 provides the measured throughput (operations per minute(ops/mint)) for different models on varying numbers of nodes. The graphical representation is shown in Fig. 22. For the Conventional model, the throughput ranges from 510 ops/mint (3 nodes) to 1003 ops/mint (12 nodes). There is slightly better throughput for the Optimized model than the Conventional model, ranging from 514 ops/mint (3 nodes) to 1128 ops/mint (12 nodes). For the QPG model, the throughput ranges from 501 ops/mint (3 nodes) to 1045 ops/mint (12 nodes. For the Proposed model, the throughput is highest, ranging from 518 ops/mint (3 nodes) to 1169 ops/mint (12 nodes). Overall, the results indicate that the proposed model achieves the highest throughput, followed by the optimized and QPG models. Furthermore, as the number of nodes increases, there is a general trend of improved throughput across all models, demonstrating the benefits of horizontal scaling.

Table 20 Throughput (ops/mint) comparison on various distributed nodes
Fig. 22
figure 22

Throughput (ops/mint) on various distributed nodes

4.2.6.5 Latency

Latency refers to the delay between requesting and receiving a response from the server. While sharding can improve scalability and throughput, it can introduce additional latency due to the system's distributed nature. The latency (milliseconds(ms)) results for data distribution over different nodes are presented in Table 21. Figure 23 provides a graphical representation of the latency. The conventional model demonstrates low latency, ranging from 3.9 ms (3 nodes) to 27 ms (12 nodes). The optimized model has constant performance across node configurations, with latency ranging from 3.5 ms (3 nodes) to 29.7 ms (12 nodes). The latency of the QPG model ranges from 4.1 ms (3 nodes) to 29.1 ms (12 nodes), which is equivalent to the conventional and optimized models. The proposed model has constant latency ranging from 3.3 ms (3 nodes) to 28.4 ms (12 nodes), similar to the other models. Overall, there is a modest rising trend in latency as the number of nodes grows, indicating that processing time may increase. In conclusion, all four models demonstrate relatively low and comparable latency, with different node variations.

Table 21 Latency on various distributed nodes
Fig. 23
figure 23

Latency comparison on distributed nodes

5 Conclusion

Designing a NoSQL database schema requires not only knowledge of data but also an understanding of how the application needs to access the data. This paper presents an automatic workload-driven model for the logical schema of a document-based NoSQL database from a conceptual model. The model takes the conceptual model and the application workload in estimated data volume and query workload. The query graphs are generated from the application workload to study the query characteristics. The characteristics are represented using query labels. These labels are used to transform the conceptual model into MongoDB logical schema.

This paper has designed a model to minimize data modeling hardships for the popular database named MongoDB. The proposed model does not rely on rules of thumb to select the appropriate schema or require expert help to design a logical schema. Therefore, the proposed work benefits novice programmers and helps them to save time for schema design decisions during the early development phase of any application's design.

We employed three state-of-the-art schema generation models termed conventional, optimized, and QPG to validate the performance of the proposed model. Several parameters, including query response time, query speedup factor, read and write latency, aggregate pipeline efficiency improvement, storage space, collection-wise performance, and scalability, are used to compare the proposed model to existing models: Conventional, Optimized, and QPG. The experimental results show the proposed model outperforms conventional, optimized, and QPG models. It achieved a 1.2, 1.1, and 1.3 speedup factor over the respective models. Additionally, the proposed model improved aggregate pipeline efficiency by 17.5%, 15%, and 10% compared to the conventional, optimized, and QPG models. The proposed model showcased advantages in terms of good performance in terms of storage space utilization, with lower read and write latencies. It exhibited good performance when scaling the volume to double. Furthermore, the proposed model enables the system to efficiently handle growing data volumes by implementing horizontal scaling techniques, resulting in high throughput and low latency. This highlights the efficiency and effectiveness of our model in handling distributed data scenarios. Based on these findings, it is evident that the proposed model surpasses the existing models (Conventional, Optimized, and QPG) in multiple aspects, including query performance, storage space efficiency, aggregate pipeline efficiency, read–write latency, collection-wise performance, scalability, throughput and latency. Therefore, the proposed model effectively tackles the challenges associated with managing the variety and volume of big data through the well-designed schema. This schema design significantly improves system performance and guarantees scalability for datasets of any size. As future work we intend to expand the similar concept to other NoSQL categories, like column and graph databases.