Keywords

1 Introduction

One of the key motivations behind the use of database systems is the aim for integration of functioning data of an enterprise and to deliver unified, thus controlled access on the data. Data is retrieved by posing a query to the database system. A query is a language construct that helps to retrieve a part of data from database [1]. When a user frames a query, a query processor frees the user from specifying the exact procedure to get the required answer via various transparency mechanisms. It takes a query, parses it, optimizes it, and finally retrieves the intended result. Query processing performs transformation on a high-level query and converts it into low-level query. This transformation must be complete and correct [2]. The Main aim of query processing is to find transformations that are most efficient for generating query result. In centralized databases, query processor finds the most efficient equivalent relational algebra query. While in distributed systems, the query processor takes into account the cost of data transfer.

The main contribution of the paper is to discuss various aspects of query processing and related modern-day challenges in it. The paper has illustrated existing architectures of query processing in Sect. 1 and subsequently highlight query processing in the centralized database systems and distributed systems in Sects. 2 and 3, respectively. The inherent challenges in modern-day query processing are discussed in conclusion, in Sect. 6.

1.1 Query Processing Architecture

Architecture of a query processing system determines the internal representation of query, enumeration of query execution plans, and the phase at which system has the knowledge of the cost of transformations. These factors are crucial in evaluating the complexity of the system. There had been many proposed architectures, two of them are Haas et al. [3] and Carey et al. architecture [4].

Haas Architecture: The main component is parser that translates the input query into an internal representation like query graph. Query rewrite transforms a query to carry out optimizations. Query Optimizer enumerates several plans that specifies how given queries should be executed. Plan Refinement and Code Generation converts the plan made by query optimizer into executable code. Catalog stores the necessary information for parsing, rewriting, and optimizing a query. Query Execution Engine gives implementation details for the operators used.

Carey Architecture: In addition to the Haas Architecture, the main component is storage object manager for providing the capabilities for modifying storage objects. Access methods give associative access to all files of objects and support for versioning. Schema-related information in the databases is provided by type manager to support a wide variety of applications. The various methods or operations on objects are supported by Operator Methods. These components provide the dedicated functions related to query processing. The schematic architecture of Carey for query processing is shown in Fig. 1 (Table 1).

Fig. 1
figure 1

Query processing phases in Carey et al. architecture

Table 1 Comparison of Haas architecture and Carey architecture for query processing

2 Centralized Database Systems

Query processing has three steps: decomposition, optimization, and execution. Query decomposition is done by query compiler. Query compiler goes through several phases like parsing, checking for semantics, query rewrite [5]. Compiler validates the syntax of the input query and returns appropriate error, if it finds any discrepancies. In query graph model of compilers, an internal representation of the query is formed. A compiler also checks for any inconsistencies in the sub parts of the input query. A compiler utilizes the global semantics to transform the query into a format that is easier to optimize [6]. Recent advancements in programming languages have led to the development of highly efficient query compilers [7].

Query optimization is the most critical phase in query processing and in centralized database systems; it is a two-step process. First is the logical optimization that involves the classic transformation rules on the algebraic trees which helps to reduce the manipulated data volume. Second is the physical optimization that includes the process of determining an appropriate join method as well as their ordering. The decisions are made by taking into account the size of the relations, the physical organization of the data, and access paths. Another approach is based on response time of a query. Response time can be defined as the time elapsed between the submissions of a user query till the time first result is generated [2]. The response time is calculated as the sum of three separate costs, first communication cost, second input/output cost, and final CPU cost.

There are generally two approaches for query optimization. In top-down approach, a query expressed in relational calculus is broken down into smaller sub queries. While, in case of bottom-up approach, many plans are built from simpler sub-plans (Table 2).

Table 2 Comparison of bottom-up and top-down query optimization process

3 Distributed Database Systems

A distributed DBMS is distributed over a computer network such that its existence is transparent to the user. The participating databases must be logically related. The only means of communication is with the help of message passing [3]. In case of distributed databases, the communication cost plays the most critical role.

3.1 Challenges in Distributed Database Query Processing

This data distribution makes the query processing a multifaceted procedure [2, 3, 8, 9].

Fragmentation and Replication: The distributed DBMS has to take into account all the fragments and the replicated copies relevant to the query.

Cost Model: The cost model selection is application specific. A database system can choose to follow the path of minimum resource consumption or minimizing the response time.

Search space Exploration and Exploitation: The search space consists of all the equivalent query execution plans that are made after using various transformation rules. The search space could grow exponentially larger once the number of relation involved in the query increases. The search strategy has to be really efficient since there could be a very large search space available.

Control Site Selection: The appropriate control site for final query processing should be selected carefully.

Where and When to Optimize: The timing and placing of the optimization process is deciding factor of a distributed DBMS performance. The alternate execution plans can be made at compile time and the optimizer can select from the available options at run time. The phases like parsing and rewriting of the query can be done at client side while QEP (Query Execution Plans) refinements can be done at server site. In crowd sourcing systems the complexity is relatively high [10]. Hence, the system must free the user from over burden and is solely responsible for compilation, generation of execution plans, and evaluation of the marketplace.

3.2 Challenges in Distributed Database Query Execution Process

Once all the execution plans are generated, the most optimal plans are selected and the rest are pruned at successive stages. The real challenge now is to execute these plans efficiently [2, 3, 8, 9].

Selecting the data transfer mode: Typically row blocking technique is used. In this, a whole block of data is transmitted rather than a tuples.

Use of cache results: Whether to use the data caches that temporarily store copies of data to utilize the benefits of temporal locality.

Organization of sites on the network: The sites are typically arranged in a hierarchical way so sending and receiving costs can vary significantly [1113].

Optimization of Multicasts: Multicast communication can be done to increase the degree of parallelism.

Use of semi joins: Semi joins reduce the size of intermediate results, since it is based on exit clause. However, this method is suitable where join selectivity factor is good.

Execution Control: The control site could be one single site that is centralized control or a group of sites can act as control sites [13].

The final challenge in query processing is to combine the various building blocks and build a full-fledged efficient query execution plan [2].

4 Multidatabase System

Multidatabase systems (MDBS) are a collection of individual DBMSs that are fully autonomous and cooperate at their own will [2, 3]. In case of multi database, the global conceptual schema is only a sub-part of the logical integrated schema formed by the individual schemas of component databases.

4.1 Challenges in Multidatabase Query Processing

Query processing in a multidatabase system is highly complex and requires more efforts than in a distributed DBMS [2, 3].

Computing Capability: The participating DBMSs may have varying computing capabilities. This hinders the uniform distribution of the query distribution process.

Cost Function: The participating DBMSs may have different cost functions. Hence, the local optimization is different for the same query posed.

Language Models: The participating DBMSs may have different language models for example relational, object oriented, and XML. This creates a problem in translating the query to component DBMS and in integrating heterogeneous results.

Autonomy: The participating DBMS exhibit autonomy. The autonomy can be defined across three dimensions: Communication, design, and execution autonomy.

Semi–Join-based optimizations: The semi-join-based optimizations are extremely difficult, since the source and the target tables are present on different component DBMS. The retrieval of the join attributes and shipping them from target site to host site, performing the join operation at the source site and modifying the target relation DBMS adds complexity.

Architectural Difficulties: The architecture of the multidatabase system is quite complex. The design of wrappers is difficult, since it has to take into account the translation art of the queries. With any change in local schemas, the wrapper and mediator have to be modified accordingly.

Data distribution: The query needs to reach every possible source to obtain the maximally contained query. For systems that involve continuous processing of data streams query optimization becomes much more complex, since the data never ceases to come to the databases. In such cases to efficiently overcome the heterogeneity event processing middlewares [14] are used.

Integration of results: After obtaining the results for an input query, it is sent back to the mediator to carry out necessary translations. The integration of the heterogeneous results thus obtained is a daunting task.

5 Query Processing: Work Done So Far

Query processing has become a major area of research and study in various computing applications. Earlier query processing was confined to single databases stored at a single geographical location. The increase in hardware and software systems has given modern query processing immense power over traditional query processing systems [15]. These days, databases must adapt to this variety of options available. Modern query processors have to make decisions and must deliver advance utilities and services. Due to availability of huge amount of data, modern query processors must provide scalability [16]. Traditional query processor did not exploit parallelism. They neither have the expertise nor potential to perform operations any faster. Modern query processor must provide quick and effective results to meet the never ending demands of the user. Modern query processors are capable of processing time-continuous data [17, 18]. Traditional query processor used to select a single query execution plan out many possible, while modern query processors have the capability to change plans to deal with non-uniform distributions of data. Modern query processing has certainly come a long way in efficiency, reliability, and performance. Table 3 summarizes the available surveys on query processing.

Table 3 Surveys on query processing

6 Analysis and Challenges

The recent ground breaking developments in database systems led to spectacular results in data management. Organizations, these days find it difficult to find approaches to manage their databases efficiently, conveniently, and in a more controlled fashion. The performance and efficiency of a database system can be analyzed across various metrics.

6.1 Analysis of Impact Parameters

Response Time: While analyzing the performance parameters in case of distributed system all three cost namely communication, I/O, and CPU time plays equal role.

Cost: Since various methodologies like parallel and distributed processing has become a possibility. The communication cost is much larger than I/O and CPU cost in case of distributed databases.

Heterogeneous databases: The query processing optimization is done into two phases. One is the global optimization that is understood by component databases and the second phase is localized optimization implemented individually.

Modern-day database software: The database software available these days are much more stable and highly user friendly.

Advanced Hardware technology: Also, in this modern era, which we can rightly call as digital age, the cost of hardware has gone down and the efficiency has gone up

Network: Due to availability of fast network the query processing approaches have to give equal weightage to all the three costs—communication, I/O, and CPU costs.

Bandwidth: Availability of larger bandwidth provides faster and more reliable connections between the distributed sites.

6.2 Inherent Challenges of Data

Talking about the challenges faced by databases, we have to admit that we are now living in the “ocean” of data. We have to consider various aspects of data analysis.

Data Dimensions: The size of data is too large to be efficiently handled by conventional databases. Hence, the need of the hour is to enhance the features of the present database systems which can incorporate as much data as possible.

Data Stride: The data is huge not only in dimensions but the pace is high too. Every day, millions and millions of data feeds are coming to the system, making it impossible for the systems to adapt at the same rate.

Data Diversity: One query processing approach might be considered for a given type of data set while the other for other dataset.

Data Changeability: In this modern age, we need database systems that can handle variability of data efficiently. The inherent interpretations of this massive amount of raw data depend on the underlying context. For example, in case of natural language processing, a single word may have different meaning.

Data Accuracy: The user will get inaccurate results, if the principal database gathers the data from diverse sources which might not be complete. In such cases, query containment concept is used. The user will get a subset of the complete answer.

Data Envisioning: One of the core tasks of a database system is to present the actual result of the query under processing in a readable and understandable format.

All of these factors demand careful examination, in particular for enterprises not already on the efficient query processing systems bandwagon.