1 Introduction

The order in which to perform the join operation across tables is one of the most difficult aspects of query optimization and query plan development [1, 2]. Even if the query’s results are the same, the order in which the query’s tables are combined can have a significant impact on query execution time. The optimizer relies on pre-calculated statistics like data distribution and cardinality estimation to determine the cost of each join order condition, since the number of intermediate rows generated at runtime is unknown. Furthermore, the number of possible join orders increases exponentially with the number of tables, which makes it difficult for the query optimizer to compute the costs for all combinations to select the best join order during query execution [1]. As a result, most optimizers use heuristics to reduce the search space, such as examining the structure of the query tree [3]. Recently, reinforcement learning and its reward system [4,5,6,7,8,9,10,11] were utilized to learn prediction models for the best join orders between tables over a snapshot of the database. These approaches used reinforcement learning, where the agent receives continuous feedback on actual response time or estimated cardinality, rather than relying on pre-calculated statistics, to maximize the reward, which results in improved query performances. A challenge may appear while considering such models to generate the join order which is the change in the data distribution resulting from tables deletions and insertions. A severe change in the data distribution would convert the trained models into obsolete models as the generated join order would not be consistent with the new data distribution.

Example 1

Consider the following query on the Store-Ordering database presented in Fig. 1. The Store-Ordering database consists of four entities: “STORE”, “ORDER”, “CASHIER”, and “CUSTOMER” with the cardinalities 1,000, 1,000,000, 1,000, and 100,000 rows for each table, respectively. To retrieve the store and cashier details of orders from the Store-Ordering database, we can use the following simple query in PostgreSQL:

figure a

If the optimizer chooses to join the “Store” and “Order” tables first, it results in a cross-product as there are no direct relationships between these tables. This approach generates a very large number of intermediate results (1,000 \(\times \) 1,000,000 = 109 rows), which significantly increases the execution time of the query. On the other hand, a more optimal plan might involve joining “Store” with “Cashier” first, followed by “Order”. This order is likely to generate fewer intermediate results because the existing relationships between “Store” and “Cashier”, and then “Cashier” and “Order”, can be used to more effectively filter the data at an earlier stage. The optimal join order is often chosen by reinforcement learning (RL) optimizers [4], which aim to optimize query execution based on a deep understanding of data relationships and the current data distribution. The decisions of these RL optimizers are significantly influenced by the data distribution during the training phase. It typically predicts the optimal execution plan based on this data. For instance, if the data has not changed much since training, the model usually suggests the most efficient way to join the tables. However, if there is a change in the data distribution, such as an increase in the cardinality of the Cashier table, the optimal join execution order may no longer be the same and the new optimal order may be revised to Store, Order, and then Cashier.

This paper introduces key contributions, specifically focusing on enhancing the adaptability of reinforcement learning models to changes in data distribution. The following key points outline the core advancements:

  1. 1.

    New Training strategy The paper proposes a novel architecture designed to enhance existing reinforcement learning models, enabling them to adapt dynamically to changes in data distribution.

  2. 2.

    Smart use of existing models Instead of starting the training from scratch, this strategy utilizes the existing information about the database in the pre-trained model. By refining what is already known, the approach enhances the system’s ability to adapt to new data without the need for extensive training time.

The rest of the paper is organized as follows: The background information for concepts used in this paper is presented in Sect. 2. Then, in Sect. 3, previous work related to the proposed model is discussed. Section 4 details the proposed model. In Sect. 5, the results of the performance evaluation of the proposed models are presented. Finally, Sect. 6 concludes the paper.

Fig. 1
figure 1

Store-ordering database ERD

2 Background

2.1 Join ordering problem

“Join” means to merge rows from two tables based on a common column. A query that joins N tables is conducted through N-1 joins, whereas join only works with two tables at a time. The query optimizer must make the following two critical decisions:

  1. 1.

    The selection of a join order.

  2. 2.

    The choice of a join algorithm.

The optimum join order that the optimizer must select is our main concern because it greatly affects the query execution time. The number of alternative join combinations that the optimizer needs to explore and analyse grows exponentially with the number of tables [1]. Furthermore, because the number of intermediate rows (results) is unknown at run time, the optimizer is forced to use pre-calculated statistics and cardinality estimation to estimate the cost.

According to a join’s commutative property, which specifies which table will be accessed first when two tables are joined, the operation A \(\bowtie \) B is identical to the operation B \(\bowtie \) A. Also, as stated by the associative property of joins, when A, B, and C are tables, (A \(\bowtie \) B) \(\bowtie \) C is equivalent to (B \(\bowtie \) C) \(\bowtie \) A. Each possible permutation could have a different cost and performance impact, for instance, depending on the intermediate results.

2.2 Reinforcement learning

Reinforcement learning is an area of machine learning in which an agent learns to take actions in different situations in order to maximize its cumulative rewards. Reinforcement learning algorithms learn by taking actions and receiving rewards or penalties in the process. While interacting with the environment, the agent’s main purpose is to maximize its cumulative rewards. Reinforcement learning’s key elements include agents, environments, states, actions, and a reward value [12] as shown in Fig. 2.

The environment receives the action of the agent and the current state as input and outputs a reward and the subsequent state. The reward function translates the state-action pair of the environment to the reward value, defining the goal in a reinforcement learning problem (negative or positive) [12].

Reinforcement learning problems are closely connected to optimal control problems, and both can be represented using the framework of Markov decision processes (MDPs). A reinforcement learning task can be represented as an MDP when it follows the Markov property. This assumption implies that the future state is determined exclusively by the current state and does not depend on history [12, 13]. The Markov decision process is described by the following 4-tuples 1 [12, 13].

$$\begin{aligned} <S, A, P(s, a), R(s, a) > \end{aligned}$$
(1)
  1. 1.

    S: The collection of states that an agent encounters while interacting with its environment. It is assumed that the states possess the Markov property.

  2. 2.

    P(s, a): The transition function, which identifies the probability distribution to be in the new state as a result of action taken in state s.

  3. 3.

    R(s, a) is the reward function that identifies the reward of taking action a in state s.

Fig. 2
figure 2

Simplified architecture of the reinforcement learning

Neural networks can be used to approximate functions in reinforcement learning when the state space or action space is large [14]. Deep reinforcement learning is the result of employing deep neural networks to apply reinforcement learning. The agents that learn to link state-action pairings to rewards are deep neural networks. Depending on the result, the neural network is encouraged or discouraged by the action on this input in the future.

2.3 Asynchronous reinforcement learning

Traditional reinforcement learning focuses on a single agent continuously updating its parameters. DQN and many of its improved algorithms use experience replay to store experiences, which limits them to off-policy algorithms [15]. To solve this problem, a new kind of reinforcement learning based on asynchronous methods was proposed by Mnih et al. [16]. While traditional reinforcement learning focuses on a single agent continuously updating its parameters, asynchronous deep reinforcement learning employs a multithreading approach that allows multiple agents to asynchronously update parameters in distinct exploration spaces. Inspired by this architecture, several asynchronous reinforcement learning algorithms [15, 17,18,19] were proposed to improve the performance of asynchronous reinforcement learning algorithms in the discrete space path planning problem.

2.4 Offline-to-online reinforcement learning

Offline reinforcement learning refers to the scenario where an agent is trained using a pre-collected dataset of past interactions [20], allowing it to learn without the risks or costs associated with real-time environmental interactions. This approach faces a significant challenge which is the difference in data distribution between the training dataset and the actual data encountered during online learning [21]. Poor performance may result from these distribution changes because the agent may not have learned how to handle scenarios that were not sufficiently represented in the training set.

Online reinforcement learning (RL) is a process where an agent learns to make optimal decisions by interacting with an environment. As the agent takes actions, it gathers data about the results and rewards of those actions. The agent then uses this real-time feedback to continuously update and improve its policy.

Some research [21,22,23] looks at a hybrid method where the agent first learns from a set of samples that were previously collected and then continues its training online. It is beneficial to further improve these agents, initially trained offline, by interacting with the environment online. This approach helps the agent get better at handling changing situations because it can adapt and enhance its reactions based on immediate feedback.

3 Related work

One of the primary challenges in database research trends is query optimization using reinforcement learning models [4,5,6,7,8,9,10,11, 24]. To enhance query optimization, researchers have explored employing Reinforcement Learning approaches.

In [6], deep neural networks were utilized to learn state representations of queries in order to find the best plans. Instead of relying just on basic statistics to estimate costs, this paper investigated the idea of training a deep reinforcement learning network to predict query cardinalities.

The ReJOIN model [5] employs deep reinforcement learning approaches to solve the Join order selection problem. During the learning phase, ReJOIN used the traditional cost model based on cardinality estimation rather than real execution time.

The SkinnerDB system proposed in [25] uses reinforcement learning for query optimization. The proposed model learns the appropriate join order while running the query. The different join orders are divided into slices, with each slice of data being tested until the best join order is discovered and considered for the remaining data slices. When measuring query performance using regret bounds as a reward mechanism, the difference between actual execution time and the time for an optimal join order is taken into account.

Neo (neural optimizer) is a supervised learning model that guides a search algorithm over a huge and complex area, as described in [7]. Neo assumes the existence of a sample workload, which is made up of a group of queries that are typical of the overall workload. Neo builds its query optimization model on top of current optimizers and learns from incoming queries. Given a sample workload and the expert’s best query plans, the learned model attempts to generalize a model that can infer the query plan with the shortest execution time.

Deep reinforcement is used in query optimization problems in towards a hands-free query optimizer via deep learning, as stated in [26]. The fully observed optimizer (FOOP) presented by [27] is based on a reinforcement learning model in which the reward function is defined as the cost model of a conventional DBMS optimizer.

Fig. 3
figure 3

Architecture overview: the architecture includes four main phases—data preparation, model preparation, retraining, and replacement

Another model that incorporates reinforcement learning is presented by [9]. To bootstrap the reinforcement learning model before fine-tuning it using real-time execution time, the model proposes a learning-based strategy for join order based on the plans given by the DBMS optimizer.

Recently, a deep reinforcement learning-based query optimizer (RL_QOptimizer) has been presented [4]. The proposed method utilizes reinforcement learning and real-time feedback from the database to determine the best join order in query plans.

All previous models, with the exception of SkinnerDB, have the limitation that they only train the model once and evaluate its performance without taking into account changes to the data distribution. Data distribution refers to how data is spread within a dataset which can change depending on factors such as insertions, deletions, and changes in cardinality (modifying the number of distinct values). The distribution of data plays a crucial role in determining query execution plans and can greatly impact query performance. Consequently, the goal of this paper is to develop a novel architecture that handles the problem of changing data distribution by providing a new training strategy that adjusts to changes in data distribution.

4 Proposed architecture

One of the most difficult aspects of learning-based database management systems is that data is always subject to significant changes, necessitating the use of an adaptable model to respond to these changes. The proposed architecture in Fig. 3 provides a practical solution by using an online approach. It adapts to data changes by training reinforcement learning models on a previous database snapshot on a periodic basis in the background. The proposed architecture uses the DQN model that is presented in [4]. It is a ’Deep’ Q-Learning model that uses a neural network to approximate the Q-value function and is more suitable for large states and actions.

The network is fed a vector of zeros and ones that represent which joins are considered in a query. For example, if we need to find the best execution plan for the Store-Ordering database query in 1, which has two join conditions (Store \(\bowtie \) ORDER, ORDER \(\bowtie \) Cashier) out of three (Store \(\bowtie \) ORDER, ORDER \(\bowtie \) Cashier, ORDER \(\bowtie \) Customer), we can input a vector like [1, 1, 0], which means that we’re looking at the first and second joins to choose between them. The neural network then outputs a vector where each number is a reward value corresponding to each join condition. These values help guide our agent in deciding which join condition to execute next, aiming to optimize the overall execution plan.

The DQN model uses a neural network to estimate the Q-value function and help the agent select the best action. The agent takes actions to maximize the total reward, which is calculated by adding the immediate reward to the highest Q-value from the next state, using the following formula 2 [28, 29]:

$$\begin{aligned} Q(s, a) = r(s, a) +\gamma max Q(s', a) \end{aligned}$$
(2)

The first part of the Q-value calculation, r(s, a), represents the immediate reward for taking a specific action (a) in a given state (s). The second part of the calculation, \(\upgamma \)* max Q(s’, a), represents the estimate of the optimal future value, adjusted by a discount factor (\(\upgamma \)), for the highest potential Q-value in the next state (s’). This second part is known as the discounted estimate of optimal future value.

Algorithm 1
figure b

Online query optimization using Deep Q-Network

According to the suggested architecture in Fig. 3 and the algorithm described in 28, the model comprises four main phases. The preparation of the data comes first, followed by the preparation of the model, the retraining phase, and the replacement phase.

Fig. 4
figure 4

Background training overview: analysing execution plans, rewarding based on execution times, and updating neural weights for model replacement

During the data preparation phase, the system receives the training data from the log file, which represents all potential queries, and syncs the most recent database snapshot. The system then prepares to retrain the model by loading a copy of the one that has already been trained previously as a pre-trained model in the model preparation phase. Following that, the model begins its training process, as illustrated in Fig. 4. It examines each query in the log file, and then the agent communicates with the DBMS (environment) to obtain the actual execution time for each execution plan. In this phase, similar to the previously proposed model (RL_QOptimizer) [4], we utilize an encoded vector to represent query join conditions. Every vector in the database schema has a size of n, where n is the total number of possible join conditions. Every vector is a query, and each vector can have a value of 0 or 1, where 1 means that a specific join condition is included in the query that is now running. Consider the Store-Ordering database query in Example 1, which has two join conditions (Store \(\bowtie \) ORDER, ORDER \(\bowtie \) Cashier). A vector like [1, 1, 0] can be input, indicating the first and second join conditions. Finding the best join sequence for a given query is the main goal of the join ordering problem. The states are represented by a vector of 0 s and 1 s, where each ’1’ indicates a join condition that could be applied to the query. In each state, the agent has a set of possible actions to select from, each corresponding to a join condition represented by ’1’ in the state vector. The possible actions are to join using the first or second conditions first. After selecting an action, the query builder integrates this condition into its order and updates the state vector by setting the value of the applied condition to zero. The goal is to optimize the total execution time of queries; therefore, the actual query execution time, multiplied by \(-\)1, is used as a reward. During experiments, a PostgreSQL DBMS is employed to obtain the actual execution time of the query.

The DQN model updates its neural parameters based on how well each query performs. The model uses the Bellman equation formula to adjust its strategies:

$$\begin{aligned} Q(s,a) & = Q(s,a) + \alpha \left[ r(s,a) + \gamma \max _{a'} Q(s',a')\right. \\ & \quad \left. - Q(s,a) \right] \end{aligned}$$
  • \(Q(s, a)\) is the current value for taking action \(a\) in state \(s\).

  • \(\alpha \) is the learning rate.

  • \(\gamma \max _{a'} Q(s',a')\) represents the highest predicted reward that can be achieved in the next state \(s'\), serving as the TD target.

  • \(r(s,a)\) is the reward received after taking action \(a\) in state \(s\).

  • The expression \(\gamma \max _{a'} Q(s',a') - Q(s,a)\) calculates the Temporal Difference (TD) error, showing the difference between the estimated best future reward and the current value.

In Phase 4, the proposed architecture ensures continuous optimization of query strategies by regularly monitoring the average response time of queries processed by the current model in the production environment. The replacement of the current model occurs under the following conditions based on configuration:

  1. 1.

    Average response time threshold If the queries’ average response time increases beyond a predefined threshold, indicating a decline in performance.

  2. 2.

    Regular schedule On a regular schedule, regardless of performance, to ensure the model is periodically updated with the latest training data.

When either of these conditions is met, a newly trained model replaces the current model in production. This replacement involves updating the neural network weights to adapt to the latest query strategies, ensuring the system remains efficient and effective.

Fig. 5
figure 5

Simplified ER diagram of the IMDB database

The proposed architecture includes retraining the pre-trained model by optimizing its weights to adapt to changes in the data distribution. Instead of starting the training from scratch, the existing information about the database in the pre-trained model is utilized.

In addition, the proposed architecture utilizes the log file to initialize the retraining process. Consequently, the pre-trained model will be updated given the workload queries that were requested by the database users. This should ensure an optimal state for the model, resulting in efficient join order plans, and consequently a shorter response time for the user’s most required queries.

Even though the proposed architecture provides a practical way to handle changes in data distribution, it does face a significant issue: reliance on periodically updating itself with database snapshots. This method can lead to suboptimal decisions when data changes quickly, due to potential delays until the next update cycle. To address this, the system is configurable to update either regularly or when a noticeable drop in query response times occurs. Updates can also be triggered if the average response time of queries decreases by a certain threshold. By shortening the intervals between updates, we minimize delays and ensure the model remains as up-to-date and responsive as possible.

5 Performance evaluation

The evaluation aims to compare the execution plans generated by the new online proposed model to those generated by the primary model [4] as a baseline, to demonstrate the effectiveness of the proposed model in terms of execution time. To do this, both models are used to generate plans for the same queries after the data distribution is changed within the evaluation database. Then, the execution times of plans generated by both models are collected and compared. The models are tested on one real database that was used as a benchmark dataset for the join-ordering problem [30].

5.1 Experiments setup

Experiments are carried out on a laptop running Ubuntu 18.04.3 LTS and equipped with an 8-core Intel Core i7-8550U CPU and 8 GB of RAM. The available memory per operator (\(work\_mem\)) was set to 512MB, and the buffer size was set to 1 GB. To develop the neural networks, models were implemented using the Python, Tensorflow [31] and Keras [32] libraries. For the DQN model, Adam query optimizer [33] and the RelU activation function were used.

A four-layer feed-forward neural network is utilized in the model, featuring 30 neurons in each hidden layer. The number of neurons in the input layer is the same as the number of possible joins in the database, and this number is also used for the output layer.

IMDb (Internet Movie Database) [34] was utilized through performance evaluation to evaluate the proposed architecture. It is a non-commercial online real-world database that contains a vast quantity of information about films, television shows, and home recordings and is used as a benchmark for the join ordering problem in [4, 30]. During the training process, only the tables shown in the diagram in Fig. 5 were used, where the average number of records per table is greater than 6 million. According to Figure 2 in [30], a typical query graph includes five relationships with the “title” table, which serves as the central table for the research workload. So, similar to [4], we chose all join operations related to the “title” table to compare our results with those in [4], which are as follows: (movie_companies \(\bowtie \) title, title \(\bowtie \) kind_type, movie_info \(\bowtie \) title, aka_title \(\bowtie \) title, movie_link \(\bowtie \) title, title \(\bowtie \) complete_cast).

Fig. 6
figure 6

Comparison between the results of the RL model and the online RL model on IMDB database

Fig. 7
figure 7

Comparison between the results of the RL model and the online RL model on IMDB database by changing several tables

5.2 Experimental results

In the experiments, the proposed model was compared primarily with RL_QOptimizer for several reasons. Firstly, RL_QOptimizer is the model that the architecture was built upon. It has been demonstrated to outperform the traditional query optimizer in PostgreSQL and a well-known state-of-the-art learning-based optimizer “Rejoin in [4], establishing it as a robust baseline. The research aims to build upon the strengths of RL_QOptimizer by introducing a new training architecture that enhances its adaptability to changes in data distribution.

Two experiments were conducted to evaluate the performance of the proposed approach. The first experiment focused on randomly increasing the cardinality of the “AKA_TITLE” table by 5 folds, while the second experiment involved random modifications to the data distribution in multiple tables.

In the first experiment, table “AKA_TITLE” cardinality was increased by 5 folds, and then the online background training was applied to the IMDb database. Following the inclusion of the old trained DQN model, the average response time increased and queries began to timeout. Consequently, the background model was retrained and it was found that the retaining process required only 10% of the original model training time. This may be related to the fact that many queries do not use the “TITLE” table and did not have to change their execution strategy. Focusing on queries using “AKA_TITLE”, as shown in Fig. 6, the model that had been trained in the background was enhanced, and as a result, “Query 24–Q24” which could not complete within the maximum time on the original model, was successfully executed on the newly trained one. Queries that do not include the “AKA_TITLE” table were left out from the sequence in Fig. 6, because these changes would not impact them.

In the second experiment, changes were applied to the data distribution in the IMDb database by using random insertion to increase the cardinality of the “COMPLETE_CAST”, “MOVIE_COMPANIES” and “AKA_TITLE” tables by 4, 2 and 3 times, respectively. In addition, deletion was applied to “MOVIE_LINK” to reduce its cardinality to 25%. As shown in Fig. 7, the model that had been trained in the background was improved, and as a result, “Query 57 - Q57” which could not complete within the maximum time on the original model, was successfully executed on the newly trained one. Similarly to the first experiment, it is found that the retraining process only requires a small fraction of the original model training time, which approximated 16% in this particular experiment. Since the model was not trained from scratch, the retraining process was able to fine-tune the existing knowledge acquired during the initial training.

The results of the performance evaluation experiments demonstrate that the newly trained model performs better than the original model in determining the appropriate join orders for various queries, including challenging cases like “Q24 and Q57”. The original model struggled to even execute the query within the maximum execution time.

Let us use Query 57 as an example to highlight the challenges faced by the previous model. Query 57 aims to retrieve comprehensive information about movies from the IMDb database, focusing on data that links movies to their types, alternate titles, detailed information, and the companies associated with them, as shown:

figure c

In this query, the earlier model prioritized MOVIE_COMPANIES and did not consider the doubled cardinality of MOVIE_COMPANIES. The original join order was: MOVIE_COMPANIES -> TITLE -> MOVIE_INFO -> AKA_TITLE -> KIND_TYPE. However, the enhanced model modified the plan, leading to improved performance by changing the order to TITLE -> AKA_TITLE -> KIND_TYPE -> MOVIE_COMPANIES -> MOVIE_INFO.

6 Conclusion

Determining the best join order is a difficult task due to the exponential growth of alternative join combinations as the number of tables increases. Consequently, it becomes impossible for the optimizer to consider all possible combinations. As a result, most optimizers use heuristics to narrow the search space, such as inspecting the query tree’s structure. Reinforcement learning models manage to recommend a query execution plan concentrating on join ordering difficulties but cannot adapt to data distribution changes like insertions, updates, and deletions. This paper proposes an architecture to enforce a training strategy to the reinforcement learning models in order to allow them to adapt to data distribution changes.

The results of the performance evaluation show that the new technique outperforms RL_QOptimizer in determining the appropriate join orders for queries involving tables whose cardinality has been modified. Moreover, this retraining process only needs a small fraction of the original model training time, thanks to the ability of the model to reuse previously acquired knowledge. However, the approach faces challenges, including the reliance on periodic updates, which may be a challenge for environments with rapidly changing data. The experiments demonstrated the model’s ability to adapt to changes effectively, but ensuring the chosen update time and decisions is crucial before system performance degrades. Overall, the architecture shows significant potential for improving query optimization in dynamic data environments.