Keywords

1 Introduction

Significant advances have been made in the area of deep learning-based decision-making, viz. deep reinforcement learning (DRL) [1,2,3,4]. These include DRL applications to tasks like traditional games, e.g. Go [5, 6], real-time game playing [7, 8], self-driving in vehicles [9], robotics [10, 11], computer vision [12, 13] and others [14,15,16]. The resounding success of DRL systems can be attributed to deep learning for function approximation [17]. A majority of such techniques is single entity based; i.e., they use one RL agent or operator. As against this, there stands the technique of using more than one entity for RL, i.e. multi-entity-based RL. These agents or entities mutually operate in a single shared environment, with each of them aiming to optimize its reward return. Besides the above applications, multi-entity-based RL systems have been successfully applied to many areas like telecommunication & sensor networks [18, 19], financial systems [20, 21], cyber-physical systems [22, 23], sociology [24, 25], etc. As a success story of RL task solving, we highlight the Atari 2600 games suite [26] which is an important benchmark in assessing an RL algorithm’s efficacy. Significant prowess of RL systems, in particular DRL systems, is seen in the game score as is shown in Fig. 1. It should be noted that the average human score for this particular game is 1668.7 [27].

Fig. 1
figure 1

Atari 2600 space invaders game scores benchmarking (state of the art) [28]

Multi-entity RL systems or multi-agent rl systems using Deep Q-Networks (DQNs) [9, 17, 29,30,31,32,33,34,35,36,37] have been used in the past. In these systems, the reward and penalty data need to be shared between the agents or entities so that they learn either through exploration or exploitation as deemed feasible during training. This reward sharing ensures that there is cooperative learning, similar to that of an ensemble learning, which facilitates cooperative decision-making. This cooperative decision-making strategy has time and again been found to be more advantageous as compared to single-entity-based strategies due to the former’s rich environment exposure, parallel processing, etc. The human body immune system may be regarded as a marvel of the multi-agent RL system with respect to the millions of white blood cells or leucocytes all learning, working and adapting seemingly individually, but serving, optimizing and ultimately benefitting the same human body. Coming back to the state of the art in multi-agent RL systems, three crucial factors decide its success: (1) the data-sharing scheme, (2) the inter-agent communication scheme and (3) the efficacy of the deep Q-Network.

With the explosion of RL-based systems on the scene many issues in the above RL systems have come to the fore, e.g. training issues, resource hunger, fine-tuning issues, low throughput, etc. Ensemble learning [38,39,40] has come a long way and is being studied for potential application to this area. The parallel processing approach of the brain, which is the basis for the ensemble approach is a well-known success story of nature. And, if this line of action is followed, more good results are expected to follow.

The rest of the paper is organized as follows. Section 2 discusses the significant works in the area. Section 3 touches upon recent trends in the area. Section 4 discusses major issues faced and future scope in the area. Conclusion is given at last.

2 Related Work

Since deep learning [41,42,43,44,45,46] came to the fore, there have been numerous machine learning tasks for which deep neural networks have been used. And, many of these tasks are closely related to RL, e.g. autonomous driving, robotics, game playing, finance management, etc. The main types of Deep Q-Networks or DQNs are discussed below.

2.1 Deep Q-Networks

[17] uses a DQN for optimization of the Q-learning action-value function:

$$\begin{aligned} Q^*\left( s,a\right) =\max _{\pi {}}{E\left[ \sum _{s=0}^{\infty {}}{\gamma {}}^sr_{t+s}\vert {}s_t=s,a_t=a,\pi {}\right] } \end{aligned}$$
(1)

The above expression gives the maximized reward sum  r\(_{t}\)  by using the discount factor \(\gamma {}\) for every time step t . This is achieved by the policy \(\pi {}=P(a\vert {}s)\), for the state s and the action a for a certain observation.

Before [17], RL algorithms were unstable or even divergent for the nonlinear function neural networks, being represented by the action-value function Q. Subsequently, several approximation techniques were discovered for the action-value function Q(s,a)  with the help of Deep Q-Networks. The only input given to the DQN is state information. In addition to this, the output layer of the DQN has a separate output for each action. Each DQN output belongs to the predicted Q-value actions present in the state. In [17], the DQN input contains an (\(84 \times 84 \times 4\)) Image. The DQN of [17] has four hidden layers. Of these, three are convolutional. The last layer is fully connected (FC) or dense. ReLU activation function is used. The last DQN layer is also FC having single output for each action. The DQN learning update uses the loss:

$$\begin{aligned} L_i({\theta {}}_i)=E_{\left( s,a,r,s^{'}\right) \sim {} U(D)} {\left( r+\gamma {}\max _{a^{'}}{Q(s^{'},a^{'};{\theta {}}_i^-)} - Q(s,a;{\theta {}}_i)\right) }^2 \end{aligned}$$
(2)

where \(\gamma {}\) is entity discount, \(\theta_{i}\) gives the DQN parameters for the i\(^{th}\) iteration, and \({\theta}_i^-\) gives the DQN parameters for ith iteration.

For experience replay [47], the entity or DQN experience e\(_{t}\) is tuple stored as:

$$\begin{aligned} e_t=(s_t,a_t,r_t,s_{t+1}) \end{aligned}$$
(3)

This consists of the observed state s\(_{t}\) during time period t, reward received r\(_{t}\) in the time period t, value of the action taken a\(_{t}\) in the time period t, and the final state s\(_{t+1}\) in the time period \({t+1}\). This entity experience data is stored for the time period t along with other past experiences:

$$\begin{aligned} D_t=[e_{1}, e_{2},\ldots ,e_{t}] \end{aligned}$$
(4)

Figure 2 shows the overview of the deep Q-Network-based learning scheme.

Fig. 2
figure 2

Overview of deep Q-network-based reinforcement learning

2.2 Double Deep Q-Networks

The maximizing operation used in DQNs as propounded by Mnih et al. [17] used a common value for selecting and as well as evaluating an action. This results in overestimated value selection, as well as overoptimistic value estimation. To overcome this problem, the work of Van Hasselt et al. [36] introduced the decoupling of selection and evaluation components for the task, in what came to be known as Double Q-learning. In this technique, the two functions are learned by random assignment of every experience leading to the use of two weight sets, viz. \(\theta {}\) and \(\theta {}\)’. Hence, by decoupling the selection and evaluation components in the original Q-learning, we have the new target function as:

$$\begin{aligned} Y_t^Q\equiv {}R_{t+1} + \gamma {}Q(S_{t+1}, {\text {argmax}}_{a}{Q(S_{t+1},a;{\theta {}}_t);{\theta {}}_t}) \end{aligned}$$
(5)

And now, the Double Q-learning algorithm for the Network becomes:

$$\begin{aligned} Y_t^{{\text {Double}}Q}\equiv {}R_{t+1}+ \gamma {}Q(S_{t+1}, {\text {argmax}}_{a}{Q(S_{t+1},a;{\theta {}}_t);{\theta {}}_t^{'}}) \end{aligned}$$
(6)

2.3 Return-Based Deep Q-Networks

Meng et al. [32] introduced a combination framework for the DQN and the return-based RL algorithm. The DQN variant introduced by Meng et al. [32] is called Return-Based Deep Q-Network (R-DQN). Conventional DQNs can be improved significantly in their performance by introducing the return-based algorithm as proposed in the paper. This is done by using a strategy having 2 policy discrepancy measurements. After conducting experiments on different OpenAI Gym tasks and Atari 2600 games, SOTA performances have been achieved. Replay memory transitions are borrowed. The transition sequences for R-DQN are used to compute state estimate and TD error. The loss function is given as:

$$\begin{aligned} L\left( \theta _{j}\right) = (Y(x_{t}, a_{t})-Q(x_{t},a_{t};\theta _{j}))^{2} \end{aligned}$$
(7)

where \( \theta _{j}\) are the R-DQN parameters at step j.

Also, \( Y(x_{t}, a_{t})\) is given as:

$$\begin{aligned} Y\left( x_{t}, a_{t}\right) = r\left( x_{t}, a_{t}\right) +\gamma Z\left( x_{t+1}\right) +\sum _{s = t+1}^{t+k-1} \gamma ^{s-t} \left( \prod _{i = t+1}^{s}C_{i}\right) \delta _{s} \end{aligned}$$
(8)

where k are the transitions.

For the learning update, gradient descent is performed as:

$$\begin{aligned} \triangledown _{\theta _{j}}L\left( \theta _{j}\right) =\left( Y\left( x_{t}, a_{t}\right) - Q\left( x_{t},a_{t};\theta _{j}\right) \right) \triangledown _{\theta _{j}}Q\left( x_{t},a_{t};\theta _{j}\right) \end{aligned}$$
(9)

R-DQN also uses experience replay like its predecessors [17, 48]. The 2 important differences between R-DQN [32] and DQN [17] are that, firstly in R-DQN for state x, the policy \( \mu (\cdot \vert x)\) is stored, and that secondly in R-DQN, memory is sequential.

2.4 Other Notable DQN Variants

For dealing with non-stationarity RL issues, Palmer et al. [49] proposed a technique called Lenient-DQN (LDQN) which uses lenient adjustment of policy updates which in turn are drawn from experience. LDQN has been successfully applied to multi-entity-based RL tasks. Its performance has been compared to that of hysteretic-DQN (HDQN) [50], and better results have been obtained. The leniency concept combined with a experience replay has been also used in the weighted double Deep Q-Network (WDDQN) [51] for dealing with the same set of problems. It is shown that WDDQN performs better than DDQN in two multi-entity environments. Hong et al. [52] introduced a Deep Policy Inference Q-Network (DPIQN) for multi-agent system modelling. Subsequently, Deep Recurrent Policy Inference Q-Network (DRPIQN) has been introduced for addressing issues arising out of partial observability. DPIQN and DRPIQN perform better than their respective baselines, viz. DQN and DRQN [53], as has been demonstrated experimentally.

3 Recent Trends

Gupta et al. [54] examined three separate learning schemes with respect to centralization, concurrence and parameter sharing, for multi-entity learning systems. The centralized scheme uses a common action based on observations of the entities. The concurrent scheme trains entities simultaneously by using a common reward. The parameter-sharing scheme trains entities simultaneously by holistic use of their individual observations. And of course based on these schemes, many multi-entity DQN-based schemes have been proposed. One such technique is RL-based ensemble learning which is rare, as is found in [55], wherein Q-learning agent ensembles are used for time series prediction. The work involves Q-learning of various agents by giving varied exposure. In other words, the number of epochs each Q-learning agent undertakes for learning is different. The disadvantage of the technique is that the exposure of the entities is non-uniform or varied, which may lead to sub-optimum performance. Naturally, the next step in this direction would be to use a DQN-based ensemble for solving RL tasks.

4 Major Issues and Future Scope

In spite of their initial success, DQN-based systems are far from done. They are still in their infancy and have so far been chiefly applied to tasks like OpenAI Gym and other simulation tasks, Atari 2600 platform and other games, etc. Implementing them in real-world systems still remains a challenge. The main issues faced in this regard are high complexity, need for extensive computation resources, training issues like long training times and excessive number of hyperparameters, fine-tuning issues, etc. It is a well-known fact that millions of commercial dollars are spent on a single DQN-based research project e.g. as was done by DeepMind Inc. of Google for [17]. Also, the misuse of the exploitation aspect of RL systems naturally passes on to DQN-based RL systems also ,e.g. when used for financial tasks, etc.

Future scope for DQNs is ripe with options. To name a few, with the advent of attention-based mechanisms [56, 57] applied to and incorporated into deep learning techniques, it will be interesting to see if attention-based schemes (as present in techniques like Visual Transformers (ViTs) [58]) can be applied to deep Q-Networks for solving RL tasks. Also, it would be equally interesting to see parallelization in DQN-based RL task solving, just as the multi-core processor technology has gained a foothold with the flattening of Moore’s Law curve for transistor-based processor hardware.

5 Conclusion

In this paper, the various important variants of deep Q-Networks used for solving reinforcement learning (RL) tasks were discussed. Their background underlying processes were indicated. The original Deep Q-Network of Mnih et al. was put forth, followed by its notable successive variants up to the state of the art. The recent trends in this direction were highlighted. The major issues faced in the area were also discussed, along with an indication of future scope for the benefit of readers. It is hoped that this survey paper will help in understanding and advancement of the state of the art with respect to Deep Q-Learning.

6 Conflict of Interest

The authors declare no conflict of interest.

7 Acknowledgement of Funding

The project has not received any type of funding.