Keywords

1 Introduction

In 2016, Alpha Go  [1], the computer program developed by Google’s DeepMind team, defeated the 18-time world champion Lee Sedol in a five-game Go match and manifested the power of the artificial intelligence (AI). In 2017, Alpha Go Zero  [2], which adopted the deep reinforcement learning algorithm (DRL) and defeated the previous generation of Go in a short time, dominated the Go match and revealed the power of the DRL. Nowadays, DRL has been widely applied to industries in different fields, e.g., manufacture, natural language processing, medical image processing, intelligent drive and video game AI design, etc. Among these areas, AI design of video games is of particular interest to researchers, as it provides a convenient test platform for investigating machine learning algorithm performance in complex environments.

With the development of DRL, the end-to-end feature extraction method breaks the conventional manual feature extraction method, so that AI can complete the game task without intervention of humans. In 2013, Mnih et al. employed DRL to play video games. The DRL based AI even surpassed top human players in some Atari games  [3]. However, these Atari games are generally too simple compared with practical applications in real life. Before long, Deep Mind team opened the StarCraft II learning environment (SC2LE)  [4] as a new test platform for DRL studies. The SC2LE involves dynamic perception and estimation, incomplete information game and multi-agent cooperation problems, resulting in a more practical test environment, where decision making becomes much closer to the real-life situations.

Although the latest StarCraft II AI has been able to beat top human players in certain circumstances  [5], the AI training required a vast number of episodes to achieve satisfying performance, due to the limits of the employed conventional Actor Critic (AC) algorithm ( [6,7,8]). For example, the advantage estimator returns single value that results in limited accuracy, and the random action selection lacks effectiveness, etc. In this paper, an improved design of Actor Critic (AC) algorithm, namely Advanced Actor Critic (AAC), is proposed for better convergence of AI training. A distributional advantage estimator is employed to estimate the distribution of advantage in a certain state. In addition, a normal constraint, which means the loss function of spatial selection based on the normal distribution, as well as an exploration method based on confidence are introduced. As to the neural networks framework, the fully-convolution with long short-term memory networks (CNN LSTM)  [9] is selected according to the Deep Mind approach  [4]. Additionally, the sparse reward  [10] is addressed by reward shaping. Finally, a combined model based on Rainbow  [11] (a combination of Double Q Learning  [12], dueling DQN  [13], Priori-tized Experience Replay  [14], distributional reward  [15], noisy net  [16]) is adopted, where the proposed algorithm is embedded. Such an approach is applied to the SC2LE mini-games to verify that AI can learn faster, play better and have stronger robustness with the ACC based training program. The main contribution of this paper is to improve the conventional AC algorithm, which leads to better performance in complex environments, and to propose a regional updating action mode instead of independent point updating for intelligent control in high-dimensional space.

2 Background

The SC2LE is integrated by Deep Mind  [4]. In DRL, state, action and reward are the most important elements. In StarCraft II mini-games, these elements are represented by map features and mini-map features.

2.1 Environment

Reward. The reward and penalty in each mini-game are shown in detail in Table 1.

Table 1. Reward and Penalty in mini-games.

State. In StarCraft II games, the environment has been processed by the interface, outputting several feature layers about the current state (see Fig. 1).

Fig. 1.
figure 1

Feature layers of the current state, which includes all simplified information of the mini-games.

During training, the size is the resolution of these feature images. Generally, the height and width are set to N, i.e., the spatial pixel observed and operated is \(N\times N\). Deep Mind pointed out that \(N\ge 64\)  [4] is required when performing any micro operation in the game. However, for mini-games, \(N=32\) is proved sufficient.

Among these \(N\times N\) pixels, each feature layer represents specific feature, such as player ID, unit ID, war fog, unit attribute, etc. Some belong to scale, while others belong to catalog.

In addition, the difference between the mini-map and the screen is that the screen is only a part of the mini-map camera, but the analysis of the unit is clearer. After processing, the mini-map is transformed into \(F_{minimap}\,\times \, N\,\times \, N\), and the screen is transformed into \(F_{screen}\times N\times N\). \(F_{minimap}\) and \(F_{screen}\) represent the quantity of features that mini-map or screen has.

Action. The instruction set of StarCraft II is very large. Through the interface, each action is divided into two parts (Spatial Action and Non-Spatial Action) and output to the environment. The whole action is: \(A=(A_{non-spatial},[A_{add},A_{spatial}])\). Among them, Spatial Action represents that AI clicks on a position in the screen, and its value range belongs to the pixel size of the environment, that is, \(A_{spatial}=(x,y)\in N\times N\). Non-Spatial Action represents operation instructions, such as select, move, attack, build, view, etc. Its value range is the game instruction set of the whole StarCraft II pairs, and at the same time, it is subject to the actions available in the current environment, that is, \(A_{non-spatial}\in A_{available}\cap A_{all}\). \(A_{available}\) involves actions that can be performed in a certain state and \(A_{all}\) contains all action instructions in StarCraft II. Especially, non-spatial actions are necessary while spatial actions are optional. For example, if an unit is selected by AI, only the non-spatial action is output, which means the spatial action becomes needless. \(A_{add}\) is the type of \(A_{non-spatial}\) operation. When \(A_{non-spatial}\) and \(A_{spatial}\) are determined, \(A_{add}\) is generated automatically.

2.2 Actor Critic and Fully-Convolution with LSTM

Actor Critic. The policy adopted by the agent is the distribution of all state and action trajectories interacting with the environment. How this distribution takes the trajectory depends on the parameter \(\theta \). One of the trajectories is \(\tau =(s_{1},a_{1},s_{2},a_{2},......,a_{t},s_{t})\).The probability of getting the trajectory is \(p_{\theta }(\tau )=\prod _{t=1}^{T}p_{\theta }(a_{t }\mid s_{t } )\). The relationship between the generation of environment and the parameter \(\theta \) of trajectory can be ignored.

In order to get the best performance of an agent, the optimal goal is: \(\theta =\arg \underset{\theta }{max}E_{\tau \sim p_{\theta }}(R_{\theta }(\tau ))\). \(R_{\theta }(\tau )\) is the cumulative reward of \(\tau \), and the gradient is:

$$\begin{aligned} \triangledown R_{\theta }(\tau )=\frac{1}{N}\sum _{n=1}^{N}\sum _{t=1}^{T_{n}}(\sum _{s=t}^{T_{n}}A_{t}\triangledown log(p_{\theta } (a_{t}^{n}\mid s_{t}^{n}))) \end{aligned}$$
(1)

In this paper, spatial policy net and non-spatial policy net are independent \((\pi _{\theta }\) represents the non-spatial policy net while \(\pi _{\theta }'\) represents spatial policy net, and \(\rho \) indicates whether spatial action is valid or not), therefore the loss function in each episode is:

$$\begin{aligned} Loss_{\pi _{\theta }}=\sum _{t=1}^{n}A_{t}[log(\pi _{\theta } (a_{t}\mid s_{t}))+\rho log(\pi _{\theta }' (a_{t}\mid s_{t}))] \end{aligned}$$
(2)

Combine Actor Critic with CNN LSTM. The following figure shows the framework of interaction between the intelligent algorithm and StarCraft II Environment(see Fig. 2).

Fig. 2.
figure 2

Interaction between DRL based AI and StarCraft II environment. The inputs of the network are (1) available action (2) screen features (3) mini-map features. Outputs are (1) X and Y of the spatial action (2) non-spatial action (3) estimate of advantage. These outputs are all distributions.

3 Reprocessing of Environment Feature

Several feature layers have been extracted from the environment of the StarCraft II games. However, some of these layers are scale, while some are catalog, which means the distribution of values is very uneven. In addition, some feature layers are almost useless in mini-games. In order to train neural network better and faster, it is necessary and significant to reprocess these feature layers.

These layers can be simply divided into catalog value layers and scalar value layers. The catalog value layers include player ID, friend, visibility, creep, etc. Generally, these values are small. The scalar value layers include unit type, unit health, terrain level, and armor value. The prior knowledge of a StarCraft II player can be exploited to choose more helpful map feature layers and discard useless ones. Among these layers, the feature layers of mini-map are chosen as: visibility, camera, player ID, player relative, selected and the screen feature layers are: visibility, player ID, player relative, selected, HP, energy.

After the layer selection, the layers are numerically processed. In the catalog value layers, the values are standardized between [0, 1] to facilitate the training efficiency of neural networks. In the scalar value layers, the threshold is set to half of its maximum value. A sigmoid function is adopted to change the value classification into a state value. For example, if a unit has more than 50% of its maximum hit point (HP), it is considered to be healthy, otherwise be weak. After this reprocessing, the ranges of all layer values are compressed to be [0, 1].

4 Advanced Actor Critic (AAC)

The AC algorithm is improved from several perspectives: the updated data can break the correlation of time; the strategy network is more robust; the evaluation network is more accurate. The improvements are explained as follows.

4.1 Distributional Advantage

In AC algorithm, critic network is responsible for outputting the advantage estimate based on a certain policy. The accuracy of the estimate determines the update efficiency of policy net. As a result, a distributional network is designed to get more information. In this network, the maximum and minimum values are determined by the background conditions, and the probability of the advantage is handled by the last layer of convolution network plus softmax function. When updating, the sample \(\widehat{R}\) obtained from Monte Carlo difference (MD) or temporal difference (TD) is transformed into a pulse function \(D(\widehat{R})\) (avoiding None by using Clamp function) and then KL divergence is used to evaluate the loss of advantage estimates (see Fig. 3).

In this case, the advantage \(A_{t}\) and the value loss function \(Loss_{V_{\theta }}\) can be formulated as follows:

$$\begin{aligned} A_{t}=E_{p(V_{t})\sim V_{\theta }}(V_{\theta }(s_{t})) \end{aligned}$$
(3)
$$\begin{aligned} Loss_{V_{\theta }}=D_{KL}[D(\widehat{R})\parallel V_{\theta }(s_{t})] \end{aligned}$$
(4)
Fig. 3.
figure 3

The process of converting sample \(\widehat{R}\) into pulse function and computing KL divergence with advantage estimate distribution.

4.2 Exploration Based on Confidence

One of the important problems in DRL is exploration. In mini-games, the exploration ability of agent in the early stage significantly influences its performance. Without exploration, the agent keeps taking the same action if the reward is always positive. Thus, it might be trapped by the local optimal solution and ignore other actions that may result in huge potential reward. Therefore, a good agent needs to have proper exploration ability. In contrast to the conventional AC algorithm exploration method, the maximum probability of the non-spatial policy, namely the confidence, is adopted to explore the environment. \(\omega \) represents the lowest level of confidence in the conduct of random exploration and AI still explores based on \(\epsilon -greedy\) to prevent falling into the local optimal solution. The less confidence the AI has, the more exploration is encouraged, and vice versa.

$$\begin{aligned} A_{non-spatial}=\left\{ \begin{matrix} \arg \underset{\theta }{max} \pi _{\theta }(a_{t}\mid s_{t}),else\\ random \, choice,max(\pi _{\theta }(a_{t}\mid s_{t}))<\omega \,\, or \,\,\epsilon <0.1 \end{matrix}\right. \end{aligned}$$
(5)

4.3 Normal Constraint

In the environment of StarCraft II games, the most difficult problem is the huge continuous action space. In AAC algorithm, the policy net of AAC consists of three parts: spatial policy net X, spatial policy net Y and non-spatial policy net. The update mode of non-spatial policy net is introduced in Sect. 2.2. Given the particularity of action space in StarCraft II games, the update of spatial policy net X and spatial policy net Y introduce an additional function for the sake of convergence speed, namely the Normal Constraint. To enhance the exploration ability, loss entropy is usually adopted as the loss function in AC network. In the case of large continuous action space and sparse reward, if the loss entropy weight is not well designed, it is easy to enter policy net with the same output probability. In order to solve this problem, a Normal Constraint is hereby designed. In StarCraft II games, the actions are selected in a continuous space, and the selection probability around a specific point should be considered similar. Therefore, when updating an action, the nearby actions should be updated accordingly. In this case, AI can understand the high-dimensional action space regionally. Finally, when converging, the output of spatial policy is a stable distribution that approximately satisfies the Normal Distribution (see Fig. 4, Fig. 5).

Fig. 4.
figure 4

How NC works.

The probability of the selected spatial action, i.e., the coordinate in the environment, is used as the weight of the NC loss. The greater probability implies more centralized policy, and the probability of its surrounding will be further enhanced or reduced when updated. Combined with formula (3), the NC loss function is:

$$\begin{aligned} Loss_{NC}=\sum _{t=1}^{n}A_{t}max(\pi _{\theta }')D_{KL}[NC(a_{t})\parallel \pi _{\theta }'(s_{t})] \end{aligned}$$
(6)

Finally, the whole network is updated according to formula (2) (4) (6).

4.4 Reward Shaping

The reward of StarCraft II is very sparse, which requires the agent to have a stronger exploration ability. During the implemented trainings, a negative reward is added to every moment to motivate AI to take actions. Besides, the reward values are standardized to be within the range [0, 1]. Such a reward shaping is very effective in a sparse environment.

5 Result

The agent based on AAC network or A3C networkFootnote 1 is applied to 4 mini-games, and then the average score in every 100 episode as well as maximum score in 20K steps are compared. In the initial stage of training, i.e., observation episodes, the update of policy net is stopped. Since the training of policy net depends on critic net, critic net must be trained first in this observation.

Fig. 5.
figure 5

Comparison of policy net with and without Normal Constraint (the figure above uses NC, while the figure below does not).

A part of hyper parameters are shown in Table 2, and the Adam optimizer is adopted  [17].

Table 2. Hyper parameters.
Fig. 6.
figure 6

Performance across 2 mini-games which mainly focus on spatial action for AAC and A3C.

Fig. 7.
figure 7

Performance across 2 small scale combat games for AAC and A3C.

Table 3. Best score in each mini-game.

6 Discussion

Mini-games in StarCraft II are divided into two categories based on their characteristics to test the performance of AAC and A3C. One category mainly focuses on spatial action and another is small scale combat game(see Fig. 6, Fig. 7). After 20K episodes, the average and max score achieved by agent based on AAC are both higher than agent based on A3C across the four mini games (see Table 3).

The key to get a better score depends on the performance of critic in Actor Critic algorithm, namely the accuracy of the advantage estimator. In a certain state (see Fig. 8), the Critic net in AAC estimates the reward by the expected value 16.4 and the true value is 25. On the contrary, the Critic net in A3C estimates the reward by 6.1 and the true value is 24. Thus, the distributional estimator in critic net collects and computes the recent expected reward of a state, which contains more information and is closer to true reward than the estimate of single-valued estimator in A3C. During the experiment, it is proved that the more atoms distribution has, the more accurate the estimate is, because the distribution is more elaborate. However, more atoms need more neurons in the critic net and more time in training. In this paper, 256 feature neurons with 51 atoms are chosen  [15].

The convergence speed of AAC and A3C is shown as the episode that the agent achieves scores around the best. It reveals like that the convergence speed of AAC is faster than that of A3C except in the Move to Beacon game at first glance. However, it should be noticed that the agent based on A3C achieved extremely bad score throughout the middle of the experiment (from episode 5k to 6k), and the reason is that the relatively small probability of spatial action leads to the hesitation but not confidence of agent (see Fig. 8).

Fig. 8.
figure 8

Performance of AAC and A3C on Move to Beacon in episode 5k, (a) (b) (c) (d) is generated by AAC while (d) (e) (f) is generated by A3C.

In conclusion, around episode 5k, the agent based on A3C indecisively chose the right target due to the max but not dominant probability, and it indicates that the policy net has not converged. As a result, the convergence speed depends not only on when the agent performs a stable and good score, but also on the distribution of policy net which represents the confidence of the agent. In AAC, Normal Constraint encourages and guides the agent in the right direction. Additionally, this positive feedback allows the agent to explore in a certain area of huge action space with more confidence, so as to speed up the convergence. As a result, agent based on AAC converges faster than A3C across all games. Besides, probability distribution of policy net has not converged in two small scale combat games because training episodes are not enough.

In future work, more episodes and distribution atoms in advantage estimator will be applied in more difficult mini-games, full games and other fields. In addition, whether AAC can perform even better than other AC algorithm in more complex environments will also be tested.