Keywords

1 Introduction

Monte-Carlo Tree Search (MCTS) [1, 2] is a simulation-based search technique that Yannakakis and Togelius [3] list among the commonly used methods in games. Browne et al. [4] reviewed the evolution of MCTS, its variations and their successful applications till 2012. Different techniques have been studied for enhancing the MCTS variants, such as the All Moves As First (AMAF) [5], the Rapid Action Value Estimation (RAVE) [6], Generalized RAVE (GRAVE) [7], HRAVE [8] and the Move Average Sampling Technique (MAST) [9]. Powley et al. introduced the ICARUS (Information Capture And ReUse Strategy) framework for describing and combining such enhancements [10].

Moreover, MCTS has been proved to be a success in many domains, including the Go game [11, 12] and General (Video) Game Playing [13,14,15]. General Video Game Playing (GVGP) [14, 15] aims at creating agents that are capable of playing any unknown video game successfully without using prior knowledge or intervention by human beings. The General Video Game AI (GVGAI) frameworkFootnote 1 has been implemented for this research purpose [14]. Many successful General Video Game Playing agents are MCTS-based, like YOLOBOT, the agent that won the GVGAI Single-Player Planning Championship in 2017, and MaastCTS2 [16], the agent that won the GVGAI Single-Player Planning Championship in 2016.

One of the difficulties of GVGP and the GVGAI competition is parameter tuning for AI agents. Examples are the exploration factor and the play-out depth of MCTS-based agents, or population size, mutation probability and planning horizon for the rolling horizon evolutionary agents. A common approach is to tune the parameters off-line with some given games. Gaina et al. [17] varied the population size and planning horizon of a vanilla Rolling Horizon Evolutionary Algorithm (RHEA) and compared their performance on a subset of games in the GVGAI framework. Bravi et al. [18] used Genetic Programming (GP) to evolve game-specific Upper Confidence Bound (UCB) alternatives, each of which outperformed the MCTS using standard UCB1 (Eq. 1) on at least one of the tested games. These UCB alternatives can be used to build a portfolio of MCTS agents to achieve robust game playing. However, such off-line tuning is computationally expensive and game dependent. Tuning the parameters of a given agent off-line for a new game is therefore sometimes not possible.

Recently, Sironi and Winands [19] have proposed on-line adaptive parameter tuning for MCTS agents and almost achieved the same performance as off-line tuning in General Game Playing. Their approach is based on the idea of interleaving parameter tuning with MCTS. Before each MCTS simulation a different combination of parameter values is selected to control the search. The reward obtained by the simulation is used to update some statistics on the performance of such combination of parameter values. These statistics are then used to choose which parameter values will control the next simulations. Four allocation strategies are proposed in [19] to decide which parameter values should be evaluated for each MCTS simulation and in which order.

In this work, we apply the same approach for tuning on-line the parameters K (the UCB exploration factor) and D (the depth limit for the search) of a standard MCTS agent, sampleMCTS, of the GVGAI framework. First of all we verify if the on-line tuning approach can be applied successfully to GVGP by testing the most promising among the four allocation strategies presented in [19], Naïve Monte-Carlo (NMC). Second, we want to see if the performance on GVGP of the on-line tuning mechanism can be further improved by using the principles of evolutionary algorithms. Therefore, we propose two more allocation strategies, one based on an Evolutionary Algorithm (EA) and one base on an N-tuple Bandit Evolutionary Algorithm (NTupleBanditEA). Finally, to validate the allocation strategies we evaluate the performance of the instances of sampleMCTS that use as fixed parameter settings the combinations of values that were used the most during game playing by each of the proposed allocation strategies.

This paper is structured as follows. Section 2 introduces the background, including GVGAI, MCTS and the on-line parameter tuning problem. Section 3 describes the approach and tuning algorithms. Section 4 presents the design of experiments and Sect. 5 analyzes the results. Finally, Sect.  6 concludes and proposes some future research.

2 Background

This section briefly introduces the General Video Game AI framework (Subsect. 2.1), the principles of Monte-Carlo Tree Search (Subsect. 2.2) and the formulation of the on-line parameter tuning problem (Subsect. 2.3).

2.1 General Video Game AI

The General Video Game AI (GVGAI) framework was initially designed and developed by the Games Intelligence Group at the University of Essex (UK) aiming at using it as a research and competition framework for studying General Video Game Playing (GVGP). The GVGAI consists of five tracks: two planning tracks (single- and two-player) where the forward model of every game is available [14, 15]; the single-playing learning track where no forward model is given [20]; the level generation track [21] and rule generation track [22]. The games were defined in Video Game Description Language (VGDL) [23] and the framework was mainly written in Java. More about the tracks and competitions can be found on the GVGAI website. In this paper, we focus on a subset of the GVGAI single-player games. More about the game set is presented in Subsect. 4.1. Compared to the Atari Learning Environment (ALE) [24] framework, GVGAI has the advantage of being more extensible, meaning that it is much easier to add new games and variations of those games, and also offers two-player games which provide a greater range of challenges than single player games. ALE currently has the advantage of offering commercial games, albeit from a few decades ago.

2.2 Monte-Carlo Tree Search

MCTS is a best-first search algorithm that incrementally builds a tree representation of the search space of a problem (e.g., a game) and estimates the values of states by performing simulations [1, 2]. An iteration of the MCTS consists of four phases: (i) Selection: starting from the root node of the tree a selection strategy is used to select the next actions to visit until a node is reached that is not fully expanded, i.e., at least one successor state is not visited and its corresponding node is not added to the tree yet; (ii) Expansion: in a node that is not fully expanded, an expansion strategy is used to choose one or more nodes that will be added to the tree; (iii) Play-out: starting from the last node added to the tree a play-out strategy chooses which actions to simulate until either a given depth (maximum play-out depth) or a terminal state are reached; and (iv) Back-propagation: at the end of the play-out, the result of the simulation is propagated back through all the nodes traversed in the tree and used to update the estimate of their value. These four phases are repeated until the search budget in terms of time or state evaluations has been exhausted. At this point, the best action in the root node is returned to be played in the real game. Different recommendation policies can be used to decide which action to return and perform, for instance, recommending the one with the highest estimated average score or the one with the highest number of visits.

Many strategies have been proposed for the different phases of MCTS. The standard selection strategy is UCT (Upper Confidence bounds applied to Trees) [2]. UCT sees the problem of choosing an action in a certain node of the tree as a Multi-Armed Bandit (MAB) problem and uses the UCB1 [25] sampling strategy to select the action to visit next. UCT selects in node s the action a that maximizes the following formula:

$$\begin{aligned} UCB1(s,a) = Q(s,a) + K \times \sqrt{\frac{\ln N(s)}{N(s,a)}}, \end{aligned}$$
(1)

where Q(sa) is the average result obtained from all the simulations in which action a was played in node (state) s, N(s) is the number of times node s has been visited during the search and N(sa) is the number of times action a has been selected whenever node s was visited. The K constant is used to control the balance between exploitation of good actions and exploration of less visited ones.

2.3 On-line Parameter Tuning

The parameters of an AI agent can be seen as a vector of integers and doubles (boolean parameters can be handled as integers with only two legal values). The tuning of parameters is therefore a problem of searching optimal numerical vector(s) in a given parameter search space. Given the combinatorial structure of the search space, Sironi and Winands [19] considered the tuning problem as a Combinatorial Multi-Armed Bandit (CMAB) [26]. The definition of the parameter tuning problem as a CMAB is given by the following three components:

  • A set of d parameters, \(P=\{P_{1},...,P_{d}\}\), where each parameter \(P_{i}\) can take \(m_{i}\) different values \(V_{i}=\{v_{i}^{1}, ..., v_{i}^{m_{i}}\}\).

  • A reward distribution \(R: V_{1} \times ... \times V_{d} \rightarrow \mathbb {R}\) that depends on the combination of values assigned to the parameters.

  • A function \(L: V_{1} \times ... \times V_{d} \rightarrow \{true, false\}\) that determines which combinations of parameter values are legal.

In this paper we use the same approach for on-line tuning that was presented in [19]. This approach consists in interleaving MCTS simulations with parameter tuning, as shown in Algorithm 1. Before each MCTS simulation the tuner \(\mathcal {T}\) selects a combination of values for the parameters \(\mathbf {p}\). These values are set for the parameters of the agent \(AI_{MCTS}\) that performs an MCTS simulation of the game and returns the associated reward. This reward is used by the tuner to update the statistics for the combination of parameters \(\mathbf {p}\). Different allocation strategies can be used by the tuner to decide which parameter combination should be evaluated next depending on these statistics. In this paper we consider the most promising allocation strategy that was introduced in [19], Naïve Monte-Carlo (NMC), and propose two more, one based on an Evolutionary Algorithm and one based on a N-tuple bandit Evolutionary Algorithm.

figure a

3 Allocation Strategies

This section describes the three allocation strategies that are integrated to the GVGAI sampleMCTS agent: Naïve Monte-Carlo, Evolutionary Algorithm and N-tuple Bandit Evolutionary Algorithm. As a result, three Self-Adaptive MCTS (SA-MCTS) agents are created. The NMC strategy is the same that is presented in [19]. Among the four strategies proposed in the paper we decide to test this one for GVGAI because it was the one that had more promising results when used to tune parameters for GGP.

3.1 Naïve Monte-Carlo

Naïve Monte-Carlo (NMC) was first proposed by Ontañon [26] to be applied to Real-Time Strategy games. This approach proved suitable to deal with combinatorial search spaces, thus in [19] it was applied to the on-line parameter tuning problem, that is characterized by a combinatorial parameter space. NMC is based on the naïve assumption, which is the assumption that the reward associated with a combination of parameter values can be approximated by a linear combination of the rewards associated with each of the single values: \(R(\mathbf {p} = \langle p_{1},...,p_{d}\rangle ) \approx \sum _{i=1}^{d} R_{i}(p_{i})\).

figure b

Algorithm 2 gives the pseudo-code for this strategy. NMC considers one global MAB (\(MAB_{g}\)) and n local MABs (\(MAB_{i}, i = 1 ... d\)), one for each parameter. Each arm of \(MAB_{g}\) corresponds to one of the legal combinations of parameter values that have been evaluated so far, while each arm in \(MAB_{i}\) corresponds to one of the legal values for parameter \(P_{i}\). This allocation strategy alternates between an exploration and an exploitation phase. For each MCTS simulation a combination of parameter values is selected by exploration with probability \(\epsilon _{0}\) and by exploitation with probability \((1-\epsilon _{0})\). If exploring, the next combination to be evaluated is selected by choosing each parameter value independently from the corresponding \(MAB_{i}\) (note that a combination that has never been visited before can be generated, thus there is exploration of the search space). If exploiting, the next combination is selected from the global MAB, \(MAB_{g}\) (in this case only previously seen combinations will be selected). \(MAB_{g}\) starts with no arms and a new arm is added whenever a new combination in generated using the local MABs. Whenever a combination of values is evaluated, the reward obtained by the corresponding MCTS simulation is used to update both the statistics associated with the global MAB and the ones associated with each of the local MABs. The SA-MCTS agent built using NMC as a tuner is denoted as SA-MCTS\(_{NMC}\).

3.2 Evolutionary Algorithm

Genetic Algorithms (GA) achieve overall good performance in General Video Game Playing [17]. However, in the case of on-line parameter tuning, we aim at using a simple algorithm as tuner, making the best use of the time budget to evaluate more MCTS instances with different parameter settings or more times a good MCTS instance. A combination of parameter values is considered to be an individual, where each single parameter is a gene. A simple Evolutionary Algorithm (EA) with \(\lambda \) individuals has been considered for evolving on-line the parameters for an MCTS agent, where the \(\mu \) elites in the previous generation are kept and \((\lambda -\mu )\) new individuals are reproduced. Each new individual is reproduced with probability \(p_c\) by uniformly random crossover between two elites selected uniformly at random, or by uniformly mutating one bit of a randomly selected elite otherwise. When evaluating a population, each individual (i.e. the corresponding parameter combination) is used to control a different MCTS simulation of the game and the outcome of the simulation is considered as the fitness of the individual. The SA-MCTS agent built using this EA as a tuner is referred to as SA-MCTS\(_{EA}\).

3.3 N-Tuple Bandit Evolutionary Algorithm

The N-Tuple Bandit Evolutionary Algorithm (NTupleBanditEA) is firstly proposed by Kunanusont et al. [27] for automatic game parameter tuning where strong stochastic AI agents were used to evaluate the evolved games with uncertainties. Then, it was applied to the GVGAI framework for evolving game rules and parameter setting in games [28]. It makes use of all the statistics of the previously evaluated solutions and balances the exploration and exploitation between evaluating a new generated solution and re-evaluating an existed solution using UCB1. The detailed algorithm is not given in this paper due to lack of space, more can be found in [27]. We apply the NTupleBanditEA to optimizing the parameters on-line by modeling each parameter to be tuned as a 1-tuple and considering their combinations as n-tuples. For this strategy we need to specify the number of neighbors generated at one iteration, n, and the exploration factor, denoted as \(K_{NEA}\) (cf. Algorithm 3 in [27]). The SA-MCTS agent built using this NTupleBanditEA as a tuner is referred to as SA-MCTS\(_{NEA}\).

4 Experimental Settings

In this section we introduce the design of the experiments, including games used as test problem, the baseline AI agent to be tuned and the tested tuners.

4.1 Games

Each of the approaches described in Sect. 3 is tested on all 5 levels of 20 games (Table 1) of the GVGAI framework. The 20 games are same as the ones used by Gaina et al. [17] for studying the parameters of a vanilla RHEA. Gaina et al. [17] uniformly randomly selected 20 games from a merged list of games exploited by Nelson [29] and Bontrager et al. [30] previously, on which the vanilla MCTS agent performs differently. During every game playing, an agent has 40 ms per game tick to decide an action. In all the games, there is no draw. A game terminates if the agent wins or loses the game before 2, 000 game ticks or the game is forced to terminate as a loss of the agent after 2, 000 game ticks. This is the same setting as in the GVGAI Single-Playing Planning competitions. The only difference is that if the agent exceeds the 40 ms limit per game tick it will not be disqualified and can still apply its selected move.

Table 1. The 20 tested games. The authors of [17] have confirmed that some games have been wrongly listed as deterministic games. Wait for Breakfast was listed as deterministic game as it has negligible randomness (detailed in Subsect. 5.1).

4.2 Tuned Agent and Parameters

We consider the single-player sampleMCTS agent in the GVGAI framework as the \(AI_{MCTS}\) to be tuned, the performance of which mainly relies on two parameters, the maximum play-out depth D and the UCB1 exploration factor K. The heuristic used by the sampleMCTS agent for evaluating a game state is defined as follows:

$$\begin{aligned} Value(GameState)= \left\{ \begin{array}{l l} score(GameState) - 10,000,000.0 &{} \text { if a loss,} \\ score(GameState) + 10,000,000.0 &{} \text { if a win,}\\ score(GameState) &{} \text {otherwise.} \end{array} \right. \end{aligned}$$
(2)

Based on this sampleMCTS, the SA-MCTS agents are designed. These agents tune D and K on-line considering 15 possible values for each parameter (i.e. 225 possible combinations of parameters). The same state evaluation heuristic (Eq. 2) is used by the self-adaptive agents. The SA-MCTS agents are compared to a baseline agent, the default sampleMCTS, with a fixed value for D and K. The parameter settings are summarized as follows:

  • sampleMCTS: \(D = 10\), \(K = 1.4\) (default setting in GVGAI);

  • SA-MCTS agents: \(D \in \{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15\}\), \(K \in \{0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0\}\).

Table 2. Average win rate (%) and average game score over 5 levels of each game for the sampleMCTS agent and the SA-MCTS agents. The best values are in bold.

4.3 Tuning Strategies

Three SA-MCTS agents are considered in the experiments, one for each of the presented tuning strategies, NMC, EA and NTupleBanditEA. To distinguish them, in the tables and in subsequent sections they are denoted as SA-MCTS\(_{NMC}\), SA-MCTS\(_{EA}\) and SA-MCTS\(_{NEA}\), respectively. The following are the settings for the tuning strategies:

  • SA-MCTS \(_{NMC}{\varvec{:}}\) \(\epsilon _{0}=0.75\) (i.e. select next combination using the local MABs with probability 0.75 and the global MAB with probability 0.25), \(\pi _{g} =\) UCB1 policy with exploration factor \(K_{g} = 0.7\), \(\pi _{l} =\) UCB1 policy with exploration factor \(K_{l} = 0.7\) (note that these two exploration factors are distinct from the UCB1 exploration factor to be tuned and used by the sampleMCTS agent).

  • SA-MCTS \(_{EA}{\varvec{:}}\) population size \(\lambda =50\), elite size \(\mu =25\) (lower values for \(\mu \) were tested when applying this strategy to GGP and none of them outperformed \(\mu = 25\)), probability of generating an individual by crossover of two parents, \(p_{c}=0.5\).

  • SA-MCTS \(_{NEA}{\varvec{:}}\) number of neighbors generated during evolution \(n=5\) (preliminary tests showed that higher values added no benefit), \(K_{NEA}=0.7\) (exploration constant for the UCB1 formula used to compute the value for a parameter combination [27]).

5 Results and Discussion

The results of the designed SA-MCTS agents and the baseline agent are analyzed and discussed in Sect. 5.1. Section 5.2 illustrates the performance of some static agents using constant parameters, the most visited parameter combination during the on-line tuning for each game.

5.1 On-line Tuning Performance

Each of the SA-MCTS agents has been performed on the 5 levels of each of the games 500 times (100 per level), as well as the baseline sampleMCTS agent. During every game playing, an agent has 40ms per game tick to decide an action. More about the games and settings has been presented in Subsect. 4.1. The win rate and average score for the sampleMCTS agent and the SA-MCTS agents on the 20 games are summarized in Tables 2a and b, respectively. In our setting, wining a game has the highest priority (Eq. 2), thus the win rate is used as the criterion for evaluating a tuner rather than the average score.

The SA-MCTS agents perform overall well on the games where the default sampleMCTS also performs well, except for the games of Chopper and Survive Zombies. Moreover, in some of the games that the sampleMCTS has poor performance, e.g. Chase, Escape, Crossfire and Wait for Breakfast (Fig. 1), the SA-MCTS agents significantly improve the win rate. These four games are detailed below.

Fig. 1.
figure 1

Screenshots of game screen of Chase, Escape, Crossfire and Wait for Breakfast.

Chase. (Figure 1(a)) The player must chase and kill scared goats that flee from the player. A dead goat turns to a corpse immediately and the player gains 1 point as score. A goat becomes angry as soon as it finds another goat’s corpse, and starts to chase the player. The player wins the game if all scared goats are dead, but it will lose one point and loses the game immediately if is caught by an angry goat. The game is very difficult as an angry goat will never turn back to a normal one, and by default the game ends with a loss of the player after running 2, 000 game ticks. Thus, once a goat becomes angry, it will inevitably lead to a lost game for the player, but this negative reward is delayed until the end of the game. In our context, the game rules are not given to the agent, so the agent will not be aware of the defeat until being caught or after 2, 000 game ticks. The baseline agent, sampleMCTS only wins 32 games out of 500, while the SA-MCTS agents win at least 72 games.

Escape. (Figure 1(b)) It is a puzzle game with wide search space and required long-term planning. The player (rat) wins the game by taking the cheese. The player’s score is 1 if it wins, -1 otherwise. Sometimes, the player needs to push a block into a hole in order to clear a path to the cheese. Each of the 3 on-line tuning agents greatly improves the win rate (at least 22 times higher) compared to the baseline agent, in particular, SA-MCTS\(_{NEA}\), increases the win rate from \(0.2\%\) to \(13.0\%\). In a similar tested puzzle game Bait, the player needs to push a block to fill a hole in order to build a path to the cheese. Interestingly, the win rate on Bait is not that highly improved, though some significant improvements have been observed.

Crossfire. (Figure 1(c)) The player wins the game if it reaches the exit door without being hit by any shell and gets 5 as game score. Once the player is hit by a shell, the game ends immediately with a loss of the player and \(-1\) as game score. The win rates achieved by the SA-MCTS agents are at least 2 times higher than the one by the baseline agent.

Wait for Breakfast. (Figure 1(d)) In this game, all tables are empty when the game starts, a waiter (NPC in black in Fig. 1(d)) serves a breakfast to the table with only one chair at a random time. The player (avatar in green in Fig. 1(d)) wins the game only if it sits on the chair on the table after the breakfast is served, otherwise (taking a wrong chair or taking the right chair before the breakfast is served), it loses the game. When the waiter serves the breakfast is defined as: at any game tick, if no breakfast is served yet, the waiter serves a breakfast with probability 0.05; the waiter serves at most one breakfast during a whole game playing. The probability of no breakfast has been served 10 game ticks after starting a game is \(0.05^{10}={9.7656}\mathrm {e}{-14}\). This game can be considered as a deterministic game. The win rate is significantly improved by all the 3 SA-MCTS agents, among which SA-MCTS\(_{NEA}\) increases the win rate from 15.4% to 44.0%.

Table 3. Average median number of iterations per tick for the sampleMCTS agent and the SA-MCTS agents. The number of forward model calls per iteration depends on the tuner and is sometimes not a constant. For space reasons, headers have been shortened as follows: SA-MCTS\(_{NMC}\) = NMC, SA-MCTS\(_{EA}\) = EA, SA-MCTS\(_{NEA}\) = NEA.
Table 4. Most visited combination of parameters per game for each of the SA-MCTS agents. Parameter combination are expressed with the format [KD]. K refers to the UCB1 exploration factor and D is the maximum play-out depth.

For reference, the average of median numbers of iterations per game tick for all tested agents are given in Table 3. Note that during one iteration of any of the agents, the forward model is called multiple times.

The most visited combination of the UCB1 exploration factor K and the maximum play-out depth D per game (over 500 runs) for each of the SA-MCTS agents are extracted and listed in Table 4. Surprisingly, the most used play-out depth is 1. The SA-MCTS agents prefers the averaged instant reward than the averaged long-term reward. A possible reason is the heuristic (Eq. 2) used by the agents. Assuming an MCTS agent with maximum play-out depth 10, even for a deterministic game, the number of possible game states after a play-out can increase at most exponentially, the reward after a play-out can vary between a large range due to the same reason. If it is a loss after a play-out, then the average reward obtained by the parameter combinations with \(D=10\) will decrease a lot due to the 10, 000, 000.0 penalty in score; if it is a win, then the average reward will increase thanks to the 10, 000, 000.0 award in score. In the games where a SA-MCTS agent gets a very low win rate, the parameter combinations with \(D=10\) are more likely to have an overall low average reward and prefer a lower maximum play-out depth D, whereas in the games where a SA-MCTS agent gets a high win rate, the parameter combinations with higher D are favorable. For instance, in the game Plaque Attack, all agents achieve a win rate higher than \(90\%\), the most visited maximum play-out depth D is 4, 15, and 15 for SA-MCTS\(_{NMC}\), SA-MCTS\(_{EA}\) and SA-MCTS\(_{NEA}\) respectively.

Table 5. Average win rate (%) and average game score over 5 levels for the sampleMCTS agent with default parameter values and the sampleMCTS agents with the most visited combination per game by each of the SA-MCTS agents. The best values are in bold.

5.2 On-line Tuning Validation

The most visited parameter combinations during learning are set to the sampleMCTS and tested on the same set of games. The parameters are fixed during game playing, so no more tuning happened. We denote these instances of sampleMCTS as instance \(_{NMC}\), instance \(_{EA}\) and instance \(_{NEA}\). The average win rate and average final score for the sampleMCTS agent with default parameter values and the sampleMCTS instances, sampleMCTS agents with the most visited combination per game by each of the SA-MCTS agents (cf. Table 4) are presented in Table 5a and b, respectively.

In the game Escape (Fig. 1(b)), the win rate is significantly improved from \(0.2\%\) (baseline agent) to \(30.8\%\) by the instance \(_{NEA}\) with parameters tuned by NEA, while the highest win rate of on-line tuning agents is \(13.0\%\) by sampleMCTS \(_{NEA}\) (shown in Table 2). In the game Wait for Breakfast (Fig. 1(d)), the win rate is significantly improved from \(15.4\%\) (baseline) to \(60.8\%\) by the instance \(_{NMC}\) and instance \(_{EA}\), while the highest win rate obtained by the on-line tuning agents is \(44.0\%\) by sampleMCTS \(_{NEA}\) (shown in Table 2). However, some instances with constant parameter values performed much worse than the baseline agent in some games. For instance, in the game Aliens, the baseline agent (\(D=10\)) and instance \(_{EA}\) (\(D=15\)) win all the games, while instance \(_{NMC}\) and instance \(_{NEA}\) win \(\sim 68\%\) games due to the maximum play-out depth \(D=1\). The same scenarios happen in the games Sea Quest and Survive Zombies. Due to the maximum play-out depth \(D=1\), instance \(_{NMC}\), instance \(_{EA}\) and instance \(_{NEA}\) lose more often the puzzle game Bait and lose almost all the 500 runs of Chopper. Our SA-MCTS agents are more robust than the non-SA MCTS instances with constant parameter values.

6 Conclusion and Future Work

General Video Game Playing has attracted interest from researchers during the last years. On-line tuning an agent for GVGP provides more adaptive and robust agents, however, it is rather difficult due to the real-time setting. In this paper, we have incorporated three different algorithms, Naïve Monte-Carlo, the Evolutionary Algorithm, and N-Tuple Bandit Evolutionary Algorithm, to tune on-line the sampleMCTS agent in the GVGAI framework and create three Self-Adaptive MCTS (SA-MCTS) agents. The SA-MCTS agents have been compared to the baseline MCTS agent, sampleMCTS, on 20 single-player GVGAI games. The SA-MCTS agents perform similarly to the sampleMCTS on the games that sampleMCTS performs well, and significantly improve the win rate in the games that sampleMCTS perform poorly. The SA-MCTS agents achieve more robust results on the tested games. Additionally, the sampleMCTS instances using most sampled parameter settings by each of the three SA-MCTS agents per game improve the win rate on Wait for Breakfast and Escape by 4 times and 150 times, respectively.

The approaches used in this paper and the tested tuning strategies have been a success in GVGAI, in particular the tuning strategy NMC has also obtained promising results in GGP. On-line agent tuning for GVGP is important because successful approaches can rapidly improve and specialize the general abilities of the agents, leading to better performance across a wide range of games. The research has application outside of games to any problem that has a fast simulation model and requires rapid and adaptive decision making.

This work can be extended in different directions. Applying the SA-MCTS agents for playing 2-player GVGAI games, where the time limit remains 40ms, will be interesting. The heuristic (Eq. 2) of sampleMCTS is directly used as the reward for tuners. The preference of maximum play-out depth 1 encourages us to explore a better reward function for the tuners. In this paper, we focus on the discrete search space, the search space of the UCB1 exploration factor is discretized by being uniformly sampled within a range using a fixed gap, though that was just an experimental design choice: all algorithms under test can work with any selection of parameter choices. An interesting future work is applying continuous parameter tuning using Evolutionary Strategies (ES), such as the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which does not rely on the assumption of smooth problem. Another work in the future is tuning a more advanced agent possibly with more parameters to be tuned. A potentially good choice is the winner agent of the 2016 GVGAI Single-Player Planing Championship, MaastCTS2 [16], which is also MCTS-based. There is also much work to be done in tuning the play-out policy. A particular challenge is to deal more efficiently with flat reward landscapes using methods that seek diverse points in the (game) state space in the absence of any differences in the explicit reward (e.g. Novelty Search). Tuning a non-MCTS-based agent, such as an Rolling Horizon Evolutionary Algorithm, will be interesting and challenging due to the real-time control in evaluating a population. In general, agents with more parameters to be optimized will provide a more challenging test for the sample efficiency of the tuning methods.