1 Introduction

Breakthrough is a fairly new game, invented in 2000 by Dan Troyka and was the winner of the 2001 about.com \(8 \times 8\) Game Design Competition. It provides an interesting mix of strategy and tactics and so is a popular game for games researchers [1,2,3,4,5].

It is played on an \(8\times 8\) board. The first two rows are populated with White’s pieces and the last two rows with Black’s. The first row is White’s home row and the eighth row is White’s goal. White moves their pieces towards their goal, one square forward at a time, either straight or diagonally. White may not move to a square already occupied by a white piece and may only capture a black piece if it is a diagonal move. White cannot move straight forward to a square occupied by a black piece. Black moves similarly. The first player to reach their goal row or capture all the opponent’s pieces is the winner. A draw is impossible in Breakthrough.

Figure 1 shows two Breakthrough positions, both with White to move.Footnote 1 The one on the left is the starting position and the one on the right is after 22 moves have been played. From the second position the white piece on g2 may move either to g3 or h3, the piece on f3 may either move to g4 or capture on e4, and the piece on f4 may make either diagonal move.

Fig. 1.
figure 1

Sample Breakthrough positions.

Starting with Wanderer, a strong MCTS/EPT (Monte-Carlo Tree Search with Early Playout Termination) based program that plays Breakthrough [3], we created two new versions: Wan-TDL uses temporal difference learning (TDL) techniques to create an evaluation function and Wan-CNN uses convolutional neural networks (CNNs) to generate a probability distribution over moves with respect to an expert Wanderer given a particular board configuration.

Machine learning techniques have long been used in game programming. In 1995 Tesauro used TDL to create a world class Backgammon playing program [7] and over the years many other game playing programs have used TDL in one form or another [8, 9]. We use a straightforward implementation of TDL to train an evaluation function that replaces the hand-tuned evaluation function in Wanderer in order to create Wan-TDL. In Sect. 2 we describe the various patterns (n-tuples) we worked with, our reasons for choosing those patterns, and the results associated with each pattern choice.

Last year DeepMind famously created AlphaGo, a program that uses CNNs and played the game of Go at a level even higher than the world’s best human players [10, 11]. In Sect. 3 we explain how we apply these ideas to create Wan-CNN. First we created CNNs for policy networks that can rank the possible moves from any board position. This was done by using moves generated by Wanderer as there are insufficient publicly available games played by strong players to draw from. Then we created a new MCTS program in Python that could use the policy net in an MCTS setting.

Finally in Sect. 4 we summarize our results and discuss future work.

2 Temporal Difference Learning and Breakthrough

TDL is an unsupervised reinforcement learning technique whose history goes back at least to Samuel’s checker program [12, 13]. Indeed, in 1959 Samuel captured the spirit of TDL when he said that a strong program to play a game (of checkers) can be produced using machine learning “...when given only the rules of the game, a sense of direction, and a redundant and incomplete list of parameters which are thought to have something to do with the game, but whose correct signs and relative weights are unknown and unspecified” [12]. It is in this spirit that we used TDL to program Breakthrough. To be a bit more precise, we employed TD(\(\lambda \)) a more recent and probably the most used version of TDL in game programming [14].

2.1 TDL Parameters

Corresponding to the parameters mentioned in the above quote we use patterns of board squares on the Breakthrough board. This technique is usually referred to as the n-tuple approach, however we feel the term pattern is more suggestive in this context.

We experimented with numerous pattern shapes, but all were simply small rectangular regions. For example, one pattern shape we used was the \(3\times 3\) rectangle. We considered every possible \(3\times 3\) rectangle on the board, for a total of 36 different rectangles. However, there is vertical symmetry on the Breakthrough board, e.g., the rectangle in the lower left corner is the same as the reflected rectangle in the lower right corner. This means there are actually only 18 unique rectangles.

Continuing this example, each rectangle contains 9 board squares and each board square can either be occupied by White or Black, or can be empty. Hence each \(3\times 3\) rectangle defines a total of \(3^9 = 19,683\) distinct patterns. Since there are 18 different rectangles, there are a total of \(36 \times 19,683=354,294\) different patterns and thus the same number of parameters. The board evaluation is then calculated simply as the sum of all the weights of all the patterns that appear on the board. We store the weights for all of these patterns as a file of float variables. In this case all of the weights required approximately 2 MB of memory (see Table 1).

Besides the weights on the patterns there are three other parameters that control the behavior of TD(\(\lambda \)). TDL updates the parameter weights of patterns seen in previous board states based on the evaluation of the current board state. The trace decay parameter, \(\lambda \), determines how many previous board states get updated. In our experiments we always set \(\lambda =0\), that is, only parameters from the previous board state are updated.

The second parameter, \(\alpha \), sometimes called the learning rate, is a multiplicative constant for the update values thus regulating the speed in which parameters can change. After experimenting with different values and changing schemes we settled on a constant value of \(\alpha =0.01\) for all of our experiments.

Finally, the standard method of ensuring sufficient exploration in the learning process is to use an \(\varepsilon \)-greedy policy where a random action is taken with probability \(\varepsilon \). We chose \(\varepsilon =0.1\) for all of our tests.

One of the difficulties with TDL is determining if and when the weights have converged. The larger the pattern size the longer it will take because (1) there are more weights to train and (2) training games take longer due to the extra time required to calculate the evaluation with so many weights. Empirically, we found that training with one hundred million games worked well, though that by no means guarantees that the weights have to (or ever will) converge. For the smallest pattern size this would take about an hour while for the largest it would take about 2 days. Pattern sizes are discussed in detail in the next section.

2.2 TDL Wanderer

Our goal is to see how TDL performs in an MCTS setting. Wanderer is a very strong program that uses MCTS with EPT. Specifically, Wanderer playouts run for 6 moves and then an evaluation function is called to determine the status of the position. In our experiments we simply replaced the finely tuned, hand coded evaluation with the one created using TDL. When testing, the first two moves of both programs are random. This, along with the nondeterminism inherent to MCTS provides sufficient variety in the games.

As can be seen in Table 1 we experimented with eight different rectangular patterns of various shapes of type \(m \times n\) where m is the number of rows and n is the number of columns. If two sizes are mentioned that means all rectangles of both sizes were used. For example, \(5 \times 3\) & \(3 \times 5\) means that all rectangles of both size \(5 \times 3\) and size \(3 \times 5\) that fit on the board were used where all weights for every pattern are treated equally. It is worth comparing the number of patterns in this second example with the \(3 \times 3\) case. Recalling the vertical symmetry, for \(3 \times 5\) rectangles there are \(2 \times 6 = 12\) possible placements and \(3 \times 4 = 12\) possible placements of \(5 \times 3\) rectangles. These rectangles contain 15 squares, each with 3 possible values, for a total of \(12 \times 12 \times 3^{15} = 172,186,884\) weights compared with 354,294 in the \(3 \times 3\) case, making it about 500 times bigger.

From a human point of view there are three main strategic concepts that one must know to play the game well: (1) Pieces are important – do not give them away; (2) Pieces in your home row in columns b, c, f, and g should not move until absolutely necessary because they play an important defense role; (3) Do not trade pieces on the opponent’s side of the board because that usually means you have given up important tempo. We were surprised to see that even with the smallest patterns Wan-TDL was able to learn these concepts. We find this quite remarkable considering the abstract and global nature of these rules compared with the size and local nature of the patterns.

Still, \(3 \times 3\) patterns were insufficient to produce a player as strong as Wanderer so we added more patterns. We did this using the obvious idea of increasing the size of the rectangle, by allowing multiple sized rectangles, and by differentiating patterns according to what phase the game was in. For eight variations of these ideas we ran tests against basic Wanderer. In each case we played 1000 games, where each player was white half the time and both players were given 10 secs/move. Table 1 summarizes our results and is explained in more detail below.

Table 1. Basic Wanderer vs. TDL with various patterns at 10 secs/move

We have included a column showing the actual size of the weights file to show that in some cases we are approaching the limits of reasonable file sizes. The win rate column shows the percentage of games won by Wan-TDL and the 95% confidence bounds.

Game progress means that a different collection of weights are given depending on the stage of the game. Stages are defined according to how many pieces are on the board for a total of seven stages with piece count cutoffs at 32, \({\ge }30\), \({\ge }28\), \({\ge }24\), \({\ge }20\), \({\ge }16\), and \({<}16\).

Though it is difficult, if not impossible, to articulate what additional strategic concepts have been learned when using more patterns, we can see from the table that the program has clearly improved. It seems that patterns of sizes \(4 \times 3\) & \(3 \times 4\) provide the best results, especially relative to their size.

3 Convolutional Neural Networks and Breakthrough

Traditional MCTS induces a policy from a directed search of the state space. While MCTS is a flexible approach that has been shown to converge to an optimal policy in the limit [15], it seldom yields such a policy in scenarios that require real-time decision making.

Conversely, heuristic models such as CNNs have the benefit of producing predictions in far less time. The main drawback with these models is that the value of the prediction may be specious, heavily dependent on its architecture, how it is trained, and what data it is trained on.

While these two methods have been used in different domains with varying levels of success, the combination of both has produced powerful agents that rival expert human players [10] or far more sophisticated programs [16] in the game of Go. However, the neural networks used in these agents were trained on data from expert human players. As Breakthrough is fairly nascent, there is an absence of readily available human generated data from which to train a neural network.

Related work by Guo et al. [17] remedied the data dearth problem in the context of Atari games, by training neural networks on data generated by an external agent that utilized MCTS. They found that this data was sufficient to produce neural networks that could successfully play games such as Breakout, Pong, and Q*bert. Based on these findings, we sought to investigate whether this same approach would hold true in the domain of Breakthrough.

3.1 Methods

We began by training two four-layer CNNs to play Breakthrough by classifying which move an expert Wanderer would make in a given position. The player neural network, which predicts the player’s moves, would be trained only on moves from winning players while the opponent neural network, which predicts the opponent’s moves, would be trained on all moves.

Training Data. Training data consisted of Breakthrough games played between two instantiations of Wanderer, allowing the training set to consist of moves from both players. Each search expanded 93, 750, 000 nodes, taking between 180 and 420 s. This specific number of nodes was chosen due to the fact that experiments at higher numbers of nodes (up to 200, 000, 000) would typically (\(\scriptstyle \mathtt {\sim }\) \(90\%\) of the time) converge to an optimal move at approximately 93, 000, 000 nodes. Limiting Wanderer to 93, 750, 000 nodes ensured that a dataset could be generated in a reasonable amount of time while not sacrificing move quality.

Next, 80 instances of self-play were run in parallel for approximately two months, yielding 3, 735, 662 moves from 37, 688 games. As Breakthrough boards exhibit vertical symmetry, reflected moves were added to the training set, doubling the original number of training examples for a final training set total of 7, 471, 324 moves (3, 735, 662 of which were winning moves). Horizontal symmetry was also exploited, explained further in Sect. 3.1.

Data Pre-processing. For each game, moves were used to generate a sequence of board states, with each board state having an associated transition (move) to the next state. To take advantage of horizontal symmetry, we transformed the board representations so that the player’s pieces started in the first two rows while the opponent’s pieces started in the last two rows, and moves were made relative to the player’s location on the board. We will refer to this representation as the player point-of-view representation. This had the benefit of minimizing the amount of information the neural net had to learn (viz. color differentiation) and cut the number of possible classes in half (e.g., a1–a2 and h8–h7 map to the same move, relative to the player).

For neural network input, we transformed each board representation into a one-hot tensor with dimensions 8 \(\times \) 8 \(\times \) 4, with an on-value of 1 for positions that correspond to player, opponent, empty, or bias positions, respectively. The bias is an on-value for every position.

For labeling of training examples, each move is mapped to a unique numerical index in the range 0 to 154, and this index serves as the label for the associated board representation. (With a player point-of-view representation, there are a total of 154 possible moves, according to start square and destination square plus a no-move position if the game is over.)

Network Architecture. The network’s input layer takes in a \(b\times 8 \times 8 \times 4\) tensor where b is the mini-batch size, or the number of samples to be evaluated in parallel. Each hidden layer has 192 3 \(\times \) 3 filters, with the exception of the final hidden layer which has a single 1 \(\times \) 1 filter, and uses a rectified linear unit (ReLU) activation function [18]. As input is passed to each hidden layer, it is zero padded to the dimensions that ensure the resulting output volume has 8 \(\times \) 8 planes. Hidden layer weights were randomly initialized with 0 mean and standard deviation \(\sqrt{2/n_l}\) as described by He et al. [19, p. 4], where \(n_l\) is the number of incoming connections for that layer. The biases were initialized to 0.01 to ensure all ReLUs fired in the beginning [20]. The output layer is a fully-connected layer which uses a Softmax activation function [21] to produce the predicted probability distribution over the 155 possible moves. The output layer’s weights were initialized by the Xavier initialization algorithm [22] and biases were initialized to 0. The network was implemented in TensorFlow, Google’s machine learning framework, as TensorFlow offers a high level of performance, portability, and an interface that facilitates quick debugging and iterative development [23].

Network Training. The network was trained for five epochs via adaptive moment estimation, also known as Adam [24], using cross entropy with a learning rate of 0.1 and a mini-batch size of 128. Training took approximately one hour on an Nvidia GTX 1080 GPU.

3.2 Results

Figure 2 shows the player policy network’s accuracy throughout training (the opponent policy network’s training showed a similar pattern).Footnote 2 On a held-out test setFootnote 3, the player policy network achieves \(50.6\%\) accuracy, with the correct move being in the networks top ten predictions \(95.3\%\) of the time. The opponent policy network achieves \(50.9\%\) accuracy, with the correct move being in the network’s top ten predictions \(96.0\%\) of the time. The four-layer network with 192 filters per layer represented an optimal tradeoff between accuracy and complexity. Experiments with larger networks (up to 12 hidden layers and 512 filters) did not yield significantly higher accuracy and came at the expense of increased memory demands and computation time.

Fig. 2.
figure 2

Training accuracy over the number of training steps.

Simply using the policy net to suggest moves produces a player of nearly equivalent strength, achieving a 44.0% \(\pm 3.14\) win-rate against Wanderer at 10 seconds/move (approximately 800,000 playouts).Footnote 4 This is impressive given the fact that the policy net takes approximately 2 ms to generate a prediction.

3.3 Integration into MCTS

In our CNN/MCTS paradigm, Wan-CNN utilizes a Synchronous Replace by Threshold [16, p. 20] expansions with \(T=0\) and a thread pool (minimal locking) for tree-level parallelism.Footnote 5 Our implementation is basically UCT, with some differences in how moves are chosen, the expansion policy, and the rollout policy.

Selection. For move selection, we use the PUCT variant described in [10] where we choose action a at state s where \(a = {\mathop {\mathrm {argmax}}\limits _{a'}} \left( \frac{w_{a'}}{1+v_{a'}}+c_{ puct } P(a'|s)\frac{\sqrt{\sum _{b}w_b}}{1+v_{a'}}\right) \), with \(w_n\) and \(v_n\) corresponding to the total wins and visits passing through node n, \(\sum _{b}w_b\) to the wins over all sibling node of n, and \(c_{ puct }\) to the exploration constant.

Expansion and Pruning Policy. For a given iteration, the tree is searched as in normal MCTS. Once we find a node to expand, we evaluate it with the respective neural network (while this thread is waiting for results from the GPU, another search thread can continue its execution). Once the neural network predictions are returned, only children that meet specified threshold criteria are retained – all other nodes are taken out of consideration.Footnote 6 This corresponds to the top two actions for the player, the top ten for the opponent.Footnote 7 The neural net output is then used to set the prior values on a node’s children, with probability of move selection corresponding to a child’s number of losses. All visits are initialized to 100.

Rollout Policy. Wan-CNN utilizes quasi-rollouts where, immediately after expansion, the highest-ranking child is marked for expansion. This continues at each subsequent level until a terminal state is reached and the resulting win/loss value is backpropagated up the tree.

3.4 Wan-CNN Performance

In games against Wanderer, Wan-CNN achieved a \(63.4\%\pm 2.99\) win-rate, meaning Wan-CNN’s exceeds that of Wanderer’s, despite the poor execution performance that Python offers. This result was directly attributable to our move selection and expansion policies. Without these modifications, Wan-CNN yielded a disappointing \(21.6\%\pm 4.31\) win-rate.

3.5 Limitations

As the policy network was trained on games played by Wanderer, it may be overfitting to Wanderer’s policy. In other words, its win-rate may be due to the fact that it was trained to emulate Wanderer’s playstyle. In the future, we plan to play it against a broader array of expert players to validate our findings.

Additionally, as the maximum search time of our tests was 10 s, our results may not hold at higher search times.

4 Conclusions and Future Work

Using an unsupervised learning paradigm (TDL) and starting with no domain knowledge we were surprised that we could create a Breakthrough evaluation function that was on par with the mature, hand-tuned one in Wanderer, giving a \(46.6\%\) win rate when played against normal Wanderer.

Using a supervised learning paradigm (CNN), we trained four-layer CNNs to classify Breakthrough moves made by Wanderer. When these networks were used to directly dictate a policy, they rivaled Wanderer at search times of up to 10 seconds/move. This is impressive given the CNNs’ inference speed (2 ms). Integrating the neural network into a simple MCTS (no domain knowledge) yielded a program that exceeded the performance of normal Wanderer, achieving a \(63.4\%\pm 2.99\) win-rate.

Going forward, to improve Wan-TDL we plan to try refinements on our game progress idea which creates patterns based on the state of the game. Other global properties such the left/right relative balances of the pieces between the two sides, piece counts on black vs. white squares, etc. could provide further benefit, keeping in mind the tradeoff between the number patterns and memory and training time constraints.

We also plan to generate a neural network to act as a value function approximation and compare against both our simplified evaluation function as well as Wanderer’s full evaluation function.