Mastering the game of Go with deep neural networks and tree search

Silver, David; Huang, Aja; Maddison, Chris J.; Guez, Arthur; Sifre, Laurent; van den Driessche, George; Schrittwieser, Julian; Antonoglou, Ioannis; Panneershelvam, Veda; Lanctot, Marc; Dieleman, Sander; Grewe, Dominik; Nham, John; Kalchbrenner, Nal; Sutskever, Ilya; Lillicrap, Timothy; Leach, Madeleine; Kavukcuoglu, Koray; Graepel, Thore; Hassabis, Demis

doi:10.1038/nature16961

Mastering the game of Go with deep neural networks and tree search

Article
Published: 27 January 2016

Volume 529, pages 484–489, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

From

View current issue Submit your manuscript

Mastering the game of Go with deep neural networks and tree search

Download PDF

David Silver¹^na1,
Aja Huang¹^na1,
Chris J. Maddison¹,
Arthur Guez¹,
Laurent Sifre¹,
George van den Driessche¹,
Julian Schrittwieser¹,
Ioannis Antonoglou¹,
Veda Panneershelvam¹,
Marc Lanctot¹,
Sander Dieleman¹,
Dominik Grewe¹,
John Nham²,
Nal Kalchbrenner¹,
Ilya Sutskever²,
Timothy Lillicrap¹,
Madeleine Leach¹,
Koray Kavukcuoglu¹,
Thore Graepel¹ &
…
Demis Hassabis¹

487k Accesses
9067 Citations
3071 Altmetric
439 Mentions
Explore all metrics

Abstract

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

Mastering the game of Go without human knowledge

Article 19 October 2017

Beyond games: a systematic review of neural Monte Carlo tree search applications

Article Open access 28 December 2023

Human-level control through deep reinforcement learning

Article 25 February 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Main

All games of perfect information have an optimal value function, v^*(s), which determines the outcome of the game, from every board position or state s, under perfect play by all players. These games may be solved by recursively computing the optimal value function in a search tree containing approximately b^d possible sequences of moves, where b is the game’s breadth (number of legal moves per position) and d is its depth (game length). In large games, such as chess (b ≈ 35, d ≈ 80)¹ and especially Go (b ≈ 250, d ≈ 150)¹, exhaustive search is infeasible^2,3, but the effective search space can be reduced by two general principles. First, the depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function v(s) ≈ v^*(s) that predicts the outcome from state s. This approach has led to superhuman performance in chess⁴, checkers⁵ and othello⁶, but it was believed to be intractable in Go due to the complexity of the game⁷. Second, the breadth of the search may be reduced by sampling actions from a policy p(a|s) that is a probability distribution over possible moves a in position s. For example, Monte Carlo rollouts⁸ search to maximum depth without branching at all, by sampling long sequences of actions for both players from a policy p. Averaging over such rollouts can provide an effective position evaluation, achieving superhuman performance in backgammon⁸ and Scrabble⁹, and weak amateur level play in Go¹⁰.

Monte Carlo tree search (MCTS)^11,12 uses Monte Carlo rollouts to estimate the value of each state in a search tree. As more simulations are executed, the search tree grows larger and the relevant values become more accurate. The policy used to select actions during search is also improved over time, by selecting children with higher values. Asymptotically, this policy converges to optimal play, and the evaluations converge to the optimal value function¹². The strongest current Go programs are based on MCTS, enhanced by policies that are trained to predict human expert moves¹³. These policies are used to narrow the search to a beam of high-probability actions, and to sample actions during rollouts. This approach has achieved strong amateur play^13,14,15. However, prior work has been limited to shallow policies^13,14,15 or value functions¹⁶ based on a linear combination of input features.

Recently, deep convolutional neural networks have achieved unprecedented performance in visual domains: for example, image classification¹⁷, face recognition¹⁸, and playing Atari games¹⁹. They use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localized representations of an image²⁰. We employ a similar architecture for the game of Go. We pass in the board position as a 19 × 19 image and use convolutional layers to construct a representation of the position. We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.

We train the neural networks using a pipeline consisting of several stages of machine learning (Fig. 1). We begin by training a supervised learning (SL) policy network p_σ directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high-quality gradients. Similar to prior work^13,15, we also train a fast policy p_π that can rapidly sample actions during rollouts. Next, we train a reinforcement learning (RL) policy network p_ρ that improves the SL policy network by optimizing the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, we train a value network v_θ that predicts the winner of games played by the RL policy network against itself. Our program AlphaGo efficiently combines the policy and value networks with MCTS.

**Figure 1: Neural network training pipeline and architecture.**

Supervised learning of policy networks

For the first stage of the training pipeline, we build on prior work on predicting expert moves in the game of Go using supervised learning^{13,21,22,23,24}. The SL policy network p_σ(a|s) alternates between convolutional layers with weights σ, and rectifier nonlinearities. A final softmax layer outputs a probability distribution over all legal moves a. The input s to the policy network is a simple representation of the board state (see Extended Data Table 2). The policy network is trained on randomly sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s

We trained a 13-layer policy network, which we call the SL policy network, from 30 million positions from the KGS Go Server. The network predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board position and move history as inputs, compared to the state-of-the-art from other research groups of 44.4% at date of submission²⁴ (full results in Extended Data Table 3). Small improvements in accuracy led to large improvements in playing strength (Fig. 2a); larger networks achieve better accuracy but are slower to evaluate during search. We also trained a faster but less accurate rollout policy p_π(a|s), using a linear softmax of small pattern features (see Extended Data Table 4) with weights π; this achieved an accuracy of 24.2%, using just 2 μs to select an action, rather than 3 ms for the policy network.

**Figure 2: Strength and accuracy of policy and value networks.**

Reinforcement learning of policy networks

The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL)^25,26. The RL policy network p_ρ is identical in structure to the SL policy network, and its weights ρ are initialized to the same values, ρ = σ. We play games between the current policy network p_ρ and a randomly selected previous iteration of the policy network. Randomizing from a pool of opponents in this way stabilizes training by preventing overfitting to the current policy. We use a reward function r(s) that is zero for all non-terminal time steps t < T. The outcome z_t = ± r(s_T) is the terminal reward at the end of the game from the perspective of the current player at time step t: +1 for winning and −1 for losing. Weights are then updated at each time step t by stochastic gradient ascent in the direction that maximizes expected outcome²⁵

We evaluated the performance of the RL policy network in game play, sampling each move from its output probability distribution over actions. When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi¹⁴, a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi. In comparison, the previous state-of-the-art, based only on supervised learning of convolutional networks, won 11% of games against Pachi²³ and 12% against a slightly weaker program, Fuego²⁴.

Reinforcement learning of value networks

The final stage of the training pipeline focuses on position evaluation, estimating a value function v^p(s) that predicts the outcome from position s of games played by using policy p for both players^28,29,30

Ideally, we would like to know the optimal value function under perfect play v^*(s); in practice, we instead estimate the value function for our strongest policy, using the RL policy network p_ρ. We approximate the value function using a value network v_θ(s) with weights θ, . This neural network has a similar architecture to the policy network, but outputs a single prediction instead of a probability distribution. We train the weights of the value network by regression on state-outcome pairs (s, z), using stochastic gradient descent to minimize the mean squared error (MSE) between the predicted value v_θ(s), and the corresponding outcome z

The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting. The problem is that successive positions are strongly correlated, differing by just one stone, but the regression target is shared for the entire game. When trained on the KGS data set in this way, the value network memorized the game outcomes rather than generalizing to new positions, achieving a minimum MSE of 0.37 on the test set, compared to 0.19 on the training set. To mitigate this problem, we generated a new self-play data set consisting of 30 million distinct positions, each sampled from a separate game. Each game was played between the RL policy network and itself until the game terminated. Training on this data set led to MSEs of 0.226 and 0.234 on the training and test set respectively, indicating minimal overfitting. Figure 2b shows the position evaluation accuracy of the value network, compared to Monte Carlo rollouts using the fast rollout policy p_π; the value function was consistently more accurate. A single evaluation of v_θ(s) also approached the accuracy of Monte Carlo rollouts using the RL policy network p_ρ, but using 15,000 times less computation.

Searching with policy and value networks

AlphaGo combines the policy and value networks in an MCTS algorithm (Fig. 3) that selects actions by lookahead search. Each edge (s, a) of the search tree stores an action value Q(s, a), visit count N(s, a), and prior probability P(s, a). The tree is traversed by simulation (that is, descending the tree in complete games without backup), starting from the root state. At each time step t of each simulation, an action a_t is selected from state s_t

**Figure 3: Monte Carlo tree search in AlphaGo.**

so as to maximize action value plus a bonus

that is proportional to the prior probability but decays with repeated visits to encourage exploration. When the traversal reaches a leaf node s_L at step L, the leaf node may be expanded. The leaf position s_L is processed just once by the SL policy network p_σ. The output probabilities are stored as prior probabilities P for each legal action a, . The leaf node is evaluated in two very different ways: first, by the value network v_θ(s_L); and second, by the outcome z_L of a random rollout played out until terminal step T using the fast rollout policy p_π; these evaluations are combined, using a mixing parameter λ, into a leaf evaluation V(s_L)

At the end of simulation, the action values and visit counts of all traversed edges are updated. Each edge accumulates the visit count and mean evaluation of all simulations passing through that edge

where is the leaf node from the ith simulation, and 1(s, a, i) indicates whether an edge (s, a) was traversed during the ith simulation. Once the search is complete, the algorithm chooses the most visited move from the root position.

It is worth noting that the SL policy network p_σ performed better in AlphaGo than the stronger RL policy network p_ρ, presumably because humans select a diverse beam of promising moves, whereas RL optimizes for the single best move. However, the value function derived from the stronger RL policy network performed better in AlphaGo than a value function derived from the SL policy network.

Evaluating policy and value networks requires several orders of magnitude more computation than traditional search heuristics. To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs. The Methods section provides full details of asynchronous and distributed MCTS.

Evaluating the playing strength of AlphaGo

To evaluate AlphaGo, we ran an internal tournament among variants of AlphaGo and several other Go programs, including the strongest commercial programs Crazy Stone¹³ and Zen, and the strongest open source programs Pachi¹⁴ and Fuego¹⁵. All of these programs are based on high-performance MCTS algorithms. In addition, we included the open source program GnuGo, a Go program using state-of-the-art search methods that preceded MCTS. All programs were allowed 5 s of computation time per move.

The results of the tournament (see Fig. 4a) suggest that single-machine AlphaGo is many dan ranks stronger than any previous Go program, winning 494 out of 495 games (99.8%) against other Go programs. To provide a greater challenge to AlphaGo, we also played games with four handicap stones (that is, free moves for the opponent); AlphaGo won 77%, 86%, and 99% of handicap games against Crazy Stone, Zen and Pachi, respectively. The distributed version of AlphaGo was significantly stronger, winning 77% of games against single-machine AlphaGo and 100% of its games against other programs.

**Figure 4: Tournament evaluation of AlphaGo.**

We also assessed variants of AlphaGo that evaluated positions using just the value network (λ = 0) or just rollouts (λ = 1) (see Fig. 4b). Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go. However, the mixed evaluation (λ = 0.5) performed best, winning ≥95% of games against other variants. This suggests that the two position-evaluation mechanisms are complementary: the value network approximates the outcome of games played by the strong but impractically slow p_ρ, while the rollouts can precisely score and evaluate the outcome of games played by the weaker but faster rollout policy p_π. Figure 5 visualizes the evaluation of a real game position by AlphaGo.

**Figure 5: How AlphaGo (black, to play) selected its move in an informal game against Fan Hui.**

Finally, we evaluated the distributed version of AlphaGo against Fan Hui, a professional 2 dan, and the winner of the 2013, 2014 and 2015 European Go championships. Over 5–9 October 2015 AlphaGo and Fan Hui competed in a formal five-game match. AlphaGo won the match 5 games to 0 (Fig. 6 and Extended Data Table 1). This is the first time that a computer Go program has defeated a human professional player, without handicap, in the full game of Go—a feat that was previously believed to be at least a decade away^3,7,31.

**Figure 6: Games from the match between AlphaGo and the European champion, Fan Hui.**

Discussion

In this work we have developed a Go program, based on a combination of deep neural networks and tree search, that plays at the level of the strongest human players, thereby achieving one of artificial intelligence’s “grand challenges”^31,32,33. We have developed, for the first time, effective move selection and position evaluation functions for Go, based on deep neural networks that are trained by a novel combination of supervised and reinforcement learning. We have introduced a new search algorithm that successfully combines neural network evaluations with Monte Carlo rollouts. Our program AlphaGo integrates these components together, at scale, in a high-performance tree search engine.

During the match against Fan Hui, AlphaGo evaluated thousands of times fewer positions than Deep Blue did in its chess match against Kasparov⁴; compensating by selecting those positions more intelligently, using the policy network, and evaluating them more precisely, using the value network—an approach that is perhaps closer to how humans play. Furthermore, while Deep Blue relied on a handcrafted evaluation function, the neural networks of AlphaGo are trained directly from gameplay purely through general-purpose supervised and reinforcement learning methods.

Go is exemplary in many ways of the difficulties faced by artificial intelligence^33,34: a challenging decision-making task, an intractable search space, and an optimal solution so complex it appears infeasible to directly approximate using a policy or value function. The previous major breakthrough in computer Go, the introduction of MCTS, led to corresponding advances in many other domains; for example, general game-playing, classical planning, partially observed planning, scheduling, and constraint satisfaction^35,36. By combining tree search with policy and value networks, AlphaGo has finally reached a professional level in Go, providing hope that human-level performance can now be achieved in other seemingly intractable artificial intelligence domains.

Methods

Problem setting

Many games of perfect information, such as chess, checkers, othello, backgammon and Go, may be defined as alternating Markov games³⁹. In these games, there is a state space (where state includes an indication of the current player to play); an action space defining the legal actions in any given state s ∈ ; a state transition function f(s, a, ξ) defining the successor state after selecting action a in state s and random input ξ (for example, dice); and finally a reward function rⁱ(s) describing the reward received by player i in state s. We restrict our attention to two-player zero-sum games, r¹(s) = −r²(s) = r(s), with deterministic state transitions, f(s, a, ξ) = f(s, a), and zero rewards except at a terminal time step T. The outcome of the game z_t = ±r(s_T) is the terminal reward at the end of the game from the perspective of the current player at time step t. A policy p(a|s) is a probability distribution over legal actions . A value function is the expected outcome if all actions for both players are selected according to policy p, that is, . Zero-sum games have a unique optimal value function v*(s) that determines the outcome from state s following perfect play by both players,

Prior work

The optimal value function can be computed recursively by minimax (or equivalently negamax) search⁴⁰. Most games are too large for exhaustive minimax tree search; instead, the game is truncated by using an approximate value function v(s) ≈ v^*(s) in place of terminal rewards. Depth-first minimax search with alpha–beta pruning⁴⁰ has achieved superhuman performance in chess⁴, checkers⁵ and othello⁶, but it has not been effective in Go⁷.

Reinforcement learning can learn to approximate the optimal value function directly from games of self-play³⁹. The majority of prior work has focused on a linear combination v_θ(s) = φ(s) · θ of features φ(s) with weights θ. Weights were trained using temporal-difference learning⁴¹ in chess^42,43, checkers^44,45 and Go³⁰; or using linear regression in othello⁶ and Scrabble⁹. Temporal-difference learning has also been used to train a neural network to approximate the optimal value function, achieving superhuman performance in backgammon⁴⁶; and achieving weak kyu-level performance in small-board Go^28,29,47 using convolutional networks.

An alternative approach to minimax search is Monte Carlo tree search (MCTS)^11,12, which estimates the optimal value of interior nodes by a double approximation, . The first approximation, , uses n Monte Carlo simulations to estimate the value function of a simulation policy Pⁿ. The second approximation, , uses a simulation policy Pⁿ in place of minimax optimal actions. The simulation policy selects actions according to a search control function , such as UCT¹², that selects children with higher action values, Qⁿ(s, a) = −Vⁿ(f(s, a)), plus a bonus u(s, a) that encourages exploration; or in the absence of a search tree at state s, it samples actions from a fast rollout policy . As more simulations are executed and the search tree grows deeper, the simulation policy becomes informed by increasingly accurate statistics. In the limit, both approximations become exact and MCTS (for example, with UCT) converges¹² to the optimal value function . The strongest current Go programs are based on MCTS^13,14,15,36.

MCTS has previously been combined with a policy that is used to narrow the beam of the search tree to high-probability moves¹³; or to bias the bonus term towards high-probability moves⁴⁸. MCTS has also been combined with a value function that is used to initialize action values in newly expanded nodes¹⁶, or to mix Monte Carlo evaluation with minimax evaluation⁴⁹. By contrast, AlphaGo’s use of value functions is based on truncated Monte Carlo search algorithms^8,9, which terminate rollouts before the end of the game and use a value function in place of the terminal reward. AlphaGo’s position evaluation mixes full rollouts with truncated rollouts, resembling in some respects the well-known temporal-difference learning algorithm TD(λ). AlphaGo also differs from prior work by using slower but more powerful representations of the policy and value function; evaluating deep neural networks is several orders of magnitude slower than linear representations and must therefore occur asynchronously.

The performance of MCTS is to a large degree determined by the quality of the rollout policy. Prior work has focused on handcrafted patterns⁵⁰ or learning rollout policies by supervised learning¹³, reinforcement learning¹⁶, simulation balancing^51,52 or online adaptation^30,53; however, it is known that rollout-based position evaluation is frequently inaccurate⁵⁴. AlphaGo uses relatively simple rollouts, and instead addresses the challenging problem of position evaluation more directly using value networks.

Search algorithm

To efficiently integrate large neural networks into AlphaGo, we implemented an asynchronous policy and value MCTS algorithm (APV-MCTS). Each node s in the search tree contains edges (s, a) for all legal actions . Each edge stores a set of statistics,

where P(s, a) is the prior probability, W_v(s, a) and W_r(s, a) are Monte Carlo estimates of total action value, accumulated over N_v(s, a) and N_r(s, a) leaf evaluations and rollout rewards, respectively, and Q(s, a) is the combined mean action value for that edge. Multiple simulations are executed in parallel on separate search threads. The APV-MCTS algorithm proceeds in the four stages outlined in Fig. 3.

Selection ( Fig. 3a ). The first in-tree phase of each simulation begins at the root of the search tree and finishes when the simulation reaches a leaf node at time step L. At each of these time steps, t < L, an action is selected according to the statistics in the search tree, using a variant of the PUCT algorithm⁴⁸, , where c_puct is a constant determining the level of exploration; this search control strategy initially prefers actions with high prior probability and low visit count, but asymptotically prefers actions with high action value.

Evaluation ( Fig. 3c ). The leaf position s_L is added to a queue for evaluation v_θ(s_L) by the value network, unless it has previously been evaluated. The second rollout phase of each simulation begins at leaf node s_L and continues until the end of the game. At each of these time-steps, t ≥ L, actions are selected by both players according to the rollout policy, . When the game reaches a terminal state, the outcome is computed from the final score.

Backup ( Fig. 3d ). At each in-tree step t ≤ L of the simulation, the rollout statistics are updated as if it has lost n_vl games, N_r(s_t, a_t) ← N_r(s_t, a_t) + n_vl; W_r(s_t, a_t) ← W_r(s_t, a_t) −n_vl; this virtual loss⁵⁵ discourages other threads from simultaneously exploring the identical variation. At the end of the simulation, t he rollout statistics are updated in a backward pass through each step t ≤ L, replacing the virtual losses by the outcome, N_r(s_t, a_t) ← N_r(s_t, a_t) −n_vl + 1; W_r(s_t, a_t) ← W_r(s_t, a_t) + n_vl + z_t. Asynchronously, a separate backward pass is initiated when the evaluation of the leaf position s_L completes. The output of the value network v_θ(s_L) is used to update value statistics in a second backward pass through each step t ≤ L, N_v(s_t, a_t) ← N_v(s_t, a_t) + 1, W_v(s_t, a_t) ← W_v(s_t, a_t) + v_θ(s_L). The overall evaluation of each state action is a weighted average of the Monte Carlo estimates, , that mixes together the value network and rollout evaluations with weighting parameter λ. All updates are performed lock-free⁵⁶.

Expansion ( Fig. 3b ). When the visit count exceeds a threshold, N_r(s, a) > n_thr, the successor state s′ = f(s, a) is added to the search tree. The new node is initialized to {N (s′, a) = N_r(s′, a) = 0, W (s′, a) = W_r(s′, a) = 0, P(s′,a) = p_σ(a|s′)}, using a tree policy p_τ(a|s′) (similar to the rollout policy but with more features, see Extended Data Table 4) to provide placeholder prior probabilities for action selection. The position s′ is also inserted into a queue for asynchronous GPU evaluation by the policy network. Prior probabilities are computed by the SL policy network with a softmax temperature set to β; these replace the placeholder prior probabilities, , using an atomic update. The threshold n_thr is adjusted dynamically to ensure that the rate at which positions are added to the policy queue matches the rate at which the GPUs evaluate the policy network. Positions are evaluated by both the policy network and the value network using a mini-batch size of 1 to minimize end-to-end evaluation time.

We also implemented a distributed APV-MCTS algorithm. This architecture consists of a single master machine that executes the main search, many remote worker CPUs that execute asynchronous rollouts, and many remote worker GPUs that execute asynchronous policy and value network evaluations. The entire search tree is stored on the master, which only executes the in-tree phase of each simulation. The leaf positions are communicated to the worker CPUs, which execute the rollout phase of simulation, and to the worker GPUs, which compute network features and evaluate the policy and value networks. The prior probabilities of the policy network are returned to the master, where they replace placeholder prior probabilities at the newly expanded node. The rewards from rollouts and the value network outputs are each returned to the master, and backed up the originating search path.

At the end of search AlphaGo selects the action with maximum visit count; this is less sensitive to outliers than maximizing action value¹⁵. The search tree is reused at subsequent time steps: the child node corresponding to the played action becomes the new root node; the subtree below this child is retained along with all its statistics, while the remainder of the tree is discarded. The match version of AlphaGo continues searching during the opponent’s move. It extends the search if the action maximizing visit count and the action maximizing action value disagree. Time controls were otherwise shaped to use most time in the middle-game⁵⁷. AlphaGo resigns when its overall evaluation drops below an estimated 10% probability of winning the game, that is, .

AlphaGo does not employ the all-moves-as-first¹⁰ or rapid action value estimation⁵⁸ heuristics used in the majority of Monte Carlo Go programs; when using policy networks as prior knowledge, these biased heuristics do not appear to give any additional benefit. In addition AlphaGo does not use progressive widening¹³, dynamic komi⁵⁹ or an opening book⁶⁰. The parameters used by AlphaGo in the Fan Hui match are listed in Extended Data Table 5.

Rollout policy

The rollout policy is a linear softmax policy based on fast, incrementally computed, local pattern-based features consisting of both ‘response’ patterns around the previous move that led to state s, and ‘non-response’ patterns around the candidate move a in state s. Each non-response pattern is a binary feature matching a specific 3 × 3 pattern centred on a, defined by the colour (black, white, empty) and liberty count (1, 2, ≥3) for each adjacent intersection. Each response pattern is a binary feature matching the colour and liberty count in a 12-point diamond-shaped pattern²¹ centred around the previous move. Additionally, a small number of handcrafted local features encode common-sense Go rules (see Extended Data Table 4). Similar to the policy network, the weights π of the rollout policy are trained from 8 million positions from human games on the Tygem server to maximize log likelihood by stochastic gradient descent. Rollouts execute at approximately 1,000 simulations per second per CPU thread on an empty board.

Our rollout policy p_π(a|s) contains less handcrafted knowledge than state-of-the-art Go programs¹³. Instead, we exploit the higher-quality action selection within MCTS, which is informed both by the search tree and the policy network. We introduce a new technique that caches all moves from the search tree and then plays similar moves during rollouts; a generalization of the ‘last good reply’ heuristic⁵³. At every step of the tree traversal, the most probable action is inserted into a hash table, along with the 3 × 3 pattern context (colour, liberty and stone counts) around both the previous move and the current move. At each step of the rollout, the pattern context is matched against the hash table; if a match is found then the stored move is played with high probability.

Symmetries

In previous work, the symmetries of Go have been exploited by using rotationally and reflectionally invariant filters in the convolutional layers^24,28,29. Although this may be effective in small neural networks, it actually hurts performance in larger networks, as it prevents the intermediate filters from identifying specific asymmetric patterns²³. Instead, we exploit symmetries at run-time by dynamically transforming each position s using the dihedral group of eight reflections and rotations, d₁(s), …, d₈(s). In an explicit symmetry ensemble, a mini-batch of all 8 positions is passed into the policy network or value network and computed in parallel. For the value network, the output values are simply averaged, . For the policy network, the planes of output probabilities are rotated/reflected back into the original orientation, and averaged together to provide an ensemble prediction, ; this approach was used in our raw network evaluation (see Extended Data Table 3). Instead, APV-MCTS makes use of an implicit symmetry ensemble that randomly selects a single rotation/reflection j ∈ [1, 8] for each evaluation. We compute exactly one evaluation for that orientation only; in each simulation we compute the value of leaf node s_L by v_θ(d_j(s_L)), and allow the search procedure to average over these evaluations. Similarly, we compute the policy network for a single, randomly selected rotation/reflection, .

Policy network: classification

We trained the policy network p_σ to classify positions according to expert moves played in the KGS data set. This data set contains 29.4 million positions from 160,000 games played by KGS 6 to 9 dan human players; 35.4% of the games are handicap games. The data set was split into a test set (the first million positions) and a training set (the remaining 28.4 million positions). Pass moves were excluded from the data set. Each position consisted of a raw board description s and the move a selected by the human. We augmented the data set to include all eight reflections and rotations of each position. Symmetry augmentation and input features were pre-computed for each position. For each training step, we sampled a randomly selected mini-batch of m samples from the augmented KGS data set, and applied an asynchronous stochastic gradient descent update to maximize the log likelihood of the action, . The step size α was initialized to 0.003 and was halved every 80 million training steps, without momentum terms, and a mini-batch size of m = 16. Updates were applied asynchronously on 50 GPUs using DistBelief ⁶¹; gradients older than 100 steps were discarded. Training took around 3 weeks for 340 million training steps.

Policy network: reinforcement learning

We further trained the policy network by policy gradient reinforcement learning^25,26. Each iteration consisted of a mini-batch of n games played in parallel, between the current policy network p_ρ that is being trained, and an opponent that uses parameters ρ⁻ from a previous iteration, randomly sampled from a pool of opponents, so as to increase the stability of training. Weights were initialized to ρ = ρ⁻ = σ. Every 500 iterations, we added the current parameters ρ to the opponent pool. Each game i in the mini-batch was played out until termination at step Tⁱ, and then scored to determine the outcome from each player’s perspective. The games were then replayed to determine the policy gradient update, , using the REINFORCE algorithm²⁵ with baseline for variance reduction. On the first pass through the training pipeline, the baseline was set to zero; on the second pass we used the value network v_θ(s) as a baseline; this provided a small performance boost. The policy network was trained in this way for 10,000 mini-batches of 128 games, using 50 GPUs, for one day.

Value network: regression

We trained a value network to approximate the value function of the RL policy network p_ρ. To avoid overfitting to the strongly correlated positions within games, we constructed a new data set of uncorrelated self-play positions. This data set consisted of over 30 million positions, each drawn from a unique game of self-play. Each game was generated in three phases by randomly sampling a time step U ~ unif{1, 450}, and sampling the first t = 1,… U − 1 moves from the SL policy network, a_t ~ p_σ(·|s_t); then sampling one move uniformly at random from available moves, a_U ~ unif{1, 361} (repeatedly until a_U is legal); then sampling the remaining sequence of moves until the game terminates, t = U + 1, … T, from the RL policy network, a_t ~ p_ρ(·|s_t). Finally, the game is scored to determine the outcome z_t = ±r(s_T). Only a single training example (s_U+1, z_U+1) is added to the data set from each game. This data provides unbiased samples of the value function . During the first two phases of generation we sample from noisier distributions so as to increase the diversity of the data set. The training method was identical to SL policy network training, except that the parameter update was based on mean squared error between the predicted values and the observed rewards, . The value network was trained for 50 million mini-batches of 32 positions, using 50 GPUs, for one week.

Features for policy/value network

Each position s was pre-processed into a set of 19 × 19 feature planes. The features that we use come directly from the raw representation of the game rules, indicating the status of each intersection of the Go board: stone colour, liberties (adjacent empty points of stone’s chain), captures, legality, turns since stone was played, and (for the value network only) the current colour to play. In addition, we use one simple tactical feature that computes the outcome of a ladder search⁷. All features were computed relative to the current colour to play; for example, the stone colour at each intersection was represented as either player or opponent rather than black or white. Each integer feature value is split into multiple 19 × 19 planes of binary values (one-hot encoding). For example, separate binary feature planes are used to represent whether an intersection has 1 liberty, 2 liberties,…, ≥8 liberties. The full set of feature planes are listed in Extended Data Table 2.

Neural network architecture

The input to the policy network is a 19 × 19 × 48 image stack consisting of 48 feature planes. The first hidden layer zero pads the input into a 23 × 23 image, then convolves k filters of kernel size 5 × 5 with stride 1 with the input image and applies a rectifier nonlinearity. Each of the subsequent hidden layers 2 to 12 zero pads the respective previous hidden layer into a 21 × 21 image, then convolves k filters of kernel size 3 × 3 with stride 1, again followed by a rectifier nonlinearity. The final layer convolves 1 filter of kernel size 1 × 1 with stride 1, with a different bias for each position, and applies a softmax function. The match version of AlphaGo used k = 192 filters; Fig. 2b and Extended Data Table 3 additionally show the results of training with k = 128, 256 and 384 filters.

The input to the value network is also a 19 × 19 × 48 image stack, with an additional binary feature plane describing the current colour to play. Hidden layers 2 to 11 are identical to the policy network, hidden layer 12 is an additional convolution layer, hidden layer 13 convolves 1 filter of kernel size 1 × 1 with stride 1, and hidden layer 14 is a fully connected linear layer with 256 rectifier units. The output layer is a fully connected linear layer with a single tanh unit.

Evaluation

We evaluated the relative strength of computer Go programs by running an internal tournament and measuring the Elo rating of each program. We estimate the probability that program a will beat program b by a logistic function , and estimate the ratings e(·) by Bayesian logistic regression, computed by the BayesElo program³⁷ using the standard constant c_elo = 1/400. The scale was anchored to the BayesElo rating of professional Go player Fan Hui (2,908 at date of submission)⁶². All programs received a maximum of 5 s computation time per move; games were scored using Chinese rules with a komi of 7.5 points (extra points to compensate white for playing second). We also played handicap games where AlphaGo played white against existing Go programs; for these games we used a non-standard handicap system in which komi was retained but black was given additional stones on the usual handicap points. Using these rules, a handicap of K stones is equivalent to giving K − 1 free moves to black, rather than K − 1/2 free moves using standard no-komi handicap rules. We used these handicap rules because AlphaGo’s value network was trained specifically to use a komi of 7.5.

With the exception of distributed AlphaGo, each computer Go program was executed on its own single machine, with identical specifications, using the latest available version and the best hardware configuration supported by that program (see Extended Data Table 6). In Fig. 4, approximate ranks of computer programs are based on the highest KGS rank achieved by that program; however, the KGS version may differ from the publicly available version.

The match against Fan Hui was arbitrated by an impartial referee. Five formal games and five informal games were played with 7.5 komi, no handicap, and Chinese rules. AlphaGo won these games 5–0 and 3–2 respectively (Fig. 6 and Extended Data Table 1). Time controls for formal games were 1 h main time plus three periods of 30 s byoyomi. Time controls for informal games were three periods of 30 s byoyomi. Time controls and playing conditions were chosen by Fan Hui in advance of the match; it was also agreed that the overall match outcome would be determined solely by the formal games. To approximately assess the relative rating of Fan Hui to computer Go programs, we appended the results of all ten games to our internal tournament results, ignoring differences in time controls.

References

Allis, L. V. Searching for Solutions in Games and Artificial Intelligence. PhD thesis, Univ. Limburg, Maastricht, The Netherlands (1994)
van den Herik, H., Uiterwijk, J. W. & van Rijswijck, J. Games solved: now and in the future. Artif. Intell. 134, 277–311 (2002)
Article MATH Google Scholar
Schaeffer, J. The games computers (and people) play. Advances in Computers 52, 189–266 (2000)
Article Google Scholar
Campbell, M., Hoane, A. & Hsu, F. Deep Blue. Artif. Intell. 134, 57–83 (2002)
Article MATH Google Scholar
Schaeffer, J. et al. A world championship caliber checkers program. Artif. Intell. 53, 273–289 (1992)
Article Google Scholar
Buro, M. From simple features to sophisticated evaluation functions. In 1st International Conference on Computers and Games, 126–145 (1999)
Müller, M. Computer Go. Artif. Intell. 134, 145–179 (2002)
Article MATH Google Scholar
Tesauro, G. & Galperin, G. On-line policy improvement using Monte-Carlo search. In Advances in Neural Information Processing, 1068–1074 (1996)
Sheppard, B. World-championship-caliber Scrabble. Artif. Intell. 134, 241–275 (2002)
Article MATH Google Scholar
Bouzy, B. & Helmstetter, B. Monte-Carlo Go developments. In 10th International Conference on Advances in Computer Games, 159–174 (2003)
Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In 5th International Conference on Computers and Games, 72–83 (2006)
Kocsis, L. & Szepesvári, C. Bandit based Monte-Carlo planning. In 15th European Conference on Machine Learning, 282–293 (2006)
Coulom, R. Computing Elo ratings of move patterns in the game of Go. ICGA J. 30, 198–208 (2007)
Article Google Scholar
Baudiš, P. & Gailly, J.-L. Pachi: State of the art open source Go program. In Advances in Computer Games, 24–38 (Springer, 2012)
Müller, M., Enzenberger, M., Arneson, B. & Segal, R. Fuego – an open-source framework for board games and Go engine based on Monte-Carlo tree search. IEEE Trans. Comput. Intell. AI in Games 2, 259–270 (2010)
Article Google Scholar
Gelly, S. & Silver, D. Combining online and offline learning in UCT. In 17th International Conference on Machine Learning, 273–280 (2007)
Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–1105 (2012)
Lawrence, S., Giles, C. L., Tsoi, A. C. & Back, A. D. Face recognition: a convolutional neural-network approach. IEEE Trans. Neural Netw. 8, 98–113 (1997)
Article CAS PubMed Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Article ADS CAS PubMed Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015)
Article ADS CAS PubMed Google Scholar
Stern, D., Herbrich, R. & Graepel, T. Bayesian pattern ranking for move prediction in the game of Go. In International Conference of Machine Learning, 873–880 (2006)
Sutskever, I. & Nair, V. Mimicking Go experts with convolutional neural networks. In International Conference on Artificial Neural Networks, 101–110 (2008)
Maddison, C. J., Huang, A., Sutskever, I. & Silver, D. Move evaluation in Go using deep convolutional neural networks. 3rd International Conference on Learning Representations (2015)
Clark, C. & Storkey, A. J. Training deep convolutional neural networks to play go. In 32nd International Conference on Machine Learning, 1766–1774 (2015)
Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)
MATH Google Scholar
Sutton, R., McAllester, D., Singh, S. & Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057–1063 (2000)
Sutton, R. & Barto, A. Reinforcement Learning: an Introduction (MIT Press, 1998)
Schraudolph, N. N., Dayan, P. & Sejnowski, T. J. Temporal difference learning of position evaluation in the game of Go. Adv. Neural Inf. Process. Syst. 6, 817–824 (1994)
Google Scholar
Enzenberger, M. Evaluation in Go by a neural network using soft segmentation. In 10th Advances in Computer Games Conference, 97–108 (2003). 267
Silver, D., Sutton, R. & Müller, M. Temporal-difference search in computer Go. Mach. Learn. 87, 183–219 (2012)
Article MathSciNet MATH Google Scholar
Levinovitz, A. The mystery of Go, the ancient game that computers still can’t win. Wired Magazine (2014)
Mechner, D. All Systems Go. The Sciences 38, 32–37 (1998)
Article Google Scholar
Mandziuk, J. Computational intelligence in mind games. In Challenges for Computational Intelligence, 407–442 (2007)
Berliner, H. A chronology of computer chess and its literature. Artif. Intell. 10, 201–214 (1978)
Article MATH Google Scholar
Browne, C. et al. A survey of Monte-Carlo tree search methods. IEEE Trans. Comput. Intell. AI in Games 4, 1–43 (2012)
Article Google Scholar
Gelly, S. et al. The grand challenge of computer Go: Monte Carlo tree search and extensions. Commun. ACM 55, 106–113 (2012)
Article Google Scholar
Coulom, R. Whole-history rating: A Bayesian rating system for players of time-varying strength. In International Conference on Computers and Games, 113–124 (2008)
KGS. Rating system math. http://www.gokgs.com/help/rmath.html
Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In 11th International Conference on Machine Learning, 157–163 (1994)
Knuth, D. E. & Moore, R. W. An analysis of alpha-beta pruning. Artif. Intell. 6, 293–326 (1975)
Article MathSciNet MATH Google Scholar
Sutton, R. Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988)
Google Scholar
Baxter, J., Tridgell, A. & Weaver, L. Learning to play chess using temporal differences. Mach. Learn. 40, 243–263 (2000)
Article MATH Google Scholar
Veness, J., Silver, D., Blair, A. & Uther, W. Bootstrapping from game tree search. In Advances in Neural Information Processing Systems (2009)
Samuel, A. L. Some studies in machine learning using the game of checkers II - recent progress. IBM J. Res. Develop. 11, 601–617 (1967)
Article Google Scholar
Schaeffer, J., Hlynka, M. & Jussila, V. Temporal difference learning applied to a high-performance game-playing program. In 17th International Joint Conference on Artificial Intelligence, 529–534 (2001)
Tesauro, G. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 6, 215–219 (1994)
Article Google Scholar
Dahl, F. Honte, a Go-playing program using neural nets. In Machines that learn to play games, 205–223 (Nova Science, 1999)
Rosin, C. D. Multi-armed bandits with episode context. Ann. Math. Artif. Intell. 61, 203–230 (2011)
Article MathSciNet MATH Google Scholar
Lanctot, M., Winands, M. H. M., Pepels, T. & Sturtevant, N. R. Monte Carlo tree search with heuristic evaluations using implicit minimax backups. In IEEE Conference on Computational Intelligence and Games, 1–8 (2014)
Gelly, S., Wang, Y., Munos, R. & Teytaud, O. Modification of UCT with patterns in Monte-Carlo Go. Tech. Rep. 6062, INRIA (2006)
Google Scholar
Silver, D. & Tesauro, G. Monte-Carlo simulation balancing. In 26th International Conference on Machine Learning, 119 (2009)
Huang, S.-C., Coulom, R. & Lin, S.-S. Monte-Carlo simulation balancing in practice. In 7th International Conference on Computers and Games, 81–92 (Springer-Verlag, 2011)
Baier, H. & Drake, P. D. The power of forgetting: improving the last-good-reply policy in Monte Carlo Go. IEEE Trans. Comput. Intell. AI in Games 2, 303–309 (2010)
Article Google Scholar
Huang, S. & Müller, M. Investigating the limits of Monte-Carlo tree search methods in computer Go. In 8th International Conference on Computers and Games, 39–48 (2013)
Segal, R. B. On the scalability of parallel UCT. Computers and Games 6515, 36–47 (2011)
Article MathSciNet MATH Google Scholar
Enzenberger, M. & Müller, M. A lock-free multithreaded Monte-Carlo tree search algorithm. In 12th Advances in Computer Games Conference, 14–20 (2009)
Huang, S.-C., Coulom, R. & Lin, S.-S. Time management for Monte-Carlo tree search applied to the game of Go. In International Conference on Technologies and Applications of Artificial Intelligence, 462–466 (2010)
Gelly, S. & Silver, D. Monte-Carlo tree search and rapid action value estimation in computer Go. Artif. Intell. 175, 1856–1875 (2011)
Article MathSciNet Google Scholar
Baudiš, P. Balancing MCTS by dynamically adjusting the komi value. ICGA J. 34, 131 (2011)
Article Google Scholar
Baier, H. & Winands, M. H. Active opening book application for Monte-Carlo tree search in 19×19 Go. In Benelux Conference on Artificial Intelligence, 3–10 (2011)
Dean, J. et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, 1223–1231 (2012)
Go ratings. http://www.goratings.org

Download references

Acknowledgements

We thank Fan Hui for agreeing to play against AlphaGo; T. Manning for refereeing the match; R. Munos and T. Schaul for helpful discussions and advice; A. Cain and M. Cant for work on the visuals; P. Dayan, G. Wayne, D. Kumaran, D. Purves, H. van Hasselt, A. Barreto and G. Ostrovski for reviewing the paper; and the rest of the DeepMind team for their support, ideas and encouragement.

Author information

David Silver and Aja Huang: These authors contributed equally to this work.

Authors and Affiliations

Google DeepMind, 5 New Street Square, London, EC4A 3TW, UK
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, Nal Kalchbrenner, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel & Demis Hassabis
Google, 1600 Amphitheatre Parkway, Mountain View, California, 94043, USA
John Nham & Ilya Sutskever

Authors

David Silver
View author publications
You can also search for this author in PubMed Google Scholar
Aja Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chris J. Maddison
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Guez
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Sifre
View author publications
You can also search for this author in PubMed Google Scholar
George van den Driessche
View author publications
You can also search for this author in PubMed Google Scholar
Julian Schrittwieser
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Antonoglou
View author publications
You can also search for this author in PubMed Google Scholar
Veda Panneershelvam
View author publications
You can also search for this author in PubMed Google Scholar
Marc Lanctot
View author publications
You can also search for this author in PubMed Google Scholar
Sander Dieleman
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Grewe
View author publications
You can also search for this author in PubMed Google Scholar
John Nham
View author publications
You can also search for this author in PubMed Google Scholar
Nal Kalchbrenner
View author publications
You can also search for this author in PubMed Google Scholar
Ilya Sutskever
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Lillicrap
View author publications
You can also search for this author in PubMed Google Scholar
Madeleine Leach
View author publications
You can also search for this author in PubMed Google Scholar
Koray Kavukcuoglu
View author publications
You can also search for this author in PubMed Google Scholar
Thore Graepel
View author publications
You can also search for this author in PubMed Google Scholar
Demis Hassabis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.H., G.v.d.D., J.S., I.A., M.La., A.G., T.G. and D.S. designed and implemented the search in AlphaGo. C.J.M., A.G., L.S., A.H., I.A., V.P., S.D., D.G., N.K., I.S., K.K. and D.S. designed and trained the neural networks in AlphaGo. J.S., J.N., A.H. and D.S. designed and implemented the evaluation framework for AlphaGo. D.S., M.Le., T.L., T.G., K.K. and D.H. managed and advised on the project. D.S., T.G., A.G. and D.H. wrote the paper.

Corresponding authors

Correspondence to David Silver or Demis Hassabis.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Extended data figures and tables

Extended Data Table 1 Details of match between AlphaGo and Fan Hui

Full size table

Extended Data Table 2 Input features for neural networks

Full size table

Extended Data Table 3 Supervised learning results for the policy network

Full size table

Extended Data Table 4 Input features for rollout and tree policy

Full size table

Extended Data Table 5 Parameters used by AlphaGo

Full size table

Extended Data Table 6 Results of a tournament between different Go programs

Full size table

Extended Data Table 7 Results of a tournament between different variants of AlphaGo

Full size table

Extended Data Table 8 Results of a tournament between AlphaGo and distributed AlphaGo, testing scalability with hardware

Full size table

Extended Data Table 9 Cross-table of win rates in per cent between programs

Full size table

Extended Data Table 10 Cross-table of win rates in per cent between programs in the single-machine scalability study

Full size table

Extended Data Table 11 Cross-table of win rates in per cent between programs in the distributed scalability study

Full size table

Related audio

Hear from the makers of the AI that mastered Go - and the professional player it beat.

Supplementary information

Supplementary Information

This zipped file contains game records for the 5 formal match games played between AlphaGo and Fan Hui. (ZIP 3 kb)

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

PowerPoint slide for Fig. 3

PowerPoint slide for Fig. 4

PowerPoint slide for Fig. 5

PowerPoint slide for Fig. 6

Rights and permissions

Reprints and permissions

About this article

Cite this article

Silver, D., Huang, A., Maddison, C. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). https://doi.org/10.1038/nature16961

Download citation

Received: 11 November 2015
Accepted: 05 January 2016
Published: 27 January 2016
Issue Date: 28 January 2016
DOI: https://doi.org/10.1038/nature16961
Springer Nature Limited

This article is cited by

Reinforcement Learning-Based Energy Management for Hybrid Power Systems: State-of-the-Art Survey, Review, and Perspectives
- Xiaolin Tang
- Jiaxin Chen
- Shen Li
Chinese Journal of Mechanical Engineering (2024)
Efficient evolution of human antibodies from general protein language models
- Brian L. Hie
- Varun R. Shanker
- Peter S. Kim
Nature Biotechnology (2024)
Research on time series prediction of multi-process based on deep learning
- Huali Zheng
- Yu Cao
- Chunming Ye
Scientific Reports (2024)
Large language models help computer programs to evolve
- Jean-Baptiste Mouret
Nature (2024)
Efficient retrosynthetic planning with MCTS exploration enhanced A* search
- Dengwei Zhao
- Shikui Tu
- Lei Xu
Communications Chemistry (2024)

Associated content

The multidisciplinary nature of machine intelligence

Collection 26 September 2018

Editorial Summary

AlphaGo computer beats Go champion

The victory in 1997 of the chess-playing computer Deep Blue in a six-game series against the then world champion Gary Kasparov was seen as a significant milestone in the development of artificial intelligence. An even greater challenge remained — the ancient game of Go. Despite decades of refinement, until recently the strongest computers were still playing Go at the level of human amateurs. Enter AlphaGo. Developed by Google DeepMind, this program uses deep neural networks to mimic expert players, and further improves its performance by learning from games played against itself. AlphaGo has achieved a 99% win rate against the strongest other Go programs, and defeated the reigning European champion Fan Hui 5–0 in a tournament match. This is the first time that a computer program has defeated a human professional player in even games, on a full, 19 x 19 board, in even games with no handicap.

Mastering the game of Go with deep neural networks and tree search

Abstract

Similar content being viewed by others

Explore related subjects

Main

Supervised learning of policy networks

Reinforcement learning of policy networks

Reinforcement learning of value networks

Searching with policy and value networks

Evaluating the playing strength of AlphaGo

Discussion

Methods

Problem setting

Prior work

Search algorithm

Rollout policy

Symmetries

Policy network: classification

Policy network: reinforcement learning

Value network: regression

Features for policy/value network

Neural network architecture

Evaluation

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Extended data figures and tables

Related audio

Supplementary information

PowerPoint slides

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation