Understanding AlphaZero

Question

Very recently, the news came out that Alphabet's DeepMind research team have extended their machine learning engine to play both Shogi and chess. Apparently, after only a few hours of self-learning, meaning by only playing against itself given the rules of the games, its performance in chess has already surpassed that of top current engines such as Stockfish 8. I personally don't know yet how exactly the matches were set up, as in under what condition Stockfish was set to perform, since if the calculation times are limited engines are able to perform very poorly. In any case, this is a very impressive achievement, because even if it turns out that one could have set up Stockfish more optimally, with few additional hours of training, AlphaZero would surpass again the level of play, meaning AlphaZero's fundamentally stronger than any current standard chess engine based on heuristic evaluation functions.

Now in the light of this news, it would be great if someone could elaborate on the main differences in the workings of a machine learned chess engine compared to the standard engines we are all accustomed to using. More concretely:

Isn't the evaluation function that AlphaZero uses, trained by machine learning methods, at the end just another heuristic evaluation function? If yes, would it be fair to say that the fundamental difference between evaluation functions of the two engines, is the fact that Stockfish has an optimized evaluation function hand-tuned by humans, meaning the definition of the function to optimize is fixed, whereas for AlphaZero, the target evaluation function is constantly being redefined through additional training (for instance through self-play)? Making the latter a far more dynamic of an approach.
Ultimately, vaguely speaking, an engine like Stockfish, applies its evaluation function to the tree of possible moves, deciding which branches to keep and which ones to drop, then through a deeper concrete analysis of each branch, again through its evaluation function, it figures out which branch yielded the highest value, and that becomes the principal variation (of course there are many advances techniques around this process to efficiently prune this large tree). Meaning, for each position, this extremely concrete routine has to be repeated for Stockfish to make a decision. In contrast, I imagine AlphaZero does something very different, namely, it does not rely on a concrete analysis of the tree of possible moves at a given position, instead its evaluation function essentially assigns a value to that position (which intuitively is similar to putting the current position in analogy to all the other positions that it has been trained for), without ever having to perform concrete analysis in the way that Stockfish, or even a human player does. Is this at all a sound picture of the workings of AlphaZero or similarly trained machine learning engines?
We know that the space of chess positions is large enough that any attempt at sampling all the positions in it would be even in principle completely in vain (EXPTIME complexity), that would suggest that no amount of training through self-play would be enough to have explored all positions, so then how can the end result be good despite having potentially explored a small fraction of positions of the space via self-play? What is the key idea here at play?
My guess is that, AlphaZero has a very optimal way of comparing any given position, even if new, to a previously visited one in its training set, the closer the comparison, the more valid the evaluation one can draw from the comparison. For instance, when it played the move Bg5 in game 5, it must have explored a similar structure during its training, i.e. it is able to recognize that this position is essentially equivalent to (a possibly completely) different one studied in its training, in analogy to how face recognition is achieved through machine learning, and as a result it concludes Bg5 should be the best move, as was in that (or those) other similar positions. Is this at all a correct guess? I have no idea how this comparison is done, as surely it's not possible to store all trained position and parse through them each time.

This is merely an attempt to get so insights into the workings of AlphaZero and how it comes to a decision given a position.

SmallChess · Accepted Answer · 2017-12-08 02:43:52Z

How does AlphaZero select a move in the search?

This is very obvious from the paper.

Each simulation proceeds by selecting in each state s a move a with low visit count, high move probability and high vale selecting ...

What does that mean? AlphaZero has trained probabilities for each move (end of page 2 in the paper) from a deep neural network. During the search, it picks a move proportional to that probability, and also nodes that have low count (to ensure the sampling space is explored). This is not a new concept, Monte Carlo Tree Search has been in the literature before Google existed.

------ (Very) Rough example ------

We have a position, and we have two legal moves.

Move 1 is good and reasonable
Move 2 puts your own king in danger for no compensation

According the paper, a trained deep model might estimate the probabilities as (0.90, 0.10). Let's say AlphaZero uses 4 iterations in Monte Carlo. The iterations could look like:

Iteration 1: Pick move 1 because it has the highest probability. Do a simulation from move 1. Iteration 2: Pick move 1 because it has the highest probability. Do a simulation from move 1. Iteration 3: Pick move 1 because it has the highest probability. Do a simulation from move 1.

Now, we're at iteration 4. Although move 1 has the higher estimated probability but move 2 has not been searched ("low visit count" in the paper), now AlphaZero would pick move 2, and do simulation.

Both moves would be considered, but AlphaZero would put more computing resources on move 1 (good thing).

AlphaZero would then pick the move with the best expected outcome.

------------------ Your questions: ------------------

Isn't the evaluation function that AlphaZero uses, trained by machine learning methods, at the end just another heuristic evaluation function?

The trained evaluation function is a deep neural network, it is not a set of heuristic rules, like what you see in Stockfish. Nobody understands fully about (not even the Google team) the trained network, it just works. This is how NN works generally.

whereas for AlphaZero, the target evaluation function is constantly being redefined through the target evaluation function is constantly being redefined through additional training.

The evaluation function in AlphaZero is a set of trained neurons (bias + weights). The Google team used very powerful machines to train the parameters. Generally, the more resources you can invest in training a deep learning model, the better parameters you get.

(2).

Stockfish uses alpha-beta, while AlphaZero uses Monte-Carlo. They are two very different algorithms. The alpha-beta algorithm assumes a lower/upper bound, while Monte-Carlo creates simulations from the root to leaf.

to have explored all positions, so then how can the end result be good despite having potentially explored a small fraction of positions of the space via self-play? What is the key idea here at play?

Google didn't claim they had solved chess, not even their powerful machines could possibly enumerate all chess positions. But they didn't have to ... that's the point for machine learning, the model "learns" from the training set. In this case, the training set come from self-playing. The key idea is to play as many as good quality games against itself as possible and quickly.

For instance, when it played the move Bg5 in game 5, it must have explored a similar structure during its training,

I don't think AlphaZero encountered the exact same position in the training games. I think a quick read of Monte-Carlo is a good idea:

https://chessprogramming.wikispaces.com/Monte-Carlo+Tree+Search

AlphaZero was able to play the move by reaching sufficient depth. The algorithm estimates the expected probability of winning for each move, apparently, the move Bg5 gave the highest expected probability.

You can think like, AlphaGo was able to search so well that it saw the probability of winning is highest after Bg5, while Stockfish didn't consider the move seriously (and thus lost).

asdfjkl · Accepted Answer · 2021-10-05 22:17:10Z

6

Shameless self plug: Even though this question is almost four years old, I want to mention my free book on

Neural Network for Chess

It covers

the underlying neural network concepts
AlphaZero
AlphaGo
AlphaGoZero and
effectively updateable neural networks (NNUE).

There is also a code example how to implement a miniature version of AlphaZero.

edited Oct 5, 2021 at 22:17

answered Oct 5, 2021 at 9:33

asdfjkl

691 silver badge2 bronze badges

Add a comment |

Stack Exchange Network

Understanding AlphaZero

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
engines
stockfish
alphazero
or ask your own question.

Linked

Hot Network Questions

Understanding AlphaZero

2 Answers 2

Not the answer you're looking for? Browse other questions tagged enginesstockfishalphazero or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
engines
stockfish
alphazero
or ask your own question.