In self-play reinforcement learning an agent plays games against itself and with the help of hindsight and retrospection improves its policy over time. Using this premise, AlphaZero famously managed to become the strongest known Go, Shogi, and Chess entity by training a deep neural network from data collected solely from self-play. AlphaZero couples this deep neural network with a Monte Carlo Tree Search algorithm that drastically improves the networks initial policy and state evaluation. When training AlphaZero relies on the final outcome of the game for the generation of training labels. By altering the learning target to instead make use of the improved state evaluation acquired after the tree search, the creation of training labels for ...