Christopher Cutlip
12/3/2014
Implementation and Analysis of Three Parallel Game Tree Search Algorithms
Introduction:
Game trees are very important to the field of artificial intelligence. They can be a very effective
tool but they can also be very large, to the point were searching them can be very time consuming.
Given the problem of the massive size of game trees, it is important to find a way to effectively search
them despite their massive size. In this project I will be implementing parallel game tree search
algorithms that are capable of searching large game trees more effectively than sequential algorithms.
These parallel algorithms, due to the greater processing power, can search a tree to a deeper depth
resulting in more accurate answers, and can search the trees much faster than sequential algorithms. [2]
In order to emphasize how much faster and more effective these parallel algorithms are, I will
implement a sequential game tree search algorithm and test its performance against the parallel
algorithms. I will also be comparing the parallel algorithms against each other. The sequential
algorithm will be a minimax algorithm with alpha beta pruning, move ordering, heuristic search, and
heuristic search with ordering. The parallel algorithms that we will use are Tree Splitting, Principal
Variation Splitting and Young Brothers Wait Concept, and we will go into detail on these algorithms
later in the report.
The game tree search algorithms will be put to use as a connect four AI, searching a connect
four game tree. Connect four is a two player connection game where players drop checker like discs
into the top of a seven column, six row vertically suspended grid. The pieces fall straight down into
place. The objective is to connect four discs of your color either horizontally, vertically, or diagonally
before your opponent does. The deliverables of the project will be the the code for the sequential and
parallel algorithms, as well as the performance results.
This is an interesting parallel processing problem given the fact game trees are very well suited
for parallel processing. Game trees can be extremely large and therefore can require a massive amount
of processing to search. The use of parallel processing can help to do more of this work in a faster time
than a sequential program. Game trees can be searched efficiently by dividing up the children of the
root node to be searched by individual CPUs, or the multiple processes can share information about
which nodes have been searched and which have not in a shared hash table and search the nodes that
have not been searched yet. [2]
Those who might use parallel processing on game trees include computer game companies
looking to improve their games artificial intelligence. An example is IBM's Deep Blue which is a chess
playing computer that utilized parallel game tree search. Deep Blue beat numerous chess grandmasters
in the late 1990s including Garry Kasparov who was the world chess champion at the time. [6]
Terminology:
Processor Hierarchy: Categorizes the algorithm based on how rigid its processor tree is.
Static Processor Hierarchy: One or multiple processors are designated as the master processor(s) and
the rest are designated as slaves that report to the masters. These designations remain fixed throughout
the algorithm.
Dynamic Processor Hierarchy: The processors that are delegated as the master processors and those
delegated as the slave processors can change dynamically depending on which processors are busy or
idle.
Control Distribution: Determines whether the control of the algorithm is centralized on a small
number of master processors or distributed amongst all of the processors.
Branching Factor: The average number of children of the nodes in the tree.
Transposition: A game state, reached by one of many possible sequences of moves.
Transposition Table: A table that stores the results of searches that were previously completed. The
purpose of the transposition table is to store the data derived from the search so that when the game
state is encountered again, the algorithm does not need to search or derive that data all over again and
waste time.
Window: The open interval between the upper bound Beta and the lower bound Alpha.
Literature survey of work already done on topic:
The first major published contribution in parallel game tree algorithms was in 1978 where
Gerard Baudet described the parallel aspiration search. In aspiration search we shrink the initial alpha
beta window to a small range. If the correct minimax value is in this smaller range, then it will be
returned while visiting less leaf nodes. In parallel aspiration search we divide up the alpha beta window
into p sections where p is the number of processors we have. The processors each search their own
window and return their results. The next major contribution was made by Akl, Barnard, and Doran in
1982. They proposed the idea of a mandatory work first algorithm. The idea of this algorithm is to first
search branches that would be searched if the tree was perfectly ordered. If the tree was perfectly
ordered, we would want to search the first, or left branch first to get the best alpha beta cutoff which
would be used to make short work of the rest of the branches. [1]
The idea is to get a good alpha beta bound first and then search the rest of the branches in
parallel. The next major contribution came from Newborn's 1988 algorithm Unsynchronized Iteratively
deepening Parallel Alpha Beta Search. This algorithm was the first attempt to asynchronously start the
next level of an iteratively deep-end search instead of synchronizing at the root of the tree. The
processors are given a certain time to search their section of the tree and return their results once the
time runs out. Given that the processors are not restricted to searching to a common depth, some
processors may search deeper than others resulting in better results. In 1989 another major contribution
came in the form of Schaeffer's Dynamic Principle Variation-Split algorithm. This algorithm allows for
dynamic processor trees in PV-Split. This allowed processors to dynamically attach themselves to other
busy processors. [1]
In the event that one branch takes a long time to process and another branch does not, the
processor from the second branch can be dynamically allocated to help work on the first branch after its
done with its own branch. In 1988 a major contribution was made when Felton and Otto implemented
the first parallel alpha beta algorithm that played chess on more than 32 processors called the WayCool
algorithm. The algorithm decided on the type of parallelism to be done on a node depending upon
whether there was a transposition table entry for that node. If there was a transposition table entry, then
the move stored would probably be the best move and so it was worth waiting for a bound to be
returned from that transposition table move. [1]
Once the bound was returned the other successors could be explored in parallel. If there was no
transposition table information for the node then all sub trees can be searched in parallel. This marked
the beginning of massive parallelism in game trees. The most famous chess playing computer IBM's
deep blue used an algorithm that utilized 256 processors. In 1988 Hyatt introduced Dynamic Tree
Splitting in his Ph.D thesis. In this algorithm one processor is given the root position and the other
processors must look for a processor that has work to steal. If a processor has work it needs to do then
DTS hands out sub trees for idle processors to work on. The sub trees are distributed to idle processors
according to a complex set of rules based on what type of nodes can be handed off as sub trees. This
algorithm is significant due to being one of the best algorithms from a performance and scalability
standpoint, and is one of the most complex game tree search algorithms to implement.[1]
Main Results:
Platform Used:
I used a school lab machine, Joey14 which is running on Ubuntu 12.04. This machine has a x86_64 i7
CPU with 8 cores at 3.4 GHz and 16GB of memory.
Performance Metrics:
Speedup: The change in time to complete a search. It's T1/Tp where p is the number of processors.
Efficiency: Speedup(p)/p.
Scalability: Trend in search speed as number of processors increases.
Nodes Searched: The number of nodes the algorithm searches.
Nodes Searched Per Second: The number of nodes the algorithm searches per second.
Time: The amount of time it takes to complete the game tree search.
Depth: The depth in the tree to which the algorithm searches.
Strategy Used to Solve Problem.
(Figure 1)
Citation: https://chessprogramming.wikispaces.com/Alpha-Beta
I implemented sequential minimax with alpha beta pruning, and heuristic search as my baseline
algorithm with which I will compare the performance of the parallel algorithms. In the minimax
algorithm all of the nodes in the game tree are game states, for example the state of a tic tac toe game.
With minimax we search to the leaf nodes of the game tree and determine the score of the leaf node,
which depends on whether the game state is a win, loss, or draw for the computer player. The tree alte
nates turns between the two players, the computer player who wants to maximize their node score and
the human player who wants to minimize it. Figure 1 shows that the min node picks seven given its
the smallest number, this is the move the human would pick. The root node is a max node, and it repres
nts the computer players current turn. The computer player will pick the node with the score of seven t
o maximize its score as seven is larger than six and three. The computer is essentially picking the best m
ove available to it, taking into account the fact that the human player is always going to pick the worst
move for the computer on their turn. [4]
01 function alphabeta(node, depth, α, β, maximizingPlayer)
02
if depth = 0 or node is a terminal node
03
return the heuristic value of node
04
if maximizingPlayer
05
for each child of node
06
α := max(α, alphabeta(child, depth - 1, α, β, FALSE))
07
if β ≤ α
08
break (* β cut-off *)
09
return α
10
else
11
for each child of node
12
β := min(β, alphabeta(child, depth - 1, α, β, TRUE))
13
if β ≤ α
14
break (* α cut-off *)
15
return β
(Alpha Beta Psuedo Code: http://en.wikipedia.org/wiki/Alpha%E2%80%93beta_pruning)
The alpha beta component of the algorithm disqualifies certain branches from ever being
searched. This reduces the amount of work needed to be done to search to a certain depth, which in
turns allows the algorithm to search even deeper in the tree within a reasonable time frame, allowing
for better decision making. Looking at figure 1, we can see that the root node already has a good move
available to it with a score of seven. Given the fact that the human player will choose the lowest score
at the second level, and it has already found a six which is lower than seven, there is no need to search
the rest of that sub-tree. There is also an enhancement to alpha beta known as move ordering. With
move ordering we score the child nodes of the current node with a scoring heuristic and search the
child with the best score first. Doing this gives a better chance that the best or a very good move will be
found first, making it possible to cut off more branches via alpha beta. Another enhancement is
heuristic search which utilizes heuristic scoring. Heuristic scoring scores the value of nodes based upon
how favorable the game state is to the player. This not only results in the computer making better
moves, but the wider variety of possible scores means it is more likely that more branches will be cut
off the tree. With only scoring nodes the standard way with a score for a win, loss, or draw we end up
with a lot of tie scores in the search which leads to there being less cut offs. [4]
The first parallel algorithm I implemented was an alpha beta algorithm with Tree Splitting. In
this algorithm the master processor generates all of the child nodes of the root. The child nodes are then
delegated to processors to be searched. Each processor performs a sequential alpha beta search on the
child node that it was assigned. Once each processor finishes its search, it returns the values to the
master processor, which then picks the best move. [1]
(Figure 2) Citation: [4]
The second parallel algorithm I implemented was an alpha beta algorithm with Principal
Variation Splitting. In Principal Variation Splitting each node is expressed as a thread which can run in
parallel. Each thread will spawn one child thread for each legal move. With this algorithm we have the
problem of data dependency amongst threads that needs to be addressed. So we get the initial required
value from the principal variation. With the principle variation we search straight down to the lowest
depth or the leaf of the tree sequentially, and we evaluate that terminal node. Once we have done that,
we take the value we get from that and pass it up the the parent node, which then searches the rest of its
children in parallel using the alpha beta values generated from the principle variation. If we get new
bounds from searching this level, we pass them up to the next level and use the alpha beta bounds to
search the remaining children in that level in parallel. This continues all the way to the root of the tree,
where the generated alpha beta bounds are used to search the rest of the roots children and the best
move is chosen. Figure 2 shows Principal Variation being run with two processors. Figure 2 shows the
Principle Variation Splitting algorithm with two processors. [1]
The third parallel algorithm I implemented was an alpha beta algorithm with Young Brothers
Wait Concept. In YBW, sibling nodes are termed "brothers", and the older brother, which is the leftmost node, is searched first. The younger brothers, which are searched in parallel, have to wait until the
oldest brother returns from its sequential search with alpha beta bounds. These alpha beta bounds are
then used by the other child nodes of the root to explore their sub trees. In YBW we want to find a good
alpha beta cut off first before we run the other child nodes in parallel. This increases the alpha beta
cutoffs that occur and reduces the size of the tree that needs to be explored.[3]
Given the size of a connect four tree searching to the leaf nodes will not be possible. That being
the case, for all of the algorithms, we will restrict the search depth. The search depth of the tree will be
limited to the point where the algorithm searches as deeply as possible while still returning an answer
within five seconds. For scoring nodes, incomplete games will be labeled as ties. I have also created a
scoring heuristic that I will use. The heuristic is based on how many pieces a player will have lined up
if they make the move that is currently being looked at. [5]
Results:
I have implemented all 8 algorithms, minimax, alpha beta, alpha beta with heuristic move, alpha
beta with move ordering, alpha beta with heuristic move and move ordering, Tree Splitting, Young
Brothers Wait, and Principle Variation Splitting. The algorithms were written in c/c++ utilizing p
threads and compiled under g++. The algorithms were run on the platform described in “Platform
Used” and the results where gathered. The tests where run on Joey14 when no other users were on the
machine so that I would not be sharing computing power with other users. We will now present,
compare, and analyze the results of our test runs. We will first look at all 8 of the algorithms and
compare the results of the sequential algorithms to the parallel algorithms.
We will then look at the parallel algorithms and compare their results. Below we can see Table
1, this data was gathered from running the first move in the game. The first move almost always
requires the most time, and in the case of my program the first move is a stable move. Game trees,
depending on what game they are can be fairly unbalanced or have certain moves that take an abnormal
amount of time. When I say the first move is stable I mean none of the algorithms have an abnormal
run time that would create data that does not reflect the speed of the vast majority of moves the
algorithm can make.
Minimax
Max_Depth
Nodes Searched
Time (s)
Nodes p/s
CPUs
Alpha-Beta
AB-Ordering
AB Heuristic
5
8
9
12
104,837,125
7,789,540
34,369,653
19,915,506
4
1
2
5
26,209,281.25 7,789,540.00
17,184,826.50
3,983,101.20
1
1
1
1
AB-O Heuristic
Max_Depth
Nodes Searched
Time (s)
Nodes p/s
CPUs
12
17,788,510
4
4,447,127.50
1
Tree Splitting
11
35,162,637
4
8,790,659.25
7
YBW
PVS
11
19,842,570
4
4,960,642.50
7
13
72,796,113
5
14,559,222.60
7
Table 1
Sequential Algorithms vs Parallel algorithms:
One of the goals of this project is to compare sequential game tree search algorithms to parallel
game tree search algorithms. It is for this reason we have implemented five sequential algorithms and
gathered performance data on them, as well as our three parallel algorithms. I ran my connect four
game using each one of the search methods and found the deepest depth that the algorithm could search
to and still return a result within five seconds. I recorded the maximum depth, as the deeper the game
tree can be searched the better the answer. I also recorded nodes searched, as this can tell us how fast
the algorithm can search nodes, and nodes per second is another form of this. As seen in table 1, I used
7 cpus for the parallel algorithms given the branching factor for a connect four game tree is about 7.
This is due to the fact that there are at most 7 moves that can be made at one time due to there being 7
columns.
Looking at table 1 we see that minimax has the best nodes searched per second. This is due to
the fact that minimax is purely a brute force algorithm, it searches through the tree looking for the best
move without doing any kind of pruning or ordering, and there is no communication overhead. One of
the biggest improvements comes from minimax to alpha beta with a max depth of 5 to 8. The alpha
beta pruning enhancement to minimax seems to have pruned nearly 100 million nodes from the tree
greatly increasing its performance. Alpha beta ordering provides a modest enhancement to alpha beta.
Alpha beta ordering orders the roots child nodes based upon the estimated scores of the nodes. This
helps to ensure that the best child node is searched first, which increases the chance generating a good
alpha beta cut off early which results in more nodes being pruned. [4]
However this does not always work given the values of the child nodes are just estimates so the
best search order is not always generated. Alpha beta heuristic search gives a big boost in performance.
Given the wide variety in scores versus the standard win, lose, draw, more branches can be pruned off
the tree resulting in a deeper search. The parallel algorithms use alpha beta heuristic as their alpha beta
algorithm. Table 1 shows that Tree splitting and Young Brothers Wait perform worse than the
sequential alpha beta heuristic algorithm, but Principle Variation Splitting performs better. Tree
splitting is a fairly brute force algorithm, and the different branches of the tree do not share the alpha
beta cut off bounds, while the sequential alpha beta heuristic algorithm does. Looking at the table we
see alpha beta heuristic searched about 20 million nodes whereas Tree splitting searched 35 million
nodes. From this data it is clear that the alpha beta heuristic search produced a good cut off, probably
early on, that it then shared with the other branches resulting in a massive amount of nodes being
pruned.
Tree Splitting would have found this good cut off as well, but that cut off would not have been
shared with the other 6 branches, resulting in those branches using sub-par cut offs which caused fewer
nodes to be pruned from the tree. Young Brothers Wait searched around the same number of nodes as
sequential alpha beta heuristic search, so why did it perform worse? I believe the reason It performed
worse is due to the sibling, or brother nodes having to wait on the oldest brother to finish first. With
YBW we wait for the first child node to finish first so that it can generate an alpha beta cut-off and
share that cut-off with the other nodes. The oldest brother must have produced a good cut-off, as YBW
almost searched as few nodes as alpha beta heuristic did. However despite the good cut-off, the
algorithm still has to wait for the first brother to finish, which delays the algorithm. In this case it seems
the time saved by the good bound produced by YBW was offset by the time spent waiting for the oldest
brother to finish. Principle variation splitting performed better than sequential alpha beta heuristic
search. In PVS we search the first child node to get a good bound like in YBW, but in PVS we use
multiple CPUs to search the first child node. [1]
This significantly reduces the time it takes to search the first child node and generate the alpha
beta bounds. With PVS we get the benefit of the bound from the first child node and we usually don’t
have to wait too long to get it. Another explanation as to why the parallel algorithms performed as
poorly as they did when compared to sequential alpha beta heuristic search is that parallel search
algorithms need to be well optimized to perform well. Game trees are not perfectly balanced, ideal
trees. With alpha beta pruning, which all the algorithms use except minimax, we can end up with one
branch that takes 10 seconds to search while the other branches take 1 second or less to search due to
pruning and tree imbalance. That being the case all of your processors with the exception of one could
finish their branches almost immediately they would be left idling, basically leaving you with a
sequential algorithm. There are more complex algorithms that could effectively work on any large
branch that needed extra help, but those were too complex and time consuming to implement. Another
issue is that connect four has a fairly low branching factor of 7, limiting the number of CPUs we can
use. Chess has a branching factor of 38, allowing for more CPUs and better performance than
sequential algorithms simply out of sheer processing power.
Parallel Algorithms Speedup
3
2.5
Speedup
2
Split
YBW
PVS
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
Processors
Graph 1
Parallel Algorithms Efficiency
1.2
1
Efficiency
0.8
Split
YBW
PVS
0.6
0.4
0.2
0
0
1
2
3
4
Processors
Graph 2
Parallel algorithms Comparison:
5
6
7
8
In this section we will compare the performance of the parallel algorithms. When gathering the
data for the parallel algorithms comparison, I again ran the opening move of my connect four program,
but this time I ran the tests with 1, 2, 4, and 7 CPUs. I used 7 CPUs as given the branching factor of 7,
no more CPUs than that could be used for the algorithms I implemented. I also searched to deeper
depths in some cases to get more of a clear separation on running time. Looking at Graph 1, we see that
Tree Splitting had the best speedup. This is due to the fact that tree splitting is more of a brute force
algorithm. This is because alpha beta bounds are not shared between the roots children so there are less
alpha beta cutoffs and generally a more balanced tree meaning the CPUs spend more of the programs
execution time working. This is counter to the algorithms like YBW where if the alpha beta bounds
from the first child is good, then the other 6 branches will have a large portion of their nodes pruned
off, to the point where those 6 branches may be much smaller than the first.
This would result in most of the YBW algorithm’s time being spent in a sequential mode when
searching the first child and very little of it spent in parallel mode when searching the other children.
This of course would result in a lower speedup than what is seen in Tree Splitting. The fact that YBW
has to search the fist child sequentially also reduces speedup in and of itself, given your using 7 CPUs,
which are being factored into the speedup equation, and when your searing the first branch your
technically only using 1 CPU. Principle variation splitting also performed worse on speedup than Tree
Splitting. This is again for the same reason as YBW, the generation of the first child alpha beta bounds
can lead to an uneven tree. Its also possible that if the principle variation produced good alpha beta
bounds, then the fact that the rest of the first child is searched in parallel would not matter as much
given that those nodes being run in parallel could be quickly pruned by the principle variations alpha
beta bounds. The efficiency of the three algorithms is fairly similar, although we do see Tree Splitting
outperforming YBW and PVS. [1]
The efficiency is an indicator that helps us know if the processing power we are putting into the
algorithm is being used efficiently and effectively. For all three algorithms the efficiency drops at each
step down to 7 CPUs. This efficiency data lets us know that in all the algorithms we have unbalanced
trees, probably from alpha beta pruning as well as the nature of the game, connect four. Looking at tree
splitting, it has an efficiency near 50% at 4 CPUs. This tells us that there are some branches that are
very long and very short. Once we get down to four branches left, when a CPU finishes it has nothing
left to work on so It is idle. One of these final four branches is so much longer than the other branches
that the idle time on the three CPUs waiting for the final branch to finish puts us at 50% efficiency. If
the algorithm took 10 seconds to run, we could say we have 40 seconds of CPU time, and 3 of those
CPUs must have sat idle for about 20 seconds of total CPU time while the final one finished. In YBW,
we have CPUs waiting on another CPU to finish the first child, so that adds to the idle time and reduces
efficiency for that algorithm as well. Efficiency for many different parallel game tree search algorithms
is not great, but I could have attained better efficiency in my project. Unfortunately the main way to do
that would be to implement a different algorithm, one where all 7 CPUs could work on one tree if need
be.
This would reduce idol time, increase efficiency, and increase speedup as it seems a big reason
for poor performance is poorly balanced game trees. However these kinds of algorithms are extremely
difficult to implement, so they did not seem to be a viable option. In regards to scalability, Tree
Splitting appears to be the most scalable for the same reasons I mentioned when talking about speed up.
Also looking at the speedup graph, we can see that as the number of CPUs increases, so does the
difference in the speedup between the algorithms. This lets us know that if we were working on a game
like chess with a larger branching factor, which would allow these algorithms to use more CPUs, that
Tree splitting would continuously get better and better speed up values. This also tells us that Tree
Splitting is pretty scalable, whereas YBW and PVS are not as scalable.
Conclusion:
In my project I have selected and implemented 8 algorithms, 5 sequential and 3 parallel. I have
run performance tests and gathered data on the algorithms. I have compared the sequential algorithms
to the parallel algorithms. In my comparison of the two algorithm Types I found that the sequential
alpha beta heuristic search algorithm, which was used in my parallel algorithms, outperformed YBW
and Tree Splitting and was outperformed by PVS. I believe this was due to tree imbalance caused by
the nature of many types of game trees being poorly balanced, and connect four could be poorly
balanced as well. I believe a lot of the tree imbalance was also caused by the alpha beta algorithm
which all of the algorithms except minimax used. Using an alpha beta algorithm can result in large
portions of the search tree being pruned, which can leave you with a very long branch and a bunch of
small branches which minimizes the effectiveness of my parallel algorithms. I also compared the
parallel algorithms with one another. I found Tree Splitting to have the best scalability and speedup.
I believe this was due to the Tree Splitting game tree being more balanced due to less alpha beta
pruning given that the different branches do not share alpha beta values. I also found principle variation
splitting to give the best performance, given that a good bound is first found in the first child and
passed to the other branches which are run in parallel. The time lost due to searching the first child first
is minimized due to the fact that we have multiple CPUs working together to search the first child,
unlike in YBW. The efficiency for the parallel algorithms was fairly similar. There were also some
limitations in my project, like the branching factor of connect four only being 7. If I used a game like
chess, which would have been more complicated, I would have had a branching factor of 38 which
would of resulted in my parallel algorithms being more effective. Further work that could be done
would be implementing an algorithm that is capable of searching any branch with multiple CPUs, like
Dynamic Tree Splitting. This would be very hard to do, but it could fix the problem of slow parallel
searches due to unbalanced search trees which seemed to cause problems in my project. [1]
Bibliography
[1] Brockington, M. G. (1996). A taxonomy of parallel game-tree search algorithms.
[2] Y. Chen, Y. Tan, Y. Zhang, C. Dulong, Performance analysis of two parallel game-tree search
applications, Lecture Notes in Computer Science, 4699 (2007), 1105 – 1114.
[3] Leung, J. Parallel Game Algorithms (2003)
[4] Elnaggar, A. A., Gadallah, M., Aziem, M. A., & El-Deeb, H. A Comparative Study of Game Tree
Searching Methods. (2014)
[5] Borovska, P., & Lazarova, M. (2007, June). Efficiency of parallel minimax algorithm for game tree
search. In Proceedings of the 2007 international conference on Computer systems and technologies (p.
14). ACM.
[6] Vučković, V. V. (2007). The method of the chess search algorithms parallelization using twoprocessor distributed system. Facta universitatis-series: Mathematics and Informatics, 22(2), 175-188.
Code References:
http://www.cplusplus.com/forum/general/67675/
http://www.codehelper.io/community/need-help/please-explain-this-code/