Christopher Cutlip 12/3/2014 Implementation and Analysis of Three Parallel Game Tree Search Algorithms Introduction: Game trees are very important to the field of artificial intelligence. They can be a very effective tool but they can also be very large, to the point were searching them can be very time consuming. Given the problem of the massive size of game trees, it is important to find a way to effectively search them despite their massive size. In this project I will be implementing parallel game tree search algorithms that are capable of searching large game trees more effectively than sequential algorithms. These parallel algorithms, due to the greater processing power, can search a tree to a deeper depth resulting in more accurate answers, and can search the trees much faster than sequential algorithms. [2] In order to emphasize how much faster and more effective these parallel algorithms are, I will implement a sequential game tree search algorithm and test its performance against the parallel algorithms. I will also be comparing the parallel algorithms against each other. The sequential algorithm will be a minimax algorithm with alpha beta pruning, move ordering, heuristic search, and heuristic search with ordering. The parallel algorithms that we will use are Tree Splitting, Principal Variation Splitting and Young Brothers Wait Concept, and we will go into detail on these algorithms later in the report. The game tree search algorithms will be put to use as a connect four AI, searching a connect four game tree. Connect four is a two player connection game where players drop checker like discs into the top of a seven column, six row vertically suspended grid. The pieces fall straight down into place. The objective is to connect four discs of your color either horizontally, vertically, or diagonally before your opponent does. The deliverables of the project will be the the code for the sequential and parallel algorithms, as well as the performance results. This is an interesting parallel processing problem given the fact game trees are very well suited for parallel processing. Game trees can be extremely large and therefore can require a massive amount of processing to search. The use of parallel processing can help to do more of this work in a faster time than a sequential program. Game trees can be searched efficiently by dividing up the children of the root node to be searched by individual CPUs, or the multiple processes can share information about which nodes have been searched and which have not in a shared hash table and search the nodes that have not been searched yet. [2] Those who might use parallel processing on game trees include computer game companies looking to improve their games artificial intelligence. An example is IBM's Deep Blue which is a chess playing computer that utilized parallel game tree search. Deep Blue beat numerous chess grandmasters in the late 1990s including Garry Kasparov who was the world chess champion at the time. [6] Terminology: Processor Hierarchy: Categorizes the algorithm based on how rigid its processor tree is. Static Processor Hierarchy: One or multiple processors are designated as the master processor(s) and the rest are designated as slaves that report to the masters. These designations remain fixed throughout the algorithm. Dynamic Processor Hierarchy: The processors that are delegated as the master processors and those delegated as the slave processors can change dynamically depending on which processors are busy or idle. Control Distribution: Determines whether the control of the algorithm is centralized on a small number of master processors or distributed amongst all of the processors. Branching Factor: The average number of children of the nodes in the tree. Transposition: A game state, reached by one of many possible sequences of moves. Transposition Table: A table that stores the results of searches that were previously completed. The purpose of the transposition table is to store the data derived from the search so that when the game state is encountered again, the algorithm does not need to search or derive that data all over again and waste time. Window: The open interval between the upper bound Beta and the lower bound Alpha. Literature survey of work already done on topic: The first major published contribution in parallel game tree algorithms was in 1978 where Gerard Baudet described the parallel aspiration search. In aspiration search we shrink the initial alpha beta window to a small range. If the correct minimax value is in this smaller range, then it will be returned while visiting less leaf nodes. In parallel aspiration search we divide up the alpha beta window into p sections where p is the number of processors we have. The processors each search their own window and return their results. The next major contribution was made by Akl, Barnard, and Doran in 1982. They proposed the idea of a mandatory work first algorithm. The idea of this algorithm is to first search branches that would be searched if the tree was perfectly ordered. If the tree was perfectly ordered, we would want to search the first, or left branch first to get the best alpha beta cutoff which would be used to make short work of the rest of the branches. [1] The idea is to get a good alpha beta bound first and then search the rest of the branches in parallel. The next major contribution came from Newborn's 1988 algorithm Unsynchronized Iteratively deepening Parallel Alpha Beta Search. This algorithm was the first attempt to asynchronously start the next level of an iteratively deep-end search instead of synchronizing at the root of the tree. The processors are given a certain time to search their section of the tree and return their results once the time runs out. Given that the processors are not restricted to searching to a common depth, some processors may search deeper than others resulting in better results. In 1989 another major contribution came in the form of Schaeffer's Dynamic Principle Variation-Split algorithm. This algorithm allows for dynamic processor trees in PV-Split. This allowed processors to dynamically attach themselves to other busy processors. [1] In the event that one branch takes a long time to process and another branch does not, the processor from the second branch can be dynamically allocated to help work on the first branch after its done with its own branch. In 1988 a major contribution was made when Felton and Otto implemented the first parallel alpha beta algorithm that played chess on more than 32 processors called the WayCool algorithm. The algorithm decided on the type of parallelism to be done on a node depending upon whether there was a transposition table entry for that node. If there was a transposition table entry, then the move stored would probably be the best move and so it was worth waiting for a bound to be returned from that transposition table move. [1] Once the bound was returned the other successors could be explored in parallel. If there was no transposition table information for the node then all sub trees can be searched in parallel. This marked the beginning of massive parallelism in game trees. The most famous chess playing computer IBM's deep blue used an algorithm that utilized 256 processors. In 1988 Hyatt introduced Dynamic Tree Splitting in his Ph.D thesis. In this algorithm one processor is given the root position and the other processors must look for a processor that has work to steal. If a processor has work it needs to do then DTS hands out sub trees for idle processors to work on. The sub trees are distributed to idle processors according to a complex set of rules based on what type of nodes can be handed off as sub trees. This algorithm is significant due to being one of the best algorithms from a performance and scalability standpoint, and is one of the most complex game tree search algorithms to implement.[1] Main Results: Platform Used: I used a school lab machine, Joey14 which is running on Ubuntu 12.04. This machine has a x86_64 i7 CPU with 8 cores at 3.4 GHz and 16GB of memory. Performance Metrics: Speedup: The change in time to complete a search. It's T1/Tp where p is the number of processors. Efficiency: Speedup(p)/p. Scalability: Trend in search speed as number of processors increases. Nodes Searched: The number of nodes the algorithm searches. Nodes Searched Per Second: The number of nodes the algorithm searches per second. Time: The amount of time it takes to complete the game tree search. Depth: The depth in the tree to which the algorithm searches. Strategy Used to Solve Problem. (Figure 1) Citation: https://chessprogramming.wikispaces.com/Alpha-Beta I implemented sequential minimax with alpha beta pruning, and heuristic search as my baseline algorithm with which I will compare the performance of the parallel algorithms. In the minimax algorithm all of the nodes in the game tree are game states, for example the state of a tic tac toe game. With minimax we search to the leaf nodes of the game tree and determine the score of the leaf node, which depends on whether the game state is a win, loss, or draw for the computer player. The tree alte nates turns between the two players, the computer player who wants to maximize their node score and the human player who wants to minimize it. Figure 1 shows that the min node picks seven given its the smallest number, this is the move the human would pick. The root node is a max node, and it repres nts the computer players current turn. The computer player will pick the node with the score of seven t o maximize its score as seven is larger than six and three. The computer is essentially picking the best m ove available to it, taking into account the fact that the human player is always going to pick the worst move for the computer on their turn. [4] 01 function alphabeta(node, depth, α, β, maximizingPlayer) 02 if depth = 0 or node is a terminal node 03 return the heuristic value of node 04 if maximizingPlayer 05 for each child of node 06 α := max(α, alphabeta(child, depth - 1, α, β, FALSE)) 07 if β ≤ α 08 break (* β cut-off *) 09 return α 10 else 11 for each child of node 12 β := min(β, alphabeta(child, depth - 1, α, β, TRUE)) 13 if β ≤ α 14 break (* α cut-off *) 15 return β (Alpha Beta Psuedo Code: http://en.wikipedia.org/wiki/Alpha%E2%80%93beta_pruning) The alpha beta component of the algorithm disqualifies certain branches from ever being searched. This reduces the amount of work needed to be done to search to a certain depth, which in turns allows the algorithm to search even deeper in the tree within a reasonable time frame, allowing for better decision making. Looking at figure 1, we can see that the root node already has a good move available to it with a score of seven. Given the fact that the human player will choose the lowest score at the second level, and it has already found a six which is lower than seven, there is no need to search the rest of that sub-tree. There is also an enhancement to alpha beta known as move ordering. With move ordering we score the child nodes of the current node with a scoring heuristic and search the child with the best score first. Doing this gives a better chance that the best or a very good move will be found first, making it possible to cut off more branches via alpha beta. Another enhancement is heuristic search which utilizes heuristic scoring. Heuristic scoring scores the value of nodes based upon how favorable the game state is to the player. This not only results in the computer making better moves, but the wider variety of possible scores means it is more likely that more branches will be cut off the tree. With only scoring nodes the standard way with a score for a win, loss, or draw we end up with a lot of tie scores in the search which leads to there being less cut offs. [4] The first parallel algorithm I implemented was an alpha beta algorithm with Tree Splitting. In this algorithm the master processor generates all of the child nodes of the root. The child nodes are then delegated to processors to be searched. Each processor performs a sequential alpha beta search on the child node that it was assigned. Once each processor finishes its search, it returns the values to the master processor, which then picks the best move. [1] (Figure 2) Citation: [4] The second parallel algorithm I implemented was an alpha beta algorithm with Principal Variation Splitting. In Principal Variation Splitting each node is expressed as a thread which can run in parallel. Each thread will spawn one child thread for each legal move. With this algorithm we have the problem of data dependency amongst threads that needs to be addressed. So we get the initial required value from the principal variation. With the principle variation we search straight down to the lowest depth or the leaf of the tree sequentially, and we evaluate that terminal node. Once we have done that, we take the value we get from that and pass it up the the parent node, which then searches the rest of its children in parallel using the alpha beta values generated from the principle variation. If we get new bounds from searching this level, we pass them up to the next level and use the alpha beta bounds to search the remaining children in that level in parallel. This continues all the way to the root of the tree, where the generated alpha beta bounds are used to search the rest of the roots children and the best move is chosen. Figure 2 shows Principal Variation being run with two processors. Figure 2 shows the Principle Variation Splitting algorithm with two processors. [1] The third parallel algorithm I implemented was an alpha beta algorithm with Young Brothers Wait Concept. In YBW, sibling nodes are termed "brothers", and the older brother, which is the leftmost node, is searched first. The younger brothers, which are searched in parallel, have to wait until the oldest brother returns from its sequential search with alpha beta bounds. These alpha beta bounds are then used by the other child nodes of the root to explore their sub trees. In YBW we want to find a good alpha beta cut off first before we run the other child nodes in parallel. This increases the alpha beta cutoffs that occur and reduces the size of the tree that needs to be explored.[3] Given the size of a connect four tree searching to the leaf nodes will not be possible. That being the case, for all of the algorithms, we will restrict the search depth. The search depth of the tree will be limited to the point where the algorithm searches as deeply as possible while still returning an answer within five seconds. For scoring nodes, incomplete games will be labeled as ties. I have also created a scoring heuristic that I will use. The heuristic is based on how many pieces a player will have lined up if they make the move that is currently being looked at. [5] Results: I have implemented all 8 algorithms, minimax, alpha beta, alpha beta with heuristic move, alpha beta with move ordering, alpha beta with heuristic move and move ordering, Tree Splitting, Young Brothers Wait, and Principle Variation Splitting. The algorithms were written in c/c++ utilizing p threads and compiled under g++. The algorithms were run on the platform described in “Platform Used” and the results where gathered. The tests where run on Joey14 when no other users were on the machine so that I would not be sharing computing power with other users. We will now present, compare, and analyze the results of our test runs. We will first look at all 8 of the algorithms and compare the results of the sequential algorithms to the parallel algorithms. We will then look at the parallel algorithms and compare their results. Below we can see Table 1, this data was gathered from running the first move in the game. The first move almost always requires the most time, and in the case of my program the first move is a stable move. Game trees, depending on what game they are can be fairly unbalanced or have certain moves that take an abnormal amount of time. When I say the first move is stable I mean none of the algorithms have an abnormal run time that would create data that does not reflect the speed of the vast majority of moves the algorithm can make. Minimax Max_Depth Nodes Searched Time (s) Nodes p/s CPUs Alpha-Beta AB-Ordering AB Heuristic 5 8 9 12 104,837,125 7,789,540 34,369,653 19,915,506 4 1 2 5 26,209,281.25 7,789,540.00 17,184,826.50 3,983,101.20 1 1 1 1 AB-O Heuristic Max_Depth Nodes Searched Time (s) Nodes p/s CPUs 12 17,788,510 4 4,447,127.50 1 Tree Splitting 11 35,162,637 4 8,790,659.25 7 YBW PVS 11 19,842,570 4 4,960,642.50 7 13 72,796,113 5 14,559,222.60 7 Table 1 Sequential Algorithms vs Parallel algorithms: One of the goals of this project is to compare sequential game tree search algorithms to parallel game tree search algorithms. It is for this reason we have implemented five sequential algorithms and gathered performance data on them, as well as our three parallel algorithms. I ran my connect four game using each one of the search methods and found the deepest depth that the algorithm could search to and still return a result within five seconds. I recorded the maximum depth, as the deeper the game tree can be searched the better the answer. I also recorded nodes searched, as this can tell us how fast the algorithm can search nodes, and nodes per second is another form of this. As seen in table 1, I used 7 cpus for the parallel algorithms given the branching factor for a connect four game tree is about 7. This is due to the fact that there are at most 7 moves that can be made at one time due to there being 7 columns. Looking at table 1 we see that minimax has the best nodes searched per second. This is due to the fact that minimax is purely a brute force algorithm, it searches through the tree looking for the best move without doing any kind of pruning or ordering, and there is no communication overhead. One of the biggest improvements comes from minimax to alpha beta with a max depth of 5 to 8. The alpha beta pruning enhancement to minimax seems to have pruned nearly 100 million nodes from the tree greatly increasing its performance. Alpha beta ordering provides a modest enhancement to alpha beta. Alpha beta ordering orders the roots child nodes based upon the estimated scores of the nodes. This helps to ensure that the best child node is searched first, which increases the chance generating a good alpha beta cut off early which results in more nodes being pruned. [4] However this does not always work given the values of the child nodes are just estimates so the best search order is not always generated. Alpha beta heuristic search gives a big boost in performance. Given the wide variety in scores versus the standard win, lose, draw, more branches can be pruned off the tree resulting in a deeper search. The parallel algorithms use alpha beta heuristic as their alpha beta algorithm. Table 1 shows that Tree splitting and Young Brothers Wait perform worse than the sequential alpha beta heuristic algorithm, but Principle Variation Splitting performs better. Tree splitting is a fairly brute force algorithm, and the different branches of the tree do not share the alpha beta cut off bounds, while the sequential alpha beta heuristic algorithm does. Looking at the table we see alpha beta heuristic searched about 20 million nodes whereas Tree splitting searched 35 million nodes. From this data it is clear that the alpha beta heuristic search produced a good cut off, probably early on, that it then shared with the other branches resulting in a massive amount of nodes being pruned. Tree Splitting would have found this good cut off as well, but that cut off would not have been shared with the other 6 branches, resulting in those branches using sub-par cut offs which caused fewer nodes to be pruned from the tree. Young Brothers Wait searched around the same number of nodes as sequential alpha beta heuristic search, so why did it perform worse? I believe the reason It performed worse is due to the sibling, or brother nodes having to wait on the oldest brother to finish first. With YBW we wait for the first child node to finish first so that it can generate an alpha beta cut-off and share that cut-off with the other nodes. The oldest brother must have produced a good cut-off, as YBW almost searched as few nodes as alpha beta heuristic did. However despite the good cut-off, the algorithm still has to wait for the first brother to finish, which delays the algorithm. In this case it seems the time saved by the good bound produced by YBW was offset by the time spent waiting for the oldest brother to finish. Principle variation splitting performed better than sequential alpha beta heuristic search. In PVS we search the first child node to get a good bound like in YBW, but in PVS we use multiple CPUs to search the first child node. [1] This significantly reduces the time it takes to search the first child node and generate the alpha beta bounds. With PVS we get the benefit of the bound from the first child node and we usually don’t have to wait too long to get it. Another explanation as to why the parallel algorithms performed as poorly as they did when compared to sequential alpha beta heuristic search is that parallel search algorithms need to be well optimized to perform well. Game trees are not perfectly balanced, ideal trees. With alpha beta pruning, which all the algorithms use except minimax, we can end up with one branch that takes 10 seconds to search while the other branches take 1 second or less to search due to pruning and tree imbalance. That being the case all of your processors with the exception of one could finish their branches almost immediately they would be left idling, basically leaving you with a sequential algorithm. There are more complex algorithms that could effectively work on any large branch that needed extra help, but those were too complex and time consuming to implement. Another issue is that connect four has a fairly low branching factor of 7, limiting the number of CPUs we can use. Chess has a branching factor of 38, allowing for more CPUs and better performance than sequential algorithms simply out of sheer processing power. Parallel Algorithms Speedup 3 2.5 Speedup 2 Split YBW PVS 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 Processors Graph 1 Parallel Algorithms Efficiency 1.2 1 Efficiency 0.8 Split YBW PVS 0.6 0.4 0.2 0 0 1 2 3 4 Processors Graph 2 Parallel algorithms Comparison: 5 6 7 8 In this section we will compare the performance of the parallel algorithms. When gathering the data for the parallel algorithms comparison, I again ran the opening move of my connect four program, but this time I ran the tests with 1, 2, 4, and 7 CPUs. I used 7 CPUs as given the branching factor of 7, no more CPUs than that could be used for the algorithms I implemented. I also searched to deeper depths in some cases to get more of a clear separation on running time. Looking at Graph 1, we see that Tree Splitting had the best speedup. This is due to the fact that tree splitting is more of a brute force algorithm. This is because alpha beta bounds are not shared between the roots children so there are less alpha beta cutoffs and generally a more balanced tree meaning the CPUs spend more of the programs execution time working. This is counter to the algorithms like YBW where if the alpha beta bounds from the first child is good, then the other 6 branches will have a large portion of their nodes pruned off, to the point where those 6 branches may be much smaller than the first. This would result in most of the YBW algorithm’s time being spent in a sequential mode when searching the first child and very little of it spent in parallel mode when searching the other children. This of course would result in a lower speedup than what is seen in Tree Splitting. The fact that YBW has to search the fist child sequentially also reduces speedup in and of itself, given your using 7 CPUs, which are being factored into the speedup equation, and when your searing the first branch your technically only using 1 CPU. Principle variation splitting also performed worse on speedup than Tree Splitting. This is again for the same reason as YBW, the generation of the first child alpha beta bounds can lead to an uneven tree. Its also possible that if the principle variation produced good alpha beta bounds, then the fact that the rest of the first child is searched in parallel would not matter as much given that those nodes being run in parallel could be quickly pruned by the principle variations alpha beta bounds. The efficiency of the three algorithms is fairly similar, although we do see Tree Splitting outperforming YBW and PVS. [1] The efficiency is an indicator that helps us know if the processing power we are putting into the algorithm is being used efficiently and effectively. For all three algorithms the efficiency drops at each step down to 7 CPUs. This efficiency data lets us know that in all the algorithms we have unbalanced trees, probably from alpha beta pruning as well as the nature of the game, connect four. Looking at tree splitting, it has an efficiency near 50% at 4 CPUs. This tells us that there are some branches that are very long and very short. Once we get down to four branches left, when a CPU finishes it has nothing left to work on so It is idle. One of these final four branches is so much longer than the other branches that the idle time on the three CPUs waiting for the final branch to finish puts us at 50% efficiency. If the algorithm took 10 seconds to run, we could say we have 40 seconds of CPU time, and 3 of those CPUs must have sat idle for about 20 seconds of total CPU time while the final one finished. In YBW, we have CPUs waiting on another CPU to finish the first child, so that adds to the idle time and reduces efficiency for that algorithm as well. Efficiency for many different parallel game tree search algorithms is not great, but I could have attained better efficiency in my project. Unfortunately the main way to do that would be to implement a different algorithm, one where all 7 CPUs could work on one tree if need be. This would reduce idol time, increase efficiency, and increase speedup as it seems a big reason for poor performance is poorly balanced game trees. However these kinds of algorithms are extremely difficult to implement, so they did not seem to be a viable option. In regards to scalability, Tree Splitting appears to be the most scalable for the same reasons I mentioned when talking about speed up. Also looking at the speedup graph, we can see that as the number of CPUs increases, so does the difference in the speedup between the algorithms. This lets us know that if we were working on a game like chess with a larger branching factor, which would allow these algorithms to use more CPUs, that Tree splitting would continuously get better and better speed up values. This also tells us that Tree Splitting is pretty scalable, whereas YBW and PVS are not as scalable. Conclusion: In my project I have selected and implemented 8 algorithms, 5 sequential and 3 parallel. I have run performance tests and gathered data on the algorithms. I have compared the sequential algorithms to the parallel algorithms. In my comparison of the two algorithm Types I found that the sequential alpha beta heuristic search algorithm, which was used in my parallel algorithms, outperformed YBW and Tree Splitting and was outperformed by PVS. I believe this was due to tree imbalance caused by the nature of many types of game trees being poorly balanced, and connect four could be poorly balanced as well. I believe a lot of the tree imbalance was also caused by the alpha beta algorithm which all of the algorithms except minimax used. Using an alpha beta algorithm can result in large portions of the search tree being pruned, which can leave you with a very long branch and a bunch of small branches which minimizes the effectiveness of my parallel algorithms. I also compared the parallel algorithms with one another. I found Tree Splitting to have the best scalability and speedup. I believe this was due to the Tree Splitting game tree being more balanced due to less alpha beta pruning given that the different branches do not share alpha beta values. I also found principle variation splitting to give the best performance, given that a good bound is first found in the first child and passed to the other branches which are run in parallel. The time lost due to searching the first child first is minimized due to the fact that we have multiple CPUs working together to search the first child, unlike in YBW. The efficiency for the parallel algorithms was fairly similar. There were also some limitations in my project, like the branching factor of connect four only being 7. If I used a game like chess, which would have been more complicated, I would have had a branching factor of 38 which would of resulted in my parallel algorithms being more effective. Further work that could be done would be implementing an algorithm that is capable of searching any branch with multiple CPUs, like Dynamic Tree Splitting. This would be very hard to do, but it could fix the problem of slow parallel searches due to unbalanced search trees which seemed to cause problems in my project. [1] Bibliography [1] Brockington, M. G. (1996). A taxonomy of parallel game-tree search algorithms. [2] Y. Chen, Y. Tan, Y. Zhang, C. Dulong, Performance analysis of two parallel game-tree search applications, Lecture Notes in Computer Science, 4699 (2007), 1105 – 1114. [3] Leung, J. Parallel Game Algorithms (2003) [4] Elnaggar, A. A., Gadallah, M., Aziem, M. A., & El-Deeb, H. A Comparative Study of Game Tree Searching Methods. (2014) [5] Borovska, P., & Lazarova, M. (2007, June). Efficiency of parallel minimax algorithm for game tree search. In Proceedings of the 2007 international conference on Computer systems and technologies (p. 14). ACM. [6] Vučković, V. V. (2007). The method of the chess search algorithms parallelization using twoprocessor distributed system. Facta universitatis-series: Mathematics and Informatics, 22(2), 175-188. Code References: http://www.cplusplus.com/forum/general/67675/ http://www.codehelper.io/community/need-help/please-explain-this-code/
© Copyright 2024 ExpyDoc