Galera Performance and Scaling Analysis LIA Project Proposal Esan Wit Jan-Willem Selij February 21, 2014 1 Introduction Normal replication as offered by MySQL is an asynchronous replication with a single master and multiple slaves. Although the slaves allow for read scalability, write scalability remains problematic. This is especially true with database constraints that must be checked between tables. Galera[1] is marketed as a clustering software enabling scalable MySQL databases. Galera claims to deliver consistency, read and write scalability, no slave lag, and improved availability. It is designed to work as middleware with a patched MySQL version to enable Write-Set Replication (WSREP) between multiple master nodes. Galera uses a form of multi-master synchronous replication. This means that all nodes in the server can be read and written to at the same time and synchronous replication makes it that all nodes have the same view of the database. This allows all databases to successfully write data and maintain a consistent state. We intend to analyze the behavior of Galera in large scale clusters. Current research and benchmarks are limited to relatively small clusters, 1 to 4 nodes. 2 Related Work Benchmarks performed by Galera use clusters with a size between 1 and 4 nodes[2]. Other work claims a size of 3 to 6+ nodes is used for the more realistic clusters[3]. Research performed by Tkachenko shows that Galera incurs only minor overhead on smaller node counts[4]. Limitations on the replication as used by Galera are not tested for higher node counts. Although research shows there is some overhead the practical implications of this and actual limits are not discussed. 3 Research Questions Our main purpose is to determine what the practical limitations are of Galera in terms of performance and scalability. To help us determine this more objectively the following sub-questions will be discussed: • What is the performance of a normal-sized cluster? • What is the performance of a cluster significantly larger then normal? • What is the network overhead caused by the replication in the cluster? 4 Scope As performance can be measured in a plentitude of ways we will focus on the number of transactions per second as a performance benchmark. This will also be plotted against the size of the transactions as we suspect there may be a bottleneck in network throughput. 1 5 Approach Initially we plan to do a bit of literature study to determine normal cluster sizes and performance benchmarks. These will be used as a baseline. Subsequently we plan to increase the scale of the previous experiments and analyze their behavior. If time permits we also intend to do an analysis of link latency and throughput between the nodes on cluster performance. 6 Materials In order to perform test on larger scale cluster we require an environment where multiple machines can be deployed. We presume to test on counts between 3 and 32 for cluster sizes. Although depending on the performance more nodes may be required to reach the actual limits. For the analysis of the impact of network latency and throughput on cluster performance we need to have some form of control on the link between nodes. Possibly rate limit from machines on high bandwidth links? 7 Planning Week 1 Week 2 Week 3 Week 4 Literature study and performance analysis of ”normal” size cluster. Testing performance on larger clusters (2, 4, 8, 16 times larger). Analyze the influence of link latency and throughput on cluster performance. Begin report. Work on report and presentation. References [1] Codership. Codership using-galera-cluster, . galera cluster. http://www.codership.com/content/ [2] Codership. Codership benchmarks. http://www.codership.com/info/benchmarks, . [3] Jay Janssen. Multicast replication in percona xtradb cluster (pxc) and galera. http://www.mysqlperformanceblog.com/2013/06/05/ multicast-replication-in-percona-xtradb-cluster-pxc-and-galera/, 2013. [4] Vadim Tkachenko. Benchmarking galera replication overhead. http://www.mysqlperformanceblog. com/2011/10/13/benchmarking-galera-replication-overhead/, 2011. 2