One of the major limitations of galera cluster is deadlocks, it is quite common across multi-writer setups. This is mainly because of cluster wise optimistic locking. If your transaction per second is high and if you are updating same rows concurrently then deadlocks will be a major setback for your application.
Yes, one can retry the failed transaction and hope that it succeed next time, obviously this is a hard way and it can increase the response time. If application is not designed to retry the failed transaction then setting the wsrep_retry_autocommit on galera nodes would automatically retry the failed transaction.
If your table is bound to have more hotspots, then the frequency of the deadlock would be very high.
Dealing with deadlocks
Our setup is a classic 3 Node MariaDB Galera cluster all three nodes serving reads and writes equally from the ProxySQL Load Balancer.
Galera has few options to monitor and troubleshoot the deadlocks.
wsrep_local_bf_aborts
- Total number of local transactions that were aborted by slave transactions while in execution.
wsrep_local_cert_failures
- Total number of local transactions that failed certification test.
As seen above, during our peak transaction hours, deadlocks and certification failures are quite high in our systems.
With wsrep_debug=ON
and wsrep_log_conflicts=ON
configuration one can enable additional logging and get the queries that are failed / retried. In our case, we know our hotspot queries and these are straight forward queries so we decided to test by sending all the writes to one node and reads to the remaining nodes.
Fortunately we have ProxySQL 2.0 which has the native support to galera servers.
The Solution
We have put the ProxySQL to send write traffic to node 1 and read traffic to other nodes. This can be done by setting max_writers
to 1 in mysql_galera_hostgroup_table
and thus by setting up regex on mysql_query_rules we have achieved this without taking any downtime.
Post these changes, we have seen a drastic dip in the number of deadlocks occurred and there are no certification failure happening in the cluster
This is a real improvement and now our applications are not facing any deadlock or certification failure.
Verdict
If your database server can single handedly serve all the writes then running galera cluster with single node write would solve all your deadlock issues.