Official Mafiree Blogs

Keeping you informed about Databases

 Home  /  MySQL  
MariaDB Galera Bug [WSREP: BF lock wait long]
By Sukan   |   May 18, 2018   |   Posted in : MySQL
Our setup has a three-node MariaDB Galera cluster running with 10.2.5 version, our cluster is serving reads and writes on all the three nodes.  All our nodes have same hardware specification and application hits are split equally on these servers using ProxySQL. While everything is working fine, suddenly all the queries hang on random nodes and eventually the cluster stalls. Even though we have all deadlock detection mechanisms and other timeout settings, still cluster hangs and it brings the entire application down.
We have observed this behavior when we upgraded from MariaDB 10.1 to 10.2 and this bug is highly unpredictable. Even in low peak time with very minimal hits, we have faced this issue. The only option we had at the moment is to trigger an alert when we hit the error and schedule a shell script to force restart the affected node.
 
After detailed study of the environment, queries, hits etc we were able to reproduce this bug.
 
How is this bug triggered?
 
Consider the below table used to store `youtubeviews` status for different services.
 
Table Structure
Table Data

From the application perspective, `status` column is being changed every time. Thus having the possibility to fire multiple updates for the same ID column [Primary key]


 
In this situation, whenever there is an update for the same primary key on a transaction, it should throw a deadlock exception. But GC accepted the writes and triggered a situation which caused this bug.
The node would no longer accept writes even for another ID and it continues to be in hang state, here is a snippet of running processes inside the cluster.
 

+---------+-------------+-----------+------+---------+-------+---------------------------------------+------------------------------------------------+----------+

| Id      | User        | Host | db   | Command | Time | State                      | Info  | Progress |

+---------+-------------+-----------+------+---------+-------+---------------------------------------+------------------------------------------------+----------+

|       1 | system user |           | NULL | Sleep | 1495 | Update_rows_log_event::find_row(2630) | update youtubeviews set status=100 where id=5  |   0.000 |

|       2 | system user |           | NULL | Sleep | 73330 | wsrep aborter idle                    | NULL        | 0.000 |

| 1052481 | root        | localhost | prod | Query   | 1496 | query end                             | update youtubeviews set status=100 where id=5  |   0.000 |

| 1052973 | root        | localhost | prod | Killed  | 924 | query end                             | update youtubeviews set status=101 where id=11 |    0.000 |

| 1052974 | root        | localhost | NULL | Sleep   | 190 |                                       | NULL        | 0.000 |

| 1052980 | root        | localhost | prod | Query   | 0 | init                                  | show processlist        | 0.000 |

+---------+-------------+-----------+------+---------+-------+---------------------------------------+------------------------------------------------+----------+


You could see below error on the MariaDB error logs.


How to bring the cluster up?
The first response from the DBA is to manually kill the query which has stalled the cluster. Unfortunately this bug is so worse that killing the query doesn't bring up the cluster; in fact we will not able to restart of node as well.
One has to manually kill the affected MariaDB Cluster node and force restarting would help bringing back the cluster online.
 
Fixes:
 
Running Galera Cluster in a single node:
We can put the Galera cluster writes in to a single node and keep other nodes for reads. Having said that it would avoid hitting this error but again it is more or less similar to normal Master / Slave MariaDB.
 
Upgrade the cluster to 10.2.13+ version
As rightly pointed out by MariaDB in their release notes, upgrading cluster to 10.2.13 version is highly recommended as they have provided the fix for this bug.  [MDEV-12837]
The same test was applied to this newer version, understandingly when we fire updates for the same ID in different nodes below error is thrown and transaction gets rollbacked.
 

Cannot run query update youtubeviews set status=100 where id=5; ERROR : Deadlock: wsrep aborted transaction

NOTE: Since applications are connected to Galera cluster via ProxySQL, a rolling restart is all need for upgrading the cluster.
 
Downgrading the cluster to 10.1.x version
            As such cases, where upgradation not possible, downgrading the cluster to 10.1 can avoid from this bug. As in this 10.1 version, it immediately thrown an deadlock error once it hit this situation.
 



0 Comments


Leave a Comment

Name *

Email *

Comment *



Search All Blogs



Need Support?


Solutions      Services      Resources      About Mafiree

More than 3000 servers monitored

Certified database Administrators

24X7X365 Support

Happy Clients


ENQUIRE NOW
For Sales Related Queries, Please Call Our Sales Experts at

 +91-80-41155993


Meet Mafiree on social networks!

     

PCI Certificate nasscom member

Copyright © 2019 - All Rights Reserved - Mafiree