Official Mafiree Blogs

Keeping you informed about Databases

Achieving High Availability Using Log Shipping
By Sujith   |   February 21, 2018   |   Posted in : SQL Server
Here we will get the detailed explanation of how we can achieve HA using Log Shipping.

SQL Server On Linux
By Sujith   |   March 30, 2019   |   Posted in : SQL Server
SQL Server 2017 brings the best features of the Microsoft relational database engine to the enterprise Linux ecosystem, including SQL Server Agent, Azure Active Directory (Azure AD) authentication, best-in-class high availability/ disaster recovery, and unparalleled data security.

Types of SQL Server Replication
By Sujith   |   February 19, 2018   |   Posted in : SQL Server
Understanding on the types of replication in SQL Server and the type apt for your traffic pattern.

Data File Splitting in SQL Server
By Murali   |   March 15, 2021   |   Posted in : SQL Server
Huge data blocks resides under single MDF file might cause a performance of the query and impact the application services. This blog help you to understand how we can the split the single MDF file into multiple data files.

Dynamic query routing with ProxySQL
By Sabareesh   |   March 15, 2021   |   Posted in : Proxysql
How ProxySQL helps in achieving query routing without making a application changes.

 Home  /  MySQL  
MariaDB Galera Bug [WSREP: BF lock wait long]
By Sukan   |   May 18, 2018   |   Posted in : MySQL
Our setup has a three-node MariaDB Galera cluster running with 10.2.5 version, our cluster is serving reads and writes on all the three nodes. All our nodes have same hardware specification and application hits are split equally on these servers using ProxySQL. While everything is working fine, suddenly all the queries hang on random nodes and eventually the cluster stalls. Even though we have all deadlock detection mechanisms and other timeout settings, still cluster hangs and it brings the entire application down.

We have observed this behavior when we upgraded from MariaDB 10.1 to 10.2 and this bug is highly unpredictable. Even in low peak time with very minimal hits, we have faced this issue. The only option we had at the moment is to trigger an alert when we hit the error and schedule a shell script to force restart the affected node

After detailed study of the environment, queries, hits etc we were able to reproduce this bug.

How is this bug triggered?

Consider the below table used to store youtubeviews status for different services.

Table Structure

  1. CREATE TABLE `youtube_views` (
  2. `id` int NOT NULL AUTO_INCREMENT,
  3. `status` int NOT NULL,
  4. `token_id` varchar(50) DEFAULT NULL,
  5. `session_id` varchar(50) DEFAULT NULL,
  6. `response` int DEFAULT NULL,
  7. PRIMARY KEY (`id`)
  8. ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Table Data

From the application perspective, status column is being changed every time. Thus having the possibility to fire multiple updates for the same ID column [Primary key]

In this situation, whenever there is an update for the same primary key on a transaction, it should throw a deadlock exception. But GC accepted the writes and triggered a situation which caused this bug.

The node would no longer accept writes even for another ID and it continues to be in hang state, here is a snippet of running processes inside the cluster.
  1. +---------+-------------+-----------+------+---------+-------+---------------------------------------+------------------------------------------------+----------+
  2. | Id | User | Host | db | Command | Time | State | Info | Progress |
  3. +---------+-------------+-----------+------+---------+-------+---------------------------------------+------------------------------------------------+----------+
  4. | 1 | system user | | NULL | Sleep | 1495 | Update_rows_log_event::find_row(2630) | update youtubeviews set status=100 where id=5 | 0.000 |
  5. | 2 | system user | | NULL | Sleep | 73330 | wsrep aborter idle | NULL | 0.000 |
  6. | 1052481 | root | localhost | prod | Query | 1496 | query end | update youtubeviews set status=100 where id=5 | 0.000 |
  7. | 1052973 | root | localhost | prod | Killed | 924 | query end | update youtubeviews set status=101 where id=11 | 0.000 |
  8. | 1052974 | root | localhost | NULL | Sleep | 190 | | NULL | 0.000 |
  9. | 1052980 | root | localhost | prod | Query | 0 | init | show processlist | 0.000 |
  10. +---------+-------------+-----------+------+---------+-------+---------------------------------------+------------------------------------------------+----------+

You could see below error on the MariaDB error logs.

How to bring the cluster up?

The first response from the DBA is to manually kill the query which has stalled the cluster. Unfortunately this bug is so worse that killing the query doesn’t bring up the cluster; in fact we will not able to restart of node as well.

One has to manually kill the affected MariaDB Cluster node and force restarting would help bringing back the cluster online.

Fix

Running Galera Cluster in a single node

We have put the Galera cluster writes in to a single node and keep other nodes for reads. Having said that it would avoid hitting this error but again it is more or less similar to normal Master / Slave MariaDB.

Upgrade the cluster to 10.2.13+ version

As rightly pointed out by MariaDB in their release notes, upgrading cluster to 10.2.13 version is highly recommended as they have provided the fix for this bug. [MDEV-12837]

The same test was applied to this newer version, understandingly when we fire updates for the same ID in different nodes below error is thrown and transaction gets rollbacked.

  1. Cannot run query update youtubeviews set status=100 where id=5;
  2. ERROR : Deadlock: wsrep aborted transaction

NOTE: Since applications are connected to Galera cluster via ProxySQL, a rolling restart is all need for upgrading the cluster.

Downgrading the cluster to 10.1.x version

As such cases, where upgradation not possible, downgrading the cluster to 10.1 can avoid from this bug. As in this 10.1 version, it immediately thrown an deadlock error once it hit this situation.




1 Comments


Ponarasu      July 25, 2020 01:18 pm


Good article, nice information, thanks for your knowlwdge sharing.

Leave a Comment

Name *

Email *

Comment *



Search All Blogs



Need Support?


Solutions      Services      Resources      About Mafiree

More than 3000 servers monitored

Certified database Administrators

24X7X365 Support

Happy Clients


ENQUIRE NOW
For Sales Related Queries, Please Call Our Sales Experts at

 +91-80-41155993


Meet Mafiree on social networks!

     

PCI Certificate

Copyright © 2019 - All Rights Reserved - Mafiree