Official Mafiree Blogs

Keeping you informed about Databases

Scaling Databases Horizontally: A Solution for High-Concurrency Systems
By Sukan   |   September 20, 2023   |   Posted in : TiDB
Blog post on how TiDB handles the OLAP requests without compromising OLTP

6 Interesting Features In MongoDB 6.0
By Thivakar   |   June 19, 2023   |   Posted in : MongoDB
Upgrade to MongoDB 6.0 for improved index management, error handling, powerful aggregation, replicaset enhancements, time series collection improvements, and embedded field exclusion. Experience efficient database management.

Access Control & Account Management In MySQL 8
By Sukan   |   June 08, 2023   |   Posted in : MySQL
Discover how MySQL 8 enhances security through features like password expiry, password validation, account lock/unlock, privilege management, and dual passwords. Safeguard your valuable data and streamline access control in your MySQL environment.

Significance of using Invisible Primary key (GIPK) with MySQL 8.0
By Sukan   |   February 21, 2023   |   Posted in : MySQL
How Generated Invisible Primary Keys (GIPK) Can Boost Your Database Efficiency

Achieving High Availability Using Log Shipping
By Sujith   |   February 21, 2018   |   Posted in : SQL Server
Here we will get the detailed explanation of how we can achieve HA using Log Shipping.

 Home  /  MySQL  
MariaDB Galera Bug [WSREP: BF lock wait long]
By Sukan   |   May 18, 2018   |   Posted in : MySQL
Our setup has a three-node MariaDB Galera cluster running with 10.2.5 version, our cluster is serving reads and writes on all the three nodes. All our nodes have same hardware specification and application hits are split equally on these servers using ProxySQL. While everything is working fine, suddenly all the queries hang on random nodes and eventually the cluster stalls. Even though we have all deadlock detection mechanisms and other timeout settings, still cluster hangs and it brings the entire application down.

We have observed this behavior when we upgraded from MariaDB 10.1 to 10.2 and this bug is highly unpredictable. Even in low peak time with very minimal hits, we have faced this issue. The only option we had at the moment is to trigger an alert when we hit the error and schedule a shell script to force restart the affected node

After detailed study of the environment, queries, hits etc we were able to reproduce this bug.

How is this bug triggered?

Consider the below table used to store youtubeviews status for different services.

Table Structure

  1. CREATE TABLE `youtube_views` (
  2. `id` int NOT NULL AUTO_INCREMENT,
  3. `status` int NOT NULL,
  4. `token_id` varchar(50) DEFAULT NULL,
  5. `session_id` varchar(50) DEFAULT NULL,
  6. `response` int DEFAULT NULL,
  7. PRIMARY KEY (`id`)
  8. ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Table Data

From the application perspective, status column is being changed every time. Thus having the possibility to fire multiple updates for the same ID column [Primary key]

In this situation, whenever there is an update for the same primary key on a transaction, it should throw a deadlock exception. But GC accepted the writes and triggered a situation which caused this bug.

The node would no longer accept writes even for another ID and it continues to be in hang state, here is a snippet of running processes inside the cluster.
  1. +---------+-------------+-----------+------+---------+-------+---------------------------------------+------------------------------------------------+----------+
  2. | Id | User | Host | db | Command | Time | State | Info | Progress |
  3. +---------+-------------+-----------+------+---------+-------+---------------------------------------+------------------------------------------------+----------+
  4. | 1 | system user | | NULL | Sleep | 1495 | Update_rows_log_event::find_row(2630) | update youtubeviews set status=100 where id=5 | 0.000 |
  5. | 2 | system user | | NULL | Sleep | 73330 | wsrep aborter idle | NULL | 0.000 |
  6. | 1052481 | root | localhost | prod | Query | 1496 | query end | update youtubeviews set status=100 where id=5 | 0.000 |
  7. | 1052973 | root | localhost | prod | Killed | 924 | query end | update youtubeviews set status=101 where id=11 | 0.000 |
  8. | 1052974 | root | localhost | NULL | Sleep | 190 | | NULL | 0.000 |
  9. | 1052980 | root | localhost | prod | Query | 0 | init | show processlist | 0.000 |
  10. +---------+-------------+-----------+------+---------+-------+---------------------------------------+------------------------------------------------+----------+

You could see below error on the MariaDB error logs.

How to bring the cluster up?

The first response from the DBA is to manually kill the query which has stalled the cluster. Unfortunately this bug is so worse that killing the query doesn’t bring up the cluster; in fact we will not able to restart of node as well.

One has to manually kill the affected MariaDB Cluster node and force restarting would help bringing back the cluster online.

Fix

Running Galera Cluster in a single node

We have put the Galera cluster writes in to a single node and keep other nodes for reads. Having said that it would avoid hitting this error but again it is more or less similar to normal Master / Slave MariaDB.

Upgrade the cluster to 10.2.13+ version

As rightly pointed out by MariaDB in their release notes, upgrading cluster to 10.2.13 version is highly recommended as they have provided the fix for this bug. [MDEV-12837]

The same test was applied to this newer version, understandingly when we fire updates for the same ID in different nodes below error is thrown and transaction gets rollbacked.

  1. Cannot run query update youtubeviews set status=100 where id=5;
  2. ERROR : Deadlock: wsrep aborted transaction

NOTE: Since applications are connected to Galera cluster via ProxySQL, a rolling restart is all need for upgrading the cluster.

Downgrading the cluster to 10.1.x version

As such cases, where upgradation not possible, downgrading the cluster to 10.1 can avoid from this bug. As in this 10.1 version, it immediately thrown an deadlock error once it hit this situation.




1 Comments


Ponarasu      July 25, 2020 01:18 pm


Good article, nice information, thanks for your knowlwdge sharing.

Leave a Comment

Name *

Email *

Comment *



Search All Blogs



Need Support?


Solutions      Services      Resources      About Mafiree

More than 3000 servers monitored

Certified database Administrators

24X7X365 Support

Happy Clients


ENQUIRE NOW
For Sales Related Queries, Please Call Our Sales Experts at

 +91-80-41155993


Meet Mafiree on social networks!

     

PCI Certificate

Copyright © 2019 - All Rights Reserved - Mafiree