开发者社区> 德哥> 正文

PostgreSQL 如何检测分布式死锁 - postgrespro pg_shardman

简介:
+关注继续查看

标签

PostgreSQL , postgrespro , pg_shardman , 分布式死锁


背景

单机的死锁检测是比较容易实现的,通过判断等待中的事务是否查询环状,即可实现。例如下面的图,A等B, B等C, C等A,出现环状,即认为死锁。

pic

而对于分布式事务,由于相互等待的事务可能出现在不同的数据节点上,如何判断分布式死锁呢?

为了判断是否有环,必须引入全局事务号,或者有标识可以在每个数据节点上标识出来对应的事务。这样才能画出等待图形,判断是否出现环状。

另一方面由于判断分布式事务是否出现死锁,在每个数据节点上读取锁等待信息,所以不同的数据节点读取等待信息是有时间差的,因此仅凭一次的环无法判断是否真的死锁(例如你在查询到等待后,查看其它节点时,实际上之前的等待已经消失的情况,是“假死锁”),需要两次查询,并且等待环(包括GID在内等信息完全一致的环)出现在两次查询中时,这个环才认为是真正的死锁。

pic

postgrespro pg_shardman

pg_shardman 没有内置死锁检测的功能,而是通过开放一个UDF接口,用户调用这个接口,无限循环,检测是否有分布式死锁,随机cancel造成分布式死锁的其中一个query。

monitor(check_timeout_sec int = 5, rm_node_timeout_sec int = 60)  

Monitor cluster for presence of distributed deadlocks and node failures. This function is intended to be executed at shardlord and is redirected to shardlord been launched at any other node. It starts infinite loop which polls all clusters nodes, collecting local lock graphs from all nodes. Period of poll is specified by check_timeout_sec parameter (default value is 5 seconds). Local lock graphs are combined into global lock graph which is analyzed for the presence of loops. A loop in the lock graph means distributed deadlock. Monitor function tries to resolve deadlock by canceling one or more backends involved in the deadlock loop (using pg_cancel_backend function, which doesn't actually terminate backend but tries to cancel current query). Canceled backend is randomly chosen within deadlock loop. Since not all deadlock members are hanged in 'active query' state, it might be needed to send cancel several times.

Since local graphs collected from all nodes do not form consistent global snapshot, false postives are possible: edges in deadlock loop correspond to different moment of times. To prevent false deadlock detection, monitor function doesn't react on detected deadlock immediately. Instead of it, previous deadlock loop located at previous iteration is compared with current deadlock loop and only if they are equal, deadlock is reported and resolving is performed.

If some node is unreachable then monitor function prints correspondent error message and retries access until rm_node_timeout_sec timeout expiration. After it node is removed from the cluster using shardman.rm_node function. If redundancy level is non-zero, then primary partitions from the disabled node are replaced with replicas. Finally pg_shardmanperforms recovery of distributed transactions for which failed node was the coordinator. It is done using shardman.recover_xacts() function which collects status of distributed transaction at all participants and tries to make decision whether it should be committed or aborted. If rm_node_timeout_sec is NULLmonitor will not remove nodes.

参考

https://github.com/postgrespro/pg_shardman/blob/native_partitioning/pg_shardman--0.0.3.sql

pg_shardman提供的分布式死锁检测依赖的一些对象。

-- Type to represent vertex in lock graph  
create type process as (node int, pid int);  
  
-- View to build lock graph which can be used to detect global deadlock.  
-- Application_name is assumed pgfdw:$system_id:$coord_pid  
-- gid is assumed pgfdw:$timestamp:$sys_id:$pid:$xid:$participants_count:$coord_count  
-- Currently we are oblivious about lock modes and report any wait -> hold edge  
-- on the same object and therefore might produce false loops. Furthermore,  
-- we have not idea about locking queues here. Probably it is better to use  
-- pg_blocking_pids, but it seems to ignore prepared xacts.  
CREATE VIEW lock_graph(wait, hold) AS  
	-- local dependencies  
    -- If xact is already prepared, we take node and pid of the coordinator.  
	SELECT  
		ROW(shardman.get_my_id(),  
			wait.pid)::shardman.process,  
	 	CASE WHEN hold.pid IS NOT NULL THEN  
		    ROW(shardman.get_my_id(), hold.pid)::shardman.process  
		ELSE -- prepared  
			ROW(shardman.get_node_by_sysid(split_part(gid, ':', 3)::bigint),  
				split_part(gid, ':', 4)::int)::shardman.process  
		END  
     FROM pg_locks wait, pg_locks hold LEFT OUTER JOIN pg_prepared_xacts twopc  
			  ON twopc.transaction=hold.transactionid  
	WHERE  
		NOT wait.granted AND wait.pid IS NOT NULL AND hold.granted AND  
		-- waiter waits for the the object holder locks  
		wait.database IS NOT DISTINCT FROM hold.database AND  
		wait.relation IS NOT DISTINCT FROM hold.relation AND  
		wait.page IS NOT DISTINCT FROM hold.page AND  
		wait.tuple IS NOT DISTINCT FROM hold.tuple AND  
		wait.virtualxid IS NOT DISTINCT FROM hold.virtualxid AND  
		wait.transactionid IS NOT DISTINCT FROM hold.transactionid AND -- waiting on xid  
		wait.classid IS NOT DISTINCT FROM hold.classid AND  
		wait.objid IS NOT DISTINCT FROM hold.objid AND  
		wait.objsubid IS NOT DISTINCT FROM hold.objsubid AND  
		 -- this is most probably truism, but who knows  
		(hold.pid IS NOT NULL OR twopc.gid IS NOT NULL)  
	UNION ALL  
	-- if this fdw backend is busy, potentially waiting, add edge coordinator -> fdw  
	SELECT ROW(shardman.get_node_by_sysid(split_part(application_name, ':', 2)::bigint),  
			   split_part(application_name,':',3)::int)::shardman.process,  
		   ROW(shardman.get_my_id(),  
			   pid)::shardman.process  
	FROM pg_stat_activity WHERE application_name LIKE 'pgfdw:%' AND wait_event<>'ClientRead'  
	UNION ALL  
	-- otherwise, coordinator itself is busy, potentially waiting, so add fdw ->  
	-- coordinator edge  
	SELECT ROW(shardman.get_my_id(),  
			   pid)::shardman.process,  
		   ROW(shardman.get_node_by_sysid(split_part(application_name,':',2)::bigint),  
			   split_part(application_name,':',3)::int)::shardman.process  
	FROM pg_stat_activity WHERE application_name LIKE 'pgfdw:%' AND wait_event='ClientRead';  
  
-- Pack lock graph into comma-separated string of edges like "2:17439->4:30046",  
-- i.e. pid 17439 on node 2 waits for pid 30046 on node 4  
CREATE FUNCTION serialize_lock_graph() RETURNS TEXT AS $$  
	SELECT COALESCE(  
		string_agg((wait).node || ':' || (wait).pid || '->' ||  
				   (hold).node || ':' || (hold).pid,  
				   ','),  
		'')  
	FROM shardman.lock_graph;  
$$ LANGUAGE sql;  

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

相关文章
分布式事务中间件 Fescar—RM 模块源码解读
前言 在SOA、微服务架构流行的年代,许多复杂业务上需要支持多资源占用场景,而在分布式系统中因为某个资源不足而导致其它资源占用回滚的系统设计一直是个难点。我所在的团队也遇到了这个问题,为解决这个问题上,团队采用的是阿里开源的分布式中间件Fescar的解决方案,并详细了解了Fescar内部的工作原理,解决在使用Fescar中间件过程中的一些疑虑的地方,也为后续团队在继续使用该中间件奠定理论基础。
24408 0
蚂蚁金服分布式链路跟踪组件 SOFATracer 总览|剖析
SOFATracer 是一个用于分布式系统调用跟踪的组件,通过统一的 TraceId 将调用链路中的各种网络调用情况以日志的方式记录下来,以达到透视化网络调用的目的,这些链路数据可用于故障的快速发现,服务治理等。
1551 0
Mars——基于张量的统一分布式计算框架
很高兴在这里宣布我们的新项目:Mars,一个基于张量的统一分布式计算框架。我们已经在 Github 开源:https://github.com/mars-project/mars 。 背景 Python Python 是一门相当古老的语言了,如今,在数据科学计算、机器学习、以及深度学习领域,Python 越来越受欢迎。
9865 0
分布式计算框架Gearman原理详解
什么是Gearman? Gearman提供了一个通用的应用程序框架,用于将工作转移到更适合于工作的其他机器或流程。它允许你并行工作,负载平衡处理,并在语言间调用函数。它可用于从高可用性网站到传输数据库复制事件的各种应用程序。
1255 0
Sql性能检测工具:Sql server profiler和优化工具:Database Engine Tuning Advisor
原文:Sql性能检测工具:Sql server profiler和优化工具:Database Engine Tuning Advisor 一、工具概要     数据库应用系统性能低下,需要对其进行优化,     如果不知道问题出在哪里,可以使用性能检测工具sql server profiler。
1759 0
史上最快! 10小时大数据入门实战(三)-分布式文件系统HDFS
HDFS 环境搭建 HDFS 伪分布式环境搭建 CentOS 环境安装步骤 MacOS安装环境 安装jd...
1470 0
Hadoop伪分布式安装Spark
应用场景 搭建部署了hadoop环境后,使用MapReduce来进行计算,速度非常慢,因为MapReduce只是分布式批量计算,用于跑批的场景,并不追求速率,因为它需要频繁读写HDFS,并不能实时反馈结果,这种跑批的场景用的还是比较少的。
1438 0
+关注
德哥
公益是一辈子的事, I&#39;m digoal, just do it.
2153
文章
245
问答
来源圈子
更多
阿里云数据库:帮用户承担一切数据库风险,给您何止是安心!支持关系型数据库:MySQL、SQL Server、PostgreSQL、PPAS(完美兼容Oracle)、自研PB级数据存储的分布式数据库Petadata、自研金融级云数据库OceanBase支持NoSQL数据库:MongoDB、Redis、Memcache更有褚霸、丁奇、德哥、彭立勋、玄惭、叶翔等顶尖数据库专家服务。
+ 订阅
相关文档: 云数据库 OceanBase 版 可信账本数据库 云原生关系型数据库 PolarDB PostgreSQL引擎
文章排行榜
最热
最新
相关电子书
更多
OceanBase 入门到实战教程
立即下载
阿里云图数据库GDB,加速开启“图智”未来.ppt
立即下载
实时数仓Hologres技术实战一本通2.0版(下)
立即下载