开发者社区> 问答> 正文

奇怪的greenplum 网络问题,

gun_hap 2017-11-24 16:44:51 2005

最近遇到一个greenplum网络问题,想请教一下解决方法,希望大牛能支持一下,谢谢!
环境信息:
CentOS Linux release 7.2.1511 (Core)
greenplum version: 4.3.16
master: 1 segment: 36
机器:服务器配置: 40核CPU, 128GB内存物理机, 千兆网络

现象:间歇性的出现,目前一个周有一到二天,在批量跑存储过程的过程中,在master日志中能找到如下失败信息:
"ERROR","58M01","Interconnect encountered a network error, please check your network (seg25 slice2 gp-s0005:40001 pid=162649)","Failed to send packet (seq 51) to 10.159.173.3:43904 (pid 39171 cid 13) after 3577 retries in 3600 seconds",,,,"SQL statement ""insert into waterheater.XXX(fmacid,product_no,province,city_name,series,product_code,product_name) select B.fmacid,A.product_no,A.province,A.city_name,C.series,C.product_code,C.product_name from public.mystall A left join public.mac_info B on A.product_no = B.sern left join mymodel C on B.product_code = C.product_code where A.class2='电热水器'""

分析过程如下
(1) 检查所有greenplum相关系统参数,没有问题;
(2) 使用gpchkperf工具进行过所有segment节点的测试,整个测试过程,网络速度正常,均是100MB/S左右;
但是有一点比较奇怪,在所有出问题的大批量写入数据过程中,全部都是在写入10.159.173.3时失败,而其它的segment节点间的交互从来没有失败。
"Failed to send packet (seq 51) to 10.159.173.3:43904 (pid 39171 cid 13) after 3577 retries in 3600 seconds",
(3) 对select * from pg_stat_activity进行检查,没有相关的sql有冲突 ;
(4) 查看了pg_locks表,一共1400个AccessShareLock锁,所有granted状态均为t,ExclusiveLock一共为74个,一部分信息如下,基本都差不多。
haieredw=# select * from pg_locks where mode='ExclusiveLock';
locktype | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid | mode | granted | mppsessionid | mppiswriter | gp_seg


transactionid | | | | | 14593745 | | | | 14593745 | 49858 | ExclusiveLock | t | 5380 | t |

 -1

transactionid | | | | | 14593687 | | | | 14593687 | 1191 | ExclusiveLock | t | 5370 | t |

 -1

transactionid | | | | | 2621190 | | | | 2621190 | 136780 | ExclusiveLock | t | 5380 | t |

..........

(5) 使用gpstate -m/-s/-f, select * from gp_segment_configuration进行检查,也没有任何问题。

(6)从主库中查询到当前正在执行的问题sql对应的sess_id为5370, 使用gpssh -f seg_host && ps -aux |grep conn5370对正在进行的sql分析时,下面是2个服务器上返回的信息,发现10.159.173.3上返回来的信息,有6个进程处于空闲状态,这个让我百思不得其解,但是集群中其它segment均没有处于空闲状态的进程。
[sdw3] gpadmin 161237 0.0 0.0 542084 16448 ? SNsl 15:46 0:00 postgres: port 40000, gpadmin haieredw 10.159.173.55(22441) con5370 seg12 cmd8 slice4 MPPEXEC SELECT
[sdw3] gpadmin 161239 0.0 0.0 542088 14268 ? SNsl 15:46 0:00 postgres: port 40001, gpadmin haieredw 10.159.173.55(63130) con5370 seg13 cmd8 slice4 MPPEXEC SELECT
[sdw3] gpadmin 161241 0.0 0.0 542084 15692 ? SNsl 15:46 0:00 postgres: port 40002, gpadmin haieredw 10.159.173.55(58606) con5370 seg14 cmd8 slice4 MPPEXEC SELECT
[sdw3] gpadmin 161243 0.0 0.0 542088 16312 ? SNsl 15:46 0:00 postgres: port 40003, gpadmin haieredw 10.159.173.55(19578) con5370 seg15 cmd8 slice4 MPPEXEC SELECT
[sdw3] gpadmin 161245 0.0 0.0 542088 15804 ? SNsl 15:46 0:00 postgres: port 40004, gpadmin haieredw 10.159.173.55(53340) con5370 seg16 cmd8 slice4 MPPEXEC SELECT
[sdw3] gpadmin 161247 0.0 0.0 542080 16464 ? SNsl 15:46 0:00 postgres: port 40005, gpadmin haieredw 10.159.173.55(39318) con5370 seg17 cmd8 slice4 MPPEXEC SELECT
[sdw3] gpadmin 161312 0.0 0.0 542368 16484 ? SNsl 15:46 0:00 postgres: port 40000, gpadmin haieredw 10.159.173.55(22477) con5370 seg12 cmd8 slice3 MPPEXEC SELECT
[sdw3] gpadmin 161314 0.0 0.0 542372 14200 ? SNsl 15:46 0:00 postgres: port 40001, gpadmin haieredw 10.159.173.55(63166) con5370 seg13 cmd8 slice3 MPPEXEC SELECT
[sdw3] gpadmin 161316 0.0 0.0 542368 15608 ? SNsl 15:46 0:00 postgres: port 40002, gpadmin haieredw 10.159.173.55(58642) con5370 seg14 cmd8 slice3 MPPEXEC SELECT
[sdw3] gpadmin 161318 0.0 0.0 542372 16212 ? SNsl 15:46 0:00 postgres: port 40003, gpadmin haieredw 10.159.173.55(19614) con5370 seg15 cmd8 slice3 MPPEXEC SELECT
[sdw3] gpadmin 161320 0.0 0.0 542372 15704 ? SNsl 15:46 0:00 postgres: port 40004, gpadmin haieredw 10.159.173.55(53376) con5370 seg16 cmd8 slice3 MPPEXEC SELECT
[sdw3] gpadmin 161322 0.0 0.0 542364 16244 ? SNsl 15:46 0:00 postgres: port 40005, gpadmin haieredw 10.159.173.55(39354) con5370 seg17 cmd8 slice3 MPPEXEC SELECT
[sdw3] gpadmin 161324 0.0 0.0 552656 22872 ? SNsl 15:46 0:00 postgres: port 40000, gpadmin haieredw 10.159.173.55(22513) con5370 seg12 cmd8 slice2 MPPEXEC SELECT
[sdw3] gpadmin 161326 0.0 0.0 552660 20856 ? SNsl 15:46 0:00 postgres: port 40001, gpadmin haieredw 10.159.173.55(63202) con5370 seg13 cmd8 slice2 MPPEXEC SELECT
[sdw3] gpadmin 161328 0.0 0.0 552656 22252 ? SNsl 15:46 0:00 postgres: port 40002, gpadmin haieredw 10.159.173.55(58678) con5370 seg14 cmd8 slice2 MPPEXEC SELECT
[sdw3] gpadmin 161330 0.0 0.0 552660 22840 ? SNsl 15:46 0:00 postgres: port 40003, gpadmin haieredw 10.159.173.55(19650) con5370 seg15 cmd8 slice2 MPPEXEC SELECT
[sdw3] gpadmin 161332 0.0 0.0 552668 22520 ? SNsl 15:46 0:00 postgres: port 40004, gpadmin haieredw 10.159.173.55(53412) con5370 seg16 cmd8 slice2 MPPEXEC SELECT
[sdw3] gpadmin 161334 0.0 0.0 552652 22896 ? SNsl 15:46 0:00 postgres: port 40005, gpadmin haieredw 10.159.173.55(39390) con5370 seg17 cmd8 slice2 MPPEXEC SELECT
[sdw3] gpadmin 161336 0.0 0.0 541844 16852 ? SNsl 15:46 0:00 postgres: port 40000, gpadmin haieredw 10.159.173.55(22550) con5370 seg12 idle
[sdw3] gpadmin 161338 0.0 0.0 541848 14816 ? SNsl 15:46 0:00 postgres: port 40001, gpadmin haieredw 10.159.173.55(63239) con5370 seg13 idle
[sdw3] gpadmin 161340 0.0 0.0 541832 16356 ? SNsl 15:46 0:00 postgres: port 40002, gpadmin haieredw 10.159.173.55(58715) con5370 seg14 idle
[sdw3] gpadmin 161342 0.0 0.0 541836 16960 ? SNsl 15:46 0:00 postgres: port 40003, gpadmin haieredw 10.159.173.55(19687) con5370 seg15 idle
[sdw3] gpadmin 161344 0.0 0.0 541836 16376 ? SNsl 15:46 0:00 postgres: port 40004, gpadmin haieredw 10.159.173.55(53449) con5370 seg16 idle
[sdw3] gpadmin 161346 0.0 0.0 541840 16868 ? SNsl 15:46 0:00 postgres: port 40005, gpadmin haieredw 10.159.173.55(39427) con5370 seg17 idle
[sdw3] gpadmin 167233 0.0 0.0 112656 956 pts/30 S+ 16:08 0:00 grep --color=auto con5370
[sdw4] gpadmin 15600 0.0 0.0 542084 16380 ? SNsl 15:46 0:00 postgres: port 40000, gpadmin haieredw 10.159.173.55(47355) con5370 seg18 cmd8 slice4 MPPEXEC SELECT
[sdw4] gpadmin 15602 0.0 0.0 542076 16156 ? SNsl 15:46 0:00 postgres: port 40001, gpadmin haieredw 10.159.173.55(11008) con5370 seg19 cmd8 slice4 MPPEXEC SELECT
[sdw4] gpadmin 15604 0.0 0.0 542084 14308 ? SNsl 15:46 0:00 postgres: port 40002, gpadmin haieredw 10.159.173.55(22043) con5370 seg20 cmd8 slice4 MPPEXEC SELECT
[sdw4] gpadmin 15606 0.0 0.0 542084 15796 ? SNsl 15:46 0:00 postgres: port 40003, gpadmin haieredw 10.159.173.55(60511) con5370 seg21 cmd8 slice4 MPPEXEC SELECT
[sdw4] gpadmin 15608 0.0 0.0 542080 16404 ? SNsl 15:46 0:00 postgres: port 40004, gpadmin haieredw 10.159.173.55(27627) con5370 seg22 cmd8 slice4 MPPEXEC SELECT
[sdw4] gpadmin 15610 0.0 0.0 542088 16440 ? SNsl 15:46 0:00 postgres: port 40005, gpadmin haieredw 10.159.173.55(23382) con5370 seg23 cmd8 slice4 MPPEXEC SELECT
[sdw4] gpadmin 15612 0.0 0.0 542368 16228 ? SNsl 15:46 0:00 postgres: port 40000, gpadmin haieredw 10.159.173.55(47391) con5370 seg18 cmd8 slice3 MPPEXEC SELECT
[sdw4] gpadmin 15614 0.0 0.0 542368 16188 ? SNsl 15:46 0:00 postgres: port 40001, gpadmin haieredw 10.159.173.55(11044) con5370 seg19 cmd8 slice3 MPPEXEC SELECT
[sdw4] gpadmin 15616 0.0 0.0 542368 14208 ? SNsl 15:46 0:00 postgres: port 40002, gpadmin haieredw 10.159.173.55(22079) con5370 seg20 cmd8 slice3 MPPEXEC SELECT
[sdw4] gpadmin 15618 0.0 0.0 542368 15828 ? SNsl 15:46 0:00 postgres: port 40003, gpadmin haieredw 10.159.173.55(60547) con5370 seg21 cmd8 slice3 MPPEXEC SELECT
[sdw4] gpadmin 15620 0.0 0.0 542364 16148 ? SNsl 15:46 0:00 postgres: port 40004, gpadmin haieredw 10.159.173.55(27663) con5370 seg22 cmd8 slice3 MPPEXEC SELECT
[sdw4] gpadmin 15622 0.0 0.0 542372 16220 ? SNsl 15:46 0:00 postgres: port 40005, gpadmin haieredw 10.159.173.55(23418) con5370 seg23 cmd8 slice3 MPPEXEC SELECT
[sdw4] gpadmin 15624 0.0 0.0 552656 25508 ? SNsl 15:46 0:00 postgres: port 40000, gpadmin haieredw 10.159.173.55(47427) con5370 seg18 cmd8 slice2 MPPEXEC SELECT
[sdw4] gpadmin 15626 0.0 0.0 552656 25476 ? SNsl 15:46 0:00 postgres: port 40001, gpadmin haieredw 10.159.173.55(11080) con5370 seg19 cmd8 slice2 MPPEXEC SELECT
[sdw4] gpadmin 15628 0.0 0.0 552656 23496 ? SNsl 15:46 0:00 postgres: port 40002, gpadmin haieredw 10.159.173.55(22115) con5370 seg20 cmd8 slice2 MPPEXEC SELECT
[sdw4] gpadmin 15630 0.0 0.0 552656 25116 ? SNsl 15:46 0:00 postgres: port 40003, gpadmin haieredw 10.159.173.55(60583) con5370 seg21 cmd8 slice2 MPPEXEC SELECT
[sdw4] gpadmin 15632 0.0 0.0 552652 25408 ? SNsl 15:46 0:00 postgres: port 40004, gpadmin haieredw 10.159.173.55(27699) con5370 seg22 cmd8 slice2 MPPEXEC SELECT
[sdw4] gpadmin 15634 0.0 0.0 552660 25540 ? SNsl 15:46 0:00 postgres: port 40005, gpadmin haieredw 10.159.173.55(23454) con5370 seg23 cmd8 slice2 MPPEXEC SELECT
[sdw4] gpadmin 15636 0.1 0.0 542612 16840 ? SNsl 15:46 0:01 postgres: port 40000, gpadmin haieredw 10.159.173.55(47464) con5370 seg18 cmd8 slice1 MPPEXEC SELECT
[sdw4] gpadmin 15638 0.1 0.0 542604 16720 ? SNsl 15:46 0:01 postgres: port 40001, gpadmin haieredw 10.159.173.55(11117) con5370 seg19 cmd8 slice1 MPPEXEC SELECT
[sdw4] gpadmin 15640 0.1 0.0 542612 14836 ? SNsl 15:46 0:01 postgres: port 40002, gpadmin haieredw 10.159.173.55(22152) con5370 seg20 cmd8 slice1 MPPEXEC SELECT
[sdw4] gpadmin 15642 0.1 0.0 542604 16040 ? SNsl 15:46 0:01 postgres: port 40003, gpadmin haieredw 10.159.173.55(60620) con5370 seg21 cmd8 slice1 MPPEXEC SELECT
[sdw4] gpadmin 15644 0.1 0.0 542600 16668 ? SNsl 15:46 0:01 postgres: port 40004, gpadmin haieredw 10.159.173.55(27736) con5370 seg22 cmd8 slice1 MPPEXEC SELECT
[sdw4] gpadmin 15646 0.1 0.0 542608 16752 ? SNsl 15:46 0:01 postgres: port 40005, gpadmin haieredw 10.159.173.55(23491) con5370 seg23 cmd8 slice1 MPPEXEC SELECT

SQL 存储 关系型数据库 Linux
分享到
取消 提交回答
全部回答(1)
  • 樱桃味
    2019-07-17 21:45:05

    这个问题也有遇到,基本是大量的数据(千万级)量在同步,没有找到解决办法

    0 0
+ 订阅

分享数据库前沿,解构实战干货,推动数据库技术变革

推荐文章
相似问题
推荐课程