问题分析
- 思路:分离读写scan请求,各种处理各自的 源码内部 WriteQueues; Queues; ScanQueues;
- 优化参数
hbase.regionserver.handler.count:默认为30,服务器端用来处理用户请求的线程数。生产线上通常需要将该值调到100~200。线上:128 hbase.ipc.server.callqueue.handler.factor : 默认为0,服务器端设置队列个数,假如该值为0.1,那么服务器就会设置handler.count * 0.1 = 30 * 0.1 = 3个队列。线上:0.2 hbase.ipc.server.callqueue.read.ratio : 默认为0,服务器端设置读写业务分别占用的队列百分比以及handler百分比。假如该值为0.5,表示读写各占一半队列,同时各占一半handler。线上:0.6 hbase.ipc.server.call.queue.scan.ratio:默认为0,服务器端为了将get和scan隔离设置了该参数。线上:0.1
hbase.ipc.server.call.queue.scan.ratio 默认0
scan请求直接打入read队列长时间不释放,导致正常read超时告警
- 源码理解
HRegionServer类构造函数入口
1. rpcServices = createRpcServices(); 2. rpcSchedulerFactory = getRpcSchedulerFactoryClass().asSubclass(RpcSchedulerFactory.class).getDeclaredConstructor().newInstance(); 3. SimpleRpcSchedulerFactory.create 创建SimpleRpcScheduler 4. SimpleRpcScheduler构造函数中创建RWQueueRpcExecutor 5. if (callqReadShare > 0) { 6. // at least 1 read handler and 1 write handler 7. callExecutor = new RWQueueRpcExecutor("default.RWQ", Math.max(2, handlerCount), 8. maxQueueLength, priority, conf, server); 9. }
5. RWQueueRpcExecutor构造函数中初始化相关参数
public RWQueueRpcExecutor(final String name, final int handlerCount, final int maxQueueLength, final PriorityFunction priority, final Configuration conf, final Abortable abortable) { super(name, handlerCount, maxQueueLength, priority, conf, abortable); //默认为0,服务器端设置读写业务分别占用的队列百分比以及handler百分比。 //假如该值为0.5,表示读写各占一半队列,同时各占一半handler float callqReadShare = getReadShare(conf); // hbase.ipc.server.callqueue.scan.ratio=0.2 float callqScanShare = getScanShare(conf); //写队列 10*0.5=5对列数 numWriteQueues = calcNumWriters(this.numCallQueues, callqReadShare); //线程数 100*0.5=50 writeHandlersCount = Math.max(numWriteQueues, calcNumWriters(handlerCount, callqReadShare)); //numCallQueues减去写队列数 10-5=5 int readQueues = calcNumReaders(this.numCallQueues, callqReadShare); //线程数 50 int readHandlers = Math.max(readQueues, calcNumReaders(handlerCount, callqReadShare)); //scanQueues = 5*0.2=1 int scanQueues = Math.max(0, (int)Math.floor(readQueues * callqScanShare)); //scanQueues = 50*0.2=10 int scanHandlers = Math.max(0, (int)Math.floor(readHandlers * callqScanShare)); if ((readQueues - scanQueues) > 0) { //readQueues=5-1=4 因为scan也属于读的一部分,所以scan要从read中扣取一部分资源 readQueues -= scanQueues; //readQueues=50-10=40 readHandlers -= scanHandlers; } else { scanQueues = 0; scanHandlers = 0; } numReadQueues = readQueues; readHandlersCount = readHandlers; numScanQueues = scanQueues; scanHandlersCount = scanHandlers; //队列分负载均衡,策咯是随机打入一个相关队列 this.writeBalancer = getBalancer(numWriteQueues); this.readBalancer = getBalancer(numReadQueues); this.scanBalancer = numScanQueues > 0 ? getBalancer(numScanQueues) : null; initializeQueues(numWriteQueues); initializeQueues(numReadQueues); initializeQueues(numScanQueues);
总结
regionserver服务端使用的ReadQueues,WriteQueues,ScanQueues来代替传统线程池处理客户端读写请求,每个对列都有对等比例的线程hbase.regionserver.handler.count消费队列,负载均衡策咯比如ReadQueues使用的随机策咯getNextQueue.ThreadLocalRandom.current().nextInt(queueSize)
问题复现:
可以使用阿里的arthas进行regionserver的线程状态分析
- 队列计算
handler.count 是100,就是regionserver总处理读写请求线程数100个 队列个数是hbase.regionserver.handler.count100 x 0.1(hbase.ipc.server.callqueue.handler.factor) = 10,读写请求队列总个数 写队列个数 10 x 0.5hbase.ipc.server.callqueue.read.ratio = 5 读队列个数 10 x 0.5 = 5,请注意,这里读请求,包含get和scan,所以有这个scan.ratio参数是占用read的比例 get 队列 5 - 5 x 0.2hbase.ipc.server.call.queue.scan.ratio = 4 scan队列 5 x 0.2 = 1
- 队列消费线程数计算
线程比例和队列比例相同 比如 write消费线程数:100乘以0.5=50 get消费线程数:100乘以0.4=40 scan消费线程数:100乘以0.1=10