ReplicationMonitor是HDFS中关于数据块复制的监控线程,它的主要作用就是计算DataNode工作,并将复制请求超时的块重新加入到待调度队列。其定义及作为线程核心的run()方法如下:
/** * Periodically calls computeReplicationWork(). * 周期性调用computeReplicationWork()方法 */ private class ReplicationMonitor implements Runnable { @Override public void run() { // 如果namesystem持续运行,while循环一直进行 while (namesystem.isRunning()) { try { // Process replication work only when active NN is out of safe mode. if (namesystem.isPopulatingReplQueues()) { // 计算数据节点工作 computeDatanodeWork(); // 将复制请求超时的块重新加入到待调度队列 processPendingReplications(); } // 线程休眠replicationRecheckInterval时间 Thread.sleep(replicationRecheckInterval); } catch (Throwable t) { if (!namesystem.isRunning()) { LOG.info("Stopping ReplicationMonitor."); if (!(t instanceof InterruptedException)) { LOG.info("ReplicationMonitor received an exception" + " while shutting down.", t); } break; } else if (!checkNSRunning && t instanceof InterruptedException) { LOG.info("Stopping ReplicationMonitor for testing."); break; } LOG.fatal("ReplicationMonitor thread received Runtime exception. ", t); terminate(1, t); } } } }ReplicationMonitor线程的run()方法运行逻辑比较清晰,如果namesystem持续运行,while循环一直进行,在这个循环内,仅当活跃NN不在安全模式时才会进行复制工作:
1、调用computeDatanodeWork()方法计算数据节点工作;
2、调用processPendingReplications()方法将复制请求超时的块重新加入到待调度队列
3、线程休眠replicationRecheckInterval时间后继续运行。
首先说下这个replicationRecheckInterval,它是名字节点检查新的复制工作的时间间隔,其初始化在BlockManager的构造函数中,代码如下:
this.replicationRecheckInterval = conf.getInt(DFSConfigKeys.DFS_NAMENODE_REPLICATION_INTERVAL_KEY, DFSConfigKeys.DFS_NAMENODE_REPLICATION_INTERVAL_DEFAULT) * 1000L;其取值取参数dfs.namenode.replication.interval,参数未配置的话,默认为3秒。
再来看下计算数据节点工作的computeDatanodeWork()方法,它负责计算块复制、块无效工作可以被调度到数据节点的总数,数据节点将在接下来的心跳中被指派该工作,并返回被调度的复制或移除的块的数目,代码如下:
/** * Compute block replication and block invalidation work that can be scheduled * on data-nodes. The datanode will be informed of this work at the next * heartbeat. * * 计算块复制、块无效工作可以被调度到数据节点的总数。数据节点将在接下来的心跳中被指派该工作。 * 返回被调度的复制或移除的块的数目 * * @return number of blocks scheduled for replication or removal. */ int computeDatanodeWork() { // Blocks should not be replicated or removed if in safe mode. // It's OK to check safe mode here w/o holding lock, in the worst // case extra replications will be scheduled, and these will get // fixed up later. // 如果namesystem处于安全模式,直接返回0 if (namesystem.isInSafeMode()) { return 0; } // 通过心跳管理器heartbeatManager获取存活数据节点数 final int numlive = heartbeatManager.getLiveDatanodeCount(); // blocksReplWorkMultiplier为集群每个周期每个DataNode平均待复制的数据块数量, // blocksToProcess为每个周期集群需要复制的数据块数量 final int blocksToProcess = numlive * this.blocksReplWorkMultiplier; // blocksInvalidateWorkPct为集群每个周期每个DataNode平均待删除的无效数据块百分比 // nodesToProcess为集群每个周期待删除的无效数据块数量 final int nodesToProcess = (int) Math.ceil(numlive * this.blocksInvalidateWorkPct); // 计算复制工作量workFound int workFound = this.computeReplicationWork(blocksToProcess); // Update counters // namesystem加写锁 namesystem.writeLock(); try { // 调用updateState()方法更新相关状态 this.updateState(); // 将计算得到的复制工作量workFound赋值给被调度复制的数据块数scheduledReplicationBlocksCount this.scheduledReplicationBlocksCount = workFound; } finally { // namesystem释放写锁 namesystem.writeUnlock(); } // 计算删除无效块工作量,并累加到workFound workFound += this.computeInvalidateWork(nodesToProcess); // 返回总工作量workFound return workFound; }computeDatanodeWork()方法的处理逻辑大体如下:
1、如果namesystem处于安全模式,直接返回0;
2、通过心跳管理器heartbeatManager获取存活数据节点数numlive;
3、计算每个周期集群需要复制的数据块数量blocksToProcess:存活数据节点数numlive乘以集群每个周期每个DataNode平均待复制的数据块数量blocksReplWorkMultiplier,blocksReplWorkMultiplier取参数dfs.namenode.replication.work.multiplier.per.iteration,参数未配置的话默认为2;
4、计算集群每个周期待删除的无效数据块数量nodesToProcess:存活数据节点数numlive乘以集群每个周期每个DataNode平均待删除的无效数据块百分比blocksInvalidateWorkPct,blocksInvalidateWorkPct取参数dfs.namenode.invalidate.work.pct.per.iteration,参数未配置的话默认为0.32f,计算结果向上取整;
5、调用computeReplicationWork()方法,传入blocksToProcess,计算复制工作量workFound;
6、namesystem加写锁;
7、调用updateState()方法更新相关状态;
8、将计算得到的复制工作量workFound赋值给被调度复制的数据块数scheduledReplicationBlocksCount;
9、namesystem释放写锁;
10、调用computeInvalidateWork()方法,传入nodesToProcess(),计算删除无效块工作量,并累加到workFound;
11、返回总工作量workFound。
下面,我们看下计算复制工作量的computeReplicationWork()方法,代码如下:
/** * Scan blocks in {@link #neededReplications} and assign replication * work to data-nodes they belong to. * * The number of process blocks equals either twice the number of live * data-nodes or the number of under-replicated blocks whichever is less. * * @return number of blocks scheduled for replication during this iteration. */ int computeReplicationWork(int blocksToProcess) { List<List<Block>> blocksToReplicate = null; // namesystem加写锁 namesystem.writeLock(); try { // Choose the blocks to be replicated // 通过neededReplications的chooseUnderReplicatedBlocks()方法, // 选取blocksToProcess个待复制的数据块,放入blocksToReplicate列表, // blocksToReplicate是一个数据块列表的列表,外层的位置索引代表数据块复制的优先级 blocksToReplicate = neededReplications .chooseUnderReplicatedBlocks(blocksToProcess); } finally { // namesystem释放写锁 namesystem.writeUnlock(); } // 调用computeReplicationWorkForBlocks()方法,进行实际数据块复制操作,传入待复制数据块列表的列表,位置索引代表复制的优先级 return computeReplicationWorkForBlocks(blocksToReplicate); }computeReplicationWork()方法比较短,逻辑也很清晰,如下:
1、namesystem加写锁;
2、通过neededReplications的chooseUnderReplicatedBlocks()方法,选取blocksToProcess个待复制的数据块,放入blocksToReplicate列表,blocksToReplicate是一个数据块列表的列表,外层的位置索引代表数据块复制的优先级:
关于如何通过neededReplications的chooseUnderReplicatedBlocks()方法选取blocksToProcess个待复制的数据块,请参考《HDFS源码分析之UnderReplicatedBlocks(二)》一文;
3、namesystem释放写锁;
4、调用computeReplicationWorkForBlocks()方法,进行实际数据块复制操作,传入待复制数据块列表的列表,位置索引代表复制的优先级。