正文
最近遇到一个很坑,我一个 6 节点的分布式数据库,一个节点 10T 的硬盘,经过一层又一层的手动翻 hdfs 本地目录去找大文件,终于找到源头,一个 dncp-block-verification.log.curr 占了 5.6T,心中一个个问号冒出来时,非常义愤填膺:这玩意也能撑这么大?比我数据文件还要大?
image.png
今天才假期第二天,客户那边就来催了,“解决方案商量好了吗?”,我赶紧在本地虚拟机上再尝试复现一下——虽然解决方案已经出来了,把那两个文件删了就行了,但毕竟是生产环境,不敢随便删除,还是稳点好。
回过头来,这其实是一个 老版本 hdfs 的 bug,在新版本之后已经修复了,我们关掉 Datanode 把这两个特别大的日志删了就行了。
另附上正统的解法:
One solution, although slightly drastic, is to disable the block scanner entirely, by setting into the HDFS DataNode configuration the key `dfs.datanode.scan.period.hours` to `0` (default is `504` in hours). The negative effect of this is that your DNs may not auto-detect corrupted block files (and would need to wait upon a future block reading client to detect them instead); this isn't a big deal if your average replication is 3- ish, but you can consider the change as a short term one until you upgrade to a release that fixes the issue. Note that this problem will not happen if you upgrade to the latest CDH 5.4.x or higher release versions, which includes the [HDFS-7430](https://issues.apache.org/jira/browse/HDFS-7430) rewrite changes and associated bug fixes. These changes have done away with the use of such a local file, thereby removing the problem.