精进Hudi系列|Apache Hudi索引实现分析（五）之基于List的IndexFileFilter-阿里云开发者社区

精进Hudi系列|Apache Hudi索引实现分析（五）之基于List的IndexFileFilter

2024-03-12 68

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 精进Hudi系列|Apache Hudi索引实现分析（五）之基于List的IndexFileFilter

1. 介绍

前面分析了基于Tree的索引过滤器的实现，Hudi来提供了基于List的索引过滤器的实现：ListBasedIndexFileFilter和ListBasedGlobalIndexFileFilter，下面进行分析。

2. 分析

ListBasedIndexFileFilter是 ListBasedGlobalIndexFileFilter的父类，两者实现了IndexFilter接口的 getMatchingFilesAndPartition方法。

2.1 ListBasedIndexFileFilter实现

ListBasedIndexFileFilter的 getMatchingFilesAndPartition方法核心代码如下

public Set<Pair<String, String>> getMatchingFilesAndPartition(String partitionPath, String recordKey) {
    // 获取分区对应的列表
    List<BloomIndexFileInfo> indexInfos = partitionToFileIndexInfo.get(partitionPath);
    Set<Pair<String, String>> toReturn = new HashSet<>();
    if (indexInfos != null) { // 可能为null，即当分区路径下无文件时
      // 遍历列表
      for (BloomIndexFileInfo indexInfo : indexInfos) {
        if (shouldCompareWithFile(indexInfo, recordKey)) { // 判断是否已经比较
          toReturn.add(Pair.of(partitionPath, indexInfo.getFileId()));
        }
      }
    }
    return toReturn;
  }

可以看到该方法的逻辑非常简单，遍历分区对应的BloomIndexFileIndex列表，若判断需要比较则返回，其中 shouldCompareWithFile方法核心代码如下

protected boolean shouldCompareWithFile(BloomIndexFileInfo indexInfo, String recordKey) {
    // 无最大最小recordKey或者指定recordKey在最大最小recordKey之间
    return !indexInfo.hasKeyRanges() || indexInfo.isKeyInRange(recordKey);
  }

是否需要比较的逻辑也很简单，若无最大最小值则需要比较，或者指定recordKey在最大最小值之间也需要比较。

2.2 ListBasedGlobalIndexFileFilter实现

ListBasedIndexFileFilter的 getMatchingFilesAndPartition方法核心代码如下

public Set<Pair<String, String>> getMatchingFilesAndPartition(String partitionPath, String recordKey) {
    Set<Pair<String, String>> toReturn = new HashSet<>();
    // 遍历集合
    partitionToFileIndexInfo.forEach((partition, bloomIndexFileInfoList) -> bloomIndexFileInfoList.forEach(file -> {
      if (shouldCompareWithFile(file, recordKey)) { // 判断是否需要比较
        toReturn.add(Pair.of(partition, file.getFileId()));
      }
    }));
    return toReturn;
  }

可以看到入参中的partitionPath不再起作用，直接遍历整个集合。

3. 总结

ListBasedIndexFileFilter和 ListBasedGlobalIndexFileFilter是基于List的索引过滤器，前者也主要用于 HoodieBloomIndex，后者主要用于 HoodieGlobalBloomIndex；同样前者以分区为粒度处理，后者是全局的。

精进Hudi系列|Apache Hudi索引实现分析（五）之基于List的IndexFileFilter

1. 介绍

2. 分析

2.1 ListBasedIndexFileFilter实现

2.2 ListBasedGlobalIndexFileFilter实现

3. 总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

精进Hudi系列|Apache Hudi索引实现分析（五）之基于List的IndexFileFilter

1. 介绍

2. 分析

2.1 ListBasedIndexFileFilter实现

2.2 ListBasedGlobalIndexFileFilter实现

3. 总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像