【spark系列10】spark logicalPlan Statistics (逻辑计划阶段的统计信息)-阿里云开发者社区

【spark系列10】spark logicalPlan Statistics (逻辑计划阶段的统计信息)

2022-05-29 229

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 【spark系列10】spark logicalPlan Statistics (逻辑计划阶段的统计信息)

背景

本文版本是spark 3.0.1

分析

逻辑阶段的统计信息，对于逻辑阶段的优化也是很重要的，比如broadcathashJoin,dynamic partitions pruning，本文分析一下spark 是怎么获取stastatics信息的

直接到LogicalPlanStats:

trait LogicalPlanStats { self: LogicalPlan =>
  /**
   * Returns the estimated statistics for the current logical plan node. Under the hood, this
   * method caches the return value, which is computed based on the configuration passed in the
   * first time. If the configuration changes, the cache can be invalidated by calling
   * [[invalidateStatsCache()]].
   */
  def stats: Statistics = statsCache.getOrElse {
    if (conf.cboEnabled) {
      statsCache = Option(BasicStatsPlanVisitor.visit(self))
    } else {
      statsCache = Option(SizeInBytesOnlyStatsPlanVisitor.visit(self))
    }
    statsCache.get
  }
  /** A cache for the estimated statistics, such that it will only be computed once. */
  protected var statsCache: Option[Statistics] = None
  /** Invalidates the stats cache. See [[stats]] for more information. */
  final def invalidateStatsCache(): Unit = {
    statsCache = None
    children.foreach(_.invalidateStatsCache())
  }
}

该stats方法用来计算statistics,如果开启了cbo,则用BasicStatsPlanVisitor的visit，否则调用SizeInBytesOnlyStatsPlanVisitor的visit方法。我们可以看一下SizeInBytesOnlyStatsPlanVisitor.visit方法，因为BasicStatsPlanVisitor的很多方法都是调用SizeInBytesOnlyStatsPlanVisitor方法。而我们可以重点看一下default方法:

override def default(p: LogicalPlan): Statistics = p match {
    case p: LeafNode => p.computeStats()
    case _: LogicalPlan => Statistics(sizeInBytes = p.children.map(_.stats.sizeInBytes).product)
  }

因为统计信息都是一层一层从叶子节点往上传递的,当匹配到叶子节点的时候，则直接调用该computeStats方法,对于不同版本的dataSource是有区别的：

对于v1版本的，拿hiveTableRelation举例：

override def computeStats(): Statistics = {
   tableMeta.stats.map(_.toPlanStats(output, conf.cboEnabled || conf.planStatsEnabled))
     .orElse(tableStats)
     .getOrElse {
     throw new IllegalStateException("table stats must be specified.")
   }
 }

直接从元数据中获取信息，如果开启了cbo或者planstats,则还会获取行信息和列的统计信息

对于v2版本的，拿DataSourceV2Relation举例：

 override def computeStats(): Statistics = {
    if (Utils.isTesting) {
      // when testing, throw an exception if this computeStats method is called because stats should
      // not be accessed before pushing the projection and filters to create a scan. otherwise, the
      // stats are not accurate because they are based on a full table scan of all columns.
      throw new IllegalStateException(
        s"BUG: computeStats called before pushdown on DSv2 relation: $name")
    } else {
      // when not testing, return stats because bad stats are better than failing a query
      table.asReadable.newScanBuilder(options) match {
        case r: SupportsReportStatistics =>
          val statistics = r.estimateStatistics()
          DataSourceV2Relation.transformV2Stats(statistics, None, conf.defaultSizeInBytes)
        case _ =>
          Statistics(sizeInBytes = conf.defaultSizeInBytes)
      }
    }

直接调用table.newScanBuilder.如果继承了SupportsReportStatistics，则调用该estimateStatistics方法，这里涉及到的Table SupportsRead SupportsReportStatistics 都是spark 3引入的新类,我们直接看ParquetScan,默认是继承FileScan的estimateStatistics方法

override def estimateStatistics(): Statistics = {
    new Statistics {
      override def sizeInBytes(): OptionalLong = {
        val compressionFactor = sparkSession.sessionState.conf.fileCompressionFactor
        val size = (compressionFactor * fileIndex.sizeInBytes).toLong
        OptionalLong.of(size)
      }
      override def numRows(): OptionalLong = OptionalLong.empty()
    }
  }

其实可以看出v2版本的没有列统计信息，至少目前是没有,而v1版本的部分是有列统计信息的, 毕竟统计每一列的信息是耗时的.

【spark系列10】spark logicalPlan Statistics (逻辑计划阶段的统计信息)

背景

分析

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【spark系列10】spark logicalPlan Statistics (逻辑计划阶段的统计信息)

背景

分析

热门文章

最新文章

相关课程

相关电子书

相关实验场景