Spark DS V2下推对TableCatalog的依赖及实现原理-开发者社区-阿里云

SPARK中 DS V2 push down(下推)的一些说明

2022-08-06 580

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： SPARK中 DS V2 push down(下推)的一些说明

背景

本文基于 SPARK 3.3.0

在之前的文章 SPARK中的FileSourceStrategy，DataSourceStrategy以及DataSourceV2Strategy规则我们有提到 DS V2 push down的功能，如JDBC 复杂下推，以及Parquet的聚合下推等等。其实这里面有个比较大的背景–就是TableCatalog类。

结论

先说结论，这些聚合下推的大前提是，在spark中已经配置了对应的catalog，如下：

spark.sql.catalog.h2=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog

分析

在Rule V2ScanRelationPushDown一系列的规则中，第一个规则createScanBuilder:

private def createScanBuilder(plan: LogicalPlan) = plan.transform {
    case r: DataSourceV2Relation =>
      ScanBuilderHolder(r.output, r, r.table.asReadable.newScanBuilder(r.options))
  }

只有是DataSourceV2Relation类型，也就是 DS v2，才会转换为 ScanBuilderHolder，而后续的pushDownFilters,pushDownAggregates规则则是基于ScanBuilderHolder来做转换的（如果有遇到ScanBuilderHolder类型才会进行DS v2特有的规则转换），所以DataSourceV2Relation是从哪里来的是关键。

直接说重点：

在RULE ResolveRelations中会进行 UnresolvedRelation到DataSourceV2Relation或是UnresolvedCatalogRelation的转换：

object ResolveRelations extends Rule[LogicalPlan] {
  ...
def apply(plan: LogicalPlan)
        : LogicalPlan = plan.resolveOperatorsUpWithPruning(AlwaysProcess.fn, ruleId) {
      case i @ InsertIntoStatement(table, _, _, _, _, _) if i.query.resolved =>
        val relation = table match {
          case u: UnresolvedRelation if !u.isStreaming =>
            lookupRelation(u).getOrElse(u)
          case other => other
        }

这里的lookupRelation会根据是否有对应的Catalog的注册来判断是DS V1还是DS V2:

private def lookupRelation(
        u: UnresolvedRelation,
        timeTravelSpec: Option[TimeTravelSpec] = None): Option[LogicalPlan] = {
      lookupTempView(u.multipartIdentifier, u.isStreaming, timeTravelSpec.isDefined).orElse {
        expandIdentifier(u.multipartIdentifier) match {
          case CatalogAndIdentifier(catalog, ident) =>
            val key = catalog.name +: ident.namespace :+ ident.name
            AnalysisContext.get.relationCache.get(key).map(_.transform {
              case multi: MultiInstanceRelation =>
                val newRelation = multi.newInstance()
                newRelation.copyTagsFrom(multi)
                newRelation
            }).orElse {
              val table = CatalogV2Util.loadTable(catalog, ident, timeTravelSpec)
              val loaded = createRelation(catalog, ident, table, u.options, u.isStreaming)
              loaded.foreach(AnalysisContext.get.relationCache.update(key, _))
              loaded
            }
          case _ => None
        }
      }
    }
    ...
private def expandIdentifier(nameParts: Seq[String]): Seq[String] = {
    if (!isResolvingView || isReferredTempViewName(nameParts)) return nameParts
    if (nameParts.length == 1) {
      AnalysisContext.get.catalogAndNamespace :+ nameParts.head
    } else if (catalogManager.isCatalogRegistered(nameParts.head)) {
      nameParts
    } else {
      AnalysisContext.get.catalogAndNamespace.head +: nameParts
    }
  }
object CatalogAndIdentifier {
    import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
    private val globalTempDB = SQLConf.get.getConf(StaticSQLConf.GLOBAL_TEMP_DATABASE)
    def unapply(nameParts: Seq[String]): Option[(CatalogPlugin, Identifier)] = {
      assert(nameParts.nonEmpty)
      if (nameParts.length == 1) {
        Some((currentCatalog, Identifier.of(catalogManager.currentNamespace, nameParts.head)))
      } else if (nameParts.head.equalsIgnoreCase(globalTempDB)) {
        // Conceptually global temp views are in a special reserved catalog. However, the v2 catalog
        // API does not support view yet, and we have to use v1 commands to deal with global temp
        // views. To simplify the implementation, we put global temp views in a special namespace
        // in the session catalog. The special namespace has higher priority during name resolution.
        // For example, if the name of a custom catalog is the same with `GLOBAL_TEMP_DATABASE`,
        // this custom catalog can't be accessed.
        Some((catalogManager.v2SessionCatalog, nameParts.asIdentifier))
      } else {
        try {
          Some((catalogManager.catalog(nameParts.head), nameParts.tail.asIdentifier))
        } catch {
          case _: CatalogNotFoundException =>
            Some((currentCatalog, nameParts.asIdentifier))
        }
      }
    }
  }

expandIdentifier方法结合CatalogAndIdentifier.unapply方法，判断：

1.如果没有指定catalog，则默认catalog 为v2SessionCatalog,catalog的名称为"spark_catalog",这也是spark默认的sessionCatalog 名称,跳到步骤3

如以下SQL： select a from table

2.如果指定了catalog，且catalog已经注册了（如以spark.sql.catalog.h2=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog），则catalog为指定的（如为JDBCTableCatalog,catalog的名称为"h2",跳到步骤3

如以下SQL：select a from h2.table

3.调用CatalogV2Util.loadTable方法也就是对应的Catalog的loadTable方法来获取对应的Table：

V2SessionCatalog catalog返回是的V1Table

JDBCTableCatalog catalog 返回的是JDBCTable

这样在下一步的createRelation 方法中就会根据不同的case转换为不同的relation：

private def createRelation(
        catalog: CatalogPlugin,
        ident: Identifier,
        table: Option[Table],
        options: CaseInsensitiveStringMap,
        isStreaming: Boolean): Option[LogicalPlan] = {
      ...
      case v1Table: V1Table if CatalogV2Util.isSessionCatalog(catalog) =>
          if (isStreaming) {
            if (v1Table.v1Table.tableType == CatalogTableType.VIEW) {
              throw QueryCompilationErrors.permanentViewNotSupportedByStreamingReadingAPIError(
                ident.quoted)
            }
            SubqueryAlias(
              catalog.name +: ident.asMultipartIdentifier,
              UnresolvedCatalogRelation(v1Table.v1Table, options, isStreaming = true))
          } else {
            v1SessionCatalog.getRelation(v1Table.v1Table, options)
          }
      ...
      case table =>
        ...
         } else {
            SubqueryAlias(
              catalog.name +: ident.asMultipartIdentifier,
              DataSourceV2Relation.create(table, Some(catalog), Some(ident), options))

如果是V1Table，则会转换为UnresolvedCatalogRelation,继而在 Rule FindDataSourceTable中转为LogicalRelation，这里就会涉及lookupDataSource,也就是注册的datasource（如：“org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider” 或者 "org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2”（目前没有进行cast匹配））发生作用了（在providingInstance()方法中实现）

如果是其他的，则会转换为DataSourceV2Relation，继而在Rule V2ScanRelationPushDown中做一系列的下推优化

所以说对于JDBC的catalog来说，想要进行DS V2的优化,就得配置：

spark.sql.catalog.h2=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog

如果想要对于其他DS v2的优化，如Parquet，就得实现对应的TableCatalog，再进行配置：

spark.sql.catalog.parquet=org.apache.spark.sql.execution.datasources.v2.jdbc.xxxx

关于TableCatalog

目前 jdbc的datasource和TableCatalog 在spark都是已经实现了：

## datasource
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
## TableCatalog
org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog

如果想实现其他的datasource以及catalog，可以参考JDBC的实现(目前的JDBC的source实现还是基于 DS V1，最好是基于DS V2实现，如参考：ParquetDataSourceV2)。

在SPARK-28396也有这方面的讨论。

更进一步DS V2 Push Down的特性，参考技术前沿｜Spark 3.3.0 中 DS V2 Push-down 的重构与新特性

SPARK中 DS V2 push down(下推)的一些说明

背景

结论

分析

关于TableCatalog

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

SPARK中 DS V2 push down(下推)的一些说明

背景

结论

分析

关于TableCatalog

热门文章

最新文章

相关电子书