spark 监控梳理

本文涉及的产品
可观测监控 Prometheus 版,每月50GB免费额度
简介: spark 监控梳理

指标

 

概述

Spark 的指标被解耦到对应于 Spark 组件的不同实例中。在每个实例中,您可以配置一组向其报告指标的接收器。目前支持以下实例:

  master:Spark 独立主进程。


  • applications:master 中的一个组件,负责报告各种应用程序。
  • worker:一个 Spark 独立工作进程。
  • executor: 一个 Spark 执行器。
  • driver:Spark 驱动程序进程(创建 SparkContext 的进程)。
  • shuffleService:Spark 随机播放服务。
  • applicationMaster: 在 YARN 上运行时的 Spark ApplicationMaster。
  • mesos_cluster: 在 Mesos 上运行时的 Spark 集群调度器。


每个实例都可以向零个或多个接收器报告。水槽包含在 org.apache.spark.metrics.sink包装中:


 

ConsoleSink:将指标信息记录到控制台。

CSVSink:定期将指标数据导出到 CSV 文件。

JmxSink:注册指标以在 JMX 控制台中查看。

MetricsServlet:在现有的 Spark UI 中添加一个 servlet,以将指标数据作为 JSON 数据提供。

PrometheusServlet:(实验性)在现有的 Spark UI 中添加一个 servlet,以提供 Prometheus 格式的指标数据。

GraphiteSink:将指标发送到 Graphite 节点。

Slf4jSink:将指标作为日志条目发送到 slf4j。

StatsdSink:将指标发送到 StatsD 节点。

 

 

任务配置


 

方式1:使用docker镜像中的配置文件

monitoring:
  exposeDriverMetrics: true
  exposeExecutorMetrics: true
  prometheus:
    jmxExporterJar: "/opt/spark/jars/jmx_prometheus_javaagent-0.17.2.jar"
    port: 8090
    portName: http-metric
    configFile: "/opt/spark/metrics/conf/prometheus.yaml"


lowercaseOutputName: true
attrNameSnakeCase: true
rules:
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+)><>Value
    name: spark_driver_$3_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
  # [ADD]
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+).(\S+)><>Value
    name: spark_driver_$3_$4_$5
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(\S+)\.StreamingMetrics\.streaming\.(\S+)><>Value
    name: spark_streaming_driver_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.spark\.streaming\.(\S+)\.(\S+)><>Value
    name: spark_structured_streaming_driver_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
      query_name: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.(\S+)\.executor\.(\S+)><>Value
    name: spark_executor_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.DAGScheduler\.(.*)><>Count
    name: spark_driver_DAGScheduler_$3_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.HiveExternalCatalog\.(.*)><>Count
    name: spark_driver_HiveExternalCatalog_$3_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.CodeGenerator\.(.*)><>Count
    name: spark_driver_CodeGenerator_$3_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Count
    name: spark_driver_LiveListenerBus_$3_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Value
    name: spark_driver_LiveListenerBus_$3
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.(.*)\.executor\.(.*)><>Count
    name: spark_executor_$4_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  # [ADD]
  - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.executor\.(.*)><>Value
    name: spark_executor_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.(jvm|NettyBlockTransfer)\.(.*)><>Value
    name: spark_executor_$4_$5
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.HiveExternalCatalog\.(.*)><>Count
    name: spark_executor_HiveExternalCatalog_$4_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.CodeGenerator\.(.*)><>Count
    name: spark_executor_CodeGenerator_$4_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"

方式2:直接放在配置文件中

monitoring:
  exposeDriverMetrics: true
  exposeExecutorMetrics: true
  metricsProperties: |
    *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
    # Enable JvmSource for instance master, worker, driver and executor
    master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
    worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
    driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
    executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
  prometheus:
    jmxExporterJar: "/opt/spark/jars/jmx_prometheus_javaagent-0.17.2.jar"
    port: 8090
    configuration: |
      lowercaseOutputName: true
      attrNameSnakeCase: true
      rules:
        # These come from the application driver if it's a streaming application
        # Example: default/streaming.driver.com.example.ClassName.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay
        - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(\S+)\.StreamingMetrics\.streaming\.(\S+)><>Value
          name: spark_streaming_driver_$4
          labels:
            app_namespace: "$1"
            app_id: "$2"
        # These come from the application driver if it's a structured streaming application
        # Example: default/streaming.driver.spark.streaming.QueryName.inputRate-total
        - pattern: metrics<name=(\S+)\.(\S+)\.driver\.spark\.streaming\.(\S+)\.(\S+)><>Value
          name: spark_structured_streaming_driver_$4
          labels:
            app_namespace: "$1"
            app_id: "$2"
            query_name: "$3"
        # These come from the application executors
        # Example: default/spark-pi.0.executor.threadpool.activeTasks
        - pattern: metrics<name=(\S+)\.(\S+)\.(\S+)\.executor\.(\S+)><>Value
          name: spark_executor_$4
          type: GAUGE
          labels:
            app_namespace: "$1"
            app_id: "$2"
            executor_id: "$3"
        # These come from the application driver
        # Example: default/spark-pi.driver.DAGScheduler.stage.failedStages
        - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+)><>Value
          name: spark_driver_$3_$4
          type: GAUGE
          labels:
            app_namespace: "$1"
            app_id: "$2"        
        # [ADD]
        - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+).(\S+)><>Value
          name: spark_driver_$3_$4_$5
          type: GAUGE
          labels:
            app_namespace: "$1"
            app_id: "$2"        
        # These come from the application driver
        # Emulate timers for DAGScheduler like messagePRocessingTime
        - pattern: metrics<name=(\S+)\.(\S+)\.driver\.DAGScheduler\.(.*)><>Count
          name: spark_driver_DAGScheduler_$3_count
          type: COUNTER
          labels:
            app_namespace: "$1"
            app_id: "$2"
        # HiveExternalCatalog is of type counter
        - pattern: metrics<name=(\S+)\.(\S+)\.driver\.HiveExternalCatalog\.(.*)><>Count
          name: spark_driver_HiveExternalCatalog_$3_count
          type: COUNTER
          labels:
            app_namespace: "$1"
            app_id: "$2"
        # These come from the application driver
        # Emulate histograms for CodeGenerator
        - pattern: metrics<name=(\S+)\.(\S+)\.driver\.CodeGenerator\.(.*)><>Count
          name: spark_driver_CodeGenerator_$3_count
          type: COUNTER
          labels:
            app_namespace: "$1"
            app_id: "$2"
        # These come from the application driver
        # Emulate timer (keep only count attribute) plus counters for LiveListenerBus
        - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Count
          name: spark_driver_LiveListenerBus_$3_count
          type: COUNTER
          labels:
            app_namespace: "$1"
            app_id: "$2"
        # Get Gauge type metrics for LiveListenerBus
        - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Value
          name: spark_driver_LiveListenerBus_$3
          type: GAUGE
          labels:
            app_namespace: "$1"
            app_id: "$2"
        # Executors counters
        - pattern: metrics<name=(\S+)\.(\S+)\.(.*)\.executor\.(.*)><>Count
          name: spark_executor_$4_count
          type: COUNTER
          labels:
            app_namespace: "$1"
            app_id: "$2"
            executor_id: "$3"
        # [ADD]
        - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.executor\.(.*)><>Value
          name: spark_executor_$4
          type: GAUGE
          labels:
            app_namespace: "$1"
            app_id: "$2"
            executor_id: "$3"
        # These come from the application executors
        # Example: app-20160809000059-0000.0.jvm.threadpool.activeTasks
        - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.(jvm|NettyBlockTransfer)\.(.*)><>Value
          name: spark_executor_$4_$5
          type: GAUGE
          labels:
            app_namespace: "$1"
            app_id: "$2"
            executor_id: "$3"
        - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.HiveExternalCatalog\.(.*)><>Count
          name: spark_executor_HiveExternalCatalog_$4_count
          type: COUNTER
          labels:
            app_namespace: "$1"
            app_id: "$2"
            executor_id: "$3"
        # These come from the application driver
        # Emulate histograms for CodeGenerator
        - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.CodeGenerator\.(.*)><>Count
          name: spark_executor_CodeGenerator_$4_count
          type: COUNTER
          labels:
            app_namespace: "$1"
            app_id: "$2"
            executor_id: "$3




镜像配置


 

根目录创建metrics文件夹


metrics.properties

*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
# Enable JvmSource for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

prometheus.yaml

---
lowercaseOutputName: true
attrNameSnakeCase: true
rules:
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+)><>Value
    name: spark_driver_$3_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
  # [ADD]
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+).(\S+)><>Value
    name: spark_driver_$3_$4_$5
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(\S+)\.StreamingMetrics\.streaming\.(\S+)><>Value
    name: spark_streaming_driver_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.spark\.streaming\.(\S+)\.(\S+)><>Value
    name: spark_structured_streaming_driver_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
      query_name: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.(\S+)\.executor\.(\S+)><>Value
    name: spark_executor_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.DAGScheduler\.(.*)><>Count
    name: spark_driver_DAGScheduler_$3_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.HiveExternalCatalog\.(.*)><>Count
    name: spark_driver_HiveExternalCatalog_$3_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.CodeGenerator\.(.*)><>Count
    name: spark_driver_CodeGenerator_$3_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Count
    name: spark_driver_LiveListenerBus_$3_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Value
    name: spark_driver_LiveListenerBus_$3
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
  - pattern: metrics<name=(\S+)\.(\S+)\.(.*)\.executor\.(.*)><>Count
    name: spark_executor_$4_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  # [ADD]
  - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.executor\.(.*)><>Value
    name: spark_executor_$4
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.(jvm|NettyBlockTransfer)\.(.*)><>Value
    name: spark_executor_$4_$5
    type: GAUGE
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.HiveExternalCatalog\.(.*)><>Count
    name: spark_executor_HiveExternalCatalog_$4_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"
  - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.CodeGenerator\.(.*)><>Count
    name: spark_executor_CodeGenerator_$4_count
    type: COUNTER
    labels:
      app_namespace: "$1"
      app_id: "$2"
      executor_id: "$3"

Dockerfile里面增加

RUN mkdir -p /opt/spark/metrics/conf
COPY metrics/metrics.properties /opt/spark/metrics/conf
COPY metrics/prometheus.yaml /opt/spark/metrics/conf

 

可用指标提供者列表


Spark 使用的指标有多种类型:gauge、counter、histogram、meter 和 timer,详见Dropwizard 库文档。以下组件和指标列表报告了名称和有关可用指标的一些详细信息,按组件实例和源命名空间分组。Spark 仪器中最常用的指标时间是仪表和计数器。可以识别计数器,因为它们具有.count后缀。计时器、仪表和直方图在列表中进行了注释,其余列表元素是 gauge 类型的指标。大多数指标在其父组件实例配置后立即激活,一些指标还需要通过额外的配置参数启用,详细信息在列表中报告。

Component instance = Driver

This is the component with the largest amount of instrumented metrics

  • namespace=BlockManager
  • disk.diskSpaceUsed_MB
  • memory.maxMem_MB
  • memory.maxOffHeapMem_MB
  • memory.maxOnHeapMem_MB
  • memory.memUsed_MB
  • memory.offHeapMemUsed_MB
  • memory.onHeapMemUsed_MB
  • memory.remainingMem_MB
  • memory.remainingOffHeapMem_MB
  • memory.remainingOnHeapMem_MB
  • namespace=HiveExternalCatalog
  • note: these metrics are conditional to a configuration parameter: spark.metrics.staticSources.enabled (default is true)
  • fileCacheHits.count
  • filesDiscovered.count
  • hiveClientCalls.count
  • parallelListingJobCount.count
  • partitionsFetched.count
  • namespace=CodeGenerator
  • note: these metrics are conditional to a configuration parameter: spark.metrics.staticSources.enabled (default is true)
  • compilationTime (histogram)
  • generatedClassSize (histogram)
  • generatedMethodSize (histogram)
  • sourceCodeSize (histogram)
  • namespace=DAGScheduler
  • job.activeJobs
  • job.allJobs
  • messageProcessingTime (timer)
  • stage.failedStages
  • stage.runningStages
  • stage.waitingStages
  • namespace=LiveListenerBus
  • listenerProcessingTime.org.apache.spark.HeartbeatReceiver (timer)
  • listenerProcessingTime.org.apache.spark.scheduler.EventLoggingListener (timer)
  • listenerProcessingTime.org.apache.spark.status.AppStatusListener (timer)
  • numEventsPosted.count
  • queue.appStatus.listenerProcessingTime (timer)
  • queue.appStatus.numDroppedEvents.count
  • queue.appStatus.size
  • queue.eventLog.listenerProcessingTime (timer)
  • queue.eventLog.numDroppedEvents.count
  • queue.eventLog.size
  • queue.executorManagement.listenerProcessingTime (timer)
  • namespace=appStatus (all metrics of type=counter)
  • note: Introduced in Spark 3.0. Conditional to a configuration parameter:spark.metrics.appStatusSource.enabled (default is false)
  • stages.failedStages.count
  • stages.skippedStages.count
  • stages.completedStages.count
  • tasks.blackListedExecutors.count // deprecated use excludedExecutors instead
  • tasks.excludedExecutors.count
  • tasks.completedTasks.count
  • tasks.failedTasks.count
  • tasks.killedTasks.count
  • tasks.skippedTasks.count
  • tasks.unblackListedExecutors.count // deprecated use unexcludedExecutors instead
  • tasks.unexcludedExecutors.count
  • jobs.succeededJobs
  • jobs.failedJobs
  • jobDuration
  • namespace=AccumulatorSource
  • note: User-configurable sources to attach accumulators to metric system
  • DoubleAccumulatorSource
  • LongAccumulatorSource
  • namespace=spark.streaming
  • note: This applies to Spark Structured Streaming only. Conditional to a configuration parameter: spark.sql.streaming.metricsEnabled=true (default is false)
  • eventTime-watermark
  • inputRate-total
  • latency
  • processingRate-total
  • states-rowsTotal
  • states-usedBytes
  • namespace=JVMCPU
  • jvmCpuTime
  • namespace=executor
  • note: These metrics are available in the driver in local mode only.
  • A full list of available metrics in this namespace can be found in the corresponding entry for the Executor component instance.
  • namespace=ExecutorMetrics
  • note: these metrics are conditional to a configuration parameter: spark.metrics.executorMetricsSource.enabled (default is true)
  • This source contains memory-related metrics. A full list of available metrics in this namespace can be found in the corresponding entry for the Executor component instance.
  • namespace=ExecutorAllocationManager
  • note: these metrics are only emitted when using dynamic allocation. Conditional to a configuration parameter spark.dynamicAllocation.enabled (default is false)
  • executors.numberExecutorsToAdd
  • executors.numberExecutorsPendingToRemove
  • executors.numberAllExecutors
  • executors.numberTargetExecutors
  • executors.numberMaxNeededExecutors
  • executors.numberExecutorsGracefullyDecommissioned.count
  • executors.numberExecutorsDecommissionUnfinished.count
  • executors.numberExecutorsExitedUnexpectedly.count
  • executors.numberExecutorsKilledByDriver.count
  • namespace=plugin.
  • Optional namespace(s). Metrics in this namespace are defined by user-supplied code, and configured using the Spark plugin API. See “Advanced Instrumentation” below for how to load custom plugins into Spark.

Component instance = Executor

These metrics are exposed by Spark executors.

  • namespace=executor (metrics are of type counter or gauge)
  • spark.executor.metrics.fileSystemSchemes (default: file,hdfs) determines the exposed file system metrics.
  • notes:
  • bytesRead.count
  • bytesWritten.count
  • cpuTime.count
  • deserializeCpuTime.count
  • deserializeTime.count
  • diskBytesSpilled.count
  • filesystem.file.largeRead_ops
  • filesystem.file.read_bytes
  • filesystem.file.read_ops
  • filesystem.file.write_bytes
  • filesystem.file.write_ops
  • filesystem.hdfs.largeRead_ops
  • filesystem.hdfs.read_bytes
  • filesystem.hdfs.read_ops
  • filesystem.hdfs.write_bytes
  • filesystem.hdfs.write_ops
  • jvmGCTime.count
  • memoryBytesSpilled.count
  • recordsRead.count
  • recordsWritten.count
  • resultSerializationTime.count
  • resultSize.count
  • runTime.count
  • shuffleBytesWritten.count
  • shuffleFetchWaitTime.count
  • shuffleLocalBlocksFetched.count
  • shuffleLocalBytesRead.count
  • shuffleRecordsRead.count
  • shuffleRecordsWritten.count
  • shuffleRemoteBlocksFetched.count
  • shuffleRemoteBytesRead.count
  • shuffleRemoteBytesReadToDisk.count
  • shuffleTotalBytesRead.count
  • shuffleWriteTime.count
  • succeededTasks.count
  • threadpool.activeTasks
  • threadpool.completeTasks
  • threadpool.currentPool_size
  • threadpool.maxPool_size
  • threadpool.startedTasks
  • namespace=ExecutorMetrics
  • ProcessTreeJVMVMemory
  • ProcessTreeJVMRSSMemory
  • ProcessTreePythonVMemory
  • ProcessTreePythonRSSMemory
  • ProcessTreeOtherVMemory
  • ProcessTreeOtherRSSMemory
  • note: “ProcessTree” metrics are collected only under certain conditions. The conditions are the logical AND of the following: /proc filesystem exists, spark.executor.processTreeMetrics.enabled=true. “ProcessTree” metrics report 0 when those conditions are not met.
  • These metrics are conditional to a configuration parameter: spark.metrics.executorMetricsSource.enabled (default value is true)
  • ExecutorMetrics are updated as part of heartbeat processes scheduled for the executors and for the driver at regular intervals: spark.executor.heartbeatInterval (default value is 10 seconds)
  • An optional faster polling mechanism is available for executor memory metrics, it can be activated by setting a polling interval (in milliseconds) using the configuration parameter spark.executor.metrics.pollingInterval
  • notes:
  • JVMHeapMemory
  • JVMOffHeapMemory
  • OnHeapExecutionMemory
  • OnHeapStorageMemory
  • OnHeapUnifiedMemory
  • OffHeapExecutionMemory
  • OffHeapStorageMemory
  • OffHeapUnifiedMemory
  • DirectPoolMemory
  • MappedPoolMemory
  • MinorGCCount
  • MinorGCTime
  • MajorGCCount
  • MajorGCTime
  • “ProcessTree*” metric counters:
  • namespace=JVMCPU
  • jvmCpuTime
  • namespace=NettyBlockTransfer
  • shuffle-client.usedDirectMemory
  • shuffle-client.usedHeapMemory
  • shuffle-server.usedDirectMemory
  • shuffle-server.usedHeapMemory
  • namespace=HiveExternalCatalog
  • note: these metrics are conditional to a configuration parameter: spark.metrics.staticSources.enabled (default is true)
  • fileCacheHits.count
  • filesDiscovered.count
  • hiveClientCalls.count
  • parallelListingJobCount.count
  • partitionsFetched.count
  • namespace=CodeGenerator
  • note: these metrics are conditional to a configuration parameter: spark.metrics.staticSources.enabled (default is true)
  • compilationTime (histogram)
  • generatedClassSize (histogram)
  • generatedMethodSize (histogram)
  • sourceCodeSize (histogram)
  • namespace=plugin.
  • Optional namespace(s). Metrics in this namespace are defined by user-supplied code, and configured using the Spark plugin API. See “Advanced Instrumentation” below for how to load custom plugins into Spark.

Source = JVM Source

Notes:

  • Activate this source by setting the relevant metrics.properties file entry or the configuration parameter:spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
  • These metrics are conditional to a configuration parameter: spark.metrics.staticSources.enabled (default is true)
  • This source is available for driver and executor instances and is also available for other instances.
  • This source provides information on JVM metrics using the Dropwizard/Codahale Metric Sets for JVM instrumentation and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.

Component instance = applicationMaster

Note: applies when running on YARN

  • numContainersPendingAllocate
  • numExecutorsFailed
  • numExecutorsRunning
  • numLocalityAwareTasks
  • numReleasedContainers

Component instance = mesos_cluster

Note: applies when running on mesos

  • waitingDrivers
  • launchedDrivers
  • retryDrivers

Component instance = master

Note: applies when running in Spark standalone as master

  • workers
  • aliveWorkers
  • apps
  • waitingApps

Component instance = ApplicationSource

Note: applies when running in Spark standalone as master

  • status
  • runtime_ms
  • cores

Component instance = worker

Note: applies when running in Spark standalone as worker

  • executors
  • coresUsed
  • memUsed_MB
  • coresFree
  • memFree_MB

Component instance = shuffleService

Note: applies to the shuffle service

  • blockTransferRate (meter) - rate of blocks being transferred
  • blockTransferMessageRate (meter) - rate of block transfer messages, i.e. if batch fetches are enabled, this represents number of batches rather than number of blocks
  • blockTransferRateBytes (meter)
  • blockTransferAvgTime_1min (gauge - 1-minute moving average)
  • numActiveConnections.count
  • numRegisteredConnections.count
  • numCaughtExceptions.count
  • openBlockRequestLatencyMillis (histogram)
  • registerExecutorRequestLatencyMillis (histogram)
  • registeredExecutorsSize
  • shuffle-server.usedDirectMemory
  • shuffle-server.usedHeapMemory

参考:


https://spark.apache.org/docs/3.2.2/monitoring.html
相关实践学习
容器服务Serverless版ACK Serverless 快速入门:在线魔方应用部署和监控
通过本实验,您将了解到容器服务Serverless版ACK Serverless 的基本产品能力,即可以实现快速部署一个在线魔方应用,并借助阿里云容器服务成熟的产品生态,实现在线应用的企业级监控,提升应用稳定性。
相关文章
|
分布式计算 资源调度 监控
没有监控的流处理作业与茫茫大海中的裸泳无异 - 附 flink 与 spark 作业监控脚本实现
没有监控的流处理作业与茫茫大海中的裸泳无异 - 附 flink 与 spark 作业监控脚本实现
|
存储 SQL 分布式计算
如何用 Uber JVM Profiler 等可视化工具监控 Spark 应用程序?
  关键要点   持续可靠地运行 Spark 应用程序是一项具有挑战性的任务,而且需要一个良好的性能监控系统。   - 在设计性能监控系统时有三个目标——收集服务器和应用程序指标、在时序数据库中存储指标,并提供用于数据可视化的仪表盘。   Uber JVM Profiler 被用于监控 Spark 应用程序,用到的其他技术还有 InfluxDB(用于存储时序数据)和 Grafana(数据可视化工具)。性能监控系统可帮助 DevOps 团队有效地监控系统,用以满足应用程序的合规性和 SLA。
275 0
|
数据采集 消息中间件 存储
【译】Databricks使用Spark Streaming和Delta Lake对流式数据进行数据质量监控介绍
本文主要对Databricks如何使用Spark Streaming和Delta Lake对流式数据进行数据质量监控的方法和架构进行了介绍,本文探讨了一种数据管理架构,该架构可以在数据到达时,通过主动监控和分析来检测流式数据中损坏或不良的数据,并且不会造成瓶颈。
【译】Databricks使用Spark Streaming和Delta Lake对流式数据进行数据质量监控介绍
|
监控 NoSQL 流计算
海量监控日志基于EMR Spark Streaming SQL进行实时聚合
从EMR-3.21.0 版本开始将提供Spark Streaming SQL的预览版功能,支持使用SQL来开发流式分析作业。结果数据可以实时写入Tablestore。 本文以LogHub为数据源,收集ECS上的日志数据,通过Spark Streaming SQL进行聚合后,将流计算结果数据实时写入Tablestore,展示一个简单的日志监控场景。
5799 0
|
分布式计算 监控 Spark
X-Pack Spark 监控指标详解
概述 本文主要介绍X-Pack Spark集群监控指标的查看方法。Spark集群对接了Ganglia和云监控。下面分别介绍两者的使用方法。 Ganglia Ganglia是一个分布式监控系统。 Ganglia 入口 打开Spark集群依次进入:数据库连接>UI访问>详细监控UI>Ganglia。
1526 0
|
分布式计算 监控 NoSQL
海量监控日志基于EMR Spark Streaming SQL进行实时聚合
从EMR-3.21.0 版本开始将提供Spark Streaming SQL的预览版功能,支持使用SQL来开发流式分析作业。结果数据可以实时写入Tablestore。 本文以LogHub为数据源,收集ECS上的日志数据,通过Spark Streaming SQL进行聚合后,将流计算结果数据实时写入Tablestore,展示一个简单的日志监控场景。
|
JSON 资源调度 分布式计算
基于Yarn API的Spark程序监控
一.简述 通过对Yarn ResourceManager中运行程序的状态(RUNNING、KILLED、FAILED、FINISHED)以及ApplicationMaster中Application的Job执行时长超过批次时间的监控,来达到对Spark on Yarn程序的失败重启、超时重启等功能 二.
4927 0
|
监控 大数据 测试技术
【Spark Summit East 2017】用Yarn监控Scala和Python Spark工作的动态资源使用情况
本讲义出自Ed Barnes与Ruslan Vaulin在Spark Summit East 2017上的演讲,我们都害怕“失去的任务”和“容器由于超出内存限制被Yarn关闭”的消息在Spark Yarn的应用程序出现的比例增多。
2489 0
|
分布式计算 监控 Spark
【Spark Summit EU 2016】Sparklint:Spark监控,识别与优化利器
本讲义出自Simon Whitear在Spark Summit EU 2016上的演讲,主要介绍了用于监控,识别并优化低效Spark的工具Sparklint。由于成功的Spark集群的规模往往会迅速扩张,往往会出现能力与任务不匹配的情况并造成资源竞争,为了使得Spark集群的效率得到提升,所以需要Sparklint这样的监控优化工具。
3325 0