指标
概述
Spark 的指标被解耦到对应于 Spark 组件的不同实例中。在每个实例中,您可以配置一组向其报告指标的接收器。目前支持以下实例:
master
:Spark 独立主进程。
applications
:master 中的一个组件,负责报告各种应用程序。worker
:一个 Spark 独立工作进程。executor
: 一个 Spark 执行器。driver
:Spark 驱动程序进程(创建 SparkContext 的进程)。shuffleService
:Spark 随机播放服务。applicationMaster
: 在 YARN 上运行时的 Spark ApplicationMaster。mesos_cluster
: 在 Mesos 上运行时的 Spark 集群调度器。
每个实例都可以向零个或多个接收器报告。水槽包含在 org.apache.spark.metrics.sink
包装中:
ConsoleSink:将指标信息记录到控制台。
CSVSink:定期将指标数据导出到 CSV 文件。
JmxSink:注册指标以在 JMX 控制台中查看。
MetricsServlet:在现有的 Spark UI 中添加一个 servlet,以将指标数据作为 JSON 数据提供。
PrometheusServlet:(实验性)在现有的 Spark UI 中添加一个 servlet,以提供 Prometheus 格式的指标数据。
GraphiteSink:将指标发送到 Graphite 节点。
Slf4jSink:将指标作为日志条目发送到 slf4j。
StatsdSink:将指标发送到 StatsD 节点。
任务配置
方式1:使用docker镜像中的配置文件
monitoring: exposeDriverMetrics: true exposeExecutorMetrics: true prometheus: jmxExporterJar: "/opt/spark/jars/jmx_prometheus_javaagent-0.17.2.jar" port: 8090 portName: http-metric configFile: "/opt/spark/metrics/conf/prometheus.yaml"
lowercaseOutputName: true attrNameSnakeCase: true rules: - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+)><>Value name: spark_driver_$3_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" # [ADD] - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+).(\S+)><>Value name: spark_driver_$3_$4_$5 type: GAUGE labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(\S+)\.StreamingMetrics\.streaming\.(\S+)><>Value name: spark_streaming_driver_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.spark\.streaming\.(\S+)\.(\S+)><>Value name: spark_structured_streaming_driver_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" query_name: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.(\S+)\.executor\.(\S+)><>Value name: spark_executor_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.DAGScheduler\.(.*)><>Count name: spark_driver_DAGScheduler_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.HiveExternalCatalog\.(.*)><>Count name: spark_driver_HiveExternalCatalog_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.CodeGenerator\.(.*)><>Count name: spark_driver_CodeGenerator_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Count name: spark_driver_LiveListenerBus_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Value name: spark_driver_LiveListenerBus_$3 type: GAUGE labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.(.*)\.executor\.(.*)><>Count name: spark_executor_$4_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" # [ADD] - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.executor\.(.*)><>Value name: spark_executor_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.(jvm|NettyBlockTransfer)\.(.*)><>Value name: spark_executor_$4_$5 type: GAUGE labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.HiveExternalCatalog\.(.*)><>Count name: spark_executor_HiveExternalCatalog_$4_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.CodeGenerator\.(.*)><>Count name: spark_executor_CodeGenerator_$4_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" executor_id: "$3"
方式2:直接放在配置文件中
monitoring: exposeDriverMetrics: true exposeExecutorMetrics: true metricsProperties: | *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink # Enable JvmSource for instance master, worker, driver and executor master.source.jvm.class=org.apache.spark.metrics.source.JvmSource worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource prometheus: jmxExporterJar: "/opt/spark/jars/jmx_prometheus_javaagent-0.17.2.jar" port: 8090 configuration: | lowercaseOutputName: true attrNameSnakeCase: true rules: # These come from the application driver if it's a streaming application # Example: default/streaming.driver.com.example.ClassName.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(\S+)\.StreamingMetrics\.streaming\.(\S+)><>Value name: spark_streaming_driver_$4 labels: app_namespace: "$1" app_id: "$2" # These come from the application driver if it's a structured streaming application # Example: default/streaming.driver.spark.streaming.QueryName.inputRate-total - pattern: metrics<name=(\S+)\.(\S+)\.driver\.spark\.streaming\.(\S+)\.(\S+)><>Value name: spark_structured_streaming_driver_$4 labels: app_namespace: "$1" app_id: "$2" query_name: "$3" # These come from the application executors # Example: default/spark-pi.0.executor.threadpool.activeTasks - pattern: metrics<name=(\S+)\.(\S+)\.(\S+)\.executor\.(\S+)><>Value name: spark_executor_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" # These come from the application driver # Example: default/spark-pi.driver.DAGScheduler.stage.failedStages - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+)><>Value name: spark_driver_$3_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" # [ADD] - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+).(\S+)><>Value name: spark_driver_$3_$4_$5 type: GAUGE labels: app_namespace: "$1" app_id: "$2" # These come from the application driver # Emulate timers for DAGScheduler like messagePRocessingTime - pattern: metrics<name=(\S+)\.(\S+)\.driver\.DAGScheduler\.(.*)><>Count name: spark_driver_DAGScheduler_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" # HiveExternalCatalog is of type counter - pattern: metrics<name=(\S+)\.(\S+)\.driver\.HiveExternalCatalog\.(.*)><>Count name: spark_driver_HiveExternalCatalog_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" # These come from the application driver # Emulate histograms for CodeGenerator - pattern: metrics<name=(\S+)\.(\S+)\.driver\.CodeGenerator\.(.*)><>Count name: spark_driver_CodeGenerator_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" # These come from the application driver # Emulate timer (keep only count attribute) plus counters for LiveListenerBus - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Count name: spark_driver_LiveListenerBus_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" # Get Gauge type metrics for LiveListenerBus - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Value name: spark_driver_LiveListenerBus_$3 type: GAUGE labels: app_namespace: "$1" app_id: "$2" # Executors counters - pattern: metrics<name=(\S+)\.(\S+)\.(.*)\.executor\.(.*)><>Count name: spark_executor_$4_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" # [ADD] - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.executor\.(.*)><>Value name: spark_executor_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" # These come from the application executors # Example: app-20160809000059-0000.0.jvm.threadpool.activeTasks - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.(jvm|NettyBlockTransfer)\.(.*)><>Value name: spark_executor_$4_$5 type: GAUGE labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.HiveExternalCatalog\.(.*)><>Count name: spark_executor_HiveExternalCatalog_$4_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" # These come from the application driver # Emulate histograms for CodeGenerator - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.CodeGenerator\.(.*)><>Count name: spark_executor_CodeGenerator_$4_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" executor_id: "$3
镜像配置
根目录创建metrics文件夹
metrics.properties
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink # Enable JvmSource for instance master, worker, driver and executor master.source.jvm.class=org.apache.spark.metrics.source.JvmSource worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
prometheus.yaml
--- lowercaseOutputName: true attrNameSnakeCase: true rules: - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+)><>Value name: spark_driver_$3_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" # [ADD] - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(BlockManager|DAGScheduler|jvm)\.(\S+).(\S+)><>Value name: spark_driver_$3_$4_$5 type: GAUGE labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.(\S+)\.StreamingMetrics\.streaming\.(\S+)><>Value name: spark_streaming_driver_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.spark\.streaming\.(\S+)\.(\S+)><>Value name: spark_structured_streaming_driver_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" query_name: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.(\S+)\.executor\.(\S+)><>Value name: spark_executor_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.DAGScheduler\.(.*)><>Count name: spark_driver_DAGScheduler_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.HiveExternalCatalog\.(.*)><>Count name: spark_driver_HiveExternalCatalog_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.CodeGenerator\.(.*)><>Count name: spark_driver_CodeGenerator_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Count name: spark_driver_LiveListenerBus_$3_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.driver\.LiveListenerBus\.(.*)><>Value name: spark_driver_LiveListenerBus_$3 type: GAUGE labels: app_namespace: "$1" app_id: "$2" - pattern: metrics<name=(\S+)\.(\S+)\.(.*)\.executor\.(.*)><>Count name: spark_executor_$4_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" # [ADD] - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.executor\.(.*)><>Value name: spark_executor_$4 type: GAUGE labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.(jvm|NettyBlockTransfer)\.(.*)><>Value name: spark_executor_$4_$5 type: GAUGE labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.HiveExternalCatalog\.(.*)><>Count name: spark_executor_HiveExternalCatalog_$4_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" executor_id: "$3" - pattern: metrics<name=(\S+)\.(\S+)\.([0-9]+)\.CodeGenerator\.(.*)><>Count name: spark_executor_CodeGenerator_$4_count type: COUNTER labels: app_namespace: "$1" app_id: "$2" executor_id: "$3"
Dockerfile里面增加
RUN mkdir -p /opt/spark/metrics/conf COPY metrics/metrics.properties /opt/spark/metrics/conf COPY metrics/prometheus.yaml /opt/spark/metrics/conf
可用指标提供者列表
Spark 使用的指标有多种类型:gauge、counter、histogram、meter 和 timer,详见Dropwizard 库文档。以下组件和指标列表报告了名称和有关可用指标的一些详细信息,按组件实例和源命名空间分组。Spark 仪器中最常用的指标时间是仪表和计数器。可以识别计数器,因为它们具有.count
后缀。计时器、仪表和直方图在列表中进行了注释,其余列表元素是 gauge 类型的指标。大多数指标在其父组件实例配置后立即激活,一些指标还需要通过额外的配置参数启用,详细信息在列表中报告。
Component instance = Driver
This is the component with the largest amount of instrumented metrics
- namespace=BlockManager
- disk.diskSpaceUsed_MB
- memory.maxMem_MB
- memory.maxOffHeapMem_MB
- memory.maxOnHeapMem_MB
- memory.memUsed_MB
- memory.offHeapMemUsed_MB
- memory.onHeapMemUsed_MB
- memory.remainingMem_MB
- memory.remainingOffHeapMem_MB
- memory.remainingOnHeapMem_MB
- namespace=HiveExternalCatalog
- note: these metrics are conditional to a configuration parameter:
spark.metrics.staticSources.enabled
(default is true) - fileCacheHits.count
- filesDiscovered.count
- hiveClientCalls.count
- parallelListingJobCount.count
- partitionsFetched.count
- namespace=CodeGenerator
- note: these metrics are conditional to a configuration parameter:
spark.metrics.staticSources.enabled
(default is true) - compilationTime (histogram)
- generatedClassSize (histogram)
- generatedMethodSize (histogram)
- sourceCodeSize (histogram)
- namespace=DAGScheduler
- job.activeJobs
- job.allJobs
- messageProcessingTime (timer)
- stage.failedStages
- stage.runningStages
- stage.waitingStages
- namespace=LiveListenerBus
- listenerProcessingTime.org.apache.spark.HeartbeatReceiver (timer)
- listenerProcessingTime.org.apache.spark.scheduler.EventLoggingListener (timer)
- listenerProcessingTime.org.apache.spark.status.AppStatusListener (timer)
- numEventsPosted.count
- queue.appStatus.listenerProcessingTime (timer)
- queue.appStatus.numDroppedEvents.count
- queue.appStatus.size
- queue.eventLog.listenerProcessingTime (timer)
- queue.eventLog.numDroppedEvents.count
- queue.eventLog.size
- queue.executorManagement.listenerProcessingTime (timer)
- namespace=appStatus (all metrics of type=counter)
- note: Introduced in Spark 3.0. Conditional to a configuration parameter:
spark.metrics.appStatusSource.enabled
(default is false) - stages.failedStages.count
- stages.skippedStages.count
- stages.completedStages.count
- tasks.blackListedExecutors.count // deprecated use excludedExecutors instead
- tasks.excludedExecutors.count
- tasks.completedTasks.count
- tasks.failedTasks.count
- tasks.killedTasks.count
- tasks.skippedTasks.count
- tasks.unblackListedExecutors.count // deprecated use unexcludedExecutors instead
- tasks.unexcludedExecutors.count
- jobs.succeededJobs
- jobs.failedJobs
- jobDuration
- namespace=AccumulatorSource
- note: User-configurable sources to attach accumulators to metric system
- DoubleAccumulatorSource
- LongAccumulatorSource
- namespace=spark.streaming
- note: This applies to Spark Structured Streaming only. Conditional to a configuration parameter:
spark.sql.streaming.metricsEnabled=true
(default is false) - eventTime-watermark
- inputRate-total
- latency
- processingRate-total
- states-rowsTotal
- states-usedBytes
- namespace=JVMCPU
- jvmCpuTime
- namespace=executor
- note: These metrics are available in the driver in local mode only.
- A full list of available metrics in this namespace can be found in the corresponding entry for the Executor component instance.
- namespace=ExecutorMetrics
- note: these metrics are conditional to a configuration parameter:
spark.metrics.executorMetricsSource.enabled
(default is true) - This source contains memory-related metrics. A full list of available metrics in this namespace can be found in the corresponding entry for the Executor component instance.
- namespace=ExecutorAllocationManager
- note: these metrics are only emitted when using dynamic allocation. Conditional to a configuration parameter
spark.dynamicAllocation.enabled
(default is false) - executors.numberExecutorsToAdd
- executors.numberExecutorsPendingToRemove
- executors.numberAllExecutors
- executors.numberTargetExecutors
- executors.numberMaxNeededExecutors
- executors.numberExecutorsGracefullyDecommissioned.count
- executors.numberExecutorsDecommissionUnfinished.count
- executors.numberExecutorsExitedUnexpectedly.count
- executors.numberExecutorsKilledByDriver.count
- namespace=plugin.
- Optional namespace(s). Metrics in this namespace are defined by user-supplied code, and configured using the Spark plugin API. See “Advanced Instrumentation” below for how to load custom plugins into Spark.
Component instance = Executor
These metrics are exposed by Spark executors.
- namespace=executor (metrics are of type counter or gauge)
spark.executor.metrics.fileSystemSchemes
(default:file,hdfs
) determines the exposed file system metrics.- notes:
- bytesRead.count
- bytesWritten.count
- cpuTime.count
- deserializeCpuTime.count
- deserializeTime.count
- diskBytesSpilled.count
- filesystem.file.largeRead_ops
- filesystem.file.read_bytes
- filesystem.file.read_ops
- filesystem.file.write_bytes
- filesystem.file.write_ops
- filesystem.hdfs.largeRead_ops
- filesystem.hdfs.read_bytes
- filesystem.hdfs.read_ops
- filesystem.hdfs.write_bytes
- filesystem.hdfs.write_ops
- jvmGCTime.count
- memoryBytesSpilled.count
- recordsRead.count
- recordsWritten.count
- resultSerializationTime.count
- resultSize.count
- runTime.count
- shuffleBytesWritten.count
- shuffleFetchWaitTime.count
- shuffleLocalBlocksFetched.count
- shuffleLocalBytesRead.count
- shuffleRecordsRead.count
- shuffleRecordsWritten.count
- shuffleRemoteBlocksFetched.count
- shuffleRemoteBytesRead.count
- shuffleRemoteBytesReadToDisk.count
- shuffleTotalBytesRead.count
- shuffleWriteTime.count
- succeededTasks.count
- threadpool.activeTasks
- threadpool.completeTasks
- threadpool.currentPool_size
- threadpool.maxPool_size
- threadpool.startedTasks
- namespace=ExecutorMetrics
- ProcessTreeJVMVMemory
- ProcessTreeJVMRSSMemory
- ProcessTreePythonVMemory
- ProcessTreePythonRSSMemory
- ProcessTreeOtherVMemory
- ProcessTreeOtherRSSMemory
- note: “ProcessTree” metrics are collected only under certain conditions. The conditions are the logical AND of the following:
/proc
filesystem exists,spark.executor.processTreeMetrics.enabled=true
. “ProcessTree” metrics report 0 when those conditions are not met. - These metrics are conditional to a configuration parameter:
spark.metrics.executorMetricsSource.enabled
(default value is true) - ExecutorMetrics are updated as part of heartbeat processes scheduled for the executors and for the driver at regular intervals:
spark.executor.heartbeatIn
terval
(default value is 10 seconds) - An optional faster polling mechanism is available for executor memory metrics, it can be activated by setting a polling interval (in milliseconds) using the configuration parameter
spark.executor.metrics.pollingInterval
- notes:
- JVMHeapMemory
- JVMOffHeapMemory
- OnHeapExecutionMemory
- OnHeapStorageMemory
- OnHeapUnifiedMemory
- OffHeapExecutionMemory
- OffHeapStorageMemory
- OffHeapUnifiedMemory
- DirectPoolMemory
- MappedPoolMemory
- MinorGCCount
- MinorGCTime
- MajorGCCount
- MajorGCTime
- “ProcessTree*” metric counters:
- namespace=JVMCPU
- jvmCpuTime
- namespace=NettyBlockTransfer
- shuffle-client.usedDirectMemory
- shuffle-client.usedHeapMemory
- shuffle-server.usedDirectMemory
- shuffle-server.usedHeapMemory
- namespace=HiveExternalCatalog
- note: these metrics are conditional to a configuration parameter:
spark.metrics.staticSources.enabled
(default is true) - fileCacheHits.count
- filesDiscovered.count
- hiveClientCalls.count
- parallelListingJobCount.count
- partitionsFetched.count
- namespace=CodeGenerator
- note: these metrics are conditional to a configuration parameter:
spark.metrics.staticSources.enabled
(default is true) - compilationTime (histogram)
- generatedClassSize (histogram)
- generatedMethodSize (histogram)
- sourceCodeSize (histogram)
- namespace=plugin.
- Optional namespace(s). Metrics in this namespace are defined by user-supplied code, and configured using the Spark plugin API. See “Advanced Instrumentation” below for how to load custom plugins into Spark.
Source = JVM Source
Notes:
- Activate this source by setting the relevant
metrics.properties
file entry or the configuration parameter:spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
- These metrics are conditional to a configuration parameter:
spark.metrics.staticSources.enabled
(default is true) - This source is available for driver and executor instances and is also available for other instances.
- This source provides information on JVM metrics using the Dropwizard/Codahale Metric Sets for JVM instrumentation and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.
Component instance = applicationMaster
Note: applies when running on YARN
- numContainersPendingAllocate
- numExecutorsFailed
- numExecutorsRunning
- numLocalityAwareTasks
- numReleasedContainers
Component instance = mesos_cluster
Note: applies when running on mesos
- waitingDrivers
- launchedDrivers
- retryDrivers
Component instance = master
Note: applies when running in Spark standalone as master
- workers
- aliveWorkers
- apps
- waitingApps
Component instance = ApplicationSource
Note: applies when running in Spark standalone as master
- status
- runtime_ms
- cores
Component instance = worker
Note: applies when running in Spark standalone as worker
- executors
- coresUsed
- memUsed_MB
- coresFree
- memFree_MB
Component instance = shuffleService
Note: applies to the shuffle service
- blockTransferRate (meter) - rate of blocks being transferred
- blockTransferMessageRate (meter) - rate of block transfer messages, i.e. if batch fetches are enabled, this represents number of batches rather than number of blocks
- blockTransferRateBytes (meter)
- blockTransferAvgTime_1min (gauge - 1-minute moving average)
- numActiveConnections.count
- numRegisteredConnections.count
- numCaughtExceptions.count
- openBlockRequestLatencyMillis (histogram)
- registerExecutorRequestLatencyMillis (histogram)
- registeredExecutorsSize
- shuffle-server.usedDirectMemory
- shuffle-server.usedHeapMemory
参考:
https://spark.apache.org/docs/3.2.2/monitoring.html