Open Sourcing Kafka Monitor

简介:

https://engineering.linkedin.com/blog/2016/05/open-sourcing-kafka-monitor

 

 

https://github.com/linkedin/kafka-monitor

https://github.com/Microsoft/Availability-Monitor-for-Kafka

 

 

Design Overview

Kafka Monitor makes it easy to develop and execute long-running Kafka-specific system tests in real clusters and to monitor existing Kafka deployment's SLAs provided by users.

Developers can create new tests by composing reusable modules to emulate various scenarios (e.g. GC pauses, broker hard-kills, rolling bounces, disk failures, etc.) and collect metrics; users can run Kafka Monitor tests that execute these scenarios at a user-defined schedule on a test cluster or production cluster and validate that Kafka still functions as expected in these scenarios. Kafka Monitor is modeled as manager for a collection of tests and services in order to achieve these goals.

A given Kafka Monitor instance runs in a single Java process and can spawn multiple tests/services in the same process. The diagram below demonstrates the relations between service, test and Kafka Monitor instance, as well as how Kafka Monitor interacts with a Kafka cluster and user.

image

这个平台比较有意思在于,不光是监控那么简单,

还包含完整的test框架,可以定义任意test,test由各种service,即组件,组合而成

  • Produce service, which produces messages to Kafka and measures metrics such as produce rate and availability.
  • Consume service, which consumes messages from Kafka and measures metrics including message loss rate, message duplicate rate and end-to-end latency. This service depends on the produce service to provide messages that embed a message sequence number and timestamp.
  • Broker bounce service, which bounces a given broker at some pre-defined schedule.

用上面的3个services,就可以组合出一个测试broker bounce的test

image 

或者上面的case,通过两个kafka monitor,可以测试多datacenter之间的同步

 

Kafka Monitor Usage at LinkedIn

Monitoring Kafka Cluster Deployments

In early 2016 we deployed Kafka Monitor to monitor availability and end-to-end latency of every Kafka cluster at LinkedIn. This project wiki goes into the details of how these metrics are measured. These basic but critical metrics have been extremely useful to actively monitor the SLAs provided by our Kafka cluster deployment.

 

Validate Client Libraries Using End-to-End Workflows

As an earlier blog post explains, we have a client library that wraps around the vanilla Apache Kafka producer and consumer to provide various features that are not available in Apache Kafka such as Avro encoding, auditing and support for large messages. We also have a REST client that allows non-Java application to produce and consume from Kafka. It is important to validate the functionality of these client libraries with each new Kafka release. Kafka Monitor allows users to plug in custom client libraries to be used in its end-to-end workflow. We have deployed Kafka Monitor instances that use our wrapper client and REST client in tests, to validate that their performance and functionality meet the requirement for every new release of these client libraries and Apache Kafka.

 

Certify New Internal Releases of Apache Kafka

We generally run off Apache Kafka trunk and cut a new internal release every quarter or so to pick up new features from Apache Kafka. A significant benefit of running off trunk is that deploying Kafka in LinkedIn’s production cluster has often detected problems in Apache Kafka trunk that can be fixed before official Apache Kafka releases.

Given the risk of running off Apache Kafka trunk, we take extra care to certify every internal release in a test cluster—which accepts traffic mirrored from production cluster(s)—for a few weeks before deploying the new release in production. For example, we do rolling bounces or hard kill brokers, while checking JMX metrics to verify that there is exactly one controller and no offline partitions, in order to validate Kafka’s availability under failover scenarios. In the past, these steps were manual, which is very time-consuming and doesn’t scale well with the number of events and types of scenarios we want to test. We are switching to Kafka Monitor to automate this process and cover more failover scenarios on a continual basis.

相关文章
|
9月前
|
人工智能 小程序
【一步步开发AI运动小程序】十五、AI运动识别中,如何判断人体站位的远近?
【云智AI运动识别小程序插件】提供人体、运动及姿态检测的AI能力,无需后台支持,具有快速、体验好、易集成等特点。本文介绍如何利用插件判断人体与摄像头的远近,确保人体图像在帧内的比例适中,以优化识别效果。通过`whole`检测规则,分别实现人体过近和过远的判断,并给出相应示例代码。
|
8月前
|
人工智能 运维 Serverless
Qwen2.5 的云端新体验,5 分钟完成极速部署
将 Qwen2.5 模型部署于函数计算 FC,用户能依据业务需求调整资源配置,有效应对高并发场景,并通过优化资源配置,如调整实例规格、多 GPU 部署和模型量化来提升推理速度。此外,函数计算支持多样化 GPU 计费模式(按需计费、阶梯定价、极速模式),可根据业务需求调整,在面对高频请求和大规模数据处理时,能够显著降低综合成本。
480 16
|
10月前
|
人工智能 自然语言处理 IDE
💡通义灵码:让每个人都能成为软件开发的「超级个体」
通义灵码是阿里巴巴达摩院推出的大模型技术,支持多种编程语言和框架,具备强大的自然语言理解和生成能力。它能够自动生成代码、自动化测试、文档编写等,显著提升开发效率,降低技术门槛,让每个人都能轻松参与软件开发。通义灵码不仅支持多语言、多编辑器,还具备智能问答、代码优化等功能,为企业和开发者提供全方位的支持。通过通义灵码,开发者可以从繁琐的任务中解放出来,专注于创新和创意,推动软件开发进入新时代。
433 4
💡通义灵码:让每个人都能成为软件开发的「超级个体」
|
9月前
|
Java 测试技术 API
探索软件测试中的自动化测试框架
本文深入探讨了自动化测试在软件开发中的重要性,并详细介绍了几种流行的自动化测试框架。通过比较它们的优缺点和适用场景,旨在为读者提供选择合适自动化测试工具的参考依据。
|
8月前
|
API Python
京东拍立淘图片搜索商品接口系列(京东 API)
简介:本文介绍了如何使用拍立淘图片搜索 API 在京东平台上查找相似商品。首先需安装 Python 库 `requests`,并通过内置库 `hashlib` 生成签名。API 支持通过图片 URL 或 Base64 编码的图片进行搜索,返回商品名称、价格等信息。示例代码展示了如何构建请求并处理响应。应用场景包括电商购物助手和竞品分析,帮助用户和商家提高购物效率和市场竞争力。
|
机器学习/深度学习 人工智能 自然语言处理
|
网络协议 安全 网络安全
|
负载均衡 Cloud Native Java
Spring Cloud Alibaba基础教程:使用Nacos实现服务注册与发现
Spring Cloud Alibaba基础教程:使用Nacos实现服务注册与发现
608 0
Spring Cloud Alibaba基础教程:使用Nacos实现服务注册与发现
|
缓存 NoSQL JavaScript
每秒100W请求,12306秒杀业务,架构如何优化?
12306类业务,并发量很高,几乎所有的读写锁冲突都集中在少量数据上,难度最大
2765 0
每秒100W请求,12306秒杀业务,架构如何优化?