用户可以基于Prometheus指标手动定义SLO,但过程相对繁琐。阿里云服务网格ASM提供了生成SLO以及配套的告警规则的能力,能够通过自定义资源YAML配置的方式简化这一流程。本文将介绍如何使用ASM定义应用服务级SLO。
系列文章:
在ASM中为应用服务启用SLO(1):服务等级目标SLO概览
https://developer.aliyun.com/article/1114965
在ASM中为应用服务启用SLO(2):服务网格中的SLO定义
https://developer.aliyun.com/article/1115135
在ASM中为应用服务启用SLO(3):使用ASM定义应用服务级SLO
https://developer.aliyun.com/article/1115152
在ASM中为应用服务启用SLO(4):导入生成的规则到Prometheus中执行SLO
https://developer.aliyun.com/article/1115171
在ASM中为应用服务启用SLO(5):使用Grafana查看SLO
https://developer.aliyun.com/article/1115187
前提条件
- 已创建ASM实例,且ASM实例为1.15.3或以上版本。具体操作,请参见创建ASM实例。
定义SLO配置
下方的示例配置将为default命名空间下的httpbin服务生成服务可用性SLO,目标值为99%,持续时间为30天,并配置Page和Ticket两个等级的告警。如需进一步了解如何自定义配置文件请参考文档:https://developer.aliyun.com/article/1115135?spm=a2c6h.13262185.profile.58.6c9a35fe5AiH8r
将如下YAML格式的配置保存为prometheusservicelevel.yaml
文件,使用ASM实例的kubeconfig连接运行kubectl命令部署到网格中。
kubectl apply -f prometheusservicelevel.yaml
apiVersion istio.alibabacloud.com/v1beta1 kind ServiceLevelObjective metadata name asm-slo-default-httpbin namespace default # 自定义资源的命名空间spec service httpbin # 目标服务名称 period 30d # slo持续时间 slosname asm-slo # slo名称 objective"99"# 目标值 sli plugin id availability # 使用的插件类型 alerting name asm-alert # 告警规则名称
自动生成Prometheus规则
执行完成后, 可以通过以下命令以查看执行结果:
# 在本示例中,大括号中内容请替换成 default 和 httpbinkubectl get prometheusservicelevel asm-slo-{目标服务所在命名空间}-{目标服务名} -o yaml
其中生成的status字段内容类似如下:
status ...... status success prometheusRules# 生成的Prometheus规则文件
在prometheusRules
字段中即为yaml格式的Prometheus规则。上述配置生成的Prometheus规则示例如下:
groupsname asm-slo-sli-recordings-httpbin-asm-slo rulesrecord slo sli_error ratio_rate5m expr"(\n(\n sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\""httpbin\",destination_service_namespace=\"default\" 5m )) \n / \n (sum(rate(istio_requests_total destination_service_name=\" 5m )) > 0)\n) OR on() vector(0)\n) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin slo_window 5m record slo sli_error ratio_rate30m expr"(\n(\n sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\""httpbin\",destination_service_namespace=\"default\" 30m )) \n / \n (sum(rate(istio_requests_total destination_service_name=\" 30m )) > 0)\n) OR on() vector(0)\n) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin slo_window 30m record slo sli_error ratio_rate1h expr"(\n(\n sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\""httpbin\",destination_service_namespace=\"default\" 1h )) \n / \n (sum(rate(istio_requests_total destination_service_name=\" 1h )) > 0)\n) OR on() vector(0)\n) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin slo_window 1h record slo sli_error ratio_rate2h expr"(\n(\n sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\""httpbin\",destination_service_namespace=\"default\" 2h )) \n / \n (sum(rate(istio_requests_total destination_service_name=\" 2h )) > 0)\n) OR on() vector(0)\n) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin slo_window 2h record slo sli_error ratio_rate6h expr"(\n(\n sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\""httpbin\",destination_service_namespace=\"default\" 6h )) \n / \n (sum(rate(istio_requests_total destination_service_name=\" 6h )) > 0)\n) OR on() vector(0)\n) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin slo_window 6h record slo sli_error ratio_rate1d expr"(\n(\n sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\""httpbin\",destination_service_namespace=\"default\" 1d )) \n / \n (sum(rate(istio_requests_total destination_service_name=\" 1d )) > 0)\n) OR on() vector(0)\n) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin slo_window 1d record slo sli_error ratio_rate3d expr"(\n(\n sum(rate(istio_requests_total{ destination_service_name=\"httpbin\",destination_service_namespace=\"default\",response_code=~\"(5..|429)\""httpbin\",destination_service_namespace=\"default\" 3d )) \n / \n (sum(rate(istio_requests_total destination_service_name=\" 3d )) > 0)\n) OR on() vector(0)\n) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin slo_window 3d record slo sli_error ratio_rate30d expr sum_over_time(slo:sli_error:ratio_rate5m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"}[30d]) / ignoring (slo_window) count_over_time(slo:sli_error:ratio_rate5m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"}[30d]) labels slo_window 30d name asm-slo-meta-recordings-httpbin-asm-slo rulesrecord slo objective ratio expr vector(0.99) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin record slo error_budget ratio expr vector(1-0.99) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin record slo time_period days expr vector(30) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin record slo current_burn_rate ratio expr slo:sli_error:ratio_rate5m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} / on(slo_id, asm_slo, slo_service) group_left slo:error_budget:ratio{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin record slo period_burn_rate ratio expr slo:sli_error:ratio_rate30d{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} / on(slo_id, asm_slo, slo_service) group_left slo:error_budget:ratio{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin record slo period_error_budget_remaining ratio expr 1 - slo period_burn_rate ratio asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin" labels asm_slo asm-slo slo_id httpbin-asm-slo slo_service httpbin record asm_slo_info expr vector(1) labels asm_slo asm-slo slo_id httpbin-asm-slo slo_mode cli-gen-prom slo_objective"99" slo_service httpbin slo_spec prometheus/v1 slo_version dev name asm-slo-alerts-httpbin-asm-slo rulesalert asm-alert expr ( (slo:sli_error:ratio_rate5m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (14.4 * 0.01)) and ignoring (slo_window) (slo:sli_error:ratio_rate1h{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (14.4 * 0.01)) ) or ignoring (slo_window) ( (slo:sli_error:ratio_rate30m{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (6 * 0.01)) and ignoring (slo_window) (slo:sli_error:ratio_rate6h{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (6 * 0.01)) ) labels slo_severity page annotations summary'{{$labels.slo_service}} {{$labels.asm_slo}} SLO error budget burn rate is over expected.' title (page) $labels.slo_service $labels.asm_slo SLO error budget burn rate is too fast. alert asm-alert expr ( (slo:sli_error:ratio_rate2h{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (3 * 0.01)) and ignoring (slo_window) (slo:sli_error:ratio_rate1d{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (3 * 0.01)) ) or ignoring (slo_window) ( (slo:sli_error:ratio_rate6h{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (1 * 0.01)) and ignoring (slo_window) (slo:sli_error:ratio_rate3d{asm_slo="asm-slo", slo_id="httpbin-asm-slo", slo_service="httpbin"} > (1 * 0.01)) ) labels slo_severity ticket annotations summary'{{$labels.slo_service}} {{$labels.asm_slo}} SLO error budget burn rate is over expected.' title (ticket) $labels.slo_service $labels.asm_slo SLO error budget burn rate is too fast.
将结果保存供下一步配置到Prometheus使用。