应用场景系列之(1),流量管理下的熔断场景

简介: 本文主要介绍深入理解熔断器在不同场景下的行为。

1.引言

在采用Istio 服务网格技术的很多客户案例中, 熔断是其中一个非常普遍的流量管理场景。在启用网格技术之前, 在一些使用Resilience4j 的Java 服务中客户已经使用了熔断功能, 但相比之下, Istio能够在网络级别支持启用熔断能力,无需集成到每个服务的应用程序代码中。

深入理解熔断器在不同场景下的行为, 是将其应用到线上环境之前的关键前提。

2.介绍

启用熔断功能,需要创建一个目标规则来为目标服务配置熔断。其中, connectionPool下定义了与熔断功能相关的参数, 相关配置参数为:

  • tcp.maxConnections: 到目标主机的最大 HTTP1 /TCP 连接数。默认值为2³²-1。
  • http.http1MaxPendingRequests:等待就绪的连接池连接时,最多可以排队的请求数量。默认值为1024。
  • http.http2MaxRequests:对后端服务的最大活跃请求数。默认值为1024。

这些参数在一个简单的场景中, 如一个客户端和一个目标服务实例(在 Kubernetes 环境中,一个实例相当于一个 pod)的情况下是清晰的。然而, 在生产环境中,比较可能出现的场景是:

  • 一个客户端实例和多个目标服务实例
  • 多个客户端实例和单个目标服务实例
  • 客户端和目标服务的多个实例

我们创建了两个 python 脚本——一个用于表示目标服务,另一个用于调用服务的客户端。服务器脚本是一个简单的 Flask 应用程序,它公开一个休眠 5 秒的API服务端点,然后返回一个“hello world!”字符串。示例代码如下所示:

#! /usr/bin/env python3
from flask import Flask
import time

app = Flask(__name__)

@app.route('/hello')
def get():
    time.sleep(5)
    return 'hello world!'

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port='9080', threaded = True)

客户端脚本以 10 个为一组调用服务器端点,即 10 个并行请求,然后在发送下一批 10 个请求之前休眠一段时间。它在无限循环中执行此操作。为了确保当我们运行客户端的多个pod 时,它们都同时发送批处理,我们使用系统时间(每分钟的第 0、20 和40秒)发送批处理。

#! /usr/bin/env python3
import requests
import time
import sys
from datetime import datetime
import _thread

def timedisplay(t):
  return t.strftime("%H:%M:%S")

def get(url):
  try:
    stime = datetime.now()
    start = time.time()
    response = requests.get(url)
    etime = datetime.now()
    end = time.time()
    elapsed = end-start
    sys.stderr.write("Status: " + str(response.status_code) + ", Start: " + timedisplay(stime) + ", End: " + timedisplay(etime) + ", Elapsed Time: " + str(elapsed)+"\n")
    sys.stdout.flush()
  except Exception as myexception:
    sys.stderr.write("Exception: " + str(myexception)+"\n")
    sys.stdout.flush()

time.sleep(30)

while True:
  sc = int(datetime.now().strftime('%S'))
  time_range = [0, 20, 40]

  if sc not in time_range:
    time.sleep(1)
    continue

  sys.stderr.write("\n----------Info----------\n")
  sys.stdout.flush()

  # Send 10 requests in parallel
  for i in range(10):
    _thread.start_new_thread(get, ("http://circuit-breaker-sample-server:9080/hello", ))

  time.sleep(2)

3.部署示例应用

使用如下YAML部署示例应用,

##################################################################################################
#  circuit-breaker-sample-server services
##################################################################################################
apiVersion: v1
kind: Service
metadata:
  name: circuit-breaker-sample-server
  labels:
    app: circuit-breaker-sample-server
    service: circuit-breaker-sample-server
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: circuit-breaker-sample-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: circuit-breaker-sample-server
  labels:
    app: circuit-breaker-sample-server
    version: v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: circuit-breaker-sample-server
      version: v1
  template:
    metadata:
      labels:
        app: circuit-breaker-sample-server
        version: v1
    spec:
      containers:
      - name: circuit-breaker-sample-server
        image: registry.cn-hangzhou.aliyuncs.com/acs/istio-samples:circuit-breaker-sample-server.v1
        imagePullPolicy: Always
        ports:
        - containerPort: 9080
---
##################################################################################################
#  circuit-breaker-sample-client services
##################################################################################################
apiVersion: apps/v1
kind: Deployment
metadata:
  name: circuit-breaker-sample-client
  labels:
    app: circuit-breaker-sample-client
    version: v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: circuit-breaker-sample-client
      version: v1
  template:
    metadata:
      labels:
        app: circuit-breaker-sample-client
        version: v1
    spec:
      containers:
      - name: circuit-breaker-sample-client
        image: registry.cn-hangzhou.aliyuncs.com/acs/istio-samples:circuit-breaker-sample-client.v1
        imagePullPolicy: Always
        

启动之后, 可以看到为客户端和服务器端分别启动了对应的pod, 类似如下:

> kubectl get po |grep circuit       
circuit-breaker-sample-client-d4f64d66d-fwrh4   2/2     Running   0             1m22s
circuit-breaker-sample-server-6d6ddb4b-gcthv    2/2     Running   0             1m22s

在未定义目标规则限制的情况下, 服务器端可以满足处理并发的10个客户端请求, 因此在服务器端的响应结果始终是200。客户端侧的日志应当类似如下:

----------Info----------
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.016539812088013
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.012614488601685
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.015984535217285
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.015599012374878
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.012874364852905
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.018714904785156
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.010422468185425
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.012431621551514
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.011001348495483
Status: 200, Start: 02:39:20, End: 02:39:25, Elapsed Time: 5.01432466506958

4.启用熔断规则

通过服务网格技术启用熔断规则, 只需要针对目标服务定义对应的目标规则DestinationRule即可。
创建并应用以下目标规则:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: circuit-breaker-sample-server
spec:
  host: circuit-breaker-sample-server
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 5

它将与目标服务的 TCP 连接数限制为 5。让我们看看它在不同场景中的工作方式。

5.场景1:一个客户端实例和一个目标服务实例

在这种情况下,客户端和目标服务都只有一个 pod。当我们启动客户端 pod 并监控日志时(建议重启下客户端以获得更直观的统计结果),我们会看到类似以下内容:

----------Info----------
Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.0167787075042725
Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.011920690536499
Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.017078161239624
Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.018405437469482
Status: 200, Start: 02:49:40, End: 02:49:45, Elapsed Time: 5.018689393997192
Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.018936395645142
Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.016417503356934
Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.019930601119995
Status: 200, Start: 02:49:40, End: 02:49:50, Elapsed Time: 10.022735834121704
Status: 200, Start: 02:49:40, End: 02:49:55, Elapsed Time: 15.02303147315979

可以看到所有请求都成功。但是,每批中只有 5 个请求的响应时间约为 5 秒,其余的要慢得多(大部分为10秒之多)。这意味着仅使用tcp.maxConnections会导致过多的请求排队,等待连接释放。如前面所述,默认情况下,可以排队的请求数为2³²-1。
要真正具有熔断(即快速失败)行为,我们还需要设置http.http1MaxPendingRequests限制可以排队的请求数量。它的默认值为1024。有趣的是,如果我们将它的值设置为0,它就会回落到默认值。所以我们必须至少将它设置为1。让我们更新目标规则以仅允许1 个待处理请求:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: circuit-breaker-sample-server
spec:
  host: circuit-breaker-sample-server
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 5
      http:
        http1MaxPendingRequests: 1

并重启客户端 pod(记得一定需要重启客户端, 否则统计结果会出现偏差), 并继续观察日志, 类似如下:

----------Info----------
Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.005339622497558594
Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.007254838943481445
Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.0044133663177490234
Status: 503, Start: 02:56:40, End: 02:56:40, Elapsed Time: 0.008964776992797852
Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.018309116363525
Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.017424821853638
Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.019804954528809
Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.01643180847168
Status: 200, Start: 02:56:40, End: 02:56:45, Elapsed Time: 5.025975227355957
Status: 200, Start: 02:56:40, End: 02:56:50, Elapsed Time: 10.01716136932373

可以看到, 4 个请求立即被限制,5 个请求发送到目标服务,1 个请求排队。这是预期的行为。
可以看到客户端 Istio 代理与目标服务中的pod 建立的活跃连接数为5:

kubectl exec $(kubectl get pod --selector app=circuit-breaker-sample-client --output jsonpath='{.items[0].metadata.name}') -c istio-proxy -- curl -X POST http://localhost:15000/clusters | grep circuit-breaker-sample-server | grep cx_active

outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.124:9080::cx_active::5

6.场景2:一个客户端实例和多个目标服务实例

现在让我们在有一个客户端实例和多个目标服务实例 pod 的场景中运行测试。
首先,我们需要将目标服务部署扩展到多个副本(比如3 个):

kubectl scale deployment/circuit-breaker-sample-server  --replicas=3

在这里要验证的两个场景分别是:

  1. 连接限制应用于 pod 级别:目标服务的每个 pod 最多 5 个连接;
  2. 或者它是否应用于服务级别:无论目标服务中的 pod 数量如何,总共最多 5 个连接;

在(1)中,我们应该看不到限制或排队,因为允许的最大连接数为 15(3 个 pod,每个 pod 5 个连接)。由于我们一次只发送 10 个请求,所有请求都应该成功并在大约 5 秒内返回。
在(2)中,我们应该看到与之前场景 #1大致相同的行为。
让我们再次启动客户端 pod 并监控日志:

----------Info----------
Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.011791706085205078
Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.0032286643981933594
Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.012153387069702148
Status: 503, Start: 03:06:20, End: 03:06:20, Elapsed Time: 0.011871814727783203
Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.012892484664917
Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.013102769851685
Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.016939163208008
Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.014261484146118
Status: 200, Start: 03:06:20, End: 03:06:25, Elapsed Time: 5.01246190071106
Status: 200, Start: 03:06:20, End: 03:06:30, Elapsed Time: 10.021712064743042

我们仍然看到类似的限制和排队,这意味着增加目标服务的实例数量不会增加客户端的限制。因此, 我们推导出该限制适用于服务级别。
运行一段时间之后, 可以看到客户端 Istio 代理与目标服务中的每个 pod 建立的连接数:

outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.124:9080::cx_active::2
outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.158:9080::cx_active::2
outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local::172.20.192.26:9080::cx_active::2

客户端代理与目标服务中的每个 pod 有 2 个活动连接, 不是 5,而是 6。正如 Envoy 和 Istio 文档中所提到的,代理在连接数量方面允许一些回旋余地。

7.场景3:多个客户端实例和一个目标服务实例

在这种情况下,我们有多个客户端实例 pod,而目标服务只有一个 pod。
相应地缩放副本:

kubectl scale deployment/circuit-breaker-sample-server --replicas=1 
kubectl scale deployment/circuit-breaker-sample-client --replicas=3

由于所有 Istio 代理都基于本地信息独立运行,无需相互协调,因此对这个测试的期望是每个客户端 pod 都会表现出场景 #1 的行为,即每个 pod 将有 5 个请求被立即发送到目标服务,1 个请求正在排队并受到限制。
让我们看一下日志,看看实际发生了什么:

Client 1

----------Info----------
Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.008828878402709961
Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.010806798934936523
Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.012855291366577148
Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.004465818405151367
Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.007823944091796875
Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.06221342086791992
Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.06922149658203125
Status: 503, Start: 03:10:40, End: 03:10:40, Elapsed Time: 0.06859922409057617
Status: 200, Start: 03:10:40, End: 03:10:45, Elapsed Time: 5.015282392501831
Status: 200, Start: 03:10:40, End: 03:10:50, Elapsed Time: 9.378434181213379

Client 2

----------Info----------
Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.007795810699462891
Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.00595545768737793
Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.013380765914916992
Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.004278898239135742
Status: 503, Start: 03:11:00, End: 03:11:00, Elapsed Time: 0.010999202728271484
Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.015426874160767
Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.0184690952301025
Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.019806146621704
Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.0175628662109375
Status: 200, Start: 03:11:00, End: 03:11:05, Elapsed Time: 5.031521558761597

Client 3

----------Info----------
Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.012019157409667969
Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.012546539306640625
Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.013760805130004883
Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.014089822769165039
Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.014792442321777344
Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.015463829040527344
Status: 503, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.01661539077758789
Status: 200, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.02904224395751953
Status: 200, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.03912043571472168
Status: 200, Start: 03:13:20, End: 03:13:20, Elapsed Time: 0.06436014175415039

结果显示, 每个客户端上的 503 数量增加了。系统仅允许来自所有三个客户端实例pod 的 5 个并发请求。检查客户端代理日志,希望能得到一些线索,并观察到两种不同类型的日志,用于被限制的请求 ( 503 )。其中, 注意到RESPONSE_FLAGS包括了两个值:UO和URX。

  • UO:上游溢出(断路)
  • URX: 请求被拒绝,因为已达到上游重试限制 (HTTP)或最大连接尝试次数 (TCP) 。
{"authority":"circuit-breaker-sample-server:9080","bytes_received":"0","bytes_sent":"81","downstream_local_address":"192.168.142.207:9080","downstream_remote_address":"172.20.192.31:44610","duration":"0","istio_policy_status":"-","method":"GET","path":"/hello","protocol":"HTTP/1.1","request_id":"d9d87600-cd01-421f-8a6f-dc0ee0ac8ccd","requested_server_name":"-","response_code":"503","response_flags":"UO","route_name":"default","start_time":"2023-02-28T03:14:00.095Z","trace_id":"-","upstream_cluster":"outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local","upstream_host":"-","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"python-requests/2.21.0","x_forwarded_for":"-"}

{"authority":"circuit-breaker-sample-server:9080","bytes_received":"0","bytes_sent":"81","downstream_local_address":"192.168.142.207:9080","downstream_remote_address":"172.20.192.31:43294","duration":"58","istio_policy_status":"-","method":"GET","path":"/hello","protocol":"HTTP/1.1","request_id":"931d080a-3413-4e35-91f4-0c906e7ee565","requested_server_name":"-","response_code":"503","response_flags":"URX","route_name":"default","start_time":"2023-02-28T03:12:20.995Z","trace_id":"-","upstream_cluster":"outbound|9080||circuit-breaker-sample-server.default.svc.cluster.local","upstream_host":"172.20.192.84:9080","upstream_local_address":"172.20.192.31:58742","upstream_service_time":"57","upstream_transport_failure_reason":"-","user_agent":"python-requests/2.21.0","x_forwarded_for":"-"}

带有UO标志的请求由客户端代理在本地进行限制。带有URX标志的请求被目标服务代理拒绝。日志中其他字段的值(例如DURATION、UPSTREAM_HOST 和 UPSTREAM_CLUSTER)也证实了这一点。
为了进一步验证,我们还要检查目标服务端的代理日志:

{"authority":"circuit-breaker-sample-server:9080","bytes_received":"0","bytes_sent":"81","downstream_local_address":"172.20.192.84:9080","downstream_remote_address":"172.20.192.31:59510","duration":"0","istio_policy_status":"-","method":"GET","path":"/hello","protocol":"HTTP/1.1","request_id":"7684cbb0-8f1c-44bf-b591-40c3deff6b0b","requested_server_name":"outbound_.9080_._.circuit-breaker-sample-server.default.svc.cluster.local","response_code":"503","response_flags":"UO","route_name":"default","start_time":"2023-02-28T03:14:00.095Z","trace_id":"-","upstream_cluster":"inbound|9080||","upstream_host":"-","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"python-requests/2.21.0","x_forwarded_for":"-"}
{"authority":"circuit-breaker-sample-server:9080","bytes_received":"0","bytes_sent":"81","downstream_local_address":"172.20.192.84:9080","downstream_remote_address":"172.20.192.31:58218","duration":"0","istio_policy_status":"-","method":"GET","path":"/hello","protocol":"HTTP/1.1","request_id":"2aa351fa-349d-4283-a5ea-dc74ecbdff8c","requested_server_name":"outbound_.9080_._.circuit-breaker-sample-server.default.svc.cluster.local","response_code":"503","response_flags":"UO","route_name":"default","start_time":"2023-02-28T03:12:20.996Z","trace_id":"-","upstream_cluster":"inbound|9080||","upstream_host":"-","upstream_local_address":"-","upstream_service_time":"-","upstream_transport_failure_reason":"-","user_agent":"python-requests/2.21.0","x_forwarded_for":"-"}

正如预期的那样,这里有 503 响应码,这也是导致客户端代理上有 "response_code":"503"以及"response_flags":"URX"。
总而言之, 客户端代理根据它们的连接限制(每个 pod最多 5 个连接)发送请求——排队或限制(使用UO 响应标志)多余的请求。所有三个客户端代理在批处理开始时最多可以发送 15 个并发请求。但是,其中只有 5 个成功,因为目标服务代理也在使用相同的配置(最多 5 个连接)限制。目标服务代理将仅接受 5 个请求并限制其余请求,这些请求在客户端代理日志中带有URX响应标志。
上述所发生情况的直观描述类似如下:

image.png

8.场景4:多个客户端实例和多个目标服务实例

最后一个也可能是最常见的场景,我们有多个客户端实例pod和多个目标服务实例 pod 。
当我们增加目标服务副本时,我们应该会看到请求的成功率整体增加,因为每个目标代理可以允许 5 个并发请求。

  • 如果我们将副本数增加到 2,我们应该会看到所有 3 个客户端代理在一个批次中生成的 30 个请求中有 10 个请求成功。我们仍然会观察到客户端和目标服务代理上的限制。
  • 如果我们将副本增加到 3,我们应该看到 15 个成功的请求。
  • 如果我们将数量增加到 4,我们应该仍然只能看到 15 个成功的请求。为什么? 这是因为无论目标服务有多少个副本,客户端代理上的限制都适用于整个目标服务。因此,无论有多少个副本,每个客户端代理最多可以向目标服务发出 5 个并发请求。

9.总结

  • 在客户端:

每个客户端代理独立应用该限制。如果限制为 100,则每个客户端代理在应用本地限制之前可以有 100 个未完成的请求。如果有 N 个客户端调用目标服务,则总共最多可以有 100*N 个未完成的请求。
客户端代理的限制是针对整个目标服务,而不是针对目标服务的单个副本。即使目标服务有 200 个活动 pod, 限流仍然会是100。

  • 在目标服务端:

每个目标服务代理也适用该限制。如果该服务有 50 个活动的 pod,则在应用限流并返回 503 之前,每个 pod 最多可以有 100 个来自客户端代理的未完成请求。

如果对服务网格 ASM 感兴趣或者对上述内容有任何疑问,欢迎钉钉扫描下方二维码或搜索群号(30421250)加入服务网格ASM 用户交流群,一起探索服务网格技术。

作者介绍
目录