Gateway with Inference Extension组件是基于Kubernetes社区Gateway API及其Inference Extension规范实现的增强型组件，支持Kubernetes四层/七层路由服务，并提供面向生成式AI推理场景的一系列增强能力。它能够简化生成式AI推理服务的管理流程，并优化在多个推理服务工作负载之间的负载均衡性能。

在不同的场景使用ACK Gateway with Inference Extension时，可能需要根据业务需求和高可用需要对网关和推理扩展进行不同的配置调整。本文主要介绍在实际业务场景中针对ACK GIE的配置建议，以获得更好的使用效果。

一、网关配置建议

配置一览

1、通过service annotations配置网关负载均衡器

ACK Gateway通过LoadBalancer类型的Service对外暴露，阿里云会为ACK Gateway创建对应CLB实例，可以通过service annotation配置CLB实例以提升请求转发的稳定性。建议的负载均衡器配置如下：

annotations:
    # 开启CLB优雅下线
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-connection-drain: 'on'
    # 优雅下线超时时长，对于LLM推理服务，这里最好两倍长于e2e latency的期望，以避免请求中断
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-connection-drain-timeout: '240'
    # CLB规格，建议配置为PayByCLCU，以防出现带宽限制问题
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU
    # CLB网络类型 公网/私网：可根据实际需求配置
    service.beta.kubernetes.io/alicloud-loadbalancer-address-type: internet

2、设定envoygateway优雅下线

ACK Gateway可使用envoyproxy资源调整网关本身的优雅下线配置，确保网关pod停止时进行足够充分的排水动作。

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
spec:
  shutdown:
    drainTimeout: 360s
...

一般设置时间略长于CLB优雅下线超时时间，以防CLB在网关pod排水完成前断开连接。

3、资源、副本数 / HPA配置

envoy会根据分配的CPU核数启用对应的工作线程数，一般建议启用2工作线程即可，因此，在资源上，初始建议为envoy分配2C 4Gi的Limit（考虑LLM推理请求的请求缓存需要更多内存，相比正常请求增加memory limit）。

可根据实际压力测试水平确定envoy副本数（可从3-4副本开始测试），或配置HPA。

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  minReplicas: 3
  maxReplicas: 20

对于LLM推理请求，QPS严重受制于后端GPU资源，网关一般不会成为瓶颈，可设置3-4副本、然后结合实际业务进行验证。

4、配置PodDisruptionBudget

通过为网关配置PodDisruptionBudget，可以保证在销毁网关pod时，不会因为销毁太多pod造成业务中断，提升业务的可用性

...
  provider:
    type: Kubernetes
    kubernetes:
      envoyPDB:
        minAvailable: 2
...

5、尽量分散部署网关副本

通过定制网关deployment的pod affinity，可以实现将网关副本尽量分散在不同节点部署，提升网关的高可用性。

...
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyDeployment:
        pod:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      gateway.envoyproxy.io/owning-gateway-name: qwen3-gateway
                      gateway.envoyproxy.io/owning-gateway-namespace: prod
                  topologyKey: kubernetes.io/hostname
...

6、通过BackendTrafficPolicy控制超时和最大连接数

由于LLM推理请求一般耗时较长、pending 在网关侧的请求也较多，在生产环境可以适当调大请求的最大超时时间和最大pending请求数、最大并行请求数等配置，通过BackendTrafficPolicy 来控制。

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
  namespace: prod
spec:
  circuitBreaker:
    maxParallelRequests: 10000
    maxParallelRetries: 1024
    maxPendingRequests: 10000
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: qwen3-gateway
  timeout:
    http:
      requestTimeout: 24h

7、配置可观测

（1）日志采集

ACK Gateway会以json格式通过容器标准输出记录请求的访问日志，可参考采集ACK集群容器日志（DaemonSet方式部署日志采集）_容器服务Kubernetes版(ACK)-阿里云帮助中心来对访问日志进行采集分析。

参考采集配置：

apiVersion: telemetry.alibabacloud.com/v1alpha1
kind: ClusterAliyunPipelineConfig
metadata:
  name: gie-access-log-config
spec:
  config:
    aggregators: []
    global: {}
    inputs:
      - Type: input_container_stdio
        CollectingContainersMeta: false
        IgnoringStderr: false
        IgnoringStdout: false
        ContainerFilters:
          K8sNamespaceRegex: ^(envoy-gateway-system)$
          K8sContainerRegex: ^(envoy)$
          IncludeK8sLabel:
            app.kubernetes.io/component: proxy
            app.kubernetes.io/managed-by: envoy-gateway
            app.kubernetes.io/name: envoy
    processors:
      - Type: processor_parse_json_native
        SourceKey: content
        KeepingSourceWhenParseFail: true
    flushers:
      - Type: flusher_sls
        Logstore: gie-access-log
  project:
    name: k8s-log-xxxxxxxx
  logstores:
    - name: gie-access-log

（2）监控指标采集

ACK Gateway会暴露请求的相关指标，同时针对生成式AI请求，还会暴露满足OpenTelemetry GenAI标准的生成式AI服务指标。

具体指标内容与相关说明如下

基础指标

指标名称	类型	描述	单位	标签
envoy_server_live	GAUGE	表示服务器当前是否存活（1）或未存活（0）。	-	-
envoy_server_uptime	GAUGE	服务器运行的总时间。	seconds	-
envoy_server_memory_allocated	GAUGE	服务器当前分配的内存总量。	bytes	-
envoy_server_memory_heap_size	GAUGE	服务器使用的堆内存大小。	bytes	-

Cluster 相关指标

指标名称	类型	描述	单位	标签
envoy_cluster_membership_healthy	GAUGE	集群中健康主机的数量。	-	envoy_cluster_name: 上游集群名称
envoy_cluster_membership_total	GAUGE	集群中主机的总数。	-	envoy_cluster_name: 上游集群名称
envoy_cluster_upstream_cx_active	GAUGE	当前到上游主机的活跃连接数。	-	envoy_cluster_name: 上游集群名称
envoy_cluster_upstream_rq_total	COUNTER	到上游主机的总请求数。	-	envoy_cluster_name: 上游集群名称
envoy_cluster_upstream_cx_rx_bytes_total	COUNTER	从上游主机接收的字节数。	bytes	envoy_cluster_name: 上游集群名称
envoy_cluster_upstream_cx_tx_bytes_total	COUNTER	发送到上游主机的字节数。	bytes	envoy_cluster_name: 上游集群名称
envoy_cluster_upstream_rq_time_bucket	HISTOGRAM	记录上游集群请求延迟的分布情况（按时间区间统计）。	seconds	cluster_name: 上游集群名称 le: 延迟时间上限（单位：秒），表示当前区间最大延迟阈值
envoy_cluster_upstream_rq_xx	COUNTER	统计上游集群返回指定HTTP状态码类别的总请求数。	requests	envoy_cluster_name: 上游集群名称 envoy_response_code_class: HTTP响应码类别（1: 1xx, 2: 2xx, 3: 3xx, 4: 4xx, 5: 5xx）
envoy_cluster_upstream_cx_total	COUNTER	上游连接总数。	-	envoy_cluster_name: 上游集群名称

HTTP 相关指标

指标名称	类型	描述	单位	标签
envoy_http_downstream_cx_rx_bytes_total	COUNTER	从下游客户端接收的字节数。	bytes	envoy_http_conn_manager_prefix: 对应http_conn_manager前缀
envoy_http_downstream_cx_tx_bytes_total	COUNTER	发送到下游客户端的字节数。	bytes	envoy_http_conn_manager_prefix: 对应http_conn_manager前缀
envoy_http_downstream_rq_total	COUNTER	下游客户端发送的 HTTP 请求总数。	requests	envoy_http_conn_manager_prefix: 对应http_conn_manager前缀
envoy_http_downstream_cx_total	COUNTER	下游 HTTP 连接总数。	requests	envoy_http_conn_manager_prefix: 对应http_conn_manager前缀
envoy_http_downstream_rq_time_bucket	HISTOGRAM	下游 HTTP 请求延迟直方图	-	envoy_http_conn_manager_prefix: 对应http_conn_manager前缀 le: 延迟时间上限（单位：秒），表示当前区间最大延迟阈值

Listener 和 TCP 相关指标

指标名称	类型	描述	单位	标签
envoy_listener_downstream_cx_active	GAUGE	当前活跃的下游连接数。	-	envoy_listener_address: 监听地址
envoy_tcp_downstream_cx_total	COUNTER	下游 TCP 连接总数。	-	-
envoy_tcp_downstream_cx_rx_bytes_total	COUNTER	下游 TCP 连接接收的字节数。	bytes	-
envoy_tcp_downstream_cx_tx_bytes_total	COUNTER	下游 TCP 连接发送的字节数。	bytes	-

Gen AI 相关指标

指标名称	类型	描述	单位	标签
gen_ai.client.token.usage	HISTOGRAM	gen_ai客户端消耗的token数	-	gen_ai.operation.name: 操作名称 gen_ai.system: 客户端请求的系统名称 gen_ai.token.type: token类型，输入或者输出 gen_ai.request.model: 请求的模型名称 server.port: 服务的端口 gen_ai.response.model: 响应的模型名称 server.address: Gen AI服务器的地址
gen_ai.client.operation.duration	HISTOGRAM	gen_ai客户端操作持续时间	-	gen_ai.operation.name: 操作名称 gen_ai.system: 客户端请求的系统名称 gen_ai.request.model: 请求的模型名称 gen_ai.response.model: 响应的模型名称 gen_ai.error.type: 错误类型 server.port: 服务的端口 server.address: Gen AI服务器的地址
gen_ai.server.request.duration	HISTOGRAM	Gen AI服务器请求持续时间	-	gen_ai.operation.name: 操作名称 gen_ai.system: 客户端请求的系统名称 gen_ai.request.model: 请求的模型名称 gen_ai.response.model: 响应的模型名称 gen_ai.error.type: 错误类型 server.port: 服务的端口 server.address: Gen AI服务器的地址
gen_ai.server.time.per.output.token	HISTOGRAM	Gen AI服务器每个输出token的耗时	-	gen_ai.operation.name: 操作名称 gen_ai.system: 客户端请求的系统名称 gen_ai.request.model: 请求的模型名称 gen_ai.response.model: 响应的模型名称 server.port: 服务的端口 server.address: Gen AI服务器的地址
gen_ai.server.time.to.first.token	HISTOGRAM	Gen AI服务器首token生成耗时	-	gen_ai.operation.name: 操作名称 gen_ai.system: 客户端请求的系统名称 gen_ai.request.model: 请求的模型名称 gen_ai.response.model: 响应的模型名称 server.port: 服务的端口 server.address: Gen AI服务器的地址

阿里云可观测监控Prometheus版将提供ACK Gateway监控指标的集成与相关大盘。

8、配置https访问

ACK Gateway支持通过配置tls监听开启https访问。具体操作可参考通过Gateway with Inference Extension访问服务_容器服务 Kubernetes 版 ACK(ACK)-阿里云帮助中心

配置方法

针对ACK Gateway本身，大部分的自定义部署配置都可以通过envoyproxy资源完成。

envoyproxy资源的完整示例如下，可根据实际部署需求进行调整

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: custom-proxy-config
  namespace: prod
spec:
  shutdown: # 配置优雅下线
    drainTimeout: 360s
  provider: 
    type: Kubernetes
    kubernetes:
      envoyPDB: # 配置网关PDB
        minAvailable: 2
      envoyDeployment:
        replicas: 3
        pod:
          affinity: # 配置pod affinity实现尽量打散部署
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      gateway.envoyproxy.io/owning-gateway-name: qwen3-gateway
                      gateway.envoyproxy.io/owning-gateway-namespace: prod
                  topologyKey: kubernetes.io/hostname
        container: # 配置pod 资源
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 2
              memory: 4Gi
      envoyService:
        annotations: # 配置网关loadbalancer的service annotations
          service.beta.kubernetes.io/alibaba-cloud-loadbalancer-connection-drain: 'on'
          service.beta.kubernetes.io/alibaba-cloud-loadbalancer-connection-drain-timeout: '300'
          service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU
          service.beta.kubernetes.io/alicloud-loadbalancer-address-type: internet
  bootstrap:
    type: JSONPatch
    jsonPatches: # 用于生成gen ai相关指标的配置
    - op: add
      path: /stats_config
      value:
        stats_tags:
          - tag_name: gen_ai.operation.name
            regex: "(\\|gen_ai.operation.name=([^|]*))"
          - tag_name: gen_ai.system
            regex: "(\\|gen_ai.system=([^|]*))"
          - tag_name: gen_ai.token.type
            regex: "(\\|gen_ai.token.type=([^|]*))"
          - tag_name: gen_ai.request.model
            regex: "(\\|gen_ai.request.model=([^|]*))"
          - tag_name: gen_ai.response.model
            regex: "(\\|gen_ai.response.model=([^|]*))"
          - tag_name: gen_ai.error.type
            regex: "(\\|gen_ai.error.type=([^|]*))"
          - tag_name: server.port
            regex: "(\\|server.port=([^|]*))"
          - tag_name: server.address
            regex: "(\\|server.address=([^|]*))"

网关可以通过infrastucture字段引用envoyproxy资源来生效自定义配置：

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: qwen3-gateway
spec:
  gatewayClassName: inference-gateway
  infrastructure:
    parametersRef:
      group: gateway.envoyproxy.io
      kind: EnvoyProxy
      name: custom-proxy-config
  listeners:
    - name: llm-gw
      protocol: HTTP
      port: 8081

二、推理扩展配置建议

配置一览

推理扩展（EPP）不负责转发和跟踪请求状态，只提供路由决策等网关对推理服务的增强能力。

1、配置推理扩展资源和副本数

在运行时，推理扩展会与后端推理引擎进行交互，快速刷新推理引擎状态，推理引擎副本数会显著影响推理扩展的CPU占用状态。

当启用前缀缓存的负载均衡等能力时，由于需要缓存推理引擎对请求前缀的缓存状态，推理扩展的内存使用会有显著的增加。

在生产部署时，建议根据实际业务副本数和启用功能对推理扩展的资源进行调整，建议可以以4C4Gi的limit开始、并根据实际观测的资源水位进行适当调整。

由于推理扩展不实际负责请求路由、对网关不构成可用性风险，实际副本数可配置为2保证多副本即可。

spec:
  replicas: 2
  template:
...
      containers:
        - name: inference-gateway-ext-proc
          resources:
            limits:
              cpu: '4'
              memory: 4G
            requests:
              cpu: 500m
              memory: 1G

2、优雅下线

由于推理扩展目前仅用于请求阶段的推理增强，优雅下线时间无需特意延长。terminationGracePeriodSeconds保持推荐默认（130）即可。

3、尽量分散部署

和网关相同，可以通过pod affinity来实现推理扩展尽量分散部署，以提高整体可用性。

spec:
...
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  inference-pool: qwen3-8b-pool # inferencepool名称
                  inference-pool-namespace: prod # inferencepool命名空间
              topologyKey: kubernetes.io/hostname
...

4、配置PodDisruptionBudget

和网关同理，通过配置PodDisruptionBudget进一步提高可用性。ACK Gateway会默认为推理扩展部署minAvailable为1的PDB资源，可根据需求对PDB配置进行修改。

配置方法

可以通过自定义的configmap来指定针对推理扩展的deployment、service、podDisruptionBudget资源的patch。

自定义推理扩展部署配置需要组件版本在1.4.0-aliyun.2及以上。

apiVersion: v1
data:
  deployment: |- 
    spec:
      replicas: 1
      template:
        spec:
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      inference-pool: qwen3-8b-pool
                      inference-pool-namespace: prod
                  topologyKey: kubernetes.io/hostname
          containers:
            - name: inference-gateway-ext-proc
              resources:
                limits:
                  cpu: '4'
                  memory: 4G
                requests:
                  cpu: 500m
                  memory: 1G
kind: ConfigMap
metadata:
  name: custom-epp
  namespace: prod

支持修改的项目包括deployment、service、podDisruptionBudget。

推理扩展实例和InferencePool绑定，在InferencePool资源中通过annotation来指定应用自定义的配置configmap，以修改推理扩展配置。

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  annotations:
    inference.networking.x-k8s.io/epp-overlay: custom-epp
  name: qwen3-8b-pool
  namespace: prod
spec:
  extensionRef:
    group: ''
    kind: Service
    name: qwen3-8b-ext-proc
  selector:
    app: qwen3-8b
  targetPortNumber: 8000

ACK GIE配置建议

一、网关配置建议

配置一览

1、通过service annotations配置网关负载均衡器

2、设定envoygateway优雅下线

3、资源、副本数 / HPA配置

4、配置PodDisruptionBudget

5、尽量分散部署网关副本

6、通过BackendTrafficPolicy控制超时和最大连接数

7、配置可观测

（1）日志采集

（2）监控指标采集

基础指标

Cluster 相关指标

HTTP 相关指标

Listener 和 TCP 相关指标

Gen AI 相关指标

8、配置https访问

配置方法

二、推理扩展配置建议

配置一览

1、配置推理扩展资源和副本数

2、优雅下线

3、尽量分散部署

4、配置PodDisruptionBudget

配置方法

容器服务

热门文章

最新文章

相关产品

相关课程

相关电子书

推荐镜像