【Azure App Service】应用服务(Web App)实战：用 .NET 代码把 Connection 耗尽与 SNAT 耗尽演练一次-阿里云开发者社区

问题描述：

在上一篇里，我解释过 App Service 中两个容易混淆的概念：

Outbound Connection：worker 实例上的 TCP 连接资源，耗尽时常见 SocketException
SNAT Port：出站负载均衡器在公网侧分配的源端口。每个实例通常按 128 个(估算,实际值可能大于它~)，耗尽时常见连接超时。

只看概念还是抽象，所以我做了一个 .NET Demo，把问题拆成四个小实验：

实验 1：Connection 耗尽 — 每次 new HttpClient()
实验 2： Connection 优化 — IHttpClientFactory 复用
实验 3： SNAT 耗尽 — 关闭连接池 + Connection: close
实验 4： SNAT 优化 — 单例 HttpClient + MaxConnectionsPerServer ≤ 128

问题解答：

实验 1：让 App Service Instance 的出站连接快速耗尽

反例很简单：每个请求都 new HttpClient()，而且不复用、不释放。

这样每个请求都会带来新的 handler 和连接池，短时间内大量并发时，worker 上的 TCP 连接资源会迅速堆积。

实验1的代码片段：

// BAD: new HttpClient 每次都创建，handler 与 socket 累积

app.MapGet("/api/demo/connection-bad", async (

int count, int concurrency, string? url) =>

{

return await Runner.RunAsync(count, concurrency, async _ =>

{

var client = new HttpClient(); // 每次新建

using var resp = await client.GetAsync(url);

resp.EnsureSuccessStatusCode();

});

异常错误信息：

HttpRequestException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (blog.mylubu.com:443) --> SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

实验结果截图：

实验 2：Connection 优化：用单例 `HttpClient` / `IHttpClientFactory` 复用

优化思路是：只保留少量长期存活的连接，让请求复用这些连接。

复用 HttpClient 或使用 IHttpClientFactory；
用 PooledConnectionLifetime 定期刷新连接，避免 DNS 漂移；
用 MaxConnectionsPerServer 控制到同一目标的物理连接数。

实验2的代码片段：

// GOOD: 在 DI 中注册一次

builder.Services.AddHttpClient("pooled", c => c.Timeout = TimeSpan.FromSeconds(30))

.ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler

{

PooledConnectionLifetime = TimeSpan.FromMinutes(2), // 解决 DNS 漂移

MaxConnectionsPerServer = 20, // 受限连接池

});

app.MapGet("/api/demo/connection-good", async (

int count, int concurrency, string? url, IHttpClientFactory factory) =>

{

var client = factory.CreateClient("pooled"); // 从工厂复用

return await Runner.RunAsync(count, concurrency, async _ =>

{

using var resp = await client.GetAsync(url);

resp.EnsureSuccessStatusCode();

});

关键优化（vs 实验 1）

不再 new HttpClient()：用 IHttpClientFactory.CreateClient("pooled") 拿到共享实例。
配置 PooledConnectionLifetime = 2min：定期回收连接，避免 DNS 漂移问题。
配置 MaxConnectionsPerServer = 20（可在上方参数区动态调节）：把单一目的端的并发物理连接控制在安全水位。
结果：N 个 HTTP 请求 ↔ 至多 20 条物理 TCP 流，socket 不再泄漏。

实验结果截图：

实验 3：让 App Service Instance 的 SNAT Port 耗尽

Connection 优化解决的是 worker 本地资源，但 SNAT 是另一层限制。

只要每个 HTTP 请求都是一条新的 TCP 流，出站负载均衡器仍然要不断分配新的 SNAT 端口。

App Service 单实例通常按 128 个 SNAT 端口 估算，耗尽后新连接会卡住直到超时。

这个反例通过禁用连接池 + Connection: close，强制每个请求都新建 TCP 连接。

实验3的代码片段：

// BAD: 禁用连接池 + Connection: close => 每个请求都是一条全新 TCP 流

app.MapGet("/api/demo/snat-bad", async (

int count, int concurrency, string? url) =>

{

return await Runner.RunAsync(count, concurrency, async _ =>

{

using var handler = new SocketsHttpHandler

{

PooledConnectionLifetime = TimeSpan.Zero, // 禁用连接池

};

using var client = new HttpClient(handler);

client.DefaultRequestHeaders.ConnectionClose = true; // 强制断开

using var resp = await client.GetAsync(url);

resp.EnsureSuccessStatusCode();

});

})

异常错误信息：

HttpRequestException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (blog.mylubu.com:443)

-->

SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

实验结果截图：实验测试SNAT的端口占用数 > 128 个

实验 4：SNAT 优化：Keep-Alive 复用 + `MaxConnectionsPerServer ≤ 128`

优化方式也很直接：保留连接池，允许请求复用已有 TCP 连接。

去掉 Connection: close，保留 Keep-Alive；
启用连接池，不再把 PooledConnectionLifetime 设为 Zero；
控制 MaxConnectionsPerServer，让同一目标的物理连接数低于 SNAT 安全水位。

实验4的代码片段：

// GOOD: 注册单例 HttpClient，所有请求共享一个连接池

builder.Services.AddSingleton<SharedHttpClient>(_ =>

{

var handler = new SocketsHttpHandler

{

PooledConnectionLifetime = TimeSpan.FromMinutes(2), // 启用连接池

PooledConnectionIdleTimeout = TimeSpan.FromSeconds(30), // 空闲回收

MaxConnectionsPerServer = 20, // <= 128

};

return new SharedHttpClient(new HttpClient(handler));

});

app.MapGet("/api/demo/snat-good", async (

int count, int concurrency, string? url, SharedHttpClient shared) =>

{

return await Runner.RunAsync(count, concurrency, async _ =>

{

// 不设置 ConnectionClose => keep-alive 复用

using var resp = await shared.Client.GetAsync(url);

resp.EnsureSuccessStatusCode();

});

关键优化（vs 实验 3）

移除 Connection: close：保留 keep-alive，让服务端不会立刻关闭连接。
启用连接池：PooledConnectionLifetime = 2min（而不是 Zero）。
添加 PooledConnectionIdleTimeout = 30s：空闲连接超时回收，但活跃连接长保留。
MaxConnectionsPerServer = 20（可在上方参数区动态调节）：硬上限，远低于 128 SNAT 安全估算，确保不会撞墙。
HttpClient 注册为 Singleton：整个进程共享一个，所有请求复用同一连接池。

实验结果截图：

总结：

在以上实验中，观察App Service的Connects指标变动，当服用链接后，肉眼可见connections指标的快速下降。

常见问题（FAQ）：

Q：为什么实验 1 和实验 3 都会失败，但根因不一样？

A：实验 1 的核心问题是应用反复创建 HttpClient 且不释放，worker 本地 socket / 临时端口 / 句柄会快速堆积；实验 3 的核心问题是禁用连接池并强制 Connection: close，每个请求都变成一条新的 TCP 流，导致同一目标上的 SNAT 端口快速消耗。前者更偏 worker 本地资源泄漏，后者更偏出站负载均衡器的 SNAT 端口耗尽。

Q：只用单例 HttpClient 就一定能解决 SNAT 吗？

A：不一定。实验 3 的价值就在这里：即使你“复用了 HttpClient”，只要禁用了连接池或加了 Connection: close，底层仍然是每请求一条新 TCP 连接，SNAT 仍然会被打爆。真正关键的是 连接池 + Keep-Alive + 合理的 MaxConnectionsPerServer。

Q：MaxConnectionsPerServer 应该设置成多少？

A：没有固定值，但我的经验是先按目标服务维度控制在安全水位内。如果是同一个公网 endpoint，建议从 20、50 这类保守值开始压测；不要直接设到几百。App Service 单实例 SNAT 端口通常按 128 估算，因此同一目标上的并发物理连接数要明显低于这个值。

Q：什么时候需要 NAT Gateway 或 Private Endpoint？

A：如果代码层已经复用连接，但业务确实需要大量并发公网出站，使用 VNet Integration + NAT Gateway 可以把出站流量切到独享端口池；如果访问的是 Azure SQL、Storage、Redis 等支持私网访问的服务，Private Endpoint 更彻底，因为它让流量走私网，不再消耗公网 SNAT。

参考资料

SNAT with App Service :https://4lowtherabbit.github.io/blogs/2019/10/SNAT/

应用服务(Web App)里的 SNAT 端口 vs 出站连接数：到底是谁限制了谁？ https://www.cnblogs.com/lulight/p/20239022

当在复杂的环境中面临问题，格物之道需：浊而静之徐清，安以动之徐生。云中，恰是如此!

【Azure App Service】应用服务(Web App)实战：用 .NET 代码把 Connection 耗尽与 SNAT 耗尽演练一次

问题描述：

问题解答：

实验 1：让 App Service Instance 的出站连接快速耗尽

实验1的代码片段：

异常错误信息：

实验结果截图：

实验 2：Connection 优化：用单例 `HttpClient` / `IHttpClientFactory` 复用

实验2的代码片段：

关键优化（vs 实验 1）

实验结果截图：

实验 3：让 App Service Instance 的 SNAT Port 耗尽

实验3的代码片段：

异常错误信息：

实验结果截图：实验测试SNAT的端口占用数 > 128 个

实验 4：SNAT 优化：Keep-Alive 复用 + `MaxConnectionsPerServer ≤ 128`

实验4的代码片段：

关键优化（vs 实验 3）

实验结果截图：

总结：

常见问题（FAQ）：

参考资料

云原生

热门文章

最新文章

相关电子书

【Azure App Service】应用服务(Web App)实战：用 .NET 代码把 Connection 耗尽与 SNAT 耗尽演练一次

问题描述：

问题解答：

实验 1：让 App Service Instance 的出站连接快速耗尽

实验1的代码片段：

异常错误信息：

实验结果截图：

实验 2：Connection 优化：用单例 HttpClient / IHttpClientFactory 复用

实验2的代码片段：

关键优化（vs 实验 1）

实验结果截图：

实验 3：让 App Service Instance 的 SNAT Port 耗尽

实验3的代码片段：

异常错误信息：

实验结果截图：实验测试SNAT的端口占用数 > 128 个

实验 4：SNAT 优化：Keep-Alive 复用 + MaxConnectionsPerServer ≤ 128

实验4的代码片段：

关键优化（vs 实验 3）

实验结果截图：

总结：

常见问题（FAQ）：

参考资料

云原生

热门文章

最新文章

相关电子书

实验 2：Connection 优化：用单例 `HttpClient` / `IHttpClientFactory` 复用

实验 4：SNAT 优化：Keep-Alive 复用 + `MaxConnectionsPerServer ≤ 128`