【Azure Redis】Redis客户端出现15分钟的超时异常

2024-08-25 27

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云数据库 Tair（兼容Redis），内存型 2GB

Redis 开源版，标准版 2GB

简介： 【Azure Redis】Redis客户端出现15分钟的超时异常

问题描述

客户端使用 Lettuce.io 连接 Azure Redis，出现了长达15分钟的Timeout异常。

问题解答

Azure Redis作为PaaS服务，由于一些平台的升级操作而引发的故障转移(Failover)。如Redis的客户端时部署在Linux服务器上，则可能导致长达15分钟无法重新连接的问题。

某些 Linux 版本中的默认 TCP 设置可能会导致 Redis 服务器连接失败 13 分钟或更长时间。默认设置可以防止客户端应用程序检测关闭的连接，并在连接未正常关闭的情况下防止自动还原这些关闭的连接。

如果网络连接中断或 Redis 服务器脱机进行计划外维护，重新建立连接可能会失败。

目前Lettuce社区已知问题，在server端未发RST断开服务的场景下，Lettuce自恢复需要15+分钟的时间。https://github.com/lettuce-io/lettuce-core/issues/2082

目前已知有效的方式是修改linux tcp_retries参数，https://docs.azure.cn/zh-cn/azure-cache-for-redis/cache-best-practices-connection#tcp-settings-for-linux-hosted-client-applications

此外，Lettuce社区也有一些解决方案，https://github.com/lettuce-io/lettuce-core/issues/2082#issuecomment-1407609439

附录： Connection does not re-establish for 15 minutes when running on Linux

Connection stalls lasting for 15 minutes like this are often caused by very optimistic default TCP settings in some Linux distros (confirmed on CentOS so far). When a server stops responding without gracefully closing the connection, the client TCP stack will continue retransmitting packets for 15 minutes before declaring the connection dead and allowing the StackExchange.Redis reconnect logic to kick in.

With Azure Cache for Redis, it's fairly easy to reproduce this by rebooting nodes as mentioned above. In this case, the machine goes down abruptly and the Redis server isn't able to transmit a FIN packet to the client. The client TCP stack continues retransmitting on the same socket hoping the server will come back up. Even when the node has rebooted and come back, it has no record of that connection so it continues ignoring the client. If the client gave up and created a NEW connection, it would be able to resume communication with the server much sooner than 15 minutes.

As you found, there are TCP settings you can change on the client machine to force it to timeout the connection sooner and allow for reconnect. In addition to tcp_retries2, you can try tuning the keepalive settings as discussed here: lettuce-io/lettuce-core#1428 (comment). It should be safe to reduce these timeouts to more realistic durations machine-wide unless you have systems that actually depend on the unusually long retransmits.

An additional approach is using the ForceReconnect pattern recommended in the Azure best practices. If you're seeing issues like this, it's perfectly appropriate to trigger reconnect on RedisTimeoutExceptions in addition to RedisConnectionExceptions. Just don't be too aggressive with it because an overloaded server can also result in persistent RedisTimeoutExceptions. Recreating connections in that situation can cause additional server load and a cascade failure.

Unfortunately there's not much the StackExchange.Redis library can do about this situation, because the Linux TCP stack is hiding the lost connection. Detecting the stall at the library level would require making assumptions that would almost certainly lead to false positives in some scenarios. Instead, it's better for the client application to implement some detection/reconnection logic based on what it knows about its load and latency patterns.

【Azure Redis】Redis客户端出现15分钟的超时异常

问题描述

问题解答

附录： Connection does not re-establish for 15 minutes when running on Linux

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

【Azure Redis】Redis客户端出现15分钟的超时异常

问题描述

问题解答

附录： Connection does not re-establish for 15 minutes when running on Linux

热门文章

最新文章

相关课程

相关电子书

相关实验场景