1. 输入数据
{"problem_id": "003", "time_range": "2025-08-29 10:14:20 ~ 2025-08-29 10:19:20", "candidate_root_causes": ["ad.Failure", "ad.LargeGc", "ad.memory", "ad.cpu","ad.networkLatency", "cart.Failure", "cart.cpu", "checkout.cpu", "checkout.Failure", "image-provider.cpu", "image-provider.memory", "image-provider.networkLatency", "inventory.Failure", "inventory.cpu", "inventory.memory", "inventory.networkLatency", "load-generator.cpu", "load-generator.FloodHomepage", "payment.Failure", "payment.Unreachable", "payment.cpu", "payment.memory", "payment.networkLatency", "product-catalog.Failure", "product-catalog.cpu", "product-catalog.memory", "product-catalog.networkLatency", "recommendation.CacheFailure", "recommendation.Failure", "recommendation.cpu", "recommendation.memory", "recommendation.networkLatency", "system.NodeKiller"], "alarm_rules": ["frontend_avg_rt"]}
2. 页面操作
2.1 查看错误数据
- 访问下面的链接进入云监控2.0调用链分析页面:
- 比赛A榜依赖的云监控2.0的workspace链接为:https://sls.aliyun.com/doc/playground/tianchi2025.html
- 比赛B榜依赖的云监控2.0的workspace链接为:https://sls.aliyun.com/doc/playground/tianchi2025b.html
- 依次点击左栏
应用监控、顶栏调用链分析; - 将故障时段
2025-08-29 10:14:20 ~ 2025-08-29 10:19:20原样复制,粘贴至页面右上角时间输入框,回车确认。在Span列表中,点击耗时排序箭头,在操作列点击详情按钮,查看耗时较长的 Span 信息:
根据文档说明,点击黑线最长的轨迹,可见自身耗时较长的调用段checkout SERVER,对应的主机名为checkout-5d79bbcb9-mvnkr
2.3 智能分析
在Trace详情页面,点击检测到异常右侧的魔棒按钮,可展开 Copilot 并向其提问:
2.4 定位性能问题
在左栏菜单点击容器洞察,悬停展开资源中心菜单,点击Pod列表:
在页面顶部点击+展开查询栏。展开 key 菜单,选定name;展开 value 菜单,选择前文出现异常的机器名checkout-5d79bbcb9-mvnkr,点击确认:
在Pod 名称列表中,悬停展开操作列表,点击眼球按钮,在弹出页面中点击打开实体:
在实体详情页面,CPU Resource栏目下点击CPU Usage图表中的同比环比按钮:
点击展开选单,选择1小时,点击查询:
对比 1 小时前的数据可发现:CPU 使用率从 24.889%显著上升至 99.287%,可视为异常。
3. 得出结论
结合调用链视图与 Copilot 分析结果,观察独占耗时高的调用段(Span)及其性能指标,可以定位故障根因系checkout出现 CPU 负载故障。
{"problem_id": "003", "root_causes": ["checkout.cpu"]}