背景介绍
Alibaba Cloud Lens 基于 SLS 构建统一云产品可观测能力,支持一键开启实例日志(重要日志、详细日志、作业运行日志)和全局日志(审计日志、计费日志、错误日志、监控指标)的采集功能。
日志分类 |
子分类 |
监控场景说明 |
实例日志 |
详细日志(收费) |
访问流量监控 访问异常监控 |
重要日志(免费) |
消费组监控 Logtail采集监控 |
|
作业运行日志(免费) |
数据加工(新版)监控 定时SQL任务监控 |
|
全局日志 |
审计日志(免费) |
资源操作监控 |
错误日志(免费) |
额度超限监控 访问异常监控 操作异常监控 |
|
监控指标(免费) |
访问流量监控 访问异常监控 资源配额水位监控 |
|
计费日志(免费) |
资源用量跟踪 |
各日志说明参考CloudLens日志索引表
使用场景
本文主要介绍如何使用CloudLens for SLS中全局错误日志、监控指标做Project 资源配额的水位监控 、超限监控 以及 如何提交资源配额提升申请。
使用前提
1.开通CloudLens for SLS 以及全局错误日志、监控指标
2.全局监控日志需存储在同一个Project下
为了构建实时资源配额水位监控,全局日志的几种监控日志(错误日志、指标监控)需存放在相同的Project下。同时为了避免监控日志存放在业务Project导致监控占用Project的Quota,可直接挑选一个固定地域的目标Project,如杭州地域:log-service-{用户ID}-cn-hangzhou。
CloudLens for SLS 额度监控大盘
资源配额预警概览
报表提供资源配额预警概览 (水位超过80%)以及 额度超限分布
Project重点资源配额实时水位详情
包含Project部分基础资源配额以及数据读写资源配额的实时水位详情
Project资源配额超限详情
监控实践
1.额度监控监控项分类说明:
分类 |
监控项 |
说明 |
实时水位监控 |
基础资源配额水位监控 |
|
数据读写配额水位监控 |
|
|
额度超限监控 |
资源配额超限次数监控 |
|
2.高级监控项细分说明如下:
分类 |
场景 |
监控项 |
说明 |
基础资源配额 |
LogStore |
实时水位监控 |
|
额度超限监控 |
|
||
机器组 |
水位监控 |
|
|
额度超限监控 |
|
||
Logtail采集配置 |
水位监控 |
|
|
额度超限监控 |
|
||
数据读写资源配额 |
Project写入流量 |
水位监控 |
|
额度超限监控 |
|
||
Project写入次数 |
水位监控 |
|
|
额度超限监控 |
|
基础监控
基础资源配额水位监控
1、确认告警SQL:15min定时检查LogStore数、机器组数、Logtail采集配置水位是否达到告警阈值。
注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果
查询SQL如下:(注意:告警只能对比结果中最多1000条数据是否满足告警条件,建议告警SQL内先针对水位做下筛选,比如此处logstore_ratio > 80 or machine_group_ratio > 80 or logtail_config_ratio > 80 )
*|select Project, region, logstore_ratio, machine_group_ratio, logtail_config_ratio from(SELECT A.idas Project , A.regionas region,round(COALESCE(SUM(B.count_logstore),0)/cast(json_extract(A.quota,'$.logstore')asdouble)*100,3)as logstore_ratio, cast(json_extract(A.quota,'$.logstore')asdouble)as quota_logstore,round(COALESCE(SUM(C.count_machine_group),0)/cast(json_extract(A.quota,'$.machine_group')asdouble)*100,3)as machine_group_ratio, cast(json_extract(A.quota,'$.machine_group')asdouble)as quota_machine_group,round(COALESCE(SUM(D.count_logtail_config),0)/cast(json_extract(A.quota,'$.config')asdouble)*100,3)as logtail_config_ratio, cast(json_extract(A.quota,'$.config')asdouble)as quota_logtail_config FROM"resource.sls.cmdb.project"as A LEFT JOIN(SELECT project,COUNT(*)AS count_logstore FROM"resource.sls.cmdb.logstore"as B GROUPBY project )AS B ON A.id= B.projectLEFT JOIN(SELECT project,COUNT(*)AS count_machine_group FROM"resource.sls.cmdb.machine_group"as C GROUPBY project )AS C ON A.id= C.projectLEFT JOIN(SELECT project,COUNT(*)AS count_logtail_config FROM"resource.sls.cmdb.logtail_config"as D GROUPBY project )AS D ON A.id= D.projectgroupby A.id, A.quota, A.region)where quota_logstore isnotnulland quota_machine_group isnotnulland quota_logtail_config isnotnulland(logstore_ratio >80or machine_group_ratio >80or logtail_config_ratio >80)limit10000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的LogStore数、机器组数、Logtail采集配置其中一个水位超过额度的90%时告警级别为严重
- 当有Project的LogStore数、机器组数、Logtail采集配置其中一个水位超过额度的80%时告警级别为中
数据读写配额水位监控
1、确认告警SQL:每分钟定时检查Project写入流量、写入次数水位是否达到告警阈值。
注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果
查询SQL:(注意:告警只能对比结果中最多1000条数据是否满足告警条件,建议告警SQL内先针对写入流量/写入次数做下筛选,比如此处where inflow_ratio > 80 or write_cnt_ratio > 80 )
(*)|select Project, region, inflow_ratio, write_cnt_ratio from(SELECT cmdb.idas Project, cmdb.regionas region, round(COALESCE(M.name1,0)/round(cast(json_extract(cmdb.quota,'$.inflow_per_min')asdouble)/1000000000,3)*100,3)as inflow_ratio, round(COALESCE(M.name2,0)/cast(json_extract(cmdb.quota,'$.write_cnt_per_min')asdouble)*100,3)as write_cnt_ratio from"resource.sls.cmdb.project"as cmdb LEFT JOIN(select project, round(MAX(name1)/1000000000,3)as name1, MAX(name2)as name2 from(SELECT __time_nano__ astime, element_at( split_to_map(__labels__,'|','#$#'),'project')as project, sum(CASE WHEN __name__ ='logstore_origin_inflow_bytes' THEN __value__ ELSE NULL END)AS name1, sum(CASE WHEN __name__ ='logstore_write_count' THEN __value__ ELSE NULL END)AS name2 FROM"internal-monitor-metric.prom"where __name__ in('logstore_origin_inflow_bytes','logstore_write_count')and regexp_like(element_at( split_to_map(__labels__,'|','#$#'),'project'),'.*')groupby project,time)groupby project)AS M ON cmdb.id= M.project)where inflow_ratio >80or write_cnt_ratio >80limit10000
2、告警配置
查询区间选择相对5分钟,依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的Project写入流量、写入次数其中一个水位超过额度的90%时告警级别为严重
- 当有Project的Project写入流量、写入次数其中一个水位超过额度的80%时告警级别为中
资源配额超限次数监控
1、确认告警SQL:15min定时检查是否有额度超限发生。
查询SQL:
((*and(ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed or ErrorCode: ShardWriteQuotaExceed or ErrorCode: ShardReadQuotaExceed)))|SELECT Project,CASE WHEN ErrorMsg like'%Project write quota exceed: inflow%' then 'Project写入流量超限'WHEN ErrorMsg like'%Project write quota exceed: qps%' then 'Project写入次数超限'WHEN ErrorMsg like'%dashboard quota exceed%' then '报表额度超限'WHEN ErrorMsg like'%config count%' then 'Logtail采集配置超限'WHEN ErrorMsg like'%machine group count%' then '机器组超限'WHEN ErrorMsg like'%Alert count %' then '告警超限'WHEN ErrorMsg like'%logstore count %' then 'LogStore数超限'WHEN ErrorMsg like'%shard count%' then 'Shard数超限'WHEN ErrorMsg like'%shard write bytes%' then 'Shard写入超限'WHEN ErrorMsg like'%shard write quota%' then 'Shard写入超限'WHEN ErrorMsg like'%user can only run%' then 'SQL分析操作并发数超限' ELSE ErrorMsg END AS ErrorMsg,COUNT(1)AScountGROUPBY Project, ErrorMsg Limit1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有任意额度超限10次错误告警级别为严重
- 当有任意额度发生超限1次错误时告警级别为中
高级监控
以下是基础监控的细分项,一般情况下不需要,如果需更精细的告警监控,可以参考。
LogStore监控
水位监控
1、确认告警SQL:15min定时检查LogStore数水位是否达到告警阈值。
注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果
查询SQL:
*|select Project, region, round(count_logstore/quota_logstore *100,3)as logstore_ratio from(SELECT A.idas Project , A.regionas region, COALESCE(SUM(B.count_logstore),0)AS count_logstore , cast(json_extract(A.quota,'$.logstore')asdouble)as quota_logstore FROM"resource.sls.cmdb.project"as A LEFT JOIN(SELECT project,COUNT(*)AS count_logstore FROM"resource.sls.cmdb.logstore"as B GROUPBY project )AS B ON A.id= B.projectgroupby A.id, A.quota, A.region)where quota_logstore isnotnullorderby logstore_ratio desclimit1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的LogStore数超过额度的90%时告警级别为严重
- 当有Project的LogStore数超过额度的80%时告警级别为中
此处需注意,告警触发条件配置多个时,判断顺序是从上至下,因此logstore_ratio>90需配置在logstore_ratio>80的上面。
超限监控
1、确认告警SQL:15min定时检查LogStore是否发生超限现象。
查询SQL:
*and(ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)|SELECT Project,COUNT(1)AScountwhere ErrorMsg like'%logstore count %'GROUPBY Project ORDERBYcountDESCLIMIT1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的LogStore发生超限10次错误告警级别为严重
- 当有Project的LogStore发生超限1次错误时告警级别为中
机器组监控
水位监控
1、确认告警SQL:15min定时检查机器组数水位是否达到告警阈值。
注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果
查询SQL:
*|select Project, region, round(count_machine_group/quota_machine_group *100,3)as machine_group_ratio from(SELECT A.idas Project , A.regionas region, COALESCE(SUM(B.count_machine_group),0)AS count_machine_group , cast(json_extract(A.quota,'$.machine_group')asdouble)as quota_machine_group FROM"resource.sls.cmdb.project"as A LEFT JOIN(SELECT project,COUNT(*)AS count_machine_group FROM"resource.sls.cmdb.machine_group"as B GROUPBY project )AS B ON A.id= B.projectgroupby A.id, A.quota, A.region)where quota_machine_group isnotnullorderby machine_group_ratio desclimit1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的机器组超过额度的90%时告警级别为严重
- 当有Project的机器组超过额度的80%时告警级别为中
超限监控
1、确认告警SQL:15min定时检查机器组是否发生超限现象。
查询SQL:
*and(ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)|SELECT Project,COUNT(1)AScountwhere ErrorMsg like'%machine group count%'GROUPBY Project ORDERBYcountDESCLIMIT1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
● 当有Project的机器组发生超限10次错误告警级别为严重
● 当有Project的机器组发生超限1次错误时告警级别为中
Logtail采集配置
水位监控
1、确认告警SQL:15min定时检查Logtail采集配置数水位是否达到告警阈值。
注意:查询SQL默认返回100条数据,如需返回全部查询结果SQL结尾可添加 limit 1000,代表可返回1000条查询结果
查询SQL:
*|select Project, region, round(count_logtail_config/quota_logtail_config *100,3)as logtail_config_ratio from(SELECT A.idas Project , A.regionas region, COALESCE(SUM(B.count_logtail_config),0)AS count_logtail_config , cast(json_extract(A.quota,'$.config')asdouble)as quota_logtail_config FROM"resource.sls.cmdb.project"as A LEFT JOIN(SELECT project,COUNT(*)AS count_logtail_config FROM"resource.sls.cmdb.logtail_config"as B GROUPBY project )AS B ON A.id= B.projectgroupby A.id, A.quota, A.region)where quota_logtail_config isnotnullorderby logtail_config_ratio desclimit1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的Logtail采集配置数超过额度的90%时告警级别为严重
- 当有Project的Logtail采集配置数超过额度的80%时告警级别为中
超限监控
1、确认告警SQL:15min定时检查LogStore是否发生超限现象。
查询SQL:
*and(ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)|SELECT Project,COUNT(1)AScountwhere ErrorMsg like'%config count%'GROUPBY Project ORDERBYcountDESCLIMIT1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project的Logtail采集配置发生超限10次错误告警级别为严重
- 当有Project的Logtail采集配置发生超限1次错误时告警级别为中
Project写入流量监控
水位监控
1、确认告警SQL:每分钟定时检查相对5分钟内Project写入流量水位是否达到告警阈值。
SQL详情:
(*)|SELECT Project, region , round(count_inflow/cast(quota_inflow asdouble)*100,3)as inflow_ratio FROM(SELECT cmdb.idas Project, cmdb.regionas region, COALESCE(M.name1,0)as count_inflow, round(cast(json_extract(cmdb.quota,'$.inflow_per_min')asdouble)/1000000000,3)as quota_inflow from"resource.sls.cmdb.project"as cmdb LEFT JOIN(select project, round(MAX(name1)/1000000000,3)as name1 from(SELECT __time_nano__ astime, element_at( split_to_map(__labels__,'|','#$#'),'project')as project, sum(CASE WHEN __name__ ='logstore_origin_inflow_bytes' THEN __value__ ELSE NULL END)AS name1 FROM"internal-monitor-metric.prom"where __name__ ='logstore_origin_inflow_bytes'and regexp_like(element_at( split_to_map(__labels__,'|','#$#'),'project'),'.*')groupby project,time)groupby project)AS M ON cmdb.id= M.project)orderby inflow_ratio desclimit1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project写入流量超过额度的90%时告警级别为严重
- 当有Project写入流量超过额度的80%时告警级别为中
超限监控
1、确认告警SQL:15min定时检查Project写入流量是否发生超限现象。
查询SQL:
*and(ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)|SELECT Project,COUNT(1)AScountwhere ErrorMsg like'%Project write quota exceed: inflow%'GROUPBY Project ORDERBYcountDESCLIMIT1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project写入流量发生超限10次错误告警级别为严重
- 当有Project写入流量发生超限1次错误时告警级别为中
Project写入次数监控
水位监控
1、确认告警SQL:每分钟定时检查相对5分钟内Project写入次数水位是否达到告警阈值。
查询SQL:
(*)|SELECT Project, region, round(count_write_cnt/cast(quota_write_cnt asdouble)*100,3)as write_cnt_ratio FROM(SELECT cmdb.idas Project, cmdb.regionas region, COALESCE(M.name1,0)as count_write_cnt,cast(json_extract(cmdb.quota,'$.write_cnt_per_min')asbigint)as quota_write_cnt from"resource.sls.cmdb.project"as cmdb LEFT JOIN(select project, MAX(name1)as name1 from(SELECT __time_nano__ astime, element_at( split_to_map(__labels__,'|','#$#'),'project')as project, sum(CASE WHEN __name__ ='logstore_write_count' THEN __value__ ELSE NULL END)AS name1 FROM"internal-monitor-metric.prom"where __name__ ='logstore_write_count'and regexp_like(element_at( split_to_map(__labels__,'|','#$#'),'project'),'.*')groupby project,time)groupby project)AS M ON cmdb.id= M.project)orderby write_cnt_ratio desclimit1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project写入次数超过额度的90%时告警级别为严重
- 当有Project写入次数超过额度的80%时告警级别为中
超限监控
1、确认告警SQL:15min定时检查Project写入次数是否发生超限现象。
查询SQL:
*and(ErrorCode: ExceedQuota or ErrorCode: QuotaExceed or ErrorCode: ProjectQuotaExceed or ErrorCode:WriteQuotaExceed)|SELECT Project,COUNT(1)AScountwhere ErrorMsg like'%Project write quota exceed: qps%'GROUPBY Project ORDERBYcountDESCLIMIT1000
2、告警配置
依据业务场景配置告警触发条件、以及告警策略:
- 当有Project写入次数发生超限10次错误告警级别为严重
- 当有Project写入次数发生超限1次错误时告警级别为中
资源配额调整申请
操作步骤
1.登录日志服务控制台。
2.在Project列表区域,单击目标Project。
3.单击图标。
4.单击资源配额对应的管理。
5.在资源配额面板中,调整目标资源的配额,然后单击保存。