基于生产环境 SQL Server 业务数据库的特性,需要针对 SQL Server 生产环境几组 AlwaysOn AG 的 Primary Replica 部署如下监控:
1. Windows集群节点不正常
2. 任一业务数据库没加入可用性组
3. AG中任一数据库状态异常
4. 任一数据库延时超过阈值
1. primary replica 检查Windows集群状态不正常的节点数(>0 告警)
123SELECT
COUNT
(*)
as
NotHealtyCount
FROM
sys.dm_hadr_cluster_members
WHERE
member_state_desc !=
'UP'
;
2. primary replica 检查副本可用数据库个数 与 需要加入AG的业务数据库数 对比 (<后者告警)
12SELECT
COUNT
(*)
as
DBCount
FROM
sys.availability_databases_cluster;
3. primary replica 检查任一数据库状态异常(>0 告警)
123SELECT
COUNT
(*)
as
NotHealtyCount
FROM
[master].[sys].[dm_hadr_database_replica_states]
WHERE
[synchronization_health_desc] != N
'HEALTHY'
4. primary replica 检查任一数据库延时(>900 告警)
12345678910111213141516171819202122232425262728293031;
WITH
AG_Stats
AS
(
SELECT
AR.replica_server_name,
HARS.role_desc,
Db_name(DRS.database_id) [DBName],
DRS.last_commit_time
FROM
sys.dm_hadr_database_replica_states DRS
INNER
JOIN
sys.availability_replicas AR
ON
DRS.replica_id = AR.replica_id
INNER
JOIN
sys.dm_hadr_availability_replica_states HARS
ON
AR.group_id = HARS.group_id
AND
AR.replica_id = HARS.replica_id
),
Pri_CommitTime
AS
(
SELECTreplica_server_name
, DBName
, last_commit_time
FROMAG_Stats
WHERErole_desc =
'PRIMARY'
),
Sec_CommitTime
AS
(
SELECTreplica_server_name
, DBName
, last_commit_time
FROMAG_Stats
WHERErole_desc =
'SECONDARY'
)
SELECT
ISNULL
(DATEDIFF(ss,s.last_commit_time,p.last_commit_time),0)
AS
[Sync_Lag_Secs]
FROM
Pri_CommitTime p
LEFT
JOIN
Sec_CommitTime s
ON
[s].[DBName] = [p].[DBName]
Zabbix参数配置示例:
UserParameter=MSSQL.SyncHealth,"C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\110\Tools\Binn\SQLCMD.EXE" -d Master -U zabbix -P zabbix -h -1 -W -Q "SELECT COUNT (*) as NotHealtyCount
FROM [master].[sys].[dm_hadr_database_replica_states]
WHERE [synchronization_health_desc] != N''HEALTHY''
"
建议,将1、3、4项配置为模板,2项需要单独配置4个参数。
本文转自UltraSQL51CTO博客,原文链接:http://blog.51cto.com/ultrasql/2050161 ,如需转载请自行联系原作者