Percona Server Audit log导致性能下降

本文涉及的产品
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
RDS MySQL Serverless 高可用系列,价值2615元额度,1个月
简介: 审计日志的重要性不言而喻,本文分析修改审计日志的代码后导致性能下降的排查过程

Audit log内容如下

{"audit_record":{"name":"Query","record":"138837743_2020-07-12T01:37:44","timestamp":"2020-07-12T02:02:38Z","command_class":"select","connection_id":"266","status":0,"sqltext":"select * from t","examined_row":"3","sent_row":"3","Query_time":"0.003730","Lock_time":"0.000417","user":"root[root] @  [172.16.128.16]","host":"","os_user":"","ip":"172.16.128.16","db":"test"}}

本地时间

# date
2020年 07月 12日 星期日 10:02:48 CST

对比发现,audit log输出的时间是UTC时间,且audit log没有参数控制,此时我们只能查看此部分源码

Audit log时间源码

# cat plugin/audit_log/audit_log.cc
...
static char *make_timestamp(char *buf, size_t buf_len, time_t t) noexcept {
  tm tm;

  memset(&tm, 0, sizeof(tm));
  strftime(buf, buf_len, "%FT%TZ", gmtime_r(&t, &tm));

  return buf;
}

static char *make_record_id(char *buf, size_t buf_len) noexcept {
  tm tm;
  size_t len;

  memset(&tm, 0, sizeof(tm));
  len = snprintf(buf, buf_len, "%llu_", next_record_id());

  strftime(buf + len, buf_len - len, "%FT%T", gmtime_r(&log_file_time, &tm));

  return buf;
}
...

修改后的源码

...
static char *make_timestamp(char *buf, size_t buf_len, time_t t) noexcept {
  tm tm;

  memset(&tm, 0, sizeof(tm));

  /* opt_log_timestamps is log_timestamps value; if 0, use UTC; if 1, use local time */
  if (opt_log_timestamps == 0)
    strftime(buf, buf_len, "%FT%TZ", gmtime_r(&t, &tm));
  else if (opt_log_timestamps == 1) {
    strftime(buf, buf_len, "%FT%TZ", localtime_r(&t, &tm)); 
  }

  return buf;
}

static char *make_record_id(char *buf, size_t buf_len) noexcept {
  tm tm;
  size_t len;

  memset(&tm, 0, sizeof(tm));

  len = snprintf(buf, buf_len, "%llu_", next_record_id());

  /* opt_log_timestamps is log_timestamps value; if 0, use UTC; if 1, use local time */
  if (opt_log_timestamps == 0)
    strftime(buf + len, buf_len - len, "%FT%T", gmtime_r(&log_file_time, &tm));
  else if (opt_log_timestamps == 1) {
    strftime(buf + len, buf_len - len, "%FT%T", localtime_r(&log_file_time, &tm));
  }

  return buf;
}
...

Audit log的输出

{"audit_record":{"name":"Query","record":"95013738_2020-07-12T10:36:53","timestamp":"2020-07-12T10:39:51Z","command_class":"select","connection_id":"135","status":0,"sqltext":"select * from t","examined_row":"3","sent_row":"3","Query_time":"0.003619","Lock_time":"0.000404","user":"root[root] @  [127.0.0.1]","host":"","os_user":"","ip":"127.0.0.1","db":"test"}}

可以看到输出的时间为本地的时间,修改成功。

性能测试
修改源码前:

[ 22s ] thds: 128 tps: 7222.79 qps: 144761.69 (r/w/o: 101429.98/28881.14/14450.57) lat (ms,95%): 28.16 err/s: 0.00 reconn/s: 0.00
[ 23s ] thds: 128 tps: 7130.34 qps: 142764.90 (r/w/o: 99932.83/28571.38/14260.69) lat (ms,95%): 29.19 err/s: 0.00 reconn/s: 0.00
[ 24s ] thds: 128 tps: 7271.79 qps: 145345.76 (r/w/o: 101731.03/29076.15/14538.58) lat (ms,95%): 28.16 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 128 tps: 7183.13 qps: 144028.59 (r/w/o: 100835.81/28823.52/14369.26) lat (ms,95%): 29.19 err/s: 0.00 reconn/s: 0.00
[ 26s ] thds: 128 tps: 7169.93 qps: 143019.60 (r/w/o: 100097.02/28580.72/14341.86) lat (ms,95%): 28.16 err/s: 0.00 reconn/s: 0.00
[ 27s ] thds: 128 tps: 7241.19 qps: 145009.80 (r/w/o: 101563.66/28960.76/14485.38) lat (ms,95%): 28.16 err/s: 0.00 reconn/s: 0.00

修改源码后:

[ 13s ] thds: 128 tps: 2315.74 qps: 46590.68 (r/w/o: 32650.27/9305.94/4634.47) lat (ms,95%): 70.55 err/s: 0.00 reconn/s: 0.00
[ 14s ] thds: 128 tps: 2351.17 qps: 46765.46 (r/w/o: 32684.42/9384.69/4696.35) lat (ms,95%): 71.83 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 128 tps: 2340.04 qps: 46699.79 (r/w/o: 32695.55/9323.16/4681.08) lat (ms,95%): 71.83 err/s: 0.00 reconn/s: 0.00
[ 16s ] thds: 128 tps: 2322.84 qps: 46816.87 (r/w/o: 32808.81/9360.37/4647.69) lat (ms,95%): 70.55 err/s: 0.00 reconn/s: 0.00
[ 17s ] thds: 128 tps: 2344.95 qps: 46729.09 (r/w/o: 32681.36/9354.82/4692.91) lat (ms,95%): 71.83 err/s: 0.00 reconn/s: 0.00

可以发现,修改代码后性能下降非常严重;show processlist中出现了freeing items状态,官方文档的解释是:已经执行了一个命令,在这种状态下完成的项目的一些释放涉及查询缓存,这个状态通常在后面cleaning up。

mysql> show processlist;
...
| 254 | root | 172.16.128.16:37634 | sbtest | Query   |    0 | freeing items              | SELECT c FROM sbtest7 WHERE id=4571228                                                               |         1 |             1 |
| 255 | root | 172.16.128.16:37716 | sbtest | Query   |    0 | freeing items              | SELECT c FROM sbtest8 WHERE id=5539222                                                               |         1 |             1 |
| 256 | root | 172.16.128.16:37656 | sbtest | Query   |    0 | freeing items              | COMMIT                                                                                               |         0 |             0 |
| 257 | root | 172.16.128.16:37706 | sbtest | Query   |    0 | Sending to client          | SELECT c FROM sbtest8 WHERE id BETWEEN 5043151 AND 5043250                                           |        66 |             0 |
| 258 | root | 172.16.128.16:37698 | sbtest | Query   |    0 | freeing items              | SELECT c FROM sbtest16 WHERE id=5011651                                                              |         1 |             1 |
| 259 | root | 172.16.128.16:37642 | sbtest | Query   |    0 | freeing items              | COMMIT                                                                                               |         0 |             0 |
| 260 | root | 172.16.128.16:37700 | sbtest | Query   |    0 | freeing items              | SELECT c FROM sbtest16 WHERE id=5027535                                                              |         1 |             1 |
| 261 | root | 172.16.128.16:37724 | sbtest | Query   |    0 | freeing items              | SELECT c FROM sbtest4 WHERE id=5041139                                                               |         1 |             1 |
| 262 | root | 172.16.128.16:37702 | sbtest | Query   |    0 | freeing items              | SELECT DISTINCT c FROM sbtest6 WHERE id BETWEEN 5007762 AND 5007861 ORDER BY c                       |       100 |           300 |
| 263 | root | 172.16.128.16:37640 | sbtest | Query   |    0 | freeing items              | SELECT c FROM sbtest13 WHERE id=5035171                                                              |         1 |             1 |
| 264 | root | 172.16.128.16:37636 | sbtest | Query   |    0 | freeing items              | COMMIT                                                                                               |         0 |             0 |
| 265 | root | 127.0.0.1:48524     | NULL   | Query   |    0 | starting                   | show processlist                                                                                     |         0 |             0 |
+-----+------+---------------------+--------+---------+------+----------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
129 rows in set (0.00 sec)

但是基于这个state无法确定问题出现的原因,还是需要使用pstack+pt-pmp工具进行分析。

Audit log栈帧分析

# cat pmp.log
     97 pthread_cond_wait,native_cond_wait(thr_cond.h:162),my_cond_wait(thr_cond.h:162),inline_mysql_cond_wait(thr_cond.h:162),Commit_stage_manager::enroll_for(thr_cond.h:162),MYSQL_BIN_LOG::change_stage(binlog.cc:8478),MYSQL_BIN_LOG::ordered_commit(binlog.cc:8758),MYSQL_BIN_LOG::commit(binlog.cc:8165),ha_commit_trans(handler.cc:1756),trans_commit(transaction.cc:241),mysql_execute_command(sql_parse.cc:4060),mysql_parse(sql_parse.cc:5306),dispatch_command(sql_parse.cc:1776),do_command(sql_parse.cc:1274),handle_connection(connection_handler_per_thread.cc:302),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)
     26 __io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:2322),LinuxAIOHandler::poll(os0file.cc:2463),os_aio_linux_handler(os0file.cc:2512),os_aio_handler(os0file.cc:2512),fil_aio_wait(fil0fil.cc:7711),io_handler_thread(srv0start.cc:282),void(srv0start.cc:282),__invoke<void(srv0start.cc:282),__call<void,(srv0start.cc:282),operator()<>(srv0start.cc:282),operator()<void(srv0start.cc:282),__invoke_impl<void,(srv0start.cc:282),__invoke<Runnable,(srv0start.cc:282),_M_invoke<0,(srv0start.cc:282),operator()(srv0start.cc:282),std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable,(srv0start.cc:282),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)
     11 __lll_lock_wait_private(libc.so.6),_L_lock_2546(libc.so.6),__tz_convert(libc.so.6),make_record_id(audit_log.cc:172),audit_log_general_record(audit_log.cc:579),audit_log_notify(audit_log.cc:1176),plugins_dispatch(sql_audit.cc:1331),event_class_dispatch(sql_audit.cc:1331),mysql_audit_notify(sql_audit.cc:405),dispatch_command(sql_error.h:382),do_command(sql_parse.cc:1274),handle_connection(connection_handler_per_thread.cc:302),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)
      9 __lll_lock_wait_private(libc.so.6),_L_lock_2546(libc.so.6),__tz_convert(libc.so.6),make_timestamp(audit_log.cc:154),audit_log_general_record(audit_log.cc:579),audit_log_notify(audit_log.cc:1176),plugins_dispatch(sql_audit.cc:1331),event_class_dispatch(sql_audit.cc:1331),mysql_audit_notify(sql_audit.cc:405),dispatch_command(sql_error.h:382),do_command(sql_parse.cc:1274),handle_connection(connection_handler_per_thread.cc:302),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,wait(sync0types.h:541),os_event::wait_low(sync0types.h:541),os_event_wait_low(os0event.cc:608),srv_worker_thread(srv0srv.cc:2805),void(srv0srv.cc:2805),__invoke<void(srv0srv.cc:2805),__call<void>(srv0srv.cc:2805),operator()<>(srv0srv.cc:2805),operator()<void(srv0srv.cc:2805),__invoke_impl<void,(srv0srv.cc:2805),__invoke<Runnable,(srv0srv.cc:2805),_M_invoke<0,(srv0srv.cc:2805),operator()(srv0srv.cc:2805),std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable,(srv0srv.cc:2805),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)

通过栈帧发现,源码修改make_record_id函数和make_timestamp函数均出现了时区转换,__tz_convert函数是证据,且线程之间存在mutex锁互相等待,导致并发上不去。时区转换的方法是借鉴slow log的方式,也就是源码log.cc文件的使用方法,那么是不是slow log也会有这个问题呢?此时设置long_query_time=0一探究竟。

slow log为0性能

...
[ 907s ] thds: 128 tps: 984.01 qps: 19624.25 (r/w/o: 13726.18/3938.05/1960.03) lat (ms,95%): 173.58 err/s: 0.00 reconn/s: 0.00
[ 908s ] thds: 128 tps: 974.95 qps: 19510.00 (r/w/o: 13667.30/3891.80/1950.90) lat (ms,95%): 179.94 err/s: 0.00 reconn/s: 0.00
[ 909s ] thds: 128 tps: 971.06 qps: 19444.19 (r/w/o: 13616.83/3880.24/1947.12) lat (ms,95%): 173.58 err/s: 0.00 reconn/s: 0.00
[ 910s ] thds: 128 tps: 972.99 qps: 19457.73 (r/w/o: 13614.81/3895.95/1946.97) lat (ms,95%): 173.58 err/s: 0.00 reconn/s: 0.00
[ 911s ] thds: 128 tps: 976.94 qps: 19536.70 (r/w/o: 13687.09/3901.74/1947.87) lat (ms,95%): 173.58 err/s: 0.00 reconn/s: 0.00
[ 912s ] thds: 128 tps: 971.05 qps: 19487.00 (r/w/o: 13641.70/3898.20/1947.10) lat (ms,95%): 176.73 err/s: 0.00 reconn/s: 0.00
[ 913s ] thds: 128 tps: 977.00 qps: 19422.95 (r/w/o: 13574.96/3894.99/1952.99) lat (ms,95%): 176.73 err/s: 0.00 reconn/s: 0.00
...

可以发现设置long_query_time为0时,性能损耗极大,使用pstack+pt-pmp查看

slow log栈帧分析

# cat pmp1.log
    100 __lll_lock_wait(libpthread.so.0),_L_lock_883(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),native_mutex_lock(thr_mutex.h:189),my_mutex_lock(thr_mutex.h:189),inline_mysql_mutex_lock(thr_mutex.h:189),File_query_log::write_slow(thr_mutex.h:189),Log_to_file_event_handler::log_slow(log.cc:1250),Query_logger::slow_log_write(log.cc:1340),dispatch_command(sql_parse.cc:2181),do_command(sql_parse.cc:1274),handle_connection(connection_handler_per_thread.cc:302),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)
     26 __io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:2322),LinuxAIOHandler::poll(os0file.cc:2463),os_aio_linux_handler(os0file.cc:2512),os_aio_handler(os0file.cc:2512),fil_aio_wait(fil0fil.cc:7711),io_handler_thread(srv0start.cc:282),void(srv0start.cc:282),__invoke<void(srv0start.cc:282),__call<void,(srv0start.cc:282),operator()<>(srv0start.cc:282),operator()<void(srv0start.cc:282),__invoke_impl<void,(srv0start.cc:282),__invoke<Runnable,(srv0start.cc:282),_M_invoke<0,(srv0start.cc:282),operator()(srv0start.cc:282),std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable,(srv0start.cc:282),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)

通过栈帧没有发现时区转换的相关信息,但是写slow log时mutex锁竞争非常激烈,且show processlist中输出的logging slow query状态,官方文档解释:线程正在向慢查询日志写入语句。

mysql> show processlist;
...
| 257 | root | 172.16.128.16:38826 | sbtest | Query   |    0 | logging slow query         | SELECT c FROM sbtest14 WHERE id=4989397                                                              |         1 |             1 |
| 258 | root | 172.16.128.16:38828 | sbtest | Query   |    0 | logging slow query         | SELECT SUM(k) FROM sbtest8 WHERE id BETWEEN 5610316 AND 5610415                                      |         1 |           100 |
| 259 | root | 172.16.128.16:38830 | sbtest | Query   |    0 | logging slow query         | SELECT SUM(k) FROM sbtest10 WHERE id BETWEEN 5046370 AND 5046469                                     |         1 |           100 |
| 260 | root | 172.16.128.16:38832 | sbtest | Query   |    0 | logging slow query         | BEGIN                                                                                                |         0 |             0 |
| 261 | root | 172.16.128.16:38834 | sbtest | Query   |    0 | logging slow query         | SELECT c FROM sbtest2 WHERE id=5041977                                                               |         1 |             1 |
| 262 | root | 172.16.128.16:38836 | sbtest | Query   |    0 | logging slow query         | SELECT c FROM sbtest16 WHERE id=5026767                                                              |         1 |             1 |
| 263 | root | 172.16.128.16:38838 | sbtest | Query   |    0 | logging slow query         | SELECT c FROM sbtest9 WHERE id=4988059                                                               |         1 |             1 |
| 264 | root | 127.0.0.1:48528     | NULL   | Query   |    0 | starting                   | show processlist                                                                                     |         0 |             0 |
+-----+------+---------------------+--------+---------+------+----------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
129 rows in set (0.00 sec)
相关实践学习
日志服务之使用Nginx模式采集日志
本文介绍如何通过日志服务控制台创建Nginx模式的Logtail配置快速采集Nginx日志并进行多维度分析。
目录
相关文章
|
6月前
|
Linux 应用服务中间件 PHP
性能工具之linux常见日志统计分析命令
通过本文的介绍,我相信同学们一定会发现 linux三剑客强大之处。在命令行中,它还能够接受,和执行外部的 AWK 程序文件,可以对文本信息进行非常复杂的处理,可以说“只有想不到的,没有它做不到的。
191 1
|
1月前
|
XML JSON Java
Logback 与 log4j2 性能对比:谁才是日志框架的性能王者?
【10月更文挑战第5天】在Java开发中,日志框架是不可或缺的工具,它们帮助我们记录系统运行时的信息、警告和错误,对于开发人员来说至关重要。在众多日志框架中,Logback和log4j2以其卓越的性能和丰富的功能脱颖而出,成为开发者们的首选。本文将深入探讨Logback与log4j2在性能方面的对比,通过详细的分析和实例,帮助大家理解两者之间的性能差异,以便在实际项目中做出更明智的选择。
232 3
|
2月前
|
缓存 监控 算法
分析慢日志文件来优化 PHP 脚本的性能
分析慢日志文件来优化 PHP 脚本的性能
|
4月前
|
SQL JSON 数据处理
5% 消耗,6 倍性能:揭秘新一代 iLogtail SPL 日志处理引擎与 Logstash 的 PK
在本文中,我们将深入探讨为何选择 iLogtail,以及它在 SPL 数据处理方面相较于 Logstash 有何独特优势。通过对比这两款工具的架构、性能以及功能,我们希望能够揭示 iLogtail 如何在日益复杂的日志处理需求中脱颖而出,帮助您做出明智的技术选择。
40421 22
|
4月前
|
存储 Linux
linux /www/server/cron内log文件占用空间过大,/www/server/cron是什么内容,/www/server/cron是否可以删除
linux /www/server/cron内log文件占用空间过大,/www/server/cron是什么内容,/www/server/cron是否可以删除
72 1
|
5月前
|
监控 网络安全
zookeeper的日志报will be dropped if server is in r-o mode如何解决
【6月更文挑战第26天】zookeeper的日志报will be dropped if server is in r-o mode如何解决
199 2
|
4月前
|
运维 中间件 数据库
浅析JAVA日志中的性能实践与原理解释问题之元信息打印会导致性能急剧下降问题如何解决
浅析JAVA日志中的性能实践与原理解释问题之元信息打印会导致性能急剧下降问题如何解决
|
4月前
|
开发框架 缓存 Java
浅析JAVA日志中的性能实践与原理解释问题之"Garbage Free"技术的实现方式问题如何解决
浅析JAVA日志中的性能实践与原理解释问题之"Garbage Free"技术的实现方式问题如何解决
|
4月前
|
缓存 自然语言处理 Java
浅析JAVA日志中的性能实践与原理解释问题之减少看得见的业务开销问题如何解决
浅析JAVA日志中的性能实践与原理解释问题之减少看得见的业务开销问题如何解决
|
4月前
|
缓存 Java
浅析JAVA日志中的性能实践与原理解释问题之AsyncAppender的配置方式的问题是如何解决的
浅析JAVA日志中的性能实践与原理解释问题之AsyncAppender的配置方式的问题是如何解决的