使用MySQL 5.7虚拟列提高查询效率

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
RDS MySQL Serverless 高可用系列,价值2615元额度,1个月
简介:

在这篇博客中,我们将看看如何使用MySQL 5.7的虚拟列来提高查询性能。

In this blog post, we’ll look at ways you can use MySQL 5.7 generated columns (or virtual columns) to improve query performance.

说明
大约两年前,我发表了一个在MySQL5.7版本上关于虚拟列的文章。从那时开始,它成为MySQL5.7发行版当中,我最喜欢的一个功能点。原因很简单:在虚拟列的帮助下,我们可以创建间接索引(fine-grained indexes),可以显著提高查询性能。我要告诉你一些技巧,可以潜在地解决那些使用了GROUP BY 和 ORDER BY而慢的报表查询。

About two years ago I published a blog post about Generated (Virtual) Columns in MySQL 5.7. Since then, it’s been one of my favorite features in the MySQL 5.7 release. The reason is simple: with the help of virtual columns, we can create fine-grained indexes that can significantly increase query performance. I’m going to show you some tricks that can potentially fix slow reporting queries with GROUP BY and ORDER BY.

问题
最近我正在协助一位客户,他正挣扎于这个查询上:

Recently I was working with a customer who was struggling with this query:

SELECT 
CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', 
COUNT(*) as 'No. of API Calls', 
AVG(ExecutionTime) as 'Avg. Execution Time', 
COUNT(distinct AccountId) as 'No. Of Accounts', 
COUNT(distinct ParentAccountId) as 'No. Of Parents' 
FROM ApiLog 
WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' 
GROUP BY CONCAT(verb, ' - ', replace(url,'.xml','')) 
HAVING COUNT(*) >= 1 ;

这个查询运行了一个多小时,并且使用和撑满了整个 tmp目录(需要用到临时文件完成排序)。

The query was running for more than an hour and used all space in the tmp directory (with sort files).

表结构如下:

The table looked like this:


CREATE TABLE `ApiLog` (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`ts` timestamp DEFAULT CURRENT_TIMESTAMP,
`ServerName` varchar(50)  NOT NULL default '',
`ServerIP` varchar(50)  NOT NULL default '',
`ClientIP` varchar(50)  NOT NULL default '',
`ExecutionTime` int(11) NOT NULL default 0,
`URL` varchar(3000)  NOT NULL COLLATE utf8mb4_unicode_ci NOT NULL,
`Verb` varchar(16)  NOT NULL,
`AccountId` int(11) NOT NULL,
`ParentAccountId` int(11) NOT NULL,
`QueryString` varchar(3000) NOT NULL,
`Request` text NOT NULL,
`RequestHeaders` varchar(2000) NOT NULL,
`Response` text NOT NULL,
`ResponseHeaders` varchar(2000) NOT NULL,
`ResponseCode` varchar(4000) NOT NULL,
... // other fields removed for simplicity
PRIMARY KEY (`Id`),
KEY `index_timestamp` (`ts`),
... // other indexes removed for simplicity
) ENGINE=InnoDB;

我们发现查询没有使用时间戳字段(“TS”)的索引:

We found out the query was not using an index on the timestamp field (“ts”):

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*)  as 'No. of API Calls',  avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts',  count(distinct ParentAccountId) as 'No. Of Parents'  FROM ApiLog  WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))  HAVING COUNT(*)  >= 1G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ApiLog
   partitions: NULL
         type: ALL
possible_keys: ts
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 22255292
     filtered: 50.00
        Extra: Using where; Using filesort1 row in set, 1 warning (0.00 sec)

原因很简单:符合过滤条件的行数太大了,以至于影响一次索引扫描扫描的效率(或者至少优化器是这样认为的):

The reason for that is simple: the number of rows matching the filter condition was too large for an index scan to be efficient (or at least the optimizer thinks that):

mysql> select count(*) from ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' ;
+----------+
| count(*) |
+----------+
|  7948800 |
+----------+
1 row in set (2.68 sec)

总行数:21998514。查询需要扫描的总行数的36%(7948800/21998514)(译者按:当预估扫描行数超过20% ~ 30%时,即便有索引,优化器通常也会强制转成全表扫描)。

Total number of rows: 21998514. The query needs to scan 36% of the total rows (7948800 / 21998514).

在这种情况下,我们有许多处理方法:

创建时间戳列和GROUP BY列的联合索引;

创建一个覆盖索引(包含所有查询字段);

仅对GROUP BY列创建索引;

创建索引松散索引扫描。

In this case, we have a number of approaches:

Create a combined index on timestamp column + group by fields

Create a covered index (including fields that are selected)

Create an index on just GROUP BY fields

Create an index for loose index scan

然而,如果我们仔细观察查询中“GROUP BY”部分,我们很快就意识到,这些方案都不能解决问题。以下是我们的GROUP BY部分:

However, if we look closer at the “GROUP BY” part of the query, we quickly realize that none of those solutions will work. Here is our GROUP BY part:

GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))

这里有两个问题:

它是计算列,所以MySQL不能扫描verb + url的索引。它首先需要连接两个字段,然后组成连接字符串。这就意味着用不到索引;

URL被定义为“varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL”,不能被完全索引(即使在全innodb_large_prefix= 1 参数设置下,这是UTF8启用下的默认参数)。我们能做部分索引,这对GROUP BY的sql优化并没有什么帮助。

There are two problems here:

It is using a calculating field, so MySQL can’t just scan the index on verb + url. It needs to first concat two fields, and then group on the concatenated string. That means that the index won’t be used.

The URL is declared as “varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL” and can’t be indexed in full (even with innodb_large_prefix=1 option, which is the default as we have utf8 enabled). We can only do a partial index, which won’t be helpful for GROUP BY optimization.

在这里,我尝试去对URL列添加一个完整的索引,在innodb_large_prefix=1参数下:

Here, I’m trying to add a full index on the URL with innodb_large_prefix=1:


mysql> alter table ApiLog add key verb_url(verb, url);
ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes

嗯,通过修改“GROUP BY CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”为 “GROUP BY verb, url” 会帮助(假设我们把字段定义从 varchar(3000)调小一些,不管业务上允许或不允许)。然而,这将改变结果,因URL字段不会删除 .xml扩展名了。

Well, changing the “GROUP BY CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))” to “GROUP BY verb, url” could help (assuming that we somehow trim the field definition from varchar(3000) to something smaller, which may or may not be possible). However, it will change the results as it will not remove the .xml extension from the URL field.

解决方案
好消息是,在MySQL 5.7中我们有虚拟列。所以我们可以在“CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”之上创建一个虚拟列。最好的部分:我们不需要执行一组完整的字符串(可能大于3000字节)。我们可以使用MD5哈希(或更长的哈希,例如SHA1 / SHA2)作为GROUP BY的对象。

The good news is that in MySQL 5.7 we have virtual columns. So we can create a virtual column on top of “CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”. The best part: we do not have to perform a GROUP BY with the full string (potentially > 3000 bytes). We can use an MD5 hash (or longer hashes, i.e., sha1/sha2) for the purposes of the GROUP BY.

下面是解决方案:

Here is the solution:


alter table ApiLog add verb_url_hash varbinary(16) GENERATED ALWAYS AS (unhex(md5(CONCAT(verb, ' - ', replace(url,'.xml',''))))) VIRTUAL;
alter table ApiLog add key (verb_url_hash);

所以我们在这里做的是:

声明虚拟列,类型为varbinary(16);

在CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”)上创建虚拟列,并且使用MD5哈希转化后再使用unhex转化32位十六进制为16位二进制;

对上面的虚拟列创建索引。

So what we did here is:

Declared the virtual column with type varbinary(16)

Created a virtual column on CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”), and used an MD5 hash on top plus an unhex to convert 32 hex bytes to 16 binary bytes

Created and index on top of the virtual column

现在我们可以修改查询语句,GROUP BY verb_url_hash列:

Now we can change the query and GROUP BY verb_url_hash column:

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml',''))
AS 'API Call', COUNT(*)  as 'No. of API Calls',
avg(ExecutionTime) as 'Avg. Execution Time',
count(distinct AccountId) as 'No. Of Accounts',
count(distinct ParentAccountId) as 'No. Of Parents'
FROM ApiLog
WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'
GROUP BY verb_url_hash
HAVING COUNT(*)  >= 1;
ERROR 1055 (42000): Expression #1 of SELECT list is not in
GROUP BY clause and contains nonaggregated column 'ApiLog.ApiLog.Verb'
which is not functionally dependent on columns in GROUP BY clause;
this is incompatible with sql_mode=only_full_group_by

MySQL 5.7的严格模式是默认启用的,我们可以只针对这次查询修改一下。

现在解释计划看上去好多了:

MySQL 5.7 has a strict mode enabled by default, which we can change for that query only.

Now the explain plan looks much better:


mysql> select @@sql_mode;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| @@sql_mode                                                                                                                                
|+-------------------------------------------------------------------------------------------------------------------------------------------+
| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |
+-------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> set sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
Query OK, 0 rows affected (0.00 sec)
mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*)  as 'No. of API Calls',  avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts',  count(distinct ParentAccountId) as 'No. Of Parents'  FROM ApiLog  WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  GROUP BY verb_url_hash HAVING COUNT(*)  >= 1G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ApiLog
   partitions: NULL
         type: index
possible_keys: ts,verb_url_hash
          key: verb_url_hash
      key_len: 19
          ref: NULL
         rows: 22008891
     filtered: 50.00
        Extra: Using where1 row in set, 1 warning (0.00 sec)
MySQL可以避免排序,速度更快。它将最终还是要扫描所有表的索引的顺序。响应时间明显更好:只需大概38秒而不再是大于一小时。

MySQL will avoid any sorting, which is much faster. It will still have to eventually scan all the table in the order of the index. The response time is significantly better: ~38 seconds as opposed to > an hour.

覆盖索引
现在我们可以尝试做一个覆盖索引,这将相当大:

Now we can attempt to do a covered index, which will be quite large:

mysql> alter table ApiLog add key covered_index (`verb_url_hash`,`ts`,`ExecutionTime`,`AccountId`,`ParentAccountId`, verb, url);
Query OK, 0 rows affected (1 min 29.71 sec)
Records: 0  Duplicates: 0  Warnings: 0

我们添加了一个“verb”和“URL”,所以之前我不得不删除表定义的COLLATE utf8mb4_unicode_ci。现在执行计划表明,我们使用了覆盖索引:

We had to add a “verb” and “url”, so beforehand I had to remove the COLLATE utf8mb4_unicode_ci from the table definition. Now explain shows that we’re using the index:

mysql> explain SELECT  CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call',  COUNT(*) as 'No. of API Calls',  AVG(ExecutionTime) as 'Avg. Execution Time',  COUNT(distinct AccountId) as 'No. Of Accounts',  COUNT(distinct ParentAccountId) as 'No. Of Parents'  FROM ApiLog  WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  GROUP BY verb_url_hash  HAVING COUNT(*) >= 1G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ApiLog
   partitions: NULL
         type: index
possible_keys: ts,verb_url_hash,covered_index
          key: covered_index
      key_len: 3057
          ref: NULL
         rows: 22382136
     filtered: 50.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

响应时间下降到约12秒!但是,索引的大小明显地比仅verb_url_hash的索引(每个记录16字节)要大得多。

The response time dropped to ~12 seconds! However, the index size is significantly larger compared to just verb_url_hash (16 bytes per record).

结论
MySQL 5.7的生成列提供一个有价值的方法来提高查询性能。如果你有一个有趣的案例,请在评论中分享。

MySQL 5.7 generated columns provide a valuable way to improve query performance. If you have an interesting case, please share in the comments.

原文发布时间为:2018-02-25
本文作者:星耀队@知数堂
本文来自云栖社区合作伙伴“老叶茶馆”,了解相关信息可以关注“老叶茶馆”微信公众号

相关实践学习
如何快速连接云数据库RDS MySQL
本场景介绍如何通过阿里云数据管理服务DMS快速连接云数据库RDS MySQL,然后进行数据表的CRUD操作。
全面了解阿里云能为你做什么
阿里云在全球各地部署高效节能的绿色数据中心,利用清洁计算为万物互联的新世界提供源源不断的能源动力,目前开服的区域包括中国(华北、华东、华南、香港)、新加坡、美国(美东、美西)、欧洲、中东、澳大利亚、日本。目前阿里云的产品涵盖弹性计算、数据库、存储与CDN、分析与搜索、云通信、网络、管理与监控、应用服务、互联网中间件、移动服务、视频服务等。通过本课程,来了解阿里云能够为你的业务带来哪些帮助     相关的阿里云产品:云服务器ECS 云服务器 ECS(Elastic Compute Service)是一种弹性可伸缩的计算服务,助您降低 IT 成本,提升运维效率,使您更专注于核心业务创新。产品详情: https://www.aliyun.com/product/ecs
相关文章
|
1月前
|
缓存 关系型数据库 MySQL
MySQL索引策略与查询性能调优实战
在实际应用中,需要根据具体的业务需求和查询模式,综合运用索引策略和查询性能调优方法,不断地测试和优化,以提高MySQL数据库的查询性能。
175 66
|
6天前
|
存储 Oracle 关系型数据库
索引在手,查询无忧:MySQL索引简介
MySQL 是一款广泛使用的关系型数据库管理系统,在2024年5月的DB-Engines排名中得分1084,仅次于Oracle。本文介绍MySQL索引的工作原理和类型,包括B+Tree、Hash、Full-text索引,以及主键、唯一、普通索引等,帮助开发者优化查询性能。索引类似于图书馆的分类系统,能快速定位数据行,极大提高检索效率。
31 8
|
9天前
|
SQL 关系型数据库 MySQL
MySQL 窗口函数详解:分析性查询的强大工具
MySQL 窗口函数从 8.0 版本开始支持,提供了一种灵活的方式处理 SQL 查询中的数据。无需分组即可对行集进行分析,常用于计算排名、累计和、移动平均值等。基本语法包括 `function_name([arguments]) OVER ([PARTITION BY columns] [ORDER BY columns] [frame_clause])`,常见函数有 `ROW_NUMBER()`, `RANK()`, `DENSE_RANK()`, `SUM()`, `AVG()` 等。窗口框架定义了计算聚合值时应包含的行。适用于复杂数据操作和分析报告。
50 11
|
12天前
|
存储 关系型数据库 MySQL
mysql怎么查询longblob类型数据的大小
通过本文的介绍,希望您能深入理解如何查询MySQL中 `LONG BLOB`类型数据的大小,并结合优化技术提升查询性能,以满足实际业务需求。
45 6
|
1月前
|
SQL 前端开发 关系型数据库
SpringBoot使用mysql查询昨天、今天、过去一周、过去半年、过去一年数据
SpringBoot使用mysql查询昨天、今天、过去一周、过去半年、过去一年数据
65 9
|
1月前
|
缓存 监控 关系型数据库
如何优化MySQL查询速度?
如何优化MySQL查询速度?【10月更文挑战第31天】
104 3
|
1月前
|
SQL NoSQL 关系型数据库
2024Mysql And Redis基础与进阶操作系列(5)作者——LJS[含MySQL DQL基本查询:select;简单、排序、分组、聚合、分组、分页等详解步骤及常见报错问题所对应的解决方法]
MySQL DQL基本查询:select;简单、排序、分组、聚合、分组、分页、INSERT INTO SELECT / FROM查询结合精例等详解步骤及常见报错问题所对应的解决方法
|
1月前
|
监控 关系型数据库 MySQL
数据库优化:MySQL索引策略与查询性能调优实战
【10月更文挑战第27天】本文深入探讨了MySQL的索引策略和查询性能调优技巧。通过介绍B-Tree索引、哈希索引和全文索引等不同类型,以及如何创建和维护索引,结合实战案例分析查询执行计划,帮助读者掌握提升查询性能的方法。定期优化索引和调整查询语句是提高数据库性能的关键。
282 1
|
1月前
|
SQL 关系型数据库 MySQL
定时任务频繁插入数据导致锁表问题 -> 查询mysql进程
定时任务频繁插入数据导致锁表问题 -> 查询mysql进程
52 1
|
1月前
|
监控 关系型数据库 MySQL
数据库优化:MySQL索引策略与查询性能调优实战
【10月更文挑战第26天】数据库作为现代应用系统的核心组件,其性能优化至关重要。本文主要探讨MySQL的索引策略与查询性能调优。通过合理创建索引(如B-Tree、复合索引)和优化查询语句(如使用EXPLAIN、优化分页查询),可以显著提升数据库的响应速度和稳定性。实践中还需定期审查慢查询日志,持续优化性能。
124 0