灌入大量数据后手工采集统计信息的重要性

简介: 1. 创建测试表TBL_STAT,及索引,但不插入记录SQL> create table TBL_STAT as select * from dba_objects where 11;Table created.

1. 创建测试表TBL_STAT,及索引,但不插入记录

SQL> create table TBL_STAT as select * from dba_objects where 1<>1;
Table created.

SQL> create index idx_tbl_stat on tbl_stat (object_id);
Index created.

SQL> select count(*) from tbl_stat;
  COUNT(*)
----------
         0

2. 检索TBL_STAT的执行计划
SQL> explain plan for select object_name from tbl_stat where object_id = 1;
Explained.

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 2448091186
------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |     1 |    79 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TBL_STAT |     1 |    79 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   1 - filter("OBJECT_ID"=1)
Note
-----
   - dynamic sampling used for this statement
17 rows selected.
发现按照索引字段查询使用的是全表扫描

3. 手工收集TBL_STAT表的统计信息
SQL> exec dbms_stats.gather_table_stats(ownname=>'DCSOPEN', tabname=>'TBL_STAT',  estimate_percent=>100);
PL/SQL procedure successfully completed.

4. 再次检索TBL_STAT表
SQL> explain plan for select object_name from tbl_stat where object_id = 1;
Explained.

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 3529113932
--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |    79 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TBL_STAT     |     1 |    79 |     1   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IDX_TBL_STAT |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("OBJECT_ID"=1)
14 rows selected.
发现这次用到了索引范围扫描,说明收集统计信息让Oracle可以选择正确的执行计划路径

5. 插入100万的测试记录
SQL> begin
  2    for i in 1 .. 10 loop
  3      insert into tbl_stat select * from dba_objects;
  4      commit;
  5    end loop;
  6  end;
  7  /
PL/SQL procedure successfully completed.

SQL> select count(*) from tbl_stat;
  COUNT(*)
----------
   1190725

6. 查看检索TBL_STAT表的执行计划
SQL> explain plan for select object_name from tbl_stat where object_id = 1;
Explained.

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 3529113932
--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |    79 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TBL_STAT     |     1 |    79 |     1   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | IDX_TBL_STAT |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("OBJECT_ID"=1)
14 rows selected.
插入100万记录后,发现仍是索引范围扫描。

7. 创建第二个测试表TBL_STAT_2,以及索引
SQL> create table tbl_stat_2 as select * from tbl_stat;
Table created.

SQL> create index idx_tbl_stat_2 on tbl_stat_2 (object_id);
Index created.

SQL> select count(*) from tbl_stat_2;
  COUNT(*)
----------
   1190725

8. 检索TBL_STAT和TBL_STAT_2关联查询的执行计划
SQL> explain plan for select a.object_name, b.object_name from tbl_stat a, tbl_stat_2 b where a.object_Id = b.object_id;
Explained.

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 752230886
----------------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |     1 |   158 |    27   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TBL_STAT_2     |    25 |  1975 |    25   (0)| 00:00:01 |
|   2 |   NESTED LOOPS              |                |     1 |   158 |    27   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL        | TBL_STAT       |     1 |    79 |     2   (0)| 00:00:01 |
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|*  4 |    INDEX RANGE SCAN         | IDX_TBL_STAT_2 |    25 |       |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   4 - access("A"."OBJECT_ID"="B"."OBJECT_ID")
Note
-----
   - dynamic sampling used for this statement
20 rows selected.
可以看到这里对TBl_STAT使用的是全表扫描,对TBL_STAT_2使用的是索引扫描,表之间是嵌套循环连接

SQL> explain plan for select a.object_name, b.object_name from tbl_stat_2 a, tbl_stat b where a.object_Id = b.object_id;
Explained.

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 752230886
----------------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |     1 |   158 |    27   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TBL_STAT_2     |    25 |  1975 |    25   (0)| 00:00:01 |
|   2 |   NESTED LOOPS              |                |     1 |   158 |    27   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL        | TBL_STAT       |     1 |    79 |     2   (0)| 00:00:01 |
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|*  4 |    INDEX RANGE SCAN         | IDX_TBL_STAT_2 |    25 |       |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   4 - access("A"."OBJECT_ID"="B"."OBJECT_ID")
Note
-----
   - dynamic sampling used for this statement
20 rows selected.
即使置换两个表的连接顺序,依旧选择TBL_STAT表是全表扫描,TBL_STAT_2是索引范围扫描,但由于插入记录后未采集过统计信息,两张表的预估记录数现在都是和实际相差较多

9. 手工采集TBL_STAT的统计信息
SQL> exec dbms_stats.gather_table_stats(ownname=>'DCSOPEN', tabname=>'TBL_STAT',  estimate_percent=>100);
PL/SQL procedure successfully completed.

SQL> explain plan for select a.object_name, b.object_name from tbl_stat_2 a, tbl_stat b where a.object_Id = b.object_id;
Explained.

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 1789047457
-----------------------------------------------------------------------------------------
| Id  | Operation          | Name       | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |            |    29M|  3038M|       | 15552   (2)| 00:03:07 |
|*  1 |  HASH JOIN         |            |    29M|  3038M|    47M| 15552   (2)| 00:03:07 |
|   2 |   TABLE ACCESS FULL| TBL_STAT   |  1190K|    34M|       |  3790   (1)| 00:00:46 |
|   3 |   TABLE ACCESS FULL| TBL_STAT_2 |  1299K|    97M|       |  3645   (1)| 00:00:44 |
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("A"."OBJECT_ID"="B"."OBJECT_ID")
Note
PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------
   - dynamic sampling used for this statement
19 rows selected.
发现此时TBL_STAT和TBL_STAT_2的预估行数已经不是1了,而且表之间采用的是全表扫描的哈希连接

10. 手工采集TBL_STAT_2表的统计信息
SQL> exec dbms_stats.gather_table_stats(ownname=>'DCSOPEN', tabname=>'TBL_STAT_2',  estimate_percent=>100);
PL/SQL procedure successfully completed.

SQL> explain plan for select a.object_name, b.object_name from tbl_stat_2 a, tbl_stat b where a.object_Id = b.object_id;
Explained.

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 2620555949
-----------------------------------------------------------------------------------------
| Id  | Operation          | Name       | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |            |    29M|  1703M|       | 12327   (2)| 00:02:28 |
|*  1 |  HASH JOIN         |            |    29M|  1703M|    47M| 12327   (2)| 00:02:28 |
|   2 |   TABLE ACCESS FULL| TBL_STAT_2 |  1190K|    34M|       |  3644   (1)| 00:00:44 |
|   3 |   TABLE ACCESS FULL| TBL_STAT   |  1190K|    34M|       |  3790   (1)| 00:00:46 |
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("A"."OBJECT_ID"="B"."OBJECT_ID")
15 rows selected.
此时TBL_STAT_2表的记录也趋于和实际一致,两表的连接仍是哈希连接

总结
1. 表的统计信息收集还是比较重要的一项工作,除了Oracle 10g以后会有自动收集的作业外,也可以手工进行统计信息的收集。
2. 本例中,由于TBL_STAT表灌入100万数据后,未收集统计信息,和TBL_STAT_2表连接采用的是嵌套循环连接,这种连接适用于大表和小表的关联场景,但实际这的两张表数据量相当,且都超过了100万,这样相当于100万*100万次关联,当收集统计信息后,两表连接改为了哈希连接,说明此时Oracle已经知道了表的实际数据量,执行计划也是依据表的实际数据量来做的判断,因此当表灌入大量数据后,建议手工采集统计信息,否则在系统自动采集统计信息之前,可能得到的执行计划就是错的。

目录
相关文章
|
域名解析 网络协议 安全
信息收集的工具你听过几种(盘点信息收集)
信息收集的工具你听过几种(盘点信息收集)
信息收集的工具你听过几种(盘点信息收集)
|
6月前
|
弹性计算 运维 搜索推荐
|
6月前
|
数据采集 开发工具 git
dirmap信息收集工具
dirmap信息收集工具
179 0
|
分布式计算 监控 Spark
日志分析实战之清洗日志小实例7:查看样本数据,保存统计数据到文件
日志分析实战之清洗日志小实例7:查看样本数据,保存统计数据到文件
137 0
日志分析实战之清洗日志小实例7:查看样本数据,保存统计数据到文件
|
JSON 监控 数据可视化
如何检测、评估和可视化数据中的历史漂移(Evidently)
您可以查看数据中的历史漂移,以了解您的数据如何变化并选择监控阈值。 这是一个带有 Evidently、Plotly、Mlflow 和一些 Python 代码的示例。
|
Oracle 关系型数据库
10G自动收集统计信息修改
10G自动收集统计信息修改
113 0
10G自动收集统计信息修改
|
算法 数据挖掘 定位技术
1个好方案,帮你实现复杂数据源中小区信息的准确归一化
背景 小区是租房业务中很重要的信息,它能够反映房源的位置和品质。对租客而言,能否浏览到准确的小区信息是高效找房的关键。因此,收集和展示准确的小区信息是提高用户找房效率的重要方面。为了获得全面的小区信息,租房业务通常会依赖多种数据源获得小区数据,这些数据格式不一,信息杂乱无章,含有很多冗余信息。
1683 0
|
Web App开发 监控 JavaScript
(转) 网站统计中的数据收集原理及实现
原文地址:http://blog.codinglabs.org/articles/how-web-analytics-data-collection-system-work.html 网站数据统计分析工具是网站站长和运营人员经常使用的一种工具,比较常用的有谷歌分析、百度统计和腾讯分析等等。
1394 0