PostgreSQL 多值列的选择性 - Statistics, Cardinality, Selectivity, Estimate

本文涉及的产品
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
RDS SQL Server Serverless,2-4RCU 50GB 3个月
推荐场景:
云原生数据库 PolarDB 分布式版,标准版 2核8GB
简介:

标签

PostgreSQL , 多值列 , 选择性评估 , Statistics , Cardinality , Selectivity , Estimate


背景

基于成本的优化器,选择性估算是重要的环节,对于单值列,通过表的行数,数据分布柱状图、高频值、唯一值比例、空值比例等统计信息(pg_class, pg_stats),以及用户输入的条件,可以估算得到输入条件的选择性。

对于多个条件的估算,之前PG给了比较暴力的AND,OR的叠加选择性计算。

PG 10开始,支持多列统计信息定义,从而提高了多列条件的选择性评估精准度。

《PostgreSQL 10 黑科技 - 自定义统计信息》

但是对于多值列,以及一些特殊的类型或操作符条件,过滤性的评估依旧有改进空间。

例如:

JSON类型的操作,范围类型的操作,空间类型的评估等。

例如用于计算“包含”的选择性如下:

postgres=# select oprname,oprleft::regtype,oprright::regtype,oprresult::regtype,oprrest from pg_operator where oprrest::text ~ 'contsel';  
 oprname | oprleft  | oprright | oprresult |   oprrest      
---------+----------+----------+-----------+--------------  
 <@      | polygon  | polygon  | boolean   | contsel  
 @>      | polygon  | polygon  | boolean   | contsel  
 <@      | box      | box      | boolean   | contsel  
 @>      | box      | box      | boolean   | contsel  
 <@      | point    | box      | boolean   | contsel  
 @>      | box      | point    | boolean   | contsel  
 <@      | point    | polygon  | boolean   | contsel  
 @>      | polygon  | point    | boolean   | contsel  
 <@      | point    | circle   | boolean   | contsel  
 @>      | circle   | point    | boolean   | contsel  
 <@      | circle   | circle   | boolean   | contsel  
 @>      | circle   | circle   | boolean   | contsel  
 &&      | anyarray | anyarray | boolean   | arraycontsel  
 @>      | anyarray | anyarray | boolean   | arraycontsel  
 <@      | anyarray | anyarray | boolean   | arraycontsel  
 @       | polygon  | polygon  | boolean   | contsel  
 ~       | polygon  | polygon  | boolean   | contsel  
 @       | box      | box      | boolean   | contsel  
 ~       | box      | box      | boolean   | contsel  
 @       | circle   | circle   | boolean   | contsel  
 ~       | circle   | circle   | boolean   | contsel  
 @>      | tsquery  | tsquery  | boolean   | contsel  
 <@      | tsquery  | tsquery  | boolean   | contsel  
 @@      | text     | text     | boolean   | contsel  
 @@      | text     | tsquery  | boolean   | contsel  
 -|-     | anyrange | anyrange | boolean   | contsel  
 @>      | jsonb    | jsonb    | boolean   | contsel  
 ?       | jsonb    | text     | boolean   | contsel  
 ?|      | jsonb    | text[]   | boolean   | contsel  
 ?&      | jsonb    | text[]   | boolean   | contsel  
 <@      | jsonb    | jsonb    | boolean   | contsel  
(31 rows)  

本文为转载文章,分析了json类型选择时,如何计算选择性。

(目前此类为写死的选择性值,代码中已经给出了原因)。

src/backend/utils/adt/geo_selfuncs.c:contsel(PG_FUNCTION_ARGS)

/*  
 *      Selectivity functions for geometric operators.  These are bogus -- unless  
 *      we know the actual key distribution in the index, we can't make a good  
 *      prediction of the selectivity of these operators.  
 *  
 *      Note: the values used here may look unreasonably small.  Perhaps they  
 *      are.  For now, we want to make sure that the optimizer will make use  
 *      of a geometric index if one is available, so the selectivity had better  
 *      be fairly small.  
 *  
 *      In general, GiST needs to search multiple subtrees in order to guarantee  
 *      that all occurrences of the same key have been found.  Because of this,  
 *      the estimated cost for scanning the index ought to be higher than the  
 *      output selectivity would indicate.  gistcostestimate(), over in selfuncs.c,  
 *      ought to be adjusted accordingly --- but until we can generate somewhat  
 *      realistic numbers here, it hardly matters...  
 */  
  
.........................  
  
/*  
 *      contsel -- How likely is a box to contain (be contained by) a given box?  
 *  
 * This is a tighter constraint than "overlap", so produce a smaller  
 * estimate than areasel does.  
 */  
  
Datum  
contsel(PG_FUNCTION_ARGS)  
{  
        PG_RETURN_FLOAT8(0.001);  
}  

数组,由于数组支持了柱状图,所以“包含”选择性计算时,是基于统计信息的。

src/backend/utils/adt/array_selfuncs.c:arraycontsel(PG_FUNCTION_ARGS)

/*  
 * arraycontsel -- restriction selectivity for array @>, &&, <@ operators  
 */  
Datum  
arraycontsel(PG_FUNCTION_ARGS)  
{  
        PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);  
        Oid                     operator = PG_GETARG_OID(1);  
        List       *args = (List *) PG_GETARG_POINTER(2);  
        int                     varRelid = PG_GETARG_INT32(3);  
        VariableStatData vardata;  
        Node       *other;  
        bool            varonleft;  
        Selectivity selec;  
        Oid                     element_typeid;  
  
        /*  
         * If expression is not (variable op something) or (something op  
         * variable), then punt and return a default estimate.  
         */  
        if (!get_restriction_variable(root, args, varRelid,  
                                                                  &vardata, &other, &varonleft))  
                PG_RETURN_FLOAT8(DEFAULT_SEL(operator));  
  
        /*  
         * Can't do anything useful if the something is not a constant, either.  
         */  
        if (!IsA(other, Const))  
        {  
                ReleaseVariableStats(vardata);  
                PG_RETURN_FLOAT8(DEFAULT_SEL(operator));  
        }  
  
        /*  
         * The "&&", "@>" and "<@" operators are strict, so we can cope with a  
         * NULL constant right away.  
         */  
        if (((Const *) other)->constisnull)  
        {  
                ReleaseVariableStats(vardata);  
                PG_RETURN_FLOAT8(0.0);  
        }  
  
        /*  
         * If var is on the right, commute the operator, so that we can assume the  
         * var is on the left in what follows.  
         */  
        if (!varonleft)  
        {  
                if (operator == OID_ARRAY_CONTAINS_OP)  
                        operator = OID_ARRAY_CONTAINED_OP;  
                else if (operator == OID_ARRAY_CONTAINED_OP)  
                        operator = OID_ARRAY_CONTAINS_OP;  
        }  
  
        /*  
         * OK, there's a Var and a Const we're dealing with here.  We need the  
         * Const to be an array with same element type as column, else we can't do  
         * anything useful.  (Such cases will likely fail at runtime, but here  
         * we'd rather just return a default estimate.)  
         */  
        element_typeid = get_base_element_type(((Const *) other)->consttype);  
        if (element_typeid != InvalidOid &&  
                element_typeid == get_base_element_type(vardata.vartype))  
        {  
                selec = calc_arraycontsel(&vardata, ((Const *) other)->constvalue,  
                                                                  element_typeid, operator);  
        }  
        else  
        {  
                selec = DEFAULT_SEL(operator);  
        }  
  
        ReleaseVariableStats(vardata);  
  
        CLAMP_PROBABILITY(selec);  
  
        PG_RETURN_FLOAT8((float8) selec);  
}  

原文

正文

Table of Contents

  • Statistics, cardinality, selectivity
  • Search on JSONB
    • Dataset
    • Operators and indexing for JSONB
    • Selectivity on JSONB
    • Diving in the code
  • Functional indexes
    • Creating the function and the index
    • Search using a function
    • Another example and selectivity calculation
  • Consequences of a bad estimate
  • Last word

Statistics, cardinality, selectivity

SQL is a declarative language. It is a language where the user asks what he wants. Without specifying how the computer should proceed to get the results.

It is the DBMS that must find “how” to perform the operation by ensuring:

  • Return the right result
  • Ideally, as soon as possible

“As soon as possible” means:

  • Minimize disk access
  • Give priority to sequential readings (especially important for mechanical disks)
  • Reduce the number of CPU operations
  • Reduce memory footprint

To do this, a DBMS has an optimizer whose role is to find the best execution plan.

PostgreSQL has an optimizer based on a cost mechanism. Without going into details, each operation has a unit cost (reading a sequential block, CPU processing of a record …). Postgres calculates the cost of several execution plans (if the query is simple) and chooses the least expensive.

How can postgres estimate the cost of a plan? By estimating the cost of each node of the plan based on statistics. PostgreSQL analyzes tables to obtain a statistical sample (this operation is normally performed by the autovacuum daemon).

Some words of vocabulary:

Cardinality: In set theory, it is the number of elements in a set. In databases, it will be the number of rows in a table or after applying a predicate.

Selectivity: Fraction of records returned after applying a predicate. For example, a table containing people and about one third of them are children. The selectivity of the predicate person = 'child' will be 0.33.

If this table contains 300 people (this is the cardinality of the “people” set), we can estimate the number of children because we know that the predicate person = 'child' is 0.33:

300 * 0.33 = 99  

These estimates can be obtained with EXPLAIN which displays the execution plan.

Example (simplified):

explain (analyze, timing off) select * from t1 WHERE c1=1;  
                                  QUERY PLAN  
------------------------------------------------------------------------------  
 Seq Scan on t1  (cost=0.00..5.75 rows=100 ...) (actual rows=100 ...)  
   Filter: (c1 = 1)  
   Rows Removed by Filter: 200  
(cost=0.00..5.75 rows=100 …) : Indicates the estimated cost and the estimated number of records (rows).  

(actual rows=100 …) : Indicates the number of records obtained.

PostgreSQL documentation provides examples of estimation calculations : Row Estimation Examples

It is quite easy to understand how to obtain estimates from scalar data types.

How are things going for particular types? For example JSON?

Search on JSONB

Dataset

As in previous articles, I used the stackoverflow dataset. I created a new table by aggregating data from multiple tables into a JSON object:

CREATE TABLE json_stack AS  
SELECT t.post_id,  
       row_to_json(t,  
                   TRUE)::jsonb json  
FROM  
  (SELECT posts.id post_id,  
          posts.owneruserid,  
          users.id,  
          title,  
          tags,  
          BODY,  
          displayname,  
          websiteurl,  
          LOCATION,  
          aboutme,  
          age  
   FROM posts  
   JOIN users ON posts.owneruserid = users.id) t;  

The processing is quite long because the two tables involved total nearly 40GB.

So I get a 40GB table that looks like this:

 \dt+ json_stack  
                      List of relations  
 Schema |    Name    | Type  |  Owner   | Size  | Description  
--------+------------+-------+----------+-------+-------------  
 public | json_stack | table | postgres | 40 GB |  
(1 row)  
  
  
\d json_stack  
             Table "public.json_stack"  
 Column  |  Type   | Collation | Nullable | Default  
---------+---------+-----------+----------+---------  
 post_id | integer |           |          |  
 json    | jsonb   |           |          |  
  
  
select post_id,jsonb_pretty(json) from json_stack  
    where json_displayname(json) = 'anayrat' limit 1;  
 post_id  |  
----------+-----------------------------------------------------------------------------------------  
 26653490 | {  
          |     "id": 4197886,  
          |     "age": null,  
          |     "body": "

I have an issue with date filter. I follow [...] | "tags": "", | "title": "Logstash date filter failed parsing", | "aboutme": "

Sysadmin, Postgres DBA

\n", | "post_id": 26653490, | "location": "Valence", | "websiteurl": "https://blog.anayrat.info", | "displayname": "anayrat", | "owneruserid": 4197886 | }

Operators and indexing for JSONB

PostgreSQL provides several operators for querying JSONB 1. We will use the operator @>.

It is also possible to index JSONB using GIN indexes:

create index ON json_stack using gin (json );  

Finally, here is an example of query:

explain (analyze,buffers)  select * from json_stack  
    where json @>  '{"displayname":"anayrat"}'::jsonb;  
                                  QUERY PLAN  
---------------------------------------------------------------------------------------  
 Bitmap Heap Scan on json_stack  
                (cost=286.95..33866.98 rows=33283 width=1011)  
                (actual time=0.099..0.102 rows=2 loops=1)  
   Recheck Cond: (json @> '{"displayname": "anayrat"}'::jsonb)  
   Heap Blocks: exact=2  
   Buffers: shared hit=17  
   ->  Bitmap Index Scan on json_stack_json_idx  
                          (cost=0.00..278.62 rows=33283 width=0)  
                          (actual time=0.092..0.092 rows=2 loops=1)  
         Index Cond: (json @> '{"displayname": "anayrat"}'::jsonb)  
         Buffers: shared hit=15  
 Planning time: 0.088 ms  
 Execution time: 0.121 ms  
(9 rows)  

Reading this plan we see that postgres is completely wrong. He estimates getting 33,283 lines, but the query returns only two rows. The error factor is around 15,000!

Selectivity on JSONB

What is the cardinality of the table? The information is contained in the system catalog:

select reltuples from pg_class where relname = 'json_stack';  
  reltuples  
-------------  
 3.32833e+07  

What is the estimated selectivity?

select 33283 / 3.32833e+07;  
        ?column?  
------------------------  
 0.00099999098647069251  

Arround 0.001.

Diving in the code

I had fun taking out the debugger GDB to find out where this number could come from. I ended up arriving in this function:

[...]  
79 /*  
80  *  contsel -- How likely is a box to contain (be contained by) a given box?  
81  *  
82  * This is a tighter constraint than "overlap", so produce a smaller  
83  * estimate than areasel does.  
84  */  
85  
86 Datum  
87 contsel(PG_FUNCTION_ARGS)  
88 {  
89     PG_RETURN_FLOAT8(0.001);  
90 }  
[...]  

The selectivity depends on the type of the operator. Let’s look in the system catalog:

select oprname,typname,oprrest from pg_operator op  
    join pg_type typ ON op.oprleft= typ.oid where oprname = '@>';  
 oprname | typname  |   oprrest  
---------+----------+--------------  
 @>      | polygon  | contsel  
 @>      | box      | contsel  
 @>      | box      | contsel  
 @>      | path     | -  
 @>      | polygon  | contsel  
 @>      | circle   | contsel  
 @>      | _aclitem | -  
 @>      | circle   | contsel  
 @>      | anyarray | arraycontsel  
 @>      | tsquery  | contsel  
 @>      | anyrange | rangesel  
 @>      | anyrange | rangesel  
 @>      | jsonb    | contsel  

There are several types, in fact the operator @> means (roughly): “Does the object on the left contain the right element?”. It is used for different types: geometry, array …

In our case, does the left JSONB object contain the ''{" displayname ":" anayrat "}'' element?

A JSON object is a special type. Determining the selectivity of an element would be quite complex. The comment is quite explicit:

 25 /*  
 26  *  Selectivity functions for geometric operators.  These are bogus -- unless  
 27  *  we know the actual key distribution in the index, we can't make a good  
 28  *  prediction of the selectivity of these operators.  
 29  *  
 30  *  Note: the values used here may look unreasonably small.  Perhaps they  
 31  *  are.  For now, we want to make sure that the optimizer will make use  
 32  *  of a geometric index if one is available, so the selectivity had better  
 33  *  be fairly small.  
[...]  

It is therefore not possible (currently) to determine the selectivity of JSONB objects.

But all is not lost

Functional indexes

PostgreSQL permits to creates so-called functional indexes. We create an index on a fonction.

You’re going to say, “Yes, but we do not need it.” In your example, postgres is already using an index.

That’s right, the difference is that postgres collects statistics about this index. As if the result of the function was a new column.

Creating the function and the index

It is very simple :

CREATE or replace FUNCTION json_displayname (jsonb )  
RETURNS text  
AS $$  
select $1->>'displayname'  
$$  
LANGUAGE SQL IMMUTABLE PARALLEL SAFE  
;  
  
create index ON json_stack (json_displayname(json));  

Search using a function

To use the index we just created, use it in the query:

explain (analyze,verbose,buffers) select * from json_stack  
        where json_displayname(json) = 'anayrat';  
                        QUERY PLAN  
----------------------------------------------------------------------------  
 Index Scan using json_stack_json_displayname_idx on public.json_stack  
            (cost=0.56..371.70 rows=363 width=1011)  
            (actual time=0.021..0.023 rows=2 loops=1)  
   Output: post_id, json  
   Index Cond: ((json_stack.json ->> 'displayname'::text) = 'anayrat'::text)  
   Buffers: shared hit=7  
 Planning time: 0.107 ms  
 Execution time: 0.037 ms  
(6 rows)  

This time postgres estimates to get 363 rows, which is much closer to the final result (2).

Another example and selectivity calculation

This time we will search on the “age” field of the JSON object:

explain (analyze,buffers)  select * from json_stack  
      where json @>  '{"age":27}'::jsonb;  
                      QUERY PLAN  
---------------------------------------------------------------------  
 Bitmap Heap Scan on json_stack  
        (cost=286.95..33866.98 rows=33283 width=1011)  
        (actual time=667.411..12723.906 rows=804630 loops=1)  
   Recheck Cond: (json @> '{"age": 27}'::jsonb)  
   Rows Removed by Index Recheck: 2211190  
   Heap Blocks: exact=391448 lossy=344083  
   Buffers: shared hit=576350 read=881510  
   I/O Timings: read=2947.458  
   ->  Bitmap Index Scan on json_stack_json_idx  
        (cost=0.00..278.62 rows=33283 width=0)  
        (actual time=562.648..562.648 rows=804644 loops=1)  
         Index Cond: (json @> '{"age": 27}'::jsonb)  
         Buffers: shared hit=9612 read=5140  
         I/O Timings: read=11.195  
 Planning time: 0.073 ms  
 Execution time: 12809.392 ms  
(12 lignes)  
  
set work_mem = '100MB';  
  
explain (analyze,buffers)  select * from json_stack  
      where json @>  '{"age":27}'::jsonb;  
                      QUERY PLAN  
---------------------------------------------------------------------  
 Bitmap Heap Scan on json_stack  
        (cost=286.95..33866.98 rows=33283 width=1011)  
        (actual time=748.968..5720.628 rows=804630 loops=1)  
   Recheck Cond: (json @> '{"age": 27}'::jsonb)  
   Rows Removed by Index Recheck: 14  
   Heap Blocks: exact=735531  
   Buffers: shared hit=123417 read=780542  
   I/O Timings: read=1550.124  
   ->  Bitmap Index Scan on json_stack_json_idx  
        (cost=0.00..278.62 rows=33283 width=0)  
        (actual time=545.553..545.553 rows=804644 loops=1)  
         Index Cond: (json @> '{"age": 27}'::jsonb)  
         Buffers: shared hit=9612 read=5140  
         I/O Timings: read=11.265  
 Planning time: 0.079 ms  
 Execution time: 5796.219 ms  
(12 lignes)  

In this example we see that postgres still estimates 33,283 records. Out he gets 804 644. This time he is too much optimistic.

P.S: In my example you will see that I run the same query by modifying work_mem. This is to prevent the bitmap from being lossy

As seen above we can create a function:

CREATE or replace FUNCTION json_age (jsonb )  
RETURNS text  
AS $$  
select $1->>'age'  
$$  
LANGUAGE SQL IMMUTABLE PARALLEL SAFE  
;  
  
create index ON json_stack (json_age(json));  

Again the estimate is much better:

explain (analyze,buffers)   select * from json_stack  
      where json_age(json) = '27';  
                         QUERY PLAN  
------------------------------------------------------------------------  
 Index Scan using json_stack_json_age_idx on json_stack  
  (cost=0.56..733177.05 rows=799908 width=1011)  
  (actual time=0.042..2355.179 rows=804630 loops=1)  
   Index Cond: ((json ->> 'age'::text) = '27'::text)  
   Buffers: shared read=737720  
   I/O Timings: read=1431.275  
 Planning time: 0.087 ms  
 Execution time: 2410.269 ms  

Postgres estimates to get 799,908 records. we will check it.

As I said, Postgres has statistics information based on a sample of data. This information is stored in a readable system catalog with the pg_stats view. With a functional index, Postgres sees it as a new column.

schemaname             | public  
tablename              | json_stack_json_age_idx  
attname                | json_age  
[...]  
most_common_vals       | {28,27,29,31,26,30,32,25,33,34,36,24,[...]}  
most_common_freqs      | {0.0248,0.0240333,0.0237333,0.0236333,0.0234,0.0229333,[...]}  
[...]  

The column most_common_vals contains the most common values and the column most_common_freqs the corresponding selectivity.

So for age = 27 we have a selectivity of 0.0240333.

这里使用了表达式时,PostgreSQL 评估为一对一输出,即输入一个A值返回一个与A相关的值,所以计算选择性可以使用原始列的柱状图。

《PostgreSQL 11 preview - 表达式索引柱状图buckets\STATISTICS\default_statistics_target可设置》

https://www.postgresql.org/docs/10/static/catalog-pg-proc.html

Then we just have to multiply the selectivity by the cardinality of the table:

select n_live_tup from pg_stat_all_tables where relname ='json_stack';  
 n_live_tup  
------------  
   33283258  
  
  
select 0.0240333 * 33283258;  
    ?column?  
----------------  
 799906.5244914  

Okay, estimate is much better. But is it serious if postgres is wrong? In the two queries above we see that postgres uses an index and that the result is obtained quickly.

Consequences of a bad estimate

How can a bad estimate be a problem?

When that leads to the choice of a bad plan.

For example, this aggregation query that counts the number of posts by age:

explain (analyze,buffers)  select json->'age',count(json->'age')  
                          from json_stack group by json->'age' ;  
                             QUERY PLAN  
--------------------------------------------------------------------------------------  
 Finalize GroupAggregate  
  (cost=10067631.49..14135810.84 rows=33283256 width=40)  
  (actual time=364151.518..411524.862 rows=86 loops=1)  
   Group Key: ((json -> 'age'::text))  
   Buffers: shared hit=1949354 read=1723941, temp read=1403174 written=1403189  
   I/O Timings: read=155401.828  
   ->  Gather Merge  
        (cost=10067631.49..13581089.91 rows=27736046 width=40)  
        (actual time=364151.056..411524.589 rows=256 loops=1)  
         Workers Planned: 2  
         Workers Launched: 2  
         Buffers: shared hit=1949354 read=1723941, temp read=1403174 written=1403189  
         I/O Timings: read=155401.828  
         ->  Partial GroupAggregate  
            (cost=10066631.46..10378661.98 rows=13868023 width=40)  
            (actual time=363797.836..409187.566 rows=85 loops=3)  
               Group Key: ((json -> 'age'::text))  
               Buffers: shared hit=5843962 read=5177836,  
                        temp read=4212551 written=4212596  
               I/O Timings: read=478460.123  
               ->  Sort  
                  (cost=10066631.46..10101301.52 rows=13868023 width=1042)  
                  (actual time=299775.029..404358.743 rows=11094533 loops=3)  
                     Sort Key: ((json -> 'age'::text))  
                     Sort Method: external merge  Disk: 11225392kB  
                     Buffers: shared hit=5843962 read=5177836,  
                              temp read=4212551 written=4212596  
                     I/O Timings: read=478460.123  
                     ->  Parallel Seq Scan on json_stack  
                        (cost=0.00..4791997.29 rows=13868023 width=1042)  
                        (actual time=0.684..202361.133 rows=11094533 loops=3)  
                           Buffers: shared hit=5843864 read=5177836  
                           I/O Timings: read=478460.123  
 Planning time: 0.080 ms  
 Execution time: 411688.165 ms  

Postgres expects to get 33,283,256 records instead of 86. It also performed a very expensive sort since it generated more than 33GB (11GB * 3 loops) of temporary files.

The same query using the json_age function:

explain (analyze,buffers)   select json_age(json),count(json_age(json))  
                              from json_stack group by json_age(json);  
                                             QUERY PLAN  
--------------------------------------------------------------------------------------  
 Finalize GroupAggregate  
  (cost=4897031.22..4897033.50 rows=83 width=40)  
  (actual time=153985.585..153985.667 rows=86 loops=1)  
   Group Key: ((json ->> 'age'::text))  
   Buffers: shared hit=1938334 read=1736761  
   I/O Timings: read=106883.908  
   ->  Sort  
      (cost=4897031.22..4897031.64 rows=166 width=40)  
      (actual time=153985.581..153985.598 rows=256 loops=1)  
         Sort Key: ((json ->> 'age'::text))  
         Sort Method: quicksort  Memory: 37kB  
         Buffers: shared hit=1938334 read=1736761  
         I/O Timings: read=106883.908  
         ->  Gather  
            (cost=4897007.46..4897025.10 rows=166 width=40)  
            (actual time=153985.264..153985.360 rows=256 loops=1)  
               Workers Planned: 2  
               Workers Launched: 2  
               Buffers: shared hit=1938334 read=1736761  
               I/O Timings: read=106883.908  
               ->  Partial HashAggregate  
                  (cost=4896007.46..4896008.50 rows=83 width=40)  
                  (actual time=153976.620..153976.635 rows=85 loops=3)  
                     Group Key: (json ->> 'age'::text)  
                     Buffers: shared hit=5811206 read=5210494  
                     I/O Timings: read=320684.515  
                     ->  Parallel Seq Scan on json_stack  
                     (cost=0.00..4791997.29 rows=13868023 width=1042)  
                     (actual time=0.090..148691.566 rows=11094533 loops=3)  
                           Buffers: shared hit=5811206 read=5210494  
                           I/O Timings: read=320684.515  
 Planning time: 0.118 ms  
 Execution time: 154086.685 ms  

Here postgres sorts later on a lot less lines. The execution time is significantly reduced and we save especially 33GB of temporary files.

Last word

Statistics are essential for choosing the best execution plan. Currently Postgres has advanced features for JSON Unfortunately there is no possibility to add statistics on the JSONB type. Note that PostgreSQL 10 provides the infrastructure to extend statistics. Hopefully in the future it will be possible to extend them for special types.

In the meantime, it is possible to work around this limitation by using functional indexes.

1、https://www.postgresql.org/docs/current/static/functions-json.html
2、A bitmap node becomes lossy when postgres can not make a bitmap of all tuples. It thus passes in so-called “lossy” mode where the bitmap is no longer on the tuple but for the entire block. This requires reading more blocks and doing a “recheck” which consists in filtering obtained tuples. ^
3、The documentation is very complete and provides many examples: use of operators, indexing. ^

Related

PostgreSQL 10 : ICU & Abbreviated Keys

PostgreSQL 10 : Performances improvements

PGDay : How does Full Text Search works?

PostgreSQL 10 and Logical replication - Setup

PostgreSQL 10 and Logical replication - Overview

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
SQL Oracle 关系型数据库
Oracle-分析函数之排序后顺序号row_number()
Oracle-分析函数之排序后顺序号row_number()
121 0
|
存储 SQL 移动开发
mysql之常用函数、聚合函数以及合并(union&union all)
1) UNION:将所有的查询结果合并到一起,然后去除掉相同的记录 2) UNION ALL:将所有的查询结果合并到一起,不会去除掉相同的记录 前提条件:结果集列数个数相同,列的类型还要相同或是兼容 使用场景:在项目统计报表模块,用来合并数据
|
关系型数据库 定位技术 数据库
【DB吐槽大会】第50期 - PG GiST距离排序操作符和过滤无法同时使用索引
大家好,这里是DB吐槽大会,第50期 - PG GiST距离排序操作符和过滤无法同时使用索引
|
弹性计算 关系型数据库 测试技术
PostgreSQL 分区表如何支持多列唯一约束 - 枚举、hash哈希 分区, 多列唯一, insert into on conflict, update, upsert, merge insert
标签 PostgreSQL , 分区表 , native partition , 唯一 , 非分区键唯一 , 组合唯一 , insert into on conflict , upsert , merge insert 背景 PG 11开始支持HASH分区,10的分区如果要支持hash分区,可以通过枚举绕道实现。 《PostgreSQL 9.x, 10, 11 hash分区表 用法举例
3160 0
|
SQL 关系型数据库 数据库
PostgreSQL 设计优化case - 大宽表任意字段组合查询索引如何选择(btree, gin, rum) - (含单个索引列数超过32列的方法)
标签 PostgreSQL , adhoc查询 , 大宽表 , 任意字段组合查询 , 索引 , btree , gin , rum 背景 大宽表,任意字段组合查询,透视。是实时分析系统中的常见需求: 1、实时写入。
2711 0
|
关系型数据库 PostgreSQL
PostgreSQL 快速返回表上某列的唯一值(枚举值) - pg_stats.most_common_vals
标签 PostgreSQL , 统计信息 , 唯一值 , 枚举值 背景 PostgreSQL的列统计信息中包含一项高频词,同时包含一项唯一值个数。 pg_stats.n_distinct pg_stats.most_common_vals 同时PostgreSQL允许用户自定义统计信息柱状图BUCKET的个数。
1667 0
|
关系型数据库 PostgreSQL
PostgreSQL sharding : citus 系列6 - count(distinct xx) 加速 (use 估值插件 hll|hyperloglog)
标签 PostgreSQL , hll , hyperloglog , distinct , 加速 , citus.count_distinct_error_rate 背景 在分布式数据库中,计算count(distinct xxx),需要对distinct 的字段, 1、去重, 2、重分布去重后的数据,(这一步,如果distinct值特别多,那么就会比较耗时) 3、然后再去重, 4、最后count (xxx), 5、求所有节点的count SUM。
1847 0

相关产品

  • 云原生数据库 PolarDB
  • 云数据库 RDS PostgreSQL 版