PolarDB for PostgreSQL 采用iprange和SPGiST index 实现超光速 - 全球化业务根据来源IP智能DNS路由就近分配本地机房访问, 提升用户体验-阿里云开发者社区

背景

《如何获得IP地址对应的地理信息库, 实现智能DNS解析? 就近路由提升全球化部署业务的访问性能》

上一篇信息提到了如何获取IP地址段的地理信息库, 本篇信息将使用PolarDB for PostgreSQL来加速根据来源IP快速找到对应的IP地址段, 将用到PolarDB for PostgreSQL的SPGiST索引和inet数据类型.

相比于把IP地址段存储为2个int8字段作between and的匹配, SPGiST索引和inet数据类型至少可以提升20倍性能.

https://www.postgresql.org/docs/15/functions-net.html

inet >>= inet → boolean  
Does subnet contain or equal subnet?  
inet '192.168.1/24' >>= inet '192.168.1/24' → t

详情

1、将数据导入PolarDB for PostgreSQL, 使用inet类型存储地址段, 并创建spgist索引.

create table ip2geo (id serial primary key, ip inet, province text, city text);  

copy ip2geo(ip,province,city) from '/Users/digoal/c.csv' (format csv);  
COPY 8617

create index idx_1 on ip2geo using spgist (ip);

postgres=# select ip,host(ip), masklen(ip) from ip2geo limit 10;  
     ip      |   host   | masklen   
-------------+----------+---------  
 1.0.1.0/24  | 1.0.1.0  |      24  
 1.0.2.0/23  | 1.0.2.0  |      23  
 1.0.8.0/21  | 1.0.8.0  |      21  
 1.0.32.0/19 | 1.0.32.0 |      19  
 1.1.0.0/24  | 1.1.0.0  |      24  
 1.1.2.0/23  | 1.1.2.0  |      23  
 1.1.4.0/22  | 1.1.4.0  |      22  
 1.1.8.0/24  | 1.1.8.0  |      24  
 1.1.9.0/24  | 1.1.9.0  |      24  
 1.1.10.0/23 | 1.1.10.0 |      23  
(10 rows)

2、IP地址段包含查询例子

postgres=# select * FROM ip2geo where  ip >>= '1.88.0.10/32' ;  
 id |     ip      | province |   city   
----+-------------+----------+----------
 53 | 1.88.0.0/14 | 北京市   | 歌华宽带
(1 row)

postgres=# select * FROM ip2geo where  ip >>= '1.88.0.0/24' ; 
 id |     ip      | province |   city   
----+-------------+----------+----------
 53 | 1.88.0.0/14 | 北京市   | 歌华宽带
(1 row)

3、对比索引扫描的性能提升, 相比全表扫描性能相差25倍:

postgres=# explain (analyze,verbose,timing,costs,buffers) select * FROM ip2geo where  ip >>= '1.88.0.0/24' ;  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_1 on public.ip2geo  (cost=0.15..2.37 rows=1 width=35) (actual time=0.019..0.020 rows=1 loops=1)
   Output: id, ip, province, city
   Index Cond: (ip2geo.ip >>= '1.88.0.0/24'::inet)
   Buffers: shared hit=4
 Planning Time: 0.057 ms
 Execution Time: 0.031 ms
(6 rows)  
  
postgres=# set enable_indexscan=off;  
SET  
postgres=# set enable_bitmapscan=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * FROM ip2geo where  ip >>= '1.88.0.0/24' ;  
                                                QUERY PLAN                                                  
----------------------------------------------------------------------------------------------------------  
 Seq Scan on public.ip2geo  (cost=0.00..175.71 rows=1 width=35) (actual time=0.013..0.783 rows=1 loops=1)
   Output: id, ip, province, city
   Filter: (ip2geo.ip >>= '1.88.0.0/24'::inet)
   Rows Removed by Filter: 8616
   Buffers: shared hit=68
 Planning Time: 0.056 ms
 Execution Time: 0.793 ms 
(7 rows)

4、压力测试方法, 随机从地址库中取一条记录并生成这个地址段内的随机IP地址.

create or replace function getipaddr(int default ceil(8617*random())) returns inet as $$  
  select ip + (floor(random()*(2^(32-masklen(ip)))))::int8 from ip2geo where id=$1;  
$$ language sql strict immutable;

postgres=# explain (analyze,verbose,timing,costs,buffers) select * FROM ip2geo where  ip >>=  getipaddr(); 
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2geo  (cost=0.00..175.71 rows=1 width=35) (actual time=0.098..0.955 rows=1 loops=1)
   Output: id, ip, province, city
   Filter: (ip2geo.ip >>= '43.243.11.49/22'::inet)
   Rows Removed by Filter: 8616
   Buffers: shared hit=68
 Planning:
   Buffers: shared hit=14
 Planning Time: 0.370 ms
 Execution Time: 0.962 ms
(9 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * FROM ip2geo where  ip >>=  getipaddr();  
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2geo  (cost=0.00..175.71 rows=1 width=35) (actual time=0.087..1.285 rows=1 loops=1)
   Output: id, ip, province, city
   Filter: (ip2geo.ip >>= '43.236.136.57/22'::inet)
   Rows Removed by Filter: 8616
   Buffers: shared hit=68
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.244 ms
 Execution Time: 1.293 ms
(9 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * FROM ip2geo where  ip >>=  getipaddr(); 
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2geo  (cost=0.00..175.71 rows=1 width=35) (actual time=0.780..0.890 rows=1 loops=1)
   Output: id, ip, province, city
   Filter: (ip2geo.ip >>= '203.19.72.14/24'::inet)
   Rows Removed by Filter: 8616
   Buffers: shared hit=68
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.199 ms
 Execution Time: 0.899 ms
(9 rows)

5、使用prepared statement, 随机地址段包含匹配查询

alter function getipaddr(int) volatile;  
  
create or replace function dyn_pre() returns setof ip2geo as $$  
declare  
  v inet;   
begin  
  v := getipaddr();  
  return query execute format('execute p(%L)', v);  
  exception when others then  
    execute format('prepare p(inet) as select * from ip2geo where ip >>= $1');  
    return query execute format('execute p(%L)', v);  
end;  
$$ language plpgsql strict;

postgres=# select dyn_pre();  
               dyn_pre                 
-------------------------------------  
 (43.227.192.0/22,浙江省杭州市,电信)  
(1 row)  
  
postgres=# select dyn_pre();  
           dyn_pre              
------------------------------  
 (103.25.64.0/22,上海市,电信)  
(1 row)  
  
postgres=# select dyn_pre();  
         dyn_pre           
-------------------------  
 (45.119.232.0/22,中国,)  
(1 row)  
  
postgres=# select dyn_pre();  
               dyn_pre                  
--------------------------------------  
 (103.205.252.0/22,江苏省宿迁市,联通)  
(1 row)  
  
postgres=# select dyn_pre();  
         dyn_pre           
-------------------------  
 (103.87.4.0/22,北京市,)  
(1 row)

6、压力测试

vi test.sql  
select dyn_pre();  
  
  
pgbench -M simple -n -r -P 1 -f ./test.sql -c 12 -j 12 -T 120

除去获取随机IP的时间, 在2018款macbook pro i5的机器上, 实际约8万qps.

PolarDB for PostgreSQL 作为智能DNS的数据搜索引擎, 节省几十倍的成本, 同时提升终端用户就近访问的体验, 特别适合例如“社交、游戏、多媒体、云盘、多地办公等等全球化或者全国部署业务”.

为什么spgist索引比btree combine 2字段索引范围搜索更高效?

spgist索引不管搜索什么范围, 搜索到目标数据基本上只需要扫描几个数据块.

而使用btree, 由于是2个字段符合搜索, 必然的会出现数据在驱动列大范围的匹配到后, 再通过第二个字段二次过滤的情况. 扫描的数据更多了, 效率立马就下降了.

测试过程:

1、创建inet转int8的函数

create or replace function inet2int8 (inet) returns int8 as $$    
  select (v[1]::bit(8)||v[2]::bit(8)||v[3]::bit(8)||v[4]::bit(8))::bit(32)::int8 from ( values ((regexp_split_to_array(host($1),'\.'))::int[]) ) t (v);    
$$ language sql strict;    
    
    
postgres=# select inet2int8('203.88.32.0');    
 inet2int8      
------------    
 3411550208    
(1 row)

2、将ip2geo拆成int8存储

create table ip2int8geo (id serial primary key, f int8,t int8, province text, city text);    
    
insert into ip2int8geo (f,t,province,city) select inet2int8(network(ip)), inet2int8(network(ip)) + 2^(32-masklen(ip)) - 1, province, city from ip2geo;    
INSERT 0 8617

3、创建from to 两个字段的combine索引

create index idx_2 on ip2int8geo (f,t);

4、创建获取随机IP INT8的函数用于测试

create or replace function genrandomipint8(int) returns int8 as $$    
  select f + ceil((t-f)*random()) from ip2int8geo where id=$1;     
$$ language sql strict;     
    
-- 这个是驱动列靠前的值, 搜索较快    
postgres=# select genrandomipint8(10);    
 genrandomipint8     
-----------------    
        16845351    
(1 row)    
    
-- 这个是驱动列靠后的值, 明显看出btree的大范围过滤    
postgres=# select genrandomipint8(8000);    
 genrandomipint8     
-----------------    
      3411550659    
(1 row)

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ip2int8geo where f <= 16845351 and t >= 16845351;    
                                                        QUERY PLAN                                                            
--------------------------------------------------------------------------------------------------------------------------    
 Index Scan using idx_2 on public.ip2int8geo  (cost=0.29..2.50 rows=1 width=44) (actual time=0.006..0.007 rows=1 loops=1)    
   Output: id, f, t, province, city    
   Index Cond: ((ip2int8geo.f <= 16845351) AND (ip2int8geo.t >= 16845351))    
   Buffers: shared hit=3    
 Planning:    
   Buffers: shared hit=3    
 Planning Time: 0.114 ms    
 Execution Time: 0.018 ms    
(8 rows)    
    
    
    
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ip2int8geo where f <= 3411550659 and t >= 3411550659;    
                                                          QUERY PLAN                                                              
------------------------------------------------------------------------------------------------------------------------------    
 Index Scan using idx_2 on public.ip2int8geo  (cost=0.29..167.31 rows=568 width=44) (actual time=0.438..0.440 rows=1 loops=1)    
   Output: id, f, t, province, city    
   Index Cond: ((ip2int8geo.f <= '3411550659'::bigint) AND (ip2int8geo.t >= '3411550659'::bigint))    
   Buffers: shared hit=33    
 Planning Time: 0.133 ms    
 Execution Time: 0.469 ms    
(6 rows)    
    
postgres=# select * from ip2int8geo where f <= 3411550659 and t >= 3411550659;    
  id  |     f      |     t      |   province   |        city             
------+------------+------------+--------------+---------------------    
 8000 | 3411550208 | 3411558399 | 广东省深圳市 | 天威有线宽带(关内))    
(1 row)

5、创建一个函数, 用于作btree索引的压力测试

create or replace function test_getip2int8geo () returns setof ip2int8geo as $$    
declare    
  i int8;    
begin    
  i := genrandomipint8( ceil(random()*8617)::int );    
  return query select * from ip2int8geo where f <= i and t >= i;    
end;    
$$ language plpgsql strict;    
    
    
    
postgres=# select * from test_getip2int8geo();    
  id  |     f      |     t      | province |  city      
------+------------+------------+----------+--------    
 3798 | 1736744960 | 1736745983 | 台湾省   | 台北市    
(1 row)    
Time: 1.058 ms    
    
postgres=# select * from test_getip2int8geo();    
  id  |     f     |     t     | province |  city      
------+-----------+-----------+----------+--------    
 1385 | 771539968 | 771540991 | 北京市   | 鹏博士    
(1 row)    
Time: 0.615 ms

6、使用spgist索引, 数据不管靠前还是靠后, 扫描的数据块都差不多, 性能基本都一样.

postgres=# select * from ip2geo offset 7999 limit 1;    
       ip       |   province   |        city             
----------------+--------------+---------------------    
 203.88.32.0/19 | 广东省深圳市 | 天威有线宽带(关内))    
(1 row)    
    
    
postgres=# explain (analyze,verbose,timing,costs,buffers) select * FROM ip2geo where  ip >>= '203.88.45.200/19' ;    
                                                      QUERY PLAN                                                          
----------------------------------------------------------------------------------------------------------------------    
 Index Scan using idx_1 on public.ip2geo  (cost=0.15..4.60 rows=3 width=31) (actual time=0.031..0.031 rows=1 loops=1)    
   Output: ip, province, city    
   Index Cond: (ip2geo.ip >>= '203.88.45.200/19'::inet)    
   Buffers: shared hit=4    
 Planning Time: 0.066 ms    
 Execution Time: 0.046 ms    
(6 rows)    
    
    
postgres=# select * FROM ip2geo where  ip >>= '203.88.45.200/19' ;    
       ip       |   province   |        city             
----------------+--------------+---------------------    
 203.88.32.0/19 | 广东省深圳市 | 天威有线宽带(关内))    
(1 row)

记录越多, btree combine扫描过滤性越差, 与spgist索引的差距就会越大. 例如, 我们可以使用以下100万条测试case来证明这个结论.
扫描的数据块数量相差上百倍.

create sequence seq INCREMENT by 1000;  
  
create table test (f int, t int);   
  
insert into test select n , n+999 from (select nextval('seq') n from generate_series(1,1000000) ) t ;  
  
postgres=#  select * from test limit 10;  
  f   |   t     
------+-------  
    1 |  1000  
 1001 |  2000  
 2001 |  3000  
 3001 |  4000  
 4001 |  5000  
 5001 |  6000  
 6001 |  7000  
 7001 |  8000  
 8001 |  9000  
 9001 | 10000  
(10 rows)  
  
  
postgres=# select min(f), max(t) from test;  
 min |    max       
-----+------------  
   1 | 1000000000  
(1 row)  
  
create index idx_test on test (f,t);  
  
explain (analyze,verbose,timing,costs,buffers) select * from test where f <= 500000000 and t>=500000000;  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where f <= 500000000 and t>=500000000;  
                                                             QUERY PLAN                                                                
-------------------------------------------------------------------------------------------------------------------------------------  
 Index Only Scan using idx_test on public.test  (cost=0.42..9021.81 rows=250000 width=8) (actual time=16.605..16.608 rows=1 loops=1)  
   Output: f, t  
   Index Cond: ((test.f <= 500000000) AND (test.t >= 500000000))  
   Heap Fetches: 0  
   Buffers: shared hit=1370  
 Planning Time: 0.098 ms  
 Execution Time: 16.629 ms  
(7 rows)  
  
  
  
create index idx_test_1 on test using spgist (int4range(f,t+1));  
or
create index idx_test_2 on test using gist (int4range(f,t+1));  
  
explain (analyze,verbose,timing,costs,buffers) select * from test where int4range(f,t+1) @> 500000000;  
  
  
vi t1.sql  
\set id random(1,1000000000)  
select * from test where f <= :id and t >= :id;  
  
vi t2.sql  
\set id random(1,1000000000)  
select * from test where int4range(f,t+1) @> :id;
  
  
pgbench -M prepared -n -r -P 1 -f ./t1.sql -c 12 -j 12 -T 120  
  
pgbench -M prepared -n -r -P 1 -f ./t2.sql -c 12 -j 12 -T 120

https://gis.stackexchange.com/questions/374091/when-to-use-gist-and-when-to-use-sp-gist-index