时空数据库实践(含 纽约TAXI数据透视分析) - PostGIS + TimescaleDB => PostgreSQL

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
云数据库 Redis 版,社区版 2GB
推荐场景:
搭建游戏排行榜
云数据库 RDS SQL Server,独享型 2核4GB
简介:

标签

PostgreSQL , 时空数据库 , 时序数据库 , 空间数据库 , timescaleDB , PostGIS


背景

现实社会中,很多业务产生的数据具有时序数据属性(在时间维度上顺序写入,同时包括大量时间区间查询统计的需求)。

例如业务的FEED数据,物联网产生的时序数据(如气象传感器、车辆轨迹、等),金融行业的实时数据等等。

PostgreSQL的UDF和BRIN(块级索引)很适合时序数据的处理。具体有以下的两个例子

《PostgreSQL 按需切片的实现(TimescaleDB插件自动切片功能的plpgsql schemaless实现)》

《PostgreSQL 时序最佳实践 - 证券交易系统数据库设计 - 阿里云RDS PostgreSQL最佳实践》

pic

实际上PostgreSQL生态中,衍生了一个时序插件:timescaleDB。专门用于处理时序数据。(timescale的改进,包括SQL优化器的改进(支持merge append,时间片聚合非常高效),rotate接口,自动分片等)

同时timescaleDB也非常受投资者的关注,已获5000万美金的投资,也间接说明时序数据库在未来是非常受用户欢迎的。

timescaleDB的优势

首先,timescaleDB是自动切片的,对用户无感知,在数据量非常庞大的时候,写入性能不衰减。(主要指IOPS较低的磁盘,如果IOPS较好的磁盘PG在写入大量数据后性能也是OK的。)

pic

其次,timescale改进了SQL优化器,增加了merge append的执行节点,同时在对小时间片进行group by时,可以不用HASH或GROUP整个数据范围,而是分片计算,使得效率非常高。

最后,timescale增加了一些API,使得用户在时序数据的写入、维护、查询都非常的高效、同时易于维护。

API如下

http://docs.timescale.com/v0.8/api

部署timescaleDB

以CentOS 7.x x64为例。

1、首先要安装好PostgreSQL

参考 《PostgreSQL on Linux 最佳部署手册》

export USE_NAMED_POSIX_SEMAPHORES=1  
LIBS=-lpthread CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10 --with-segsize=8 --with-wal-segsize=256  
LIBS=-lpthread CFLAGS="-O3" make world -j 64  
LIBS=-lpthread CFLAGS="-O3" make install-world  

2、其次需要安装cmake3

epel  
  
yum install -y cmake3  
  
ln -s /usr/bin/cmake3 /usr/bin/cmake  

3、编译timescaleDB

git clone https://github.com/timescale/timescaledb/  
  
cd timescaledb  
git checkout release-0.8.0  
  
或  
  
wget https://github.com/timescale/timescaledb/archive/0.8.0.tar.gz  
  
  
  
export PATH=/home/digoal/pgsql10/bin:$PATH  
export LD_LIBRARY_PATH=/home/digoal/pgsql10/lib:$LD_LIBRARY_PATH  
  
# Bootstrap the build system  
./bootstrap  
  
cd ./build && make  
  
make install  
  
  
[  2%] Built target sqlupdatefile  
[  4%] Built target sqlfile  
[100%] Built target timescaledb  
Install the project...  
-- Install configuration: "Release"  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb.control  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.8.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.1--0.8.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.1.0--0.2.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.2.0--0.3.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.3.0--0.4.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.0--0.4.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.1--0.4.2.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.2--0.5.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.5.0--0.6.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.0--0.6.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.0--0.7.1.sql  
-- Installing: /home/dege.zzz/pgsql10/lib/timescaledb.so  

4、配置postgresql.conf,在数据库启动时自动加载timescale lib库。

vi $PGDATA/postgresql.conf  
shared_preload_libraries = 'timescaledb'  
  
pg_ctl restart -m fast  

5、对需要使用timescaledb的数据库,创建插件.

psql  
psql (10.1)  
Type "help" for help.  
  
postgres=# create extension timescaledb ;  

6、timescaledb的相关参数

timescaledb.constraint_aware_append     
timescaledb.disable_optimizations       
timescaledb.optimize_non_hypertables    
timescaledb.restoring    
  
postgres=# show timescaledb.constraint_aware_append ;  
 timescaledb.constraint_aware_append   
-------------------------------------  
 on  
(1 row)  
  
postgres=# show timescaledb.disable_optimizations ;  
 timescaledb.disable_optimizations   
-----------------------------------  
 off  
(1 row)  
  
postgres=# show timescaledb.optimize_non_hypertables ;  
 timescaledb.optimize_non_hypertables   
--------------------------------------  
 off  
(1 row)  
  
postgres=# show timescaledb.restoring ;  
 timescaledb.restoring   
-----------------------  
 off  
(1 row)  

timescaleDB使用例子1 - 纽约TAXI数据透视分析

第一个例子是real-life New York City taxicab data ,

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

数据为真实的数据,来自

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

1、下载样本数据

wget https://timescaledata.blob.core.windows.net/datasets/nyc_data.tar.gz  

2、解压

tar -zxvf nyc_data.tar.gz   

3、建表,其中包括将普通表转换为时序存储表的API create_hypertable 的使用。

psql -f nyc_data.sql  

截取一些nyc_data.sql的内容如下:

cat nyc_data.sql  
  
-- 打车数据: 包括时长、计费、路程、上车、下车经纬度、时间、人数等等。  
  
CREATE TABLE "rides"(  
    vendor_id TEXT,  
    pickup_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,  
    dropoff_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,  
    passenger_count NUMERIC,  
    trip_distance NUMERIC,  
    pickup_longitude  NUMERIC,  
    pickup_latitude   NUMERIC,  
    rate_code         INTEGER,  
    dropoff_longitude NUMERIC,  
    dropoff_latitude  NUMERIC,  
    payment_type INTEGER,  
    fare_amount NUMERIC,  
    extra NUMERIC,  
    mta_tax NUMERIC,  
    tip_amount NUMERIC,  
    tolls_amount NUMERIC,  
    improvement_surcharge NUMERIC,  
    total_amount NUMERIC  
);  

这句话,将rides转换为时序表存储

SELECT create_hypertable('rides', 'pickup_datetime', 'payment_type', 2, create_default_indexes=>FALSE);  

创建索引

CREATE INDEX ON rides (vendor_id, pickup_datetime desc);  
CREATE INDEX ON rides (pickup_datetime desc, vendor_id);  
CREATE INDEX ON rides (rate_code, pickup_datetime DESC);  
CREATE INDEX ON rides (passenger_count, pickup_datetime desc);  

4、导入测试数据

psql -c "\COPY rides FROM nyc_data_rides.csv CSV"  
COPY 10906858  

5、对已转换为时序存储表的rides执行一些测试SQL,性能比PostgreSQL普通表要好。

每天同车超过2人的交易,平均计费多少?

-- Average fare amount of rides with 2+ passengers by day  
  
SELECT date_trunc('day', pickup_datetime) as day, avg(fare_amount)  
  FROM rides  
  WHERE passenger_count > 1 AND pickup_datetime < '2016-01-08'  
  GROUP BY day ORDER BY day;  
  
        day         |         avg  
--------------------+---------------------  
2016-01-01 00:00:00 | 13.3990821679715529  
2016-01-02 00:00:00 | 13.0224687415181399  
2016-01-03 00:00:00 | 13.5382068607068607  
2016-01-04 00:00:00 | 12.9618895561740149  
2016-01-05 00:00:00 | 12.6614611935518309  
2016-01-06 00:00:00 | 12.5775245695086098  
2016-01-07 00:00:00 | 12.5868802584437019  
(7 rows)  

6、某些查询的性能甚至超过20倍

每天有多少笔交易。

-- Total number of rides by day for first 5 days  
  
SELECT date_trunc('day', pickup_datetime) as day, COUNT(*) FROM rides  
  GROUP BY day ORDER BY day  
  LIMIT 5;  
  
        day         | count  
--------------------+--------  
2016-01-01 00:00:00 | 345037  
2016-01-02 00:00:00 | 312831  
2016-01-03 00:00:00 | 302878  
2016-01-04 00:00:00 | 316171  
2016-01-05 00:00:00 | 343251  
(5 rows)  

timescale增加了merge append的执行优化,因此在时间片上按小粒度聚合,效率非常高,数据量越大,性能提升的效果越明显。

For example, TimescaleDB introduces a time-based "merge append" optimization to minimize the number of
groups which must be processed to execute the following (given its knowledge that time is already ordered).

For our 100M row table, this results in query latency that is 396x faster than PostgreSQL (82ms vs. 32566ms).

SELECT date_trunc('minute', time) AS minute, max(usage_user)  
  FROM cpu  
  WHERE time < '2017-01-01'  
  GROUP BY minute  
  ORDER BY minute DESC  
  LIMIT 5;  

7、执行一些timescaleDB特有的功能函数,例如time_bucket,这里同样会用到timescaleDB内置的一些加速算法。

每5分钟间隔为一个BUCKET,输出每个间隔产生了多少笔订单。

-- Number of rides by 5 minute intervals  
--   (using the TimescaleDB "time_bucket" function)  
  
SELECT time_bucket('5 minute', pickup_datetime) as five_min, count(*)  
  FROM rides  
  WHERE pickup_datetime < '2016-01-01 02:00'  
  GROUP BY five_min ORDER BY five_min;  
  
      five_min       | count  
---------------------+-------  
 2016-01-01 00:00:00 |   703  
 2016-01-01 00:05:00 |  1482  
 2016-01-01 00:10:00 |  1959  
 2016-01-01 00:15:00 |  2200  
 2016-01-01 00:20:00 |  2285  
 2016-01-01 00:25:00 |  2291  
 2016-01-01 00:30:00 |  2349  
 2016-01-01 00:35:00 |  2328  
 2016-01-01 00:40:00 |  2440  
 2016-01-01 00:45:00 |  2372  
 2016-01-01 00:50:00 |  2388  
 2016-01-01 00:55:00 |  2473  
 2016-01-01 01:00:00 |  2395  
 2016-01-01 01:05:00 |  2510  
 2016-01-01 01:10:00 |  2412  
 2016-01-01 01:15:00 |  2482  
 2016-01-01 01:20:00 |  2428  
 2016-01-01 01:25:00 |  2433  
 2016-01-01 01:30:00 |  2337  
 2016-01-01 01:35:00 |  2366  
 2016-01-01 01:40:00 |  2325  
 2016-01-01 01:45:00 |  2257  
 2016-01-01 01:50:00 |  2316  
 2016-01-01 01:55:00 |  2250  
(24 rows)  

8、执行一些统计分析SQL

每个城市的打车交易量。

-- Join rides with rates to get more information on rate_code  
  
SELECT rates.description, COUNT(vendor_id) as num_trips FROM rides  
  JOIN rates on rides.rate_code = rates.rate_code  
  WHERE pickup_datetime < '2016-01-08'  
  GROUP BY rates.description ORDER BY rates.description;  
  
      description      | num_trips  
-----------------------+-----------  
 JFK                   |     54832  
 Nassau or Westchester |       967  
 Newark                |      4126  
 group ride            |        17  
 negotiated fare       |      7193  
 standard rate         |   2266401  
(6 rows)  

某些城市2016年1月的打车统计(最长、短距离、平均人数、时长等)

-- Analysis of all JFK and EWR rides in Jan 2016  
  
SELECT rates.description, COUNT(vendor_id) as num_trips,  
    AVG(dropoff_datetime - pickup_datetime) as avg_trip_duration, AVG(total_amount) as avg_total,  
    AVG(tip_amount) as avg_tip, MIN(trip_distance) as min_distance, AVG(trip_distance) as avg_distance, MAX(trip_distance) as max_distance,  
    AVG(passenger_count) as avg_passengers  
  FROM rides  
  JOIN rates on rides.rate_code = rates.rate_code  
  WHERE rides.rate_code in (2,3) AND pickup_datetime < '2016-02-01'  
  GROUP BY rates.description ORDER BY rates.description;  
  
 description | num_trips | avg_trip_duration |      avg_total      |      avg_tip       | min_distance |    avg_distance     | max_distance |   avg_passengers  
-------------+-----------+-------------------+---------------------+--------------------+--------------+---------------------+--------------+--------------------  
 JFK         |    225019 | 00:45:46.822517   | 64.3278115181384683 | 7.3334228220728027 |         0.00 | 17.2602816651038357 |       221.00 | 1.7333869584346211  
 Newark      |     16822 | 00:35:16.157472   | 86.4633688027582927 | 9.5461657353465700 |         0.00 | 16.2706122934252764 |       177.23 | 1.7435501129473309  
(2 rows)  

9、数据自动分片与执行计划

postgres=# \d+ rides  
                                                     Table "public.rides"  
        Column         |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description   
-----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------  
 vendor_id             | text                        |           |          |         | extended |              |   
 pickup_datetime       | timestamp without time zone |           | not null |         | plain    |              |   
 dropoff_datetime      | timestamp without time zone |           | not null |         | plain    |              |   
 passenger_count       | numeric                     |           |          |         | main     |              |   
 trip_distance         | numeric                     |           |          |         | main     |              |   
 pickup_longitude      | numeric                     |           |          |         | main     |              |   
 pickup_latitude       | numeric                     |           |          |         | main     |              |   
 rate_code             | integer                     |           |          |         | plain    |              |   
 dropoff_longitude     | numeric                     |           |          |         | main     |              |   
 dropoff_latitude      | numeric                     |           |          |         | main     |              |   
 payment_type          | integer                     |           |          |         | plain    |              |   
 fare_amount           | numeric                     |           |          |         | main     |              |   
 extra                 | numeric                     |           |          |         | main     |              |   
 mta_tax               | numeric                     |           |          |         | main     |              |   
 tip_amount            | numeric                     |           |          |         | main     |              |   
 tolls_amount          | numeric                     |           |          |         | main     |              |   
 improvement_surcharge | numeric                     |           |          |         | main     |              |   
 total_amount          | numeric                     |           |          |         | main     |              |   
Indexes:  
    "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)  
    "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)  
    "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)  
    "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)  
Child tables: _timescaledb_internal._hyper_1_1_chunk,  
              _timescaledb_internal._hyper_1_2_chunk,  
              _timescaledb_internal._hyper_1_3_chunk,  
              _timescaledb_internal._hyper_1_4_chunk  
  
其中一个分片的约束如下  
Check constraints:  
    "constraint_1" CHECK (pickup_datetime >= '2015-12-31 00:00:00'::timestamp without time zone AND pickup_datetime < '2016-01-30 00:00:00'::timestamp without time zone)  
    "constraint_2" CHECK (_timescaledb_internal.get_partition_hash(payment_type) >= 1073741823)  
Inherits: rides  
-- Peek behind the scenes  
  
postgres=#  select count(*) from rides;  
  count     
----------  
 10906858  
(1 row)  
  
Time: 376.247 ms  
postgres=# explain select count(*) from rides;  
                                                 QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------  
 Finalize Aggregate  (cost=254662.23..254662.24 rows=1 width=8)  
   ->  Gather  (cost=254661.71..254662.22 rows=5 width=8)  
         Workers Planned: 5  
         ->  Partial Aggregate  (cost=253661.71..253661.72 rows=1 width=8)  
               ->  Append  (cost=0.00..247468.57 rows=2477258 width=0)  
                     ->  Parallel Seq Scan on rides  (cost=0.00..0.00 rows=1 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_1_chunk  (cost=0.00..77989.57 rows=863657 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_2_chunk  (cost=0.00..150399.01 rows=1331101 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_3_chunk  (cost=0.00..6549.75 rows=112675 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_4_chunk  (cost=0.00..12530.24 rows=169824 width=0)  
(10 rows)  

10、也可以直接查分片

postgres=# select count(*) from  _timescaledb_internal._hyper_1_1_chunk;  
  count    
---------  
 3454961  
(1 row)  

分片对用户完全透明

分片元数据:

postgres=# \dn  
         List of schemas  
         Name          |  Owner     
-----------------------+----------  
 _timescaledb_cache    | postgres  
 _timescaledb_catalog  | postgres  
 _timescaledb_internal | postgres  
 public                | postgres  
(4 rows)  

timescaleDB + PostGIS 双剑合璧 - 时空数据库

结合时序数据库timescaleDB插件,空间数据库PostGIS插件。PostgreSQL可以很好的处理空间数据。

1、创建空间数据库PostGIS创建

create extension postgis;  

2、添加空间类型字段

http://postgis.net/docs/manual-2.4/AddGeometryColumn.html

postgres=# SELECT AddGeometryColumn ('public','rides','pickup_geom',2163,'POINT',2);  
                   addgeometrycolumn                      
--------------------------------------------------------  
 public.rides.pickup_geom  SRID:2163 TYPE:POINT DIMS:2   
(1 row)  
  
postgres=# SELECT AddGeometryColumn ('public','rides','dropoff_geom',2163,'POINT',2);  
                    addgeometrycolumn                      
---------------------------------------------------------  
 public.rides.dropoff_geom  SRID:2163 TYPE:POINT DIMS:2   
(1 row)  
  
postgres=#   
postgres=# \d+ rides  
                                                     Table "public.rides"  
        Column         |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description   
-----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------  
 vendor_id             | text                        |           |          |         | extended |              |   
 pickup_datetime       | timestamp without time zone |           | not null |         | plain    |              |   
 dropoff_datetime      | timestamp without time zone |           | not null |         | plain    |              |   
 passenger_count       | numeric                     |           |          |         | main     |              |   
 trip_distance         | numeric                     |           |          |         | main     |              |   
 pickup_longitude      | numeric                     |           |          |         | main     |              |   
 pickup_latitude       | numeric                     |           |          |         | main     |              |   
 rate_code             | integer                     |           |          |         | plain    |              |   
 dropoff_longitude     | numeric                     |           |          |         | main     |              |   
 dropoff_latitude      | numeric                     |           |          |         | main     |              |   
 payment_type          | integer                     |           |          |         | plain    |              |   
 fare_amount           | numeric                     |           |          |         | main     |              |   
 extra                 | numeric                     |           |          |         | main     |              |   
 mta_tax               | numeric                     |           |          |         | main     |              |   
 tip_amount            | numeric                     |           |          |         | main     |              |   
 tolls_amount          | numeric                     |           |          |         | main     |              |   
 improvement_surcharge | numeric                     |           |          |         | main     |              |   
 total_amount          | numeric                     |           |          |         | main     |              |   
 pickup_geom           | geometry(Point,2163)        |           |          |         | main     |              |   
 dropoff_geom          | geometry(Point,2163)        |           |          |         | main     |              |   
Indexes:  
    "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)  
    "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)  
    "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)  
    "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)  
Child tables: _timescaledb_internal._hyper_1_1_chunk,  
              _timescaledb_internal._hyper_1_2_chunk,  
              _timescaledb_internal._hyper_1_3_chunk,  
              _timescaledb_internal._hyper_1_4_chunk  

3、将数据更新到geometry字段(实际存储为两个自动,分别表示经度和纬度。实际上不更新也没关系,因为PG支持表达式索引,完全可以使用这两个字段,创建表达式空间索引)。

-- Generate the geometry points and write to table  
--   (Note: These calculations might take a few mins)  
  
UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163);  
UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163);  
  
  
vacuum full rides;  

4、时空分析举例。

在(lat, long) (40.7589,-73.9851)附近400米范围内,每30分钟有多少辆车被叫(以上车位置来计算)。

-- Number of rides on New Years Eve originating within  
--   400m of Times Square, by 30 min buckets  
--   Note: Times Square is at (lat, long) (40.7589,-73.9851)  
  
SELECT time_bucket('30 minutes', pickup_datetime) AS thirty_min, COUNT(*) AS near_times_sq  
  FROM rides  
  WHERE ST_Distance(pickup_geom, ST_Transform(ST_SetSRID(ST_MakePoint(-73.9851,40.7589),4326),2163)) < 400  
    AND pickup_datetime < '2016-01-01 14:00'  
  GROUP BY thirty_min ORDER BY thirty_min;  
  
     thirty_min      | near_times_sq  
---------------------+--------------  
 2016-01-01 00:00:00 |      74  
 2016-01-01 00:30:00 |     102  
 2016-01-01 01:00:00 |     120  
 2016-01-01 01:30:00 |      98  
 2016-01-01 02:00:00 |     112  
 2016-01-01 02:30:00 |     109  
 2016-01-01 03:00:00 |     163  
 2016-01-01 03:30:00 |     181  
 2016-01-01 04:00:00 |     214  
 2016-01-01 04:30:00 |     185  
 2016-01-01 05:00:00 |     158  
 2016-01-01 05:30:00 |     113  
 2016-01-01 06:00:00 |     102  
 2016-01-01 06:30:00 |      91  
 2016-01-01 07:00:00 |      88  
 2016-01-01 07:30:00 |      58  
 2016-01-01 08:00:00 |      72  
 2016-01-01 08:30:00 |      94  
 2016-01-01 09:00:00 |     115  
 2016-01-01 09:30:00 |     118  
 2016-01-01 10:00:00 |     135  
 2016-01-01 10:30:00 |     160  
 2016-01-01 11:00:00 |     212  
 2016-01-01 11:30:00 |     229  
 2016-01-01 12:00:00 |     244  
 2016-01-01 12:30:00 |     230  
 2016-01-01 13:00:00 |     235  
 2016-01-01 13:30:00 |     238  

实例2 - 传感器数据、天气数据

http://docs.timescale.com/v0.8/tutorials/other-sample-datasets

不再赘述。

timescaleDB 常用API

http://docs.timescale.com/v0.8/api

1、创建时序表

create_hypertable()

Required Arguments

Name Description
main_table Identifier of table to convert to hypertable
time_column_name Name of the column containing time values

Optional Arguments

Name Description
partitioning_column Name of an additional column to partition by. If provided, number_partitions must be set.
number_partitions Number of hash partitions to use for partitioning_column when this optional argument is supplied. Must be > 0.
chunk_time_interval Interval in event time that each chunk covers. Must be > 0. Default is 1 month.
create_default_indexes Boolean whether to create default indexes on time/partitioning columns. Default is TRUE.
if_not_exists Boolean whether to print warning if table already converted to hypertable or raise exception. Default is FALSE.
partitioning_func The function to use for calculating a value's partition.

2、添加多级分片字段

支持hash和interval分片

add_dimension()

Required Arguments

Name Description
main_table Identifier of hypertable to add the dimension to.
column_name Name of the column to partition by.

Optional Arguments

Name Description
number_partitions Number of hash partitions to use on column_name. Must be > 0.
interval_length Interval that each chunk covers. Must be > 0.
partitioning_func The function to use for calculating a value's partition (see create_hypertable instructions).

3、删除分片

删除指定 时间点、多久 之前的分片

drop_chunks()

Required Arguments

Name Description
older_than Timestamp of cut-off point for data to be dropped, i.e., anything older than this should be removed.

Optional Arguments

Name Description
table_name Hypertable name from which to drop chunks. If not supplied, all hypertables are affected.
schema_name Schema name of the hypertable from which to drop chunks. Defaults to public.
cascade Boolean on whether to CASCADE the drop on chunks, therefore removing dependent objects on chunks to be removed. Defaults to FALSE.

4、设置分片时间区间

set_chunk_time_interval()

Required Arguments

Name Description
main_table Identifier of hypertable to update interval for.
chunk_time_interval Interval in event time that each new chunk covers. Must be > 0.

5、分析函数 - 第一条

first()

Required Arguments

Name Description
value The value to return (anyelement)
time The timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如,查找所有传感器的最早上传的温度值。

SELECT device_id, first(temp, time)  
  FROM metrics  
  GROUP BY device_id;  

使用递归亦可实现:

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

6、分析函数 - 最后一条

last()

Required Arguments

Name Description
value The value to return (anyelement)
time The timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如,查找每5分钟时间区间内,每个传感器的最新温度值

SELECT device_id, time_bucket('5 minutes', time) as interval,  
  last(temp, time)  
  FROM metrics  
  WHERE time > now () - interval '1 day'  
  GROUP BY device_id, interval  
  ORDER BY interval DESC;  

使用递归亦可实现:

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

7、分析函数 - 柱状图

histogram()

Required Arguments

Name Description
value A set of values to partition into a histogram
min The histogram’s lower bound used in bucketing
max The histogram’s upper bound used in bucketing
nbuckets The integer value for the number of histogram buckets (partitions)

例如,

电池电量20到60,均分为5个BUCKET区间,返回5+2个值的数组(表示每个bucket区间的记录数),头尾分为别为边界外的记录数有多少。

SELECT device_id, histogram(battery_level, 20, 60, 5)  
  FROM readings  
  GROUP BY device_id  
  LIMIT 10;  
  
 device_id  |          histogram  
------------+------------------------------  
 demo000000 | {0,0,0,7,215,206,572}  
 demo000001 | {0,12,173,112,99,145,459}  
 demo000002 | {0,0,187,167,68,229,349}  
 demo000003 | {197,209,127,221,106,112,28}  
 demo000004 | {0,0,0,0,0,39,961}  
 demo000005 | {12,225,171,122,233,80,157}  
 demo000006 | {0,78,176,170,8,40,528}  
 demo000007 | {0,0,0,126,239,245,390}  
 demo000008 | {0,0,311,345,116,228,0}  
 demo000009 | {295,92,105,50,8,8,442}  

8、分析函数 - 时间区间

类似date_trunc,但是更强大,可以用任意interval进行时间截断。方便用户使用。

time_bucket()

Required Arguments

Name Description
bucket_width A PostgreSQL time interval for how long each bucket is (interval)
time The timestamp to bucket (timestamp/timestamptz/date)

Optional Arguments

Name Description
offset The time interval to offset all buckets by (interval)

9、数据概貌查看函数 - 时序表概貌

hypertable_relation_size_pretty()

SELECT * FROM hypertable_relation_size_pretty('conditions');  
  
 table_size | index_size | toast_size | total_size  
------------+------------+------------+------------  
 1171 MB    | 1608 MB    | 176 kB     | 2779 MB  

10、数据概貌查看函数 - 分片大小

chunk_relation_size_pretty()

SELECT * FROM chunk_relation_size_pretty('conditions');  
  
                chunk_table                 | table_size | index_size | total_size  
---------------------------------------------+------------+------------+------------  
 "_timescaledb_internal"."_hyper_1_1_chunk"  | 28 MB      | 36 MB      | 64 MB  
 "_timescaledb_internal"."_hyper_1_2_chunk"  | 57 MB      | 78 MB      | 134 MB  
 ...  

11、数据概貌查看函数 - 索引大小

indexes_relation_size_pretty()

SELECT * FROM indexes_relation_size_pretty('conditions');  
  
             index_name_              | total_size  
--------------------------------------+------------  
 public.conditions_device_id_time_idx | 1143 MB  
 public.conditions_time_idx           | 465 MB  

12、导出时序元数据

https://raw.githubusercontent.com/timescale/timescaledb/master/scripts/dump_meta_data.sql

psql [your connect flags] -d your_timescale_db < dump_meta_data.sql > dumpfile.txt  

小结

timescaleDB是一个非常好用的时序数据处理插件,隐藏了分片逻辑(对用户透明),同时提供了大量的API函数接口,以及性能优化。在时序场景使用很赞。

结合PostGIS插件,PostgreSQL在时空处理这块,如虎添翼。

参考

《PostgreSQL 按需切片的实现(TimescaleDB插件自动切片功能的plpgsql schemaless实现)》

《时序数据库有哪些特点? TimescaleDB时序数据库介绍》

《PostgreSQL on Linux 最佳部署手册》

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

http://docs.timescale.com/v0.8/introduction

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
相关文章
|
8天前
|
存储 运维 关系型数据库
数据的力量:构筑现代大型网站之数据库基础与应用
数据的力量:构筑现代大型网站之数据库基础与应用
185 0
|
24天前
|
安全 Java 数据库
后端进阶之路——万字总结Spring Security与数据库集成实践(五)
后端进阶之路——万字总结Spring Security与数据库集成实践(五)
|
25天前
|
关系型数据库 分布式数据库 数据库
阿里云瑶池数据库训练营权益:PolarDB开发者大会主题资料开放下载!
阿里云瑶池数据库训练营权益:PolarDB开发者大会主题资料开放下载!
|
26天前
|
SQL 存储 数据库连接
自动生成测试数据—数据库篇
自动生成测试数据—数据库篇
|
1天前
|
存储 监控 安全
360 企业安全浏览器基于阿里云数据库 SelectDB 版内核 Apache Doris 的数据架构升级实践
为了提供更好的日志数据服务,360 企业安全浏览器设计了统一运维管理平台,并引入 [Apache Doris](https://doris.apache.org/) 替代了 Elasticsearch,实现日志检索与报表分析架构的统一,同时依赖 Doris 优异性能,聚合分析效率呈数量级提升、存储成本下降 60%....为日志数据的可视化和价值发挥提供了坚实的基础。
360 企业安全浏览器基于阿里云数据库 SelectDB 版内核 Apache Doris 的数据架构升级实践
|
2天前
|
关系型数据库 MySQL 分布式数据库
PolarDB for MySQL数据库外网连接解析失败的原因
【2月更文挑战第5天】PolarDB for MySQL数据库外网连接解析失败的原因
50 8
|
4天前
|
Cloud Native 关系型数据库 分布式数据库
推荐你使用数据库排行榜第一的PolarDB
数据库大家都使用的,才是最好的,有足够多的案例和样本。
25769 15
|
4天前
|
SQL 关系型数据库 分布式数据库
|
19天前
|
SQL 存储 关系型数据库
MySQL技能完整学习列表——1、数据库基础概念——1、关系型数据库(Relational Database)
MySQL技能完整学习列表——1、数据库基础概念——1、关系型数据库(Relational Database)
129 0
|
19天前
|
缓存 监控 关系型数据库
构建高效的数据库应用:Python 实践
在当今数据驱动的世界中,构建高效的数据库应用程序对于企业和个人来说至关重要。Python 作为一种强大的编程语言,提供了丰富的数据库工具和库,使得开发人员能够轻松地构建高效、可靠的数据库应用程序。在本文中,我们将探讨一些关键的技术和最佳实践,以帮助你构建高效的数据库应用程序。

相关产品

  • 云原生数据库 PolarDB