时空数据库实践(含 纽约TAXI数据透视分析) - PostGIS + TimescaleDB => PostgreSQL

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
云数据库 Tair(兼容Redis),内存型 2GB
Redis 开源版,标准版 2GB
推荐场景:
搭建游戏排行榜
简介:

标签

PostgreSQL , 时空数据库 , 时序数据库 , 空间数据库 , timescaleDB , PostGIS


背景

现实社会中,很多业务产生的数据具有时序数据属性(在时间维度上顺序写入,同时包括大量时间区间查询统计的需求)。

例如业务的FEED数据,物联网产生的时序数据(如气象传感器、车辆轨迹、等),金融行业的实时数据等等。

PostgreSQL的UDF和BRIN(块级索引)很适合时序数据的处理。具体有以下的两个例子

《PostgreSQL 按需切片的实现(TimescaleDB插件自动切片功能的plpgsql schemaless实现)》

《PostgreSQL 时序最佳实践 - 证券交易系统数据库设计 - 阿里云RDS PostgreSQL最佳实践》

pic

实际上PostgreSQL生态中,衍生了一个时序插件:timescaleDB。专门用于处理时序数据。(timescale的改进,包括SQL优化器的改进(支持merge append,时间片聚合非常高效),rotate接口,自动分片等)

同时timescaleDB也非常受投资者的关注,已获5000万美金的投资,也间接说明时序数据库在未来是非常受用户欢迎的。

timescaleDB的优势

首先,timescaleDB是自动切片的,对用户无感知,在数据量非常庞大的时候,写入性能不衰减。(主要指IOPS较低的磁盘,如果IOPS较好的磁盘PG在写入大量数据后性能也是OK的。)

pic

其次,timescale改进了SQL优化器,增加了merge append的执行节点,同时在对小时间片进行group by时,可以不用HASH或GROUP整个数据范围,而是分片计算,使得效率非常高。

最后,timescale增加了一些API,使得用户在时序数据的写入、维护、查询都非常的高效、同时易于维护。

API如下

http://docs.timescale.com/v0.8/api

部署timescaleDB

以CentOS 7.x x64为例。

1、首先要安装好PostgreSQL

参考 《PostgreSQL on Linux 最佳部署手册》

export USE_NAMED_POSIX_SEMAPHORES=1  
LIBS=-lpthread CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10 --with-segsize=8 --with-wal-segsize=256  
LIBS=-lpthread CFLAGS="-O3" make world -j 64  
LIBS=-lpthread CFLAGS="-O3" make install-world  

2、其次需要安装cmake3

epel  
  
yum install -y cmake3  
  
ln -s /usr/bin/cmake3 /usr/bin/cmake  

3、编译timescaleDB

git clone https://github.com/timescale/timescaledb/  
  
cd timescaledb  
git checkout release-0.8.0  
  
或  
  
wget https://github.com/timescale/timescaledb/archive/0.8.0.tar.gz  
  
  
  
export PATH=/home/digoal/pgsql10/bin:$PATH  
export LD_LIBRARY_PATH=/home/digoal/pgsql10/lib:$LD_LIBRARY_PATH  
  
# Bootstrap the build system  
./bootstrap  
  
cd ./build && make  
  
make install  
  
  
[  2%] Built target sqlupdatefile  
[  4%] Built target sqlfile  
[100%] Built target timescaledb  
Install the project...  
-- Install configuration: "Release"  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb.control  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.8.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.1--0.8.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.1.0--0.2.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.2.0--0.3.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.3.0--0.4.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.0--0.4.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.1--0.4.2.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.4.2--0.5.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.5.0--0.6.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.0--0.6.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.0.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.6.1--0.7.1.sql  
-- Installing: /home/dege.zzz/pgsql10/share/extension/timescaledb--0.7.0--0.7.1.sql  
-- Installing: /home/dege.zzz/pgsql10/lib/timescaledb.so  

4、配置postgresql.conf,在数据库启动时自动加载timescale lib库。

vi $PGDATA/postgresql.conf  
shared_preload_libraries = 'timescaledb'  
  
pg_ctl restart -m fast  

5、对需要使用timescaledb的数据库,创建插件.

psql  
psql (10.1)  
Type "help" for help.  
  
postgres=# create extension timescaledb ;  

6、timescaledb的相关参数

timescaledb.constraint_aware_append     
timescaledb.disable_optimizations       
timescaledb.optimize_non_hypertables    
timescaledb.restoring    
  
postgres=# show timescaledb.constraint_aware_append ;  
 timescaledb.constraint_aware_append   
-------------------------------------  
 on  
(1 row)  
  
postgres=# show timescaledb.disable_optimizations ;  
 timescaledb.disable_optimizations   
-----------------------------------  
 off  
(1 row)  
  
postgres=# show timescaledb.optimize_non_hypertables ;  
 timescaledb.optimize_non_hypertables   
--------------------------------------  
 off  
(1 row)  
  
postgres=# show timescaledb.restoring ;  
 timescaledb.restoring   
-----------------------  
 off  
(1 row)  

timescaleDB使用例子1 - 纽约TAXI数据透视分析

第一个例子是real-life New York City taxicab data ,

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

数据为真实的数据,来自

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

1、下载样本数据

wget https://timescaledata.blob.core.windows.net/datasets/nyc_data.tar.gz  

2、解压

tar -zxvf nyc_data.tar.gz   

3、建表,其中包括将普通表转换为时序存储表的API create_hypertable 的使用。

psql -f nyc_data.sql  

截取一些nyc_data.sql的内容如下:

cat nyc_data.sql  
  
-- 打车数据: 包括时长、计费、路程、上车、下车经纬度、时间、人数等等。  
  
CREATE TABLE "rides"(  
    vendor_id TEXT,  
    pickup_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,  
    dropoff_datetime TIMESTAMP WITHOUT TIME ZONE NOT NULL,  
    passenger_count NUMERIC,  
    trip_distance NUMERIC,  
    pickup_longitude  NUMERIC,  
    pickup_latitude   NUMERIC,  
    rate_code         INTEGER,  
    dropoff_longitude NUMERIC,  
    dropoff_latitude  NUMERIC,  
    payment_type INTEGER,  
    fare_amount NUMERIC,  
    extra NUMERIC,  
    mta_tax NUMERIC,  
    tip_amount NUMERIC,  
    tolls_amount NUMERIC,  
    improvement_surcharge NUMERIC,  
    total_amount NUMERIC  
);  

这句话,将rides转换为时序表存储

SELECT create_hypertable('rides', 'pickup_datetime', 'payment_type', 2, create_default_indexes=>FALSE);  

创建索引

CREATE INDEX ON rides (vendor_id, pickup_datetime desc);  
CREATE INDEX ON rides (pickup_datetime desc, vendor_id);  
CREATE INDEX ON rides (rate_code, pickup_datetime DESC);  
CREATE INDEX ON rides (passenger_count, pickup_datetime desc);  

4、导入测试数据

psql -c "\COPY rides FROM nyc_data_rides.csv CSV"  
COPY 10906858  

5、对已转换为时序存储表的rides执行一些测试SQL,性能比PostgreSQL普通表要好。

每天同车超过2人的交易,平均计费多少?

-- Average fare amount of rides with 2+ passengers by day  
  
SELECT date_trunc('day', pickup_datetime) as day, avg(fare_amount)  
  FROM rides  
  WHERE passenger_count > 1 AND pickup_datetime < '2016-01-08'  
  GROUP BY day ORDER BY day;  
  
        day         |         avg  
--------------------+---------------------  
2016-01-01 00:00:00 | 13.3990821679715529  
2016-01-02 00:00:00 | 13.0224687415181399  
2016-01-03 00:00:00 | 13.5382068607068607  
2016-01-04 00:00:00 | 12.9618895561740149  
2016-01-05 00:00:00 | 12.6614611935518309  
2016-01-06 00:00:00 | 12.5775245695086098  
2016-01-07 00:00:00 | 12.5868802584437019  
(7 rows)  

6、某些查询的性能甚至超过20倍

每天有多少笔交易。

-- Total number of rides by day for first 5 days  
  
SELECT date_trunc('day', pickup_datetime) as day, COUNT(*) FROM rides  
  GROUP BY day ORDER BY day  
  LIMIT 5;  
  
        day         | count  
--------------------+--------  
2016-01-01 00:00:00 | 345037  
2016-01-02 00:00:00 | 312831  
2016-01-03 00:00:00 | 302878  
2016-01-04 00:00:00 | 316171  
2016-01-05 00:00:00 | 343251  
(5 rows)  

timescale增加了merge append的执行优化,因此在时间片上按小粒度聚合,效率非常高,数据量越大,性能提升的效果越明显。

For example, TimescaleDB introduces a time-based "merge append" optimization to minimize the number of
groups which must be processed to execute the following (given its knowledge that time is already ordered).

For our 100M row table, this results in query latency that is 396x faster than PostgreSQL (82ms vs. 32566ms).

SELECT date_trunc('minute', time) AS minute, max(usage_user)  
  FROM cpu  
  WHERE time < '2017-01-01'  
  GROUP BY minute  
  ORDER BY minute DESC  
  LIMIT 5;  

7、执行一些timescaleDB特有的功能函数,例如time_bucket,这里同样会用到timescaleDB内置的一些加速算法。

每5分钟间隔为一个BUCKET,输出每个间隔产生了多少笔订单。

-- Number of rides by 5 minute intervals  
--   (using the TimescaleDB "time_bucket" function)  
  
SELECT time_bucket('5 minute', pickup_datetime) as five_min, count(*)  
  FROM rides  
  WHERE pickup_datetime < '2016-01-01 02:00'  
  GROUP BY five_min ORDER BY five_min;  
  
      five_min       | count  
---------------------+-------  
 2016-01-01 00:00:00 |   703  
 2016-01-01 00:05:00 |  1482  
 2016-01-01 00:10:00 |  1959  
 2016-01-01 00:15:00 |  2200  
 2016-01-01 00:20:00 |  2285  
 2016-01-01 00:25:00 |  2291  
 2016-01-01 00:30:00 |  2349  
 2016-01-01 00:35:00 |  2328  
 2016-01-01 00:40:00 |  2440  
 2016-01-01 00:45:00 |  2372  
 2016-01-01 00:50:00 |  2388  
 2016-01-01 00:55:00 |  2473  
 2016-01-01 01:00:00 |  2395  
 2016-01-01 01:05:00 |  2510  
 2016-01-01 01:10:00 |  2412  
 2016-01-01 01:15:00 |  2482  
 2016-01-01 01:20:00 |  2428  
 2016-01-01 01:25:00 |  2433  
 2016-01-01 01:30:00 |  2337  
 2016-01-01 01:35:00 |  2366  
 2016-01-01 01:40:00 |  2325  
 2016-01-01 01:45:00 |  2257  
 2016-01-01 01:50:00 |  2316  
 2016-01-01 01:55:00 |  2250  
(24 rows)  

8、执行一些统计分析SQL

每个城市的打车交易量。

-- Join rides with rates to get more information on rate_code  
  
SELECT rates.description, COUNT(vendor_id) as num_trips FROM rides  
  JOIN rates on rides.rate_code = rates.rate_code  
  WHERE pickup_datetime < '2016-01-08'  
  GROUP BY rates.description ORDER BY rates.description;  
  
      description      | num_trips  
-----------------------+-----------  
 JFK                   |     54832  
 Nassau or Westchester |       967  
 Newark                |      4126  
 group ride            |        17  
 negotiated fare       |      7193  
 standard rate         |   2266401  
(6 rows)  

某些城市2016年1月的打车统计(最长、短距离、平均人数、时长等)

-- Analysis of all JFK and EWR rides in Jan 2016  
  
SELECT rates.description, COUNT(vendor_id) as num_trips,  
    AVG(dropoff_datetime - pickup_datetime) as avg_trip_duration, AVG(total_amount) as avg_total,  
    AVG(tip_amount) as avg_tip, MIN(trip_distance) as min_distance, AVG(trip_distance) as avg_distance, MAX(trip_distance) as max_distance,  
    AVG(passenger_count) as avg_passengers  
  FROM rides  
  JOIN rates on rides.rate_code = rates.rate_code  
  WHERE rides.rate_code in (2,3) AND pickup_datetime < '2016-02-01'  
  GROUP BY rates.description ORDER BY rates.description;  
  
 description | num_trips | avg_trip_duration |      avg_total      |      avg_tip       | min_distance |    avg_distance     | max_distance |   avg_passengers  
-------------+-----------+-------------------+---------------------+--------------------+--------------+---------------------+--------------+--------------------  
 JFK         |    225019 | 00:45:46.822517   | 64.3278115181384683 | 7.3334228220728027 |         0.00 | 17.2602816651038357 |       221.00 | 1.7333869584346211  
 Newark      |     16822 | 00:35:16.157472   | 86.4633688027582927 | 9.5461657353465700 |         0.00 | 16.2706122934252764 |       177.23 | 1.7435501129473309  
(2 rows)  

9、数据自动分片与执行计划

postgres=# \d+ rides  
                                                     Table "public.rides"  
        Column         |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description   
-----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------  
 vendor_id             | text                        |           |          |         | extended |              |   
 pickup_datetime       | timestamp without time zone |           | not null |         | plain    |              |   
 dropoff_datetime      | timestamp without time zone |           | not null |         | plain    |              |   
 passenger_count       | numeric                     |           |          |         | main     |              |   
 trip_distance         | numeric                     |           |          |         | main     |              |   
 pickup_longitude      | numeric                     |           |          |         | main     |              |   
 pickup_latitude       | numeric                     |           |          |         | main     |              |   
 rate_code             | integer                     |           |          |         | plain    |              |   
 dropoff_longitude     | numeric                     |           |          |         | main     |              |   
 dropoff_latitude      | numeric                     |           |          |         | main     |              |   
 payment_type          | integer                     |           |          |         | plain    |              |   
 fare_amount           | numeric                     |           |          |         | main     |              |   
 extra                 | numeric                     |           |          |         | main     |              |   
 mta_tax               | numeric                     |           |          |         | main     |              |   
 tip_amount            | numeric                     |           |          |         | main     |              |   
 tolls_amount          | numeric                     |           |          |         | main     |              |   
 improvement_surcharge | numeric                     |           |          |         | main     |              |   
 total_amount          | numeric                     |           |          |         | main     |              |   
Indexes:  
    "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)  
    "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)  
    "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)  
    "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)  
Child tables: _timescaledb_internal._hyper_1_1_chunk,  
              _timescaledb_internal._hyper_1_2_chunk,  
              _timescaledb_internal._hyper_1_3_chunk,  
              _timescaledb_internal._hyper_1_4_chunk  
  
其中一个分片的约束如下  
Check constraints:  
    "constraint_1" CHECK (pickup_datetime >= '2015-12-31 00:00:00'::timestamp without time zone AND pickup_datetime < '2016-01-30 00:00:00'::timestamp without time zone)  
    "constraint_2" CHECK (_timescaledb_internal.get_partition_hash(payment_type) >= 1073741823)  
Inherits: rides  
-- Peek behind the scenes  
  
postgres=#  select count(*) from rides;  
  count     
----------  
 10906858  
(1 row)  
  
Time: 376.247 ms  
postgres=# explain select count(*) from rides;  
                                                 QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------  
 Finalize Aggregate  (cost=254662.23..254662.24 rows=1 width=8)  
   ->  Gather  (cost=254661.71..254662.22 rows=5 width=8)  
         Workers Planned: 5  
         ->  Partial Aggregate  (cost=253661.71..253661.72 rows=1 width=8)  
               ->  Append  (cost=0.00..247468.57 rows=2477258 width=0)  
                     ->  Parallel Seq Scan on rides  (cost=0.00..0.00 rows=1 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_1_chunk  (cost=0.00..77989.57 rows=863657 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_2_chunk  (cost=0.00..150399.01 rows=1331101 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_3_chunk  (cost=0.00..6549.75 rows=112675 width=0)  
                     ->  Parallel Seq Scan on _hyper_1_4_chunk  (cost=0.00..12530.24 rows=169824 width=0)  
(10 rows)  

10、也可以直接查分片

postgres=# select count(*) from  _timescaledb_internal._hyper_1_1_chunk;  
  count    
---------  
 3454961  
(1 row)  

分片对用户完全透明

分片元数据:

postgres=# \dn  
         List of schemas  
         Name          |  Owner     
-----------------------+----------  
 _timescaledb_cache    | postgres  
 _timescaledb_catalog  | postgres  
 _timescaledb_internal | postgres  
 public                | postgres  
(4 rows)  

timescaleDB + PostGIS 双剑合璧 - 时空数据库

结合时序数据库timescaleDB插件,空间数据库PostGIS插件。PostgreSQL可以很好的处理空间数据。

1、创建空间数据库PostGIS创建

create extension postgis;  

2、添加空间类型字段

http://postgis.net/docs/manual-2.4/AddGeometryColumn.html

postgres=# SELECT AddGeometryColumn ('public','rides','pickup_geom',2163,'POINT',2);  
                   addgeometrycolumn                      
--------------------------------------------------------  
 public.rides.pickup_geom  SRID:2163 TYPE:POINT DIMS:2   
(1 row)  
  
postgres=# SELECT AddGeometryColumn ('public','rides','dropoff_geom',2163,'POINT',2);  
                    addgeometrycolumn                      
---------------------------------------------------------  
 public.rides.dropoff_geom  SRID:2163 TYPE:POINT DIMS:2   
(1 row)  
  
postgres=#   
postgres=# \d+ rides  
                                                     Table "public.rides"  
        Column         |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description   
-----------------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------  
 vendor_id             | text                        |           |          |         | extended |              |   
 pickup_datetime       | timestamp without time zone |           | not null |         | plain    |              |   
 dropoff_datetime      | timestamp without time zone |           | not null |         | plain    |              |   
 passenger_count       | numeric                     |           |          |         | main     |              |   
 trip_distance         | numeric                     |           |          |         | main     |              |   
 pickup_longitude      | numeric                     |           |          |         | main     |              |   
 pickup_latitude       | numeric                     |           |          |         | main     |              |   
 rate_code             | integer                     |           |          |         | plain    |              |   
 dropoff_longitude     | numeric                     |           |          |         | main     |              |   
 dropoff_latitude      | numeric                     |           |          |         | main     |              |   
 payment_type          | integer                     |           |          |         | plain    |              |   
 fare_amount           | numeric                     |           |          |         | main     |              |   
 extra                 | numeric                     |           |          |         | main     |              |   
 mta_tax               | numeric                     |           |          |         | main     |              |   
 tip_amount            | numeric                     |           |          |         | main     |              |   
 tolls_amount          | numeric                     |           |          |         | main     |              |   
 improvement_surcharge | numeric                     |           |          |         | main     |              |   
 total_amount          | numeric                     |           |          |         | main     |              |   
 pickup_geom           | geometry(Point,2163)        |           |          |         | main     |              |   
 dropoff_geom          | geometry(Point,2163)        |           |          |         | main     |              |   
Indexes:  
    "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)  
    "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)  
    "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)  
    "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)  
Child tables: _timescaledb_internal._hyper_1_1_chunk,  
              _timescaledb_internal._hyper_1_2_chunk,  
              _timescaledb_internal._hyper_1_3_chunk,  
              _timescaledb_internal._hyper_1_4_chunk  

3、将数据更新到geometry字段(实际存储为两个自动,分别表示经度和纬度。实际上不更新也没关系,因为PG支持表达式索引,完全可以使用这两个字段,创建表达式空间索引)。

-- Generate the geometry points and write to table  
--   (Note: These calculations might take a few mins)  
  
UPDATE rides SET pickup_geom = ST_Transform(ST_SetSRID(ST_MakePoint(pickup_longitude,pickup_latitude),4326),2163);  
UPDATE rides SET dropoff_geom = ST_Transform(ST_SetSRID(ST_MakePoint(dropoff_longitude,dropoff_latitude),4326),2163);  
  
  
vacuum full rides;  

4、时空分析举例。

在(lat, long) (40.7589,-73.9851)附近400米范围内,每30分钟有多少辆车被叫(以上车位置来计算)。

-- Number of rides on New Years Eve originating within  
--   400m of Times Square, by 30 min buckets  
--   Note: Times Square is at (lat, long) (40.7589,-73.9851)  
  
SELECT time_bucket('30 minutes', pickup_datetime) AS thirty_min, COUNT(*) AS near_times_sq  
  FROM rides  
  WHERE ST_Distance(pickup_geom, ST_Transform(ST_SetSRID(ST_MakePoint(-73.9851,40.7589),4326),2163)) < 400  
    AND pickup_datetime < '2016-01-01 14:00'  
  GROUP BY thirty_min ORDER BY thirty_min;  
  
     thirty_min      | near_times_sq  
---------------------+--------------  
 2016-01-01 00:00:00 |      74  
 2016-01-01 00:30:00 |     102  
 2016-01-01 01:00:00 |     120  
 2016-01-01 01:30:00 |      98  
 2016-01-01 02:00:00 |     112  
 2016-01-01 02:30:00 |     109  
 2016-01-01 03:00:00 |     163  
 2016-01-01 03:30:00 |     181  
 2016-01-01 04:00:00 |     214  
 2016-01-01 04:30:00 |     185  
 2016-01-01 05:00:00 |     158  
 2016-01-01 05:30:00 |     113  
 2016-01-01 06:00:00 |     102  
 2016-01-01 06:30:00 |      91  
 2016-01-01 07:00:00 |      88  
 2016-01-01 07:30:00 |      58  
 2016-01-01 08:00:00 |      72  
 2016-01-01 08:30:00 |      94  
 2016-01-01 09:00:00 |     115  
 2016-01-01 09:30:00 |     118  
 2016-01-01 10:00:00 |     135  
 2016-01-01 10:30:00 |     160  
 2016-01-01 11:00:00 |     212  
 2016-01-01 11:30:00 |     229  
 2016-01-01 12:00:00 |     244  
 2016-01-01 12:30:00 |     230  
 2016-01-01 13:00:00 |     235  
 2016-01-01 13:30:00 |     238  

实例2 - 传感器数据、天气数据

http://docs.timescale.com/v0.8/tutorials/other-sample-datasets

不再赘述。

timescaleDB 常用API

http://docs.timescale.com/v0.8/api

1、创建时序表

create_hypertable()

Required Arguments

Name Description
main_table Identifier of table to convert to hypertable
time_column_name Name of the column containing time values

Optional Arguments

Name Description
partitioning_column Name of an additional column to partition by. If provided, number_partitions must be set.
number_partitions Number of hash partitions to use for partitioning_column when this optional argument is supplied. Must be > 0.
chunk_time_interval Interval in event time that each chunk covers. Must be > 0. Default is 1 month.
create_default_indexes Boolean whether to create default indexes on time/partitioning columns. Default is TRUE.
if_not_exists Boolean whether to print warning if table already converted to hypertable or raise exception. Default is FALSE.
partitioning_func The function to use for calculating a value's partition.

2、添加多级分片字段

支持hash和interval分片

add_dimension()

Required Arguments

Name Description
main_table Identifier of hypertable to add the dimension to.
column_name Name of the column to partition by.

Optional Arguments

Name Description
number_partitions Number of hash partitions to use on column_name. Must be > 0.
interval_length Interval that each chunk covers. Must be > 0.
partitioning_func The function to use for calculating a value's partition (see create_hypertable instructions).

3、删除分片

删除指定 时间点、多久 之前的分片

drop_chunks()

Required Arguments

Name Description
older_than Timestamp of cut-off point for data to be dropped, i.e., anything older than this should be removed.

Optional Arguments

Name Description
table_name Hypertable name from which to drop chunks. If not supplied, all hypertables are affected.
schema_name Schema name of the hypertable from which to drop chunks. Defaults to public.
cascade Boolean on whether to CASCADE the drop on chunks, therefore removing dependent objects on chunks to be removed. Defaults to FALSE.

4、设置分片时间区间

set_chunk_time_interval()

Required Arguments

Name Description
main_table Identifier of hypertable to update interval for.
chunk_time_interval Interval in event time that each new chunk covers. Must be > 0.

5、分析函数 - 第一条

first()

Required Arguments

Name Description
value The value to return (anyelement)
time The timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如,查找所有传感器的最早上传的温度值。

SELECT device_id, first(temp, time)  
  FROM metrics  
  GROUP BY device_id;  

使用递归亦可实现:

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

6、分析函数 - 最后一条

last()

Required Arguments

Name Description
value The value to return (anyelement)
time The timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

例如,查找每5分钟时间区间内,每个传感器的最新温度值

SELECT device_id, time_bucket('5 minutes', time) as interval,  
  last(temp, time)  
  FROM metrics  
  WHERE time > now () - interval '1 day'  
  GROUP BY device_id, interval  
  ORDER BY interval DESC;  

使用递归亦可实现:

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

7、分析函数 - 柱状图

histogram()

Required Arguments

Name Description
value A set of values to partition into a histogram
min The histogram’s lower bound used in bucketing
max The histogram’s upper bound used in bucketing
nbuckets The integer value for the number of histogram buckets (partitions)

例如,

电池电量20到60,均分为5个BUCKET区间,返回5+2个值的数组(表示每个bucket区间的记录数),头尾分为别为边界外的记录数有多少。

SELECT device_id, histogram(battery_level, 20, 60, 5)  
  FROM readings  
  GROUP BY device_id  
  LIMIT 10;  
  
 device_id  |          histogram  
------------+------------------------------  
 demo000000 | {0,0,0,7,215,206,572}  
 demo000001 | {0,12,173,112,99,145,459}  
 demo000002 | {0,0,187,167,68,229,349}  
 demo000003 | {197,209,127,221,106,112,28}  
 demo000004 | {0,0,0,0,0,39,961}  
 demo000005 | {12,225,171,122,233,80,157}  
 demo000006 | {0,78,176,170,8,40,528}  
 demo000007 | {0,0,0,126,239,245,390}  
 demo000008 | {0,0,311,345,116,228,0}  
 demo000009 | {295,92,105,50,8,8,442}  

8、分析函数 - 时间区间

类似date_trunc,但是更强大,可以用任意interval进行时间截断。方便用户使用。

time_bucket()

Required Arguments

Name Description
bucket_width A PostgreSQL time interval for how long each bucket is (interval)
time The timestamp to bucket (timestamp/timestamptz/date)

Optional Arguments

Name Description
offset The time interval to offset all buckets by (interval)

9、数据概貌查看函数 - 时序表概貌

hypertable_relation_size_pretty()

SELECT * FROM hypertable_relation_size_pretty('conditions');  
  
 table_size | index_size | toast_size | total_size  
------------+------------+------------+------------  
 1171 MB    | 1608 MB    | 176 kB     | 2779 MB  

10、数据概貌查看函数 - 分片大小

chunk_relation_size_pretty()

SELECT * FROM chunk_relation_size_pretty('conditions');  
  
                chunk_table                 | table_size | index_size | total_size  
---------------------------------------------+------------+------------+------------  
 "_timescaledb_internal"."_hyper_1_1_chunk"  | 28 MB      | 36 MB      | 64 MB  
 "_timescaledb_internal"."_hyper_1_2_chunk"  | 57 MB      | 78 MB      | 134 MB  
 ...  

11、数据概貌查看函数 - 索引大小

indexes_relation_size_pretty()

SELECT * FROM indexes_relation_size_pretty('conditions');  
  
             index_name_              | total_size  
--------------------------------------+------------  
 public.conditions_device_id_time_idx | 1143 MB  
 public.conditions_time_idx           | 465 MB  

12、导出时序元数据

https://raw.githubusercontent.com/timescale/timescaledb/master/scripts/dump_meta_data.sql

psql [your connect flags] -d your_timescale_db < dump_meta_data.sql > dumpfile.txt  

小结

timescaleDB是一个非常好用的时序数据处理插件,隐藏了分片逻辑(对用户透明),同时提供了大量的API函数接口,以及性能优化。在时序场景使用很赞。

结合PostGIS插件,PostgreSQL在时空处理这块,如虎添翼。

参考

《PostgreSQL 按需切片的实现(TimescaleDB插件自动切片功能的plpgsql schemaless实现)》

《时序数据库有哪些特点? TimescaleDB时序数据库介绍》

《PostgreSQL on Linux 最佳部署手册》

http://docs.timescale.com/v0.8/tutorials/tutorial-hello-nyc

http://docs.timescale.com/v0.8/introduction

《PostgrSQL 递归SQL的几个应用 - 极客与正常人的思维》

相关实践学习
使用PolarDB和ECS搭建门户网站
本场景主要介绍基于PolarDB和ECS实现搭建门户网站。
阿里云数据库产品家族及特性
阿里云智能数据库产品团队一直致力于不断健全产品体系,提升产品性能,打磨产品功能,从而帮助客户实现更加极致的弹性能力、具备更强的扩展能力、并利用云设施进一步降低企业成本。以云原生+分布式为核心技术抓手,打造以自研的在线事务型(OLTP)数据库Polar DB和在线分析型(OLAP)数据库Analytic DB为代表的新一代企业级云原生数据库产品体系, 结合NoSQL数据库、数据库生态工具、云原生智能化数据库管控平台,为阿里巴巴经济体以及各个行业的企业客户和开发者提供从公共云到混合云再到私有云的完整解决方案,提供基于云基础设施进行数据从处理、到存储、再到计算与分析的一体化解决方案。本节课带你了解阿里云数据库产品家族及特性。
目录
相关文章
|
21天前
|
SQL 存储 关系型数据库
PostgreSQL窗口函数避坑指南:如何让复杂分析查询提速300%?
本文基于真实企业级案例,深入剖析PostgreSQL窗口函数的执行原理与性能陷阱,提供8大优化策略。通过定制索引、分区裁剪、内存调优及并行处理等手段,将分钟级查询压缩至秒级响应。结合CTE分阶段计算与物化视图技术,解决海量数据分析中的瓶颈问题。某金融客户实践表明,风险分析查询从47秒降至0.8秒,效率提升5800%。文章附带代码均在PostgreSQL 15中验证,助您高效优化SQL性能。
|
3月前
|
存储 缓存 数据库
数据库数据删除策略:硬删除vs软删除的最佳实践指南
在项目开发中,“删除”操作常见但方式多样,主要分为硬删除与软删除。硬删除直接从数据库移除数据,操作简单、高效,但不可恢复;适用于临时或敏感数据。软删除通过标记字段保留数据,支持恢复和审计,但增加查询复杂度与数据量;适合需追踪历史或可恢复的场景。两者各有优劣,实际开发中常结合使用以满足不同需求。
211 4
|
4月前
|
人工智能 前端开发 JavaScript
代码采纳率从 22% 到 33%,通义灵码辅助数据库智能编码实践
通义灵码本质上是一个AI agent,它已经进行了大量的优化。然而,为了更完美或有效地调用模型的潜在能力,我们在使用时仍需掌握一些技巧。通常,大多数人在使用通义灵码时会直接上手,这是 AI agent 的一个优势,即 zero shot 使用,无需任何上下文即可直接使用通义灵码的能力。
|
1月前
|
存储 SQL Java
数据存储使用文件还是数据库,哪个更合适?
数据库和文件系统各有优劣:数据库读写性能较低、结构 rigid,但具备计算能力和数据一致性保障;文件系统灵活易管理、读写高效,但缺乏计算能力且无法保证一致性。针对仅需高效存储与灵活管理的场景,文件系统更优,但其计算短板可通过开源工具 SPL(Structured Process Language)弥补。SPL 提供独立计算语法及高性能文件格式(如集文件、组表),支持复杂计算与多源混合查询,甚至可替代数据仓库。此外,SPL 易集成、支持热切换,大幅提升开发运维效率,是后数据库时代文件存储的理想补充方案。
|
4月前
|
数据库 Python
【YashanDB知识库】python驱动查询gbk字符集崖山数据库CLOB字段,数据被驱动截断
【YashanDB知识库】python驱动查询gbk字符集崖山数据库CLOB字段,数据被驱动截断
|
3月前
|
人工智能 关系型数据库 分布式数据库
让数据与AI贴得更近,阿里云瑶池数据库系列产品焕新升级
4月9日阿里云AI势能大会上,阿里云瑶池数据库发布重磅新品及一系列产品能力升级。「推理加速服务」Tair KVCache全新上线,实现KVCache动态分层存储,显著提高内存资源利用率,为大模型推理降本提速。
|
5月前
|
SQL 数据建模 BI
【YashanDB 知识库】用 yasldr 配置 Bulkload 模式作单线程迁移 300G 的业务数据到分布式数据库,迁移任务频繁出错
问题描述 详细版本:YashanDB Server Enterprise Edition Release 23.2.4.100 x86_64 6db1237 影响范围: 离线数据迁移场景,影响业务数据入库。 外场将部分 NewCIS 的报表业务放到分布式数据库,验证 SQL 性能水平。 操作系统环境配置: 125G 内存 32C CPU 2T 的 HDD 磁盘 问题出现的步骤/操作: 1、部署崖山分布式数据库 1mm 1cn 3dn 单线启动 yasldr 数据迁移任务,设置 32 线程的 bulk load 模式 2、观察 yasldr.log 是否出现如下错
|
5月前
|
存储 数据挖掘 数据处理
2600 万表流计算分析如何做到? 时序数据库 TDengine 助力数百家超市智能化转型
在生鲜超市的高效运营中,实时数据分析至关重要。万象云鼎的“云鲜生”通过智能秤+网关+软件系统的组合,实现了销售数据的精准管理与优化。而在数据处理方面,TDengine 的流计算能力成为了这一方案的核心支撑。本文详细分享了“云鲜生”如何利用 TDengine 高效存储和分析海量销售数据,在优化超市运营、提升用户体验的同时,解决高基数分组、高并发查询等技术挑战。
121 1
|
4月前
|
存储 监控 数据挖掘
消防行业如何借助时序数据库 TDengine 打造高效的数据监控与分析系统
本篇文章来自“2024,我想和 TDengine 谈谈”征文活动的优秀投稿,深入探讨了如何在消防行业中运用 TDengine 进行业务建模。文章重点介绍了如何通过 TDengine 的超级表、标签设计和高效查询功能,有效管理消防监控系统中的时序数据。作者详细阐述了实时监控、报警系统以及历史数据分析在消防行业中的应用,展示了 TDengine 在数据压缩、保留策略和分布式架构下的强大优势。
103 0

相关产品

  • 云原生数据库 PolarDB
  • 推荐镜像

    更多