使用INT4/INT类型替换INT8/BIGINT类型能够节省多少磁盘空间？-阿里云开发者社区

使用INT4/INT类型替换INT8/BIGINT类型能够节省多少磁盘空间？

最近有人在IRC，Slack和Reddit上讨论使用int4/integer替代int8/bigint能够少4个字节。事实并非如此，来解释下。

进行下测试，我的环境64位，Ryzen处理器：

$ select version();
                                              version                                               
────────────────────────────────────────────────────────────────────────────────────────────────────
 PostgreSQL 15devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.2.0-7ubuntu2) 11.2.0, 64-bit
(1 row)

$ create table test8 (id int8);
CREATE TABLE
$ create table test4 (id int4);
CREATE TABLE
$ insert into test8 select generate_series(1,1000000) i;
INSERT 0 1000000
$ insert into test4 select generate_series(1,1000000) i;
INSERT 0 1000000
$ \dt+ test*
                                  List of relations
 Schema │ Name  │ Type  │ Owner  │ Persistence │ Access method │ Size  │ Description
────────┼───────┼───────┼────────┼─────────────┼───────────────┼───────┼─────────────
 public │ test4 │ table │ depesz │ permanent   │ heap          │ 35 MB │
 public │ test8 │ table │ depesz │ permanent   │ heap          │ 35 MB │
(2 rows)

创建单列表，插入1百万行记录。结果表大小一样大。

接着在32位环境上测试：

postgres=# select version();
                                                         version                                                         
-------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 13.5 (Debian 13.5-0+deb11u1) on i686-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 32-bit
(1 row)
...
postgres=# \dt+ test*
                           List of relations
 Schema | Name  | Type  |  Owner   | Persistence | Size  | Description 
--------+-------+-------+----------+-------------+-------+-------------
 public | test4 | table | postgres | permanent   | 31 MB | 
 public | test8 | table | postgres | permanent   | 35 MB | 
(2 rows)

有趣的是，这里表大小减少了4M。为什么？是什么让它以这种方式工作？

答案是性能。由于性能原因PG将数据对齐到架构相关的大小，也就是说在64位系统上对齐8字节。究竟什么是对齐？这意味着分配的最小单位是8字节。如果技术上可行，PG不会将单个值拆分位多个8字节的块。这意味着，如果你表中只有4字节的列，无论如何都会使用8字节。如果有2个int4的列，他们将8字节对齐，仅使用这一个8字节。但是如果有int4，下一个列超过4字节，PG不会将另一个列拆分为“前一个8字节的4个字节的一部分，下一个中的一部分”，一切都将转到下一个8字节块中。

我们通过简单测试来看，仍在64位机器上：

$ create table test as select i::int4 as i1, i::int4 as i2 from generate_series(1,1000000) i;
SELECT 1000000
$ \dt+ test
                                 List of relations
 Schema │ Name │ Type  │ Owner  │ Persistence │ Access method │ Size  │ Description
────────┼──────┼───────┼────────┼─────────────┼───────────────┼───────┼─────────────
 public │ test │ table │ depesz │ permanent   │ heap          │ 35 MB │
(1 row)

在这里你可以看到，当我用两个 int4 列创建表时，它使用的空间与单个 int8 相同。现在让我们看看当我创建一个包含 int4 + int8 列的表时会发生什么。如果我的数学是正确的，它应该使用大约43MB：

$ create table test48 as select i::int4 as i1, i::int8 as i2 from generate_series(1,1000000) i;
SELECT 1000000
$ \dt+ test48
                                  List of relations
 Schema │  Name  │ Type  │ Owner  │ Persistence │ Access method │ Size  │ Description
────────┼────────┼───────┼────────┼─────────────┼───────────────┼───────┼─────────────
 public │ test48 │ table │ depesz │ permanent   │ heap          │ 42 MB │
(1 row)

它稍微少一点，但正如所料，它显然接近我的预期。您还可以看到，对于更短类型也是如此。INT2（2字节）或者BOOL（1字节）仍占用整个8字节，除非他们可以容纳上一列/下一列的8字节块：

$ create table test2 as select (i % 32000)::int2 as i2 from generate_series(1,1000000) i;
SELECT 1000000
$ \dt+ test2
                                  List of relations
 Schema │ Name  │ Type  │ Owner  │ Persistence │ Access method │ Size  │ Description
────────┼───────┼───────┼────────┼─────────────┼───────────────┼───────┼─────────────
 public │ test2 │ table │ depesz │ permanent   │ heap          │ 35 MB │
(1 row)
$ create table testb as select 'true'::bool as b from generate_series(1,1000000) i;
SELECT 1000000
$ \dt+ testb
                                  List of relations
 Schema │ Name  │ Type  │ Owner  │ Persistence │ Access method │ Size  │ Description
────────┼───────┼───────┼────────┼─────────────┼───────────────┼───────┼─────────────
 public │ testb │ table │ depesz │ permanent   │ heap          │ 35 MB │
(1 row)

他们都是35MB。为什么呢？还是性能。我不知道底层细节，但根据我的理解，处理器以与架构相关的块大小处理数据。64位处理器在64位上工作，意味着如果对int4（8字节块的一部分）执行某些操作，则必须添加操作以将其他32位归零。

差异太小不易测试，并会因负载的随机波动而相形见绌，但它确实存在。在重负载的机器上可能会有所区别。

因此可以通过切换到更小的数据类型来使用更少的磁盘空间，但您必须非常小心表中列的顺序。当设计表时，这可能非常容易但一旦部署应用程序就不再合理，现在只是更改架构以适应新功能。

还有一个因素要考虑。注意，表中放入了100万行和8MB的数据，表大小35MB，剩下的是什么？除了通常可见的列之外，PG中每行都有一些额外的系统列，而且他们大小非零，可以看到：

$ SELECT
    a.attname,
    t.typname,
    t.typlen,
    a.attnum
FROM
    pg_attribute a
    JOIN pg_type t ON a.atttypid = t.oid
WHERE
    a.attrelid = 'test8'::regclass
ORDER BY
    attnum;
 attname  │ typname │ typlen │ attnum
──────────┼─────────┼────────┼────────
 tableoid │ oid     │      4 │     -6
 cmax     │ cid     │      4 │     -5
 xmax     │ xid     │      4 │     -4
 cmin     │ cid     │      4 │     -3
 xmin     │ xid     │      4 │     -2
 ctid     │ tid     │      6 │     -1
 id       │ int8    │      8 │      1
(7 rows

对于每一行，都有tableoid、cmax、xmax、cmin和ctid（tableoid和ctid在数据文件中并不存在），然后再表格中有“真实列”：

https://paste.depesz.com/s/77i

可以看到这些信息：

$ select tableoid, cmax, xmax, cmin, xmin, ctid, id from test8 limit 1;
 tableoid │ cmax │ xmax │ cmin │ xmin │ ctid  │ id
──────────┼──────┼──────┼──────┼──────┼───────┼────
   307696 │    0 │    0 │    0 │ 1773 │ (0,1) │  1
(1 row)

可以在手册中找到有关列的含义描述。从另一方面说，如果在表中添加了一个4字节的列，那么在它旁边添加另一个4字节的列（磁盘空间方面）基本上是不会增加的。

看一个真实的表：

=> select 'col_' || a.attnum, a.atttypid::regtype, a.attlen from pg_attribute a where a.attrelid = 'accounts'::regclass and attnum > 0;
 ?column? │          atttypid           │ attlen 
══════════╪═════════════════════════════╪════════
 col_1    │ bigint                      │      8
 col_2    │ text                        │     -1
 col_3    │ timestamp without time zone │      8
 col_4    │ timestamp without time zone │      8
 col_5    │ text                        │     -1
 col_6    │ timestamp without time zone │      8
 col_7    │ bigint                      │      8
 col_8    │ text                        │     -1
 col_9    │ bigint                      │      8
 col_10   │ bigint                      │      8
 col_11   │ bigint                      │      8
 col_12   │ bigint                      │      8
 col_13   │ text                        │     -1
 col_14   │ text                        │     -1
 col_15   │ text                        │     -1
 col_16   │ bigint                      │      8
 col_17   │ bigint                      │      8
 col_18   │ boolean                     │      1
 col_19   │ text                        │     -1
 col_20   │ text                        │     -1
 col_21   │ text                        │     -1
 col_22   │ text                        │     -1
 col_23   │ text                        │     -1
 col_24   │ text                        │     -1
 col_25   │ boolean                     │      1
 col_26   │ boolean                     │      1
 col_27   │ text                        │     -1
 col_28   │ text                        │     -1
 col_29   │ text                        │     -1
 col_30   │ text                        │     -1
 col_31   │ text                        │     -1
 col_32   │ bigint                      │      8
 col_33   │ bigint                      │      8
 col_34   │ text                        │     -1
 col_35   │ bigint                      │      8
 col_36   │ text                        │     -1
 col_37   │ text                        │     -1
 col_38   │ text                        │     -1
 col_39   │ text                        │     -1
 col_40   │ bigint                      │      8
 col_41   │ text                        │     -1
 col_42   │ bigint                      │      8
 col_43   │ bigint                      │      8
(43 rows)

attnum > 0表示过滤掉系统列，-1的attlen表示数据长度可变，具体取决于实际有多少数据。

由于都是texts，估算每行大小比较复杂，但是假设他们每个只占8个字节块，鉴于列的排列方式，只有第25和26列可以放入单个8字节块中。所以，总行大小（无系统列）42 * 8 bytes = 336 bytes

现在，将每个Int8更改为int4，可以将第9-12列合并为2个8字节块，将第16-17列合并一个，对于32，33和42，43也是如此。总计296字节。这意味着我们将每行节省40字节。注意，假设23个文本列都不会适应超过8字节，猜猜它的可能性。

可以简单的看下行的实际宽度：

=> explain select * from accounts;
                            QUERY PLAN                             
═══════════════════════════════════════════════════════════════════
 Seq Scan on accounts  (cost=0.00..3979.23 rows=100323 width=1113)
(1 row)

每行超过1kb，在此节省40字节或多或少等于舍入误差。所以在某些情况下，使用int4/int2是有益的。可以通过使用较小的数据类型来节省一些磁盘空间，但是差异并没有那么大。需要仔细规划。

下面脚本有助于PG列找到更好的对齐方式：

https://github.com/NikolayS/postgres_dba/blob/master/sql/p1_alignment_padding.sql

原文

https://www.depesz.com/2022/02/13/how-much-disk-space-you-can-save-by-using-int4-int-instead-of-int8-bigint/

使用INT4/INT类型替换INT8/BIGINT类型能够节省多少磁盘空间？

原文

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

使用INT4/INT类型替换INT8/BIGINT类型能够节省多少磁盘空间？

原文

热门文章

最新文章

相关电子书