ZFS (sync, async) R/W IOPS / throughput performance tuning-阿里云开发者社区

本文讨论一下zfs读写IOPS或吞吐量的优化技巧, (读写操作分同步和异步两种情况).

影响性能的因素

1. 底层设备的性能直接影响同步读写 iops, throughput. 异步读写和cache(arc, l2arc) 设备或配置有关.

2. vdev 的冗余选择影响iops, through.

因为ZPOOL的IO是均分到各vdevs的, 所以vdev越多, IO和吞吐能力越好.

vdev本身的话, 写性能 mirror > raidz1 > raidz2 > raidz3 ,

读性能看实际存储的盘数量决定. (raidz1(3) = raidz2(4) = raidz3(5) > mirror(n))

3. 底层设备的IO对齐影响IOPS.

在创建zpool 时需指定ashift, 而且以后都无法更改.

建议同一个vdev底层设备的sector一致, 如果不一致的话, 建议取最大的扇区作为ashift. 或者将不一致的块设备分到不同的vdev里面.

例如sda sdb的sector=512, sdc sdd的sector=4K

zpool create -o ashift=9 zp1 mirror sda sdb
zpool add -o ashift=12 zp1 mirror sdc sdd
       ashift
           Pool  sector  size exponent, to the power of 2 (internally referred to as "ashift"). I/O operations will be
           aligned to the specified size boundaries. Additionally, the minimum (disk) write size will be  set  to  the
           specified  size,  so  this  represents a space vs. performance trade-off. The typical case for setting this
           property is when performance is important and the underlying disks use 4KiB sectors but report 512B sectors
           to the OS (for compatibility reasons); in that case, set ashift=12 (which is 1<<12 = 4096).

           For  optimal  performance,  the  pool sector size should be greater than or equal to the sector size of the
           underlying disks. Since the property cannot be changed after pool creation, if in a given  pool,  you  ever
           want to use drives that report 4KiB sectors, you must set ashift=12 at pool creation time.

           Keep in mind is that the ashift is vdev specific and is not a pool global.  This means that when adding new
           vdevs to an existing pool you may need to specify the ashift.

这里有一个工具收录了一些常见设备的扇区大小.

https://github.com/zfsonlinux/zfs/blob/master/cmd/zpool/zpool_vdev.c#L108

如果不清楚底层设备的扇区大小, 为了对齐可以设置为13(8K).

例如

# zpool create -o ashift=13 zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e
# zpool create -o ashift=9 zp2 scsi-36c81f660eb17fb001b2c5ff465cff3ed
# zfs create -o mountpoint=/data01 zp1/data01
# zfs create -o mountpoint=/data02 zp2/data02

# date +%F%T; dd if=/dev/zero of=/data01/test.img bs=1024K count=8192 oflag=sync,noatime,nonblock; date +%F%T;
2014-06-2609:57:35
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 46.4277 s, 185 MB/s
2014-06-2609:58:22

# date +%F%T; dd if=/dev/zero of=/data02/test.img bs=1024K count=8192 oflag=sync,noatime,nonblock; date +%F%T;
2014-06-2609:58:32
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 43.9984 s, 195 MB/s
2014-06-2609:59:16

# zpool list
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zp1   3.62T  8.01G  3.62T     0%  1.00x  ONLINE  -
zp2   3.62T  8.00G  3.62T     0%  1.00x  ONLINE  -

大文件看不出区别, 小文件的话, 如果文件小于ashift设置的大小, 那么就等于浪费空间, 同时降低了小文件的写效率. 增加cache占用等.

4. 底层设备的模式, 建议JBOD或passthrough, 绕过RAID卡的控制.

5. zfs 参数直接影响iops和吞吐量.

5.1

对于数据库类型的应用, 大文件, 离散的小数据集访问, 选择recordsize 大于或等于数据库的块大小比较好. 例如PostgreSQL 8K的block_size, 建议zfs recordsize大于等于8KB. 一般不建议调整recordsize, 使用默认的128K就能满足大多数需求.

       recordsize=size
           Specifies a suggested block size for files in the file system. This property is  designed  solely  for  use
           with  database  workloads  that  access  files  in  fixed-size records. ZFS automatically tunes block sizes
           according to internal algorithms optimized for typical access patterns.

           For databases that create very large files but access them in small random chunks, these algorithms may  be
           suboptimal.  Specifying a recordsize greater than or equal to the record size of the database can result in
           significant performance gains. Use of this property for general purpose file systems is  strongly  discour-
           aged, and may adversely affect performance.

           The  size  specified  must  be  a  power  of two greater than or equal to 512 and less than or equal to 128
           Kbytes.

           Changing the file system’s recordsize affects only files created afterward; existing files are  unaffected.

           This property can also be referred to by its shortened column name, recsize.

测试 :

# zpool create -o ashift=12 zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e
# zfs create -o mountpoint=/data01 -o recordsize=8K -o atime=off zp1/data01
# zfs create -o mountpoint=/data02 -o recordsize=128K -o atime=off zp1/data02
# zfs create -o mountpoint=/data03 -o recordsize=512 -o atime=off zp1/data03
关闭数据缓存, 不影响结果.
# zfs set primarycache=metadata zp1/data01
# zfs set primarycache=metadata zp1/data02
# zfs set primarycache=metadata zp1/data03
# mkdir -p /data01/pgdata
# mkdir -p /data02/pgdata
# mkdir -p /data03/pgdata
# chown postgres:postgres /data0*/pgdata

pg_test_fsync 测试结果, 512最差, 8K和128K差不多.

512
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        fdatasync                         252.052 ops/sec    3967 usecs/op
        fsync                             248.701 ops/sec    4021 usecs/op
Non-Sync'ed 8kB writes:
        write                            7615.510 ops/sec     131 usecs/op
8K
        fdatasync                         329.874 ops/sec    3031 usecs/op
        fsync                             329.008 ops/sec    3039 usecs/op
Non-Sync'ed 8kB writes:
        write                           83849.214 ops/sec      12 usecs/op
128K
        fdatasync                         329.207 ops/sec    3038 usecs/op
        fsync                             328.739 ops/sec    3042 usecs/op
Non-Sync'ed 8kB writes:
        write                           76100.311 ops/sec      13 usecs/op

5.2

压缩效率和压缩比不能兼得, 一般推荐LZ4, 压缩效率和压缩比折中.

       compression=on | off | lzjb | gzip | gzip-N | zle | lz4

           Controls the compression algorithm used for this dataset. The lzjb compression algorithm is  optimized  for
           performance  while  providing  decent data compression. Setting compression to on uses the lzjb compression
           algorithm.

           The gzip compression algorithm uses the same compression as the gzip(1) command. You can specify  the  gzip
           level  by using the value gzip-N where N is an integer from 1 (fastest) to 9 (best compression ratio). Cur-
           rently, gzip is equivalent to gzip-6 (which is also the default for gzip(1)).

           The zle (zero-length encoding) compression algorithm is a fast and simple algorithm to  eliminate  runs  of
           zeroes.

           The lz4 compression algorithm is a high-performance replacement for the lzjb algorithm. It features signif-
           icantly faster compression and decompression, as well as a moderately higher compression ratio  than  lzjb,
           but  can  only  be  used  on  pools with the lz4_compress feature set to enabled. See zpool-features(5) for
           details on ZFS feature flags and the lz4_compress feature.

           This property can also be referred to by its shortened column name compress. Changing this property affects
           only newly-written data.

测试, 开启压缩和不开启压缩, 效率差不多.

# zfs set compression=lz4 zp1/data02
# date +%F%T; dd if=/dev/zero of=/data02/test.img ibs=1024K obs=8K count=100 oflag=nonblock,sync,noatime; date +%F%T
2014-06-2610:59:16
100+0 records in
12800+0 records out
104857600 bytes (105 MB) copied, 38.9054 s, 2.7 MB/s
2014-06-2610:59:55

# zfs set compression=off zp1/data02
# date +%F%T; dd if=/dev/zero of=/data02/test.img ibs=1024K obs=8K count=100 oflag=nonblock,sync,noatime; date +%F%T
2014-06-2611:00:08
100+0 records in
12800+0 records out
104857600 bytes (105 MB) copied, 38.8295 s, 2.7 MB/s
2014-06-2611:00:46

开启压缩后, 需要注意一些ZFS的内核参数, l2arc可能会不能缓存压缩后的buffer. 看设置.

# modinfo zfs|grep compre
parm:           zfs_sync_pass_dont_compress:Don't compress starting in this pass (int)
parm:           zfs_mdcomp_disable:Disable meta data compression (int)
parm:           l2arc_nocompress:Skip compressing L2ARC buffers (int)

zio.c:int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */
zio.c:          if (pass >= zfs_sync_pass_dont_compress)
zio.c:module_param(zfs_sync_pass_dont_compress, int, 0644);
zio.c:MODULE_PARM_DESC(zfs_sync_pass_dont_compress,

static int
zio_write_bp_init(zio_t *zio)
{
                if (pass >= zfs_sync_pass_dont_compress)
                        compress = ZIO_COMPRESS_OFF;

arc.c
int l2arc_nocompress = B_FALSE;                 /* don't compress bufs */

5.3

文件的拷贝份数, 一般不建议设置, 除非你的vdev以及底层块设备都没有使用任何冗余措施. 同样影响文件写的IOPS.

       copies=1 | 2 | 3

           Controls  the  number of copies of data stored for this dataset. These copies are in addition to any redun-
           dancy provided by the pool, for example, mirroring or RAID-Z. The copies are stored on different disks,  if
           possible.  The  space  used  by multiple copies is charged to the associated file and dataset, changing the
           used property and counting against quotas and reservations.

           Changing this property only affects newly-written data. Therefore, set this property at  file  system  cre-
           ation time by using the -o copies=N option.

5.4

数据块校验, 对IOPS有一定的影响, 但是非常不建议关闭.

       checksum=on | off | fletcher2,| fletcher4 | sha256

           Controls the checksum used to verify data integrity. The default value is on, which  automatically  selects
           an appropriate algorithm (currently, fletcher4, but this may change in future releases). The value off dis-
           ables integrity checking on user data. Disabling checksums is NOT a recommended practice.

           Changing this property affects only newly-written data.

5.5 是否更新文件的访问时间戳, 一般建议关闭. 除非应用程序需要用到文件的访问时间戳.

       atime=on | off

           Controls  whether the access time for files is updated when they are read. Turning this property off avoids
           producing write traffic when reading files and can result in significant performance gains, though it might
           confuse mailers and other similar utilities. The default value is on.  See also relatime below.

5.6

主缓存(ARC)配置,

all表示所有数据均使用ARC, none表示不使用ARC, 相当于没有缓存. metadata表示只有元数据使用缓存.

开启缓存可以极大的提高读性能, 写性能则会有一定下降(差异并不大).

主要影响的还是读性能, 如果关闭arc, 读的性能会非常的差.

       primarycache=all | none | metadata

           Controls what is cached in the primary cache (ARC). If this property is set to all, then both user data and
           metadata is cached. If this property is set to none, then neither user data nor metadata is cached. If this
           property is set to metadata, then only metadata is cached. The default value is all.

缓存的使用限制可以通过zfs内核参数来调整.

/sys/module/zfs/parameters/zfs_arc_grow_retry:5
/sys/module/zfs/parameters/zfs_arc_max:0
/sys/module/zfs/parameters/zfs_arc_memory_throttle_disable:1
/sys/module/zfs/parameters/zfs_arc_meta_limit:0
/sys/module/zfs/parameters/zfs_arc_meta_prune:1048576
/sys/module/zfs/parameters/zfs_arc_min:0
/sys/module/zfs/parameters/zfs_arc_min_prefetch_lifespan:1000
/sys/module/zfs/parameters/zfs_arc_p_aggressive_disable:1
/sys/module/zfs/parameters/zfs_arc_p_dampener_disable:1
/sys/module/zfs/parameters/zfs_arc_shrink_shift:5
parm:           zfs_arc_min:Min arc size (ulong)
parm:           zfs_arc_max:Max arc size (ulong)
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm:           zfs_arc_meta_prune:Bytes of meta data to prune (int)
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)
parm:           zfs_arc_p_aggressive_disable:disable aggressive arc_p grow (int)
parm:           zfs_arc_p_dampener_disable:disable arc_p adapt dampener (int)
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm:           zfs_arc_memory_throttle_disable:disable memory throttle (int)
parm:           zfs_arc_min_prefetch_lifespan:Min life of prefetch block (int)

脏数据的内存使用限制内核参数

# modinfo zfs|grep dirty
parm:           zfs_vdev_async_write_active_max_dirty_percent:Async write concurrency max threshold (int)
parm:           zfs_vdev_async_write_active_min_dirty_percent:Async write concurrency min threshold (int)
parm:           zfs_dirty_data_max_percent:percent of ram can be dirty (int)
parm:           zfs_dirty_data_max_max_percent:zfs_dirty_data_max upper bound as % of RAM (int)
parm:           zfs_delay_min_dirty_percent:transaction delay threshold (int)
parm:           zfs_dirty_data_max:determines the dirty space limit (ulong)
parm:           zfs_dirty_data_max_max:zfs_dirty_data_max upper bound in bytes (ulong)
parm:           zfs_dirty_data_sync:sync txg when this much dirty data (ulong)
# grep ".*" /sys/module/zfs/parameters/*|grep dirty
/sys/module/zfs/parameters/zfs_delay_min_dirty_percent:60
/sys/module/zfs/parameters/zfs_dirty_data_max:3361508147
/sys/module/zfs/parameters/zfs_dirty_data_max_max:8403770368
/sys/module/zfs/parameters/zfs_dirty_data_max_max_percent:25
/sys/module/zfs/parameters/zfs_dirty_data_max_percent:10
/sys/module/zfs/parameters/zfs_dirty_data_sync:67108864
/sys/module/zfs/parameters/zfs_vdev_async_write_active_max_dirty_percent:60
/sys/module/zfs/parameters/zfs_vdev_async_write_active_min_dirty_percent:30

测试, 异步读写, primarycache=metadata的写入速度要快一点, 一般在cache填满后cache=metadata和cache=all速度达到一致.

zpool块设备越多, 差别越明显. 通过zpool iostat -v 1来查看.

# zpool create -o ashift=12 -o autoreplace=on zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e scsi-36c81f660eb17fb001b2c5ff465cff3ed scsi-36c81f660eb17fb001b2c5ffa662f3df2 scsi-36c81f660eb17fb001b2c5fff66848a6c scsi-36c81f660eb17fb001b2c600466cb5810 scsi-36c81f660eb17fb001b2c60096714bcf2 scsi-36c81f660eb17fb001b2c600e6761a9bd scsi-36c81f660eb17fb001b2c601267a63fcc scsi-36c81f660eb17fb001b2c601867f2c341  scsi-36c81f660eb17fb001b2c601e685414b5 scsi-36c81f660eb17fb001b2c602368a21621 scsi-36c81f660eb17fb001b2c602a690a4ed8

# zfs create -o mountpoint=/data01 -o atime=off -o primarycache=metadata zp1/data01
# dd if=/dev/zero of=/data01/test.img bs=1024K count=819200
^C185116+0 records in
185116+0 records out
194108194816 bytes (194 GB) copied, 113.589 s, 1.7 GB/s

# zfs destroy zp1/data01
# zfs create -o mountpoint=/data01 -o atime=off -o primarycache=all zp1/data01
# dd if=/dev/zero of=/data01/test.img bs=1024K count=819200
^C147262+0 records in
147262+0 records out
154415398912 bytes (154 GB) copied, 90.1703 s, 1.7 GB/s

读性能测试, 关闭arc后, 性能非常差, 目前还不清楚是否可以通过调整zfs内核参数来提高直接的块设备的读性能.

# zfs set primarycache=metadata zp1/data01
# cp /data01/test.img /data01/test.img1
# zpool iostat -v 1
                                             capacity     operations    bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
zp1                                       80.5G  43.4T    289    592  35.9M  64.5M
  scsi-36c81f660eb17fb001b2c5fec6553ff5e  6.72G  3.62T     23     44  3.00M  5.49M
  scsi-36c81f660eb17fb001b2c5ff465cff3ed  6.69G  3.62T     24     44  3.12M  5.49M
  scsi-36c81f660eb17fb001b2c5ffa662f3df2  6.71G  3.62T     24     49  3.00M  5.76M
  scsi-36c81f660eb17fb001b2c5fff66848a6c  6.72G  3.62T     23     44  3.00M  5.01M
  scsi-36c81f660eb17fb001b2c600466cb5810  6.70G  3.62T     24     62  3.12M  5.54M
  scsi-36c81f660eb17fb001b2c60096714bcf2  6.69G  3.62T     21     54  2.75M  5.15M
  scsi-36c81f660eb17fb001b2c600e6761a9bd  6.71G  3.62T     27     53  3.37M  5.35M
  scsi-36c81f660eb17fb001b2c601267a63fcc  6.71G  3.62T     21     46  2.75M  4.90M
  scsi-36c81f660eb17fb001b2c601867f2c341  6.68G  3.62T     22     46  2.87M  5.02M
  scsi-36c81f660eb17fb001b2c601e685414b5  6.74G  3.62T     25     54  3.24M  5.90M
  scsi-36c81f660eb17fb001b2c602368a21621  6.71G  3.62T     23     43  3.00M  5.49M
  scsi-36c81f660eb17fb001b2c602a690a4ed8  6.69G  3.62T     21     42  2.75M  5.37M
cache                                         -      -      -      -      -      -
  pcie-shannon-6819246149b014-part1       5.14M   800G      0      1      0  68.9K
----------------------------------------  -----  -----  -----  -----  -----  -----

开启arc后, 读性能提升, 注意看读的iops增加到300+, 开启arc前只有20+

# zfs set primarycache=all zp1/data01
# cp /data01/test.img /data01/test.img1
cp: overwrite `/data01/test.img1'? y
                                             capacity     operations    bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
zp1                                       82.8G  43.4T  3.54K  4.01K   449M   476M
  scsi-36c81f660eb17fb001b2c5fec6553ff5e  6.91G  3.62T    318    318  39.6M  39.6M
  scsi-36c81f660eb17fb001b2c5ff465cff3ed  6.89G  3.62T    286    328  35.6M  40.2M
  scsi-36c81f660eb17fb001b2c5ffa662f3df2  6.91G  3.62T    304    335  37.9M  39.6M
  scsi-36c81f660eb17fb001b2c5fff66848a6c  6.92G  3.62T    299    335  37.3M  40.3M
  scsi-36c81f660eb17fb001b2c600466cb5810  6.89G  3.62T    288    322  35.5M  37.1M
  scsi-36c81f660eb17fb001b2c60096714bcf2  6.89G  3.62T    300    337  37.3M  39.4M
  scsi-36c81f660eb17fb001b2c600e6761a9bd  6.90G  3.62T    305    330  37.9M  39.0M
  scsi-36c81f660eb17fb001b2c601267a63fcc  6.90G  3.62T    294    343  36.8M  40.1M
  scsi-36c81f660eb17fb001b2c601867f2c341  6.88G  3.62T    300    373  36.8M  39.5M
  scsi-36c81f660eb17fb001b2c601e685414b5  6.94G  3.62T    321    374  39.7M  40.4M
  scsi-36c81f660eb17fb001b2c602368a21621  6.90G  3.62T    292    365  36.4M  39.6M
  scsi-36c81f660eb17fb001b2c602a690a4ed8  6.89G  3.62T    308    339  38.2M  41.2M
cache                                         -      -      -      -      -      -
  pcie-shannon-6819246149b014-part1        454M   800G      0    649      0  79.5M
----------------------------------------  -----  -----  -----  -----  -----  -----

5.7

二级缓存(L2ARC)配置, 即zpool 中的cache设备.

如果要使用L2ARC的话, 建议使用SSD作为L2ARC.

       secondarycache=all | none | metadata

           Controls what is cached in the secondary cache (L2ARC). If this property is set to all, then both user data
           and metadata is cached. If this property is set to none, then neither user data nor metadata is cached.  If
           this property is set to metadata, then only metadata is cached. The default value is all.

l2arc的数据从arc的mru, mfu表取到, 所以arc如果关闭的话, l2arc也不会有缓存数据.

所以如果要使用l2arc的话, 务必同时打开arc和l2arc.

l2arc里面不存储脏数据, 所以对于活跃数据频繁变更的业务, L2ARC几乎没什么用处.

5.8

数据块去重配置, 对于大多数场景没有什么效果, 而且如果数据集很大的话需要耗费大量的内存. 同时影响IOPS和吞吐量.

一般不建议开启.

      dedup=on | off | verify | sha256[,verify]

           Controls  whether  deduplication is in effect for a dataset. The default value is off. The default checksum
           used for deduplication is sha256 (subject to change). When dedup is enabled, the dedup  checksum  algorithm
           overrides the checksum property. Setting the value to verify is equivalent to specifying sha256,verify.

           If  the  property  is set to verify, then, whenever two blocks have the same signature, ZFS will do a byte-
           for-byte comparison with the existing block to ensure that the contents are identical.

5.9

ZIL的使用配置, 对同步写请求来说, latency表示使用ZIL设备, throughput表示不使用zil设备(非常不推荐).

如果使用PostgreSQL数据库, 并且使用异步事务提交的话, 是否使用zil关系都不大.

zil要求IOPS能力很好的设备, 才能达到好的同步写请求iops.

       logbias = latency | throughput
           Provide  a hint to ZFS about handling of synchronous requests in this dataset. If logbias is set to latency
           (the default), ZFS will use pool log devices (if configured) to handle the requests at low latency. If log-
           bias  is  set  to  throughput, ZFS will not use configured pool log devices. ZFS will instead optimize syn-
           chronous operations for global pool throughput and efficient use of resources.

首先我们测试一个有SSD zil设备的, 普通机械硬盘12块组成的一个ZPOOL的fsync场景性能.

# zfs get all|grep logbias
zp1         logbias               latency                default
zp1/data01  logbias               latency                default

> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                        7285.416 ops/sec     137 usecs/op
        fsync                            7359.841 ops/sec     136 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                        5396.851 ops/sec     185 usecs/op
        fsync                            4323.672 ops/sec     231 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write                    n/a*
         2 *  8kB open_sync writes                   n/a*
         4 *  4kB open_sync writes                   n/a*
         8 *  2kB open_sync writes                   n/a*
        16 *  1kB open_sync writes                   n/a*

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close              5859.650 ops/sec     171 usecs/op
        write, close, fsync              6626.115 ops/sec     151 usecs/op

Non-Sync'ed 8kB writes:
        write                           82388.939 ops/sec      12 usecs/op

注意此时ZIL所在的SSD硬盘的利用率没有到100%, 处于一个比较低的水平.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   12.39    5.75    0.00   81.86
dfa               0.00     0.00    0.00 7401.00     0.00 177624.00    24.00     0.24    0.03   0.03  24.10

使用zpool iostat看到fsync调用使用了zil设备.

# zpool iostat -v 1
                                             capacity     operations    bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
zp1                                        160G  43.3T      0  7.23K      0  86.7M
  scsi-36c81f660eb17fb001b2c5fec6553ff5e  13.4G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c5ff465cff3ed  13.4G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c5ffa662f3df2  13.3G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c5fff66848a6c  13.4G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c600466cb5810  13.3G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c60096714bcf2  13.3G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c600e6761a9bd  13.3G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c601267a63fcc  13.3G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c601867f2c341  13.3G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c601e685414b5  13.4G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c602368a21621  13.3G  3.61T      0      0      0      0
  scsi-36c81f660eb17fb001b2c602a690a4ed8  13.3G  3.61T      0      0      0      0
logs                                          -      -      -      -      -      -
  pcie-shannon-6819246149b014-part2        976M  1.03G      0  7.23K      0  86.7M
cache                                         -      -      -      -      -      -
  pcie-shannon-6819246149b014-part1       2.03M   800G      0      0      0      0
----------------------------------------  -----  -----  -----  -----  -----  -----

接下来把这个zfs的logbias改成throughput, 也就是不使用zil设备, fsync的性能马上下降了.

这里实际上VDEV块设备的iops利用率还不到20%, FreeBSD下面没有问题, 这是ZFSonLinux的一个问题, 已提交brian, 得到的回复如下.

Thanks,

I've opened a new issue so we can track this.

https://github.com/zfsonlinux/zfs/issues/2431

The next step is somebody is going to have to profile the Linux case to 
see what's going on.  It seems like we're blocking somewhere in the 
stack unnecessarily.  Unfortunately, all the developers are swamped so 
I'm not sure when someone will get a chance to look at this.  If your 
interested in getting some additional profiling data I'd suggest 
starting with getting a call graph of fsync() using ftrace.  That should 
show us where the time is going.

http://lwn.net/Articles/370423/

Thanks,
Brian

# zfs set logbias=throughput zp1/data01
> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                         330.846 ops/sec    3023 usecs/op
        fsync                             329.942 ops/sec    3031 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                         329.407 ops/sec    3036 usecs/op
        fsync                             329.606 ops/sec    3034 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write                    n/a*
         2 *  8kB open_sync writes                   n/a*
         4 *  4kB open_sync writes                   n/a*
         8 *  2kB open_sync writes                   n/a*
        16 *  1kB open_sync writes                   n/a*

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close               324.344 ops/sec    3083 usecs/op
        write, close, fsync               329.272 ops/sec    3037 usecs/op

Non-Sync'ed 8kB writes:
        write                           84914.324 ops/sec      12 usecs/op

如果直接用SSD建立ZPOOL, 它的fsync性能如何呢? 和前面一个VDEVS使用机械硬盘+ZIL SSD性能基本一致.

# zpool destroy zp1
# zpool create -o ashift=12 zp1 pcie-shannon-6819246149b014-part1
# zfs create -o mountpoint=/data01 zp1/data01
# mkdir /data01/pgdata
# chown postgres:postgres /data01/pgdata
> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                        6604.779 ops/sec     151 usecs/op
        fsync                            7086.614 ops/sec     141 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                        5760.927 ops/sec     174 usecs/op
        fsync                            5677.560 ops/sec     176 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write                    n/a*
         2 *  8kB open_sync writes                   n/a*
         4 *  4kB open_sync writes                   n/a*
         8 *  2kB open_sync writes                   n/a*
        16 *  1kB open_sync writes                   n/a*

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close              6561.159 ops/sec     152 usecs/op
        write, close, fsync              6530.990 ops/sec     153 usecs/op

Non-Sync'ed 8kB writes:
        write                           81261.194 ops/sec      12 usecs/op

如果不使用ZFS, 直接使用EXT4的话, 性能如何呢?

此时底层块设备的利用率明显提升.

# mkfs.ext4 /dev/disk/by-id/pcie-shannon-6819246149b014-part2
# mount /dev/disk/by-id/pcie-shannon-6819246149b014-part2 /mnt
# chmod 777 /mnt
ing one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                   38533.583 ops/sec      26 usecs/op
        fdatasync                       29027.342 ops/sec      34 usecs/op
        fsync                           26695.490 ops/sec      37 usecs/op
        fsync_writethrough                            n/a
        open_sync                       43047.350 ops/sec      23 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                   23826.738 ops/sec      42 usecs/op
        fdatasync                       31193.925 ops/sec      32 usecs/op
        fsync                           29445.494 ops/sec      34 usecs/op
        fsync_writethrough                            n/a
        open_sync                       22241.529 ops/sec      45 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write       34597.675 ops/sec      29 usecs/op
         2 *  8kB open_sync writes      22051.151 ops/sec      45 usecs/op
         4 *  4kB open_sync writes      11751.948 ops/sec      85 usecs/op
         8 *  2kB open_sync writes        804.951 ops/sec    1242 usecs/op
        16 *  1kB open_sync writes        403.788 ops/sec    2477 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close             18227.669 ops/sec      55 usecs/op
        write, close, fsync             18158.735 ops/sec      55 usecs/op

Non-Sync'ed 8kB writes:
        write                           288696.375 ops/sec       3 usecs/op
iostat看到此时的SSD设备利用率提高.
dfa               0.00     0.00    0.00 55244.00     0.00 441952.00     8.00     1.30    0.02   0.01  78.10

ZVOL+EXT4的性能

# zfs create -V 10G zp1/data02
# mkfs.ext4 /dev/zd0
# mount /dev/zd0 /tmp
# chmod 777 /tmp
结果也不理想
> pg_test_fsync -f /tmp/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                    5221.004 ops/sec     192 usecs/op
        fdatasync                        4770.779 ops/sec     210 usecs/op
        fsync                            2523.113 ops/sec     396 usecs/op
        fsync_writethrough                            n/a
        open_sync                        5527.120 ops/sec     181 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                    2740.871 ops/sec     365 usecs/op
        fdatasync                        3774.486 ops/sec     265 usecs/op
        fsync                            1927.523 ops/sec     519 usecs/op
        fsync_writethrough                            n/a
        open_sync                        2747.225 ops/sec     364 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write        4751.333 ops/sec     210 usecs/op
         2 *  8kB open_sync writes       2729.912 ops/sec     366 usecs/op
         4 *  4kB open_sync writes       1387.512 ops/sec     721 usecs/op
         8 *  2kB open_sync writes        734.417 ops/sec    1362 usecs/op
        16 *  1kB open_sync writes        364.665 ops/sec    2742 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close              3134.067 ops/sec     319 usecs/op
        write, close, fsync              3486.530 ops/sec     287 usecs/op

Non-Sync'ed 8kB writes:
        write                           293944.412 ops/sec       3 usecs/op

对比以上几种情况, ZFS没有发挥出底层设备的FSYNC能力, 而直接使用块设备+ext4有明显改善, 不知道是不是zfs在Linux下的效率问题, 还是需要调整某些ZFS内核参数? 后面我使用FreeBSD进行一下测试看看是不是有同样的情况.

FreeBSD的性能很好, 基本达到块设备的瓶颈. 如下 :

http://blog.163.com/digoal@126/blog/static/16387704020145264116819/

5.10 同步接口调用的操作. 不推荐关闭, 关闭可能导致异常后数据丢失. 因为某些应用程序如数据库的一些操作, 希望调用fsync后数据确实写入了非易失存储. 而关闭sync的话, 显然和应用程序的期望实际不符.

       sync=standard | always | disabled
           Controls  the  behavior  of  synchronous  requests  (e.g. fsync, O_DSYNC).  standard is the POSIX specified
           behavior of ensuring all synchronous requests are written to stable storage and all devices are flushed  to
           ensure  data  is  not  cached  by device controllers (this is the default). always causes every file system
           transaction to be written and flushed before its system call returns. This has a large performance penalty.
           disabled disables synchronous requests. File system transactions are only committed to stable storage peri-
           odically. This option will give the highest performance.  However, it is very dangerous  as  ZFS  would  be
           ignoring  the  synchronous  transaction  demands  of applications such as databases or NFS.  Administrators
           should only use this option when the risks are understood.

下面测试一下关闭sync后的性能, 虽然我们非常不建议这么做, 但是提供一下测试结果.

# zfs set sync=disabled zp1/data01
# zfs get all|grep cache
zp1         primarycache          all                    default
zp1         secondarycache        all                    default
zp1/data01  primarycache          all                    default
zp1/data01  secondarycache        all                    default
> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                       109380.512 ops/sec       9 usecs/op
        fsync                           115186.570 ops/sec       9 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                       60158.540 ops/sec      17 usecs/op
        fsync                           60352.231 ops/sec      17 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write                    n/a*
         2 *  8kB open_sync writes                   n/a*
         4 *  4kB open_sync writes                   n/a*
         8 *  2kB open_sync writes                   n/a*
        16 *  1kB open_sync writes                   n/a*

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close             75829.757 ops/sec      13 usecs/op
        write, close, fsync             75501.094 ops/sec      13 usecs/op

Non-Sync'ed 8kB writes:
        write                           94328.592 ops/sec      11 usecs/op

关闭sync后, 其实和cache没有什么关系, 即使同时关闭cache性能依旧彪悍.

# zfs set primarycache=none zp1/data01
> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                       115321.769 ops/sec       9 usecs/op
        fsync                           115119.262 ops/sec       9 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                                n/a*
        fdatasync                       60296.171 ops/sec      17 usecs/op
        fsync                           60201.468 ops/sec      17 usecs/op
        fsync_writethrough                            n/a
        open_sync                                    n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write                    n/a*
         2 *  8kB open_sync writes                   n/a*
         4 *  4kB open_sync writes                   n/a*
         8 *  2kB open_sync writes                   n/a*
        16 *  1kB open_sync writes                   n/a*

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close             75542.879 ops/sec      13 usecs/op
        write, close, fsync             75654.249 ops/sec      13 usecs/op

Non-Sync'ed 8kB writes:
        write                           95557.532 ops/sec      10 usecs/op

6. zfs模块内核参数也会极大的影响性能.

参见

http://blog.163.com/digoal@126/blog/static/16387704020145253599111/

[参考]

1. zfs source

2. man zpool

3. man zfs

4. man zdb

5. http://blog.163.com/digoal@126/blog/static/1638770402014525103556357/

6. http://blog.163.com/digoal@126/blog/static/1638770402014525111238683/

7. http://blog.163.com/digoal@126/blog/static/16387704020145253599111/

8. http://fixunix.com/solaris-rss/579853-choosing-stripsize-lun-recordsize-zfs-postgresql.html