ZFS case : top CPU 100%sy, when no free memory trigger it.

简介:
最近在一个系统频频遇到负载突然飙升到几百, 然后又下去的情况.
根据负载升高的时间点对应的数据库日志分析, 对应的时间点, 有大量的类似如下的日志 :
"UPDATE waiting",2015-01-09 01:38:47 CST,979/7,2927976054,LOG,00000,"process 26366 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1117.676 ms",,,,,,"
"INSERT waiting",2015-01-09 01:38:36 CST,541/8,2927976307,LOG,00000,"process 25936 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1219.762 ms",,,,,,"
"INSERT waiting",2015-01-09 01:38:48 CST,1018/64892,2929458056,LOG,00000,"process 26439 still waiting for ExclusiveLock on extension of relation 686061993 of database 35078604 after 1000.105 ms",
.........

对应几个对象的块扩展等待
select 686062002::regclass;
          regclass           
-----------------------------
 pg_toast.pg_toast_686061993
(1 row)
select relname from pg_class where reltoastrelid=686062002;
               relname               
-------------------------------------
 tbl_xxx_20150109
(1 row)
Time: 4.643 ms


同时系统的dmesg还伴随 : 
postgres: page allocation failure. order:1, mode:0x20
Pid: 20427, comm: postgres Tainted: P           ---------------    2.6.32-504.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8113438a>] ? __alloc_pages_nodemask+0x74a/0x8d0
 [<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff81173332>] ? kmem_getpages+0x62/0x170
 [<ffffffff81173f4a>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8117399f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81173cc9>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff81174c4b>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff8144c768>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8144d992>] ? sk_clone+0x22/0x2e0
 [<ffffffff814a1b76>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814bb713>] ? tcp_create_openreq_child+0x23/0x470
 [<ffffffff814b8ecd>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814bb4b6>] ? tcp_check_req+0x226/0x460
 [<ffffffff814b890b>] ? tcp_v4_do_rcv+0x35b/0x490
 [<ffffffffa0207557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]
 [<ffffffff814ba1a2>] ? tcp_v4_rcv+0x522/0x900
 [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
 [<ffffffff81496ded>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff81497078>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8149653d>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff81496ac5>] ? ip_rcv+0x275/0x350
 [<ffffffff8145c88b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff81460588>] ? netif_receive_skb+0x58/0x60
 [<ffffffff81460690>] ? napi_skb_finish+0x50/0x70
 [<ffffffff81461f69>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01a7d91>] ? igb_poll+0x981/0x1010 [igb]
 [<ffffffff814b59c0>] ? tcp_delack_timer+0x0/0x270
 [<ffffffff814b3af9>] ? tcp_send_ack+0xd9/0x120
 [<ffffffff81462083>] ? net_rx_action+0x103/0x2f0
 [<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8107d90f>] ? __do_softirq+0x11f/0x1e0
 [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
 [<ffffffff8100fc15>] ? do_softirq+0x65/0xa0
 [<ffffffff8107d765>] ? irq_exit+0x85/0x90
 [<ffffffff81533b45>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff8116f5f9>] ? compaction_alloc+0x269/0x4b0
 [<ffffffff8116f552>] ? compaction_alloc+0x1c2/0x4b0
 [<ffffffff811799fa>] ? migrate_pages+0xaa/0x480
 [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13
 [<ffffffff8116f390>] ? compaction_alloc+0x0/0x4b0
 [<ffffffff8116e9ea>] ? compact_zone+0x61a/0xba0
 [<ffffffff8116f01c>] ? compact_zone_order+0xac/0x100
 [<ffffffff8116f151>] ? try_to_compact_pages+0xe1/0x120
 [<ffffffff81133b6a>] ? __alloc_pages_direct_compact+0xda/0x1b0
 [<ffffffff81134055>] ? __alloc_pages_nodemask+0x415/0x8d0
 [<ffffffff8116c79a>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff8118845d>] ? do_huge_pmd_anonymous_page+0x14d/0x3b0
 [<ffffffff8114fdb0>] ? handle_mm_fault+0x2f0/0x300
 [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff8152ae5e>] ? mutex_lock+0x1e/0x50
 [<ffffffff8152ffbe>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152d375>] ? page_fault+0x25/0x30


这是个日志表, 有4个索引, 其中一个变长字段存储的值较长(因此有用到TOAST存储), 例如
DxxxxxxxxxxxxzwLlyyDd7xGd7^7xxwLDxyD@5xHB7^if5^vv4&DJCEL7xxxCFyhsxxxd4x~j2%$BB%ChkzHlzzvxBwqn5^DDCFexzwC@zyLDz
zC~zyDDzyCbAyyh3M~v5^DDCHvBBy%j0%iL4^fJB%K1xxxB%G1wz~h2M%B4%qn5&7xxwyPs!$xJ!Dd7xCb3^DFCGLnzyzlzyP7zyCJ7x)Lx^xxxxxxxy73$rLB&DND
zL5zy~xxyCt4xPj4%DJCE~DzyP#zyLPzyypxxx3&~DB^P1zzC5zye5wzz10MCb3^Gp4^DLCEiNywi$yzvxBwL73&$F7%7xzwG5zyy5wyah4MbzB%C1DzL9zyf5yyG9z
y!1zyLJxyCt4xPP5%nLB&xxxx
&7xxwjHzzi#yyi$yzi$yzmHIPm^K@CbAzzh5MDBCGLxxxxwz~h5M$JB%DxDzGlzyH5zyL7zzylzyC9AyLxxx7xxwC5AyzNxxxxxx5%Pp0
^~d0&6NzwK1wyzN1M%xxxx^P74^DJCGD7yyvBByiF4&Pt0%~d0&6hwwvDByiF4&et5xxxxxx17%$$1^DBCFGfwxzh2M!j1%qv5^DLCF!NzyH5zzvJBy%h4
%aD4&%v4^61zwDnyyK5wyzN0Mxxxx5$71zwCd7xCv0&fj5&(h1%yNc%mf7%71zwxxx%a@4%rpd^a5d$71zwCt5x!l1$~^l^LDx&K1wzmh5MxxxxxxxxxxGr4&Dvd&$jl
%LBx%K1wzPh5M$Ll^yn1&Ht6+fxxxxxxwC@5xqnxxx&~90^fj5+Oh5%71zwDJ7xH~j&yDh$G@5^7xxxx^Gp4@DLCEf1yw7b0~bn0^%^i&HDzyebz
y6)zyLBzyfdyyvxBwH#5%GP5^nvd&$LB^DN5zqHA^%P1^nLlEDL5%$@n^i#4^$J7%nPn+bzF@Ct4xD~1&GB4^add%7xxw!7zybBxyC@5xqn5xxxxx!DLCEvBxxxd6&!vL@7xx
w)dyECn5&)B1zxxxxxOz)D&HND&C9DOOl1yzpD&xxxxxx1^6hzwLrzyf3zxxx&L73+Gp4@DLCEiNywi$yzvxBw$h5%Hdf&(l9%zh0%nHB^D1zz~7zyvzB
yHl6%jl4!!Fg^rLB%DhDzC5xxxxb7j&aH4^)txxxDzvxBw$v0%DJCEzNxxxxzyzr1y~Lzx(vA&(@zyrtCyy5DxxxHl5OHbDO$3DxxxyyN0MD~1%afd&71z
wDd7xj17xxxxx$)7^7N2wq8*=

每天会新建一个表, 因此不停的在做数据块的扩展, 但是理论上扩展是比较快的, 不会导致以上情况的发生, 而且发生问题的时间点, 数据量, 并发量也正常.

关于这个等待的情况, 可以参考之前写过一篇文章, 关于批量导入遇到的extend lock等待的性能问题.
和本文 性能的 case 无关.

看样子是ZFS的问题, 最后排查发现. 
free的内存在不停的减少, 当减少到0的时候, 负载就会马上飙升. 

环境 : 
CentOS 6.x x64
2.6.32-504.el6.x86_64

zfs 版本
zfs-0.6.3-1.1.el6.x86_64
libzfs2-0.6.3-1.1.el6.x86_64
zfs-dkms-0.6.3-1.1.el6.noarch

服务器内存 384G

数据库shared buffer 20GB, maintenance_work_mem=2G, autovacuum_max_workers=6 
不算work_MEM的话, 数据库最多可能占用32G内存. 
还有300多G可以给系统和ZFS使用.

zfs 参数如下
cd /sys/module/zfs/parameters
# grep '' *|sort 
l2arc_feed_again:1
l2arc_feed_min_ms:200
l2arc_feed_secs:1
l2arc_headroom:2
l2arc_headroom_boost:200
l2arc_nocompress:0
l2arc_noprefetch:1
l2arc_norw:0
l2arc_write_boost:8388608
l2arc_write_max:8388608
metaslab_debug_load:0
metaslab_debug_unload:0
spa_asize_inflation:24
spa_config_path:/etc/zfs/zpool.cache
zfetch_array_rd_sz:1048576
zfetch_block_cap:256
zfetch_max_streams:8
zfetch_min_sec_reap:2
zfs_arc_grow_retry:5
zfs_arc_max:10240000000
zfs_arc_memory_throttle_disable:1
zfs_arc_meta_limit:0
zfs_arc_meta_prune:1048576
zfs_arc_min:0
zfs_arc_min_prefetch_lifespan:1000
zfs_arc_p_aggressive_disable:1
zfs_arc_p_dampener_disable:1
zfs_arc_shrink_shift:5
zfs_autoimport_disable:0
zfs_dbuf_state_index:0
zfs_deadman_enabled:1
zfs_deadman_synctime_ms:1000000
zfs_dedup_prefetch:1
zfs_delay_min_dirty_percent:60
zfs_delay_scale:500000
zfs_dirty_data_max:10240000000
zfs_dirty_data_max_max:101595342848
zfs_dirty_data_max_max_percent:25
zfs_dirty_data_max_percent:10
zfs_dirty_data_sync:67108864
zfs_disable_dup_eviction:0
zfs_expire_snapshot:300
zfs_flags:1
zfs_free_min_time_ms:1000
zfs_immediate_write_sz:32768
zfs_mdcomp_disable:0
zfs_nocacheflush:0
zfs_nopwrite_enabled:1
zfs_no_scrub_io:0
zfs_no_scrub_prefetch:0
zfs_pd_blks_max:100
zfs_prefetch_disable:0
zfs_read_chunk_size:1048576
zfs_read_history:0
zfs_read_history_hits:0
zfs_recover:0
zfs_resilver_delay:2
zfs_resilver_min_time_ms:3000
zfs_scan_idle:50
zfs_scan_min_time_ms:1000
zfs_scrub_delay:4
zfs_send_corrupt_data:0
zfs_sync_pass_deferred_free:2
zfs_sync_pass_dont_compress:5
zfs_sync_pass_rewrite:2
zfs_top_maxinflight:32
zfs_txg_history:0
zfs_txg_timeout:5
zfs_vdev_aggregation_limit:131072
zfs_vdev_async_read_max_active:3
zfs_vdev_async_read_min_active:1
zfs_vdev_async_write_active_max_dirty_percent:60
zfs_vdev_async_write_active_min_dirty_percent:30
zfs_vdev_async_write_max_active:10
zfs_vdev_async_write_min_active:1
zfs_vdev_cache_bshift:16
zfs_vdev_cache_max:16384
zfs_vdev_cache_size:0
zfs_vdev_max_active:1000
zfs_vdev_mirror_switch_us:10000
zfs_vdev_read_gap_limit:32768
zfs_vdev_scheduler:noop
zfs_vdev_scrub_max_active:2
zfs_vdev_scrub_min_active:1
zfs_vdev_sync_read_max_active:10
zfs_vdev_sync_read_min_active:10
zfs_vdev_sync_write_max_active:10
zfs_vdev_sync_write_min_active:10
zfs_vdev_write_gap_limit:4096
zfs_zevent_cols:80
zfs_zevent_console:0
zfs_zevent_len_max:768
zil_replay_disable:0
zil_slog_limit:1048576
zio_bulk_flags:0
zio_delay_max:30000
zio_injection_enabled:0
zio_requeue_io_start_cut_in_line:1
zvol_inhibit_dev:0
zvol_major:230
zvol_max_discard_blocks:16384
zvol_threads:32

这些参数的介绍可参考 : 
man /usr/share/man/man5/zfs-module-parameters.5.gz

zpool参数
# zpool get all zp1
NAME  PROPERTY               VALUE                  SOURCE
zp1   size                   40T                    -
zp1   capacity               2%                     -
zp1   altroot                -                      default
zp1   health                 ONLINE                 -
zp1   guid                   15254203672861282738   default
zp1   version                -                      default
zp1   bootfs                 -                      default
zp1   delegation             on                     default
zp1   autoreplace            off                    default
zp1   cachefile              -                      default
zp1   failmode               wait                   default
zp1   listsnapshots          off                    default
zp1   autoexpand             off                    default
zp1   dedupditto             0                      default
zp1   dedupratio             1.00x                  -
zp1   free                   39.0T                  -
zp1   allocated              995G                   -
zp1   readonly               off                    -
zp1   ashift                 12                     local
zp1   comment                -                      default
zp1   expandsize             0                      -
zp1   freeing                0                      default
zp1   feature@async_destroy  enabled                local
zp1   feature@empty_bpobj    active                 local
zp1   feature@lz4_compress   active                 local


zfs参数
# zfs get all zp1/data_a0
NAME         PROPERTY              VALUE                  SOURCE
zp1/data_a0  type                  filesystem             -
zp1/data_a0  creation              Thu Dec 18 10:30 2014  -
zp1/data_a0  used                  98.8G                  -
zp1/data_a0  available             34.1T                  -
zp1/data_a0  referenced            98.8G                  -
zp1/data_a0  compressratio         1.00x                  -
zp1/data_a0  mounted               yes                    -
zp1/data_a0  quota                 none                   default
zp1/data_a0  reservation           none                   default
zp1/data_a0  recordsize            128K                   default
zp1/data_a0  mountpoint            /data_a0               local
zp1/data_a0  sharenfs              off                    default
zp1/data_a0  checksum              on                     default
zp1/data_a0  compression           off                    local
zp1/data_a0  atime                 off                    inherited from zp1
zp1/data_a0  devices               on                     default
zp1/data_a0  exec                  on                     default
zp1/data_a0  setuid                on                     default
zp1/data_a0  readonly              off                    default
zp1/data_a0  zoned                 off                    default
zp1/data_a0  snapdir               hidden                 default
zp1/data_a0  aclinherit            restricted             default
zp1/data_a0  canmount              on                     default
zp1/data_a0  xattr                 sa                     local
zp1/data_a0  copies                1                      default
zp1/data_a0  version               5                      -
zp1/data_a0  utf8only              off                    -
zp1/data_a0  normalization         none                   -
zp1/data_a0  casesensitivity       sensitive              -
zp1/data_a0  vscan                 off                    default
zp1/data_a0  nbmand                off                    default
zp1/data_a0  sharesmb              off                    default
zp1/data_a0  refquota              none                   default
zp1/data_a0  refreservation        none                   default
zp1/data_a0  primarycache          metadata               local
zp1/data_a0  secondarycache        all                    local
zp1/data_a0  usedbysnapshots       0                      -
zp1/data_a0  usedbydataset         98.8G                  -
zp1/data_a0  usedbychildren        0                      -
zp1/data_a0  usedbyrefreservation  0                      -
zp1/data_a0  logbias               latency                default
zp1/data_a0  dedup                 off                    default
zp1/data_a0  mlslabel              none                   default
zp1/data_a0  sync                  standard               default
zp1/data_a0  refcompressratio      1.00x                  -
zp1/data_a0  written               98.8G                  -
zp1/data_a0  logicalused           98.7G                  -
zp1/data_a0  logicalreferenced     98.7G                  -
zp1/data_a0  snapdev               hidden                 default
zp1/data_a0  acltype               off                    default
zp1/data_a0  context               none                   default
zp1/data_a0  fscontext             none                   default
zp1/data_a0  defcontext            none                   default
zp1/data_a0  rootcontext           none                   default
zp1/data_a0  relatime              off                    default


解决问题可能要从 arc入手 : 
ARC原理参考
man zfs-module-parameters
arc 优化案例
在大内存下建议调整 ARC shrink shift (降到每次100M左右)
       zfs_arc_shrink_shift (int)
                   log2(fraction of arc to reclaim)
                   Default value: 5.

默认是5, 也就是1/32 , 如果内存有384G, 将达到12GB, 一次shrink 12GB arc的话, 要hang很久的.
建议降低到100MB左右, 那么可以设置 zfs_arc_shrink_shift  =11, 也就是1/2048, 相当于187.5MB
Description: Semi-regular spikes in I/O latency on an SSD postgres server.
Analysis: The customer reported multi-second I/O latency for a server with flash memory-based solid state disks (SSDs). Since this SSD type was new in production, it was feared that there may be a new drive or firmware problem causing high latency. ZFS latency counters, measured at the VFS interface, confirmed that I/O latency was dismal, sometimes reaching 10 seconds for I/O. The DTrace-based iosnoop tool (DTraceToolkit) was used to trace at the block device level, however, no seriously slow I/O was observed from the SSDs. I plotted the iosnoop traces using R for evidence of queueing behind TXG flushes, but they didn’t support that theory either.
This was difficult to investigate since the slow I/O was intermittent, sometimes only occurring once per hour. Instead of a typical interactive investigation, I developed various ways to log activity from DTrace and kstats, so that clues for the issue could be examined afterwards from the logs. This included capturing which processes were executed using execsnoop, and dumping ZFS metrics from kstat, including arcstats. This showed that various maintenance processes were executing during the hour, and, the ZFS ARC, which was around 210 Gbytes, would sometimes drop by around 6 Gbytes. Having worked performance issues with shrinking ARCs before, I developed a DTrace script to trace ARC reaping along with process execution, and found that it was a match with a cp(1) command. This was part of the maintenance task, which was copying a 30 Gbyte file, hitting the ARC limit and triggering an ARC shrink. Shrinking involves holding ARC hash locks, which can cause latency, especially when shrinking 6 Gbytes worth of buffers. The zfs:zfs_arc_shrink_shift tunable was adjusted to reduce the shrink size, which also made them more frequent. The worst-case I/O improved from 10s to 100ms.

ARC shrink shift
Every second a process runs which checks if data can be removed from the ARC and evicts it. Default max 1/32nd of the ARC can be evicted at a time. This is limited because evicting large amounts of data from ARC stalls all other processes. Back when 8GB was a lot of memory 1/32nd meant 256MB max at a time. When you have 196GB of memory 1/32nd is 6.3GB, which can cause up to 20-30 seconds of unresponsiveness (depending on the record size).
This 1/32nd needs to be changed to make sure the max is set to ~100-200MB again, by adding the following to /etc/system:
set zfs:zfs_arc_shrink_shift=11
(where 11 is 1/2 11 or 1/2048th, 10 is  1/2 10 or 1/1024th etc. Change depending on amount of RAM in your system).


结合ARC原理还有异步dirty write delay的情况, 优化如下 :
       zfs_vdev_async_write_active_min_dirty_percent (int)
                   When  the  pool  has  less  than  zfs_vdev_async_write_active_min_dirty_percent  dirty  data,   use
                   zfs_vdev_async_write_min_active to limit active async writes.  If the dirty data is between min and
                   max, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".
                   Default value: 30.
       zfs_vdev_async_write_active_max_dirty_percent (int)
                   When  the  pool  has  more  than  zfs_vdev_async_write_active_max_dirty_percent  dirty  data,   use
                   zfs_vdev_async_write_max_active to limit active async writes.  If the dirty data is between min and
                   max, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".
                   Default value: 60.
       zfs_vdev_async_write_max_active (int)
                   Maxium asynchronous write I/Os active to each device.  See the section "ZFS I/O SCHEDULER".
                   Default value: 10.
       zfs_vdev_async_write_min_active (int)
                   Minimum asynchronous write I/Os active to each device.  See the section "ZFS I/O SCHEDULER".
                   Default value: 1.

这幅图表示异步dirty write的提速和限速情况, 降低zfs_vdev_async_write_active_min_dirty_percent可以使最小限速区间变小,
降低zfs_vdev_async_write_active_max_dirty_percent可以使最大限速提早, 从而提高脏数据的flush速度. 
但是可能影响同步写的IO争抢.
              |              o---------| <-- zfs_vdev_async_write_max_active
         ^    |             /^         |
         |    |            / |         |
       active |           /  |         |
        I/O   |          /   |         |
       count  |         /    |         |
              |        /     |         |
              |-------o      |         | <-- zfs_vdev_async_write_min_active
             0|_______^______|_________|
              0%      |      |       100% of zfs_dirty_data_max
                      |      |
                      |      ‘-- zfs_vdev_async_write_active_max_dirty_percent
                      ‘--------- zfs_vdev_async_write_active_min_dirty_percent

另一方面, 我们需要设置arc max, 注意不是dirty arc max
因为数据库也占用了大部分内存, ZFS ARC不限制的话就无节制了.
有文章指出将ARC限制到总内存的40% . (总内存有384GB, PostgreSQL shared buffer用掉 20GB)
到底设置为多少呢 ?

查看当前情况, 数据库已经开启,
# free
             total       used       free     shared    buffers     cached
Mem:     396856808  228812456  168044352   21633868      58744   45380060

系统有168GB空闲内存
ARC已使用约20GB内存
# cat /proc/spl/kstat/zfs/arcstats |grep size
size                            4    19751851104

那么, 在当前空闲内存的情形下我再留48GB给系统和数据库的话, ZFS还有120GB可用.
加上已用的20G, ZFS可以用140G.
把arc max设置到140GB  (差不多是总内存的40%)
# echo 140000000000 > /sys/module/zfs/parameters/zfs_arc_max

接下来设置一下dirty相关的参数
zfs_dirty_data_max 降到 arc max 的 1/5 = 28000000000 (可动态调整)

异步写的加速参数调整
zfs_vdev_async_write_active_min_dirty_percent=10
zfs_vdev_async_write_active_max_dirty_percent=30  (务必小于zfs_delay_min_dirty_percent)
zfs_delay_min_dirty_percent=60


动态调整后, 建议设置启动模块参数 : 
# cd /sys/module/zfs/parameters/
# echo 140000000000 >zfs_arc_max
# echo 28000000000 >zfs_dirty_data_max
# echo 10 > zfs_vdev_async_write_active_min_dirty_percent
# echo 30 > zfs_vdev_async_write_active_max_dirty_percent
# echo 60 > zfs_delay_min_dirty_percent
# echo 11 > zfs_arc_shrink_shift

zfs模块启动参数
# vi /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=140000000000
options zfs zfs_dirty_data_max=28000000000
options zfs zfs_vdev_async_write_active_min_dirty_percent=10
options zfs zfs_vdev_async_write_active_max_dirty_percent=30
options zfs zfs_delay_min_dirty_percent=60
options zfs zfs_arc_shrink_shift=11


观察期.....
还是一个样子, 内存会用光, 然后一样CPU暴增.
但是进程的内存消耗是正常的,
# ps -e --width=1024 -o pid,%mem,rss,size,sz,vsz,cmd --sort rss
rss        RSS      resident set size, the non-swapped physical memory that a task has used (in kiloBytes).
                    (alias rssize, rsz).
size       SZ       approximate amount of swap space that would be required if the process were to dirty all writable
                    pages and then be swapped out. This number is very rough!
sz         SZ       size in physical pages of the core image of the process. This includes text, data, and stack
                    space. Device mappings are currently excluded; this is subject to change. See vsz and rss.
vsz        VSZ      virtual memory size of the process in KiB (1024-byte units). Device mappings are currently
                    excluded; this is subject to change. (alias vsize).


06:10:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
06:20:01 AM 219447748 177409060     44.70     23196  22965260  27351828      6.75
06:30:01 AM 219304016 177552792     44.74     24628  23080756  27348820      6.75
06:40:01 AM 218698000 178158808     44.89     26276  23638736  27365764      6.75
06:50:01 AM 218454732 178402076     44.95     27588  23852552  27365664      6.75
07:00:01 AM 218211060 178645748     45.02     28840  24066384  27365736      6.75
07:10:01 AM 218006588 178850220     45.07     30144  24231036  27366528      6.75
07:20:01 AM 217784072 179072736     45.12     31424  24412084  27365496      6.75
07:30:01 AM 217128620 179728188     45.29     32752  24970064  27370048      6.75
07:40:01 AM 216704964 180151844     45.39     34372  25331396  27369700      6.75
07:50:01 AM 216372456 180484352     45.48     35740  25610760  27371348      6.75
08:00:01 AM 216028392 180828416     45.57     37060  25890136  27393748      6.76
08:10:01 AM 214706196 182150612     45.90     38808  27120088  27400288      6.76
08:20:01 AM 213981920 182874888     46.08     42712  27798924  27413000      6.76
08:30:01 AM 213551104 183305704     46.19     44268  28193028  27411516      6.76


设置cache的使用趋势
vfs_cache_pressure
------------------

This percentage value controls the tendency of the kernel to reclaim
the memory which is used for caching of directory and inode objects.

At the default value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a "fair" rate with respect to pagecache and
swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer
to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.

Increasing vfs_cache_pressure significantly beyond 100 may have negative
performance impact. Reclaim code needs to take various locks to find freeable
directory and inode objects. With vfs_cache_pressure=1000, it will look for
ten times more freeable objects than there are.

即使设置为1, 貌似还是不断的使用cache.

因为和脏数据无关, 所以也不需要调整脏数据的内核参数 :
# cat /proc/meminfo |grep -i -E "dirt|back"
Dirty:                 0 kB
Writeback:             0 kB
WritebackTmp:          0 kB


==============================================================

dirty_background_bytes

Contains the amount of dirty memory at which the background kernel
flusher threads will start writeback.

If dirty_background_bytes is written, dirty_background_ratio becomes a function
of its value (dirty_background_bytes / the amount of dirtyable system memory).

==============================================================

dirty_background_ratio

Contains, as a percentage of total system memory, the number of pages at which
the background kernel flusher threads will start writing out dirty data.

==============================================================

dirty_bytes

Contains the amount of dirty memory at which a process generating disk writes
will itself start writeback.

If dirty_bytes is written, dirty_ratio becomes a function of its value
(dirty_bytes / the amount of dirtyable system memory).

Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
value lower than this limit will be ignored and the old configuration will be
retained.

==============================================================

dirty_expire_centisecs

This tunable is used to define when dirty data is old enough to be eligible
for writeout by the kernel flusher threads.  It is expressed in 100'ths
of a second.  Data which has been dirty in-memory for longer than this
interval will be written out next time a flusher thread wakes up.

==============================================================

dirty_ratio

Contains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
data.

==============================================================

dirty_writeback_centisecs

The kernel flusher threads will periodically wake up and write `old' data
out to disk.  This tunable expresses the interval between those wakeups, in
100'ths of a second.

Setting this to zero disables periodic writeback altogether.


现在暂且增加一个空闲时间自动FREE的脚本.
/usr/share/doc/kernel-doc-2.6.32/Documentation/sysctl/vm.txt
drop_caches

Writing to this will cause the kernel to drop clean caches, dentries and
inodes from memory, causing that memory to become free.

To free pagecache:
        echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
        echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
        echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation and dirty objects are not freeable, the
user should run `sync' first.

crontab -e
30 4 * * * /usr/local/bin/free.sh >>/tmp/free.log 2>&1

# cat /usr/local/bin/free.sh
#!/bin/bash

. /root/.bash_profile
. /etc/profile

echo "`date +%F%T` start drop cache."
free
sync
echo 3 > /proc/sys/vm/drop_caches
echo "`date +%F%T` end drop cache."
free


最终调整的参数如下 : 
负载恢复正常.
减少脏数据比例, 提高脏数据刷新频率
将ARC改成只存储metadata, 不存储page.
sysctl -w vm.zone_reclaim_mode=1
sysctl -w vm.dirty_background_bytes=102400000
sysctl -w vm.dirty_bytes=102400000
sysctl -w vm.dirty_expire_centisecs=10
sysctl -w vm.dirty_writeback_centisecs=10
sysctl -w vm.swappiness=0
sysctl -w vm.vfs_cache_pressure=80

# vi /etc/sysctl.conf
vm.zone_reclaim_mode=1
vm.dirty_background_bytes=102400000
vm.dirty_bytes=102400000
vm.dirty_expire_centisecs=10
vm.dirty_writeback_centisecs=10
vm.swappiness=0
vm.vfs_cache_pressure=80


# cd /sys/module/zfs/parameters/
# cat zfs_arc_max 
10240000000

查看arc统计信息/proc/spl/kstat/zfs/arcstats, 可以看到metadata使用了不到2G, 所以给10G差不多了.
不够的话, 以后可以再调整.
meta_size                       4    1952531968

# cat /etc/modprobe.d/zfs.conf 
options zfs zfs_arc_max=10240000000
options zfs zfs_dirty_data_max=800000000
options zfs zfs_vdev_async_write_active_min_dirty_percent=10
options zfs zfs_vdev_async_write_active_max_dirty_percent=30
options zfs zfs_delay_min_dirty_percent=60
options zfs zfs_arc_shrink_shift=11

设置为metadata, 因为LINUX本身也带cache, 没有必要多重cache. 
zfs 和 PostgreSQL 一样有这个多重cache问题, 除非使用directIO.
# zfs set primarycache=metadata zp1
# zfs set primarycache=metadata zp1/data_a0
# zfs set primarycache=metadata zp1/data_a1
# zfs set primarycache=metadata zp1/data_b0
# zfs set primarycache=metadata zp1/data_b1
# zfs set primarycache=metadata zp1/data_c0
# zfs set primarycache=metadata zp1/data_c1
# zfs set primarycache=metadata zp1/data_ssd0
# zfs set primarycache=metadata zp1/data_ssd1

设置为与数据库块大小一致.
# zfs set recordsize=16k zp1/data_a0  wal_block_size=16k
# zfs set recordsize=8k zp1/data_a0  block_size=8k


[参考]
2.  http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance
11. /proc/spl/*

相关文章
|
16天前
|
监控 Unix Linux
cpu相关指标(top、uptime、vmstat、mpstat、sar、pidstat、ps、dstat、perf、tcpdump、lscpu)等常见使用方法(二)
cpu相关指标(top、uptime、vmstat、mpstat、sar、pidstat、ps、dstat、perf、tcpdump、lscpu)等常见使用方法
|
4月前
|
监控 Linux BI
Linux命令之top命令查看服务器CPU与内存占用
Linux命令之top命令查看服务器CPU与内存占用
205 0
|
8月前
|
监控 Linux
在 Linux 中使用 Top 命令检查和排序 CPU 使用率?
在 Linux 中使用 Top 命令检查和排序 CPU 使用率?
673 0
|
10月前
|
缓存 Linux
Linux 的 top命令参数详解 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
Linux 的 top命令参数详解 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
219 0
|
Linux
linux下用top命令查看,cpu利用率超过100%时怎么回事
linux下用top命令查看,cpu利用率超过100%时怎么回事
389 0
|
传感器 缓存 监控
CPU使用率过高问题排查及Linux之top命令用法详解
hi(hardirq):表示 CPU 处理硬中断所花费的时间。硬中断是由外设硬件(如键盘控制器、硬件传感器等)发出的,需要有中断控制器参与,特点是快速执行。
1848 0
CPU使用率过高问题排查及Linux之top命令用法详解
|
缓存 监控 算法
linux性能监控:CPU监控命令之top命令
linux性能监控:CPU监控命令之top命令
449 0
linux性能监控:CPU监控命令之top命令
top 显示按照内存、CPU排序
top 显示按照内存、CPU排序
957 0

热门文章

最新文章