bcache / 如何使用bcache构建LVM,软RAID / 如何优化bcache
作者
digoal
日期
2016-09-19
标签
bcache , mdadm , lvm2 , 软RAID
bcache 背景知识
本小章节转载自 http://www.sysnote.org/2014/06/20/bcache-analysis/
1. 简介
bcache是linux内核块设备层cache,类似于flashcache使用ssd作为hdd的缓存方案,相比于flashcache,bcache更加灵活,支持ssd作为多块hdd的共享缓存,并且还支持多块ssd(还未完善),能够在运行中动态增加,删除缓存设备和后端设备。
从3.10开始,bcache进入内核主线。
bcache支持writeback、writethrough、writearoud三种策略,默认是wriththrough,可以动态修改,缓存替换方式支持lru、fifo和random三种。下面从几个方面介绍bcache的实现机制。
2.总体结构
bcache的整体结构如图所示。
bcache中是以cache set来划分不同存储集合,一个cache set中包含一个或多个缓存设备(一般是ssd),一个或多个后端设备(一般是hdd)。
bcache对外输出给用户使用的是/dev/bcache这种设备,每个bcache设备都与一个后端物理盘一一对应。
用户对不同bcache设备的io会缓存在ssd中,刷脏数据的时候就会写到各自对应的后端设备上。
因此,bcache扩容很容易,只要注册一个新的物理设备即可。
3. bcache关键结构
3.1 bucket
缓存设备会按照bucket大小划分成很多bucket,bucket的大小最好是设置成与缓存设备ssd的擦除大小一致,一般建议128k~2M+,默认是512k。
每个bucket有个优先级编号(16 bit的priority),每次hit都会增加,然后所有的bucket的优先级编号都会周期性地减少,不常用的会被回收,这个优先级编号主要是用来实现lru替换的。
bucket还有8bit的generation,用来invalidate bucket用的。
bucket内空间是追加分配的,只记录当前分配到哪个偏移了,下一次分配的时候从当前记录位置往后分配。另外在选择bucket来缓存数据时有两个优先原则:
1)优先考虑io连续性,即使io可能来自于不同的生产者;
2)其次考虑相关性,同一个进程产生的数据尽量缓存到相同的bucket里。
3.2 bkey
bucket的管理是使用b+树索引,而b+树节点中的关键结构就是bkey,bkey就是记录缓存设备缓存数据和后端设备数据的映射关系的,其结构如下。
struct bkey {
uint64_t high;
uint64_t low;
uint64_t ptr[];
}
其中:
- KEY_INODE:表示一个后端设备的id编号(后端设备在cache set中一般以bdev0,bdev1这种方式出现)
- KEY_SIZE:表示该bkey所对应缓存数据的大小
- KEY_DIRTY:表示该块缓存数据是否是脏数据
- KEY_PTRS:表示cache设备的个数(多个ptr是用来支持多个cache设备的,多个cache设备只对脏数据和元数据做镜像)
- KEY_OFFSET:bkey所缓存的hdd上的那段数据区域的结束地址
- PTR_DEV:cache设备
- PTR_OFFSET:在缓存设备中缓存的数据的起始地址
- PTR_GEN:对应cache的bucket的迭代数(版本)
3.3 bset
一个bset是一个bkey的数组,在内存中的bset是一段连续的内存,并且以bkey排序的(bkey之间进行比较的时候是先比较KEY_INODE,如果KEY_INODE相同,再比较KEY_OFFSET)。
bset在磁盘上(缓存设备)有很多,但是内存中一个btree node只有4个bset。
4. bcache中的b+树
4.1 btree结构
bcache中以b+tree来维护索引,一个btree node里包含4个bset,每个bset中是排序的bkey。
图中做了简化,一个btree node只画了两个bset。
每个btree node以一个bkey来标识,该bkey是其子节点中所有bkey中的最大值,不同于标准的b+树那样,父节点存放子节点的地址指针,bcache中的b+树中非叶子节点中存放的bkey用于查找其子节点(并不是存的地址指针),而是根据bkey计算hash,再到hash表中取查找btree node。
叶子节点中的bkey存放的就是实际的映射了(根据这些key可以找到缓存数据以及在hdd上的位置)。
bkey插入到b+树中的过程与标准的b+树的插入过程类似,这里不做细讲。
4.2 btree插入bkey时overlapping的处理
收到新的写io时,这个io对应hdd上的数据可能有部分已经缓存在ssd上了,这个时候为这个io创建的bkey就需要处理这种overlapping的问题。
Btree node 是log structured,磁盘上的btree node有可能有overlap的情况,因为是在不同时候写入的。
但是内存中的btree node不会有overlap,因为插入bkey时如果和内存中的bkey有overlap,就会解决overlap的问题;
另外,从磁盘上读出btree node时就会把bsets中的bkey做归并排序,就会检查overlap的问题并进行解决。
下图给出了一个示例情况。
出现这种overlapping的情况,原来的bkey会做修改(ssd和hdd上的偏移都会修改,还有数据大小)。
插入key或者查找key时,使用待处理的key的start来遍历bset中的bkey(bset中的bkey排序了),找到可能存在overlapping的第一个key,如下图描述的两种情况。
要处理的overlapping情况有以下几种:
读数据的时候也有可能出现这种部分命中,部分miss的情况,本质上和插入key处理overlapping的一样,只不过命中的部分从cache设备中读,miss的部分是从hdd读,并且会重新插入一个新的key来缓存这部分miss的数据。
4.3 btree node split
b+树节点有最大大小限制,在新创建一个btree node时就在内存中分配了一段连续的空间来存放bset及bkey(按照指定的内存pages数),如果超过这个pages数,btree node就会分裂。
在进行key的插入时,递归进行插入,到叶子节点上如果需要分裂,则会以3/5节点大小分裂成2个节点,把分裂成的两个节点的最大key通过op传到上层函数调用中,叶子节点这层函数退出后,在上一层的函数里就会把op中的key添加到keys中(这样就修改了父节点的指针),如果这一层还需要分裂,跟之前的分裂过程一致,分裂后的节点都会立即持久化到ssd中。
5. writeback
bcache支持三种缓存策略:writeback,writethrough,writearoud。
- writethrough就是既写ssd也写hdd,这样读的时候如果命中的话就可以从ssd中读,适应于读多写少的场景;
- writearoud就是绕过ssd直接读写hdd,个人感觉这种方式没啥意义;
- writeback就是ssd做写缓存,所有的写入都是先写缓存,然后会在后台刷脏数据,这里主要介绍writeback。
上文提到bucket内是以追加的方式来分配空间的,一个bucket里面缓存的数据可能对应hdd上的不同位置,甚至有可能在不同的hdd。
不同于flashcache以set为单位刷脏数据(等同于bcache中的bucket),bcache中以bkey为单位来writeback,而不是以bucket为单位。
每个cache set都有一个writeback_keys,记录需要writeback的bkeys。
当满足刷脏数据的条件时(脏数据比例),就会遍历整个b+树,查找dirty的bkey,放到writeback_keys中(writeback_keys有个大小限制),然后按照磁盘偏移排序,再进行刷脏数据的动作,数据写到hdd后,会把对应的bkey从writeback_keys移除,并去掉该bkey的dirty标记。
这样的好处在于刷脏数据的时候可以尽量考虑hdd上的连续性,减少磁头的移动,而如果以bucket为单位刷,一个bucket可能缓存hdd上不同位置的数据,比较随机,刷脏数据的效率不高。
既然bucket内的空间分配是按照追加的方式,每次都是从后面开发分配,而刷脏数据又不是以bucket为单位,那么就会出现bucket中有空洞的情况(bucket中间有些数据已经刷到磁盘上了),导致bucket已用空间不多,但是很多空闲空间比较分散,从后面分配又空间不足。
对于这种情况,bcache中又专门的垃圾回收机制(gc),会把这种情况的bucket给回收掉,从而可以重新使用这个bucket。
writeback方式虽然性能比较高,但是会出现意外宕机情况下的恢复问题。
对于这种情况,bcache能够很好地处理。在引入journal之前(后面会介绍),在数据写入到缓存ssd中后,io并没有返回,而是等到相应的btree node也持久化后,再返回写成功,这样意味着每次写io都会写元数据,宕机的情况下就能够根据元数据恢复出来,不过这种方式就导致每次都要写元数据,而元数据的io都比较小,对于ssd来说,小io有写放大的问题,导致了效率低下。
引入journal后,就解决了这个性能问题。
6. journal
journal不是为了一致性恢复用的,而是为了提高性能。在writeback一节提到没有journal之前每次写操作都会更新元数据(一个bset),为了减少这个开销,引入journal,journal就是插入的keys的log,按照插入时间排序,只用记录叶子节点上bkey的更新,非叶子节点在分裂的时候就已经持久化了。
这样每次写操作在数据写入后就只用记录一下log,在崩溃恢复的时候就可以根据这个log重新插入key。
7. garbage colletction
gc的目的是为了重用buckets。
初始化cache时会在bch_moving_init_cache_set中会初始化一个判断bkey是否可以gc的函数moving_pred,该函数就是判断该key所对应的bucket的GC_SECTORS_USED是否小于cache->gc_move_threshold,如果是则可以gc,否则不能。
gc一般是由invalidate buckets触发的。
bcache使用一个moving_gc_keys(key buf,以红黑树来维护)来存放可以gc的keys,gc时会扫描整个btree,判断哪些bkey是可以gc的,把能够gc的bkey加到moving_gc_keys中,然后就根据这个可以gc的key先从ssd中读出数据,然后同样根据key中记录的hdd上的偏移写到hdd中,成功后把该key从moving_gc_keys中移除。
8. 总结
bcache相对于flashcache要复杂很多,使用b+树来维护索引就可见一斑。
虽然bcache已经进入了内核主线,但是目前使用bcache的人还是比较少的,离商用还有一段距离。
bcache还不太稳定,时不时有些bug出现,而且其稳定性需要比较长的时间来检测。
而flashcache来说就稳定很多,而且有facebook作为其最大的维护者,国内外还有很多公司也在使用,成熟度是公认的。
一、bcache 术语
bcache 设备种类
1. backing 设备
指SSD后面的设备,通常是机械盘。
2. cache 设备
指缓存设备,通常为SSD。
使用bcache分为三个步骤:
1. 创建backing和cache设备
2. 注册backing和cache设备
3. 绑定backing和cache设备
所有设备在使用之前都需要注册。
一个cache设备可以绑定到多个backing设备,但是一个backing设备不可以绑定多个cache设备。
cache模式
writethrough [writeback] writearound none
bucket size
cache设备(ssd)被格式为多个bucket,每个bucket用来缓存一部分backing设备的block。
cache设备将以bucket为最小单位,将数据同步到backing设备,或重用bucket。
使用bucket的好处是减少离散的写操作。
block size
表示cache设备数据块的大小,should match hardware sector size.
ssd通常为4K。
二、bcache 安装
需要安装bcache、bcache-tools包。
需要将块设备的ioscheduler改成deadline的。
三、bcache 设备的部署
1. 环境说明
ssd /dev/dfa
disk /dev/sdb, ...... 若干
2. 创建gpt分区表
#parted -s /dev/dfa mklabel gpt
#parted -s /dev/sdb mklabel gpt
#parted -s /dev/sdc mklabel gpt
3. 分区,注意对齐
#parted -s /dev/dfa mkpart primary 1MiB xxxxGB
#parted -s /dev/dfa mkpart primary xxxxGB yyyyGB
#parted -s /dev/sdb mkpart primary 1MiB zzzzGB
#parted -s /dev/sdc mkpart primary 1MiB zzzzGB
这里将ssd分成2个区,目的是演示多个cache设备的情况,实际上一个SSD不需要分成多个区来使用。
4. 创建backing或cache设备的命令
#make-bcache --help
Usage: make-bcache [options] device
-C, --cache Format a cache device
-B, --bdev Format a backing device
-b, --bucket bucket size
-w, --block block size (hard sector size of SSD, often 2k)
-o, --data-offset data offset in sectors
--cset-uuid UUID for the cache set
--writeback enable writeback
--discard enable discards
--cache_replacement_policy=(lru|fifo)
-h, --help display this help and exit
5. 创建cache设备
指定cache设备的扇区大小,以及bucket大小
#make-bcache -C -b 1MiB -w 4KiB --discard --cache_replacement_policy=lru /dev/dfa1
UUID: 8ef3535a-c42f-49ca-a5ed-2d4207524e22
Set UUID: a01f921f-a91b-46ad-b682-2f59d0be4717
version: 0
nbuckets: 1525878
block_size: 8 单位(扇区,512字节)
bucket_size: 2048 单位(扇区,512字节)
nr_in_set: 1
nr_this_dev: 0
first_bucket: 1
说明
discard
Boolean; if on a discard/TRIM will be issued to each bucket before it is
reused. Defaults to off, since SATA TRIM is an unqueued command (and thus
slow).
block_size
Minimum granularity of writes - should match hardware sector size.
6. 创建backing设备,(data-offset可以用来指定偏移量,达到对齐的目的)。
backing设备与cache设备的block_size必须设置为一样的,所以以两者大的为准即可。
#make-bcache -B --writeback -w 4KiB /dev/sdb1 --wipe-bcache
UUID: d813d8ab-6541-4296-a77e-e35d18d2d6ec
Set UUID: 8033e49c-270d-4a8f-b5e9-f331ac77bf80
version: 1
block_size: 8 单位(扇区,512字节)
data_offset: 16 单位(扇区,512字节)
7. 注册cache, backing设备
echo /dev/dfa1 > /sys/fs/bcache/register
echo /dev/sdb1 > /sys/fs/bcache/register
8. 观察
#ll /sys/block/dfa/dfa1/bcache/
block_size bucket_size clear_stats io_errors nbuckets priority_stats written
btree_written cache_replacement_policy discard metadata_written physical_block_size set/
#ll /sys/block/sdb/sdb1/bcache/
attach dirty_bytes no_cache_wt_pages sequential_cutoff stripe_size writeback_percent writeback_running
cache_mode dirty_data page_cache_enable state tdc/ writeback_rate wt_torture_test
clear_stats drop_page_cache partial_stripes_expensive stats_day/ winto_keys_debug writeback_rate_debug
dc_high_latency_filter_ms io_stats_read readahead stats_five_minute/ writeback_debug writeback_rate_d_term
dc_high_latency_stats io_stats_write read_via_page_cache stats_hour/ writeback_delay writeback_rate_min
detach io_stats_writeback_detail running stats_total/ writeback_flush_enable writeback_rate_p_term_inverse
device/ label sequential_bios stop writeback_metadata writeback_rate_update_seconds
如果注册了cache设备,可以看到bcache的cache set UUID,对应创建cache设备是返回的Set UUID:
#ll /sys/fs/bcache/
total 0
drwxr-xr-x 7 root root 0 Sep 18 16:39 a01f921f-a91b-46ad-b682-2f59d0be4717
drwxr-xr-x 2 root root 0 Sep 18 16:41 bdevs
--w------- 1 root root 4096 Sep 18 16:38 register
--w------- 1 root root 4096 Sep 18 16:41 register_quiet
注册了backing设备,则可以看到对应的bcache设备
#lsblk
sdb 8:16 0 xxT 0 disk
`-sdb1 8:17 0 xxT 0 part
`-bcache0 251:0 0 xxT 0 disk
9. 将cache设备绑定到backing设备
完成这一步,ssd缓存才生效
指定cache设备的UUID(通过#ll /sys/fs/bcache/得到),写入backing设备对应的attach
# echo a01f921f-a91b-46ad-b682-2f59d0be4717 > /sys/block/sdb/sdb1/bcache/attach
10. 检查bcache设备状态
#cat /sys/block/sdb/sdb1/bcache/state
clean
11. 检查 或 修改缓存模式
#cat /sys/block/sdb/sdb1/bcache/cache_mode
writethrough [writeback] writearound none
#echo writethrough > /sys/block/sdb/sdb1/bcache/cache_mode
#cat /sys/block/sdb/sdb1/bcache/cache_mode
[writethrough] writeback writearound none
#echo writeback > /sys/block/sdb/sdb1/bcache/cache_mode
#cat /sys/block/sdb/sdb1/bcache/cache_mode
writethrough [writeback] writearound none
12. 创建文件系统
#mkfs.ext4 /dev/bcache0 -m 0 -O extent,uninit_bg -E lazy_itable_init=1 -T largefile -L sdb1
13. mount 文件系统
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0 LABEL=sdb1 /disk1
四、bcache 调优
1. backing设备块对齐,如果backing设备是RAID设备,可以将--data-offset设置为raid 条带大小的倍数,避免写放大。
make-bcache --data-offset
如果考虑未来RAID的扩展,则建议这样计算data-offset的值
For example: If you have a 64k stripe size, then the following offset
would provide alignment for many common RAID5 data spindle counts:
64k * 2*2*2*3*3*5*7 bytes = 161280k
That space is wasted, but for only 157.5MB you can grow your RAID 5
volume to the following data-spindle counts without re-aligning:
3,4,5,6,7,8,9,10,12,14,15,18,20,21 ...
2. 调整backing设备的连续IO阈值,表示bcache0设备的连续写IO大于4MB时,大于4MB的部分不会过SSD设备,也不会缓存到ssd,而是直接写backing设备。
echo 4M > /sys/block/bcache0/bcache/sequential_cutoff
3. 如何防止cache设备成为瓶颈
bcache会跟踪每个IO,如果IO的时间超过阈值,则旁路cache设备,直接读写backing设备。
如果你的SSD足够强大,可以不跟踪,减少跟踪的开销。
# echo 0 > /sys/fs/bcache//congested_read_threshold_us
# echo 0 > /sys/fs/bcache//congested_write_threshold_us
关闭旁路的另一个好处是,所有的离散读写都会经过cache设备,从而不会导致cache missing。
默认情况下当读请求超过2ms,写请求超过20ms时,旁路cache设备。
The default is 2000 us (2 milliseconds) for reads, and 20000 for writes.
五、bcache 自启动脚本
重启后,需要重新注册设备,如果修改了bcache的一些配置,也需要重新修改,例如。
echo /dev/dfa1 > /sys/fs/bcache/register
echo /dev/sdb1 > /sys/fs/bcache/register
......
重启后,不需要重新创建cache, backing设备,不需要重新绑定backing和cache设备。
六、bcache 性能测试
使用fio测试bcache设备的性能。
yum install -y libaio
git clone https://github.com/axboe/fio
cd fio
./configure --prefix=/home/digoal/fiohome
make -j 32
make install
export PATH=/home/digoal/fiohome/bin:$PATH
测试性能
假设bcache设备的挂载点为/disk1
fio -filename=/disk1/testdir -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=32K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest
fio -filename=/disk1/testdir -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=32K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest
fio -filename=/disk1/testdir -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio -bs=32K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest
fio -filename=/disk1/testdir -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=32K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest
七、bcache 维护
添加backing, cache设备
1. 添加cache设备
#make-bcache -C -b 1MiB -w 4KiB --discard --cache_replacement_policy=lru /dev/dfa2
UUID: cbe1760b-43bf-47f0-94b1-cd1136576873
Set UUID: 826b8b21-1f40-4a1d-ad2b-84f1ecbb4c45
version: 0
nbuckets: 1525878
block_size: 8
bucket_size: 2048
nr_in_set: 1
nr_this_dev: 0
first_bucket: 1
2. 注册cache设备
echo /dev/dfa2 > /sys/fs/bcache/register
3. 添加backing设备
# make-bcache -B --writeback -w 4KiB /dev/sdc1 --wipe-bcache
UUID: e406b0b2-69f9-4f4c-8b18-2d314ce6ed35
Set UUID: f8877b48-3c59-40a7-919e-029ce2d3249d
version: 1
block_size: 8
data_offset: 16
4. 注册backing设备
echo /dev/sdc1 > /sys/fs/bcache/register
5. 绑定cache与backing设备
echo 826b8b21-1f40-4a1d-ad2b-84f1ecbb4c45 > /sys/block/sdc/sdc1/bcache/attach
使用以上方法,将所有的backding盘都操作一下(mklabel gpt, mkpart, make-bcache -B, register, 绑定)。
现在变成这样的。
#lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 xxT 0 disk
`-sdb1 8:17 0 xxT 0 part
`-bcache0 251:0 0 xxT 0 disk /disk1
sdc 8:32 0 xxT 0 disk
`-sdc1 8:33 0 xxT 0 part
`-bcache1 251:1 0 xxT 0 disk
......
dfa 252:0 0 zzT 0 disk
|-dfa1 252:1 0 yyT 0 part
| |-bcache0 251:0 0 xxT 0 disk /disk1
| |-bcache1 251:1 0 xxT 0 disk
| |...
| `-bcache5 251:5 0 xxT 0 disk
`-dfa2 252:2 0 yyT 0 part
|-bcache6 251:6 0 xxT 0 disk
|-bcache7 251:7 0 xxT 0 disk
|...
`-bcache11 251:11 0 xxT 0 disk
12块机械盘分别绑定到两个cache设备上。
注意
如果有多个SSD设备都需要作为一个backing设备的cache设备的话,可以使用lvm将ssd做成条带,从而提升cache设备的整体IO能力和带宽能力。
然后再将lvm设备作为cache设备即可。
如果是多个backding设备,则可以像以上的方法一样,不同的backing设备绑定不同的cache设备。
删除backing, cache设备
步骤如下
1. umount 挂载点
2. 如果有在bcache设备上建立了软RAID或者逻辑卷,首先要解除这层关系
lvremove
vgremove
pvremove
或
mdadm -S md设备
3. 删除cache设备前,必须确保没有任何与之绑定的backing设备,解除backing与cache设备的绑定 (detach)
echo 1 > /sys/block/sdX/sdX[Y]/bcache/detach
echo /sys/block/bcache/bcache/detach
4. 停止 backing设备
detach cache设备后,我们还需要这一步,才能删除backing设备。
echo 1 > /sys/block/sdX/sdX[Y]/bcache/stop
5. unregister cache 设备
echo 1 > /sys/fs/bcache//stop
echo 1 > /sys/fs/bcache/[SSD bcache UUID]/unregister
6. wipe-cache -a 清理块设备的头信息
wipe-cache -a /dev/dfa1
wipe-cache -a /dev/sd[b-m]1
八、软raid on bcache
使用bcache盘,构建软RAID的例子
1. 4块bcache盘,创建raid5
#mdadm --create --verbose /dev/md0 -c 4M --level=5 --raid-devices=4 /dev/bcache[0-3]
2. 4块bcache盘,创建raid5
#mdadm --create --verbose /dev/md1 -c 4M --level=5 --raid-devices=4 /dev/bcache[4-7]
3. 4块bcache盘,创建raid5
#mdadm --create --verbose /dev/md2 -c 4M --level=5 --raid-devices=4 /dev/bcache[89] /dev/bcache1[01]
4. 2块软raid盘,创建raid0
#mdadm --create --verbose /dev/md3 -c 4M --level=0 --raid-devices=2 /dev/md[01]
5. 创建文件系统
stride即条带的大小(单位512字节)
stripe-width即条带的宽度,等于stride*实际的数据盘数量(扣掉校验盘和mirror盘)
#/home/digoal/e2fsprogs/sbin/mkfs.ext4 /dev/md3 -b 4096 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stripe-width=192,stride=32 -T largefile -L md3
#/home/digoal/e2fsprogs/sbin/mkfs.ext4 /dev/md2 -b 4096 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stripe-width=96,stride=32 -T largefile -L md2
6. 加载文件系统
stripe等于创建时指定的stripe-width
#mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback,stripe=192 LABEL=md3 /disk1
#mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback,stripe=96 LABEL=md2 /disk2
九、lvm on bcache
使用bcache设备构建逻辑卷。
1. 添加bcache lvm支持
vi /etc/lvm.conf
types = [ "bcache", 16 ]
2. 创建PV
pvcreate /dev/bcache[0-9]
pvcreate /dev/bcache1[01]
3. 创建VG
设置一个PE的大小
vgcreate -s 128M vgdata01 /dev/bcache[0-9] /dev/bcache1[01]
#vgs
VG #PV #LV #SN Attr VSize VFree
vgdata01 12 0 0 wz--n- 87.33t 87.33t
4. 创建 lvm raid时触发BUG,LVM版本太老
#lvcreate --type raid5 -i 11 -I 4M -l 100%VG -n lv01 vgdata01
WARNING: RAID segment types are considered Tech Preview
For more information on Tech Preview features, visit:
https://access.redhat.com/support/offerings/techpreview/
Rounding size (715392 extents) up to stripe boundary size (715396 extents)
Volume group "vgdata01" has insufficient free space (715392 extents): 715396 required.
man lvmcreate
Sistina Software UK LVM TOOLS 2.02.87(2)-RHEL6 (2011-10-12)
dmesg
[32634.575210] device-mapper: raid: Supplied region_size (1024 sectors) below minimum (8943)
[32634.583958] device-mapper: table: 253:24: raid: Supplied region size is too small
[32634.592008] device-mapper: ioctl: error adding target to table
原因,测试环境为CentOS 6.3以前的版本,所以LVM2版本很老,存在BUG
https://bugzilla.redhat.com/show_bug.cgi?id=837927
RAID: Fix problems with creating, extending and converting large RAID LVs
MD's bitmaps can handle 2^21 regions at most. The RAID code has always
used a region_size of 1024 sectors. That means the size of a RAID LV was
limited to 1TiB. (The user can adjust the region_size when creating a
RAID LV, which can affect the maximum size.) Thus, creating, extending or
converting to a RAID LV greater than 1TiB would result in a failure to
load the new device-mapper table.
Again, the size of the RAID LV is not limited by how much space is allocated
for the metadata area, but by the limitations of the MD bitmap. Therefore,
we must adjust the 'region_size' to ensure that the number of regions does
not exceed the limit. I've added code to do this when extending a RAID LV
(which covers 'create' and 'extend' operations) and when up-converting -
specifically from linear to RAID1.
Fix verified in the latest rpms.
2.6.32-348.el6.x86_64
lvm2-2.02.98-6.el6 BUILT: Thu Dec 20 07:00:04 CST 2012
lvm2-libs-2.02.98-6.el6 BUILT: Thu Dec 20 07:00:04 CST 2012
lvm2-cluster-2.02.98-6.el6 BUILT: Thu Dec 20 07:00:04 CST 2012
udev-147-2.43.el6 BUILT: Thu Oct 11 05:59:38 CDT 2012
device-mapper-1.02.77-6.el6 BUILT: Thu Dec 20 07:00:04 CST 2012
device-mapper-libs-1.02.77-6.el6 BUILT: Thu Dec 20 07:00:04 CST 2012
device-mapper-event-1.02.77-6.el6 BUILT: Thu Dec 20 07:00:04 CST 2012
device-mapper-event-libs-1.02.77-6.el6 BUILT: Thu Dec 20 07:00:04 CST 2012
cmirror-2.02.98-6.el6 BUILT: Thu Dec 20 07:00:04 CST 2012
5. 创建普通lvm正常(-I 大点,对于OLAP系统更好)
# lvcreate -i 12 -I 4M -l 100%VG -n lv01 vgdata01
Logical volume "lv01" created
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
lv01 vgdata01 -wi-a- 87.33t
# /home/digoal/e2fsprogs/sbin/mkfs.ext4 /dev/mapper/vgdata01-lv01 -b 4096 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stripe-width=24,stride=2 -T largefile -L lv01
# mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback,stripe=24 LABEL=lv01 /disk1
6. 使用新版本LVM2解决BUG
更新LVM2版本
# tar -zxvf LVM2.2.02.165.tgz
# cd LVM2.2.02.165
# sudo
# ./configure --prefix=/home/digoal/lvm2 ; make -j 32 ; make install
#export LD_LIBRARY_PATH=/home/digoal/lvm2/lib:$LD_LIBRARY_PATH
#export PATH=/home/digoal/lvm2/sbin:$PATH
#export MANPATH=/home/digoal/lvm2/share/man:$MANPATH
# /home/digoal/lvm2/sbin/pvs
PV VG Fmt Attr PSize PFree
/dev/bcache0 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache1 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache10 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache11 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache2 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache3 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache4 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache5 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache6 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache7 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache8 vgdata01 lvm2 a-- 7.28t 7.28t
/dev/bcache9 vgdata01 lvm2 a-- 7.28t 7.28t
# man /home/digoal/lvm2/share/man/man8/lvcreate.8
6.1. 创建2个raid5逻辑卷,分别使用8块,4块bcache盘。
lvcreate -i 表示实际的数据盘数量(需要扣除校验盘,mirror盘)。
lvcreate -I 单位KB,表示写多少内容后开始写下一个数据盘,即表示条带的大小。
mkfs.ext4 stride=16表示条带大小,单位扇区(512字节),stripe-width=112 表示条带宽度(=stride * -i)。
#/home/digoal/lvm2/sbin/lvcreate --type raid5 -i 7 -I 64 -l 100%PVS -n lv01 vgdata01 /dev/bcache[0-7]
Rounding size 58.22 TiB (476928 extents) up to stripe boundary size 58.22 TiB (476931 extents).
Logical volume "lv01" created.
#/home/digoal/e2fsprogs/sbin/mkfs.ext4 /dev/mapper/vgdata01-lv01 -b 4096 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stripe-width=112,stride=16 -T largefile -L lv01
#mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback,stripe=112 LABEL=lv01 /disk1
/home/digoal/lvm2/sbin/lvcreate --type raid5 -i 3 -I 64 -l 100%PVS -n lv02 vgdata01 /dev/bcache[89] /dev/bcache1[01]
/home/digoal/e2fsprogs/sbin/mkfs.ext4 /dev/mapper/vgdata01-lv02 -b 4096 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stripe-width=48,stride=16 -T largefile -L lv02
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback,stripe=48 LABEL=lv02 /disk2
6.2. 创建raid10逻辑卷
/home/digoal/lvm2/sbin/lvcreate --type raid10 -i 6 -I 128 -l 100%VG -n lv01 vgdata01
/home/digoal/e2fsprogs/sbin/mkfs.ext4 /dev/mapper/vgdata01-lv01 -b 4096 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stripe-width=192,stride=32 -T largefile -L lv01
mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback,stripe=192 LABEL=lv01 /disk1
十、配置参数或模块参数参考
有些值可以被修改,达到调整的目的。
1. SYSFS - BACKING DEVICE
Available at /sys/block//bcache, /sys/block/bcache*/bcache and
(if attached) /sys/fs/bcache//bdev*
attach
Echo the UUID of a cache set to this file to enable caching.
cache_mode
Can be one of either writethrough, writeback, writearound or none.
clear_stats
Writing to this file resets the running total stats (not the day/hour/5 minute
decaying versions).
detach
Write to this file to detach from a cache set. If there is dirty data in the
cache, it will be flushed first.
dirty_data
Amount of dirty data for this backing device in the cache. Continuously
updated unlike the cache set's version, but may be slightly off.
label
Name of underlying device.
readahead
Size of readahead that should be performed. Defaults to 0. If set to e.g.
1M, it will round cache miss reads up to that size, but without overlapping
existing cache entries.
running
1 if bcache is running (i.e. whether the /dev/bcache device exists, whether
it's in passthrough mode or caching).
sequential_cutoff
A sequential IO will bypass the cache once it passes this threshold; the
most recent 128 IOs are tracked so sequential IO can be detected even when
it isn't all done at once.
sequential_merge
If non zero, bcache keeps a list of the last 128 requests submitted to compare
against all new requests to determine which new requests are sequential
continuations of previous requests for the purpose of determining sequential
cutoff. This is necessary if the sequential cutoff value is greater than the
maximum acceptable sequential size for any single request.
state
The backing device can be in one of four different states:
no cache: Has never been attached to a cache set.
clean: Part of a cache set, and there is no cached dirty data.
dirty: Part of a cache set, and there is cached dirty data.
inconsistent: The backing device was forcibly run by the user when there was
dirty data cached but the cache set was unavailable; whatever data was on the
backing device has likely been corrupted.
stop
Write to this file to shut down the bcache device and close the backing
device.
writeback_delay
When dirty data is written to the cache and it previously did not contain
any, waits some number of seconds before initiating writeback. Defaults to
30.
writeback_percent
If nonzero, bcache tries to keep around this percentage of the cache dirty by
throttling background writeback and using a PD controller to smoothly adjust
the rate.
writeback_rate
Rate in sectors per second - if writeback_percent is nonzero, background
writeback is throttled to this rate. Continuously adjusted by bcache but may
also be set by the user.
writeback_running
If off, writeback of dirty data will not take place at all. Dirty data will
still be added to the cache until it is mostly full; only meant for
benchmarking. Defaults to on.
2. SYSFS - BACKING DEVICE STATS:
There are directories with these numbers for a running total, as well as
versions that decay over the past day, hour and 5 minutes; they're also
aggregated in the cache set directory as well.
bypassed
Amount of IO (both reads and writes) that has bypassed the cache
cache_hits
cache_misses
cache_hit_ratio
Hits and misses are counted per individual IO as bcache sees them; a
partial hit is counted as a miss.
cache_bypass_hits
cache_bypass_misses
Hits and misses for IO that is intended to skip the cache are still counted,
but broken out here.
cache_miss_collisions
Counts instances where data was going to be inserted into the cache from a
cache miss, but raced with a write and data was already present (usually 0
since the synchronization for cache misses was rewritten)
cache_readaheads
Count of times readahead occurred.
3. SYSFS - CACHE SET:
Available at /sys/fs/bcache/
average_key_size
Average data per key in the btree.
bdev<0..n>
Symlink to each of the attached backing devices.
block_size
Block size of the cache devices.
btree_cache_size
Amount of memory currently used by the btree cache
bucket_size
Size of buckets
cache<0..n>
Symlink to each of the cache devices comprising this cache set.
cache_available_percent
Percentage of cache device which doesn't contain dirty data, and could
potentially be used for writeback. This doesn't mean this space isn't used
for clean cached data; the unused statistic (in priority_stats) is typically
much lower.
clear_stats
Clears the statistics associated with this cache
dirty_data
Amount of dirty data is in the cache (updated when garbage collection runs).
flash_vol_create
Echoing a size to this file (in human readable units, k/M/G) creates a thinly
provisioned volume backed by the cache set.
io_error_halflife
io_error_limit
These determines how many errors we accept before disabling the cache.
Each error is decayed by the half life (in # ios). If the decaying count
reaches io_error_limit dirty data is written out and the cache is disabled.
journal_delay_ms
Journal writes will delay for up to this many milliseconds, unless a cache
flush happens sooner. Defaults to 100.
root_usage_percent
Percentage of the root btree node in use. If this gets too high the node
will split, increasing the tree depth.
stop
Write to this file to shut down the cache set - waits until all attached
backing devices have been shut down.
tree_depth
Depth of the btree (A single node btree has depth 0).
unregister
Detaches all backing devices and closes the cache devices; if dirty data is
present it will disable writeback caching and wait for it to be flushed.
4. SYSFS - CACHE SET INTERNAL:
This directory also exposes timings for a number of internal operations, with
separate files for average duration, average frequency, last occurrence and max
duration: garbage collection, btree read, btree node sorts and btree splits.
active_journal_entries
Number of journal entries that are newer than the index.
btree_nodes
Total nodes in the btree.
btree_used_percent
Average fraction of btree in use.
bset_tree_stats
Statistics about the auxiliary search trees
btree_cache_max_chain
Longest chain in the btree node cache's hash table
cache_read_races
Counts instances where while data was being read from the cache, the bucket
was reused and invalidated - i.e. where the pointer was stale after the read
completed. When this occurs the data is reread from the backing device.
trigger_gc
Writing to this file forces garbage collection to run.
5. SYSFS - CACHE DEVICE:
Available at /sys/block//bcache
block_size
Minimum granularity of writes - should match hardware sector size.
btree_written
Sum of all btree writes, in (kilo/mega/giga) bytes
bucket_size
Size of buckets
cache_replacement_policy
One of either lru, fifo or random.
discard
Boolean; if on a discard/TRIM will be issued to each bucket before it is
reused. Defaults to off, since SATA TRIM is an unqueued command (and thus
slow).
freelist_percent
Size of the freelist as a percentage of nbuckets. Can be written to to
increase the number of buckets kept on the freelist, which lets you
artificially reduce the size of the cache at runtime. Mostly for testing
purposes (i.e. testing how different size caches affect your hit rate), but
since buckets are discarded when they move on to the freelist will also make
the SSD's garbage collection easier by effectively giving it more reserved
space.
io_errors
Number of errors that have occurred, decayed by io_error_halflife.
metadata_written
Sum of all non data writes (btree writes and all other metadata).
nbuckets
Total buckets in this cache
priority_stats
Statistics about how recently data in the cache has been accessed.
This can reveal your working set size. Unused is the percentage of
the cache that doesn't contain any data. Metadata is bcache's
metadata overhead. Average is the average priority of cache buckets.
Next is a list of quantiles with the priority threshold of each.
written
Sum of all data that has been written to the cache; comparison with
btree_written gives the amount of write inflation in bcache.
十一、参考
http://www.sysnote.org/2014/06/20/bcache-analysis/
http://www.sysnote.org/2014/05/29/bcache-use/
http://blog.csdn.net/liangchen0322/article/details/50322635
https://wiki.archlinux.org/index.php/Bcache
https://github.com/axboe/fio
http://elf8848.iteye.com/blog/2168876
http://www.atatech.org/articles/62373
https://www.kernel.org/doc/Documentation/bcache.txt
http://raid.wiki.kernel.org/
http://wushank.blog.51cto.com/3489095/1114437
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/raid_volumes.html
https://sourceware.org/lvm2/
https://linux.die.net/man/1/fio