原理概述
一个文件系统的空间管理,常见的技术大致有两种,bitmap和tree方式。
bitmap是将文件系统所有管辖的空间细化成block(windows叫cluster),每一个block对应一个二进制位,两种状态分别表示自由/已分配。将这些二进制位集合在一起,就是bitmap。当需要分配空间时,在bitmap中查找连续的自由位,分配后,再置成已分配就可以了;释放时,将对应位置为可分配即可。
tree方式是以extent的记录来描述自由/已分配空间的状态,如果以表示自由空间的tree来说,有可能是由一些"{可分配起始位置,可分配大小}"的记录组成的数据表。需要分配空间时,可以就某种优先原则查找可分配的自由空间段,分配后,重新调整树即可;释放时,重构记录,重构树即可。
上述两种模式在应付小容量的文件系统时基本问题不大,尤其是bitmap模式,以常见的4K块大小为例,在1T的文件系统上,位图的大小为32M(1T/4K/8),在内存中还是可以应付得过来,释放时效率的影响也不大。tree模式,虽然运算的负载较bitmap重了些,但好在文件系统容量不大,片断数量多数也在可接受范围内,整体性能影响也较小。
当文件系统越来越大,上述两种空间管理模式对效率的影响就开始突显,例如,在一个16T的文件系统上,如果还是4K的文件系统块大小,则bitmap的大小将大到1G,如果文件系统全部载入内存,一来内存负担太重,二来很难及时全部回写磁盘,在应付突然断电之类的问题上,要消耗巨大的资源(设计时也麻烦得多)。同时如果某个文件分配得很散乱,分配或释放时都会涉及大量随机IO的bitmap操作,导致io性能极低。如果在这么大的空间内想找到一个自由空间点,最差情况需要遍历1G的位图空间,也似乎也是很低下的模式。
tree的模式会综合一点,通常可以比bitmap消耗内存少一些,同时可能有记录合并的情况,所以,在查找可分配空间上不会因快用满性能明显下降。但释放空间、分配空间,会导致树结构的不断调整,树越大,这些负荷也越大。还有,最麻烦的是,分配还是释放还是无法保证连续记录,在磁盘IO方面,仍然无法避免随机io带来的低效(不过,优秀设计过的树结构还是有较大提升空间,以实现最大程度掩盖上述缺点)。
说完上述两种常见空间管理,显然是为ZFS铺垫的。
ZFS不同于bitmap和tree的空间管理,它是第三种,极少有文件系统采用的一种空间管理方法。
第一层:将所有需要管理的空间均匀分为不超过200个(源码中define的常量)的相同容量的小段,每个段只可接受2^N。也就是说把全空间除以200,得到的值向上趋近至2^N,就用这个大小对全空间进行切分,每一个分配空间叫一个metaslab。比如一个150G的全空间,除以200是750M,就向上趋近于2^N,即1GB,得到150个metaslab,每个大小为1GB。
metaslab是化整为零的第一步,每个metaslab都对应着单独的空间管理记录。(这样的好处是全局考虑释放、分配的情况会少很多)。
在ZFS中,一个metaslab的分配情况对应一个object,简单的理解,就像对应了一个文件(mos中的某个元文件)。其内容就是分配情况。如果没有分配空间(即全部空间可用),就不用为其分配空间记录表,也不用为其分配元文件。
第二层:每个metaslab中用位图表的方式记录:本metaslab中连续的1扇区的片断有多少个、连续的1K的片断有多少个、连续的2K的片断有多少个、4K,8K、16K,..1G、2G等连续片断有多少个。以便于在分配时,快速确定本metaslab有没有最优分配片断。
第三层:每个metaslab用流水账(space map)的方式记录本metaslab的io日志,所有的释放/分配都在这个账本的尾部续写。当本metaslab需要分配/释放空间时,先按时间顺序读入这个流水账,读完后,就生成了本metaslab的真实分配位图,再在内存中进行分配就可以,当达到一个事务节点时,将内存中的摘要信息(第二层)和正好在内存中顺便合并后的流水账信息回写到磁盘上(space map entry可以视情况合并,老的记录也可能优先合并了,所以不会特别大)
这样的好处至少在于:
1、整个文件系统的空间管理单元很精简,不会因文件系统特别大,或因文件系统快用完了就变得臃肿。空间管理更优秀地匹配了真实复杂度,正如文件系统刚开始的时候,空间管理非常简单;快用满时,也从复杂变得简单(分配融合,记录更少了),这是合理的。
2、化整为零,空间的分配优先在同一个metaslab中进行,实际上,相当于利用连续的小磁盘空间进行最有效的当前事务管理,提高了管理效力(与之相对的bitmap,用户可能只需1M空间,但文件系统驱动得读进所有bitmap,其他不相关的部分几乎是无用负载)
3、流水账方式的记录,简化的树操作。zfs把树的操作部分,全部放在内存中处理,磁盘空间上除了位图方式记录的片断表,就是顺序IO记录。这样减少了与磁盘IO的交互次数(因树触发的随机数据变更是最要命的),提高了文件系统性能。
唯一感觉缺点的是(张宇www.datahf.net注:未做详细实验,仅凭理论推断),每个metaslab分配时,都需要在内存中重构树,在简单应用时,运算负载会较传统文件系统更大。可简单应用,其实也无所谓运算负载稍重,反正可能cpu,内存都闲着。
理论部分就是上面,再图文并茂下,直观透视。
使用下面命令,创建了一个zpool,简单的说,就是有2个vdev,一个30G,一个60G组成了一个pool。
1
2
3
4
5
|
qemu-img create -f vmdk case4.1.vmdk 30G
qemu-img create -f vmdk case4.2.vmdk 60G
qemu-nbd -c
/dev/nbd0
case4.1.vmdk
qemu-nbd -c
/dev/nbd1
case4.2.vmdk
zpool create -f case4
/dev/nbd0
/dev/nbd1
|
一、使用zdb -l命令列出两个的vdev的label,其实只为获取 metaslab_array与metaslab_shift。据执行结果可知:
vdev0(case4.1.vmdk 30G):/dev/nbd0:
metaslab_array:37 ----表示vdev1的metaslab分析表由37号元文件管理。
metaslab_shift: 28 ----表示vdev1的metaslab大小为2^28=256M。
vdev1(case4.2.vmdk 60G):/dev/nbd1:
metaslab_array:34 ----表示vdev1的metaslab分析表由34号元文件管理。
metaslab_shift: 29 ----表示vdev1的metaslab大小为2^29=512M。
命令执行结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
|
[root@localhost case4]# zdb -l /dev/nbd0
--------------------------------------------
LABEL 0
--------------------------------------------
version: 5000
name: 'case4'
state: 0
txg: 4
pool_guid: 4712723554953817788
errata: 0
hostname: 'localhost'
top_guid: 10025130926649767584
guid: 10025130926649767584
vdev_children: 2
vdev_tree:
type: 'disk'
id: 0
guid: 10025130926649767584
path: '/dev/nbd0'
whole_disk: 0
metaslab_array: 37
metaslab_shift: 28
ashift: 9
asize: 32207536128
is_log: 0
create_txg: 4
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
--------------------------------------------
LABEL 1
--------------------------------------------
version: 5000
name: 'case4'
state: 0
txg: 4
pool_guid: 4712723554953817788
errata: 0
hostname: 'localhost'
top_guid: 10025130926649767584
guid: 10025130926649767584
vdev_children: 2
vdev_tree:
type: 'disk'
id: 0
guid: 10025130926649767584
path: '/dev/nbd0'
whole_disk: 0
metaslab_array: 37
metaslab_shift: 28
ashift: 9
asize: 32207536128
is_log: 0
create_txg: 4
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
--------------------------------------------
LABEL 2
--------------------------------------------
version: 5000
name: 'case4'
state: 0
txg: 4
pool_guid: 4712723554953817788
errata: 0
hostname: 'localhost'
top_guid: 10025130926649767584
guid: 10025130926649767584
vdev_children: 2
vdev_tree:
type: 'disk'
id: 0
guid: 10025130926649767584
path: '/dev/nbd0'
whole_disk: 0
metaslab_array: 37
metaslab_shift: 28
ashift: 9
asize: 32207536128
is_log: 0
create_txg: 4
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
--------------------------------------------
LABEL 3
--------------------------------------------
version: 5000
name: 'case4'
state: 0
txg: 4
pool_guid: 4712723554953817788
errata: 0
hostname: 'localhost'
top_guid: 10025130926649767584
guid: 10025130926649767584
vdev_children: 2
vdev_tree:
type: 'disk'
id: 0
guid: 10025130926649767584
path: '/dev/nbd0'
whole_disk: 0
metaslab_array: 37
metaslab_shift: 28
ashift: 9
asize: 32207536128
is_log: 0
create_txg: 4
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
|
第二个vdev的label(部分,后面的LABEL1,2,3略):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
[root@localhost case4]# zdb -l /dev/nbd1
--------------------------------------------
LABEL 0
--------------------------------------------
version: 5000
name: 'case4'
state: 0
txg: 4
pool_guid: 4712723554953817788
errata: 0
hostname: 'localhost'
top_guid: 5318385741477692907
guid: 5318385741477692907
vdev_children: 2
vdev_tree:
type: 'disk'
id: 1
guid: 5318385741477692907
path: '/dev/nbd1'
whole_disk: 0
metaslab_array: 34
metaslab_shift: 29
ashift: 9
asize: 64419790848
is_log: 0
create_txg: 4
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
|
二、使用zdb -m 命令列出每个metaslab的摘要情况,其实就是解释37号和34号元文件的内容。
命令的执行结果表明:
vdev0(case4.1.vmdk 30G):/dev/nbd0:
0号metaslab(负责第0个256M)的位图由38号元文件管理
23号metaslab(负责第23个256M)的位图由39号元文件管理
其他空间未分配
vdev1(case4.2.vmdk 60G):/dev/nbd1:
0号metaslab(负责第0个512M)的位图由36号元文件管理
23号metaslab(负责第23个512M)的位图由35号元文件管理
其他空间未分配
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
|
[root@localhost case4]# zdb -m case4
Metaslabs:
vdev 0
metaslabs 119 offset spacemap free
--------------- ------------------- --------------- -------------
metaslab 0 offset 0 spacemap 38 free 256M
metaslab 1 offset 10000000 spacemap 0 free 256M
metaslab 2 offset 20000000 spacemap 0 free 256M
metaslab 3 offset 30000000 spacemap 0 free 256M
metaslab 4 offset 40000000 spacemap 0 free 256M
metaslab 5 offset 50000000 spacemap 0 free 256M
metaslab 6 offset 60000000 spacemap 0 free 256M
metaslab 7 offset 70000000 spacemap 0 free 256M
metaslab 8 offset 80000000 spacemap 0 free 256M
metaslab 9 offset 90000000 spacemap 0 free 256M
metaslab 10 offset a0000000 spacemap 0 free 256M
metaslab 11 offset b0000000 spacemap 0 free 256M
metaslab 12 offset c0000000 spacemap 0 free 256M
metaslab 13 offset d0000000 spacemap 0 free 256M
metaslab 14 offset e0000000 spacemap 0 free 256M
metaslab 15 offset f0000000 spacemap 0 free 256M
metaslab 16 offset 100000000 spacemap 0 free 256M
metaslab 17 offset 110000000 spacemap 0 free 256M
metaslab 18 offset 120000000 spacemap 0 free 256M
metaslab 19 offset 130000000 spacemap 0 free 256M
metaslab 20 offset 140000000 spacemap 0 free 256M
metaslab 21 offset 150000000 spacemap 0 free 256M
metaslab 22 offset 160000000 spacemap 0 free 256M
metaslab 23 offset 170000000 spacemap 39 free 256M
metaslab 24 offset 180000000 spacemap 0 free 256M
metaslab 25 offset 190000000 spacemap 0 free 256M
metaslab 26 offset 1a0000000 spacemap 0 free 256M
metaslab 27 offset 1b0000000 spacemap 0 free 256M
metaslab 28 offset 1c0000000 spacemap 0 free 256M
metaslab 29 offset 1d0000000 spacemap 0 free 256M
metaslab 30 offset 1e0000000 spacemap 0 free 256M
metaslab 31 offset 1f0000000 spacemap 0 free 256M
metaslab 32 offset 200000000 spacemap 0 free 256M
metaslab 33 offset 210000000 spacemap 0 free 256M
metaslab 34 offset 220000000 spacemap 0 free 256M
metaslab 35 offset 230000000 spacemap 0 free 256M
metaslab 36 offset 240000000 spacemap 0 free 256M
metaslab 37 offset 250000000 spacemap 0 free 256M
metaslab 38 offset 260000000 spacemap 0 free 256M
metaslab 39 offset 270000000 spacemap 0 free 256M
metaslab 40 offset 280000000 spacemap 0 free 256M
metaslab 41 offset 290000000 spacemap 0 free 256M
metaslab 42 offset 2a0000000 spacemap 0 free 256M
metaslab 43 offset 2b0000000 spacemap 0 free 256M
metaslab 44 offset 2c0000000 spacemap 0 free 256M
metaslab 45 offset 2d0000000 spacemap 0 free 256M
metaslab 46 offset 2e0000000 spacemap 0 free 256M
metaslab 47 offset 2f0000000 spacemap 0 free 256M
metaslab 48 offset 300000000 spacemap 0 free 256M
metaslab 49 offset 310000000 spacemap 0 free 256M
metaslab 50 offset 320000000 spacemap 0 free 256M
metaslab 51 offset 330000000 spacemap 0 free 256M
metaslab 52 offset 340000000 spacemap 0 free 256M
metaslab 53 offset 350000000 spacemap 0 free 256M
metaslab 54 offset 360000000 spacemap 0 free 256M
metaslab 55 offset 370000000 spacemap 0 free 256M
metaslab 56 offset 380000000 spacemap 0 free 256M
metaslab 57 offset 390000000 spacemap 0 free 256M
metaslab 58 offset 3a0000000 spacemap 0 free 256M
metaslab 59 offset 3b0000000 spacemap 0 free 256M
metaslab 60 offset 3c0000000 spacemap 0 free 256M
metaslab 61 offset 3d0000000 spacemap 0 free 256M
metaslab 62 offset 3e0000000 spacemap 0 free 256M
metaslab 63 offset 3f0000000 spacemap 0 free 256M
metaslab 64 offset 400000000 spacemap 0 free 256M
metaslab 65 offset 410000000 spacemap 0 free 256M
metaslab 66 offset 420000000 spacemap 0 free 256M
metaslab 67 offset 430000000 spacemap 0 free 256M
metaslab 68 offset 440000000 spacemap 0 free 256M
metaslab 69 offset 450000000 spacemap 0 free 256M
metaslab 70 offset 460000000 spacemap 0 free 256M
metaslab 71 offset 470000000 spacemap 0 free 256M
metaslab 72 offset 480000000 spacemap 0 free 256M
metaslab 73 offset 490000000 spacemap 0 free 256M
metaslab 74 offset 4a0000000 spacemap 0 free 256M
metaslab 75 offset 4b0000000 spacemap 0 free 256M
metaslab 76 offset 4c0000000 spacemap 0 free 256M
metaslab 77 offset 4d0000000 spacemap 0 free 256M
metaslab 78 offset 4e0000000 spacemap 0 free 256M
metaslab 79 offset 4f0000000 spacemap 0 free 256M
metaslab 80 offset 500000000 spacemap 0 free 256M
metaslab 81 offset 510000000 spacemap 0 free 256M
metaslab 82 offset 520000000 spacemap 0 free 256M
metaslab 83 offset 530000000 spacemap 0 free 256M
metaslab 84 offset 540000000 spacemap 0 free 256M
metaslab 85 offset 550000000 spacemap 0 free 256M
metaslab 86 offset 560000000 spacemap 0 free 256M
metaslab 87 offset 570000000 spacemap 0 free 256M
metaslab 88 offset 580000000 spacemap 0 free 256M
metaslab 89 offset 590000000 spacemap 0 free 256M
metaslab 90 offset 5a0000000 spacemap 0 free 256M
metaslab 91 offset 5b0000000 spacemap 0 free 256M
metaslab 92 offset 5c0000000 spacemap 0 free 256M
metaslab 93 offset 5d0000000 spacemap 0 free 256M
metaslab 94 offset 5e0000000 spacemap 0 free 256M
metaslab 95 offset 5f0000000 spacemap 0 free 256M
metaslab 96 offset 600000000 spacemap 0 free 256M
metaslab 97 offset 610000000 spacemap 0 free 256M
metaslab 98 offset 620000000 spacemap 0 free 256M
metaslab 99 offset 630000000 spacemap 0 free 256M
metaslab 100 offset 640000000 spacemap 0 free 256M
metaslab 101 offset 650000000 spacemap 0 free 256M
metaslab 102 offset 660000000 spacemap 0 free 256M
metaslab 103 offset 670000000 spacemap 0 free 256M
metaslab 104 offset 680000000 spacemap 0 free 256M
metaslab 105 offset 690000000 spacemap 0 free 256M
metaslab 106 offset 6a0000000 spacemap 0 free 256M
metaslab 107 offset 6b0000000 spacemap 0 free 256M
metaslab 108 offset 6c0000000 spacemap 0 free 256M
metaslab 109 offset 6d0000000 spacemap 0 free 256M
metaslab 110 offset 6e0000000 spacemap 0 free 256M
metaslab 111 offset 6f0000000 spacemap 0 free 256M
metaslab 112 offset 700000000 spacemap 0 free 256M
metaslab 113 offset 710000000 spacemap 0 free 256M
metaslab 114 offset 720000000 spacemap 0 free 256M
metaslab 115 offset 730000000 spacemap 0 free 256M
metaslab 116 offset 740000000 spacemap 0 free 256M
metaslab 117 offset 750000000 spacemap 0 free 256M
metaslab 118 offset 760000000 spacemap 0 free 256M
vdev 1
metaslabs 119 offset spacemap free
--------------- ------------------- --------------- -------------
metaslab 0 offset 0 spacemap 36 free 511M
metaslab 1 offset 20000000 spacemap 0 free 512M
metaslab 2 offset 40000000 spacemap 0 free 512M
metaslab 3 offset 60000000 spacemap 0 free 512M
metaslab 4 offset 80000000 spacemap 0 free 512M
metaslab 5 offset a0000000 spacemap 0 free 512M
metaslab 6 offset c0000000 spacemap 0 free 512M
metaslab 7 offset e0000000 spacemap 0 free 512M
metaslab 8 offset 100000000 spacemap 0 free 512M
metaslab 9 offset 120000000 spacemap 0 free 512M
metaslab 10 offset 140000000 spacemap 0 free 512M
metaslab 11 offset 160000000 spacemap 0 free 512M
metaslab 12 offset 180000000 spacemap 0 free 512M
metaslab 13 offset 1a0000000 spacemap 0 free 512M
metaslab 14 offset 1c0000000 spacemap 0 free 512M
metaslab 15 offset 1e0000000 spacemap 0 free 512M
metaslab 16 offset 200000000 spacemap 0 free 512M
metaslab 17 offset 220000000 spacemap 0 free 512M
metaslab 18 offset 240000000 spacemap 0 free 512M
metaslab 19 offset 260000000 spacemap 0 free 512M
metaslab 20 offset 280000000 spacemap 0 free 512M
metaslab 21 offset 2a0000000 spacemap 0 free 512M
metaslab 22 offset 2c0000000 spacemap 0 free 512M
metaslab 23 offset 2e0000000 spacemap 35 free 512M
metaslab 24 offset 300000000 spacemap 0 free 512M
metaslab 25 offset 320000000 spacemap 0 free 512M
metaslab 26 offset 340000000 spacemap 0 free 512M
metaslab 27 offset 360000000 spacemap 0 free 512M
metaslab 28 offset 380000000 spacemap 0 free 512M
metaslab 29 offset 3a0000000 spacemap 0 free 512M
metaslab 30 offset 3c0000000 spacemap 0 free 512M
metaslab 31 offset 3e0000000 spacemap 0 free 512M
metaslab 32 offset 400000000 spacemap 0 free 512M
metaslab 33 offset 420000000 spacemap 0 free 512M
metaslab 34 offset 440000000 spacemap 0 free 512M
metaslab 35 offset 460000000 spacemap 0 free 512M
metaslab 36 offset 480000000 spacemap 0 free 512M
metaslab 37 offset 4a0000000 spacemap 0 free 512M
metaslab 38 offset 4c0000000 spacemap 0 free 512M
metaslab 39 offset 4e0000000 spacemap 0 free 512M
metaslab 40 offset 500000000 spacemap 0 free 512M
metaslab 41 offset 520000000 spacemap 0 free 512M
metaslab 42 offset 540000000 spacemap 0 free 512M
metaslab 43 offset 560000000 spacemap 0 free 512M
metaslab 44 offset 580000000 spacemap 0 free 512M
metaslab 45 offset 5a0000000 spacemap 0 free 512M
metaslab 46 offset 5c0000000 spacemap 0 free 512M
metaslab 47 offset 5e0000000 spacemap 0 free 512M
metaslab 48 offset 600000000 spacemap 0 free 512M
metaslab 49 offset 620000000 spacemap 0 free 512M
metaslab 50 offset 640000000 spacemap 0 free 512M
metaslab 51 offset 660000000 spacemap 0 free 512M
metaslab 52 offset 680000000 spacemap 0 free 512M
metaslab 53 offset 6a0000000 spacemap 0 free 512M
metaslab 54 offset 6c0000000 spacemap 0 free 512M
metaslab 55 offset 6e0000000 spacemap 0 free 512M
metaslab 56 offset 700000000 spacemap 0 free 512M
metaslab 57 offset 720000000 spacemap 0 free 512M
metaslab 58 offset 740000000 spacemap 0 free 512M
metaslab 59 offset 760000000 spacemap 0 free 512M
metaslab 60 offset 780000000 spacemap 0 free 512M
metaslab 61 offset 7a0000000 spacemap 0 free 512M
metaslab 62 offset 7c0000000 spacemap 0 free 512M
metaslab 63 offset 7e0000000 spacemap 0 free 512M
metaslab 64 offset 800000000 spacemap 0 free 512M
metaslab 65 offset 820000000 spacemap 0 free 512M
metaslab 66 offset 840000000 spacemap 0 free 512M
metaslab 67 offset 860000000 spacemap 0 free 512M
metaslab 68 offset 880000000 spacemap 0 free 512M
metaslab 69 offset 8a0000000 spacemap 0 free 512M
metaslab 70 offset 8c0000000 spacemap 0 free 512M
metaslab 71 offset 8e0000000 spacemap 0 free 512M
metaslab 72 offset 900000000 spacemap 0 free 512M
metaslab 73 offset 920000000 spacemap 0 free 512M
metaslab 74 offset 940000000 spacemap 0 free 512M
metaslab 75 offset 960000000 spacemap 0 free 512M
metaslab 76 offset 980000000 spacemap 0 free 512M
metaslab 77 offset 9a0000000 spacemap 0 free 512M
metaslab 78 offset 9c0000000 spacemap 0 free 512M
metaslab 79 offset 9e0000000 spacemap 0 free 512M
metaslab 80 offset a00000000 spacemap 0 free 512M
metaslab 81 offset a20000000 spacemap 0 free 512M
metaslab 82 offset a40000000 spacemap 0 free 512M
metaslab 83 offset a60000000 spacemap 0 free 512M
metaslab 84 offset a80000000 spacemap 0 free 512M
metaslab 85 offset aa0000000 spacemap 0 free 512M
metaslab 86 offset ac0000000 spacemap 0 free 512M
metaslab 87 offset ae0000000 spacemap 0 free 512M
metaslab 88 offset b00000000 spacemap 0 free 512M
metaslab 89 offset b20000000 spacemap 0 free 512M
metaslab 90 offset b40000000 spacemap 0 free 512M
metaslab 91 offset b60000000 spacemap 0 free 512M
metaslab 92 offset b80000000 spacemap 0 free 512M
metaslab 93 offset ba0000000 spacemap 0 free 512M
metaslab 94 offset bc0000000 spacemap 0 free 512M
metaslab 95 offset be0000000 spacemap 0 free 512M
metaslab 96 offset c00000000 spacemap 0 free 512M
metaslab 97 offset c20000000 spacemap 0 free 512M
metaslab 98 offset c40000000 spacemap 0 free 512M
metaslab 99 offset c60000000 spacemap 0 free 512M
metaslab 100 offset c80000000 spacemap 0 free 512M
metaslab 101 offset ca0000000 spacemap 0 free 512M
metaslab 102 offset cc0000000 spacemap 0 free 512M
metaslab 103 offset ce0000000 spacemap 0 free 512M
metaslab 104 offset d00000000 spacemap 0 free 512M
metaslab 105 offset d20000000 spacemap 0 free 512M
metaslab 106 offset d40000000 spacemap 0 free 512M
metaslab 107 offset d60000000 spacemap 0 free 512M
metaslab 108 offset d80000000 spacemap 0 free 512M
metaslab 109 offset da0000000 spacemap 0 free 512M
metaslab 110 offset dc0000000 spacemap 0 free 512M
metaslab 111 offset de0000000 spacemap 0 free 512M
metaslab 112 offset e00000000 spacemap 0 free 512M
metaslab 113 offset e20000000 spacemap 0 free 512M
metaslab 114 offset e40000000 spacemap 0 free 512M
metaslab 115 offset e60000000 spacemap 0 free 512M
metaslab 116 offset e80000000 spacemap 0 free 512M
metaslab 117 offset ea0000000 spacemap 0 free 512M
metaslab 118 offset ec0000000 spacemap 0 free 512M
|
以vdev0为例,使用命令读出37号元文件,验证内容:
先解析37号元文件节点,根据dnode的提示,知道内容区(dva)位于1:13f000:200:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
[root@localhost case4]# zdb -ddddd case4 37
Dataset mos [META], ID 0, cr_txg 4, 759K, 39 objects,
rootbp DVA[0]=<1:13d600:800> DVA[1]=<0:9d800:800> DVA[2]=<1:2e0049c00:800>
[L0 DMU objset] fletcher4 uncompressed LE contiguous unique triple
size=800L/800P birth=16L/16P fill=39
cksum=6d0c48364:c1aa77fe98b:ac992e22f766b:66eb89954167a31
Object lvl iblk dblk dsize lsize %full type
37 1 16K 512 1.50K 512 100.00 object array
dnode flags: USED_BYTES
dnode maxblkid: 0
Indirect blocks:
0 L0 1:13f000:200 200L/200P F=1 B=16/16
segment [0000000000000000, 0000000000000200) size 512
|
读出其Indirect blocks指向的DVA区域:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
[root@localhost case4]# zdb -R case4 1:13f000:200
Found vdev: /dev/nbd1
1:13f000:200
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 0000000000000026 0000000000000000 &...............
000010: 0000000000000000 0000000000000000 ................
000020: 0000000000000000 0000000000000000 ................
000030: 0000000000000000 0000000000000000 ................
000040: 0000000000000000 0000000000000000 ................
000050: 0000000000000000 0000000000000000 ................
000060: 0000000000000000 0000000000000000 ................
000070: 0000000000000000 0000000000000000 ................
000080: 0000000000000000 0000000000000000 ................
000090: 0000000000000000 0000000000000000 ................
0000a0: 0000000000000000 0000000000000000 ................
0000b0: 0000000000000000 0000000000000027 ........'.......
0000c0: 0000000000000000 0000000000000000 ................
0000d0: 0000000000000000 0000000000000000 ................
0000e0: 0000000000000000 0000000000000000 ................
0000f0: 0000000000000000 0000000000000000 ................
000100: 0000000000000000 0000000000000000 ................
000110: 0000000000000000 0000000000000000 ................
000120: 0000000000000000 0000000000000000 ................
000130: 0000000000000000 0000000000000000 ................
000140: 0000000000000000 0000000000000000 ................
000150: 0000000000000000 0000000000000000 ................
000160: 0000000000000000 0000000000000000 ................
000170: 0000000000000000 0000000000000000 ................
000180: 0000000000000000 0000000000000000 ................
000190: 0000000000000000 0000000000000000 ................
0001a0: 0000000000000000 0000000000000000 ................
0001b0: 0000000000000000 0000000000000000 ................
0001c0: 0000000000000000 0000000000000000 ................
0001d0: 0000000000000000 0000000000000000 ................
0001e0: 0000000000000000 0000000000000000 ................
0001f0: 0000000000000000 0000000000000000 ................
|
可以看到,图中就是一个标准的64位整型数组,metaslab[0] = 0x26(38) , metaslab[23]=0x27(39) , 即表现为vdev0的metaslab分配表。
三、读出space map
以vdev0为例,读出metaslab[0]的内容(汉字部分为解释性注释):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
|
[root@localhost case4]# zdb -mmmm case4 /dev/nbd0 0
Metaslabs:
vdev 0
metaslabs 119 offset spacemap free
--------------- ------------------- --------------- -------------
metaslab 0 offset 0 spacemap 38 free 256M
segments 8 maxsize 255M freepct 99%
In-memory histogram: //表示内存中的可分配片断摘要图,不会体现在磁盘上
9: 1 *
10: 0
11: 1 *
12: 1 *
13: 1 *
14: 1 *
15: 1 *
16: 0
17: 1 *
18: 0
19: 0
20: 0
21: 0
22: 0
23: 0
24: 0
25: 0
26: 0
27: 1 *
On-disk histogram: fragmentation 0
//表示磁盘上的可分配片断摘要图
11: 1 * //表示2^11连续的片断有1个
12: 0
13: 0
14: 0
15: 0
16: 0
17: 0
18: 0
19: 0
20: 0
21: 0
22: 0
23: 0
24: 0
25: 0
26: 0
27: 1 *//表示2^27连续的片断有1个
[ 0] ALLOC: txg 4, pass 1
表示:一条debug记录,分配,事务号4,过程1
[ 1] A range: 0000000000-000003d200 size: 03d200
表示:一条分配记录,<区域>,<大小>
[ 2] FREE: txg 4, pass 1
表示:一条debug记录,释放,事务号4,过程1
[ 3] F range: 0000024e00-0000025600 size: 000800
表示:一条释放记录,<区域>,<大小>
下同。。。
[ 4] ALLOC: txg 4, pass 2
[ 5] A range: 000003d200-0000040600 size: 003400
[ 6] ALLOC: txg 5, pass 2
[ 7] A range: 0000040600-0000068e00 size: 028800
[ 8] ALLOC: txg 5, pass 3
[ 9] A range: 0000068e00-000006fe00 size: 007000
[ 10] ALLOC: txg 16, pass 1
[ 11] A range: 000006fe00-000009e000 size: 02e200
[ 12] FREE: txg 16, pass 1
[ 13] F range: 0000000600-0000000800 size: 000200
[ 14] F range: 0000000e00-0000024e00 size: 024000
[ 15] F range: 0000025600-0000026200 size: 000c00
[ 16] F range: 0000064600-0000068e00 size: 004800
[ 17] F range: 0000098400-0000098c00 size: 000800
[ 18] ALLOC: txg 16, pass 2
[ 19] A range: 000009e000-00000a6400 size: 008400
[ 20] FREE: txg 16, pass 2
[ 21] F range: 0000034a00-000003d200 size: 008800
[ 22] F range: 000003d400-000003f400 size: 002000
[ 23] F range: 000003f600-0000040600 size: 001000
[ 24] ALLOC: txg 16, pass 3
[ 25] A range: 00000a6400-00000aa400 size: 004000
|
上面的的数据有两部分,一是可分配空间片断统计图,存在于dnode的bonus中,一是space map,存在于对应分配文件的内容区。
使用zdb进行验证:
1
2
3
4
5
6
7
8
9
10
11
12
|
[root@localhost case4]# zdb -uuu case4
Uberblock:
magic = 0000000000bab10c
version = 5000
txg = 16
guid_sum = 1609496149371726663
timestamp = 1476716014 UTC = Mon Oct 17 10:53:34 2016
rootbp = DVA[0]=<1:13d600:800> DVA[1]=<0:9d800:800>
DVA[2]=<1:2e0049c00:800> [L0 DMU objset] fletcher4 uncompressed
LE contiguous unique triple size=800L/800P birth=16L/16P fill=39
cksum=6d0c48364:c1aa77fe98b:ac992e22f766b:66eb89954167a31
|
得到元文件集的objset_phys_t的blkprt_t,DVA: <1:13d600:800>
读出其内容:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
[root@localhost case4]# zdb -R case4 1:13d600:800
Found vdev: /dev/nbd1
1:13d600:800
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 0100000003010e0a 0000000000000020 ........ .......
000010: 0000000000000001 0000000000018000 ................
000020: 0000000000000000 0000000000000000 ................
000030: 0000000000000000 0000000000000000 ................
000040: 0000000000000020 00000000000004cc ...............
000050: 0000000100000020 00000000000009cb ...............
000060: 0000000000000020 0000000000b80000 ...............
000070: 800a0702001f001f 0000000000000000 ................
000080: 0000000000000000 0000000000000000 ................
000090: 0000000000000010 000000000000001f ................
0000a0: 0000003fc89656a4 00018a7e56e571ed .V..?....q.V~...
0000b0: 076504429388c8ac f58fd3b44082deb5 ....B.e....@....
0000c0: 0000000100000020 0000000000000a11 ...............
0000d0: 0000000000000020 0000000000000512 ...............
0000e0: 0000000100000020 0000000001700274 .......t.p.....
0000f0: 800a0702001f001f 0000000000000000 ................
000100: 0000000000000000 0000000000000000 ................
000110: 0000000000000010 0000000000000008 ................
000120: 0000000fc467e7bb 0000deee52adaf6b ..g.....k..R....
000130: 0635fd0f5dc8e276 c90e402d95a65d44 v..]..5.D]..-@..
000140: 0000000000000000 0000000000000000 ................
...后面内容为0,省略
|
可知mos有2个片断,各32个扇区组成。可以通过<0:4cc扇区:32扇区> <1:a11扇区:32扇区>得到:
每个dnode 512字节,所以,第0x26(38)号dnode位置位于<1:a11扇区+6扇区:32扇区>,换成字节方式的标准DVA, 即:<1:142e00:200>,使用命令读出:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
[root@localhost case4]# zdb -R case4 1:142e00:200
Found vdev: /dev/nbd1
1:142e00:200
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 0100000701010e08 0000000001400008 ..........@.....
000010: 0000000000000000 0000000000003000 .........0......
000020: 0000000000000000 0000000000000000 ................
000030: 0000000000000000 0000000000000000 ................
000040: 0000000100000008 00000000000009f9 ................
000050: 0000000000000008 00000000000004fa ................
000060: 0000000100000008 000000000170025c ........\.p.....
000070: 8008070200070007 0000000000000000 ................
000080: 0000000000000000 0000000000000000 ................
000090: 0000000000000010 0000000000000001 ................
0000a0: 0000000554eb8757 000014dd9a66d378 W..T....x.f.....
0000b0: 0028e1062768aab4 35752d75087c35c8 ..h'..(..5|.u-u5
0000c0: 0000000000000026 00000000000000d0 &...............
0000d0: 0000000000074600 0000000000000000 .F..............
0000e0: 0000000000000000 0000000000000000 ................
0000f0: 0000000000000000 0000000000000000 ................
000100: 0000000000000000 0000000000000000 ................
000110: 0000000000000001 0000000000000000 ................
000120: 0000000000000000 0000000000000000 ................
000130: 0000000000000000 0000000000000000 ................
000140: 0000000000000000 0000000000000000 ................
000150: 0000000000000000 0000000000000000 ................
000160: 0000000000000000 0000000000000000 ................
000170: 0000000000000000 0000000000000000 ................
000180: 0000000000000000 0000000000000000 ................
000190: 0000000000000001 0000000000000000 ................
0001a0: 0000000000000000 0000000000000000 ................
0001b0: 0000000000000000 0000000000000000 ................
0001c0: 0000000000000000 0000000000000000 ................
0001d0: 0000000000000000 0000000000000000 ................
0001e0: 0000000000000000 0000000000000000 ................
0001f0: 0000000000000000 0000000000000000 ................
|
对应space map bonus的源代码结构:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
typedef
struct
space_map_phys {
uint64_t smp_object;
/* on-disk space map object */
uint64_t smp_objsize;
/* size of the object */
uint64_t smp_alloc;
/* space allocated from the map */
uint64_t smp_pad[5];
/* reserved */
/*
* The smp_histogram maintains a histogram of free regions. Each
* bucket, smp_histogram[i], contains the number of free regions
* whose size is:
* 2^(i+sm_shift) <= size of free region in bytes < 2^(i+sm_shift+1)
*/
uint64_t smp_histogram[SPACE_MAP_HISTOGRAM_SIZE];
} space_map_phys_t;
/*
对应上图0xC0位置,即:
typedef struct space_map_phys {
uint64_t 对象ID:0x26;
uint64_t 本对象已用空间:0xD0;
uint64_t 本metaslab已用空间:0x74600;
uint64_t smp_pad[5];//保留
uint64_t smp_histogram[SPACE_MAP_HISTOGRAM_SIZE];
//64位数组,表示2^9的连续数量,2^10的连续数量。。。
//0x110处的1表示2^11连续的可分配片断为1个
//0x190处的1表示2^27连续的可分配片断为1个
//其余大小的可分配片断为0
} space_map_phys_t;
*/
|
上述是可分配片断的统计表,再来看space map(分配/释放流水账):
使用zdb 读出38号文件的内容:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
[root@localhost case4]# zdb -ddddd case4 38
Dataset mos [META], ID 0, cr_txg 4, 759K, 39 objects,
rootbp DVA[0]=<1:13d600:800> DVA[1]=<0:9d800:800> DVA[2]=<1:2e0049c00:800>
[L0 DMU objset] fletcher4 uncompressed LE contiguous unique triple
size=800L/800P birth=16L/16P fill=39
cksum=6d0c48364:c1aa77fe98b:ac992e22f766b:66eb89954167a31
Object lvl iblk dblk dsize lsize %full type
38 1 16K 4K 12.0K 4K 100.00 SPA space map
320 bonus SPA space map header
dnode flags: USED_BYTES
dnode maxblkid: 0
Indirect blocks:
0 L0 1:13f200:1000 1000L/1000P F=1 B=16/16
segment [0000000000000000, 0000000000001000) size 4K
|
可知其内容在地址<1:13f200:1000> 真实大小在上面space_map_phys中显示为0xD0,读出其内容为:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
[root@localhost case4]# zdb -R case4 1:13f200:200
Found vdev: /dev/nbd1
1:13f200:200
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 8004000000000004 00000000000001e8 ................
000010: 9004000000000004 0000000001278003 ..........'.....
000020: 8008000000000004 0000000001e90019 ................
000030: 8008000000000005 0000000002030143 ........C.......
000040: 800c000000000005 0000000003470037 ........7.G.....
000050: 8004000000000010 00000000037f0170 ........p.......
000060: 9004000000000010 0000000000038000 ................
000070: 000000000007811f 00000000012b8005 ..........+.....
000080: 0000000003238023 0000000004c28003 #.#.............
000090: 8008000000000010 0000000004f00041 ........A.......
0000a0: 9008000000000010 0000000001a58043 ........C.......
0000b0: 0000000001ea800f 0000000001fb8007 ................
0000c0: 800c000000000010 000000000532001f ..........2.....
后面省略,全为无效数据,表现也全为0
|
这是一组按64位整型数组,每64位表示一条space map entry,即一条流水账。对应的解释规则参考源代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
/*
* debug entry
*
* 1 3 10 50
* ,---+--------+------------+---------------------------------.
* | 1 | action | syncpass | txg (lower bits) |
* `---+--------+------------+---------------------------------'
* 63 62 60 59 50 49 0
*
*
* non-debug entry
*
* 1 47 1 15
* ,-----------------------------------------------------------.
* | 0 | offset (sm_shift units) | type | run |
* `-----------------------------------------------------------'
* 63 62 17 16 15 0
*/
typedef
enum
{
SM_ALLOC,
SM_FREE
} maptype_t;
|
解释如下(与前面zdb表现呼应):
注:
offset:表示扇区单位的位置
run:表示大小,加1后乘以扇区大小即表示真实分配字节数
第一条记录:0x8004000000000004
最高位为1,表示debug,action为0,表示分配,过程号为1,txg为4
第二条记录:0x00000000000001e8
最高位为0,表示non-debug,type为0,表示分配,位置为0,大小数值为0x1E8,即表示(0x1E8+1)*512字节=0x03d200字节
第三条记录:0x9004000000000004
最高位为1,表示debug,action为1,表示释放,过程号为1,txg为4
第四条记录:0x0000000001278003
最高位为0,表示non-debug,type为1,表示释放,位置为0x127,即0x127*512=0x24E00 ,大小数值3,表示4*512=0x800字节
这些space map entry按顺序即可重构此metaslab的空间使用图。