Linux块层技术全面剖析-v0.1
perftrace@gmail.com
1 前言
网络上很多文章对块层的描述散乱在各个站点,而一些经典书籍由于更新不及时难免更不上最新的代码,例如关于块层的多队列。那么,是时候写一个关于linux块层的中文专题片章了,本文基于内核4.17.2。
因为文章中很多内容都可以单独领出来做个专题文章,所以信息量还是比较大的。如果大家发现阅读过程中无法顺序读完,那么可以选择性的看,或者阶段性的来看,例如看一段消化一段时间,毕竟文章也是不一下子写完的。最后文章给出的参考链接是非常棒的素材,如果英文可以建议大家去浏览一下不必细究,因为很多内容基本已经融合在文章中了。
2 总体逻辑
操作系统内核这个东西真是复杂,如果上来直接说代码,我想很可能会被大家直接嫌弃。所以,我们先列整体的框架,从逻辑入手,先抽象再具体。Let’s go。
块层是文件系统访问存储设备的接口,是连接文件系统和驱动的桥梁。
块层从代码上其实还可以分为两层,一个是bio层,另一个是request层。如下图所示。
2.1 bio 层
文件系统中最后会调用函数generic_make_request,将请求构造成bio结构体,并将其传递到bio层。该函数本身也不返回,等io请求处理完毕后,会异步调用bio->bi_end_io中指定的函数。bio层其实就是这个generic_make_request函数,非常的薄。
bio层下面就是request层,可以看到有多队列和单队列之分。
2.2 Request层
request 层有单队列和多队列之分,多队列也是内核进化出来的结果,将来很有可能只剩下多队列了。如果是单队列generic_make_request会调用blk_queue_io,如果是多队列则调用blk_mq_make_request.
2.2.1 单队列
单队列主要是考虑传统的机械硬件,因为机械臂同一时刻只能在一个地方,所以单队列足矣。换个角度说,单队列也正是被旋转磁盘需要的。为了充分利用机械磁盘的特性,单队列中需要有三个关键的任务:
l  收集多个连续操作到一组请求中,充分利用硬件。这个通过代码会对队列进行检查,看请求是否可以接收新的bio。如果可以接收,调度器就会同意合并;否则,后续考虑和其他请求合并。请求会变得很大且连续。
l  将请求按顺序排列,减少寻道时间, 这样不会延时重要的请求。但是我们无法知道每个请求的重要程度,以及寻道浪费的时间。这个需要通过队列调度器,例如deadline,cfq,noop等。
l  为了让请求到达底层驱动,当请求准备好时候需要踢到底层,而请求完成时候需要有机制通知。驱动在会通过blk_init_queue_node函数来注册一个request_fn()函数。在新请求出现在队列时候会被调用。驱动会调用blk_peek_request函数来收集请求并处理,当请求完成后驱动会继续去取请求而不是调用request_fn()函数。每个请求完成后会调用blk_finish_request()函数。
一些设备同时可以接收多个请求,在上一个请求完成前接受新的请求。可以将请求进行打标签,当请求完成的时候可以继续执行合适的请求。随着设备不断的进步可以内部执行更多调度工作后,对多队列的就越来越大了。
2.2.2 多队列
多队列的另一个动机是,系统中核数越来越多,最后都放入到一个队列中,直接导致性能瓶颈。
如果在每个NUMA节点或者cpu上分配队列,请求放到队列的传输压力会大大减少。但是如果硬件一次提交一个,那么多队列最后需要做合并操作。
我们知道cfq调取在内部也会有多个队列,但是和cfq调度不同, cfq是将请求同优先级关联,而多队列是将队列绑定到硬件设备。
多队列的request层有两个硬件相关队列:软件staging队列(也叫submission queues)和硬件dispatch队列。
软件staging队列结构体是blk_mq_ctx,基于cpu硬件数量来分配,每个numa节点或者cpu分配一个,请求被添加到这些队列中。这些队列被独立的多队列调度器管理,例如常用的:bfq, kyber, mq-deadline。不同CPU下软件队列并不会去跨cpu聚合。
硬件dispatch队列基于目标硬件来分配,可能只有一个也有可能有2048个。驱动向底层驱动负责。request 层给每个硬件队列分配一个blk_mq_hw_ctx结构体(硬件上下文),最后请求和硬件上下文一起被传递到底层驱动。这个队列需要负责控制提交给设备驱动的速度,防止过载。当前请求从软件队列所下发的硬件队列,最好是在同一个cpu上运行,提高cpu缓存本地化率。
多队列还有一个不同于单队列的是,多队列的request结构体是与分配好的。每个request结构体都有一个数字标签,用于区分设备。
多队列不是提供request_fn()函数,而是需要操作函数结构体blk_mq_ops,其中定义了函数,最重要的是queue_rq()函数。还有一些其他的,超时、polling、请求初始化等函数。如果调度认为请求已经准备好不需要继续放在队列上的时候,将会调用queue_rq()函数,将请求下放到request层之外,单队列中是由驱动来从队列中获取请求的。queue_rq函数会将请求放到内部FIFO队列,并直接处理。函数也可以通过返回BLK_STS_RESOURCE来拒接接收请求,这回导致请求继续处于staging 队列中。处理返回BLK_STS_RESOURCE和BLK_STS_OK之外,其他返回都表示错误。
2.2.2.1 多队列调度
多队列驱动并不需要配置调度器,工作类似于单队列的noop调度器。当调用blk_mq_run_hw_queue()或blk_mq_delay_run_hw_queue()时候,会将请求传给驱动。多队列的调度函数定义在函数集elevator_mq_ops中,主要的有insert_requests()和dispatch_request()函数。Insert_requests会将请求插入到staging队列中,dispatch_request函数会选择一个请求传入到给定的硬件队列。当然,可以不提供insert_requests函数,内核也是很有弹性的,那么请求增加在最后了,如果没有提供dispatch_request,请求会从任何一个staging队列中取然后投入到硬件队列中,这个对性能有伤害(如果设备只有一个硬件队列,那就无所谓了)。
上面我们提到多队列常用的有三个调度器,mq-deadline,bfq和kyber。
mq-deadline中也有insert_request函数,它会忽略staging队列,直接将请求插入到两个全局基于时间排序的读写队列。而dispatch_request函数会基于时间、大小、饥饿程度来返回一个队列。注意这里的函数和elevator_mq_ops中是不一样的名字,少了一个s.
bfq调度器是cfq的升级版,是Budget Fair Queueing的缩写。不过更像mq-deadline,不适用每个cpu的staging队列。如果有多队列则通过一个自旋锁来被所有cpu获取。
Kyber I/O 调度器,会利用每个cpu或者每个node的staging 队列。该调度器没有提供insert_quest函数,使用默认方式。dispatch_request函数基于硬件上下文来维护内部队列。这个调度在17年初大家才开始讨论,所以很多细节将来可能固定下来,先一跃而过了。
2.2.2.2 多队列拓扑图
最后我们来看下一个最清楚不够的图,如下:
从图中我们可以看到software staging queues是大于hardware dispatch queues的。
不过有其实有三种情况:
l  software staging queues大于hardware dispatch queues
2个或多个software staging queues分配到一个硬件队列,分发请求时候回从所有相关软件队列中获取请求。
l  software staging queues小于hardware dispatch queues
这个场景下,软件队列顺序映射到硬件队列。
l  software staging queues等于hardware dispatch queues
这个场景就是1:1映射的。
2.2.3 多队列何时取代单队列
这个可能还需要些时间,因为任何新事物都是不完美。另外在红帽在内部存储测试时候发现mq-deadlien的性能问题,还有一些公司也在测试时候发现性能倒退。不过,这个只是时间问题,并不会很久远。
倒霉的是,很多书籍中描述的都是单队列的情况,当然也包括书中描述的调度器。幸好,我们这个专题书籍涉及了,欢迎分享给自己的小伙伴。
2.3 bio based driver
早些时候,为了绕过系统单队列的瓶颈,需要将绕过request 层,那么驱动叫做基于request的驱动,如果跳过了request层,那么驱动就叫做bio based driver了。如何跳过呢?
设备驱动可以通过调用blk_queue_make_request注册一个make_request_fn函数,make_request_fn可以直接处理bio。generic_make_request会把设备指定的bio来调用make_request_fn,从而跳过了bio层下面的request 层,因为有些设备例如ssd盘,是不需要request层中的请求合并、排序等。
其实,这个方法并不是为SSD高性能设计的,本是为MD RAID实际的,用于处理请求并将其下发到底层真实的硬件设备中。
此外,bio based driver是为了绕过碰到的内核中单队列瓶颈,也带来一个问题:所有驱动都需要自己去处理并发明所有事情,代码不具备通用性。针对此事,内核中引入了多队列机制,后续bio based的驱动也是会慢慢绝迹的。Blk-mq:new multi-queue block IO queueing mechnism
总体上看,bio层是比较薄的层,只负责将IO请求构建成bio机构体,然后传递给合适的make_request_fn()函数。而request层还是比较厚的,毕竟还有调度器、请求合并等操作。
3 请求分发逻辑
3.1 多队列
3.1.1 请求提交
多队列使用的make_request函数是blk_mq_make_request,当设备支持单硬件队列或异步请求时,将不会阻塞或大幅减轻阻塞。如果请求是同步的,驱动不会执行阻塞。
make_request函数会执行请求合并,如果设备允许阻塞,会负载搜索阻塞列表中的合适候选者,最后将软件队列映射到当前的cpu。提交路径不涉及I/O 调度相关的回调。
make_request会将同步请求立即发送到对应的硬件队列,如果是异步或是flush(批量)的请求或被delay用于后续高效的合并的分发。
针对同步和异步请求,make_request的处理存在一些差异。
3.1.2 请求分发
如果IO请求是同步的(在多队列中不允许阻塞),那么分发有同一个请求上下文来执行。
如果是异步或flush的,分发可能由关联到同一个硬件队列请求在其上下文执行;也可能延迟的工作调度来执行。
多队列中由函数blk_mq_run_hw_queue来具体实现,同步请求会被立即分发到驱动,而异步请求会被延迟。当时同步请求时候,该函数会调用内部函数__blk_mq_run_hw_queue,先会加入和当前硬件队列相关的软件队列,然后加入已存在的分发列表,收集条目后,函数开始将每个请求分发到驱动中,最后由queue_rq来处理。
整个逻辑代码位于函数blk_mq_make_request中。
多队列的逻辑如下图:
blk_mq的高清图链接如下:
https://github.com/kernel-z/filesystem/blob/master/blk_mq.png
3.2 单队列
函数generic_make_request在单队列中会调用blk_queue_bio,来负责处理bio结构体。该函数是块层中最重要的,需要重点关注的一个函数,里面内容也是极其丰富,第一次去看它肯定会“迷路”的。
blk_queue_bio函数中会进行电梯调度的处理,由向前合并或向后合并,如果不能合并就新产生一个request请求,最后会调用blk_account_io_start记录io开始处理,很多io的监控统计就是从这个函数开始的。
逻辑相对代码来说是简单很多,先判断bio是否可以合并到进程的plug链表中,不行则判断是否可以合并到块层请求队列中;如果都不支持合并,则重新产生一个新的request来处理bio,这里又分为是否可以阻塞的;如果可以阻塞则判断原阻塞队列是否需要刷盘,不需要刷则直接挂到plug队列中即可返回;如果不可阻塞的,就添加到请求队列中,并调用__blk_run_queue函数,该函数会调用rq->request_fn(由设备驱动指定,scsi的是scsi_request_fn),离开块层。这个也是blk_queue_bio的整体逻辑。如下图所示:
高清图链接如下:
https://github.com/kernel-z/filesystem/blob/master/blk_single.png
plug中的请求,除了被新的blk_queue_bio函数调用链(blk_flush_plug_list)触发外,还可以被进程调度触发:
schedule->
sched_submit_work ->
blk_schedule_flush_plug()->
blk_flush_plug_list(plug, true) ->
queue_unplugged->
blk_run_queue_async
唤醒kblockd工作队列来进行unplug操作。
plug队列中的请求是要先刷到请求队列中,而最后都由__blk_run_queue往下发,会调用->request_fn函数,这个函数因驱动而已(scsi驱动是scsi_request_fn)。
3.2.1 函数小结
插入函数:__elv_add_request
拆分函数:blk_queue_split
合并函数:
bio_attemp_front_merge/bio_attempt_back_merge,blk_attemp_plug_merge
发起IO: __blk_run_queue
4 块层函数初始化分析(scsi)
驱动初始化时候需要根据硬件设备确定驱动是否能使用多队列。从而在初始化时候确定请求入队列的函数块层入口(blk_queue_bio或者blk_mq_make_request),以及最后发起请求的函数离开块层(scsi_request_fn或scsi_queue_rq)。
4.1 scsi为例
4.1.1 scsi_alloc_sdev
驱动在探测scsi设备过程中,会使用函数scsi_alloc_sdev。会分配、初始化io,并返回指向scsi_device结构体的指针,scsi_device会存储主机、通道、id和lun,并将scsi_device添加到合适的列表中。
该会做如下判断:
if (shost_use_blk_mq(shost))
sdev->request_queue = scsi_mq_alloc_queue(sdev);
else
sdev->request_queue = scsi_old_alloc_queue(sdev);
如果是设备能使用多队列,则调用函数scsi_mq_alloc_queue,否则使用单队列,调用scsi_old_alloc_queue函数,其中参数sdev是scsi_device.
在scsi_mq_alloc_queue函数中,会调用blk_mq_init_queue,最后会注册blk_mq_make_request函数。
初始化逻辑如下,横向太大,所以给竖过来了:
高清图如下:
https://github.com/kernel-z/filesystem/blob/master/scsi-init.png
下面是关于具体代码中的一些结构体、函数的解释,结合上面的文字描述可以更好的理解块层。
5 结构体:关键结构体
5.1 request
request结构体就是请求操作块设备的请求结构体,该结构体被放到request_queue队列中,等到合适的时候再处理。
该结构体定义在include/linux/blkdev.h文件中:
struct request {
struct request_queue *q; //所在队列
struct blk_mq_ctx *mq_ctx;
int cpu;
unsigned int cmd_flags; /* op and common flags */
req_flags_t rq_flags;
int internal_tag;
/* the following two fields are internal, NEVER access directly */
unsigned int __data_len; /* total data len */
int tag;
sector_t __sector; /* sector cursor */
struct bio *bio;
struct bio *biotail;
struct list_head queuelist; //请求结构体队列链表
/*
* The hash is used inside the scheduler, and killed once the
* request reaches the dispatch list. The ipi_list is only used
* to queue the request for softirq completion, which is long
* after the request has been unhashed (and even removed from
* the dispatch list).
*/
union {
struct hlist_node hash; /* merge hash */
struct list_head ipi_list;
};
/*
* The rb_node is only used inside the io scheduler, requests
* are pruned when moved to the dispatch queue. So let the
* completion_data share space with the rb_node.
*/
union {
struct rb_node rb_node; /* sort/lookup */
struct bio_vec special_vec;
void *completion_data;
int error_count; /* for legacy drivers, don't use */
};
/*
* Three pointers are available for the IO schedulers, if they need
* more they have to dynamically allocate it. Flush requests are
* never put on the IO scheduler. So let the flush fields share
* space with the elevator data.
*/
union {
struct {
struct io_cq *icq;
void *priv[2];
} elv;
struct {
unsigned int seq;
struct list_head list;
rq_end_io_fn *saved_end_io;
} flush;
};
struct gendisk *rq_disk;
struct hd_struct *part;
unsigned long start_time;
struct blk_issue_stat issue_stat;
/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
*/
unsigned short nr_phys_segments;
#if defined(CONFIG_BLK_DEV_INTEGRITY)
unsigned short nr_integrity_segments;
#endif
unsigned short write_hint;
unsigned short ioprio;
unsigned int timeout;
void *special; /* opaque pointer available for LLD use */
unsigned int extra_len; /* length of alignment and padding */
/*
* On blk-mq, the lower bits of ->gstate (generation number and
* state) carry the MQ_RQ_* state value and the upper bits the
* generation number which is monotonically incremented and used to
* distinguish the reuse instances.
*
* ->gstate_seq allows updates to ->gstate and other fields
* (currently ->deadline) during request start to be read
* atomically from the timeout path, so that it can operate on a
* coherent set of information.
*/
seqcount_t gstate_seq;
u64 gstate;
/*
* ->aborted_gstate is used by the timeout to claim a specific
* recycle instance of this request. See blk_mq_timeout_work().
*/
struct u64_stats_sync aborted_gstate_sync;
u64 aborted_gstate;
/* access through blk_rq_set_deadline, blk_rq_deadline */
unsigned long __deadline;
struct list_head timeout_list;
union {
struct __call_single_data csd;
u64 fifo_time;
};
/*
* completion callback.
*/
rq_end_io_fn *end_io;
void *end_io_data;
/* for bidi */
struct request *next_rq;
#ifdef CONFIG_BLK_CGROUP
struct request_list *rl; /* rl this rq is alloced from */
unsigned long long start_time_ns;
unsigned long long io_start_time_ns; /* when passed to hardware */
#endif
};
表示块设备驱动层I/O请求,经由I/O调度层转换后的I/O请求,将会发到块设备驱动层进行处理;
5.2 request_queue
每一块设备都会有一个队列,当需要对设备操作时,把请求放在队列中。因为对块设备的操作 I/O访问不能及时调用完成,I/O操作比较慢,所以把所有的请求放在队列中,等到合适的时候再处理这些请求;
该结构体定义在include/linux/blkdev.h文件中:
struct request_queue {
/*
* Together with queue_head for cacheline sharing
*/
struct list_head queue_head;//待处理请求的链表
struct request *last_merge;//队列中首先可能合并的请求描述符
struct elevator_queue *elevator;//指向elevator对象指针。
int nr_rqs[2]; /* # allocated [a]sync rqs */
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */
atomic_t shared_hctx_restart;
struct blk_queue_stats *stats;
struct rq_wb *rq_wb;
/*
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
* is used, root blkg allocates from @q->root_rl and all other
* blkgs from their own blkg->rl. Which one to use should be
* determined using bio_request_list().
*/
struct request_list root_rl;
request_fn_proc *request_fn;//驱动程序的策略例程入口点
make_request_fn *make_request_fn;
poll_q_fn *poll_fn;
prep_rq_fn *prep_rq_fn;
unprep_rq_fn *unprep_rq_fn;
softirq_done_fn *softirq_done_fn;
rq_timed_out_fn *rq_timed_out_fn;
dma_drain_needed_fn *dma_drain_needed;
lld_busy_fn *lld_busy_fn;
/* Called just after a request is allocated */
init_rq_fn *init_rq_fn;
/* Called just before a request is freed */
exit_rq_fn *exit_rq_fn;
/* Called from inside blk_get_request() */
void (*initialize_rq_fn)(struct request *rq);
const struct blk_mq_ops *mq_ops;
unsigned int *mq_map;
/* sw queues */
struct blk_mq_ctx __percpu *queue_ctx;
unsigned int nr_queues;
unsigned int queue_depth;
/* hw dispatch queues */
struct blk_mq_hw_ctx **queue_hw_ctx;
unsigned int nr_hw_queues;
/*
* Dispatch queue sorting
*/
sector_t end_sector;
struct request *boundary_rq;
/*
* Delayed queue handling
*/
struct delayed_work delay_work;
struct backing_dev_info *backing_dev_info;
/*
* The queue owner gets to use this for whatever they like.
* ll_rw_blk doesn't touch it.
*/
void *queuedata;
/*
* various queue flags, see QUEUE_* below
*/
unsigned long queue_flags;
/*
* ida allocated id for this queue. Used to index queues from
* ioctx.
*/
int id;
/*
* queue needs bounce pages for pages above this limit
*/
gfp_t bounce_gfp;
/*
* protects queue structures from reentrancy. ->__queue_lock should
* _never_ be used directly, it is queue private. always use
* ->queue_lock.
*/
spinlock_t __queue_lock;
spinlock_t *queue_lock;
/*
* queue kobject
*/
struct kobject kobj;
/*
* mq queue kobject
*/
struct kobject mq_kobj;
#ifdef CONFIG_BLK_DEV_INTEGRITY
struct blk_integrity integrity;
#endif /* CONFIG_BLK_DEV_INTEGRITY */
#ifdef CONFIG_PM
struct device *dev;
int rpm_status;
unsigned int nr_pending;
#endif
/*
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned int nr_batching;
unsigned int dma_drain_size;
void *dma_drain_buffer;
unsigned int dma_pad_mask;
unsigned int dma_alignment;
struct blk_queue_tag *queue_tags;
struct list_head tag_busy_list;
unsigned int nr_sorted;
unsigned int in_flight[2];
/*
* Number of active block driver functions for which blk_drain_queue()
* must wait. Must be incremented around functions that unlock the
* queue_lock internally, e.g. scsi_request_fn().
*/
unsigned int request_fn_active;
unsigned int rq_timeout;
int poll_nsec;
struct blk_stat_callback *poll_cb;
struct blk_rq_stat poll_stat[BLK_MQ_POLL_STATS_BKTS];
struct timer_list timeout;
struct work_struct timeout_work;
struct list_head timeout_list;
struct list_head icq_list;
#ifdef CONFIG_BLK_CGROUP
DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);
struct blkcg_gq *root_blkg;
struct list_head blkg_list;
#endif
struct queue_limits limits;
/*
* Zoned block device information for request dispatch control.
* nr_zones is the total number of zones of the device. This is always
* 0 for regular block devices. seq_zones_bitmap is a bitmap of nr_zones
* bits which indicates if a zone is conventional (bit clear) or
* sequential (bit set). seq_zones_wlock is a bitmap of nr_zones
* bits which indicates if a zone is write locked, that is, if a write
* request targeting the zone was dispatched. All three fields are
* initialized by the low level device driver (e.g. scsi/sd.c).
* Stacking drivers (device mappers) may or may not initialize
* these fields.
*/
unsigned int nr_zones;
unsigned long *seq_zones_bitmap;
unsigned long *seq_zones_wlock;
/*
* sg stuff
*/
unsigned int sg_timeout;
unsigned int sg_reserved_size;
int node;
#ifdef CONFIG_BLK_DEV_IO_TRACE
struct blk_trace *blk_trace;
struct mutex blk_trace_mutex;
#endif
/*
* for flush operations
*/
struct blk_flush_queue *fq;
struct list_head requeue_list;
spinlock_t requeue_lock;
struct delayed_work requeue_work;
struct mutex sysfs_lock;
int bypass_depth;
atomic_t mq_freeze_depth;
#if defined(CONFIG_BLK_DEV_BSG)
bsg_job_fn *bsg_job_fn;
struct bsg_class_device bsg_dev;
#endif
#ifdef CONFIG_BLK_DEV_THROTTLING
/* Throttle data */
struct throtl_data *td;
#endif
struct rcu_head rcu_head;
wait_queue_head_t mq_freeze_wq;
struct percpu_ref q_usage_counter;
struct list_head all_q_node;
struct blk_mq_tag_set *tag_set;
struct list_head tag_set_list;
struct bio_set *bio_split;
#ifdef CONFIG_BLK_DEBUG_FS
struct dentry *debugfs_dir;
struct dentry *sched_debugfs_dir;
#endif
bool mq_sysfs_init_done;
size_t cmd_size;
void *rq_alloc_data;
struct work_struct release_work;
#define BLK_MAX_WRITE_HINTS 5
u64 write_hints[BLK_MAX_WRITE_HINTS];
};
该结构体还是异常庞大的,都快接近sk_buff结构体了。
维护块设备驱动层I/O请求的队列,所有的request都插入到该队列,每个磁盘设备都只有一个queue(多个分区也只有一个);
一个request_queue中包含多个request,每个request可能包含多个bio,请求的合并就是根据各种原则将多个bio加入到同一个request中。
5.3 elevator_queue
电梯调度队列,每个队列都会有一个电梯调度队列。
struct elevator_queue
{
struct elevator_type *type;
void *elevator_data;
struct kobject kobj;
struct mutex sysfs_lock;
unsigned int registered:1;
unsigned int uses_mq:1;
DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
};
5.4 elevator_type
电梯类型其实就是调度算法类型。
struct elevator_type
{
/* managed by elevator core */
struct kmem_cache *icq_cache;
/* fields provided by elevator implementation */
union {
struct elevator_ops sq;
struct elevator_mq_ops mq;
} ops;
size_t icq_size; /* see iocontext.h */
size_t icq_align; /* ditto */
struct elv_fs_entry *elevator_attrs;
char elevator_name[ELV_NAME_MAX];
const char *elevator_alias;
struct module *elevator_owner;
bool uses_mq;
#ifdef CONFIG_BLK_DEBUG_FS
const struct blk_mq_debugfs_attr *queue_debugfs_attrs;
const struct blk_mq_debugfs_attr *hctx_debugfs_attrs;
#endif
/* managed by elevator core */
char icq_cache_name[ELV_NAME_MAX + 6]; /* elvname + "_io_cq" */
struct list_head list;
};
5.4.1 iosched_cfq
例如cfq调度器结构体,指定了该调度器相关的所有函数。
static struct elevator_type iosched_cfq = {
.ops.sq = {
.elevator_merge_fn = cfq_merge,
.elevator_merged_fn = cfq_merged_request,
.elevator_merge_req_fn = cfq_merged_requests,
.elevator_allow_bio_merge_fn = cfq_allow_bio_merge,
.elevator_allow_rq_merge_fn = cfq_allow_rq_merge,
.elevator_bio_merged_fn = cfq_bio_merged,
.elevator_dispatch_fn = cfq_dispatch_requests,
.elevator_add_req_fn = cfq_insert_request,
.elevator_activate_req_fn = cfq_activate_request,
.elevator_deactivate_req_fn = cfq_deactivate_request,
.elevator_completed_req_fn = cfq_completed_request,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_init_icq_fn = cfq_init_icq,
.elevator_exit_icq_fn = cfq_exit_icq,
.elevator_set_req_fn = cfq_set_request,
.elevator_put_req_fn = cfq_put_request,
.elevator_may_queue_fn = cfq_may_queue,
.elevator_init_fn = cfq_init_queue,
.elevator_exit_fn = cfq_exit_queue,
.elevator_registered_fn = cfq_registered_queue,
},
.icq_size = sizeof(struct cfq_io_cq),
.icq_align = __alignof__(struct cfq_io_cq),
.elevator_attrs = cfq_attrs,
.elevator_name = "cfq",
.elevator_owner = THIS_MODULE,
};
5.5 gendisk
再来看下磁盘的数据结构gendisk (定义于 <include/linux/genhd.h>) ,是单独一个磁盘驱动器的内核表示。是块I/O子系统中最重要的数据结构。
struct gendisk {
/* major, first_minor and minors are input parameters only,
* don't use directly. Use disk_devt() and disk_max_parts().
*/
int major; /* major number of driver */
int first_minor;
int minors; /* maximum number of minors, =1 for
* disks that can't be partitioned. */
char disk_name[DISK_NAME_LEN]; /* name of major driver */
char *(*devnode)(struct gendisk *gd, umode_t *mode);
unsigned int events; /* supported events */
unsigned int async_events; /* async events, subset of all */
/* Array of pointers to partitions indexed by partno.
* Protected with matching bdev lock but stat and other
* non-critical accesses use RCU. Always access through
* helpers.
*/
struct disk_part_tbl __rcu *part_tbl;
struct hd_struct part0;
const struct block_device_operations *fops;
struct request_queue *queue;
void *private_data;
int flags;
struct rw_semaphore lookup_sem;
struct kobject *slave_dir;
struct timer_rand_state *random;
atomic_t sync_io; /* RAID */
struct disk_events *ev;
#ifdef CONFIG_BLK_DEV_INTEGRITY
struct kobject integrity_kobj;
#endif /* CONFIG_BLK_DEV_INTEGRITY */
int node_id;
struct badblocks *bb;
struct lockdep_map lockdep_map;
};
该结构体中有设备号、次编号(标记不同分区)、磁盘驱动器名字(出现在/proc/partitions和sysfs中)、 设备的操作集(block_device_operations)、设备IO请求结构、驱动器状态、驱动器容量、驱动内部数据指针private_data等。
和gendisk相关的函数有,alloc_disk函数用来分配一个磁盘,del_gendisk用来减掉一个对结构体的引用。
分配一个 gendisk 结构不能使系统可使用这个磁盘。还必须初始化这个结构并且调用 add_disk。一旦调用add_disk后, 这个磁盘是"活的"并且它的方法可被在任何时间被调用了,内核这个时候就可以来摸设备了。实际上第一个调用将可能发生, 也可能在 add_disk 函数返回之前; 内核将读前几个字节以试图找到一个分区表。在驱动被完全初始化并且准备好之前,不要调用add_disk来响应对磁盘的请求。
5.6 hd_struct
磁盘分区结构体。
struct hd_struct {
sector_t start_sect;
/*
* nr_sects is protected by sequence counter. One might extend a
* partition while IO is happening to it and update of nr_sects
* can be non-atomic on 32bit machines with 64bit sector_t.
*/
sector_t nr_sects;
seqcount_t nr_sects_seq;
sector_t alignment_offset;
unsigned int discard_alignment;
struct device __dev;
struct kobject *holder_dir;
int policy, partno;
struct partition_meta_info *info;
#ifdef CONFIG_FAIL_MAKE_REQUEST
int make_it_fail;
#endif
unsigned long stamp;
atomic_t in_flight[2];
#ifdef CONFIG_SMP
struct disk_stats __percpu *dkstats;
#else
struct disk_stats dkstats;
#endif
struct percpu_ref ref;
struct rcu_head rcu_head;
};
5.7 bio
在2.4内核以前使用缓冲头的方式,该方式下会将每个I/O请求分解成512字节的块,所以不能创建高性能IO子系统。2.5中一个重要的工作就是支持高性能I/O,于是有了现在的BIO结构体。
bio结构体是request结构体的实际数据,一个request结构体中包含一个或者多个bio结构体,在底层实际是按bio来对设备进行操作的,传递给驱动。
代码会把它合并到一个已经存在的request结构体中,或者需要的话会再创建一个新的request结构体;bio结构体包含了驱动程序执行请求的全部信息。
一个bio包含多个page,这些page对应磁盘上一段连续的空间。由于文件在磁盘上并不连续存放,文件I/O提交到块设备之前,极有可能被拆成多个bio结构;
该结构体定义在include/linux/blk_types.h文件中,不幸的是该结构和以往发生了一些较大变化,特别是与ldd一书中不匹配了。
/*
* main unit of I/O for the block layer and lower layers (ie drivers and
* stacking drivers)
*/
struct bio {
struct bio *bi_next; /* request queue link */
struct gendisk *bi_disk;
unsigned int bi_opf; /* bottom bits req flags,
* top bits REQ_OP. Use
* accessors.
*/
unsigned short bi_flags; /* status, etc and bvec pool number */
unsigned short bi_ioprio;
unsigned short bi_write_hint;
blk_status_t bi_status;
u8 bi_partno;
/* Number of segments in this BIO after
* physical address coalescing is performed.
*/
unsigned int bi_phys_segments;
/*
* To keep track of the max segment size, we account for the
* sizes of the first and last mergeable segments in this bio.
*/
unsigned int bi_seg_front_size;
unsigned int bi_seg_back_size;
struct bvec_iter bi_iter;
atomic_t __bi_remaining;
bio_end_io_t *bi_end_io;
void *bi_private;
#ifdef CONFIG_BLK_CGROUP
/*
* Optional ioc and css associated with this bio. Put on bio
* release. Read comment on top of bio_associate_current().
*/
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
void *bi_cg_private;
struct blk_issue_stat bi_issue_stat;
#endif
#endif
union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif
};
unsigned short bi_vcnt; /* how many bio_vec's */
/*
* Everything starting with bi_max_vecs will be preserved by bio_reset()
*/
unsigned short bi_max_vecs; /* max bvl_vecs we can hold */
atomic_t __bi_cnt; /* pin count */
struct bio_vec *bi_io_vec; /* the actual vec list */
struct bio_set *bi_pool;
/*
* We can inline a number of vecs at the end of the bio, to avoid
* double allocations for a small number of bio_vecs. This member
* MUST obviously be kept at the very end of the bio.
*/
struct bio_vec bi_inline_vecs[0];
};
5.8 bio_vec
其中bio_vec结构体位于文件include/linux/bvec.h中:
struct bio_vec {
struct page *bv_page; //指向整个缓冲区所驻留的物理页面
unsigned int bv_len; //以字节为单位的大小
unsigned int bv_offset;//以字节为单位的偏移量
};
5.9 elevator_type
电梯调度类型,例如AS或者deadline调度类型。
struct elevator_type
{
/* managed by elevator core */
struct kmem_cache *icq_cache;
/* fields provided by elevator implementation */
union {
struct elevator_ops sq;
struct elevator_mq_ops mq;
} ops;
size_t icq_size; /* see iocontext.h */
size_t icq_align; /* ditto */
struct elv_fs_entry *elevator_attrs;
char elevator_name[ELV_NAME_MAX];
const char *elevator_alias;
struct module *elevator_owner;
bool uses_mq;
#ifdef CONFIG_BLK_DEBUG_FS
const struct blk_mq_debugfs_attr *queue_debugfs_attrs;
const struct blk_mq_debugfs_attr *hctx_debugfs_attrs;
#endif
/* managed by elevator core */
char icq_cache_name[ELV_NAME_MAX + 6]; /* elvname + "_io_cq" */
struct list_head list;
};
5.10 多队列结构体
5.10.1 blk_mq_ctx
代表software staging queues.
struct blk_mq_ctx {
struct {
spinlock_t lock;
struct list_head rq_list;
} ____cacheline_aligned_in_smp;
unsigned int cpu;
unsigned int index_hw;
/* incremented at dispatch time */
unsigned long rq_dispatched[2];
unsigned long rq_merged;
/* incremented at completion time */
unsigned long ____cacheline_aligned_in_smp rq_completed[2];
struct request_queue *queue;
struct kobject kobj;
}
5.10.2 blk_mq_hw_ctx
多队列的硬件队列。它和blk_mq_ctx的映射通过blk_mq_ops中map_queues来实现。同时映射也保存在request_queue中的mq_map中。
/**
* struct blk_mq_hw_ctx - State for a hardware queue facing the hardware block device
*/
struct blk_mq_hw_ctx {
struct {
spinlock_t lock;
struct list_head dispatch;
unsigned long state; /* BLK_MQ_S_* flags */
} ____cacheline_aligned_in_smp;
struct delayed_work run_work;
cpumask_var_t cpumask;
int next_cpu;
int next_cpu_batch;
unsigned long flags; /* BLK_MQ_F_* flags */
void *sched_data;
struct request_queue *queue;
struct blk_flush_queue *fq;
void *driver_data;
struct sbitmap ctx_map;
struct blk_mq_ctx *dispatch_from;
struct blk_mq_ctx **ctxs;
unsigned int nr_ctx;
wait_queue_entry_t dispatch_wait;
atomic_t wait_index;
struct blk_mq_tags *tags;
struct blk_mq_tags *sched_tags;
unsigned long queued;
unsigned long run;
#define BLK_MQ_MAX_DISPATCH_ORDER 7
unsigned long dispatched[BLK_MQ_MAX_DISPATCH_ORDER];
unsigned int numa_node;
unsigned int queue_num;
atomic_t nr_active;
unsigned int nr_expired;
struct hlist_node cpuhp_dead;
struct kobject kobj;
unsigned long poll_considered;
unsigned long poll_invoked;
unsigned long poll_success;
#ifdef CONFIG_BLK_DEBUG_FS
struct dentry *debugfs_dir;
struct dentry *sched_debugfs_dir;
#endif
/* Must be the last member - see also blk_mq_hw_ctx_size(). */
struct srcu_struct srcu[0];
};
struct blk_mq_tag_set {
unsigned int *mq_map;
const struct blk_mq_ops *ops;
unsigned int nr_hw_queues;
unsigned int queue_depth; /* max hw supported */
unsigned int reserved_tags;
unsigned int cmd_size; /* per-request extra data */
int numa_node;
unsigned int timeout;
unsigned int flags; /* BLK_MQ_F_* */
void *driver_data;
struct blk_mq_tags **tags;
struct mutex tag_list_lock;
struct list_head tag_list;
};
5.11 函数操作结构体
5.11.1 elevator_ops
调度操作函数集合。
struct elevator_ops
{
elevator_merge_fn *elevator_merge_fn;
elevator_merged_fn *elevator_merged_fn;
elevator_merge_req_fn *elevator_merge_req_fn;
elevator_allow_bio_merge_fn *elevator_allow_bio_merge_fn;
elevator_allow_rq_merge_fn *elevator_allow_rq_merge_fn;
elevator_bio_merged_fn *elevator_bio_merged_fn;
elevator_dispatch_fn *elevator_dispatch_fn;
elevator_add_req_fn *elevator_add_req_fn;
elevator_activate_req_fn *elevator_activate_req_fn;
elevator_deactivate_req_fn *elevator_deactivate_req_fn;
elevator_completed_req_fn *elevator_completed_req_fn;
elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
elevator_init_icq_fn *elevator_init_icq_fn; /* see iocontext.h */
elevator_exit_icq_fn *elevator_exit_icq_fn; /* ditto */
elevator_set_req_fn *elevator_set_req_fn;
elevator_put_req_fn *elevator_put_req_fn;
elevator_may_queue_fn *elevator_may_queue_fn;
elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
elevator_registered_fn *elevator_registered_fn;
};
5.11.2 elevator_mq_ops
多队列调度操作函数集合。
struct elevator_mq_ops {
int (*init_sched)(struct request_queue *, struct elevator_type *);
void (*exit_sched)(struct elevator_queue *);
int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
int (*request_merge)(struct request_queue *q, struct request **, struct bio *);
void (*request_merged)(struct request_queue *, struct request *, enum elv_merge);
void (*requests_merged)(struct request_queue *, struct request *, struct request *);
void (*limit_depth)(unsigned int, struct blk_mq_alloc_data *);
void (*prepare_request)(struct request *, struct bio *bio);
void (*finish_request)(struct request *);
void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
bool (*has_work)(struct blk_mq_hw_ctx *);
void (*completed_request)(struct request *);
void (*started_request)(struct request *);
void (*requeue_request)(struct request *);
struct request *(*former_request)(struct request_queue *, struct request *);
struct request *(*next_request)(struct request_queue *, struct request *);
void (*init_icq)(struct io_cq *);
void (*exit_icq)(struct io_cq *);
};
5.11.3 多队列
5.11.3.1 blk_mq_ops
多队列的操作函数,是块层多队列和块设备的桥梁,非常重要。
struct blk_mq_ops {
/*
* Queue request
*/
queue_rq_fn *queue_rq;//队列处理函数。
/*
* Reserve budget before queue request, once .queue_rq is
* run, it is driver's responsibility to release the
* reserved budget. Also we have to handle failure case
* of .get_budget for avoiding I/O deadlock.
*/
get_budget_fn *get_budget;
put_budget_fn *put_budget;
/*
* Called on request timeout
*/
timeout_fn *timeout;
/*
* Called to poll for completion of a specific tag.
*/
poll_fn *poll;
softirq_done_fn *complete;
/*
* Called when the block layer side of a hardware queue has been
* set up, allowing the driver to allocate/init matching structures.
* Ditto for exit/teardown.
*/
init_hctx_fn *init_hctx;
exit_hctx_fn *exit_hctx;
/*
* Called for every command allocated by the block layer to allow
* the driver to set up driver specific data.
*
* Tag greater than or equal to queue_depth is for setting up
* flush request.
*
* Ditto for exit/teardown.
*/
init_hctx_fn *init_hctx;
exit_hctx_fn *exit_hctx;
/*
* Called for every command allocated by the block layer to allow
* the driver to set up driver specific data.
*
* Tag greater than or equal to queue_depth is for setting up
* flush request.
*
* Ditto for exit/teardown.
*/
init_request_fn *init_request;
exit_request_fn *exit_request;
/* Called from inside blk_get_request() */
void (*initialize_rq_fn)(struct request *rq);
map_queues_fn *map_queues;//blk_mq_ctx和blk_mq_hw_ctx映射关系
#ifdef CONFIG_BLK_DEBUG_FS
/*
* Used by the debugfs implementation to show driver-specific
* information about a request.
*/
void (*show_rq)(struct seq_file *m, struct request *rq);
#endif
};
例如scsi的多队列操作函数集合。
5.11.3.1.1 scsi_mq_ops
最新scsi驱动使用的多队列函数操作集合,老的单队列处理函数是scsi_request_fn。
static const struct blk_mq_ops scsi_mq_ops = {
.get_budget = scsi_mq_get_budget,
.put_budget = scsi_mq_put_budget,
.queue_rq = scsi_queue_rq,
.complete = scsi_softirq_done,
.timeout = scsi_timeout,
#ifdef CONFIG_BLK_DEBUG_FS
.show_rq = scsi_show_rq,
#endif
.init_request = scsi_mq_init_request,
.exit_request = scsi_mq_exit_request,
.initialize_rq_fn = scsi_initialize_rq,
.map_queues = scsi_map_queues,
};
6 函数:主要函数
6.1 多队列
6.1.1 blk_mq_flush_plug_list
多队列中刷plug队列中请求的函数。会有blk_flush_plug_list函数调用。
6.1.2 blk_mq_make_request多队列块入口
这个函数和单队列的blk_queue_bio对立,是多队列的入口函数。整个函数逻辑也体现了多队列中io的处理流程。
总体逻辑和单队列的blk_queue_bio函数非常相似。
如果能够合入到进程的plug队列中就直接合入后并返回。否则,通过函数blk_mq_sched_bio_merge来进行合并到请求队列中。不管合并到哪里,都分为向前合并和向后合并两种方式。
如果不能合并bio,则需要根据bio来生成一个request结构体。新产生的request会根据bio有多种执行分支,判断条件有flush操作、sync、plug等,最终都是调用blk_mq_run_hw_queue来向设备发起io请求。
static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
{
const int is_sync = op_is_sync(bio->bi_opf);
const int is_flush_fua = op_is_flush(bio->bi_opf);
struct blk_mq_alloc_data data = { .flags = 0 };
struct request *rq;
unsigned int request_count = 0;
struct blk_plug *plug;
struct request *same_queue_rq = NULL;
blk_qc_t cookie;
unsigned int wb_acct;
blk_queue_bounce(q, &bio);
blk_queue_split(q, &bio);//根据设备硬件上限来分割bio
if (!bio_integrity_prep(bio))
return BLK_QC_T_NONE;
if (!is_flush_fua && !blk_queue_nomerges(q) &&
blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))//合并到进程的plug队列
return BLK_QC_T_NONE;
if (blk_mq_sched_bio_merge(q, bio))//合并到请求队列中,成功返回
return BLK_QC_T_NONE;
wb_acct = wbt_wait(q->rq_wb, bio, NULL);
trace_block_getrq(q, bio, bio->bi_opf);
rq = blk_mq_get_request(q, bio, bio->bi_opf, &data);//无法合并,产生新的request 请求
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
if (bio->bi_opf & REQ_NOWAIT)
bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
wbt_track(&rq->issue_stat, wb_acct);
cookie = request_to_qc_t(data.hctx, rq);
plug = current->plug;
if (unlikely(is_flush_fua)) {//是flush操作
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);//根据bio生成request,继续下方到硬件队列
/* bypass scheduler for flush rq */
blk_insert_flush(rq);
blk_mq_run_hw_queue(data.hctx, true);//向设备发起io请求
} else if (plug && q->nr_hw_queues == 1) {//可以plug,同时硬件队列数量为1。
struct request *last = NULL;
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
/*
* @request_count may become stale because of schedule
* out, so check the list again.
*/
if (list_empty(&plug->mq_list))
request_count = 0;
else if (blk_queue_nomerges(q))
request_count = blk_plug_queued_count(q);
if (!request_count)
trace_block_plug(q);
else
last = list_entry_rq(plug->mq_list.prev);
if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
blk_flush_plug_list(plug, false);
trace_block_plug(q);
}
list_add_tail(&rq->queuelist, &plug->mq_list);
} else if (plug && !blk_queue_nomerges(q)) {
blk_mq_bio_to_request(rq, bio);
/*
* We do limited plugging. If the bio can be merged, do that.
* Otherwise the existing request in the plug list will be
* issued. So the plug list will have one request at most
* The plug list might get flushed before this. If that happens,
* the plug list is empty, and same_queue_rq is invalid.
*/
if (list_empty(&plug->mq_list))
same_queue_rq = NULL;
if (same_queue_rq)
list_del_init(&same_queue_rq->queuelist);
list_add_tail(&rq->queuelist, &plug->mq_list);
blk_mq_put_ctx(data.ctx);
if (same_queue_rq) {
data.hctx = blk_mq_map_queue(q,
same_queue_rq->mq_ctx->cpu);
blk_mq_try_issue_directly(data.hctx, same_queue_rq,
&cookie);
}
} else if (q->nr_hw_queues > 1 && is_sync) {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
blk_mq_try_issue_directly(data.hctx, rq, &cookie);
} else if (q->elevator) {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
blk_mq_sched_insert_request(rq, false, true, true);
} else {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
blk_mq_queue_io(data.hctx, data.ctx, rq);
blk_mq_run_hw_queue(data.hctx, true);
}
return cookie;
}
6.2 单队列
6.2.1 blk _flush_plug_list
对应多队列的blk_mq_flush_plug_list函数,负责将进程中plug链上的bio通过函数__elv_add_request刷到调度队列中,并调用__blk_run_queue函数发起io。
6.2.2 blk_queue_bio单队列块层入口
这个函数是单队列的请求处理函数,负责将bio放入到队列中。由generic_make_request函数调用。将来如果多队列完全体会了单队列,那么这个函数就成为历史了。
该函数提交的 bio 的缓存处理存在以下几种情况,
l  当前进程 IO 处于 Plug 状态,尝试将 bio 合并到当前进程的 plugged list 里,即 current->plug.list 。
l  当前进程 IO 处于 Unplug 状态,尝试利用 IO 调度器的代码找到合适的 IO request,并将 bio 合并到该 request 中。
l  如果无法将 bio 合并到已经存在的 IO request 结构里,那么就进入到单独为该 bio 分配空闲 IO request 的逻辑里。
不论是 plugged list 还是 IO scheduler 的 IO 合并,都分为向前合并和向后合并两种情况,
向后由 bio_attempt_back_merge 完成。
向前由 bio_attempt_front_merge 完成。
static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
{
struct blk_plug *plug;//阻塞结构体
int where = ELEVATOR_INSERT_SORT;
struct request *req, *free;
unsigned int request_count = 0;
unsigned int wb_acct;
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
* ISA dma in theory)
*/
blk_queue_bounce(q, &bio);
blk_queue_split(q, &bio);//根据块设备请求队列的limits.max_sectors和limits.max_segmetns来拆分bio,适应设备缓存。会在函数blk_set_default_limits中设置。
if (!bio_integrity_prep(bio))//判断bio是否完整
return BLK_QC_T_NONE;
if (op_is_flush(bio->bi_opf)) {//判断bio是否是REQ_PREFLUSH或者REQ_FUA, 需要特殊处理
spin_lock_irq(q->queue_lock);
where = ELEVATOR_INSERT_FLUSH;
goto get_rq;
}
/*
* Check if we can merge with the plugged list before grabbing
* any locks.
*/
if (!blk_queue_nomerges(q)) {//判断队列能否合并,由QUEUE_FLAG_NOMERGES
if (blk_attempt_plug_merge(q, bio, &request_count, NULL)) //尝试将bio合并到进程plug列表中,然后直接返回,等后续触发再处理阻塞队列。
return BLK_QC_T_NONE;
} else
request_count = blk_plug_queued_count(q);//获取plug队列中的请求数量即可。
spin_lock_irq(q->queue_lock);
switch (elv_merge(q, &req, bio)) {//单队列的io调度层,进入到电梯调度函数。
case ELEVATOR_BACK_MERGE://向后合并,将bio合入到已经存在的request中,合并后,调用blk_account_io_start结束
if (!bio_attempt_back_merge(q, req, bio)) //向后合并函数
break;
elv_bio_merged(q, req, bio);
free = attempt_back_merge(q, req);
if (free)
__blk_put_request(q, free);
else
elv_merged_request(q, req, ELEVATOR_BACK_MERGE);
goto out_unlock;
case ELEVATOR_FRONT_MERGE://向前合并,将bio合入到已经存在的request中,合并后,调用blk_account_io_start结束
if (!bio_attempt_front_merge(q, req, bio))
break;
elv_bio_merged(q, req, bio);
free = attempt_front_merge(q, req);
if (free)
__blk_put_request(q, free);
else
elv_merged_request(q, req, ELEVATOR_FRONT_MERGE);
goto out_unlock;
default:
break;
}
get_rq:
wb_acct = wbt_wait(q->rq_wb, bio, q->queue_lock);
/*
* Grab a free request. This is might sleep but can not fail.
* Returns with the queue unlocked.
*/
blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, 0); //如果在plug链和request队列中都无法合并,则重新生成一个request.
if (IS_ERR(req)) {
blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
else
bio->bi_status = BLK_STS_IOERR;
bio_endio(bio);
goto out_unlock;
}
wbt_track(&req->issue_stat, wb_acct);
/*
* After dropping the lock and possibly sleeping here, our request
* may now be mergeable after it had proven unmergeable (above).
* We don't worry about that case for efficiency. It won't happen
* often, and the elevators are able to handle it.
*/
blk_init_request_from_bio(req, bio);//通过bio初始化request请求。
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))
req->cpu = raw_smp_processor_id();
plug = current->plug;
if (plug) {
/*
* If this is the first request added after a plug, fire
* of a plug trace.
*
* @request_count may become stale because of schedule
* out, so check plug list again.
*/
if (!request_count || list_empty(&plug->list))
trace_block_plug(q);
else {
struct request *last = list_entry_rq(plug->list.prev);
if (request_count >= BLK_MAX_REQUEST_COUNT ||
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE) {
blk_flush_plug_list(plug, false);//如果请求数量或者大小超过指定,就触发刷阻塞的io,第二参数表示不是从调度触发的,是自己触发的。会调用__elv_add_request将请求插入到电梯队列中。
trace_block_plug(q);
}
}
list_add_tail(&req->queuelist, &plug->list);//把请求添加到plug列表中
blk_account_io_start(req, true);//启动队列中的io静态相关统计.
} else {
spin_lock_irq(q->queue_lock);
add_acct_request(q, req, where);//该函数会调用blk_account_io_start,__elv_add_request,将请求放入到请求队列中,准备被处理。
__blk_run_queue(q);//如果非阻塞,则调用__blk_run_queue函数,触发IO,开工干活。
out_unlock:
spin_unlock_irq(q->queue_lock);
}
return BLK_QC_T_NONE;
}
6.3 初始化函数
6.3.1 blk_mq_init_queue
这个函数初始化软件(software staging queues)和硬件(hardware dispatch queues)队列,同时执行映射操作。
也会通过调用blk_queue_make_request来设置blk_mq_make_request函数。
struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
{
struct request_queue *uninit_q, *q;
uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node, NULL);
if (!uninit_q)
return ERR_PTR(-ENOMEM);
q = blk_mq_init_allocated_queue(set, uninit_q);
if (IS_ERR(q))
blk_cleanup_queue(uninit_q);
return q;
}
6.3.2 blk_mq_init_request
该函数会调用.init_request函数。
static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
unsigned int hctx_idx, int node)
{
int ret;
if (set->ops->init_request) {
ret = set->ops->init_request(set, rq, hctx_idx, node);
if (ret)
return ret;
}
seqcount_init(&rq->gstate_seq);
u64_stats_init(&rq->aborted_gstate_sync);
/*
* start gstate with gen 1 instead of 0, otherwise it will be equal
* to aborted_gstate, and be identified timed out by
* blk_mq_terminate_expired.
*/
WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
return 0;
}
6.3.3 blk_init_queue
初始化队列函数,会调用blk_init_queue_node。
struct request_queue *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
{
return blk_init_queue_node(rfn, lock, NUMA_NO_NODE);
}
会调用blk_init_queue_node函数,而函数blk_init_queue_node会调用blk_init_allocated_queue 函数。
struct request_queue *
blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
{
struct request_queue *q;
q = blk_alloc_queue_node(GFP_KERNEL, node_id, lock);
if (!q)
return NULL;
q->request_fn = rfn;
if (blk_init_allocated_queue(q) < 0) {
blk_cleanup_queue(q);
return NULL;
}
return q;
}
6.3.4 blk_queue_make_request
blk_queue_make_request用来设置多队列的入口函数:blk_mq_make_request函数
6.4 关键承上启下函数
6.4.1 generic_make_request
这个函数本身起到一个承上启下的作用,所以在函数定义处加入了大量的描述性文字,帮助开发者理解。
generic_make_request函数是bio层的入口,负责把bio传递给块层,将bio结构体到请求队列。如果是使用单队列则调用blk_queue_bio,如果是使用多队列的则调用blk_mq_make_request。
/**
* generic_make_request - hand a buffer to its device driver for I/O
* @bio: The bio describing the location in memory and on the device.
*
* generic_make_request() is used to make I/O requests of block
* devices. It is passed a &struct bio, which describes the I/O that needs
* to be done.
*
* generic_make_request() does not return any status. The
* success/failure status of the request, along with notification of
* completion, is delivered asynchronously through the bio->bi_end_io
* function described (one day) else where.
*
* The caller of generic_make_request must make sure that bi_io_vec
* are set to describe the memory buffer, and that bi_dev and bi_sector are
* set to describe the device address, and the
* bi_end_io and optionally bi_private are set to describe how
* completion notification should be signaled.
*
* generic_make_request and the drivers it calls may use bi_next if this
* bio happens to be merged with someone else, and may resubmit the bio to
* a lower device by calling into generic_make_request recursively, which
* means the bio should NOT be touched after the call to ->make_request_fn.
*/
blk_qc_t generic_make_request(struct bio *bio)
{
/*
* bio_list_on_stack[0] contains bios submitted by the current
* make_request_fn.
* bio_list_on_stack[1] contains bios that were submitted before
* the current make_request_fn, but that haven't been processed
* yet.
*/
struct bio_list bio_list_on_stack[2];
blk_mq_req_flags_t flags = 0;
struct request_queue *q = bio->bi_disk->queue;//获取bio关联设备的队列
blk_qc_t ret = BLK_QC_T_NONE;
if (bio->bi_opf & REQ_NOWAIT)//判断bio是否是REQ_NOWAIT的,设置flags
flags = BLK_MQ_REQ_NOWAIT;
if (blk_queue_enter(q, flags) < 0) {//判断队列是否可以处理响应请求。
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
bio_io_error(bio);
return ret;
}
if (!generic_make_request_checks(bio))//检测bio
goto out;
/*
* We only want one ->make_request_fn to be active at a time, else
* stack usage with stacked devices could be a problem. So use
* current->bio_list to keep a list of requests submited by a
* make_request_fn function. current->bio_list is also used as a
* flag to say if generic_make_request is currently active in this
* task or not. If it is NULL, then no make_request is active. If
* it is non-NULL, then a make_request is active, and new requests
* should be added at the tail
*/
if (current->bio_list) {//current是描述进程的task_struct机构体,其中bio_list是 Stacked block device info(MD),如果是MD设备就添加到队列后退出了。
bio_list_add(¤t->bio_list[0], bio);
goto out;
}
/* following loop may be a bit non-obvious, and so deserves some
* explanation.
* Before entering the loop, bio->bi_next is NULL (as all callers
* ensure that) so we have a list with a single bio.
* We pretend that we have just taken it off a longer list, so
* we assign bio_list to a pointer to the bio_list_on_stack,
* thus initialising the bio_list of new bios to be
* added. ->make_request() may indeed add some more bios
* through a recursive call to generic_make_request. If it
* did, we find a non-NULL value in bio_list and re-enter the loop
* from the top. In this case we really did just take the bio
* of the top of the list (no pretending) and so remove it from
* bio_list, and call into ->make_request() again.
*/
BUG_ON(bio->bi_next);
bio_list_init(&bio_list_on_stack[0]);//初始化当前要提交的bio链表结构
current->bio_list = bio_list_on_stack;//赋值给task_struct->bio_list,最后函数结束后会赋值为null.
do {//循环处理bio,调用make_request_fn处理每个bio
bool enter_succeeded = true;
if (unlikely(q != bio->bi_disk->queue)) {//判断第一个bio关联的队列是否与上次make_request_fn函数提交的bio队列一致。
if (q)
blk_queue_exit(q);//减少队列引用,是blk_queue_enter逆操作
q = bio->bi_disk->queue; //从下一个bio中获取关联的队列
flags = 0;
if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
if (blk_queue_enter(q, flags) < 0) {
enter_succeeded = false;
q = NULL;
}
}
if (enter_succeeded) {//成功放入队列后
struct bio_list lower, same;
/* Create a fresh bio_list for all subordinate requests */
bio_list_on_stack[1] = bio_list_on_stack[0];//上次make_request_fn提交的bios,赋值给bio_list_on_stack[1].
bio_list_init(&bio_list_on_stack[0]);//初始化这次需要提交的bios存放结构体bio_list_on_stack[0].
ret = q->make_request_fn(q, bio);//调用关键函数->make_request_fn
/* sort new bios into those for a lower level
* and those for the same level
*/
bio_list_init(&lower);//初始化两个bio链表
bio_list_init(&same);
while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)//循环处理这次提交的bios。
if (q == bio->bi_disk->queue)
bio_list_add(&same, bio);
else
bio_list_add(&lower, bio);
/* now assemble so we handle the lowest level first */
bio_list_merge(&bio_list_on_stack[0], &lower);//进行合并。
bio_list_merge(&bio_list_on_stack[0], &same);
bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
} else {
if (unlikely(!blk_queue_dying(q) &&
(bio->bi_opf & REQ_NOWAIT)))
bio_wouldblock_error(bio);
else
bio_io_error(bio);
}
bio = bio_list_pop(&bio_list_on_stack[0]);//获取下一个bio,继续处理
} while (bio);
current->bio_list = NULL; /* deactivate */
out:
if (q)
blk_queue_exit(q);
return ret;
}
7 参考
一切不配参考链接的文章都是耍流氓。
https://lwn.net/Articles/736534/
https://lwn.net/Articles/738449/
https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mechanism_(blk-mq)
https://miuv.blog/2017/10/21/linux-block-mq-simple-walkthrough/
https://hyunyoung2.github.io/2016/09/14/Multi_Queue/
http://ari-ava.blogspot.com/2014/07/opw-linux-block-io-layer-part-4-multi.html
Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems
Blk-mq:new multi-queue block IO queueing mechnism