path: root/block/bfq-iosched.c
Commit message (Collapse)AuthorAgeFilesLines
* block, bfq: prevent soft_rt_next_start from being stuck at infinityDavide Sapienza2018-05-311-27/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BFQ can deem a bfq_queue as soft real-time only if the queue - periodically becomes completely idle, i.e., empty and with no still-outstanding I/O request; - after becoming idle, gets new I/O only after a special reference time soft_rt_next_start. In this respect, after commit "block, bfq: consider also past I/O in soft real-time detection", the value of soft_rt_next_start can never decrease. This causes a problem with the following special updating case for soft_rt_next_start: to prevent queues that are not completely idle to be wrongly detected as soft real-time (when they become non-empty again), soft_rt_next_start is temporarily set to infinity for empty queues with still outstanding I/O requests. But, if such an update is actually performed, then, because of the above commit, soft_rt_next_start will be stuck at infinity forever, and the queue will have no more chance to be considered soft real-time. On slow systems, this problem does cause actual soft real-time applications to be occasionally not detected as such. This commit addresses this issue by eliminating the pushing of soft_rt_next_start to infinity, and by changing the way non-empty queues are prevented from being wrongly detected as soft real-time. Simply, a queue that becomes non-empty again can now be detected as soft real-time only if it has no outstanding I/O request. Signed-off-by: Davide Sapienza <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: increase weight-raising duration for interactive appsDavide Sapienza2018-05-311-11/+15
| | | | | | | | | | | | | | | | | | | | | | The maximum possible duration of the weight-raising period for interactive applications is limited to 13 seconds, as this is the time needed to load the largest application that we considered when tuning weight raising. Unfortunately, in such an evaluation, we did not consider the case of very slow virtual machines. For example, on a QEMU/KVM virtual machine - running in a slow PC; - with a virtual disk stacked on a slow low-end 5400rpm HDD; - serving a heavy I/O workload, such as the sequential reading of several files; mplayer takes 23 seconds to start, if constantly weight-raised. To address this issue, this commit conservatively sets the upper limit for weight-raising duration to 25 seconds. Signed-off-by: Davide Sapienza <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: remove slow-system classPaolo Valente2018-05-311-95/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | BFQ computes the duration of weight raising for interactive applications automatically, using some reference parameters. In particular, BFQ uses the best durations (see comments in the code for how these durations have been assessed) for two classes of systems: slow and fast ones. Examples of slow systems are old phones or systems using micro HDDs. Fast systems are all the remaining ones. Using these parameters, BFQ computes the actual duration of the weight raising, for the system at hand, as a function of the relative speed of the system w.r.t. the speed of a reference system, belonging to the same class of systems as the system at hand. This slow vs fast differentiation proved to be useful in the past, but happens to have little meaning with current hardware. Even worse, it does cause problems in virtual systems, where the speed of the system can vary frequently, and so widely to just confuse the class-detection mechanism, and, as we have verified experimentally, to cause BFQ to compute non-sensical weight-raising durations. This commit addresses this issue by removing the slow class and the class-detection mechanism. Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: add description of weight-raising heuristicsPaolo Valente2018-05-311-24/+56
| | | | | | | | | | | A description of how weight raising works is missing in BFQ sources. In addition, the code for handling weight raising is scattered across a few functions. This makes it rather hard to understand the mechanism and its rationale. This commits adds such a description at the beginning of the main source file. Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: remove the removal of 'next' rq in bfq_requests_mergedFilippo Muzzini2018-05-311-7/+0
| | | | | | | | | | | | | Since bfq_finish_request() is always called on the request 'next', after bfq_requests_merged() is finished, and bfq_finish_request() removes 'next' from its bfq_queue if needed, it isn't necessary to do such a removal in advance in bfq_merged_requests(). This commit removes such a useless 'next' removal. Signed-off-by: Filippo Muzzini <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: remove wrong check in bfq_requests_mergedPaolo Valente2018-05-311-6/+20
| | | | | | | | | | | | | | | | | | | | The request rq passed to the function bfq_requests_merged is always in a bfq_queue, so the check !RB_EMPTY_NODE(&rq->rb_node) at the beginning of bfq_requests_merged always succeeds, and the control flow systematically skips to the end of the function. This implies that the body of the function is never executed, i.e., the repositioning of rq is never performed. On the opposite end, a control is missing in the body of the function: 'next' must be removed only if it is inside a bfq_queue. This commit removes the wrong check on rq, and adds the missing check on 'next'. In addition, this commit adds comments on bfq_requests_merged. Signed-off-by: Filippo Muzzini <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: remove wrong lock in bfq_requests_mergedFilippo Muzzini2018-05-311-2/+0
| | | | | | | | | | | | | | | | In bfq_requests_merged(), there is a deadlock because the lock on bfqq->bfqd->lock is held by the calling function, but the code of this function tries to grab the lock again. This deadlock is currently hidden by another bug (fixed by next commit for this source file), which causes the body of bfq_requests_merged() to be never executed. This commit removes the deadlock by removing the lock/unlock pair. Signed-off-by: Filippo Muzzini <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* bfq-iosched: update shallow depth to smallest one usedJens Axboe2018-05-101-3/+14
| | | | | | | | | If our shallow depth is smaller than the wake batching of sbitmap, we can introduce hangs. Ensure that sbitmap knows how low we'll go. Acked-by: Paolo Valente <> Reviewed-by: Omar Sandoval <> Signed-off-by: Jens Axboe <>
* bfq-iosched: remove unused variableJens Axboe2018-05-101-9/+7
| | | | | | | | | bfqd->sb_shift was attempted used as a cache for the sbitmap queue shift, but we don't need it, as it never changes. Kill it with fire. Acked-by: Paolo Valente <> Reviewed-by: Omar Sandoval <> Signed-off-by: Jens Axboe <>
* bfq: calculate shallow depths at init timeJens Axboe2018-05-101-47/+50
| | | | | | | | It doesn't change, so don't put it in the per-IO hot path. Acked-by: Paolo Valente <> Reviewed-by: Omar Sandoval <> Signed-off-by: Jens Axboe <>
* bfq-iosched: don't worry about reserved tags in limit_depthJens Axboe2018-05-101-8/+1
| | | | | | | | | | Reserved tags are used for error handling, we don't need to care about them for regular IO. The core won't call us for these anyway. Acked-by: Paolo Valente <> Reviewed-by: Omar Sandoval <> Signed-off-by: Jens Axboe <>
* block, bfq: postpone rq preparation to insert or mergePaolo Valente2018-05-101-29/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When invoked for an I/O request rq, the prepare_request hook of bfq increments reference counters in the destination bfq_queue for rq. In this respect, after this hook has been invoked, rq may still be transformed into a request with no icq attached, i.e., for bfq, a request not associated with any bfq_queue. No further hook is invoked to signal this tranformation to bfq (in general, to the destination elevator for rq). This leads bfq into an inconsistent state, because bfq has no chance to correctly lower these counters back. This inconsistency may in its turn cause incorrect scheduling and hangs. It certainly causes memory leaks, by making it impossible for bfq to free the involved bfq_queue. On the bright side, no transformation can still happen for rq after rq has been inserted into bfq, or merged with another, already inserted, request. Exploiting this fact, this commit addresses the above issue by delaying the preparation of an I/O request to when the request is inserted or merged. This change also gives a performance bonus: a lock-contention point gets removed. To prepare a request, bfq needs to hold its scheduler lock. After postponing request preparation to insertion or merging, no lock needs to be grabbed any longer in the prepare_request hook, while the lock already taken to perform insertion or merging is used to preparare the request as well. Tested-by: Oleksandr Natalenko <> Tested-by: Bart Van Assche <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block: consolidate struct request timestamp fieldsOmar Sandoval2018-05-091-2/+2
| | | | | | | | | | | | | | | | Currently, struct request has four timestamp fields: - A start time, set at get_request time, in jiffies, used for iostats - An I/O start time, set at start_request time, in ktime nanoseconds, used for blk-stats (i.e., wbt, kyber, hybrid polling) - Another start time and another I/O start time, used for cfq and bfq These can all be consolidated into one start time and one I/O start time, both in ktime nanoseconds, shaving off up to 16 bytes from struct request depending on the kernel config. Signed-off-by: Omar Sandoval <> Signed-off-by: Jens Axboe <>
* bfq-iosched: ensure to clear bic/bfqq pointers when preparing requestJens Axboe2018-04-171-1/+9
| | | | | | | | | | | | | | | | Even if we don't have an IO context attached to a request, we still need to clear the priv[0..1] pointers, as they could be pointing to previously used bic/bfqq structures. If we don't do so, we'll either corrupt memory on dispatching a request, or cause an imbalance in counters. Inspired by a fix from Kees. Reported-by: Oleksandr Natalenko <> Reported-by: Kees Cook <> Cc: Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Signed-off-by: Jens Axboe <>
* block, bfq: lower-bound the estimated peak rate to 1Paolo Valente2018-03-261-1/+24
| | | | | | | | | | | | | | | | | | | | | If a storage device handled by BFQ happens to be slower than 7.5 KB/s for a certain amount of time (in the order of a second), then the estimated peak rate of the device, maintained in BFQ, becomes equal to 0. The reason is the limited precision with which the rate is represented (details on the range of representable values in the comments introduced by this commit). This leads to a division-by-zero error where the estimated peak rate is used as divisor. Such a type of failure has been reported in [1]. This commit addresses this issue by: 1. Lower-bounding the estimated peak rate to 1 2. Adding and improving comments on the range of rates representable [1] Signed-off-by: Konstantin Khlebnikov <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: add requeue-request hookPaolo Valente2018-02-071-25/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device be re-inserted into the active I/O scheduler for that device. As a consequence, I/O schedulers may get the same request inserted again, even several times, without a finish_request invoked on that request before each re-insertion. This fact is the cause of the failure reported in [1]. For an I/O scheduler, every re-insertion of the same re-prepared request is equivalent to the insertion of a new request. For schedulers like mq-deadline or kyber, this fact causes no harm. In contrast, it confuses a stateful scheduler like BFQ, which keeps state for an I/O request, until the finish_request hook is invoked on the request. In particular, BFQ may get stuck, waiting forever for the number of request dispatches, of the same request, to be balanced by an equal number of request completions (while there will be one completion for that request). In this state, BFQ may refuse to serve I/O requests from other bfq_queues. The hang reported in [1] then follows. However, the above re-prepared requests undergo a requeue, thus the requeue_request hook of the active elevator is invoked for these requests, if set. This commit then addresses the above issue by properly implementing the hook requeue_request in BFQ. [1] Reported-by: Ivan Kozik <> Reported-by: Alban Browaeys <> Tested-by: Mike Galbraith <> Signed-off-by: Paolo Valente <> Signed-off-by: Serena Ziviani <> Signed-off-by: Jens Axboe <>
* block, bfq: limit sectors served with interactive weight raisingPaolo Valente2018-01-181-9/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To maximise responsiveness, BFQ raises the weight, and performs device idling, for bfq_queues associated with processes deemed as interactive. In particular, weight raising has a maximum duration, equal to the time needed to start a large application. If a weight-raised process goes on doing I/O beyond this maximum duration, it loses weight-raising. This mechanism is evidently vulnerable to the following false positives: I/O-bound applications that will go on doing I/O for much longer than the duration of weight-raising. These applications have basically no benefit from being weight-raised at the beginning of their I/O. On the opposite end, while being weight-raised, these applications a) unjustly steal throughput to applications that may truly need low latency; b) make BFQ uselessly perform device idling; device idling results in loss of device throughput with most flash-based storage, and may increase latencies when used purposelessly. This commit adds a countermeasure to reduce both the above problems. To introduce this countermeasure, we provide the following extra piece of information (full details in the comments added by this commit). During the start-up of the large application used as a reference to set the duration of weight-raising, involved processes transfer at most ~110K sectors each. Accordingly, a process initially deemed as interactive has no right to be weight-raised any longer, once transferred 110K sectors or more. Basing on this consideration, this commit early-ends weight-raising for a bfq_queue if the latter happens to have received an amount of service at least equal to 110K sectors (actually, a little bit more, to keep a safety margin). I/O-bound applications that reach a high throughput, such as file copy, get to this threshold much before the allowed weight-raising period finishes. Thus this early ending of weight-raising reduces the amount of time during which these applications cause the problems described above. Tested-by: Oleksandr Natalenko <> Tested-by: Holger Hoffstätte <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: limit tags for writes and async I/OPaolo Valente2018-01-181-0/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Asynchronous I/O can easily starve synchronous I/O (both sync reads and sync writes), by consuming all request tags. Similarly, storms of synchronous writes, such as those that sync(2) may trigger, can starve synchronous reads. In their turn, these two problems may also cause BFQ to loose control on latency for interactive and soft real-time applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice Writer takes 0.6 seconds to start if the device is idle, but it takes more than 45 seconds (!) if there are sequential writes in the background. This commit addresses this issue by limiting the maximum percentage of tags that asynchronous I/O requests and synchronous write requests can consume. In particular, this commit grants a higher threshold to synchronous writes, to prevent the latter from being starved by asynchronous I/O. According to the above test, LibreOffice Writer now starts in about 1.2 seconds on average, regardless of the background workload, and apart from some rare outlier. To check this improvement, run, e.g., sudo ./ bfq 5 5 seq 10 "lowriter --terminate_after_init" for the comm_startup_lat benchmark in the S suite [1]. [1] Tested-by: Oleksandr Natalenko <> Tested-by: Holger Hoffstätte <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: fix occurrences of request finish method's old nameChiara Bruschi2018-01-101-13/+13
| | | | | | | | | | | | | | | | Commit '7b9e93616399' ("blk-mq-sched: unify request finished methods") changed the old name of current bfq_finish_request method, but left it unchanged elsewhere in the code (related comments, part of function name bfq_put_rq_priv_body). This commit fixes all occurrences of the old name of this method by changing them into the current name. Fixes: 7b9e93616399 ("blk-mq-sched: unify request finished methods") Reviewed-by: Paolo Valente <> Signed-off-by: Federico Motta <> Signed-off-by: Chiara Bruschi <> Signed-off-by: Jens Axboe <>
* bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHEDJens Axboe2018-01-091-1/+1
| | | | | | | | It's not available if we don't have group io scheduling set, and there's no need to call it. Fixes: 0d52af590552 ("block, bfq: release oom-queue ref to root group on exit") Signed-off-by: Jens Axboe <>
* block, bfq: release oom-queue ref to root group on exitPaolo Valente2018-01-091-0/+3
| | | | | | | | | | | | | On scheduler init, a reference to the root group, and a reference to its corresponding blkg are taken for the oom queue. Yet these references are not released on scheduler exit, which prevents these objects from be freed. This commit adds the missing reference releases. Reported-by: Davide Ferrari <> Tested-by: Holger Hoffstätte <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: remove batches of confusing ifdefsPaolo Valente2018-01-051-55/+72
| | | | | | | | | | | | | | | Commit a33801e8b473 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs: one reported in [1], plus a similar one in another function. This commit removes both batches, in the way suggested in [1]. [1] Fixes: a33801e8b473 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP") Reported-by: Linus Torvalds <> Tested-by: Luca Miccio <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: consider also past I/O in soft real-time detectionPaolo Valente2018-01-051-34/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BFQ privileges the I/O of soft real-time applications, such as video players, to guarantee to these application a high bandwidth and a low latency. In this respect, it is not easy to correctly detect when an application is soft real-time. A particularly nasty false positive is that of an I/O-bound application that occasionally happens to meet all requirements to be deemed as soft real-time. After being detected as soft real-time, such an application monopolizes the device. Fortunately, BFQ will realize soon that the application is actually not soft real-time and suspend every privilege. Yet, the application may happen again to be wrongly detected as soft real-time, and so on. As highlighted by our tests, this problem causes BFQ to occasionally fail to guarantee a high responsiveness, in the presence of heavy background I/O workloads. The reason is that the background workload happens to be detected as soft real-time, more or less frequently, during the execution of the interactive task under test. To give an idea, because of this problem, Libreoffice Writer occasionally takes 8 seconds, instead of 3, to start up, if there are sequential reads and writes in the background, on a Kingston SSDNow V300. This commit addresses this issue by leveraging the following facts. The reason why some applications are detected as soft real-time despite all BFQ checks to avoid false positives, is simply that, during high CPU or storage-device load, I/O-bound applications may happen to do I/O slowly enough to meet all soft real-time requirements, and pass all BFQ extra checks. Yet, this happens only for limited time periods: slow-speed time intervals are usually interspersed between other time intervals during which these applications do I/O at a very high speed. To exploit these facts, this commit introduces a little change, in the detection of soft real-time behavior, to systematically consider also the recent past: the higher the speed was in the recent past, the later next I/O should arrive for the application to be considered as soft real-time. At the beginning of a slow-speed interval, the minimum arrival time allowed for the next I/O usually happens to still be so high, to fall *after* the end of the slow-speed period itself. As a consequence, the application does not risk to be deemed as soft real-time during the slow-speed interval. Then, during the next high-speed interval, the application cannot, evidently, be deemed as soft real-time (exactly because of its speed), and so on. This extra filtering proved to be rather effective: in the above test, the frequency of false positives became so low that the start-up time was 3 seconds in all iterations (apart from occasional outliers, caused by page-cache-management issues, which are out of the scope of this commit, and cannot be solved by an I/O scheduler). Tested-by: Lee Tibbert <> Signed-off-by: Paolo Valente <> Signed-off-by: Angelo Ruocco <> Signed-off-by: Jens Axboe <>
* block, bfq: remove superfluous check in queue-merging setupAngelo Ruocco2018-01-051-31/+5
| | | | | | | | | | | | | | | | | | | | | | | | | When two or more processes do I/O in a way that the their requests are sequential in respect to one another, BFQ merges the bfq_queues associated with the processes. This way the overall I/O pattern becomes sequential, and thus there is a boost in througput. These cooperating processes usually start or restart to do I/O shortly after each other. So, in order to avoid merging non-cooperating processes, BFQ ensures that none of these queues has been in weight raising for too long. In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged only shortly after being created", BFQ checks whether any queue (and not only weight-raised ones) is doing I/O continuously from too long to be merged. This new additional check makes the first one useless: a queue doing I/O from long enough, if being weight-raised, is also a queue in weight raising for too long to be merged. Accordingly, this commit removes the first check. Signed-off-by: Angelo Ruocco <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: let a queue be merged only shortly after starting I/OPaolo Valente2018-01-051-11/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In BFQ and CFQ, two processes are said to be cooperating if they do I/O in such a way that the union of their I/O requests yields a sequential I/O pattern. To get such a sequential I/O pattern out of the non-sequential pattern of each cooperating process, BFQ and CFQ merge the queues associated with these processes. In more detail, cooperating processes, and thus their associated queues, usually start, or restart, to do I/O shortly after each other. This is the case, e.g., for the I/O threads of KVM/QEMU and of the dump utility. Basing on this assumption, this commit allows a bfq_queue to be merged only during a short time interval (100ms) after it starts, or re-starts, to do I/O. This filtering provides two important benefits. First, it greatly reduces the probability that two non-cooperating processes have their queues merged by mistake, if they just happen to do I/O close to each other for a short time interval. These spurious merges cause loss of service guarantees. A low-weight bfq_queue may unjustly get more than its expected share of the throughput: if such a low-weight queue is merged with a high-weight queue, then the I/O for the low-weight queue is served as if the queue had a high weight. This may damage other high-weight queues unexpectedly. For instance, because of this issue, lxterminal occasionally took 7.5 seconds to start, instead of 6.5 seconds, when some sequential readers and writers did I/O in the background on a FUJITSU MHX2300BT HDD. The reason is that the bfq_queues associated with some of the readers or the writers were merged with the high-weight queues of some processes that had to do some urgent but little I/O. The readers then exploited the inherited high weight for all or most of their I/O, during the start-up of terminal. The filtering introduced by this commit eliminated any outlier caused by spurious queue merges in our start-up time tests. This filtering also provides a little boost of the throughput sustainable by BFQ: 3-4%, depending on the CPU. The reason is that, once a bfq_queue cannot be merged any longer, this commit makes BFQ stop updating the data needed to handle merging for the queue. Signed-off-by: Paolo Valente <> Signed-off-by: Angelo Ruocco <> Signed-off-by: Jens Axboe <>
* block, bfq: check low_latency flag in bfq_bfqq_save_state()Angelo Ruocco2018-01-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | A just-created bfq_queue will certainly be deemed as interactive on the arrival of its first I/O request, if the low_latency flag is set. Yet, if the queue is merged with another queue on the arrival of its first I/O request, it will not have the chance to be flagged as interactive. Nevertheless, if the queue is then split soon enough, it has to be flagged as interactive after the split. To handle this early-merge scenario correctly, BFQ saves the state of the queue, on the merge, as if the latter had already been deemed interactive. So, if the queue is split soon, it will get weight-raised, because the previous state of the queue is resumed on the split. Unfortunately, in the act of saving the state of the newly-created queue, BFQ doesn't check whether the low_latency flag is set, and this causes early-merged queues to be then weight-raised, on queue splits, even if low_latency is off. This commit addresses this problem by adding the missing check. Signed-off-by: Angelo Ruocco <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: add missing rq_pos_tree update on rq removalPaolo Valente2018-01-051-0/+2
| | | | | | | | | | | | | | | | | | | If two processes do I/O close to each other, then BFQ merges the bfq_queues associated with these processes, to get a more sequential I/O, and thus a higher throughput. In this respect, to detect whether two processes are doing I/O close to each other, BFQ keeps a list of the head-of-line I/O requests of all active bfq_queues. The list is ordered by initial sectors, and implemented through a red-black tree (rq_pos_tree). Unfortunately, the update of the rq_pos_tree was incomplete, because the tree was not updated on the removal of the head-of-line I/O request of a bfq_queue, in case the queue did not remain empty. This commit adds the missing update. Signed-off-by: Paolo Valente <> Signed-off-by: Angelo Ruocco <> Signed-off-by: Jens Axboe <>
* block, bfq: increase threshold to deem I/O as randomPaolo Valente2018-01-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If two processes do I/O close to each other, i.e., are cooperating processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their associated bfq_queues, so as to get sequential I/O from the union of the I/O requests of the processes, and thus reach a higher throughput. A merged queue is then split if its I/O stops being sequential. In this respect, BFQ deems the I/O of a bfq_queue as (mostly) sequential only if less than 4 I/O requests are random, out of the last 32 requests inserted into the queue. Unfortunately, extensive testing (with the interleaved_io benchmark of the S suite [1], and with real applications spawning cooperating processes) has clearly shown that, with such a low threshold, only a rather low I/O throughput may be reached when several cooperating processes do I/O. In particular, the outcome of each test run was bimodal: if queue merging occurred and was stable during the test, then the throughput was close to the peak rate of the storage device, otherwise the throughput was arbitrarily low (usually around 1/10 of the peak rate with a rotational device). The probability to get the unlucky outcomes grew with the number of cooperating processes: it was already significant with 5 processes, and close to one with 7 or more processes. The cause of the low throughput in the unlucky runs was that the merged queues containing the I/O of these cooperating processes were soon split, because they contained more random I/O requests than those tolerated by the 4/32 threshold, but - that I/O would have however allowed the storage device to reach peak throughput or almost peak throughput; - in contrast, the I/O of these processes, if served individually (from separate queues) yielded a rather low throughput. So we repeated our tests with increasing values of the threshold, until we found the minimum value (19) for which we obtained maximum throughput, reliably, with at least up to 9 cooperating processes. Then we checked that the use of that higher threshold value did not cause any regression for any other benchmark in the suite [1]. This commit raises the threshold to such a higher value. [1] Signed-off-by: Angelo Ruocco <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUPLuca Miccio2017-11-141-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BFQ currently creates, and updates, its own instance of the whole set of blkio statistics that cfq creates. Yet, from the comments of Tejun Heo in [1], it turned out that most of these statistics are meant/useful only for debugging. This commit makes BFQ create the latter, debugging statistics only if the option CONFIG_DEBUG_BLK_CGROUP is set. By doing so, this commit also enables BFQ to enjoy a high perfomance boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then BFQ has to update far fewer statistics, and, in particular, not the heaviest to update. To give an idea of the benefits, if CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on null_blk (configured with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS (+30%). We have measured similar or even much higher boosts with other CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have been obtained and can be reproduced very easily with the script in [1]. [1] Suggested-by: Tejun Heo <> Suggested-by: Ulf Hansson <> Tested-by: Lee Tibbert <> Tested-by: Oleksandr Natalenko <> Signed-off-by: Luca Miccio <> Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: update blkio stats outside the scheduler lockPaolo Valente2017-11-141-11/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bfq invokes various blkg_*stats_* functions to update the statistics contained in the special files blkio.bfq.* in the blkio controller groups, i.e., the I/O accounting related to the proportional-share policy provided by bfq. The execution of these functions takes a considerable percentage, about 40%, of the total per-request execution time of bfq (i.e., of the sum of the execution time of all the bfq functions that have to be executed to process an I/O request from its creation to its destruction). This reduces the request-processing rate sustainable by bfq noticeably, even on a multicore CPU. In fact, the bfq functions that invoke blkg_*stats_* functions cannot be executed in parallel with the rest of the code of bfq, because both are executed under the same same per-device scheduler lock. To reduce this slowdown, this commit moves, wherever possible, the invocation of these functions (more precisely, of the bfq functions that invoke blkg_*stats_* functions) outside the critical sections protected by the scheduler lock. With this change, and with all blkio.bfq.* statistics enabled, the throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel i7-4850HQ, in case of 8 threads doing random I/O in parallel on null_blk, with the latter configured with 0 latency. We obtained the same or higher throughput boosts, up to +30%, with other processors (some figures are reported in the documentation). For our tests, we used the script [1], with which our results can be easily reproduced. NOTE. This commit still protects the invocation of blkg_*stats_* functions with the request_queue lock, because the group these functions are invoked on may otherwise disappear before or while these functions are executed. Fortunately, tests without even this lock show, by difference, that the serialization caused by this lock has a little impact (at most ~5% of throughput reduction). [1] Tested-by: Lee Tibbert <> Tested-by: Oleksandr Natalenko <> Signed-off-by: Paolo Valente <> Signed-off-by: Luca Miccio <> Signed-off-by: Jens Axboe <>
* block, bfq: add missing invocations of bfqg_stats_update_io_add/removeLuca Miccio2017-11-141-3/+18
| | | | | | | | | | | | | | | bfqg_stats_update_io_add and bfqg_stats_update_io_remove are to be invoked, respectively, when an I/O request enters and when an I/O request exits the scheduler. Unfortunately, bfq does not fully comply with this scheme, because it does not invoke these functions for requests that are inserted into or extracted from its priority dispatch list. This commit fixes this mistake. Tested-by: Lee Tibbert <> Tested-by: Oleksandr Natalenko <> Signed-off-by: Paolo Valente <> Signed-off-by: Luca Miccio <> Signed-off-by: Jens Axboe <>
* block, bfq: fix unbalanced decrements of burst sizePaolo Valente2017-10-091-2/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit "block, bfq: decrease burst size when queues in burst exit" introduced the decrement of burst_size on the removal of a bfq_queue from the burst list. Unfortunately, this decrement can happen to be performed even when burst size is already equal to 0, because of unbalanced decrements. A description follows of the cause of these unbalanced decrements, namely a wrong assumption, and of the way how this wrong assumption leads to unbalanced decrements. The wrong assumption is that a bfq_queue can exit only if the process associated with the bfq_queue has exited. This is false, because a bfq_queue, say Q, may exit also as a consequence of a merge with another bfq_queue. In this case, Q exits because the I/O of its associated process has been redirected to another bfq_queue. The decrement unbalance occurs because Q may then be re-created after a split, and added back to the current burst list, *without* incrementing burst_size. burst_size is not incremented because Q is not a new bfq_queue added to the burst list, but a bfq_queue only temporarily removed from the list, and, before the commit "bfq-sq, bfq-mq: decrease burst size when queues in burst exit", burst_size was not decremented when Q was removed. This commit addresses this issue by just checking whether the exiting bfq_queue is a merged bfq_queue, and, in that case, not decrementing burst_size. Unfortunately, this still leaves room for unbalanced decrements, in the following rarer case: on a split, the bfq_queue happens to be inserted into a different burst list than that it was removed from when merged. If this happens, the number of elements in the new burst list becomes higher than burst_size (by one). When the bfq_queue then exits, it is of course not in a merged state any longer, thus burst_size is decremented, which results in an unbalanced decrement. To handle this sporadic, unlucky case in a simple way, this commit also checks that burst_size is larger than 0 before decrementing it. Finally, this commit removes an useless, extra check: the check that the bfq_queue is sync, performed before checking whether the bfq_queue is in the burst list. This extra check is redundant, because only sync bfq_queues can be inserted into the burst list. Fixes: 7cb04004fa37 ("block, bfq: decrease burst size when queues in burst exit") Reported-by: Philip Müller <> Signed-off-by: Paolo Valente <> Signed-off-by: Angelo Ruocco <> Tested-by: Philip Müller <> Tested-by: Oleksandr Natalenko <> Tested-by: Lee Tibbert <> Signed-off-by: Jens Axboe <>
* block,bfq: Disable writeback throttlingLuca Miccio2017-10-091-1/+2
| | | | | | | | | | | | | | Similarly to CFQ, BFQ has its write-throttling heuristics, and it is better not to combine them with further write-throttling heuristics of a different nature. So this commit disables write-back throttling for a device if BFQ is used as I/O scheduler for that device. Signed-off-by: Luca Miccio <> Signed-off-by: Paolo Valente <> Tested-by: Oleksandr Natalenko <> Tested-by: Lee Tibbert <> Signed-off-by: Jens Axboe <>
* block, bfq: decrease burst size when queues in burst exitPaolo Valente2017-10-031-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If many queues belonging to the same group happen to be created shortly after each other, then the concurrent processes associated with these queues have typically a common goal, and they get it done as soon as possible if not hampered by device idling. Examples are processes spawned by git grep, or by systemd during boot. As for device idling, this mechanism is currently necessary for weight raising to succeed in its goal: privileging I/O. In view of these facts, BFQ does not provide the above queues with either weight raising or device idling. On the other hand, a burst of queue creations may be caused also by the start-up of a complex application. In this case, these queues need usually to be served one after the other, and as quickly as possible, to maximise responsiveness. Therefore, in this case the best strategy is to weight-raise all the queues created during the burst, i.e., the exact opposite of the strategy for the above case. To distinguish between the two cases, BFQ uses an empirical burst-size threshold, found through extensive tests and monitoring of daily usage. Only large bursts, i.e., burst with a size above this threshold, are considered as generated by a high number of parallel processes. In this respect, upstart-based boot proved to be rather hard to detect as generating a large burst of queue creations, because with upstart most of the queues created in a burst exit *before* the next queues in the same burst are created. To address this issue, I changed the burst-detection mechanism so as to not decrease the size of the current burst even if one of the queues in the burst is eliminated. Unfortunately, this missing decrease causes false positives on very fast systems: on the start-up of a complex application, such as libreoffice writer, so many queues are created, served and exited shortly after each other, that a large burst of queue creations is wrongly detected as occurring. These false positives just disappear if the size of a burst is decreased when one of the queues in the burst exits. This commit restores the missing burst-size decrease, relying of the fact that upstart is apparently unlikely to be used on systems running this and future versions of the kernel. Signed-off-by: Paolo Valente <> Signed-off-by: Mauro Andreolini <> Signed-off-by: Angelo Ruocco <> Tested-by: Mirko Montanari <> Tested-by: Oleksandr Natalenko <> Tested-by: Lee Tibbert <> Signed-off-by: Jens Axboe <>
* block, bfq: let early-merged queues be weight-raised on split tooPaolo Valente2017-10-031-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | A just-created bfq_queue, say Q, may happen to be merged with another bfq_queue on the very first invocation of the function __bfq_insert_request. In such a case, even if Q would clearly deserve interactive weight raising (as it has just been created), the function bfq_add_request does not make it to be invoked for Q, and thus to activate weight raising for Q. As a consequence, when the state of Q is saved for a possible future restore, after a split of Q from the other bfq_queue(s), such a state happens to be (unjustly) non-weight-raised. Then the bfq_queue will not enjoy any weight raising on the split, even if should still be in an interactive weight-raising period when the split occurs. This commit solves this problem as follows, for a just-created bfq_queue that is being early-merged: it stores directly, in the saved state of the bfq_queue, the weight-raising state that would have been assigned to the bfq_queue if not early-merged. Signed-off-by: Paolo Valente <> Tested-by: Angelo Ruocco <> Tested-by: Mirko Montanari <> Tested-by: Oleksandr Natalenko <> Tested-by: Lee Tibbert <> Signed-off-by: Jens Axboe <>
* block, bfq: check and switch back to interactive wr also on queue splitPaolo Valente2017-10-031-38/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As already explained in the message of commit "block, bfq: fix wrong init of saved start time for weight raising", if a soft real-time weight-raising period happens to be nested in a larger interactive weight-raising period, then BFQ restores the interactive weight raising at the end of the soft real-time weight raising. In particular, BFQ checks whether the latter has ended only on request dispatches. Unfortunately, the above scheme fails to restore interactive weight raising in the following corner case: if a bfq_queue, say Q, 1) Is merged with another bfq_queue while it is in a nested soft real-time weight-raising period. The weight-raising state of Q is then saved, and not considered any longer until a split occurs. 2) Is split from the other bfq_queue(s) at a time instant when its soft real-time weight raising is already finished. On the split, while resuming the previous, soft real-time weight-raised state of the bfq_queue Q, BFQ checks whether the current soft real-time weight-raising period is actually over. If so, BFQ switches weight raising off for Q, *without* checking whether the soft real-time period was actually nested in a non-yet-finished interactive weight-raising period. This commit addresses this issue by adding the above missing check in bfq_queue splits, and restoring interactive weight raising if needed. Signed-off-by: Paolo Valente <> Tested-by: Angelo Ruocco <> Tested-by: Mirko Montanari <> Tested-by: Oleksandr Natalenko <> Tested-by: Lee Tibbert <> Signed-off-by: Jens Axboe <>
* block, bfq: fix wrong init of saved start time for weight raisingPaolo Valente2017-10-031-19/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit fixes a bug that causes bfq to fail to guarantee a high responsiveness on some drives, if there is heavy random read+write I/O in the background. More precisely, such a failure allowed this bug to be found [1], but the bug may well cause other yet unreported anomalies. BFQ raises the weight of the bfq_queues associated with soft real-time applications, to privilege the I/O, and thus reduce latency, for these applications. This mechanism is named soft-real-time weight raising in BFQ. A soft real-time period may happen to be nested into an interactive weight raising period, i.e., it may happen that, when a bfq_queue switches to a soft real-time weight-raised state, the bfq_queue is already being weight-raised because deemed interactive too. In this case, BFQ saves in a special variable wr_start_at_switch_to_srt, the time instant when the interactive weight-raising period started for the bfq_queue, i.e., the time instant when BFQ started to deem the bfq_queue interactive. This value is then used to check whether the interactive weight-raising period would still be in progress when the soft real-time weight-raising period ends. If so, interactive weight raising is restored for the bfq_queue. This restore is useful, in particular, because it prevents bfq_queues from losing their interactive weight raising prematurely, as a consequence of spurious, short-lived soft real-time weight-raising periods caused by wrong detections as soft real-time. If, instead, a bfq_queue switches to soft-real-time weight raising while it *is not* already in an interactive weight-raising period, then the variable wr_start_at_switch_to_srt has no meaning during the following soft real-time weight-raising period. Unfortunately the handling of this case is wrong in BFQ: not only the variable is not flagged somehow as meaningless, but it is also set to the time when the switch to soft real-time weight-raising occurs. This may cause an interactive weight-raising period to be considered mistakenly as still in progress, and thus a spurious interactive weight-raising period to start for the bfq_queue, at the end of the soft-real-time weight-raising period. In particular the spurious interactive weight-raising period will be considered as still in progress, if the soft-real-time weight-raising period does not last very long. The bfq_queue will then be wrongly privileged and, if I/O bound, will unjustly steal bandwidth to truly interactive or soft real-time bfq_queues, harming responsiveness and low latency. This commit fixes this issue by just setting wr_start_at_switch_to_srt to minus infinity (farthest past time instant according to jiffies macros): when the soft-real-time weight-raising period ends, certainly no interactive weight-raising period will be considered as still in progress. [1] Background I/O Type: Random - Background I/O mix: Reads and writes - Application to start: LibreOffice Writer in Signed-off-by: Paolo Valente <> Signed-off-by: Angelo Ruocco <> Tested-by: Oleksandr Natalenko <> Tested-by: Lee Tibbert <> Tested-by: Mirko Montanari <> Signed-off-by: Jens Axboe <>
* Merge branch 'for-4.14/block-postmerge' of git:// Torvalds2017-09-091-26/+49
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull followup block layer updates from Jens Axboe: "I ended up splitting the main pull request for this series into two, mainly because of clashes between NVMe fixes that went into 4.13 after the for-4.14 branches were split off. This pull request is mostly NVMe, but not exclusively. In detail, it contains: - Two pull request for NVMe changes from Christoph. Nothing new on the feature front, basically just fixes all over the map for the core bits, transport, rdma, etc. - Series from Bart, cleaning up various bits in the BFQ scheduler. - Series of bcache fixes, which has been lingering for a release or two. Coly sent this in, but patches from various people in this area. - Set of patches for BFQ from Paolo himself, updating both documentation and fixing some corner cases in performance. - Series from Omar, attempting to now get the 4k loop support correct. Our confidence level is higher this time. - Series from Shaohua for loop as well, improving O_DIRECT performance and fixing a use-after-free" * 'for-4.14/block-postmerge' of git:// (74 commits) bcache: initialize dirty stripes in flash_dev_run() loop: set physical block size to logical block size bcache: fix bch_hprint crash and improve output bcache: Update continue_at() documentation bcache: silence static checker warning bcache: fix for gc and write-back race bcache: increase the number of open buckets bcache: Correct return value for sysfs attach errors bcache: correct cache_dirty_target in __update_writeback_rate() bcache: gc does not work when triggering by manual command bcache: Don't reinvent the wheel but use existing llist API bcache: do not subtract sectors_to_gc for bypassed IO bcache: fix sequential large write IO bypass bcache: Fix leak of bdev reference block/loop: remove unused field block/loop: fix use after free bfq: Use icq_to_bic() consistently bfq: Suppress compiler warnings about comparisons bfq: Check kstrtoul() return value bfq: Declare local functions static ...
| * bfq: Use icq_to_bic() consistentlyBart Van Assche2017-09-011-1/+1
| | | | | | | | | | | | | | | | | | | | Some code uses icq_to_bic() to convert an io_cq pointer to a bfq_io_cq pointer while other code uses a direct cast. Convert the code that uses a direct cast such that it uses icq_to_bic(). Acked-by: Paolo Valente <> Signed-off-by: Bart Van Assche <> Signed-off-by: Jens Axboe <>
| * bfq: Suppress compiler warnings about comparisonsBart Van Assche2017-09-011-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch avoids that the following warnings are reported when building with W=1: block/bfq-iosched.c: In function 'bfq_back_seek_max_store': block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/bfq-iosched.c:4876:1: note: in expansion of macro 'STORE_FUNCTION' STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0); ^~~~~~~~~~~~~~ block/bfq-iosched.c: In function 'bfq_slice_idle_store': block/bfq-iosched.c:4860:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/bfq-iosched.c:4879:1: note: in expansion of macro 'STORE_FUNCTION' STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2); ^~~~~~~~~~~~~~ block/bfq-iosched.c: In function 'bfq_slice_idle_us_store': block/bfq-iosched.c:4892:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/bfq-iosched.c:4899:1: note: in expansion of macro 'USEC_STORE_FUNCTION' USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0, ^~~~~~~~~~~~~~~~~~~ Acked-by: Paolo Valente <> Signed-off-by: Bart Van Assche <> Signed-off-by: Jens Axboe <>
| * bfq: Check kstrtoul() return valueBart Van Assche2017-09-011-15/+37
| | | | | | | | | | | | | | | | | | | | Make sysfs writes fail for invalid numbers instead of storing uninitialized data copied from the stack. This patch removes all uninitialized_var() occurrences from the BFQ source code. Acked-by: Paolo Valente <> Signed-off-by: Bart Van Assche <> Signed-off-by: Jens Axboe <>
| * bfq: Annotate fall-through in a switch statementBart Van Assche2017-09-011-0/+1
| | | | | | | | | | | | | | | | | | This patch avoids that gcc 7 issues a warning about fall-through when building with W=1. Acked-by: Paolo Valente <> Signed-off-by: Bart Van Assche <> Signed-off-by: Jens Axboe <>
| * block, bfq: make lookup_next_entity push up vtime on expirationsPaolo Valente2017-08-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To provide a very smooth service, bfq starts to serve a bfq_queue only if the queue is 'eligible', i.e., if the same queue would have started to be served in the ideal, perfectly fair system that bfq simulates internally. This is obtained by associating each queue with a virtual start time, and by computing a special system virtual time quantity: a queue is eligible only if the system virtual time has reached the virtual start time of the queue. Finally, bfq guarantees that, when a new queue must be set in service, there is always at least one eligible entity for each active parent entity in the scheduler. To provide this guarantee, the function __bfq_lookup_next_entity pushes up, for each parent entity on which it is invoked, the system virtual time to the minimum among the virtual start times of the entities in the active tree for the parent entity (more precisely, the push up occurs if the system virtual time happens to be lower than all such virtual start times). There is however a circumstance in which __bfq_lookup_next_entity cannot push up the system virtual time for a parent entity, even if the system virtual time is lower than the virtual start times of all the child entities in the active tree. It happens if one of the child entities is in service. In fact, in such a case, there is already an eligible entity, the in-service one, even if it may not be not present in the active tree (because in-service entities may be removed from the active tree). Unfortunately, in the last re-design of the hierarchical-scheduling engine, the reset of the pointer to the in-service entity for a given parent entity--reset to be done as a consequence of the expiration of the in-service entity--always happens after the function __bfq_lookup_next_entity has been invoked. This causes the function to think that there is still an entity in service for the parent entity, and then that the system virtual time cannot be pushed up, even if actually such a no-more-in-service entity has already been properly reinserted into the active tree (or in some other tree if no more active). Yet, the system virtual time *had* to be pushed up, to be ready to correctly choose the next queue to serve. Because of the lack of this push up, bfq may wrongly set in service a queue that had been speculatively pre-computed as the possible next-in-service queue, but that would no more be the one to serve after the expiration and the reinsertion into the active trees of the previously in-service entities. This commit addresses this issue by making __bfq_lookup_next_entity properly push up the system virtual time if an expiration is occurring. Signed-off-by: Paolo Valente <> Tested-by: Lee Tibbert <> Tested-by: Oleksandr Natalenko <> Signed-off-by: Jens Axboe <>
* | bfq: Re-enable auto-loading when built as a moduleBen Hutchings2017-08-291-0/+1
|/ | | | | | | | | | The block core requests modules with the "-iosched" name suffix, but bfq no longer has that suffix. Add an alias. Fixes: ea25da48086d ("block, bfq: split bfq-iosched.c into multiple ...") Reviewed-by: Ming Lei <> Signed-off-by: Ben Hutchings <> Signed-off-by: Jens Axboe <>
* block, scheduler: convert xxx_var_store to voidweiping zhang2017-08-281-16/+17
| | | | | | | | The last parameter "count" never be used in xxx_var_store, convert these functions to void. Signed-off-by: weiping zhang <> Signed-off-by: Jens Axboe <>
* block, bfq: fix error handle in bfq_initweiping zhang2017-08-231-1/+3
| | | | | | | if elv_register fail, bfq_pool should be free. Signed-off-by: weiping zhang <> Signed-off-by: Jens Axboe <>
* block, bfq: boost throughput with flash-based non-queueing devicesPaolo Valente2017-08-111-10/+19
| | | | | | | | | | | | | | | | | | | When a queue associated with a process remains empty, there are cases where throughput gets boosted if the device is idled to await the arrival of a new I/O request for that queue. Currently, BFQ assumes that one of these cases is when the device has no internal queueing (regardless of the properties of the I/O being served). Unfortunately, this condition has proved to be too general. So, this commit refines it as "the device has no internal queueing and is rotational". This refinement provides a significant throughput boost with random I/O, on flash-based storage without internal queueing. For example, on a HiKey board, throughput increases by up to 125%, growing, e.g., from 6.9MB/s to 15.6MB/s with two or three random readers in parallel. Signed-off-by: Paolo Valente <> Signed-off-by: Luca Miccio <> Signed-off-by: Jens Axboe <>
* block,bfq: refactor device-idling logicPaolo Valente2017-08-111-56/+61
| | | | | | | | | | | | | | | The logic that decides whether to idle the device is scattered across three functions. Almost all of the logic is in the function bfq_bfqq_may_idle, but (1) part of the decision is made in bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may switch off idling regardless of the output of bfq_bfqq_may_idle. In addition, both bfq_update_idle_window and bfq_bfqq_must_idle make their decisions as a function of parameters that are used, for similar purposes, also in bfq_bfqq_may_idle. This commit addresses these issues by moving all the logic into bfq_bfqq_may_idle. Signed-off-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* bfq: dispatch request to prevent queue stalling after the request completionHou Tao2017-07-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are mq devices (eg., virtio-blk, nbd and loopback) which don't invoke blk_mq_run_hw_queues() after the completion of a request. If bfq is enabled on these devices and the slice_idle attribute or strict_guarantees attribute is set as zero, it is possible that after a request completion the remaining requests of busy bfq queue will stalled in the bfq schedule until a new request arrives. To fix the scheduler latency problem, we need to check whether or not all issued requests have completed and dispatch more requests to driver if there is no request in driver. The problem can be reproduced by running the following script on a virtio-blk device with nr_hw_queues as 1: #!/bin/sh dev=vdb # mount point for dev mp=/tmp/mnt cd $mp job=strict.job cat <<EOF > $job [global] direct=1 bs=4k size=256M rw=write ioengine=libaio iodepth=128 runtime=5 time_based [1] [2] new_group EOF echo bfq > /sys/block/$dev/queue/scheduler echo 1 > /sys/block/$dev/queue/iosched/strict_guarantees fio $job Signed-off-by: Hou Tao <> Reviewed-by: Paolo Valente <> Signed-off-by: Jens Axboe <>
* block, bfq: don't change ioprio class for a bfq_queue on a service treePaolo Valente2017-07-031-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On each deactivation or re-scheduling (after being served) of a bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(), to perform pending updates of ioprio, weight and ioprio class for the bfq_queue. BFQ also invokes this function on I/O-request dispatches, to raise or lower weights more quickly when needed, thereby improving latency. However, the entity representing the bfq_queue may be on the active (sub)tree of a service tree when this happens, and, although with a very low probability, the bfq_queue may happen to also have a pending change of its ioprio class. If both conditions hold when __bfq_entity_update_weight_prio() is invoked, then the entity moves to a sort of hybrid state: the new service tree for the entity, as returned by bfq_entity_service_tree(), differs from service tree on which the entity still is. The functions that handle activations and deactivations of entities do not cope with such a hybrid state (and would need to become more complex to cope). This commit addresses this issue by just making __bfq_entity_update_weight_prio() not perform also a possible pending change of ioprio class, when invoked on an I/O-request dispatch for a bfq_queue. Such a change is thus postponed to when __bfq_entity_update_weight_prio() is invoked on deactivation or re-scheduling of the bfq_queue. Reported-by: Marco Piazza <> Reported-by: Laurentiu Nicola <> Signed-off-by: Paolo Valente <> Tested-by: Marco Piazza <> Signed-off-by: Jens Axboe <>