summaryrefslogtreecommitdiffstats
path: root/block/blk-wbt.c
Commit message (Collapse)AuthorAgeFilesLines
* block: get rid of struct blk_issue_statOmar Sandoval2018-05-091-7/+5
| | | | | | | | | | | | | | | struct blk_issue_stat squashes three things into one u64: - The time the driver started working on a request - The original size of the request (for the io.low controller) - Flags for writeback throttling It turns out that on x86_64, we have a 4 byte hole in struct request which we can fill with the non-timestamp fields from blk_issue_stat, simplifying things quite a bit. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: pass struct request instead of struct blk_issue_stat to wbtOmar Sandoval2018-05-091-27/+26
| | | | | | | | | | issue_stat is going to go away, so first make writeback throttling take the containing request, update the internal wbt helpers accordingly, and change rwb->sync_cookie to be the request pointer instead of the issue_stat pointer. No functional change. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move some wbt helpers to blk-wbt.cOmar Sandoval2018-05-091-0/+20
| | | | | | | | | A few helpers are only used from blk-wbt.c, so move them there, and put wbt_track() behind the CONFIG_BLK_WBT typedef. This is in preparation for changing how the wbt flags are tracked. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-wbt: throttle discards like background writesJens Axboe2018-05-081-17/+26
| | | | | | | | | | Throttle discards like we would any background write. Discards should be background activity, so if they are impacting foreground IO, then we will throttle them down. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-wbt: pass in enum wbt_flags to get_rq_wait()Jens Axboe2018-05-081-10/+15
| | | | | | | | | | This is in preparation for having more write queues, in which case we would have needed to pass in more information than just a simple 'is_kswapd' boolean. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-wbt: account any writing command as a writeJens Axboe2018-05-081-1/+1
| | | | | | | | | | | We currently special case WRITE and FLUSH, but we should really just include any command with the write bit set. This ensures that we account DISCARD. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-wbt: account flush requests correctlyJens Axboe2018-02-061-1/+9
| | | | | | | | | | | | | | | | | | Mikulas reported a workload that saw bad performance, and figured out what it was due to various other types of requests being accounted as reads. Flush requests, for instance. Due to the high latency of those, we heavily throttle the writes to keep the latencies in balance. But they really should be accounted as writes. Fix this by checking the exact type of the request. If it's a read, account as a read, if it's a write or a flush, account as a write. Any other request we disregard. Previously everything would have been mistakenly accounted as reads. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-wbt: fix comments typoweiping zhang2017-11-231-1/+1
| | | | | Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-wbt: move wbt_clear_stat to common place in wbt_doneweiping zhang2017-11-231-2/+1
| | | | | | | | wbt_done call wbt_clear_stat no matter current stat was tracked or not, move it to common place. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-wbt: remove duplicated setting in wbt_initweiping zhang2017-11-231-2/+0
| | | | | | | | rwb->wc and rwb->queue_depth were overwritten by wbt_set_write_cache and wbt_set_queue_depth, remove the default setting. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-blockLinus Torvalds2017-11-141-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block layer updates from Jens Axboe: "This is the main pull request for block storage for 4.15-rc1. Nothing out of the ordinary in here, and no API changes or anything like that. Just various new features for drivers, core changes, etc. In particular, this pull request contains: - A patch series from Bart, closing the whole on blk/scsi-mq queue quescing. - A series from Christoph, building towards hidden gendisks (for multipath) and ability to move bio chains around. - NVMe - Support for native multipath for NVMe (Christoph). - Userspace notifications for AENs (Keith). - Command side-effects support (Keith). - SGL support (Chaitanya Kulkarni) - FC fixes and improvements (James Smart) - Lots of fixes and tweaks (Various) - bcache - New maintainer (Michael Lyle) - Writeback control improvements (Michael) - Various fixes (Coly, Elena, Eric, Liang, et al) - lightnvm updates, mostly centered around the pblk interface (Javier, Hans, and Rakesh). - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph) - Writeback series that fix the much discussed hundreds of millions of sync-all units. This goes all the way, as discussed previously (me). - Fix for missing wakeup on writeback timer adjustments (Yafang Shao). - Fix laptop mode on blk-mq (me). - {mq,name} tupple lookup for IO schedulers, allowing us to have alias names. This means you can use 'deadline' on both !mq and on mq (where it's called mq-deadline). (me). - blktrace race fix, oopsing on sg load (me). - blk-mq optimizations (me). - Obscure waitqueue race fix for kyber (Omar). - NBD fixes (Josef). - Disable writeback throttling by default on bfq, like we do on cfq (Luca Miccio). - Series from Ming that enable us to treat flush requests on blk-mq like any other request. This is a really nice cleanup. - Series from Ming that improves merging on blk-mq with schedulers, getting us closer to flipping the switch on scsi-mq again. - BFQ updates (Paolo). - blk-mq atomic flags memory ordering fixes (Peter Z). - Loop cgroup support (Shaohua). - Lots of minor fixes from lots of different folks, both for core and driver code" * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits) nvme: fix visibility of "uuid" ns attribute blk-mq: fixup some comment typos and lengths ide: ide-atapi: fix compile error with defining macro DEBUG blk-mq: improve tag waiting setup for non-shared tags brd: remove unused brd_mutex blk-mq: only run the hardware queue if IO is pending block: avoid null pointer dereference on null disk fs: guard_bio_eod() needs to consider partitions xtensa/simdisk: fix compile error nvme: expose subsys attribute to sysfs nvme: create 'slaves' and 'holders' entries for hidden controllers block: create 'slaves' and 'holders' entries for hidden gendisks nvme: also expose the namespace identification sysfs files for mpath nodes nvme: implement multipath access to nvme subsystems nvme: track shared namespaces nvme: introduce a nvme_ns_ids structure nvme: track subsystems block, nvme: Introduce blk_mq_req_flags_t block, scsi: Make SCSI quiesce and resume work reliably block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag ...
| * block,bfq: Disable writeback throttlingLuca Miccio2017-10-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly to CFQ, BFQ has its write-throttling heuristics, and it is better not to combine them with further write-throttling heuristics of a different nature. So this commit disables write-back throttling for a device if BFQ is used as I/O scheduler for that device. Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland2017-10-251-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list namingIngo Molnar2017-06-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So I've noticed a number of instances where it was not obvious from the code whether ->task_list was for a wait-queue head or a wait-queue entry. Furthermore, there's a number of wait-queue users where the lists are not for 'tasks' but other entities (poll tables, etc.), in which case the 'task_list' name is actively confusing. To clear this all up, name the wait-queue head and entry list structure fields unambiguously: struct wait_queue_head::task_list => ::head struct wait_queue_entry::task_list => ::entry For example, this code: rqw->wait.task_list.next != &wait->task_list ... is was pretty unclear (to me) what it's doing, while now it's written this way: rqw->wait.head.next != &wait->entry ... which makes it pretty clear that we are iterating a list until we see the head. Other examples are: list_for_each_entry_safe(pos, next, &x->task_list, task_list) { list_for_each_entry(wq, &fence->wait.task_list, task_list) { ... where it's unclear (to me) what we are iterating, and during review it's hard to tell whether it's trying to walk a wait-queue entry (which would be a bug), while now it's written as: list_for_each_entry_safe(pos, next, &x->head, entry) { list_for_each_entry(wq, &fence->wait.head, entry) { Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/wait: Rename wait_queue_t => wait_queue_entry_tIngo Molnar2017-06-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | Rename: wait_queue_t => wait_queue_entry_t 'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue", but in reality it's a queue *entry*. The 'real' queue is the wait queue head, which had to carry the name. Start sorting this out by renaming it to 'wait_queue_entry_t'. This also allows the real structure name 'struct __wait_queue' to lose its double underscore and become 'struct wait_queue_entry', which is the more canonical nomenclature for such data types. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* blk-stat: kill blk_stat_rq_ddir()Jens Axboe2017-04-211-1/+6
| | | | | | | | No point in providing and exporting this helper. There's just one (real) user of it, just use rq_data_dir(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Make writeback throttling defaults consistent for SQ devicesJan Kara2017-04-191-0/+19
| | | | | | | | | | | | When CFQ is used as an elevator, it disables writeback throttling because they don't play well together. Later when a different elevator is chosen for the device, writeback throttling doesn't get enabled again as it should. Make sure CFQ enables writeback throttling (if it should be enabled by default) when we switch from it to another IO scheduler. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Fix list corruption of blk stats callback listJan Kara2017-04-111-8/+4
| | | | | | | | | | | | | | | When CFQ calls wbt_disable_default(), it will call blk_stat_remove_callback() to stop gathering IO statistics for the purposes of writeback throttling. Later, when request_queue is unregistered, wbt_exit() will call blk_stat_remove_callback() again which will try to delete callback from the list again and possibly cause list corruption. Fix the problem by making wbt_disable_default() called wbt_exit() which is properly guarded against being called multiple times. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-stat: convert to callback-based statistics reportingOmar Sandoval2017-03-211-32/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, statistics are gathered in ~0.13s windows, and users grab the statistics whenever they need them. This is not ideal for both in-tree users: 1. Writeback throttling wants its own dynamically sized window of statistics. Since the blk-stats statistics are reset after every window and the wbt windows don't line up with the blk-stats windows, wbt doesn't see every I/O. 2. Polling currently grabs the statistics on every I/O. Again, depending on how the window lines up, we may miss some I/Os. It's also unnecessary overhead to get the statistics on every I/O; the hybrid polling heuristic would be just as happy with the statistics from the previous full window. This reworks the blk-stats infrastructure to be callback-based: users register a callback that they want called at a given time with all of the statistics from the window during which the callback was active. Users can dynamically bucketize the statistics. wbt and polling both currently use read vs. write, but polling can be extended to further subdivide based on request size. The callbacks are kept on an RCU list, and each callback has percpu stats buffers. There will only be a few users, so the overhead on the I/O completion side is low. The stats flushing is also simplified considerably: since the timer function is responsible for clearing the statistics, we don't have to worry about stale statistics. wbt is a trivial conversion. After the conversion, the windowing problem mentioned above is fixed. For polling, we register an extra callback that caches the previous window's statistics in the struct request_queue for the hybrid polling heuristic to use. Since we no longer have a single stats buffer for the request queue, this also removes the sysfs and debugfs stats entries. To replace those, we add a debugfs entry for the poll statistics. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE}Omar Sandoval2017-03-211-6/+6
| | | | | | | | | The stats buckets will become generic soon, so make the existing users use the common READ and WRITE definitions instead of one internal to blk-stat. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Use pointer to backing_dev_info from request_queueJan Kara2017-02-021-4/+4
| | | | | | | | | | | We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Avoid that sparse complains about context imbalance in __wbt_wait()Bart Van Assche2017-01-021-5/+6
| | | | | | | | This patch does not change any functionality. Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Make wbt_wait() definition consistent with declarationBart Van Assche2017-01-021-1/+1
| | | | | | Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: don't throttle discard or write zeroesChristoph Hellwig2016-12-091-3/+2
| | | | | | | | Both of these are metadata only commands that are not issued by the writeback code and not directly relevant to the writeback bandwith. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: add support for REQ_OP_WRITE_ZEROESChaitanya Kulkarni2016-12-011-2/+3
| | | | | | | | | | | | | | | This adds a new block layer operation to zero out a range of LBAs. This allows to implement zeroing for devices that don't use either discard with a predictable zero pattern or WRITE SAME of zeroes. The prominent example of that is NVMe with the Write Zeroes command, but in the future, this should also help with improving the way zeroing discards work. For this operation, suitable entry is exported in sysfs which indicate the number of maximum bytes allowed in one write zeroes operation by the device. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: allow wbt to be enabled always through sysfsJens Axboe2016-11-281-1/+2
| | | | | | | | | | | Currently there's no way to enable wbt if it's not enabled in the kernel config by default for a device. Allow a write to the 'wbt_lat_usec' queue sysfs file to enable wbt. This is useful for both the kernel config case, but also if the device is CFQ managed and it was turned off by default. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: cleanup disable-by-default for CFQJens Axboe2016-11-281-2/+8
| | | | | | | | | Make it clear that we are disabling wbt for the specified queued, if it was enabled by default. This is in preparation for allowing users to re-enable wbt, and not have it disabled automatically again. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: allow reset of default latency through sysfsJens Axboe2016-11-281-4/+13
| | | | | | | | Allow a write of '-1' to reset the default latency target for a given device. This removes knowledge of the different default settings for rotational vs non-rotational from user space. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: fix old-style function declarationArnd Bergmann2016-11-161-1/+1
| | | | | | | | | | | | The newly added driver causes a harmless warning in some configurations: block/blk-wbt.c:250:1: error: ‘inline’ is not at beginning of declaration [-Werror=old-style-declaration] static bool inline stat_sample_valid(struct blk_rq_stat *stat) This makes it use the expected format for the declaration. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: use BLK_STAT_{READ,WRITE} instead of 0/1Jens Axboe2016-11-111-6/+6
| | | | | | Since we have proper enums for the stats directions, use them. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: remove stat opsJens Axboe2016-11-111-10/+5
| | | | | | | Again a leftover from when the throttling code was generic. Now that we just have the block user, get rid of the stat ops and indirections. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: store queue instead of bdiJens Axboe2016-11-111-8/+12
| | | | | | | The bdi was a leftover from when the code was block layer agnostic. Now that we just support a block layer user, store the queue directly. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-wbt: add general throttling mechanismJens Axboe2016-11-101-0/+735
We can hook this up to the block layer, to help throttle buffered writes. wbt registers a few trace points that can be used to track what is happening in the system: wbt_lat: 259:0: latency 2446318 wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1, wmean=518866, wmin=15522, wmax=5330353, wsamples=57 wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32 This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat dumps the current read/write stats for that window, and wbt_step shows a step down event where we now scale back writes. Each trace includes the device, 259:0 in this case. Signed-off-by: Jens Axboe <axboe@fb.com>