summaryrefslogtreecommitdiffstats
path: root/block/blk-core.c
Commit message (Collapse)AuthorAgeFilesLines
* block: Move bdi_unregister() to del_gendisk()Jan Kara2017-03-021-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6cd18e711dd8 "block: destroy bdi before blockdev is unregistered." moved bdi unregistration (at that time through bdi_destroy()) from blk_release_queue() to blk_cleanup_queue() because it needs to happen before blk_unregister_region() call in del_gendisk() for MD. SCSI though will free up the device number from sd_remove() called through a maze of callbacks from device_del() in __scsi_remove_device() before blk_cleanup_queue() and thus similar races as described in 6cd18e711dd8 can happen for SCSI as well as reported by Omar [1]. Moving bdi_unregister() to del_gendisk() works for MD and fixes the problem for SCSI since del_gendisk() gets called from sd_remove() before freeing the device number. This also makes device_add_disk() (calling bdi_register_owner()) more symmetric with del_gendisk(). [1] http://marc.info/?l=linux-block&m=148554717109098&w=2 Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-4.11/next' into for-4.11/linus-mergeJens Axboe2017-02-171-160/+165
|\ | | | | | | Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: optionally merge discontiguous discard bios into a single requestChristoph Hellwig2017-02-081-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new merge strategy that merges discard bios into a request until the maximum number of discard ranges (or the maximum discard size) is reached from the plug merging code. I/O scheduler merging is not wired up yet but might also be useful, although not for fast devices like NVMe which are the only user for now. Note that for now we don't support limiting the size of each discard range, but if needed that can be added later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: enumify ELEVATOR_*_MERGEChristoph Hellwig2017-02-081-36/+40
| | | | | | | | | | | | | | | | Switch these constants to an enum, and make let the compiler ensure that all callers of blk_try_merge and elv_merge handle all potential values. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: free merged request in the callerJens Axboe2017-02-031-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we end up doing a request-to-request merge when we have completed a bio-to-request merge, we free the request from deep down in that path. For blk-mq-sched, the merge path has to hold the appropriate lock, but we don't need it for freeing the request. And in fact holding the lock is problematic, since we are now calling the mq sched put_rq_private() hook with the lock held. Other call paths do not hold this lock. Fix this inconsistency by ensuring that the caller frees a merged request. Then we can do it outside of the lock, making it both more efficient and fixing the blk-mq-sched problem of invoking parts of the scheduler with an unknown lock state. Reported-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
| * block: use same block debugfs directory for blk-mq and blktraceOmar Sandoval2017-02-021-0/+9
| | | | | | | | | | | | | | | | | | | | When I added the blk-mq debugging information to debugfs, I didn't notice that blktrace also creates a "block" directory in debugfs. Make them use the same dentry, now created in the core block code. Based on a patch from Jens. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * scsi, block: fix duplicate bdi name registration crashesDan Williams2017-02-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Warnings of the following form occur because scsi reuses a devt number while the block layer still has it referenced as the name of the bdi [1]: WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192' [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 ? kernfs_path_from_node+0x4f/0x60 sysfs_warn_dup+0x62/0x80 sysfs_create_dir_ns+0x77/0x90 kobject_add_internal+0xb2/0x350 kobject_add+0x75/0xd0 device_add+0x15a/0x650 device_create_groups_vargs+0xe0/0xf0 device_create_vargs+0x1c/0x20 bdi_register+0x90/0x240 ? lockdep_init_map+0x57/0x200 bdi_register_owner+0x36/0x60 device_add_disk+0x1bb/0x4e0 ? __pm_runtime_use_autosuspend+0x5c/0x70 sd_probe_async+0x10d/0x1c0 async_run_entry_fn+0x39/0x170 This is a brute-force fix to pass the devt release information from sd_probe() to the locations where we register the bdi, device_add_disk(), and unregister the bdi, blk_cleanup_queue(). Thanks to Omar for the quick reproducer script [2]. This patch survives where an unmodified kernel fails in a few seconds. [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4 [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2 Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Bart Van Assche <bart.vanassche@sandisk.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Jan Kara <jack@suse.cz> Reported-by: Omar Sandoval <osandov@osandov.com> Tested-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Get rid of blk_get_backing_dev_info()Jan Kara2017-02-021-14/+0
| | | | | | | | | | | | | | | | blk_get_backing_dev_info() is now a simple dereference. Remove that function and simplify some code around that. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Make blk_get_backing_dev_info() safe without open bdevJan Kara2017-02-021-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currenly blk_get_backing_dev_info() is not safe to be called when the block device is not open as bdev->bd_disk is NULL in that case. However inode_to_bdi() uses this function and may be call called from flusher worker or other writeback related functions without bdev being open which leads to crashes such as: [113031.075540] Unable to handle kernel paging request for data at address 0x00000000 [113031.075614] Faulting instruction address: 0xc0000000003692e0 0:mon> t [c0000000fb65f900] c00000000036cb6c writeback_sb_inodes+0x30c/0x590 [c0000000fb65fa10] c00000000036ced4 __writeback_inodes_wb+0xe4/0x150 [c0000000fb65fa70] c00000000036d33c wb_writeback+0x30c/0x450 [c0000000fb65fb40] c00000000036e198 wb_workfn+0x268/0x580 [c0000000fb65fc50] c0000000000f3470 process_one_work+0x1e0/0x590 [c0000000fb65fce0] c0000000000f38c8 worker_thread+0xa8/0x660 [c0000000fb65fd80] c0000000000fc4b0 kthread+0x110/0x130 [c0000000fb65fe30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Dynamically allocate and refcount backing_dev_infoJan Kara2017-02-021-7/+5
| | | | | | | | | | | | | | | | | | | | | | Instead of storing backing_dev_info inside struct request_queue, allocate it dynamically, reference count it, and free it when the last reference is dropped. Currently only request_queue holds the reference but in the following patch we add other users referencing backing_dev_info. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Use pointer to backing_dev_info from request_queueJan Kara2017-02-021-13/+14
| | | | | | | | | | | | | | | | | | | | | | We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: fold cmd_type into the REQ_OP_ spaceChristoph Hellwig2017-01-311-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of keeping two levels of indirection for requests types, fold it all into the operations. The little caveat here is that previously cmd_type only applied to struct request, while the request and bio op fields were set to plain REQ_OP_READ/WRITE even for passthrough operations. Instead this patch adds new REQ_OP_* for SCSI passthrough and driver private requests, althought it has to add two for each so that we can communicate the data in/out nature of the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: introduce blk_rq_is_passthroughChristoph Hellwig2017-01-311-4/+4
| | | | | | | | | | | | | | | | | | This can be used to check for fs vs non-fs requests and basically removes all knowledge of BLOCK_PC specific from the block layer, as well as preparing for removing the cmd_type field in struct request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: don't assign cmd_flags in __blk_rq_prep_cloneChristoph Hellwig2017-01-271-1/+0
| | | | | | | | | | | | | | | | | | These days we have the proper flags set since request allocation time. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: split scsi_request out of struct requestChristoph Hellwig2017-01-271-31/+0
| | | | | | | | | | | | | | | | | | | | And require all drivers that want to support BLOCK_PC to allocate it as the first thing of their private data. To support this the legacy IDE and BSG code is switched to set cmd_size on their queues to let the block layer allocate the additional space. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: allow specifying size for extra command dataChristoph Hellwig2017-01-271-12/+47
| | | | | | | | | | | | | | | | | | | | | | This mirrors the blk-mq capabilities to allocate extra drivers-specific data behind struct request by setting a cmd_size field, as well as having a constructor / destructor for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: simplify blk_init_allocated_queueChristoph Hellwig2017-01-271-23/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Return an errno value instead of the passed in queue so that the callers don't have to keep track of two queues, and move the assignment of the request_fn and lock to the caller as passing them as argument doesn't simplify anything. While we're at it also remove two pointless NULL assignments, given that the request structure is zeroed on allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: fix elevator init checkChristoph Hellwig2017-01-271-22/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't initalize the elevator fields for flushes as flush share space in struct request with the elevator data. But currently we can't communicate that a request is a flush through blk_get_request as we can only pass READ or WRITE, and the low-level code looks at the possible NULL bio to check for a flush. Fix this by allowing to pass any block op and flags, and by checking for the flush flags in __get_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: queue lock must be acquired when iterating over rlsTahsin Erdogan2017-02-011-0/+2
|/ | | | | | | | | | | blk_set_queue_dying() does not acquire queue lock before it calls blk_queue_for_each_rl(). This allows a racing blkg_destroy() to remove blkg->q_node from the linked list and have blk_queue_for_each_rl() loop infitely over the removed blkg->q_node list node. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: cleanup remaining manual checks for PREFLUSH|FUAJens Axboe2017-01-271-1/+1
| | | | | | Use op_is_flush() where applicable. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()Jens Axboe2017-01-271-1/+1
| | | | | | | | | | Instead of letting the caller check this and handle the details of inserting a flush request, put the logic in the scheduler insertion function. This fixes direct flush insertion outside of the usual make_request_fn calls, like from dm via blk_insert_cloned_request(). Signed-off-by: Jens Axboe <axboe@fb.com>
* block: add a op_is_flush helperChristoph Hellwig2017-01-271-4/+4
| | | | | | | | | | This centralizes the checks for bios that needs to be go into the flush state machine. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: add framework for MQ capable IO schedulersJens Axboe2017-01-171-1/+3
| | | | | | | | | | | | | | | | | | This adds a set of hooks that intercepts the blk-mq path of allocating/inserting/issuing/completing requests, allowing us to develop a scheduler within that framework. We reuse the existing elevator scheduler API on the registration side, but augment that with the scheduler flagging support for the blk-mq interfce, and with a separate set of ops hooks for MQ devices. We split driver and scheduler tags, so we can run the scheduling independently of device queue depth. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* block: move rq_ioc() to blk.hJens Axboe2017-01-171-16/+0
| | | | | | | | | | We want to use it outside of blk-core.c. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
* Merge branch 'for-4.10' of ↵Linus Torvalds2016-12-131-1/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: - Adam added opt-in ATA command priority support. - There are machines which hide multiple nvme devices behind an ahci BAR. Dan Williams proposed a solution to force-switch the mode but deemed too hackishd. People are gonna discuss the proper way to handle the situation in nvme standard meetings. For now, detect and warn about the situation. - Low level driver specific changes. Christoph Hellwig pipes in about the hidden nvme warning: "I wish that was the case. We've pretty much agreed that we'll want to implement it as a virtual PCIe root bridge, similar to Intels other 'innovation' VMD that we work around that way. But Intel management has apparently decided that they don't want to spend more cycles on this now that Lenovo has an optional BIOS that doesn't force this broken mode anymore, and no one outside of Intel has enough information to implement something like this. So for now I guess this warning is it, until Intel reconsideres and spends resources on fixing up the damage their Chipset people caused" * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ahci: warn about remapped NVMe devices ahci-remap.h: add ahci remapping definitions nvme: move NVMe class code to pci_ids.h pata: imx: support controller modes up to PIO4 pata: imx: add support of setting timings for PIO modes pata: imx: set controller PIO mode with .set_piomode callback pata: imx: sort headers out ata: set ncq_prio_enabled iff device has support ata: ATA Command Priority Disabled By Default ata: Enabling ATA Command Priorities block: Add iocontext priority to request ahci: qoriq: added ls1046a platform support
| * block: Add iocontext priority to requestAdam Manzanares2016-10-191-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch adds an association between iocontext ioprio and the ioprio of a request. This is done to enable request based drivers the ability to act on priority information stored in the request. An example being ATA devices that support command priorities. If the ATA driver discovers that the device supports command priorities and the request has valid priority information indicating the request is high priority, then a high priority command can be sent to the device. This should improve tail latencies for high priority IO on any device that queues requests internally and can make use of the priority information stored in the request. The ioprio of the request is set in blk_rq_set_prio which takes the request and the ioc as arguments. If the ioc is valid in blk_rq_set_prio then the iopriority of the request is set as the iopriority of the ioc. In init_request_from_bio a check is made to see if the ioprio of the bio is valid and if so then the request prio comes from the bio. Signed-off-by: Adam Manzananares <adam.manzanares@wdc.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org>
* | block: improve handling of the magic discard payloadChristoph Hellwig2016-12-091-32/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of allocating a single unused biovec for discard requests, send them down without any payload. Instead we allow the driver to add a "special" payload using a biovec embedded into struct request (unioned over other fields never used while in the driver), and overloading the number of segments for this case. This has a couple of advantages: - we don't have to allocate the bio_vec - the amount of special casing for discard requests in the block layer is significantly reduced - using this same scheme for other request types is trivial, which will be important for implementing the new WRITE_ZEROES op on devices where it actually requires a payload (e.g. SCSI) - we can get rid of playing games with the request length, as we'll never touch it and completions will work just fine - it will allow us to support ranged discard operations in the future by merging non-contiguous discard bios into a single request - last but not least it removes a lot of code This patch is the common base for my WIP series for ranges discards and to remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES, so it would be good to get it in quickly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: fix unintended fallthrough in generic_make_request_checks()Nicolai Stange2016-12-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit e73c23ff736e ("block: add async variant of blkdev_issue_zeroout") messages like the following show up: EXT4-fs (dm-1): Delayed block allocation failed for inode 2368848 at logical offset 0 with max blocks 1 with error 95 EXT4-fs (dm-1): This should not happen!! Data will be lost Due to the following fallthrough introduced with commit 2d253440b5af ("block: Define zoned block device operations"), generic_make_request_checks() would accept a REQ_OP_WRITE_SAME bio only if the block device supports "write same" *and* is a zoned one: switch (bio_op(bio)) { [...] case REQ_OP_WRITE_SAME: if (!bdev_write_same(bio->bi_bdev)) goto not_supported; case REQ_OP_ZONE_REPORT: case REQ_OP_ZONE_RESET: if (!bdev_is_zoned(bio->bi_bdev)) goto not_supported; break; [...] } Thus, although the bio setup as done by __blkdev_issue_write_same() from commit e73c23ff736e ("block: add async variant of blkdev_issue_zeroout") would succeed, its actual submission would not, resulting in the EOPNOTSUPP == 95. Fix this by removing the fallthrough which, due to the lack of an explicit comment, seems to be unintended anyway. Fixes: e73c23ff736e ("block: add async variant of blkdev_issue_zeroout") Fixes: 2d253440b5af ("block: Define zoned block device operations") Signed-off-by: Nicolai Stange <nicstange@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: add support for REQ_OP_WRITE_ZEROESChaitanya Kulkarni2016-12-011-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a new block layer operation to zero out a range of LBAs. This allows to implement zeroing for devices that don't use either discard with a predictable zero pattern or WRITE SAME of zeroes. The prominent example of that is NVMe with the Write Zeroes command, but in the future, this should also help with improving the way zeroing discards work. For this operation, suitable entry is exported in sysfs which indicate the number of maximum bytes allowed in one write zeroes operation by the device. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: apply blk_partition_remap to REQ_OP_ZONE_RESETShaun Tancheff2016-11-211-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | If a ZBC device is partitioned and operations are performed on the partition the zone information is rebased to the partition, however the zone reset is not mapped from the partition to device as are other operations. This causes the API (report zones / reset zone) to be unbalanced in this regard. Checking for the zone reset op code explicitly will balance the API. Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: deal with stale req count of plug listMing Lei2016-11-161-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In both legacy and mq path, req count of plug list is computed before allocating request, so the number can be stale when falling back to slept allocation, also the new introduced wbt can sleep too. This patch deals with the case by checking if plug list becomes empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds' which is introduced by Shaohua's patches of dispatching big request. Fixes: 600271d900002(blk-mq: immediately dispatch big size request) Fixes: 50d24c34403c6(block: immediately dispatch big size request) Cc: Shaohua Li <shli@fb.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: move poll code to blk-mqJens Axboe2016-11-111-46/+0
| | | | | | | | | | | | | | | | The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* | block: hook up writeback throttlingJens Axboe2016-11-101-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable throttling of buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency target to me met. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch this setting. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: add scalable completion tracking of requestsJens Axboe2016-11-101-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. The feature is off by default, to avoid any extra overhead. In-kernel users of it can turn it on by setting QUEUE_FLAG_STATS in the queue flags. We currently don't turn it on if someone just reads any of the stats files, that is something we could add as well. Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: immediately dispatch big size requestShaohua Li2016-11-031-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently block plug holds up to 16 non-mergeable requests. This makes sense if the request size is small, eg, reduce lock contention. But if request size is big enough, we don't need to worry about lock contention. Holding such request makes no sense and it lows the disk utilization. In practice, this improves 10% throughput for my raid5 sequential write workload. The size (128k) is arbitrary right now, but it makes sure lock contention is small. This probably could be more intelligent, eg, check average request size holded. Since this is mainly for sequential IO, probably not worthy. V2: check the last request instead of the first request, so as long as there is one big size request we flush the plug. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: better op and flags encodingChristoph Hellwig2016-10-281-40/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: split out request-only flags into a new namespaceChristoph Hellwig2016-10-281-34/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A lot of the REQ_* flags are only used on struct requests, and only of use to the block layer and a few drivers that dig into struct request internals. This patch adds a new req_flags_t rq_flags field to struct request for them, and thus dramatically shrinks the number of common requests. It also removes the unfortunate situation where we have to fit the fields from the same enum into 32 bits for struct bio and 64 bits for struct request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: Define zoned block device operationsShaun Tancheff2016-10-181-0/+4
|/ | | | | | | | | | | | | | Define REQ_OP_ZONE_REPORT and REQ_OP_ZONE_RESET for handling zones of host-managed and host-aware zoned block devices. With with these two new operations, the total number of operations defined reaches 8 and still fits with the 3 bits definition of REQ_OP_BITS. Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: add poll_considered statisticStephen Bates2016-09-141-2/+6
| | | | | | | | | | | | In order to help determine the effectiveness of polling in a running system it is usful to determine the ratio of how often the poll function is called vs how often the completion is checked. For this reason we add a poll_considered variable and add it to the sysfs entry for io_poll. Signed-off-by: Stephen Bates <sbates@raithlin.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: turn hctx->run_work into a regular work structJens Axboe2016-08-291-1/+1
| | | | | | | We don't need the larger delayed work struct, since we always run it immediately. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: add kblockd_schedule_work_on()Jens Axboe2016-08-291-0/+6
| | | | | | Add a helper to schedule a regular struct work on a particular CPU. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Fix race triggered by blk_set_queue_dying()Bart Van Assche2016-08-161-1/+3
| | | | | | | | | | | | blk_set_queue_dying() can be called while another thread is submitting I/O or changing queue flags, e.g. through dm_stop_queue(). Hence protect the QUEUE_FLAG_DYING flag change with locking. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: rename bio bi_rw to bi_opfJens Axboe2016-08-071-13/+13
| | | | | | | | | | | | | Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2016-07-261-72/+25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "This branch also contains core changes. I've come to the conclusion that from 4.9 and forward, I'll be doing just a single branch. We often have dependencies between core and drivers, and it's hard to always split them up appropriately without pulling core into drivers when that happens. That said, this contains: - separate secure erase type for the core block layer, from Christoph. - set of discard fixes, from Christoph. - bio shrinking fixes from Christoph, as a followup up to the op/flags change in the core branch. - map and append request fixes from Christoph. - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty exciting! - nvme-loop fixes from Arnd. - removal of ->driverfs_dev from Dan, after providing a device_add_disk() helper. - bcache fixes from Bhaktipriya and Yijing. - cdrom subchannel read fix from Vchannaiah. - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier. - set of drbd updates and fixes from Fabian, Lars, and Philipp. - mg_disk error path fix from Bart. - user notification for failed device add for loop, from Minfei. - NVMe in general: + NVMe delay quirk from Guilherme. + SR-IOV support and command retry limits from Keith. + fix for memory-less NUMA node from Masayoshi. + use UINT_MAX for discard sectors, from Minfei. + cancel IO fixes from Ming. + don't allocate unused major, from Neil. + error code fixup from Dan. + use constants for PSDT/FUSE from James. + variable init fix from Jay. + fabrics fixes from Ming, Sagi, and Wei. + various fixes" * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits) nvme/pci: Provide SR-IOV support nvme: initialize variable before logical OR'ing it block: unexport various bio mapping helpers scsi/osd: open code blk_make_request target: stop using blk_make_request block: simplify and export blk_rq_append_bio block: ensure bios return from blk_get_request are properly initialized virtio_blk: use blk_rq_map_kern memstick: don't allow REQ_TYPE_BLOCK_PC requests block: shrink bio size again block: simplify and cleanup bvec pool handling block: get rid of bio_rw and READA block: don't ignore -EOPNOTSUPP blkdev_issue_write_same block: introduce BLKDEV_DISCARD_ZERO to fix zeroout NVMe: don't allocate unused nvme_major nvme: avoid crashes when node 0 is memoryless node. nvme: Limit command retries loop: Make user notify for adding loop device failed nvme-loop: fix nvme-loop Kconfig dependencies nvmet: fix return value check in nvmet_subsys_alloc() ...
| * scsi/osd: open code blk_make_requestChristoph Hellwig2016-07-201-57/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I wish the OSD code could simply use blk_rq_map_* helpers like everyone else, but the complex nature of deciding if we have DATA IN and/or DATA OUT buffers might make this impossible (at least for a mere human like me). But using blk_rq_append_bio at least allows sharing the setup code between request with or without dat a buffers, and given that this is the last user of blk_make_request it allows getting rid of that somewhat awkward interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Boaz Harrosh <ooo@electrozaur.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: simplify and export blk_rq_append_bioChristoph Hellwig2016-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | The target SCSI passthrough backend is much better served with the low-level blk_rq_append_bio construct then the helpers built on top of it, so export it. Also use the opportunity to remove the pointless request_queue argument and make the code flow a little more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: ensure bios return from blk_get_request are properly initializedChristoph Hellwig2016-07-201-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | blk_get_request is used for BLOCK_PC and similar passthrough requests. Currently we always need to call blk_rq_set_block_pc or an open coded version of it to allow appending bios using the request mapping helpers later on, which is a somewhat awkward API. Instead move the initialization part of blk_rq_set_block_pc into blk_get_request, so that we always have a safe to use request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: Export blk_pollSagi Grimberg2016-07-051-0/+1
| | | | | | | | | | | | | | | | | | | | The new NVMe over fabrics target will make use of this outside from a module. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: add a separate operation type for secure eraseChristoph Hellwig2016-06-091-10/+17
| | | | | | | | | | | | | | | | | | | | Instead of overloading the discard support with the REQ_SECURE flag. Use the opportunity to rename the queue flag as well, and remove the dead checks for this flag in the RAID 1 and RAID 10 drivers that don't claim support for secure erase. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | cfq-iosched: temporarily boost queue priority for idle classesJens Axboe2016-06-091-0/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we're queuing REQ_PRIO IO and the task is running at an idle IO class, then temporarily boost the priority. This prevents livelocks due to priority inversion, when a low priority task is holding file system resources while attempting to do IO. An example of that is shown below. An ioniced idle task is holding the directory mutex, while a normal priority task is trying to do a directory lookup. [478381.198925] ------------[ cut here ]------------ [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds. [478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1 [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080 [478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698 [478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002 [478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000 [478381.206874] Call Trace: [478381.207253] [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80 [478381.208175] [<ffffffff8177cea7>] schedule+0x37/0x90 [478381.208932] [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250 [478381.209805] [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50 [478381.210706] [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0 [478381.211489] [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110 [478381.212402] [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90 [478381.213280] [<ffffffff8177d616>] bit_wait_io+0x36/0x50 [478381.214063] [<ffffffff8177d325>] __wait_on_bit+0x65/0x90 [478381.214961] [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80 [478381.215872] [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90 [478381.216806] [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40 [478381.217773] [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30 [478381.218641] [<ffffffff8123c557>] ext4_bread+0x57/0x70 [478381.219425] [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380 [478381.220467] [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170 [478381.221357] [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0 [478381.222208] [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510 [478381.223090] [<ffffffff812471a2>] ext4_lookup+0x52/0x160 [478381.223882] [<ffffffff811c401d>] lookup_real+0x1d/0x60 [478381.224675] [<ffffffff811c4698>] __lookup_hash+0x38/0x50 [478381.225697] [<ffffffff817745bd>] lookup_slow+0x45/0xab [478381.226941] [<ffffffff811c690e>] link_path_walk+0x7ae/0x820 [478381.227880] [<ffffffff811c6a42>] path_init+0xc2/0x430 [478381.228677] [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20 [478381.229776] [<ffffffff811c8c57>] path_openat+0x77/0x620 [478381.230767] [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70 [478381.232019] [<ffffffff811cb253>] do_filp_open+0x43/0xa0 [478381.233016] [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70 [478381.234072] [<ffffffff811c0cb0>] do_open_execat+0x70/0x170 [478381.235039] [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0 [478381.236051] [<ffffffff811c214c>] do_execve+0x2c/0x30 [478381.236809] [<ffffffff811ca392>] ? getname+0x12/0x20 [478381.237564] [<ffffffff811c23be>] SyS_execve+0x2e/0x40 [478381.238338] [<ffffffff81780a1d>] stub_execve+0x6d/0xa0 [478381.239126] ------------[ cut here ]------------ [478381.239915] ------------[ cut here ]------------ [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds. [478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1 [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080 [478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0 [478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff [478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8 [478381.248252] Call Trace: [478381.248630] [<ffffffff8177cea7>] schedule+0x37/0x90 [478381.249382] [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10 [478381.250465] [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100 [478381.251409] [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f [478381.252199] [<ffffffff817745ae>] lookup_slow+0x36/0xab [478381.253023] [<ffffffff811c690e>] link_path_walk+0x7ae/0x820 [478381.253877] [<ffffffff811aeb41>] ? try_charge+0xc1/0x700 [478381.254690] [<ffffffff811c6a42>] path_init+0xc2/0x430 [478381.255525] [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20 [478381.256450] [<ffffffff811c8c57>] path_openat+0x77/0x620 [478381.257256] [<ffffffff8115b2fb>] ? lru_cache_add_active_or_unevictable+0x2b/0xa0 [478381.258390] [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720 [478381.259309] [<ffffffff811cb253>] do_filp_open+0x43/0xa0 [478381.260139] [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120 [478381.260962] [<ffffffff811b95ac>] do_sys_open+0x13c/0x230 [478381.261779] [<ffffffff81011393>] ? syscall_trace_enter_phase1+0x113/0x170 [478381.262851] [<ffffffff811b96c2>] SyS_open+0x22/0x30 [478381.263598] [<ffffffff81780532>] system_call_fastpath+0x12/0x17 [478381.264551] ------------[ cut here ]------------ [478381.265377] ------------[ cut here ]------------ Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>