summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/delayed-inode.c
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: make the delalloc block rsv per inodeJosef Bacik2017-11-011-45/+1
| | | | | | | | | | | | | | | | | The way we handle delalloc metadata reservations has gotten progressively more complicated over the years. There is so much cruft and weirdness around keeping the reserved count and outstanding counters consistent and handling the error cases that it's impossible to understand. Fix this by making the delalloc block rsv per-inode. This way we can calculate the actual size of the outstanding metadata reservations every time we make a change, and then reserve the delta based on that amount. This greatly simplifies the code everywhere, and makes the error handling in btrfs_delalloc_reserve_metadata far less terrifying. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: increase ctx->pos for delayed dir indexJosef Bacik2017-08-181-0/+1
| | | | | | | | | | | Our dir_context->pos is supposed to hold the next position we're supposed to look. If we successfully insert a delayed dir index we could end up with a duplicate entry because we don't increase ctx->pos after doing the dir_emit. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_tElena Reshetova2017-04-181-9/+9
| | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_tElena Reshetova2017-04-181-14/+14
| | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_i_size_write take btrfs_inodeNikolay Borisov2017-02-281-1/+1
| | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix over-80 lines introduced by previous cleanupsDavid Sterba2017-02-141-2/+3
| | | | | | | This goes as a separate patch because fixing that inside the patches caused too many many conflicts. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inodeNikolay Borisov2017-02-141-3/+3
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inodeNikolay Borisov2017-02-141-2/+2
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inodeNikolay Borisov2017-02-141-3/+3
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_remove_delayed_node take btrfs_inodeNikolay Borisov2017-02-141-3/+3
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inodeNikolay Borisov2017-02-141-2/+2
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inodeNikolay Borisov2017-02-141-3/+3
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inodeNikolay Borisov2017-02-141-3/+3
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inodeNikolay Borisov2017-02-141-3/+3
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inodeNikolay Borisov2017-02-141-8/+8
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inodeNikolay Borisov2017-02-141-6/+5
| | | | | Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_get_delayed_node take btrfs_inodeNikolay Borisov2017-02-141-9/+8
| | | | | | | | | This function is internal to btrfs and doesn't really deal with any VFS members, as such it needn't take a struct inode refrence but btrfs_inode. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_ino take a struct btrfs_inodeNikolay Borisov2017-02-141-7/+7
| | | | | | | | | | | | | | | | Currently btrfs_ino takes a struct inode and this causes a lot of internal btrfs functions which consume this ino to take a VFS inode, rather than btrfs' own struct btrfs_inode. In order to fix this "leak" of VFS structs into the internals of btrfs first it's necessary to eliminate all uses of struct inode for the purpose of inode. This patch does that by using BTRFS_I to convert an inode to btrfs_inode. With this problem eliminated subsequent patches will start eliminating the passing of struct inode altogether, eventually resulting in a lot cleaner code. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> [ fix btrfs_get_extent tracepoint prototype ] Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: ACCESS_ONCE cleanupSeraphime Kirkovski2017-02-141-2/+2
| | | | | | | | | This replaces ACCESS_ONCE macro with the corresponding READ|WRITE macros Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: limit async_work allocation and worker func durationMaxim Patlasov2016-12-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem statement: unprivileged user who has read-write access to more than one btrfs subvolume may easily consume all kernel memory (eventually triggering oom-killer). Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): [root@kteam1 ~]# cat prep.sh DEV=/dev/sdb mkfs.btrfs -f $DEV mount $DEV /mnt for i in `seq 1 16` do mkdir /mnt/$i btrfs subvolume create /mnt/SV_$i ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` mount -t btrfs -o subvolid=$ID $DEV /mnt/$i chmod a+rwx /mnt/$i done [root@kteam1 ~]# sh prep.sh [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done kmalloc-128 10144 10144 128 32 1 : tunables 0 0 0 : slabdata 317 317 0 kmalloc-128 9992352 9992352 128 32 1 : tunables 0 0 0 : slabdata 312261 312261 0 kmalloc-128 24226752 24226752 128 32 1 : tunables 0 0 0 : slabdata 757086 757086 0 kmalloc-128 42754240 42754240 128 32 1 : tunables 0 0 0 : slabdata 1336070 1336070 0 The huge numbers above come from insane number of async_work-s allocated and queued by btrfs_wq_run_delayed_node. The problem is caused by btrfs_wq_run_delayed_node() queuing more and more works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The worker func (btrfs_async_run_delayed_root) processes at least BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery works as expected while the list is almost empty. As soon as it is getting bigger, worker func starts to process more than one item at a time, it takes longer, and the chances to have async_works queued more than needed is getting higher. The problem above is worsened by another flaw of delayed-inode implementation: if async_work was queued in a throttling branch (number of items >= BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that the func occupies CPU infinitely (up to 30sec in my experiments): while the func is trying to drain the list, the user activity may add more and more items to the list. The patch fixes both problems in straightforward way: refuse queuing too many works in btrfs_wq_run_delayed_node and bail out of worker func if at least BTRFS_DELAYED_WRITEBACK items are processed. Changed in v2: remove support of thresh == NO_THRESHOLD. Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org # v3.15+
* btrfs: remove root parameter from transaction commit/end routinesJeff Mahoney2016-12-061-2/+2
| | | | | | | | | | Now we only use the root parameter to print the root objectid in a tracepoint. We can use the root parameter from the transaction handle for that. It's also used to join the transaction with async commits, so we remove the comment that it's just for checking. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: take an fs_info directly when the root is not used otherwiseJeff Mahoney2016-12-061-42/+42
| | | | | | | | | There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: root->fs_info cleanup, access fs_info->delayed_root directlyJeff Mahoney2016-12-061-17/+6
| | | | | | | | This results in btrfs_assert_delayed_root_empty and btrfs_destroy_delayed_inode taking an fs_info instead of a root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: root->fs_info cleanup, add fs_info convenience variablesJeff Mahoney2016-12-061-16/+21
| | | | | | | | | In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_sizeJeff Mahoney2016-12-061-2/+2
| | | | | Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: increment ctx->pos for every emitted or skipped dirent in readdirJeff Mahoney2016-11-301-2/+1
| | | | | | | | | | | | | | | | | | | | If we process the last item in the leaf and hit an I/O error while reading the next leaf, we return -EIO without having adjusted the position. Since we have emitted dirents, getdents() will return the byte count to the user instead of the error. Subsequent callers will emit the last successful dirent again, and return -EIO again, with the same result. Callers loop forever. Instead, if we always increment ctx->pos after emitting or skipping the dirent, we'll be sure that we won't hit the same one again. When we go to process the next leaf, we won't have emitted any dirents and the -EIO will be returned to the user properly. We also don't need to track if we've emitted a dirent already or if we've changed the position yet. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: unsplit printed stringsJeff Mahoney2016-09-261-10/+7
| | | | | | | | | | | CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: squash lines for simple wrapper functionsMasahiro Yamada2016-09-261-4/+1
| | | | | | | | Remove unneeded variables and assignments. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: add a flags field to btrfs_fs_infoJosef Bacik2016-09-261-1/+2
| | | | | | | | | | | We have a lot of random ints in btrfs_fs_info that can be put into flags. This is mostly equivalent with the exception of how we deal with quota going on or off, now instead we set a flag when we are turning it on or off and deal with that appropriately, rather than just having a pending state that the current quota_enabled gets set to. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: btrfs_abort_transaction, drop root parameterJeff Mahoney2016-07-261-1/+1
| | | | | | | | | __btrfs_abort_transaction doesn't use its root parameter except to obtain an fs_info pointer. We can obtain that from trans->root->fs_info for now and from trans->fs_info in a later patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Fix slab accounting flagsNikolay Borisov2016-07-261-1/+1
| | | | | | | | | | | | | | | | | | | BTRFS is using a variety of slab caches to satisfy internal needs. Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT, meaning allocations from the caches are going to be accounted as SReclaimable. At the same time btrfs is not registering any shrinkers whatsoever, thus preventing memory from the slabs to be shrunk. This means those caches are not in fact reclaimable. To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the inode cache, since this one is being freed by the generic VFS super_block shrinker. Also set the transaction related caches as SLAB_TEMPORARY, to better document the lifetime of the objects (it just translates to SLAB_RECLAIM_ACCOUNT). Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: change delayed reservation fallback behaviorJosef Bacik2016-07-071-41/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | We reserve space for the inode update when we first reserve space for writing to a file. However there are lots of ways that we can use this reservation and not have it for subsequent ordered extents. Previously we'd fall through and try to reserve metadata bytes for this, then we'd just steal the full reservation from the delalloc_block_rsv, and if that didn't have enough space we'd steal the full reservation from the global reserve. The problem with this is we can easily just return ENOSPC and fallback to updating the inode item directly. In the worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we fall all the way through, however if we just fallback and update the inode directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib. We would have also just added the extent item for the inode so we likely will have already cow'ed down most of the way to the leaf containing the inode item, so we are more often than not only need one or two nodesize's worth of reservations. Given the reservation for the extent itself is also a worst case we will likely already have space to cover the inode update. This change will make us behave better in the theoretical worst case, and much better in the case that we don't have our reservation and cannot reserve more metadata. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix callers of btrfs_block_rsv_migrateJosef Bacik2016-07-071-4/+4
| | | | | | | | | | | | | So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes. Not only this but it unconditionally changes the size of the block_rsv. This isn't a bug strictly speaking, but it makes truncate block rsv's look funny because every time we migrate bytes over its size grows, even though we only want it to be a specific size. So collapse this into one function that takes an update_size argument and make truncate and evict not update the size for consistency sake. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodesOmar Sandoval2016-06-251-5/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"") backed out the conversion to ->iterate_shared() for Btrfs because the delayed inode handling in btrfs_real_readdir() is racy. However, we can still do readdir in parallel if there are no delayed nodes. This is a temporary fix which upgrades the shared inode lock to an exclusive lock only when we have delayed items until we come up with a more complete solution. While we're here, rename the btrfs_{get,put}_delayed_items functions to make it very clear that they're just for readdir. Tested with xfstests and by doing a parallel kernel build: while make tinyconfig && make -j4 && git clean dqfx; do : done along with a bunch of parallel finds in another shell: while true; do for ((i=0; i<4; i++)); do find . >/dev/null & done wait done Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: GFP_NOFS does not GFP_HIGHMEMDavid Sterba2016-05-101-1/+1
| | | | | | Masking HIGHMEM out of NOFS does not make sense. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Print Warning only if ENOSPC_DEBUG is enabledAshish Samant2016-03-141-1/+6
| | | | | | | | | Dont print warning for ENOSPC error unless ENOSPC_DEBUG is enabled. Use btrfs_debug if it is enabled. Signed-off-by: Ashish Samant <ashish.samant@oracle.com> [ preserve the WARN_ON ] Signed-off-by: David Sterba <dsterba@suse.com>
* Merge branch 'cleanups-4.6' into for-chris-4.6David Sterba2016-02-261-2/+1
|\
| * btrfs: drop null testing before destroy functionsKinglong Mee2016-02-181-2/+1
| | | | | | | | | | | | | | | | | | | | Cleanup. kmem_cache_destroy has support NULL argument checking, so drop the double null testing before calling it. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: properly set the termination value of ctx->pos in readdirDavid Sterba2016-02-111-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor <services@swwu.com> CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: zero out delayed node upon allocationAlexandru Moise2016-01-071-6/+1
| | | | | | | | It's slightly cleaner to zero-out the delayed node upon allocation than to do it by hand in btrfs_init_delayed_node() for a few members Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: comment the rest of implicit barriers before waitqueue_activeDavid Sterba2015-10-101-0/+4
| | | | | | | There are atomic operations that imply the barrier for waitqueue_active mixed in an if-condition. Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.Yang Dongsheng2015-04-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to fill inode when we found a node for it in delayed_nodes_tree. But we did not fill the ->last_trans currently, it will cause the test of xfstest/generic/311 fail. Scenario of the 311 is shown as below: Problem: (1). test_fd = open(fname, O_RDWR|O_DIRECT) (2). pwrite(test_fd, buf, 4096, 0) (3). close(test_fd) (4). drop_all_caches() <-------- "echo 3 > /proc/sys/vm/drop_caches" (5). test_fd = open(fname, O_RDWR|O_DIRECT) (6). fsync(test_fd); <-------- we did not get the correct log entry for the file Reason: When we re-open this file in (5), we would find a node in delayed_nodes_tree and fill the inode we are lookup with the information. But the ->last_trans is not filled, then the fsync() will check the ->last_trans and found it's 0 then say this inode is already in our tree which is commited, not recording the extents for it. Fix: This patch fill the ->last_trans properly and set the runtime_flags if needed in this situation. Then we can get the log entries we expected after (6) and generic/311 passed. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: delayed-inode: replace root args iff only fs_info usedDaniel Dressler2015-02-161-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | This is the second independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. This patch affects the following function(s): 1) btrfs_wq_run_delayed_node Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
* Btrfs: Add code to support file creation timechandan r2015-02-021-0/+10
| | | | | | | | | | | This patch adds a new member to the 'struct btrfs_inode' structure to hold the file creation time. Signed-off-by: chandan <chandanrmail@gmail.com> [refreshed, removed btrfs_inode_otime] Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: kill btrfs_inode_*time helpersDavid Sterba2015-02-021-16/+12
| | | | | | | They just opencode taking address of the timespec member. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: don't delay inode ref updates during log replayChris Mason2015-01-021-0/+8
| | | | | | | | | | | | | | | | | | | | | | Commit 1d52c78afbb (Btrfs: try not to ENOSPC on log replay) added a check to skip delayed inode updates during log replay because it confuses the enospc code. But the delayed processing will end up ignoring delayed refs from log replay because the inode itself wasn't put through the delayed code. This can end up triggering a warning at commit time: WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34() Which is repeated for each commit because we never process the delayed inode ref update. The fix used here is to change btrfs_delayed_delete_inode_ref to return an error if we're currently in log replay. The caller will do the ref deletion immediately and everything will work properly. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v3.18 and any stable series that picked 1d52c78afbbf80b58299e076a159617d6b42fe3c
* btrfs: kill the key type accessor helpersDavid Sterba2014-09-171-4/+4
| | | | | | | | | btrfs_set_key_type and btrfs_key_type are used inconsistently along with open coded variants. Other members of btrfs_key are accessed directly without any helpers anyway. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix task hang under heavy compressed writeLiu Bo2014-08-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: free delayed node outside of root->inode_lockJeff Mahoney2014-06-091-2/+5
| | | | | | | | | | On heavy workloads, we're seeing soft lockup warnings on root->inode_lock in __btrfs_release_delayed_node. The low hanging fruit is to reduce the size of the critical section. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: Cleanup the "_struct" suffix in btrfs_workequeueQu Wenruo2014-03-101-2/+2
| | | | | | | | | | | | | | Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>