summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig2017-08-2328-52/+55
| | | | | | | | | | | | | | | | | | | | This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: cache the partition index in struct block_deviceChristoph Hellwig2017-08-231-0/+1
| | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* btrfs: index check-integrity state hash by a dev_tChristoph Hellwig2017-08-231-18/+13
| | | | | | | | | | We won't have the struct block_device available in the bio soon, so switch to the numerical dev_t instead of the block_device pointer for looking up the check-integrity state. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blktrace: add an option to allow displaying cgroup pathShaohua Li2017-07-291-0/+19
| | | | | | | | | | | | By default we output cgroup id in blktrace. This adds an option to display cgroup path. Since get cgroup path is a relativly heavy operation, we don't enable it by default. with the option enabled, blktrace will output something like this: dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd] Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: add exportfs operationsShaohua Li2017-07-291-0/+56
| | | | | | | | | | | | Now we have the facilities to implement exportfs operations. The idea is cgroup can export the fhandle info to userspace, then userspace uses fhandle to find the cgroup name. Another example is userspace can get fhandle for a cgroup and BPF uses the fhandle to filter info for the cgroup. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: introduce kernfs_node_idShaohua Li2017-07-293-9/+9
| | | | | | | | | | | inode number and generation can identify a kernfs node. We are going to export the identification by exportfs operations, so put ino and generation into a separate structure. It's convenient when later patches use the identification. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: don't set dentry->d_fsdataShaohua Li2017-07-296-31/+27
| | | | | | | | | | | | | | When working on adding exportfs operations in kernfs, I found it's hard to initialize dentry->d_fsdata in the exportfs operations. Looks there is no way to do it without race condition. Look at the kernfs code closely, there is no point to set dentry->d_fsdata. inode->i_private already points to kernfs_node, and we can get inode from a dentry. So this patch just delete the d_fsdata usage. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: add an API to get kernfs node from inode numberShaohua Li2017-07-293-1/+69
| | | | | | | | | | | | | | | Add an API to get kernfs node from inode number. We will need this to implement exportfs operations. This API will be used in blktrace too later, so it should be as fast as possible. To make the API lock free, kernfs node is freed in RCU context. And we depend on kernfs_node count/ino number to filter out stale kernfs nodes. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: implement i_generationShaohua Li2017-07-292-1/+10
| | | | | | | | | | | | | | | | | | | | Set i_generation for kernfs inode. This is required to implement exportfs operations. The generation is 32-bit, so it's possible the generation wraps up and we find stale files. To reduce the posssibility, we don't reuse inode numer immediately. When the inode number allocation wraps, we increase generation number. In this way generation/inode number consist of a 64-bit number which is unlikely duplicated. This does make the idr tree more sparse and waste some memory. Since idr manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6 bits of the key. In a 100k inode kernfs, the worst case will have around 300k radix tree node. Each node is 576bytes, so the tree will use about ~150M memory. Sounds not too bad, if this really is a problem, we should find better data structure. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* kernfs: use idr instead of ida to manage inode numberShaohua Li2017-07-291-5/+12
| | | | | | | | | | | kernfs uses ida to manage inode number. The problem is we can't get kernfs_node from inode number with ida. Switching to use idr, next patch will add an API to get kernfs_node from inode number. Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds2017-07-282-6/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client fixes from Anna Schumaker: "More NFS client bugfixes for 4.13. Most of these fix locking bugs that Ben and Neil noticed, but I also have a patch to fix one more access bug that was reported after last week. Stable fixes: - Fix a race where CB_NOTIFY_LOCK fails to wake a waiter - Invalidate file size when taking a lock to prevent corruption Other fixes: - Don't excessively generate tiny writes with fallocate - Use the raw NFS access mask in nfs4_opendata_access()" * tag 'nfs-for-4.13-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter NFS: Optimize fallocate by refreshing mapping when needed. NFS: invalidate file size when taking a lock. NFS: Use raw NFS access mask in nfs4_opendata_access()
| * NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiterBenjamin Coddington2017-07-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfs4_retry_setlk() sets the task's state to TASK_INTERRUPTIBLE within the same region protected by the wait_queue's lock after checking for a notification from CB_NOTIFY_LOCK callback. However, after releasing that lock, a wakeup for that task may race in before the call to freezable_schedule_timeout_interruptible() and set TASK_WAKING, then freezable_schedule_timeout_interruptible() will set the state back to TASK_INTERRUPTIBLE before the task will sleep. The result is that the task will sleep for the entire duration of the timeout. Since we've already set TASK_INTERRUPTIBLE in the locked section, just use freezable_schedule_timout() instead. Fixes: a1d617d8f134 ("nfs: allow blocking locks to be awoken by lock callbacks") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFS: Optimize fallocate by refreshing mapping when needed.NeilBrown2017-07-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | posix_fallocate() will allocate space in an NFS file by considering the last byte of every 4K block. If it is before EOF, it will read the byte and if it is zero, a zero is written out. If it is after EOF, the zero is unconditionally written. For the blocks beyond EOF, if NFS believes its cache is valid, it will expand these writes to write full pages, and then will merge the pages. This results if (typically) 1MB writes. If NFS believes its cache is not valid (particularly if NFS_INO_INVALID_DATA or NFS_INO_REVAL_PAGECACHE are set - see nfs_write_pageuptodate()), it will send the individual 1-byte writes. This results in (typically) 256 times as many RPC requests, and can be substantially slower. Currently nfs_revalidate_mapping() is only used when reading a file or mmapping a file, as these are times when the content needs to be up-to-date. Writes don't generally need the cache to be up-to-date, but writes beyond EOF can benefit, particularly in the posix_fallocate() case. So this patch calls nfs_revalidate_mapping() when writing beyond EOF - i.e. when there is a gap between the end of the file and the start of the write. If the cache is thought to be out of date (as happens after taking a file lock), this will cause a GETATTR, and the two flags mentioned above will be cleared. With this, posix_fallocate() on a newly locked file does not generate excessive tiny writes. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFS: invalidate file size when taking a lock.NeilBrown2017-07-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to commit ca0daa277aca ("NFS: Cache aggressively when file is open for writing"), NFS would revalidate, or invalidate, the file size when taking a lock. Since that commit it only invalidates the file content. If the file size is changed on the server while wait for the lock, the client will have an incorrect understanding of the file size and could corrupt data. This particularly happens when writing beyond the (supposed) end of file and can be easily be demonstrated with posix_fallocate(). If an application opens an empty file, waits for a write lock, and then calls posix_fallocate(), glibc will determine that the underlying filesystem doesn't support fallocate (assuming version 4.1 or earlier) and will write out a '0' byte at the end of each 4K page in the region being fallocated that is after the end of the file. NFS will (usually) detect that these writes are beyond EOF and will expand them to cover the whole page, and then will merge the pages. Consequently, NFS will write out large blocks of zeroes beyond where it thought EOF was. If EOF had moved, the pre-existing part of the file will be over-written. Locking should have protected against this, but it doesn't. This patch restores the use of nfs_zap_caches() which invalidated the cached attributes. When posix_fallocate() asks for the file size, the request will go to the server and get a correct answer. cc: stable@vger.kernel.org (v4.8+) Fixes: ca0daa277aca ("NFS: Cache aggressively when file is open for writing") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFS: Use raw NFS access mask in nfs4_opendata_access()Anna Schumaker2017-07-261-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bd8b2441742b ("NFS: Store the raw NFS access mask in the inode's access cache") changed how the access results are stored after an access() call. An NFS v4 OPEN might have access bits returned with the opendata, so we should use the NFS4_ACCESS values when determining the return value in nfs4_opendata_access(). Fixes: bd8b2441742b ("NFS: Store the raw NFS access mask in the inode's access cache") Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Tested-by: Takashi Iwai <tiwai@suse.de>
* | Merge tag 'xfs-4.13-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2017-07-286-3/+39
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: - fix firstfsb variables that we left uninitialized, which could lead to locking problems. - check for NULL metadata buffer pointers before using them. - don't allow btree cursor manipulation if the btree block is corrupt. Better to just shut down. - fix infinite loop problems in quotacheck. - fix buffer overrun when validating directory blocks. - fix deadlock problem in bunmapi. * tag 'xfs-4.13-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix multi-AG deadlock in xfs_bunmapi xfs: check that dir block entries don't off the end of the buffer xfs: fix quotacheck dquot id overflow infinite loop xfs: check _alloc_read_agf buffer pointer before using xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write xfs: check _btree_check_block value
| * | xfs: fix multi-AG deadlock in xfs_bunmapiChristoph Hellwig2017-07-261-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like in the allocator we must avoid touching multiple AGs out of order when freeing blocks, as freeing still locks the AGF and can cause the same AB-BA deadlocks as in the allocation path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | xfs: check that dir block entries don't off the end of the bufferDarrick J. Wong2017-07-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we're checking the entries in a directory buffer, make sure that the entry length doesn't push us off the end of the buffer. Found via xfs/388 writing ones to the length fields. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | xfs: fix quotacheck dquot id overflow infinite loopBrian Foster2017-07-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a dquot has an id of U32_MAX, the next lookup index increment overflows the uint32_t back to 0. This starts the lookup sequence over from the beginning, repeats indefinitely and results in a livelock. Update xfs_qm_dquot_walk() to explicitly check for the lookup overflow and exit the loop. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | xfs: check _alloc_read_agf buffer pointer before usingDarrick J. Wong2017-07-202-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some circumstances, _alloc_read_agf can return an error code of zero but also a null AGF buffer pointer. Check for this and jump out. Fixes-coverity-id: 1415250 Fixes-coverity-id: 1415320 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_writeDarrick J. Wong2017-07-202-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We must initialize the firstfsb parameter to _bmapi_write so that it doesn't incorrectly treat stack garbage as a restriction on which AGs it can search for free space. Fixes-coverity-id: 1402025 Fixes-coverity-id: 1415167 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | xfs: check _btree_check_block valueDarrick J. Wong2017-07-201-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check the _btree_check_block return value for the firstrec and lastrec functions, since we have the ability to signal that the repositioning did not succeed. Fixes-coverity-id: 114067 Fixes-coverity-id: 114068 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* | | Merge branch 'for-4.13-part3' of ↵Linus Torvalds2017-07-283-10/+8
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes addressing problems reported by users, and there's one more regression fix" * 'for-4.13-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: round down size diff when shrinking/growing device Btrfs: fix early ENOSPC due to delalloc btrfs: fix lockup in find_free_extent with read-only block groups Btrfs: fix dir item validation when replaying xattr deletes
| * | | btrfs: round down size diff when shrinking/growing deviceNikolay Borisov2017-07-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Further testing showed that the fix introduced in 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") was insufficient and it could still lead to discrepancies between the total_bytes in the super block and the device total bytes. So this patch also ensures that the difference between old/new sizes when shrinking/growing is also rounded down. This ensure that we won't be subtracting/adding a non-sectorsize multiples to the superblock/device total sizees. Fixes: 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: fix early ENOSPC due to delallocOmar Sandoval2017-07-241-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a lot of metadata is reserved for outstanding delayed allocations, we rely on shrink_delalloc() to reclaim metadata space in order to fulfill reservation tickets. However, shrink_delalloc() has a shortcut where if it determines that space can be overcommitted, it will stop early. This made sense before the ticketed enospc system, but now it means that shrink_delalloc() will often not reclaim enough space to fulfill any tickets, leading to an early ENOSPC. (Reservation tickets don't care about being able to overcommit, they need every byte accounted for.) Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims all of the metadata it is supposed to. This fixes early ENOSPCs we were seeing when doing a btrfs receive to populate a new filesystem, as well as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure") Tested-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name> Cc: stable@vger.kernel.org Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: fix lockup in find_free_extent with read-only block groupsJeff Mahoney2017-07-241-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have a block group that is all of the following: 1) uncached in memory 2) is read-only 3) has a disk cache state that indicates we need to recreate the cache AND the file system has enough free space fragmentation such that the request for an extent of a given size can't be honored; AND have a single CPU core; AND it's the block group with the highest starting offset such that there are no opportunities (like reading from disk) for the loop to yield the CPU; We can end up with a lockup. The root cause is simple. Once we're in the position that we've read in all of the other block groups directly and none of those block groups can honor the request, there are no more opportunities to sleep. We end up trying to start a caching thread which never gets run if we only have one core. This *should* present as a hung task waiting on the caching thread to make some progress, but it doesn't. Instead, it degrades into a busy loop because of the placement of the read-only check. During the first pass through the loop, block_group->cached will be set to BTRFS_CACHE_STARTED and have_caching_bg will be set. Then we hit the read-only check and short circuit the loop. We're not yet in LOOP_CACHING_WAIT, so we skip that loop back before going through the loop again for other raid groups. Then we move to LOOP_CACHING_WAIT state. During the this pass through the loop, ->cached will still be BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter cache_block_group, do a lot of nothing, and return, and also set have_caching_bg again. Then we hit the read-only check and short circuit the loop. The same thing happens as before except now we DO trigger the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the top. We do this forever. There are two fixes in this patch since they address the same underlying bug. The first is to add a cond_resched to the end of the loop to ensure that the caching thread always has an opportunity to run. This will fix the soft lockup issue, but find_free_extent will still loop doing nothing until the thread has completed. The second is to move the read-only check to the top of the loop. We're never going to return an allocation within a read-only block group so we may as well skip it early. The check for ->cached == BTRFS_CACHE_ERROR would cause the same problem except that BTRFS_CACHE_ERROR is considered a "done" state and we won't re-set have_caching_bg again. Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in the testing process. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: fix dir item validation when replaying xattr deletesFilipe Manana2017-07-191-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were passing an incorrect slot number to the function that validates directory items when we are replaying xattr deletes from a log tree. The correct slot is stored at variable 'i' and not at 'path->slots[0]', so the call to the validation function was only correct for the first iteration of the loop, when 'i == path->slots[0]'. After this fix, the fstest generic/066 passes again. Fixes: 8ee8c2d62d5f ("btrfs: Verify dir_item in replay_xattr_deletes") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | Merge tag 'jfs-4.13' of git://github.com/kleikamp/linux-shaggyLinus Torvalds2017-07-253-12/+20
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull JFS fixes from David Kleikamp. * tag 'jfs-4.13' of git://github.com/kleikamp/linux-shaggy: jfs: preserve i_mode if __jfs_set_acl() fails jfs: Don't clear SGID when inheriting ACLs jfs: atomically read inode size
| * | | | jfs: preserve i_mode if __jfs_set_acl() failsErnesto A. Fernández2017-07-181-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When changing a file's acl mask, __jfs_set_acl() will first set the group bits of i_mode to the value of the mask, and only then set the actual extended attribute representing the new acl. If the second part fails (due to lack of space, for example) and the file had no acl attribute to begin with, the system will from now on assume that the mask permission bits are actual group permission bits, potentially granting access to the wrong users. Prevent this by only changing the inode mode after the acl has been set. Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
| * | | | jfs: Don't clear SGID when inheriting ACLsJan Kara2017-07-181-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __jfs_set_acl() into jfs_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: stable@vger.kernel.org CC: jfs-discussion@lists.sourceforge.net Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
| * | | | jfs: atomically read inode sizeFabian Frederick2017-02-092-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | See i_size_read() comments in include/linux/fs.h Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
* | | | | Merge tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds2017-07-217-33/+69
|\ \ \ \ \ | | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Anna Schumaker: "Stable bugfix: - Fix error reporting regression Bugfixes: - Fix setting filelayout ds address race - Fix subtle access bug when using ACLs - Fix setting mnt3_counts array size - Fix a couple of pNFS commit races" * tag 'nfs-for-4.13-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid() NFS: Be more careful about mapping file permissions NFS: Store the raw NFS access mask in the inode's access cache NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask() NFS: Refactor NFS access to kernel access mask calculation net/sunrpc/xprt_sock: fix regression in connection error reporting. nfs: count correct array for mnt3_counts array size Revert commit 722f0b891198 ("pNFS: Don't send COMMITs to the DSes if...") pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit() NFS: Fix another COMMIT race in pNFS NFS: Fix a COMMIT race in pNFS mount: copy the port field into the cloned nfs_server structure. NFS: Don't run wake_up_bit() when nobody is waiting... nfs: add export operations
| * | | | NFS/filelayout: Fix racy setting of fl->dsaddr in filelayout_check_deviceid()Trond Myklebust2017-07-211-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We must set fl->dsaddr once, and once only, even if there are multiple processes calling filelayout_check_deviceid() for the same layout segment. Reported-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | NFS: Be more careful about mapping file permissionsTrond Myklebust2017-07-211-8/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mapping a directory, we want the MAY_WRITE permissions to reflect whether or not we have permission to modify, add and delete the directory entries. MAY_EXEC must map to lookup permissions. On the other hand, for files, we want MAY_WRITE to reflect a permission to modify and extend the file. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | NFS: Store the raw NFS access mask in the inode's access cacheTrond Myklebust2017-07-211-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | NFSv3: Convert nfs3_proc_access() to use nfs_access_set_mask()Trond Myklebust2017-07-211-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | NFS: Refactor NFS access to kernel access mask calculationTrond Myklebust2017-07-211-8/+23
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | nfs: count correct array for mnt3_counts array sizeEryu Guan2017-07-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Array size of mnt3_counts should be the size of array mnt3_procedures, not mnt_procedures, though they're same in size right now. Found this by code inspection. Fixes: 1c5876ddbdb4 ("sunrpc: move p_count out of struct rpc_procinfo") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | Revert commit 722f0b891198 ("pNFS: Don't send COMMITs to the DSes if...")Trond Myklebust2017-07-191-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Doing the test without taking any locks is racy, and so really it makes more sense to do it in the flexfiles code (which is the only case that cares). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | pNFS/flexfiles: Handle expired layout segments in ff_layout_initiate_commit()Trond Myklebust2017-07-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the layout has expired due to a fencing event, then we should not attempt to commit to the DS. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | NFS: Fix another COMMIT race in pNFSTrond Myklebust2017-07-191-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We must make sure that cinfo->ds->ncommitting is in sync with the commit list, since it is checked as part of pnfs_commit_list(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | NFS: Fix a COMMIT race in pNFSTrond Myklebust2017-07-191-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We must make sure that cinfo->ds->nwritten is in sync with the commit list, since it is checked as part of pnfs_scan_commit_lists(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | mount: copy the port field into the cloned nfs_server structure.Steve Dickson2017-07-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Doing this copy eliminates the "port=0" entry in the /proc/mounts entries Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=69241 Signed-off-by: Steve Dickson <steved@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | NFS: Don't run wake_up_bit() when nobody is waiting...Trond Myklebust2017-07-131-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "perf lock" shows fairly heavy contention for the bit waitqueue locks when doing an I/O heavy workload. Use a bit to tell whether or not there has been contention for a lock so that we can optimise away the bit waitqueue options in those cases. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * | | | nfs: add export operationsPeng Tao2017-07-134-1/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This support for opening files on NFS by file handle, both through the open_by_handle syscall, and for re-exporting NFS (for example using a different version). The support is very basic for now, as each open by handle will have to do an NFSv4 open operation on the wire. In the future this will hopefully be mitigated by an open file cache, as well as various optimizations in NFS for this specific case. Signed-off-by: Peng Tao <tao.peng@primarydata.com> [hch: incorporated various changes, resplit the patches, new changelog] Signed-off-by: Christoph Hellwig <hch@lst.de>
* | | | | Merge branch 'overlayfs-linus' of ↵Linus Torvalds2017-07-217-42/+88
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "This fixes a crash with SELinux and several other old and new bugs" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: check for bad and whiteout index on lookup ovl: do not cleanup directory and whiteout index entries ovl: fix xattr get and set with selinux ovl: remove unneeded check for IS_ERR() ovl: fix origin verification of index dir ovl: mark parent impure on ovl_link() ovl: fix random return value on mount
| * | | | | ovl: check for bad and whiteout index on lookupAmir Goldstein2017-07-201-5/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Index should always be of the same file type as origin, except for the case of a whiteout index. A whiteout index should only exist if all lower aliases have been unlinked, which means that finding a lower origin on lookup whose index is a whiteout should be treated as a lookup error. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | | ovl: do not cleanup directory and whiteout index entriesAmir Goldstein2017-07-202-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Directory index entries are going to be used for looking up redirected upper dirs by lower dir fh when decoding an overlay file handle of a merge dir. Whiteout index entries are going to be used as an indication that an exported overlay file handle should be treated as stale (i.e. after unlink of the overlay inode). We don't know the verification rules for directory and whiteout index entries, because they have not been implemented yet, so fail to mount overlay rw if those entries are found to avoid corrupting an index that was created by a newer kernel. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | | ovl: fix xattr get and set with selinuxMiklos Szeredi2017-07-204-23/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | inode_doinit_with_dentry() in SELinux wants to read the upper inode's xattr to get security label, and ovl_xattr_get() calls ovl_dentry_real(), which depends on dentry->d_inode, but d_inode is null and not initialized yet at this point resulting in an Oops. Fix by getting the upperdentry info from the inode directly in this case. Reported-by: Eryu Guan <eguan@redhat.com> Fixes: 09d8b586731b ("ovl: move __upperdentry to ovl_inode") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | | ovl: remove unneeded check for IS_ERR()Amir Goldstein2017-07-131-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ovl_workdir_create() returns a valid index dentry or NULL. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>