summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Add linux-next specific files for 20180911next-20180911Stephen Rothwell2018-09-115-0/+5315
| | | | Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* Merge branch 'akpm/master'Stephen Rothwell2018-09-112-3/+5
|\
| * drivers/media/platform/sti/delta/delta-ipc.c: fix read buffer overflowAndi Kleen2018-09-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The single caller passes a string to delta_ipc_open, which copies with a fixed size larger than the string. So it copies some random data after the original string the ro segment. If the string was at the end of a page it may fault. Just copy the string with a normal strcpy after clearing the field. Found by a LTO build (which errors out) because the compiler inlines the functions and can resolve the string sizes and triggers the compile time checks in memcpy. In function `memcpy', inlined from `delta_ipc_open.constprop' at linux/drivers/media/platform/sti/delta/delta-ipc.c:178:0, inlined from `delta_mjpeg_ipc_open' at linux/drivers/media/platform/sti/delta/delta-mjpeg-dec.c:227:0, inlined from `delta_mjpeg_decode' at linux/drivers/media/platform/sti/delta/delta-mjpeg-dec.c:403:0: /home/andi/lsrc/linux/include/linux/string.h:337:0: error: call to `__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter __read_overflow2(); Link: http://lkml.kernel.org/r/20171222001212.1850-1-andi@firstfloor.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Hugues FRUCHET <hugues.fruchet@st.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * vfs: replace current_kernel_time64 with ktime equivalentArnd Bergmann2018-09-111-1/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | current_time is the last remaining caller of current_kernel_time64(), which is a wrapper around ktime_get_coarse_real_ts64(). This calls the latter directly for consistency with the rest of the kernel that is moving to the ktime_get_ family of time accessors, as now documented in Documentation/core-api/timekeeping.rst. An open questions is whether we may want to actually call the more accurate ktime_get_real_ts64() for file systems that save high-resolution timestamps in their on-disk format. This would add a small overhead to each update of the inode stamps but lead to inode timestamps to actually have a usable resolution better than one jiffy (1 to 10 milliseconds normally). Experiments on a variety of hardware platforms show a typical time of around 100 CPU cycles to read the cycle counter and calculate the accurate time from that. On old platforms without a cycle counter, this can be signiciantly higher, up to several microseconds to access a hardware clock, but those have become very rare by now. I traced the original addition of the current_kernel_time() call to set the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch with subject "nanosecond stat timefields". Andi explains that the motivation was to introduce as little overhead as possible back then. At this time, reading the clock hardware was also more expensive when most architectures did not have a cycle counter. One side effect of having more accurate inode timestamp would be having to write out the inode every time that mtime/ctime/atime get touched on most systems, whereas many file systems today only write it when the timestamps have changed, i.e. at most once per jiffy unless something else changes as well. That change would certainly be noticed in some workloads, which is enough reason to not do it without a good reason, regardless of the cost of reading the time. One thing we could still consider however would be to round the timestamps from current_time() to multiples of NSEC_PER_JIFFY, e.g. full milliseconds rather than having six or seven meaningless but confusing digits at the end of the timestamp. Link: http://lkml.kernel.org/r/20180726130820.4174359-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* Merge branch 'akpm-current/current'Stephen Rothwell2018-09-1194-901/+1296
|\
| * bfs: add sanity check at bfs_fill_super()Tetsuo Handa2018-09-111-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot is reporting too large memory allocation at bfs_fill_super() [1]. Since file system image is corrupted such that bfs_sb->s_start == 0, bfs_fill_super() is trying to allocate 8MB of continuous memory. Fix this by adding a sanity check on bfs_sb->s_start, __GFP_NOWARN and printf(). [1] https://syzkaller.appspot.com/bug?id=16a87c236b951351374a84c8a32f40edbc034e96 Link: http://lkml.kernel.org/r/1525862104-3407-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+71c6b5d68e91149fc8a4@syzkaller.appspotmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tigran Aivazian <aivazian.tigran@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * reiserfs: propagate errors from fill_with_dentries() properlyJann Horn2018-09-111-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fill_with_dentries() failed to propagate errors up to reiserfs_for_each_xattr() properly. Plumb them through. Note that reiserfs_for_each_xattr() is only used by reiserfs_delete_xattrs() and reiserfs_chown_xattrs(). The result of reiserfs_delete_xattrs() is discarded anyway, the only difference there is whether a warning is printed to dmesg. The result of reiserfs_chown_xattrs() does matter because it can block chowning of the file to which the xattrs belong; but either way, the resulting state can have misaligned ownership, so my patch doesn't improve things greatly. Credit for making me look at this code goes to Al Viro, who pointed out that the ->actor calling convention is suboptimal and should be changed. Link: http://lkml.kernel.org/r/20180802163335.83312-1-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * hfs: prevent btree data loss on ENOSPCErnesto A. Fernández2018-09-114-15/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inserting a new record in a btree may require splitting several of its nodes. If we hit ENOSPC halfway through, the new nodes will be left orphaned and their records will be lost. This could mean lost inodes or extents. >From now on, check the available disk space before making any changes. This still leaves the potential problem of corruption on ENOMEM. There is no need to reserve space before deleting a catalog record, as we do for hfsplus. This difference is because hfs index nodes have fixed length keys. Link: http://lkml.kernel.org/r/68e77996d38baf789e8dc108ca6d210c5987f2ea.1535682464.git.ernesto.mnd.fernandez@gmail.com Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * hfs: fix BUG on bnode parent updateErnesto A. Fernández2018-09-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hfs_brec_update_parent() may hit BUG_ON() if the first record of both a leaf node and its parent are changed, and if this forces the parent to be split. It is not possible for this to happen on a valid hfs filesystem because the index nodes have fixed length keys. For reasons I ignore, the hfs module does have support for a number of hfsplus features. A corrupt btree header may report variable length keys and trigger this BUG, so it's better to fix it. Link: http://lkml.kernel.org/r/cf9b02d57f806217a2b1bf5db8c3e39730d8f603.1535682463.git.ernesto.mnd.fernandez@gmail.com Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * hfs: prevent btree data loss on root splitErnesto A. Fernández2018-09-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This bug is triggered whenever hfs_brec_update_parent() needs to split the root node. The height of the btree is not increased, which leaves the new node orphaned and its records lost. It is not possible for this to happen on a valid hfs filesystem because the index nodes have fixed length keys. For reasons I ignore, the hfs module does have support for a number of hfsplus features. A corrupt btree header may report variable length keys and trigger this bug, so it's better to fix it. Link: http://lkml.kernel.org/r/9750b1415685c4adca10766895f6d5ef12babdb0.1535682463.git.ernesto.mnd.fernandez@gmail.com Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * hfsplus: prevent btree data loss on ENOSPCErnesto A. Fernández2018-09-115-15/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inserting or deleting a record in a btree may require splitting several of its nodes. If we hit ENOSPC halfway through, the new nodes will be left orphaned and their records will be lost. This could mean lost inodes, extents or xattrs. Henceforth, check the available disk space before making any changes. This still leaves the potential problem of corruption on ENOMEM. The patch can be tested with xfstests generic/027. Link: http://lkml.kernel.org/r/4ecd0a1d09756aadb173809e18285e807994e90b.1535682462.git.ernesto.mnd.fernandez@gmail.com Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * hfsplus: fix BUG on bnode parent updateErnesto A. Fernández2018-09-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Creating, renaming or deleting a file may hit BUG_ON() if the first record of both a leaf node and its parent are changed, and if this forces the parent to be split. This bug is triggered by xfstests generic/027, somewhat rarely; here is a more reliable reproducer: truncate -s 50M fs.iso mkfs.hfsplus fs.iso mount fs.iso /mnt i=1000 while [ $i -le 2400 ]; do touch /mnt/$i &>/dev/null ((++i)) done i=2400 while [ $i -ge 1000 ]; do mv /mnt/$i /mnt/$(perl -e "print $i x61") &>/dev/null ((--i)) done The issue is that a newly created bnode is being put twice. Reset new_node to NULL in hfs_brec_update_parent() before reaching goto again. Link: http://lkml.kernel.org/r/5ee1db09b60373a15890f6a7c835d00e76bf601d.1535682461.git.ernesto.mnd.fernandez@gmail.com Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * hfsplus: prevent btree data loss on root splitErnesto A. Fernández2018-09-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Creating, renaming or deleting a file may cause catalog corruption and data loss. This bug is randomly triggered by xfstests generic/027, but here is a faster reproducer: truncate -s 50M fs.iso mkfs.hfsplus fs.iso mount fs.iso /mnt i=100 while [ $i -le 150 ]; do touch /mnt/$i &>/dev/null ((++i)) done i=100 while [ $i -le 150 ]; do mv /mnt/$i /mnt/$(perl -e "print $i x82") &>/dev/null ((++i)) done umount /mnt fsck.hfsplus -n fs.iso The bug is triggered whenever hfs_brec_update_parent() needs to split the root node. The height of the btree is not increased, which leaves the new node orphaned and its records lost. Link: http://lkml.kernel.org/r/26d882184fc43043a810114258f45277752186c7.1535682461.git.ernesto.mnd.fernandez@gmail.com Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * init/do_mounts.c: add root=PARTLABEL=<name> supportNikolaus Voss2018-09-111-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support referencing the root partition label from GPT as argument to the root= option on the kernel command line in analogy to referencing the partition uuid as root=PARTUUID=<uuid>. Specifying the partition label instead of the uuid is often much easier, e.g. in embedded environments when there is an A/B rootfs partition scheme for interruptible firmware updates (i.e. rootfsA/ rootfsB). The partition label can be queried with the blkid command. Link: http://lkml.kernel.org/r/20180822060904.828E510665E@pc-niv.weinmann.com Signed-off-by: Nikolaus Voss <nikolaus.voss@loewensteinmedical.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Sasha Levin <Alexander.Levin@microsoft.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * checkpatch: remove GCC_BINARY_CONSTANT warningChristophe Leroy2018-09-111-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This warning was there to avoid the use of 0bxxx values as they are not supported by gcc prior to v4.3 Since cafa0010cd51f ("Raise the minimum required gcc version to 4.6"), it's not an issue anymore and using such values can increase readability of code. Joe said: : Seems sensible as the other compilers also support binary literals from : relatively old versions. : http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3472.pdf : https://software.intel.com/en-us/articles/c14-features-supported-by-intel-c-compiler Link: http://lkml.kernel.org/r/392eeae782302ee8812a3c932a602035deed1609.1535351453.git.christophe.leroy@c-s.fr Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * lib/parser.c: switch match_u64int() over to use match_strdup()Eric Biggers2018-09-111-4/+1
| | | | | | | | | | | | | | | | | | | | This simplifies the code. No change in behavior. Link: http://lkml.kernel.org/r/20180830194814.192880-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * lib/parser.c: switch match_strdup() over to use kmemdup_nul()Eric Biggers2018-09-111-5/+1
| | | | | | | | | | | | | | | | | | | | This simplifies the code. No change in behavior. Link: http://lkml.kernel.org/r/20180830194436.188867-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * lib/bitmap.c: simplify bitmap_print_to_pagebuf()Rasmus Villemoes2018-09-111-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | len is guaranteed to lie in [1, PAGE_SIZE]. If scnprintf is called with a buffer size of 1, it is guaranteed to return 0. So in the extremely unlikely case of having just one byte remaining in the page, let's just call scnprintf anyway. The only difference is that this will write a '\0' to that final byte in the page, but that's an improvement: We now guarantee that after the call, buf is a properly terminated C string of length exactly the return value. Link: http://lkml.kernel.org/r/20180818131623.8755-8-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * lib-bitmapc-fix-remaining-space-computation-in-bitmap_print_to_pagebuf-fix-fixAndrew Morton2018-09-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | include mm.h for offset_in_page() Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * lib-bitmapc-fix-remaining-space-computation-in-bitmap_print_to_pagebuf-fixAndrew Morton2018-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | use offset_in_page(), per Andy Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * lib/bitmap.c: fix remaining space computation in bitmap_print_to_pagebufRasmus Villemoes2018-09-111-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For various alignments of buf, the current expression computes 4096 ok 4095 ok 8190 8189 ... 4097 i.e., if the caller has already written two bytes into the page buffer, len is 8190 rather than 4094, because PTR_ALIGN aligns up to the next boundary. So if the printed version of the bitmap is huge, scnprintf() ends up writing beyond the page boundary. I don't think any current callers actually write anything before bitmap_print_to_pagebuf, but the API seems to be designed to allow it. Link: http://lkml.kernel.org/r/20180818131623.8755-7-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * linux/bitmap.h: relax comment on compile-time constant nbitsRasmus Villemoes2018-09-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's not clear what's so horrible about emitting a function call to handle a run-time sized bitmap. Moreover, gcc also emits a function call for a compile-time-constant-but-huge nbits, so the comment isn't even accurate. Link: http://lkml.kernel.org/r/20180818131623.8755-6-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * linux/bitmap.h: fix type of nbits in bitmap_shift_right()Rasmus Villemoes2018-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most other bitmap API, including the OOL version __bitmap_shift_right, take unsigned nbits. This was accidentally left out from 2fbad29917c98. Link: http://lkml.kernel.org/r/20180818131623.8755-5-linux@rasmusvillemoes.dk Fixes: 2fbad29917c98 ("lib: bitmap: change bitmap_shift_right to take unsigned parameters") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reported-by: Yury Norov <ynorov@caviumnetworks.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * linux/bitmap.h: remove redundant uses of small_const_nbits()Rasmus Villemoes2018-09-111-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the _zero, _fill and _copy functions, the small_const_nbits branch is redundant. If nbits is small and const, gcc knows full well that BITS_TO_LONGS(nbits) is 1, so len is also a compile-time constant (sizeof(long)), and calling memset or memcpy with a length argument of sizeof(long) makes gcc generate the expected code anyway: #include <string.h> void a(unsigned long *x) { memset(x, 0, 8); } void b(unsigned long *x) { memset(x, 0xff, 8); } void c(unsigned long *x, const unsigned long *y) { memcpy(x, y, 8); } turns into 0000000000000000 <a>: 0: 48 c7 07 00 00 00 00 movq $0x0,(%rdi) 7: c3 retq 8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) f: 00 0000000000000010 <b>: 10: 48 c7 07 ff ff ff ff movq $0xffffffffffffffff,(%rdi) 17: c3 retq 18: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 1f: 00 0000000000000020 <c>: 20: 48 8b 06 mov (%rsi),%rax 23: 48 89 07 mov %rax,(%rdi) 26: c3 retq Link: http://lkml.kernel.org/r/20180818131623.8755-4-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * linux/bitmap.h: handle constant zero-size bitmaps correctlyRasmus Villemoes2018-09-111-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The static inlines in bitmap.h do not handle a compile-time constant nbits==0 correctly (they dereference the passed src or dst pointers, despite only 0 words being valid to access). I had the 0-day buildbot chew on a patch [1] that would cause build failures for such cases without complaining, suggesting that we don't have any such users currently, at least for the 70 .config/arch combinations that was built. Should any turn up, make sure they use the out-of-line versions, which do handle nbits==0 correctly. This is of course not the most efficient, but it's much less churn than teaching all the static inlines an "if (zero_const_nbits())", and since we don't have any current instances, this doesn't affect existing code at all. [1] lkml.kernel.org/r/20180815085539.27485-1-linux@rasmusvillemoes.dk Link: http://lkml.kernel.org/r/20180818131623.8755-3-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * lib/bitmap.c: remove wrong documentationRasmus Villemoes2018-09-111-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This promise is violated in a number of places, e.g. already in the second function below this paragraph. Since I don't think anybody relies on this being true, and since actually honouring it would hurt performance and code size in various places, just remove the paragraph. Link: http://lkml.kernel.org/r/20180818131623.8755-2-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * fs/buffer.c: add debug print for __getblk_gfp() stall problemTetsuo Handa2018-09-113-2/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Among syzbot's unresolved hung task reports, 18 out of 65 reports contain __getblk_gfp() line in the backtrace. Since there is a comment block that says that __getblk_gfp() will lock up the machine if try_to_free_buffers() attempt from grow_dev_page() is failing, let's start from checking whether syzbot is hitting that case. This change will be removed after the bug is fixed. Link: http://lkml.kernel.org/r/9b9fcdda-c347-53ee-fdbb-8a7d11cf430e@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@redhat.com> Cc: <syzkaller-bugs@googlegroups.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm/page_owner: align with pageblock_nr pageszhong jiang2018-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | When pfn_valid(pfn) returns false, pfn should be aligned with pageblock_nr_pages other than MAX_ORDER_NR_PAGES in init_pages_in_zone, because the skipped 2M may be valid pfn, as a result, early allocated count will not be accurate. Link: http://lkml.kernel.org/r/1468938136-24228-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm/page_owner: align with pageblock_nr_pagesMarc-André Lureau2018-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, init_pages_in_zone() walks the zone in pageblock_nr_pages steps. MAX_ORDER_NR_PAGES is possible to have holes when CONFIG_HOLES_IN_ZONE is set. it is likely to be different between MAX_ORDER_NR_PAGES and pageblock_nr_pages. if we skip the size of MAX_ORDER_NR_PAGES, it will result in the second 2M memroy leak. Meanwhile, the change will make the code consistent. because the entire function is based on the pageblock_nr_pages steps. Link: http://lkml.kernel.org/r/1512395284-13588-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: don't expose page to fast gup before it's readyYu Zhao2018-09-112-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't want to expose page before it's properly setup. During page setup, we may call page_add_new_anon_rmap() which uses non- atomic bit op. If page is exposed before it's done, we could overwrite page flags that are set by get_user_pages_fast() or its callers. Here is a non-fatal scenario (there might be other fatal problems that I didn't look into): CPU 1 CPU1 set_pte_at() get_user_pages_fast() page_add_new_anon_rmap() gup_pte_range() __SetPageSwapBacked() SetPageReferenced() Fix the problem by delaying set_pte_at() until page is ready. I didn't observe the race directly. But I did get few crashes when trying to access mem_cgroup of pages returned by get_user_pages_fast(). Those page were charged and they showed valid mem_cgroup in kdumps. So this led me to think the problem came from premature set_pte_at(). I think the fact that nobody complained about this problem is because the race only happens when using ksm+swap, and it might not cause any fatal problem even so. Nevertheless, it's nice to have set_pte_at() done consistently after rmap is added and page is charged. Link: http://lkml.kernel.org/r/20180108225632.16332-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: add strictlimit knobMaxim Patlasov2018-09-112-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value. Jan said: : In principle I have nothing against this and the usecase sounds reasonable : (in fact I believe the lack of a feature like this is one of reasons why : desktop automounters usually mount USB devices with 'sync' mount option). : So feel free to add: Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Artem S. Tashkinov" <t.artem@lycos.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fixAndrew Morton2018-09-111-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | include prefetch.h, remove unlocked list_empty() test, per Dave Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm/list_lru.c: prefetch neighboring list entries before acquiring lockWaiman Long2018-09-111-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | list_lru_del() removes the given item from the LRU list. The operation looks simple, but it involves writing into the cachelines of the two neighboring list entries in order to get the deletion done. That can take a while if the cachelines aren't there yet, thus prolonging the lock hold time. To reduce the lock hold time, the cachelines of the two neighboring list entries are now prefetched before acquiring the list_lru_node's lock. Using a multi-threaded test program that created a large number of dentries and then killed them, the execution time was reduced from 38.5s to 36.6s after applying the patch on a 2-socket 36-core 72-thread x86-64 system. Link: http://lkml.kernel.org/r/1511965054-6328-1-git-send-email-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: fix race between swapoff and mincoreHuang Ying2018-09-111-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Via commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks") on, after swapoff, the address_space associated with the swap device will be freed. So swap_address_space() users which touch the address_space need some kind of mechanism to prevent the address_space from being freed during accessing. When mincore process unmapped range for swapped shmem pages, it doesn't hold the lock to prevent swap device from being swapoff. So the following race is possible, CPU1 CPU2 do_mincore() swapoff() walk_page_range() mincore_unmapped_range() __mincore_unmapped_range mincore_page as = swap_address_space() ... exit_swap_address_space() ... kvfree(spaces) find_get_page(as) The address space may be accessed after being freed. To fix the race, get_swap_device()/put_swap_device() is used to enclose find_get_page() to check whether the swap entry is valid and prevent the swap device from being swapoff during accessing. Link: http://lkml.kernel.org/r/20180313012036.1597-1-ying.huang@intel.com Fixes: 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm, swap: fix race between swapoff and some swap operationsHuang Ying2018-09-112-1/+24
| | | | | | | | | | | | | | | | | | | | | | - Add more comments to get_swap_device() to make it more clear about possible swapoff or swapoff+swapon. Link: http://lkml.kernel.org/r/20180223060010.954-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm, swap: fix race between swapoff and some swap operationsHuang Ying2018-09-114-37/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When swapin is performed, after getting the swap entry information from the page table, system will swap in the swap entry, without any lock held to prevent the swap device from being swapoff. This may cause the race like below, CPU 1 CPU 2 ----- ----- do_swap_page swapin_readahead __read_swap_cache_async swapoff swapcache_prepare p->swap_map = NULL __swap_duplicate p->swap_map[?] /* !!! NULL pointer access */ Because swapoff is usually done when system shutdown only, the race may not hit many people in practice. But it is still a race need to be fixed. To fix the race, get_swap_device() is added to check whether the specified swap entry is valid in its swap device. If so, it will keep the swap entry valid via preventing the swap device from being swapoff, until put_swap_device() is called. Because swapoff() is very rare code path, to make the normal path runs as fast as possible, disabling preemption + stop_machine() instead of reference count is used to implement get/put_swap_device(). From get_swap_device() to put_swap_device(), the preemption is disabled, so stop_machine() in swapoff() will wait until put_swap_device() is called. In addition to swap_map, cluster_info, etc. data structure in the struct swap_info_struct, the swap cache radix tree will be freed after swapoff, so this patch fixes the race between swap cache looking up and swapoff too. Races between some other swap cache usages protected via disabling preemption and swapoff are fixed too via calling stop_machine() between clearing PageSwapCache() and freeing swap cache data structure. Alternative implementation could be replacing disable preemption with rcu_read_lock_sched and stop_machine() with synchronize_sched(). Link: http://lkml.kernel.org/r/20180213014220.2464-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Shaohua Li <shli@fb.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-move-mirrored-memory-specific-code-outside-of-memmap_init_zone-v2Pavel Tatashin2018-09-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | uninline overlap_memmap_init() Link: http://lkml.kernel.org/r/20180726193509.3326-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: move mirrored memory specific code outside of memmap_init_zonePavel Tatashin2018-09-111-40/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memmap_init_zone, is getting complex, because it is called from different contexts: hotplug, and during boot, and also because it must handle some architecture quirks. One of them is mirrored memory. Move the code that decides whether to skip mirrored memory outside of memmap_init_zone, into a separate function. Link: http://lkml.kernel.org/r/20180724235520.10200-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-calculate-deferred-pages-after-skipping-mirrored-memory-fixAndrew Morton2018-09-111-1/+2
| | | | | | | | | | | | | | | | | | make defer_init non-inline, __meminit Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-calculate-deferred-pages-after-skipping-mirrored-memory-v2Pavel Tatashin2018-09-111-0/+4
| | | | | | | | | | | | | | | | | | | | v2: add comment about defer_init's lack of locking Link: http://lkml.kernel.org/r/20180726193509.3326-3-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: calculate deferred pages after skipping mirrored memoryPavel Tatashin2018-09-111-20/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | update_defer_init() should be called only when struct page is about to be initialized. Because it counts number of initialized struct pages, but there we may skip struct pages if there is some mirrored memory. So move, update_defer_init() after checking for mirrored memory. Also, rename update_defer_init() to defer_init() and reverse the return boolean to emphasize that this is a boolean function, that tells that the reset of memmap initialization should be deferred. Make this function self-contained: do not pass number of already initialized pages in this zone by using static counters. I found this bug by reading the code. The effect is that fewer than expected struct pages are initialized early in boot, and it is possible that in some corner cases we may fail to boot when mirrored pages are used. The deferred on demand code should somewhat mitigate this. But this still brings some inconsistencies compared to when booting without mirrored pages, so it is better to fix. Link: http://lkml.kernel.org/r/20180724235520.10200-3-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: make memmap_init a proper functionPavel Tatashin2018-09-112-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memmap_init is sometimes a macro sometimes a function based on __HAVE_ARCH_MEMMAP_INIT. It is only a function on ia64. Make memmap_init a weak function instead, and let ia64 redefine it. Link: http://lkml.kernel.org/r/20180724235520.10200-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm/z3fold.c: fix wrong handling of headless pagesJongseok Kim2018-09-111-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | During the processing of headless pages in z3fold_reclaim_page(), there was a problem that the zhdr pointed to another page or a page was already released in z3fold_free(). So, the wrong page is encoded in headless, or test_bit does not work properly in z3fold_reclaim_page(). This patch fixed these problems. Link: http://lkml.kernel.org/r/1530853846-30215-1-git-send-email-ks77sj@gmail.com Signed-off-by: Jongseok Kim <ks77sj@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: page_alloc: reduce unnecessary binary search in early_pfn_valid()Jia He2018-09-111-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") optimized the loop in memmap_init_zone(). But there is still some room for improvement. E.g. in early_pfn_valid(), if pfn and pfn+1 are in the same memblock region, we can record the last returned memblock region index and check whether pfn++ is still in the same region. Currently it only improve the performance on arm/arm64 and will have no impact on other arches. For the performance improvement, after this set, I can see the time overhead of memmap_init() is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G memory, pagesize 64k). Link: http://lkml.kernel.org/r/1530867675-9018-7-git-send-email-hejianet@gmail.com Signed-off-by: Jia He <jia.he@hxt-semitech.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Daniel Vacek <neelx@redhat.com> Cc: Eugeniu Rosca <erosca@de.adit-jv.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: James Morse <james.morse@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Philip Derrin <philip@cog.systems> Cc: Russell King <linux@armlinux.org.uk> Cc: Steve Capper <steve.capper@arm.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm/memblock: introduce pfn_valid_region()Jia He2018-09-111-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") optimized the loop in memmap_init_zone(). But there is still some room for improvement. E.g. in early_pfn_valid(), we can record the last returned memblock region. If current pfn and last pfn are in the same memory region, we needn't do the unnecessary binary searches because memblock_is_nomap is the same result for whole memory region. Link: http://lkml.kernel.org/r/1530867675-9018-6-git-send-email-hejianet@gmail.com Signed-off-by: Jia He <jia.he@hxt-semitech.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Daniel Vacek <neelx@redhat.com> Cc: Eugeniu Rosca <erosca@de.adit-jv.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: James Morse <james.morse@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Philip Derrin <philip@cog.systems> Cc: Russell King <linux@armlinux.org.uk> Cc: Steve Capper <steve.capper@arm.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-memblock-introduce-memblock_search_pfn_regions-fixAndrew Morton2018-09-111-4/+1
| | | | | | | | | | | | | | | | | | simplify code Cc: Jia He <jia.he@hxt-semitech.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm/memblock: introduce memblock_search_pfn_regions()Jia He2018-09-112-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This helper is to find the memory region index of input pfn. Link: http://lkml.kernel.org/r/1530867675-9018-5-git-send-email-hejianet@gmail.com Signed-off-by: Jia He <jia.he@hxt-semitech.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Daniel Vacek <neelx@redhat.com> Cc: Eugeniu Rosca <erosca@de.adit-jv.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: James Morse <james.morse@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Philip Derrin <philip@cog.systems> Cc: Russell King <linux@armlinux.org.uk> Cc: Steve Capper <steve.capper@arm.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-page_alloc-reduce-unnecessary-binary-search-in-memblock_next_valid_pfn-fi ↵Andrew Morton2018-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | x-fix fix bogus fix Cc: Jia He <jia.he@hxt-semitech.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-page_alloc-reduce-unnecessary-binary-search-in-memblock_next_valid_pfn-fixAndrew Morton2018-09-112-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | s/ulong/unsigned long/, make early_region_idx local to memblock_next_valid_pfn() Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Daniel Vacek <neelx@redhat.com> Cc: Eugeniu Rosca <erosca@de.adit-jv.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: James Morse <james.morse@arm.com> Cc: Jia He <jia.he@hxt-semitech.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Philip Derrin <philip@cog.systems> Cc: Russell King <linux@armlinux.org.uk> Cc: Steve Capper <steve.capper@arm.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn()Jia He2018-09-111-8/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") optimized the loop in memmap_init_zone(). But there is still some room for improvement. E.g. if pfn and pfn+1 are in the same memblock region, we can simply pfn++ instead of doing the binary search in memblock_next_valid_pfn. Furthermore, if the pfn is in a gap of two memory region, skip to next region directly if possible. Attached the memblock region information in my server. [ 0.000000] Zone ranges: [ 0.000000] DMA32 [mem 0x0000000000200000-0x00000000ffffffff] [ 0.000000] Normal [mem 0x0000000100000000-0x00000017ffffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000000200000-0x000000000021ffff] [ 0.000000] node 0: [mem 0x0000000000820000-0x000000000307ffff] [ 0.000000] node 0: [mem 0x0000000003080000-0x000000000308ffff] [ 0.000000] node 0: [mem 0x0000000003090000-0x00000000031fffff] [ 0.000000] node 0: [mem 0x0000000003200000-0x00000000033fffff] [ 0.000000] node 0: [mem 0x0000000003410000-0x00000000034fffff] [ 0.000000] node 0: [mem 0x0000000003500000-0x000000000351ffff] [ 0.000000] node 0: [mem 0x0000000003520000-0x000000000353ffff] [ 0.000000] node 0: [mem 0x0000000003540000-0x0000000003e3ffff] [ 0.000000] node 0: [mem 0x0000000003e40000-0x0000000003e7ffff] [ 0.000000] node 0: [mem 0x0000000003e80000-0x0000000003ecffff] [ 0.000000] node 0: [mem 0x0000000003ed0000-0x0000000003ed5fff] [ 0.000000] node 0: [mem 0x0000000003ed6000-0x0000000006eeafff] [ 0.000000] node 0: [mem 0x0000000006eeb000-0x000000000710ffff] [ 0.000000] node 0: [mem 0x0000000007110000-0x0000000007f0ffff] [ 0.000000] node 0: [mem 0x0000000007f10000-0x0000000007faffff] [ 0.000000] node 0: [mem 0x0000000007fb0000-0x000000000806ffff] [ 0.000000] node 0: [mem 0x0000000008070000-0x00000000080affff] [ 0.000000] node 0: [mem 0x00000000080b0000-0x000000000832ffff] [ 0.000000] node 0: [mem 0x0000000008330000-0x000000000836ffff] [ 0.000000] node 0: [mem 0x0000000008370000-0x000000000838ffff] [ 0.000000] node 0: [mem 0x0000000008390000-0x00000000083a9fff] [ 0.000000] node 0: [mem 0x00000000083aa000-0x00000000083bbfff] [ 0.000000] node 0: [mem 0x00000000083bc000-0x00000000083fffff] [ 0.000000] node 0: [mem 0x0000000008400000-0x000000000841ffff] [ 0.000000] node 0: [mem 0x0000000008420000-0x000000000843ffff] [ 0.000000] node 0: [mem 0x0000000008440000-0x000000000865ffff] [ 0.000000] node 0: [mem 0x0000000008660000-0x000000000869ffff] [ 0.000000] node 0: [mem 0x00000000086a0000-0x00000000086affff] [ 0.000000] node 0: [mem 0x00000000086b0000-0x00000000086effff] [ 0.000000] node 0: [mem 0x00000000086f0000-0x0000000008b6ffff] [ 0.000000] node 0: [mem 0x0000000008b70000-0x0000000008bbffff] [ 0.000000] node 0: [mem 0x0000000008bc0000-0x0000000008edffff] [ 0.000000] node 0: [mem 0x0000000008ee0000-0x0000000008ee0fff] [ 0.000000] node 0: [mem 0x0000000008ee1000-0x0000000008ee2fff] [ 0.000000] node 0: [mem 0x0000000008ee3000-0x000000000decffff] [ 0.000000] node 0: [mem 0x000000000ded0000-0x000000000defffff] [ 0.000000] node 0: [mem 0x000000000df00000-0x000000000fffffff] [ 0.000000] node 0: [mem 0x0000000010800000-0x0000000017feffff] [ 0.000000] node 0: [mem 0x000000001c000000-0x000000001c00ffff] [ 0.000000] node 0: [mem 0x000000001c010000-0x000000001c7fffff] [ 0.000000] node 0: [mem 0x000000001c810000-0x000000007efbffff] [ 0.000000] node 0: [mem 0x000000007efc0000-0x000000007efdffff] [ 0.000000] node 0: [mem 0x000000007efe0000-0x000000007efeffff] [ 0.000000] node 0: [mem 0x000000007eff0000-0x000000007effffff] [ 0.000000] node 0: [mem 0x000000007f000000-0x00000017ffffffff] [ 0.000000] Initmem setup node 0 [mem 0x0000000000200000-0x00000017ffffffff] [ 0.000000] On node 0 totalpages: 25145296 [ 0.000000] DMA32 zone: 16376 pages used for memmap [ 0.000000] DMA32 zone: 0 pages reserved [ 0.000000] DMA32 zone: 1028048 pages, LIFO batch:31 [ 0.000000] Normal zone: 376832 pages used for memmap [ 0.000000] Normal zone: 24117248 pages, LIFO batch:31 Link: http://lkml.kernel.org/r/1530867675-9018-4-git-send-email-hejianet@gmail.com Signed-off-by: Jia He <jia.he@hxt-semitech.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Daniel Vacek <neelx@redhat.com> Cc: Eugeniu Rosca <erosca@de.adit-jv.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: James Morse <james.morse@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Philip Derrin <philip@cog.systems> Cc: Russell King <linux@armlinux.org.uk> Cc: Steve Capper <steve.capper@arm.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>