summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Add linux-next specific files for 20170515next-20170515Stephen Rothwell2017-05-155-0/+2512
| | | | Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* Merge branch 'akpm/master'Stephen Rothwell2017-05-1511-57/+120
|\
| * lib/crc-ccitt: add CCITT-FALSE CRC16 variantAndrey Vostrikov2017-05-152-1/+64
| | | | | | | | | | | | | | | | | | | | | | | | In support of a soon to be published MFD driver using serdev to talk to a supervisory processor that uses the CCITT-FALSE CRC16 variant in it's protocol, this patch was tested successfully on an i.MX6 ARM platform. Link: http://lkml.kernel.org/r/20170413142932.27287-1-andrew.smirnov@gmail.com Signed-off-by: Andrey Vostrikov <andrey.vostrikov@cogentembedded.com> Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com> Tested-by: Chris Healy <cphealy@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: adaptive hash table scalingPavel Tatashin2017-05-154-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT is provided every time memory quadruples the sizes of hash tables will only double instead of quadrupling as well. This algorithm starts working only when memory size reaches a certain point, currently set to 64G. This is example of dentry hash table size, before and after four various memory configurations: MEMORY SCALE HASH_SIZE old new old new 8G 13 13 8M 8M 16G 13 13 16M 16M 32G 13 13 32M 32M 64G 13 13 64M 64M 128G 13 14 128M 64M 256G 13 14 256M 128M 512G 13 15 512M 128M 1024G 13 15 1024M 256M 2048G 13 16 2048M 256M 4096G 13 16 4096M 512M 8192G 13 17 8192M 512M 16384G 13 17 16384M 1024M 32768G 13 18 32768M 1024M 65536G 13 18 65536M 2048M Link: http://lkml.kernel.org/r/1488432825-92126-5-git-send-email-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Babu Moger <babu.moger@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: update callers to use HASH_ZERO flagPavel Tatashin2017-05-155-40/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update dcache, inode, pid, mountpoint, and mount hash tables to use HASH_ZERO, and remove initialization after allocations. In case of places where HASH_EARLY was used such as in __pv_init_lock_hash the zeroed hash table was already assumed, because memblock zeroes the memory. CPU: SPARC M6, Memory: 7T Before fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 11.798s After fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 3.198s CPU: Intel Xeon E5-2630, Memory: 2.2T: Before fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.245s After fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.244s Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: zero hash tables in allocatorPavel Tatashin2017-05-152-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new flag HASH_ZERO which when provided grantees that the hash table that is returned by alloc_large_system_hash() is zeroed. In most cases that is what is needed by the caller. Use page level allocator's __GFP_ZERO flags to zero the memory. It is using memset() which is efficient method to zero memory and is optimized for most platforms. Link: http://lkml.kernel.org/r/1488432825-92126-3-git-send-email-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * sparc64: NG4 memset 32 bits overflowPavel Tatashin2017-05-151-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Early in boot Linux patches memset and memcpy to branch to platform optimized versions of these routines. The NG4 (Niagra 4) versions are currently used on all platforms starting from T4. Recently, there were M7 optimized routines added into UEK4 but not into mainline yet. So, even with M7 optimized routines NG4 are still going to be used on T4, T5, M5, and M6 processors. While investigating how to improve initialization time of dentry_hashtable which is 8G long on M6 ldom with 7T of main memory, I noticed that memset() does not reset all the memory in this array, after studying the code, I realized that NG4memset() branches use %icc register instead of %xcc to check compare, so if value of length is over 32-bit long, which is true for 8G array, these routines fail to work properly. The fix is to replace all %icc with %xcc in these routines. (Alternative is to use %ncc, but this is misleading, as the code already has sparcv9 only instructions, and cannot be compiled on 32-bit). This is important to fix this bug, because even older T4-4 can have 2T of memory, and there are large memory proportional data structures in kernel which can be larger than 4G in size. The failing of memset() is silent and corruption is hard to detect. Link: http://lkml.kernel.org/r/1488432825-92126-2-git-send-email-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * imx7: fix Kconfig warning and build errorsRandy Dunlap2017-05-151-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix Kconfig warning in drivers/soc/imx/Kconfig and subsequent build errors elsewhere when CONFIG_PM is not enabled. warning: (IMX7_PM_DOMAINS) selects PM_GENERIC_DOMAINS which has unmet direct dependencies (PM) This warning causes multiple build errors in drivers/base/power/domain.c: ../drivers/base/power/domain.c: In function 'genpd_queue_power_off_work': ../drivers/base/power/domain.c:279:13: error: 'pm_wq' undeclared (first use in this function) queue_work(pm_wq, &genpd->power_off_work); ../drivers/base/power/domain.c: In function 'genpd_dev_pm_qos_notifier': ../drivers/base/power/domain.c:462:25: error: 'struct dev_pm_info' has no member named 'ignore_children' if (!dev || dev->power.ignore_children) ../drivers/base/power/domain.c: In function 'rtpm_status_str': ../drivers/base/power/domain.c:2207:16: error: 'struct dev_pm_info' has no member named 'runtime_error' if (dev->power.runtime_error) ../drivers/base/power/domain.c:2209:21: error: 'struct dev_pm_info' has no member named 'disable_depth' else if (dev->power.disable_depth) ../drivers/base/power/domain.c:2211:21: error: 'struct dev_pm_info' has no member named 'runtime_status' else if (dev->power.runtime_status < ARRAY_SIZE(status_lookup)) p = status_lookup[dev->power.runtime_status]; Link: http://lkml.kernel.org/r/47fc2167-bf0a-866a-e075-8247a10aa6e0@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Fabio Estevam <fabio.estevam@nxp.com> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Sascha Hauer <kernel@pengutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge branch 'akpm-current/current'Stephen Rothwell2017-05-1559-289/+2038
|\
| * ipc/sem: avoid indexing past end of sem_arrayKees Cook2017-05-141-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes the struct + trailing data pattern to using a void * so that the end of sem_array is found without possibly indexing past the end which can upset some static analyzers. Mostly, this ends up avoiding a cast between different non-void types, which the future randstruct GCC plugin was warning about. Link: http://lkml.kernel.org/r/20170508222345.GA52073@beast Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fault-inject: add /proc/<pid>/fail-nthAkinobu Mita2017-05-142-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | fail-nth interface is only created in /proc/self/task/<current-tid>/. This change also adds it in /proc/<pid>/. This makes shell based tool a bit simpler. $ bash -c "builtin echo 100 > /proc/self/fail-nth && exec ls /" Link: http://lkml.kernel.org/r/1491490561-10485-6-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fault-inject: don't convert unsigned type value as signed intAkinobu Mita2017-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | This fixes fault-inject-simplify-access-check-for-fail-nth.patch in -mm tree which by mistake partially reverts the change by fault-inject- parse-as-natural-1-based-value-for-fail-nth-write-interface.patch. Link: http://lkml.kernel.org/r/1492444483-9239-1-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fault-inject: simplify access check for fail-nthAkinobu Mita2017-05-141-14/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The fail-nth file is created with 0666 and the access is permitted if and only if the task is current. This file is owned by the currnet user. So we can create it with 0644 and allow the owner to write it. This enables to watch the status of task->fail_nth from another processes. Link: http://lkml.kernel.org/r/1491490561-10485-5-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fault-inject: make fail-nth read/write interface symmetricAkinobu Mita2017-05-142-14/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The read interface for fail-nth looks a bit odd. Read from this file returns "NYYYY..." or "YYYYY..." (this makes me surprise when cat this file). Because there is no EOF condition. The first character indicates current->fail_nth is zero or not, and then current->fail_nth is reset to zero. Just returning task->fail_nth value is more natural to understand. Link: http://lkml.kernel.org/r/1491490561-10485-4-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fault-inject: parse as natural 1-based value for fail-nth write interfaceAkinobu Mita2017-05-143-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | The value written to fail-nth file is parsed as 0-based. Parsing as one-based is more natural to understand and it enables to cancel the previous setup by simply writing '0'. This change also converts task->fail_nth from signed to unsigned int. Link: http://lkml.kernel.org/r/1491490561-10485-3-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fault-inject: automatically detect the number base for fail-nth write interfaceAkinobu Mita2017-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | Automatically detect the number base to use when writing to fail-nth file instead of always parsing as a decimal number. Link: http://lkml.kernel.org/r/1491490561-10485-2-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fault-inject-support-systematic-fault-injection-fixAndrew Morton2017-05-141-1/+1
| | | | | | | | | | | | | | | | fix build Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fault-inject: support systematic fault injectionDmitry Vyukov2017-05-145-0/+142
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add /proc/self/task/<current-tid>/fail-nth file that allows failing 0-th, 1-st, 2-nd and so on calls systematically. Excerpt from the added documentation: === Write to this file of integer N makes N-th call in the current task fail (N is 0-based). Read from this file returns a single char 'Y' or 'N' that says if the fault setup with a previous write to this file was injected or not, and disables the fault if it wasn't yet injected. Note that this file enables all types of faults (slab, futex, etc). This setting takes precedence over all other generic settings like probability, interval, times, etc. But per-capability settings (e.g. fail_futex/ignore-private) take precedence over it. This feature is intended for systematic testing of faults in a single system call. See an example below. === Why adding new setting: 1. Existing settings are global rather than per-task. So parallel testing is not possible. 2. attr->interval is close but it depends on attr->count which is non reset to 0, so interval does not work as expected. 3. Trying to model this with existing settings requires manipulations of all of probability, interval, times, space, task-filter and unexposed count and per-task make-it-fail files. 4. Existing settings are per-failure-type, and the set of failure types is potentially expanding. 5. make-it-fail can't be changed by unprivileged user and aggressive stress testing better be done from an unprivileged user. Similarly, this would require opening the debugfs files to the unprivileged user, as he would need to reopen at least times file (not possible to pre-open before dropping privs). The proposed interface solves all of the above (see the example). We want to integrate this into syzkaller fuzzer. A prototype has found 10 bugs in kernel in first day of usage: https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kernel-reboot-add-devm_register_reboot_notifier-fixAndrew Morton2017-05-141-1/+2
| | | | | | | | | | | | | | move `struct device' forward declaration to top-of-file Cc: Andrey Smirnov <andrew.smirnov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kernel/reboot.c: add devm_register_reboot_notifier()Andrey Smirnov2017-05-142-0/+30
| | | | | | | | | | | | | | | | | | Add devm_* wrapper around register_reboot_notifier to simplify device specific reboot notifier registration/unregistration. Link: http://lkml.kernel.org/r/20170320171753.1705-1-andrew.smirnov@gmail.com Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kcmp: add KCMP_EPOLL_TFD mode to compare epoll target filesCyrill Gorcunov2017-05-144-0/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With current epoll architecture target files are addressed with file_struct and file descriptor number, where the last is not unique. Moreover files can be transferred from another process via unix socket, added into queue and closed then so we won't find this descriptor in the task fdinfo list. Thus to checkpoint and restore such processes CRIU needs to find out where exactly the target file is present to add it into epoll queue. For this sake one can use kcmp call where some particular target file from the queue is compared with arbitrary file passed as an argument. Because epoll target files can have same file descriptor number but different file_struct a caller should explicitly specify the offset within. To test if some particular file is matching entry inside epoll one have to - fill kcmp_epoll_slot structure with epoll file descriptor, target file number and target file offset (in case if only one target is present then it should be 0) - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot) - the kernel fetch file pointer matching file descriptor @fd of pid1 - lookups for file struct in epoll queue of pid2 and returns traditional 0,1,2 result for sorting purpose Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Andrey Vagin <avagin@openvz.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * procfs: fdinfo: extend information about epoll target filesCyrill Gorcunov2017-05-142-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since it is possbile to have same number in tfd field (say file added, closed, then nother file dup'ed to same number and added back) it is imposible to distinguish such target files solely by their numbers. Strictly speaking regular applications don't need to recognize these targets at all but for checkpoint/restore sake we need to collect targets to be able to push them back on restore stage in a proper order. Thus lets add file position, inode and device number where this target lays. This three fields can be used as a primary key for sorting, and together with kcmp help CRIU can find out an exact file target (from the whole set of processes being checkpointed). Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * scripts/gdb: add lx-fdtdump commandPeter Griffin2017-05-142-0/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lx-fdtdump dumps the flattened device tree passed to the kernel from the bootloader to the filename specified as the command argument. If no argument is provided it defaults to fdtdump.dtb. This then allows further post processing on the machine running GDB. The fdt header is also also printed in the GDB console. For example: (gdb) lx-fdtdump fdt_magic: 0xD00DFEED fdt_totalsize: 0xC108 off_dt_struct: 0x38 off_dt_strings: 0x3804 off_mem_rsvmap: 0x28 version: 17 last_comp_version: 16 Dumped fdt to fdtdump.dtb >fdtdump fdtdump.dtb | less This command is useful as the bootloader can often re-write parts of the device tree, and this can sometimes cause the kernel to not boot. Link: http://lkml.kernel.org/r/1481280065-5336-2-git-send-email-kbingham@kernel.org Signed-off-by: Peter Griffin <peter.griffin@linaro.org> Signed-off-by: Kieran Bingham <kbingham@kernel.org> Cc: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * bfs: fix sanity checks for empty filesRakesh Pandit2017-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mount fails if file system image has empty files because of sanity check while reading superblock. For empty files disk offset to end of file (i_eoffset) is cpu_to_le32(-1). Sanity check comparison, which compares disk offset with file system size isn't valid for this value and hence is ignored with this patch. Steps to reproduce: $ dd if=/dev/zero of=bfs-image count=204800 $ mkfs.bfs bfs-image $ mkdir bfs-mount-point $ sudo mount -t bfs -o loop bfs-image bfs-mount-point/ $ cd bfs-mount-point/ $ sudo touch a $ cd .. $ sudo umount bfs-mount-point/ $ sudo mount -t bfs -o loop bfs-image bfs-mount-point/ mount: /dev/loop0: can't read superblock $ dmesg [25526.689580] BFS-fs: bfs_fill_super(): Inode 0x00000003 corrupted Tigran said: : If you had created the filesystem with the proper mkfs under SCO UnixWare : 7 you (probably) wouldn't encounter this issue. But since commercial : Unix-es are now part of history and the only proper way is the Linux : mkfs.bfs utility, your patch is fine. Link: http://lkml.kernel.org/r/20170505201625.GA3097@hercules.tuxera.com Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Acked-by: Tigran Aivazian <aivazian.tigran@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * uapi: fix linux/sysctl.h userspace compilation errorsDmitry V. Levin2017-05-141-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Include <stddef.h> (guarded by #ifndef __KERNEL__) to fix the following linux/sysctl.h userspace compilation errors: /usr/include/linux/sysctl.h:38:2: error: unknown type name 'size_t' size_t *oldlenp; /usr/include/linux/sysctl.h:40:2: error: unknown type name 'size_t' size_t newlen; This also fixes userspace compilation of uapi headers that include linux/sysctl.h, e.g. linux/netfilter.h. Link: http://lkml.kernel.org/r/20170222230652.GA14373@altlinux.org Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kdump, vmcoreinfo: report actual value of phys_baseHATAYAMA Daisuke2017-05-142-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, VMCOREINFO note information reports the virtual address of phys_base that is assigned to symbol phys_base. But this doesn't make sense because to refer to phys_base, it's necessary to get the value of phys_base itself we are now about to refer to. Userland tools related to kdump such as makedumpfile and crash utility so far have made some efforts to calculate phys_base on crash dump formats generated by mechanisms running outside Linux kernel, such as virtual machine hypervisor such as qemu dump, which ordinary users use via virsh dump, or ones implemented on vendor specific firmware. That is, find a kernel data whose virtual and physical addresses are available via its note information and calculate phys_base from it. However, such data structure is not the one prepared for phys_base purpose. There's no guarantee that other crash dump mechanisms include such information that can be used to calculate phys_base similarly. To get VMCOREINFO in vmcore, it's easy to use strings and grep commands like this; VMCOREINFO consists of simple string: $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E ".*VMCOREINFO.*" -A 100 VMCOREINFO OSRELEASE=3.10.0-121.el7.x86_64 PAGESIZE=4096 ... This is also useful to get value of phys_base in kdump 2nd kernel contained in vmcore using the above-mentioned external crash dump mechanism; kdump 2nd kernel is an inherently relocated kernel. This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because makedumpfile refers to it and if removing it, old versions makedumpfile doesn't work well. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Dave Anderson <anderson@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kdump: protect vmcoreinfo data under the crash memoryXunlei Pang2017-05-146-2/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently vmcoreinfo data is updated at boot time subsys_initcall(), it has the risk of being modified by some wrong code during system is running. As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on, when using "crash", "makedumpfile", etc utility to parse this vmcore, we probably will get "Segmentation fault" or other unexpected errors. E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the system; 3) trigger kdump, then we obviously will fail to recognize the crash context correctly due to the corrupted vmcoreinfo. Now except for vmcoreinfo, all the crash data is well protected(including the cpu note which is fully updated in the crash path, thus its correctness is guaranteed). Given that vmcoreinfo data is a large chunk prepared for kdump, we better protect it as well. To solve this, we relocate and copy vmcoreinfo_data to the crash memory when kdump is loading via kexec syscalls. Because the whole crash memory will be protected by existing arch_kexec_protect_crashkres() mechanism, we naturally protect vmcoreinfo_data from write(even read) access under kernel direct mapping after kdump is loaded. Since kdump is usually loaded at the very early stage after boot, we can trust the correctness of the vmcoreinfo data copied. On the other hand, we still need to operate the vmcoreinfo safe copy when crash happens to generate vmcoreinfo_note again, we rely on vmap() to map out a new kernel virtual address and update to use this new one instead in the following crash_save_vmcoreinfo(). BTW, we do not touch vmcoreinfo_note, because it will be fully updated using the protected vmcoreinfo_data after crash which is surely correct just like the cpu crash note. Link: http://lkml.kernel.org/r/1493281021-20737-3-git-send-email-xlpang@redhat.com Signed-off-by: Xunlei Pang <xlpang@redhat.com> Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Juergen Gross <jgross@suse.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * powerpc/fadump: use the correct VMCOREINFO_NOTE_SIZE for phdrXunlei Pang2017-05-143-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vmcoreinfo_max_size stands for the vmcoreinfo_data, the correct one we should use is vmcoreinfo_note whose total size is VMCOREINFO_NOTE_SIZE. Like explained in commit 77019967f06b ("kdump: fix exported size of vmcoreinfo note"), it should not affect the actual function, but we better fix it, also this change should be safe and backward compatible. After this, we can get rid of variable vmcoreinfo_max_size, let's use the corresponding macros directly, fewer variables means more safety for vmcoreinfo operation. Link: http://lkml.kernel.org/r/1493281021-20737-2-git-send-email-xlpang@redhat.com Signed-off-by: Xunlei Pang <xlpang@redhat.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Juergen Gross <jgross@suse.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kexec: move vmcoreinfo out of the kernel's .bss sectionXunlei Pang2017-05-148-21/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As Eric said, "what we need to do is move the variable vmcoreinfo_note out of the kernel's .bss section. And modify the code to regenerate and keep this information in something like the control page. Definitely something like this needs a page all to itself, and ideally far away from any other kernel data structures. I clearly was not watching closely the data someone decided to keep this silly thing in the kernel's .bss section." This patch allocates extra pages for these vmcoreinfo_XXX variables, one advantage is that it enhances some safety of vmcoreinfo, because vmcoreinfo now is kept far away from other kernel data structures. Link: http://lkml.kernel.org/r/1493281021-20737-1-git-send-email-xlpang@redhat.com Signed-off-by: Xunlei Pang <xlpang@redhat.com> Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Reviewed-by: Juergen Gross <jgross@suse.com> Suggested-by: Eric Biederman <ebiederm@xmission.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fs, epoll: short circuit fetching events if thread has been killedDavid Rientjes2017-05-141-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've encountered zombies that are waiting for a thread to exit that are looping in ep_poll() almost endlessly although there is a pending SIGKILL as a result of a group exit. This happens because we always find ep_events_available() and fetch more events and never are able to check for signal_pending() that would break from the loop and return -EINTR. Special case fatal signals and break immediately to guarantee that we loop to fetch more events and delay making a timely exit. It would also be possible to simply move the check for signal_pending() higher than checking for ep_events_available(), but there have been no reports of delayed signal handling other than SIGKILL preventing zombies from exiting that would be fixed by this. It fixes an issue for us where we have witnessed zombies sticking around for at least O(minutes), but considering the code has been like this forever and nobody else has complained that I have found, I would simply queue it up for 4.12. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/vmstat.c: walk the zone in pageblock_nr_pages stepszhong jiang2017-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | when walking the zone, we can happens to the holes. we should not align MAX_ORDER_NR_PAGES, so it can skip the normal memory. In addition, pagetypeinfo_showmixedcount_print reflect fragmentization. we hope to get more accurate data. therefore, I decide to fix it. Link: http://lkml.kernel.org/r/1469502526-24486-2-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/page_owner: align with pageblock_nr pageszhong jiang2017-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | When pfn_valid(pfn) returns false, pfn should be aligned with pageblock_nr_pages other than MAX_ORDER_NR_PAGES in init_pages_in_zone, because the skipped 2M may be valid pfn, as a result, early allocated count will not be accurate. Link: http://lkml.kernel.org/r/1468938136-24228-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: vmscan: do not pass reclaimed slab to vmpressureVinayak Menon2017-05-141-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During global reclaim, the nr_reclaimed passed to vmpressure includes the pages reclaimed from slab. But the corresponding scanned slab pages is not passed. There is an impact to the vmpressure values because of this. While moving from kernel version 3.18 to 4.4, a difference is seen in the vmpressure values for the same workload resulting in a different behaviour of the vmpressure consumer. One such case is of a vmpressure based lowmemorykiller. It is observed that the vmpressure events are received late and less in number resulting in tasks not being killed at the right time. In this use case, The number of critical vmpressure events received is around 50% less on 4.4 than 3.18. The following numbers show the impact on reclaim activity due to the change in behaviour of lowmemorykiller on a 4GB device. The test launches a number of apps in sequence and repeats it multiple times. The difference in reclaim behaviour is because of lesser number of kills and kills happening late, resulting in more swapping and page cache reclaim. v4.4 v3.18 pgpgin 163016456 145617236 pgpgout 4366220 4188004 workingset_refault 29857868 26781854 workingset_activate 6293946 5634625 pswpin 1327601 1133912 pswpout 3593842 3229602 pgalloc_dma 99520618 94402970 pgalloc_normal 104046854 98124798 pgfree 203772640 192600737 pgmajfault 2126962 1851836 pgsteal_kswapd_dma 19732899 18039462 pgsteal_kswapd_normal 19945336 17977706 pgsteal_direct_dma 206757 131376 pgsteal_direct_normal 236783 138247 pageoutrun 116622 108370 allocstall 7220 4684 compact_stall 931 856 The lowmemorykiller example above is just for indicating the difference in vmpressure events between 4.4 and 3.18. Do not consider reclaimed slab pages for vmpressure calculation. The reclaimed pages from slab can be excluded because the freeing of a page by slab shrinking depends on each slab's object population, making the cost model (i.e. scan:free) different from that of LRU. Also, not every shrinker accounts the pages it reclaims. Ideally the pages reclaimed from slab should be passed to vmpressure, otherwise higher vmpressure levels can be triggered even when there is a reclaim progress. But accounting only the reclaimed slab pages without the scanned, and adding something which does not fit into the cost model just adds noise to the vmpressure values. Fixes: 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()") Link: http://lkml.kernel.org/r/1486641577-11685-2-git-send-email-vinmenon@codeaurora.org Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Shiraz Hashim <shashim@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/page_alloc: return 0 in case this node has no page within the zoneWei Yang2017-05-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The whole memory space is divided into several zones and nodes may have no page in some zones. In this case, the __absent_pages_in_range() would return 0, since the range it is searching for is an empty range. Also this happens more often to those nodes with higher memory range when there are more nodes, which is a trend for future architectures. This patch checks the zone range after clamp and adjustment, return 0 if the range is an empty range. Link: http://lkml.kernel.org/r/20170206154314.15705-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/vmstat.c: standardize file operations variable namesAnshuman Khandual2017-05-141-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | Standardize the file operation variable names related to all four memory management /proc interface files. Also change all the symbol permissions (S_IRUGO) into octal permissions (0444) as it got complaints from checkpatch.pl. This does not create any functional change to the interface. Link: http://lkml.kernel.org/r/20170427030632.8588-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * zram: compare all the entries with same checksum for deduplicationJoonsoo Kim2017-05-141-12/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now, we compare just one entry with same checksum when checking duplication since it is the simplest way to implement. However, for the completeness, checking all the entries is better so this patch implement to compare all the entries with same checksum. Since this event would be rare so there would be no performance loss. Link: http://lkml.kernel.org/r/1494556204-25796-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * zram: make deduplication feature optionalJoonsoo Kim2017-05-148-12/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | Benefit of deduplication is dependent on the workload so it's not preferable to always enable. Therefore, make it optional in Kconfig and device param. Default is 'off'. This option will be beneficial for users who use the zram as blockdev and stores build output to it. Link: http://lkml.kernel.org/r/1494556204-25796-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * zram: implement deduplication in zramJoonsoo Kim2017-05-146-7/+278
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement deduplication feature in zram. The purpose of this work is naturally to save amount of memory usage by zram. Android is one of the biggest users to use zram as swap and it's really important to save amount of memory usage. There is a paper that reports that duplication ratio of Android's memory content is rather high [1]. And, there is a similar work on zswap that also reports that experiments has shown that around 10-15% of pages stored in zswp are duplicates and deduplicate them provides some benefits [2]. Also, there is a different kind of workload that uses zram as blockdev and store build outputs into it to reduce wear-out problem of real blockdev. In this workload, deduplication hit is very high due to temporary files and intermediate object files. Detailed analysis is on the bottom of this description. Anyway, if we can detect duplicated content and avoid to store duplicated content at different memory space, we can save memory. This patch tries to do that. Implementation is almost simple and intuitive but I should note one thing about implementation detail. To check duplication, this patch uses checksum of the page and collision of this checksum could be possible. There would be many choices to handle this situation but this patch chooses to allow entry with duplicated checksum to be added to the hash, but, not to compare all entries with duplicated checksum when checking duplication. I guess that checksum collision is quite rare event and we don't need to pay any attention to such a case. Therefore, I decided the most simplest way to implement the feature. If there is a different opinion, I can accept and go that way. Following is the result of this patch. Test result #1 (Swap): Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18 orig_data_size: 145297408 compr_data_size: 32408125 mem_used_total: 32276480 dup_data_size: 3188134 meta_data_size: 1444272 Last two metrics added to mm_stat are related to this work. First one, dup_data_size, is amount of saved memory by avoiding to store duplicated page. Later one, meta_data_size, is the amount of data structure to support deduplication. If dup > meta, we can judge that the patch improves memory usage. In Adnroid, we can save 5% of memory usage by this work. Test result #2 (Blockdev): build the kernel and store output to ext4 FS on zram <no-dedup> Elapsed time: 249 s mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0 <dedup> Elapsed time: 250 s mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792 There is no performance degradation and save 23% memory. Test result #3 (Blockdev): copy android build output dir(out/host) to ext4 FS on zram <no-dedup> Elapsed time: out/host: 88 s mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0 <dedup> Elapsed time: out/host: 100 s mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336 It shows performance degradation roughly 13% and save 24% memory. Maybe, it is due to overhead of calculating checksum and comparison. Test result #4 (Blockdev): copy android build output dir(out/target/common) to ext4 FS on zram <no-dedup> Elapsed time: out/host: 203 s mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0 <dedup> Elapsed time: out/host: 201 s mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336 Memory is saved by 42% and performance is the same. Even if there is overhead of calculating checksum and comparison, large hit ratio compensate it since hit leads to less compression attempt. I checked the detailed reason of savings on kernel build workload and there are some cases that deduplication happens. 1) *.cmd Build command is usually similar in one directory so content of these file are very similar. In my system, more than 789 lines in fs/ext4/.namei.o.cmd and fs/ext4/.inode.o.cmd are the same in 944 and 938 lines of the file, respectively. 2) intermediate object files built-in.o and temporary object file have the similar contents. More than 50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o. 3) vmlinux .tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boo/compressed/vmlinux.bin have the similar contents. Android test has similar case that some of object files(.class and .so) are similar with another ones. (./host/linux-x86/lib/libartd.so and ./host/linux-x86-lib/libartd-comiler.so) Anyway, benefit seems to be largely dependent on the workload so following patch will make this feature optional. However, this feature can help some usecases so is deserved to be merged. [1]: MemScope: Analyzing Memory Duplication on Android Systems, dl.acm.org/citation.cfm?id=2797023 [2]: zswap: Optimize compressed pool memory utilization, lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2 Link: http://lkml.kernel.org/r/1494556204-25796-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * zram: introduce zram_entry to prepare dedup functionalityJoonsoo Kim2017-05-142-35/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following patch will implement deduplication functionality in zram and it requires an indirection layer to manage the life cycle of zsmalloc handle. To prepare that, this patch introduces zram_entry which can be used to manage the life-cycle of zsmalloc handle. Many lines are changed due to rename but core change is just simple introduction about newly data structure. Link: http://lkml.kernel.org/r/1494556204-25796-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * ksm: fix use after free with merge_across_nodes = 0Andrea Arcangeli2017-05-141-11/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If merge_across_nodes was manually set to 0 (not the default value) by the admin or a tuned profile on NUMA systems triggering cross-NODE page migrations, a stable_node use after free could materialize. If the chain is collapsed stable_node would point to the old chain that was already freed. stable_node_dup would be the stable_node dup now converted to a regular stable_node and indexed in the rbtree in replacement of the freed stable_node chain (not anymore a dup). This special case where the chain is collapsed in the NUMA replacement path, is now detected by setting stable_node to NULL by the chain_prune callee if it decides to collapse the chain. This tells the NUMA replacement code that even if stable_node and stable_node_dup are different, this is not a chain if stable_node is NULL, as the stable_node_dup was converted to a regular stable_node and the chain was collapsed. It is generally safer for the callee to force the caller stable_node to NULL the moment it become stale so any other mistake like this would result in an instant Oops easier to debug than an use after free. Otherwise the replace logic would act like if stable_node was a valid chain, when in fact it was freed. Notably stable_node_chain_add_dup(page_node, stable_node) would run on a stable stable_node. Andrey Ryabinin found the source of the use after free in chain_prune(). Link: http://lkml.kernel.org/r/20170512193805.8807-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * ksm: introduce ksm_max_page_sharing per page deduplication limitAndrea Arcangeli2017-05-142-66/+730
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without a max deduplication limit for each KSM page, the list of the rmap_items associated to each stable_node can grow infinitely large. During the rmap walk each entry can take up to ~10usec to process because of IPIs for the TLB flushing (both for the primary MMU and the secondary MMUs with the MMU notifier). With only 16GB of address space shared in the same KSM page, that would amount to dozens of seconds of kernel runtime. A ~256 max deduplication factor will reduce the latencies of the rmap walks on KSM pages to order of a few msec. Just doing the cond_resched() during the rmap walks is not enough, the list size must have a limit too, otherwise the caller could get blocked in (schedule friendly) kernel computations for seconds, unexpectedly. There's room for optimization to significantly reduce the IPI delivery cost during the page_referenced(), but at least for page_migration in the KSM case (used by hard NUMA bindings, compaction and NUMA balancing) it may be inevitable to send lots of IPIs if each rmap_item->mm is active on a different CPU and there are lots of CPUs. Even if we ignore the IPI delivery cost, we've still to walk the whole KSM rmap list, so we can't allow millions or billions (ulimited) number of entries in the KSM stable_node rmap_item lists. The limit is enforced efficiently by adding a second dimension to the stable rbtree. So there are three types of stable_nodes: the regular ones (identical as before, living in the first flat dimension of the stable rbtree), the "chains" and the "dups". Every "chain" and all "dups" linked into a "chain" enforce the invariant that they represent the same write protected memory content, even if each "dup" will be pointed by a different KSM page copy of that content. This way the stable rbtree lookup computational complexity is unaffected if compared to an unlimited max_sharing_limit. It is still enforced that there cannot be KSM page content duplicates in the stable rbtree itself. Adding the second dimension to the stable rbtree only after the max_page_sharing limit hits, provides for a zero memory footprint increase on 64bit archs. The memory overhead of the per-KSM page stable_tree and per virtual mapping rmap_item is unchanged. Only after the max_page_sharing limit hits, we need to allocate a stable_tree "chain" and rb_replace() the "regular" stable_node with the newly allocated stable_node "chain". After that we simply add the "regular" stable_node to the chain as a stable_node "dup" by linking hlist_dup in the stable_node_chain->hlist. This way the "regular" (flat) stable_node is converted to a stable_node "dup" living in the second dimension of the stable rbtree. During stable rbtree lookups the stable_node "chain" is identified as stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka is_stable_node_chain()). When dropping stable_nodes, the stable_node "dup" is identified as stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()). The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used elsewhere in any stable_node->head/node to avoid a clashes with the stable_node->node.rb_parent_color pointer, and different from &migrate_nodes. So the second field of &migrate_nodes is picked and verified as always safe with a BUILD_BUG_ON in case the list_head implementation changes in the future. The STABLE_NODE_DUP is picked as a random negative value in stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when it's a "regular" stable_node or a stable_node "dup". The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn is aliased in a union with a time field used to rate limit the stable_node_chain->hlist prunes. The garbage collection of the stable_node_chain happens lazily during stable rbtree lookups (as for all other kind of stable_nodes), or while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the entire stable rbtree. While the "regular" stable_nodes and the stable_node "dups" must wait for their underlying tree_page to be freed before they can be freed themselves, the stable_node "chains" can be freed immediately if the stable_node->hlist turns empty. This is because the "chains" are never pointed by any page->mapping and they're effectively stable rbtree KSM self contained metadata. [akpm@linux-foundation.org: fix non-NUMA build] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Evgheni Dereveanchin <ederevea@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/nobootmem.c: return 0 when start_pfn equals end_pfnWei Yang2017-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | When start_pfn equals end_pfn, __free_pages_memory() has no effect and __free_memory_core() will finally return (end_pfn - start_pfn) = 0. This patch returns 0 directly when start_pfn equals end_pfn. Link: http://lkml.kernel.org/r/20170502131115.6650-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/vmscan.c: fix unsequenced modification and access warningNick Desaulniers2017-05-141-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clang and its -Wunsequenced emits a warning mm/vmscan.c:2961:25: error: unsequenced modification and access to 'gfp_mask' [-Wunsequenced] .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)), ^ While it is not clear to me whether the initialization code violates the specification (6.7.8 par 19 (ISO/IEC 9899) looks like it disagrees) the code is quite confusing and worth cleaning up anyway. Fix this by reusing sc.gfp_mask rather than the updated input gfp_mask parameter. Link: http://lkml.kernel.org/r/20170510154030.10720-1-nick.desaulniers@gmail.com Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/mmap.c: mark protection_map as __ro_after_initDaniel Micay2017-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The protection map is only modified by per-arch init code so it can be protected from writes after the init code runs. This change was extracted from PaX where it's part of KERNEXEC. Link: http://lkml.kernel.org/r/20170510174441.26163-1-danielmicay@gmail.com Signed-off-by: Daniel Micay <danielmicay@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm, sparsemem: break out of loops earlyDave Hansen2017-05-143-14/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a number of times that we loop over NR_MEM_SECTIONS, looking for section_present() on each section. But, when we have very large physical address spaces (large MAX_PHYSMEM_BITS), NR_MEM_SECTIONS becomes very large, making the loops quite long. With MAX_PHYSMEM_BITS=46 and a section size of 128MB, the current loops are 512k iterations, which we barely notice on modern hardware. But, raising MAX_PHYSMEM_BITS higher (like we will see on systems that support 5-level paging) makes this 64x longer and we start to notice, especially on slower systems like simulators. A 10-second delay for 512k iterations is annoying. But, a 640- second delay is crippling. This does not help if we have extremely sparse physical address spaces, but those are quite rare. We expect that most of the "slow" systems where this matters will also be quite small and non-sparse. To fix this, we track the highest section we've ever encountered. This lets us know when we will *never* see another section_present(), and lets us break out of the loops earlier. Doing the whole for_each_present_section_nr() macro is probably overkill, but it will ensure that any future loop iterations that we grow are more likely to be correct. Kirrill said "It shaved almost 40 seconds from boot time in qemu with 5-level paging enabled for me". Link: http://lkml.kernel.org/r/20170504174434.C45A4735@viggo.jf.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/slub.c: wrap kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIALWei Yang2017-05-142-31/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kmem_cache->cpu_partial is just used when CONFIG_SLUB_CPU_PARTIAL is set, so wrap it with config CONFIG_SLUB_CPU_PARTIAL will save some space on 32bit arch. This patch wraps kmem_cache->cpu_partial in config CONFIG_SLUB_CPU_PARTIAL and wraps its sysfs too. Link: http://lkml.kernel.org/r/20170502144533.10729-4-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm-slub-wrap-cpu_slab-partial-in-config_slub_cpu_partial-fixAndrew Morton2017-05-141-4/+6
| | | | | | | | | | | | | | | | | | | | | | avoid strange 80-col tricks Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/slub.c: wrap cpu_slab->partial in CONFIG_SLUB_CPU_PARTIALWei Yang2017-05-142-7/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpu_slab's field partial is used when CONFIG_SLUB_CPU_PARTIAL is set, which means we can save a pointer's space on each cpu for every slub item. This patch wraps cpu_slab->partial in CONFIG_SLUB_CPU_PARTIAL and wraps its sysfs use too. Link: http://lkml.kernel.org/r/20170502144533.10729-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/slub.c: pack red_left_pad with another int to save a wordWei Yang2017-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series " try to save some memory for kmem_cache in some cases", v2. kmem_cache is a frequently used data in kernel. During the code reading, I found maybe we could save some space in some cases. 1. On 64bit arch, type int will occupy a word if it doesn't sit well. 2. cpu_slab->partial is just used when CONFIG_SLUB_CPU_PARTIAL is set 3. cpu_partial is just used when CONFIG_SLUB_CPU_PARTIAL is set, while just save some space on 32bit arch. This patch (of 3): On 64bit arch, struct is 8-bytes aligned, so int will occupy a word if it doesn't sit well. This patch pack red_left_pad with reserved to save 8 bytes for struct kmem_cache on a 64bit arch. Link: http://lkml.kernel.org/r/20170502144533.10729-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/slub: reset cpu_slab's pointer in deactivate_slab()Wei Yang2017-05-141-13/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each time a slab is deactivated, the page and freelist pointer should be reset. This patch just merges these two options into deactivate_slab(). Link: http://lkml.kernel.org/r/20170507031215.3130-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>