From afaef01c001537fa97a25092d7f54d764dc7d8c1 Mon Sep 17 00:00:00 2001 From: Alexander Popov Date: Fri, 17 Aug 2018 01:16:58 +0300 Subject: x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls The STACKLEAK feature (initially developed by PaX Team) has the following benefits: 1. Reduces the information that can be revealed through kernel stack leak bugs. The idea of erasing the thread stack at the end of syscalls is similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel crypto, which all comply with FDP_RIP.2 (Full Residual Information Protection) of the Common Criteria standard. 2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712, CVE-2010-2963). That kind of bugs should be killed by improving C compilers in future, which might take a long time. This commit introduces the code filling the used part of the kernel stack with a poison value before returning to userspace. Full STACKLEAK feature also contains the gcc plugin which comes in a separate commit. The STACKLEAK feature is ported from grsecurity/PaX. More information at: https://grsecurity.net/ https://pax.grsecurity.net/ This code is modified from Brad Spengler/PaX Team's code in the last public patch of grsecurity/PaX based on our understanding of the code. Changes or omissions from the original code are ours and don't reflect the original grsecurity/PaX code. Performance impact: Hardware: Intel Core i7-4770, 16 GB RAM Test #1: building the Linux kernel on a single core 0.91% slowdown Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P 4.2% slowdown So the STACKLEAK description in Kconfig includes: "The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary and you are advised to test this feature on your expected workload before deploying it". Signed-off-by: Alexander Popov Acked-by: Thomas Gleixner Reviewed-by: Dave Hansen Acked-by: Ingo Molnar Signed-off-by: Kees Cook --- Documentation/x86/x86_64/mm.txt | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation') diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 5432a96d31ffd..600bc2afa27d6 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -24,6 +24,7 @@ ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space [fixmap start] - ffffffffff5fffff kernel-internal fixmap range ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole +STACKLEAK_POISON value in this last hole: ffffffffffff4111 Virtual memory map with 5 level page tables: @@ -50,6 +51,7 @@ ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space [fixmap start] - ffffffffff5fffff kernel-internal fixmap range ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole +STACKLEAK_POISON value in this last hole: ffffffffffff4111 Architecture defines a 64-bit virtual address. Implementations can support less. Currently supported are 48- and 57-bit virtual addresses. Bits 63 -- cgit v1.2.3 From ed535a2dae1836d15c71e250475952881265d244 Mon Sep 17 00:00:00 2001 From: Alexander Popov Date: Fri, 17 Aug 2018 01:17:02 +0300 Subject: doc: self-protection: Add information about STACKLEAK feature Add information about STACKLEAK feature to the "Memory poisoning" section of self-protection.rst. Signed-off-by: Alexander Popov Signed-off-by: Kees Cook --- Documentation/security/self-protection.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/security/self-protection.rst b/Documentation/security/self-protection.rst index e1ca698e00063..f584fb74b4ff2 100644 --- a/Documentation/security/self-protection.rst +++ b/Documentation/security/self-protection.rst @@ -302,11 +302,11 @@ sure structure holes are cleared. Memory poisoning ---------------- -When releasing memory, it is best to poison the contents (clear stack on -syscall return, wipe heap memory on a free), to avoid reuse attacks that -rely on the old contents of memory. This frustrates many uninitialized -variable attacks, stack content exposures, heap content exposures, and -use-after-free attacks. +When releasing memory, it is best to poison the contents, to avoid reuse +attacks that rely on the old contents of memory. E.g., clear stack on a +syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a +free. This frustrates many uninitialized variable attacks, stack content +exposures, heap content exposures, and use-after-free attacks. Destination tracking -------------------- -- cgit v1.2.3 From 964c9dff0091893a9a74a88edf984c6da0b779f7 Mon Sep 17 00:00:00 2001 From: Alexander Popov Date: Fri, 17 Aug 2018 01:17:03 +0300 Subject: stackleak: Allow runtime disabling of kernel stack erasing Introduce CONFIG_STACKLEAK_RUNTIME_DISABLE option, which provides 'stack_erasing' sysctl. It can be used in runtime to control kernel stack erasing for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. Suggested-by: Ingo Molnar Signed-off-by: Alexander Popov Tested-by: Laura Abbott Signed-off-by: Kees Cook --- Documentation/sysctl/kernel.txt | 18 ++++++++++++++++++ include/linux/stackleak.h | 6 ++++++ kernel/stackleak.c | 38 ++++++++++++++++++++++++++++++++++++++ kernel/sysctl.c | 15 ++++++++++++++- scripts/gcc-plugins/Kconfig | 8 ++++++++ 5 files changed, 84 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 37a679501ddc6..1b8775298cf7a 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -89,6 +89,7 @@ show up in /proc/sys/kernel: - shmmni - softlockup_all_cpu_backtrace - soft_watchdog +- stack_erasing - stop-a [ SPARC only ] - sysrq ==> Documentation/admin-guide/sysrq.rst - sysctl_writes_strict @@ -987,6 +988,23 @@ detect a hard lockup condition. ============================================================== +stack_erasing + +This parameter can be used to control kernel stack erasing at the end +of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. + +That erasing reduces the information which kernel stack leak bugs +can reveal and blocks some uninitialized stack variable attacks. +The tradeoff is the performance impact: on a single CPU system kernel +compilation sees a 1% slowdown, other systems and workloads may vary. + + 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. + + 1: kernel stack erasing is enabled (default), it is performed before + returning to the userspace at the end of syscalls. + +============================================================== + tainted: Non-zero if the kernel has been tainted. Numeric values, which can be diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h index b911b973d328b..3d5c3271a9a8c 100644 --- a/include/linux/stackleak.h +++ b/include/linux/stackleak.h @@ -22,6 +22,12 @@ static inline void stackleak_task_init(struct task_struct *t) t->prev_lowest_stack = t->lowest_stack; # endif } + +#ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE +int stack_erasing_sysctl(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos); +#endif + #else /* !CONFIG_GCC_PLUGIN_STACKLEAK */ static inline void stackleak_task_init(struct task_struct *t) { } #endif diff --git a/kernel/stackleak.c b/kernel/stackleak.c index f66239572c898..e42892926244d 100644 --- a/kernel/stackleak.c +++ b/kernel/stackleak.c @@ -12,6 +12,41 @@ #include +#ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE +#include +#include + +static DEFINE_STATIC_KEY_FALSE(stack_erasing_bypass); + +int stack_erasing_sysctl(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int ret = 0; + int state = !static_branch_unlikely(&stack_erasing_bypass); + int prev_state = state; + + table->data = &state; + table->maxlen = sizeof(int); + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + state = !!state; + if (ret || !write || state == prev_state) + return ret; + + if (state) + static_branch_disable(&stack_erasing_bypass); + else + static_branch_enable(&stack_erasing_bypass); + + pr_warn("stackleak: kernel stack erasing is %s\n", + state ? "enabled" : "disabled"); + return ret; +} + +#define skip_erasing() static_branch_unlikely(&stack_erasing_bypass) +#else +#define skip_erasing() false +#endif /* CONFIG_STACKLEAK_RUNTIME_DISABLE */ + asmlinkage void stackleak_erase(void) { /* It would be nice not to have 'kstack_ptr' and 'boundary' on stack */ @@ -20,6 +55,9 @@ asmlinkage void stackleak_erase(void) unsigned int poison_count = 0; const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long); + if (skip_erasing()) + return; + /* Check that 'lowest_stack' value is sane */ if (unlikely(kstack_ptr - boundary >= THREAD_SIZE)) kstack_ptr = boundary; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index cc02050fd0c49..3ae223f7b5dfa 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -91,7 +91,9 @@ #ifdef CONFIG_CHR_DEV_SG #include #endif - +#ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE +#include +#endif #ifdef CONFIG_LOCKUP_DETECTOR #include #endif @@ -1232,6 +1234,17 @@ static struct ctl_table kern_table[] = { .extra1 = &zero, .extra2 = &one, }, +#endif +#ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE + { + .procname = "stack_erasing", + .data = NULL, + .maxlen = sizeof(int), + .mode = 0600, + .proc_handler = stack_erasing_sysctl, + .extra1 = &zero, + .extra2 = &one, + }, #endif { } }; diff --git a/scripts/gcc-plugins/Kconfig b/scripts/gcc-plugins/Kconfig index b0a015ef52685..0d5c799688f0a 100644 --- a/scripts/gcc-plugins/Kconfig +++ b/scripts/gcc-plugins/Kconfig @@ -182,4 +182,12 @@ config STACKLEAK_METRICS can be useful for estimating the STACKLEAK performance impact for your workloads. +config STACKLEAK_RUNTIME_DISABLE + bool "Allow runtime disabling of kernel stack erasing" + depends on GCC_PLUGIN_STACKLEAK + help + This option provides 'stack_erasing' sysctl, which can be used in + runtime to control kernel stack erasing for kernels built with + CONFIG_GCC_PLUGIN_STACKLEAK. + endif -- cgit v1.2.3 From 303d22c5fc37cebe2bd0b3c15ac3db6950db60c6 Mon Sep 17 00:00:00 2001 From: Miguel Ojeda Date: Mon, 3 Sep 2018 18:32:11 +0200 Subject: Compiler Attributes: add Doc/process/programming-language.rst Tested-by: Sedat Dilek # on top of v4.19-rc5, clang 7 Reviewed-by: Nick Desaulniers Reviewed-by: Luc Van Oostenryck Signed-off-by: Miguel Ojeda --- Documentation/process/index.rst | 1 + Documentation/process/programming-language.rst | 45 ++++++++++++++++++++++++++ 2 files changed, 46 insertions(+) create mode 100644 Documentation/process/programming-language.rst (limited to 'Documentation') diff --git a/Documentation/process/index.rst b/Documentation/process/index.rst index 9ae3e317bddf9..00558b6d2649f 100644 --- a/Documentation/process/index.rst +++ b/Documentation/process/index.rst @@ -23,6 +23,7 @@ Below are the essential guides that every developer should read. code-of-conduct development-process submitting-patches + programming-language coding-style maintainer-pgp-guide email-clients diff --git a/Documentation/process/programming-language.rst b/Documentation/process/programming-language.rst new file mode 100644 index 0000000000000..e5f5f065dc24e --- /dev/null +++ b/Documentation/process/programming-language.rst @@ -0,0 +1,45 @@ +.. _programming_language: + +Programming Language +==================== + +The kernel is written in the C programming language [c-language]_. +More precisely, the kernel is typically compiled with ``gcc`` [gcc]_ +under ``-std=gnu89`` [gcc-c-dialect-options]_: the GNU dialect of ISO C90 +(including some C99 features). + +This dialect contains many extensions to the language [gnu-extensions]_, +and many of them are used within the kernel as a matter of course. + +There is some support for compiling the kernel with ``clang`` [clang]_ +and ``icc`` [icc]_ for several of the architectures, although at the time +of writing it is not completed, requiring third-party patches. + +Attributes +---------- + +One of the common extensions used throughout the kernel are attributes +[gcc-attribute-syntax]_. Attributes allow to introduce +implementation-defined semantics to language entities (like variables, +functions or types) without having to make significant syntactic changes +to the language (e.g. adding a new keyword) [n2049]_. + +In some cases, attributes are optional (i.e. a compiler not supporting them +should still produce proper code, even if it is slower or does not perform +as many compile-time checks/diagnostics). + +The kernel defines pseudo-keywords (e.g. ``__pure``) instead of using +directly the GNU attribute syntax (e.g. ``__attribute__((__pure__))``) +in order to feature detect which ones can be used and/or to shorten the code. + +Please refer to ``include/linux/compiler_attributes.h`` for more information. + +.. [c-language] http://www.open-std.org/jtc1/sc22/wg14/www/standards +.. [gcc] https://gcc.gnu.org +.. [clang] https://clang.llvm.org +.. [icc] https://software.intel.com/en-us/c-compilers +.. [gcc-c-dialect-options] https://gcc.gnu.org/onlinedocs/gcc/C-Dialect-Options.html +.. [gnu-extensions] https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html +.. [gcc-attribute-syntax] https://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html +.. [n2049] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2049.pdf + -- cgit v1.2.3 From d2266bbfa9e3e32e3b642965088ca461bd24a94f Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Wed, 3 Oct 2018 00:49:21 +0800 Subject: x86/earlyprintk: Add a force option for pciserial device The "pciserial" earlyprintk variant helps much on many modern x86 platforms, but unfortunately there are still some platforms with PCI UART devices which have the wrong PCI class code. In that case, the current class code check does not allow for them to be used for logging. Add a sub-option "force" which overrides the class code check and thus the use of such device can be enforced. [ bp: massage formulations. ] Suggested-by: Borislav Petkov Signed-off-by: Feng Tang Signed-off-by: Borislav Petkov Cc: "H. Peter Anvin" Cc: "Stuart R . Anderson" Cc: Bjorn Helgaas Cc: David Rientjes Cc: Feng Tang Cc: Frederic Weisbecker Cc: Greg Kroah-Hartman Cc: H Peter Anvin Cc: Ingo Molnar Cc: Jiri Kosina Cc: Jonathan Corbet Cc: Kai-Heng Feng Cc: Kate Stewart Cc: Konrad Rzeszutek Wilk Cc: Peter Zijlstra Cc: Philippe Ombredanne Cc: Thomas Gleixner Cc: Thymo van Beers Cc: alan@linux.intel.com Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/20181002164921.25833-1-feng.tang@intel.com --- Documentation/admin-guide/kernel-parameters.txt | 6 ++++- arch/x86/kernel/early_printk.c | 29 ++++++++++++++++--------- 2 files changed, 24 insertions(+), 11 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 92eb1f42240d7..34e6800dea0ee 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1063,7 +1063,7 @@ earlyprintk=serial[,0x...[,baudrate]] earlyprintk=ttySn[,baudrate] earlyprintk=dbgp[debugController#] - earlyprintk=pciserial,bus:device.function[,baudrate] + earlyprintk=pciserial[,force],bus:device.function[,baudrate] earlyprintk=xdbc[xhciController#] earlyprintk is useful when the kernel crashes before @@ -1095,6 +1095,10 @@ The sclp output can only be used on s390. + The optional "force" to "pciserial" enables use of a + PCI device even when its classcode is not of the + UART class. + edac_report= [HW,EDAC] Control how to report EDAC event Format: {"on" | "off" | "force"} on: enable EDAC to report H/W event. May be overridden diff --git a/arch/x86/kernel/early_printk.c b/arch/x86/kernel/early_printk.c index 5e801c8c8ce7c..374a52fa52969 100644 --- a/arch/x86/kernel/early_printk.c +++ b/arch/x86/kernel/early_printk.c @@ -213,8 +213,9 @@ static unsigned int mem32_serial_in(unsigned long addr, int offset) * early_pci_serial_init() * * This function is invoked when the early_printk param starts with "pciserial" - * The rest of the param should be ",B:D.F,baud" where B, D & F describe the - * location of a PCI device that must be a UART device. + * The rest of the param should be "[force],B:D.F,baud", where B, D & F describe + * the location of a PCI device that must be a UART device. "force" is optional + * and overrides the use of an UART device with a wrong PCI class code. */ static __init void early_pci_serial_init(char *s) { @@ -224,17 +225,23 @@ static __init void early_pci_serial_init(char *s) u32 classcode, bar0; u16 cmdreg; char *e; + int force = 0; - - /* - * First, part the param to get the BDF values - */ if (*s == ',') ++s; if (*s == 0) return; + /* Force the use of an UART device with wrong class code */ + if (!strncmp(s, "force,", 6)) { + force = 1; + s += 6; + } + + /* + * Part the param to get the BDF values + */ bus = (u8)simple_strtoul(s, &e, 16); s = e; if (*s != ':') @@ -253,7 +260,7 @@ static __init void early_pci_serial_init(char *s) s++; /* - * Second, find the device from the BDF + * Find the device from the BDF */ cmdreg = read_pci_config(bus, slot, func, PCI_COMMAND); classcode = read_pci_config(bus, slot, func, PCI_CLASS_REVISION); @@ -264,8 +271,10 @@ static __init void early_pci_serial_init(char *s) */ if (((classcode >> 16 != PCI_CLASS_COMMUNICATION_MODEM) && (classcode >> 16 != PCI_CLASS_COMMUNICATION_SERIAL)) || - (((classcode >> 8) & 0xff) != 0x02)) /* 16550 I/F at BAR0 */ - return; + (((classcode >> 8) & 0xff) != 0x02)) /* 16550 I/F at BAR0 */ { + if (!force) + return; + } /* * Determine if it is IO or memory mapped @@ -289,7 +298,7 @@ static __init void early_pci_serial_init(char *s) } /* - * Lastly, initialize the hardware + * Initialize the hardware */ if (*s) { if (strcmp(s, "nocfg") == 0) -- cgit v1.2.3 From 01e194bbd64a7c65af428561fa97653be8b74564 Mon Sep 17 00:00:00 2001 From: Sergei Shtylyov Date: Sat, 22 Sep 2018 22:43:24 +0300 Subject: dt-bindings: pwm: renesas: tpu: Fix "compatible" prop description The "compatible" property description contradicts even the example given: it only says that there must be a single value while the example has the fallback value too -- which makes much more sense. Moreover, the generic property value is misdocumented as being R-Car (and RZ/G1) specific... Fixes: 382457e562bb ("pwm: renesas-tpu: Add DT support") Fixes: 3ba111a01822 ("dt-bindings: pwm: renesas-tpu: Document r8a774[35] support") Signed-off-by: Sergei Shtylyov Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt index d53a16715da6a..d276693cf3014 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt +++ b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt @@ -2,13 +2,14 @@ Required Properties: - - compatible: should be one of the following. + - compatible: must contain one or more of the following: - "renesas,tpu-r8a73a4": for R8A73A4 (R-Mobile APE6) compatible PWM controller. - "renesas,tpu-r8a7740": for R8A7740 (R-Mobile A1) compatible PWM controller. - "renesas,tpu-r8a7743": for R8A7743 (RZ/G1M) compatible PWM controller. - "renesas,tpu-r8a7745": for R8A7745 (RZ/G1E) compatible PWM controller. - "renesas,tpu-r8a7790": for R8A7790 (R-Car H2) compatible PWM controller. - - "renesas,tpu": for generic R-Car and RZ/G1 TPU PWM controller. + - "renesas,tpu": for the generic TPU PWM controller; this is a fallback for + the entries listed above. - reg: Base address and length of each memory resource used by the PWM controller hardware module. -- cgit v1.2.3 From 2360fc6a704e8dc1f8183a048980052bfe64388b Mon Sep 17 00:00:00 2001 From: Sergei Shtylyov Date: Mon, 1 Oct 2018 22:57:39 +0300 Subject: dt-bindings: pwm: renesas: pwm-rcar: Document R8A779{7|8}0 bindings Document the R-Car V3{M|H} (R8A779{7|8}0) SoC in the Renesas R-Car PWM bindings. R8A77970's hardware is a generic R-Car gen3 PWM, while R8A77980 has an extra error injection register... Signed-off-by: Sergei Shtylyov Reviewed-by: Geert Uytterhoeven Reviewed-by: Simon Horman Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt index e1ef6afbe3a74..e68a34e57ff60 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt +++ b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt @@ -12,6 +12,8 @@ Required Properties: - "renesas,pwm-r8a7795": for R-Car H3 - "renesas,pwm-r8a7796": for R-Car M3-W - "renesas,pwm-r8a77965": for R-Car M3-N + - "renesas,pwm-r8a77970": for R-Car V3M + - "renesas,pwm-r8a77980": for R-Car V3H - "renesas,pwm-r8a77990": for R-Car E3 - "renesas,pwm-r8a77995": for R-Car D3 - reg: base address and length of the registers block for the PWM. -- cgit v1.2.3 From 136d822a9e4e947609d2fbf8d3fa5f15b8c20970 Mon Sep 17 00:00:00 2001 From: Sergei Shtylyov Date: Sat, 22 Sep 2018 22:57:29 +0300 Subject: dt-bindings: pwm: renesas: tpu: Document R8A779{7|8}0 bindings Document the R-Car V3{M|H} (R8A779{7|8}0) SoC in the Renesas TPU bindings; the TPU hardware in those is the Renesas standard 4-channel timer pulse unit. Signed-off-by: Sergei Shtylyov Reviewed-by: Geert Uytterhoeven Reviewed-by: Simon Horman Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt index d276693cf3014..af5f4c7455973 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt +++ b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt @@ -8,6 +8,10 @@ Required Properties: - "renesas,tpu-r8a7743": for R8A7743 (RZ/G1M) compatible PWM controller. - "renesas,tpu-r8a7745": for R8A7745 (RZ/G1E) compatible PWM controller. - "renesas,tpu-r8a7790": for R8A7790 (R-Car H2) compatible PWM controller. + - "renesas,tpu-r8a77970": for R8A77970 (R-Car V3M) compatible PWM + controller. + - "renesas,tpu-r8a77980": for R8A77980 (R-Car V3H) compatible PWM + controller. - "renesas,tpu": for the generic TPU PWM controller; this is a fallback for the entries listed above. -- cgit v1.2.3 From a4675881ed71f73f5ea96afda6a1ddc1526fde4d Mon Sep 17 00:00:00 2001 From: Biju Das Date: Thu, 4 Oct 2018 17:17:19 +0100 Subject: dt-bindings: pwm: rcar: Add r8a7744 support Document RZ/G1N (R8A7744) SoC bindings. Signed-off-by: Biju Das Reviewed-by: Chris Paterson Reviewed-by: Simon Horman Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt index e68a34e57ff60..bdbf1e4070e2e 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt +++ b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt @@ -3,6 +3,7 @@ Required Properties: - compatible: should be "renesas,pwm-rcar" and one of the following. - "renesas,pwm-r8a7743": for RZ/G1M + - "renesas,pwm-r8a7744": for RZ/G1N - "renesas,pwm-r8a7745": for RZ/G1E - "renesas,pwm-r8a7778": for R-Car M1A - "renesas,pwm-r8a7779": for R-Car H1 -- cgit v1.2.3 From d5a7060022cd443cc6a054bc73383acac6a8f53d Mon Sep 17 00:00:00 2001 From: Biju Das Date: Thu, 4 Oct 2018 17:21:34 +0100 Subject: dt-bindings: pwm: renesas-tpu: Document r8a7744 support Document r8a7744 specific compatible strings. No driver change is needed as the fallback compatible string "renesas,tpu" activates the right code in the driver. Signed-off-by: Biju Das Reviewed-by: Chris Paterson Reviewed-by: Simon Horman Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt index af5f4c7455973..848a92b53d810 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt +++ b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.txt @@ -6,6 +6,7 @@ Required Properties: - "renesas,tpu-r8a73a4": for R8A73A4 (R-Mobile APE6) compatible PWM controller. - "renesas,tpu-r8a7740": for R8A7740 (R-Mobile A1) compatible PWM controller. - "renesas,tpu-r8a7743": for R8A7743 (RZ/G1M) compatible PWM controller. + - "renesas,tpu-r8a7744": for R8A7744 (RZ/G1N) compatible PWM controller. - "renesas,tpu-r8a7745": for R8A7745 (RZ/G1E) compatible PWM controller. - "renesas,tpu-r8a7790": for R8A7790 (R-Car H2) compatible PWM controller. - "renesas,tpu-r8a77970": for R8A77970 (R-Car V3M) compatible PWM -- cgit v1.2.3 From 786e0cfa9d6fc1cc8856c1864c988d233649348c Mon Sep 17 00:00:00 2001 From: Fabrizio Castro Date: Tue, 21 Aug 2018 17:03:46 +0100 Subject: dt-bindings: pwm: rcar: Add r8a774a1 support Document RZ/G2M (R8A774A1) SoC bindings. Signed-off-by: Fabrizio Castro Reviewed-by: Biju Das Reviewed-by: Simon Horman Reviewed-by: Rob Herring Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt index bdbf1e4070e2e..7f31fe7e20934 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt +++ b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt @@ -5,6 +5,7 @@ Required Properties: - "renesas,pwm-r8a7743": for RZ/G1M - "renesas,pwm-r8a7744": for RZ/G1N - "renesas,pwm-r8a7745": for RZ/G1E + - "renesas,pwm-r8a774a1": for RZ/G2M - "renesas,pwm-r8a7778": for R-Car M1A - "renesas,pwm-r8a7779": for R-Car H1 - "renesas,pwm-r8a7790": for R-Car H2 -- cgit v1.2.3 From 3c413e7e39224d171ef697277cd4629b59103101 Mon Sep 17 00:00:00 2001 From: Vignesh R Date: Tue, 16 Oct 2018 11:34:01 +0530 Subject: dt-bindings: pwm: tiecap: Add TI AM654 SoC specific compatible Add a new compatible string "ti,am654-ecap" to support PWM ECAP IP of TI AM654 SoC. Signed-off-by: Vignesh R Reviewed-by: Rob Herring Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/pwm-tiecap.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/pwm-tiecap.txt b/Documentation/devicetree/bindings/pwm/pwm-tiecap.txt index 06a363d9ccef9..b9a1d7402128b 100644 --- a/Documentation/devicetree/bindings/pwm/pwm-tiecap.txt +++ b/Documentation/devicetree/bindings/pwm/pwm-tiecap.txt @@ -7,6 +7,7 @@ Required properties: for da850 - compatible = "ti,da850-ecap", "ti,am3352-ecap", "ti,am33xx-ecap"; for dra746 - compatible = "ti,dra746-ecap", "ti,am3352-ecap"; for 66ak2g - compatible = "ti,k2g-ecap", "ti,am3352-ecap"; + for am654 - compatible = "ti,am654-ecap", "ti,am3352-ecap"; - #pwm-cells: should be 3. See pwm.txt in this directory for a description of the cells format. The PWM channel index ranges from 0 to 4. The only third cell flag supported by this binding is PWM_POLARITY_INVERTED. -- cgit v1.2.3 From d8a22773a12c6d78ee758c9e530f3a488bb7cb29 Mon Sep 17 00:00:00 2001 From: Sascha Hauer Date: Fri, 7 Sep 2018 14:36:45 +0200 Subject: ubifs: Enable authentication support With the preparations all being done this patch now enables authentication support for UBIFS. Authentication is enabled when the newly introduced auth_key and auth_hash_name mount options are passed. auth_key provides the key which is used for authentication whereas auth_hash_name provides the hashing algorithm used for this FS. Passing these options make authentication mandatory and only UBIFS images that can be authenticated with the given key are allowed. Signed-off-by: Sascha Hauer Signed-off-by: Richard Weinberger --- Documentation/filesystems/ubifs.txt | 7 +++++++ fs/ubifs/Kconfig | 10 ++++++++++ fs/ubifs/super.c | 36 +++++++++++++++++++++++++++++++++++- 3 files changed, 52 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt index a0a61d2f389f4..acc80442a3bbe 100644 --- a/Documentation/filesystems/ubifs.txt +++ b/Documentation/filesystems/ubifs.txt @@ -91,6 +91,13 @@ chk_data_crc do not skip checking CRCs on data nodes compr=none override default compressor and set it to "none" compr=lzo override default compressor and set it to "lzo" compr=zlib override default compressor and set it to "zlib" +auth_key= specify the key used for authenticating the filesystem. + Passing this option makes authentication mandatory. + The passed key must be present in the kernel keyring + and must be of type 'logon' +auth_hash_name= The hash algorithm used for authentication. Used for + both hashing and for creating HMACs. Typical values + include "sha256" or "sha512" Quick usage instructions diff --git a/fs/ubifs/Kconfig b/fs/ubifs/Kconfig index 853c77579b4e9..529856fbccd0e 100644 --- a/fs/ubifs/Kconfig +++ b/fs/ubifs/Kconfig @@ -86,3 +86,13 @@ config UBIFS_FS_SECURITY the extended attribute support in advance. If you are not using a security module, say N. + +config UBIFS_FS_AUTHENTICATION + bool "UBIFS authentication support" + select CRYPTO_HMAC + help + Enable authentication support for UBIFS. This feature offers protection + against offline changes for both data and metadata of the filesystem. + If you say yes here you should also select a hashing algorithm such as + sha256, these are not selected automatically since there are many + different options. diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c index e2964ce81dee8..1fac1133dadd2 100644 --- a/fs/ubifs/super.c +++ b/fs/ubifs/super.c @@ -579,7 +579,9 @@ static int init_constants_early(struct ubifs_info *c) c->ranges[UBIFS_REF_NODE].len = UBIFS_REF_NODE_SZ; c->ranges[UBIFS_TRUN_NODE].len = UBIFS_TRUN_NODE_SZ; c->ranges[UBIFS_CS_NODE].len = UBIFS_CS_NODE_SZ; - c->ranges[UBIFS_AUTH_NODE].len = UBIFS_AUTH_NODE_SZ; + c->ranges[UBIFS_AUTH_NODE].min_len = UBIFS_AUTH_NODE_SZ; + c->ranges[UBIFS_AUTH_NODE].max_len = UBIFS_AUTH_NODE_SZ + + UBIFS_MAX_HMAC_LEN; c->ranges[UBIFS_INO_NODE].min_len = UBIFS_INO_NODE_SZ; c->ranges[UBIFS_INO_NODE].max_len = UBIFS_MAX_INO_NODE_SZ; @@ -935,6 +937,8 @@ static int check_volume_empty(struct ubifs_info *c) * Opt_no_chk_data_crc: do not check CRCs when reading data nodes * Opt_override_compr: override default compressor * Opt_assert: set ubifs_assert() action + * Opt_auth_key: The key name used for authentication + * Opt_auth_hash_name: The hash type used for authentication * Opt_err: just end of array marker */ enum { @@ -946,6 +950,8 @@ enum { Opt_no_chk_data_crc, Opt_override_compr, Opt_assert, + Opt_auth_key, + Opt_auth_hash_name, Opt_ignore, Opt_err, }; @@ -958,6 +964,8 @@ static const match_table_t tokens = { {Opt_chk_data_crc, "chk_data_crc"}, {Opt_no_chk_data_crc, "no_chk_data_crc"}, {Opt_override_compr, "compr=%s"}, + {Opt_auth_key, "auth_key=%s"}, + {Opt_auth_hash_name, "auth_hash_name=%s"}, {Opt_ignore, "ubi=%s"}, {Opt_ignore, "vol=%s"}, {Opt_assert, "assert=%s"}, @@ -1081,6 +1089,16 @@ static int ubifs_parse_options(struct ubifs_info *c, char *options, kfree(act); break; } + case Opt_auth_key: + c->auth_key_name = kstrdup(args[0].from, GFP_KERNEL); + if (!c->auth_key_name) + return -ENOMEM; + break; + case Opt_auth_hash_name: + c->auth_hash_name = kstrdup(args[0].from, GFP_KERNEL); + if (!c->auth_hash_name) + return -ENOMEM; + break; case Opt_ignore: break; default: @@ -1260,6 +1278,19 @@ static int mount_ubifs(struct ubifs_info *c) c->mounting = 1; + if (c->auth_key_name) { + if (IS_ENABLED(CONFIG_UBIFS_FS_AUTHENTICATION)) { + err = ubifs_init_authentication(c); + if (err) + goto out_free; + } else { + ubifs_err(c, "auth_key_name, but UBIFS is built without" + " authentication support"); + err = -EINVAL; + goto out_free; + } + } + err = ubifs_read_superblock(c); if (err) goto out_free; @@ -1577,7 +1608,10 @@ static void ubifs_umount(struct ubifs_info *c) free_wbufs(c); free_orphans(c); ubifs_lpt_free(c, 0); + ubifs_exit_authentication(c); + kfree(c->auth_key_name); + kfree(c->auth_hash_name); kfree(c->cbuf); kfree(c->rcvrd_mst_node); kfree(c->mst_node); -- cgit v1.2.3 From e453fa60e086786fe89ba15ee8fef80bc2e6ecc3 Mon Sep 17 00:00:00 2001 From: Sascha Hauer Date: Fri, 7 Sep 2018 14:36:46 +0200 Subject: Documentation: ubifs: Add authentication whitepaper Signed-off-by: Sascha Hauer Signed-off-by: Richard Weinberger --- Documentation/filesystems/ubifs-authentication.md | 426 ++++++++++++++++++++++ 1 file changed, 426 insertions(+) create mode 100644 Documentation/filesystems/ubifs-authentication.md (limited to 'Documentation') diff --git a/Documentation/filesystems/ubifs-authentication.md b/Documentation/filesystems/ubifs-authentication.md new file mode 100644 index 0000000000000..028b3e2e25f97 --- /dev/null +++ b/Documentation/filesystems/ubifs-authentication.md @@ -0,0 +1,426 @@ +% UBIFS Authentication +% sigma star gmbh +% 2018 + +# Introduction + +UBIFS utilizes the fscrypt framework to provide confidentiality for file +contents and file names. This prevents attacks where an attacker is able to +read contents of the filesystem on a single point in time. A classic example +is a lost smartphone where the attacker is unable to read personal data stored +on the device without the filesystem decryption key. + +At the current state, UBIFS encryption however does not prevent attacks where +the attacker is able to modify the filesystem contents and the user uses the +device afterwards. In such a scenario an attacker can modify filesystem +contents arbitrarily without the user noticing. One example is to modify a +binary to perform a malicious action when executed [DMC-CBC-ATTACK]. Since +most of the filesystem metadata of UBIFS is stored in plain, this makes it +fairly easy to swap files and replace their contents. + +Other full disk encryption systems like dm-crypt cover all filesystem metadata, +which makes such kinds of attacks more complicated, but not impossible. +Especially, if the attacker is given access to the device multiple points in +time. For dm-crypt and other filesystems that build upon the Linux block IO +layer, the dm-integrity or dm-verity subsystems [DM-INTEGRITY, DM-VERITY] +can be used to get full data authentication at the block layer. +These can also be combined with dm-crypt [CRYPTSETUP2]. + +This document describes an approach to get file contents _and_ full metadata +authentication for UBIFS. Since UBIFS uses fscrypt for file contents and file +name encryption, the authentication system could be tied into fscrypt such that +existing features like key derivation can be utilized. It should however also +be possible to use UBIFS authentication without using encryption. + + +## MTD, UBI & UBIFS + +On Linux, the MTD (Memory Technology Devices) subsystem provides a uniform +interface to access raw flash devices. One of the more prominent subsystems that +work on top of MTD is UBI (Unsorted Block Images). It provides volume management +for flash devices and is thus somewhat similar to LVM for block devices. In +addition, it deals with flash-specific wear-leveling and transparent I/O error +handling. UBI offers logical erase blocks (LEBs) to the layers on top of it +and maps them transparently to physical erase blocks (PEBs) on the flash. + +UBIFS is a filesystem for raw flash which operates on top of UBI. Thus, wear +leveling and some flash specifics are left to UBI, while UBIFS focuses on +scalability, performance and recoverability. + + + + +------------+ +*******+ +-----------+ +-----+ + | | * UBIFS * | UBI-BLOCK | | ... | + | JFFS/JFFS2 | +*******+ +-----------+ +-----+ + | | +-----------------------------+ +-----------+ +-----+ + | | | UBI | | MTD-BLOCK | | ... | + +------------+ +-----------------------------+ +-----------+ +-----+ + +------------------------------------------------------------------+ + | MEMORY TECHNOLOGY DEVICES (MTD) | + +------------------------------------------------------------------+ + +-----------------------------+ +--------------------------+ +-----+ + | NAND DRIVERS | | NOR DRIVERS | | ... | + +-----------------------------+ +--------------------------+ +-----+ + + Figure 1: Linux kernel subsystems for dealing with raw flash + + + +Internally, UBIFS maintains multiple data structures which are persisted on +the flash: + +- *Index*: an on-flash B+ tree where the leaf nodes contain filesystem data +- *Journal*: an additional data structure to collect FS changes before updating + the on-flash index and reduce flash wear. +- *Tree Node Cache (TNC)*: an in-memory B+ tree that reflects the current FS + state to avoid frequent flash reads. It is basically the in-memory + representation of the index, but contains additional attributes. +- *LEB property tree (LPT)*: an on-flash B+ tree for free space accounting per + UBI LEB. + +In the remainder of this section we will cover the on-flash UBIFS data +structures in more detail. The TNC is of less importance here since it is never +persisted onto the flash directly. More details on UBIFS can also be found in +[UBIFS-WP]. + + +### UBIFS Index & Tree Node Cache + +Basic on-flash UBIFS entities are called *nodes*. UBIFS knows different types +of nodes. Eg. data nodes (`struct ubifs_data_node`) which store chunks of file +contents or inode nodes (`struct ubifs_ino_node`) which represent VFS inodes. +Almost all types of nodes share a common header (`ubifs_ch`) containing basic +information like node type, node length, a sequence number, etc. (see +`fs/ubifs/ubifs-media.h`in kernel source). Exceptions are entries of the LPT +and some less important node types like padding nodes which are used to pad +unusable content at the end of LEBs. + +To avoid re-writing the whole B+ tree on every single change, it is implemented +as *wandering tree*, where only the changed nodes are re-written and previous +versions of them are obsoleted without erasing them right away. As a result, +the index is not stored in a single place on the flash, but *wanders* around +and there are obsolete parts on the flash as long as the LEB containing them is +not reused by UBIFS. To find the most recent version of the index, UBIFS stores +a special node called *master node* into UBI LEB 1 which always points to the +most recent root node of the UBIFS index. For recoverability, the master node +is additionally duplicated to LEB 2. Mounting UBIFS is thus a simple read of +LEB 1 and 2 to get the current master node and from there get the location of +the most recent on-flash index. + +The TNC is the in-memory representation of the on-flash index. It contains some +additional runtime attributes per node which are not persisted. One of these is +a dirty-flag which marks nodes that have to be persisted the next time the +index is written onto the flash. The TNC acts as a write-back cache and all +modifications of the on-flash index are done through the TNC. Like other caches, +the TNC does not have to mirror the full index into memory, but reads parts of +it from flash whenever needed. A *commit* is the UBIFS operation of updating the +on-flash filesystem structures like the index. On every commit, the TNC nodes +marked as dirty are written to the flash to update the persisted index. + + +### Journal + +To avoid wearing out the flash, the index is only persisted (*commited*) when +certain conditions are met (eg. `fsync(2)`). The journal is used to record +any changes (in form of inode nodes, data nodes etc.) between commits +of the index. During mount, the journal is read from the flash and replayed +onto the TNC (which will be created on-demand from the on-flash index). + +UBIFS reserves a bunch of LEBs just for the journal called *log area*. The +amount of log area LEBs is configured on filesystem creation (using +`mkfs.ubifs`) and stored in the superblock node. The log area contains only +two types of nodes: *reference nodes* and *commit start nodes*. A commit start +node is written whenever an index commit is performed. Reference nodes are +written on every journal update. Each reference node points to the position of +other nodes (inode nodes, data nodes etc.) on the flash that are part of this +journal entry. These nodes are called *buds* and describe the actual filesystem +changes including their data. + +The log area is maintained as a ring. Whenever the journal is almost full, +a commit is initiated. This also writes a commit start node so that during +mount, UBIFS will seek for the most recent commit start node and just replay +every reference node after that. Every reference node before the commit start +node will be ignored as they are already part of the on-flash index. + +When writing a journal entry, UBIFS first ensures that enough space is +available to write the reference node and buds part of this entry. Then, the +reference node is written and afterwards the buds describing the file changes. +On replay, UBIFS will record every reference node and inspect the location of +the referenced LEBs to discover the buds. If these are corrupt or missing, +UBIFS will attempt to recover them by re-reading the LEB. This is however only +done for the last referenced LEB of the journal. Only this can become corrupt +because of a power cut. If the recovery fails, UBIFS will not mount. An error +for every other LEB will directly cause UBIFS to fail the mount operation. + + + | ---- LOG AREA ---- | ---------- MAIN AREA ------------ | + + -----+------+-----+--------+---- ------+-----+-----+--------------- + \ | | | | / / | | | \ + / CS | REF | REF | | \ \ DENT | INO | INO | / + \ | | | | / / | | | \ + ----+------+-----+--------+--- -------+-----+-----+---------------- + | | ^ ^ + | | | | + +------------------------+ | + | | + +-------------------------------+ + + + Figure 2: UBIFS flash layout of log area with commit start nodes + (CS) and reference nodes (REF) pointing to main area + containing their buds + + +### LEB Property Tree/Table + +The LEB property tree is used to store per-LEB information. This includes the +LEB type and amount of free and *dirty* (old, obsolete content) space [1] on +the LEB. The type is important, because UBIFS never mixes index nodes with data +nodes on a single LEB and thus each LEB has a specific purpose. This again is +useful for free space calculations. See [UBIFS-WP] for more details. + +The LEB property tree again is a B+ tree, but it is much smaller than the +index. Due to its smaller size it is always written as one chunk on every +commit. Thus, saving the LPT is an atomic operation. + + +[1] Since LEBs can only be appended and never overwritten, there is a +difference between free space ie. the remaining space left on the LEB to be +written to without erasing it and previously written content that is obsolete +but can't be overwritten without erasing the full LEB. + + +# UBIFS Authentication + +This chapter introduces UBIFS authentication which enables UBIFS to verify +the authenticity and integrity of metadata and file contents stored on flash. + + +## Threat Model + +UBIFS authentication enables detection of offline data modification. While it +does not prevent it, it enables (trusted) code to check the integrity and +authenticity of on-flash file contents and filesystem metadata. This covers +attacks where file contents are swapped. + +UBIFS authentication will not protect against rollback of full flash contents. +Ie. an attacker can still dump the flash and restore it at a later time without +detection. It will also not protect against partial rollback of individual +index commits. That means that an attacker is able to partially undo changes. +This is possible because UBIFS does not immediately overwrites obsolete +versions of the index tree or the journal, but instead marks them as obsolete +and garbage collection erases them at a later time. An attacker can use this by +erasing parts of the current tree and restoring old versions that are still on +the flash and have not yet been erased. This is possible, because every commit +will always write a new version of the index root node and the master node +without overwriting the previous version. This is further helped by the +wear-leveling operations of UBI which copies contents from one physical +eraseblock to another and does not atomically erase the first eraseblock. + +UBIFS authentication does not cover attacks where an attacker is able to +execute code on the device after the authentication key was provided. +Additional measures like secure boot and trusted boot have to be taken to +ensure that only trusted code is executed on a device. + + +## Authentication + +To be able to fully trust data read from flash, all UBIFS data structures +stored on flash are authenticated. That is: + +- The index which includes file contents, file metadata like extended + attributes, file length etc. +- The journal which also contains file contents and metadata by recording changes + to the filesystem +- The LPT which stores UBI LEB metadata which UBIFS uses for free space accounting + + +### Index Authentication + +Through UBIFS' concept of a wandering tree, it already takes care of only +updating and persisting changed parts from leaf node up to the root node +of the full B+ tree. This enables us to augment the index nodes of the tree +with a hash over each node's child nodes. As a result, the index basically also +a Merkle tree. Since the leaf nodes of the index contain the actual filesystem +data, the hashes of their parent index nodes thus cover all the file contents +and file metadata. When a file changes, the UBIFS index is updated accordingly +from the leaf nodes up to the root node including the master node. This process +can be hooked to recompute the hash only for each changed node at the same time. +Whenever a file is read, UBIFS can verify the hashes from each leaf node up to +the root node to ensure the node's integrity. + +To ensure the authenticity of the whole index, the UBIFS master node stores a +keyed hash (HMAC) over its own contents and a hash of the root node of the index +tree. As mentioned above, the master node is always written to the flash whenever +the index is persisted (ie. on index commit). + +Using this approach only UBIFS index nodes and the master node are changed to +include a hash. All other types of nodes will remain unchanged. This reduces +the storage overhead which is precious for users of UBIFS (ie. embedded +devices). + + + +---------------+ + | Master Node | + | (hash) | + +---------------+ + | + v + +-------------------+ + | Index Node #1 | + | | + | branch0 branchn | + | (hash) (hash) | + +-------------------+ + | ... | (fanout: 8) + | | + +-------+ +------+ + | | + v v + +-------------------+ +-------------------+ + | Index Node #2 | | Index Node #3 | + | | | | + | branch0 branchn | | branch0 branchn | + | (hash) (hash) | | (hash) (hash) | + +-------------------+ +-------------------+ + | ... | ... | + v v v + +-----------+ +----------+ +-----------+ + | Data Node | | INO Node | | DENT Node | + +-----------+ +----------+ +-----------+ + + + Figure 3: Coverage areas of index node hash and master node HMAC + + + +The most important part for robustness and power-cut safety is to atomically +persist the hash and file contents. Here the existing UBIFS logic for how +changed nodes are persisted is already designed for this purpose such that +UBIFS can safely recover if a power-cut occurs while persisting. Adding +hashes to index nodes does not change this since each hash will be persisted +atomically together with its respective node. + + +### Journal Authentication + +The journal is authenticated too. Since the journal is continuously written +it is necessary to also add authentication information frequently to the +journal so that in case of a powercut not too much data can't be authenticated. +This is done by creating a continuous hash beginning from the commit start node +over the previous reference nodes, the current reference node, and the bud +nodes. From time to time whenever it is suitable authentication nodes are added +between the bud nodes. This new node type contains a HMAC over the current state +of the hash chain. That way a journal can be authenticated up to the last +authentication node. The tail of the journal which may not have a authentication +node cannot be authenticated and is skipped during journal replay. + +We get this picture for journal authentication: + + ,,,,,,,, + ,......,........................................... + ,. CS , hash1.----. hash2.----. + ,. | , . |hmac . |hmac + ,. v , . v . v + ,.REF#0,-> bud -> bud -> bud.-> auth -> bud -> bud.-> auth ... + ,..|...,........................................... + , | , + , | ,,,,,,,,,,,,,,, + . | hash3,----. + , | , |hmac + , v , v + , REF#1 -> bud -> bud,-> auth ... + ,,,|,,,,,,,,,,,,,,,,,, + v + REF#2 -> ... + | + V + ... + +Since the hash also includes the reference nodes an attacker cannot reorder or +skip any journal heads for replay. An attacker can only remove bud nodes or +reference nodes from the end of the journal, effectively rewinding the +filesystem at maximum back to the last commit. + +The location of the log area is stored in the master node. Since the master +node is authenticated with a HMAC as described above, it is not possible to +tamper with that without detection. The size of the log area is specified when +the filesystem is created using `mkfs.ubifs` and stored in the superblock node. +To avoid tampering with this and other values stored there, a HMAC is added to +the superblock struct. The superblock node is stored in LEB 0 and is only +modified on feature flag or similar changes, but never on file changes. + + +### LPT Authentication + +The location of the LPT root node on the flash is stored in the UBIFS master +node. Since the LPT is written and read atomically on every commit, there is +no need to authenticate individual nodes of the tree. It suffices to +protect the integrity of the full LPT by a simple hash stored in the master +node. Since the master node itself is authenticated, the LPTs authenticity can +be verified by verifying the authenticity of the master node and comparing the +LTP hash stored there with the hash computed from the read on-flash LPT. + + +## Key Management + +For simplicity, UBIFS authentication uses a single key to compute the HMACs +of superblock, master, commit start and reference nodes. This key has to be +available on creation of the filesystem (`mkfs.ubifs`) to authenticate the +superblock node. Further, it has to be available on mount of the filesystem +to verify authenticated nodes and generate new HMACs for changes. + +UBIFS authentication is intended to operate side-by-side with UBIFS encryption +(fscrypt) to provide confidentiality and authenticity. Since UBIFS encryption +has a different approach of encryption policies per directory, there can be +multiple fscrypt master keys and there might be folders without encryption. +UBIFS authentication on the other hand has an all-or-nothing approach in the +sense that it either authenticates everything of the filesystem or nothing. +Because of this and because UBIFS authentication should also be usable without +encryption, it does not share the same master key with fscrypt, but manages +a dedicated authentication key. + +The API for providing the authentication key has yet to be defined, but the +key can eg. be provided by userspace through a keyring similar to the way it +is currently done in fscrypt. It should however be noted that the current +fscrypt approach has shown its flaws and the userspace API will eventually +change [FSCRYPT-POLICY2]. + +Nevertheless, it will be possible for a user to provide a single passphrase +or key in userspace that covers UBIFS authentication and encryption. This can +be solved by the corresponding userspace tools which derive a second key for +authentication in addition to the derived fscrypt master key used for +encryption. + +To be able to check if the proper key is available on mount, the UBIFS +superblock node will additionally store a hash of the authentication key. This +approach is similar to the approach proposed for fscrypt encryption policy v2 +[FSCRYPT-POLICY2]. + + +# Future Extensions + +In certain cases where a vendor wants to provide an authenticated filesystem +image to customers, it should be possible to do so without sharing the secret +UBIFS authentication key. Instead, in addition the each HMAC a digital +signature could be stored where the vendor shares the public key alongside the +filesystem image. In case this filesystem has to be modified afterwards, +UBIFS can exchange all digital signatures with HMACs on first mount similar +to the way the IMA/EVM subsystem deals with such situations. The HMAC key +will then have to be provided beforehand in the normal way. + + +# References + +[CRYPTSETUP2] http://www.saout.de/pipermail/dm-crypt/2017-November/005745.html + +[DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/ + +[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt + +[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.txt + +[FSCRYPT-POLICY2] https://www.spinics.net/lists/linux-ext4/msg58710.html + +[UBIFS-WP] http://www.linux-mtd.infradead.org/doc/ubifs_whitepaper.pdf -- cgit v1.2.3 From 3511ba7d4ca6f39e2d060bb94e42a41ad1fee7bf Mon Sep 17 00:00:00 2001 From: Keiji Hayashibara Date: Wed, 24 Oct 2018 18:34:29 +0900 Subject: spi: uniphier: fix incorrect property items This commit fixes incorrect property because it was different from the actual. The parameters of '#address-cells' and '#size-cells' were removed, and 'interrupts', 'pinctrl-names' and 'pinctrl-0' were added. Fixes: 4dcd5c2781f3 ("spi: add DT bindings for UniPhier SPI controller") Signed-off-by: Keiji Hayashibara Signed-off-by: Mark Brown --- Documentation/devicetree/bindings/spi/spi-uniphier.txt | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/spi/spi-uniphier.txt b/Documentation/devicetree/bindings/spi/spi-uniphier.txt index 504a4ecfc7b16..b04e66a52de5d 100644 --- a/Documentation/devicetree/bindings/spi/spi-uniphier.txt +++ b/Documentation/devicetree/bindings/spi/spi-uniphier.txt @@ -5,18 +5,20 @@ UniPhier SoCs have SCSSI which supports SPI single channel. Required properties: - compatible: should be "socionext,uniphier-scssi" - reg: address and length of the spi master registers - - #address-cells: must be <1>, see spi-bus.txt - - #size-cells: must be <0>, see spi-bus.txt - - clocks: A phandle to the clock for the device. - - resets: A phandle to the reset control for the device. + - interrupts: a single interrupt specifier + - pinctrl-names: should be "default" + - pinctrl-0: pin control state for the default mode + - clocks: a phandle to the clock for the device + - resets: a phandle to the reset control for the device Example: spi0: spi@54006000 { compatible = "socionext,uniphier-scssi"; reg = <0x54006000 0x100>; - #address-cells = <1>; - #size-cells = <0>; + interrupts = <0 39 4>; + pinctrl-names = "default"; + pinctrl-0 = <&pinctrl_spi0>; clocks = <&peri_clk 11>; resets = <&peri_rst 11>; }; -- cgit v1.2.3 From 70025f84e5b79627a6739533c4fe7cef5b605886 Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 9 Oct 2018 17:46:51 +0100 Subject: KEYS: Provide key type operations for asymmetric key ops [ver #2] Provide five new operations in the key_type struct that can be used to provide access to asymmetric key operations. These will be implemented for the asymmetric key type in a later patch and may refer to a key retained in RAM by the kernel or a key retained in crypto hardware. int (*asym_query)(const struct kernel_pkey_params *params, struct kernel_pkey_query *info); int (*asym_eds_op)(struct kernel_pkey_params *params, const void *in, void *out); int (*asym_verify_signature)(struct kernel_pkey_params *params, const void *in, const void *in2); Since encrypt, decrypt and sign are identical in their interfaces, they're rolled together in the asym_eds_op() operation and there's an operation ID in the params argument to distinguish them. Verify is different in that we supply the data and the signature instead and get an error value (or 0) as the only result on the expectation that this may well be how a hardware crypto device may work. Signed-off-by: David Howells Tested-by: Marcel Holtmann Reviewed-by: Marcel Holtmann Reviewed-by: Denis Kenzior Tested-by: Denis Kenzior Signed-off-by: James Morris --- Documentation/security/keys/core.rst | 106 +++++++++++++++++++++++++++++++++++ include/linux/key-type.h | 11 ++++ include/linux/keyctl.h | 46 +++++++++++++++ include/uapi/linux/keyctl.h | 5 ++ 4 files changed, 168 insertions(+) create mode 100644 include/linux/keyctl.h (limited to 'Documentation') diff --git a/Documentation/security/keys/core.rst b/Documentation/security/keys/core.rst index 9ce7256c6edba..c144978479d58 100644 --- a/Documentation/security/keys/core.rst +++ b/Documentation/security/keys/core.rst @@ -1483,6 +1483,112 @@ The structure has a number of fields, some of which are mandatory: attempted key link operation. If there is no match, -EINVAL is returned. + * ``int (*asym_eds_op)(struct kernel_pkey_params *params, + const void *in, void *out);`` + ``int (*asym_verify_signature)(struct kernel_pkey_params *params, + const void *in, const void *in2);`` + + These methods are optional. If provided the first allows a key to be + used to encrypt, decrypt or sign a blob of data, and the second allows a + key to verify a signature. + + In all cases, the following information is provided in the params block:: + + struct kernel_pkey_params { + struct key *key; + const char *encoding; + const char *hash_algo; + char *info; + __u32 in_len; + union { + __u32 out_len; + __u32 in2_len; + }; + enum kernel_pkey_operation op : 8; + }; + + This includes the key to be used; a string indicating the encoding to use + (for instance, "pkcs1" may be used with an RSA key to indicate + RSASSA-PKCS1-v1.5 or RSAES-PKCS1-v1.5 encoding or "raw" if no encoding); + the name of the hash algorithm used to generate the data for a signature + (if appropriate); the sizes of the input and output (or second input) + buffers; and the ID of the operation to be performed. + + For a given operation ID, the input and output buffers are used as + follows:: + + Operation ID in,in_len out,out_len in2,in2_len + ======================= =============== =============== =============== + kernel_pkey_encrypt Raw data Encrypted data - + kernel_pkey_decrypt Encrypted data Raw data - + kernel_pkey_sign Raw data Signature - + kernel_pkey_verify Raw data - Signature + + asym_eds_op() deals with encryption, decryption and signature creation as + specified by params->op. Note that params->op is also set for + asym_verify_signature(). + + Encrypting and signature creation both take raw data in the input buffer + and return the encrypted result in the output buffer. Padding may have + been added if an encoding was set. In the case of signature creation, + depending on the encoding, the padding created may need to indicate the + digest algorithm - the name of which should be supplied in hash_algo. + + Decryption takes encrypted data in the input buffer and returns the raw + data in the output buffer. Padding will get checked and stripped off if + an encoding was set. + + Verification takes raw data in the input buffer and the signature in the + second input buffer and checks that the one matches the other. Padding + will be validated. Depending on the encoding, the digest algorithm used + to generate the raw data may need to be indicated in hash_algo. + + If successful, asym_eds_op() should return the number of bytes written + into the output buffer. asym_verify_signature() should return 0. + + A variety of errors may be returned, including EOPNOTSUPP if the operation + is not supported; EKEYREJECTED if verification fails; ENOPKG if the + required crypto isn't available. + + + * ``int (*asym_query)(const struct kernel_pkey_params *params, + struct kernel_pkey_query *info);`` + + This method is optional. If provided it allows information about the + public or asymmetric key held in the key to be determined. + + The parameter block is as for asym_eds_op() and co. but in_len and out_len + are unused. The encoding and hash_algo fields should be used to reduce + the returned buffer/data sizes as appropriate. + + If successful, the following information is filled in:: + + struct kernel_pkey_query { + __u32 supported_ops; + __u32 key_size; + __u16 max_data_size; + __u16 max_sig_size; + __u16 max_enc_size; + __u16 max_dec_size; + }; + + The supported_ops field will contain a bitmask indicating what operations + are supported by the key, including encryption of a blob, decryption of a + blob, signing a blob and verifying the signature on a blob. The following + constants are defined for this:: + + KEYCTL_SUPPORTS_{ENCRYPT,DECRYPT,SIGN,VERIFY} + + The key_size field is the size of the key in bits. max_data_size and + max_sig_size are the maximum raw data and signature sizes for creation and + verification of a signature; max_enc_size and max_dec_size are the maximum + raw data and signature sizes for encryption and decryption. The + max_*_size fields are measured in bytes. + + If successful, 0 will be returned. If the key doesn't support this, + EOPNOTSUPP will be returned. + + Request-Key Callback Service ============================ diff --git a/include/linux/key-type.h b/include/linux/key-type.h index 05d8fb5a06c49..bc9af551fc838 100644 --- a/include/linux/key-type.h +++ b/include/linux/key-type.h @@ -17,6 +17,9 @@ #ifdef CONFIG_KEYS +struct kernel_pkey_query; +struct kernel_pkey_params; + /* * key under-construction record * - passed to the request_key actor if supplied @@ -155,6 +158,14 @@ struct key_type { */ struct key_restriction *(*lookup_restriction)(const char *params); + /* Asymmetric key accessor functions. */ + int (*asym_query)(const struct kernel_pkey_params *params, + struct kernel_pkey_query *info); + int (*asym_eds_op)(struct kernel_pkey_params *params, + const void *in, void *out); + int (*asym_verify_signature)(struct kernel_pkey_params *params, + const void *in, const void *in2); + /* internal fields */ struct list_head link; /* link in types list */ struct lock_class_key lock_class; /* key->sem lock class */ diff --git a/include/linux/keyctl.h b/include/linux/keyctl.h new file mode 100644 index 0000000000000..c7c48c79ce0e9 --- /dev/null +++ b/include/linux/keyctl.h @@ -0,0 +1,46 @@ +/* keyctl kernel bits + * + * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ + +#ifndef __LINUX_KEYCTL_H +#define __LINUX_KEYCTL_H + +#include + +struct kernel_pkey_query { + __u32 supported_ops; /* Which ops are supported */ + __u32 key_size; /* Size of the key in bits */ + __u16 max_data_size; /* Maximum size of raw data to sign in bytes */ + __u16 max_sig_size; /* Maximum size of signature in bytes */ + __u16 max_enc_size; /* Maximum size of encrypted blob in bytes */ + __u16 max_dec_size; /* Maximum size of decrypted blob in bytes */ +}; + +enum kernel_pkey_operation { + kernel_pkey_encrypt, + kernel_pkey_decrypt, + kernel_pkey_sign, + kernel_pkey_verify, +}; + +struct kernel_pkey_params { + struct key *key; + const char *encoding; /* Encoding (eg. "oaep" or "raw" for none) */ + const char *hash_algo; /* Digest algorithm used (eg. "sha1") or NULL if N/A */ + char *info; /* Modified info string to be released later */ + __u32 in_len; /* Input data size */ + union { + __u32 out_len; /* Output buffer size (enc/dec/sign) */ + __u32 in2_len; /* 2nd input data size (verify) */ + }; + enum kernel_pkey_operation op : 8; +}; + +#endif /* __LINUX_KEYCTL_H */ diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h index 0f3cb13db8e93..1d1e9f2877afb 100644 --- a/include/uapi/linux/keyctl.h +++ b/include/uapi/linux/keyctl.h @@ -82,4 +82,9 @@ struct keyctl_kdf_params { __u32 __spare[8]; }; +#define KEYCTL_SUPPORTS_ENCRYPT 0x01 +#define KEYCTL_SUPPORTS_DECRYPT 0x02 +#define KEYCTL_SUPPORTS_SIGN 0x04 +#define KEYCTL_SUPPORTS_VERIFY 0x08 + #endif /* _LINUX_KEYCTL_H */ -- cgit v1.2.3 From 00d60fd3b93219ea854220f0fd264b86398cbc53 Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 9 Oct 2018 17:46:59 +0100 Subject: KEYS: Provide keyctls to drive the new key type ops for asymmetric keys [ver #2] Provide five keyctl functions that permit userspace to make use of the new key type ops for accessing and driving asymmetric keys. (*) Query an asymmetric key. long keyctl(KEYCTL_PKEY_QUERY, key_serial_t key, unsigned long reserved, struct keyctl_pkey_query *info); Get information about an asymmetric key. The information is returned in the keyctl_pkey_query struct: __u32 supported_ops; A bit mask of flags indicating which ops are supported. This is constructed from a bitwise-OR of: KEYCTL_SUPPORTS_{ENCRYPT,DECRYPT,SIGN,VERIFY} __u32 key_size; The size in bits of the key. __u16 max_data_size; __u16 max_sig_size; __u16 max_enc_size; __u16 max_dec_size; The maximum sizes in bytes of a blob of data to be signed, a signature blob, a blob to be encrypted and a blob to be decrypted. reserved must be set to 0. This is intended for future use to hand over one or more passphrases needed unlock a key. If successful, 0 is returned. If the key is not an asymmetric key, EOPNOTSUPP is returned. (*) Encrypt, decrypt, sign or verify a blob using an asymmetric key. long keyctl(KEYCTL_PKEY_ENCRYPT, const struct keyctl_pkey_params *params, const char *info, const void *in, void *out); long keyctl(KEYCTL_PKEY_DECRYPT, const struct keyctl_pkey_params *params, const char *info, const void *in, void *out); long keyctl(KEYCTL_PKEY_SIGN, const struct keyctl_pkey_params *params, const char *info, const void *in, void *out); long keyctl(KEYCTL_PKEY_VERIFY, const struct keyctl_pkey_params *params, const char *info, const void *in, const void *in2); Use an asymmetric key to perform a public-key cryptographic operation a blob of data. The parameter block pointed to by params contains a number of integer values: __s32 key_id; __u32 in_len; __u32 out_len; __u32 in2_len; For a given operation, the in and out buffers are used as follows: Operation ID in,in_len out,out_len in2,in2_len ======================= =============== =============== =========== KEYCTL_PKEY_ENCRYPT Raw data Encrypted data - KEYCTL_PKEY_DECRYPT Encrypted data Raw data - KEYCTL_PKEY_SIGN Raw data Signature - KEYCTL_PKEY_VERIFY Raw data - Signature info is a string of key=value pairs that supply supplementary information. The __spare space in the parameter block must be set to 0. This is intended, amongst other things, to allow the passing of passphrases required to unlock a key. If successful, encrypt, decrypt and sign all return the amount of data written into the output buffer. Verification returns 0 on success. Signed-off-by: David Howells Tested-by: Marcel Holtmann Reviewed-by: Marcel Holtmann Reviewed-by: Denis Kenzior Tested-by: Denis Kenzior Signed-off-by: James Morris --- Documentation/security/keys/core.rst | 111 ++++++++++++ include/uapi/linux/keyctl.h | 25 +++ security/keys/Makefile | 1 + security/keys/compat.c | 18 ++ security/keys/internal.h | 39 +++++ security/keys/keyctl.c | 24 +++ security/keys/keyctl_pkey.c | 323 +++++++++++++++++++++++++++++++++++ 7 files changed, 541 insertions(+) create mode 100644 security/keys/keyctl_pkey.c (limited to 'Documentation') diff --git a/Documentation/security/keys/core.rst b/Documentation/security/keys/core.rst index c144978479d58..9521c4207f014 100644 --- a/Documentation/security/keys/core.rst +++ b/Documentation/security/keys/core.rst @@ -859,6 +859,7 @@ The keyctl syscall functions are: and either the buffer length or the OtherInfo length exceeds the allowed length. + * Restrict keyring linkage:: long keyctl(KEYCTL_RESTRICT_KEYRING, key_serial_t keyring, @@ -890,6 +891,116 @@ The keyctl syscall functions are: applicable to the asymmetric key type. + * Query an asymmetric key:: + + long keyctl(KEYCTL_PKEY_QUERY, + key_serial_t key_id, unsigned long reserved, + struct keyctl_pkey_query *info); + + Get information about an asymmetric key. The information is returned in + the keyctl_pkey_query struct:: + + __u32 supported_ops; + __u32 key_size; + __u16 max_data_size; + __u16 max_sig_size; + __u16 max_enc_size; + __u16 max_dec_size; + __u32 __spare[10]; + + ``supported_ops`` contains a bit mask of flags indicating which ops are + supported. This is constructed from a bitwise-OR of:: + + KEYCTL_SUPPORTS_{ENCRYPT,DECRYPT,SIGN,VERIFY} + + ``key_size`` indicated the size of the key in bits. + + ``max_*_size`` indicate the maximum sizes in bytes of a blob of data to be + signed, a signature blob, a blob to be encrypted and a blob to be + decrypted. + + ``__spare[]`` must be set to 0. This is intended for future use to hand + over one or more passphrases needed unlock a key. + + If successful, 0 is returned. If the key is not an asymmetric key, + EOPNOTSUPP is returned. + + + * Encrypt, decrypt, sign or verify a blob using an asymmetric key:: + + long keyctl(KEYCTL_PKEY_ENCRYPT, + const struct keyctl_pkey_params *params, + const char *info, + const void *in, + void *out); + + long keyctl(KEYCTL_PKEY_DECRYPT, + const struct keyctl_pkey_params *params, + const char *info, + const void *in, + void *out); + + long keyctl(KEYCTL_PKEY_SIGN, + const struct keyctl_pkey_params *params, + const char *info, + const void *in, + void *out); + + long keyctl(KEYCTL_PKEY_VERIFY, + const struct keyctl_pkey_params *params, + const char *info, + const void *in, + const void *in2); + + Use an asymmetric key to perform a public-key cryptographic operation a + blob of data. For encryption and verification, the asymmetric key may + only need the public parts to be available, but for decryption and signing + the private parts are required also. + + The parameter block pointed to by params contains a number of integer + values:: + + __s32 key_id; + __u32 in_len; + __u32 out_len; + __u32 in2_len; + + ``key_id`` is the ID of the asymmetric key to be used. ``in_len`` and + ``in2_len`` indicate the amount of data in the in and in2 buffers and + ``out_len`` indicates the size of the out buffer as appropriate for the + above operations. + + For a given operation, the in and out buffers are used as follows:: + + Operation ID in,in_len out,out_len in2,in2_len + ======================= =============== =============== =============== + KEYCTL_PKEY_ENCRYPT Raw data Encrypted data - + KEYCTL_PKEY_DECRYPT Encrypted data Raw data - + KEYCTL_PKEY_SIGN Raw data Signature - + KEYCTL_PKEY_VERIFY Raw data - Signature + + ``info`` is a string of key=value pairs that supply supplementary + information. These include: + + ``enc=`` The encoding of the encrypted/signature blob. This + can be "pkcs1" for RSASSA-PKCS1-v1.5 or + RSAES-PKCS1-v1.5; "pss" for "RSASSA-PSS"; "oaep" for + "RSAES-OAEP". If omitted or is "raw", the raw output + of the encryption function is specified. + + ``hash=`` If the data buffer contains the output of a hash + function and the encoding includes some indication of + which hash function was used, the hash function can be + specified with this, eg. "hash=sha256". + + The ``__spare[]`` space in the parameter block must be set to 0. This is + intended, amongst other things, to allow the passing of passphrases + required to unlock a key. + + If successful, encrypt, decrypt and sign all return the amount of data + written into the output buffer. Verification returns 0 on success. + + Kernel Services =============== diff --git a/include/uapi/linux/keyctl.h b/include/uapi/linux/keyctl.h index 1d1e9f2877afb..f45ee0f69c0c2 100644 --- a/include/uapi/linux/keyctl.h +++ b/include/uapi/linux/keyctl.h @@ -61,6 +61,11 @@ #define KEYCTL_INVALIDATE 21 /* invalidate a key */ #define KEYCTL_GET_PERSISTENT 22 /* get a user's persistent keyring */ #define KEYCTL_DH_COMPUTE 23 /* Compute Diffie-Hellman values */ +#define KEYCTL_PKEY_QUERY 24 /* Query public key parameters */ +#define KEYCTL_PKEY_ENCRYPT 25 /* Encrypt a blob using a public key */ +#define KEYCTL_PKEY_DECRYPT 26 /* Decrypt a blob using a public key */ +#define KEYCTL_PKEY_SIGN 27 /* Create a public key signature */ +#define KEYCTL_PKEY_VERIFY 28 /* Verify a public key signature */ #define KEYCTL_RESTRICT_KEYRING 29 /* Restrict keys allowed to link to a keyring */ /* keyctl structures */ @@ -87,4 +92,24 @@ struct keyctl_kdf_params { #define KEYCTL_SUPPORTS_SIGN 0x04 #define KEYCTL_SUPPORTS_VERIFY 0x08 +struct keyctl_pkey_query { + __u32 supported_ops; /* Which ops are supported */ + __u32 key_size; /* Size of the key in bits */ + __u16 max_data_size; /* Maximum size of raw data to sign in bytes */ + __u16 max_sig_size; /* Maximum size of signature in bytes */ + __u16 max_enc_size; /* Maximum size of encrypted blob in bytes */ + __u16 max_dec_size; /* Maximum size of decrypted blob in bytes */ + __u32 __spare[10]; +}; + +struct keyctl_pkey_params { + __s32 key_id; /* Serial no. of public key to use */ + __u32 in_len; /* Input data size */ + union { + __u32 out_len; /* Output buffer size (encrypt/decrypt/sign) */ + __u32 in2_len; /* 2nd input data size (verify) */ + }; + __u32 __spare[7]; +}; + #endif /* _LINUX_KEYCTL_H */ diff --git a/security/keys/Makefile b/security/keys/Makefile index ef1581b337a3d..9cef54064f608 100644 --- a/security/keys/Makefile +++ b/security/keys/Makefile @@ -22,6 +22,7 @@ obj-$(CONFIG_PROC_FS) += proc.o obj-$(CONFIG_SYSCTL) += sysctl.o obj-$(CONFIG_PERSISTENT_KEYRINGS) += persistent.o obj-$(CONFIG_KEY_DH_OPERATIONS) += dh.o +obj-$(CONFIG_ASYMMETRIC_KEY_TYPE) += keyctl_pkey.o # # Key types diff --git a/security/keys/compat.c b/security/keys/compat.c index e87c89c0177c1..9482df601dc33 100644 --- a/security/keys/compat.c +++ b/security/keys/compat.c @@ -141,6 +141,24 @@ COMPAT_SYSCALL_DEFINE5(keyctl, u32, option, return keyctl_restrict_keyring(arg2, compat_ptr(arg3), compat_ptr(arg4)); + case KEYCTL_PKEY_QUERY: + if (arg3 != 0) + return -EINVAL; + return keyctl_pkey_query(arg2, + compat_ptr(arg4), + compat_ptr(arg5)); + + case KEYCTL_PKEY_ENCRYPT: + case KEYCTL_PKEY_DECRYPT: + case KEYCTL_PKEY_SIGN: + return keyctl_pkey_e_d_s(option, + compat_ptr(arg2), compat_ptr(arg3), + compat_ptr(arg4), compat_ptr(arg5)); + + case KEYCTL_PKEY_VERIFY: + return keyctl_pkey_verify(compat_ptr(arg2), compat_ptr(arg3), + compat_ptr(arg4), compat_ptr(arg5)); + default: return -EOPNOTSUPP; } diff --git a/security/keys/internal.h b/security/keys/internal.h index 9f8208dc0e558..74cb0ff42fedb 100644 --- a/security/keys/internal.h +++ b/security/keys/internal.h @@ -298,6 +298,45 @@ static inline long compat_keyctl_dh_compute( #endif #endif +#ifdef CONFIG_ASYMMETRIC_KEY_TYPE +extern long keyctl_pkey_query(key_serial_t, + const char __user *, + struct keyctl_pkey_query __user *); + +extern long keyctl_pkey_verify(const struct keyctl_pkey_params __user *, + const char __user *, + const void __user *, const void __user *); + +extern long keyctl_pkey_e_d_s(int, + const struct keyctl_pkey_params __user *, + const char __user *, + const void __user *, void __user *); +#else +static inline long keyctl_pkey_query(key_serial_t id, + const char __user *_info, + struct keyctl_pkey_query __user *_res) +{ + return -EOPNOTSUPP; +} + +static inline long keyctl_pkey_verify(const struct keyctl_pkey_params __user *params, + const char __user *_info, + const void __user *_in, + const void __user *_in2) +{ + return -EOPNOTSUPP; +} + +static inline long keyctl_pkey_e_d_s(int op, + const struct keyctl_pkey_params __user *params, + const char __user *_info, + const void __user *_in, + void __user *_out) +{ + return -EOPNOTSUPP; +} +#endif + /* * Debugging key validation */ diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c index 1ffe60bb2845f..18619690ce77a 100644 --- a/security/keys/keyctl.c +++ b/security/keys/keyctl.c @@ -1747,6 +1747,30 @@ SYSCALL_DEFINE5(keyctl, int, option, unsigned long, arg2, unsigned long, arg3, (const char __user *) arg3, (const char __user *) arg4); + case KEYCTL_PKEY_QUERY: + if (arg3 != 0) + return -EINVAL; + return keyctl_pkey_query((key_serial_t)arg2, + (const char __user *)arg4, + (struct keyctl_pkey_query *)arg5); + + case KEYCTL_PKEY_ENCRYPT: + case KEYCTL_PKEY_DECRYPT: + case KEYCTL_PKEY_SIGN: + return keyctl_pkey_e_d_s( + option, + (const struct keyctl_pkey_params __user *)arg2, + (const char __user *)arg3, + (const void __user *)arg4, + (void __user *)arg5); + + case KEYCTL_PKEY_VERIFY: + return keyctl_pkey_verify( + (const struct keyctl_pkey_params __user *)arg2, + (const char __user *)arg3, + (const void __user *)arg4, + (const void __user *)arg5); + default: return -EOPNOTSUPP; } diff --git a/security/keys/keyctl_pkey.c b/security/keys/keyctl_pkey.c new file mode 100644 index 0000000000000..783978842f13a --- /dev/null +++ b/security/keys/keyctl_pkey.c @@ -0,0 +1,323 @@ +/* Public-key operation keyctls + * + * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ + +#include +#include +#include +#include +#include +#include +#include +#include "internal.h" + +static void keyctl_pkey_params_free(struct kernel_pkey_params *params) +{ + kfree(params->info); + key_put(params->key); +} + +enum { + Opt_err = -1, + Opt_enc, /* "enc=" eg. "enc=oaep" */ + Opt_hash, /* "hash=" eg. "hash=sha1" */ +}; + +static const match_table_t param_keys = { + { Opt_enc, "enc=%s" }, + { Opt_hash, "hash=%s" }, + { Opt_err, NULL } +}; + +/* + * Parse the information string which consists of key=val pairs. + */ +static int keyctl_pkey_params_parse(struct kernel_pkey_params *params) +{ + unsigned long token_mask = 0; + substring_t args[MAX_OPT_ARGS]; + char *c = params->info, *p, *q; + int token; + + while ((p = strsep(&c, " \t"))) { + if (*p == '\0' || *p == ' ' || *p == '\t') + continue; + token = match_token(p, param_keys, args); + if (__test_and_set_bit(token, &token_mask)) + return -EINVAL; + q = args[0].from; + if (!q[0]) + return -EINVAL; + + switch (token) { + case Opt_enc: + params->encoding = q; + break; + + case Opt_hash: + params->hash_algo = q; + break; + + default: + return -EINVAL; + } + } + + return 0; +} + +/* + * Interpret parameters. Callers must always call the free function + * on params, even if an error is returned. + */ +static int keyctl_pkey_params_get(key_serial_t id, + const char __user *_info, + struct kernel_pkey_params *params) +{ + key_ref_t key_ref; + void *p; + int ret; + + memset(params, 0, sizeof(*params)); + params->encoding = "raw"; + + p = strndup_user(_info, PAGE_SIZE); + if (IS_ERR(p)) + return PTR_ERR(p); + params->info = p; + + ret = keyctl_pkey_params_parse(params); + if (ret < 0) + return ret; + + key_ref = lookup_user_key(id, 0, KEY_NEED_SEARCH); + if (IS_ERR(key_ref)) + return PTR_ERR(key_ref); + params->key = key_ref_to_ptr(key_ref); + + if (!params->key->type->asym_query) + return -EOPNOTSUPP; + + return 0; +} + +/* + * Get parameters from userspace. Callers must always call the free function + * on params, even if an error is returned. + */ +static int keyctl_pkey_params_get_2(const struct keyctl_pkey_params __user *_params, + const char __user *_info, + int op, + struct kernel_pkey_params *params) +{ + struct keyctl_pkey_params uparams; + struct kernel_pkey_query info; + int ret; + + memset(params, 0, sizeof(*params)); + params->encoding = "raw"; + + if (copy_from_user(&uparams, _params, sizeof(uparams)) != 0) + return -EFAULT; + + ret = keyctl_pkey_params_get(uparams.key_id, _info, params); + if (ret < 0) + return ret; + + ret = params->key->type->asym_query(params, &info); + if (ret < 0) + return ret; + + switch (op) { + case KEYCTL_PKEY_ENCRYPT: + case KEYCTL_PKEY_DECRYPT: + if (uparams.in_len > info.max_enc_size || + uparams.out_len > info.max_dec_size) + return -EINVAL; + break; + case KEYCTL_PKEY_SIGN: + case KEYCTL_PKEY_VERIFY: + if (uparams.in_len > info.max_sig_size || + uparams.out_len > info.max_data_size) + return -EINVAL; + break; + default: + BUG(); + } + + params->in_len = uparams.in_len; + params->out_len = uparams.out_len; + return 0; +} + +/* + * Query information about an asymmetric key. + */ +long keyctl_pkey_query(key_serial_t id, + const char __user *_info, + struct keyctl_pkey_query __user *_res) +{ + struct kernel_pkey_params params; + struct kernel_pkey_query res; + long ret; + + memset(¶ms, 0, sizeof(params)); + + ret = keyctl_pkey_params_get(id, _info, ¶ms); + if (ret < 0) + goto error; + + ret = params.key->type->asym_query(¶ms, &res); + if (ret < 0) + goto error; + + ret = -EFAULT; + if (copy_to_user(_res, &res, sizeof(res)) == 0 && + clear_user(_res->__spare, sizeof(_res->__spare)) == 0) + ret = 0; + +error: + keyctl_pkey_params_free(¶ms); + return ret; +} + +/* + * Encrypt/decrypt/sign + * + * Encrypt data, decrypt data or sign data using a public key. + * + * _info is a string of supplementary information in key=val format. For + * instance, it might contain: + * + * "enc=pkcs1 hash=sha256" + * + * where enc= specifies the encoding and hash= selects the OID to go in that + * particular encoding if required. If enc= isn't supplied, it's assumed that + * the caller is supplying raw values. + * + * If successful, the amount of data written into the output buffer is + * returned. + */ +long keyctl_pkey_e_d_s(int op, + const struct keyctl_pkey_params __user *_params, + const char __user *_info, + const void __user *_in, + void __user *_out) +{ + struct kernel_pkey_params params; + void *in, *out; + long ret; + + ret = keyctl_pkey_params_get_2(_params, _info, op, ¶ms); + if (ret < 0) + goto error_params; + + ret = -EOPNOTSUPP; + if (!params.key->type->asym_eds_op) + goto error_params; + + switch (op) { + case KEYCTL_PKEY_ENCRYPT: + params.op = kernel_pkey_encrypt; + break; + case KEYCTL_PKEY_DECRYPT: + params.op = kernel_pkey_decrypt; + break; + case KEYCTL_PKEY_SIGN: + params.op = kernel_pkey_sign; + break; + default: + BUG(); + } + + in = memdup_user(_in, params.in_len); + if (IS_ERR(in)) { + ret = PTR_ERR(in); + goto error_params; + } + + ret = -ENOMEM; + out = kmalloc(params.out_len, GFP_KERNEL); + if (!out) + goto error_in; + + ret = params.key->type->asym_eds_op(¶ms, in, out); + if (ret < 0) + goto error_out; + + if (copy_to_user(_out, out, ret) != 0) + ret = -EFAULT; + +error_out: + kfree(out); +error_in: + kfree(in); +error_params: + keyctl_pkey_params_free(¶ms); + return ret; +} + +/* + * Verify a signature. + * + * Verify a public key signature using the given key, or if not given, search + * for a matching key. + * + * _info is a string of supplementary information in key=val format. For + * instance, it might contain: + * + * "enc=pkcs1 hash=sha256" + * + * where enc= specifies the signature blob encoding and hash= selects the OID + * to go in that particular encoding. If enc= isn't supplied, it's assumed + * that the caller is supplying raw values. + * + * If successful, 0 is returned. + */ +long keyctl_pkey_verify(const struct keyctl_pkey_params __user *_params, + const char __user *_info, + const void __user *_in, + const void __user *_in2) +{ + struct kernel_pkey_params params; + void *in, *in2; + long ret; + + ret = keyctl_pkey_params_get_2(_params, _info, KEYCTL_PKEY_VERIFY, + ¶ms); + if (ret < 0) + goto error_params; + + ret = -EOPNOTSUPP; + if (!params.key->type->asym_verify_signature) + goto error_params; + + in = memdup_user(_in, params.in_len); + if (IS_ERR(in)) { + ret = PTR_ERR(in); + goto error_params; + } + + in2 = memdup_user(_in2, params.in2_len); + if (IS_ERR(in2)) { + ret = PTR_ERR(in2); + goto error_in; + } + + params.op = kernel_pkey_verify; + ret = params.key->type->asym_verify_signature(¶ms, in, in2); + + kfree(in2); +error_in: + kfree(in); +error_params: + keyctl_pkey_params_free(¶ms); + return ret; +} -- cgit v1.2.3 From 5a30771832aab228e0863e414f9182f86797429e Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 9 Oct 2018 17:47:07 +0100 Subject: KEYS: Provide missing asymmetric key subops for new key type ops [ver #2] Provide the missing asymmetric key subops for new key type ops. This include query, encrypt, decrypt and create signature. Verify signature already exists. Also provided are accessor functions for this: int query_asymmetric_key(const struct key *key, struct kernel_pkey_query *info); int encrypt_blob(struct kernel_pkey_params *params, const void *data, void *enc); int decrypt_blob(struct kernel_pkey_params *params, const void *enc, void *data); int create_signature(struct kernel_pkey_params *params, const void *data, void *enc); The public_key_signature struct gains an encoding field to carry the encoding for verify_signature(). Signed-off-by: David Howells Tested-by: Marcel Holtmann Reviewed-by: Marcel Holtmann Reviewed-by: Denis Kenzior Tested-by: Denis Kenzior Signed-off-by: James Morris --- Documentation/crypto/asymmetric-keys.txt | 24 ++++++-- crypto/asymmetric_keys/asymmetric_keys.h | 3 + crypto/asymmetric_keys/asymmetric_type.c | 43 +++++++++++++++ crypto/asymmetric_keys/signature.c | 95 ++++++++++++++++++++++++++++++++ include/crypto/public_key.h | 13 ++++- include/keys/asymmetric-subtype.h | 9 +++ 6 files changed, 180 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/crypto/asymmetric-keys.txt b/Documentation/crypto/asymmetric-keys.txt index 5969bf42562a8..deb656ef008b2 100644 --- a/Documentation/crypto/asymmetric-keys.txt +++ b/Documentation/crypto/asymmetric-keys.txt @@ -183,6 +183,10 @@ and looks like the following: void (*describe)(const struct key *key, struct seq_file *m); void (*destroy)(void *payload); + int (*query)(const struct kernel_pkey_params *params, + struct kernel_pkey_query *info); + int (*eds_op)(struct kernel_pkey_params *params, + const void *in, void *out); int (*verify_signature)(const struct key *key, const struct public_key_signature *sig); }; @@ -207,12 +211,22 @@ There are a number of operations defined by the subtype: asymmetric key will look after freeing the fingerprint and releasing the reference on the subtype module. - (3) verify_signature(). + (3) query(). - Optional. These are the entry points for the key usage operations. - Currently there is only the one defined. If not set, the caller will be - given -ENOTSUPP. The subtype may do anything it likes to implement an - operation, including offloading to hardware. + Mandatory. This is a function for querying the capabilities of a key. + + (4) eds_op(). + + Optional. This is the entry point for the encryption, decryption and + signature creation operations (which are distinguished by the operation ID + in the parameter struct). The subtype may do anything it likes to + implement an operation, including offloading to hardware. + + (5) verify_signature(). + + Optional. This is the entry point for signature verification. The + subtype may do anything it likes to implement an operation, including + offloading to hardware. ========================== diff --git a/crypto/asymmetric_keys/asymmetric_keys.h b/crypto/asymmetric_keys/asymmetric_keys.h index ca8e9ac34ce62..7be1ccf4fa9f2 100644 --- a/crypto/asymmetric_keys/asymmetric_keys.h +++ b/crypto/asymmetric_keys/asymmetric_keys.h @@ -16,3 +16,6 @@ extern struct asymmetric_key_id *asymmetric_key_hex_to_key_id(const char *id); extern int __asymmetric_key_hex_to_key_id(const char *id, struct asymmetric_key_id *match_id, size_t hexlen); + +extern int asymmetric_key_eds_op(struct kernel_pkey_params *params, + const void *in, void *out); diff --git a/crypto/asymmetric_keys/asymmetric_type.c b/crypto/asymmetric_keys/asymmetric_type.c index 26539e9a8bda4..69a0788a7de5d 100644 --- a/crypto/asymmetric_keys/asymmetric_type.c +++ b/crypto/asymmetric_keys/asymmetric_type.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "asymmetric_keys.h" MODULE_LICENSE("GPL"); @@ -538,6 +539,45 @@ out: return ret; } +int asymmetric_key_eds_op(struct kernel_pkey_params *params, + const void *in, void *out) +{ + const struct asymmetric_key_subtype *subtype; + struct key *key = params->key; + int ret; + + pr_devel("==>%s()\n", __func__); + + if (key->type != &key_type_asymmetric) + return -EINVAL; + subtype = asymmetric_key_subtype(key); + if (!subtype || + !key->payload.data[0]) + return -EINVAL; + if (!subtype->eds_op) + return -ENOTSUPP; + + ret = subtype->eds_op(params, in, out); + + pr_devel("<==%s() = %d\n", __func__, ret); + return ret; +} + +static int asymmetric_key_verify_signature(struct kernel_pkey_params *params, + const void *in, const void *in2) +{ + struct public_key_signature sig = { + .s_size = params->in2_len, + .digest_size = params->in_len, + .encoding = params->encoding, + .hash_algo = params->hash_algo, + .digest = (void *)in, + .s = (void *)in2, + }; + + return verify_signature(params->key, &sig); +} + struct key_type key_type_asymmetric = { .name = "asymmetric", .preparse = asymmetric_key_preparse, @@ -548,6 +588,9 @@ struct key_type key_type_asymmetric = { .destroy = asymmetric_key_destroy, .describe = asymmetric_key_describe, .lookup_restriction = asymmetric_lookup_restriction, + .asym_query = query_asymmetric_key, + .asym_eds_op = asymmetric_key_eds_op, + .asym_verify_signature = asymmetric_key_verify_signature, }; EXPORT_SYMBOL_GPL(key_type_asymmetric); diff --git a/crypto/asymmetric_keys/signature.c b/crypto/asymmetric_keys/signature.c index 28198314bc39f..ad95a58c66427 100644 --- a/crypto/asymmetric_keys/signature.c +++ b/crypto/asymmetric_keys/signature.c @@ -16,7 +16,9 @@ #include #include #include +#include #include +#include #include "asymmetric_keys.h" /* @@ -36,6 +38,99 @@ void public_key_signature_free(struct public_key_signature *sig) } EXPORT_SYMBOL_GPL(public_key_signature_free); +/** + * query_asymmetric_key - Get information about an aymmetric key. + * @params: Various parameters. + * @info: Where to put the information. + */ +int query_asymmetric_key(const struct kernel_pkey_params *params, + struct kernel_pkey_query *info) +{ + const struct asymmetric_key_subtype *subtype; + struct key *key = params->key; + int ret; + + pr_devel("==>%s()\n", __func__); + + if (key->type != &key_type_asymmetric) + return -EINVAL; + subtype = asymmetric_key_subtype(key); + if (!subtype || + !key->payload.data[0]) + return -EINVAL; + if (!subtype->query) + return -ENOTSUPP; + + ret = subtype->query(params, info); + + pr_devel("<==%s() = %d\n", __func__, ret); + return ret; +} +EXPORT_SYMBOL_GPL(query_asymmetric_key); + +/** + * encrypt_blob - Encrypt data using an asymmetric key + * @params: Various parameters + * @data: Data blob to be encrypted, length params->data_len + * @enc: Encrypted data buffer, length params->enc_len + * + * Encrypt the specified data blob using the private key specified by + * params->key. The encrypted data is wrapped in an encoding if + * params->encoding is specified (eg. "pkcs1"). + * + * Returns the length of the data placed in the encrypted data buffer or an + * error. + */ +int encrypt_blob(struct kernel_pkey_params *params, + const void *data, void *enc) +{ + params->op = kernel_pkey_encrypt; + return asymmetric_key_eds_op(params, data, enc); +} +EXPORT_SYMBOL_GPL(encrypt_blob); + +/** + * decrypt_blob - Decrypt data using an asymmetric key + * @params: Various parameters + * @enc: Encrypted data to be decrypted, length params->enc_len + * @data: Decrypted data buffer, length params->data_len + * + * Decrypt the specified data blob using the private key specified by + * params->key. The decrypted data is wrapped in an encoding if + * params->encoding is specified (eg. "pkcs1"). + * + * Returns the length of the data placed in the decrypted data buffer or an + * error. + */ +int decrypt_blob(struct kernel_pkey_params *params, + const void *enc, void *data) +{ + params->op = kernel_pkey_decrypt; + return asymmetric_key_eds_op(params, enc, data); +} +EXPORT_SYMBOL_GPL(decrypt_blob); + +/** + * create_signature - Sign some data using an asymmetric key + * @params: Various parameters + * @data: Data blob to be signed, length params->data_len + * @enc: Signature buffer, length params->enc_len + * + * Sign the specified data blob using the private key specified by params->key. + * The signature is wrapped in an encoding if params->encoding is specified + * (eg. "pkcs1"). If the encoding needs to know the digest type, this can be + * passed through params->hash_algo (eg. "sha1"). + * + * Returns the length of the data placed in the signature buffer or an error. + */ +int create_signature(struct kernel_pkey_params *params, + const void *data, void *enc) +{ + params->op = kernel_pkey_sign; + return asymmetric_key_eds_op(params, data, enc); +} +EXPORT_SYMBOL_GPL(create_signature); + /** * verify_signature - Initiate the use of an asymmetric key to verify a signature * @key: The asymmetric key to verify against diff --git a/include/crypto/public_key.h b/include/crypto/public_key.h index e0b681a717bac..3a1047a0195c3 100644 --- a/include/crypto/public_key.h +++ b/include/crypto/public_key.h @@ -14,6 +14,8 @@ #ifndef _LINUX_PUBLIC_KEY_H #define _LINUX_PUBLIC_KEY_H +#include + /* * Cryptographic data for the public-key subtype of the asymmetric key type. * @@ -40,6 +42,7 @@ struct public_key_signature { u8 digest_size; /* Number of bytes in digest */ const char *pkey_algo; const char *hash_algo; + const char *encoding; }; extern void public_key_signature_free(struct public_key_signature *sig); @@ -65,8 +68,14 @@ extern int restrict_link_by_key_or_keyring_chain(struct key *trust_keyring, const union key_payload *payload, struct key *trusted); -extern int verify_signature(const struct key *key, - const struct public_key_signature *sig); +extern int query_asymmetric_key(const struct kernel_pkey_params *, + struct kernel_pkey_query *); + +extern int encrypt_blob(struct kernel_pkey_params *, const void *, void *); +extern int decrypt_blob(struct kernel_pkey_params *, const void *, void *); +extern int create_signature(struct kernel_pkey_params *, const void *, void *); +extern int verify_signature(const struct key *, + const struct public_key_signature *); int public_key_verify_signature(const struct public_key *pkey, const struct public_key_signature *sig); diff --git a/include/keys/asymmetric-subtype.h b/include/keys/asymmetric-subtype.h index e0a9c23688728..9ce2f0fae57e3 100644 --- a/include/keys/asymmetric-subtype.h +++ b/include/keys/asymmetric-subtype.h @@ -17,6 +17,8 @@ #include #include +struct kernel_pkey_query; +struct kernel_pkey_params; struct public_key_signature; /* @@ -34,6 +36,13 @@ struct asymmetric_key_subtype { /* Destroy a key of this subtype */ void (*destroy)(void *payload_crypto, void *payload_auth); + int (*query)(const struct kernel_pkey_params *params, + struct kernel_pkey_query *info); + + /* Encrypt/decrypt/sign data */ + int (*eds_op)(struct kernel_pkey_params *params, + const void *in, void *out); + /* Verify the signature on a key of this subtype (optional) */ int (*verify_signature)(const struct key *key, const struct public_key_signature *sig); -- cgit v1.2.3 From 3c58b2362ba828ee2970c66c6a6fd7b04fde4413 Mon Sep 17 00:00:00 2001 From: David Howells Date: Tue, 9 Oct 2018 17:47:46 +0100 Subject: KEYS: Implement PKCS#8 RSA Private Key parser [ver #2] Implement PKCS#8 RSA Private Key format [RFC 5208] parser for the asymmetric key type. For the moment, this will only support unencrypted DER blobs. PEM and decryption can be added later. PKCS#8 keys can be loaded like this: openssl pkcs8 -in private_key.pem -topk8 -nocrypt -outform DER | \ keyctl padd asymmetric foo @s Signed-off-by: David Howells Tested-by: Marcel Holtmann Reviewed-by: Marcel Holtmann Reviewed-by: Denis Kenzior Tested-by: Denis Kenzior Signed-off-by: James Morris --- Documentation/crypto/asymmetric-keys.txt | 2 + crypto/asymmetric_keys/Kconfig | 10 ++ crypto/asymmetric_keys/Makefile | 13 +++ crypto/asymmetric_keys/pkcs8.asn1 | 24 ++++ crypto/asymmetric_keys/pkcs8_parser.c | 184 +++++++++++++++++++++++++++++++ 5 files changed, 233 insertions(+) create mode 100644 crypto/asymmetric_keys/pkcs8.asn1 create mode 100644 crypto/asymmetric_keys/pkcs8_parser.c (limited to 'Documentation') diff --git a/Documentation/crypto/asymmetric-keys.txt b/Documentation/crypto/asymmetric-keys.txt index deb656ef008b2..8763866b11cfd 100644 --- a/Documentation/crypto/asymmetric-keys.txt +++ b/Documentation/crypto/asymmetric-keys.txt @@ -248,6 +248,8 @@ Examples of blob formats for which parsers could be implemented include: - X.509 ASN.1 stream. - Pointer to TPM key. - Pointer to UEFI key. + - PKCS#8 private key [RFC 5208]. + - PKCS#5 encrypted private key [RFC 2898]. During key instantiation each parser in the list is tried until one doesn't return -EBADMSG. diff --git a/crypto/asymmetric_keys/Kconfig b/crypto/asymmetric_keys/Kconfig index f3702e533ff41..66a7dad7ed3d1 100644 --- a/crypto/asymmetric_keys/Kconfig +++ b/crypto/asymmetric_keys/Kconfig @@ -31,6 +31,16 @@ config X509_CERTIFICATE_PARSER data and provides the ability to instantiate a crypto key from a public key packet found inside the certificate. +config PKCS8_PRIVATE_KEY_PARSER + tristate "PKCS#8 private key parser" + depends on ASYMMETRIC_PUBLIC_KEY_SUBTYPE + select ASN1 + select OID_REGISTRY + help + This option provides support for parsing PKCS#8 format blobs for + private key data and provides the ability to instantiate a crypto key + from that data. + config PKCS7_MESSAGE_PARSER tristate "PKCS#7 message parser" depends on X509_CERTIFICATE_PARSER diff --git a/crypto/asymmetric_keys/Makefile b/crypto/asymmetric_keys/Makefile index d4b2e1b2dc650..c38424f55b08d 100644 --- a/crypto/asymmetric_keys/Makefile +++ b/crypto/asymmetric_keys/Makefile @@ -29,6 +29,19 @@ $(obj)/x509_cert_parser.o: \ $(obj)/x509.asn1.o: $(obj)/x509.asn1.c $(obj)/x509.asn1.h $(obj)/x509_akid.asn1.o: $(obj)/x509_akid.asn1.c $(obj)/x509_akid.asn1.h +# +# PKCS#8 private key handling +# +obj-$(CONFIG_PKCS8_PRIVATE_KEY_PARSER) += pkcs8_key_parser.o +pkcs8_key_parser-y := \ + pkcs8.asn1.o \ + pkcs8_parser.o + +$(obj)/pkcs8_parser.o: $(obj)/pkcs8.asn1.h +$(obj)/pkcs8-asn1.o: $(obj)/pkcs8.asn1.c $(obj)/pkcs8.asn1.h + +clean-files += pkcs8.asn1.c pkcs8.asn1.h + # # PKCS#7 message handling # diff --git a/crypto/asymmetric_keys/pkcs8.asn1 b/crypto/asymmetric_keys/pkcs8.asn1 new file mode 100644 index 0000000000000..702c41a3c7137 --- /dev/null +++ b/crypto/asymmetric_keys/pkcs8.asn1 @@ -0,0 +1,24 @@ +-- +-- This is the unencrypted variant +-- +PrivateKeyInfo ::= SEQUENCE { + version Version, + privateKeyAlgorithm PrivateKeyAlgorithmIdentifier, + privateKey PrivateKey, + attributes [0] IMPLICIT Attributes OPTIONAL +} + +Version ::= INTEGER ({ pkcs8_note_version }) + +PrivateKeyAlgorithmIdentifier ::= AlgorithmIdentifier ({ pkcs8_note_algo }) + +PrivateKey ::= OCTET STRING ({ pkcs8_note_key }) + +Attributes ::= SET OF Attribute + +Attribute ::= ANY + +AlgorithmIdentifier ::= SEQUENCE { + algorithm OBJECT IDENTIFIER ({ pkcs8_note_OID }), + parameters ANY OPTIONAL +} diff --git a/crypto/asymmetric_keys/pkcs8_parser.c b/crypto/asymmetric_keys/pkcs8_parser.c new file mode 100644 index 0000000000000..5f6a7ecc9765e --- /dev/null +++ b/crypto/asymmetric_keys/pkcs8_parser.c @@ -0,0 +1,184 @@ +/* PKCS#8 Private Key parser [RFC 5208]. + * + * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public Licence + * as published by the Free Software Foundation; either version + * 2 of the Licence, or (at your option) any later version. + */ + +#define pr_fmt(fmt) "PKCS8: "fmt +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "pkcs8.asn1.h" + +struct pkcs8_parse_context { + struct public_key *pub; + unsigned long data; /* Start of data */ + enum OID last_oid; /* Last OID encountered */ + enum OID algo_oid; /* Algorithm OID */ + u32 key_size; + const void *key; +}; + +/* + * Note an OID when we find one for later processing when we know how to + * interpret it. + */ +int pkcs8_note_OID(void *context, size_t hdrlen, + unsigned char tag, + const void *value, size_t vlen) +{ + struct pkcs8_parse_context *ctx = context; + + ctx->last_oid = look_up_OID(value, vlen); + if (ctx->last_oid == OID__NR) { + char buffer[50]; + + sprint_oid(value, vlen, buffer, sizeof(buffer)); + pr_info("Unknown OID: [%lu] %s\n", + (unsigned long)value - ctx->data, buffer); + } + return 0; +} + +/* + * Note the version number of the ASN.1 blob. + */ +int pkcs8_note_version(void *context, size_t hdrlen, + unsigned char tag, + const void *value, size_t vlen) +{ + if (vlen != 1 || ((const u8 *)value)[0] != 0) { + pr_warn("Unsupported PKCS#8 version\n"); + return -EBADMSG; + } + return 0; +} + +/* + * Note the public algorithm. + */ +int pkcs8_note_algo(void *context, size_t hdrlen, + unsigned char tag, + const void *value, size_t vlen) +{ + struct pkcs8_parse_context *ctx = context; + + if (ctx->last_oid != OID_rsaEncryption) + return -ENOPKG; + + ctx->pub->pkey_algo = "rsa"; + return 0; +} + +/* + * Note the key data of the ASN.1 blob. + */ +int pkcs8_note_key(void *context, size_t hdrlen, + unsigned char tag, + const void *value, size_t vlen) +{ + struct pkcs8_parse_context *ctx = context; + + ctx->key = value; + ctx->key_size = vlen; + return 0; +} + +/* + * Parse a PKCS#8 private key blob. + */ +static struct public_key *pkcs8_parse(const void *data, size_t datalen) +{ + struct pkcs8_parse_context ctx; + struct public_key *pub; + long ret; + + memset(&ctx, 0, sizeof(ctx)); + + ret = -ENOMEM; + ctx.pub = kzalloc(sizeof(struct public_key), GFP_KERNEL); + if (!ctx.pub) + goto error; + + ctx.data = (unsigned long)data; + + /* Attempt to decode the private key */ + ret = asn1_ber_decoder(&pkcs8_decoder, &ctx, data, datalen); + if (ret < 0) + goto error_decode; + + ret = -ENOMEM; + pub = ctx.pub; + pub->key = kmemdup(ctx.key, ctx.key_size, GFP_KERNEL); + if (!pub->key) + goto error_decode; + + pub->keylen = ctx.key_size; + pub->key_is_private = true; + return pub; + +error_decode: + kfree(ctx.pub); +error: + return ERR_PTR(ret); +} + +/* + * Attempt to parse a data blob for a key as a PKCS#8 private key. + */ +static int pkcs8_key_preparse(struct key_preparsed_payload *prep) +{ + struct public_key *pub; + + pub = pkcs8_parse(prep->data, prep->datalen); + if (IS_ERR(pub)) + return PTR_ERR(pub); + + pr_devel("Cert Key Algo: %s\n", pub->pkey_algo); + pub->id_type = "PKCS8"; + + /* We're pinning the module by being linked against it */ + __module_get(public_key_subtype.owner); + prep->payload.data[asym_subtype] = &public_key_subtype; + prep->payload.data[asym_key_ids] = NULL; + prep->payload.data[asym_crypto] = pub; + prep->payload.data[asym_auth] = NULL; + prep->quotalen = 100; + return 0; +} + +static struct asymmetric_key_parser pkcs8_key_parser = { + .owner = THIS_MODULE, + .name = "pkcs8", + .parse = pkcs8_key_preparse, +}; + +/* + * Module stuff + */ +static int __init pkcs8_key_init(void) +{ + return register_asymmetric_key_parser(&pkcs8_key_parser); +} + +static void __exit pkcs8_key_exit(void) +{ + unregister_asymmetric_key_parser(&pkcs8_key_parser); +} + +module_init(pkcs8_key_init); +module_exit(pkcs8_key_exit); + +MODULE_DESCRIPTION("PKCS#8 certificate parser"); +MODULE_LICENSE("GPL"); -- cgit v1.2.3 From cb028e49129f352d2f2645727ceeeacdd64761f4 Mon Sep 17 00:00:00 2001 From: Douglas Anderson Date: Thu, 25 Oct 2018 15:21:29 -0700 Subject: dt-bindings: drm/panel: simple: Add no-hpd property Some eDP panels that are designed to be always connected to a board use their HPD signal to signal that they've finished powering on and they're ready to be talked to. However, for various reasons it's possible that the HPD signal from the panel isn't actually hooked up. In the case where the HPD isn't hooked up you can look at the timing diagram on the panel datasheet and insert a delay for the maximum amount of time that the HPD might take to come up. Let's add a property in the device tree for this concept. Reviewed-by: Sean Paul Reviewed-by: Rob Herring Signed-off-by: Douglas Anderson Signed-off-by: Sean Paul Link: https://patchwork.freedesktop.org/patch/msgid/20181025222134.174583-1-dianders@chromium.org --- Documentation/devicetree/bindings/display/panel/simple-panel.txt | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/display/panel/simple-panel.txt b/Documentation/devicetree/bindings/display/panel/simple-panel.txt index 45a457ad38f0f..b2b872c710f24 100644 --- a/Documentation/devicetree/bindings/display/panel/simple-panel.txt +++ b/Documentation/devicetree/bindings/display/panel/simple-panel.txt @@ -11,6 +11,9 @@ Optional properties: - ddc-i2c-bus: phandle of an I2C controller used for DDC EDID probing - enable-gpios: GPIO pin to enable or disable the panel - backlight: phandle of the backlight device attached to the panel +- no-hpd: This panel is supposed to communicate that it's ready via HPD + (hot plug detect) signal, but the signal isn't hooked up so we should + hardcode the max delay from the panel spec when powering up the panel. Example: -- cgit v1.2.3 From c0e9ab24723cbc0be32993f02b34477929555e0d Mon Sep 17 00:00:00 2001 From: Douglas Anderson Date: Thu, 25 Oct 2018 15:21:33 -0700 Subject: dt-bindings: drm/panel: simple: Innolux TV123WAM is actually P120ZDG-BF1 As far as I can tell the bindings that were added in commit 9c04400f7ea6 ("dt-bindings: drm/panel: Document Innolux TV123WAM panel bindings") weren't actually for Innolux TV123WAM but were actually for Innolux P120ZDG-BF1. As far as I can tell the Innolux TV123WAM isn't a real panel and but it's a mosh between the TI TV123WAM and the Innolux P120ZDG-BF1. Let's unmosh. Here's my evidence: * Searching for TV123WAM on the Internet turns up a TI panel. While it's possible that an Innolux panel has the same model number as the TI Panel, it seems a little doubtful. Looking up the datasheet from the TI Panel shows that it's 1920 x 1280 and 259.2 mm x 172.8 mm. * As far as I know, the patch adding the Innolux Panel was supposed to be for the board that's sitting in front of me as I type this (support for that board is not yet upstream). On the back of that panel I see Innolux P120ZDZ-EZ1 rev B1. * Someone pointed me at a datasheet that's supposed to be for the panel in front of me (sorry, I can't share the datasheet). That datasheet has the string "p120zdg-bf1" * If I search for "P120ZDG-BF1" on the Internet I get hits for panels that are 2160x1440. They don't have datasheets, but the fact that the resolution matches is a good sign. While we doing the rename, also mention that no-hpd can be used with this panel. See the previous patch in this series ("drm/panel: simple: Add "no-hpd" delay for Innolux TV123WAM"). Fixes: 9c04400f7ea6 ("dt-bindings: drm/panel: Document Innolux TV123WAM panel bindings") Cc: Sandeep Panda Reviewed-by: Rob Herring Reviewed-by: Andrzej Hajda Signed-off-by: Douglas Anderson Signed-off-by: Sean Paul Link: https://patchwork.freedesktop.org/patch/msgid/20181025222134.174583-5-dianders@chromium.org --- .../bindings/display/panel/innolux,p120zdg-bf1.txt | 22 ++++++++++++++++++++++ .../bindings/display/panel/innolux,tv123wam.txt | 20 -------------------- 2 files changed, 22 insertions(+), 20 deletions(-) create mode 100644 Documentation/devicetree/bindings/display/panel/innolux,p120zdg-bf1.txt delete mode 100644 Documentation/devicetree/bindings/display/panel/innolux,tv123wam.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/display/panel/innolux,p120zdg-bf1.txt b/Documentation/devicetree/bindings/display/panel/innolux,p120zdg-bf1.txt new file mode 100644 index 0000000000000..513f03466abac --- /dev/null +++ b/Documentation/devicetree/bindings/display/panel/innolux,p120zdg-bf1.txt @@ -0,0 +1,22 @@ +Innolux P120ZDG-BF1 12.02 inch eDP 2K display panel + +This binding is compatible with the simple-panel binding, which is specified +in simple-panel.txt in this directory. + +Required properties: +- compatible: should be "innolux,p120zdg-bf1" +- power-supply: regulator to provide the supply voltage + +Optional properties: +- enable-gpios: GPIO pin to enable or disable the panel +- backlight: phandle of the backlight device attached to the panel +- no-hpd: If HPD isn't hooked up; add this property. + +Example: + panel_edp: panel-edp { + compatible = "innolux,p120zdg-bf1"; + enable-gpios = <&msmgpio 31 GPIO_ACTIVE_LOW>; + power-supply = <&pm8916_l2>; + backlight = <&backlight>; + no-hpd; + }; diff --git a/Documentation/devicetree/bindings/display/panel/innolux,tv123wam.txt b/Documentation/devicetree/bindings/display/panel/innolux,tv123wam.txt deleted file mode 100644 index a9b35265fa13a..0000000000000 --- a/Documentation/devicetree/bindings/display/panel/innolux,tv123wam.txt +++ /dev/null @@ -1,20 +0,0 @@ -Innolux TV123WAM 12.3 inch eDP 2K display panel - -This binding is compatible with the simple-panel binding, which is specified -in simple-panel.txt in this directory. - -Required properties: -- compatible: should be "innolux,tv123wam" -- power-supply: regulator to provide the supply voltage - -Optional properties: -- enable-gpios: GPIO pin to enable or disable the panel -- backlight: phandle of the backlight device attached to the panel - -Example: - panel_edp: panel-edp { - compatible = "innolux,tv123wam"; - enable-gpios = <&msmgpio 31 GPIO_ACTIVE_LOW>; - power-supply = <&pm8916_l2>; - backlight = <&backlight>; - }; -- cgit v1.2.3 From 2e5dfc99f2e61c42083ba742395e7a7b353513d1 Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Tue, 30 Oct 2018 10:41:21 +1100 Subject: vfs: combine the clone and dedupe into a single remap_file_range Combine the clone_file_range and dedupe_file_range operations into a single remap_file_range file operation dispatch since they're fundamentally the same operation. The differences between the two can be made in the prep functions. Signed-off-by: Darrick J. Wong Reviewed-by: Amir Goldstein Reviewed-by: Christoph Hellwig Signed-off-by: Dave Chinner --- Documentation/filesystems/porting | 5 +++++ Documentation/filesystems/vfs.txt | 20 +++++++++++------ fs/btrfs/ctree.h | 8 +++---- fs/btrfs/file.c | 3 +-- fs/btrfs/ioctl.c | 45 ++++++++++++++++++++------------------- fs/cifs/cifsfs.c | 22 +++++++++++-------- fs/nfs/nfs4file.c | 10 ++++++--- fs/ocfs2/file.c | 24 +++++++-------------- fs/overlayfs/file.c | 30 +++++++++++++++----------- fs/read_write.c | 18 ++++++++-------- fs/xfs/xfs_file.c | 23 ++++++-------------- include/linux/fs.h | 25 ++++++++++++++++++---- 12 files changed, 127 insertions(+), 106 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 7b7b845c490a4..e6d4466268dd1 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -622,3 +622,8 @@ in your dentry operations instead. alloc_file_clone(file, flags, ops) does not affect any caller's references. On success you get a new struct file sharing the mount/dentry with the original, on failure - ERR_PTR(). +-- +[mandatory] + ->clone_file_range() and ->dedupe_file_range have been replaced with + ->remap_file_range(). See Documentation/filesystems/vfs.txt for more + information. diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index a6c6a8af48a29..6f5babfee27b7 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -883,8 +883,9 @@ struct file_operations { unsigned (*mmap_capabilities)(struct file *); #endif ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int); - int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, u64); - int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, u64); + int (*remap_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + u64 len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); }; @@ -960,11 +961,16 @@ otherwise noted. copy_file_range: called by the copy_file_range(2) system call. - clone_file_range: called by the ioctl(2) system call for FICLONERANGE and - FICLONE commands. - - dedupe_file_range: called by the ioctl(2) system call for FIDEDUPERANGE - command. + remap_file_range: called by the ioctl(2) system call for FICLONERANGE and + FICLONE and FIDEDUPERANGE commands to remap file ranges. An + implementation should remap len bytes at pos_in of the source file into + the dest file at pos_out. Implementations must handle callers passing + in len == 0; this means "remap to the end of the source file". The + return value should be zero if all bytes were remapped, or the usual + negative error code if the remapping did not succeed completely. + The remap_flags parameter accepts REMAP_FILE_* flags. If + REMAP_FILE_DEDUP is set then the implementation must only remap if the + requested file ranges have identical contents. fadvise: possibly called by the fadvise64() system call. diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2cddfe7806a41..124a05662fc29 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3218,9 +3218,6 @@ void btrfs_get_block_group_info(struct list_head *groups_list, struct btrfs_ioctl_space_info *space); void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info, struct btrfs_ioctl_balance_args *bargs); -int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff, - struct file *dst_file, loff_t dst_loff, - u64 olen); /* file.c */ int __init btrfs_auto_defrag_init(void); @@ -3250,8 +3247,9 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages, size_t num_pages, loff_t pos, size_t write_bytes, struct extent_state **cached); int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); -int btrfs_clone_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len); +int btrfs_remap_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, u64 len, + unsigned int remap_flags); /* tree-defrag.c */ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 2be00e873e92a..9a963f061393e 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3269,8 +3269,7 @@ const struct file_operations btrfs_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = btrfs_compat_ioctl, #endif - .clone_file_range = btrfs_clone_file_range, - .dedupe_file_range = btrfs_dedupe_file_range, + .remap_file_range = btrfs_remap_file_range, }; void __cold btrfs_auto_defrag_exit(void) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d60b6caf09e85..bfd99c66723e2 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3627,26 +3627,6 @@ out_unlock: return ret; } -int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff, - struct file *dst_file, loff_t dst_loff, - u64 olen) -{ - struct inode *src = file_inode(src_file); - struct inode *dst = file_inode(dst_file); - u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize; - - if (WARN_ON_ONCE(bs < PAGE_SIZE)) { - /* - * Btrfs does not support blocksize < page_size. As a - * result, btrfs_cmp_data() won't correctly handle - * this situation without an update. - */ - return -EINVAL; - } - - return btrfs_extent_same(src, src_loff, olen, dst, dst_loff); -} - static int clone_finish_inode_update(struct btrfs_trans_handle *trans, struct inode *inode, u64 endoff, @@ -4348,9 +4328,30 @@ out_unlock: return ret; } -int btrfs_clone_file_range(struct file *src_file, loff_t off, - struct file *dst_file, loff_t destoff, u64 len) +int btrfs_remap_file_range(struct file *src_file, loff_t off, + struct file *dst_file, loff_t destoff, u64 len, + unsigned int remap_flags) { + if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) + return -EINVAL; + + if (remap_flags & REMAP_FILE_DEDUP) { + struct inode *src = file_inode(src_file); + struct inode *dst = file_inode(dst_file); + u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize; + + if (WARN_ON_ONCE(bs < PAGE_SIZE)) { + /* + * Btrfs does not support blocksize < page_size. As a + * result, btrfs_cmp_data() won't correctly handle + * this situation without an update. + */ + return -EINVAL; + } + + return btrfs_extent_same(src, off, len, dst, destoff); + } + return btrfs_clone_files(dst_file, src_file, off, len, destoff); } diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index 7065426b3280f..e8144d0dcde2e 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -975,8 +975,9 @@ const struct inode_operations cifs_symlink_inode_ops = { .listxattr = cifs_listxattr, }; -static int cifs_clone_file_range(struct file *src_file, loff_t off, - struct file *dst_file, loff_t destoff, u64 len) +static int cifs_remap_file_range(struct file *src_file, loff_t off, + struct file *dst_file, loff_t destoff, u64 len, + unsigned int remap_flags) { struct inode *src_inode = file_inode(src_file); struct inode *target_inode = file_inode(dst_file); @@ -986,6 +987,9 @@ static int cifs_clone_file_range(struct file *src_file, loff_t off, unsigned int xid; int rc; + if (remap_flags & ~REMAP_FILE_ADVISORY) + return -EINVAL; + cifs_dbg(FYI, "clone range\n"); xid = get_xid(); @@ -1134,7 +1138,7 @@ const struct file_operations cifs_file_ops = { .llseek = cifs_llseek, .unlocked_ioctl = cifs_ioctl, .copy_file_range = cifs_copy_file_range, - .clone_file_range = cifs_clone_file_range, + .remap_file_range = cifs_remap_file_range, .setlease = cifs_setlease, .fallocate = cifs_fallocate, }; @@ -1153,7 +1157,7 @@ const struct file_operations cifs_file_strict_ops = { .llseek = cifs_llseek, .unlocked_ioctl = cifs_ioctl, .copy_file_range = cifs_copy_file_range, - .clone_file_range = cifs_clone_file_range, + .remap_file_range = cifs_remap_file_range, .setlease = cifs_setlease, .fallocate = cifs_fallocate, }; @@ -1172,7 +1176,7 @@ const struct file_operations cifs_file_direct_ops = { .splice_write = iter_file_splice_write, .unlocked_ioctl = cifs_ioctl, .copy_file_range = cifs_copy_file_range, - .clone_file_range = cifs_clone_file_range, + .remap_file_range = cifs_remap_file_range, .llseek = cifs_llseek, .setlease = cifs_setlease, .fallocate = cifs_fallocate, @@ -1191,7 +1195,7 @@ const struct file_operations cifs_file_nobrl_ops = { .llseek = cifs_llseek, .unlocked_ioctl = cifs_ioctl, .copy_file_range = cifs_copy_file_range, - .clone_file_range = cifs_clone_file_range, + .remap_file_range = cifs_remap_file_range, .setlease = cifs_setlease, .fallocate = cifs_fallocate, }; @@ -1209,7 +1213,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = { .llseek = cifs_llseek, .unlocked_ioctl = cifs_ioctl, .copy_file_range = cifs_copy_file_range, - .clone_file_range = cifs_clone_file_range, + .remap_file_range = cifs_remap_file_range, .setlease = cifs_setlease, .fallocate = cifs_fallocate, }; @@ -1227,7 +1231,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = { .splice_write = iter_file_splice_write, .unlocked_ioctl = cifs_ioctl, .copy_file_range = cifs_copy_file_range, - .clone_file_range = cifs_clone_file_range, + .remap_file_range = cifs_remap_file_range, .llseek = cifs_llseek, .setlease = cifs_setlease, .fallocate = cifs_fallocate, @@ -1239,7 +1243,7 @@ const struct file_operations cifs_dir_ops = { .read = generic_read_dir, .unlocked_ioctl = cifs_ioctl, .copy_file_range = cifs_copy_file_range, - .clone_file_range = cifs_clone_file_range, + .remap_file_range = cifs_remap_file_range, .llseek = generic_file_llseek, .fsync = cifs_dir_fsync, }; diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c index 4288a6ecaf756..ae5780ce41dcf 100644 --- a/fs/nfs/nfs4file.c +++ b/fs/nfs/nfs4file.c @@ -180,8 +180,9 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t return nfs42_proc_allocate(filep, offset, len); } -static int nfs42_clone_file_range(struct file *src_file, loff_t src_off, - struct file *dst_file, loff_t dst_off, u64 count) +static int nfs42_remap_file_range(struct file *src_file, loff_t src_off, + struct file *dst_file, loff_t dst_off, u64 count, + unsigned int remap_flags) { struct inode *dst_inode = file_inode(dst_file); struct nfs_server *server = NFS_SERVER(dst_inode); @@ -190,6 +191,9 @@ static int nfs42_clone_file_range(struct file *src_file, loff_t src_off, bool same_inode = false; int ret; + if (remap_flags & ~REMAP_FILE_ADVISORY) + return -EINVAL; + /* check alignment w.r.t. clone_blksize */ ret = -EINVAL; if (bs) { @@ -262,7 +266,7 @@ const struct file_operations nfs4_file_operations = { .copy_file_range = nfs4_copy_file_range, .llseek = nfs4_file_llseek, .fallocate = nfs42_fallocate, - .clone_file_range = nfs42_clone_file_range, + .remap_file_range = nfs42_remap_file_range, #else .llseek = nfs_file_llseek, #endif diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 9fa35cb6f6e0b..0b757a24567c8 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2527,24 +2527,18 @@ out: return offset; } -static int ocfs2_file_clone_range(struct file *file_in, +static int ocfs2_remap_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len) + u64 len, + unsigned int remap_flags) { - return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, false); -} + if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) + return -EINVAL; -static int ocfs2_file_dedupe_range(struct file *file_in, - loff_t pos_in, - struct file *file_out, - loff_t pos_out, - u64 len) -{ return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, true); + len, remap_flags & REMAP_FILE_DEDUP); } const struct inode_operations ocfs2_file_iops = { @@ -2586,8 +2580,7 @@ const struct file_operations ocfs2_fops = { .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, .fallocate = ocfs2_fallocate, - .clone_file_range = ocfs2_file_clone_range, - .dedupe_file_range = ocfs2_file_dedupe_range, + .remap_file_range = ocfs2_remap_file_range, }; const struct file_operations ocfs2_dops = { @@ -2633,8 +2626,7 @@ const struct file_operations ocfs2_fops_no_plocks = { .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, .fallocate = ocfs2_fallocate, - .clone_file_range = ocfs2_file_clone_range, - .dedupe_file_range = ocfs2_file_dedupe_range, + .remap_file_range = ocfs2_remap_file_range, }; const struct file_operations ocfs2_dops_no_plocks = { diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index 986313da0c889..fffb36fd5920a 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -489,26 +489,31 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in, OVL_COPY); } -static int ovl_clone_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len) +static int ovl_remap_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + u64 len, unsigned int remap_flags) { - return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0, - OVL_CLONE); -} + enum ovl_copyop op; + + if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) + return -EINVAL; + + if (remap_flags & REMAP_FILE_DEDUP) + op = OVL_DEDUPE; + else + op = OVL_CLONE; -static int ovl_dedupe_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len) -{ /* * Don't copy up because of a dedupe request, this wouldn't make sense * most of the time (data would be duplicated instead of deduplicated). */ - if (!ovl_inode_upper(file_inode(file_in)) || - !ovl_inode_upper(file_inode(file_out))) + if (op == OVL_DEDUPE && + (!ovl_inode_upper(file_inode(file_in)) || + !ovl_inode_upper(file_inode(file_out)))) return -EPERM; return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0, - OVL_DEDUPE); + op); } const struct file_operations ovl_file_operations = { @@ -525,6 +530,5 @@ const struct file_operations ovl_file_operations = { .compat_ioctl = ovl_compat_ioctl, .copy_file_range = ovl_copy_file_range, - .clone_file_range = ovl_clone_file_range, - .dedupe_file_range = ovl_dedupe_file_range, + .remap_file_range = ovl_remap_file_range, }; diff --git a/fs/read_write.c b/fs/read_write.c index 734c5661fb690..766bdcb381f34 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1588,9 +1588,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, * Try cloning first, this is supported by more file systems, and * more efficient if both clone and copy are supported (e.g. NFS). */ - if (file_in->f_op->clone_file_range) { - ret = file_in->f_op->clone_file_range(file_in, pos_in, - file_out, pos_out, len); + if (file_in->f_op->remap_file_range) { + ret = file_in->f_op->remap_file_range(file_in, pos_in, + file_out, pos_out, len, 0); if (ret == 0) { ret = len; goto done; @@ -1849,7 +1849,7 @@ int do_clone_file_range(struct file *file_in, loff_t pos_in, (file_out->f_flags & O_APPEND)) return -EBADF; - if (!file_in->f_op->clone_file_range) + if (!file_in->f_op->remap_file_range) return -EOPNOTSUPP; ret = remap_verify_area(file_in, pos_in, len, false); @@ -1860,8 +1860,8 @@ int do_clone_file_range(struct file *file_in, loff_t pos_in, if (ret) return ret; - ret = file_in->f_op->clone_file_range(file_in, pos_in, - file_out, pos_out, len); + ret = file_in->f_op->remap_file_range(file_in, pos_in, + file_out, pos_out, len, 0); if (!ret) { fsnotify_access(file_in); fsnotify_modify(file_out); @@ -2006,7 +2006,7 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, goto out_drop_write; ret = -EINVAL; - if (!dst_file->f_op->dedupe_file_range) + if (!dst_file->f_op->remap_file_range) goto out_drop_write; if (len == 0) { @@ -2014,8 +2014,8 @@ int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, goto out_drop_write; } - ret = dst_file->f_op->dedupe_file_range(src_file, src_pos, - dst_file, dst_pos, len); + ret = dst_file->f_op->remap_file_range(src_file, src_pos, dst_file, + dst_pos, len, REMAP_FILE_DEDUP); out_drop_write: mnt_drop_write_file(dst_file); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 61a5ad2600e86..2ad94d508f80f 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -920,27 +920,19 @@ out_unlock: } STATIC int -xfs_file_clone_range( +xfs_file_remap_range( struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len) + u64 len, + unsigned int remap_flags) { - return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, false); -} + if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) + return -EINVAL; -STATIC int -xfs_file_dedupe_range( - struct file *file_in, - loff_t pos_in, - struct file *file_out, - loff_t pos_out, - u64 len) -{ return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, true); + len, remap_flags & REMAP_FILE_DEDUP); } STATIC int @@ -1175,8 +1167,7 @@ const struct file_operations xfs_file_operations = { .fsync = xfs_file_fsync, .get_unmapped_area = thp_get_unmapped_area, .fallocate = xfs_file_fallocate, - .clone_file_range = xfs_file_clone_range, - .dedupe_file_range = xfs_file_dedupe_range, + .remap_file_range = xfs_file_remap_range, }; const struct file_operations xfs_dir_file_operations = { diff --git a/include/linux/fs.h b/include/linux/fs.h index 55729e1c2e759..888cef35c7d70 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1721,6 +1721,24 @@ struct block_device_operations; #define NOMMU_VMFLAGS \ (NOMMU_MAP_READ | NOMMU_MAP_WRITE | NOMMU_MAP_EXEC) +/* + * These flags control the behavior of the remap_file_range function pointer. + * If it is called with len == 0 that means "remap to end of source file". + * See Documentation/filesystems/vfs.txt for more details about this call. + * + * REMAP_FILE_DEDUP: only remap if contents identical (i.e. deduplicate) + */ +#define REMAP_FILE_DEDUP (1 << 0) + +/* + * These flags signal that the caller is ok with altering various aspects of + * the behavior of the remap operation. The changes must be made by the + * implementation; the vfs remap helper functions can take advantage of them. + * Flags in this category exist to preserve the quirky behavior of the hoisted + * btrfs clone/dedupe ioctls. + * There are no flags yet, but subsequent commits will add some. + */ +#define REMAP_FILE_ADVISORY (0) struct iov_iter; @@ -1759,10 +1777,9 @@ struct file_operations { #endif ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int); - int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t, - u64); - int (*dedupe_file_range)(struct file *, loff_t, struct file *, loff_t, - u64); + int (*remap_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + u64 len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); } __randomize_layout; -- cgit v1.2.3 From 42ec3d4c02187a18e27ff94b409ec27234bf2ffd Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Tue, 30 Oct 2018 10:41:49 +1100 Subject: vfs: make remap_file_range functions take and return bytes completed Change the remap_file_range functions to take a number of bytes to operate upon and return the number of bytes they operated on. This is a requirement for allowing fs implementations to return short clone/dedupe results to the user, which will enable us to obey resource limits in a graceful manner. A subsequent patch will enable copy_file_range to signal to the ->clone_file_range implementation that it can handle a short length, which will be returned in the function's return value. For now the short return is not implemented anywhere so the behavior won't change -- either copy_file_range manages to clone the entire range or it tries an alternative. Neither clone ioctl can take advantage of this, alas. Signed-off-by: Darrick J. Wong Reviewed-by: Amir Goldstein Signed-off-by: Dave Chinner --- Documentation/filesystems/vfs.txt | 10 ++++---- fs/btrfs/ctree.h | 6 ++--- fs/btrfs/ioctl.c | 13 +++++++---- fs/cifs/cifsfs.c | 6 ++--- fs/ioctl.c | 10 +++++++- fs/nfs/nfs4file.c | 6 ++--- fs/nfsd/vfs.c | 8 +++++-- fs/ocfs2/file.c | 16 ++++++------- fs/ocfs2/refcounttree.c | 2 +- fs/ocfs2/refcounttree.h | 2 +- fs/overlayfs/copy_up.c | 6 ++--- fs/overlayfs/file.c | 12 +++++----- fs/read_write.c | 49 +++++++++++++++++++++------------------ fs/xfs/xfs_file.c | 9 ++++--- fs/xfs/xfs_reflink.c | 4 ++-- fs/xfs/xfs_reflink.h | 2 +- include/linux/fs.h | 27 +++++++++++---------- mm/filemap.c | 2 +- 18 files changed, 108 insertions(+), 82 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 6f5babfee27b7..1bd2919deaca0 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -883,9 +883,9 @@ struct file_operations { unsigned (*mmap_capabilities)(struct file *); #endif ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int); - int (*remap_file_range)(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - u64 len, unsigned int remap_flags); + loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); }; @@ -966,8 +966,8 @@ otherwise noted. implementation should remap len bytes at pos_in of the source file into the dest file at pos_out. Implementations must handle callers passing in len == 0; this means "remap to the end of the source file". The - return value should be zero if all bytes were remapped, or the usual - negative error code if the remapping did not succeed completely. + return value should the number of bytes remapped, or the usual + negative error code if errors occurred before any bytes were remapped. The remap_flags parameter accepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then the implementation must only remap if the requested file ranges have identical contents. diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 124a05662fc29..771a961d77ad4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3247,9 +3247,9 @@ int btrfs_dirty_pages(struct inode *inode, struct page **pages, size_t num_pages, loff_t pos, size_t write_bytes, struct extent_state **cached); int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end); -int btrfs_remap_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len, - unsigned int remap_flags); +loff_t btrfs_remap_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len, unsigned int remap_flags); /* tree-defrag.c */ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index bfd99c66723e2..b0c513e10977b 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4328,10 +4328,12 @@ out_unlock: return ret; } -int btrfs_remap_file_range(struct file *src_file, loff_t off, - struct file *dst_file, loff_t destoff, u64 len, +loff_t btrfs_remap_file_range(struct file *src_file, loff_t off, + struct file *dst_file, loff_t destoff, loff_t len, unsigned int remap_flags) { + int ret; + if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) return -EINVAL; @@ -4349,10 +4351,11 @@ int btrfs_remap_file_range(struct file *src_file, loff_t off, return -EINVAL; } - return btrfs_extent_same(src, off, len, dst, destoff); + ret = btrfs_extent_same(src, off, len, dst, destoff); + } else { + ret = btrfs_clone_files(dst_file, src_file, off, len, destoff); } - - return btrfs_clone_files(dst_file, src_file, off, len, destoff); + return ret < 0 ? ret : len; } static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp) diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index e8144d0dcde2e..5ca71c6c8be2f 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -975,8 +975,8 @@ const struct inode_operations cifs_symlink_inode_ops = { .listxattr = cifs_listxattr, }; -static int cifs_remap_file_range(struct file *src_file, loff_t off, - struct file *dst_file, loff_t destoff, u64 len, +static loff_t cifs_remap_file_range(struct file *src_file, loff_t off, + struct file *dst_file, loff_t destoff, loff_t len, unsigned int remap_flags) { struct inode *src_inode = file_inode(src_file); @@ -1029,7 +1029,7 @@ static int cifs_remap_file_range(struct file *src_file, loff_t off, unlock_two_nondirectories(src_inode, target_inode); out: free_xid(xid); - return rc; + return rc < 0 ? rc : len; } ssize_t cifs_file_copychunk_range(unsigned int xid, diff --git a/fs/ioctl.c b/fs/ioctl.c index 2005529af5608..72537b68c272d 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -223,6 +223,7 @@ static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd, u64 off, u64 olen, u64 destoff) { struct fd src_file = fdget(srcfd); + loff_t cloned; int ret; if (!src_file.file) @@ -230,7 +231,14 @@ static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd, ret = -EXDEV; if (src_file.file->f_path.mnt != dst_file->f_path.mnt) goto fdput; - ret = vfs_clone_file_range(src_file.file, off, dst_file, destoff, olen); + cloned = vfs_clone_file_range(src_file.file, off, dst_file, destoff, + olen); + if (cloned < 0) + ret = cloned; + else if (olen && cloned != olen) + ret = -EINVAL; + else + ret = 0; fdput: fdput(src_file); return ret; diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c index ae5780ce41dcf..46d691ba04bc8 100644 --- a/fs/nfs/nfs4file.c +++ b/fs/nfs/nfs4file.c @@ -180,8 +180,8 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t return nfs42_proc_allocate(filep, offset, len); } -static int nfs42_remap_file_range(struct file *src_file, loff_t src_off, - struct file *dst_file, loff_t dst_off, u64 count, +static loff_t nfs42_remap_file_range(struct file *src_file, loff_t src_off, + struct file *dst_file, loff_t dst_off, loff_t count, unsigned int remap_flags) { struct inode *dst_inode = file_inode(dst_file); @@ -244,7 +244,7 @@ out_unlock: inode_unlock(src_inode); } out: - return ret; + return ret < 0 ? ret : count; } #endif /* CONFIG_NFS_V4_2 */ diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index b53e76391e525..ac6cb6101cbec 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -541,8 +541,12 @@ __be32 nfsd4_set_nfs4_label(struct svc_rqst *rqstp, struct svc_fh *fhp, __be32 nfsd4_clone_file_range(struct file *src, u64 src_pos, struct file *dst, u64 dst_pos, u64 count) { - return nfserrno(vfs_clone_file_range(src, src_pos, dst, dst_pos, - count)); + loff_t cloned; + + cloned = vfs_clone_file_range(src, src_pos, dst, dst_pos, count); + if (count && cloned != count) + cloned = -EINVAL; + return nfserrno(cloned < 0 ? cloned : 0); } ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst, diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 9809b0e5746f5..fbaeafe44b5fa 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2527,18 +2527,18 @@ out: return offset; } -static int ocfs2_remap_file_range(struct file *file_in, - loff_t pos_in, - struct file *file_out, - loff_t pos_out, - u64 len, - unsigned int remap_flags) +static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len, unsigned int remap_flags) { + int ret; + if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) return -EINVAL; - return ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, - len, remap_flags); + ret = ocfs2_reflink_remap_range(file_in, pos_in, file_out, pos_out, + len, remap_flags); + return ret < 0 ? ret : len; } const struct inode_operations ocfs2_file_iops = { diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c index df9781567ec05..6a42c04ac0ab0 100644 --- a/fs/ocfs2/refcounttree.c +++ b/fs/ocfs2/refcounttree.c @@ -4824,7 +4824,7 @@ int ocfs2_reflink_remap_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len, + loff_t len, unsigned int remap_flags) { struct inode *inode_in = file_inode(file_in); diff --git a/fs/ocfs2/refcounttree.h b/fs/ocfs2/refcounttree.h index d2c5f526edff4..eb65c1d0843c2 100644 --- a/fs/ocfs2/refcounttree.h +++ b/fs/ocfs2/refcounttree.h @@ -119,7 +119,7 @@ int ocfs2_reflink_remap_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len, + loff_t len, unsigned int remap_flags); #endif /* OCFS2_REFCOUNTTREE_H */ diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c index 1cc797a08a5b5..8750b72355169 100644 --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -125,6 +125,7 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len) struct file *new_file; loff_t old_pos = 0; loff_t new_pos = 0; + loff_t cloned; int error = 0; if (len == 0) @@ -141,11 +142,10 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len) } /* Try to use clone_file_range to clone up within the same fs */ - error = do_clone_file_range(old_file, 0, new_file, 0, len); - if (!error) + cloned = do_clone_file_range(old_file, 0, new_file, 0, len); + if (cloned == len) goto out; /* Couldn't clone, so now we try to copy the data */ - error = 0; /* FIXME: copy up sparse files efficiently */ while (len) { diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index fffb36fd5920a..6c3fec6168e9a 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -434,14 +434,14 @@ enum ovl_copyop { OVL_DEDUPE, }; -static ssize_t ovl_copyfile(struct file *file_in, loff_t pos_in, +static loff_t ovl_copyfile(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len, unsigned int flags, enum ovl_copyop op) + loff_t len, unsigned int flags, enum ovl_copyop op) { struct inode *inode_out = file_inode(file_out); struct fd real_in, real_out; const struct cred *old_cred; - ssize_t ret; + loff_t ret; ret = ovl_real_fdget(file_out, &real_out); if (ret) @@ -489,9 +489,9 @@ static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in, OVL_COPY); } -static int ovl_remap_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - u64 len, unsigned int remap_flags) +static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len, unsigned int remap_flags) { enum ovl_copyop op; diff --git a/fs/read_write.c b/fs/read_write.c index b61bd3fc71545..356641afa4874 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1589,10 +1589,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, * more efficient if both clone and copy are supported (e.g. NFS). */ if (file_in->f_op->remap_file_range) { - ret = file_in->f_op->remap_file_range(file_in, pos_in, - file_out, pos_out, len, 0); - if (ret == 0) { - ret = len; + loff_t cloned; + + cloned = file_in->f_op->remap_file_range(file_in, pos_in, + file_out, pos_out, + min_t(loff_t, MAX_RW_COUNT, len), 0); + if (cloned > 0) { + ret = cloned; goto done; } } @@ -1686,11 +1689,12 @@ out2: return ret; } -static int remap_verify_area(struct file *file, loff_t pos, u64 len, bool write) +static int remap_verify_area(struct file *file, loff_t pos, loff_t len, + bool write) { struct inode *inode = file_inode(file); - if (unlikely(pos < 0)) + if (unlikely(pos < 0 || len < 0)) return -EINVAL; if (unlikely((loff_t) (pos + len) < 0)) @@ -1721,7 +1725,7 @@ static int remap_verify_area(struct file *file, loff_t pos, u64 len, bool write) static int generic_remap_check_len(struct inode *inode_in, struct inode *inode_out, loff_t pos_out, - u64 *len, + loff_t *len, unsigned int remap_flags) { u64 blkmask = i_blocksize(inode_in) - 1; @@ -1747,7 +1751,7 @@ static int generic_remap_check_len(struct inode *inode_in, */ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 *len, unsigned int remap_flags) + loff_t *len, unsigned int remap_flags) { struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); @@ -1843,12 +1847,12 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, } EXPORT_SYMBOL(generic_remap_file_range_prep); -int do_clone_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len) +loff_t do_clone_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, loff_t len) { struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); - int ret; + loff_t ret; if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) return -EISDIR; @@ -1881,19 +1885,19 @@ int do_clone_file_range(struct file *file_in, loff_t pos_in, ret = file_in->f_op->remap_file_range(file_in, pos_in, file_out, pos_out, len, 0); - if (!ret) { - fsnotify_access(file_in); - fsnotify_modify(file_out); - } + if (ret < 0) + return ret; + fsnotify_access(file_in); + fsnotify_modify(file_out); return ret; } EXPORT_SYMBOL(do_clone_file_range); -int vfs_clone_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len) +loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, loff_t len) { - int ret; + loff_t ret; file_start_write(file_out); ret = do_clone_file_range(file_in, pos_in, file_out, pos_out, len); @@ -1999,10 +2003,11 @@ out_error: } EXPORT_SYMBOL(vfs_dedupe_file_range_compare); -int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, - struct file *dst_file, loff_t dst_pos, u64 len) +loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, + struct file *dst_file, loff_t dst_pos, + loff_t len) { - s64 ret; + loff_t ret; ret = mnt_want_write_file(dst_file); if (ret) @@ -2051,7 +2056,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) int i; int ret; u16 count = same->dest_count; - int deduped; + loff_t deduped; if (!(file->f_mode & FMODE_READ)) return -EINVAL; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 20314eb4677a0..38fde4e117148 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -919,20 +919,23 @@ out_unlock: return error; } -STATIC int +STATIC loff_t xfs_file_remap_range( struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len, + loff_t len, unsigned int remap_flags) { + int ret; + if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) return -EINVAL; - return xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, + ret = xfs_reflink_remap_range(file_in, pos_in, file_out, pos_out, len, remap_flags); + return ret < 0 ? ret : len; } STATIC int diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c index 2d7dd8b28d7c2..3dbe5fb7e9c06 100644 --- a/fs/xfs/xfs_reflink.c +++ b/fs/xfs/xfs_reflink.c @@ -1296,7 +1296,7 @@ xfs_reflink_remap_prep( loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 *len, + loff_t *len, unsigned int remap_flags) { struct inode *inode_in = file_inode(file_in); @@ -1387,7 +1387,7 @@ xfs_reflink_remap_range( loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 len, + loff_t len, unsigned int remap_flags) { struct inode *inode_in = file_inode(file_in); diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h index 6f82d628bf175..c3c46c276fe18 100644 --- a/fs/xfs/xfs_reflink.h +++ b/fs/xfs/xfs_reflink.h @@ -28,7 +28,7 @@ extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset, xfs_off_t count); extern int xfs_reflink_recover_cow(struct xfs_mount *mp); extern int xfs_reflink_remap_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len, + struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); extern int xfs_reflink_inode_has_shared_extents(struct xfs_trans *tp, struct xfs_inode *ip, bool *has_shared); diff --git a/include/linux/fs.h b/include/linux/fs.h index c5435ca811329..c72d8c3c065a8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1777,9 +1777,9 @@ struct file_operations { #endif ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int); - int (*remap_file_range)(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, - u64 len, unsigned int remap_flags); + loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); } __randomize_layout; @@ -1844,19 +1844,22 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, loff_t, size_t, unsigned int); extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - u64 *count, unsigned int remap_flags); -extern int do_clone_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len); -extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in, - struct file *file_out, loff_t pos_out, u64 len); + loff_t *count, + unsigned int remap_flags); +extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len); +extern loff_t vfs_clone_file_range(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len); extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, loff_t destoff, loff_t len, bool *is_same); extern int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same); -extern int vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, - struct file *dst_file, loff_t dst_pos, - u64 len); +extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, + struct file *dst_file, loff_t dst_pos, + loff_t len); struct super_operations { @@ -2986,7 +2989,7 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *); extern int generic_remap_checks(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - uint64_t *count, unsigned int remap_flags); + loff_t *count, unsigned int remap_flags); extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *); extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *); extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *); diff --git a/mm/filemap.c b/mm/filemap.c index 410dc58f7b166..e9091d731f846 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2994,7 +2994,7 @@ EXPORT_SYMBOL(generic_write_checks); */ int generic_remap_checks(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, - uint64_t *req_count, unsigned int remap_flags) + loff_t *req_count, unsigned int remap_flags) { struct inode *inode_in = file_in->f_mapping->host; struct inode *inode_out = file_out->f_mapping->host; -- cgit v1.2.3 From eca3654e3cc7d93e9734d0fa96cfb15c7f356244 Mon Sep 17 00:00:00 2001 From: "Darrick J. Wong" Date: Tue, 30 Oct 2018 10:42:10 +1100 Subject: vfs: enable remap callers that can handle short operations Plumb in a remap flag that enables the filesystem remap handler to shorten remapping requests for callers that can handle it. Now copy_file_range can report partial success (in case we run up against alignment problems, resource limits, etc.). We also enable CAN_SHORTEN for fideduperange to maintain existing userspace-visible behavior where xfs/btrfs shorten the dedupe range to avoid stale post-eof data exposure. Signed-off-by: Darrick J. Wong Reviewed-by: Amir Goldstein Signed-off-by: Dave Chinner --- Documentation/filesystems/vfs.txt | 4 +++- fs/read_write.c | 28 ++++++++++++++++++++-------- include/linux/fs.h | 5 +++-- mm/filemap.c | 11 +++++++---- 4 files changed, 33 insertions(+), 15 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 1bd2919deaca0..5f71a252e2e0f 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -970,7 +970,9 @@ otherwise noted. negative error code if errors occurred before any bytes were remapped. The remap_flags parameter accepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then the implementation must only remap if the - requested file ranges have identical contents. + requested file ranges have identical contents. If REMAP_CAN_SHORTEN is + set, the caller is ok with the implementation shortening the request + length to satisfy alignment or EOF requirements (or any other reason). fadvise: possibly called by the fadvise64() system call. diff --git a/fs/read_write.c b/fs/read_write.c index ea30666013b0d..c0bcc1a206502 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1593,7 +1593,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, cloned = file_in->f_op->remap_file_range(file_in, pos_in, file_out, pos_out, - min_t(loff_t, MAX_RW_COUNT, len), 0); + min_t(loff_t, MAX_RW_COUNT, len), + REMAP_FILE_CAN_SHORTEN); if (cloned > 0) { ret = cloned; goto done; @@ -1721,6 +1722,8 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, * can't meaningfully compare post-EOF contents. * * For clone we only link a partial EOF block above the destination file's EOF. + * + * Shorten the request if possible. */ static int generic_remap_check_len(struct inode *inode_in, struct inode *inode_out, @@ -1729,16 +1732,24 @@ static int generic_remap_check_len(struct inode *inode_in, unsigned int remap_flags) { u64 blkmask = i_blocksize(inode_in) - 1; + loff_t new_len = *len; if ((*len & blkmask) == 0) return 0; - if (remap_flags & REMAP_FILE_DEDUP) - *len &= ~blkmask; - else if (pos_out + *len < i_size_read(inode_out)) - return -EINVAL; + if ((remap_flags & REMAP_FILE_DEDUP) || + pos_out + *len < i_size_read(inode_out)) + new_len &= ~blkmask; - return 0; + if (new_len == *len) + return 0; + + if (remap_flags & REMAP_FILE_CAN_SHORTEN) { + *len = new_len; + return 0; + } + + return (remap_flags & REMAP_FILE_DEDUP) ? -EBADE : -EINVAL; } /* @@ -2014,7 +2025,8 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, { loff_t ret; - WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP)); + WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP | + REMAP_FILE_CAN_SHORTEN)); ret = mnt_want_write_file(dst_file); if (ret) @@ -2115,7 +2127,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) deduped = vfs_dedupe_file_range_one(file, off, dst_file, info->dest_offset, len, - 0); + REMAP_FILE_CAN_SHORTEN); if (deduped == -EBADE) info->status = FILE_DEDUPE_RANGE_DIFFERS; else if (deduped < 0) diff --git a/include/linux/fs.h b/include/linux/fs.h index 544ab5083b488..34c22d6950113 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1727,8 +1727,10 @@ struct block_device_operations; * See Documentation/filesystems/vfs.txt for more details about this call. * * REMAP_FILE_DEDUP: only remap if contents identical (i.e. deduplicate) + * REMAP_FILE_CAN_SHORTEN: caller can handle a shortened request */ #define REMAP_FILE_DEDUP (1 << 0) +#define REMAP_FILE_CAN_SHORTEN (1 << 1) /* * These flags signal that the caller is ok with altering various aspects of @@ -1736,9 +1738,8 @@ struct block_device_operations; * implementation; the vfs remap helper functions can take advantage of them. * Flags in this category exist to preserve the quirky behavior of the hoisted * btrfs clone/dedupe ioctls. - * There are no flags yet, but subsequent commits will add some. */ -#define REMAP_FILE_ADVISORY (0) +#define REMAP_FILE_ADVISORY (REMAP_FILE_CAN_SHORTEN) struct iov_iter; diff --git a/mm/filemap.c b/mm/filemap.c index e9091d731f846..1775d4ad3317c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3045,8 +3045,7 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in, bcount = ALIGN(size_in, bs) - pos_in; } else { if (!IS_ALIGNED(count, bs)) - return -EINVAL; - + count = ALIGN_DOWN(count, bs); bcount = count; } @@ -3056,10 +3055,14 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in, pos_out < pos_in + bcount) return -EINVAL; - /* For now we don't support changing the length. */ - if (*req_count != count) + /* + * We shortened the request but the caller can't deal with that, so + * bounce the request back to userspace. + */ + if (*req_count != count && !(remap_flags & REMAP_FILE_CAN_SHORTEN)) return -EINVAL; + *req_count = count; return 0; } -- cgit v1.2.3 From e2d00e62f24b907d49b83802ae91a6fa6254a48c Mon Sep 17 00:00:00 2001 From: Lorenzo Colitti Date: Mon, 29 Oct 2018 09:30:29 +0900 Subject: Documentation: ip-sysctl.txt: Document tcp_fwmark_accept This patch documents the tcp_fwmark_accept sysctl that was added in 3.15. Signed-off-by: Lorenzo Colitti Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 163b5ff1073cd..32b21571adfeb 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -316,6 +316,17 @@ tcp_frto - INTEGER By default it's enabled with a non-zero value. 0 disables F-RTO. +tcp_fwmark_accept - BOOLEAN + If set, incoming connections to listening sockets that do not have a + socket mark will set the mark of the accepting socket to the fwmark of + the incoming SYN packet. This will cause all packets on that connection + (starting from the first SYNACK) to be sent with that fwmark. The + listening socket's mark is unchanged. Listening sockets that already + have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are + unaffected. + + Default: 0 + tcp_invalid_ratelimit - INTEGER Limit the maximal rate for sending duplicate acknowledgments in response to incoming TCP packets that are for an existing -- cgit v1.2.3 From 204c881e96e435606451e8a167cdb5a12fafd32a Mon Sep 17 00:00:00 2001 From: Viresh Kumar Date: Mon, 29 Oct 2018 15:13:05 +0530 Subject: dt-bindings: arm: Explain capacities-dmips-mhz calculations in example The example contains two values for the capacity currently, 446 in text and 578 in code. The numbers are all correct but can confuse some of the readers. This patch tries to explain how the numbers are calculated to avoid same confusion going forward. Acked-by: Daniel Lezcano Signed-off-by: Viresh Kumar Signed-off-by: Rob Herring --- Documentation/devicetree/bindings/arm/cpu-capacity.txt | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/arm/cpu-capacity.txt b/Documentation/devicetree/bindings/arm/cpu-capacity.txt index 9b5685a1d15d9..84262cdb8d29a 100644 --- a/Documentation/devicetree/bindings/arm/cpu-capacity.txt +++ b/Documentation/devicetree/bindings/arm/cpu-capacity.txt @@ -59,9 +59,11 @@ mhz values (normalized w.r.t. the highest value found while parsing the DT). =========================================== Example 1 (ARM 64-bit, 6-cpu system, two clusters): -capacities-dmips-mhz are scaled w.r.t. 1024 (cpu@0 and cpu@1) -supposing cluster0@max-freq=1100 and custer1@max-freq=850, -final capacities are 1024 for cluster0 and 446 for cluster1 +The capacities-dmips-mhz or DMIPS/MHz values (scaled to 1024) +are 1024 and 578 for cluster0 and cluster1. Further normalization +is done by the operating system based on cluster0@max-freq=1100 and +custer1@max-freq=850, final capacities are 1024 for cluster0 and +446 for cluster1 (576*850/1100). cpus { #address-cells = <2>; -- cgit v1.2.3 From 69819c7fc836e17a870a23a54b885a44afc89d85 Mon Sep 17 00:00:00 2001 From: "A.s. Dong" Date: Tue, 30 Oct 2018 15:28:11 +0000 Subject: dt-bindings: i2c: i2c-imx-lpi2c: add imx8qxp compatible string Add imx8qxp compatible string Reviewed-by: Rob Herring Signed-off-by: Dong Aisheng Signed-off-by: Wolfram Sang --- Documentation/devicetree/bindings/i2c/i2c-imx-lpi2c.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/i2c/i2c-imx-lpi2c.txt b/Documentation/devicetree/bindings/i2c/i2c-imx-lpi2c.txt index 091c8dfd32291..b245363d6d60a 100644 --- a/Documentation/devicetree/bindings/i2c/i2c-imx-lpi2c.txt +++ b/Documentation/devicetree/bindings/i2c/i2c-imx-lpi2c.txt @@ -3,6 +3,7 @@ Required properties: - compatible : - "fsl,imx7ulp-lpi2c" for LPI2C compatible with the one integrated on i.MX7ULP soc + - "fsl,imx8qxp-lpi2c" for LPI2C compatible with the one integrated on i.MX8QXP soc - reg : address and length of the lpi2c master registers - interrupts : lpi2c interrupt - clocks : lpi2c clock specifier -- cgit v1.2.3 From 0085b4191f3e30ec94beef4103679f2828d3c54f Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Tue, 30 Oct 2018 00:41:28 +0900 Subject: kconfig: remove silentoldconfig target As commit 911a91c39cab ("kconfig: rename silentoldconfig to syncconfig") announced, it is time for the removal. Signed-off-by: Masahiro Yamada Acked-by: Jeff Kirsher --- Documentation/networking/ice.rst | 2 +- scripts/kconfig/Makefile | 9 +-------- 2 files changed, 2 insertions(+), 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ice.rst b/Documentation/networking/ice.rst index 1e4948c9e9897..4d118b827bbb7 100644 --- a/Documentation/networking/ice.rst +++ b/Documentation/networking/ice.rst @@ -20,7 +20,7 @@ Enabling the driver The driver is enabled via the standard kernel configuration system, using the make command:: - make oldconfig/silentoldconfig/menuconfig/etc. + make oldconfig/menuconfig/etc. The driver is located in the menu structure at: diff --git a/scripts/kconfig/Makefile b/scripts/kconfig/Makefile index 5d37a604ecc84..63b609243d037 100644 --- a/scripts/kconfig/Makefile +++ b/scripts/kconfig/Makefile @@ -68,14 +68,7 @@ PHONY += $(simple-targets) $(simple-targets): $(obj)/conf $< $(silent) --$@ $(Kconfig) -PHONY += silentoldconfig savedefconfig defconfig - -# We do not expect manual invokcation of "silentoldcofig" (or "syncconfig"). -silentoldconfig: syncconfig - @echo " WARNING: \"silentoldconfig\" has been renamed to \"syncconfig\"" - @echo " and is now an internal implementation detail." - @echo " What you want is probably \"oldconfig\"." - @echo " \"silentoldconfig\" will be removed after Linux 4.19" +PHONY += savedefconfig defconfig savedefconfig: $(obj)/conf $< $(silent) --$@=defconfig $(Kconfig) -- cgit v1.2.3 From 3f80babd9ca477d19a027e0b25b95973cd7f7357 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Tue, 30 Oct 2018 01:03:54 +0900 Subject: kbuild: remove unused cc-fullversion variable The last user of cc-fullversion was removed by commit f2910f0e6835 ("powerpc: remove old GCC version checks"). Signed-off-by: Masahiro Yamada --- Documentation/kbuild/makefiles.txt | 15 --------------- scripts/Kbuild.include | 4 ---- 2 files changed, 19 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index 7b6a2b2bdc98d..8da26c6dd886a 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -537,21 +537,6 @@ more details, with real examples. The third parameter may be a text as in this example, but it may also be an expanded variable or a macro. - cc-fullversion - cc-fullversion is useful when the exact version of gcc is needed. - One typical use-case is when a specific GCC version is broken. - cc-fullversion points out a more specific version than cc-version does. - - Example: - #arch/powerpc/Makefile - $(Q)if test "$(cc-fullversion)" = "040200" ; then \ - echo -n '*** GCC-4.2.0 cannot compile the 64-bit powerpc ' ; \ - false ; \ - fi - - In this example for a specific GCC version the build will error out - explaining to the user why it stops. - cc-cross-prefix cc-cross-prefix is used to check if there exists a $(CC) in path with one of the listed prefixes. The first prefix where there exist a diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include index ca21a35fa244e..19a63db62fb9f 100644 --- a/scripts/Kbuild.include +++ b/scripts/Kbuild.include @@ -147,10 +147,6 @@ cc-name = $(shell $(CC) -v 2>&1 | grep -q "clang version" && echo clang || echo # cc-version cc-version = $(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-version.sh $(CC)) -# cc-fullversion -cc-fullversion = $(shell $(CONFIG_SHELL) \ - $(srctree)/scripts/gcc-version.sh -p $(CC)) - # cc-ifversion # Usage: EXTRA_CFLAGS += $(call cc-ifversion, -lt, 0402, -O1) cc-ifversion = $(shell [ $(cc-version) $(1) $(2) ] && echo $(3) || echo $(4)) -- cgit v1.2.3 From d47748e5ae5af6572e520cc9767bbe70c22ea498 Mon Sep 17 00:00:00 2001 From: Miklos Szeredi Date: Thu, 1 Nov 2018 21:31:39 +0100 Subject: ovl: automatically enable redirect_dir on metacopy=on Current behavior is to automatically disable metacopy if redirect_dir is not enabled and proceed with the mount. If "metacopy=on" mount option was given, then this behavior can confuse the user: no mount failure, yet metacopy is disabled. This patch makes metacopy=on imply redirect_dir=on. The converse is also true: turning off full redirect with redirect_dir= {off|follow|nofollow} will disable metacopy. If both metacopy=on and redirect_dir={off|follow|nofollow} is specified, then mount will fail, since there's no way to correctly resolve the conflict. Reported-by: Daniel Walsh Fixes: d5791044d2e5 ("ovl: Provide a mount option metacopy=on/off...") Cc: # v4.19 Signed-off-by: Miklos Szeredi --- Documentation/filesystems/overlayfs.txt | 6 ++++++ fs/overlayfs/super.c | 36 ++++++++++++++++++++++++++------- 2 files changed, 35 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt index 51c136c821bfb..eef7d9d259e85 100644 --- a/Documentation/filesystems/overlayfs.txt +++ b/Documentation/filesystems/overlayfs.txt @@ -286,6 +286,12 @@ pointed by REDIRECT. This should not be possible on local system as setting "trusted." xattrs will require CAP_SYS_ADMIN. But it should be possible for untrusted layers like from a pen drive. +Note: redirect_dir={off|nofollow|follow(*)} conflicts with metacopy=on, and +results in an error. + +(*) redirect_dir=follow only conflicts with metacopy=on if upperdir=... is +given. + Sharing and copying layers -------------------------- diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 22ffb23ea44d9..0116735cc3214 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -472,6 +472,7 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config) { char *p; int err; + bool metacopy_opt = false, redirect_opt = false; config->redirect_mode = kstrdup(ovl_redirect_mode_def(), GFP_KERNEL); if (!config->redirect_mode) @@ -516,6 +517,7 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config) config->redirect_mode = match_strdup(&args[0]); if (!config->redirect_mode) return -ENOMEM; + redirect_opt = true; break; case OPT_INDEX_ON: @@ -548,6 +550,7 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config) case OPT_METACOPY_ON: config->metacopy = true; + metacopy_opt = true; break; case OPT_METACOPY_OFF: @@ -572,13 +575,32 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config) if (err) return err; - /* metacopy feature with upper requires redirect_dir=on */ - if (config->upperdir && config->metacopy && !config->redirect_dir) { - pr_warn("overlayfs: metadata only copy up requires \"redirect_dir=on\", falling back to metacopy=off.\n"); - config->metacopy = false; - } else if (config->metacopy && !config->redirect_follow) { - pr_warn("overlayfs: metadata only copy up requires \"redirect_dir=follow\" on non-upper mount, falling back to metacopy=off.\n"); - config->metacopy = false; + /* + * This is to make the logic below simpler. It doesn't make any other + * difference, since config->redirect_dir is only used for upper. + */ + if (!config->upperdir && config->redirect_follow) + config->redirect_dir = true; + + /* Resolve metacopy -> redirect_dir dependency */ + if (config->metacopy && !config->redirect_dir) { + if (metacopy_opt && redirect_opt) { + pr_err("overlayfs: conflicting options: metacopy=on,redirect_dir=%s\n", + config->redirect_mode); + return -EINVAL; + } + if (redirect_opt) { + /* + * There was an explicit redirect_dir=... that resulted + * in this conflict. + */ + pr_info("overlayfs: disabling metacopy due to redirect_dir=%s\n", + config->redirect_mode); + config->metacopy = false; + } else { + /* Automatically enable redirect otherwise. */ + config->redirect_follow = config->redirect_dir = true; + } } return 0; -- cgit v1.2.3 From b5f2954d30c77649bce9c27e7a0a94299d9cfdf8 Mon Sep 17 00:00:00 2001 From: Dennis Zhou Date: Thu, 1 Nov 2018 17:24:10 -0400 Subject: blkcg: revert blkcg cleanups series This reverts a series committed earlier due to null pointer exception bug report in [1]. It seems there are edge case interactions that I did not consider and will need some time to understand what causes the adverse interactions. The original series can be found in [2] with a follow up series in [3]. [1] https://www.spinics.net/lists/cgroups/msg20719.html [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/ [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/ This reverts the following commits: d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae, f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef, a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53 Signed-off-by: Dennis Zhou Signed-off-by: Jens Axboe --- Documentation/admin-guide/cgroup-v2.rst | 8 +- block/bfq-cgroup.c | 4 +- block/bfq-iosched.c | 2 +- block/bio.c | 174 +++++++++----------------------- block/blk-cgroup.c | 123 +++++++--------------- block/blk-core.c | 1 - block/blk-iolatency.c | 26 ++++- block/blk-throttle.c | 13 ++- block/bounce.c | 4 +- block/cfq-iosched.c | 4 +- drivers/block/loop.c | 5 +- drivers/md/raid0.c | 2 +- fs/buffer.c | 10 +- fs/ext4/page-io.c | 2 +- include/linux/bio.h | 26 ++--- include/linux/blk-cgroup.h | 145 +++++++++----------------- include/linux/blk_types.h | 1 + include/linux/cgroup.h | 2 - include/linux/writeback.h | 5 +- kernel/cgroup/cgroup.c | 48 ++------- kernel/trace/blktrace.c | 4 +- mm/page_io.c | 2 +- 22 files changed, 208 insertions(+), 403 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index caf36105a1c7b..184193bcb262a 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1857,10 +1857,8 @@ following two functions. wbc_init_bio(@wbc, @bio) Should be called for each bio carrying writeback data and - associates the bio with the inode's owner cgroup and the - corresponding request queue. This must be called after - a queue (device) has been associated with the bio and - before submission. + associates the bio with the inode's owner cgroup. Can be + called anytime between bio allocation and submission. wbc_account_io(@wbc, @page, @bytes) Should be called for each data segment being written out. @@ -1879,7 +1877,7 @@ the configuration, the bio may be executed at a lower priority and if the writeback session is holding shared resources, e.g. a journal entry, may lead to priority inversion. There is no one easy solution for the problem. Filesystems can try to work around specific problem -cases by skipping wbc_init_bio() or using bio_associate_create_blkg() +cases by skipping wbc_init_bio() or using bio_associate_blkcg() directly. diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index d9a7916ff0ab6..9fe5952d117d5 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -642,7 +642,7 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) uint64_t serial_nr; rcu_read_lock(); - serial_nr = __bio_blkcg(bio)->css.serial_nr; + serial_nr = bio_blkcg(bio)->css.serial_nr; /* * Check whether blkcg has changed. The condition may trigger @@ -651,7 +651,7 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr)) goto out; - bfqg = __bfq_bic_change_cgroup(bfqd, bic, __bio_blkcg(bio)); + bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio)); /* * Update blkg_path for bfq_log_* functions. We cache this * path, and update it here, for the following diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 6075100f03a50..3a27d31fcda60 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -4384,7 +4384,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, rcu_read_lock(); - bfqg = bfq_find_set_group(bfqd, __bio_blkcg(bio)); + bfqg = bfq_find_set_group(bfqd, bio_blkcg(bio)); if (!bfqg) { bfqq = &bfqd->oom_bfqq; goto out; diff --git a/block/bio.c b/block/bio.c index bbfeb4ee2892f..4a5a036268fb8 100644 --- a/block/bio.c +++ b/block/bio.c @@ -609,9 +609,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src) bio->bi_iter = bio_src->bi_iter; bio->bi_io_vec = bio_src->bi_io_vec; - bio_clone_blkg_association(bio, bio_src); - - blkcg_bio_issue_init(bio); + bio_clone_blkcg_association(bio, bio_src); } EXPORT_SYMBOL(__bio_clone_fast); @@ -1956,151 +1954,69 @@ EXPORT_SYMBOL(bioset_init_from_src); #ifdef CONFIG_BLK_CGROUP -/** - * bio_associate_blkg - associate a bio with the a blkg - * @bio: target bio - * @blkg: the blkg to associate - * - * This tries to associate @bio with the specified blkg. Association failure - * is handled by walking up the blkg tree. Therefore, the blkg associated can - * be anything between @blkg and the root_blkg. This situation only happens - * when a cgroup is dying and then the remaining bios will spill to the closest - * alive blkg. - * - * A reference will be taken on the @blkg and will be released when @bio is - * freed. - */ -int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg) -{ - if (unlikely(bio->bi_blkg)) - return -EBUSY; - bio->bi_blkg = blkg_tryget_closest(blkg); - return 0; -} - -/** - * __bio_associate_blkg_from_css - internal blkg association function - * - * This in the core association function that all association paths rely on. - * A blkg reference is taken which is released upon freeing of the bio. - */ -static int __bio_associate_blkg_from_css(struct bio *bio, - struct cgroup_subsys_state *css) -{ - struct request_queue *q = bio->bi_disk->queue; - struct blkcg_gq *blkg; - int ret; - - rcu_read_lock(); - - if (!css || !css->parent) - blkg = q->root_blkg; - else - blkg = blkg_lookup_create(css_to_blkcg(css), q); - - ret = bio_associate_blkg(bio, blkg); - - rcu_read_unlock(); - return ret; -} - -/** - * bio_associate_blkg_from_css - associate a bio with a specified css - * @bio: target bio - * @css: target css - * - * Associate @bio with the blkg found by combining the css's blkg and the - * request_queue of the @bio. This falls back to the queue's root_blkg if - * the association fails with the css. - */ -int bio_associate_blkg_from_css(struct bio *bio, - struct cgroup_subsys_state *css) -{ - if (unlikely(bio->bi_blkg)) - return -EBUSY; - return __bio_associate_blkg_from_css(bio, css); -} -EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css); - #ifdef CONFIG_MEMCG /** - * bio_associate_blkg_from_page - associate a bio with the page's blkg + * bio_associate_blkcg_from_page - associate a bio with the page's blkcg * @bio: target bio * @page: the page to lookup the blkcg from * - * Associate @bio with the blkg from @page's owning memcg and the respective - * request_queue. If cgroup_e_css returns NULL, fall back to the queue's - * root_blkg. - * - * Note: this must be called after bio has an associated device. + * Associate @bio with the blkcg from @page's owning memcg. This works like + * every other associate function wrt references. */ -int bio_associate_blkg_from_page(struct bio *bio, struct page *page) +int bio_associate_blkcg_from_page(struct bio *bio, struct page *page) { - struct cgroup_subsys_state *css; - int ret; + struct cgroup_subsys_state *blkcg_css; - if (unlikely(bio->bi_blkg)) + if (unlikely(bio->bi_css)) return -EBUSY; if (!page->mem_cgroup) return 0; - - rcu_read_lock(); - - css = cgroup_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys); - - ret = __bio_associate_blkg_from_css(bio, css); - - rcu_read_unlock(); - return ret; + blkcg_css = cgroup_get_e_css(page->mem_cgroup->css.cgroup, + &io_cgrp_subsys); + bio->bi_css = blkcg_css; + return 0; } #endif /* CONFIG_MEMCG */ /** - * bio_associate_create_blkg - associate a bio with a blkg from q - * @q: request_queue where bio is going + * bio_associate_blkcg - associate a bio with the specified blkcg * @bio: target bio + * @blkcg_css: css of the blkcg to associate + * + * Associate @bio with the blkcg specified by @blkcg_css. Block layer will + * treat @bio as if it were issued by a task which belongs to the blkcg. * - * Associate @bio with the blkg found from the bio's css and the request_queue. - * If one is not found, bio_lookup_blkg creates the blkg. This falls back to - * the queue's root_blkg if association fails. + * This function takes an extra reference of @blkcg_css which will be put + * when @bio is released. The caller must own @bio and is responsible for + * synchronizing calls to this function. */ -int bio_associate_create_blkg(struct request_queue *q, struct bio *bio) +int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css) { - struct cgroup_subsys_state *css; - int ret = 0; - - /* someone has already associated this bio with a blkg */ - if (bio->bi_blkg) - return ret; - - rcu_read_lock(); - - css = blkcg_css(); - - ret = __bio_associate_blkg_from_css(bio, css); - - rcu_read_unlock(); - return ret; + if (unlikely(bio->bi_css)) + return -EBUSY; + css_get(blkcg_css); + bio->bi_css = blkcg_css; + return 0; } +EXPORT_SYMBOL_GPL(bio_associate_blkcg); /** - * bio_reassociate_blkg - reassociate a bio with a blkg from q - * @q: request_queue where bio is going + * bio_associate_blkg - associate a bio with the specified blkg * @bio: target bio + * @blkg: the blkg to associate * - * When submitting a bio, multiple recursive calls to make_request() may occur. - * This causes the initial associate done in blkcg_bio_issue_check() to be - * incorrect and reference the prior request_queue. This performs reassociation - * when this situation happens. + * Associate @bio with the blkg specified by @blkg. This is the queue specific + * blkcg information associated with the @bio, a reference will be taken on the + * @blkg and will be freed when the bio is freed. */ -int bio_reassociate_blkg(struct request_queue *q, struct bio *bio) +int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg) { - if (bio->bi_blkg) { - blkg_put(bio->bi_blkg); - bio->bi_blkg = NULL; - } - - return bio_associate_create_blkg(q, bio); + if (unlikely(bio->bi_blkg)) + return -EBUSY; + if (!blkg_try_get(blkg)) + return -ENODEV; + bio->bi_blkg = blkg; + return 0; } /** @@ -2113,6 +2029,10 @@ void bio_disassociate_task(struct bio *bio) put_io_context(bio->bi_ioc); bio->bi_ioc = NULL; } + if (bio->bi_css) { + css_put(bio->bi_css); + bio->bi_css = NULL; + } if (bio->bi_blkg) { blkg_put(bio->bi_blkg); bio->bi_blkg = NULL; @@ -2120,16 +2040,16 @@ void bio_disassociate_task(struct bio *bio) } /** - * bio_clone_blkg_association - clone blkg association from src to dst bio + * bio_clone_blkcg_association - clone blkcg association from src to dst bio * @dst: destination bio * @src: source bio */ -void bio_clone_blkg_association(struct bio *dst, struct bio *src) +void bio_clone_blkcg_association(struct bio *dst, struct bio *src) { - if (src->bi_blkg) - bio_associate_blkg(dst, src->bi_blkg); + if (src->bi_css) + WARN_ON(bio_associate_blkcg(dst, src->bi_css)); } -EXPORT_SYMBOL_GPL(bio_clone_blkg_association); +EXPORT_SYMBOL_GPL(bio_clone_blkcg_association); #endif /* CONFIG_BLK_CGROUP */ static void __init biovec_init_slabs(void) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 992da5592c6ed..c630e02836a80 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -84,37 +84,6 @@ static void blkg_free(struct blkcg_gq *blkg) kfree(blkg); } -static void __blkg_release(struct rcu_head *rcu) -{ - struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head); - - percpu_ref_exit(&blkg->refcnt); - - /* release the blkcg and parent blkg refs this blkg has been holding */ - css_put(&blkg->blkcg->css); - if (blkg->parent) - blkg_put(blkg->parent); - - wb_congested_put(blkg->wb_congested); - - blkg_free(blkg); -} - -/* - * A group is RCU protected, but having an rcu lock does not mean that one - * can access all the fields of blkg and assume these are valid. For - * example, don't try to follow throtl_data and request queue links. - * - * Having a reference to blkg under an rcu allows accesses to only values - * local to groups like group stats and group rate limits. - */ -static void blkg_release(struct percpu_ref *ref) -{ - struct blkcg_gq *blkg = container_of(ref, struct blkcg_gq, refcnt); - - call_rcu(&blkg->rcu_head, __blkg_release); -} - /** * blkg_alloc - allocate a blkg * @blkcg: block cgroup the new blkg is associated with @@ -141,6 +110,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q, blkg->q = q; INIT_LIST_HEAD(&blkg->q_node); blkg->blkcg = blkcg; + atomic_set(&blkg->refcnt, 1); /* root blkg uses @q->root_rl, init rl only for !root blkgs */ if (blkcg != &blkcg_root) { @@ -247,11 +217,6 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, blkg_get(blkg->parent); } - ret = percpu_ref_init(&blkg->refcnt, blkg_release, 0, - GFP_NOWAIT | __GFP_NOWARN); - if (ret) - goto err_cancel_ref; - /* invoke per-policy init */ for (i = 0; i < BLKCG_MAX_POLS; i++) { struct blkcg_policy *pol = blkcg_policy[i]; @@ -284,8 +249,6 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, blkg_put(blkg); return ERR_PTR(ret); -err_cancel_ref: - percpu_ref_exit(&blkg->refcnt); err_put_congested: wb_congested_put(wb_congested); err_put_css: @@ -296,7 +259,7 @@ err_free_blkg: } /** - * __blkg_lookup_create - lookup blkg, try to create one if not there + * blkg_lookup_create - lookup blkg, try to create one if not there * @blkcg: blkcg of interest * @q: request_queue of interest * @@ -305,11 +268,12 @@ err_free_blkg: * that all non-root blkg's have access to the parent blkg. This function * should be called under RCU read lock and @q->queue_lock. * - * Returns the blkg or the closest blkg if blkg_create fails as it walks - * down from root. + * Returns pointer to the looked up or created blkg on success, ERR_PTR() + * value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not + * dead and bypassing, returns ERR_PTR(-EBUSY). */ -struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg, - struct request_queue *q) +struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, + struct request_queue *q) { struct blkcg_gq *blkg; @@ -321,7 +285,7 @@ struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg, * we shouldn't allow anything to go through for a bypassing queue. */ if (unlikely(blk_queue_bypass(q))) - return q->root_blkg; + return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY); blkg = __blkg_lookup(blkcg, q, true); if (blkg) @@ -329,58 +293,23 @@ struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg, /* * Create blkgs walking down from blkcg_root to @blkcg, so that all - * non-root blkgs have access to their parents. Returns the closest - * blkg to the intended blkg should blkg_create() fail. + * non-root blkgs have access to their parents. */ while (true) { struct blkcg *pos = blkcg; struct blkcg *parent = blkcg_parent(blkcg); - struct blkcg_gq *ret_blkg = q->root_blkg; - - while (parent) { - blkg = __blkg_lookup(parent, q, false); - if (blkg) { - /* remember closest blkg */ - ret_blkg = blkg; - break; - } + + while (parent && !__blkg_lookup(parent, q, false)) { pos = parent; parent = blkcg_parent(parent); } blkg = blkg_create(pos, q, NULL); - if (IS_ERR(blkg)) - return ret_blkg; - if (pos == blkcg) + if (pos == blkcg || IS_ERR(blkg)) return blkg; } } -/** - * blkg_lookup_create - find or create a blkg - * @blkcg: target block cgroup - * @q: target request_queue - * - * This looks up or creates the blkg representing the unique pair - * of the blkcg and the request_queue. - */ -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, - struct request_queue *q) -{ - struct blkcg_gq *blkg = blkg_lookup(blkcg, q); - unsigned long flags; - - if (unlikely(!blkg)) { - spin_lock_irqsave(q->queue_lock, flags); - - blkg = __blkg_lookup_create(blkcg, q); - - spin_unlock_irqrestore(q->queue_lock, flags); - } - - return blkg; -} - static void blkg_destroy(struct blkcg_gq *blkg) { struct blkcg *blkcg = blkg->blkcg; @@ -424,7 +353,7 @@ static void blkg_destroy(struct blkcg_gq *blkg) * Put the reference taken at the time of creation so that when all * queues are gone, group can be destroyed. */ - percpu_ref_kill(&blkg->refcnt); + blkg_put(blkg); } /** @@ -451,6 +380,29 @@ static void blkg_destroy_all(struct request_queue *q) q->root_rl.blkg = NULL; } +/* + * A group is RCU protected, but having an rcu lock does not mean that one + * can access all the fields of blkg and assume these are valid. For + * example, don't try to follow throtl_data and request queue links. + * + * Having a reference to blkg under an rcu allows accesses to only values + * local to groups like group stats and group rate limits. + */ +void __blkg_release_rcu(struct rcu_head *rcu_head) +{ + struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head); + + /* release the blkcg and parent blkg refs this blkg has been holding */ + css_put(&blkg->blkcg->css); + if (blkg->parent) + blkg_put(blkg->parent); + + wb_congested_put(blkg->wb_congested); + + blkg_free(blkg); +} +EXPORT_SYMBOL_GPL(__blkg_release_rcu); + /* * The next function used by blk_queue_for_each_rl(). It's a bit tricky * because the root blkg uses @q->root_rl instead of its own rl. @@ -1796,7 +1748,8 @@ void blkcg_maybe_throttle_current(void) blkg = blkg_lookup(blkcg, q); if (!blkg) goto out; - if (!blkg_tryget(blkg)) + blkg = blkg_try_get(blkg); + if (!blkg) goto out; rcu_read_unlock(); diff --git a/block/blk-core.c b/block/blk-core.c index 26a5dac80ed98..ce12515f9b9b9 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2435,7 +2435,6 @@ blk_qc_t generic_make_request(struct bio *bio) if (q) blk_queue_exit(q); q = bio->bi_disk->queue; - bio_reassociate_blkg(q, bio); flags = 0; if (bio->bi_opf & REQ_NOWAIT) flags = BLK_MQ_REQ_NOWAIT; diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c index 35c48d7b8f78e..bb240a0c1309f 100644 --- a/block/blk-iolatency.c +++ b/block/blk-iolatency.c @@ -480,12 +480,34 @@ static void blkcg_iolatency_throttle(struct rq_qos *rqos, struct bio *bio, spinlock_t *lock) { struct blk_iolatency *blkiolat = BLKIOLATENCY(rqos); - struct blkcg_gq *blkg = bio->bi_blkg; + struct blkcg *blkcg; + struct blkcg_gq *blkg; + struct request_queue *q = rqos->q; bool issue_as_root = bio_issue_as_root_blkg(bio); if (!blk_iolatency_enabled(blkiolat)) return; + rcu_read_lock(); + blkcg = bio_blkcg(bio); + bio_associate_blkcg(bio, &blkcg->css); + blkg = blkg_lookup(blkcg, q); + if (unlikely(!blkg)) { + if (!lock) + spin_lock_irq(q->queue_lock); + blkg = blkg_lookup_create(blkcg, q); + if (IS_ERR(blkg)) + blkg = NULL; + if (!lock) + spin_unlock_irq(q->queue_lock); + } + if (!blkg) + goto out; + + bio_issue_init(&bio->bi_issue, bio_sectors(bio)); + bio_associate_blkg(bio, blkg); +out: + rcu_read_unlock(); while (blkg && blkg->parent) { struct iolatency_grp *iolat = blkg_to_lat(blkg); if (!iolat) { @@ -706,7 +728,7 @@ static void blkiolatency_timer_fn(struct timer_list *t) * We could be exiting, don't access the pd unless we have a * ref on the blkg. */ - if (!blkg_tryget(blkg)) + if (!blkg_try_get(blkg)) continue; iolat = blkg_to_lat(blkg); diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 4bda70e8db48a..db1a3a2ae0061 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -2115,11 +2115,21 @@ static inline void throtl_update_latency_buckets(struct throtl_data *td) } #endif +static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio) +{ +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW + /* fallback to root_blkg if we fail to get a blkg ref */ + if (bio->bi_css && (bio_associate_blkg(bio, tg_to_blkg(tg)) == -ENODEV)) + bio_associate_blkg(bio, bio->bi_disk->queue->root_blkg); + bio_issue_init(&bio->bi_issue, bio_sectors(bio)); +#endif +} + bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, struct bio *bio) { struct throtl_qnode *qn = NULL; - struct throtl_grp *tg = blkg_to_tg(blkg); + struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg); struct throtl_service_queue *sq; bool rw = bio_data_dir(bio); bool throttled = false; @@ -2138,6 +2148,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, if (unlikely(blk_queue_bypass(q))) goto out_unlock; + blk_throtl_assoc_bio(tg, bio); blk_throtl_update_idletime(tg); sq = &tg->service_queue; diff --git a/block/bounce.c b/block/bounce.c index ec0d99995f5f0..418677dcec60f 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -276,9 +276,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask, } } - bio_clone_blkg_association(bio, bio_src); - - blkcg_bio_issue_init(bio); + bio_clone_blkcg_association(bio, bio_src); return bio; } diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 6a3d87dd3c1ac..ed41aa978c4ab 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -3759,7 +3759,7 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) uint64_t serial_nr; rcu_read_lock(); - serial_nr = __bio_blkcg(bio)->css.serial_nr; + serial_nr = bio_blkcg(bio)->css.serial_nr; rcu_read_unlock(); /* @@ -3824,7 +3824,7 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic, struct cfq_group *cfqg; rcu_read_lock(); - cfqg = cfq_lookup_cfqg(cfqd, __bio_blkcg(bio)); + cfqg = cfq_lookup_cfqg(cfqd, bio_blkcg(bio)); if (!cfqg) { cfqq = &cfqd->oom_cfqq; goto out; diff --git a/drivers/block/loop.c b/drivers/block/loop.c index abad6d15f9563..ea9debf59b225 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -77,7 +77,6 @@ #include #include #include -#include #include "loop.h" @@ -1761,8 +1760,8 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, /* always use the first bio's css */ #ifdef CONFIG_BLK_CGROUP - if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) { - cmd->css = &bio_blkcg(rq->bio)->css; + if (cmd->use_aio && rq->bio && rq->bio->bi_css) { + cmd->css = rq->bio->bi_css; css_get(cmd->css); } else #endif diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index f3fb5bb8c82a1..ac1cffd2a09b0 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -542,7 +542,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) !discard_bio) continue; bio_chain(discard_bio, bio); - bio_clone_blkg_association(discard_bio, bio); + bio_clone_blkcg_association(discard_bio, bio); if (mddev->gendisk) trace_block_bio_remap(bdev_get_queue(rdev->bdev), discard_bio, disk_devt(mddev->gendisk), diff --git a/fs/buffer.c b/fs/buffer.c index 109f551968662..6f1ae3ac97896 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3060,6 +3060,11 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, */ bio = bio_alloc(GFP_NOIO, 1); + if (wbc) { + wbc_init_bio(wbc, bio); + wbc_account_io(wbc, bh->b_page, bh->b_size); + } + bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio_set_dev(bio, bh->b_bdev); bio->bi_write_hint = write_hint; @@ -3079,11 +3084,6 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, op_flags |= REQ_PRIO; bio_set_op_attrs(bio, op, op_flags); - if (wbc) { - wbc_init_bio(wbc, bio); - wbc_account_io(wbc, bh->b_page, bh->b_size); - } - submit_bio(bio); return 0; } diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 2aa62d58d8dd8..db7590178dfcf 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -374,13 +374,13 @@ static int io_submit_init_bio(struct ext4_io_submit *io, bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES); if (!bio) return -ENOMEM; + wbc_init_bio(io->io_wbc, bio); bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio_set_dev(bio, bh->b_bdev); bio->bi_end_io = ext4_end_bio; bio->bi_private = ext4_get_io_end(io->io_end); io->io_bio = bio; io->io_next_block = bh->b_blocknr; - wbc_init_bio(io->io_wbc, bio); return 0; } diff --git a/include/linux/bio.h b/include/linux/bio.h index b47c7f716731f..056fb627edb3e 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -503,31 +503,23 @@ do { \ disk_devt((bio)->bi_disk) #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) -int bio_associate_blkg_from_page(struct bio *bio, struct page *page); +int bio_associate_blkcg_from_page(struct bio *bio, struct page *page); #else -static inline int bio_associate_blkg_from_page(struct bio *bio, - struct page *page) { return 0; } +static inline int bio_associate_blkcg_from_page(struct bio *bio, + struct page *page) { return 0; } #endif #ifdef CONFIG_BLK_CGROUP +int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css); int bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg); -int bio_associate_blkg_from_css(struct bio *bio, - struct cgroup_subsys_state *css); -int bio_associate_create_blkg(struct request_queue *q, struct bio *bio); -int bio_reassociate_blkg(struct request_queue *q, struct bio *bio); void bio_disassociate_task(struct bio *bio); -void bio_clone_blkg_association(struct bio *dst, struct bio *src); +void bio_clone_blkcg_association(struct bio *dst, struct bio *src); #else /* CONFIG_BLK_CGROUP */ -static inline int bio_associate_blkg_from_css(struct bio *bio, - struct cgroup_subsys_state *css) -{ return 0; } -static inline int bio_associate_create_blkg(struct request_queue *q, - struct bio *bio) { return 0; } -static inline int bio_reassociate_blkg(struct request_queue *q, struct bio *bio) -{ return 0; } +static inline int bio_associate_blkcg(struct bio *bio, + struct cgroup_subsys_state *blkcg_css) { return 0; } static inline void bio_disassociate_task(struct bio *bio) { } -static inline void bio_clone_blkg_association(struct bio *dst, - struct bio *src) { } +static inline void bio_clone_blkcg_association(struct bio *dst, + struct bio *src) { } #endif /* CONFIG_BLK_CGROUP */ #ifdef CONFIG_HIGHMEM diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 1e76ceebeb5dc..6d766a19f2bbb 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -126,7 +126,7 @@ struct blkcg_gq { struct request_list rl; /* reference count */ - struct percpu_ref refcnt; + atomic_t refcnt; /* is this blkg online? protected by both blkcg and q locks */ bool online; @@ -184,8 +184,6 @@ extern struct cgroup_subsys_state * const blkcg_root_css; struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg, struct request_queue *q, bool update_hint); -struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg, - struct request_queue *q); struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, struct request_queue *q); int blkcg_init_queue(struct request_queue *q); @@ -232,59 +230,22 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, char *input, struct blkg_conf_ctx *ctx); void blkg_conf_finish(struct blkg_conf_ctx *ctx); -/** - * blkcg_css - find the current css - * - * Find the css associated with either the kthread or the current task. - * This may return a dying css, so it is up to the caller to use tryget logic - * to confirm it is alive and well. - */ -static inline struct cgroup_subsys_state *blkcg_css(void) -{ - struct cgroup_subsys_state *css; - - css = kthread_blkcg(); - if (css) - return css; - return task_css(current, io_cgrp_id); -} static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css) { return css ? container_of(css, struct blkcg, css) : NULL; } -/** - * __bio_blkcg - internal version of bio_blkcg for bfq and cfq - * - * DO NOT USE. - * There is a flaw using this version of the function. In particular, this was - * used in a broken paradigm where association was called on the given css. It - * is possible though that the returned css from task_css() is in the process - * of dying due to migration of the current task. So it is improper to assume - * *_get() is going to succeed. Both BFQ and CFQ rely on this logic and will - * take additional work to handle more gracefully. - */ -static inline struct blkcg *__bio_blkcg(struct bio *bio) -{ - if (bio && bio->bi_blkg) - return bio->bi_blkg->blkcg; - return css_to_blkcg(blkcg_css()); -} - -/** - * bio_blkcg - grab the blkcg associated with a bio - * @bio: target bio - * - * This returns the blkcg associated with a bio, NULL if not associated. - * Callers are expected to either handle NULL or know association has been - * done prior to calling this. - */ static inline struct blkcg *bio_blkcg(struct bio *bio) { - if (bio && bio->bi_blkg) - return bio->bi_blkg->blkcg; - return NULL; + struct cgroup_subsys_state *css; + + if (bio && bio->bi_css) + return css_to_blkcg(bio->bi_css); + css = kthread_blkcg(); + if (css) + return css_to_blkcg(css); + return css_to_blkcg(task_css(current, io_cgrp_id)); } static inline bool blk_cgroup_congested(void) @@ -490,35 +451,26 @@ static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen) */ static inline void blkg_get(struct blkcg_gq *blkg) { - percpu_ref_get(&blkg->refcnt); + WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0); + atomic_inc(&blkg->refcnt); } /** - * blkg_tryget - try and get a blkg reference + * blkg_try_get - try and get a blkg reference * @blkg: blkg to get * * This is for use when doing an RCU lookup of the blkg. We may be in the midst * of freeing this blkg, so we can only use it if the refcnt is not zero. */ -static inline bool blkg_tryget(struct blkcg_gq *blkg) +static inline struct blkcg_gq *blkg_try_get(struct blkcg_gq *blkg) { - return percpu_ref_tryget(&blkg->refcnt); + if (atomic_inc_not_zero(&blkg->refcnt)) + return blkg; + return NULL; } -/** - * blkg_tryget_closest - try and get a blkg ref on the closet blkg - * @blkg: blkg to get - * - * This walks up the blkg tree to find the closest non-dying blkg and returns - * the blkg that it did association with as it may not be the passed in blkg. - */ -static inline struct blkcg_gq *blkg_tryget_closest(struct blkcg_gq *blkg) -{ - while (!percpu_ref_tryget(&blkg->refcnt)) - blkg = blkg->parent; - return blkg; -} +void __blkg_release_rcu(struct rcu_head *rcu); /** * blkg_put - put a blkg reference @@ -526,7 +478,9 @@ static inline struct blkcg_gq *blkg_tryget_closest(struct blkcg_gq *blkg) */ static inline void blkg_put(struct blkcg_gq *blkg) { - percpu_ref_put(&blkg->refcnt); + WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0); + if (atomic_dec_and_test(&blkg->refcnt)) + call_rcu(&blkg->rcu_head, __blkg_release_rcu); } /** @@ -579,36 +533,25 @@ static inline struct request_list *blk_get_rl(struct request_queue *q, rcu_read_lock(); - if (bio && bio->bi_blkg) { - blkcg = bio->bi_blkg->blkcg; - if (blkcg == &blkcg_root) - goto rl_use_root; - - blkg_get(bio->bi_blkg); - rcu_read_unlock(); - return &bio->bi_blkg->rl; - } + blkcg = bio_blkcg(bio); - blkcg = css_to_blkcg(blkcg_css()); + /* bypass blkg lookup and use @q->root_rl directly for root */ if (blkcg == &blkcg_root) - goto rl_use_root; + goto root_rl; + /* + * Try to use blkg->rl. blkg lookup may fail under memory pressure + * or if either the blkcg or queue is going away. Fall back to + * root_rl in such cases. + */ blkg = blkg_lookup(blkcg, q); if (unlikely(!blkg)) - blkg = __blkg_lookup_create(blkcg, q); - - if (blkg->blkcg == &blkcg_root || !blkg_tryget(blkg)) - goto rl_use_root; + goto root_rl; + blkg_get(blkg); rcu_read_unlock(); return &blkg->rl; - - /* - * Each blkg has its own request_list, however, the root blkcg - * uses the request_queue's root_rl. This is to avoid most - * overhead for the root blkcg. - */ -rl_use_root: +root_rl: rcu_read_unlock(); return &q->root_rl; } @@ -854,26 +797,32 @@ static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg struct bio *bio) { return false; } #endif - -static inline void blkcg_bio_issue_init(struct bio *bio) -{ - bio_issue_init(&bio->bi_issue, bio_sectors(bio)); -} - static inline bool blkcg_bio_issue_check(struct request_queue *q, struct bio *bio) { + struct blkcg *blkcg; struct blkcg_gq *blkg; bool throtl = false; rcu_read_lock(); + blkcg = bio_blkcg(bio); + + /* associate blkcg if bio hasn't attached one */ + bio_associate_blkcg(bio, &blkcg->css); - bio_associate_create_blkg(q, bio); - blkg = bio->bi_blkg; + blkg = blkg_lookup(blkcg, q); + if (unlikely(!blkg)) { + spin_lock_irq(q->queue_lock); + blkg = blkg_lookup_create(blkcg, q); + if (IS_ERR(blkg)) + blkg = NULL; + spin_unlock_irq(q->queue_lock); + } throtl = blk_throtl_bio(q, blkg, bio); if (!throtl) { + blkg = blkg ?: q->root_blkg; /* * If the bio is flagged with BIO_QUEUE_ENTERED it means this * is a split bio and we would have already accounted for the @@ -885,8 +834,6 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q, blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1); } - blkcg_bio_issue_init(bio); - rcu_read_unlock(); return !throtl; } @@ -983,7 +930,6 @@ static inline int blkcg_activate_policy(struct request_queue *q, static inline void blkcg_deactivate_policy(struct request_queue *q, const struct blkcg_policy *pol) { } -static inline struct blkcg *__bio_blkcg(struct bio *bio) { return NULL; } static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; } static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg, @@ -999,7 +945,6 @@ static inline void blk_put_rl(struct request_list *rl) { } static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl) { } static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q->root_rl; } -static inline void blkcg_bio_issue_init(struct bio *bio) { } static inline bool blkcg_bio_issue_check(struct request_queue *q, struct bio *bio) { return true; } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 093a818c5b684..1dcf652ba0aa3 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -178,6 +178,7 @@ struct bio { * release. Read comment on top of bio_associate_current(). */ struct io_context *bi_ioc; + struct cgroup_subsys_state *bi_css; struct blkcg_gq *bi_blkg; struct bio_issue bi_issue; #endif diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index b8bcbdeb2eac4..32c553556bbdc 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -93,8 +93,6 @@ extern struct css_set init_css_set; bool css_has_online_children(struct cgroup_subsys_state *css); struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss); -struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgroup, - struct cgroup_subsys *ss); struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup, struct cgroup_subsys *ss); struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry, diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 738a0c24874f0..fdfd04e348f69 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -246,8 +246,7 @@ static inline void wbc_attach_fdatawrite_inode(struct writeback_control *wbc, * * @bio is a part of the writeback in progress controlled by @wbc. Perform * writeback specific initialization. This is used to apply the cgroup - * writeback context. Must be called after the bio has been associated with - * a device. + * writeback context. */ static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio) { @@ -258,7 +257,7 @@ static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio) * regular writeback instead of writing things out itself. */ if (wbc->wb) - bio_associate_blkg_from_css(bio, wbc->wb->blkcg_css); + bio_associate_blkcg(bio, wbc->wb->blkcg_css); } #else /* CONFIG_CGROUP_WRITEBACK */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 4c1cf0969a80e..4a3dae2a82830 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -492,7 +492,7 @@ static struct cgroup_subsys_state *cgroup_tryget_css(struct cgroup *cgrp, } /** - * cgroup_e_css_by_mask - obtain a cgroup's effective css for the specified ss + * cgroup_e_css - obtain a cgroup's effective css for the specified subsystem * @cgrp: the cgroup of interest * @ss: the subsystem of interest (%NULL returns @cgrp->self) * @@ -501,8 +501,8 @@ static struct cgroup_subsys_state *cgroup_tryget_css(struct cgroup *cgrp, * enabled. If @ss is associated with the hierarchy @cgrp is on, this * function is guaranteed to return non-NULL css. */ -static struct cgroup_subsys_state *cgroup_e_css_by_mask(struct cgroup *cgrp, - struct cgroup_subsys *ss) +static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp, + struct cgroup_subsys *ss) { lockdep_assert_held(&cgroup_mutex); @@ -522,35 +522,6 @@ static struct cgroup_subsys_state *cgroup_e_css_by_mask(struct cgroup *cgrp, return cgroup_css(cgrp, ss); } -/** - * cgroup_e_css - obtain a cgroup's effective css for the specified subsystem - * @cgrp: the cgroup of interest - * @ss: the subsystem of interest - * - * Find and get the effective css of @cgrp for @ss. The effective css is - * defined as the matching css of the nearest ancestor including self which - * has @ss enabled. If @ss is not mounted on the hierarchy @cgrp is on, - * the root css is returned, so this function always returns a valid css. - * - * The returned css is not guaranteed to be online, and therefore it is the - * callers responsiblity to tryget a reference for it. - */ -struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp, - struct cgroup_subsys *ss) -{ - struct cgroup_subsys_state *css; - - do { - css = cgroup_css(cgrp, ss); - - if (css) - return css; - cgrp = cgroup_parent(cgrp); - } while (cgrp); - - return init_css_set.subsys[ss->id]; -} - /** * cgroup_get_e_css - get a cgroup's effective css for the specified subsystem * @cgrp: the cgroup of interest @@ -633,11 +604,10 @@ EXPORT_SYMBOL_GPL(of_css); * * Should be called under cgroup_[tree_]mutex. */ -#define for_each_e_css(css, ssid, cgrp) \ - for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT; (ssid)++) \ - if (!((css) = cgroup_e_css_by_mask(cgrp, \ - cgroup_subsys[(ssid)]))) \ - ; \ +#define for_each_e_css(css, ssid, cgrp) \ + for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT; (ssid)++) \ + if (!((css) = cgroup_e_css(cgrp, cgroup_subsys[(ssid)]))) \ + ; \ else /** @@ -1036,7 +1006,7 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset, * @ss is in this hierarchy, so we want the * effective css from @cgrp. */ - template[i] = cgroup_e_css_by_mask(cgrp, ss); + template[i] = cgroup_e_css(cgrp, ss); } else { /* * @ss is not in this hierarchy, so we don't want @@ -3053,7 +3023,7 @@ static int cgroup_apply_control(struct cgroup *cgrp) return ret; /* - * At this point, cgroup_e_css_by_mask() results reflect the new csses + * At this point, cgroup_e_css() results reflect the new csses * making the following cgroup_update_dfl_csses() properly update * css associations of all tasks in the subtree. */ diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index fac0ddf8a8e22..2868d85f1fb1d 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -764,9 +764,9 @@ blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio) if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP)) return NULL; - if (!bio->bi_blkg) + if (!bio->bi_css) return NULL; - return cgroup_get_kernfs_id(bio_blkcg(bio)->css.cgroup); + return cgroup_get_kernfs_id(bio->bi_css->cgroup); } #else static union kernfs_node_id * diff --git a/mm/page_io.c b/mm/page_io.c index 573d3663d8462..aafd19ec1db46 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -339,7 +339,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, goto out; } bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc); - bio_associate_blkg_from_page(bio, page); + bio_associate_blkcg_from_page(bio, page); count_swpout_vm_event(page); set_page_writeback(page); unlock_page(page); -- cgit v1.2.3 From 732e5dca374d764bd9711c8c121b0e18ff92985e Mon Sep 17 00:00:00 2001 From: Guo Ren Date: Sat, 3 Nov 2018 00:51:29 +0800 Subject: dt-bindings: timer: C-SKY Multi-processor timer Dt-bingdings doc for C-SKY SMP system setting. Signed-off-by: Guo Ren Reviewed-by: Rob Herring Signed-off-by: Daniel Lezcano --- .../devicetree/bindings/timer/csky,mptimer.txt | 42 ++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 Documentation/devicetree/bindings/timer/csky,mptimer.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/timer/csky,mptimer.txt b/Documentation/devicetree/bindings/timer/csky,mptimer.txt new file mode 100644 index 0000000000000..15cfec08fbb8e --- /dev/null +++ b/Documentation/devicetree/bindings/timer/csky,mptimer.txt @@ -0,0 +1,42 @@ +============================ +C-SKY Multi-processors Timer +============================ + +C-SKY multi-processors timer is designed for C-SKY SMP system and the +regs is accessed by cpu co-processor 4 registers with mtcr/mfcr. + + - PTIM_CTLR "cr<0, 14>" Control reg to start reset timer. + - PTIM_TSR "cr<1, 14>" Interrupt cleanup status reg. + - PTIM_CCVR "cr<3, 14>" Current counter value reg. + - PTIM_LVR "cr<6, 14>" Window value reg to triger next event. + +============================== +timer node bindings definition +============================== + + Description: Describes SMP timer + + PROPERTIES + + - compatible + Usage: required + Value type: + Definition: must be "csky,mptimer" + - clocks + Usage: required + Value type: + Definition: must be input clk node + - interrupts + Usage: required + Value type: + Definition: must be timer irq num defined by soc + +Examples: +--------- + + timer: timer { + compatible = "csky,mptimer"; + clocks = <&dummy_apb_clk>; + interrupts = <16>; + interrupt-parent = <&intc>; + }; -- cgit v1.2.3 From ab1e77c3f590536b2659c097980efe6e8f713921 Mon Sep 17 00:00:00 2001 From: Guo Ren Date: Sat, 3 Nov 2018 00:51:31 +0800 Subject: dt-bindings: timer: gx6605s SOC timer Dt-bindings doc for gx6605s SOC's system timer. Signed-off-by: Guo Ren Reviewed-by: Rob Herring Signed-off-by: Daniel Lezcano --- .../bindings/timer/csky,gx6605s-timer.txt | 42 ++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 Documentation/devicetree/bindings/timer/csky,gx6605s-timer.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/timer/csky,gx6605s-timer.txt b/Documentation/devicetree/bindings/timer/csky,gx6605s-timer.txt new file mode 100644 index 0000000000000..6b04344f4beaa --- /dev/null +++ b/Documentation/devicetree/bindings/timer/csky,gx6605s-timer.txt @@ -0,0 +1,42 @@ +================= +gx6605s SOC Timer +================= + +The timer is used in gx6605s soc as system timer and the driver +contain clk event and clk source. + +============================== +timer node bindings definition +============================== + + Description: Describes gx6605s SOC timer + + PROPERTIES + + - compatible + Usage: required + Value type: + Definition: must be "csky,gx6605s-timer" + - reg + Usage: required + Value type: + Definition: in soc from cpu view + - clocks + Usage: required + Value type: phandle + clock specifier cells + Definition: must be input clk node + - interrupt + Usage: required + Value type: + Definition: must be timer irq num defined by soc + +Examples: +--------- + + timer0: timer@20a000 { + compatible = "csky,gx6605s-timer"; + reg = <0x0020a000 0x400>; + clocks = <&dummy_apb_clk>; + interrupts = <10>; + interrupt-parent = <&intc>; + }; -- cgit v1.2.3 From b068621a53f92f42400581d69ad0e84c56620b0a Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Sun, 4 Nov 2018 14:03:56 -0800 Subject: Documentation/x86: Fix typo in zero-page.txt Signed-off-by: Randy Dunlap Cc: Jonathan Corbet Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/f259b21b-1f2b-f215-00d2-23388bed2530@infradead.org Signed-off-by: Ingo Molnar --- Documentation/x86/zero-page.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/x86/zero-page.txt b/Documentation/x86/zero-page.txt index 97b7adbceda48..68aed077f7b62 100644 --- a/Documentation/x86/zero-page.txt +++ b/Documentation/x86/zero-page.txt @@ -25,7 +25,7 @@ Offset Proto Name Meaning 0C8/004 ALL ext_cmd_line_ptr cmd_line_ptr high 32bits 140/080 ALL edid_info Video mode setup (struct edid_info) 1C0/020 ALL efi_info EFI 32 information (struct efi_info) -1E0/004 ALL alk_mem_k Alternative mem check, in KB +1E0/004 ALL alt_mem_k Alternative mem check, in KB 1E4/004 ALL scratch Scratch field for the kernel setup code 1E8/001 ALL e820_entries Number of entries in e820_table (below) 1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below) -- cgit v1.2.3 From 058ad7b6aa5204d3af878415c7b946748ab34f7a Mon Sep 17 00:00:00 2001 From: Fabrizio Castro Date: Thu, 4 Oct 2018 09:53:10 +0100 Subject: dt-bindings: arm: Fix RZ/G2E part number Fix RZ/G2E part number from its description. Signed-off-by: Fabrizio Castro Reviewed-by: Biju Das Reviewed-by: Rob Herring Signed-off-by: Simon Horman --- Documentation/devicetree/bindings/arm/shmobile.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/arm/shmobile.txt b/Documentation/devicetree/bindings/arm/shmobile.txt index f5e0f82fd5031..58c4256d37a39 100644 --- a/Documentation/devicetree/bindings/arm/shmobile.txt +++ b/Documentation/devicetree/bindings/arm/shmobile.txt @@ -27,7 +27,7 @@ SoCs: compatible = "renesas,r8a77470" - RZ/G2M (R8A774A1) compatible = "renesas,r8a774a1" - - RZ/G2E (RA8774C0) + - RZ/G2E (R8A774C0) compatible = "renesas,r8a774c0" - R-Car M1A (R8A77781) compatible = "renesas,r8a7778" -- cgit v1.2.3 From 4c0608f4a0e76dfb82d3accd20081f4bf47ed143 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Tue, 30 Oct 2018 09:45:55 -0400 Subject: XArray: Regularise xa_reserve The xa_reserve() function was a little unusual in that it attempted to be callable for all kinds of locking scenarios. Make it look like the other APIs with __xa_reserve, xa_reserve_bh and xa_reserve_irq variants. Signed-off-by: Matthew Wilcox --- Documentation/core-api/xarray.rst | 13 +++++++ include/linux/xarray.h | 80 ++++++++++++++++++++++++++++++++++++++- lib/test_xarray.c | 6 +++ lib/xarray.c | 18 ++++----- 4 files changed, 105 insertions(+), 12 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst index a4e705108f428..65c77a81b6891 100644 --- a/Documentation/core-api/xarray.rst +++ b/Documentation/core-api/xarray.rst @@ -105,6 +105,15 @@ may result in the entry being marked at some, but not all of the other indices. Storing into one index may result in the entry retrieved by some, but not all of the other indices changing. +Sometimes you need to ensure that a subsequent call to :c:func:`xa_store` +will not need to allocate memory. The :c:func:`xa_reserve` function +will store a reserved entry at the indicated index. Users of the normal +API will see this entry as containing ``NULL``. If you do not need to +use the reserved entry, you can call :c:func:`xa_release` to remove the +unused entry. If another user has stored to the entry in the meantime, +:c:func:`xa_release` will do nothing; if instead you want the entry to +become ``NULL``, you should use :c:func:`xa_erase`. + Finally, you can remove all entries from an XArray by calling :c:func:`xa_destroy`. If the XArray entries are pointers, you may wish to free the entries first. You can do this by iterating over all present @@ -167,6 +176,9 @@ Takes xa_lock internally: * :c:func:`xa_alloc` * :c:func:`xa_alloc_bh` * :c:func:`xa_alloc_irq` + * :c:func:`xa_reserve` + * :c:func:`xa_reserve_bh` + * :c:func:`xa_reserve_irq` * :c:func:`xa_destroy` * :c:func:`xa_set_mark` * :c:func:`xa_clear_mark` @@ -177,6 +189,7 @@ Assumes xa_lock held on entry: * :c:func:`__xa_erase` * :c:func:`__xa_cmpxchg` * :c:func:`__xa_alloc` + * :c:func:`__xa_reserve` * :c:func:`__xa_set_mark` * :c:func:`__xa_clear_mark` diff --git a/include/linux/xarray.h b/include/linux/xarray.h index d9514928ddacb..c2cb0426c60c3 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -291,7 +291,6 @@ void *xa_load(struct xarray *, unsigned long index); void *xa_store(struct xarray *, unsigned long index, void *entry, gfp_t); void *xa_cmpxchg(struct xarray *, unsigned long index, void *old, void *entry, gfp_t); -int xa_reserve(struct xarray *, unsigned long index, gfp_t); void *xa_store_range(struct xarray *, unsigned long first, unsigned long last, void *entry, gfp_t); bool xa_get_mark(struct xarray *, unsigned long index, xa_mark_t); @@ -455,6 +454,7 @@ void *__xa_store(struct xarray *, unsigned long index, void *entry, gfp_t); void *__xa_cmpxchg(struct xarray *, unsigned long index, void *old, void *entry, gfp_t); int __xa_alloc(struct xarray *, u32 *id, u32 max, void *entry, gfp_t); +int __xa_reserve(struct xarray *, unsigned long index, gfp_t); void __xa_set_mark(struct xarray *, unsigned long index, xa_mark_t); void __xa_clear_mark(struct xarray *, unsigned long index, xa_mark_t); @@ -621,6 +621,84 @@ static inline int xa_alloc_irq(struct xarray *xa, u32 *id, u32 max, void *entry, return err; } +/** + * xa_reserve() - Reserve this index in the XArray. + * @xa: XArray. + * @index: Index into array. + * @gfp: Memory allocation flags. + * + * Ensures there is somewhere to store an entry at @index in the array. + * If there is already something stored at @index, this function does + * nothing. If there was nothing there, the entry is marked as reserved. + * Loading from a reserved entry returns a %NULL pointer. + * + * If you do not use the entry that you have reserved, call xa_release() + * or xa_erase() to free any unnecessary memory. + * + * Context: Any context. Takes and releases the xa_lock. + * May sleep if the @gfp flags permit. + * Return: 0 if the reservation succeeded or -ENOMEM if it failed. + */ +static inline +int xa_reserve(struct xarray *xa, unsigned long index, gfp_t gfp) +{ + int ret; + + xa_lock(xa); + ret = __xa_reserve(xa, index, gfp); + xa_unlock(xa); + + return ret; +} + +/** + * xa_reserve_bh() - Reserve this index in the XArray. + * @xa: XArray. + * @index: Index into array. + * @gfp: Memory allocation flags. + * + * A softirq-disabling version of xa_reserve(). + * + * Context: Any context. Takes and releases the xa_lock while + * disabling softirqs. + * Return: 0 if the reservation succeeded or -ENOMEM if it failed. + */ +static inline +int xa_reserve_bh(struct xarray *xa, unsigned long index, gfp_t gfp) +{ + int ret; + + xa_lock_bh(xa); + ret = __xa_reserve(xa, index, gfp); + xa_unlock_bh(xa); + + return ret; +} + +/** + * xa_reserve_irq() - Reserve this index in the XArray. + * @xa: XArray. + * @index: Index into array. + * @gfp: Memory allocation flags. + * + * An interrupt-disabling version of xa_reserve(). + * + * Context: Process context. Takes and releases the xa_lock while + * disabling interrupts. + * Return: 0 if the reservation succeeded or -ENOMEM if it failed. + */ +static inline +int xa_reserve_irq(struct xarray *xa, unsigned long index, gfp_t gfp) +{ + int ret; + + xa_lock_irq(xa); + ret = __xa_reserve(xa, index, gfp); + xa_unlock_irq(xa); + + return ret; +} + /* Everything below here is the Advanced API. Proceed with caution. */ /* diff --git a/lib/test_xarray.c b/lib/test_xarray.c index 126127658b494..e5294b20b52fd 100644 --- a/lib/test_xarray.c +++ b/lib/test_xarray.c @@ -373,6 +373,12 @@ static noinline void check_reserve(struct xarray *xa) xa_erase_index(xa, 12345678); XA_BUG_ON(xa, !xa_empty(xa)); + /* And so does xa_insert */ + xa_reserve(xa, 12345678, GFP_KERNEL); + XA_BUG_ON(xa, xa_insert(xa, 12345678, xa_mk_value(12345678), 0) != 0); + xa_erase_index(xa, 12345678); + XA_BUG_ON(xa, !xa_empty(xa)); + /* Can iterate through a reserved entry */ xa_store_index(xa, 5, GFP_KERNEL); xa_reserve(xa, 6, GFP_KERNEL); diff --git a/lib/xarray.c b/lib/xarray.c index e7be4e47c6a91..9cab8cfef8a8e 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -1488,7 +1488,7 @@ void *__xa_cmpxchg(struct xarray *xa, unsigned long index, EXPORT_SYMBOL(__xa_cmpxchg); /** - * xa_reserve() - Reserve this index in the XArray. + * __xa_reserve() - Reserve this index in the XArray. * @xa: XArray. * @index: Index into array. * @gfp: Memory allocation flags. @@ -1496,33 +1496,29 @@ EXPORT_SYMBOL(__xa_cmpxchg); * Ensures there is somewhere to store an entry at @index in the array. * If there is already something stored at @index, this function does * nothing. If there was nothing there, the entry is marked as reserved. - * Loads from @index will continue to see a %NULL pointer until a - * subsequent store to @index. + * Loading from a reserved entry returns a %NULL pointer. * * If you do not use the entry that you have reserved, call xa_release() * or xa_erase() to free any unnecessary memory. * - * Context: Process context. Takes and releases the xa_lock, IRQ or BH safe - * if specified in XArray flags. May sleep if the @gfp flags permit. + * Context: Any context. Expects the xa_lock to be held on entry. May + * release the lock, sleep and reacquire the lock if the @gfp flags permit. * Return: 0 if the reservation succeeded or -ENOMEM if it failed. */ -int xa_reserve(struct xarray *xa, unsigned long index, gfp_t gfp) +int __xa_reserve(struct xarray *xa, unsigned long index, gfp_t gfp) { XA_STATE(xas, xa, index); - unsigned int lock_type = xa_lock_type(xa); void *curr; do { - xas_lock_type(&xas, lock_type); curr = xas_load(&xas); if (!curr) xas_store(&xas, XA_ZERO_ENTRY); - xas_unlock_type(&xas, lock_type); - } while (xas_nomem(&xas, gfp)); + } while (__xas_nomem(&xas, gfp)); return xas_error(&xas); } -EXPORT_SYMBOL(xa_reserve); +EXPORT_SYMBOL(__xa_reserve); #ifdef CONFIG_XARRAY_MULTI static void xas_set_range(struct xa_state *xas, unsigned long first, -- cgit v1.2.3 From 84e5acb76dacb8ebd648a86a53907ce0dd616534 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Fri, 26 Oct 2018 14:41:29 -0400 Subject: XArray: Add xa_store_bh() and xa_store_irq() These convenience wrappers disable interrupts while taking the spinlock. A number of drivers would otherwise have to open-code these functions. Signed-off-by: Matthew Wilcox --- Documentation/core-api/xarray.rst | 5 +++- include/linux/xarray.h | 52 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 56 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst index 65c77a81b6891..8a6e2087de777 100644 --- a/Documentation/core-api/xarray.rst +++ b/Documentation/core-api/xarray.rst @@ -167,6 +167,8 @@ Takes RCU read lock: Takes xa_lock internally: * :c:func:`xa_store` + * :c:func:`xa_store_bh` + * :c:func:`xa_store_irq` * :c:func:`xa_insert` * :c:func:`xa_erase` * :c:func:`xa_erase_bh` @@ -247,7 +249,8 @@ Sharing the XArray with interrupt context is also possible, either using :c:func:`xa_lock_irqsave` in both the interrupt handler and process context, or :c:func:`xa_lock_irq` in process context and :c:func:`xa_lock` in the interrupt handler. Some of the more common patterns have helper -functions such as :c:func:`xa_erase_bh` and :c:func:`xa_erase_irq`. +functions such as :c:func:`xa_store_bh`, :c:func:`xa_store_irq`, +:c:func:`xa_erase_bh` and :c:func:`xa_erase_irq`. Sometimes you need to protect access to the XArray with a mutex because that lock sits above another mutex in the locking hierarchy. That does diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 4c839c17a99be..52d9732e4ec41 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -426,6 +426,58 @@ static inline int __xa_insert(struct xarray *xa, unsigned long index, return -EEXIST; } +/** + * xa_store_bh() - Store this entry in the XArray. + * @xa: XArray. + * @index: Index into array. + * @entry: New entry. + * @gfp: Memory allocation flags. + * + * This function is like calling xa_store() except it disables softirqs + * while holding the array lock. + * + * Context: Any context. Takes and releases the xa_lock while + * disabling softirqs. + * Return: The entry which used to be at this index. + */ +static inline void *xa_store_bh(struct xarray *xa, unsigned long index, + void *entry, gfp_t gfp) +{ + void *curr; + + xa_lock_bh(xa); + curr = __xa_store(xa, index, entry, gfp); + xa_unlock_bh(xa); + + return curr; +} + +/** + * xa_store_irq() - Erase this entry from the XArray. + * @xa: XArray. + * @index: Index into array. + * @entry: New entry. + * @gfp: Memory allocation flags. + * + * This function is like calling xa_store() except it disables interrupts + * while holding the array lock. + * + * Context: Process context. Takes and releases the xa_lock while + * disabling interrupts. + * Return: The entry which used to be at this index. + */ +static inline void *xa_store_irq(struct xarray *xa, unsigned long index, + void *entry, gfp_t gfp) +{ + void *curr; + + xa_lock_irq(xa); + curr = __xa_store(xa, index, entry, gfp); + xa_unlock_irq(xa); + + return curr; +} + /** * xa_erase_bh() - Erase this entry from the XArray. * @xa: XArray. -- cgit v1.2.3 From d9c480435add8257f9069941f0e6196647f6d746 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 5 Nov 2018 16:15:56 -0500 Subject: XArray: Handle NULL pointers differently for allocation For allocating XArrays, it makes sense to distinguish beteen erasing an entry and storing NULL. Storing NULL keeps the index allocated with a NULL pointer associated with it while xa_erase() frees the index. Some existing IDR users rely on this ability. Signed-off-by: Matthew Wilcox --- Documentation/core-api/xarray.rst | 28 +++++++++++++++++++--------- lib/xarray.c | 13 ++++++++++--- 2 files changed, 29 insertions(+), 12 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst index 8a6e2087de777..616ac406bf865 100644 --- a/Documentation/core-api/xarray.rst +++ b/Documentation/core-api/xarray.rst @@ -119,18 +119,27 @@ Finally, you can remove all entries from an XArray by calling to free the entries first. You can do this by iterating over all present entries in the XArray using the :c:func:`xa_for_each` iterator. -ID assignment -------------- +Allocating XArrays +------------------ + +If you use :c:func:`DEFINE_XARRAY_ALLOC` to define the XArray, or +initialise it by passing ``XA_FLAGS_ALLOC`` to :c:func:`xa_init_flags`, +the XArray changes to track whether entries are in use or not. You can call :c:func:`xa_alloc` to store the entry at any unused index in the XArray. If you need to modify the array from interrupt context, you can use :c:func:`xa_alloc_bh` or :c:func:`xa_alloc_irq` to disable -interrupts while allocating the ID. Unlike :c:func:`xa_store`, allocating -a ``NULL`` pointer does not delete an entry. Instead it reserves an -entry like :c:func:`xa_reserve` and you can release it using either -:c:func:`xa_erase` or :c:func:`xa_release`. To use ID assignment, the -XArray must be defined with :c:func:`DEFINE_XARRAY_ALLOC`, or initialised -by passing ``XA_FLAGS_ALLOC`` to :c:func:`xa_init_flags`, +interrupts while allocating the ID. + +Using :c:func:`xa_store`, :c:func:`xa_cmpxchg` or :c:func:`xa_insert` +will mark the entry as being allocated. Unlike a normal XArray, storing +``NULL`` will mark the entry as being in use, like :c:func:`xa_reserve`. +To free an entry, use :c:func:`xa_erase` (or :c:func:`xa_release` if +you only want to free the entry if it's ``NULL``). + +You cannot use ``XA_MARK_0`` with an allocating XArray as this mark +is used to track whether an entry is free or not. The other marks are +available for your use. Memory allocation ----------------- @@ -338,7 +347,8 @@ to :c:func:`xas_retry`, and retry the operation if it returns ``true``. - :c:func:`xa_is_zero` - Zero entries appear as ``NULL`` through the Normal API, but occupy an entry in the XArray which can be used to reserve the index for - future use. + future use. This is used by allocating XArrays for allocated entries + which are ``NULL``. Other internal entries may be added in the future. As far as possible, they will be handled by :c:func:`xas_retry`. diff --git a/lib/xarray.c b/lib/xarray.c index a9d28013f9dc1..c3e2084aa3139 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -1382,10 +1382,12 @@ void *__xa_store(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp) if (WARN_ON_ONCE(xa_is_internal(entry))) return XA_ERROR(-EINVAL); + if (xa_track_free(xa) && !entry) + entry = XA_ZERO_ENTRY; do { curr = xas_store(&xas, entry); - if (xa_track_free(xa) && entry) + if (xa_track_free(xa)) xas_clear_mark(&xas, XA_FREE_MARK); } while (__xas_nomem(&xas, gfp)); @@ -1446,6 +1448,8 @@ void *__xa_cmpxchg(struct xarray *xa, unsigned long index, if (WARN_ON_ONCE(xa_is_internal(entry))) return XA_ERROR(-EINVAL); + if (xa_track_free(xa) && !entry) + entry = XA_ZERO_ENTRY; do { curr = xas_load(&xas); @@ -1453,7 +1457,7 @@ void *__xa_cmpxchg(struct xarray *xa, unsigned long index, curr = NULL; if (curr == old) { xas_store(&xas, entry); - if (xa_track_free(xa) && entry) + if (xa_track_free(xa)) xas_clear_mark(&xas, XA_FREE_MARK); } } while (__xas_nomem(&xas, gfp)); @@ -1487,8 +1491,11 @@ int __xa_reserve(struct xarray *xa, unsigned long index, gfp_t gfp) do { curr = xas_load(&xas); - if (!curr) + if (!curr) { xas_store(&xas, XA_ZERO_ENTRY); + if (xa_track_free(xa)) + xas_clear_mark(&xas, XA_FREE_MARK); + } } while (__xas_nomem(&xas, gfp)); return xas_error(&xas); -- cgit v1.2.3 From 804dfaf01bcc9daa4298c608ba9018abf616ec48 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 5 Nov 2018 16:37:15 -0500 Subject: XArray: Fix Documentation Minor fixes. Signed-off-by: Matthew Wilcox --- Documentation/core-api/xarray.rst | 6 +++++- include/linux/xarray.h | 4 ++-- lib/xarray.c | 10 +++++----- 3 files changed, 12 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst index 616ac406bf865..dbe96cb5558ef 100644 --- a/Documentation/core-api/xarray.rst +++ b/Documentation/core-api/xarray.rst @@ -74,7 +74,8 @@ using :c:func:`xa_load`. xa_store will overwrite any entry with the new entry and return the previous entry stored at that index. You can use :c:func:`xa_erase` instead of calling :c:func:`xa_store` with a ``NULL`` entry. There is no difference between an entry that has never -been stored to and one that has most recently had ``NULL`` stored to it. +been stored to, one that has been erased and one that has most recently +had ``NULL`` stored to it. You can conditionally replace an entry at an index by using :c:func:`xa_cmpxchg`. Like :c:func:`cmpxchg`, it will only succeed if @@ -114,6 +115,9 @@ unused entry. If another user has stored to the entry in the meantime, :c:func:`xa_release` will do nothing; if instead you want the entry to become ``NULL``, you should use :c:func:`xa_erase`. +If all entries in the array are ``NULL``, the :c:func:`xa_empty` function +will return ``true``. + Finally, you can remove all entries from an XArray by calling :c:func:`xa_destroy`. If the XArray entries are pointers, you may wish to free the entries first. You can do this by iterating over all present diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 52d9732e4ec41..564892e19f8ca 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -487,7 +487,7 @@ static inline void *xa_store_irq(struct xarray *xa, unsigned long index, * the third argument. The XArray does not need to allocate memory, so * the user does not need to provide GFP flags. * - * Context: Process context. Takes and releases the xa_lock while + * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. * Return: The entry which used to be at this index. */ @@ -622,7 +622,7 @@ static inline int xa_alloc(struct xarray *xa, u32 *id, u32 max, void *entry, * Updates the @id pointer with the index, then stores the entry at that * index. A concurrent lookup will not see an uninitialised @id. * - * Context: Process context. Takes and releases the xa_lock while + * Context: Any context. Takes and releases the xa_lock while * disabling softirqs. May sleep if the @gfp flags permit. * Return: 0 on success, -ENOMEM if memory allocation fails or -ENOSPC if * there is no more space in the XArray. diff --git a/lib/xarray.c b/lib/xarray.c index c3e2084aa3139..7946380cd6c93 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -610,8 +610,8 @@ static int xas_expand(struct xa_state *xas, void *head) * (see the xa_cmpxchg() implementation for an example). * * Return: If the slot already existed, returns the contents of this slot. - * If the slot was newly created, returns NULL. If it failed to create the - * slot, returns NULL and indicates the error in @xas. + * If the slot was newly created, returns %NULL. If it failed to create the + * slot, returns %NULL and indicates the error in @xas. */ static void *xas_create(struct xa_state *xas) { @@ -1640,7 +1640,7 @@ EXPORT_SYMBOL(__xa_alloc); * @index: Index of entry. * @mark: Mark number. * - * Attempting to set a mark on a NULL entry does not succeed. + * Attempting to set a mark on a %NULL entry does not succeed. * * Context: Any context. Expects xa_lock to be held on entry. */ @@ -1710,7 +1710,7 @@ EXPORT_SYMBOL(xa_get_mark); * @index: Index of entry. * @mark: Mark number. * - * Attempting to set a mark on a NULL entry does not succeed. + * Attempting to set a mark on a %NULL entry does not succeed. * * Context: Process context. Takes and releases the xa_lock. */ @@ -1879,7 +1879,7 @@ static unsigned int xas_extract_marked(struct xa_state *xas, void **dst, * * The @filter may be an XArray mark value, in which case entries which are * marked with that mark will be copied. It may also be %XA_PRESENT, in - * which case all entries which are not NULL will be copied. + * which case all entries which are not %NULL will be copied. * * The entries returned may not represent a snapshot of the XArray at a * moment in time. For example, if another thread stores to index 5, then -- cgit v1.2.3 From 003aedaed65d4f71f3f122ea1e079c648bab113e Mon Sep 17 00:00:00 2001 From: Sakari Ailus Date: Tue, 9 Oct 2018 07:29:14 -0400 Subject: media: docs: Document metadata format in struct v4l2_format The format fields in struct v4l2_format were otherwise documented but the meta field was missing. Document it. Reported-by: Hans Verkuil Signed-off-by: Sakari Ailus Acked-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab --- Documentation/media/uapi/v4l/dev-meta.rst | 2 +- Documentation/media/uapi/v4l/vidioc-g-fmt.rst | 5 +++++ 2 files changed, 6 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/media/uapi/v4l/dev-meta.rst b/Documentation/media/uapi/v4l/dev-meta.rst index f7ac8d0d3af14..b65dc078abeb8 100644 --- a/Documentation/media/uapi/v4l/dev-meta.rst +++ b/Documentation/media/uapi/v4l/dev-meta.rst @@ -40,7 +40,7 @@ To use the :ref:`format` ioctls applications set the ``type`` field of the the desired operation. Both drivers and applications must set the remainder of the :c:type:`v4l2_format` structure to 0. -.. _v4l2-meta-format: +.. c:type:: v4l2_meta_format .. tabularcolumns:: |p{1.4cm}|p{2.2cm}|p{13.9cm}| diff --git a/Documentation/media/uapi/v4l/vidioc-g-fmt.rst b/Documentation/media/uapi/v4l/vidioc-g-fmt.rst index 3ead350e099f9..9ea494a8facab 100644 --- a/Documentation/media/uapi/v4l/vidioc-g-fmt.rst +++ b/Documentation/media/uapi/v4l/vidioc-g-fmt.rst @@ -132,6 +132,11 @@ The format as returned by :ref:`VIDIOC_TRY_FMT ` must be identical - ``sdr`` - Definition of a data format, see :ref:`pixfmt`, used by SDR capture and output devices. + * - + - struct :c:type:`v4l2_meta_format` + - ``meta`` + - Definition of a metadata format, see :ref:`meta-formats`, used by + metadata capture devices. * - - __u8 - ``raw_data``\ [200] -- cgit v1.2.3 From d52888aa2753e3063a9d3a0c9f72f94aa9809c15 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Fri, 26 Oct 2018 15:28:54 +0300 Subject: x86/mm: Move LDT remap out of KASLR region on 5-level paging On 5-level paging the LDT remap area is placed in the middle of the KASLR randomization region and it can overlap with the direct mapping, the vmalloc or the vmap area. The LDT mapping is per mm, so it cannot be moved into the P4D page table next to the CPU_ENTRY_AREA without complicating PGD table allocation for 5-level paging. The 4 PGD slot gap just before the direct mapping is reserved for hypervisors, so it cannot be used. Move the direct mapping one slot deeper and use the resulting gap for the LDT remap area. The resulting layout is the same for 4 and 5 level paging. [ tglx: Massaged changelog ] Fixes: f55f0501cbf6 ("x86/pti: Put the LDT in its own PGD if PTI is on") Signed-off-by: Kirill A. Shutemov Signed-off-by: Thomas Gleixner Reviewed-by: Andy Lutomirski Cc: bp@alien8.de Cc: hpa@zytor.com Cc: dave.hansen@linux.intel.com Cc: peterz@infradead.org Cc: boris.ostrovsky@oracle.com Cc: jgross@suse.com Cc: bhe@redhat.com Cc: willy@infradead.org Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181026122856.66224-2-kirill.shutemov@linux.intel.com --- Documentation/x86/x86_64/mm.txt | 34 +++++++++++++++++---------------- arch/x86/include/asm/page_64_types.h | 12 +++++++----- arch/x86/include/asm/pgtable_64_types.h | 4 +--- arch/x86/xen/mmu_pv.c | 6 +++--- 4 files changed, 29 insertions(+), 27 deletions(-) (limited to 'Documentation') diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 73aaaa3da4369..804f9426ed17b 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -34,23 +34,24 @@ __________________|____________|__________________|_________|___________________ ____________________________________________________________|___________________________________________________________ | | | | ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor - ffff880000000000 | -120 TB | ffffc7ffffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) - ffffc80000000000 | -56 TB | ffffc8ffffffffff | 1 TB | ... unused hole + ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI + ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) + ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory - fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole - | | | | vaddr_end for KASLR - fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping - fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | LDT remap for PTI - ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks __________________|____________|__________________|_________|____________________________________________________________ | - | Identical layout to the 47-bit one from here on: + | Identical layout to the 56-bit one from here on: ____________________________________________________________|____________________________________________________________ | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole @@ -83,7 +84,7 @@ Notes: __________________|____________|__________________|_________|___________________________________________________________ | | | | 0000800000000000 | +64 PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -128 TB + | | | | virtual memory addresses up to the -64 PB | | | | starting offset of kernel mappings. __________________|____________|__________________|_________|___________________________________________________________ | @@ -91,23 +92,24 @@ __________________|____________|__________________|_________|___________________ ____________________________________________________________|___________________________________________________________ | | | | ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor - ff10000000000000 | -60 PB | ff8fffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base) - ff90000000000000 | -28 PB | ff9fffffffffffff | 4 PB | LDT remap for PTI + ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI + ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base) + ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base) ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base) ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole ffdf000000000000 | -8.25 PB | fffffdffffffffff | ~8 PB | KASAN shadow memory - fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole - | | | | vaddr_end for KASLR - fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping - fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole - ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks __________________|____________|__________________|_________|____________________________________________________________ | | Identical layout to the 47-bit one from here on: ____________________________________________________________|____________________________________________________________ | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index cd0cf1c568b4c..8f657286d599a 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -33,12 +33,14 @@ /* * Set __PAGE_OFFSET to the most negative possible address + - * PGDIR_SIZE*16 (pgd slot 272). The gap is to allow a space for a - * hypervisor to fit. Choosing 16 slots here is arbitrary, but it's - * what Xen requires. + * PGDIR_SIZE*17 (pgd slot 273). + * + * The gap is to allow a space for LDT remap for PTI (1 pgd slot) and space for + * a hypervisor (16 slots). Choosing 16 slots for a hypervisor is arbitrary, + * but it's what Xen requires. */ -#define __PAGE_OFFSET_BASE_L5 _AC(0xff10000000000000, UL) -#define __PAGE_OFFSET_BASE_L4 _AC(0xffff880000000000, UL) +#define __PAGE_OFFSET_BASE_L5 _AC(0xff11000000000000, UL) +#define __PAGE_OFFSET_BASE_L4 _AC(0xffff888000000000, UL) #ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT #define __PAGE_OFFSET page_offset_base diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 04edd2d58211a..84bd9bdc1987f 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -111,9 +111,7 @@ extern unsigned int ptrs_per_p4d; */ #define MAXMEM (1UL << MAX_PHYSMEM_BITS) -#define LDT_PGD_ENTRY_L4 -3UL -#define LDT_PGD_ENTRY_L5 -112UL -#define LDT_PGD_ENTRY (pgtable_l5_enabled() ? LDT_PGD_ENTRY_L5 : LDT_PGD_ENTRY_L4) +#define LDT_PGD_ENTRY -240UL #define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT) #define LDT_END_ADDR (LDT_BASE_ADDR + PGDIR_SIZE) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 0d7b3ae4960bb..a5d7ed1253370 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -1905,7 +1905,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) init_top_pgt[0] = __pgd(0); /* Pre-constructed entries are in pfn, so convert to mfn */ - /* L4[272] -> level3_ident_pgt */ + /* L4[273] -> level3_ident_pgt */ /* L4[511] -> level3_kernel_pgt */ convert_pfn_mfn(init_top_pgt); @@ -1925,8 +1925,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) addr[0] = (unsigned long)pgd; addr[1] = (unsigned long)l3; addr[2] = (unsigned long)l2; - /* Graft it onto L4[272][0]. Note that we creating an aliasing problem: - * Both L4[272][0] and L4[511][510] have entries that point to the same + /* Graft it onto L4[273][0]. Note that we creating an aliasing problem: + * Both L4[273][0] and L4[511][510] have entries that point to the same * L2 (PMD) tables. Meaning that if you modify it in __va space * it will be also modified in the __ka space! (But if you just * modify the PMD table to point to other PTE's or none, then you -- cgit v1.2.3 From 781f0766cc41a9dd2e5d118ef4b1d5d89430257b Mon Sep 17 00:00:00 2001 From: Kai-Heng Feng Date: Fri, 19 Oct 2018 16:14:50 +0800 Subject: USB: Wait for extra delay time after USB_PORT_FEAT_RESET for quirky hub Devices connected under Terminus Technology Inc. Hub (1a40:0101) may fail to work after the system resumes from suspend: [ 206.063325] usb 3-2.4: reset full-speed USB device number 4 using xhci_hcd [ 206.143691] usb 3-2.4: device descriptor read/64, error -32 [ 206.351671] usb 3-2.4: device descriptor read/64, error -32 Info for this hub: T: Bus=03 Lev=01 Prnt=01 Port=01 Cnt=01 Dev#= 2 Spd=480 MxCh= 4 D: Ver= 2.00 Cls=09(hub ) Sub=00 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor=1a40 ProdID=0101 Rev=01.11 S: Product=USB 2.0 Hub C: #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA I: If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub Some expirements indicate that the USB devices connected to the hub are innocent, it's the hub itself is to blame. The hub needs extra delay time after it resets its port. Hence wait for extra delay, if the device is connected to this quirky hub. Signed-off-by: Kai-Heng Feng Cc: stable Acked-by: Alan Stern Signed-off-by: Greg Kroah-Hartman --- Documentation/admin-guide/kernel-parameters.txt | 2 ++ drivers/usb/core/hub.c | 14 +++++++++++--- drivers/usb/core/quirks.c | 6 ++++++ include/linux/usb/quirks.h | 3 +++ 4 files changed, 22 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 81d1d5a747280..19f4423e70d91 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4713,6 +4713,8 @@ prevent spurious wakeup); n = USB_QUIRK_DELAY_CTRL_MSG (Device needs a pause after every control message); + o = USB_QUIRK_HUB_SLOW_RESET (Hub needs extra + delay after resetting its port); Example: quirks=0781:5580:bk,0a5c:5834:gij usbhid.mousepoll= diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c index c6077d582d296..d9bd7576786a7 100644 --- a/drivers/usb/core/hub.c +++ b/drivers/usb/core/hub.c @@ -2794,6 +2794,7 @@ static int hub_port_reset(struct usb_hub *hub, int port1, int i, status; u16 portchange, portstatus; struct usb_port *port_dev = hub->ports[port1 - 1]; + int reset_recovery_time; if (!hub_is_superspeed(hub->hdev)) { if (warm) { @@ -2885,11 +2886,18 @@ static int hub_port_reset(struct usb_hub *hub, int port1, done: if (status == 0) { - /* TRSTRCY = 10 ms; plus some extra */ if (port_dev->quirks & USB_PORT_QUIRK_FAST_ENUM) usleep_range(10000, 12000); - else - msleep(10 + 40); + else { + /* TRSTRCY = 10 ms; plus some extra */ + reset_recovery_time = 10 + 40; + + /* Hub needs extra delay after resetting its port. */ + if (hub->hdev->quirks & USB_QUIRK_HUB_SLOW_RESET) + reset_recovery_time += 100; + + msleep(reset_recovery_time); + } if (udev) { struct usb_hcd *hcd = bus_to_hcd(udev->bus); diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c index 178d6c6063c02..4d7d948eae636 100644 --- a/drivers/usb/core/quirks.c +++ b/drivers/usb/core/quirks.c @@ -128,6 +128,9 @@ static int quirks_param_set(const char *val, const struct kernel_param *kp) case 'n': flags |= USB_QUIRK_DELAY_CTRL_MSG; break; + case 'o': + flags |= USB_QUIRK_HUB_SLOW_RESET; + break; /* Ignore unrecognized flag characters */ } } @@ -380,6 +383,9 @@ static const struct usb_device_id usb_quirk_list[] = { { USB_DEVICE(0x1a0a, 0x0200), .driver_info = USB_QUIRK_LINEAR_UFRAME_INTR_BINTERVAL }, + /* Terminus Technology Inc. Hub */ + { USB_DEVICE(0x1a40, 0x0101), .driver_info = USB_QUIRK_HUB_SLOW_RESET }, + /* Corsair K70 RGB */ { USB_DEVICE(0x1b1c, 0x1b13), .driver_info = USB_QUIRK_DELAY_INIT }, diff --git a/include/linux/usb/quirks.h b/include/linux/usb/quirks.h index b7a99ce56bc9a..a1be64c9940fb 100644 --- a/include/linux/usb/quirks.h +++ b/include/linux/usb/quirks.h @@ -66,4 +66,7 @@ /* Device needs a pause after every control message. */ #define USB_QUIRK_DELAY_CTRL_MSG BIT(13) +/* Hub needs extra delay after resetting its port. */ +#define USB_QUIRK_HUB_SLOW_RESET BIT(14) + #endif /* __LINUX_USB_QUIRKS_H */ -- cgit v1.2.3 From 8d72ee3266f08fd3490011ee7b95ffc31e66fd6d Mon Sep 17 00:00:00 2001 From: Viresh Kumar Date: Mon, 29 Oct 2018 12:47:40 +0530 Subject: Documentation: cpu-freq: Frequencies aren't always sorted The order in which the frequencies are displayed in cpufreq stats depends on the order in which the frequencies were sorted in the frequency table provided to cpufreq core by the cpufreq driver. They can be completely unsorted as well. The documentation's claim that the stats will be sorted in descending order is hence incorrect and here is an attempt to fix it. Reported-by: Pavel Signed-off-by: Viresh Kumar Signed-off-by: Rafael J. Wysocki --- Documentation/cpu-freq/cpufreq-stats.txt | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cpu-freq/cpufreq-stats.txt b/Documentation/cpu-freq/cpufreq-stats.txt index a873855c811d6..14378cecb1723 100644 --- a/Documentation/cpu-freq/cpufreq-stats.txt +++ b/Documentation/cpu-freq/cpufreq-stats.txt @@ -86,9 +86,11 @@ transitions. This will give a fine grained information about all the CPU frequency transitions. The cat output here is a two dimensional matrix, where an entry (row i, column j) represents the count of number of transitions from -Freq_i to Freq_j. Freq_i is in descending order with increasing rows and -Freq_j is in descending order with increasing columns. The output here also -contains the actual freq values for each row and column for better readability. +Freq_i to Freq_j. Freq_i rows and Freq_j columns follow the sorting order in +which the driver has provided the frequency table initially to the cpufreq core +and so can be sorted (ascending or descending) or unsorted. The output here +also contains the actual freq values for each row and column for better +readability. If the transition table is bigger than PAGE_SIZE, reading this will return an -EFBIG error. -- cgit v1.2.3 From e531efa1e92b888acce45785b3ae69278fa712c0 Mon Sep 17 00:00:00 2001 From: Zhao Wei Liew Date: Thu, 1 Nov 2018 01:32:46 +0000 Subject: Documentation: cpufreq: Correct a typo Fix a typo in the admin-guide documentation for cpufreq. Signed-off-by: Zhao Wei Liew Acked-by: Viresh Kumar Signed-off-by: Rafael J. Wysocki --- Documentation/admin-guide/pm/cpufreq.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 47153e64dfb53..7eca9026a9ed2 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -150,7 +150,7 @@ data structures necessary to handle the given policy and, possibly, to add a governor ``sysfs`` interface to it. Next, the governor is started by invoking its ``->start()`` callback. -That callback it expected to register per-CPU utilization update callbacks for +That callback is expected to register per-CPU utilization update callbacks for all of the online CPUs belonging to the given policy with the CPU scheduler. The utilization update callbacks will be invoked by the CPU scheduler on important events, like task enqueue and dequeue, on every iteration of the -- cgit v1.2.3 From 406e7f986b2e4499295351bcfff5c93b0d34022a Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Wed, 7 Nov 2018 14:45:24 +0100 Subject: Documentation: ABI: led-trigger-pattern: Fix typos - Spelling s/brigntess/brightness/, - Double "use". Signed-off-by: Geert Uytterhoeven Acked-by: Pavel Machek Signed-off-by: Jacek Anaszewski --- Documentation/ABI/testing/sysfs-class-led-trigger-pattern | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-class-led-trigger-pattern b/Documentation/ABI/testing/sysfs-class-led-trigger-pattern index fb3d1e03b8819..1e5d172e06462 100644 --- a/Documentation/ABI/testing/sysfs-class-led-trigger-pattern +++ b/Documentation/ABI/testing/sysfs-class-led-trigger-pattern @@ -37,8 +37,8 @@ Description: 0-| / \/ \/ +---0----1----2----3----4----5----6------------> time (s) - 2. To make the LED go instantly from one brigntess value to another, - we should use use zero-time lengths (the brightness must be same as + 2. To make the LED go instantly from one brightness value to another, + we should use zero-time lengths (the brightness must be same as the previous tuple's). So the format should be: "brightness_1 duration_1 brightness_1 0 brightness_2 duration_2 brightness_2 0 ...". For example: -- cgit v1.2.3 From aeaf6a4b2d9ed4373e39a64c1c10cb1d93dd6bd1 Mon Sep 17 00:00:00 2001 From: Sudeep Holla Date: Wed, 7 Nov 2018 17:40:58 +0000 Subject: dt-bindings: cpufreq: remove stale arm_big_little_dt entry Most of the ARM platforms used v2 OPP bindings to support big-little configurations. This arm_big_little_dt binding is incomplete and was never used. Commit f174e49e4906 (cpufreq: remove unused arm_big_little_dt driver) removed the driver supporting this binding, but the binding was left unnoticed, so let's get rid of it now. Fixes: f174e49e4906 (cpufreq: remove unused arm_big_little_dt driver) Signed-off-by: Sudeep Holla Acked-by: Viresh Kumar Signed-off-by: Rafael J. Wysocki --- .../bindings/cpufreq/arm_big_little_dt.txt | 65 ---------------------- 1 file changed, 65 deletions(-) delete mode 100644 Documentation/devicetree/bindings/cpufreq/arm_big_little_dt.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/cpufreq/arm_big_little_dt.txt b/Documentation/devicetree/bindings/cpufreq/arm_big_little_dt.txt deleted file mode 100644 index 2aa06ac0fac59..0000000000000 --- a/Documentation/devicetree/bindings/cpufreq/arm_big_little_dt.txt +++ /dev/null @@ -1,65 +0,0 @@ -Generic ARM big LITTLE cpufreq driver's DT glue ------------------------------------------------ - -This is DT specific glue layer for generic cpufreq driver for big LITTLE -systems. - -Both required and optional properties listed below must be defined -under node /cpus/cpu@x. Where x is the first cpu inside a cluster. - -FIXME: Cpus should boot in the order specified in DT and all cpus for a cluster -must be present contiguously. Generic DT driver will check only node 'x' for -cpu:x. - -Required properties: -- operating-points: Refer to Documentation/devicetree/bindings/opp/opp.txt - for details - -Optional properties: -- clock-latency: Specify the possible maximum transition latency for clock, - in unit of nanoseconds. - -Examples: - -cpus { - #address-cells = <1>; - #size-cells = <0>; - - cpu@0 { - compatible = "arm,cortex-a15"; - reg = <0>; - next-level-cache = <&L2>; - operating-points = < - /* kHz uV */ - 792000 1100000 - 396000 950000 - 198000 850000 - >; - clock-latency = <61036>; /* two CLK32 periods */ - }; - - cpu@1 { - compatible = "arm,cortex-a15"; - reg = <1>; - next-level-cache = <&L2>; - }; - - cpu@100 { - compatible = "arm,cortex-a7"; - reg = <100>; - next-level-cache = <&L2>; - operating-points = < - /* kHz uV */ - 792000 950000 - 396000 750000 - 198000 450000 - >; - clock-latency = <61036>; /* two CLK32 periods */ - }; - - cpu@101 { - compatible = "arm,cortex-a7"; - reg = <101>; - next-level-cache = <&L2>; - }; -}; -- cgit v1.2.3 From 4f145f14f6b98b5aa0dd91bdae518b3f24f74b37 Mon Sep 17 00:00:00 2001 From: Eugeniu Rosca Date: Mon, 20 Aug 2018 16:49:10 +0200 Subject: dt-bindings: can: rcar_can: document r8a77965 support Document the support for rcar_can on R8A77965 SoC devices. Add R8A77965 to the list of SoCs which require the "assigned-clocks" and "assigned-clock-rates" properties (thanks, Sergei). Signed-off-by: Eugeniu Rosca Reviewed-by: Simon Horman Reviewed-by: Kieran Bingham Reviewed-by: Rob Herring Signed-off-by: Marc Kleine-Budde --- Documentation/devicetree/bindings/net/can/rcar_can.txt | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/can/rcar_can.txt b/Documentation/devicetree/bindings/net/can/rcar_can.txt index cc4372842bf37..47fc68148f38e 100644 --- a/Documentation/devicetree/bindings/net/can/rcar_can.txt +++ b/Documentation/devicetree/bindings/net/can/rcar_can.txt @@ -14,6 +14,7 @@ Required properties: "renesas,can-r8a7794" if CAN controller is a part of R8A7794 SoC. "renesas,can-r8a7795" if CAN controller is a part of R8A7795 SoC. "renesas,can-r8a7796" if CAN controller is a part of R8A7796 SoC. + "renesas,can-r8a77965" if CAN controller is a part of R8A77965 SoC. "renesas,rcar-gen1-can" for a generic R-Car Gen1 compatible device. "renesas,rcar-gen2-can" for a generic R-Car Gen2 or RZ/G1 compatible device. @@ -29,11 +30,10 @@ Required properties: - pinctrl-0: pin control group to be used for this controller. - pinctrl-names: must be "default". -Required properties for "renesas,can-r8a7795" and "renesas,can-r8a7796" -compatible: -In R8A7795 and R8A7796 SoCs, "clkp2" can be CANFD clock. This is a div6 clock -and can be used by both CAN and CAN FD controller at the same time. It needs to -be scaled to maximum frequency if any of these controllers use it. This is done +Required properties for R8A7795, R8A7796 and R8A77965: +For the denoted SoCs, "clkp2" can be CANFD clock. This is a div6 clock and can +be used by both CAN and CAN FD controller at the same time. It needs to be +scaled to maximum frequency if any of these controllers use it. This is done using the below properties: - assigned-clocks: phandle of clkp2(CANFD) clock. -- cgit v1.2.3 From 868b7c0f43e61f227bf3d7f7d6134bb3c67bb0e8 Mon Sep 17 00:00:00 2001 From: Fabrizio Castro Date: Mon, 10 Sep 2018 11:43:14 +0100 Subject: dt-bindings: can: rcar_can: Add r8a774a1 support Document RZ/G2M (r8a774a1) SoC specific bindings. Signed-off-by: Fabrizio Castro Signed-off-by: Chris Paterson Reviewed-by: Biju Das Reviewed-by: Rob Herring Reviewed-by: Simon Horman Signed-off-by: Marc Kleine-Budde --- Documentation/devicetree/bindings/net/can/rcar_can.txt | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/can/rcar_can.txt b/Documentation/devicetree/bindings/net/can/rcar_can.txt index 47fc68148f38e..9936b9ee67c36 100644 --- a/Documentation/devicetree/bindings/net/can/rcar_can.txt +++ b/Documentation/devicetree/bindings/net/can/rcar_can.txt @@ -5,6 +5,7 @@ Required properties: - compatible: "renesas,can-r8a7743" if CAN controller is a part of R8A7743 SoC. "renesas,can-r8a7744" if CAN controller is a part of R8A7744 SoC. "renesas,can-r8a7745" if CAN controller is a part of R8A7745 SoC. + "renesas,can-r8a774a1" if CAN controller is a part of R8A774A1 SoC. "renesas,can-r8a7778" if CAN controller is a part of R8A7778 SoC. "renesas,can-r8a7779" if CAN controller is a part of R8A7779 SoC. "renesas,can-r8a7790" if CAN controller is a part of R8A7790 SoC. @@ -18,15 +19,21 @@ Required properties: "renesas,rcar-gen1-can" for a generic R-Car Gen1 compatible device. "renesas,rcar-gen2-can" for a generic R-Car Gen2 or RZ/G1 compatible device. - "renesas,rcar-gen3-can" for a generic R-Car Gen3 compatible device. + "renesas,rcar-gen3-can" for a generic R-Car Gen3 or RZ/G2 + compatible device. When compatible with the generic version, nodes must list the SoC-specific version corresponding to the platform first followed by the generic version. - reg: physical base address and size of the R-Car CAN register map. - interrupts: interrupt specifier for the sole interrupt. -- clocks: phandles and clock specifiers for 3 CAN clock inputs. -- clock-names: 3 clock input name strings: "clkp1", "clkp2", "can_clk". +- clocks: phandles and clock specifiers for 2 CAN clock inputs for RZ/G2 + devices. + phandles and clock specifiers for 3 CAN clock inputs for every other + SoC. +- clock-names: 2 clock input name strings for RZ/G2: "clkp1", "can_clk". + 3 clock input name strings for every other SoC: "clkp1", "clkp2", + "can_clk". - pinctrl-0: pin control group to be used for this controller. - pinctrl-names: must be "default". @@ -42,8 +49,9 @@ using the below properties: Optional properties: - renesas,can-clock-select: R-Car CAN Clock Source Select. Valid values are: <0x0> (default) : Peripheral clock (clkp1) - <0x1> : Peripheral clock (clkp2) - <0x3> : Externally input clock + <0x1> : Peripheral clock (clkp2) (not supported by + RZ/G2 devices) + <0x3> : External input clock Example ------- -- cgit v1.2.3 From f164d0204b1156a7e0d8d1622c1a8d25752befec Mon Sep 17 00:00:00 2001 From: Lukas Wunner Date: Sat, 27 Oct 2018 10:36:54 +0200 Subject: can: hi311x: Use level-triggered interrupt If the hi3110 shares the SPI bus with another traffic-intensive device and packets are received in high volume (by a separate machine sending with "cangen -g 0 -i -x"), reception stops after a few minutes and the counter in /proc/interrupts stops incrementing. Bus state is "active". Bringing the interface down and back up reconvenes the reception. The issue is not observed when the hi3110 is the sole device on the SPI bus. Using a level-triggered interrupt makes the issue go away and lets the hi3110 successfully receive 2 GByte over the course of 5 days while a ks8851 Ethernet chip on the same SPI bus handles 6 GByte of traffic. Unfortunately the hi3110 datasheet is mum on the trigger type. The pin description on page 3 only specifies the polarity (active high): http://www.holtic.com/documents/371-hi-3110_v-rev-kpdf.do Cc: Mathias Duckeck Cc: Akshay Bhat Cc: Casey Fitzpatrick Signed-off-by: Lukas Wunner Cc: linux-stable Signed-off-by: Marc Kleine-Budde --- Documentation/devicetree/bindings/net/can/holt_hi311x.txt | 2 +- drivers/net/can/spi/hi311x.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/can/holt_hi311x.txt b/Documentation/devicetree/bindings/net/can/holt_hi311x.txt index 903a78da65be2..3a9926f999370 100644 --- a/Documentation/devicetree/bindings/net/can/holt_hi311x.txt +++ b/Documentation/devicetree/bindings/net/can/holt_hi311x.txt @@ -17,7 +17,7 @@ Example: reg = <1>; clocks = <&clk32m>; interrupt-parent = <&gpio4>; - interrupts = <13 IRQ_TYPE_EDGE_RISING>; + interrupts = <13 IRQ_TYPE_LEVEL_HIGH>; vdd-supply = <®5v0>; xceiver-supply = <®5v0>; }; diff --git a/drivers/net/can/spi/hi311x.c b/drivers/net/can/spi/hi311x.c index 53e320c92a8be..ddaf46239e39e 100644 --- a/drivers/net/can/spi/hi311x.c +++ b/drivers/net/can/spi/hi311x.c @@ -760,7 +760,7 @@ static int hi3110_open(struct net_device *net) { struct hi3110_priv *priv = netdev_priv(net); struct spi_device *spi = priv->spi; - unsigned long flags = IRQF_ONESHOT | IRQF_TRIGGER_RISING; + unsigned long flags = IRQF_ONESHOT | IRQF_TRIGGER_HIGH; int ret; ret = open_candev(net); -- cgit v1.2.3 From ab214c48387aaaadcf8246b4b00d8b78286fde1b Mon Sep 17 00:00:00 2001 From: Vignesh R Date: Fri, 9 Nov 2018 16:44:10 +0530 Subject: dt-bindings: i2c: omap: Add new compatible for AM654 SoCs AM654 SoCs have same I2C IP as OMAP SoCs. Add new compatible to handle AM654 SoCs. While at that reformat the existing compatible list for older SoCs to list one valid compatible per line. Signed-off-by: Vignesh R Reviewed-by: Rob Herring Reviewed-by: Grygorii Strashko Signed-off-by: Wolfram Sang --- Documentation/devicetree/bindings/i2c/i2c-omap.txt | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/i2c/i2c-omap.txt b/Documentation/devicetree/bindings/i2c/i2c-omap.txt index 7e49839d41249..4b90ba9f31b70 100644 --- a/Documentation/devicetree/bindings/i2c/i2c-omap.txt +++ b/Documentation/devicetree/bindings/i2c/i2c-omap.txt @@ -1,8 +1,12 @@ I2C for OMAP platforms Required properties : -- compatible : Must be "ti,omap2420-i2c", "ti,omap2430-i2c", "ti,omap3-i2c" - or "ti,omap4-i2c" +- compatible : Must be + "ti,omap2420-i2c" for OMAP2420 SoCs + "ti,omap2430-i2c" for OMAP2430 SoCs + "ti,omap3-i2c" for OMAP3 SoCs + "ti,omap4-i2c" for OMAP4+ SoCs + "ti,am654-i2c", "ti,omap4-i2c" for AM654 SoCs - ti,hwmods : Must be "i2c", n being the instance number (1-based) - #address-cells = <1>; - #size-cells = <0>; -- cgit v1.2.3 From c71bcdcb42a7493348d3b45dee8139843bf45efc Mon Sep 17 00:00:00 2001 From: Ajay Gupta Date: Fri, 26 Oct 2018 09:36:58 -0700 Subject: i2c: add i2c bus driver for NVIDIA GPU Latest NVIDIA GPU card has USB Type-C interface. There is a Type-C controller which can be accessed over I2C. This driver adds I2C bus driver to communicate with Type-C controller. I2C client driver will be part of USB Type-C UCSI driver. Signed-off-by: Ajay Gupta Reviewed-by: Andy Shevchenko [wsa: kept Makefile sorting] Signed-off-by: Wolfram Sang --- Documentation/i2c/busses/i2c-nvidia-gpu | 18 ++ MAINTAINERS | 7 + drivers/i2c/busses/Kconfig | 9 + drivers/i2c/busses/Makefile | 1 + drivers/i2c/busses/i2c-nvidia-gpu.c | 368 ++++++++++++++++++++++++++++++++ 5 files changed, 403 insertions(+) create mode 100644 Documentation/i2c/busses/i2c-nvidia-gpu create mode 100644 drivers/i2c/busses/i2c-nvidia-gpu.c (limited to 'Documentation') diff --git a/Documentation/i2c/busses/i2c-nvidia-gpu b/Documentation/i2c/busses/i2c-nvidia-gpu new file mode 100644 index 0000000000000..31884d2b2eb5a --- /dev/null +++ b/Documentation/i2c/busses/i2c-nvidia-gpu @@ -0,0 +1,18 @@ +Kernel driver i2c-nvidia-gpu + +Datasheet: not publicly available. + +Authors: + Ajay Gupta + +Description +----------- + +i2c-nvidia-gpu is a driver for I2C controller included in NVIDIA Turing +and later GPUs and it is used to communicate with Type-C controller on GPUs. + +If your 'lspci -v' listing shows something like the following, + +01:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad9 (rev a1) + +then this driver should support the I2C controller of your GPU. diff --git a/MAINTAINERS b/MAINTAINERS index 259fe9445d2fb..1835d3ffd7ea3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6861,6 +6861,13 @@ L: linux-acpi@vger.kernel.org S: Maintained F: drivers/i2c/i2c-core-acpi.c +I2C CONTROLLER DRIVER FOR NVIDIA GPU +M: Ajay Gupta +L: linux-i2c@vger.kernel.org +S: Maintained +F: Documentation/i2c/busses/i2c-nvidia-gpu +F: drivers/i2c/busses/i2c-nvidia-gpu.c + I2C MUXES M: Peter Rosin L: linux-i2c@vger.kernel.org diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig index 77dc94b440117..f2c6819712013 100644 --- a/drivers/i2c/busses/Kconfig +++ b/drivers/i2c/busses/Kconfig @@ -224,6 +224,15 @@ config I2C_NFORCE2_S4985 This driver can also be built as a module. If so, the module will be called i2c-nforce2-s4985. +config I2C_NVIDIA_GPU + tristate "NVIDIA GPU I2C controller" + depends on PCI + help + If you say yes to this option, support will be included for the + NVIDIA GPU I2C controller which is used to communicate with the GPU's + Type-C controller. This driver can also be built as a module called + i2c-nvidia-gpu. + config I2C_SIS5595 tristate "SiS 5595" depends on PCI diff --git a/drivers/i2c/busses/Makefile b/drivers/i2c/busses/Makefile index 18b26af82b1c5..5f0cb6915969a 100644 --- a/drivers/i2c/busses/Makefile +++ b/drivers/i2c/busses/Makefile @@ -19,6 +19,7 @@ obj-$(CONFIG_I2C_ISCH) += i2c-isch.o obj-$(CONFIG_I2C_ISMT) += i2c-ismt.o obj-$(CONFIG_I2C_NFORCE2) += i2c-nforce2.o obj-$(CONFIG_I2C_NFORCE2_S4985) += i2c-nforce2-s4985.o +obj-$(CONFIG_I2C_NVIDIA_GPU) += i2c-nvidia-gpu.o obj-$(CONFIG_I2C_PIIX4) += i2c-piix4.o obj-$(CONFIG_I2C_SIS5595) += i2c-sis5595.o obj-$(CONFIG_I2C_SIS630) += i2c-sis630.o diff --git a/drivers/i2c/busses/i2c-nvidia-gpu.c b/drivers/i2c/busses/i2c-nvidia-gpu.c new file mode 100644 index 0000000000000..744f5e42636b2 --- /dev/null +++ b/drivers/i2c/busses/i2c-nvidia-gpu.c @@ -0,0 +1,368 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Nvidia GPU I2C controller Driver + * + * Copyright (C) 2018 NVIDIA Corporation. All rights reserved. + * Author: Ajay Gupta + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +/* I2C definitions */ +#define I2C_MST_CNTL 0x00 +#define I2C_MST_CNTL_GEN_START BIT(0) +#define I2C_MST_CNTL_GEN_STOP BIT(1) +#define I2C_MST_CNTL_CMD_READ (1 << 2) +#define I2C_MST_CNTL_CMD_WRITE (2 << 2) +#define I2C_MST_CNTL_BURST_SIZE_SHIFT 6 +#define I2C_MST_CNTL_GEN_NACK BIT(28) +#define I2C_MST_CNTL_STATUS GENMASK(30, 29) +#define I2C_MST_CNTL_STATUS_OKAY (0 << 29) +#define I2C_MST_CNTL_STATUS_NO_ACK (1 << 29) +#define I2C_MST_CNTL_STATUS_TIMEOUT (2 << 29) +#define I2C_MST_CNTL_STATUS_BUS_BUSY (3 << 29) +#define I2C_MST_CNTL_CYCLE_TRIGGER BIT(31) + +#define I2C_MST_ADDR 0x04 + +#define I2C_MST_I2C0_TIMING 0x08 +#define I2C_MST_I2C0_TIMING_SCL_PERIOD_100KHZ 0x10e +#define I2C_MST_I2C0_TIMING_TIMEOUT_CLK_CNT 16 +#define I2C_MST_I2C0_TIMING_TIMEOUT_CLK_CNT_MAX 255 +#define I2C_MST_I2C0_TIMING_TIMEOUT_CHECK BIT(24) + +#define I2C_MST_DATA 0x0c + +#define I2C_MST_HYBRID_PADCTL 0x20 +#define I2C_MST_HYBRID_PADCTL_MODE_I2C BIT(0) +#define I2C_MST_HYBRID_PADCTL_I2C_SCL_INPUT_RCV BIT(14) +#define I2C_MST_HYBRID_PADCTL_I2C_SDA_INPUT_RCV BIT(15) + +struct gpu_i2c_dev { + struct device *dev; + void __iomem *regs; + struct i2c_adapter adapter; + struct i2c_board_info *gpu_ccgx_ucsi; +}; + +static void gpu_enable_i2c_bus(struct gpu_i2c_dev *i2cd) +{ + u32 val; + + /* enable I2C */ + val = readl(i2cd->regs + I2C_MST_HYBRID_PADCTL); + val |= I2C_MST_HYBRID_PADCTL_MODE_I2C | + I2C_MST_HYBRID_PADCTL_I2C_SCL_INPUT_RCV | + I2C_MST_HYBRID_PADCTL_I2C_SDA_INPUT_RCV; + writel(val, i2cd->regs + I2C_MST_HYBRID_PADCTL); + + /* enable 100KHZ mode */ + val = I2C_MST_I2C0_TIMING_SCL_PERIOD_100KHZ; + val |= (I2C_MST_I2C0_TIMING_TIMEOUT_CLK_CNT_MAX + << I2C_MST_I2C0_TIMING_TIMEOUT_CLK_CNT); + val |= I2C_MST_I2C0_TIMING_TIMEOUT_CHECK; + writel(val, i2cd->regs + I2C_MST_I2C0_TIMING); +} + +static int gpu_i2c_check_status(struct gpu_i2c_dev *i2cd) +{ + unsigned long target = jiffies + msecs_to_jiffies(1000); + u32 val; + + do { + val = readl(i2cd->regs + I2C_MST_CNTL); + if (!(val & I2C_MST_CNTL_CYCLE_TRIGGER)) + break; + if ((val & I2C_MST_CNTL_STATUS) != + I2C_MST_CNTL_STATUS_BUS_BUSY) + break; + usleep_range(500, 600); + } while (time_is_after_jiffies(target)); + + if (time_is_before_jiffies(target)) { + dev_err(i2cd->dev, "i2c timeout error %x\n", val); + return -ETIME; + } + + val = readl(i2cd->regs + I2C_MST_CNTL); + switch (val & I2C_MST_CNTL_STATUS) { + case I2C_MST_CNTL_STATUS_OKAY: + return 0; + case I2C_MST_CNTL_STATUS_NO_ACK: + return -EIO; + case I2C_MST_CNTL_STATUS_TIMEOUT: + return -ETIME; + default: + return 0; + } +} + +static int gpu_i2c_read(struct gpu_i2c_dev *i2cd, u8 *data, u16 len) +{ + int status; + u32 val; + + val = I2C_MST_CNTL_GEN_START | I2C_MST_CNTL_CMD_READ | + (len << I2C_MST_CNTL_BURST_SIZE_SHIFT) | + I2C_MST_CNTL_CYCLE_TRIGGER | I2C_MST_CNTL_GEN_NACK; + writel(val, i2cd->regs + I2C_MST_CNTL); + + status = gpu_i2c_check_status(i2cd); + if (status < 0) + return status; + + val = readl(i2cd->regs + I2C_MST_DATA); + switch (len) { + case 1: + data[0] = val; + break; + case 2: + put_unaligned_be16(val, data); + break; + case 3: + put_unaligned_be16(val >> 8, data); + data[2] = val; + break; + case 4: + put_unaligned_be32(val, data); + break; + default: + break; + } + return status; +} + +static int gpu_i2c_start(struct gpu_i2c_dev *i2cd) +{ + writel(I2C_MST_CNTL_GEN_START, i2cd->regs + I2C_MST_CNTL); + return gpu_i2c_check_status(i2cd); +} + +static int gpu_i2c_stop(struct gpu_i2c_dev *i2cd) +{ + writel(I2C_MST_CNTL_GEN_STOP, i2cd->regs + I2C_MST_CNTL); + return gpu_i2c_check_status(i2cd); +} + +static int gpu_i2c_write(struct gpu_i2c_dev *i2cd, u8 data) +{ + u32 val; + + writel(data, i2cd->regs + I2C_MST_DATA); + + val = I2C_MST_CNTL_CMD_WRITE | (1 << I2C_MST_CNTL_BURST_SIZE_SHIFT); + writel(val, i2cd->regs + I2C_MST_CNTL); + + return gpu_i2c_check_status(i2cd); +} + +static int gpu_i2c_master_xfer(struct i2c_adapter *adap, + struct i2c_msg *msgs, int num) +{ + struct gpu_i2c_dev *i2cd = i2c_get_adapdata(adap); + int status, status2; + int i, j; + + /* + * The controller supports maximum 4 byte read due to known + * limitation of sending STOP after every read. + */ + for (i = 0; i < num; i++) { + if (msgs[i].flags & I2C_M_RD) { + /* program client address before starting read */ + writel(msgs[i].addr, i2cd->regs + I2C_MST_ADDR); + /* gpu_i2c_read has implicit start */ + status = gpu_i2c_read(i2cd, msgs[i].buf, msgs[i].len); + if (status < 0) + goto stop; + } else { + u8 addr = i2c_8bit_addr_from_msg(msgs + i); + + status = gpu_i2c_start(i2cd); + if (status < 0) { + if (i == 0) + return status; + goto stop; + } + + status = gpu_i2c_write(i2cd, addr); + if (status < 0) + goto stop; + + for (j = 0; j < msgs[i].len; j++) { + status = gpu_i2c_write(i2cd, msgs[i].buf[j]); + if (status < 0) + goto stop; + } + } + } + status = gpu_i2c_stop(i2cd); + if (status < 0) + return status; + + return i; +stop: + status2 = gpu_i2c_stop(i2cd); + if (status2 < 0) + dev_err(i2cd->dev, "i2c stop failed %d\n", status2); + return status; +} + +static const struct i2c_adapter_quirks gpu_i2c_quirks = { + .max_read_len = 4, + .flags = I2C_AQ_COMB_WRITE_THEN_READ, +}; + +static u32 gpu_i2c_functionality(struct i2c_adapter *adap) +{ + return I2C_FUNC_I2C | I2C_FUNC_SMBUS_EMUL; +} + +static const struct i2c_algorithm gpu_i2c_algorithm = { + .master_xfer = gpu_i2c_master_xfer, + .functionality = gpu_i2c_functionality, +}; + +/* + * This driver is for Nvidia GPU cards with USB Type-C interface. + * We want to identify the cards using vendor ID and class code only + * to avoid dependency of adding product id for any new card which + * requires this driver. + * Currently there is no class code defined for UCSI device over PCI + * so using UNKNOWN class for now and it will be updated when UCSI + * over PCI gets a class code. + * There is no other NVIDIA cards with UNKNOWN class code. Even if the + * driver gets loaded for an undesired card then eventually i2c_read() + * (initiated from UCSI i2c_client) will timeout or UCSI commands will + * timeout. + */ +#define PCI_CLASS_SERIAL_UNKNOWN 0x0c80 +static const struct pci_device_id gpu_i2c_ids[] = { + { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, + PCI_CLASS_SERIAL_UNKNOWN << 8, 0xffffff00}, + { } +}; +MODULE_DEVICE_TABLE(pci, gpu_i2c_ids); + +static int gpu_populate_client(struct gpu_i2c_dev *i2cd, int irq) +{ + struct i2c_client *ccgx_client; + + i2cd->gpu_ccgx_ucsi = devm_kzalloc(i2cd->dev, + sizeof(*i2cd->gpu_ccgx_ucsi), + GFP_KERNEL); + if (!i2cd->gpu_ccgx_ucsi) + return -ENOMEM; + + strlcpy(i2cd->gpu_ccgx_ucsi->type, "ccgx-ucsi", + sizeof(i2cd->gpu_ccgx_ucsi->type)); + i2cd->gpu_ccgx_ucsi->addr = 0x8; + i2cd->gpu_ccgx_ucsi->irq = irq; + ccgx_client = i2c_new_device(&i2cd->adapter, i2cd->gpu_ccgx_ucsi); + if (!ccgx_client) + return -ENODEV; + + return 0; +} + +static int gpu_i2c_probe(struct pci_dev *pdev, const struct pci_device_id *id) +{ + struct gpu_i2c_dev *i2cd; + int status; + + i2cd = devm_kzalloc(&pdev->dev, sizeof(*i2cd), GFP_KERNEL); + if (!i2cd) + return -ENOMEM; + + i2cd->dev = &pdev->dev; + dev_set_drvdata(&pdev->dev, i2cd); + + status = pcim_enable_device(pdev); + if (status < 0) { + dev_err(&pdev->dev, "pcim_enable_device failed %d\n", status); + return status; + } + + pci_set_master(pdev); + + i2cd->regs = pcim_iomap(pdev, 0, 0); + if (!i2cd->regs) { + dev_err(&pdev->dev, "pcim_iomap failed\n"); + return -ENOMEM; + } + + status = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_MSI); + if (status < 0) { + dev_err(&pdev->dev, "pci_alloc_irq_vectors err %d\n", status); + return status; + } + + gpu_enable_i2c_bus(i2cd); + + i2c_set_adapdata(&i2cd->adapter, i2cd); + i2cd->adapter.owner = THIS_MODULE; + strlcpy(i2cd->adapter.name, "NVIDIA GPU I2C adapter", + sizeof(i2cd->adapter.name)); + i2cd->adapter.algo = &gpu_i2c_algorithm; + i2cd->adapter.quirks = &gpu_i2c_quirks; + i2cd->adapter.dev.parent = &pdev->dev; + status = i2c_add_adapter(&i2cd->adapter); + if (status < 0) + goto free_irq_vectors; + + status = gpu_populate_client(i2cd, pdev->irq); + if (status < 0) { + dev_err(&pdev->dev, "gpu_populate_client failed %d\n", status); + goto del_adapter; + } + + return 0; + +del_adapter: + i2c_del_adapter(&i2cd->adapter); +free_irq_vectors: + pci_free_irq_vectors(pdev); + return status; +} + +static void gpu_i2c_remove(struct pci_dev *pdev) +{ + struct gpu_i2c_dev *i2cd = dev_get_drvdata(&pdev->dev); + + i2c_del_adapter(&i2cd->adapter); + pci_free_irq_vectors(pdev); +} + +static int gpu_i2c_resume(struct device *dev) +{ + struct gpu_i2c_dev *i2cd = dev_get_drvdata(dev); + + gpu_enable_i2c_bus(i2cd); + return 0; +} + +UNIVERSAL_DEV_PM_OPS(gpu_i2c_driver_pm, NULL, gpu_i2c_resume, NULL); + +static struct pci_driver gpu_i2c_driver = { + .name = "nvidia-gpu", + .id_table = gpu_i2c_ids, + .probe = gpu_i2c_probe, + .remove = gpu_i2c_remove, + .driver = { + .pm = &gpu_i2c_driver_pm, + }, +}; + +module_pci_driver(gpu_i2c_driver); + +MODULE_AUTHOR("Ajay Gupta "); +MODULE_DESCRIPTION("Nvidia GPU I2C controller Driver"); +MODULE_LICENSE("GPL v2"); -- cgit v1.2.3 From 7243ec72b9024b1ae98ed571db559a7cf0c5a0f4 Mon Sep 17 00:00:00 2001 From: Douglas Anderson Date: Fri, 12 Oct 2018 14:36:32 -0700 Subject: dt-bindings: phy-qcom-qmp: Fix several mistakes from prior commits Digging through the "phy-qcom-qmp" showed me many inconsistencies between the bindings and the reality of the driver. Let's fix them all. * In commit 2d66eab18375 ("dt-bindings: phy: qmp: Add support for QMP phy in IPQ8074") we probably should have explicitly listed that there are no clocks for this PHY and also added the reset names in alphabetical order. You can see that there are no clocks in the driver where "clk_list" is NULL. * In commit 8587b220f05e ("dt-bindings: phy-qcom-qmp: Update bindings for QMP V3 USB PHY") we probably should have listed the resets for this new PHY and also removed the "(Optional)" marking for the "cfg" reset since PHYs that need "cfg" really do need it. It's just that not all PHYs need it. * In commit 7f0802074120 ("dt-bindings: phy-qcom-qmp: Update bindings for sdm845") we forgot to update one instance of the string "qcom,qmp-v3-usb3-phy" to be "qcom,sdm845-qmp-usb3-phy". Let's fix that. We should also have added "qcom,sdm845-qmp-usb3-uni-phy" to the clock-names and reset-names lists. * In commit 99c7c7364b71 ("dt-bindings: phy-qcom-qmp: Add UFS phy compatible string for sdm845") we should have added the set of clocks and resets for "qcom,sdm845-qmp-ufs-phy". These were taken from the driver. * Cleanup the wording for what properties child nodes have to make it more obvious which types of PHYs need clocks and resets. This was sorta implicit in the "-names" description but I found myself confused. * As per the code not all "pcie qmp phys" have resets. Specifically note that the "has_lane_rst" property in the driver is false for "ipq8074-qmp-pcie-phy". Thus make it clear exactly which PHYs need child nodes with resets. Signed-off-by: Douglas Anderson Reviewed-by: Evan Green Reviewed-by: Rob Herring Signed-off-by: Kishon Vijay Abraham I --- .../devicetree/bindings/phy/qcom-qmp-phy.txt | 31 ++++++++++++++++------ 1 file changed, 23 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt index adf20b2bdf715..fbc198d5dd39e 100644 --- a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt +++ b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt @@ -40,24 +40,36 @@ Required properties: "ref" for 19.2 MHz ref clk, "com_aux" for phy common block aux clock, "ref_aux" for phy reference aux clock, + + For "qcom,ipq8074-qmp-pcie-phy": no clocks are listed. For "qcom,msm8996-qmp-pcie-phy" must contain: "aux", "cfg_ahb", "ref". For "qcom,msm8996-qmp-usb3-phy" must contain: "aux", "cfg_ahb", "ref". - For "qcom,qmp-v3-usb3-phy" must contain: + For "qcom,sdm845-qmp-usb3-phy" must contain: + "aux", "cfg_ahb", "ref", "com_aux". + For "qcom,sdm845-qmp-usb3-uni-phy" must contain: "aux", "cfg_ahb", "ref", "com_aux". + For "qcom,sdm845-qmp-ufs-phy" must contain: + "ref", "ref_aux". - resets: a list of phandles and reset controller specifier pairs, one for each entry in reset-names. - reset-names: "phy" for reset of phy block, "common" for phy common block reset, - "cfg" for phy's ahb cfg block reset (Optional). + "cfg" for phy's ahb cfg block reset. + + For "qcom,ipq8074-qmp-pcie-phy" must contain: + "phy", "common". For "qcom,msm8996-qmp-pcie-phy" must contain: - "phy", "common", "cfg". + "phy", "common", "cfg". For "qcom,msm8996-qmp-usb3-phy" must contain - "phy", "common". - For "qcom,ipq8074-qmp-pcie-phy" must contain: - "phy", "common". + "phy", "common". + For "qcom,sdm845-qmp-usb3-phy" must contain: + "phy", "common". + For "qcom,sdm845-qmp-usb3-uni-phy" must contain: + "phy", "common". + For "qcom,sdm845-qmp-ufs-phy": no resets are listed. - vdda-phy-supply: Phandle to a regulator supply to PHY core block. - vdda-pll-supply: Phandle to 1.8V regulator supply to PHY refclk pll block. @@ -79,9 +91,10 @@ Required properties for child node: - #phy-cells: must be 0 +Required properties child node of pcie and usb3 qmp phys: - clocks: a list of phandles and clock-specifier pairs, one for each entry in clock-names. - - clock-names: Must contain following for pcie and usb qmp phys: + - clock-names: Must contain following: "pipe" for pipe clock specific to each lane. - clock-output-names: Name of the PHY clock that will be the parent for the above pipe clock. @@ -91,9 +104,11 @@ Required properties for child node: (or) "pcie20_phy1_pipe_clk" +Required properties for child node of PHYs with lane reset, AKA: + "qcom,msm8996-qmp-pcie-phy" - resets: a list of phandles and reset controller specifier pairs, one for each entry in reset-names. - - reset-names: Must contain following for pcie qmp phys: + - reset-names: Must contain following: "lane" for reset specific to each lane. Example: -- cgit v1.2.3 From 6c4b88288abf908d6fe9fc71fdfeb69cb4135193 Mon Sep 17 00:00:00 2001 From: Ding Tao Date: Mon, 12 Nov 2018 11:27:11 -0800 Subject: Input: dt-bindings - fix a typo in file input-reset.txt Replace sysrq-reset-seq with keyset. Signed-off-by: Ding Tao Signed-off-by: Dmitry Torokhov --- Documentation/devicetree/bindings/input/input-reset.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/input/input-reset.txt b/Documentation/devicetree/bindings/input/input-reset.txt index 2bb2626fdb78b..1ca6cc5ebf8ed 100644 --- a/Documentation/devicetree/bindings/input/input-reset.txt +++ b/Documentation/devicetree/bindings/input/input-reset.txt @@ -12,7 +12,7 @@ The /chosen node should contain a 'linux,sysrq-reset-seq' child node to define a set of keys. Required property: -sysrq-reset-seq: array of Linux keycodes, one keycode per cell. +keyset: array of Linux keycodes, one keycode per cell. Optional property: timeout-ms: duration keys must be pressed together in milliseconds before -- cgit v1.2.3 From 7150ceaacb27f7b3bf494e72cd4be4e11612dfff Mon Sep 17 00:00:00 2001 From: David Howells Date: Mon, 12 Nov 2018 22:33:22 +0000 Subject: rxrpc: Fix life check The life-checking function, which is used by kAFS to make sure that a call is still live in the event of a pending signal, only samples the received packet serial number counter; it doesn't actually provoke a change in the counter, rather relying on the server to happen to give us a packet in the time window. Fix this by adding a function to force a ping to be transmitted. kAFS then keeps track of whether there's been a stall, and if so, uses the new function to ping the server, resetting the timeout to allow the reply to come back. If there's a stall, a ping and the call is *still* stalled in the same place after another period, then the call will be aborted. Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Fixes: f4d15fb6f99a ("rxrpc: Provide functions for allowing cleaner handling of signals") Signed-off-by: David Howells Signed-off-by: David S. Miller --- Documentation/networking/rxrpc.txt | 17 +++++++++++------ fs/afs/rxrpc.c | 11 ++++++++++- include/net/af_rxrpc.h | 3 ++- include/trace/events/rxrpc.h | 2 ++ net/rxrpc/af_rxrpc.c | 27 +++++++++++++++++++++++---- 5 files changed, 48 insertions(+), 12 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index 605e00cdd6beb..89f1302d593a5 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -1056,18 +1056,23 @@ The kernel interface functions are as follows: u32 rxrpc_kernel_check_life(struct socket *sock, struct rxrpc_call *call); + void rxrpc_kernel_probe_life(struct socket *sock, + struct rxrpc_call *call); - This returns a number that is updated when ACKs are received from the peer - (notably including PING RESPONSE ACKs which we can elicit by sending PING - ACKs to see if the call still exists on the server). The caller should - compare the numbers of two calls to see if the call is still alive after - waiting for a suitable interval. + The first function returns a number that is updated when ACKs are received + from the peer (notably including PING RESPONSE ACKs which we can elicit by + sending PING ACKs to see if the call still exists on the server). The + caller should compare the numbers of two calls to see if the call is still + alive after waiting for a suitable interval. This allows the caller to work out if the server is still contactable and if the call is still alive on the server whilst waiting for the server to process a client operation. - This function may transmit a PING ACK. + The second function causes a ping ACK to be transmitted to try to provoke + the peer into responding, which would then cause the value returned by the + first function to change. Note that this must be called in TASK_RUNNING + state. (*) Get reply timestamp. diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c index 59970886690f1..a7b44863d502e 100644 --- a/fs/afs/rxrpc.c +++ b/fs/afs/rxrpc.c @@ -576,6 +576,7 @@ static long afs_wait_for_call_to_complete(struct afs_call *call, { signed long rtt2, timeout; long ret; + bool stalled = false; u64 rtt; u32 life, last_life; @@ -609,12 +610,20 @@ static long afs_wait_for_call_to_complete(struct afs_call *call, life = rxrpc_kernel_check_life(call->net->socket, call->rxcall); if (timeout == 0 && - life == last_life && signal_pending(current)) + life == last_life && signal_pending(current)) { + if (stalled) break; + __set_current_state(TASK_RUNNING); + rxrpc_kernel_probe_life(call->net->socket, call->rxcall); + timeout = rtt2; + stalled = true; + continue; + } if (life != last_life) { timeout = rtt2; last_life = life; + stalled = false; } timeout = schedule_timeout(timeout); diff --git a/include/net/af_rxrpc.h b/include/net/af_rxrpc.h index de587948042a4..1adefe42c0a68 100644 --- a/include/net/af_rxrpc.h +++ b/include/net/af_rxrpc.h @@ -77,7 +77,8 @@ int rxrpc_kernel_retry_call(struct socket *, struct rxrpc_call *, struct sockaddr_rxrpc *, struct key *); int rxrpc_kernel_check_call(struct socket *, struct rxrpc_call *, enum rxrpc_call_completion *, u32 *); -u32 rxrpc_kernel_check_life(struct socket *, struct rxrpc_call *); +u32 rxrpc_kernel_check_life(const struct socket *, const struct rxrpc_call *); +void rxrpc_kernel_probe_life(struct socket *, struct rxrpc_call *); u32 rxrpc_kernel_get_epoch(struct socket *, struct rxrpc_call *); bool rxrpc_kernel_get_reply_time(struct socket *, struct rxrpc_call *, ktime_t *); diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h index 573d5b901fb11..5b50fe4906d2e 100644 --- a/include/trace/events/rxrpc.h +++ b/include/trace/events/rxrpc.h @@ -181,6 +181,7 @@ enum rxrpc_timer_trace { enum rxrpc_propose_ack_trace { rxrpc_propose_ack_client_tx_end, rxrpc_propose_ack_input_data, + rxrpc_propose_ack_ping_for_check_life, rxrpc_propose_ack_ping_for_keepalive, rxrpc_propose_ack_ping_for_lost_ack, rxrpc_propose_ack_ping_for_lost_reply, @@ -380,6 +381,7 @@ enum rxrpc_tx_point { #define rxrpc_propose_ack_traces \ EM(rxrpc_propose_ack_client_tx_end, "ClTxEnd") \ EM(rxrpc_propose_ack_input_data, "DataIn ") \ + EM(rxrpc_propose_ack_ping_for_check_life, "ChkLife") \ EM(rxrpc_propose_ack_ping_for_keepalive, "KeepAlv") \ EM(rxrpc_propose_ack_ping_for_lost_ack, "LostAck") \ EM(rxrpc_propose_ack_ping_for_lost_reply, "LostRpl") \ diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c index 64362d078da84..a2522f9d71e26 100644 --- a/net/rxrpc/af_rxrpc.c +++ b/net/rxrpc/af_rxrpc.c @@ -375,16 +375,35 @@ EXPORT_SYMBOL(rxrpc_kernel_end_call); * getting ACKs from the server. Returns a number representing the life state * which can be compared to that returned by a previous call. * - * If this is a client call, ping ACKs will be sent to the server to find out - * whether it's still responsive and whether the call is still alive on the - * server. + * If the life state stalls, rxrpc_kernel_probe_life() should be called and + * then 2RTT waited. */ -u32 rxrpc_kernel_check_life(struct socket *sock, struct rxrpc_call *call) +u32 rxrpc_kernel_check_life(const struct socket *sock, + const struct rxrpc_call *call) { return call->acks_latest; } EXPORT_SYMBOL(rxrpc_kernel_check_life); +/** + * rxrpc_kernel_probe_life - Poke the peer to see if it's still alive + * @sock: The socket the call is on + * @call: The call to check + * + * In conjunction with rxrpc_kernel_check_life(), allow a kernel service to + * find out whether a call is still alive by pinging it. This should cause the + * life state to be bumped in about 2*RTT. + * + * The must be called in TASK_RUNNING state on pain of might_sleep() objecting. + */ +void rxrpc_kernel_probe_life(struct socket *sock, struct rxrpc_call *call) +{ + rxrpc_propose_ACK(call, RXRPC_ACK_PING, 0, 0, true, false, + rxrpc_propose_ack_ping_for_check_life); + rxrpc_send_ack_packet(call, true, NULL); +} +EXPORT_SYMBOL(rxrpc_kernel_probe_life); + /** * rxrpc_kernel_get_epoch - Retrieve the epoch value from a call. * @sock: The socket the call is on -- cgit v1.2.3 From 3841840449817ba6cf3e636008bc4e1061a03388 Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Tue, 20 Nov 2018 08:25:28 +0100 Subject: x86/boot: Mostly revert commit ae7e1238e68f2a ("Add ACPI RSDP address to setup_header") Peter Anvin pointed out that commit: ae7e1238e68f2a ("x86/boot: Add ACPI RSDP address to setup_header") should be reverted as setup_header should only contain items set by the legacy BIOS. So revert said commit. Instead of fully reverting the dependent commit of: e7b66d16fe4172 ("x86/acpi, x86/boot: Take RSDP address for boot params if available") just remove the setup_header reference in order to replace it by a boot_params in a followup patch. Suggested-by: "H. Peter Anvin" Signed-off-by: Juergen Gross Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: boris.ostrovsky@oracle.com Cc: bp@alien8.de Cc: daniel.kiper@oracle.com Cc: sstabellini@kernel.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20181120072529.5489-2-jgross@suse.com Signed-off-by: Ingo Molnar --- Documentation/x86/boot.txt | 32 +------------------------------- arch/x86/boot/header.S | 6 +----- arch/x86/include/asm/x86_init.h | 2 -- arch/x86/include/uapi/asm/bootparam.h | 4 ---- arch/x86/kernel/acpi/boot.c | 2 +- arch/x86/kernel/head32.c | 1 - arch/x86/kernel/head64.c | 2 -- arch/x86/kernel/setup.c | 17 ----------------- 8 files changed, 3 insertions(+), 63 deletions(-) (limited to 'Documentation') diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt index 7727db8f94bce..5e9b826b5f62f 100644 --- a/Documentation/x86/boot.txt +++ b/Documentation/x86/boot.txt @@ -61,18 +61,6 @@ Protocol 2.12: (Kernel 3.8) Added the xloadflags field and extension fields to struct boot_params for loading bzImage and ramdisk above 4G in 64bit. -Protocol 2.13: (Kernel 3.14) Support 32- and 64-bit flags being set in - xloadflags to support booting a 64-bit kernel from 32-bit - EFI - -Protocol 2.14: (Kernel 4.20) Added acpi_rsdp_addr holding the physical - address of the ACPI RSDP table. - The bootloader updates version with: - 0x8000 | min(kernel-version, bootloader-version) - kernel-version being the protocol version supported by - the kernel and bootloader-version the protocol version - supported by the bootloader. - **** MEMORY LAYOUT The traditional memory map for the kernel loader, used for Image or @@ -209,7 +197,6 @@ Offset Proto Name Meaning 0258/8 2.10+ pref_address Preferred loading address 0260/4 2.10+ init_size Linear memory required during initialization 0264/4 2.11+ handover_offset Offset of handover entry point -0268/8 2.14+ acpi_rsdp_addr Physical address of RSDP table (1) For backwards compatibility, if the setup_sects field contains 0, the real value is 4. @@ -322,7 +309,7 @@ Protocol: 2.00+ Contains the magic number "HdrS" (0x53726448). Field name: version -Type: modify +Type: read Offset/size: 0x206/2 Protocol: 2.00+ @@ -330,12 +317,6 @@ Protocol: 2.00+ e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version 10.17. - Up to protocol version 2.13 this information is only read by the - bootloader. From protocol version 2.14 onwards the bootloader will - write the used protocol version or-ed with 0x8000 to the field. The - used protocol version will be the minimum of the supported protocol - versions of the bootloader and the kernel. - Field name: realmode_swtch Type: modify (optional) Offset/size: 0x208/4 @@ -763,17 +744,6 @@ Offset/size: 0x264/4 See EFI HANDOVER PROTOCOL below for more details. -Field name: acpi_rsdp_addr -Type: write -Offset/size: 0x268/8 -Protocol: 2.14+ - - This field can be set by the boot loader to tell the kernel the - physical address of the ACPI RSDP table. - - A value of 0 indicates the kernel should fall back to the standard - methods to locate the RSDP. - **** THE IMAGE CHECKSUM diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index 4c881c850125c..850b8762e8896 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -300,7 +300,7 @@ _start: # Part 2 of the header, from the old setup.S .ascii "HdrS" # header signature - .word 0x020e # header version number (>= 0x0105) + .word 0x020d # header version number (>= 0x0105) # or else old loadlin-1.5 will fail) .globl realmode_swtch realmode_swtch: .word 0, 0 # default_switch, SETUPSEG @@ -558,10 +558,6 @@ pref_address: .quad LOAD_PHYSICAL_ADDR # preferred load addr init_size: .long INIT_SIZE # kernel initialization size handover_offset: .long 0 # Filled in by build.c -acpi_rsdp_addr: .quad 0 # 64-bit physical pointer to the - # ACPI RSDP table, added with - # version 2.14 - # End of setup header ##################################################### .section ".entrytext", "ax" diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 0f842104862c3..b85a7c54c6a13 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -303,6 +303,4 @@ extern void x86_init_noop(void); extern void x86_init_uint_noop(unsigned int unused); extern bool x86_pnpbios_disabled(void); -void x86_verify_bootdata_version(void); - #endif diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h index 22f89d040dddc..a06cbf0197446 100644 --- a/arch/x86/include/uapi/asm/bootparam.h +++ b/arch/x86/include/uapi/asm/bootparam.h @@ -16,9 +16,6 @@ #define RAMDISK_PROMPT_FLAG 0x8000 #define RAMDISK_LOAD_FLAG 0x4000 -/* version flags */ -#define VERSION_WRITTEN 0x8000 - /* loadflags */ #define LOADED_HIGH (1<<0) #define KASLR_FLAG (1<<1) @@ -89,7 +86,6 @@ struct setup_header { __u64 pref_address; __u32 init_size; __u32 handover_offset; - __u64 acpi_rsdp_addr; } __attribute__((packed)); struct sys_desc_table { diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 92c76bf97ad82..fb3b1f3a5aba5 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -1776,5 +1776,5 @@ void __init arch_reserve_mem_area(acpi_physical_address addr, size_t size) u64 x86_default_get_root_pointer(void) { - return boot_params.hdr.acpi_rsdp_addr; + return 0; } diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c index 76fa3b8365985..ec6fefbfd3c04 100644 --- a/arch/x86/kernel/head32.c +++ b/arch/x86/kernel/head32.c @@ -37,7 +37,6 @@ asmlinkage __visible void __init i386_start_kernel(void) cr4_init_shadow(); sanitize_boot_params(&boot_params); - x86_verify_bootdata_version(); x86_early_init_platform_quirks(); diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c index 7663a8eb602bc..16b1cbd3a61e2 100644 --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -457,8 +457,6 @@ void __init x86_64_start_reservations(char *real_mode_data) if (!boot_params.hdr.version) copy_bootdata(__va(real_mode_data)); - x86_verify_bootdata_version(); - x86_early_init_platform_quirks(); switch (boot_params.hdr.hardware_subarch) { diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index b74e7bfed6ab4..d494b9bfe618c 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1280,23 +1280,6 @@ void __init setup_arch(char **cmdline_p) unwind_init(); } -/* - * From boot protocol 2.14 onwards we expect the bootloader to set the - * version to "0x8000 | ". In case we find a version >= 2.14 - * without the 0x8000 we assume the boot loader supports 2.13 only and - * reset the version accordingly. The 0x8000 flag is removed in any case. - */ -void __init x86_verify_bootdata_version(void) -{ - if (boot_params.hdr.version & VERSION_WRITTEN) - boot_params.hdr.version &= ~VERSION_WRITTEN; - else if (boot_params.hdr.version >= 0x020e) - boot_params.hdr.version = 0x020d; - - if (boot_params.hdr.version < 0x020e) - boot_params.hdr.acpi_rsdp_addr = 0; -} - #ifdef CONFIG_X86_32 static struct resource video_ram_resource = { -- cgit v1.2.3 From 544b03da39e2d7b4961d3163976ed4bfb1fac509 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 19 Nov 2018 11:07:18 +0000 Subject: Documentation/security-bugs: Postpone fix publication in exceptional cases At the request of the reporter, the Linux kernel security team offers to postpone the publishing of a fix for up to 5 business days from the date of a report. While it is generally undesirable to keep a fix private after it has been developed, this short window is intended to allow distributions to package the fix into their kernel builds and permits early inclusion of the security team in the case of a co-ordinated disclosure with other parties. Unfortunately, discussions with major Linux distributions and cloud providers has revealed that 5 business days is not sufficient to achieve either of these two goals. As an example, cloud providers need to roll out KVM security fixes to a global fleet of hosts with sufficient early ramp-up and monitoring. An end-to-end timeline of less than two weeks dramatically cuts into the amount of early validation and increases the chance of guest-visible regressions. The consequence of this timeline mismatch is that security issues are commonly fixed without the involvement of the Linux kernel security team and are instead analysed and addressed by an ad-hoc group of developers across companies contributing to Linux. In some cases, mainline (and therefore the official stable kernels) can be left to languish for extended periods of time. This undermines the Linux kernel security process and puts upstream developers in a difficult position should they find themselves involved with an undisclosed security problem that they are unable to report due to restrictions from their employer. To accommodate the needs of these users of the Linux kernel and encourage them to engage with the Linux security team when security issues are first uncovered, extend the maximum period for which fixes may be delayed to 7 calendar days, or 14 calendar days in exceptional cases, where the logistics of QA and large scale rollouts specifically need to be accommodated. This brings parity with the linux-distros@ maximum embargo period of 14 calendar days. Cc: Paolo Bonzini Cc: David Woodhouse Cc: Amit Shah Cc: Laura Abbott Acked-by: Kees Cook Co-developed-by: Thomas Gleixner Co-developed-by: David Woodhouse Signed-off-by: Thomas Gleixner Signed-off-by: David Woodhouse Signed-off-by: Will Deacon Reviewed-by: Tyler Hicks Acked-by: Peter Zijlstra Signed-off-by: Greg Kroah-Hartman --- Documentation/admin-guide/security-bugs.rst | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/security-bugs.rst b/Documentation/admin-guide/security-bugs.rst index 164bf71149fdf..30187d49dc2c7 100644 --- a/Documentation/admin-guide/security-bugs.rst +++ b/Documentation/admin-guide/security-bugs.rst @@ -32,16 +32,17 @@ Disclosure and embargoed information The security list is not a disclosure channel. For that, see Coordination below. -Once a robust fix has been developed, our preference is to release the -fix in a timely fashion, treating it no differently than any of the other -thousands of changes and fixes the Linux kernel project releases every -month. - -However, at the request of the reporter, we will postpone releasing the -fix for up to 5 business days after the date of the report or after the -embargo has lifted; whichever comes first. The only exception to that -rule is if the bug is publicly known, in which case the preference is to -release the fix as soon as it's available. +Once a robust fix has been developed, the release process starts. Fixes +for publicly known bugs are released immediately. + +Although our preference is to release fixes for publicly undisclosed bugs +as soon as they become available, this may be postponed at the request of +the reporter or an affected party for up to 7 calendar days from the start +of the release process, with an exceptional extension to 14 calendar days +if it is agreed that the criticality of the bug requires more time. The +only valid reason for deferring the publication of a fix is to accommodate +the logistics of QA and large scale rollouts which require release +coordination. Whilst embargoed information may be shared with trusted individuals in order to develop a fix, such information will not be published alongside -- cgit v1.2.3 From 2a5bf23d5b795d5df33dc284e8f5cf8b6a5b4042 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 20 Nov 2018 18:08:42 +0100 Subject: perf/x86/intel: Fix regression by default disabling perfmon v4 interrupt handling Kyle Huey reported that 'rr', a replay debugger, broke due to the following commit: af3bdb991a5c ("perf/x86/intel: Add a separate Arch Perfmon v4 PMI handler") Rework the 'disable_counter_freezing' __setup() parameter such that we can explicitly enable/disable it and switch to default disabled. To this purpose, rename the parameter to "perf_v4_pmi=" which is a much better description and allows requiring a bool argument. [ mingo: Improved the changelog some more. ] Reported-by: Kyle Huey Signed-off-by: Peter Zijlstra (Intel) Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Kan Liang Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Robert O'Callahan Cc: Stephane Eranian Cc: Thomas Gleixner Cc: Vince Weaver Cc: acme@kernel.org Link: http://lkml.kernel.org/r/20181120170842.GZ2131@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar --- Documentation/admin-guide/kernel-parameters.txt | 3 ++- arch/x86/events/intel/core.c | 12 ++++++++---- 2 files changed, 10 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 81d1d5a747280..5463d5a4d85c1 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -856,7 +856,8 @@ causing system reset or hang due to sending INIT from AP to BSP. - disable_counter_freezing [HW] + perf_v4_pmi= [X86,INTEL] + Format: Disable Intel PMU counter freezing feature. The feature only exists starting from Arch Perfmon v4 (Skylake and newer). diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 273c62e815463..af8bea9d4006a 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -2306,14 +2306,18 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status) return handled; } -static bool disable_counter_freezing; +static bool disable_counter_freezing = true; static int __init intel_perf_counter_freezing_setup(char *s) { - disable_counter_freezing = true; - pr_info("Intel PMU Counter freezing feature disabled\n"); + bool res; + + if (kstrtobool(s, &res)) + return -EINVAL; + + disable_counter_freezing = !res; return 1; } -__setup("disable_counter_freezing", intel_perf_counter_freezing_setup); +__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup); /* * Simplified handler for Arch Perfmon v4: -- cgit v1.2.3 From 48d7f160b10711f014bf07b574c73452646c9fdd Mon Sep 17 00:00:00 2001 From: Stephen Boyd Date: Mon, 5 Nov 2018 11:40:10 -0800 Subject: dt-bindings: clk: Introduce 'protected-clocks' property Add a generic clk property for clks which are not intended to be used by the OS due to security restrictions put in place by firmware. For example, on some Qualcomm firmwares reading or writing certain clk registers causes the entire system to reboot, but on other firmwares reading and writing those same registers is required to make devices like QSPI work. Rather than adding one-off properties each time a new set of clks appears to be protected, let's add a generic clk property to describe any set of clks that shouldn't be touched by the OS. This way we never need to register the clks or use them in certain firmware configurations. Cc: Rob Herring Cc: Bjorn Andersson Cc: Taniya Das Signed-off-by: Stephen Boyd Reviewed-by: Bjorn Andersson Reviewed-by: Rob Herring Signed-off-by: Stephen Boyd --- .../devicetree/bindings/clock/clock-bindings.txt | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/clock/clock-bindings.txt b/Documentation/devicetree/bindings/clock/clock-bindings.txt index 2ec489eebe723..b646bbcf7f924 100644 --- a/Documentation/devicetree/bindings/clock/clock-bindings.txt +++ b/Documentation/devicetree/bindings/clock/clock-bindings.txt @@ -168,3 +168,19 @@ a shared clock is forbidden. Configuration of common clocks, which affect multiple consumer devices can be similarly specified in the clock provider node. + +==Protected clocks== + +Some platforms or firmwares may not fully expose all the clocks to the OS, such +as in situations where those clks are used by drivers running in ARM secure +execution levels. Such a configuration can be specified in device tree with the +protected-clocks property in the form of a clock specifier list. This property should +only be specified in the node that is providing the clocks being protected: + + clock-controller@a000f000 { + compatible = "vendor,clk95; + reg = <0xa000f000 0x1000> + #clocks-cells = <1>; + ... + protected-clocks = , ; + }; -- cgit v1.2.3 From ea2fc769719f9cc8e26845e8ffa2a719e2724645 Mon Sep 17 00:00:00 2001 From: Ezequiel Garcia Date: Fri, 9 Nov 2018 10:16:41 -0500 Subject: media: Revert "media: dt-bindings: Document the Rockchip VPU bindings" This reverts commit e4183d3256e3cd668e899d06af66da5aac3a51af. The commit was picked by mistake, as the Rockchip VPU driver is not ready for inclusion yet, and it's still under discussion. Signed-off-by: Ezequiel Garcia Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab --- .../devicetree/bindings/media/rockchip-vpu.txt | 29 ---------------------- 1 file changed, 29 deletions(-) delete mode 100644 Documentation/devicetree/bindings/media/rockchip-vpu.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/media/rockchip-vpu.txt b/Documentation/devicetree/bindings/media/rockchip-vpu.txt deleted file mode 100644 index 35dc464ad7c86..0000000000000 --- a/Documentation/devicetree/bindings/media/rockchip-vpu.txt +++ /dev/null @@ -1,29 +0,0 @@ -device-tree bindings for rockchip VPU codec - -Rockchip (Video Processing Unit) present in various Rockchip platforms, -such as RK3288 and RK3399. - -Required properties: -- compatible: value should be one of the following - "rockchip,rk3288-vpu"; - "rockchip,rk3399-vpu"; -- interrupts: encoding and decoding interrupt specifiers -- interrupt-names: should be "vepu" and "vdpu" -- clocks: phandle to VPU aclk, hclk clocks -- clock-names: should be "aclk" and "hclk" -- power-domains: phandle to power domain node -- iommus: phandle to a iommu node - -Example: -SoC-specific DT entry: - vpu: video-codec@ff9a0000 { - compatible = "rockchip,rk3288-vpu"; - reg = <0x0 0xff9a0000 0x0 0x800>; - interrupts = , - ; - interrupt-names = "vepu", "vdpu"; - clocks = <&cru ACLK_VCODEC>, <&cru HCLK_VCODEC>; - clock-names = "aclk", "hclk"; - power-domains = <&power RK3288_PD_VIDEO>; - iommus = <&vpu_mmu>; - }; -- cgit v1.2.3 From e7b9fb4f545b1f7885e7c642643828f93d3d79c9 Mon Sep 17 00:00:00 2001 From: Fabio Estevam Date: Fri, 23 Nov 2018 15:46:50 -0200 Subject: dt-bindings: dsa: Fix typo in "probed" The correct form is "can be probed", so fix the typo. Signed-off-by: Fabio Estevam Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- Documentation/devicetree/bindings/net/dsa/dsa.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.txt b/Documentation/devicetree/bindings/net/dsa/dsa.txt index 3ceeb8de11963..35694c0c376b9 100644 --- a/Documentation/devicetree/bindings/net/dsa/dsa.txt +++ b/Documentation/devicetree/bindings/net/dsa/dsa.txt @@ -7,7 +7,7 @@ limitations. Current Binding --------------- -Switches are true Linux devices and can be probes by any means. Once +Switches are true Linux devices and can be probed by any means. Once probed, they register to the DSA framework, passing a node pointer. This node is expected to fulfil the following binding, and may contain additional properties as required by the device it is -- cgit v1.2.3 From a7c3a0d5f8d8cd5cdb32c06d4d68f5b4e4d2104b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 23 Nov 2018 07:18:12 -0500 Subject: media: mediactl docs: Fix licensing message Right now, it mentions two SPDX headers that don't exist inside the Kernel: GFDL-1.1-or-later And an exception: no-invariant-sections While it would be trivial to add the first one, there's no way, currently, to distinguish, with SPDX, between a free and a non-free document under GFDL. Free documents with GFDL should not have invariant sections. There's an open issue at SPDX tree waiting for it to be solved. While we don't have this issue closed, let's just replace by a free-text license, and add a TODO note to remind us to revisit it later. Reviewed-by: Tomasz Figa Signed-off-by: Mauro Carvalho Chehab --- .../uapi/mediactl/media-ioc-request-alloc.rst | 26 +++++++++++++++++++++- .../uapi/mediactl/media-request-ioc-queue.rst | 26 +++++++++++++++++++++- .../uapi/mediactl/media-request-ioc-reinit.rst | 26 +++++++++++++++++++++- Documentation/media/uapi/mediactl/request-api.rst | 26 +++++++++++++++++++++- .../media/uapi/mediactl/request-func-close.rst | 26 +++++++++++++++++++++- .../media/uapi/mediactl/request-func-ioctl.rst | 26 +++++++++++++++++++++- .../media/uapi/mediactl/request-func-poll.rst | 26 +++++++++++++++++++++- 7 files changed, 175 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/media/uapi/mediactl/media-ioc-request-alloc.rst b/Documentation/media/uapi/mediactl/media-ioc-request-alloc.rst index 0f8b31874002c..de131f00c2496 100644 --- a/Documentation/media/uapi/mediactl/media-ioc-request-alloc.rst +++ b/Documentation/media/uapi/mediactl/media-ioc-request-alloc.rst @@ -1,4 +1,28 @@ -.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections +.. This file is dual-licensed: you can use it either under the terms +.. of the GPL or the GFDL 1.1+ license, at your option. Note that this +.. dual licensing only applies to this file, and not this project as a +.. whole. +.. +.. a) This file is free software; you can redistribute it and/or +.. modify it under the terms of the GNU General Public License as +.. published by the Free Software Foundation; either version 2 of +.. the License, or (at your option) any later version. +.. +.. This file is distributed in the hope that it will be useful, +.. but WITHOUT ANY WARRANTY; without even the implied warranty of +.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.. GNU General Public License for more details. +.. +.. Or, alternatively, +.. +.. b) Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections .. _media_ioc_request_alloc: diff --git a/Documentation/media/uapi/mediactl/media-request-ioc-queue.rst b/Documentation/media/uapi/mediactl/media-request-ioc-queue.rst index 6dd2d7fea7144..5d2604345e191 100644 --- a/Documentation/media/uapi/mediactl/media-request-ioc-queue.rst +++ b/Documentation/media/uapi/mediactl/media-request-ioc-queue.rst @@ -1,4 +1,28 @@ -.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections +.. This file is dual-licensed: you can use it either under the terms +.. of the GPL or the GFDL 1.1+ license, at your option. Note that this +.. dual licensing only applies to this file, and not this project as a +.. whole. +.. +.. a) This file is free software; you can redistribute it and/or +.. modify it under the terms of the GNU General Public License as +.. published by the Free Software Foundation; either version 2 of +.. the License, or (at your option) any later version. +.. +.. This file is distributed in the hope that it will be useful, +.. but WITHOUT ANY WARRANTY; without even the implied warranty of +.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.. GNU General Public License for more details. +.. +.. Or, alternatively, +.. +.. b) Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections .. _media_request_ioc_queue: diff --git a/Documentation/media/uapi/mediactl/media-request-ioc-reinit.rst b/Documentation/media/uapi/mediactl/media-request-ioc-reinit.rst index febe888494c8d..ec61960c81ce9 100644 --- a/Documentation/media/uapi/mediactl/media-request-ioc-reinit.rst +++ b/Documentation/media/uapi/mediactl/media-request-ioc-reinit.rst @@ -1,4 +1,28 @@ -.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections +.. This file is dual-licensed: you can use it either under the terms +.. of the GPL or the GFDL 1.1+ license, at your option. Note that this +.. dual licensing only applies to this file, and not this project as a +.. whole. +.. +.. a) This file is free software; you can redistribute it and/or +.. modify it under the terms of the GNU General Public License as +.. published by the Free Software Foundation; either version 2 of +.. the License, or (at your option) any later version. +.. +.. This file is distributed in the hope that it will be useful, +.. but WITHOUT ANY WARRANTY; without even the implied warranty of +.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.. GNU General Public License for more details. +.. +.. Or, alternatively, +.. +.. b) Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections .. _media_request_ioc_reinit: diff --git a/Documentation/media/uapi/mediactl/request-api.rst b/Documentation/media/uapi/mediactl/request-api.rst index 5f4a23029c487..945113dcb2185 100644 --- a/Documentation/media/uapi/mediactl/request-api.rst +++ b/Documentation/media/uapi/mediactl/request-api.rst @@ -1,4 +1,28 @@ -.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections +.. This file is dual-licensed: you can use it either under the terms +.. of the GPL or the GFDL 1.1+ license, at your option. Note that this +.. dual licensing only applies to this file, and not this project as a +.. whole. +.. +.. a) This file is free software; you can redistribute it and/or +.. modify it under the terms of the GNU General Public License as +.. published by the Free Software Foundation; either version 2 of +.. the License, or (at your option) any later version. +.. +.. This file is distributed in the hope that it will be useful, +.. but WITHOUT ANY WARRANTY; without even the implied warranty of +.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.. GNU General Public License for more details. +.. +.. Or, alternatively, +.. +.. b) Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections .. _media-request-api: diff --git a/Documentation/media/uapi/mediactl/request-func-close.rst b/Documentation/media/uapi/mediactl/request-func-close.rst index 098d7f2b95482..dcf3f35bcf176 100644 --- a/Documentation/media/uapi/mediactl/request-func-close.rst +++ b/Documentation/media/uapi/mediactl/request-func-close.rst @@ -1,4 +1,28 @@ -.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections +.. This file is dual-licensed: you can use it either under the terms +.. of the GPL or the GFDL 1.1+ license, at your option. Note that this +.. dual licensing only applies to this file, and not this project as a +.. whole. +.. +.. a) This file is free software; you can redistribute it and/or +.. modify it under the terms of the GNU General Public License as +.. published by the Free Software Foundation; either version 2 of +.. the License, or (at your option) any later version. +.. +.. This file is distributed in the hope that it will be useful, +.. but WITHOUT ANY WARRANTY; without even the implied warranty of +.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.. GNU General Public License for more details. +.. +.. Or, alternatively, +.. +.. b) Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections .. _request-func-close: diff --git a/Documentation/media/uapi/mediactl/request-func-ioctl.rst b/Documentation/media/uapi/mediactl/request-func-ioctl.rst index ff7b072a69991..11a22f8878439 100644 --- a/Documentation/media/uapi/mediactl/request-func-ioctl.rst +++ b/Documentation/media/uapi/mediactl/request-func-ioctl.rst @@ -1,4 +1,28 @@ -.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections +.. This file is dual-licensed: you can use it either under the terms +.. of the GPL or the GFDL 1.1+ license, at your option. Note that this +.. dual licensing only applies to this file, and not this project as a +.. whole. +.. +.. a) This file is free software; you can redistribute it and/or +.. modify it under the terms of the GNU General Public License as +.. published by the Free Software Foundation; either version 2 of +.. the License, or (at your option) any later version. +.. +.. This file is distributed in the hope that it will be useful, +.. but WITHOUT ANY WARRANTY; without even the implied warranty of +.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.. GNU General Public License for more details. +.. +.. Or, alternatively, +.. +.. b) Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections .. _request-func-ioctl: diff --git a/Documentation/media/uapi/mediactl/request-func-poll.rst b/Documentation/media/uapi/mediactl/request-func-poll.rst index 85191254f381a..2609fd54d519c 100644 --- a/Documentation/media/uapi/mediactl/request-func-poll.rst +++ b/Documentation/media/uapi/mediactl/request-func-poll.rst @@ -1,4 +1,28 @@ -.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections +.. This file is dual-licensed: you can use it either under the terms +.. of the GPL or the GFDL 1.1+ license, at your option. Note that this +.. dual licensing only applies to this file, and not this project as a +.. whole. +.. +.. a) This file is free software; you can redistribute it and/or +.. modify it under the terms of the GNU General Public License as +.. published by the Free Software Foundation; either version 2 of +.. the License, or (at your option) any later version. +.. +.. This file is distributed in the hope that it will be useful, +.. but WITHOUT ANY WARRANTY; without even the implied warranty of +.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.. GNU General Public License for more details. +.. +.. Or, alternatively, +.. +.. b) Permission is granted to copy, distribute and/or modify this +.. document under the terms of the GNU Free Documentation License, +.. Version 1.1 or any later version published by the Free Software +.. Foundation, with no Invariant Sections, no Front-Cover Texts +.. and no Back-Cover Texts. A copy of the license is included at +.. Documentation/media/uapi/fdl-appendix.rst. +.. +.. TODO: replace it to GPL-2.0 OR GFDL-1.1-or-later WITH no-invariant-sections .. _request-func-poll: -- cgit v1.2.3 From fa1202ef224391b6f5b26cdd44cc50495e8fab54 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sun, 25 Nov 2018 19:33:45 +0100 Subject: x86/speculation: Add command line control for indirect branch speculation Add command line control for user space indirect branch speculation mitigations. The new option is: spectre_v2_user= The initial options are: - on: Unconditionally enabled - off: Unconditionally disabled -auto: Kernel selects mitigation (default off for now) When the spectre_v2= command line argument is either 'on' or 'off' this implies that the application to application control follows that state even if a contradicting spectre_v2_user= argument is supplied. Originally-by: Tim Chen Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Cc: Peter Zijlstra Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Jiri Kosina Cc: Tom Lendacky Cc: Josh Poimboeuf Cc: Andrea Arcangeli Cc: David Woodhouse Cc: Andi Kleen Cc: Dave Hansen Cc: Casey Schaufler Cc: Asit Mallick Cc: Arjan van de Ven Cc: Jon Masters Cc: Waiman Long Cc: Greg KH Cc: Dave Stewart Cc: Kees Cook Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.082720373@linutronix.de --- Documentation/admin-guide/kernel-parameters.txt | 32 +++++- arch/x86/include/asm/nospec-branch.h | 10 ++ arch/x86/kernel/cpu/bugs.c | 133 +++++++++++++++++++++--- 3 files changed, 156 insertions(+), 19 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 19f4423e70d91..b6e5b33b9d751 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4194,9 +4194,13 @@ spectre_v2= [X86] Control mitigation of Spectre variant 2 (indirect branch speculation) vulnerability. + The default operation protects the kernel from + user space attacks. - on - unconditionally enable - off - unconditionally disable + on - unconditionally enable, implies + spectre_v2_user=on + off - unconditionally disable, implies + spectre_v2_user=off auto - kernel detects whether your CPU model is vulnerable @@ -4206,6 +4210,12 @@ CONFIG_RETPOLINE configuration option, and the compiler with which the kernel was built. + Selecting 'on' will also enable the mitigation + against user space to user space task attacks. + + Selecting 'off' will disable both the kernel and + the user space protections. + Specific mitigations can also be selected manually: retpoline - replace indirect branches @@ -4215,6 +4225,24 @@ Not specifying this option is equivalent to spectre_v2=auto. + spectre_v2_user= + [X86] Control mitigation of Spectre variant 2 + (indirect branch speculation) vulnerability between + user space tasks + + on - Unconditionally enable mitigations. Is + enforced by spectre_v2=on + + off - Unconditionally disable mitigations. Is + enforced by spectre_v2=off + + auto - Kernel selects the mitigation depending on + the available CPU features and vulnerability. + Default is off. + + Not specifying this option is equivalent to + spectre_v2_user=auto. + spec_store_bypass_disable= [HW] Control Speculative Store Bypass (SSB) Disable mitigation (Speculative Store Bypass vulnerability) diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index c202a64edd957..be0b0aa780e20 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -3,6 +3,8 @@ #ifndef _ASM_X86_NOSPEC_BRANCH_H_ #define _ASM_X86_NOSPEC_BRANCH_H_ +#include + #include #include #include @@ -226,6 +228,12 @@ enum spectre_v2_mitigation { SPECTRE_V2_IBRS_ENHANCED, }; +/* The indirect branch speculation control variants */ +enum spectre_v2_user_mitigation { + SPECTRE_V2_USER_NONE, + SPECTRE_V2_USER_STRICT, +}; + /* The Speculative Store Bypass disable variants */ enum ssb_mitigation { SPEC_STORE_BYPASS_NONE, @@ -303,6 +311,8 @@ do { \ preempt_enable(); \ } while (0) +DECLARE_STATIC_KEY_FALSE(switch_to_cond_stibp); + #endif /* __ASSEMBLY__ */ /* diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 4f5a6319dca6d..3a223cce1fac1 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -54,6 +54,9 @@ static u64 __ro_after_init x86_spec_ctrl_mask = SPEC_CTRL_IBRS; u64 __ro_after_init x86_amd_ls_cfg_base; u64 __ro_after_init x86_amd_ls_cfg_ssbd_mask; +/* Control conditional STIPB in switch_to() */ +DEFINE_STATIC_KEY_FALSE(switch_to_cond_stibp); + void __init check_bugs(void) { identify_boot_cpu(); @@ -199,6 +202,9 @@ static void x86_amd_ssb_disable(void) static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init = SPECTRE_V2_NONE; +static enum spectre_v2_user_mitigation spectre_v2_user __ro_after_init = + SPECTRE_V2_USER_NONE; + #ifdef RETPOLINE static bool spectre_v2_bad_module; @@ -237,6 +243,104 @@ enum spectre_v2_mitigation_cmd { SPECTRE_V2_CMD_RETPOLINE_AMD, }; +enum spectre_v2_user_cmd { + SPECTRE_V2_USER_CMD_NONE, + SPECTRE_V2_USER_CMD_AUTO, + SPECTRE_V2_USER_CMD_FORCE, +}; + +static const char * const spectre_v2_user_strings[] = { + [SPECTRE_V2_USER_NONE] = "User space: Vulnerable", + [SPECTRE_V2_USER_STRICT] = "User space: Mitigation: STIBP protection", +}; + +static const struct { + const char *option; + enum spectre_v2_user_cmd cmd; + bool secure; +} v2_user_options[] __initdata = { + { "auto", SPECTRE_V2_USER_CMD_AUTO, false }, + { "off", SPECTRE_V2_USER_CMD_NONE, false }, + { "on", SPECTRE_V2_USER_CMD_FORCE, true }, +}; + +static void __init spec_v2_user_print_cond(const char *reason, bool secure) +{ + if (boot_cpu_has_bug(X86_BUG_SPECTRE_V2) != secure) + pr_info("spectre_v2_user=%s forced on command line.\n", reason); +} + +static enum spectre_v2_user_cmd __init +spectre_v2_parse_user_cmdline(enum spectre_v2_mitigation_cmd v2_cmd) +{ + char arg[20]; + int ret, i; + + switch (v2_cmd) { + case SPECTRE_V2_CMD_NONE: + return SPECTRE_V2_USER_CMD_NONE; + case SPECTRE_V2_CMD_FORCE: + return SPECTRE_V2_USER_CMD_FORCE; + default: + break; + } + + ret = cmdline_find_option(boot_command_line, "spectre_v2_user", + arg, sizeof(arg)); + if (ret < 0) + return SPECTRE_V2_USER_CMD_AUTO; + + for (i = 0; i < ARRAY_SIZE(v2_user_options); i++) { + if (match_option(arg, ret, v2_user_options[i].option)) { + spec_v2_user_print_cond(v2_user_options[i].option, + v2_user_options[i].secure); + return v2_user_options[i].cmd; + } + } + + pr_err("Unknown user space protection option (%s). Switching to AUTO select\n", arg); + return SPECTRE_V2_USER_CMD_AUTO; +} + +static void __init +spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) +{ + enum spectre_v2_user_mitigation mode = SPECTRE_V2_USER_NONE; + bool smt_possible = IS_ENABLED(CONFIG_SMP); + + if (!boot_cpu_has(X86_FEATURE_IBPB) && !boot_cpu_has(X86_FEATURE_STIBP)) + return; + + if (cpu_smt_control == CPU_SMT_FORCE_DISABLED || + cpu_smt_control == CPU_SMT_NOT_SUPPORTED) + smt_possible = false; + + switch (spectre_v2_parse_user_cmdline(v2_cmd)) { + case SPECTRE_V2_USER_CMD_AUTO: + case SPECTRE_V2_USER_CMD_NONE: + goto set_mode; + case SPECTRE_V2_USER_CMD_FORCE: + mode = SPECTRE_V2_USER_STRICT; + break; + } + + /* Initialize Indirect Branch Prediction Barrier */ + if (boot_cpu_has(X86_FEATURE_IBPB)) { + setup_force_cpu_cap(X86_FEATURE_USE_IBPB); + pr_info("Spectre v2 mitigation: Enabling Indirect Branch Prediction Barrier\n"); + } + + /* If enhanced IBRS is enabled no STIPB required */ + if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED) + return; + +set_mode: + spectre_v2_user = mode; + /* Only print the STIBP mode when SMT possible */ + if (smt_possible) + pr_info("%s\n", spectre_v2_user_strings[mode]); +} + static const char * const spectre_v2_strings[] = { [SPECTRE_V2_NONE] = "Vulnerable", [SPECTRE_V2_RETPOLINE_GENERIC] = "Mitigation: Full generic retpoline", @@ -385,12 +489,6 @@ specv2_set_mode: setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW); pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n"); - /* Initialize Indirect Branch Prediction Barrier if supported */ - if (boot_cpu_has(X86_FEATURE_IBPB)) { - setup_force_cpu_cap(X86_FEATURE_USE_IBPB); - pr_info("Spectre v2 mitigation: Enabling Indirect Branch Prediction Barrier\n"); - } - /* * Retpoline means the kernel is safe because it has no indirect * branches. Enhanced IBRS protects firmware too, so, enable restricted @@ -407,23 +505,21 @@ specv2_set_mode: pr_info("Enabling Restricted Speculation for firmware calls\n"); } + /* Set up IBPB and STIBP depending on the general spectre V2 command */ + spectre_v2_user_select_mitigation(cmd); + /* Enable STIBP if appropriate */ arch_smt_update(); } static bool stibp_needed(void) { - if (spectre_v2_enabled == SPECTRE_V2_NONE) - return false; - /* Enhanced IBRS makes using STIBP unnecessary. */ if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED) return false; - if (!boot_cpu_has(X86_FEATURE_STIBP)) - return false; - - return true; + /* Check for strict user mitigation mode */ + return spectre_v2_user == SPECTRE_V2_USER_STRICT; } static void update_stibp_msr(void *info) @@ -844,10 +940,13 @@ static char *stibp_state(void) if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED) return ""; - if (x86_spec_ctrl_base & SPEC_CTRL_STIBP) - return ", STIBP"; - else - return ""; + switch (spectre_v2_user) { + case SPECTRE_V2_USER_NONE: + return ", STIBP: disabled"; + case SPECTRE_V2_USER_STRICT: + return ", STIBP: forced"; + } + return ""; } static char *ibpb_state(void) -- cgit v1.2.3 From 9137bb27e60e554dab694eafa4cca241fa3a694f Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sun, 25 Nov 2018 19:33:53 +0100 Subject: x86/speculation: Add prctl() control for indirect branch speculation Add the PR_SPEC_INDIRECT_BRANCH option for the PR_GET_SPECULATION_CTRL and PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of indirect branch speculation via STIBP and IBPB. Invocations: Check indirect branch speculation status with - prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0); Enable indirect branch speculation with - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0); Disable indirect branch speculation with - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0); Force disable indirect branch speculation with - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0); See Documentation/userspace-api/spec_ctrl.rst. Signed-off-by: Tim Chen Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Cc: Peter Zijlstra Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Jiri Kosina Cc: Tom Lendacky Cc: Josh Poimboeuf Cc: Andrea Arcangeli Cc: David Woodhouse Cc: Andi Kleen Cc: Dave Hansen Cc: Casey Schaufler Cc: Asit Mallick Cc: Arjan van de Ven Cc: Jon Masters Cc: Waiman Long Cc: Greg KH Cc: Dave Stewart Cc: Kees Cook Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.866780996@linutronix.de --- Documentation/userspace-api/spec_ctrl.rst | 9 +++++ arch/x86/include/asm/nospec-branch.h | 1 + arch/x86/kernel/cpu/bugs.c | 67 +++++++++++++++++++++++++++++++ arch/x86/kernel/process.c | 5 +++ include/linux/sched.h | 9 +++++ include/uapi/linux/prctl.h | 1 + tools/include/uapi/linux/prctl.h | 1 + 7 files changed, 93 insertions(+) (limited to 'Documentation') diff --git a/Documentation/userspace-api/spec_ctrl.rst b/Documentation/userspace-api/spec_ctrl.rst index 32f3d55c54b75..c4dbe6f7cdae8 100644 --- a/Documentation/userspace-api/spec_ctrl.rst +++ b/Documentation/userspace-api/spec_ctrl.rst @@ -92,3 +92,12 @@ Speculation misfeature controls * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_ENABLE, 0, 0); * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_DISABLE, 0, 0); * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_FORCE_DISABLE, 0, 0); + +- PR_SPEC_INDIR_BRANCH: Indirect Branch Speculation in User Processes + (Mitigate Spectre V2 style attacks against user processes) + + Invocations: + * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0); diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index d4d35baf04308..2adbe7b047fa2 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -232,6 +232,7 @@ enum spectre_v2_mitigation { enum spectre_v2_user_mitigation { SPECTRE_V2_USER_NONE, SPECTRE_V2_USER_STRICT, + SPECTRE_V2_USER_PRCTL, }; /* The Speculative Store Bypass disable variants */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 9cab538e10f1d..74359fff87fdb 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -566,6 +566,8 @@ void arch_smt_update(void) case SPECTRE_V2_USER_STRICT: update_stibp_strict(); break; + case SPECTRE_V2_USER_PRCTL: + break; } mutex_unlock(&spec_ctrl_mutex); @@ -752,12 +754,50 @@ static int ssb_prctl_set(struct task_struct *task, unsigned long ctrl) return 0; } +static int ib_prctl_set(struct task_struct *task, unsigned long ctrl) +{ + switch (ctrl) { + case PR_SPEC_ENABLE: + if (spectre_v2_user == SPECTRE_V2_USER_NONE) + return 0; + /* + * Indirect branch speculation is always disabled in strict + * mode. + */ + if (spectre_v2_user == SPECTRE_V2_USER_STRICT) + return -EPERM; + task_clear_spec_ib_disable(task); + task_update_spec_tif(task); + break; + case PR_SPEC_DISABLE: + case PR_SPEC_FORCE_DISABLE: + /* + * Indirect branch speculation is always allowed when + * mitigation is force disabled. + */ + if (spectre_v2_user == SPECTRE_V2_USER_NONE) + return -EPERM; + if (spectre_v2_user == SPECTRE_V2_USER_STRICT) + return 0; + task_set_spec_ib_disable(task); + if (ctrl == PR_SPEC_FORCE_DISABLE) + task_set_spec_ib_force_disable(task); + task_update_spec_tif(task); + break; + default: + return -ERANGE; + } + return 0; +} + int arch_prctl_spec_ctrl_set(struct task_struct *task, unsigned long which, unsigned long ctrl) { switch (which) { case PR_SPEC_STORE_BYPASS: return ssb_prctl_set(task, ctrl); + case PR_SPEC_INDIRECT_BRANCH: + return ib_prctl_set(task, ctrl); default: return -ENODEV; } @@ -790,11 +830,34 @@ static int ssb_prctl_get(struct task_struct *task) } } +static int ib_prctl_get(struct task_struct *task) +{ + if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2)) + return PR_SPEC_NOT_AFFECTED; + + switch (spectre_v2_user) { + case SPECTRE_V2_USER_NONE: + return PR_SPEC_ENABLE; + case SPECTRE_V2_USER_PRCTL: + if (task_spec_ib_force_disable(task)) + return PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE; + if (task_spec_ib_disable(task)) + return PR_SPEC_PRCTL | PR_SPEC_DISABLE; + return PR_SPEC_PRCTL | PR_SPEC_ENABLE; + case SPECTRE_V2_USER_STRICT: + return PR_SPEC_DISABLE; + default: + return PR_SPEC_NOT_AFFECTED; + } +} + int arch_prctl_spec_ctrl_get(struct task_struct *task, unsigned long which) { switch (which) { case PR_SPEC_STORE_BYPASS: return ssb_prctl_get(task); + case PR_SPEC_INDIRECT_BRANCH: + return ib_prctl_get(task); default: return -ENODEV; } @@ -974,6 +1037,8 @@ static char *stibp_state(void) return ", STIBP: disabled"; case SPECTRE_V2_USER_STRICT: return ", STIBP: forced"; + case SPECTRE_V2_USER_PRCTL: + return ""; } return ""; } @@ -986,6 +1051,8 @@ static char *ibpb_state(void) return ", IBPB: disabled"; case SPECTRE_V2_USER_STRICT: return ", IBPB: always-on"; + case SPECTRE_V2_USER_PRCTL: + return ""; } } return ""; diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index afbe2eb4a1c6d..7d31192296a87 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -450,6 +450,11 @@ static unsigned long speculation_ctrl_update_tif(struct task_struct *tsk) set_tsk_thread_flag(tsk, TIF_SSBD); else clear_tsk_thread_flag(tsk, TIF_SSBD); + + if (task_spec_ib_disable(tsk)) + set_tsk_thread_flag(tsk, TIF_SPEC_IB); + else + clear_tsk_thread_flag(tsk, TIF_SPEC_IB); } /* Return the updated threadinfo flags*/ return task_thread_info(tsk)->flags; diff --git a/include/linux/sched.h b/include/linux/sched.h index a51c13c2b1a03..d607db5fcc6a8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1453,6 +1453,8 @@ static inline bool is_percpu_thread(void) #define PFA_SPREAD_SLAB 2 /* Spread some slab caches over cpuset */ #define PFA_SPEC_SSB_DISABLE 3 /* Speculative Store Bypass disabled */ #define PFA_SPEC_SSB_FORCE_DISABLE 4 /* Speculative Store Bypass force disabled*/ +#define PFA_SPEC_IB_DISABLE 5 /* Indirect branch speculation restricted */ +#define PFA_SPEC_IB_FORCE_DISABLE 6 /* Indirect branch speculation permanently restricted */ #define TASK_PFA_TEST(name, func) \ static inline bool task_##func(struct task_struct *p) \ @@ -1484,6 +1486,13 @@ TASK_PFA_CLEAR(SPEC_SSB_DISABLE, spec_ssb_disable) TASK_PFA_TEST(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable) TASK_PFA_SET(SPEC_SSB_FORCE_DISABLE, spec_ssb_force_disable) +TASK_PFA_TEST(SPEC_IB_DISABLE, spec_ib_disable) +TASK_PFA_SET(SPEC_IB_DISABLE, spec_ib_disable) +TASK_PFA_CLEAR(SPEC_IB_DISABLE, spec_ib_disable) + +TASK_PFA_TEST(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) +TASK_PFA_SET(SPEC_IB_FORCE_DISABLE, spec_ib_force_disable) + static inline void current_restore_flags(unsigned long orig_flags, unsigned long flags) { diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index c0d7ea0bf5b62..b17201edfa09a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -212,6 +212,7 @@ struct prctl_mm_map { #define PR_SET_SPECULATION_CTRL 53 /* Speculation control variants */ # define PR_SPEC_STORE_BYPASS 0 +# define PR_SPEC_INDIRECT_BRANCH 1 /* Return and control values for PR_SET/GET_SPECULATION_CTRL */ # define PR_SPEC_NOT_AFFECTED 0 # define PR_SPEC_PRCTL (1UL << 0) diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h index c0d7ea0bf5b62..b17201edfa09a 100644 --- a/tools/include/uapi/linux/prctl.h +++ b/tools/include/uapi/linux/prctl.h @@ -212,6 +212,7 @@ struct prctl_mm_map { #define PR_SET_SPECULATION_CTRL 53 /* Speculation control variants */ # define PR_SPEC_STORE_BYPASS 0 +# define PR_SPEC_INDIRECT_BRANCH 1 /* Return and control values for PR_SET/GET_SPECULATION_CTRL */ # define PR_SPEC_NOT_AFFECTED 0 # define PR_SPEC_PRCTL (1UL << 0) -- cgit v1.2.3 From 7cc765a67d8e04ef7d772425ca5a2a1e2b894c15 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sun, 25 Nov 2018 19:33:54 +0100 Subject: x86/speculation: Enable prctl mode for spectre_v2_user Now that all prerequisites are in place: - Add the prctl command line option - Default the 'auto' mode to 'prctl' - When SMT state changes, update the static key which controls the conditional STIBP evaluation on context switch. - At init update the static key which controls the conditional IBPB evaluation on context switch. Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Cc: Peter Zijlstra Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Jiri Kosina Cc: Tom Lendacky Cc: Josh Poimboeuf Cc: Andrea Arcangeli Cc: David Woodhouse Cc: Tim Chen Cc: Andi Kleen Cc: Dave Hansen Cc: Casey Schaufler Cc: Asit Mallick Cc: Arjan van de Ven Cc: Jon Masters Cc: Waiman Long Cc: Greg KH Cc: Dave Stewart Cc: Kees Cook Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.958421388@linutronix.de --- Documentation/admin-guide/kernel-parameters.txt | 7 ++++- arch/x86/kernel/cpu/bugs.c | 41 +++++++++++++++++++------ 2 files changed, 38 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b6e5b33b9d751..a9b98a4e87899 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4236,9 +4236,14 @@ off - Unconditionally disable mitigations. Is enforced by spectre_v2=off + prctl - Indirect branch speculation is enabled, + but mitigation can be enabled via prctl + per thread. The mitigation control state + is inherited on fork. + auto - Kernel selects the mitigation depending on the available CPU features and vulnerability. - Default is off. + Default is prctl. Not specifying this option is equivalent to spectre_v2_user=auto. diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 74359fff87fdb..d0137d10f9a6e 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -255,11 +255,13 @@ enum spectre_v2_user_cmd { SPECTRE_V2_USER_CMD_NONE, SPECTRE_V2_USER_CMD_AUTO, SPECTRE_V2_USER_CMD_FORCE, + SPECTRE_V2_USER_CMD_PRCTL, }; static const char * const spectre_v2_user_strings[] = { [SPECTRE_V2_USER_NONE] = "User space: Vulnerable", [SPECTRE_V2_USER_STRICT] = "User space: Mitigation: STIBP protection", + [SPECTRE_V2_USER_PRCTL] = "User space: Mitigation: STIBP via prctl", }; static const struct { @@ -270,6 +272,7 @@ static const struct { { "auto", SPECTRE_V2_USER_CMD_AUTO, false }, { "off", SPECTRE_V2_USER_CMD_NONE, false }, { "on", SPECTRE_V2_USER_CMD_FORCE, true }, + { "prctl", SPECTRE_V2_USER_CMD_PRCTL, false }, }; static void __init spec_v2_user_print_cond(const char *reason, bool secure) @@ -324,12 +327,15 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) smt_possible = false; switch (spectre_v2_parse_user_cmdline(v2_cmd)) { - case SPECTRE_V2_USER_CMD_AUTO: case SPECTRE_V2_USER_CMD_NONE: goto set_mode; case SPECTRE_V2_USER_CMD_FORCE: mode = SPECTRE_V2_USER_STRICT; break; + case SPECTRE_V2_USER_CMD_AUTO: + case SPECTRE_V2_USER_CMD_PRCTL: + mode = SPECTRE_V2_USER_PRCTL; + break; } /* Initialize Indirect Branch Prediction Barrier */ @@ -340,6 +346,9 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) case SPECTRE_V2_USER_STRICT: static_branch_enable(&switch_mm_always_ibpb); break; + case SPECTRE_V2_USER_PRCTL: + static_branch_enable(&switch_mm_cond_ibpb); + break; default: break; } @@ -352,6 +361,12 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) if (spectre_v2_enabled == SPECTRE_V2_IBRS_ENHANCED) return; + /* + * If SMT is not possible or STIBP is not available clear the STIPB + * mode. + */ + if (!smt_possible || !boot_cpu_has(X86_FEATURE_STIBP)) + mode = SPECTRE_V2_USER_NONE; set_mode: spectre_v2_user = mode; /* Only print the STIBP mode when SMT possible */ @@ -552,6 +567,15 @@ static void update_stibp_strict(void) on_each_cpu(update_stibp_msr, NULL, 1); } +/* Update the static key controlling the evaluation of TIF_SPEC_IB */ +static void update_indir_branch_cond(void) +{ + if (sched_smt_active()) + static_branch_enable(&switch_to_cond_stibp); + else + static_branch_disable(&switch_to_cond_stibp); +} + void arch_smt_update(void) { /* Enhanced IBRS implies STIBP. No update required. */ @@ -567,6 +591,7 @@ void arch_smt_update(void) update_stibp_strict(); break; case SPECTRE_V2_USER_PRCTL: + update_indir_branch_cond(); break; } @@ -1038,7 +1063,8 @@ static char *stibp_state(void) case SPECTRE_V2_USER_STRICT: return ", STIBP: forced"; case SPECTRE_V2_USER_PRCTL: - return ""; + if (static_key_enabled(&switch_to_cond_stibp)) + return ", STIBP: conditional"; } return ""; } @@ -1046,14 +1072,11 @@ static char *stibp_state(void) static char *ibpb_state(void) { if (boot_cpu_has(X86_FEATURE_IBPB)) { - switch (spectre_v2_user) { - case SPECTRE_V2_USER_NONE: - return ", IBPB: disabled"; - case SPECTRE_V2_USER_STRICT: + if (static_key_enabled(&switch_mm_always_ibpb)) return ", IBPB: always-on"; - case SPECTRE_V2_USER_PRCTL: - return ""; - } + if (static_key_enabled(&switch_mm_cond_ibpb)) + return ", IBPB: conditional"; + return ", IBPB: disabled"; } return ""; } -- cgit v1.2.3 From 6b3e64c237c072797a9ec918654a60e3a46488e2 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sun, 25 Nov 2018 19:33:55 +0100 Subject: x86/speculation: Add seccomp Spectre v2 user space protection mode If 'prctl' mode of user space protection from spectre v2 is selected on the kernel command-line, STIBP and IBPB are applied on tasks which restrict their indirect branch speculation via prctl. SECCOMP enables the SSBD mitigation for sandboxed tasks already, so it makes sense to prevent spectre v2 user space to user space attacks as well. The Intel mitigation guide documents how STIPB works: Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor prevents the predicted targets of indirect branches on any logical processor of that core from being controlled by software that executes (or executed previously) on another logical processor of the same core. Ergo setting STIBP protects the task itself from being attacked from a task running on a different hyper-thread and protects the tasks running on different hyper-threads from being attacked. While the document suggests that the branch predictors are shielded between the logical processors, the observed performance regressions suggest that STIBP simply disables the branch predictor more or less completely. Of course the document wording is vague, but the fact that there is also no requirement for issuing IBPB when STIBP is used points clearly in that direction. The kernel still issues IBPB even when STIBP is used until Intel clarifies the whole mechanism. IBPB is issued when the task switches out, so malicious sandbox code cannot mistrain the branch predictor for the next user space task on the same logical processor. Signed-off-by: Jiri Kosina Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Cc: Peter Zijlstra Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Tom Lendacky Cc: Josh Poimboeuf Cc: Andrea Arcangeli Cc: David Woodhouse Cc: Tim Chen Cc: Andi Kleen Cc: Dave Hansen Cc: Casey Schaufler Cc: Asit Mallick Cc: Arjan van de Ven Cc: Jon Masters Cc: Waiman Long Cc: Greg KH Cc: Dave Stewart Cc: Kees Cook Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185006.051663132@linutronix.de --- Documentation/admin-guide/kernel-parameters.txt | 9 ++++++++- arch/x86/include/asm/nospec-branch.h | 1 + arch/x86/kernel/cpu/bugs.c | 17 ++++++++++++++++- 3 files changed, 25 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a9b98a4e87899..f405281bb202c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4241,9 +4241,16 @@ per thread. The mitigation control state is inherited on fork. + seccomp + - Same as "prctl" above, but all seccomp + threads will enable the mitigation unless + they explicitly opt out. + auto - Kernel selects the mitigation depending on the available CPU features and vulnerability. - Default is prctl. + + Default mitigation: + If CONFIG_SECCOMP=y then "seccomp", otherwise "prctl" Not specifying this option is equivalent to spectre_v2_user=auto. diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 2adbe7b047fa2..032b6009baab4 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -233,6 +233,7 @@ enum spectre_v2_user_mitigation { SPECTRE_V2_USER_NONE, SPECTRE_V2_USER_STRICT, SPECTRE_V2_USER_PRCTL, + SPECTRE_V2_USER_SECCOMP, }; /* The Speculative Store Bypass disable variants */ diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index d0137d10f9a6e..c9e3049605346 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -256,12 +256,14 @@ enum spectre_v2_user_cmd { SPECTRE_V2_USER_CMD_AUTO, SPECTRE_V2_USER_CMD_FORCE, SPECTRE_V2_USER_CMD_PRCTL, + SPECTRE_V2_USER_CMD_SECCOMP, }; static const char * const spectre_v2_user_strings[] = { [SPECTRE_V2_USER_NONE] = "User space: Vulnerable", [SPECTRE_V2_USER_STRICT] = "User space: Mitigation: STIBP protection", [SPECTRE_V2_USER_PRCTL] = "User space: Mitigation: STIBP via prctl", + [SPECTRE_V2_USER_SECCOMP] = "User space: Mitigation: STIBP via seccomp and prctl", }; static const struct { @@ -273,6 +275,7 @@ static const struct { { "off", SPECTRE_V2_USER_CMD_NONE, false }, { "on", SPECTRE_V2_USER_CMD_FORCE, true }, { "prctl", SPECTRE_V2_USER_CMD_PRCTL, false }, + { "seccomp", SPECTRE_V2_USER_CMD_SECCOMP, false }, }; static void __init spec_v2_user_print_cond(const char *reason, bool secure) @@ -332,10 +335,16 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) case SPECTRE_V2_USER_CMD_FORCE: mode = SPECTRE_V2_USER_STRICT; break; - case SPECTRE_V2_USER_CMD_AUTO: case SPECTRE_V2_USER_CMD_PRCTL: mode = SPECTRE_V2_USER_PRCTL; break; + case SPECTRE_V2_USER_CMD_AUTO: + case SPECTRE_V2_USER_CMD_SECCOMP: + if (IS_ENABLED(CONFIG_SECCOMP)) + mode = SPECTRE_V2_USER_SECCOMP; + else + mode = SPECTRE_V2_USER_PRCTL; + break; } /* Initialize Indirect Branch Prediction Barrier */ @@ -347,6 +356,7 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) static_branch_enable(&switch_mm_always_ibpb); break; case SPECTRE_V2_USER_PRCTL: + case SPECTRE_V2_USER_SECCOMP: static_branch_enable(&switch_mm_cond_ibpb); break; default: @@ -591,6 +601,7 @@ void arch_smt_update(void) update_stibp_strict(); break; case SPECTRE_V2_USER_PRCTL: + case SPECTRE_V2_USER_SECCOMP: update_indir_branch_cond(); break; } @@ -833,6 +844,8 @@ void arch_seccomp_spec_mitigate(struct task_struct *task) { if (ssb_mode == SPEC_STORE_BYPASS_SECCOMP) ssb_prctl_set(task, PR_SPEC_FORCE_DISABLE); + if (spectre_v2_user == SPECTRE_V2_USER_SECCOMP) + ib_prctl_set(task, PR_SPEC_FORCE_DISABLE); } #endif @@ -864,6 +877,7 @@ static int ib_prctl_get(struct task_struct *task) case SPECTRE_V2_USER_NONE: return PR_SPEC_ENABLE; case SPECTRE_V2_USER_PRCTL: + case SPECTRE_V2_USER_SECCOMP: if (task_spec_ib_force_disable(task)) return PR_SPEC_PRCTL | PR_SPEC_FORCE_DISABLE; if (task_spec_ib_disable(task)) @@ -1063,6 +1077,7 @@ static char *stibp_state(void) case SPECTRE_V2_USER_STRICT: return ", STIBP: forced"; case SPECTRE_V2_USER_PRCTL: + case SPECTRE_V2_USER_SECCOMP: if (static_key_enabled(&switch_to_cond_stibp)) return ", STIBP: conditional"; } -- cgit v1.2.3 From 55a974021ec952ee460dc31ca08722158639de72 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sun, 25 Nov 2018 19:33:56 +0100 Subject: x86/speculation: Provide IBPB always command line options Provide the possibility to enable IBPB always in combination with 'prctl' and 'seccomp'. Add the extra command line options and rework the IBPB selection to evaluate the command instead of the mode selected by the STIPB switch case. Signed-off-by: Thomas Gleixner Reviewed-by: Ingo Molnar Cc: Peter Zijlstra Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Jiri Kosina Cc: Tom Lendacky Cc: Josh Poimboeuf Cc: Andrea Arcangeli Cc: David Woodhouse Cc: Tim Chen Cc: Andi Kleen Cc: Dave Hansen Cc: Casey Schaufler Cc: Asit Mallick Cc: Arjan van de Ven Cc: Jon Masters Cc: Waiman Long Cc: Greg KH Cc: Dave Stewart Cc: Kees Cook Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185006.144047038@linutronix.de --- Documentation/admin-guide/kernel-parameters.txt | 12 +++++++++ arch/x86/kernel/cpu/bugs.c | 34 +++++++++++++++++-------- 2 files changed, 35 insertions(+), 11 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index f405281bb202c..05a252e5178d7 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4241,11 +4241,23 @@ per thread. The mitigation control state is inherited on fork. + prctl,ibpb + - Like "prctl" above, but only STIBP is + controlled per thread. IBPB is issued + always when switching between different user + space processes. + seccomp - Same as "prctl" above, but all seccomp threads will enable the mitigation unless they explicitly opt out. + seccomp,ibpb + - Like "seccomp" above, but only STIBP is + controlled per thread. IBPB is issued + always when switching between different + user space processes. + auto - Kernel selects the mitigation depending on the available CPU features and vulnerability. diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index c9e3049605346..500278f5308ee 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -256,7 +256,9 @@ enum spectre_v2_user_cmd { SPECTRE_V2_USER_CMD_AUTO, SPECTRE_V2_USER_CMD_FORCE, SPECTRE_V2_USER_CMD_PRCTL, + SPECTRE_V2_USER_CMD_PRCTL_IBPB, SPECTRE_V2_USER_CMD_SECCOMP, + SPECTRE_V2_USER_CMD_SECCOMP_IBPB, }; static const char * const spectre_v2_user_strings[] = { @@ -271,11 +273,13 @@ static const struct { enum spectre_v2_user_cmd cmd; bool secure; } v2_user_options[] __initdata = { - { "auto", SPECTRE_V2_USER_CMD_AUTO, false }, - { "off", SPECTRE_V2_USER_CMD_NONE, false }, - { "on", SPECTRE_V2_USER_CMD_FORCE, true }, - { "prctl", SPECTRE_V2_USER_CMD_PRCTL, false }, - { "seccomp", SPECTRE_V2_USER_CMD_SECCOMP, false }, + { "auto", SPECTRE_V2_USER_CMD_AUTO, false }, + { "off", SPECTRE_V2_USER_CMD_NONE, false }, + { "on", SPECTRE_V2_USER_CMD_FORCE, true }, + { "prctl", SPECTRE_V2_USER_CMD_PRCTL, false }, + { "prctl,ibpb", SPECTRE_V2_USER_CMD_PRCTL_IBPB, false }, + { "seccomp", SPECTRE_V2_USER_CMD_SECCOMP, false }, + { "seccomp,ibpb", SPECTRE_V2_USER_CMD_SECCOMP_IBPB, false }, }; static void __init spec_v2_user_print_cond(const char *reason, bool secure) @@ -321,6 +325,7 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) { enum spectre_v2_user_mitigation mode = SPECTRE_V2_USER_NONE; bool smt_possible = IS_ENABLED(CONFIG_SMP); + enum spectre_v2_user_cmd cmd; if (!boot_cpu_has(X86_FEATURE_IBPB) && !boot_cpu_has(X86_FEATURE_STIBP)) return; @@ -329,17 +334,20 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) cpu_smt_control == CPU_SMT_NOT_SUPPORTED) smt_possible = false; - switch (spectre_v2_parse_user_cmdline(v2_cmd)) { + cmd = spectre_v2_parse_user_cmdline(v2_cmd); + switch (cmd) { case SPECTRE_V2_USER_CMD_NONE: goto set_mode; case SPECTRE_V2_USER_CMD_FORCE: mode = SPECTRE_V2_USER_STRICT; break; case SPECTRE_V2_USER_CMD_PRCTL: + case SPECTRE_V2_USER_CMD_PRCTL_IBPB: mode = SPECTRE_V2_USER_PRCTL; break; case SPECTRE_V2_USER_CMD_AUTO: case SPECTRE_V2_USER_CMD_SECCOMP: + case SPECTRE_V2_USER_CMD_SECCOMP_IBPB: if (IS_ENABLED(CONFIG_SECCOMP)) mode = SPECTRE_V2_USER_SECCOMP; else @@ -351,12 +359,15 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) if (boot_cpu_has(X86_FEATURE_IBPB)) { setup_force_cpu_cap(X86_FEATURE_USE_IBPB); - switch (mode) { - case SPECTRE_V2_USER_STRICT: + switch (cmd) { + case SPECTRE_V2_USER_CMD_FORCE: + case SPECTRE_V2_USER_CMD_PRCTL_IBPB: + case SPECTRE_V2_USER_CMD_SECCOMP_IBPB: static_branch_enable(&switch_mm_always_ibpb); break; - case SPECTRE_V2_USER_PRCTL: - case SPECTRE_V2_USER_SECCOMP: + case SPECTRE_V2_USER_CMD_PRCTL: + case SPECTRE_V2_USER_CMD_AUTO: + case SPECTRE_V2_USER_CMD_SECCOMP: static_branch_enable(&switch_mm_cond_ibpb); break; default: @@ -364,7 +375,8 @@ spectre_v2_user_select_mitigation(enum spectre_v2_mitigation_cmd v2_cmd) } pr_info("mitigation: Enabling %s Indirect Branch Prediction Barrier\n", - mode == SPECTRE_V2_USER_STRICT ? "always-on" : "conditional"); + static_key_enabled(&switch_mm_always_ibpb) ? + "always-on" : "conditional"); } /* If enhanced IBRS is enabled no STIPB required */ -- cgit v1.2.3 From ce8c80c536dac9f325a051b30bf7730ee505eddc Mon Sep 17 00:00:00 2001 From: Catalin Marinas Date: Mon, 19 Nov 2018 11:27:28 +0000 Subject: arm64: Add workaround for Cortex-A76 erratum 1286807 On the affected Cortex-A76 cores (r0p0 to r3p0), if a virtual address for a cacheable mapping of a location is being accessed by a core while another core is remapping the virtual address to a new physical page using the recommended break-before-make sequence, then under very rare circumstances TLBI+DSB completes before a read using the translation being invalidated has been observed by other observers. The workaround repeats the TLBI+DSB operation and is shared with the Qualcomm Falkor erratum 1009 Reviewed-by: Suzuki K Poulose Signed-off-by: Catalin Marinas --- Documentation/arm64/silicon-errata.txt | 1 + arch/arm64/Kconfig | 25 +++++++++++++++++++++++++ arch/arm64/include/asm/tlbflush.h | 4 ++-- arch/arm64/kernel/cpu_errata.c | 20 +++++++++++++++++--- 4 files changed, 45 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/arm64/silicon-errata.txt b/Documentation/arm64/silicon-errata.txt index 76ccded8b74c0..8f95776211447 100644 --- a/Documentation/arm64/silicon-errata.txt +++ b/Documentation/arm64/silicon-errata.txt @@ -57,6 +57,7 @@ stable kernels. | ARM | Cortex-A73 | #858921 | ARM64_ERRATUM_858921 | | ARM | Cortex-A55 | #1024718 | ARM64_ERRATUM_1024718 | | ARM | Cortex-A76 | #1188873 | ARM64_ERRATUM_1188873 | +| ARM | Cortex-A76 | #1286807 | ARM64_ERRATUM_1286807 | | ARM | MMU-500 | #841119,#826419 | N/A | | | | | | | Cavium | ThunderX ITS | #22375, #24313 | CAVIUM_ERRATUM_22375 | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 787d7850e0643..ea2ab0330e3a1 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -497,6 +497,24 @@ config ARM64_ERRATUM_1188873 If unsure, say Y. +config ARM64_ERRATUM_1286807 + bool "Cortex-A76: Modification of the translation table for a virtual address might lead to read-after-read ordering violation" + default y + select ARM64_WORKAROUND_REPEAT_TLBI + help + This option adds workaround for ARM Cortex-A76 erratum 1286807 + + On the affected Cortex-A76 cores (r0p0 to r3p0), if a virtual + address for a cacheable mapping of a location is being + accessed by a core while another core is remapping the virtual + address to a new physical page using the recommended + break-before-make sequence, then under very rare circumstances + TLBI+DSB completes before a read using the translation being + invalidated has been observed by other observers. The + workaround repeats the TLBI+DSB operation. + + If unsure, say Y. + config CAVIUM_ERRATUM_22375 bool "Cavium erratum 22375, 24313" default y @@ -566,9 +584,16 @@ config QCOM_FALKOR_ERRATUM_1003 is unchanged. Work around the erratum by invalidating the walk cache entries for the trampoline before entering the kernel proper. +config ARM64_WORKAROUND_REPEAT_TLBI + bool + help + Enable the repeat TLBI workaround for Falkor erratum 1009 and + Cortex-A76 erratum 1286807. + config QCOM_FALKOR_ERRATUM_1009 bool "Falkor E1009: Prematurely complete a DSB after a TLBI" default y + select ARM64_WORKAROUND_REPEAT_TLBI help On Falkor v1, the CPU may prematurely complete a DSB following a TLBI xxIS invalidate maintenance operation. Repeat the TLBI operation diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index c3c0387aee18f..5dfd23897dea9 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -41,14 +41,14 @@ ALTERNATIVE("nop\n nop", \ "dsb ish\n tlbi " #op, \ ARM64_WORKAROUND_REPEAT_TLBI, \ - CONFIG_QCOM_FALKOR_ERRATUM_1009) \ + CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ : : ) #define __TLBI_1(op, arg) asm ("tlbi " #op ", %0\n" \ ALTERNATIVE("nop\n nop", \ "dsb ish\n tlbi " #op ", %0", \ ARM64_WORKAROUND_REPEAT_TLBI, \ - CONFIG_QCOM_FALKOR_ERRATUM_1009) \ + CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \ : : "r" (arg)) #define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg) diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index a509e35132d22..6ad715d67df89 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -570,6 +570,20 @@ static const struct midr_range arm64_harden_el2_vectors[] = { #endif +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI + +static const struct midr_range arm64_repeat_tlbi_cpus[] = { +#ifdef CONFIG_QCOM_FALKOR_ERRATUM_1009 + MIDR_RANGE(MIDR_QCOM_FALKOR_V1, 0, 0, 0, 0), +#endif +#ifdef CONFIG_ARM64_ERRATUM_1286807 + MIDR_RANGE(MIDR_CORTEX_A76, 0, 0, 3, 0), +#endif + {}, +}; + +#endif + const struct arm64_cpu_capabilities arm64_errata[] = { #if defined(CONFIG_ARM64_ERRATUM_826319) || \ defined(CONFIG_ARM64_ERRATUM_827319) || \ @@ -695,11 +709,11 @@ const struct arm64_cpu_capabilities arm64_errata[] = { .matches = is_kryo_midr, }, #endif -#ifdef CONFIG_QCOM_FALKOR_ERRATUM_1009 +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI { - .desc = "Qualcomm Technologies Falkor erratum 1009", + .desc = "Qualcomm erratum 1009, ARM erratum 1286807", .capability = ARM64_WORKAROUND_REPEAT_TLBI, - ERRATA_MIDR_REV(MIDR_QCOM_FALKOR_V1, 0, 0), + ERRATA_MIDR_RANGE_LIST(arm64_repeat_tlbi_cpus), }, #endif #ifdef CONFIG_ARM64_ERRATUM_858921 -- cgit v1.2.3 From e0c274472d5d27f277af722e017525e0b33784cd Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 30 Nov 2018 14:09:58 -0800 Subject: psi: make disabling/enabling easier for vendor kernels Mel Gorman reports a hackbench regression with psi that would prohibit shipping the suse kernel with it default-enabled, but he'd still like users to be able to opt in at little to no cost to others. With the current combination of CONFIG_PSI and the psi_disabled bool set from the commandline, this is a challenge. Do the following things to make it easier: 1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros to enable CONFIG_PSI in their kernel but leave the feature disabled unless a user requests it at boot-time. To avoid double negatives, rename psi_disabled= to psi=. 2. Make psi_disabled a static branch to eliminate any branch costs when the feature is disabled. In terms of numbers before and after this patch, Mel says: : The following is a comparision using CONFIG_PSI=n as a baseline against : your patch and a vanilla kernel : : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4 : kconfigdisable-v1r1 vanilla psidisable-v1r1 : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%) : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%) : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%) : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%) : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%) : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%) : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%) : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%) : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%) : : It's showing that the vanilla kernel takes a hit (as the bisection : indicated it would) and that disabling PSI by default is reasonably : close in terms of performance for this particular workload on this : particular machine so; Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org Signed-off-by: Johannes Weiner Tested-by: Mel Gorman Reported-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/kernel-parameters.txt | 4 ++++ include/linux/psi.h | 3 ++- init/Kconfig | 9 ++++++++ kernel/sched/psi.c | 30 +++++++++++++++++-------- kernel/sched/stats.h | 8 +++---- 5 files changed, 40 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 675170c360788..5d6ba930d4f46 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3505,6 +3505,10 @@ before loading. See Documentation/blockdev/ramdisk.txt. + psi= [KNL] Enable or disable pressure stall information + tracking. + Format: + psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports diff --git a/include/linux/psi.h b/include/linux/psi.h index 8e0725aac0aa8..7006008d5b72f 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -1,6 +1,7 @@ #ifndef _LINUX_PSI_H #define _LINUX_PSI_H +#include #include #include @@ -9,7 +10,7 @@ struct css_set; #ifdef CONFIG_PSI -extern bool psi_disabled; +extern struct static_key_false psi_disabled; void psi_init(void); diff --git a/init/Kconfig b/init/Kconfig index a4112e95724a0..cf5b5a0dcbc2f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -509,6 +509,15 @@ config PSI Say N if unsure. +config PSI_DEFAULT_DISABLED + bool "Require boot parameter to enable pressure stall information tracking" + default n + depends on PSI + help + If set, pressure stall information tracking will be disabled + per default but can be enabled through passing psi_enable=1 + on the kernel commandline during boot. + endmenu # "CPU/Task time and stats accounting" config CPU_ISOLATION diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 3d7355d7c3e38..fe24de3fbc938 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -136,8 +136,18 @@ static int psi_bug __read_mostly; -bool psi_disabled __read_mostly; -core_param(psi_disabled, psi_disabled, bool, 0644); +DEFINE_STATIC_KEY_FALSE(psi_disabled); + +#ifdef CONFIG_PSI_DEFAULT_DISABLED +bool psi_enable; +#else +bool psi_enable = true; +#endif +static int __init setup_psi(char *str) +{ + return kstrtobool(str, &psi_enable) == 0; +} +__setup("psi=", setup_psi); /* Running averages - we need to be higher-res than loadavg */ #define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ @@ -169,8 +179,10 @@ static void group_init(struct psi_group *group) void __init psi_init(void) { - if (psi_disabled) + if (!psi_enable) { + static_branch_enable(&psi_disabled); return; + } psi_period = jiffies_to_nsecs(PSI_FREQ); group_init(&psi_system); @@ -549,7 +561,7 @@ void psi_memstall_enter(unsigned long *flags) struct rq_flags rf; struct rq *rq; - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return; *flags = current->flags & PF_MEMSTALL; @@ -579,7 +591,7 @@ void psi_memstall_leave(unsigned long *flags) struct rq_flags rf; struct rq *rq; - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return; if (*flags) @@ -600,7 +612,7 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return 0; cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); @@ -612,7 +624,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) void psi_cgroup_free(struct cgroup *cgroup) { - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return; cancel_delayed_work_sync(&cgroup->psi.clock_work); @@ -637,7 +649,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) struct rq_flags rf; struct rq *rq; - if (psi_disabled) { + if (static_branch_likely(&psi_disabled)) { /* * Lame to do this here, but the scheduler cannot be locked * from the outside, so we move cgroups from inside sched/. @@ -673,7 +685,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { int full; - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return -EOPNOTSUPP; update_stats(group); diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 4904c46770007..aa0de240fb419 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -66,7 +66,7 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup) { int clear = 0, set = TSK_RUNNING; - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return; if (!wakeup || p->sched_psi_wake_requeue) { @@ -86,7 +86,7 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep) { int clear = TSK_RUNNING, set = 0; - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return; if (!sleep) { @@ -102,7 +102,7 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep) static inline void psi_ttwu_dequeue(struct task_struct *p) { - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return; /* * Is the task being migrated during a wakeup? Make sure to @@ -128,7 +128,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) static inline void psi_task_tick(struct rq *rq) { - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return; if (unlikely(rq->curr->flags & PF_MEMSTALL)) -- cgit v1.2.3 From a3d7e01da06013dc580641a1da57c3b482d58157 Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Wed, 28 Nov 2018 13:40:04 -0800 Subject: net: dsa: Fix tagging attribute location While introducing the DSA tagging protocol attribute, it was added to the DSA slave network devices, but those actually see untagged traffic (that is their whole purpose). Correct this mistake by putting the tagging sysfs attribute under the DSA master network device where this is the information that we need. While at it, also correct the sysfs documentation mistake that missed the "dsa/" directory component of the attribute. Fixes: 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space") Signed-off-by: Florian Fainelli Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller --- Documentation/ABI/testing/sysfs-class-net-dsa | 2 +- net/dsa/master.c | 34 ++++++++++++++++++++++++++- net/dsa/slave.c | 28 ---------------------- 3 files changed, 34 insertions(+), 30 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-class-net-dsa b/Documentation/ABI/testing/sysfs-class-net-dsa index f240221e071ef..985d84c585c66 100644 --- a/Documentation/ABI/testing/sysfs-class-net-dsa +++ b/Documentation/ABI/testing/sysfs-class-net-dsa @@ -1,4 +1,4 @@ -What: /sys/class/net//tagging +What: /sys/class/net//dsa/tagging Date: August 2018 KernelVersion: 4.20 Contact: netdev@vger.kernel.org diff --git a/net/dsa/master.c b/net/dsa/master.c index c90ee3227deab..5e8c9bef78bd2 100644 --- a/net/dsa/master.c +++ b/net/dsa/master.c @@ -158,8 +158,31 @@ static void dsa_master_ethtool_teardown(struct net_device *dev) cpu_dp->orig_ethtool_ops = NULL; } +static ssize_t tagging_show(struct device *d, struct device_attribute *attr, + char *buf) +{ + struct net_device *dev = to_net_dev(d); + struct dsa_port *cpu_dp = dev->dsa_ptr; + + return sprintf(buf, "%s\n", + dsa_tag_protocol_to_str(cpu_dp->tag_ops)); +} +static DEVICE_ATTR_RO(tagging); + +static struct attribute *dsa_slave_attrs[] = { + &dev_attr_tagging.attr, + NULL +}; + +static const struct attribute_group dsa_group = { + .name = "dsa", + .attrs = dsa_slave_attrs, +}; + int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp) { + int ret; + /* If we use a tagging format that doesn't have an ethertype * field, make sure that all packets from this point on get * sent to the tag format's receive function. @@ -168,11 +191,20 @@ int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp) dev->dsa_ptr = cpu_dp; - return dsa_master_ethtool_setup(dev); + ret = dsa_master_ethtool_setup(dev); + if (ret) + return ret; + + ret = sysfs_create_group(&dev->dev.kobj, &dsa_group); + if (ret) + dsa_master_ethtool_teardown(dev); + + return ret; } void dsa_master_teardown(struct net_device *dev) { + sysfs_remove_group(&dev->dev.kobj, &dsa_group); dsa_master_ethtool_teardown(dev); dev->dsa_ptr = NULL; diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 7d0c19e7edcf0..aec78f5aca72d 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -1058,27 +1058,6 @@ static struct device_type dsa_type = { .name = "dsa", }; -static ssize_t tagging_show(struct device *d, struct device_attribute *attr, - char *buf) -{ - struct net_device *dev = to_net_dev(d); - struct dsa_port *dp = dsa_slave_to_port(dev); - - return sprintf(buf, "%s\n", - dsa_tag_protocol_to_str(dp->cpu_dp->tag_ops)); -} -static DEVICE_ATTR_RO(tagging); - -static struct attribute *dsa_slave_attrs[] = { - &dev_attr_tagging.attr, - NULL -}; - -static const struct attribute_group dsa_group = { - .name = "dsa", - .attrs = dsa_slave_attrs, -}; - static void dsa_slave_phylink_validate(struct net_device *dev, unsigned long *supported, struct phylink_link_state *state) @@ -1374,14 +1353,8 @@ int dsa_slave_create(struct dsa_port *port) goto out_phy; } - ret = sysfs_create_group(&slave_dev->dev.kobj, &dsa_group); - if (ret) - goto out_unreg; - return 0; -out_unreg: - unregister_netdev(slave_dev); out_phy: rtnl_lock(); phylink_disconnect_phy(p->dp->pl); @@ -1405,7 +1378,6 @@ void dsa_slave_destroy(struct net_device *slave_dev) rtnl_unlock(); dsa_slave_notify(slave_dev, DSA_PORT_UNREGISTER); - sysfs_remove_group(&slave_dev->dev.kobj, &dsa_group); unregister_netdev(slave_dev); phylink_destroy(dp->pl); free_percpu(p->stats64); -- cgit v1.2.3 From 52ea899637c746984d657b508da6e3f2686adfca Mon Sep 17 00:00:00 2001 From: Peter Hutterer Date: Wed, 5 Dec 2018 10:42:21 +1000 Subject: Input: add `REL_WHEEL_HI_RES` and `REL_HWHEEL_HI_RES` This event code represents scroll reports from high-resolution wheels and is modelled after the approach Windows uses. The value 120 is one detent (wheel click) of movement. Mice with higher-resolution scrolling can send fractions of 120 which must be accumulated in userspace. Userspace can either wait for a full 120 to accumulate or scroll by fractions of one logical scroll movement as the events come in. 120 was picked as magic number because it has a high number of integer fractions that can be used by high-resolution wheels. For more information see https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn613912(v=vs.85) These new axes obsolete REL_WHEEL and REL_HWHEEL. The legacy axes are emulated by the kernel but the most accurate (and most granular) data is available through the new axes. Signed-off-by: Peter Hutterer Acked-by: Dmitry Torokhov Verified-by: Harry Cutts Signed-off-by: Benjamin Tissoires --- Documentation/input/event-codes.rst | 21 ++++++++++++++++++++- include/uapi/linux/input-event-codes.h | 2 ++ 2 files changed, 22 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/input/event-codes.rst b/Documentation/input/event-codes.rst index a8c0873beb952..b24b5343f5eb3 100644 --- a/Documentation/input/event-codes.rst +++ b/Documentation/input/event-codes.rst @@ -190,7 +190,26 @@ A few EV_REL codes have special meanings: * REL_WHEEL, REL_HWHEEL: - These codes are used for vertical and horizontal scroll wheels, - respectively. + respectively. The value is the number of detents moved on the wheel, the + physical size of which varies by device. For high-resolution wheels + this may be an approximation based on the high-resolution scroll events, + see REL_WHEEL_HI_RES. These event codes are legacy codes and + REL_WHEEL_HI_RES and REL_HWHEEL_HI_RES should be preferred where + available. + +* REL_WHEEL_HI_RES, REL_HWHEEL_HI_RES: + + - High-resolution scroll wheel data. The accumulated value 120 represents + movement by one detent. For devices that do not provide high-resolution + scrolling, the value is always a multiple of 120. For devices with + high-resolution scrolling, the value may be a fraction of 120. + + If a vertical scroll wheel supports high-resolution scrolling, this code + will be emitted in addition to REL_WHEEL or REL_HWHEEL. The REL_WHEEL + and REL_HWHEEL may be an approximation based on the high-resolution + scroll events. There is no guarantee that the high-resolution data + is a multiple of 120 at the time of an emulated REL_WHEEL or REL_HWHEEL + event. EV_ABS ------ diff --git a/include/uapi/linux/input-event-codes.h b/include/uapi/linux/input-event-codes.h index 3eb5a4c3d60a9..265ef20286605 100644 --- a/include/uapi/linux/input-event-codes.h +++ b/include/uapi/linux/input-event-codes.h @@ -716,6 +716,8 @@ * the situation described above. */ #define REL_RESERVED 0x0a +#define REL_WHEEL_HI_RES 0x0b +#define REL_HWHEEL_HI_RES 0x0c #define REL_MAX 0x0f #define REL_CNT (REL_MAX+1) -- cgit v1.2.3