summaryrefslogtreecommitdiffstats
path: root/net/ipv4
Commit message (Collapse)AuthorAgeFilesLines
* tcp: fix potential double free issue for fastopen_reqWei Wang2017-03-021-3/+12
| | | | | | | | | | | | | | | | | | tp->fastopen_req could potentially be double freed if a malicious user does the following: 1. Enable TCP_FASTOPEN_CONNECT sockopt and do a connect() on the socket. 2. Call connect() with AF_UNSPEC to disconnect the socket. 3. Make this socket a listening socket by calling listen(). 4. Accept incoming connections and generate child sockets. All child sockets will get a copy of the pointer of fastopen_req. 5. Call close() on all sockets. fastopen_req will get freed multiple times. Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp/dccp: block BH for SYN processingEric Dumazet2017-03-011-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SYN processing really was meant to be handled from BH. When I got rid of BH blocking while processing socket backlog in commit 5413d1babe8f ("net: do not block BH while processing socket backlog"), I forgot that a malicious user could transition to TCP_LISTEN from a state that allowed (SYN) packets to be parked in the socket backlog while socket is owned by the thread doing the listen() call. Sure enough syzkaller found this and reported the bug ;) ================================= [ INFO: inconsistent lock state ] 4.10.0+ #60 Not tainted --------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. syz-executor0/5090 [HC0[0]:SC0[0]:HE1:SE1] takes: (&(&hashinfo->ehash_locks[i])->rlock){+.?...}, at: [<ffffffff83a6a370>] spin_lock include/linux/spinlock.h:299 [inline] (&(&hashinfo->ehash_locks[i])->rlock){+.?...}, at: [<ffffffff83a6a370>] inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407 {IN-SOFTIRQ-W} state was registered at: mark_irqflags kernel/locking/lockdep.c:2923 [inline] __lock_acquire+0xbcf/0x3270 kernel/locking/lockdep.c:3295 lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407 reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:753 [inline] inet_csk_reqsk_queue_hash_add+0x1b7/0x2a0 net/ipv4/inet_connection_sock.c:764 tcp_conn_request+0x25cc/0x3310 net/ipv4/tcp_input.c:6399 tcp_v4_conn_request+0x157/0x220 net/ipv4/tcp_ipv4.c:1262 tcp_rcv_state_process+0x802/0x4130 net/ipv4/tcp_input.c:5889 tcp_v4_do_rcv+0x56b/0x940 net/ipv4/tcp_ipv4.c:1433 tcp_v4_rcv+0x2e12/0x3210 net/ipv4/tcp_ipv4.c:1711 ip_local_deliver_finish+0x4ce/0xc40 net/ipv4/ip_input.c:216 NF_HOOK include/linux/netfilter.h:257 [inline] ip_local_deliver+0x1ce/0x710 net/ipv4/ip_input.c:257 dst_input include/net/dst.h:492 [inline] ip_rcv_finish+0xb1d/0x2110 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:257 [inline] ip_rcv+0xd90/0x19c0 net/ipv4/ip_input.c:487 __netif_receive_skb_core+0x1ad1/0x3400 net/core/dev.c:4179 __netif_receive_skb+0x2a/0x170 net/core/dev.c:4217 netif_receive_skb_internal+0x1d6/0x430 net/core/dev.c:4245 napi_skb_finish net/core/dev.c:4602 [inline] napi_gro_receive+0x4e6/0x680 net/core/dev.c:4636 e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4033 [inline] e1000_clean_rx_irq+0x5e0/0x1490 drivers/net/ethernet/intel/e1000/e1000_main.c:4489 e1000_clean+0xb9a/0x2910 drivers/net/ethernet/intel/e1000/e1000_main.c:3834 napi_poll net/core/dev.c:5171 [inline] net_rx_action+0xe70/0x1900 net/core/dev.c:5236 __do_softirq+0x2fb/0xb7d kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x19e/0x1d0 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:658 [inline] do_IRQ+0x81/0x1a0 arch/x86/kernel/irq.c:250 ret_from_intr+0x0/0x20 native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:53 arch_safe_halt arch/x86/include/asm/paravirt.h:98 [inline] default_idle+0x8f/0x410 arch/x86/kernel/process.c:271 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:262 default_idle_call+0x36/0x60 kernel/sched/idle.c:96 cpuidle_idle_call kernel/sched/idle.c:154 [inline] do_idle+0x348/0x440 kernel/sched/idle.c:243 cpu_startup_entry+0x18/0x20 kernel/sched/idle.c:345 start_secondary+0x344/0x440 arch/x86/kernel/smpboot.c:272 verify_cpu+0x0/0xfc irq event stamp: 1741 hardirqs last enabled at (1741): [<ffffffff84d49d77>] __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160 [inline] hardirqs last enabled at (1741): [<ffffffff84d49d77>] _raw_spin_unlock_irqrestore+0xf7/0x1a0 kernel/locking/spinlock.c:191 hardirqs last disabled at (1740): [<ffffffff84d4a732>] __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline] hardirqs last disabled at (1740): [<ffffffff84d4a732>] _raw_spin_lock_irqsave+0xa2/0x110 kernel/locking/spinlock.c:159 softirqs last enabled at (1738): [<ffffffff84d4deff>] __do_softirq+0x7cf/0xb7d kernel/softirq.c:310 softirqs last disabled at (1571): [<ffffffff84d4b92c>] do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&hashinfo->ehash_locks[i])->rlock); <Interrupt> lock(&(&hashinfo->ehash_locks[i])->rlock); *** DEADLOCK *** 1 lock held by syz-executor0/5090: #0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83406b43>] lock_sock include/net/sock.h:1460 [inline] #0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83406b43>] sock_setsockopt+0x233/0x1e40 net/core/sock.c:683 stack backtrace: CPU: 1 PID: 5090 Comm: syz-executor0 Not tainted 4.10.0+ #60 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:15 [inline] dump_stack+0x292/0x398 lib/dump_stack.c:51 print_usage_bug+0x3ef/0x450 kernel/locking/lockdep.c:2387 valid_state kernel/locking/lockdep.c:2400 [inline] mark_lock_irq kernel/locking/lockdep.c:2602 [inline] mark_lock+0xf30/0x1410 kernel/locking/lockdep.c:3065 mark_irqflags kernel/locking/lockdep.c:2941 [inline] __lock_acquire+0x6dc/0x3270 kernel/locking/lockdep.c:3295 lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407 reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:753 [inline] inet_csk_reqsk_queue_hash_add+0x1b7/0x2a0 net/ipv4/inet_connection_sock.c:764 dccp_v6_conn_request+0xada/0x11b0 net/dccp/ipv6.c:380 dccp_rcv_state_process+0x51e/0x1660 net/dccp/input.c:606 dccp_v6_do_rcv+0x213/0x350 net/dccp/ipv6.c:632 sk_backlog_rcv include/net/sock.h:896 [inline] __release_sock+0x127/0x3a0 net/core/sock.c:2052 release_sock+0xa5/0x2b0 net/core/sock.c:2539 sock_setsockopt+0x60f/0x1e40 net/core/sock.c:1016 SYSC_setsockopt net/socket.c:1782 [inline] SyS_setsockopt+0x2fb/0x3a0 net/socket.c:1765 entry_SYSCALL_64_fastpath+0x1f/0xc2 RIP: 0033:0x4458b9 RSP: 002b:00007fe8b26c2b58 EFLAGS: 00000292 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00000000004458b9 RDX: 000000000000001a RSI: 0000000000000001 RDI: 0000000000000006 RBP: 00000000006e2110 R08: 0000000000000010 R09: 0000000000000000 R10: 00000000208c3000 R11: 0000000000000292 R12: 0000000000708000 R13: 0000000020000000 R14: 0000000000001000 R15: 0000000000000000 Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: route: add missing nla_policy entry for RTA_MARK attributeLiping Zhang2017-03-011-0/+1
| | | | | | | This will add stricter validating for RTA_MARK attribute. Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2017-02-282-3/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Don't save TIPC header values before the header has been validated, from Jon Paul Maloy. 2) Fix memory leak in RDS, from Zhu Yanjun. 3) We miss to initialize the UID in the flow key in some paths, from Julian Anastasov. 4) Fix latent TOS masking bug in the routing cache removal from years ago, also from Julian. 5) We forget to set the sockaddr port in sctp_copy_local_addr_list(), fix from Xin Long. 6) Missing module ref count drop in packet scheduler actions, from Roman Mashak. 7) Fix RCU annotations in rht_bucket_nested, from Herbert Xu. 8) Fix use after free which happens because L2TP's ipv4 support returns non-zero values from it's backlog_rcv function which ipv4 interprets as protocol values. Fix from Paul Hüber. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits) qed: Don't use attention PTT for configuring BW qed: Fix race with multiple VFs l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv xfrm: provide correct dst in xfrm_neigh_lookup rhashtable: Fix RCU dereference annotation in rht_bucket_nested rhashtable: Fix use before NULL check in bucket_table_free net sched actions: do not overwrite status of action creation. rxrpc: Kernel calls get stuck in recvmsg net sched actions: decrement module reference count after table flush. lib: Allow compile-testing of parman ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt sctp: set sin_port for addr param when checking duplicate address net/mlx4_en: fix overflow in mlx4_en_init_timestamp() netfilter: nft_set_bitmap: incorrect bitmap size net: s2io: fix typo argumnet argument net: vxge: fix typo argumnet argument netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value. ipv4: mask tos for input route ipv4: add missing initialization for flowi4_uid lib: fix spelling mistake: "actualy" -> "actually" ...
| * ipv4: mask tos for input routeJulian Anastasov2017-02-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Restore the lost masking of TOS in input route code to allow ip rules to match it properly. Problem [1] noticed by Shmulik Ladkani <shmulik.ladkani@gmail.com> [1] http://marc.info/?t=137331755300040&r=1&w=2 Fixes: 89aef8921bfb ("ipv4: Delete routing cache.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: add missing initialization for flowi4_uidJulian Anastasov2017-02-262-3/+4
| | | | | | | | | | | | | | | | | | | | | | Avoid matching of random stack value for uid when rules are looked up on input route or when RP filter is used. Problem should affect only setups that use ip rules with uid range. Fixes: 622ec2c9d524 ("net: core: add UID to flows, rules, and routes") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* | lib/vsprintf.c: remove %Z supportAlexey Dobriyan2017-02-274-4/+4
|/ | | | | | | | | | | | | | | | | | Now that %z is standartised in C99 there is no reason to support %Z. Unlike %L it doesn't even make format strings smaller. Use BUILD_BUG_ON in a couple ATM drivers. In case anyone didn't notice lib/vsprintf.o is about half of SLUB which is in my opinion is quite an achievement. Hopefully this patch inspires someone else to trim vsprintf.c more. Link: http://lkml.kernel.org/r/20170103230126.GA30170@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tcp: account for ts offset only if tsecr not zeroAlexey Kodanev2017-02-221-1/+2
| | | | | | | | | We can get SYN with zero tsecr, don't apply offset in this case. Fixes: ee684b6f2830 ("tcp: send packets with a socket timestamp") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: setup timestamp offset when write_seq already setAlexey Kodanev2017-02-221-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | Found that when randomized tcp offsets are enabled (by default) TCP client can still start new connections without them. Later, if server does active close and re-uses sockets in TIME-WAIT state, new SYN from client can be rejected on PAWS check inside tcp_timewait_state_process(), because either tw_ts_recent or rcv_tsval doesn't really have an offset set. Here is how to reproduce it with LTP netstress tool: netstress -R 1 & netstress -H 127.0.0.1 -lr 1000000 -a1 [...] < S seq 1956977072 win 43690 TS val 295618 ecr 459956970 > . ack 1956911535 win 342 TS val 459967184 ecr 1547117608 < R seq 1956911535 win 0 length 0 +1. < S seq 1956977072 win 43690 TS val 296640 ecr 459956970 > S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640 Fixes: 95a22caee396 ("tcp: randomize tcp timestamp offsets for each connection") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"Eric Dumazet2017-02-211-2/+2
| | | | | | | | | | | | | | | | This reverts commit e70ac171658679ecf6bea4bbd9e9325cd6079d2b. jtcp_rcv_established() is in fact called with hard irq being disabled. Initial bug report from Ricardo Nabinger Sanchez [1] still needs to be investigated, but does not look like a TCP bug. [1] https://www.spinics.net/lists/netdev/msg420960.html Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kernel test robot <xiaolong.ye@intel.com> Cc: Ricardo Nabinger Sanchez <rnsanchez@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip: fix IP_CHECKSUM handlingPaolo Abeni2017-02-211-4/+4
| | | | | | | | | | | | | | | | | | | | The skbs processed by ip_cmsg_recv() are not guaranteed to be linear e.g. when sending UDP packets over loopback with MSGMORE. Using csum_partial() on [potentially] the whole skb len is dangerous; instead be on the safe side and use skb_checksum(). Thanks to syzkaller team to detect the issue and provide the reproducer. v1 -> v2: - move the variable declaration in a tighter scope Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: use page_ref_inc() in tcp_sendmsg()Eric Dumazet2017-02-171-1/+1
| | | | | | | | sk_page_frag_refill() allocates either a compound page or an order-0 page. We can use page_ref_inc() which is slightly faster than get_page() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: accommodate sequence number to a peer's shrunk receive window caused by ↵Cui, Cheng2017-02-171-2/+5
| | | | | | | | | | | | | | | | | precision loss in window scaling Prevent sending out a left-shifted sequence number from a Linux sender in response to a peer's shrunk receive-window caused by losing least significant bits in window-scaling. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Cheng Cui <Cheng.Cui@netapp.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2017-02-168-10/+134
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-02-16 1) Make struct xfrm_input_afinfo const, nothing writes to it. From Florian Westphal. 2) Remove all places that write to the afinfo policy backend and make the struct const then. From Florian Westphal. 3) Prepare for packet consuming gro callbacks and add ESP GRO handlers. ESP packets can be decapsulated at the GRO layer then. It saves a round through the stack for each ESP packet. Please note that this has a merge coflict between commit 63fca65d0863 ("net: add confirm_neigh method to dst_ops") from net-next and 3d7d25a68ea5 ("xfrm: policy: remove garbage_collect callback") a2817d8b279b ("xfrm: policy: remove family field") from ipsec-next. The conflict can be solved as it is done in linux-next. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * esp: Add a software GRO codepathSteffen Klassert2017-02-155-1/+129
| | | | | | | | | | | | | | | | | | | | | | | | This patch adds GRO ifrastructure and callbacks for ESP on ipv4 and ipv6. In case the GRO layer detects an ESP packet, the esp{4,6}_gro_receive() function does a xfrm state lookup and calls the xfrm input layer if it finds a matching state. The packet will be decapsulated and reinjected it into layer 2. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * net: Add a skb_gro_flush_final helper.Steffen Klassert2017-02-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Add a skb_gro_flush_final helper to prepare for consuming skbs in call_gro_receive. We will extend this helper to not touch the skb if the skb is consumed by a gro callback with a followup patch. We need this to handle the upcomming IPsec ESP callbacks as they reinject the skb to the napi_gro_receive asynchronous. The handler is used in all gro_receive functions that can call the ESP gro handlers. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: policy: make policy backend constFlorian Westphal2017-02-091-1/+1
| | | | | | | | | | Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: policy: remove family fieldFlorian Westphal2017-02-091-4/+1
| | | | | | | | | | | | | | Only needed it to register the policy backend at init time. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: policy: remove garbage_collect callbackFlorian Westphal2017-02-091-1/+1
| | | | | | | | | | | | | | | | | | Just call xfrm_garbage_collect_deferred() directly. This gets rid of a write to afinfo in register/unregister and allows to constify afinfo later on. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * xfrm: input: constify xfrm_input_afinfoFlorian Westphal2017-02-091-2/+1
| | | | | | | | | | | | | | | | | | | | Nothing writes to these structures (the module owner was not used). While at it, size xfrm_input_afinfo[] by the highest existing xfrm family (INET6), not AF_MAX. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-02-162-8/+8
|\ \
| * | tcp: tcp_probe: use spin_lock_bh()Eric Dumazet2017-02-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_rcv_established() can now run in process context. We need to disable BH while acquiring tcp probe spinlock, or risk a deadlock. Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Ricardo Nabinger Sanchez <rnsanchez@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | NET: Fix /proc/net/arp for AX.25Ralf Baechle2017-02-131-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When sending ARP requests over AX.25 links the hwaddress in the neighbour cache are not getting initialized. For such an incomplete arp entry ax2asc2 will generate an empty string resulting in /proc/net/arp output like the following: $ cat /proc/net/arp IP address HW type Flags HW address Mask Device 192.168.122.1 0x1 0x2 52:54:00:00:5d:5f * ens3 172.20.1.99 0x3 0x0 * bpq0 The missing field will confuse the procfs parsing of arp(8) resulting in incorrect output for the device such as the following: $ arp Address HWtype HWaddress Flags Mask Iface gateway ether 52:54:00:00:5d:5f C ens3 172.20.1.99 (incomplete) ens3 This changes the content of /proc/net/arp to: $ cat /proc/net/arp IP address HW type Flags HW address Mask Device 172.20.1.99 0x3 0x0 * * bpq0 192.168.122.1 0x1 0x2 52:54:00:00:5d:5f * ens3 To do so it change ax2asc to put the string "*" in buf for a NULL address argument. Finally the HW address field is left aligned in a 17 character field (the length of an ethernet HW address in the usual hex notation) for readability. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: rename dst_neigh_output back to neigh_outputJulian Anastasov2017-02-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the dst->pending_confirm flag was removed, we do not need anymore to provide dst arg to dst_neigh_output. So, rename it to neigh_output as before commit 5110effee8fd ("net: Do delayed neigh confirmation."). Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-02-112-0/+3
|\| |
| * | igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()Hangbin Liu2017-02-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In function igmpv3/mld_add_delrec() we allocate pmc and put it in idev->mc_tomb, so we should free it when we don't need it in del_delrec(). But I removed kfree(pmc) incorrectly in latest two patches. Now fix it. Fixes: 24803f38a5c0 ("igmp: do not remove igmp souce list info when ...") Fixes: 1666d49e1d41 ("mld: do not remove mld souce list info when ...") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ping: fix a null pointer dereferenceWANG Cong2017-02-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrey reported a kernel crash: general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 2 PID: 3880 Comm: syz-executor1 Not tainted 4.10.0-rc6+ #124 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff880060048040 task.stack: ffff880069be8000 RIP: 0010:ping_v4_push_pending_frames net/ipv4/ping.c:647 [inline] RIP: 0010:ping_v4_sendmsg+0x1acd/0x23f0 net/ipv4/ping.c:837 RSP: 0018:ffff880069bef8b8 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: ffff880069befb90 RCX: 0000000000000000 RDX: 0000000000000018 RSI: ffff880069befa30 RDI: 00000000000000c2 RBP: ffff880069befbb8 R08: 0000000000000008 R09: 0000000000000000 R10: 0000000000000002 R11: 0000000000000000 R12: ffff880069befab0 R13: ffff88006c624a80 R14: ffff880069befa70 R15: 0000000000000000 FS: 00007f6f7c716700(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004a6f28 CR3: 000000003a134000 CR4: 00000000000006e0 Call Trace: inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744 sock_sendmsg_nosec net/socket.c:635 [inline] sock_sendmsg+0xca/0x110 net/socket.c:645 SYSC_sendto+0x660/0x810 net/socket.c:1687 SyS_sendto+0x40/0x50 net/socket.c:1655 entry_SYSCALL_64_fastpath+0x1f/0xc2 This is because we miss a check for NULL pointer for skb_peek() when the queue is empty. Other places already have the same check. Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: fib: Add events for FIB replace and appendIdo Schimmel2017-02-101-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The FIB notification chain currently uses the NLM_F_{REPLACE,APPEND} flags to signal routes being replaced or appended. Instead of using netlink flags for in-kernel notifications we can simply introduce two new events in the FIB notification chain. This has the added advantage of making the API cleaner, thereby making it clear that these events should be supported by listeners of the notification chain. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: fib: Send notification before deleting FIB aliasIdo Schimmel2017-02-101-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a FIB alias is replaced following NLM_F_REPLACE, the ENTRY_ADD notification is sent after the reference on the previous FIB info was dropped. This is problematic as potential listeners might need to access it in their notification blocks. Solve this by sending the notification prior to the deletion of the replaced FIB alias. This is consistent with ENTRY_DEL notifications. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: fib: Send deletion notification with actual FIB alias typeIdo Schimmel2017-02-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a FIB alias is removed, a notification is sent using the type passed from user space - can be RTN_UNSPEC - instead of the actual type of the removed alias. This is problematic for listeners of the FIB notification chain, as several FIB aliases can exist with matching parameters, but the type. Solve this by passing the actual type of the removed FIB alias. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: fib: Only flush FIB aliases belonging to currently flushed tableIdo Schimmel2017-02-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case the MAIN table is flushed and its trie is shared with the LOCAL table, then we might be flushing FIB aliases belonging to the latter. This can lead to FIB_ENTRY_DEL notifications sent with the wrong table ID. The above doesn't affect current listeners, as the table ID is ignored during entry deletion, but this will change later in the patchset. When flushing a particular table, skip any aliases belonging to a different one. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Patrick McHardy <kaber@trash.net> Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv4: fib: Notify about nexthop status changesIdo Schimmel2017-02-081-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a multipath route is hit the kernel doesn't consider nexthops that are DEAD or LINKDOWN when IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN is set. Devices that offload multipath routes need to be made aware of nexthop status changes. Otherwise, the device will keep forwarding packets to non-functional nexthops. Add the FIB_EVENT_NH_{ADD,DEL} events to the fib notification chain, which notify capable devices when they should add or delete a nexthop from their tables. Cc: Roopa Prabhu <roopa@cumulusnetworks.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | gro_cells: move to net/core/gro_cells.cEric Dumazet2017-02-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have many gro cells users, so lets move the code to avoid duplication. This creates a CONFIG_GRO_CELLS option. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-02-074-2/+19
|\| | | | | | | | | | | | | | | | | | | | | | | The conflict was an interaction between a bug fix in the netvsc driver in 'net' and an optimization of the RX path in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | udp: properly cope with csum errorsEric Dumazet2017-02-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dmitry reported that UDP sockets being destroyed would trigger the WARN_ON(atomic_read(&sk->sk_rmem_alloc)); in inet_sock_destruct() It turns out we do not properly destroy skb(s) that have wrong UDP checksum. Thanks again to syzkaller team. Fixes : 7c13f97ffde6 ("udp: do fwd memory scheduling on dequeue") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: avoid infinite loop in tcp_splice_read()Eric Dumazet2017-02-061-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Splicing from TCP socket is vulnerable when a packet with URG flag is received and stored into receive queue. __tcp_splice_read() returns 0, and sk_wait_data() immediately returns since there is the problematic skb in queue. This is a nice way to burn cpu (aka infinite loop) and trigger soft lockups. Again, this gem was found by syzkaller tool. Fixes: 9c55e01c0cc8 ("[TCP]: Splice receive support.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netlabel: out of bound access in cipso_v4_validate()Eric Dumazet2017-02-041-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller found another out of bound access in ip_options_compile(), or more exactly in cipso_v4_validate() Fixes: 20e2a8648596 ("cipso: handle CIPSO options correctly when NetLabel is disabled") Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paul Moore <paul@paul-moore.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ipv4: keep skb->dst around in presence of IP optionsEric Dumazet2017-02-041-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrey Konovalov got crashes in __ip_options_echo() when a NULL skb->dst is accessed. ipv4_pktinfo_prepare() should not drop the dst if (evil) IP options are present. We could refine the test to the presence of ts_needtime or srr, but IP options are not often used, so let's be conservative. Thanks to syzkaller team for finding this bug. Fixes: d826eb14ecef ("ipv4: PKTINFO doesnt need dst reference") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TPJulian Anastasov2017-02-074-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. The datagram protocols can use MSG_CONFIRM to confirm the neighbour. When used with MSG_PROBE we do not reach the code where neighbour is confirmed, so we have to do the same slow lookup by using the dst_confirm_neigh() helper. When MSG_PROBE is not used, ip_append_data/ip6_append_data will set the skb flag dst_pending_confirm. Reported-by: YueHaibing <yuehaibing@huawei.com> Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.") Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: add confirm_neigh method to dst_opsJulian Anastasov2017-02-071-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add confirm_neigh method to dst_ops and use it from IPv4 and IPv6 to lookup and confirm the neighbour. Its usage via the new helper dst_confirm_neigh() should be restricted to MSG_PROBE users for performance reasons. For XFRM prefer the last tunnel address, if present. With help from Steffen Klassert. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | tcp: replace dst_confirm with sk_dst_confirmJulian Anastasov2017-02-073-14/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. Use the new sk_dst_confirm() helper to propagate the indication from received packets to sock_confirm_neigh(). Reported-by: YueHaibing <yuehaibing@huawei.com> Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.") Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.") Tested-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: add dst_pending_confirm flag to skbuffJulian Anastasov2017-02-071-1/+4
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | Add new skbuff flag to allow protocols to confirm neighbour. When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. Add sock_confirm_neigh() helper to confirm the neighbour and use it for IPv4, IPv6 and VRF before dst_neigh_output. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2017-02-039-47/+31
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from sk_buff so we only access one single cacheline in the conntrack hotpath. Patchset from Florian Westphal. 2) Don't leak pointer to internal structures when exporting x_tables ruleset back to userspace, from Willem DeBruijn. This includes new helper functions to copy data to userspace such as xt_data_to_user() as well as conversions of our ip_tables, ip6_tables and arp_tables clients to use it. Not surprinsingly, ebtables requires an ad-hoc update. There is also a new field in x_tables extensions to indicate the amount of bytes that we copy to userspace. 3) Add nf_log_all_netns sysctl: This new knob allows you to enable logging via nf_log infrastructure for all existing netnamespaces. Given the effort to provide pernet syslog has been discontinued, let's provide a way to restore logging using netfilter kernel logging facilities in trusted environments. Patch from Michal Kubecek. 4) Validate SCTP checksum from conntrack helper, from Davide Caratti. 5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly a copy&paste from the original helper, from Florian Westphal. 6) Reset netfilter state when duplicating packets, also from Florian. 7) Remove unnecessary check for broadcast in IPv6 in pkttype match and nft_meta, from Liping Zhang. 8) Add missing code to deal with loopback packets from nft_meta when used by the netdev family, also from Liping. 9) Several cleanups on nf_tables, one to remove unnecessary check from the netlink control plane path to add table, set and stateful objects and code consolidation when unregister chain hooks, from Gao Feng. 10) Fix harmless reference counter underflow in IPVS that, however, results in problems with the introduction of the new refcount_t type, from David Windsor. 11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp, from Davide Caratti. 12) Missing documentation on nf_tables uapi header, from Liping Zhang. 13) Use rb_entry() helper in xt_connlimit, from Geliang Tang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: allow logging from non-init namespacesMichal Kubeček2017-02-022-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 69b34fb996b2 ("netfilter: xt_LOG: add net namespace support for xt_LOG") disabled logging packets using the LOG target from non-init namespaces. The motivation was to prevent containers from flooding kernel log of the host. The plan was to keep it that way until syslog namespace implementation allows containers to log in a safe way. However, the work on syslog namespace seems to have hit a dead end somewhere in 2013 and there are users who want to use xt_LOG in all network namespaces. This patch allows to do so by setting /proc/sys/net/netfilter/nf_log_all_netns to a nonzero value. This sysctl is only accessible from init_net so that one cannot switch the behaviour from inside a container. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: add and use nf_ct_set helperFlorian Westphal2017-02-023-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add a helper to assign a nf_conn entry and the ctinfo bits to an sk_buff. This avoids changing code in followup patch that merges skb->nfct and skb->nfctinfo into skb->_nfct. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | skbuff: add and use skb_nfct helperFlorian Westphal2017-02-024-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | Followup patch renames skb->nfct and changes its type so add a helper to avoid intrusive rename change later. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: reset netfilter state when duplicating packetFlorian Westphal2017-02-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should also toss nf_bridge_info, if any -- packet is leaving via ip_local_out, also, this skb isn't bridged -- it is a locally generated copy. Also this avoids the need to touch this later when skb->nfct is replaced with 'unsigned long _nfct' in followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: conntrack: no need to pass ctinfo to error handlerFlorian Westphal2017-02-021-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is never accessed for reading and the only places that write to it are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo). The conntrack core specifically checks for attached skb->nfct after ->error() invocation and returns early in this case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | xtables: extend matches and targets with .usersizeWillem de Bruijn2017-01-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In matches and targets that define a kernel-only tail to their xt_match and xt_target data structs, add a field .usersize that specifies up to where data is to be shared with userspace. Performed a search for comment "Used internally by the kernel" to find relevant matches and targets. Manually inspected the structs to derive a valid offsetof. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | arptables: use match, target and data copy_to_user helpersWillem de Bruijn2017-01-091-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | Convert arptables to copying entries, matches and targets one by one, using the xt_match_to_user and xt_target_to_user helper functions. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>