summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2016-03-04rtnl: RTM_GETNETCONF: fix wrong return valueAnton Protopopov
[ Upstream commit a97eb33ff225f34a8124774b3373fd244f0e83ce ] An error response from a RTM_GETNETCONF request can return the positive error value EINVAL in the struct nlmsgerr that can mislead userspace. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-04ipv6: fix a lockdep splatEric Dumazet
[ Upstream commit 44c3d0c1c0a880354e9de5d94175742e2c7c9683 ] Silence lockdep false positive about rcu_dereference() being used in the wrong context. First one should use rcu_dereference_protected() as we own the spinlock. Second one should be a normal assignation, as no barrier is needed. Fixes: 18367681a10bd ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-04ipv6: addrconf: Fix recursive spin lock callsubashab@codeaurora.org
[ Upstream commit 16186a82de1fdd868255448274e64ae2616e2640 ] A rcu stall with the following backtrace was seen on a system with forwarding, optimistic_dad and use_optimistic set. To reproduce, set these flags and allow ipv6 autoconf. This occurs because the device write_lock is acquired while already holding the read_lock. Back trace below - INFO: rcu_preempt self-detected stall on CPU { 1} (t=2100 jiffies g=3992 c=3991 q=4471) <6> Task dump for CPU 1: <2> kworker/1:0 R running task 12168 15 2 0x00000002 <2> Workqueue: ipv6_addrconf addrconf_dad_work <6> Call trace: <2> [<ffffffc000084da8>] el1_irq+0x68/0xdc <2> [<ffffffc000cc4e0c>] _raw_write_lock_bh+0x20/0x30 <2> [<ffffffc000bc5dd8>] __ipv6_dev_ac_inc+0x64/0x1b4 <2> [<ffffffc000bcbd2c>] addrconf_join_anycast+0x9c/0xc4 <2> [<ffffffc000bcf9f0>] __ipv6_ifa_notify+0x160/0x29c <2> [<ffffffc000bcfb7c>] ipv6_ifa_notify+0x50/0x70 <2> [<ffffffc000bd035c>] addrconf_dad_work+0x314/0x334 <2> [<ffffffc0000b64c8>] process_one_work+0x244/0x3fc <2> [<ffffffc0000b7324>] worker_thread+0x2f8/0x418 <2> [<ffffffc0000bb40c>] kthread+0xe0/0xec v2: do addrconf_dad_kick inside read lock and then acquire write lock for ipv6_ifa_notify as suggested by Eric Fixes: 7fd2561e4ebdd ("net: ipv6: Add a sysctl to make optimistic addresses useful candidates") Cc: Eric Dumazet <edumazet@google.com> Cc: Erik Kline <ek@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-04net/ipv6: add sysctl option accept_ra_min_hop_limitHangbin Liu
[ Upstream commit 8013d1d7eafb0589ca766db6b74026f76b7f5cb4 ] Commit 6fd99094de2b ("ipv6: Don't reduce hop limit for an interface") disabled accept hop limit from RA if it is smaller than the current hop limit for security stuff. But this behavior kind of break the RFC definition. RFC 4861, 6.3.4. Processing Received Router Advertisements A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time, and Retrans Timer) may contain a value denoting that it is unspecified. In such cases, the parameter should be ignored and the host should continue using whatever value it is already using. If the received Cur Hop Limit value is non-zero, the host SHOULD set its CurHopLimit variable to the received value. So add sysctl option accept_ra_min_hop_limit to let user choose the minimum hop limit value they can accept from RA. And set default to 1 to meet RFC standards. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-04ipv6/udp: use sticky pktinfo egress ifindex on connect()Paolo Abeni
[ Upstream commit 1cdda91871470f15e79375991bd2eddc6e86ddb1 ] Currently, the egress interface index specified via IPV6_PKTINFO is ignored by __ip6_datagram_connect(), so that RFC 3542 section 6.7 can be subverted when the user space application calls connect() before sendmsg(). Fix it by initializing properly flowi6_oif in connect() before performing the route lookup. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-03-04ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()Paolo Abeni
[ Upstream commit 6f21c96a78b835259546d8f3fb4edff0f651d478 ] The current implementation of ip6_dst_lookup_tail basically ignore the egress ifindex match: if the saddr is set, ip6_route_output() purposefully ignores flowi6_oif, due to the commit d46a9d678e4c ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE flag if saddr set"), if the saddr is 'any' the first route lookup in ip6_dst_lookup_tail fails, but upon failure a second lookup will be performed with saddr set, thus ignoring the ifindex constraint. This commit adds an output route lookup function variant, which allows the caller to specify lookup flags, and modify ip6_dst_lookup_tail() to enforce the ifindex match on the second lookup via said helper. ip6_route_output() becames now a static inline function build on top of ip6_route_output_flags(); as a side effect, out-of-tree modules need now a GPL license to access the output route lookup functionality. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-01-31xfrm: dst_entries_init() per-net dst_opsDan Streetman
[ Upstream commit a8a572a6b5f2a79280d6e302cb3c1cb1fbaeb3e8 ] Remove the dst_entries_init/destroy calls for xfrm4 and xfrm6 dst_ops templates; their dst_entries counters will never be used. Move the xfrm dst_ops initialization from the common xfrm/xfrm_policy.c to xfrm4/xfrm4_policy.c and xfrm6/xfrm6_policy.c, and call dst_entries_init and dst_entries_destroy for each net namespace. The ipv4 and ipv6 xfrms each create dst_ops template, and perform dst_entries_init on the templates. The template values are copied to each net namespace's xfrm.xfrm*_dst_ops. The problem there is the dst_ops pcpuc_entries field is a percpu counter and cannot be used correctly by simply copying it to another object. The result of this is a very subtle bug; changes to the dst entries counter from one net namespace may sometimes get applied to a different net namespace dst entries counter. This is because of how the percpu counter works; it has a main count field as well as a pointer to the percpu variables. Each net namespace maintains its own main count variable, but all point to one set of percpu variables. When any net namespace happens to change one of the percpu variables to outside its small batch range, its count is moved to the net namespace's main count variable. So with multiple net namespaces operating concurrently, the dst_ops entries counter can stray from the actual value that it should be; if counts are consistently moved from one net namespace to another (which my testing showed is likely), then one net namespace winds up with a negative dst_ops count while another winds up with a continually increasing count, eventually reaching its gc_thresh limit, which causes all new traffic on the net namespace to fail with -ENOBUFS. Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31ipv6: update skb->csum when CE mark is propagatedEric Dumazet
[ Upstream commit 34ae6a1aa0540f0f781dd265366036355fdc8930 ] When a tunnel decapsulates the outer header, it has to comply with RFC 6080 and eventually propagate CE mark into inner header. It turns out IP6_ECN_set_ce() does not correctly update skb->csum for CHECKSUM_COMPLETE packets, triggering infamous "hw csum failure" messages and stack traces. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31udp: disallow UFO for sockets with SO_NO_CHECK optionMichal Kubeček
[ Upstream commit 40ba330227ad00b8c0cdf2f425736ff9549cc423 ] Commit acf8dd0a9d0b ("udp: only allow UFO for packets from SOCK_DGRAM sockets") disallows UFO for packets sent from raw sockets. We need to do the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if for a bit different reason: while such socket would override the CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and bad offloading flags warning is triggered in __skb_gso_segment(). In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow UFO for packets sent by sockets with UDP_NO_CHECK6_TX option. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Tested-by: Shannon Nelson <shannon.nelson@intel.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31ipv6: tcp: add rcu locking in tcp_v6_send_synack()Eric Dumazet
[ Upstream commit 3e4006f0b86a5ae5eb0e8215f9a9e1db24506977 ] When first SYNACK is sent, we already hold rcu_read_lock(), but this is not true if a SYNACK is retransmitted, as a timer (soft) interrupt does not hold rcu_read_lock() Fixes: 45f6fad84cc30 ("ipv6: add complete rcu protection around np->opt") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31addrconf: always initialize sysctl table dataWANG Cong
[ Upstream commit 5449a5ca9bc27dd51a462de7ca0b1cd861cd2bd0 ] When sysctl performs restrict writes, it allows to write from a middle position of a sysctl file, which requires us to initialize the table data before calling proc_dostring() for the write case. Fixes: 3d1bec99320d ("ipv6: introduce secret_stable to ipv6_devconf") Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31ipv6/addrlabel: fix ip6addrlbl_get()Andrey Ryabinin
[ Upstream commit e459dfeeb64008b2d23bdf600f03b3605dbb8152 ] ip6addrlbl_get() has never worked. If ip6addrlbl_hold() succeeded, ip6addrlbl_get() will exit with '-ESRCH'. If ip6addrlbl_hold() failed, ip6addrlbl_get() will use about to be free ip6addrlbl_entry pointer. Fix this by inverting ip6addrlbl_hold() check. Fixes: 2a8cc6c89039 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Cong Wang <cwang@twopensource.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-22ipv6: automatically enable stable privacy mode if stable_secret setHannes Frederic Sowa
[ Upstream commit 9b29c6962b70f232cde4076b1020191e1be0889d ] Bjørn reported that while we switch all interfaces to privacy stable mode when setting the secret, we don't set this mode for new interfaces. This does not make sense, so change this behaviour. Fixes: 622c81d57b392cc ("ipv6: generation of stable privacy addresses for link-local and autoconf") Reported-by: Bjørn Mork <bjorn@mork.no> Cc: Bjørn Mork <bjorn@mork.no> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-22net: fix IP early demux racesEric Dumazet
[ Upstream commit 5037e9ef9454917b047f9f3a19b4dd179fbf7cd4 ] David Wilder reported crashes caused by dst reuse. <quote David> I am seeing a crash on a distro V4.2.3 kernel caused by a double release of a dst_entry. In ipv4_dst_destroy() the call to list_empty() finds a poisoned next pointer, indicating the dst_entry has already been removed from the list and freed. The crash occurs 18 to 24 hours into a run of a network stress exerciser. </quote> Thanks to his detailed report and analysis, we were able to understand the core issue. IP early demux can associate a dst to skb, after a lookup in TCP/UDP sockets. When socket cache is not properly set, we want to store into sk->sk_dst_cache the dst for future IP early demux lookups, by acquiring a stable refcount on the dst. Problem is this acquisition is simply using an atomic_inc(), which works well, unless the dst was queued for destruction from dst_release() noticing dst refcount went to zero, if DST_NOCACHE was set on dst. We need to make sure current refcount is not zero before incrementing it, or risk double free as David reported. This patch, being a stable candidate, adds two new helpers, and use them only from IP early demux problematic paths. It might be possible to merge in net-next skb_dst_force() and skb_dst_force_safe(), but I prefer having the smallest patch for stable kernels : Maybe some skb_dst_force() callers do not expect skb->dst can suddenly be cleared. Can probably be backported back to linux-3.6 kernels Reported-by: David J. Wilder <dwilder@us.ibm.com> Tested-by: David J. Wilder <dwilder@us.ibm.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-22net: add validation for the socket syscall protocol argumentHannes Frederic Sowa
[ Upstream commit 79462ad02e861803b3840cc782248c7359451cd9 ] 郭永刚 reported that one could simply crash the kernel as root by using a simple program: int socket_fd; struct sockaddr_in addr; addr.sin_port = 0; addr.sin_addr.s_addr = INADDR_ANY; addr.sin_family = 10; socket_fd = socket(10,3,0x40000000); connect(socket_fd , &addr,16); AF_INET, AF_INET6 sockets actually only support 8-bit protocol identifiers. inet_sock's skc_protocol field thus is sized accordingly, thus larger protocol identifiers simply cut off the higher bits and store a zero in the protocol fields. This could lead to e.g. NULL function pointer because as a result of the cut off inet_num is zero and we call down to inet_autobind, which is NULL for raw sockets. kernel: Call Trace: kernel: [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70 kernel: [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80 kernel: [<ffffffff81645069>] SYSC_connect+0xd9/0x110 kernel: [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80 kernel: [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200 kernel: [<ffffffff81645e0e>] SyS_connect+0xe/0x10 kernel: [<ffffffff81779515>] tracesys_phase2+0x84/0x89 I found no particular commit which introduced this problem. CVE: CVE-2015-8543 Cc: Cong Wang <cwang@twopensource.com> Reported-by: 郭永刚 <guoyonggang@360.cn> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-22ipv6: keep existing flags when setting IFA_F_OPTIMISTICBjørn Mork
[ Upstream commit 9a1ec4612c9bfc94d4185e3459055a37a685e575 ] Commit 64236f3f3d74 ("ipv6: introduce IFA_F_STABLE_PRIVACY flag") failed to update the setting of the IFA_F_OPTIMISTIC flag, causing the IFA_F_STABLE_PRIVACY flag to be lost if IFA_F_OPTIMISTIC is set. Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Fixes: 64236f3f3d74 ("ipv6: introduce IFA_F_STABLE_PRIVACY flag") Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-22gre6: allow to update all parameters via rtnlNicolas Dichtel
[ Upstream commit 6a61d4dbf4f54b5683e0f1e58d873cecca7cb977 ] Parameters were updated only if the kernel was unable to find the tunnel with the new parameters, ie only if core pamareters were updated (keys, addr, link, type). Now it's possible to update ttl, hoplimit, flowinfo and flags. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14ipv6: add complete rcu protection around np->optEric Dumazet
[ Upstream commit 45f6fad84cc305103b28d73482b344d7f5b76f39 ] This patch addresses multiple problems : UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions while socket is not locked : Other threads can change np->opt concurrently. Dmitry posted a syzkaller (http://github.com/google/syzkaller) program desmonstrating use-after-free. Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock() and dccp_v6_request_recv_sock() also need to use RCU protection to dereference np->opt once (before calling ipv6_dup_options()) This patch adds full RCU protection to np->opt Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14ipv6: distinguish frag queues by device for multicast and link-local packetsMichal Kubeček
[ Upstream commit 264640fc2c5f4f913db5c73fa3eb1ead2c45e9d7 ] If a fragmented multicast packet is received on an ethernet device which has an active macvlan on top of it, each fragment is duplicated and received both on the underlying device and the macvlan. If some fragments for macvlan are processed before the whole packet for the underlying device is reassembled, the "overlapping fragments" test in ip6_frag_queue() discards the whole fragment queue. To resolve this, add device ifindex to the search key and require it to match reassembling multicast packets and packets to link-local addresses. Note: similar patch has been already submitted by Yoshifuji Hideaki in http://patchwork.ozlabs.org/patch/220979/ but got lost and forgotten for some reason. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14net: ip6mr: fix static mfc/dev leaks on table destructionNikolay Aleksandrov
[ Upstream commit 4c6980462f32b4f282c5d8e5f7ea8070e2937725 ] Similar to ipv4, when destroying an mrt table the static mfc entries and the static devices are kept, which leads to devices that can never be destroyed (because of refcnt taken) and leaked memory. Make sure that everything is cleaned up on netns destruction. Fixes: 8229efdaef1e ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code") CC: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-14snmp: Remove duplicate OUTMCAST stat incrementNeil Horman
[ Upstream commit 41033f029e393a64e81966cbe34d66c6cf8a2e7e ] the OUTMCAST stat is double incremented, getting bumped once in the mcast code itself, and again in the common ip output path. Remove the mcast bump, as its not needed Validated by the reporter, with good results Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Claus Jensen <claus.jensen@microsemi.com> CC: Claus Jensen <claus.jensen@microsemi.com> CC: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-09ipv6: clean up dev_snmp6 proc entry when we fail to initialize inet6_devSabrina Dubroca
[ Upstream commit 2a189f9e57650e9f310ddf4aad75d66c1233a064 ] In ipv6_add_dev, when addrconf_sysctl_register fails, we do not clean up the dev_snmp6 entry that we have already registered for this device. Call snmp6_unregister_dev in this case. Fixes: a317a2f19da7d ("ipv6: fail early when creating netdev named all or default") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-09sit: fix sit0 percpu double allocationsEric Dumazet
[ Upstream commit 4ece9009774596ee3df0acba65a324b7ea79387c ] sit0 device allocates its percpu storage twice : - One time in ipip6_tunnel_init() - One time in ipip6_fb_tunnel_init() Thus we leak 48 bytes per possible cpu per network namespace dismantle. ipip6_fb_tunnel_init() can be much simpler and does not return an error, and should be called after register_netdev() Note that ipip6_tunnel_clone_6rd() also needs to be called after register_netdev() (calling ipip6_tunnel_init()) Fixes: ebe084aafb7e ("sit: Use ipip6_tunnel_init as the ndo_init function.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-03ipv6: fix multipath route replace error recoveryRoopa Prabhu
[ Upstream commit 6b9ea5a64ed5eeb3f68f2e6fcce0ed1179801d1e ] Problem: The ecmp route replace support for ipv6 in the kernel, deletes the existing ecmp route too early, ie when it installs the first nexthop. If there is an error in installing the subsequent nexthops, its too late to recover the already deleted existing route leaving the fib in an inconsistent state. This patch reduces the possibility of this by doing the following: a) Changes the existing multipath route add code to a two stage process: build rt6_infos + insert them ip6_route_add rt6_info creation code is moved into ip6_route_info_create. b) This ensures that most errors are caught during building rt6_infos and we fail early c) Separates multipath add and del code. Because add needs the special two stage mode in a) and delete essentially does not care. d) In any event if the code fails during inserting a route again, a warning is printed (This should be unlikely) Before the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from * kernel */ After the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 Fixes: 27596472473a ("ipv6: fix ECMP route replacement") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-03net/ipv6: Correct PIM6 mrt_lock handlingRichard Laing
[ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ] In the IPv6 multicast routing code the mrt_lock was not being released correctly in the MFC iterator, as a result adding or deleting a MIF would cause a hang because the mrt_lock could not be acquired. This fix is a copy of the code for the IPv4 case and ensures that the lock is released correctly. Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-03ipv6: fix exthdrs offload registration in out_rt pathDaniel Borkmann
[ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ] We previously register IPPROTO_ROUTING offload under inet6_add_offload(), but in error path, we try to unregister it with inet_del_offload(). This doesn't seem correct, it should actually be inet6_del_offload(), also ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it also uses rthdr_offload twice), but it got removed entirely later on. Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-03ip6_gre: release cached dst on tunnel removalhuaibin Wang
[ Upstream commit d4257295ba1b389c693b79de857a96e4b7cd8ac0 ] When a tunnel is deleted, the cached dst entry should be released. This problem may prevent the removal of a netns (seen with a x-netns IPv6 gre tunnel): unregister_netdevice: waiting for lo to become free. Usage count = 3 CC: Dmitry Kozlov <xeb@mail.ru> Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: huaibin Wang <huaibin.wang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29inet: fix possible request socket leakEric Dumazet
[ Upstream commit 3257d8b12f954c462d29de6201664a846328a522 ] In commit b357a364c57c9 ("inet: fix possible panic in reqsk_queue_unlink()"), I missed fact that tcp_check_req() can return the listener socket in one case, and that we must release the request socket refcount or we leak it. Tested: Following packetdrill test template shows the issue 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 2920 <mss 1460,sackOK,nop,nop> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.002 < . 1:1(0) ack 21 win 2920 +0 > R 21:21(0) Fixes: b357a364c57c9 ("inet: fix possible panic in reqsk_queue_unlink()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29Revert "sit: Add gro callbacks to sit_offload"Herbert Xu
[ Upstream commit fdbf5b097bbd9693a86c0b8bfdd071a9a2117cfc ] This patch reverts 19424e052fb44da2f00d1a868cbb51f3e9f4bbb5 ("sit: Add gro callbacks to sit_offload") because it generates packets that cannot be handled even by our own GSO. Reported-by: Wolfgang Walter <linux@stwm.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29ipv6: lock socket in ip6_datagram_connect()Eric Dumazet
[ Upstream commit 03645a11a570d52e70631838cb786eb4253eb463 ] ip6_datagram_connect() is doing a lot of socket changes without socket being locked. This looks wrong, at least for udp_lib_rehash() which could corrupt lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29ipv6: Make MLD packets to only be processed locallyAngga
[ Upstream commit 4c938d22c88a9ddccc8c55a85e0430e9c62b1ac5 ] Before commit daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().") MLD packets were only processed locally. After the change, a copy of MLD packet goes through ip6_mr_input, causing MRT6MSG_NOCACHE message to be generated to user space. Make MLD packet only processed locally. Fixes: daad151263cf ("ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().") Signed-off-by: Hermin Anggawijaya <hermin.anggawijaya@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10ip: report the original address of ICMP messagesJulian Anastasov
[ Upstream commit 34b99df4e6256ddafb663c6de0711dceceddfe0e ] ICMP messages can trigger ICMP and local errors. In this case serr->port is 0 and starting from Linux 4.0 we do not return the original target address to the error queue readers. Add function to define which errors provide addr_offset. With this fix my ping command is not silent anymore. Fixes: c247f0534cc5 ("ip: fix error queue empty skb handling") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-10Revert "ipv6: Fix protocol resubmission"David S. Miller
This reverts commit 0243508edd317ff1fa63b495643a7c192fbfcd92. It introduces new regressions. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-08ipv6: Fix protocol resubmissionJosh Hunt
UDP encapsulation is broken on IPv6. This is because the logic to resubmit the nexthdr is inverted, checking for a ret value > 0 instead of < 0. Also, the resubmit label is in the wrong position since we already get the nexthdr value when performing decapsulation. In addition the skb pull is no longer necessary either. This changes the return value check to look for < 0, using it for the nexthdr on the next iteration, and moves the resubmit label to the proper location. With these changes the v6 code now matches what we do in the v4 ip input code wrt resubmitting when decapsulating. Signed-off-by: Josh Hunt <johunt@akamai.com> Acked-by: "Tom Herbert" <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-08ipv6: fix possible use after free of dev statsRobert Shearman
The memory pointed to by idev->stats.icmpv6msgdev, idev->stats.icmpv6dev and idev->stats.ipv6 can each be used in an RCU read context without taking a reference on idev. For example, through IP6_*_STATS_* calls in ip6_rcv. These memory blocks are freed without waiting for an RCU grace period to elapse. This could lead to the memory being written to after it has been freed. Fix this by using call_rcu to free the memory used for stats, as well as idev after an RCU grace period has elapsed. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01vti6: Add pmtu handling to vti6_xmit.Steffen Klassert
We currently rely on the PMTU discovery of xfrm. However if a packet is localy sent, the PMTU mechanism of xfrm tries to to local socket notification what might not work for applications like ping that don't check for this. So add pmtu handling to vti6_xmit to report MTU changes immediately. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31udp: fix behavior of wrong checksumsEric Dumazet
We have two problems in UDP stack related to bogus checksums : 1) We return -EAGAIN to application even if receive queue is not empty. This breaks applications using edge trigger epoll() 2) Under UDP flood, we can loop forever without yielding to other processes, potentially hanging the host, especially on non SMP. This patch is an attempt to make things better. We might in the future add extra support for rt applications wanting to better control time spent doing a recv() in a hostile environment. For example we could validate checksums before queuing packets in socket receive queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-28Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2015-05-28 1) Fix a race in xfrm_state_lookup_byspi, we need to take the refcount before we release xfrm_state_lock. From Li RongQing. 2) Fix IV generation on ESN state. We used just the low order sequence numbers for IV generation on ESN, as a result the IV can repeat on the same state. Fix this by using the high order sequence number bits too and make sure to always initialize the high order bits with zero. These patches are serious stable candidates. Fixes from Herbert Xu. 3) Fix the skb->mark handling on vti. We don't reset skb->mark in skb_scrub_packet anymore, so vti must care to restore the original value back after it was used to lookup the vti policy and state. Fixes from Alexander Duyck. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-28ip_vti/ip6_vti: Preserve skb->mark after rcv_cb callAlexander Duyck
The vti6_rcv_cb and vti_rcv_cb calls were leaving the skb->mark modified after completing the function. This resulted in the original skb->mark value being lost. Since we only need skb->mark to be set for xfrm_policy_check we can pull the assignment into the rcv_cb calls and then just restore the original mark after xfrm_policy_check has been completed. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-05-28ip_vti/ip6_vti: Do not touch skb->mark on xmitAlexander Duyck
Instead of modifying skb->mark we can simply modify the flowi_mark that is generated as a result of the xfrm_decode_session. By doing this we don't need to actually touch the skb->mark and it can be preserved as it passes out through the tunnel. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contain Netfilter fixes for your net tree, they are: 1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash. This problem is due to wrong order in the per-net registration and netlink socket events. Patch from Francesco Ruggeri. 2) Make sure that counters that userspace pass us are higher than 0 in all the x_tables frontends. Discovered via Trinity, patch from Dave Jones. 3) Revert a patch for br_netfilter to rely on the conntrack status bits. This breaks stateless IPv6 NAT transformations. Patch from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20ipv6: fix ECMP route replacementMichal Kubeček
When replacing an IPv6 multipath route with "ip route replace", i.e. NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first matching route without fixing its siblings, resulting in corrupted siblings linked list; removing one of the siblings can then end in an infinite loop. IPv6 ECMP implementation is a bit different from IPv4 so that route replacement cannot work in exactly the same way. This should be a reasonable approximation: 1. If the new route is ECMP-able and there is a matching ECMP-able one already, replace it and all its siblings (if any). 2. If the new route is ECMP-able and no matching ECMP-able route exists, replace first matching non-ECMP-able (if any) or just add the new one. 3. If the new route is not ECMP-able, replace first matching non-ECMP-able route (if any) or add the new route. We also need to remove the NLM_F_REPLACE flag after replacing old route(s) by first nexthop of an ECMP route so that each subsequent nexthop does not replace previous one. Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20ipv6: do not delete previously existing ECMP routes if add failsMichal Kubeček
If adding a nexthop of an IPv6 multipath route fails, comment in ip6_route_multipath() says we are going to delete all nexthops already added. However, current implementation deletes even the routes it hasn't even tried to add yet. For example, running ip route add 1234:5678::/64 \ nexthop via fe80::aa dev dummy1 \ nexthop via fe80::bb dev dummy1 \ nexthop via fe80::cc dev dummy1 twice results in removing all routes first command added. Limit the second (delete) run to nexthops that succeeded in the first (add) run. Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20netfilter: ensure number of counters is >0 in do_replace()Dave Jones
After improving setsockopt() coverage in trinity, I started triggering vmalloc failures pretty reliably from this code path: warn_alloc_failed+0xe9/0x140 __vmalloc_node_range+0x1be/0x270 vzalloc+0x4b/0x50 __do_replace+0x52/0x260 [ip_tables] do_ipt_set_ctl+0x15d/0x1d0 [ip_tables] nf_setsockopt+0x65/0x90 ip_setsockopt+0x61/0xa0 raw_setsockopt+0x16/0x60 sock_common_setsockopt+0x14/0x20 SyS_setsockopt+0x71/0xd0 It turns out we don't validate that the num_counters field in the struct we pass in from userspace is initialized. The same problem also exists in ebtables, arptables, ipv6, and the compat variants. Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-19net/ipv6/udp: Fix ipv6 multicast socket filter regressionHenning Rogge
Commit <5cf3d46192fc> ("udp: Simplify__udp*_lib_mcast_deliver") simplified the filter for incoming IPv6 multicast but removed the check of the local socket address and the UDP destination address. This patch restores the filter to prevent sockets bound to a IPv6 multicast IP to receive other UDP traffic link unicast. Signed-off-by: Henning Rogge <hrogge@gmail.com> Fixes: 5cf3d46192fc ("udp: Simplify__udp*_lib_mcast_deliver") Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17tcp/ipv6: fix flow label setting in TIME_WAIT stateFlorent Fourcot
commit 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAIT") added the flow label in the last TCP packets. Unfortunately, it was not casted properly. This patch replace the buggy shift with be32_to_cpu/cpu_to_be32. Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14ipv6: Fix udp checksums with raw socketsVlad Yasevich
It was reported that trancerout6 would cause a kernel to crash when trying to compute checksums on raw UDP packets. The cause was the check in __ip6_append_data that would attempt to use partial checksums on the packet. However, raw sockets do not initialize partial checksum fields so partial checksums can't be used. Solve this the same way IPv4 does it. raw sockets pass transhdrlen value of 0 to ip_append_data which causes the checksum to be computed in software. Use the same check in ip6_append_data (check transhdrlen). Reported-by: Wolfgang Walter <linux@stwm.de> CC: Wolfgang Walter <linux@stwm.de> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13esp6: Use high-order sequence number bits for IV generationHerbert Xu
I noticed we were only using the low-order bits for IV generation when ESN is enabled. This is very bad because it means that the IV can repeat. We must use the full 64 bits. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-05-09ipv6: Fixed source specific default route handling.Markus Stenberg
If there are only IPv6 source specific default routes present, the host gets -ENETUNREACH on e.g. connect() because ip6_dst_lookup_tail calls ip6_route_output first, and given source address any, it fails, and ip6_route_get_saddr is never called. The change is to use the ip6_route_get_saddr, even if the initial ip6_route_output fails, and then doing ip6_route_output _again_ after we have appropriate source address available. Note that this is '99% fix' to the problem; a correct fix would be to do route lookups only within addrconf.c when picking a source address, and never call ip6_route_output before source address has been populated. Signed-off-by: Markus Stenberg <markus.stenberg@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24inet: fix possible panic in reqsk_queue_unlink()Eric Dumazet
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 [ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243 There is a race when reqsk_timer_handler() and tcp_check_req() call inet_csk_reqsk_queue_unlink() on the same req at the same time. Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer"), listener spinlock was held and race could not happen. To solve this bug, we change reqsk_queue_unlink() to not assume req must be found, and we return a status, to conditionally release a refcount on the request sock. This also means tcp_check_req() in non fastopen case might or not consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have to properly handle this. (Same remark for dccp_check_req() and its callers) inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is called 4 times in tcp and 3 times in dccp. Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>