summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2013-08-20openvswitch: Add vxlan tunneling support.Pravin B Shelar
Following patch adds vxlan vport type for openvswitch using vxlan api. So now there is vxlan dependency for openvswitch. CC: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20vxlan: Factor out vxlan send api.Pravin B Shelar
Following patch allows more code sharing between vxlan and ovs-vxlan. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20vxlan: Extend vxlan handlers for openvswitch.Pravin B Shelar
Following patch adds data field to vxlan socket and export vxlan handler api. vh->data is required to store private data per vxlan handler. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2013-08-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix SKB leak in 8139cp, from Dave Jones. 2) Fix use of *_PAGES interfaces with mlx5 firmware, from Moshe Lazar. 3) RCU conversion of macvtap introduced two races, fixes by Eric Dumazet 4) Synchronize statistic flows in bnx2x driver to prevent corruption, from Dmitry Kravkov 5) Undo optimization in IP tunneling, we were using the inner IP header in some cases to inherit the IP ID, but that isn't correct in some circumstances. From Pravin B Shelar 6) Use correct struct size when parsing netlink attributes in rtnl_bridge_getlink(). From Asbjoern Sloth Toennesen 7) Length verifications in tun_get_user() are bogus, from Weiping Pan and Dan Carpenter 8) Fix bad merge resolution during 3.11 networking development in openvswitch, albeit a harmless one which added some unreachable code. From Jesse Gross 9) Wrong size used in flexible array allocation in openvswitch, from Pravin B Shelar 10) Clear out firmware capability flags the be2net driver isn't ready to handle yet, from Sarveshwar Bandi 11) Revert DMA mapping error checking addition to cxgb3 driver, it's buggy. From Alexey Kardashevskiy 12) Fix regression in packet scheduler rate limiting when working with a link layer of ATM. From Jesper Dangaard Brouer 13) Fix several errors in TCP Cubic congestion control, in particular overflow errors in timestamp calculations. From Eric Dumazet and Van Jacobson 14) In ipv6 routing lookups, we need to backtrack if subtree traversal don't result in a match. From Hannes Frederic Sowa 15) ipgre_header() returns incorrect packet offset. Fix from Timo Teräs 16) Get "low latency" out of the new MIB counter names. From Eliezer Tamir 17) State check in ndo_dflt_fdb_del() is inverted, from Sridhar Samudrala 18) Handle TCP Fast Open properly in netfilter conntrack, from Yuchung Cheng 19) Wrong memcpy length in pcan_usb driver, from Stephane Grosjean 20) Fix dealock in TIPC, from Wang Weidong and Ding Tianhong 21) call_rcu() call to destroy SCTP transport is done too early and might result in an oops. From Daniel Borkmann 22) Fix races in genetlink family dumps, from Johannes Berg 23) Flags passed into macvlan by the user need to be validated properly, from Michael S Tsirkin 24) Fix skge build on 32-bit, from Stephen Hemminger 25) Handle malformed TCP headers properly in xt_TCPMSS, from Pablo Neira Ayuso 26) Fix handling of stacked vlans in vlan_dev_real_dev(), from Nikolay Aleksandrov 27) Eliminate MTU calculation overflows in esp{4,6}, from Daniel Borkmann 28) neigh_parms need to be setup before calling the ->ndo_neigh_setup() method. From Veaceslav Falico 29) Kill out-of-bounds prefetch in fib_trie, from Eric Dumazet 30) Don't dereference MLD query message if the length isn't value in the bridge multicast code, from Linus Lüssing 31) Fix VXLAN IGMP join regression due to an inverted check, from Cong Wang * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (70 commits) net/mlx5_core: Support MANAGE_PAGES and QUERY_PAGES firmware command changes tun: signedness bug in tun_get_user() qlcnic: Fix diagnostic interrupt test for 83xx adapters qlcnic: Fix beacon state return status handling qlcnic: Fix set driver version command net: tg3: fix NULL pointer dereference in tg3_io_error_detected and tg3_io_slot_reset net_sched: restore "linklayer atm" handling drivers/net/ethernet/via/via-velocity.c: update napi implementation Revert "cxgb3: Check and handle the dma mapping errors" be2net: Clear any capability flags that driver is not interested in. openvswitch: Reset tunnel key between input and output. openvswitch: Use correct type while allocating flex array. openvswitch: Fix bad merge resolution. tun: compare with 0 instead of total_len rtnetlink: rtnl_bridge_getlink: Call nlmsg_find_attr() with ifinfomsg header ethernet/arc/arc_emac - fix NAPI "work > weight" warning ip_tunnel: Do not use inner ip-header-id for tunnel ip-header-id. bnx2x: prevent crash in shutdown flow with CNIC bnx2x: fix PTE write access error bnx2x: fix memory leak in VF ...
2013-08-16Fix TLB gather virtual address range invalidation corner casesLinus Torvalds
Ben Tebulin reported: "Since v3.7.2 on two independent machines a very specific Git repository fails in 9/10 cases on git-fsck due to an SHA1/memory failures. This only occurs on a very specific repository and can be reproduced stably on two independent laptops. Git mailing list ran out of ideas and for me this looks like some very exotic kernel issue" and bisected the failure to the backport of commit 53a59fc67f97 ("mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT"). That commit itself is not actually buggy, but what it does is to make it much more likely to hit the partial TLB invalidation case, since it introduces a new case in tlb_next_batch() that previously only ever happened when running out of memory. The real bug is that the TLB gather virtual memory range setup is subtly buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather: enable tlb flush range in generic mmu_gather"), and the range handling was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB range flushed when __tlb_remove_page() runs out of slots"), but that fix was not complete. The problem with the TLB gather virtual address range is that it isn't set up by the initial tlb_gather_mmu() initialization (which didn't get the TLB range information), but it is set up ad-hoc later by the functions that actually flush the TLB. And so any such case that forgot to update the TLB range entries would potentially miss TLB invalidates. Rather than try to figure out exactly which particular ad-hoc range setup was missing (I personally suspect it's the hugetlb case in zap_huge_pmd(), which didn't have the same logic as zap_pte_range() did), this patch just gets rid of the problem at the source: make the TLB range information available to tlb_gather_mmu(), and initialize it when initializing all the other tlb gather fields. This makes the patch larger, but conceptually much simpler. And the end result is much more understandable; even if you want to play games with partial ranges when invalidating the TLB contents in chunks, now the range information is always there, and anybody who doesn't want to bother with it won't introduce subtle bugs. Ben verified that this fixes his problem. Reported-bisected-and-tested-by: Ben Tebulin <tebulin@googlemail.com> Build-testing-by: Stephen Rothwell <sfr@canb.auug.org.au> Build-testing-by: Richard Weinberger <richard.weinberger@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-15net/mlx5_core: Support MANAGE_PAGES and QUERY_PAGES firmware command changesMoshe Lazer
In the previous QUERY_PAGES command version we used one command to get the required amount of boot, init and post init pages. The new version uses the op_mod field to specify whether the query is for the required amount of boot, init or post init pages. In addition the output field size for the required amount of pages increased from 16 to 32 bits. In MANAGE_PAGES command the input_num_entries and output_num_entries fields sizes changed from 16 to 32 bits and the PAS tables offset changed to 0x10. In the pages request event the num_pages field also changed to 32 bits. In the HCA-capabilities-layout the size and location of max_qp_mcg field has been changed to support 24 bits. This patch isn't compatible with firmware versions < 5; however, it turns out that the first GA firmware we will publish will not support previous versions so this should be OK. Signed-off-by: Moshe Lazer <moshel@mellanox.com> Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15net_sched: restore "linklayer atm" handlingJesper Dangaard Brouer
commit 56b765b79 ("htb: improved accuracy at high rates") broke the "linklayer atm" handling. tc class add ... htb rate X ceil Y linklayer atm The linklayer setting is implemented by modifying the rate table which is send to the kernel. No direct parameter were transferred to the kernel indicating the linklayer setting. The commit 56b765b79 ("htb: improved accuracy at high rates") removed the use of the rate table system. To keep compatible with older iproute2 utils, this patch detects the linklayer by parsing the rate table. It also supports future versions of iproute2 to send this linklayer parameter to the kernel directly. This is done by using the __reserved field in struct tc_ratespec, to convey the choosen linklayer option, but only using the lower 4 bits of this field. Linklayer detection is limited to speeds below 100Mbit/s, because at high rates the rtab is gets too inaccurate, so bad that several fields contain the same values, this resembling the ATM detect. Fields even start to contain "0" time to send, e.g. at 1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have been more broken than we first realized. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15ip6tnl: add x-netns supportNicolas Dichtel
This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15ipip: add x-netns supportNicolas Dichtel
This patch allows to switch the netns when packet is encapsulated or decapsulated. In other word, the encapsulated packet is received in a netns, where the lookup is done to find the tunnel. Once the tunnel is found, the packet is decapsulated and injecting into the corresponding interface which stands to another netns. When one of the two netns is removed, the tunnel is destroyed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-14Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds
Merge a bunch of fixes from Andrew Morton. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: fs/proc/task_mmu.c: fix buffer overflow in add_page_map() arch: *: Kconfig: add "kernel/Kconfig.freezer" to "arch/*/Kconfig" ocfs2: fix null pointer dereference in ocfs2_dir_foreach_blk_id() x86 get_unmapped_area(): use proper mmap base for bottom-up direction ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_page ocfs2: Revert 40bd62e to avoid regression in extended allocation drivers/rtc/rtc-stmp3xxx.c: provide timeout for potentially endless loop polling a HW bit hugetlb: fix lockdep splat caused by pmd sharing aoe: adjust ref of head for compound page tails microblaze: fix clone syscall mm: save soft-dirty bits on file pages mm: save soft-dirty bits on swapped pages memcg: don't initialize kmem-cache destroying work for root caches
2013-08-13x86 get_unmapped_area(): use proper mmap base for bottom-up directionRadu Caragea
When the stack is set to unlimited, the bottomup direction is used for mmap-ings but the mmap_base is not used and thus effectively renders ASLR for mmapings along with PIE useless. Cc: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Adrian Sendroiu <molecula2788@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13microblaze: fix clone syscallMichal Simek
Fix inadvertent breakage in the clone syscall ABI for Microblaze that was introduced in commit f3268edbe6fe ("microblaze: switch to generic fork/vfork/clone"). The Microblaze syscall ABI for clone takes the parent tid address in the 4th argument; the third argument slot is used for the stack size. The incorrectly-used CLONE_BACKWARDS type assigned parent tid to the 3rd slot. This commit restores the original ABI so that existing userspace libc code will work correctly. All kernel versions from v3.8-rc1 were affected. Signed-off-by: Michal Simek <michal.simek@xilinx.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13mm: save soft-dirty bits on file pagesCyrill Gorcunov
Andy reported that if file page get reclaimed we lose the soft-dirty bit if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get encoded into pte entry. Thus when #pf happens on such non-present pte we can restore it back. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13mm: save soft-dirty bits on swapped pagesCyrill Gorcunov
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set get swapped out, the bit is getting lost and no longer available when pte read back. To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in pte entry for the page being swapped out. When such page is to be read back from a swap cache we check for bit presence and if it's there we clear it and restore the former _PAGE_SOFT_DIRTY bit back. One of the problem was to find a place in pte entry where we can save the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was chosen for that, it doesn't intersect with swap entry format stored in pte. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13ipv6: make unsolicited report intervals configurable for mldHannes Frederic Sowa
Commit cab70040dfd95ee32144f02fade64f0cb94f31a0 ("net: igmp: Reduce Unsolicited report interval to 1s when using IGMPv3") and 2690048c01f32bf45d1c1e1ab3079bc10ad2aea7 ("net: igmp: Allow user-space configuration of igmp unsolicited report interval") by William Manley made igmp unsolicited report intervals configurable per interface and corrected the interval of unsolicited igmpv3 report messages resendings to 1s. Same needs to be done for IPv6: MLDv1 (RFC2710 7.10.): 10 seconds MLDv2 (RFC3810 9.11.): 1 second Both intervals are configurable via new procfs knobs mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval. (also added .force_mld_version to ipv6_devconf_dflt to bring structs in line without semantic changes) v2: a) Joined documentation update for IPv4 and IPv6 MLD/IGMP unsolicited_report_interval procfs knobs. b) incorporate stylistic feedback from William Manley v3: a) add new DEVCONF_* values to the end of the enum (thanks to David Miller) Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: William Manley <william.manley@youview.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Docbook fixes that make 99% of the diffstat, plus a oneliner fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Ensure update_cfs_shares() is called for parents of continuously-running tasks sched: Fix some kernel-doc warnings
2013-08-13ip_tunnel: Do not use inner ip-header-id for tunnel ip-header-id.Pravin B Shelar
Using inner-id for tunnel id is not safe in some rare cases. E.g. packets coming from multiple sources entering same tunnel can have same id. Therefore on tunnel packet receive we could have packets from two different stream but with same source and dst IP with same ip-id which could confuse ip packet reassembly. Following patch reverts optimization from commit 490ab08127 (IP_GRE: Fix IP-Identification.) CC: Jarno Rajahalme <jrajahalme@nicira.com> CC: Ansis Atteka <aatteka@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== This is a batch of updates intended for 3.12. It is mostly driver stuff, although Johannes Berg and Simon Wunderlich make a good showing with mac80211 bits (particularly some work on 5/10 MHz channel support). The usual suspects are mostly represented. There are lots of updates to iwlwifi, ath9k, ath10k, mwifiex, rt2x00, wil6210, as usual. The bcma bus gets some love this time, as do cw1200, iwl4965, and a few other bits here and there. I don't think there is much unusual here, FWIW. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13pptp: fix byte order warningsstephen hemminger
Pptp driver has lots of byte order warnings from sparse. This was because the on-the-wire header is in network byte order (obviously) but the definition did not reflect that. Also, the address structure to user space actually put the call id in host order. Rather than break ABI compatibility, just acknowledge the existing design. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13sched: fix the theoretical signal_wake_up() vs schedule() raceOleg Nesterov
This is only theoretical, but after try_to_wake_up(p) was changed to check p->state under p->pi_lock the code like __set_current_state(TASK_INTERRUPTIBLE); schedule(); can miss a signal. This is the special case of wait-for-condition, it relies on try_to_wake_up/schedule interaction and thus it does not need mb() between __set_current_state() and if(signal_pending). However, this __set_current_state() can move into the critical section protected by rq->lock, now that try_to_wake_up() takes another lock we need to ensure that it can't be reordered with "if (signal_pending(current))" check inside that section. The patch is actually one-liner, it simply adds smp_wmb() before spin_lock_irq(rq->lock). This is what try_to_wake_up() already does by the same reason. We turn this wmb() into the new helper, smp_mb__before_spinlock(), for better documentation and to allow the architectures to change the default implementation. While at it, kill smp_mb__after_lock(), it has no callers. Perhaps we can also add smp_mb__before/after_spinunlock() for prepare_to_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-12Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem Conflicts: drivers/net/ethernet/broadcom/Kconfig
2013-08-10Merge tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: - Stable patch for lockd to fix Oopses due to inappropriate calls to utsname()->nodename - Stable patches for sunrpc to fix Oopses on shutdown when using AF_LOCAL sockets with rpcbind - Fix memory leak and error checking issues in nfs4_proc_lookup_mountpoint - Fix a regression with the sync mount option failing to work for nfs4 mounts - Fix a writeback performance issue when doing cache invalidation - Remove an incorrect call to nfs_setsecurity in nfs_fhget * tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Fix up nfs4_proc_lookup_mountpoint NFS: Remove unnecessary call to nfs_setsecurity in nfs_fhget() NFSv4: Fix the sync mount option for nfs4 mounts NFS: Fix writeback performance issue on cache invalidation SUNRPC: If the rpcbind channel is disconnected, fail the call to unregister SUNRPC: Don't auto-disconnect from the local rpcbind socket LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs
2013-08-10Merge tag 'staging-3.11-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging driver fixes from Greg KH: "Here are 3 small fixes for staging/IIO drivers for 3.11-rc5. Nothing huge, two IIO driver fixes, and a zcache fix. All of these have been in linux-next for a while" * tag 'staging-3.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: zcache: fix "zcache=" kernel parameter iio: ti_am335x_adc: Fix wrong samples received on 1st read iio:trigger: Fix use_count race condition
2013-08-10net: attempt high order allocations in sock_alloc_send_pskb()Eric Dumazet
Adding paged frags skbs to af_unix sockets introduced a performance regression on large sends because of additional page allocations, even if each skb could carry at least 100% more payload than before. We can instruct sock_alloc_send_pskb() to attempt high order allocations. Most of the time, it does a single page allocation instead of 8. I added an additional parameter to sock_alloc_send_pskb() to let other users to opt-in for this new feature on followup patches. Tested: Before patch : $ netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 2304 212992 212992 10.00 46861.15 After patch : $ netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 2304 212992 212992 10.00 57981.11 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10af_unix: improve STREAM behavior with fragmented memoryEric Dumazet
unix_stream_sendmsg() currently uses order-2 allocations, and we had numerous reports this can fail. The __GFP_REPEAT flag present in sock_alloc_send_pskb() is not helping. This patch extends the work done in commit eb6a24816b247c ("af_unix: reduce high order page allocations) for datagram sockets. This opens the possibility of zero copy IO (splice() and friends) The trick is to not use skb_pull() anymore in recvmsg() path, and instead add a @consumed field in UNIXCB() to track amount of already read payload in the skb. There is a performance regression for large sends because of extra page allocations that will be addressed in a follow-up patch, allowing sock_alloc_send_pskb() to attempt high order page allocations. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10tcp: add server ip to encrypt cookie in fast openYuchung Cheng
Encrypt the cookie with both server and client IPv4 addresses, such that multi-homed server will grant different cookies based on both the source and destination IPs. No client change is needed since cookie is opaque to the client. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09Merge tag 'pm+acpi-3.11-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: - ACPI-based memory hotplug stopped working after a recent change, because it's not possible to associate sufficiently many "physical" devices with one ACPI device object due to an artificial limit. Fix from Rafael J Wysocki removes that limit and makes memory hotplug work again. - A change made in 3.9 uncovered a bug in the ACPI processor driver preventing NUMA nodes from being put offline due to an ordering issue. Fix from Yasuaki Ishimatsu changes the ordering to make things work again. - One of the recent ACPI video commits (that hasn't been reverted so far) uncovered a bug in the code handling quirky BIOSes that caused some Asus machines to boot with backlight completely off which made it quite difficult to use them afterward. Fix from Felipe Contreras improves the quirk to cover this particular case correctly. - A cpufreq user space interface change made in 3.10 inadvertently renamed the ignore_nice_load sysfs attribute to ignore_nice which resulted in some confusion. Fix from Viresh Kumar changes the name back to ignore_nice_load. - An initialization ordering change made in 3.9 broke cpufreq on loongson2 boards. Fix from Aaro Koskinen restores the correct initialization ordering there. - Fix breakage resulting from a mistake made in 3.9 and causing the detection of some graphics adapters (that were detected correctly before) to fail. There are two objects representing the same PCIe port in the affected systems' ACPI tables and both appear as "enabled" and we are expected to guess which one to use. We used to choose the right one before by pure luck, but when we tried to address another similar corner case, the luck went away. This time we try to make our guessing a bit more educated which is reported to work on those systems. - The /proc/acpi/wakeup interface code is missing some locking which may lead to breakage if that file is written or read during hotplug of wakeup devices. That should be rare but still possible, so it's better to start using the appropriate locking there. * tag 'pm+acpi-3.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: Try harder to resolve _ADR collisions for bridges cpufreq: rename ignore_nice as ignore_nice_load cpufreq: loongson2: fix regression related to clock management ACPI / processor: move try_offline_node() after acpi_unmap_lsapic() ACPI: Drop physical_node_id_bitmap from struct acpi_device ACPI / PM: Walk physical_node_list under physical_node_lock ACPI / video: improve quirk check in acpi_video_bqc_quirk()
2013-08-09Merge branch 'v4l_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: "Some driver fixes (em28xx, coda, usbtv, s5p, hdpvr and ml86v7667) and a fix for media DocBook" * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: [media] em28xx: fix assignment of the eeprom data [media] hdpvr: fix iteration over uninitialized lists in hdpvr_probe() [media] usbtv: fix dependency [media] usbtv: Throw corrupted frames away [media] usbtv: Fix deinterlacing [media] v4l2: added missing mutex.h include to v4l2-ctrls.h [media] DocBook: upgrade media_api DocBook version to 4.2 [media] ml86v7667: fix compile warning: 'ret' set but not used [media] s5p-g2d: Fix registration failure [media] media: coda: Fix DT driver data pointer for i.MX27 [media] s5p-mfc: Fix input/output format reporting
2013-08-09Revert "net: sctp: convert sctp_checksum_disable module param into sctp sysctl"David S. Miller
This reverts commit cda5f98e36576596b9230483ec52bff3cc97eb21. As per Vlad's request. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09Merge branch 'for-john' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
2013-08-09Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
2013-08-09net: rename busy poll MIB counterEliezer Tamir
Rename mib counter from "low latency" to "busy poll" v1 also moved the counter to the ip MIB (suggested by Shawn Bohrer) Eric Dumazet suggested that the current location is better. So v2 just renames the counter to fit the new naming convention. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: sctp: trivial: update bug report in header commentDaniel Borkmann
With the restructuring of the lksctp.org site, we only allow bug reports through the SCTP mailing list linux-sctp@vger.kernel.org, not via SF, as SF is only used for web hosting and nothing more. While at it, also remove the obvious statement that bugs will be fixed and incooperated into the kernel. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: sctp: convert sctp_checksum_disable module param into sctp sysctlDaniel Borkmann
Get rid of the last module parameter for SCTP and make this configurable via sysctl for SCTP like all the rest of SCTP's configuration knobs. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: igmp: Allow user-space configuration of igmp unsolicited report intervalWilliam Manley
Adds the new procfs knobs: /proc/sys/net/ipv4/conf/*/igmpv2_unsolicited_report_interval /proc/sys/net/ipv4/conf/*/igmpv3_unsolicited_report_interval Which will allow userspace configuration of the IGMP unsolicited report interval (see below) in milliseconds. The defaults are 10000ms for IGMPv2 and 1000ms for IGMPv3 in accordance with RFC2236 and RFC3376. Background: If an IGMP join packet is lost you will not receive data sent to the multicast group so if no data arrives from that multicast group in a period of time after the IGMP join a second IGMP join will be sent. The delay between joins is the "IGMP Unsolicited Report Interval". Prior to this patch this value was hard coded in the kernel to 10s for IGMPv2 and 1s for IGMPv3. 10s is unsuitable for some use-cases, such as IPTV as it can cause channel change to be slow in the presence of packet loss. This patch allows the value to be overridden from userspace for both IGMPv2 and IGMPv3 such that it can be tuned accoding to the network. Tested with Wireshark and a simple program to join a (non-existent) multicast group. The distribution of timings for the second join differ based upon setting the procfs knobs. igmpvX_unsolicited_report_interval is intended to follow the pattern established by force_igmp_version, and while a procfs entry has been added a corresponding sysctl knob has not as it is my understanding that sysctl is deprecated[1]. [1]: http://lwn.net/Articles/247243/ Signed-off-by: William Manley <william.manley@youview.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09net: igmp: Don't flush routing cache when force_igmp_version is modifiedWilliam Manley
The procfs knob /proc/sys/net/ipv4/conf/*/force_igmp_version allows the IGMP protocol version to use to be explicitly set. As a side effect this caused the routing cache to be flushed as it was declared as a DEVINET_SYSCTL_FLUSHING_ENTRY. Flushing is unnecessary and this patch makes it so flushing does not occur. Requested by Hannes Frederic Sowa as he was reviewing other patches adding procfs entries. Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: William Manley <william.manley@youview.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-08net: add SNMP counters tracking incoming ECN bitsEric Dumazet
With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP counters do not count the number of frames, but number of aggregated segments. Its probably too late to change this now. This patch adds four new counters, tracking number of frames, regardless of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6. Ip[6]NoECTPkts : Number of packets received with NOECT Ip[6]ECT1Pkts : Number of packets received with ECT(1) Ip[6]ECT0Pkts : Number of packets received with ECT(0) Ip[6]CEPkts : Number of packets received with Congestion Experienced lph37:~# nstat | egrep "Pkts|InReceive" IpInReceives 1634137 0.0 Ip6InReceives 3714107 0.0 Ip6InNoECTPkts 19205 0.0 Ip6InECT0Pkts 52651828 0.0 IpExtInNoECTPkts 33630 0.0 IpExtInECT0Pkts 15581379 0.0 IpExtInCEPkts 6 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-08userns: limit the maximum depth of user_namespace->parent chainOleg Nesterov
Ensure that user_namespace->parent chain can't grow too much. Currently we use the hardroded 32 as limit. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-08Merge tag 'regmap-v3.11-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap fixes from Mark Brown: "Two things here, one is a fix for a nasty issue where we were failing to sync the last register in a block when using raw writes and the other fixes a missing header for the !REGMAP stubs so that we don't rely on implicit includes in that case" * tag 'regmap-v3.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: regmap: Add missing header for !CONFIG_REGMAP stubs regmap: cache: Make sure to sync the last register in a block
2013-08-07net: move zerocopy_sg_from_iovec() to net/core/datagram.cJason Wang
To let it be reused and reduce code duplication. Also document this function. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07net: move iov_pages() to net/core/iovec.cJason Wang
To let it be reused and reduce code duplication. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07ip_tunnel: embed hash list headstephen hemminger
The IP tunnel hash heads can be embedded in the per-net structure since it is a fixed size. Reduce the size so that the total structure fits in a page size. The original size was overly large, even NETDEV_HASHBITS is only 8 bits! Also, add some white space for readability. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Pravin B Shelar <pshelar@nicira.com>. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07SUNRPC: If the rpcbind channel is disconnected, fail the call to unregisterTrond Myklebust
If rpcbind causes our connection to the AF_LOCAL socket to close after we've registered a service, then we want to be careful about reconnecting since the mount namespace may have changed. By simply refusing to reconnect the AF_LOCAL socket in the case of unregister, we avoid the need to somehow save the mount namespace. While this may lead to some services not unregistering properly, it should be safe. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Nix <nix@esperi.org.uk> Cc: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org # 3.9.x
2013-08-07ACPI: Try harder to resolve _ADR collisions for bridgesRafael J. Wysocki
In theory, under a given ACPI namespace node there should be only one child device object with _ADR whose value matches a given bus address exactly. In practice, however, there are systems in which multiple child device objects under a given parent have _ADR matching exactly the same address. In those cases we use _STA to determine which of the multiple matching devices is enabled, since some systems are known to indicate which ACPI device object to associate with the given physical (usually PCI) device this way. Unfortunately, as it turns out, there are systems in which many device objects under the same parent have _ADR matching exactly the same bus address and none of them has _STA, in which case they all should be regarded as enabled according to the spec. Still, if those device objects are supposed to represent bridges (e.g. this is the case for device objects corresponding to PCIe ports), we can try harder and skip the ones that have no child device objects in the ACPI namespace. With luck, we can avoid using device objects that we are not expected to use this way. Although this only works for bridges whose children also have ACPI namespace representation, it is sufficient to address graphics adapter detection issues on some systems, so rework the code finding a matching device ACPI handle for a given bus address to implement this idea. Introduce a new function, acpi_find_child(), taking three arguments: the ACPI handle of the device's parent, a bus address suitable for the device's bus type and a bool indicating if the device is a bridge and make it work as outlined above. Reimplement the function currently used for this purpose, acpi_get_child(), as a call to acpi_find_child() with the last argument set to 'false' and make the PCI subsystem use acpi_find_child() with the bridge information passed as the last argument to it. [Lan Tianyu notices that it is not sufficient to use pci_is_bridge() for that, because the device's subordinate pointer hasn't been set yet at this point, so use hdr_type instead.] This change fixes a regression introduced inadvertently by commit 33f767d (ACPI: Rework acpi_get_child() to be more efficient) which overlooked the fact that for acpi_walk_namespace() "post-order" means "after all children have been visited" rather than "on the way back", so for device objects without children and for namespace walks of depth 1, as in the acpi_get_child() case, the "post-order" callbacks ordering is actually the same as the ordering of "pre-order" ones. Since that commit changed the namespace walk in acpi_get_child() to terminate after finding the first matching object instead of going through all of them and returning the last one, it effectively changed the result returned by that function in some rare cases and that led to problems (the switch from a "pre-order" to a "post-order" callback was supposed to prevent that from happening, but it was ineffective). As it turns out, the systems where the change made by commit 33f767d actually matters are those where there are multiple ACPI device objects representing the same PCIe port (which effectively is a bridge). Moreover, only one of them, and the one we are expected to use, has child device objects in the ACPI namespace, so the regression can be addressed as described above. References: https://bugzilla.kernel.org/show_bug.cgi?id=60561 Reported-by: Peter Wu <lekensteyn@gmail.com> Tested-by: Vladimir Lalov <mail@vlalov.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Cc: 3.9+ <stable@vger.kernel.org> # 3.9+
2013-08-07Merge tag 'trace-fixes-3.11-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Oleg Nesterov has been working hard in closing all the holes that can lead to race conditions between deleting an event and accessing an event debugfs file. This included a fix to the debugfs system (acked by Greg Kroah-Hartman). We think that all the holes have been patched and hopefully we don't find more. I haven't marked all of them for stable because I need to examine them more to figure out how far back some of the changes need to go. Along the way, some other fixes have been made. Alexander Z Lam fixed some logic where the wrong buffer was being modifed. Andrew Vagin found a possible corruption for machines that actually allocate cpumask, as a reference to one was being zeroed out by mistake. Dhaval Giani found a bad prototype when tracing is not configured. And I not only had some changes to help Oleg, but also finally fixed a long standing bug that Dave Jones and others have been hitting, where a module unload and reload can cause the function tracing accounting to get screwed up" * tag 'trace-fixes-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix reset of time stamps during trace_clock changes tracing: Make TRACE_ITER_STOP_ON_FREE stop the correct buffer tracing: Fix trace_dump_stack() proto when CONFIG_TRACING is not set tracing: Fix fields of struct trace_iterator that are zeroed by mistake tracing/uprobes: Fail to unregister if probe event files are in use tracing/kprobes: Fail to unregister if probe event files are in use tracing: Add comment to describe special break case in probe_remove_event_call() tracing: trace_remove_event_call() should fail if call/file is in use debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs) ftrace: Check module functions being traced on reload ftrace: Consolidate some duplicate code for updating ftrace ops tracing: Change remove_event_file_dir() to clear "d_subdirs"->i_private tracing: Introduce remove_event_file_dir() tracing: Change f_start() to take event_mutex and verify i_private != NULL tracing: Change event_filter_read/write to verify i_private != NULL tracing: Change event_enable/disable_read() to verify i_private != NULL tracing: Turn event/id->i_private into call->event.type
2013-08-06regmap: Add missing header for !CONFIG_REGMAP stubsMateusz Krawczuk
regmap.h requires linux/err.h if CONFIG_REGMAP is not defined. Without it I get error. CC drivers/media/platform/exynos4-is/fimc-reg.o In file included from drivers/media/platform/exynos4-is/fimc-reg.c:14:0: include/linux/regmap.h: In function ‘regmap_write’: include/linux/regmap.h:525:10: error: ‘EINVAL’ undeclared (first use in this function) include/linux/regmap.h:525:10: note: each undeclared identifier is reported only once for each function it appears in Signed-off-by: Mateusz Krawczuk <m.krawczuk@partner.samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Mark Brown <broonie@linaro.org> Cc: stable@kernel.org
2013-08-06ACPI: Drop physical_node_id_bitmap from struct acpi_deviceRafael J. Wysocki
The physical_node_id_bitmap in struct acpi_device is only used for looking up the first currently unused dependent phyiscal node ID by acpi_bind_one(). It is not really necessary, however, because acpi_bind_one() walks the entire physical_node_list of the given device object for sanity checking anyway and if that list is always sorted by node_id, it is straightforward to find the first gap between the currently used node IDs and use that number as the ID of the new list node. This also removes the artificial limit of the maximum number of dependent physical devices per ACPI device object, which now depends only on the capacity of unsigend int. As a result, it fixes a regression introduced by commit e2ff394 (ACPI / memhotplug: Bind removable memory blocks to ACPI device nodes) that caused acpi_memory_enable_device() to fail when the number of 128 MB blocks within one removable memory module was greater than 32. Reported-and-tested-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
2013-08-05sctp: Pack dst_cookie into 1st cacheline hole for 64bit hostfan.du
As dst_cookie is used in fast path sctp_transport_dst_check. Before: struct sctp_transport { struct list_head transports; /* 0 16 */ atomic_t refcnt; /* 16 4 */ __u32 dead:1; /* 20:31 4 */ __u32 rto_pending:1; /* 20:30 4 */ __u32 hb_sent:1; /* 20:29 4 */ __u32 pmtu_pending:1; /* 20:28 4 */ /* XXX 28 bits hole, try to pack */ __u32 sack_generation; /* 24 4 */ /* XXX 4 bytes hole, try to pack */ struct flowi fl; /* 32 64 */ /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */ union sctp_addr ipaddr; /* 96 28 */ After: struct sctp_transport { struct list_head transports; /* 0 16 */ atomic_t refcnt; /* 16 4 */ __u32 dead:1; /* 20:31 4 */ __u32 rto_pending:1; /* 20:30 4 */ __u32 hb_sent:1; /* 20:29 4 */ __u32 pmtu_pending:1; /* 20:28 4 */ /* XXX 28 bits hole, try to pack */ __u32 sack_generation; /* 24 4 */ u32 dst_cookie; /* 28 4 */ struct flowi fl; /* 32 64 */ /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */ union sctp_addr ipaddr; /* 96 28 */ Signed-off-by: Fan Du <fan.du@windriver.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-05mlx5: remove health handler pluginEli Cohen
Remove this code, per Dave Miller's request, since it is not being used anywhere in the kernel. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>