summaryrefslogtreecommitdiff
path: root/include/linux/netdevice.h
AgeCommit message (Collapse)Author
2014-01-07net-gre-gro: Add GRE support to the GRO stackJerry Chu
This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36 ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO stack. It also serves as an example for supporting other encapsulation protocols in the GRO stack in the future. The patch supports version 0 and all the flags (key, csum, seq#) but will flush any pkt with the S (seq#) flag. This is because the S flag is not support by GSO, and a GRO pkt may end up in the forwarding path, thus requiring GSO support to break it up correctly. Currently the "packet_offload" structure only contains L3 (ETH_P_IP/ ETH_P_IPV6) GRO offload support so the encapped pkts are limited to IP pkts (i.e., w/o L2 hdr). But support for other protocol type can be easily added, so is the support for GRE variations like NVGRE. The patch also support csum offload. Specifically if the csum flag is on and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE), the code will take advantage of the csum computed by the h/w when validating the GRE csum. Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre: add GRO capability" already introduces GRO capability to IPv4 GRE tunnels, using the gro_cells infrastructure. But GRO is done after GRE hdr has been removed (i.e., decapped). The following patch applies GRO when pkts first come in (before hitting the GRE tunnel code). There is some performance advantage for applying GRO as early as possible. Also this approach is transparent to other subsystem like Open vSwitch where GRE decap is handled outside of the IP stack hence making it harder for the gro_cells stuff to apply. On the other hand, some NICs are still not capable of hashing on the inner hdr of a GRE pkt (RSS). In that case the GRO processing of pkts from the same remote host will all happen on the same CPU and the performance may be suboptimal. I'm including some rough preliminary performance numbers below. Note that the performance will be highly dependent on traffic load, mix as usual. Moreover it also depends on NIC offload features hence the following is by no means a comprehesive study. Local testing and tuning will be needed to decide the best setting. All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs. (super_netperf 50 -H 192.168.1.18 -l 30) An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1 mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123) is configured. The GRO support for pkts AFTER decap are controlled through the device feature of the GRE device (e.g., ethtool -K gre1 gro on/off). 1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off thruput: 9.16Gbps CPU utilization: 19% 1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off thruput: 5.9Gbps CPU utilization: 15% 1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 12-13% 1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on thruput: 9.26Gbps CPU utilization: 10% The following tests were performed on a different NIC that is capable of csum offload. I.e., the h/w is capable of computing IP payload csum (CHECKSUM_COMPLETE). 2.1 ethtool -K gre1 gro on (hence will use gro_cells) 2.1.1 ethtool -K eth0 gro off; csum offload disabled thruput: 8.53Gbps CPU utilization: 9% 2.1.2 ethtool -K eth0 gro off; csum offload enabled thruput: 8.97Gbps CPU utilization: 7-8% 2.1.3 ethtool -K eth0 gro on; csum offload disabled thruput: 8.83Gbps CPU utilization: 5-6% 2.1.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.98Gbps CPU utilization: 5% 2.2 ethtool -K gre1 gro off 2.2.1 ethtool -K eth0 gro off; csum offload disabled thruput: 5.93Gbps CPU utilization: 9% 2.2.2 ethtool -K eth0 gro off; csum offload enabled thruput: 5.62Gbps CPU utilization: 8% 2.2.3 ethtool -K eth0 gro on; csum offload disabled thruput: 7.69Gbps CPU utilization: 8% 2.2.4 ethtool -K eth0 gro on; csum offload enabled thruput: 8.96Gbps CPU utilization: 5-6% Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c net/ipv6/ip6_tunnel.c net/ipv6/ip6_vti.c ipv6 tunnel statistic bug fixes conflicting with consolidation into generic sw per-cpu net stats. qlogic conflict between queue counting bug fix and the addition of multiple MAC address support. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: * Add full port randomization support. Some crazy researchers found a way to reconstruct the secure ephemeral ports that are allocated in random mode by sending off-path bursts of UDP packets to overrun the socket buffer of the DNS resolver to trigger retransmissions, then if the timing for the DNS resolution done by a client is larger than usual, then they conclude that the port that received the burst of UDP packets is the one that was opened. It seems a bit aggressive method to me but it seems to work for them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a new NAT mode to fully randomize ports using prandom. * Add a new classifier to x_tables based on the socket net_cls set via cgroups. These includes two patches to prepare the field as requested by Zefan Li. Also from Daniel Borkmann. * Use prandom instead of get_random_bytes in several locations of the netfilter code, from Florian Westphal. * Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack mark, also from Florian Westphal. * Fix compilation warning due to unused variable in IPVS, from Geert Uytterhoeven. * Add support for UID/GID via nfnetlink_queue, from Valentina Giusti. * Add IPComp extension to x_tables, from Fan Du. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04net: unify the pcpu_tstats and br_cpu_netstats as oneLi RongQing
They are same, so unify them as one, pcpu_sw_netstats. Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats from if_tunnel and remove br_cpu_netstats from br_private.h Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-03net: netprio: rename config to be more consistent with cgroup configsDaniel Borkmann
While we're at it and introduced CGROUP_NET_CLASSID, lets also make NETPRIO_CGROUP more consistent with the rest of cgroups and rename it into CONFIG_CGROUP_NET_PRIO so that for networking, we now have CONFIG_CGROUP_NET_{PRIO,CLASSID}. This not only makes the CONFIG option consistent among networking cgroups, but also among cgroups CONFIG conventions in general as the vast majority has a prefix of CONFIG_CGROUP_<SUBSYS>. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Zefan Li <lizefan@huawei.com> Cc: cgroups@vger.kernel.org Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-01-02ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NICWei-Chun Chao
VM to VM GSO traffic is broken if it goes through VXLAN or GRE tunnel and the physical NIC on the host supports hardware VXLAN/GRE GSO offload (e.g. bnx2x and next-gen mlx4). Two issues - (VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header integrity check fails in udp4_ufo_fragment if inner protocol is TCP. Also gso_segs is calculated incorrectly using skb->len that includes tunnel header. Fix: robust check should only be applied to the inner packet. (VXLAN & GRE) Once GSO header integrity check passes, NULL segs is returned and the original skb is sent to hardware. However the tunnel header is already pulled. Fix: tunnel header needs to be restored so that hardware can perform GSO properly on the original packet. Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-01net: core functions cleanupstephen hemminger
The following functions are not used outside of net/core/dev.c and should be declared static. call_netdevice_notifiers_info __dev_remove_offload netdev_has_any_upper_dev __netdev_adjacent_dev_remove __netdev_adjacent_dev_link_lists __netdev_adjacent_dev_unlink_lists __netdev_adjacent_dev_unlink __netdev_adjacent_dev_link_neighbour __netdev_adjacent_dev_unlink_neighbour And the following are never used and should be deleted netdev_lower_dev_get_private_rcu __netdev_find_adj_rcu Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31vlan: Fix header ops passthru when doing TX VLAN offload.David S. Miller
When the vlan code detects that the real device can do TX VLAN offloads in hardware, it tries to arrange for the real device's header_ops to be invoked directly. But it does so illegally, by simply hooking the real device's header_ops up to the VLAN device. This doesn't work because we will end up invoking a set of header_ops routines which expect a device type which matches the real device, but will see a VLAN device instead. Fix this by providing a pass-thru set of header_ops which will arrange to pass the proper real device instead. To facilitate this add a dev_rebuild_header(). There are implementations which provide a ->cache and ->create but not a ->rebuild (f.e. PLIP). So we need a helper function just like dev_hard_header() to avoid crashes. Use this helper in the one existing place where the header_ops->rebuild was being invoked, the neighbour code. With lots of help from Florian Westphal. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/intel/i40e/i40e_main.c drivers/net/macvtap.c Both minor merge hassles, simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net: remove dead code for add/del multiplestephen hemminger
These function to manipulate multiple addresses are not used anywhere in current net-next tree. Some out of tree code maybe using these but too bad; they should submit their code upstream.. Also, make __hw_addr_flush local since only used by dev_addr_lists.c Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-14bonding: create bond_first_slave_rcu()dingtianhong
The bond_first_slave_rcu() will be used to instead of bond_first_slave() in rcu_read_lock(). According to the Jay Vosburgh's suggestion, the struct netdev_adjacent should hide from users who wanted to use it directly. so I package a new function to get the first slave of the bond. Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com> Suggested-by: Jay Vosburgh <fubar@us.ibm.com> Suggested-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12net: make neigh_priv_len in struct net_device 16bit instead of 8bitSebastian Siewior
neigh_priv_len is defined as u8. With all debug enabled struct ipoib_neigh has 200 bytes. The largest part is sk_buff_head with 96 bytes and here the spinlock with 72 bytes. The size value still fits in this u8 leaving some room for more. On -RT struct ipoib_neigh put on weight and has 392 bytes. The main reason is sk_buff_head with 288 and the fatty here is spinlock with 192 bytes. This does no longer fit into into neigh_priv_len and gcc complains. This patch changes neigh_priv_len from being 8bit to 16bit. Since the following element (dev_id) is 16bit followed by a spinlock which is aligned, the struct remains with a total size of 3200 (allmodconfig) / 2048 (with as much debug off as possible) bytes on x86-64. On x86-32 the struct is 1856 (allmodconfig) / 1216 (with as much debug off as possible) bytes long. The numbers were gained with and without the patch to prove that this change does not increase the size of the struct. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12net-gro: Prepare GRO stack for the upcoming tunneling supportJerry Chu
This patch modifies the GRO stack to avoid the use of "network_header" and associated macros like ip_hdr() and ipv6_hdr() in order to allow an arbitary number of IP hdrs (v4 or v6) to be used in the encapsulation chain. This lays the foundation for various IP tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later. With this patch, the GRO stack traversing now is mostly based on skb_gro_offset rather than special hdr offsets saved in skb (e.g., skb->network_header). As a result all but the top layer (i.e., the the transport layer) must have hdrs of the same length in order for a pkt to be considered for aggregation. Therefore when adding a new encap layer (e.g., for tunneling), one must check and skip flows (e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a different hdr length. Note that unlike the network header, the transport header can and will continue to be set by the GRO code since there will be at most one "transport layer" in the encap chain. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11tipc: remove TIPC usage of field af_packet_priv in struct net_deviceYing Xue
TIPC is currently using the field 'af_packet_priv' in struct net_device as a handle to find the bearer instance associated to the given network device. But, by doing so it is blocking other networking cleanups, such as the one discussed here: http://patchwork.ozlabs.org/patch/178044/ This commit removes this usage from TIPC. Instead, we introduce a new field, 'tipc_ptr', to the net_device structure, to serve this purpose. When TIPC bearer is enabled, the bearer object is associated to 'tipc_ptr'. When a TIPC packet arrives in the recv_msg() upcall from a networking device, the bearer object can now be obtained from 'tipc_ptr'. When a bearer is disabled, the bearer object is detached from its underlying network device by setting 'tipc_ptr' to NULL. Additionally, an RCU lock is used to protect the new pointer. Henceforth, the existing tipc_net_lock is used in write mode to serialize write accesses to this pointer, while the new RCU lock is applied on the read side to ensure that the pointer is 100% valid within its wrapped area for all readers. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Patrick McHardy <kaber@trash.net> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06net: introduce dev_consume_skb_any()Eric Dumazet
Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq() helpers to free skbs, both for dropped packets and TX completed ones. We need to separate the two causes to get better diagnostics given by dropwatch or "perf record -e skb:kfree_skb" This patch provides two new helpers, dev_consume_skb_any() and dev_consume_skb_irq() to be used for consumed skbs. __dev_kfree_skb_irq() is slightly optimized to remove one atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-15Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual earth-shaking, news-breaking, rocket science pile from trivial.git" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits) doc: usb: Fix typo in Documentation/usb/gadget_configs.txt doc: add missing files to timers/00-INDEX timekeeping: Fix some trivial typos in comments mm: Fix some trivial typos in comments irq: Fix some trivial typos in comments NUMA: fix typos in Kconfig help text mm: update 00-INDEX doc: Documentation/DMA-attributes.txt fix typo DRM: comment: `halve' -> `half' Docs: Kconfig: `devlopers' -> `developers' doc: typo on word accounting in kprobes.c in mutliple architectures treewide: fix "usefull" typo treewide: fix "distingush" typo mm/Kconfig: Grammar s/an/a/ kexec: Typo s/the/then/ Documentation/kvm: Update cpuid documentation for steal time and pv eoi treewide: Fix common typo in "identify" __page_to_pfn: Fix typo in comment Correct some typos for word frequency clk: fixed-factor: Fix a trivial typo ...
2013-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) The addition of nftables. No longer will we need protocol aware firewall filtering modules, it can all live in userspace. At the core of nftables is a, for lack of a better term, virtual machine that executes byte codes to inspect packet or metadata (arriving interface index, etc.) and make verdict decisions. Besides support for loading packet contents and comparing them, the interpreter supports lookups in various datastructures as fundamental operations. For example sets are supports, and therefore one could create a set of whitelist IP address entries which have ACCEPT verdicts attached to them, and use the appropriate byte codes to do such lookups. Since the interpreted code is composed in userspace, userspace can do things like optimize things before giving it to the kernel. Another major improvement is the capability of atomically updating portions of the ruleset. In the existing netfilter implementation, one has to update the entire rule set in order to make a change and this is very expensive. Userspace tools exist to create nftables rules using existing netfilter rule sets, but both kernel implementations will need to co-exist for quite some time as we transition from the old to the new stuff. Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have worked so hard on this. 2) Daniel Borkmann and Hannes Frederic Sowa made several improvements to our pseudo-random number generator, mostly used for things like UDP port randomization and netfitler, amongst other things. In particular the taus88 generater is updated to taus113, and test cases are added. 3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet and Yang Yingliang. 4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin Sujir. 5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet, Neal Cardwell, and Yuchung Cheng. 6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary control message data, much like other socket option attributes. From Francesco Fusco. 7) Allow applications to specify a cap on the rate computed automatically by the kernel for pacing flows, via a new SO_MAX_PACING_RATE socket option. From Eric Dumazet. 8) Make the initial autotuned send buffer sizing in TCP more closely reflect actual needs, from Eric Dumazet. 9) Currently early socket demux only happens for TCP sockets, but we can do it for connected UDP sockets too. Implementation from Shawn Bohrer. 10) Refactor inet socket demux with the goal of improving hash demux performance for listening sockets. With the main goals being able to use RCU lookups on even request sockets, and eliminating the listening lock contention. From Eric Dumazet. 11) The bonding layer has many demuxes in it's fast path, and an RCU conversion was started back in 3.11, several changes here extend the RCU usage to even more locations. From Ding Tianhong and Wang Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav Falico. 12) Allow stackability of segmentation offloads to, in particular, allow segmentation offloading over tunnels. From Eric Dumazet. 13) Significantly improve the handling of secret keys we input into the various hash functions in the inet hashtables, TCP fast open, as well as syncookies. From Hannes Frederic Sowa. The key fundamental operation is "net_get_random_once()" which uses static keys. Hannes even extended this to ipv4/ipv6 fragmentation handling and our generic flow dissector. 14) The generic driver layer takes care now to set the driver data to NULL on device removal, so it's no longer necessary for drivers to explicitly set it to NULL any more. Many drivers have been cleaned up in this way, from Jingoo Han. 15) Add a BPF based packet scheduler classifier, from Daniel Borkmann. 16) Improve CRC32 interfaces and generic SKB checksum iterators so that SCTP's checksumming can more cleanly be handled. Also from Daniel Borkmann. 17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces using the interface MTU value. This helps avoid PMTU attacks, particularly on DNS servers. From Hannes Frederic Sowa. 18) Use generic XPS for transmit queue steering rather than internal (re-)implementation in virtio-net. From Jason Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits) random32: add test cases for taus113 implementation random32: upgrade taus88 generator to taus113 from errata paper random32: move rnd_state to linux/random.h random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized random32: add periodic reseeding random32: fix off-by-one in seeding requirement PHY: Add RTL8201CP phy_driver to realtek xtsonic: add missing platform_set_drvdata() in xtsonic_probe() macmace: add missing platform_set_drvdata() in mace_probe() ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe() ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh vlan: Implement vlan_dev_get_egress_qos_mask as an inline. ixgbe: add warning when max_vfs is out of range. igb: Update link modes display in ethtool netfilter: push reasm skb through instead of original frag skbs ip6_output: fragment outgoing reassembled skb properly MAINTAINERS: mv643xx_eth: take over maintainership from Lennart net_sched: tbf: support of 64bit rates ixgbe: deleting dfwd stations out of order can cause null ptr deref ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS ...
2013-11-07net: Add layer 2 hardware acceleration operations for macvlan devicesJohn Fastabend
Add a operations structure that allows a network interface to export the fact that it supports package forwarding in hardware between physical interfaces and other mac layer devices assigned to it (such as macvlans). This operaions structure can be used by virtual mac devices to bypass software switching so that forwarding can be done in hardware more efficiently. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07net: make ndev->irq signed for error handlingDan Carpenter
There is a bug in cpsw_probe() where we do: ndev->irq = platform_get_irq(pdev, 0); if (ndev->irq < 0) { The problem is that "ndev->irq" is unsigned so the error handling doesn't work. I have changed it to a regular int. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07Merge tag 'driver-core-3.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core / sysfs update for 3.13-rc1. There's lots of dev_groups updates for different subsystems, as they all get slowly migrated over to the safe versions of the attribute groups (removing userspace races with the creation of the sysfs files.) Also in here are some kobject updates, devres expansions, and the first round of Tejun's sysfs reworking to enable it to be used by other subsystems as a backend for an in-kernel filesystem. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits) sysfs: rename sysfs_assoc_lock and explain what it's about sysfs: use generic_file_llseek() for sysfs_file_operations sysfs: return correct error code on unimplemented mmap() mdio_bus: convert bus code to use dev_groups device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name sysfs: separate out dup filename warning into a separate function sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c sysfs: remove unused sysfs_get_dentry() prototype sysfs: honor bin_attr.attr.ignore_lockdep sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr devres: restore zeroing behavior of devres_alloc() sysfs: fix sysfs_write_file for bin file input: gameport: convert bus code to use dev_groups input: serio: remove bus usage of dev_attrs input: serio: use DEVICE_ATTR_RO() i2o: convert bus code to use dev_groups memstick: convert bus code to use dev_groups tifm: convert bus code to use dev_groups virtio: convert bus code to use dev_groups ipack: convert bus code to use dev_groups ...
2013-11-03net: extend net_device allocation to vmalloc()Eric Dumazet
Joby Poriyath provided a xen-netback patch to reduce the size of xenvif structure as some netdev allocation could fail under memory pressure/fragmentation. This patch is handling the problem at the core level, allowing any netdev structures to use vmalloc() if kmalloc() failed. As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT to kzalloc() flags to do this fallback only when really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Joby Poriyath <joby.poriyath@citrix.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-29net: add might_sleep() call to napi_disableJacob Keller
napi_disable uses an msleep() call to wait for outstanding napi work to be finished after setting the disable bit. It does not always sleep incase there was no outstanding work. This resulted in a rare bug in ixgbe_down operation where a napi_disable call took place inside of a local_bh_disable()d context. In order to enable easier detection of future sleep while atomic BUGs, this patch adds a might_sleep() call, so that every use of napi_disable during atomic context will be visible. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: Alexander Duyck <alexander.duyck@intel.com> Cc: Hyong-Youb Kim <hykim@myri.com> Cc: Amir Vadai <amirv@mellanox.com> Cc: Dmitry Kravkov <dmitry@broadcom.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-10-18net: Delete trailing semi-colon from definition of netdev_WARN()Ben Hutchings
Macro definitions should not normally end with a semi-colon, as this makes it dangerous to use them an if...else statement. Happily this has not happened yet. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-14treewide: Fix common typo in "identify"Maxime Jayat
Correct common misspelling of "identify" as "indentify" throughout the kernel Signed-off-by: Maxime Jayat <maxime@artisandeveloppeur.fr> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: include/linux/netdevice.h net/core/sock.c Trivial merge issues. Removal of "extern" for functions declaration in netdevice.h at the same time "const" was added to an argument. Two parallel line additions in net/core/sock.c Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07net: Separate the close_list and the unreg_list v2Eric W. Biederman
Separate the unreg_list and the close_list in dev_close_many preventing dev_close_many from permuting the unreg_list. The permutations of the unreg_list have resulted in cases where the loopback device is accessed it has been freed in code such as dst_ifdown. Resulting in subtle memory corruption. This is the second bug from sharing the storage between the close_list and the unreg_list. The issues that crop up with sharing are apparently too subtle to show up in normal testing or usage, so let's forget about being clever and use two separate lists. v2: Make all callers pass in a close_list to dev_close_many Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-07netif_set_xps_queue: make cpu mask constMichael S. Tsirkin
virtio wants to pass in cpumask_of(cpu), make parameter const to avoid build warnings. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30Merge branch '20130926_include_linux_networking_externs' of ↵David S. Miller
git://repo.or.cz/linux-2.6/trivial-mods Conflicts: include/linux/netdevice.h More extern removals from Joe Perches. Minor conflict with the dev_notify_flags changes which added a new argument to __dev_notify_flags(). Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-30dev: update __dev_notify_flags() to send rtnl msgNicolas Dichtel
This patch only prepares the next one, there is no functional change. Now, __dev_notify_flags() can also be used to notify flags changes via rtnetlink. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26[networking]device.h: Remove extern from function prototypesJoe Perches
There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com>
2013-09-26sysfs: make attr namespace interface less convolutedTejun Heo
sysfs ns (namespace) implementation became more convoluted than necessary while trying to hide ns information from visible interface. The relatively recent attr ns support is a good example. * attr ns tag is determined by sysfs_ops->namespace() callback while dir tag is determined by kobj_type->namespace(). The placement is arbitrary. * Instead of performing operations with explicit ns tag, the namespace callback is routed through sysfs_attr_ns(), sysfs_ops->namespace(), class_attr_namespace(), class_attr->namespace(). It's not simpler in any sense. The only thing this convolution does is traversing the whole stack backwards. The namespace callbacks are unncessary because the operations involved are inherently synchronous. The information can be provided in in straight-forward top-down direction and reversing that direction is unnecessary and against basic design principles. This backward interface is unnecessarily convoluted and hinders properly separating out sysfs from driver model / kobject for proper layering. This patch updates attr ns support such that * sysfs_ops->namespace() and class_attr->namespace() are dropped. * sysfs_{create|remove}_file_ns(), which take explicit @ns param, are added and sysfs_{create|remove}_file() are now simple wrappers around the ns aware functions. * ns handling is dropped from sysfs_chmod_file(). Nobody uses it at this point. sysfs_chmod_file_ns() can be added later if necessary. * Explicit @ns is propagated through class_{create|remove}_file_ns() and netdev_class_{create|remove}_file_ns(). * driver/net/bonding which is currently the only user of attr namespace is updated to use netdev_class_{create|remove}_file_ns() with @bh->net as the ns tag instead of using the namespace callback. This patch should be an equivalent conversion without any functional difference. It makes the code easier to follow, reduces lines of code a bit and helps proper separation and layering. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26net: add a possibility to get private from netdev_adjacent->listVeaceslav Falico
It will be useful to get first/last element. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add for_each iterators through neighbour lower link's privateVeaceslav Falico
Add a possibility to iterate through netdev_adjacent's private, currently only for lower neighbours. Add both RCU and RTNL/other locking variants of iterators, and make the non-rcu variant to be safe from removal. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add netdev_adjacent->private and allow to use itVeaceslav Falico
Currently, even though we can access any linked device, we can't attach anything to it, which is vital to properly manage them. To fix this, add a new void *private to netdev_adjacent and functions setting/getting it (per link), so that we can save, per example, bonding's slave structures there, per slave device. netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to upper dev and populates the neighbour link only with private. netdev_lower_dev_get_private{,_rcu}() returns the private, if found. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-26net: add adj_list to save only neighboursVeaceslav Falico
Currently, we distinguish neighbours (first-level linked devices) from non-neighbours by the neighbour bool in the netdev_adjacent. This could be quite time-consuming in case we would like to traverse *only* through neighbours - cause we'd have to traverse through all devices and check for this flag, and in a (quite common) scenario where we have lots of vlans on top of bridge, which is on top of a bond - the bonding would have to go through all those vlans to get its upper neighbour linked devices. This situation is really unpleasant, cause there are already a lot of cases when a device with slaves needs to go through them in hot path. To fix this, introduce a new upper/lower device lists structure - adj_list, which contains only the neighbours. It works always in pair with the all_adj_list structure (renamed from upper/lower_dev_list), i.e. both of them contain the same links, only that all_adj_list contains also non-neighbour device links. It's really a small change visible, currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't change the main linked logic at all. Also, add some comments a fix a name collision in netdev_for_each_upper_dev_rcu() and rework the naming by the following rules: netdev_(all_)(upper|lower)_* If "all_" is present, then we work with the whole list of upper/lower devices, otherwise - only with direct neighbours. Uninline functions - to get better stack traces. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-15vxlan: Fix sparse warningsJoseph Gasparakis
This patch fixes sparse warnings when incorrectly handling the port number and using int instead of unsigned int iterating through &vn->sock_list[]. Keeping the port as __be16 also makes things clearer wrt endianess. Also, it was pointed out that vxlan_get_rx_port() had unnecessary checks which got removed. Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-06net: add documentation for BQL helpersFlorian Fainelli
Provide a kernel-doc comment documentation for the BQL helpers: - netdev_sent_queue - netdev_completed_queue - netdev_reset_queue Similarly to how it is done for the other functions, the documentation only covers the function operating on struct net_device and not struct netdev_queue. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-05vxlan: Notify drivers for listening UDP port changesJoseph Gasparakis
This patch adds two more ndo ops: ndo_add_rx_vxlan_port() and ndo_del_rx_vxlan_port(). Drivers can get notifications through the above functions about changes of the UDP listening port of VXLAN. Also, when physical ports come up, now they can call vxlan_get_rx_port() in order to obtain the port number(s) of the existing VXLAN interface in case they already up before them. This information about the listening UDP port would be used for VXLAN related offloads. A big thank you to John Fastabend (john.r.fastabend@intel.com) for his input and his suggestions on this patch set. CC: John Fastabend <john.r.fastabend@intel.com> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: add netdev_for_each_upper_dev_rcu()Veaceslav Falico
The new macro netdev_for_each_upper_dev_rcu(dev, upper, iter) iterates through the dev->upper_dev_list starting from the first element, using the netdev_upper_get_next_dev_rcu(dev, &iter). Must be called under RCU read lock. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29net: add lower_dev_list to net_device and make a full meshVeaceslav Falico
This patch adds lower_dev_list list_head to net_device, which is the same as upper_dev_list, only for lower devices, and begins to use it in the same way as the upper list. It also changes the way the whole adjacent device lists work - now they contain *all* of upper/lower devices, not only the first level. The first level devices are distinguished by the bool neighbour field in netdev_adjacent, also added by this patch. There are cases when a device can be added several times to the adjacent list, the simplest would be: /---- eth0.10 ---\ eth0- --- bond0 \---- eth0.20 ---/ where both bond0 and eth0 'see' each other in the adjacent lists two times. To avoid duplication of netdev_adjacent structures ref_nr is being kept as the number of times the device was added to the list. The 'full view' is achieved by adding, on link creation, all of the upper_dev's upper_dev_list devices as upper devices to all of the lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice versa. On unlink they are removed using the same logic. I've tested it with thousands vlans/bonds/bridges, everything works ok and no observable lags even on a huge number of interfaces. Memory footprint for 128 devices interconnected with each other via both upper and lower (which is impossible, but for the comparison) lists would be: 128*128*2*sizeof(netdev_adjacent) = 1.5MB but in the real world we usualy have at most several devices with slaves and a lot of vlans, so the footprint will be much lower. CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Alexander Duyck <alexander.h.duyck@intel.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Merge net into net-next to setup some infrastructure Eric Dumazet needs for usbnet changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-01net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLLCong Wang
Eliezer renames several *ll_poll to *busy_poll, but forgets CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too. Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30net: add ndo to get id of physical port of the deviceJiri Pirko
This patch adds a ndo for getting physical port of the device. Driver which is aware of being virtual function of some physical port should implement this ndo. This is applicable not only for IOV, but for other solutions (NPAR, multichannel) as well. Basically if there is possible to have multiple netdevs on the single hw port. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24net: Make devnet_rename_seq staticThomas Gleixner
No users outside net/core/dev.c. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23net: convert resend IGMP to notifier eventJiri Pirko
Until now, bond_resend_igmp_join_requests() looks for vlans attached to bonding device, bridge where bonding act as port manually. It does not care of other scenarios, like stacked bonds or team device above. Make this more generic and use netdev notifier to propagate the event to upper devices and to actually call ip_mc_rejoin_groups(). Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10net: rename ll methods to busy-pollEliezer Tamir
Rename ndo_ll_poll to ndo_busy_poll. Rename sk_mark_ll to sk_mark_napi_id. Rename skb_mark_ll to skb_mark_napi_id. Correct all useres of these functions. Update comments and defines in include/net/busy_poll.h Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/freescale/fec_main.c drivers/net/ethernet/renesas/sh_eth.c net/ipv4/gre.c The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list) and the splitting of the gre.c code into seperate files. The FEC conflict was two sets of changes adding ethtool support code in an "!CONFIG_M5272" CPP protected block. Finally the sh_eth.c conflict was between one commit add bits set in the .eesr_err_check mask whilst another commit removed the .tx_error_check member and assignments. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-26net: fix kernel deadlock with interface rename and netdev name retrieval.Nicolas Schichan
When the kernel (compiled with CONFIG_PREEMPT=n) is performing the rename of a network interface, it can end up waiting for a workqueue to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to the fact that read_secklock_begin() will spin forever waiting for the writer process (the one doing the interface rename) to update the devnet_rename_seq sequence. This patch fixes the problem by adding a helper (netdev_get_name()) and using it in the code handling the SIOCGIFNAME ioctl and SO_BINDTODEVICE setsockopt. The netdev_get_name() helper uses raw_seqcount_begin() to avoid spinning forever, waiting for devnet_rename_seq->sequence to become even. cond_resched() is used in the contended case, before retrying the access to give the writer process a chance to finish. The use of raw_seqcount_begin() will incur some unneeded work in the reader process in the contended case, but this is better than deadlocking the system. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13net/core: Add VF link state controlRony Efraim
Add netlink directives and ndo entry to allow for controling VF link, which can be in one of three states: Auto - VF link state reflects the PF link state (default) Up - VF link state is up, traffic from VF to VF works even if the actual PF link is down Down - VF link state is down, no traffic from/to this VF, can be of use while configuring the VF Signed-off-by: Rony Efraim <ronye@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13net-rps: fixes for rps flow limitWillem de Bruijn
Caught by sparse: - __rcu: missing annotation to sd->flow_limit - __user: direct access in cpumask_scnprintf Also - add endline character when printing bitmap if room in buffer - avoid bucket overflow by reducing FLOW_LIMIT_HISTORY The last item warrants some explanation. The hashtable buckets are subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal to bucket size, since all packets may end up in a single bucket. The current (rather arbitrary) history value of 256 happens to match the buffer size (u8). As a result, with a single flow, the first 128 packets are accepted (correct), the second 128 packets dropped (correct) and then the history[] array has filled, so that each subsequent new packet causes an increment in the bucket for new_flow plus a decrement for old_flow: a steady state. This is fine if packets are dropped, as the steady state goes away as soon as a mix of traffic reappears. But, because the 256th packet overflowed the bucket to 0: no packets are dropped. Instead of explicitly adding an overflow check, this patch changes FLOW_LIMIT_HISTORY to never be able to overflow a single bucket. Reported-by: Fengguang Wu <fengguang.wu@intel.com> (first item) Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>