summaryrefslogtreecommitdiff
path: root/net/core/dev.c
AgeCommit message (Collapse)Author
2013-03-29net: add a synchronize_net() in netdev_rx_handler_unregister()Eric Dumazet
commit 35d48903e97819 (bonding: fix rx_handler locking) added a race in bonding driver, reported by Steven Rostedt who did a very good diagnosis : <quoting Steven> I'm currently debugging a crash in an old 3.0-rt kernel that one of our customers is seeing. The bug happens with a stress test that loads and unloads the bonding module in a loop (I don't know all the details as I'm not the one that is directly interacting with the customer). But the bug looks to be something that may still be present and possibly present in mainline too. It will just be much harder to trigger it in mainline. In -rt, interrupts are threads, and can schedule in and out just like any other thread. Note, mainline now supports interrupt threads so this may be easily reproducible in mainline as well. I don't have the ability to tell the customer to try mainline or other kernels, so my hands are somewhat tied to what I can do. But according to a core dump, I tracked down that the eth irq thread crashed in bond_handle_frame() here: slave = bond_slave_get_rcu(skb->dev); bond = slave->bond; <--- BUG the slave returned was NULL and accessing slave->bond caused a NULL pointer dereference. Looking at the code that unregisters the handler: void netdev_rx_handler_unregister(struct net_device *dev) { ASSERT_RTNL(); RCU_INIT_POINTER(dev->rx_handler, NULL); RCU_INIT_POINTER(dev->rx_handler_data, NULL); } Which is basically: dev->rx_handler = NULL; dev->rx_handler_data = NULL; And looking at __netif_receive_skb() we have: rx_handler = rcu_dereference(skb->dev->rx_handler); if (rx_handler) { if (pt_prev) { ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = NULL; } switch (rx_handler(&skb)) { My question to all of you is, what stops this interrupt from happening while the bonding module is unloading? What happens if the interrupt triggers and we have this: CPU0 CPU1 ---- ---- rx_handler = skb->dev->rx_handler netdev_rx_handler_unregister() { dev->rx_handler = NULL; dev->rx_handler_data = NULL; rx_handler() bond_handle_frame() { slave = skb->dev->rx_handler; bond = slave->bond; <-- NULL pointer dereference!!! What protection am I missing in the bond release handler that would prevent the above from happening? </quoting Steven> We can fix bug this in two ways. First is adding a test in bond_handle_frame() and others to check if rx_handler_data is NULL. A second way is adding a synchronize_net() in netdev_rx_handler_unregister() to make sure that a rcu protected reader has the guarantee to see a non NULL rx_handler_data. The second way is better as it avoids an extra test in fast path. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jpirko@redhat.com> Cc: Paul E. McKenney <paulmck@us.ibm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-29net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb'Shmulik Ladkani
'nf_reset' is called just prior calling 'netif_rx'. No need to call it twice. Reported-by: Igor Michailov <rgohita@gmail.com> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-24net: remove a WARN_ON() in net_enable_timestamp()Eric Dumazet
The WARN_ON(in_interrupt()) in net_enable_timestamp() can get false positive, in socket clone path, run from softirq context : [ 3641.624425] WARNING: at net/core/dev.c:1532 net_enable_timestamp+0x7b/0x80() [ 3641.668811] Call Trace: [ 3641.671254] <IRQ> [<ffffffff80286817>] warn_slowpath_common+0x87/0xc0 [ 3641.677871] [<ffffffff8028686a>] warn_slowpath_null+0x1a/0x20 [ 3641.683683] [<ffffffff80742f8b>] net_enable_timestamp+0x7b/0x80 [ 3641.689668] [<ffffffff80732ce5>] sk_clone_lock+0x425/0x450 [ 3641.695222] [<ffffffff8078db36>] inet_csk_clone_lock+0x16/0x170 [ 3641.701213] [<ffffffff807ae449>] tcp_create_openreq_child+0x29/0x820 [ 3641.707663] [<ffffffff807d62e2>] ? ipt_do_table+0x222/0x670 [ 3641.713354] [<ffffffff807aaf5b>] tcp_v4_syn_recv_sock+0xab/0x3d0 [ 3641.719425] [<ffffffff807af63a>] tcp_check_req+0x3da/0x530 [ 3641.724979] [<ffffffff8078b400>] ? inet_hashinfo_init+0x60/0x80 [ 3641.730964] [<ffffffff807ade6f>] ? tcp_v4_rcv+0x79f/0xbe0 [ 3641.736430] [<ffffffff807ab9bd>] tcp_v4_do_rcv+0x38d/0x4f0 [ 3641.741985] [<ffffffff807ae14a>] tcp_v4_rcv+0xa7a/0xbe0 Its safe at this point because the parent socket owns a reference on the netstamp_needed, so we cant have a 0 -> 1 transition, which requires to lock a mutex. Instead of refining the check, lets remove it, as all known callers are safe. If it ever changes in the future, static_key_slow_inc() will complain anyway. Reported-by: Laurent Chavey <chavey@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12net/core: move vlan_depth out of while loop in skb_network_protocol()Li RongQing
[ Bug added added in commit 05e8ef4ab2d8087d (net: factor out skb_mac_gso_segment() from skb_gso_segment() ) ] move vlan_depth out of while loop, or else vlan_depth always is ETH_HLEN, can not be increased, and lead to infinite loop when frame has two vlan headers. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-08bridging: fix rx_handlers return codeCristian Bercaru
The frames for which rx_handlers return RX_HANDLER_CONSUMED are no longer counted as dropped. They are counted as successfully received by 'netif_receive_skb'. This allows network interface drivers to correctly update their RX-OK and RX-DRP counters based on the result of 'netif_receive_skb'. Signed-off-by: Cristian Bercaru <B43982@freescale.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-06net: reduce net_rx_action() latency to 2 HZEric Dumazet
We should use time_after_eq() to get maximum latency of two ticks, instead of three. Bug added in commit 24f8b2385 (net: increase receive packet quantum) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-06net: fix new kernel-doc warnings in net coreRandy Dunlap
Fix new kernel-doc warnings in net/core/dev.c: Warning(net/core/dev.c:4788): No description found for parameter 'new_carrier' Warning(net/core/dev.c:4788): Excess function parameter 'new_carries' description in 'dev_change_carrier' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-27hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22net: Fix locking bug in netif_set_xps_queueAlexander Duyck
Smatch found a locking bug in netif_set_xps_queue in which we were not releasing the lock in the case of an allocation failure. This change corrects that so that we release the xps_map_mutex before returning -ENOMEM in the case of an allocation failure. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-19net: fix a build failure when !CONFIG_PROC_FSCong Wang
When !CONFIG_PROC_FS dev_mcast_init() is not defined, actually we can just merge dev_mcast_init() into dev_proc_init(). Reported-by: Gao feng <gaofeng@cn.fujitsu.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-19net: move procfs code to net/core/net-procfs.cCong Wang
Similar to net/core/net-sysfs.c, group procfs code to a single unit. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18net: proc: change proc_net_remove to remove_proc_entryGao feng
proc_net_remove is only used to remove proc entries that under /proc/net,it's not a general function for removing proc entries of netns. if we want to remove some proc entries which under /proc/net/stat/, we still need to call remove_proc_entry. this patch use remove_proc_entry to replace proc_net_remove. we can remove proc_net_remove after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18net: proc: change proc_net_fops_create to proc_createGao feng
Right now, some modules such as bonding use proc_create to create proc entries under /proc/net/, and other modules such as ipv4 use proc_net_fops_create. It looks a little chaos.this patch changes all of proc_net_fops_create to proc_create. we can remove proc_net_fops_create after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18net: move ioctl functions into a separated fileCong Wang
They well deserve a separated unit. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-15net: use skb_reset_mac_len() in dev_gro_receive()Eric Dumazet
We no longer need to use mac_len, lets cleanup things. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-15v4 GRE: Add TCP segmentation offload for GREPravin B Shelar
Following patch adds GRE protocol offload handler so that skb_gso_segment() can segment GRE packets. SKB GSO CB is added to keep track of total header length so that skb_segment can push entire header. e.g. in case of GRE, skb_segment need to push inner and outer headers to every segment. New NETIF_F_GRE_GSO feature is added for devices which support HW GRE TSO offload. Currently none of devices support it therefore GRE GSO always fall backs to software GSO. [ Compute pkt_len before ip_local_out() invocation. -DaveM ] Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-15net: factor out skb_mac_gso_segment() from skb_gso_segment()Pravin B Shelar
This function will be used in next GRE_GSO patch. This patch does not change any functionality. It only exports skb_mac_gso_segment() function. [ Use skb_reset_mac_len() -DaveM ] Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-14net: Don't write to current task flags on every packet received.David S. Miller
Even for non-pfmalloc SKBs, __netif_receive_skb() will do a tsk_restore_flags() on current unconditionally. Make __netif_receive_skb() a shim around the existing code, renamed to __netif_receive_skb_core(). Let __netif_receive_skb() wrap the __netif_receive_skb_core() call with the task flag modifications, if necessary. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06net: reset mac header in dev_start_xmit()Eric Dumazet
On 64 bit arches : There is a off-by-one error in qdisc_pkt_len_init() because mac_header is not set in xmit path. skb_mac_header() returns an out of bound value that was harmless because hdr_len is an 'unsigned int' On 32bit arches, the error is abysmal. This patch is also a prereq for "macvlan: add multicast filter" Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06net: adjust skb_gso_segment() for calling in rx pathCong Wang
skb_gso_segment() is almost always called in tx path, except for openvswitch. It calls this function when it receives the packet and tries to queue it to user-space. In this special case, the ->ip_summed check inside skb_gso_segment() is no longer true, as ->ip_summed value has different meanings on rx path. This patch adjusts skb_gso_segment() so that we can at least avoid such warnings on checksum. Cc: Jesse Gross <jesse@nicira.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06netpoll: protect napi_poll and poll_controller during dev_[open|close]Neil Horman
Ivan Vercera was recently backporting commit 9c13cb8bb477a83b9a3c9e5a5478a4e21294a760 to a RHEL kernel, and I noticed that, while this patch protects the tg3 driver from having its ndo_poll_controller routine called during device initalization, it does nothing for the driver during shutdown. I.e. it would be entirely possible to have the ndo_poll_controller method (or subsequently the ndo_poll) routine called for a driver in the netpoll path on CPU A while in parallel on CPU B, the ndo_close or ndo_open routine could be called. Given that the two latter routines tend to initizlize and free many data structures that the former two rely on, the result can easily be data corruption or various other crashes. Furthermore, it seems that this is potentially a problem with all net drivers that support netpoll, and so this should ideally be fixed in a common path. As Ben H Pointed out to me, we can't preform dev_open/dev_close in atomic context, so I've come up with this solution. We can use a mutex to sleep in open/close paths and just do a mutex_trylock in the napi poll path and abandon the poll attempt if we're locked, as we'll just retry the poll on the next send anyway. I've tested this here by flooding netconsole with messages on a system whos nic driver I modfied to periodically return NETDEV_TX_BUSY, so that the netpoll tx workqueue would be forced to send frames and poll the device. While this was going on I rapidly ifdown/up'ed the interface and watched for any problems. I've not found any. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Ivan Vecera <ivecera@redhat.com> CC: "David S. Miller" <davem@davemloft.net> CC: Ben Hutchings <bhutchings@solarflare.com> CC: Francois Romieu <romieu@fr.zoreil.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06net: core: Remove unnecessary alloc/OOM messagesJoe Perches
alloc failures already get standardized OOM messages and a dump_stack. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29net: disallow drivers with buggy VLAN accel to register_netdevice()Michał Mirosław
Instead of jumping aroung bugs that are easily fixed just don't let them in: affected drivers should be either fixed or have NETIF_F_HW_VLAN_FILTER removed from advertised features. Quick grep in drivers/net shows two drivers that have NETIF_F_HW_VLAN_FILTER but not ndo_vlan_rx_add/kill_vid(), but those are false-positives (features are commented out). OTOH two drivers have ndo_vlan_rx_add/kill_vid() implemented but don't advertise NETIF_F_HW_VLAN_FILTER. Those are: +ethernet/cisco/enic/enic_main.c +ethernet/qlogic/qlcnic/qlcnic_main.c Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-28net: fix possible wrong checksum generationEric Dumazet
Pravin Shelar mentioned that GSO could potentially generate wrong TX checksum if skb has fragments that are overwritten by the user between the checksum computation and transmit. He suggested to linearize skbs but this extra copy can be avoided for normal tcp skbs cooked by tcp_sendmsg(). This patch introduces a new SKB_GSO_SHARED_FRAG flag, set in skb_shinfo(skb)->gso_type if at least one frag can be modified by the user. Typical sources of such possible overwrites are {vm}splice(), sendfile(), and macvtap/tun/virtio_net drivers. Tested: $ netperf -H 7.7.8.84 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3959.52 $ netperf -H 7.7.8.84 -t TCP_SENDFILE TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3216.80 Performance of the SENDFILE is impacted by the extra allocation and copy, and because we use order-0 pages, while the TCP_STREAM uses bigger pages. Reported-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21net: move rx and tx hash functions to net/core/flow_dissector.cCong Wang
__skb_tx_hash() and __skb_get_rxhash() are all for calculating hash value based by some fields in skb, mostly used for selecting queues by device drivers. Meanwhile, net/core/dev.c is bloating. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-16net_sched: fix qdisc_pkt_len_init()Eric Dumazet
commit 1def9238d4aa2 (net_sched: more precise pkt_len computation) does a wrong computation of mac + network headers length, as it includes the padding before the frame. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuval Mintz <yuvalmin@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: Documentation/networking/ip-sysctl.txt drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c Both conflicts were simply overlapping context. A build fix for qlcnic is in here too, simply removing the added devinit annotations which no longer exist. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-11net, wireless: overwrite default_ethtool_opsStanislaw Gruszka
Since: commit 2c60db037034d27f8c636403355d52872da92f81 Author: Eric Dumazet <edumazet@google.com> Date: Sun Sep 16 09:17:26 2012 +0000 net: provide a default dev->ethtool_ops wireless core does not correctly assign ethtool_ops. After alloc_netdev*() call, some cfg80211 drivers provide they own ethtool_ops, but some do not. For them, wireless core provide generic cfg80211_ethtool_ops, which is assigned in NETDEV_REGISTER notify call: if (!dev->ethtool_ops) dev->ethtool_ops = &cfg80211_ethtool_ops; But after Eric's commit, dev->ethtool_ops is no longer NULL (on cfg80211 drivers without custom ethtool_ops), but points to &default_ethtool_ops. In order to fix the problem, provide function which will overwrite default_ethtool_ops and use it by wireless core. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-11net: Export __netdev_pick_tx so that it can be used in modulesAlexander Duyck
When testing with FCoE enabled we discovered that I had not exported __netdev_pick_tx. As a result ixgbe doesn't build with the RFC patches applied because ixgbe_select_queue was calling the function. This change corrects that build issue by correctly exporting __netdev_pick_tx so it can be used by modules. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10net: Add support for XPS without sysfs being definedAlexander Duyck
This patch makes it so that we can support transmit packet steering without sysfs needing to be enabled. The reason for making this change is to make it so that a driver can make use of the XPS even while the sysfs portion of the interface is not present. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10net: Rewrite netif_set_xps_queues to address several issuesAlexander Duyck
This change is meant to address several issues I found within the netif_set_xps_queues function. If the allocation of one of the maps to be assigned to new_dev_maps failed we could end up with the device map in an inconsistent state since we had already worked through a number of CPUs and removed or added the queue. To address that I split the process into several steps. The first of which is just the allocation of updated maps for CPUs that will need larger maps to store the queue. By doing this we can fail gracefully without actually altering the contents of the current device map. The second issue I found was the fact that we were always allocating a new device map even if we were not adding any queues. I have updated the code so that we only allocate a new device map if we are adding queues, otherwise if we are not adding any queues to CPUs we just skip to the removal process. The last change I made was to reuse the code from remove_xps_queue to remove the queue from the CPU. By making this change we can be consistent in how we go about adding and removing the queues from the CPUs. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10net: Rewrite netif_reset_xps_queue to allow for better code reuseAlexander Duyck
This patch does a minor refactor on netif_reset_xps_queue to address a few items I noticed. First is the fact that we are doing removal of queues in both netif_reset_xps_queue and netif_set_xps_queue. Since there is no need to have the code in two places I am pushing it out into a separate function and will come back in another patch and reuse the code in netif_set_xps_queue. The second item this change addresses is the fact that the Tx queues were not getting their numa_node value cleared as a part of the XPS queue reset. This patch resolves that by resetting the numa_node value if the dev_maps value is set. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10net: Add functions netif_reset_xps_queue and netif_set_xps_queueAlexander Duyck
This patch adds two functions, netif_reset_xps_queue and netif_set_xps_queue. The main idea behind these two functions is to provide a mechanism through which drivers can update their defaults in regards to XPS. Currently no such mechanism exists and as a result we cannot use XPS for things such as ATR which would require a basic configuration to start in which the Tx queues are mapped to CPUs via a 1:1 mapping. With this change I am making it possible for drivers such as ixgbe to be able to use the XPS feature by controlling the default configuration. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10net: Split core bits of netdev_pick_tx into __netdev_pick_txAlexander Duyck
This change splits the core bits of netdev_pick_tx into a separate function. The main idea behind this is to make this code accessible to select queue functions when they decide to process the standard path instead of their own custom path in their select queue routine. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10net_sched: more precise pkt_len computationEric Dumazet
One long standing problem with TSO/GSO/GRO packets is that skb->len doesn't represent a precise amount of bytes on wire. Headers are only accounted for the first segment. For TCP, thats typically 66 bytes per 1448 bytes segment missing, an error of 4.5 % for normal MSS value. As consequences : 1) TBF/CBQ/HTB/NETEM/... can send more bytes than the assigned limits. 2) Device stats are slightly under estimated as well. Fix this by taking account of headers in qdisc_skb_cb(skb)->pkt_len computation. Packet schedulers should use qdisc pkt_len instead of skb->len for their bandwidth limitations, and TSO enabled devices drivers could use pkt_len if their statistics are not hardware assisted, and if they don't scratch skb->cb[] first word. Both egress and ingress paths work, thanks to commit fda55eca5a (net: introduce skb_transport_header_was_set()) : If GRO built a GSO packet, it also set the transport header for us. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Paolo Valente <paolo.valente@unimore.it> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-08net: init perm_addr in register_netdevice()Jiri Pirko
Benefit from the fact that dev->addr_assign_type is set to NET_ADDR_PERM in case the device has permanent address. This also fixes the problem that many drivers do not set perm_addr at all. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-08net: introduce skb_transport_header_was_set()Eric Dumazet
We have skb_mac_header_was_set() helper to tell if mac_header was set on a skb. We would like the same for transport_header. __netif_receive_skb() doesn't reset the transport header if already set by GRO layer. Note that network stacks usually reset the transport header anyway, after pulling the network header, so this change only allows a followup patch to have more precise qdisc pkt_len computation for GSO packets at ingress side. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-04net: remove no longer used netdev_set_bond_master() and netdev_set_master()Jiri Pirko
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-04net: introduce upper device listsJiri Pirko
This lists are supposed to serve for storing pointers to all upper devices. Eventually it will replace dev->master pointer which is used for bonding, bridge, team but it cannot be used for vlan, macvlan where there might be multiple upper present. In case the upper link is replacement for dev->master, it is marked with "master" flag. New upper device list resolves this limitation. Also, the information stored in lists is used for preventing looping setups like "bond->somethingelse->samebond" Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-03net: add address assign type "SET"Jiri Pirko
This is the way to indicate that mac address of a device has been set by dev_set_mac_address() Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-03net: call add_device_randomness() only after successful mac changeJiri Pirko
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-28net: add change_carrier netdev opJiri Pirko
This allows a driver to register change_carrier callback which will be called whenever user will like to change carrier state. This is useful for devices like dummy, gre, team and so on. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-21net: devnet_rename_seq should be a seqcountEric Dumazet
Using a seqlock for devnet_rename_seq is not a good idea, as device_rename() can sleep. As we hold RTNL, we dont need a protection for writers, and only need a seqcount so that readers can catch a change done by a writer. Bug added in commit c91f6df2db4972d3 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name) Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking changes from David Miller: 1) Allow to dump, monitor, and change the bridge multicast database using netlink. From Cong Wang. 2) RFC 5961 TCP blind data injection attack mitigation, from Eric Dumazet. 3) Networking user namespace support from Eric W. Biederman. 4) tuntap/virtio-net multiqueue support by Jason Wang. 5) Support for checksum offload of encapsulated packets (basically, tunneled traffic can still be checksummed by HW). From Joseph Gasparakis. 6) Allow BPF filter access to VLAN tags, from Eric Dumazet and Daniel Borkmann. 7) Bridge port parameters over netlink and BPDU blocking support from Stephen Hemminger. 8) Improve data access patterns during inet socket demux by rearranging socket layout, from Eric Dumazet. 9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and Jon Maloy. 10) Update TCP socket hash sizing to be more in line with current day realities. The existing heurstics were choosen a decade ago. From Eric Dumazet. 11) Fix races, queue bloat, and excessive wakeups in ATM and associated drivers, from Krzysztof Mazur and David Woodhouse. 12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions in VXLAN driver, from David Stevens. 13) Add "oops_only" mode to netconsole, from Amerigo Wang. 14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also allow DCB netlink to work on namespaces other than the initial namespace. From John Fastabend. 15) Support PTP in the Tigon3 driver, from Matt Carlson. 16) tun/vhost zero copy fixes and improvements, plus turn it on by default, from Michael S. Tsirkin. 17) Support per-association statistics in SCTP, from Michele Baldessari. And many, many, driver updates, cleanups, and improvements. Too numerous to mention individually. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) net/mlx4_en: Add support for destination MAC in steering rules net/mlx4_en: Use generic etherdevice.h functions. net: ethtool: Add destination MAC address to flow steering API bridge: add support of adding and deleting mdb entries bridge: notify mdb changes via netlink ndisc: Unexport ndisc_{build,send}_skb(). uapi: add missing netconf.h to export list pkt_sched: avoid requeues if possible solos-pci: fix double-free of TX skb in DMA mode bnx2: Fix accidental reversions. bna: Driver Version Updated to 3.1.2.1 bna: Firmware update bna: Add RX State bna: Rx Page Based Allocation bna: TX Intr Coalescing Fix bna: Tx and Rx Optimizations bna: Code Cleanup and Enhancements ath9k: check pdata variable before dereferencing it ath5k: RX timestamp is reported at end of frame ath9k_htc: RX timestamp is reported at end of frame ...
2012-12-11net: gro: dev_gro_receive() cleanupEric Dumazet
__napi_gro_receive() is inlined from two call sites for no good reason. Lets move the prep stuff in a function of its own, called only if/when needed. This saves 300 bytes on x86 : # size net/core/dev.o.after net/core/dev.o.before text data bss dec hex filename 51968 1238 1040 54246 d3e6 net/core/dev.o.before 51664 1238 1040 53942 d2b6 net/core/dev.o.after Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-09net: Handle encapsulated offloads before fragmentation or handing to lower devAlexander Duyck
This change allows the VXLAN to enable Tx checksum offloading even on devices that do not support encapsulated checksum offloads. The advantage to this is that it allows for the lower device to change due to routing table changes without impacting features on the VXLAN itself. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-07net: gro: fix possible panic in skb_gro_receive()Eric Dumazet
commit 2e71a6f8084e (net: gro: selective flush of packets) added a bug for skbs using frag_list. This part of the GRO stack is rarely used, as it needs skb not using a page fragment for their skb->head. Most drivers do use a page fragment, but some of them use GFP_KERNEL allocations for the initial fill of their RX ring buffer. napi_gro_flush() overwrite skb->prev that was used for these skb to point to the last skb in frag_list. Fix this using a separate field in struct napi_gro_cb to point to the last fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-07net: call notifiers for mtu change even if iface is not upJiri Pirko
Do the same thing as in set mac. Call notifiers every time. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04net: dev_change_net_namespace: send a KOBJ_REMOVED/KOBJ_ADDSerge Hallyn
When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent to ns1. When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and nothing to ns1. This patch changes that behavior so that when moving a nic from ns1 to ns2, we send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2. (The KOBJ_MOVE is still sent to ns2). The effects of this can be seen when starting and stopping containers in an upstart based host. Lxc will create a pair of veth nics, the kernel sends KOBJ_ADD, and upstart starts network-instance jobs for each. When one nic is moved to the container, because no KOBJ_REMOVED event is received, the network-instance job for that veth never goes away. This was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589 With this patch the networ-instance jobs properly go away. The other oddness solved here is that if a nic is passed into a running upstart-based container, without this patch no network-instance job is started in the container. But when the container creates a new nic itself (ip link add new type veth) then network-interface jobs are created. With this patch, behavior comes in line with a regular host. v2: also send KOBJ_ADD to new netns. There will then be a _MOVE event from the device_rename() call, but that should be innocuous. Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-29core: make GRO methods static.Rami Rosen
This patch changes three methods to be static and removes their EXPORT_SYMBOLs in core/dev.c and their external declaration in netdevice.h. The methods, dev_gro_receive(), napi_frags_finish() and napi_skb_finish(), which are in the GRO rx path, are not used outside core/dev.c. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>