summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2011-03-22ipv4: optimize route adding on secondary promotionJulian Anastasov
Optimize the calling of fib_add_ifaddr for all secondary addresses after the promoted one to start from their place, not from the new place of the promoted secondary. It will save some CPU cycles because we are sure the promoted secondary was first for the subnet and all next secondaries do not change their place. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22ipv4: remove the routes on secondary promotionJulian Anastasov
The secondary address promotion relies on fib_sync_down_addr to remove all routes created for the secondary addresses when the old primary address is deleted. It does not happen for cases when the primary address is also in another subnet. Fix that by deleting local and broadcast routes for all secondaries while they are on device list and by faking that all addresses from this subnet are to be deleted. It relies on fib_del_ifaddr being able to ignore the IPs from the concerned subnet while checking for duplication. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22ipv4: fix route deletion for IPs on many subnetsJulian Anastasov
Alex Sidorenko reported for problems with local routes left after IP addresses are deleted. It happens when same IPs are used in more than one subnet for the device. Fix fib_del_ifaddr to restrict the checks for duplicate local and broadcast addresses only to the IFAs that use our primary IFA or another primary IFA with same address. And we expect the prefsrc to be matched when the routes are deleted because it is possible they to differ only by prefsrc. This patch prevents local and broadcast routes to be leaked until their primary IP is deleted finally from the box. As the secondary address promotion needs to delete the routes for all secondaries that used the old primary IFA, add option to ignore these secondaries from the checks and to assume they are already deleted, so that we can safely delete the route while these IFAs are still on the device list. Reported-by: Alex Sidorenko <alexandre.sidorenko@hp.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22ipv4: match prefsrc when deleting routesJulian Anastasov
fib_table_delete forgets to match the routes by prefsrc. Callers can specify known IP in fc_prefsrc and we should remove the exact route. This is needed for cases when same local or broadcast addresses are used in different subnets and the routes differ only in prefsrc. All callers that do not provide fc_prefsrc will ignore the route prefsrc as before and will delete the first occurence. That is how the ip route del default magic works. Current callers are: - ip_rt_ioctl where rtentry_to_fib_config provides fc_prefsrc only when the provided device name matches IP label with colon. - inet_rtm_delroute where RTA_PREFSRC is optional too - fib_magic which deals with routes when deleting addresses and where the fc_prefsrc is always set with the primary IP for the concerned IFA. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-20netfilter: ipt_CLUSTERIP: fix buffer overflowVasiliy Kulikov
'buffer' string is copied from userspace. It is not checked whether it is zero terminated. This may lead to overflow inside of simple_strtoul(). Changli Gao suggested to copy not more than user supplied 'size' bytes. It was introduced before the git epoch. Files "ipt_CLUSTERIP/*" are root writable only by default, however, on some setups permissions might be relaxed to e.g. network admin user. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-20netfilter: xtables: fix reentrancyEric Dumazet
commit f3c5c1bfd4308 (make ip_tables reentrant) introduced a race in handling the stackptr restore, at the end of ipt_do_table() We should do it before the call to xt_info_rdunlock_bh(), or we allow cpu preemption and another cpu overwrites stackptr of original one. A second fix is to change the underflow test to check the origptr value instead of 0 to detect underflow, or else we allow a jump from different hooks. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-15net_sched: fix ip_tos2prioDan Siemon
ECN support incorrectly maps ECN BESTEFFORT packets to TC_PRIO_FILLER (1) instead of TC_PRIO_BESTEFFORT (0) This means ECN enabled flows are placed in pfifo_fast/prio low priority band, giving ECN enabled flows [ECT(0) and CE codepoints] higher drop probabilities. This is rather unfortunate, given we would like ECN being more widely used. Ref : http://www.coverfire.com/archives/2011/03/13/pfifo_fast-and-ecn/ Signed-off-by: Dan Siemon <dan@coverfire.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dave Täht <d@taht.net> Cc: Jonathan Morton <chromatix99@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-15Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
2011-03-15Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 Conflicts: Documentation/feature-removal-schedule.txt
2011-03-15netfilter: ipt_addrtype: rename to xt_addrtypeFlorian Westphal
Followup patch will add ipv6 support. ipt_addrtype.h is retained for compatibility reasons, but no longer used by the kernel. Signed-off-by: Florian Westphal <fwestphal@astaro.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-15netfilter: ip_tables: fix infoleak to userspaceVasiliy Kulikov
Structures ipt_replace, compat_ipt_replace, and xt_get_revision are copied from userspace. Fields of these structs that are zero-terminated strings are not checked. When they are used as argument to a format string containing "%s" in request_module(), some sensitive information is leaked to userspace via argument of spawned modprobe process. The first and the third bugs were introduced before the git epoch; the second was introduced in 2722971c (v2.6.17-rc1). To trigger the bug one should have CAP_NET_ADMIN. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-15netfilter: arp_tables: fix infoleak to userspaceVasiliy Kulikov
Structures ipt_replace, compat_ipt_replace, and xt_get_revision are copied from userspace. Fields of these structs that are zero-terminated strings are not checked. When they are used as argument to a format string containing "%s" in request_module(), some sensitive information is leaked to userspace via argument of spawned modprobe process. The first bug was introduced before the git epoch; the second is introduced by 6b7d31fc (v2.6.15-rc1); the third is introduced by 6b7d31fc (v2.6.15-rc1). To trigger the bug one should have CAP_NET_ADMIN. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-14tcp_cubic: fix low utilization of CUBIC with HyStartSangtae Ha
HyStart sets the initial exit point of slow start. Suppose that HyStart exits at 0.5BDP in a BDP network and no history exists. If the BDP of a network is large, CUBIC's initial cwnd growth may be too conservative to utilize the link. CUBIC increases the cwnd 20% per RTT in this case. Signed-off-by: Sangtae Ha <sangtae.ha@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: make the delay threshold of HyStart less sensitiveSangtae Ha
Make HyStart less sensitive to abrupt delay variations due to buffer bloat. Signed-off-by: Sangtae Ha <sangtae.ha@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Reported-by: Lucas Nussbaum <lucas.nussbaum@loria.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: enable high resolution ack time if neededstephen hemminger
This is a refined version of an earlier patch by Lucas Nussbaum. Cubic needs RTT values in milliseconds. If HZ < 1000 then the values will be too coarse. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Reported-by: Lucas Nussbaum <lucas.nussbaum@loria.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: fix clock dependencystephen hemminger
The hystart code was written with assumption that HZ=1000. Replace the use of jiffies with bictcp_clock as a millisecond real time clock. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Reported-by: Lucas Nussbaum <lucas.nussbaum@loria.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: make ack train delta value a parameterstephen hemminger
Make the spacing between ACK's that indicates a train a tuneable value like other hystart values. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp_cubic: fix comparison of jiffiesstephen hemminger
Jiffies wraps around therefore the correct way to compare is to use cast to signed value. Note: cubic is not using full jiffies value on 64 bit arch because using full unsigned long makes struct bictcp grow too large for the available ca_priv area. Includes correction from Sangtae Ha to improve ack train detection. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-14tcp: fix RTT for quick packets in congestion controlstephen hemminger
In the congestion control interface, the callback for each ACK includes an estimated round trip time in microseconds. Some algorithms need high resolution (Vegas style) but most only need jiffie resolution. If RTT is not accurate (like a retransmission) -1 is used as a flag value. When doing coarse resolution if RTT is less than a a jiffie then 0 should be returned rather than no estimate. Otherwise algorithms that expect good ack's to trigger slow start (like CUBIC Hystart) will be confused. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-13inetpeer: should use call_rcu() variantEric Dumazet
After commit 7b46ac4e77f3224a (inetpeer: Don't disable BH for initial fast RCU lookup.), we should use call_rcu() to wait proper RCU grace period. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-13esp4: Add support for IPsec extended sequence numbersSteffen Klassert
This patch adds IPsec extended sequence numbers support to esp4. We use the authencesn crypto algorithm to handle esp with separate encryption/authentication algorithms. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-13xfrm: Use separate low and high order bits of the sequence numbers in ↵Steffen Klassert
xfrm_skb_cb To support IPsec extended sequence numbers, we split the output sequence numbers of xfrm_skb_cb in low and high order 32 bits and we add the high order 32 bits to the input sequence numbers. All users are updated accordingly. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-13ipv4: Fix PMTU update.Hiroaki SHIMODA
On current net-next-2.6, when Linux receives ICMP Type: 3, Code: 4 (Destination unreachable (Fragmentation needed)), icmp_unreach -> ip_rt_frag_needed (peer->pmtu_expires is set here) -> tcp_v4_err -> do_pmtu_discovery -> ip_rt_update_pmtu (peer->pmtu_expires is already set, so check_peer_pmtu is skipped.) -> check_peer_pmtu check_peer_pmtu is skipped and MTU is not updated. To fix this, let check_peer_pmtu execute unconditionally. And some minor fixes 1) Avoid potential peer->pmtu_expires set to be zero. 2) In check_peer_pmtu, argument of time_before is reversed. 3) check_peer_pmtu expects peer->pmtu_orig is initialized as zero, but not initialized. Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12net: Put fl4_* macros to struct flowi4 and use them again.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Kill fib_semantic_match declaration from fib_lookup.hDavid S. Miller
This function no longer exists. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12net: Use flowi4 and flowi6 in xfrm layer.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Use flowi4 in UDPDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12netfilter: Use flowi4 in nf_nat_standalone.cDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Use flowi4 in ipmr code.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Use flowi4 in FIB layer.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Use flowi4 in public route lookup interfaces.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Use struct flowi4 internally in routing lookups.David S. Miller
We will change the externally visible APIs next. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Pass ipv4 flow objects into fib_lookup() paths.David S. Miller
To start doing these conversions, we need to add some temporary flow4_* macros which will eventually go away when all the protocol code paths are changed to work on AF specific flowi objects. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12net: Break struct flowi out into AF specific instances.David S. Miller
Now we have struct flowi4, flowi6, and flowidn for each address family. And struct flowi is just a union of them all. It might have been troublesome to convert flow_cache_uli_match() but as it turns out this function is completely unused and therefore can be simply removed. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12net: Make flowi ports AF dependent.David S. Miller
Create two sets of port member accessors, one set prefixed by fl4_* and the other prefixed by fl6_* This will let us to create AF optimal flow instances. It will work because every context in which we access the ports, we have to be fully aware of which AF the flowi is anyways. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12net: Put flowi_* prefix on AF independent members of struct flowiDavid S. Miller
I intend to turn struct flowi into a union of AF specific flowi structs. There will be a common structure that each variant includes first, much like struct sock_common. This is the first step to move in that direction. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12ipv4: Create and use route lookup helpers.David S. Miller
The idea here is this minimizes the number of places one has to edit in order to make changes to how flows are defined and used. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10ipv4: Kill flowi arg to fib_select_multipath()David S. Miller
Completely unused. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10ipv4: Remove unnecessary test from ip_mkroute_input()David S. Miller
fl->oif will always be zero on the input path, so there is no reason to test for that. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10ipv4: Remove redundant RCU locking in ip_check_mc().David S. Miller
All callers are under rcu_read_lock() protection already. Rename to ip_check_mc_rcu() to make it even more clear. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x_cmn.c
2011-03-10Merge branch 'master' of /home/davem/src/GIT/linux-2.6/David S. Miller
2011-03-10tcp: mark tcp_congestion_ops read_mostlyStephen Hemminger
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09ipv4: Optimize flow initialization in fib_validate_source().David S. Miller
Like in commit 44713b67db10c774f14280c129b0d5fd13c70cf2 ("ipv4: Optimize flow initialization in output route lookup." we can optimize the on-stack flow setup to only initialize the members which are actually used. Otherwise we bzero the entire structure, then initialize explicitly the first half of it. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09ipv4: Optimize flow initialization in input route lookup.David S. Miller
Like in commit 44713b67db10c774f14280c129b0d5fd13c70cf2 ("ipv4: Optimize flow initialization in output route lookup." we can optimize the on-stack flow setup to only initialize the members which are actually used. Otherwise we bzero the entire structure, then initialize explicitly the first half of it. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10net: don't allow CAP_NET_ADMIN to load non-netdev kernel modulesVasiliy Kulikov
Since a8f80e8ff94ecba629542d9b4b5f5a8ee3eb565c any process with CAP_NET_ADMIN may load any module from /lib/modules/. This doesn't mean that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are limited to /lib/modules/**. However, CAP_NET_ADMIN capability shouldn't allow anybody load any module not related to networking. This patch restricts an ability of autoloading modules to netdev modules with explicit aliases. This fixes CVE-2011-1019. Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior of loading netdev modules by name (without any prefix) for processes with CAP_SYS_MODULE to maintain the compatibility with network scripts that use autoloading netdev modules by aliases like "eth0", "wlan0". Currently there are only three users of the feature in the upstream kernel: ipip, ip_gre and sit. root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) -- root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: fffffff800001000 CapEff: fffffff800001000 CapBnd: fffffff800001000 root@albatros:~# modprobe xfs FATAL: Error inserting xfs (/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted root@albatros:~# lsmod | grep xfs root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod | grep xfs root@albatros:~# lsmod | grep sit root@albatros:~# ifconfig sit sit: error fetching interface information: Device not found root@albatros:~# lsmod | grep sit root@albatros:~# ifconfig sit0 sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 root@albatros:~# lsmod | grep sit sit 10457 0 tunnel4 2957 1 sit For CAP_SYS_MODULE module loading is still relaxed: root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod | grep xfs xfs 745319 0 Reference: https://lkml.org/lkml/2011/2/24/203 Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-03-09tcp: ioctl type SIOCOUTQNSD returns amount of data not sentMario Schuknecht
In contrast to SIOCOUTQ which returns the amount of data sent but not yet acknowledged plus data not yet sent this patch only returns the data not sent. For various methods of live streaming bitrate control it may be helpful to know how much data are in the tcp outqueue are not sent yet. Signed-off-by: Mario Schuknecht <m.schuknecht@dresearch.de> Signed-off-by: Steffen Sledz <sledz@dresearch.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09ipv4: Lookup multicast routes by rtable using helper.David S. Miller
Create a common helper for this operation, since we do it identically in three spots. Suggested by Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09ipv4: Fix erroneous uses of ifa_address.David S. Miller
In usual cases ifa_address == ifa_local, but in the case where SIOCSIFDSTADDR sets the destination address on a point-to-point link, ifa_address gets set to that destination address. Therefore we should use ifa_local when we want the local interface address. There were two cases where the selection was done incorrectly: 1) When devinet_ioctl() does matching, it checks ifa_address even though gifconf correct reported ifa_local to the user 2) IN_DEV_ARP_NOTIFY handling sends a gratuitous ARP using ifa_address instead of ifa_local. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-08inetpeer: Don't disable BH for initial fast RCU lookup.David S. Miller
If modifications on other cpus are ok, then modifications to the tree during lookup done by the local cpu are ok too. Signed-off-by: David S. Miller <davem@davemloft.net>