linux-toradex.git/net/netfilter, branch v4.9.14

netfilter: conntrack: refine gc worker heuristics, redux

2017-03-12T05:41:53+00:00

commit e5072053b09642b8ff417d47da05b84720aea3ee upstream.

This further refines the changes made to conntrack gc_worker in
commit e0df8cae6c16 ("netfilter: conntrack: refine gc worker heuristics").

The main idea of that change was to reduce the scan interval when evictions
take place.

However, on the reporters' setup, there are 1-2 million conntrack entries
in total and roughly 8k new (and closing) connections per second.

In this case we'll always evict at least one entry per gc cycle and scan
interval is always at 1 jiffy because of this test:

 } else if (expired_count) {
     gc_work->next_gc_run /= 2U;
     next_run = msecs_to_jiffies(1);

being true almost all the time.

Given we scan ~10k entries per run its clearly wrong to reduce interval
based on nonzero eviction count, it will only waste cpu cycles since a vast
majorities of conntracks are not timed out.

Thus only look at the ratio (scanned entries vs. evicted entries) to make
a decision on whether to reduce or not.

Because evictor is supposed to only kick in when system turns idle after
a busy period, pick a high ratio -- this makes it 50%.  We thus keep
the idea of increasing scan rate when its likely that table contains many
expired entries.

In order to not let timed-out entries hang around for too long
(important when using event logging, in which case we want to timely
destroy events), we now scan the full table within at most
GC_MAX_SCAN_JIFFIES (16 seconds) even in worst-case scenario where all
timed-out entries sit in same slot.

I tested this with a vm under synflood (with
sysctl net.netfilter.nf_conntrack_tcp_timeout_syn_recv=3).

While flood is ongoing, interval now stays at its max rate
(GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV -> 125ms).

With feedback from Nicolas Dichtel.

Reported-by: Denys Fedoryshchenko 
Cc: Nicolas Dichtel 
Fixes: b87a2f9199ea82eaadc ("netfilter: conntrack: add gc worker to remove timed-out entries")
Signed-off-by: Florian Westphal 
Tested-by: Nicolas Dichtel 
Acked-by: Nicolas Dichtel 
Tested-by: Denys Fedoryshchenko 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Greg Kroah-Hartman

netfilter: conntrack: remove GC_MAX_EVICTS break

2017-03-12T05:41:53+00:00

commit 524b698db06b9b6da7192e749f637904e2f62d7b upstream.

Instead of breaking loop and instant resched, don't bother checking
this in first place (the loop calls cond_resched for every bucket anyway).

Suggested-by: Nicolas Dichtel 
Signed-off-by: Florian Westphal 
Acked-by: Nicolas Dichtel 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Greg Kroah-Hartman

netfilter: nf_ct_helper: warn when not applying default helper assignment

2017-02-26T10:10:52+00:00

commit dfe75ff8ca74f54b0fa5a326a1aa9afa485ed802 upstream.

Commit 3bb398d925 ("netfilter: nf_ct_helper: disable automatic helper
assignment") is causing behavior regressions in firewalls, as traffic
handled by conntrack helpers is now by default not passed through even
though it was before due to missing CT targets (which were not necessary
before this commit).

The default had to be switched off due to security reasons [1] [2] and
therefore should stay the way it is, but let's be friendly to firewall
admins and issue a warning the first time we're in situation where packet
would be likely passed through with the old default but we're likely going
to drop it on the floor now.

Rewrite the code a little bit as suggested by Linus, so that we avoid
spaghettiing the code even more -- namely the whole decision making
process regarding helper selection (either automatic or not) is being
separated, so that the whole logic can be simplified and code (condition)
duplication reduced.

[1] https://cansecwest.com/csw12/conntrack-attack.pdf
[2] https://home.regit.org/netfilter-en/secure-use-of-helpers/

Signed-off-by: Jiri Kosina 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Greg Kroah-Hartman

netfilter: nft_range: add the missing NULL pointer check

2016-11-24T13:43:35+00:00

Otherwise, kernel panic will happen if the user does not specify
the related attributes.

Fixes: 0f3cd9b36977 ("netfilter: nf_tables: add range expression")
Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso

netfilter: nf_tables: fix inconsistent element expiration calculation

2016-11-24T13:43:34+00:00

As Liping Zhang reports, after commit a8b1e36d0d1d ("netfilter: nft_dynset:
fix element timeout for HZ != 1000"), priv->timeout was stored in jiffies,
while set->timeout was stored in milliseconds. This is inconsistent and
incorrect.

Firstly, we already call msecs_to_jiffies in nft_set_elem_init, so
priv->timeout will be converted to jiffies twice.

Secondly, if the user did not specify the NFTA_DYNSET_TIMEOUT attr,
set->timeout will be used, but we forget to call msecs_to_jiffies
when do update elements.

Fix this by using jiffies internally for traditional sets and doing the
conversions to/from msec when interacting with userspace - as dynset
already does.

This is preferable to doing the conversions, when elements are inserted or
updated, because this can happen very frequently on busy dynsets.

Fixes: a8b1e36d0d1d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Reported-by: Liping Zhang 
Signed-off-by: Anders K. Pedersen 
Acked-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso

netfilter: nat: switch to new rhlist interface

2016-11-24T13:43:34+00:00

I got offlist bug report about failing connections and high cpu usage.
This happens because we hit 'elasticity' checks in rhashtable that
refuses bucket list exceeding 16 entries.

The nat bysrc hash unfortunately needs to insert distinct objects that
share same key and are identical (have same source tuple), this cannot
be avoided.

Switch to the rhlist interface which is designed for this.

The nulls_base is removed here, I don't think its needed:

A (unlikely) false positive results in unneeded port clash resolution,
a false negative results in packet drop during conntrack confirmation,
when we try to insert the duplicate into main conntrack hash table.

Tested by adding multiple ip addresses to host, then adding
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

... and then creating multiple connections, from same source port but
different addresses:

for i in $(seq 2000 2032);do nc -p 1234 192.168.7.1 $i > /dev/null  & done

(all of these then get hashed to same bysource slot)

Then, to test that nat conflict resultion is working:

nc -s 10.0.0.1 -p 1234 192.168.7.1 2000
nc -s 10.0.0.2 -p 1234 192.168.7.1 2000

tcp  .. src=10.0.0.1 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1024 [ASSURED]
tcp  .. src=10.0.0.2 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1025 [ASSURED]
tcp  .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2000 src=192.168.7.1 dst=192.168.7.10 sport=2000 dport=1234 [ASSURED]
tcp  .. src=192.168.7.10 dst=192.168.7.1 sport=1234 dport=2001 src=192.168.7.1 dst=192.168.7.10 sport=2001 dport=1234 [ASSURED]
[..]

-> nat altered source ports to 1024 and 1025, respectively.
This can also be confirmed on destination host which shows
ESTAB      0      0   192.168.7.1:2000      192.168.7.10:1024
ESTAB      0      0   192.168.7.1:2000      192.168.7.10:1025
ESTAB      0      0   192.168.7.1:2000      192.168.7.10:1234

Cc: Herbert Xu 
Fixes: 870190a9ec907 ("netfilter: nat: convert nat bysrc hash to rhashtable")
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso

netfilter: nat: fix cmp return value

2016-11-24T13:43:33+00:00

The comparator works like memcmp, i.e. 0 means objects are equal.
In other words, when objects are distinct they are treated as identical,
when they are distinct they are allegedly the same.

The first case is rare (distinct objects are unlikely to get hashed to
same bucket).

The second case results in unneeded port conflict resolutions attempts.

Fixes: 870190a9ec907 ("netfilter: nat: convert nat bysrc hash to rhashtable")
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso

netfilter: nft_hash: validate maximum value of u32 netlink hash attribute

2016-11-24T13:40:03+00:00

Use the function nft_parse_u32_check() to fetch the value and validate
the u32 attribute into the hash len u8 field.

This patch revisits 4da449ae1df9 ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").

Fixes: cb1b69b0b15b ("netfilter: nf_tables: add hash expression")
Signed-off-by: Laura Garcia Liebana 
Signed-off-by: Pablo Neira Ayuso

netfilter: nf_tables: fix oops when inserting an element into a verdict map

2016-11-08T22:53:39+00:00

Dalegaard says:
 The following ruleset, when loaded with 'nft -f bad.txt'
 ----snip----
 flush ruleset
 table ip inlinenat {
   map sourcemap {
     type ipv4_addr : verdict;
   }

   chain postrouting {
     ip saddr vmap @sourcemap accept
   }
 }
 add chain inlinenat test
 add element inlinenat sourcemap { 100.123.10.2 : jump test }
 ----snip----

 results in a kernel oops:
 BUG: unable to handle kernel paging request at 0000000000001344
 IP: [] nf_tables_check_loops+0x114/0x1f0 [nf_tables]
 [...]
 Call Trace:
  [] ? nft_data_init+0x13e/0x1a0 [nf_tables]
  [] nft_validate_register_store+0x60/0xb0 [nf_tables]
  [] nft_add_set_elem+0x545/0x5e0 [nf_tables]
  [] ? nft_table_lookup+0x30/0x60 [nf_tables]
  [] ? nla_strcmp+0x40/0x50
  [] nf_tables_newsetelem+0x11e/0x210 [nf_tables]
  [] ? nla_validate+0x60/0x80
  [] nfnetlink_rcv+0x354/0x5a7 [nfnetlink]

Because we forget to fill the net pointer in bind_ctx, so dereferencing
it may cause kernel crash.

Reported-by: Dalegaard 
Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso

netfilter: conntrack: refine gc worker heuristics

2016-11-08T22:53:38+00:00

Nicolas Dichtel says:
  After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to
  remove timed-out entries"), netlink conntrack deletion events may be
  sent with a huge delay.

Nicolas further points at this line:

  goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);

and indeed, this isn't optimal at all.  Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge.  But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.

We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).

So, after some discussion with Nicolas:

1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)

2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries

3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead

4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.

The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.

Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').

Reported-by: Nicolas Dichtel 
Signed-off-by: Florian Westphal 
Acked-by: Nicolas Dichtel 
Signed-off-by: Pablo Neira Ayuso