<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/net/openvswitch/flow_table.c, branch v4.16-rc4</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>openvswitch: Optimize operations for OvS flow_stats.</title>
<updated>2017-07-19T20:49:39+00:00</updated>
<author>
<name>Tonghao Zhang</name>
<email>xiangxia.m.yue@gmail.com</email>
</author>
<published>2017-07-18T06:28:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c4b2bf6b4a35348fe6d1eb06928eb68d7b9d99a9'/>
<id>c4b2bf6b4a35348fe6d1eb06928eb68d7b9d99a9</id>
<content type='text'>
When calling the flow_free() to free the flow, we call many times
(cpu_possible_mask, eg. 128 as default) cpumask_next(). That will
take up our CPU usage if we call the flow_free() frequently.
When we put all packets to userspace via upcall, and OvS will send
them back via netlink to ovs_packet_cmd_execute(will call flow_free).

The test topo is shown as below. VM01 sends TCP packets to VM02,
and OvS forward packtets. When testing, we use perf to report the
system performance.

VM01 --- OvS-VM --- VM02

Without this patch, perf-top show as below: The flow_free() is
3.02% CPU usage.

	4.23%  [kernel]            [k] _raw_spin_unlock_irqrestore
	3.62%  [kernel]            [k] __do_softirq
	3.16%  [kernel]            [k] __memcpy
	3.02%  [kernel]            [k] flow_free
	2.42%  libc-2.17.so        [.] __memcpy_ssse3_back
	2.18%  [kernel]            [k] copy_user_generic_unrolled
	2.17%  [kernel]            [k] find_next_bit

When applied this patch, perf-top show as below: Not shown on
the list anymore.

	4.11%  [kernel]            [k] _raw_spin_unlock_irqrestore
	3.79%  [kernel]            [k] __do_softirq
	3.46%  [kernel]            [k] __memcpy
	2.73%  libc-2.17.so        [.] __memcpy_ssse3_back
	2.25%  [kernel]            [k] copy_user_generic_unrolled
	1.89%  libc-2.17.so        [.] _int_malloc
	1.53%  ovs-vswitchd        [.] xlate_actions

With this patch, the TCP throughput(we dont use Megaflow Cache
+ Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec
(maybe ~10% performance imporve).

This patch adds cpumask struct, the cpu_used_mask stores the cpu_id
that the flow used. And we only check the flow_stats on the cpu we
used, and it is unncessary to check all possible cpu when getting,
cleaning, and updating the flow_stats. Adding the cpu_used_mask to
sw_flow struct does’t increase the cacheline number.

Signed-off-by: Tonghao Zhang &lt;xiangxia.m.yue@gmail.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@ovn.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When calling the flow_free() to free the flow, we call many times
(cpu_possible_mask, eg. 128 as default) cpumask_next(). That will
take up our CPU usage if we call the flow_free() frequently.
When we put all packets to userspace via upcall, and OvS will send
them back via netlink to ovs_packet_cmd_execute(will call flow_free).

The test topo is shown as below. VM01 sends TCP packets to VM02,
and OvS forward packtets. When testing, we use perf to report the
system performance.

VM01 --- OvS-VM --- VM02

Without this patch, perf-top show as below: The flow_free() is
3.02% CPU usage.

	4.23%  [kernel]            [k] _raw_spin_unlock_irqrestore
	3.62%  [kernel]            [k] __do_softirq
	3.16%  [kernel]            [k] __memcpy
	3.02%  [kernel]            [k] flow_free
	2.42%  libc-2.17.so        [.] __memcpy_ssse3_back
	2.18%  [kernel]            [k] copy_user_generic_unrolled
	2.17%  [kernel]            [k] find_next_bit

When applied this patch, perf-top show as below: Not shown on
the list anymore.

	4.11%  [kernel]            [k] _raw_spin_unlock_irqrestore
	3.79%  [kernel]            [k] __do_softirq
	3.46%  [kernel]            [k] __memcpy
	2.73%  libc-2.17.so        [.] __memcpy_ssse3_back
	2.25%  [kernel]            [k] copy_user_generic_unrolled
	1.89%  libc-2.17.so        [.] _int_malloc
	1.53%  ovs-vswitchd        [.] xlate_actions

With this patch, the TCP throughput(we dont use Megaflow Cache
+ Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec
(maybe ~10% performance imporve).

This patch adds cpumask struct, the cpu_used_mask stores the cpu_id
that the flow used. And we only check the flow_stats on the cpu we
used, and it is unncessary to check all possible cpu when getting,
cleaning, and updating the flow_stats. Adding the cpu_used_mask to
sw_flow struct does’t increase the cacheline number.

Signed-off-by: Tonghao Zhang &lt;xiangxia.m.yue@gmail.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@ovn.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>openvswitch: use percpu flow stats</title>
<updated>2016-09-19T02:14:01+00:00</updated>
<author>
<name>Thadeu Lima de Souza Cascardo</name>
<email>cascardo@redhat.com</email>
</author>
<published>2016-09-15T22:11:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=db74a3335e0f645e3139c80bcfc90feb01d8e304'/>
<id>db74a3335e0f645e3139c80bcfc90feb01d8e304</id>
<content type='text'>
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo &lt;cascardo@redhat.com&gt;
Cc: pravin shelar &lt;pshelar@ovn.org&gt;
Acked-by: Pravin B Shelar &lt;pshelar@ovn.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo &lt;cascardo@redhat.com&gt;
Cc: pravin shelar &lt;pshelar@ovn.org&gt;
Acked-by: Pravin B Shelar &lt;pshelar@ovn.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>openvswitch: fix flow stats accounting when node 0 is not possible</title>
<updated>2016-09-19T02:14:01+00:00</updated>
<author>
<name>Thadeu Lima de Souza Cascardo</name>
<email>cascardo@redhat.com</email>
</author>
<published>2016-09-15T22:11:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=40773966ccf1985a1b2bb570a03cbeaf1cbd4e00'/>
<id>40773966ccf1985a1b2bb570a03cbeaf1cbd4e00</id>
<content type='text'>
On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.

Signed-off-by: Thadeu Lima de Souza Cascardo &lt;cascardo@redhat.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@ovn.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.

Signed-off-by: Thadeu Lima de Souza Cascardo &lt;cascardo@redhat.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@ovn.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2015-10-20T13:08:27+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2015-10-20T13:08:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=26440c835f8b1a491e2704118ac55bf87334366c'/>
<id>26440c835f8b1a491e2704118ac55bf87334366c</id>
<content type='text'>
Conflicts:
	drivers/net/usb/asix_common.c
	net/ipv4/inet_connection_sock.c
	net/switchdev/switchdev.c

In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.

The other two conflicts were overlapping changes.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Conflicts:
	drivers/net/usb/asix_common.c
	net/ipv4/inet_connection_sock.c
	net/switchdev/switchdev.c

In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.

The other two conflicts were overlapping changes.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>openvswitch: add tunnel protocol to sw_flow_key</title>
<updated>2015-10-07T11:17:59+00:00</updated>
<author>
<name>Jiri Benc</name>
<email>jbenc@redhat.com</email>
</author>
<published>2015-10-05T11:09:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=00a93babd06aaad31d23384cda576ede0f586a8c'/>
<id>00a93babd06aaad31d23384cda576ede0f586a8c</id>
<content type='text'>
Store tunnel protocol (AF_INET or AF_INET6) in sw_flow_key. This field now
also acts as an indicator whether the flow contains tunnel data (this was
previously indicated by tun_key.u.ipv4.dst being set but with IPv6 addresses
in an union with IPv4 ones this won't work anymore).

The new field was added to a hole in sw_flow_key.

Signed-off-by: Jiri Benc &lt;jbenc@redhat.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@nicira.com&gt;
Acked-by: Thomas Graf &lt;tgraf@suug.ch&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Store tunnel protocol (AF_INET or AF_INET6) in sw_flow_key. This field now
also acts as an indicator whether the flow contains tunnel data (this was
previously indicated by tun_key.u.ipv4.dst being set but with IPv6 addresses
in an union with IPv4 ones this won't work anymore).

The new field was added to a hole in sw_flow_key.

Signed-off-by: Jiri Benc &lt;jbenc@redhat.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@nicira.com&gt;
Acked-by: Thomas Graf &lt;tgraf@suug.ch&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ovs: do not allocate memory from offline numa node</title>
<updated>2015-10-05T13:42:03+00:00</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>khlebnikov@yandex-team.ru</email>
</author>
<published>2015-10-02T10:18:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=598c12d0ba6de9060f04999746eb1e015774044b'/>
<id>598c12d0ba6de9060f04999746eb1e015774044b</id>
<content type='text'>
When openvswitch tries allocate memory from offline numa node 0:
stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, 0)
It catches VM_BUG_ON(nid &lt; 0 || nid &gt;= MAX_NUMNODES || !node_online(nid))
[ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
This patch disables numa affinity in this case.

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Acked-by: Pravin B Shelar &lt;pshelar@nicira.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When openvswitch tries allocate memory from offline numa node 0:
stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO, 0)
It catches VM_BUG_ON(nid &lt; 0 || nid &gt;= MAX_NUMNODES || !node_online(nid))
[ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
This patch disables numa affinity in this case.

Signed-off-by: Konstantin Khlebnikov &lt;khlebnikov@yandex-team.ru&gt;
Acked-by: Pravin B Shelar &lt;pshelar@nicira.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>openvswitch: Zero flows on allocation.</title>
<updated>2015-09-23T00:33:41+00:00</updated>
<author>
<name>Jesse Gross</name>
<email>jesse@nicira.com</email>
</author>
<published>2015-09-22T03:21:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=ae5f2fb1d51fa128a460bcfbe3c56d7ab8bf6a43'/>
<id>ae5f2fb1d51fa128a460bcfbe3c56d7ab8bf6a43</id>
<content type='text'>
When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross &lt;jesse@nicira.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@nicira.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross &lt;jesse@nicira.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@nicira.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ip_tunnels: add IPv6 addresses to ip_tunnel_key</title>
<updated>2015-08-20T22:42:36+00:00</updated>
<author>
<name>Jiri Benc</name>
<email>jbenc@redhat.com</email>
</author>
<published>2015-08-20T11:56:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c1ea5d672aaff08da337dee735dbb548e3415585'/>
<id>c1ea5d672aaff08da337dee735dbb548e3415585</id>
<content type='text'>
Add the IPv6 addresses as an union with IPv4 ones. When using IPv4, the
newly introduced padding after the IPv4 addresses needs to be zeroed out.

Signed-off-by: Jiri Benc &lt;jbenc@redhat.com&gt;
Acked-by: Thomas Graf &lt;tgraf@suug.ch&gt;
Acked-by: Alexei Starovoitov &lt;ast@plumgrid.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add the IPv6 addresses as an union with IPv4 ones. When using IPv4, the
newly introduced padding after the IPv4 addresses needs to be zeroed out.

Signed-off-by: Jiri Benc &lt;jbenc@redhat.com&gt;
Acked-by: Thomas Graf &lt;tgraf@suug.ch&gt;
Acked-by: Alexei Starovoitov &lt;ast@plumgrid.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2015-07-23T07:41:16+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2015-07-23T07:41:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c5e40ee287db61a79af1746954ee03ebbf1ff8a3'/>
<id>c5e40ee287db61a79af1746954ee03ebbf1ff8a3</id>
<content type='text'>
Conflicts:
	net/bridge/br_mdb.c

br_mdb.c conflict was a function call being removed to fix a bug in
'net' but whose signature was changed in 'net-next'.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Conflicts:
	net/bridge/br_mdb.c

br_mdb.c conflict was a function call being removed to fix a bug in
'net' but whose signature was changed in 'net-next'.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes</title>
<updated>2015-07-22T05:26:03+00:00</updated>
<author>
<name>Chris J Arges</name>
<email>chris.j.arges@canonical.com</email>
</author>
<published>2015-07-21T17:36:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=bac541e46319c2dc6ff9fd5ad6df59cd38686bdb'/>
<id>bac541e46319c2dc6ff9fd5ad6df59cd38686bdb</id>
<content type='text'>
Some architectures like POWER can have a NUMA node_possible_map that
contains sparse entries. This causes memory corruption with openvswitch
since it allocates flow_cache with a multiple of num_possible_nodes() and
assumes the node variable returned by for_each_node will index into
flow-&gt;stats[node].

Use nr_node_ids to allocate a maximal sparse array instead of
num_possible_nodes().

The crash was noticed after 3af229f2 was applied as it changed the
node_possible_map to match node_online_map on boot.
Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861

Signed-off-by: Chris J Arges &lt;chris.j.arges@canonical.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@nicira.com&gt;
Acked-by: Nishanth Aravamudan &lt;nacc@linux.vnet.ibm.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Some architectures like POWER can have a NUMA node_possible_map that
contains sparse entries. This causes memory corruption with openvswitch
since it allocates flow_cache with a multiple of num_possible_nodes() and
assumes the node variable returned by for_each_node will index into
flow-&gt;stats[node].

Use nr_node_ids to allocate a maximal sparse array instead of
num_possible_nodes().

The crash was noticed after 3af229f2 was applied as it changed the
node_possible_map to match node_online_map on boot.
Fixes: 3af229f2071f5b5cb31664be6109561fbe19c861

Signed-off-by: Chris J Arges &lt;chris.j.arges@canonical.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@nicira.com&gt;
Acked-by: Nishanth Aravamudan &lt;nacc@linux.vnet.ibm.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
</feed>
