<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/include/linux/kvm_host.h, branch v5.1-rc3</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter</title>
<updated>2019-02-20T21:48:50+00:00</updated>
<author>
<name>Nir Weiner</name>
<email>nir.weiner@oracle.com</email>
</author>
<published>2019-01-27T10:17:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=49113d360bdeb4dd916fb6bffbcc3e157422b6fd'/>
<id>49113d360bdeb4dd916fb6bffbcc3e157422b6fd</id>
<content type='text'>
The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
start value when raising up vcpu-&gt;halt_poll_ns.
It actually sets the first timeout to the first polling session.
This value has significant effect on how tolerant we are to outliers.
On the standard case, higher value is better - we will spend more time
in the polling busyloop, handle events/interrupts faster and result
in better performance.
But on outliers it puts us in a busy loop that does nothing.
Even if the shrink factor is zero, we will still waste time on the first
iteration.
The optimal value changes between different workloads. It depends on
outliers rate and polling sessions length.
As this value has significant effect on the dynamic halt-polling
algorithm, it should be configurable and exposed.

Reviewed-by: Boris Ostrovsky &lt;boris.ostrovsky@oracle.com&gt;
Reviewed-by: Liran Alon &lt;liran.alon@oracle.com&gt;
Signed-off-by: Nir Weiner &lt;nir.weiner@oracle.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
start value when raising up vcpu-&gt;halt_poll_ns.
It actually sets the first timeout to the first polling session.
This value has significant effect on how tolerant we are to outliers.
On the standard case, higher value is better - we will spend more time
in the polling busyloop, handle events/interrupts faster and result
in better performance.
But on outliers it puts us in a busy loop that does nothing.
Even if the shrink factor is zero, we will still waste time on the first
iteration.
The optimal value changes between different workloads. It depends on
outliers rate and polling sessions length.
As this value has significant effect on the dynamic halt-polling
algorithm, it should be configurable and exposed.

Reviewed-by: Boris Ostrovsky &lt;boris.ostrovsky@oracle.com&gt;
Reviewed-by: Liran Alon &lt;liran.alon@oracle.com&gt;
Signed-off-by: Nir Weiner &lt;nir.weiner@oracle.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>KVM: Move the memslot update in-progress flag to bit 63</title>
<updated>2019-02-20T21:48:37+00:00</updated>
<author>
<name>Sean Christopherson</name>
<email>sean.j.christopherson@intel.com</email>
</author>
<published>2019-02-05T21:01:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=164bf7e56c5a73f2f819c39ba7e0f20e0f97dc7b'/>
<id>164bf7e56c5a73f2f819c39ba7e0f20e0f97dc7b</id>
<content type='text'>
...now that KVM won't explode by moving it out of bit 0.  Using bit 63
eliminates the need to jump over bit 0, e.g. when calculating a new
memslots generation or when propagating the memslots generation to an
MMIO spte.

Signed-off-by: Sean Christopherson &lt;sean.j.christopherson@intel.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
...now that KVM won't explode by moving it out of bit 0.  Using bit 63
eliminates the need to jump over bit 0, e.g. when calculating a new
memslots generation or when propagating the memslots generation to an
MMIO spte.

Signed-off-by: Sean Christopherson &lt;sean.j.christopherson@intel.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>KVM: Explicitly define the "memslot update in-progress" bit</title>
<updated>2019-02-20T21:48:34+00:00</updated>
<author>
<name>Sean Christopherson</name>
<email>sean.j.christopherson@intel.com</email>
</author>
<published>2019-02-05T21:01:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=361209e054a2c9f34da090ee1ee4c1e8bfe76a64'/>
<id>361209e054a2c9f34da090ee1ee4c1e8bfe76a64</id>
<content type='text'>
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing.  Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.

Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...

Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.

Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag.  This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.

Signed-off-by: Sean Christopherson &lt;sean.j.christopherson@intel.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing.  Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.

Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...

Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.

Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag.  This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.

Signed-off-by: Sean Christopherson &lt;sean.j.christopherson@intel.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>KVM: Call kvm_arch_memslots_updated() before updating memslots</title>
<updated>2019-02-20T21:48:32+00:00</updated>
<author>
<name>Sean Christopherson</name>
<email>sean.j.christopherson@intel.com</email>
</author>
<published>2019-02-05T20:54:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=152482580a1b0accb60676063a1ac57b2d12daf6'/>
<id>152482580a1b0accb60676063a1ac57b2d12daf6</id>
<content type='text'>
kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots.  Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Sean Christopherson &lt;sean.j.christopherson@intel.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots.  Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Sean Christopherson &lt;sean.j.christopherson@intel.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kvm: Change offset in kvm_write_guest_offset_cached to unsigned</title>
<updated>2018-12-21T10:28:22+00:00</updated>
<author>
<name>Jim Mattson</name>
<email>jmattson@google.com</email>
</author>
<published>2018-12-14T22:34:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=7a86dab8cf2f0fdf508f3555dddfc236623bff60'/>
<id>7a86dab8cf2f0fdf508f3555dddfc236623bff60</id>
<content type='text'>
Since the offset is added directly to the hva from the
gfn_to_hva_cache, a negative offset could result in an out of bounds
write. The existing BUG_ON only checks for addresses beyond the end of
the gfn_to_hva_cache, not for addresses before the start of the
gfn_to_hva_cache.

Note that all current call sites have non-negative offsets.

Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()")
Reported-by: Cfir Cohen &lt;cfir@google.com&gt;
Signed-off-by: Jim Mattson &lt;jmattson@google.com&gt;
Reviewed-by: Cfir Cohen &lt;cfir@google.com&gt;
Reviewed-by: Peter Shier &lt;pshier@google.com&gt;
Reviewed-by: Krish Sadhukhan &lt;krish.sadhukhan@oracle.com&gt;
Reviewed-by: Sean Christopherson &lt;sean.j.christopherson@intel.com&gt;
Signed-off-by: Radim Krčmář &lt;rkrcmar@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since the offset is added directly to the hva from the
gfn_to_hva_cache, a negative offset could result in an out of bounds
write. The existing BUG_ON only checks for addresses beyond the end of
the gfn_to_hva_cache, not for addresses before the start of the
gfn_to_hva_cache.

Note that all current call sites have non-negative offsets.

Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()")
Reported-by: Cfir Cohen &lt;cfir@google.com&gt;
Signed-off-by: Jim Mattson &lt;jmattson@google.com&gt;
Reviewed-by: Cfir Cohen &lt;cfir@google.com&gt;
Reviewed-by: Peter Shier &lt;pshier@google.com&gt;
Reviewed-by: Krish Sadhukhan &lt;krish.sadhukhan@oracle.com&gt;
Reviewed-by: Sean Christopherson &lt;sean.j.christopherson@intel.com&gt;
Signed-off-by: Radim Krčmář &lt;rkrcmar@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kvm: introduce manual dirty log reprotect</title>
<updated>2018-12-14T11:34:19+00:00</updated>
<author>
<name>Paolo Bonzini</name>
<email>pbonzini@redhat.com</email>
</author>
<published>2018-10-23T00:36:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2a31b9db153530df4aa02dac8c32837bf5f47019'/>
<id>2a31b9db153530df4aa02dac8c32837bf5f47019</id>
<content type='text'>
There are two problems with KVM_GET_DIRTY_LOG.  First, and less important,
it can take kvm-&gt;mmu_lock for an extended period of time.  Second, its user
can actually see many false positives in some cases.  The latter is due
to a benign race like this:

  1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
     them.
  2. The guest modifies the pages, causing them to be marked ditry.
  3. Userspace actually copies the pages.
  4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
     they were not written to since (3).

This is especially a problem for large guests, where the time between
(1) and (3) can be substantial.  This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns.  Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page.  The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.

Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There are two problems with KVM_GET_DIRTY_LOG.  First, and less important,
it can take kvm-&gt;mmu_lock for an extended period of time.  Second, its user
can actually see many false positives in some cases.  The latter is due
to a benign race like this:

  1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
     them.
  2. The guest modifies the pages, causing them to be marked ditry.
  3. Userspace actually copies the pages.
  4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
     they were not written to since (3).

This is especially a problem for large guests, where the time between
(1) and (3) can be substantial.  This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns.  Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page.  The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.

Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kvm: rename last argument to kvm_get_dirty_log_protect</title>
<updated>2018-12-14T11:34:18+00:00</updated>
<author>
<name>Paolo Bonzini</name>
<email>pbonzini@redhat.com</email>
</author>
<published>2018-10-23T00:18:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8fe65a8299f9e1f40cb95308ab7b3c4ad80bf801'/>
<id>8fe65a8299f9e1f40cb95308ab7b3c4ad80bf801</id>
<content type='text'>
When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
pointer argument will always be false on exit, because no TLB flush is needed
until the manual re-protection operation.  Rename it from "is_dirty" to "flush",
which more accurately tells the caller what they have to do with it.

Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
pointer argument will always be false on exit, because no TLB flush is needed
until the manual re-protection operation.  Rename it from "is_dirty" to "flush",
which more accurately tells the caller what they have to do with it.

Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic</title>
<updated>2018-12-14T11:34:18+00:00</updated>
<author>
<name>Paolo Bonzini</name>
<email>pbonzini@redhat.com</email>
</author>
<published>2017-02-16T09:40:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e5d83c74a5800c2a1fa3ba982c1c4b2b39ae6db2'/>
<id>e5d83c74a5800c2a1fa3ba982c1c4b2b39ae6db2</id>
<content type='text'>
The first such capability to be handled in virt/kvm/ will be manual
dirty page reprotection.

Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The first such capability to be handled in virt/kvm/ will be manual
dirty page reprotection.

Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>kvm: x86: make kvm_{load|put}_guest_fpu() static</title>
<updated>2018-09-19T22:51:43+00:00</updated>
<author>
<name>Sebastian Andrzej Siewior</name>
<email>bigeasy@linutronix.de</email>
</author>
<published>2018-09-12T13:33:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=822f312d47f0200dc0999c9f006fe94aa43bd0bd'/>
<id>822f312d47f0200dc0999c9f006fe94aa43bd0bd</id>
<content type='text'>
The functions
	kvm_load_guest_fpu()
	kvm_put_guest_fpu()

are only used locally, make them static. This requires also that both
functions are moved because they are used before their implementation.
Those functions were exported (via EXPORT_SYMBOL) before commit
e5bb40251a920 ("KVM: Drop kvm_{load,put}_guest_fpu() exports").

Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The functions
	kvm_load_guest_fpu()
	kvm_put_guest_fpu()

are only used locally, make them static. This requires also that both
functions are moved because they are used before their implementation.
Those functions were exported (via EXPORT_SYMBOL) before commit
e5bb40251a920 ("KVM: Drop kvm_{load,put}_guest_fpu() exports").

Signed-off-by: Sebastian Andrzej Siewior &lt;bigeasy@linutronix.de&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm, oom: distinguish blockable mode for mmu notifiers</title>
<updated>2018-08-22T17:52:44+00:00</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2018-08-22T04:52:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=93065ac753e4443840a057bfef4be71ec766fde9'/>
<id>93065ac753e4443840a057bfef4be71ec766fde9</id>
<content type='text'>
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep.  That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.

We can do much better though.  Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.  Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range.  Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.

This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false.  This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.

I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that.  The first
part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode.  A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom.  This can be done e.g.  after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small.  Then we are looking for a proper process tear down.

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt; # AMD notifiers
Acked-by: Leon Romanovsky &lt;leonro@mellanox.com&gt; # mlx and umem_odp
Reported-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: "David (ChunMing) Zhou" &lt;David1.Zhou@amd.com&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Cc: David Airlie &lt;airlied@linux.ie&gt;
Cc: Jani Nikula &lt;jani.nikula@linux.intel.com&gt;
Cc: Joonas Lahtinen &lt;joonas.lahtinen@linux.intel.com&gt;
Cc: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
Cc: Doug Ledford &lt;dledford@redhat.com&gt;
Cc: Jason Gunthorpe &lt;jgg@ziepe.ca&gt;
Cc: Mike Marciniszyn &lt;mike.marciniszyn@intel.com&gt;
Cc: Dennis Dalessandro &lt;dennis.dalessandro@intel.com&gt;
Cc: Sudeep Dutt &lt;sudeep.dutt@intel.com&gt;
Cc: Ashutosh Dixit &lt;ashutosh.dixit@intel.com&gt;
Cc: Dimitri Sivanich &lt;sivanich@sgi.com&gt;
Cc: Boris Ostrovsky &lt;boris.ostrovsky@oracle.com&gt;
Cc: Juergen Gross &lt;jgross@suse.com&gt;
Cc: "Jérôme Glisse" &lt;jglisse@redhat.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep.  That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.

We can do much better though.  Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.  Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range.  Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.

This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false.  This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.

I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that.  The first
part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode.  A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom.  This can be done e.g.  after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small.  Then we are looking for a proper process tear down.

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt; # AMD notifiers
Acked-by: Leon Romanovsky &lt;leonro@mellanox.com&gt; # mlx and umem_odp
Reported-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: "David (ChunMing) Zhou" &lt;David1.Zhou@amd.com&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Cc: David Airlie &lt;airlied@linux.ie&gt;
Cc: Jani Nikula &lt;jani.nikula@linux.intel.com&gt;
Cc: Joonas Lahtinen &lt;joonas.lahtinen@linux.intel.com&gt;
Cc: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
Cc: Doug Ledford &lt;dledford@redhat.com&gt;
Cc: Jason Gunthorpe &lt;jgg@ziepe.ca&gt;
Cc: Mike Marciniszyn &lt;mike.marciniszyn@intel.com&gt;
Cc: Dennis Dalessandro &lt;dennis.dalessandro@intel.com&gt;
Cc: Sudeep Dutt &lt;sudeep.dutt@intel.com&gt;
Cc: Ashutosh Dixit &lt;ashutosh.dixit@intel.com&gt;
Cc: Dimitri Sivanich &lt;sivanich@sgi.com&gt;
Cc: Boris Ostrovsky &lt;boris.ostrovsky@oracle.com&gt;
Cc: Juergen Gross &lt;jgross@suse.com&gt;
Cc: "Jérôme Glisse" &lt;jglisse@redhat.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
