linux-toradex.git/kernel/resource.c, branch v6.0-rc6

resource: Introduce alloc_free_mem_region()

2022-07-22T00:19:25+00:00

The core of devm_request_free_mem_region() is a helper that searches for
free space in iomem_resource and performs __request_region_locked() on
the result of that search. The policy choices of the implementation
conform to what CONFIG_DEVICE_PRIVATE users want which is memory that is
immediately marked busy, and a preference to search for the first-fit
free range in descending order from the top of the physical address
space.

CXL has a need for a similar allocator, but with the following tweaks:

1/ Search for free space in ascending order

2/ Search for free space relative to a given CXL window

3/ 'insert' rather than 'request' the new resource given downstream
   drivers from the CXL Region driver (like the pmem or dax drivers) are
   responsible for request_mem_region() when they activate the memory
   range.

Rework __request_free_mem_region() into get_free_mem_region() which
takes a set of GFR_* (Get Free Region) flags to control the allocation
policy (ascending vs descending), and "busy" policy (insert_resource()
vs request_region()).

As part of the consolidation of the legacy GFR_REQUEST_REGION case with
the new default of just inserting a new resource into the free space
some minor cleanups like not checking for NULL before calling
devres_free() (which does its own check) is included.

Suggested-by: Jason Gunthorpe 
Link: https://lore.kernel.org/linux-cxl/20220420143406.GY2120790@nvidia.com/
Cc: Matthew Wilcox 
Cc: Christoph Hellwig 
Reviewed-by: Jonathan Cameron 
Link: https://lore.kernel.org/r/165784333333.1758207.13703329337805274043.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams

cxl/acpi: Track CXL resources in iomem_resource

2022-07-21T15:40:01+00:00

Recall that CXL capable address ranges, on ACPI platforms, are published
in the CEDT.CFMWS (CXL Early Discovery Table: CXL Fixed Memory Window
Structures). These windows represent both the actively mapped capacity
and the potential address space that can be dynamically assigned to a
new CXL decode configuration (region / interleave-set).

CXL endpoints like DDR DIMMs can be mapped at any physical address
including 0 and legacy ranges.

There is an expectation and requirement that the /proc/iomem interface
and the iomem_resource tree in the kernel reflect the full set of
platform address ranges. I.e. that every address range that platform
firmware and bus drivers enumerate be reflected as an iomem_resource
entry. The hard requirement to do this for CXL arises from the fact that
facilities like CONFIG_DEVICE_PRIVATE expect to be able to treat empty
iomem_resource ranges as free for software to use as proxy address
space. Without CXL publishing its potential address ranges in
iomem_resource, the CONFIG_DEVICE_PRIVATE mechanism may inadvertently
steal capacity reserved for runtime provisioning of new CXL regions.

So, iomem_resource needs to know about both active and potential CXL
resource ranges. The active CXL resources might already be reflected in
iomem_resource as "System RAM". insert_resource_expand_to_fit() handles
re-parenting "System RAM" underneath a CXL window.

The "_expand_to_fit()" behavior handles cases where a CXL window is not
a strict superset of an existing entry in the iomem_resource tree. The
"_expand_to_fit()" behavior is acceptable from the perspective of
resource allocation. The expansion happens because a conflicting
resource range is already populated, which means the resource boundary
expansion does not result in any additional free CXL address space being
made available. CXL address space allocation is always bounded by the
orginal unexpanded address range.

However, the potential for expansion does mean that something like
walk_iomem_res_desc(IORES_DESC_CXL...) can only return fuzzy answers on
corner case platforms that cause the resource tree to expand a CXL
window resource over a range that is not decoded by CXL. This would be
an odd platform configuration, but if it becomes a problem in practice
the CXL subsytem could just publish an API that returns definitive
answers.

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Jason Gunthorpe 
Cc: Tony Luck 
Cc: Christoph Hellwig 
Reviewed-by: Jonathan Cameron 
Acked-by: Greg Kroah-Hartman 
Link: https://lore.kernel.org/r/165784325943.1758207.5310344844375305118.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams

kernel/resource: fix kfree() of bootmem memory again

2022-03-24T02:00:35+00:00

Since commit ebff7d8f270d ("mem hotunplug: fix kfree() of bootmem
memory"), we could get a resource allocated during boot via
alloc_resource().  And it's required to release the resource using
free_resource().  Howerver, many people use kfree directly which will
result in kernel BUG.  In order to fix this without fixing every call
site, just leak a couple of bytes in such corner case.

Link: https://lkml.kernel.org/r/20220217083619.19305-1-linmiaohe@huawei.com
Fixes: ebff7d8f270d ("mem hotunplug: fix kfree() of bootmem memory")
Signed-off-by: Miaohe Lin 
Suggested-by: David Hildenbrand 
Cc: Dan Williams 
Cc: Alistair Popple 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc: remove PDE_DATA() completely

2022-01-22T06:33:37+00:00

Remove PDE_DATA() completely and replace it with pde_data().

[akpm@linux-foundation.org: fix naming clash in drivers/nubus/proc.c]
[akpm@linux-foundation.org: now fix it properly]

Link: https://lkml.kernel.org/r/20211124081956.87711-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Acked-by: Christian Brauner 
Cc: Alexey Dobriyan 
Cc: Alexey Gladkov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kernel/resource: disallow access to exclusive system RAM regions

2021-11-09T18:02:52+00:00

virtio-mem dynamically exposes memory inside a device memory region as
system RAM to Linux, coordinating with the hypervisor which parts are
actually "plugged" and consequently usable/accessible.

On the one hand, the virtio-mem driver adds/removes whole memory blocks,
creating/removing busy IORESOURCE_SYSTEM_RAM resources, on the other
hand, it logically (un)plugs memory inside added memory blocks,
dynamically either exposing them to the buddy or hiding them from the
buddy and marking them PG_offline.

In contrast to physical devices, like a DIMM, the virtio-mem driver is
required to actually make use of any of the device-provided memory,
because it performs the handshake with the hypervisor.  virtio-mem
memory cannot simply be access via /dev/mem without a driver.

There is no safe way to:
a) Access plugged memory blocks via /dev/mem, as they might contain
   unplugged holes or might get silently unplugged by the virtio-mem
   driver and consequently turned inaccessible.
b) Access unplugged memory blocks via /dev/mem because the virtio-mem
   driver is required to make them actually accessible first.

The virtio-spec states that unplugged memory blocks MUST NOT be written,
and only selected unplugged memory blocks MAY be read.  We want to make
sure, this is the case in sane environments -- where the virtio-mem driver
was loaded.

We want to make sure that in a sane environment, nobody "accidentially"
accesses unplugged memory inside the device managed region.  For example,
a user might spot a memory region in /proc/iomem and try accessing it via
/dev/mem via gdb or dumping it via something else.  By the time the mmap()
happens, the memory might already have been removed by the virtio-mem
driver silently: the mmap() would succeeed and user space might
accidentially access unplugged memory.

So once the driver was loaded and detected the device along the
device-managed region, we just want to disallow any access via /dev/mem to
it.

In an ideal world, we would mark the whole region as busy ("owned by a
driver") and exclude it; however, that would be wrong, as we don't really
have actual system RAM at these ranges added to Linux ("busy system RAM").
Instead, we want to mark such ranges as "not actual busy system RAM but
still soft-reserved and prepared by a driver for future use."

Let's teach iomem_is_exclusive() to reject access to any range with
"IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE", even if not busy and even
if "iomem=relaxed" is set.  Introduce EXCLUSIVE_SYSTEM_RAM to make it
easier for applicable drivers to depend on this setting in their Kconfig.

For now, there are no applicable ranges and we'll modify virtio-mem next
to properly set IORESOURCE_EXCLUSIVE on the parent resource container it
creates to contain all actual busy system RAM added via
add_memory_driver_managed().

Link: https://lkml.kernel.org/r/20210920142856.17758-3-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Dan Williams 
Cc: Andy Shevchenko 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Hanjun Guo 
Cc: Jason Wang 
Cc: "Michael S. Tsirkin" 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kernel/resource: clean up and optimize iomem_is_exclusive()

2021-11-09T18:02:52+00:00

Patch series "virtio-mem: disallow mapping virtio-mem memory via /dev/mem", v5.

Let's add the basic infrastructure to exclude some physical memory regions
marked as "IORESOURCE_SYSTEM_RAM" completely from /dev/mem access, even
though they are not marked IORESOURCE_BUSY and even though "iomem=relaxed"
is set.  Resource IORESOURCE_EXCLUSIVE for that purpose instead of adding
new flags to express something similar to "soft-busy" or "not busy yet,
but already prepared by a driver and not to be mapped by user space".

Use it for virtio-mem, to disallow mapping any virtio-mem memory via
/dev/mem to user space after the virtio-mem driver was loaded.

This patch (of 3):

We end up traversing subtrees of ranges we are not interested in; let's
optimize this case, skipping such subtrees, cleaning up the function a
bit.

For example, in the following configuration (/proc/iomem):

  00000000-00000fff : Reserved
  00001000-00057fff : System RAM
  00058000-00058fff : Reserved
  00059000-0009cfff : System RAM
  0009d000-000fffff : Reserved
     000a0000-000bffff : PCI Bus 0000:00
     000c0000-000c3fff : PCI Bus 0000:00
     000c4000-000c7fff : PCI Bus 0000:00
     000c8000-000cbfff : PCI Bus 0000:00
     000cc000-000cffff : PCI Bus 0000:00
     000d0000-000d3fff : PCI Bus 0000:00
     000d4000-000d7fff : PCI Bus 0000:00
     000d8000-000dbfff : PCI Bus 0000:00
     000dc000-000dffff : PCI Bus 0000:00
     000e0000-000e3fff : PCI Bus 0000:00
     000e4000-000e7fff : PCI Bus 0000:00
     000e8000-000ebfff : PCI Bus 0000:00
     000ec000-000effff : PCI Bus 0000:00
     000f0000-000fffff : PCI Bus 0000:00
       000f0000-000fffff : System ROM
  00100000-3fffffff : System RAM
  40000000-403fffff : Reserved
     40000000-403fffff : pnp 00:00
  40400000-80a79fff : System RAM
  ...

We don't have to look at any children of "0009d000-000fffff : Reserved"
if we can just skip these 15 items directly because the parent range is
not of interest.

Link: https://lkml.kernel.org/r/20210920142856.17758-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210920142856.17758-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Dan Williams 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "Rafael J. Wysocki" 
Cc: Hanjun Guo 
Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kernel/resource: fix return code check in __request_free_mem_region

2021-05-15T02:41:32+00:00

Splitting an earlier version of a patch that allowed calling
__request_region() while holding the resource lock into a series of
patches required changing the return code for the newly introduced
__request_region_locked().

Unfortunately this change was not carried through to a subsequent commit
56fd94919b8b ("kernel/resource: fix locking in request_free_mem_region")
in the series.  This resulted in a use-after-free due to freeing the
struct resource without properly releasing it.  Fix this by correcting the
return code check so that the struct is not freed if the request to add it
was successful.

Link: https://lkml.kernel.org/r/20210512073528.22334-1-apopple@nvidia.com
Fixes: 56fd94919b8b ("kernel/resource: fix locking in request_free_mem_region")
Signed-off-by: Alistair Popple 
Reported-by: kernel test robot 
Reviewed-by: David Hildenbrand 
Cc: Balbir Singh 
Cc: Dan Williams 
Cc: Daniel Vetter 
Cc: Greg Kroah-Hartman 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Muchun Song 
Cc: Oliver Sang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kernel/resource: fix locking in request_free_mem_region

2021-05-07T07:26:33+00:00

request_free_mem_region() is used to find an empty range of physical
addresses for hotplugging ZONE_DEVICE memory.  It does this by iterating
over the range of possible addresses using region_intersects() to see if
the range is free before calling request_mem_region() to allocate the
region.

However the resource_lock is dropped between these two calls meaning by
the time request_mem_region() is called in request_free_mem_region()
another thread may have already reserved the requested region.  This
results in unexpected failures and a message in the kernel log from
hitting this condition:

        /*
         * mm/hmm.c reserves physical addresses which then
         * become unavailable to other users.  Conflicts are
         * not expected.  Warn to aid debugging if encountered.
         */
        if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
                pr_warn("Unaddressable device %s %pR conflicts with %pR",
                        conflict->name, conflict, res);

These unexpected failures can be corrected by holding resource_lock across
the two calls.  This also requires memory allocation to be performed prior
to taking the lock.

Link: https://lkml.kernel.org/r/20210419070109.4780-3-apopple@nvidia.com
Signed-off-by: Alistair Popple 
Reviewed-by: David Hildenbrand 
Cc: Balbir Singh 
Cc: Daniel Vetter 
Cc: Dan Williams 
Cc: Greg Kroah-Hartman 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Muchun Song 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kernel/resource: refactor __request_region to allow external locking

2021-05-07T07:26:33+00:00

Refactor the portion of __request_region() done whilst holding the
resource_lock into a separate function to allow callers to hold the lock.

Link: https://lkml.kernel.org/r/20210419070109.4780-2-apopple@nvidia.com
Signed-off-by: Alistair Popple 
Reviewed-by: David Hildenbrand 
Cc: Balbir Singh 
Cc: Daniel Vetter 
Cc: Dan Williams 
Cc: Greg Kroah-Hartman 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Muchun Song 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kernel/resource: allow region_intersects users to hold resource_lock

2021-05-07T07:26:33+00:00

Introduce a version of region_intersects() that can be called with the
resource_lock already held.

This will be used in a future fix to __request_free_mem_region().

[akpm@linux-foundation.org: make __region_intersects static]

Link: https://lkml.kernel.org/r/20210419070109.4780-1-apopple@nvidia.com
Signed-off-by: Alistair Popple 
Cc: David Hildenbrand 
Cc: Daniel Vetter 
Cc: Dan Williams 
Cc: Greg Kroah-Hartman 
Cc: John Hubbard 
Cc: Jerome Glisse 
Cc: Balbir Singh 
Cc: Muchun Song 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds