linux-toradex.git/arch/x86/include/asm/pgtable_64.h, branch v4.4.6

mm: remove remaining references to NUMA hinting bits and helpers

2015-02-13T02:54:08+00:00

This patch removes the NUMA PTE bits and associated helpers.  As a
side-effect it increases the maximum possible swap space on x86-64.

One potential source of problems is races between the marking of PTEs
PROT_NONE, NUMA hinting faults and migration.  It must be guaranteed that
a PTE being protected is not faulted in parallel, seen as a pte_none and
corrupting memory.  The base case is safe but transhuge has problems in
the past due to an different migration mechanism and a dependance on page
lock to serialise migrations and warrants a closer look.

task_work hinting update			parallel fault
------------------------			--------------
change_pmd_range
  change_huge_pmd
    __pmd_trans_huge_lock
      pmdp_get_and_clear
						__handle_mm_fault
						pmd_none
						  do_huge_pmd_anonymous_page
						  read? pmd_lock blocks until hinting complete, fail !pmd_none test
						  write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
      pmd_modify
      set_pmd_at

task_work hinting update			parallel migration
------------------------			------------------
change_pmd_range
  change_huge_pmd
    __pmd_trans_huge_lock
      pmdp_get_and_clear
						__handle_mm_fault
						  do_huge_pmd_numa_page
						    migrate_misplaced_transhuge_page
						    pmd_lock waits for updates to complete, recheck pmd_same
      pmd_modify
      set_pmd_at

Both of those are safe and the case where a transhuge page is inserted
during a protection update is unchanged.  The case where two processes try
migrating at the same time is unchanged by this series so should still be
ok.  I could not find a case where we are accidentally depending on the
PTE not being cleared and flushed.  If one is missed, it'll manifest as
corruption problems that start triggering shortly after this series is
merged and only happen when NUMA balancing is enabled.

Signed-off-by: Mel Gorman 
Tested-by: Sasha Levin 
Cc: Aneesh Kumar K.V 
Cc: Benjamin Herrenschmidt 
Cc: Dave Jones 
Cc: Hugh Dickins 
Cc: Ingo Molnar 
Cc: Kirill Shutemov 
Cc: Linus Torvalds 
Cc: Paul Mackerras 
Cc: Rik van Riel 
Cc: Mark Brown 
Cc: Stephen Rothwell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

x86: drop _PAGE_FILE and pte_file()-related helpers

2015-02-10T22:30:33+00:00

We've replaced remap_file_pages(2) implementation with emulation.  Nobody
creates non-linear mapping anymore.

Signed-off-by: Kirill A. Shutemov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-10-14T00:22:41+00:00

Pull x86 mm updates from Ingo Molnar:
 "This tree includes the following changes:

   - fix memory hotplug
   - fix hibernation bootup memory layout assumptions
   - fix hyperv numa guest kernel messages
   - remove dead code
   - update documentation"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Update memory map description to list hypervisor-reserved area
  x86/mm, hibernate: Do not assume the first e820 area to be RAM
  x86/mm/numa: Drop dead code and rename setup_node_data() to setup_alloc_data()
  x86/mm/hotplug: Modify PGD entry when removing memory
  x86/mm/hotplug: Pass sync_global_pgds() a correct argument in remove_pagetable()
  x86: Remove set_pmd_pfn

x86/mm/hotplug: Modify PGD entry when removing memory

2014-09-16T06:55:09+00:00

When hot-adding/removing memory, sync_global_pgds() is called
for synchronizing PGD to PGD entries of all processes MM.  But
when hot-removing memory, sync_global_pgds() does not work
correctly.

At first, sync_global_pgds() checks whether target PGD is none
or not.  And if PGD is none, the PGD is skipped.  But when
hot-removing memory, PGD may be none since PGD may be cleared by
free_pud_table().  So when sync_global_pgds() is called after
hot-removing memory, sync_global_pgds() should not skip PGD even
if the PGD is none.  And sync_global_pgds() must clear PGD
entries of all processes MM.

Currently sync_global_pgds() does not clear PGD entries of all
processes MM when hot-removing memory.  So when hot adding
memory which is same memory range as removed memory after
hot-removing memory, following call traces are shown:

 kernel BUG at arch/x86/mm/init_64.c:206!
 ...
 [] kernel_physical_mapping_init+0x1b2/0x1d2
 [] init_memory_mapping+0x1d4/0x380
 [] arch_add_memory+0x3d/0xd0
 [] add_memory+0xb9/0x1b0
 [] acpi_memory_device_add+0x1af/0x28e
 [] acpi_bus_device_attach+0x8c/0xf0
 [] acpi_ns_walk_namespace+0xc8/0x17f
 [] ? acpi_bus_type_and_status+0xb7/0xb7
 [] ? acpi_bus_type_and_status+0xb7/0xb7
 [] acpi_walk_namespace+0x95/0xc5
 [] acpi_bus_scan+0x9a/0xc2
 [] acpi_scan_bus_device_check+0x8b/0x12e
 [] acpi_scan_device_check+0x13/0x15
 [] acpi_os_execute_deferred+0x25/0x32
 [] process_one_work+0x17b/0x460
 [] worker_thread+0x11b/0x400
 [] ? rescuer_thread+0x400/0x400
 [] kthread+0xcf/0xe0
 [] ? kthread_create_on_node+0x140/0x140
 [] ret_from_fork+0x7c/0xb0
 [] ? kthread_create_on_node+0x140/0x140

This patch clears PGD entries of all processes MM when
sync_global_pgds() is called after hot-removing memory

Signed-off-by: Yasuaki Ishimatsu 
Acked-by: Toshi Kani 
Signed-off-by: Andrew Morton 
Cc: Tang Chen 
Cc: Gu Zheng 
Cc: Zhang Yanfei 
Cc: Linus Torvalds 
Signed-off-by: Ingo Molnar

x86/xen: don't copy bogus duplicate entries into kernel page tables

2014-09-10T14:23:42+00:00

When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
modules exceeds 512 MiB, then loading modules fails with a warning
(and hence a vmalloc allocation failure) because the PTEs for the
newly-allocated vmalloc address space are not zero.

  WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
           vmap_page_range_noflush+0x2a1/0x360()

This is caused by xen_setup_kernel_pagetables() copying
level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present
entries.

Without KASLR, the normal kernel image size only covers the first half
of level2_kernel_pgt and module space starts after that.

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                  [256..511]->module
                          [511]->level2_fixmap_pgt[  0..505]->module

This allows 512 MiB of of module vmalloc space to be used before
having to use the corrupted level2_fixmap_pgt entries.

With KASLR enabled, the kernel image uses the full PUD range of 1G and
module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                          [511]->level2_fixmap_pgt[0..505]->module

And now no module vmalloc space can be used without using the corrupt
level2_fixmap_pgt entries.

Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
and setting level1_fixmap_pgt as read-only.

A number of comments were also using the the wrong L3 offset for
level2_kernel_pgt.  These have been corrected.

Signed-off-by: Stefan Bader 
Signed-off-by: David Vrabel 
Reviewed-by: Boris Ostrovsky 
Cc: stable@vger.kernel.org

mm: x86 pgtable: drop unneeded preprocessor ifdef

2014-06-04T23:54:05+00:00

_PAGE_BIT_FILE (bit 6) is always less than _PAGE_BIT_PROTNONE (bit 8), so
drop redundant #ifdef.

Signed-off-by: Cyrill Gorcunov 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Peter Anvin 
Cc: Ingo Molnar 
Cc: Steven Noonan 
Cc: Rik van Riel 
Cc: David Vrabel 
Cc: Peter Zijlstra 
Cc: Pavel Emelyanov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels

2014-06-04T23:53:55+00:00

_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults on x86.  Care is taken such that _PAGE_NUMA is used only in
situations where the VMA flags distinguish between NUMA hinting faults
and prot_none faults.  This decision was x86-specific and conceptually
it is difficult requiring special casing to distinguish between PROTNONE
and NUMA ptes based on context.

Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected
for NUMA hinting faults as if the PTE is not present then a fault will
be trapped.

Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
This patch shrinks the maximum possible swap size and uses the bit to
uniquely distinguish between NUMA hinting ptes and swap ptes.

Signed-off-by: Mel Gorman 
Cc: David Vrabel 
Cc: Ingo Molnar 
Cc: Peter Anvin 
Cc: Fengguang Wu 
Cc: Linus Torvalds 
Cc: Steven Noonan 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Andrea Arcangeli 
Cc: Dave Hansen 
Cc: Srikar Dronamraju 
Cc: Cyrill Gorcunov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2013-02-22T02:06:55+00:00

Pull x86 mm changes from Peter Anvin:
 "This is a huge set of several partly interrelated (and concurrently
  developed) changes, which is why the branch history is messier than
  one would like.

  The *really* big items are two humonguous patchsets mostly developed
  by Yinghai Lu at my request, which completely revamps the way we
  create initial page tables.  In particular, rather than estimating how
  much memory we will need for page tables and then build them into that
  memory -- a calculation that has shown to be incredibly fragile -- we
  now build them (on 64 bits) with the aid of a "pseudo-linear mode" --
  a #PF handler which creates temporary page tables on demand.

  This has several advantages:

  1. It makes it much easier to support things that need access to data
     very early (a followon patchset uses this to load microcode way
     early in the kernel startup).

  2. It allows the kernel and all the kernel data objects to be invoked
     from above the 4 GB limit.  This allows kdump to work on very large
     systems.

  3. It greatly reduces the difference between Xen and native (Xen's
     equivalent of the #PF handler are the temporary page tables created
     by the domain builder), eliminating a bunch of fragile hooks.

  The patch series also gets us a bit closer to W^X.

  Additional work in this pull is the 64-bit get_user() work which you
  were also involved with, and a bunch of cleanups/speedups to
  __phys_addr()/__pa()."

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (105 commits)
  x86, mm: Move reserving low memory later in initialization
  x86, doc: Clarify the use of asm("%edx") in uaccess.h
  x86, mm: Redesign get_user with a __builtin_choose_expr hack
  x86: Be consistent with data size in getuser.S
  x86, mm: Use a bitfield to mask nuisance get_user() warnings
  x86/kvm: Fix compile warning in kvm_register_steal_time()
  x86-32: Add support for 64bit get_user()
  x86-32, mm: Remove reference to alloc_remap()
  x86-32, mm: Remove reference to resume_map_numa_kva()
  x86-32, mm: Rip out x86_32 NUMA remapping code
  x86/numa: Use __pa_nodebug() instead
  x86: Don't panic if can not alloc buffer for swiotlb
  mm: Add alloc_bootmem_low_pages_nopanic()
  x86, 64bit, mm: hibernate use generic mapping_init
  x86, 64bit, mm: Mark data/bss/brk to nx
  x86: Merge early kernel reserve for 32bit and 64bit
  x86: Add Crash kernel low reservation
  x86, kdump: Remove crashkernel range find limit for 64bit
  memblock: Add memblock_mem_size()
  x86, boot: Not need to check setup_header version for setup_data
  ...

x86/mm: Convert update_mmu_cache() and update_mmu_cache_pmd() to functions

2013-01-24T15:12:13+00:00

Converting macros to functions unhide type problems before
changes will be integrated and trigger problems on other
architectures.

Signed-off-by: Kirill A. Shutemov 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Ingo Molnar

x86: Move some contents of page_64_types.h into pgtable_64.h and page_64.h

2012-11-17T00:40:34+00:00

This patch is meant to clean-up the fact that we have several functions in
page_64_types.h which really don't belong there.  I found this issue when I
had tried to replace __phys_addr with an inline function.  It resulted in the
realmode bits generating compile warnings about types.  In order to resolve
that I am relocating the address translation to page_64.h since this is in
keeping with where these functions are located in 32 bit.

In addtion I have relocated several functions defined in init_64.c to
pgtable_64.h as this seems to be where most of the functions related to
memory initialization were already located.

[ hpa: added missing #include  to apic_numachip.c,
  as reported by Yinghai Lu. ]

Signed-off-by: Alexander Duyck 
Link: http://lkml.kernel.org/r/20121116215244.8521.31505.stgit@ahduyck-cp1.jf.intel.com
Signed-off-by: H. Peter Anvin 
Cc: Yinghai Lu 
Cc: Daniel J Blueman