linux-toradex.git/fs/proc/task_mmu.c, branch v4.4.27

proc: revert /proc//maps [stack:TID] annotation

2016-09-15T06:27:46+00:00

[ Upstream commit 65376df582174ffcec9e6471bf5b0dd79ba05e4a ]

Commit b76437579d13 ("procfs: mark thread stack correctly in
proc//maps") added [stack:TID] annotation to /proc//maps.

Finding the task of a stack VMA requires walking the entire thread list,
turning this into quadratic behavior: a thousand threads means a
thousand stacks, so the rendering of /proc//maps needs to look at a
million combinations.

The cost is not in proportion to the usefulness as described in the
patch.

Drop the [stack:TID] annotation to make /proc//maps (and
/proc//numa_maps) usable again for higher thread counts.

The [stack] annotation inside /proc//task//maps is retained, as
identifying the stack VMA there is an O(1) operation.

Siddesh said:
 "The end users needed a way to identify thread stacks programmatically and
  there wasn't a way to do that.  I'm afraid I no longer remember (or have
  access to the resources that would aid my memory since I changed
  employers) the details of their requirement.  However, I did do this on my
  own time because I thought it was an interesting project for me and nobody
  really gave any feedback then as to its utility, so as far as I am
  concerned you could roll back the main thread maps information since the
  information is available in the thread-specific files"

Signed-off-by: Johannes Weiner 
Cc: "Kirill A. Shutemov" 
Cc: Siddhesh Poyarekar 
Cc: Shaohua Li 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

numa: fix /proc//numa_maps for THP

2016-05-04T21:48:49+00:00

commit 28093f9f34cedeaea0f481c58446d9dac6dd620f upstream.

In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
because the layouts may differ depending on the architecture.  On s390
this will lead to inaccurate numa_maps accounting in /proc because of
misguided pte_present() and pte_dirty() checks on the fake pte.

On other architectures pte_present() and pte_dirty() may work by chance,
but there may be an issue with direct-access (dax) mappings w/o
underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
available.  In vm_normal_page() the fake pte will be checked with
pte_special() and because there is no "special" bit in a pmd, this will
always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
skipped.  On dax mappings w/o struct pages, an invalid struct page
pointer would then be returned that can crash the kernel.

This patch fixes the numa_maps THP handling by introducing new "_pmd"
variants of the can_gather_numa_stats() and vm_normal_page() functions.

Signed-off-by: Gerald Schaefer 
Cc: Naoya Horiguchi 
Cc: "Kirill A . Shutemov" 
Cc: Konstantin Khlebnikov 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Jerome Marchand 
Cc: Johannes Weiner 
Cc: Dave Hansen 
Cc: Mel Gorman 
Cc: Dan Williams 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Michael Holzheu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

numa: fix /proc//numa_maps for hugetlbfs on s390

2016-02-25T20:01:22+00:00

commit 5c2ff95e41c9290d16556cd02e35b25d81be8fe0 upstream.

When working with hugetlbfs ptes (which are actually pmds) is not valid to
directly use pte functions like pte_present() because the hardware bit
layout of pmds and ptes can be different.  This is the case on s390.
Therefore we have to convert the hugetlbfs ptes first into a valid pte
encoding with huge_ptep_get().

Currently the /proc//numa_maps code uses hugetlbfs ptes without
huge_ptep_get().  On s390 this leads to the following two problems:

1) The pte_present() function returns false (instead of true) for
   PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
   completely in the "numa_maps" output.

2) The pte_dirty() function always returns false for all hugetlb ptes.
   Therefore these pages are reported as "mapped=xxx" instead of
   "dirty=xxx".

Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.

Signed-off-by: Michael Holzheu 
Reviewed-by: Gerald Schaefer 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: clear_soft_dirty_pmd() requires THP

2015-11-06T03:34:48+00:00

Don't build clear_soft_dirty_pmd() if transparent huge pages are not
enabled.

Signed-off-by: Laurent Dufour 
Reviewed-by: Aneesh Kumar K.V 
Cc: Pavel Emelyanov 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: clear pte in clear_soft_dirty()

2015-11-06T03:34:48+00:00

As mentioned in the commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa()
for updating _PAGE_NUMA bit"), architectures like ppc64 don't do tlb
flush in set_pte/pmd functions.

So when dealing with existing pte in clear_soft_dirty, the pte must be
cleared before being modified.

Signed-off-by: Laurent Dufour 
Reviewed-by: Aneesh Kumar K.V 
Cc: Pavel Emelyanov 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

2015-11-06T03:34:48+00:00

Currently there's no easy way to get per-process usage of hugetlb pages,
which is inconvenient because userspace applications which use hugetlb
typically want to control their processes on the basis of how much memory
(including hugetlb) they use.  So this patch simply provides easy access
to the info via /proc/PID/status.

Signed-off-by: Naoya Horiguchi 
Acked-by: Joern Engel 
Acked-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: Mike Kravetz 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps

2015-11-06T03:34:48+00:00

Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB),
which is inconvenient when we want to know per-task or per-vma base
hugetlb usage.  To solve this, this patch adds new fields for hugetlb
usage like below:

  Size:              20480 kB
  Rss:                   0 kB
  Pss:                   0 kB
  Shared_Clean:          0 kB
  Shared_Dirty:          0 kB
  Private_Clean:         0 kB
  Private_Dirty:         0 kB
  Referenced:            0 kB
  Anonymous:             0 kB
  AnonHugePages:         0 kB
  Shared_Hugetlb:    18432 kB
  Private_Hugetlb:    2048 kB
  Swap:                  0 kB
  KernelPageSize:     2048 kB
  MMUPageSize:        2048 kB
  Locked:                0 kB
  VmFlags: rd wr mr mw me de ht

[hughd@google.com: fix Private_Hugetlb alignment ]
Signed-off-by: Naoya Horiguchi 
Acked-by: Joern Engel 
Acked-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: Mike Kravetz 
Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: add architecture primitives for software dirty bit clearing

2015-10-14T12:32:05+00:00

There are primitives to create and query the software dirty bits
in a pte or pmd. But the clearing of the software dirty bits is done
in common code with x86 specific page table functions.

Add the missing architecture primitives to clear the software dirty
bits to allow the feature to be used on non-x86 systems, e.g. the
s390 architecture.

Acked-by: Cyrill Gorcunov 
Signed-off-by: Martin Schwidefsky

mm: introduce idle page tracking

2015-09-10T20:29:01+00:00

Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g.  by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.  However,
this method has two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g.  by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.

The Young page flag is used to avoid interference with the memory
reclaimer.  A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.

Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov 
Reviewed-by: Andres Lagar-Cavilla 
Cc: Minchan Kim 
Cc: Raghavendra K T 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Greg Thelen 
Cc: Michel Lespinasse 
Cc: David Rientjes 
Cc: Pavel Emelyanov 
Cc: Cyrill Gorcunov 
Cc: Jonathan Corbet 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: /proc/pid/smaps:: show proportional swap share of the mapping

2015-09-08T22:35:28+00:00

We want to know per-process workingset size for smart memory management
on userland and we use swap(ex, zram) heavily to maximize memory
efficiency so workingset includes swap as well as RSS.

On such system, if there are lots of shared anonymous pages, it's really
hard to figure out exactly how many each process consumes memory(ie, rss
+ wap) if the system has lots of shared anonymous memory(e.g, android).

This patch introduces SwapPss field on /proc//smaps so we can get
more exact workingset size per process.

Bongkyu tested it. Result is below.

1. 50M used swap
SwapTotal: 461976 kB
SwapFree: 411192 kB

$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
48236
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
141184

2. 240M used swap
SwapTotal: 461976 kB
SwapFree: 216808 kB

$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
230315
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
1387744

[akpm@linux-foundation.org: simplify kunmap_atomic() call]
Signed-off-by: Minchan Kim 
Reported-by: Bongkyu Kim 
Tested-by: Bongkyu Kim 
Cc: Hugh Dickins 
Cc: Sergey Senozhatsky 
Cc: Jonathan Corbet 
Cc: Jerome Marchand 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds