linux-toradex.git/fs/proc, branch v4.6-rc6

numa: fix /proc//numa_maps for THP

2016-04-29T02:34:04+00:00

In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
because the layouts may differ depending on the architecture.  On s390
this will lead to inaccurate numa_maps accounting in /proc because of
misguided pte_present() and pte_dirty() checks on the fake pte.

On other architectures pte_present() and pte_dirty() may work by chance,
but there may be an issue with direct-access (dax) mappings w/o
underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
available.  In vm_normal_page() the fake pte will be checked with
pte_special() and because there is no "special" bit in a pmd, this will
always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
skipped.  On dax mappings w/o struct pages, an invalid struct page
pointer would then be returned that can crash the kernel.

This patch fixes the numa_maps THP handling by introducing new "_pmd"
variants of the can_gather_numa_stats() and vm_normal_page() functions.

Signed-off-by: Gerald Schaefer 
Cc: Naoya Horiguchi 
Cc: "Kirill A . Shutemov" 
Cc: Konstantin Khlebnikov 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Jerome Marchand 
Cc: Johannes Weiner 
Cc: Dave Hansen 
Cc: Mel Gorman 
Cc: Dan Williams 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Michael Holzheu 
Cc: 	[4.3+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

2016-04-04T17:41:08+00:00

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 -  << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

 -  >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov 
Acked-by: Michal Hocko 
Signed-off-by: Linus Torvalds

Merge branch 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2016-03-21T17:05:13+00:00

Pull cgroup namespace support from Tejun Heo:
 "These are changes to implement namespace support for cgroup which has
  been pending for quite some time now.  It is very straight-forward and
  only affects what part of cgroup hierarchies are visible.

  After unsharing, mounting a cgroup fs will be scoped to the cgroups
  the task belonged to at the time of unsharing and the cgroup paths
  exposed to userland would be adjusted accordingly"

* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix and restructure error handling in copy_cgroup_ns()
  cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
  Add FS_USERNS_FLAG to cgroup fs
  cgroup: Add documentation for cgroup namespaces
  cgroup: mount cgroupns-root when inside non-init cgroupns
  kernfs: define kernfs_node_dentry
  cgroup: cgroup namespace setns support
  cgroup: introduce cgroup namespaces
  sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  kernfs: Add API to generate relative kernfs path

Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2016-03-21T02:08:56+00:00

Pull x86 protection key support from Ingo Molnar:
 "This tree adds support for a new memory protection hardware feature
  that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

  There's a background article at LWN.net:

      https://lwn.net/Articles/643797/

  The gist is that protection keys allow the encoding of
  user-controllable permission masks in the pte.  So instead of having a
  fixed protection mask in the pte (which needs a system call to change
  and works on a per page basis), the user can map a (handful of)
  protection mask variants and can change the masks runtime relatively
  cheaply, without having to change every single page in the affected
  virtual memory range.

  This allows the dynamic switching of the protection bits of large
  amounts of virtual memory, via user-space instructions.  It also
  allows more precise control of MMU permission bits: for example the
  executable bit is separate from the read bit (see more about that
  below).

  This tree adds the MM infrastructure and low level x86 glue needed for
  that, plus it adds a high level API to make use of protection keys -
  if a user-space application calls:

        mmap(..., PROT_EXEC);

  or

        mprotect(ptr, sz, PROT_EXEC);

  (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
  this special case, and will set a special protection key on this
  memory range.  It also sets the appropriate bits in the Protection
  Keys User Rights (PKRU) register so that the memory becomes unreadable
  and unwritable.

  So using protection keys the kernel is able to implement 'true'
  PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
  PROT_READ as well.  Unreadable executable mappings have security
  advantages: they cannot be read via information leaks to figure out
  ASLR details, nor can they be scanned for ROP gadgets - and they
  cannot be used by exploits for data purposes either.

  We know about no user-space code that relies on pure PROT_EXEC
  mappings today, but binary loaders could start making use of this new
  feature to map binaries and libraries in a more secure fashion.

  There is other pending pkeys work that offers more high level system
  call APIs to manage protection keys - but those are not part of this
  pull request.

  Right now there's a Kconfig that controls this feature
  (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
  (like most x86 CPU feature enablement code that has no runtime
  overhead), but it's not user-configurable at the moment.  If there's
  any serious problem with this then we can make it configurable and/or
  flip the default"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
  mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
  x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
  mm/core, x86/mm/pkeys: Add execute-only protection keys support
  x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
  x86/mm/pkeys: Allow kernel to modify user pkey rights register
  x86/fpu: Allow setting of XSAVE state
  x86/mm: Factor out LDT init from context init
  mm/core, x86/mm/pkeys: Add arch_validate_pkey()
  mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
  x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
  x86/mm/pkeys: Add Kconfig prompt to existing config option
  x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
  x86/mm/pkeys: Dump PKRU with other kernel registers
  mm/core, x86/mm/pkeys: Differentiate instruction fetches
  x86/mm/pkeys: Optimize fault handling in access_error()
  mm/core: Do not enforce PKEY permissions on remote mm access
  um, pkeys: Add UML arch_*_access_permitted() methods
  mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
  x86/mm/gup: Simplify get_user_pages() PTE bit handling
  ...

proc-vmcore: wrong data type casting fix

2016-03-17T22:09:34+00:00

On i686 PAE enabled machine the contiguous physical area could be large
and it can cause trimming down variables in below calculation in
read_vmcore() and mmap_vmcore():

	tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);

That is, the types being used is like below on i686:
m->offset: unsigned long long int
m->size:   unsigned long long int
*fpos:     loff_t (long long int)
buflen:    size_t (unsigned int)

So casting (m->offset + m->size - *fpos) by size_t means truncating a
given value by 4GB.

Suppose (m->offset + m->size - *fpos) being truncated to 0, buflen >0
then we will get tsz = 0.  It is of course not an expected result.
Similarly we could also get other truncated values less than buflen.
Then the real size passed down is not correct any more.

If (m->offset + m->size - *fpos) is above 4GB, read_vmcore or
mmap_vmcore use the min_t result with truncated values being compared to
buflen.  Then, fpos proceeds with the wrong value so that we reach below
bugs:

1) read_vmcore will refuse to continue so makedumpfile fails.
2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().

Use unsigned long long in min_t instead so that the variables in are not
truncated.

Signed-off-by: Baoquan He 
Signed-off-by: Dave Young 
Cc: HATAYAMA Daisuke 
Cc: Vivek Goyal 
Cc: Jianyu Zhan 
Cc: Minfei Huang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc/base: make prompt shell start from new line after executing "cat /proc/$pid/wchan"

2016-03-17T22:09:34+00:00

It is not elegant that prompt shell does not start from new line after
executing "cat /proc/$pid/wchan".  Make prompt shell start from new
line.

Signed-off-by: Minfei Huang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

procfs: add conditional compilation check

2016-03-17T22:09:34+00:00

`proc_timers_operations` is only used when CONFIG_CHECKPOINT_RESTORE is
enabled.

Signed-off-by: Eric Engestrom 
Acked-by: Cyrill Gorcunov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc: add /proc//timerslack_ns interface

2016-03-17T22:09:34+00:00

This patch provides a proc/PID/timerslack_ns interface which exposes a
task's timerslack value in nanoseconds and allows it to be changed.

This allows power/performance management software to set timer slack for
other threads according to its policy for the thread (such as when the
thread is designated foreground vs.  background activity)

If the value written is non-zero, slack is set to that value.  Otherwise
sets it to the default for the thread.

This interface checks that the calling task has permissions to to use
PTRACE_MODE_ATTACH_FSCREDS on the target task, so that we can ensure
arbitrary apps do not change the timer slack for other apps.

Signed-off-by: John Stultz 
Acked-by: Kees Cook 
Cc: Arjan van de Ven 
Cc: Thomas Gleixner 
Cc: Oren Laadan 
Cc: Ruchi Kandoi 
Cc: Rom Lemarchand 
Cc: Android Kernel Team 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/page_alloc.c: calculate 'available' memory in a separate function

2016-03-17T22:09:34+00:00

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.  This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.

This patch (of 2):

Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.

In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).

Signed-off-by: Igor Redko 
Signed-off-by: Denis V. Lunev 
Reviewed-by: Roman Kagan 
Cc: Michael S. Tsirkin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

/proc/kpageflags: return KPF_SLAB for slab tail pages

2016-03-17T22:09:34+00:00

Currently /proc/kpageflags returns just KPF_COMPOUND_TAIL for slab tail
pages, which is inconvenient when grasping how slab pages are
distributed (userspace always needs to check which kind of tail pages by
itself).  This patch sets KPF_SLAB for such pages.

With this patch:

  $ grep Slab /proc/meminfo ; tools/vm/page-types -b slab
  Slab:              64880 kB
               flags      page-count       MB  symbolic-flags                     long-symbolic-flags
  0x0000000000000080           16220       63  _______S__________________________________ slab
               total           16220       63

16220 pages equals to 64880 kB, so returned result is consistent with the
global counter.

Signed-off-by: Naoya Horiguchi 
Reviewed-by: Vladimir Davydov 
Cc: Konstantin Khlebnikov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds