From 1e9877902dc7e11d2be038371c6fbf2dfcd469d7 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:01:54 -0800 Subject: mm/gup: Introduce get_user_pages_remote() For protection keys, we need to understand whether protections should be enforced in software or not. In general, we enforce protections when working on our own task, but not when on others. We call these "current" and "remote" operations. This patch introduces a new get_user_pages() variant: get_user_pages_remote() Which is a replacement for when get_user_pages() is called on non-current tsk/mm. We also introduce a new gup flag: FOLL_REMOTE which can be used for the "__" gup variants to get this new behavior. The uprobes is_trap_at_addr() location holds mmap_sem and calls get_user_pages(current->mm) on an instruction address. This makes it a pretty unique gup caller. Being an instruction access and also really originating from the kernel (vs. the app), I opted to consider this a 'remote' access where protection keys will not be enforced. Without protection keys, this patch should not change any behavior. Signed-off-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Kirill A. Shutemov Cc: Linus Torvalds Cc: Naoya Horiguchi Cc: Peter Zijlstra Cc: Rik van Riel Cc: Srikar Dronamraju Cc: Vlastimil Babka Cc: jack@suse.cz Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com Signed-off-by: Ingo Molnar --- include/linux/mm.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index b1d4b8c7f7cd..faf3b709eead 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1225,6 +1225,10 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int foll_flags, struct page **pages, struct vm_area_struct **vmas, int *nonblocking); +long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + int write, int force, struct page **pages, + struct vm_area_struct **vmas); long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, int write, int force, struct page **pages, @@ -2170,6 +2174,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ #define FOLL_MLOCK 0x1000 /* lock present pages */ +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); -- cgit v1.2.3 From cde70140fed8429acf7a14e2e2cbd3e329036653 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:01:55 -0800 Subject: mm/gup: Overload get_user_pages() functions The concept here was a suggestion from Ingo. The implementation horrors are all mine. This allows get_user_pages(), get_user_pages_unlocked(), and get_user_pages_locked() to be called with or without the leading tsk/mm arguments. We will give a compile-time warning about the old style being __deprecated and we will also WARN_ON() if the non-remote version is used for a remote-style access. Doing this, folks will get nice warnings and will not break the build. This should be nice for -next and will hopefully let developers fix up their own code instead of maintainers needing to do it at merge time. The way we do this is hideous. It uses the __VA_ARGS__ macro functionality to call different functions based on the number of arguments passed to the macro. There's an additional hack to ensure that our EXPORT_SYMBOL() of the deprecated symbols doesn't trigger a warning. We should be able to remove this mess as soon as -rc1 hits in the release after this is merged. Signed-off-by: Dave Hansen Cc: Al Viro Cc: Alexander Kuleshov Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Dan Williams Cc: Dave Hansen Cc: Dominik Dingel Cc: Geliang Tang Cc: Jan Kara Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Konstantin Khlebnikov Cc: Leon Romanovsky Cc: Linus Torvalds Cc: Masahiro Yamada Cc: Mateusz Guzik Cc: Maxime Coquelin Cc: Michal Hocko Cc: Naoya Horiguchi Cc: Oleg Nesterov Cc: Paul Gortmaker Cc: Peter Zijlstra Cc: Srikar Dronamraju Cc: Thomas Gleixner Cc: Vladimir Davydov Cc: Vlastimil Babka Cc: Xie XiuQi Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com Signed-off-by: Ingo Molnar --- include/linux/mm.h | 74 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 64 insertions(+), 10 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index faf3b709eead..4c7317828fda 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1229,24 +1229,78 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas); -long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, unsigned long nr_pages, - int write, int force, struct page **pages, - struct vm_area_struct **vmas); -long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, unsigned long nr_pages, - int write, int force, struct page **pages, - int *locked); +long get_user_pages6(unsigned long start, unsigned long nr_pages, + int write, int force, struct page **pages, + struct vm_area_struct **vmas); +long get_user_pages_locked6(unsigned long start, unsigned long nr_pages, + int write, int force, struct page **pages, int *locked); long __get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, int write, int force, struct page **pages, unsigned int gup_flags); -long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, unsigned long nr_pages, +long get_user_pages_unlocked5(unsigned long start, unsigned long nr_pages, int write, int force, struct page **pages); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +/* suppress warnings from use in EXPORT_SYMBOL() */ +#ifndef __DISABLE_GUP_DEPRECATED +#define __gup_deprecated __deprecated +#else +#define __gup_deprecated +#endif +/* + * These macros provide backward-compatibility with the old + * get_user_pages() variants which took tsk/mm. These + * functions/macros provide both compile-time __deprecated so we + * can catch old-style use and not break the build. The actual + * functions also have WARN_ON()s to let us know at runtime if + * the get_user_pages() should have been the "remote" variant. + * + * These are hideous, but temporary. + * + * If you run into one of these __deprecated warnings, look + * at how you are calling get_user_pages(). If you are calling + * it with current/current->mm as the first two arguments, + * simply remove those arguments. The behavior will be the same + * as it is now. If you are calling it on another task, use + * get_user_pages_remote() instead. + * + * Any questions? Ask Dave Hansen + */ +long +__gup_deprecated +get_user_pages8(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + int write, int force, struct page **pages, + struct vm_area_struct **vmas); +#define GUP_MACRO(_1, _2, _3, _4, _5, _6, _7, _8, get_user_pages, ...) \ + get_user_pages +#define get_user_pages(...) GUP_MACRO(__VA_ARGS__, \ + get_user_pages8, x, \ + get_user_pages6, x, x, x, x, x)(__VA_ARGS__) + +__gup_deprecated +long get_user_pages_locked8(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + int write, int force, struct page **pages, + int *locked); +#define GUPL_MACRO(_1, _2, _3, _4, _5, _6, _7, _8, get_user_pages_locked, ...) \ + get_user_pages_locked +#define get_user_pages_locked(...) GUPL_MACRO(__VA_ARGS__, \ + get_user_pages_locked8, x, \ + get_user_pages_locked6, x, x, x, x)(__VA_ARGS__) + +__gup_deprecated +long get_user_pages_unlocked7(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + int write, int force, struct page **pages); +#define GUPU_MACRO(_1, _2, _3, _4, _5, _6, _7, get_user_pages_unlocked, ...) \ + get_user_pages_unlocked +#define get_user_pages_unlocked(...) GUPU_MACRO(__VA_ARGS__, \ + get_user_pages_unlocked7, x, \ + get_user_pages_unlocked5, x, x, x, x)(__VA_ARGS__) + /* Container for pinned pfns / pages */ struct frame_vector { unsigned int nr_allocated; /* Number of frames we have space for */ -- cgit v1.2.3 From 63c17fb8e5a46a16e10e82005748837fd11a2024 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:02:08 -0800 Subject: mm/core, x86/mm/pkeys: Store protection bits in high VMA flags vma->vm_flags is an 'unsigned long', so has space for 32 flags on 32-bit architectures. The high 32 bits are unused on 64-bit platforms. We've steered away from using the unused high VMA bits for things because we would have difficulty supporting it on 32-bit. Protection Keys are not available in 32-bit mode, so there is no concern about supporting this feature in 32-bit mode or on 32-bit CPUs. This patch carves out 4 bits from the high half of vma->vm_flags and allows architectures to set config option to make them available. Sparse complains about these constants unless we explicitly call them "UL". Signed-off-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dan Williams Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Jan Kara Cc: Kirill A. Shutemov Cc: Konstantin Khlebnikov Cc: Linus Torvalds Cc: Mel Gorman Cc: Michal Hocko Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Rik van Riel Cc: Sasha Levin Cc: Valentin Rothberg Cc: Vladimir Davydov Cc: Vlastimil Babka Cc: Xie XiuQi Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210208.81AF00D5@viggo.jf.intel.com Signed-off-by: Ingo Molnar --- include/linux/mm.h | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 4c7317828fda..54d173bcc327 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -170,6 +170,17 @@ extern unsigned int kobjsize(const void *objp); #define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */ #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS +#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) +#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) +#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) +#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) +#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ + #if defined(CONFIG_X86) # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ #elif defined(CONFIG_PPC) -- cgit v1.2.3 From 8f62c883222c9e3c06d60b5e55e307a3d1f18257 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:02:10 -0800 Subject: x86/mm/pkeys: Add arch-specific VMA protection bits Lots of things seem to do: vma->vm_page_prot = vm_get_page_prot(flags); and the ptes get created right from things we pull out of ->vm_page_prot. So it is very convenient if we can store the protection key in flags and vm_page_prot, just like the existing permission bits (_PAGE_RW/PRESENT). It greatly reduces the amount of plumbing and arch-specific hacking we have to do in generic code. This also takes the new PROT_PKEY{0,1,2,3} flags and turns *those* in to VM_ flags for vma->vm_flags. The protection key values are stored in 4 places: 1. "prot" argument to system calls 2. vma->vm_flags, filled from the mmap "prot" 3. vma->vm_page prot, filled from vma->vm_flags 4. the PTE itself. The pseudocode for these for steps are as follows: mmap(PROT_PKEY*) vma->vm_flags = ... | arch_calc_vm_prot_bits(mmap_prot); vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags); pte = pfn | vma->vm_page_prot Note that this provides a new definitions for x86: arch_vm_get_page_prot() Signed-off-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210210.FE483A42@viggo.jf.intel.com Signed-off-by: Ingo Molnar --- include/linux/mm.h | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 54d173bcc327..3056369bab1d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -183,6 +183,13 @@ extern unsigned int kobjsize(const void *objp); #if defined(CONFIG_X86) # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ +#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) +# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 +# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ +# define VM_PKEY_BIT1 VM_HIGH_ARCH_1 +# define VM_PKEY_BIT2 VM_HIGH_ARCH_2 +# define VM_PKEY_BIT3 VM_HIGH_ARCH_3 +#endif #elif defined(CONFIG_PPC) # define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */ #elif defined(CONFIG_PARISC) -- cgit v1.2.3 From 1b2ee1266ea647713dbaf44825967c180dfc8d76 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:02:21 -0800 Subject: mm/core: Do not enforce PKEY permissions on remote mm access We try to enforce protection keys in software the same way that we do in hardware. (See long example below). But, we only want to do this when accessing our *own* process's memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then tried to PTRACE_POKE a target process which just happened to have some mprotect_pkey(pkey=6) memory, we do *not* want to deny the debugger access to that memory. PKRU is fundamentally a thread-local structure and we do not want to enforce it on access to _another_ thread's data. This gets especially tricky when we have workqueues or other delayed-work mechanisms that might run in a random process's context. We can check that we only enforce pkeys when operating on our *own* mm, but delayed work gets performed when a random user context is active. We might end up with a situation where a delayed-work gup fails when running randomly under its "own" task but succeeds when running under another process. We want to avoid that. To avoid that, we use the new GUP flag: FOLL_REMOTE and add a fault flag: FAULT_FLAG_REMOTE. They indicate that we are walking an mm which is not guranteed to be the same as current->mm and should not be subject to protection key enforcement. Thanks to Jerome Glisse for pointing out this scenario. Signed-off-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Alexey Kardashevskiy Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Benjamin Herrenschmidt Cc: Boaz Harrosh Cc: Borislav Petkov Cc: Brian Gerst Cc: Dan Williams Cc: Dave Chinner Cc: Dave Hansen Cc: David Gibson Cc: Denys Vlasenko Cc: Dominik Dingel Cc: Dominik Vogt Cc: Eric B Munson Cc: Geliang Tang Cc: Guan Xuetao Cc: H. Peter Anvin Cc: Heiko Carstens Cc: Hugh Dickins Cc: Jan Kara Cc: Jason Low Cc: Jerome Marchand Cc: Joerg Roedel Cc: Kirill A. Shutemov Cc: Konstantin Khlebnikov Cc: Laurent Dufour Cc: Linus Torvalds Cc: Martin Schwidefsky Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michael Ellerman Cc: Michal Hocko Cc: Mikulas Patocka Cc: Minchan Kim Cc: Oleg Nesterov Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Rik van Riel Cc: Sasha Levin Cc: Shachar Raindel Cc: Vlastimil Babka Cc: Xie XiuQi Cc: iommu@lists.linux-foundation.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Ingo Molnar --- include/linux/mm.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 3056369bab1d..2aaa0f0d67ea 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -251,6 +251,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_KILLABLE 0x10 /* The fault task is in SIGKILL killable region */ #define FAULT_FLAG_TRIED 0x20 /* Second try */ #define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ +#define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */ /* * vm_fault is filled by the the pagefault handler and passed to the vma's -- cgit v1.2.3 From d61172b4b695b821388cdb6088a41d431bcbb93b Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:02:24 -0800 Subject: mm/core, x86/mm/pkeys: Differentiate instruction fetches As discussed earlier, we attempt to enforce protection keys in software. However, the code checks all faults to ensure that they are not violating protection key permissions. It was assumed that all faults are either write faults where we check PKRU[key].WD (write disable) or read faults where we check the AD (access disable) bit. But, there is a third category of faults for protection keys: instruction faults. Instruction faults never run afoul of protection keys because they do not affect instruction fetches. So, plumb the PF_INSTR bit down in to the arch_vma_access_permitted() function where we do the protection key checks. We also add a new FAULT_FLAG_INSTRUCTION. This is because handle_mm_fault() is not passed the architecture-specific error_code where we keep PF_INSTR, so we need to encode the instruction fetch information in to the arch-generic fault flags. Signed-off-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.com Signed-off-by: Ingo Molnar --- include/linux/mm.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 2aaa0f0d67ea..7955c3eb83db 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -252,6 +252,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_TRIED 0x20 /* Second try */ #define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ #define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */ +#define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */ /* * vm_fault is filled by the the pagefault handler and passed to the vma's -- cgit v1.2.3 From e6bfb70959a0ca6ddedb29e779a293c6f71ed0e7 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:02:31 -0800 Subject: mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This plumbs a protection key through calc_vm_flag_bits(). We could have done this in calc_vm_prot_bits(), but I did not feel super strongly which way to go. It was pretty arbitrary which one to use. Signed-off-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Arve Hjønnevåg Cc: Benjamin Herrenschmidt Cc: Borislav Petkov Cc: Brian Gerst Cc: Chen Gang Cc: Dan Williams Cc: Dave Chinner Cc: Dave Hansen Cc: David Airlie Cc: Denys Vlasenko Cc: Eric W. Biederman Cc: Geliang Tang Cc: Greg Kroah-Hartman Cc: H. Peter Anvin Cc: Kirill A. Shutemov Cc: Konstantin Khlebnikov Cc: Leon Romanovsky Cc: Linus Torvalds Cc: Masahiro Yamada Cc: Maxime Coquelin Cc: Mel Gorman Cc: Michael Ellerman Cc: Oleg Nesterov Cc: Paul Gortmaker Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Rik van Riel Cc: Riley Andrews Cc: Vladimir Davydov Cc: devel@driverdev.osuosl.org Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com Signed-off-by: Ingo Molnar --- include/linux/mman.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mman.h b/include/linux/mman.h index 16373c8f5f57..33e17f6a327a 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -35,7 +35,7 @@ static inline void vm_unacct_memory(long pages) */ #ifndef arch_calc_vm_prot_bits -#define arch_calc_vm_prot_bits(prot) 0 +#define arch_calc_vm_prot_bits(prot, pkey) 0 #endif #ifndef arch_vm_get_page_prot @@ -70,12 +70,12 @@ static inline int arch_validate_prot(unsigned long prot) * Combine the mmap "prot" argument into "vm_flags" used internally. */ static inline unsigned long -calc_vm_prot_bits(unsigned long prot) +calc_vm_prot_bits(unsigned long prot, unsigned long pkey) { return _calc_vm_trans(prot, PROT_READ, VM_READ ) | _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) | _calc_vm_trans(prot, PROT_EXEC, VM_EXEC) | - arch_calc_vm_prot_bits(prot); + arch_calc_vm_prot_bits(prot, pkey); } /* -- cgit v1.2.3 From 66d375709d2c891acc639538fd3179fa0cbb0daf Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:02:32 -0800 Subject: mm/core, x86/mm/pkeys: Add arch_validate_pkey() The syscall-level code is passed a protection key and need to return an appropriate error code if the protection key is bogus. We will be using this in subsequent patches. Note that this also begins a series of arch-specific calls that we need to expose in otherwise arch-independent code. We create a linux/pkeys.h header where we will put *all* the stubs for these functions. Signed-off-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210232.774EEAAB@viggo.jf.intel.com Signed-off-by: Ingo Molnar --- include/linux/pkeys.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 include/linux/pkeys.h (limited to 'include/linux') diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h new file mode 100644 index 000000000000..55e465f93a28 --- /dev/null +++ b/include/linux/pkeys.h @@ -0,0 +1,25 @@ +#ifndef _LINUX_PKEYS_H +#define _LINUX_PKEYS_H + +#include +#include + +#ifdef CONFIG_ARCH_HAS_PKEYS +#include +#else /* ! CONFIG_ARCH_HAS_PKEYS */ +#define arch_max_pkey() (1) +#endif /* ! CONFIG_ARCH_HAS_PKEYS */ + +/* + * This is called from mprotect_pkey(). + * + * Returns true if the protection keys is valid. + */ +static inline bool validate_pkey(int pkey) +{ + if (pkey < 0) + return false; + return (pkey < arch_max_pkey()); +} + +#endif /* _LINUX_PKEYS_H */ -- cgit v1.2.3 From 8459429693395ca9e8d18101300b120ad9171795 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:02:36 -0800 Subject: x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andrew Morton Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Brian Gerst Cc: Dave Hansen Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Rik van Riel Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar --- include/linux/pkeys.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include/linux') diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 55e465f93a28..fc325b367bd0 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -4,6 +4,11 @@ #include #include +#define PKEY_DISABLE_ACCESS 0x1 +#define PKEY_DISABLE_WRITE 0x2 +#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ + PKEY_DISABLE_WRITE) + #ifdef CONFIG_ARCH_HAS_PKEYS #include #else /* ! CONFIG_ARCH_HAS_PKEYS */ -- cgit v1.2.3 From 62b5f7d013fc455b8db26cf01e421f4c0d264b92 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Fri, 12 Feb 2016 13:02:40 -0800 Subject: mm/core, x86/mm/pkeys: Add execute-only protection keys support Protection keys provide new page-based protection in hardware. But, they have an interesting attribute: they only affect data accesses and never affect instruction fetches. That means that if we set up some memory which is set as "access-disabled" via protection keys, we can still execute from it. This patch uses protection keys to set up mappings to do just that. If a user calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will notice this, and set a special protection key on the memory. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. I haven't found any userspace that does this today. With this facility in place, we expect userspace to move to use it eventually. Userspace _could_ start doing this today. Any PROT_EXEC calls get converted to PROT_READ inside the kernel, and would transparently be upgraded to "true" PROT_EXEC with this code. IOW, userspace never has to do any PROT_EXEC runtime detection. This feature provides enhanced protection against leaking executable memory contents. This helps thwart attacks which are attempting to find ROP gadgets on the fly. But, the security provided by this approach is not comprehensive. The PKRU register which controls access permissions is a normal user register writable from unprivileged userspace. An attacker who can execute the 'wrpkru' instruction can easily disable the protection provided by this feature. The protection key that is used for execute-only support is permanently dedicated at compile time. This is fine for now because there is currently no API to set a protection key other than this one. Despite there being a constant PKRU value across the entire system, we do not set it unless this feature is in use in a process. That is to preserve the PKRU XSAVE 'init state', which can lead to faster context switches. PKRU *is* a user register and the kernel is modifying it. That means that code doing: pkru = rdpkru() pkru |= 0x100; mmap(..., PROT_EXEC); wrpkru(pkru); could lose the bits in PKRU that enforce execute-only permissions. To avoid this, we suggest avoiding ever calling mmap() or mprotect() when the PKRU value is expected to be unstable. Signed-off-by: Dave Hansen Reviewed-by: Thomas Gleixner Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Andy Lutomirski Cc: Aneesh Kumar K.V Cc: Borislav Petkov Cc: Borislav Petkov Cc: Brian Gerst Cc: Chen Gang Cc: Dan Williams Cc: Dave Chinner Cc: Dave Hansen Cc: David Hildenbrand Cc: Denys Vlasenko Cc: H. Peter Anvin Cc: Kees Cook Cc: Kirill A. Shutemov Cc: Konstantin Khlebnikov Cc: Linus Torvalds Cc: Mel Gorman Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Piotr Kwapulinski Cc: Rik van Riel Cc: Stephen Smalley Cc: Vladimir Murzin Cc: Will Deacon Cc: keescook@google.com Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com Signed-off-by: Ingo Molnar --- include/linux/pkeys.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux') diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index fc325b367bd0..1d405a2b7272 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -13,6 +13,9 @@ #include #else /* ! CONFIG_ARCH_HAS_PKEYS */ #define arch_max_pkey() (1) +#define execute_only_pkey(mm) (0) +#define arch_override_mprotect_pkey(vma, prot, pkey) (0) +#define PKEY_DEDICATED_EXECUTE_ONLY 0 #endif /* ! CONFIG_ARCH_HAS_PKEYS */ /* -- cgit v1.2.3