From 3f0e131221eb951c45c93d1cce9db73889be2a5e Mon Sep 17 00:00:00 2001 From: Dan Streetman Date: Wed, 9 Sep 2015 15:35:16 -0700 Subject: zpool: add zpool_has_pool() This series makes creation of the zpool and compressor dynamic, so that they can be changed at runtime. This makes using/configuring zswap easier, as before this zswap had to be configured at boot time, using boot params. This uses a single list to track both the zpool and compressor together, although Seth had mentioned an alternative which is to track the zpools and compressors using separate lists. In the most common case, only a single zpool and single compressor, using one list is slightly simpler than using two lists, and for the uncommon case of multiple zpools and/or compressors, using one list is slightly less simple (and uses slightly more memory, probably) than using two lists. This patch (of 4): Add zpool_has_pool() function, indicating if the specified type of zpool is available (i.e. zsmalloc or zbud). This allows checking if a pool is available, without actually trying to allocate it, similar to crypto_has_alg(). This is used by a following patch to zswap that enables the dynamic runtime creation of zswap zpools. Signed-off-by: Dan Streetman Acked-by: Seth Jennings Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/zpool.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux') diff --git a/include/linux/zpool.h b/include/linux/zpool.h index c924a28d9805..42f8ec992452 100644 --- a/include/linux/zpool.h +++ b/include/linux/zpool.h @@ -36,6 +36,8 @@ enum zpool_mapmode { ZPOOL_MM_DEFAULT = ZPOOL_MM_RW }; +bool zpool_has_pool(char *type); + struct zpool *zpool_create_pool(char *type, char *name, gfp_t gfp, const struct zpool_ops *ops); -- cgit v1.2.3 From 2fc045247089ad4ed611ec20cc3a736c0212bf1a Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Wed, 9 Sep 2015 15:35:28 -0700 Subject: memcg: add page_cgroup_ino helper This patchset introduces a new user API for tracking user memory pages that have not been used for a given period of time. The purpose of this is to provide the userspace with the means of tracking a workload's working set, i.e. the set of pages that are actively used by the workload. Knowing the working set size can be useful for partitioning the system more efficiently, e.g. by tuning memory cgroup limits appropriately, or for job placement within a compute cluster. ==== USE CASES ==== The unified cgroup hierarchy has memory.low and memory.high knobs, which are defined as the low and high boundaries for the workload working set size. However, the working set size of a workload may be unknown or change in time. With this patch set, one can periodically estimate the amount of memory unused by each cgroup and tune their memory.low and memory.high parameters accordingly, therefore optimizing the overall memory utilization. Another use case is balancing workloads within a compute cluster. Knowing how much memory is not really used by a workload unit may help take a more optimal decision when considering migrating the unit to another node within the cluster. Also, as noted by Minchan, this would be useful for per-process reclaim (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle pages only by smart user memory manager. ==== USER API ==== The user API consists of two new files: * /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each bit corresponds to a page, indexed by PFN. When the bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle. To mark a page idle one should set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. Only user memory pages can be marked idle, for other page types input is silently ignored. Writing to this file beyond max PFN results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set. This file can be used to estimate the amount of pages that are not used by a particular workload as follows: 1. mark all pages of interest idle by setting corresponding bits in the /sys/kernel/mm/page_idle/bitmap 2. wait until the workload accesses its working set 3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set * /proc/kpagecgroup. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. This file can be used to find all pages (including unmapped file pages) accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one can then estimate the cgroup working set size. For an example of using these files for estimating the amount of unused memory pages per each memory cgroup, please see the script attached below. ==== REASONING ==== The reason to introduce the new user API instead of using /proc/PID/{clear_refs,smaps} is that the latter has two serious drawbacks: - it does not count unmapped file pages - it affects the reclaimer logic The new API attempts to overcome them both. For more details on how it is achieved, please see the comment to patch 6. ==== PATCHSET STRUCTURE ==== The patch set is organized as follows: - patch 1 adds page_cgroup_ino() helper for the sake of /proc/kpagecgroup and patches 2-3 do related cleanup - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is charged to - patch 5 introduces a new mmu notifier callback, clear_young, which is a lightweight version of clear_flush_young; it is used in patch 6 - patch 6 implements the idle page tracking feature, including the userspace API, /sys/kernel/mm/page_idle/bitmap - patch 7 exports idle flag via /proc/kpageflags ==== SIMILAR WORKS ==== Originally, the patch for tracking idle memory was proposed back in 2011 by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main difference between Michel's patch and this one is that Michel implemented a kernel space daemon for estimating idle memory size per cgroup while this patch only provides the userspace with the minimal API for doing the job, leaving the rest up to the userspace. However, they both share the same idea of Idle/Young page flags to avoid affecting the reclaimer logic. ==== PERFORMANCE EVALUATION ==== SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the performance impact introduced by this patch set. Three runs were carried out: - base: kernel without the patch - patched: patched kernel, the feature is not used - patched-active: patched kernel, 1 minute-period daemon is used for tracking idle memory For tracking idle memory, idlememstat utility was used: https://github.com/locker/idlememstat testcase base patched patched-active compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)% compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)% crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)% derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)% mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)% scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)% scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)% serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)% startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)% sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)% xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)% composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)% time idlememstat: 17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k 448inputs+40outputs (1major+36052minor)pagefaults 0swaps ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ==== #! /usr/bin/python # import os import stat import errno import struct CGROUP_MOUNT = "/sys/fs/cgroup/memory" BUFSIZE = 8 * 1024 # must be multiple of 8 def get_hugepage_size(): with open("/proc/meminfo", "r") as f: for s in f: k, v = s.split(":") if k == "Hugepagesize": return int(v.split()[0]) * 1024 PAGE_SIZE = os.sysconf("SC_PAGE_SIZE") HUGEPAGE_SIZE = get_hugepage_size() def set_idle(): f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE) while True: try: f.write(struct.pack("Q", pow(2, 64) - 1)) except IOError as err: if err.errno == errno.ENXIO: break raise f.close() def count_idle(): f_flags = open("/proc/kpageflags", "rb", BUFSIZE) f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE) with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f: while f.read(BUFSIZE): pass # update idle flag idlememsz = {} while True: s1, s2 = f_flags.read(8), f_cgroup.read(8) if not s1 or not s2: break flags, = struct.unpack('Q', s1) cgino, = struct.unpack('Q', s2) unevictable = (flags >> 18) & 1 huge = (flags >> 22) & 1 idle = (flags >> 25) & 1 if idle and not unevictable: idlememsz[cgino] = idlememsz.get(cgino, 0) + \ (HUGEPAGE_SIZE if huge else PAGE_SIZE) f_flags.close() f_cgroup.close() return idlememsz if __name__ == "__main__": print "Setting the idle flag for each page..." set_idle() raw_input("Wait until the workload accesses its working set, " "then press Enter") print "Counting idle pages..." idlememsz = count_idle() for dir, subdirs, files in os.walk(CGROUP_MOUNT): ino = os.stat(dir)[stat.ST_INO] print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB" ==== END SCRIPT ==== This patch (of 8): Add page_cgroup_ino() helper to memcg. This function returns the inode number of the closest online ancestor of the memory cgroup a page is charged to. It is required for exporting information about which page is charged to which cgroup to userspace, which will be introduced by a following patch. Signed-off-by: Vladimir Davydov Reviewed-by: Andres Lagar-Cavilla Cc: Minchan Kim Cc: Raghavendra K T Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Michel Lespinasse Cc: David Rientjes Cc: Pavel Emelyanov Cc: Cyrill Gorcunov Cc: Jonathan Corbet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d92b80b63c5c..f56c818e56bc 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -345,6 +345,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm, } struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page); +ino_t page_cgroup_ino(struct page *page); static inline bool mem_cgroup_disabled(void) { -- cgit v1.2.3 From e993d905c81e2c0f669f2f8e8327df86738baebe Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Wed, 9 Sep 2015 15:35:35 -0700 Subject: memcg: zap try_get_mem_cgroup_from_page It is only used in mem_cgroup_try_charge, so fold it in and zap it. Signed-off-by: Vladimir Davydov Reviewed-by: Andres Lagar-Cavilla Cc: Minchan Kim Cc: Raghavendra K T Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Michel Lespinasse Cc: David Rientjes Cc: Pavel Emelyanov Cc: Cyrill Gorcunov Cc: Jonathan Corbet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index f56c818e56bc..ad800e62cb7a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -305,11 +305,9 @@ struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg); - -struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page); struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); - struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg); + static inline struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ return css ? container_of(css, struct mem_cgroup, css) : NULL; @@ -556,11 +554,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, return &zone->lruvec; } -static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) -{ - return NULL; -} - static inline bool mm_match_cgroup(struct mm_struct *mm, struct mem_cgroup *memcg) { -- cgit v1.2.3 From 1d7715c676a1566c2e4c3e77d16b1f9bb4909025 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Wed, 9 Sep 2015 15:35:41 -0700 Subject: mmu-notifier: add clear_young callback In the scope of the idle memory tracking feature, which is introduced by the following patch, we need to clear the referenced/accessed bit not only in primary, but also in secondary ptes. The latter is required in order to estimate wss of KVM VMs. At the same time we want to avoid flushing tlb, because it is quite expensive and it won't really affect the final result. Currently, there is no function for clearing pte young bit that would meet our requirements, so this patch introduces one. To achieve that we have to add a new mmu-notifier callback, clear_young, since there is no method for testing-and-clearing a secondary pte w/o flushing tlb. The new method is not mandatory and currently only implemented by KVM. Signed-off-by: Vladimir Davydov Reviewed-by: Andres Lagar-Cavilla Acked-by: Paolo Bonzini Cc: Minchan Kim Cc: Raghavendra K T Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Michel Lespinasse Cc: David Rientjes Cc: Pavel Emelyanov Cc: Cyrill Gorcunov Cc: Jonathan Corbet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmu_notifier.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) (limited to 'include/linux') diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 61cd67f4d788..a5b17137c683 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -65,6 +65,16 @@ struct mmu_notifier_ops { unsigned long start, unsigned long end); + /* + * clear_young is a lightweight version of clear_flush_young. Like the + * latter, it is supposed to test-and-clear the young/accessed bitflag + * in the secondary pte, but it may omit flushing the secondary tlb. + */ + int (*clear_young)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, + unsigned long end); + /* * test_young is called to check the young/accessed bitflag in * the secondary pte. This is used to know if the page is @@ -203,6 +213,9 @@ extern void __mmu_notifier_release(struct mm_struct *mm); extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long start, unsigned long end); +extern int __mmu_notifier_clear_young(struct mm_struct *mm, + unsigned long start, + unsigned long end); extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_change_pte(struct mm_struct *mm, @@ -231,6 +244,15 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, return 0; } +static inline int mmu_notifier_clear_young(struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_clear_young(mm, start, end); + return 0; +} + static inline int mmu_notifier_test_young(struct mm_struct *mm, unsigned long address) { @@ -311,6 +333,28 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) __young; \ }) +#define ptep_clear_young_notify(__vma, __address, __ptep) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = ptep_test_and_clear_young(___vma, ___address, __ptep);\ + __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \ + ___address + PAGE_SIZE); \ + __young; \ +}) + +#define pmdp_clear_young_notify(__vma, __address, __pmdp) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = pmdp_test_and_clear_young(___vma, ___address, __pmdp);\ + __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \ + ___address + PMD_SIZE); \ + __young; \ +}) + #define ptep_clear_flush_notify(__vma, __address, __ptep) \ ({ \ unsigned long ___addr = __address & PAGE_MASK; \ -- cgit v1.2.3 From 33c3fc71c8cfa3cc3a98beaa901c069c177dc295 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Wed, 9 Sep 2015 15:35:45 -0700 Subject: mm: introduce idle page tracking Knowing the portion of memory that is not used by a certain application or memory cgroup (idle memory) can be useful for partitioning the system efficiently, e.g. by setting memory cgroup limits appropriately. Currently, the only means to estimate the amount of idle memory provided by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the access bit for all pages mapped to a particular process by writing 1 to clear_refs, wait for some time, and then count smaps:Referenced. However, this method has two serious shortcomings: - it does not count unmapped file pages - it affects the reclaimer logic To overcome these drawbacks, this patch introduces two new page flags, Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap. A page's Idle flag can only be set from userspace by setting bit in /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page, and it is cleared whenever the page is accessed either through page tables (it is cleared in page_referenced() in this case) or using the read(2) system call (mark_page_accessed()). Thus by setting the Idle flag for pages of a particular workload, which can be found e.g. by reading /proc/PID/pagemap, waiting for some time to let the workload access its working set, and then reading the bitmap file, one can estimate the amount of pages that are not used by the workload. The Young page flag is used to avoid interference with the memory reclaimer. A page's Young flag is set whenever the Access bit of a page table entry pointing to the page is cleared by writing to the bitmap file. If page_referenced() is called on a Young page, it will add 1 to its return value, therefore concealing the fact that the Access bit was cleared. Note, since there is no room for extra page flags on 32 bit, this feature uses extended page flags when compiled on 32 bit. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: kpageidle requires an MMU] [akpm@linux-foundation.org: decouple from page-flags rework] Signed-off-by: Vladimir Davydov Reviewed-by: Andres Lagar-Cavilla Cc: Minchan Kim Cc: Raghavendra K T Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Michel Lespinasse Cc: David Rientjes Cc: Pavel Emelyanov Cc: Cyrill Gorcunov Cc: Jonathan Corbet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmu_notifier.h | 2 + include/linux/page-flags.h | 11 +++++ include/linux/page_ext.h | 4 ++ include/linux/page_idle.h | 110 +++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 127 insertions(+) create mode 100644 include/linux/page_idle.h (limited to 'include/linux') diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index a5b17137c683..a1a210d59961 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -471,6 +471,8 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) #define ptep_clear_flush_young_notify ptep_clear_flush_young #define pmdp_clear_flush_young_notify pmdp_clear_flush_young +#define ptep_clear_young_notify ptep_test_and_clear_young +#define pmdp_clear_young_notify pmdp_test_and_clear_young #define ptep_clear_flush_notify ptep_clear_flush #define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush #define pmdp_huge_get_and_clear_notify pmdp_huge_get_and_clear diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 41c93844fb1d..416509e26d6d 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -108,6 +108,10 @@ enum pageflags { #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE PG_compound_lock, +#endif +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) + PG_young, + PG_idle, #endif __NR_PAGEFLAGS, @@ -289,6 +293,13 @@ PAGEFLAG_FALSE(HWPoison) #define __PG_HWPOISON 0 #endif +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) +TESTPAGEFLAG(Young, young) +SETPAGEFLAG(Young, young) +TESTCLEARFLAG(Young, young) +PAGEFLAG(Idle, idle) +#endif + /* * On an anonymous page mapped into a user virtual memory area, * page->mapping points to its anon_vma, not to a struct address_space; diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h index c42981cd99aa..17f118a82854 100644 --- a/include/linux/page_ext.h +++ b/include/linux/page_ext.h @@ -26,6 +26,10 @@ enum page_ext_flags { PAGE_EXT_DEBUG_POISON, /* Page is poisoned */ PAGE_EXT_DEBUG_GUARD, PAGE_EXT_OWNER, +#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT) + PAGE_EXT_YOUNG, + PAGE_EXT_IDLE, +#endif }; /* diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h new file mode 100644 index 000000000000..bf268fa92c5b --- /dev/null +++ b/include/linux/page_idle.h @@ -0,0 +1,110 @@ +#ifndef _LINUX_MM_PAGE_IDLE_H +#define _LINUX_MM_PAGE_IDLE_H + +#include +#include +#include + +#ifdef CONFIG_IDLE_PAGE_TRACKING + +#ifdef CONFIG_64BIT +static inline bool page_is_young(struct page *page) +{ + return PageYoung(page); +} + +static inline void set_page_young(struct page *page) +{ + SetPageYoung(page); +} + +static inline bool test_and_clear_page_young(struct page *page) +{ + return TestClearPageYoung(page); +} + +static inline bool page_is_idle(struct page *page) +{ + return PageIdle(page); +} + +static inline void set_page_idle(struct page *page) +{ + SetPageIdle(page); +} + +static inline void clear_page_idle(struct page *page) +{ + ClearPageIdle(page); +} +#else /* !CONFIG_64BIT */ +/* + * If there is not enough space to store Idle and Young bits in page flags, use + * page ext flags instead. + */ +extern struct page_ext_operations page_idle_ops; + +static inline bool page_is_young(struct page *page) +{ + return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags); +} + +static inline void set_page_young(struct page *page) +{ + set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags); +} + +static inline bool test_and_clear_page_young(struct page *page) +{ + return test_and_clear_bit(PAGE_EXT_YOUNG, + &lookup_page_ext(page)->flags); +} + +static inline bool page_is_idle(struct page *page) +{ + return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags); +} + +static inline void set_page_idle(struct page *page) +{ + set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags); +} + +static inline void clear_page_idle(struct page *page) +{ + clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags); +} +#endif /* CONFIG_64BIT */ + +#else /* !CONFIG_IDLE_PAGE_TRACKING */ + +static inline bool page_is_young(struct page *page) +{ + return false; +} + +static inline void set_page_young(struct page *page) +{ +} + +static inline bool test_and_clear_page_young(struct page *page) +{ + return false; +} + +static inline bool page_is_idle(struct page *page) +{ + return false; +} + +static inline void set_page_idle(struct page *page) +{ +} + +static inline void clear_page_idle(struct page *page) +{ +} + +#endif /* CONFIG_IDLE_PAGE_TRACKING */ + +#endif /* _LINUX_MM_PAGE_IDLE_H */ -- cgit v1.2.3 From 8a5e5e02fc83aaf67053ab53b359af08c6c49aaf Mon Sep 17 00:00:00 2001 From: Vasily Kulikov Date: Wed, 9 Sep 2015 15:36:00 -0700 Subject: include/linux/poison.h: fix LIST_POISON{1,2} offset Poison pointer values should be small enough to find a room in non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space" is located starting from 0x0. Given unprivileged users cannot mmap anything below mmap_min_addr, it should be safe to use poison pointers lower than mmap_min_addr. The current poison pointer values of LIST_POISON{1,2} might be too big for mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu uses only 0x10000). There is little point to use such a big value given the "poison pointer space" below 1 MB is not yet exhausted. Changing it to a smaller value solves the problem for small mmap_min_addr setups. The values are suggested by Solar Designer: http://www.openwall.com/lists/oss-security/2015/05/02/6 Signed-off-by: Vasily Kulikov Cc: Solar Designer Cc: Thomas Gleixner Cc: "Kirill A. Shutemov" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/poison.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/poison.h b/include/linux/poison.h index 2110a81c5e2a..253c9b4198ef 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -19,8 +19,8 @@ * under normal circumstances, used to verify that nobody uses * non-initialized list entries. */ -#define LIST_POISON1 ((void *) 0x00100100 + POISON_POINTER_DELTA) -#define LIST_POISON2 ((void *) 0x00200200 + POISON_POINTER_DELTA) +#define LIST_POISON1 ((void *) 0x100 + POISON_POINTER_DELTA) +#define LIST_POISON2 ((void *) 0x200 + POISON_POINTER_DELTA) /********** include/linux/timer.h **********/ /* -- cgit v1.2.3 From 8b839635e73575990e92cce1f19f5b1d7febd3fa Mon Sep 17 00:00:00 2001 From: Vasily Kulikov Date: Wed, 9 Sep 2015 15:36:03 -0700 Subject: include/linux/poison.h: remove not-used poison pointer macros Signed-off-by: Vasily Kulikov Cc: Solar Designer Cc: Thomas Gleixner Cc: "Kirill A. Shutemov" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/poison.h | 7 ------- 1 file changed, 7 deletions(-) (limited to 'include/linux') diff --git a/include/linux/poison.h b/include/linux/poison.h index 253c9b4198ef..317e16de09e5 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -69,10 +69,6 @@ #define ATM_POISON_FREE 0x12 #define ATM_POISON 0xdeadbeef -/********** net/ **********/ -#define NEIGHBOR_DEAD 0xdeadbeef -#define NETFILTER_LINK_POISON 0xdead57ac - /********** kernel/mutexes **********/ #define MUTEX_DEBUG_INIT 0x11 #define MUTEX_DEBUG_FREE 0x22 @@ -83,7 +79,4 @@ /********** security/ **********/ #define KEY_DESTROY 0xbd -/********** sound/oss/ **********/ -#define OSS_POISON_FREE 0xAB - #endif -- cgit v1.2.3 From 515a9adce0f0c3d2ef20f869c12902d03851a273 Mon Sep 17 00:00:00 2001 From: "Jason A. Donenfeld" Date: Wed, 9 Sep 2015 15:36:12 -0700 Subject: include/linux/printk.h: include pr_fmt in pr_debug_ratelimited The other two implementations of pr_debug_ratelimited include pr_fmt, along with every other pr_* function. But pr_debug_ratelimited forgot to add it with the CONFIG_DYNAMIC_DEBUG implementation. This patch unifies the behavior. Signed-off-by: Jason A. Donenfeld Cc: Steven Rostedt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/printk.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/printk.h b/include/linux/printk.h index a6298b27ac99..6545d911054f 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -404,10 +404,10 @@ do { \ static DEFINE_RATELIMIT_STATE(_rs, \ DEFAULT_RATELIMIT_INTERVAL, \ DEFAULT_RATELIMIT_BURST); \ - DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \ + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, pr_fmt(fmt)); \ if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT) && \ __ratelimit(&_rs)) \ - __dynamic_pr_debug(&descriptor, fmt, ##__VA_ARGS__); \ + __dynamic_pr_debug(&descriptor, pr_fmt(fmt), ##__VA_ARGS__); \ } while (0) #elif defined(DEBUG) #define pr_debug_ratelimited(fmt, ...) \ -- cgit v1.2.3 From cdf17449af1d9b596742c260134edd6c1fac2792 Mon Sep 17 00:00:00 2001 From: Linus Walleij Date: Wed, 9 Sep 2015 15:37:11 -0700 Subject: hexdump: do not print debug dumps for !CONFIG_DEBUG print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or dev_dbg() & friends. Currently it will adhere to dynamic debug, but will not stub out prints if CONFIG_DEBUG is not set. Let's make it do the right thing, because I am tired of having my dmesg buffer full of hex dumps on production systems. Signed-off-by: Linus Walleij Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/printk.h | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/printk.h b/include/linux/printk.h index 6545d911054f..9729565c25ff 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -456,11 +456,17 @@ static inline void print_hex_dump_bytes(const char *prefix_str, int prefix_type, groupsize, buf, len, ascii) \ dynamic_hex_dump(prefix_str, prefix_type, rowsize, \ groupsize, buf, len, ascii) -#else +#elif defined(DEBUG) #define print_hex_dump_debug(prefix_str, prefix_type, rowsize, \ groupsize, buf, len, ascii) \ print_hex_dump(KERN_DEBUG, prefix_str, prefix_type, rowsize, \ groupsize, buf, len, ascii) -#endif /* defined(CONFIG_DYNAMIC_DEBUG) */ +#else +static inline void print_hex_dump_debug(const char *prefix_str, int prefix_type, + int rowsize, int groupsize, + const void *buf, size_t len, bool ascii) +{ +} +#endif #endif -- cgit v1.2.3 From b40bdb7fb2b8359d5dfe19a91c147465c3d0359b Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Wed, 9 Sep 2015 15:37:16 -0700 Subject: lib/string_helpers: rename "esc" arg to "only" To further clarify the purpose of the "esc" argument, rename it to "only" to reflect that it is a limit, not a list of additional characters to escape. Signed-off-by: Kees Cook Suggested-by: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/string_helpers.h | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'include/linux') diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h index 71f711db4500..dabe643eb5fa 100644 --- a/include/linux/string_helpers.h +++ b/include/linux/string_helpers.h @@ -48,24 +48,24 @@ static inline int string_unescape_any_inplace(char *buf) #define ESCAPE_HEX 0x20 int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, - unsigned int flags, const char *esc); + unsigned int flags, const char *only); static inline int string_escape_mem_any_np(const char *src, size_t isz, - char *dst, size_t osz, const char *esc) + char *dst, size_t osz, const char *only) { - return string_escape_mem(src, isz, dst, osz, ESCAPE_ANY_NP, esc); + return string_escape_mem(src, isz, dst, osz, ESCAPE_ANY_NP, only); } static inline int string_escape_str(const char *src, char *dst, size_t sz, - unsigned int flags, const char *esc) + unsigned int flags, const char *only) { - return string_escape_mem(src, strlen(src), dst, sz, flags, esc); + return string_escape_mem(src, strlen(src), dst, sz, flags, only); } static inline int string_escape_str_any_np(const char *src, char *dst, - size_t sz, const char *esc) + size_t sz, const char *only) { - return string_escape_str(src, dst, sz, ESCAPE_ANY_NP, esc); + return string_escape_str(src, dst, sz, ESCAPE_ANY_NP, only); } #endif -- cgit v1.2.3 From 90f023030e26ce8f981b3e688cb79329d8d07cc3 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Wed, 9 Sep 2015 15:38:22 -0700 Subject: kmod: use system_unbound_wq instead of khelper We need to launch the usermodehelper kernel threads with the widest affinity and this is partly why we use khelper. This workqueue has unbound properties and thus a wide affinity inherited by all its children. Now khelper also has special properties that we aren't much interested in: ordered and singlethread. There is really no need about ordering as all we do is creating kernel threads. This can be done concurrently. And singlethread is a useless limitation as well. The workqueue engine already proposes generic unbound workqueues that don't share these useless properties and handle well parallel jobs. The only worrysome specific is their affinity to the node of the current CPU. It's fine for creating the usermodehelper kernel threads but those inherit this affinity for longer jobs such as requesting modules. This patch proposes to use these node affine unbound workqueues assuming that a node is sufficient to handle several parallel usermodehelper requests. Signed-off-by: Frederic Weisbecker Cc: Rik van Riel Reviewed-by: Oleg Nesterov Cc: Christoph Lameter Cc: Tejun Heo Cc: Rusty Russell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kmod.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kmod.h b/include/linux/kmod.h index 0555cc66a15b..fcfd2bf14d3f 100644 --- a/include/linux/kmod.h +++ b/include/linux/kmod.h @@ -85,8 +85,6 @@ enum umh_disable_depth { UMH_DISABLED, }; -extern void usermodehelper_init(void); - extern int __usermodehelper_disable(enum umh_disable_depth depth); extern void __usermodehelper_set_disable_depth(enum umh_disable_depth depth); -- cgit v1.2.3 From 37607102c4426cf92aeb5da1b1d9a79ba6d95e3f Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 9 Sep 2015 15:38:33 -0700 Subject: seq_file: provide an analogue of print_hex_dump() This introduces a new helper and switches current users to use it. All patches are compiled tested. kmemleak is tested via its own test suite. This patch (of 6): The new seq_hex_dump() is a complete analogue of print_hex_dump(). We have few users of this functionality already. It allows to reduce their codebase. Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Joe Perches Cc: Tadeusz Struk Cc: Helge Deller Cc: Ingo Tuchscherer Cc: Catalin Marinas Cc: Vladimir Kondratiev Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/seq_file.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include/linux') diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index d4c7271382cb..adeadbd6d7bf 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -122,6 +122,10 @@ int seq_write(struct seq_file *seq, const void *data, size_t len); __printf(2, 3) int seq_printf(struct seq_file *, const char *, ...); __printf(2, 0) int seq_vprintf(struct seq_file *, const char *, va_list args); +void seq_hex_dump(struct seq_file *m, const char *prefix_str, int prefix_type, + int rowsize, int groupsize, const void *buf, size_t len, + bool ascii); + int seq_path(struct seq_file *, const struct path *, const char *); int seq_file_path(struct seq_file *, struct file *, const char *); int seq_dentry(struct seq_file *, struct dentry *, const char *); -- cgit v1.2.3 From a43cac0d9dc2073ff2245a171429ddbe1accece7 Mon Sep 17 00:00:00 2001 From: Dave Young Date: Wed, 9 Sep 2015 15:38:51 -0700 Subject: kexec: split kexec_file syscall code to kexec_file.c Split kexec_file syscall related code to another file kernel/kexec_file.c so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped. Sharing variables and functions are moved to kernel/kexec_internal.h per suggestion from Vivek and Petr. [akpm@linux-foundation.org: fix bisectability] [akpm@linux-foundation.org: declare the various arch_kexec functions] [akpm@linux-foundation.org: fix build] Signed-off-by: Dave Young Cc: Eric W. Biederman Cc: Vivek Goyal Cc: Petr Tesarik Cc: Theodore Ts'o Cc: Josh Boyer Cc: David Howells Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kexec.h | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'include/linux') diff --git a/include/linux/kexec.h b/include/linux/kexec.h index b63218f68c4b..ab150ade0d18 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -318,6 +318,17 @@ int crash_shrink_memory(unsigned long new_size); size_t crash_get_memory_size(void); void crash_free_reserved_phys_range(unsigned long begin, unsigned long end); +int __weak arch_kexec_kernel_image_probe(struct kimage *image, void *buf, + unsigned long buf_len); +void * __weak arch_kexec_kernel_image_load(struct kimage *image); +int __weak arch_kimage_file_post_load_cleanup(struct kimage *image); +int __weak arch_kexec_kernel_verify_sig(struct kimage *image, void *buf, + unsigned long buf_len); +int __weak arch_kexec_apply_relocations_add(const Elf_Ehdr *ehdr, + Elf_Shdr *sechdrs, unsigned int relsec); +int __weak arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, + unsigned int relsec); + #else /* !CONFIG_KEXEC */ struct pt_regs; struct task_struct; -- cgit v1.2.3 From 2965faa5e03d1e71e9ff9aa143fff39e0a77543a Mon Sep 17 00:00:00 2001 From: Dave Young Date: Wed, 9 Sep 2015 15:38:55 -0700 Subject: kexec: split kexec_load syscall from kexec core code There are two kexec load syscalls, kexec_load another and kexec_file_load. kexec_file_load has been splited as kernel/kexec_file.c. In this patch I split kexec_load syscall code to kernel/kexec.c. And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and use kexec_file_load only, or vice verse. The original requirement is from Ted Ts'o, he want kexec kernel signature being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use kexec_load syscall can bypass the checking. Vivek Goyal proposed to create a common kconfig option so user can compile in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects KEXEC_CORE so that old config files still work. Because there's general code need CONFIG_KEXEC_CORE, so I updated all the architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects KEXEC_CORE in arch Kconfig. Also updated general kernel code with to kexec_load syscall. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Young Cc: Eric W. Biederman Cc: Vivek Goyal Cc: Petr Tesarik Cc: Theodore Ts'o Cc: Josh Boyer Cc: David Howells Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kexec.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/kexec.h b/include/linux/kexec.h index ab150ade0d18..d140b1e9faa7 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -16,7 +16,7 @@ #include -#ifdef CONFIG_KEXEC +#ifdef CONFIG_KEXEC_CORE #include #include #include @@ -329,13 +329,13 @@ int __weak arch_kexec_apply_relocations_add(const Elf_Ehdr *ehdr, int __weak arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs, unsigned int relsec); -#else /* !CONFIG_KEXEC */ +#else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; static inline void crash_kexec(struct pt_regs *regs) { } static inline int kexec_should_crash(struct task_struct *p) { return 0; } #define kexec_in_progress false -#endif /* CONFIG_KEXEC */ +#endif /* CONFIG_KEXEC_CORE */ #endif /* !defined(__ASSEBMLY__) */ -- cgit v1.2.3 From 1fcfd8db7f82fa1f533a6f0e4155614ff4144d56 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Wed, 9 Sep 2015 15:39:29 -0700 Subject: mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff() Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(), rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple wrapper on top of do_mmap(). Perhaps we should update the callers of do_mmap_pgoff() and kill it later. This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not play with vm internals. After this change mmap_region() has a single user outside of mmap.c, arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to change arch/tile/ and unexport mmap_region(). [kirill@shutemov.name: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Oleg Nesterov Acked-by: Dave Hansen Tested-by: Dave Hansen Signed-off-by: Kirill A. Shutemov Cc: "H. Peter Anvin" Cc: Andy Lutomirski Cc: Ingo Molnar Cc: Minchan Kim Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index f25a957bf0ab..fda728e3c27d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1873,11 +1873,19 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff); -extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, +extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, - unsigned long pgoff, unsigned long *populate); + vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate); extern int do_munmap(struct mm_struct *, unsigned long, size_t); +static inline unsigned long +do_mmap_pgoff(struct file *file, unsigned long addr, + unsigned long len, unsigned long prot, unsigned long flags, + unsigned long pgoff, unsigned long *populate) +{ + return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate); +} + #ifdef CONFIG_MMU extern int __mm_populate(unsigned long addr, unsigned long len, int ignore_errors); -- cgit v1.2.3