From c5f88bd29ab42d5d1e77085b5f69d5c6da20324e Mon Sep 17 00:00:00 2001 From: Vegard Nossum Date: Tue, 2 Aug 2016 14:02:22 -0700 Subject: mm: fail prefaulting if page table allocation fails I ran into this: BUG: sleeping function called from invalid context at mm/page_alloc.c:3784 in_atomic(): 0, irqs_disabled(): 0, pid: 1434, name: trinity-c1 2 locks held by trinity-c1/1434: #0: (&mm->mmap_sem){......}, at: [] __do_page_fault+0x1ce/0x8f0 #1: (rcu_read_lock){......}, at: [] filemap_map_pages+0xd6/0xdd0 CPU: 0 PID: 1434 Comm: trinity-c1 Not tainted 4.7.0+ #58 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x65/0x84 panic+0x185/0x2dd ___might_sleep+0x51c/0x600 __might_sleep+0x90/0x1a0 __alloc_pages_nodemask+0x5b1/0x2160 alloc_pages_current+0xcc/0x370 pte_alloc_one+0x12/0x90 __pte_alloc+0x1d/0x200 alloc_set_pte+0xe3e/0x14a0 filemap_map_pages+0x42b/0xdd0 handle_mm_fault+0x17d5/0x28b0 __do_page_fault+0x310/0x8f0 trace_do_page_fault+0x18d/0x310 do_async_page_fault+0x27/0xa0 async_page_fault+0x28/0x30 The important bits from the above is that filemap_map_pages() is calling into the page allocator while holding rcu_read_lock (sleeping is not allowed inside RCU read-side critical sections). According to Kirill Shutemov, the prefaulting code in do_fault_around() is supposed to take care of this, but missing error handling means that the allocation failure can go unnoticed. We don't need to return VM_FAULT_OOM (or any other error) here, since we can just let the normal fault path try again. Fixes: 7267ec008b5c ("mm: postpone page table allocation until we have page to map") Link: http://lkml.kernel.org/r/1469708107-11868-1-git-send-email-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum Acked-by: Kirill A. Shutemov Cc: "Hillf Danton" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 4425b6059339..04004834e985 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3133,6 +3133,8 @@ static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff) if (pmd_none(*fe->pmd)) { fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address); + if (!fe->prealloc_pte) + goto out; smp_wmb(); /* See comment in __pte_alloc() */ } -- cgit v1.2.3 From 1a8018fb4c6976559c3f04bcf760822381be501d Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Tue, 2 Aug 2016 14:02:25 -0700 Subject: mm: move swap-in anonymous page into active list Every swap-in anonymous page starts from inactive lru list's head. It should be activated unconditionally when VM decide to reclaim because page table entry for the page always usually has marked accessed bit. Thus, their window size for getting a new referece is 2 * NR_inactive + NR_active while others is NR_inactive + NR_active. It's not fair that it has more chance to be referenced compared to other newly allocated page which starts from active lru list's head. Johannes: : The page can still have a valid copy on the swap device, so prefering to : reclaim that page over a fresh one could make sense. But as you point : out, having it start inactive instead of active actually ends up giving it : *more* LRU time, and that seems to be without justification. Rik: : The reason newly read in swap cache pages start on the inactive list is : that we do some amount of read-around, and do not know which pages will : get used. : : However, immediately activating the ones that DO get used, like your patch : does, is the right thing to do. Link: http://lkml.kernel.org/r/1469762740-17860-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim Acked-by: Johannes Weiner Acked-by: Rik van Riel Cc: Nadav Amit Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 04004834e985..83be99d9d8a1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2642,6 +2642,7 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte) if (page == swapcache) { do_page_add_anon_rmap(page, vma, fe->address, exclusive); mem_cgroup_commit_charge(page, memcg, true, false); + activate_page(page); } else { /* ksm created a completely new copy */ page_add_new_anon_rmap(page, vma, fe->address, false); mem_cgroup_commit_charge(page, memcg, false, false); -- cgit v1.2.3 From 649920c6ab93429b94bc7c1aa7c0e8395351be32 Mon Sep 17 00:00:00 2001 From: Jia He Date: Tue, 2 Aug 2016 14:02:31 -0700 Subject: mm/hugetlb: avoid soft lockup in set_max_huge_pages() In powerpc servers with large memory(32TB), we watched several soft lockups for hugepage under stress tests. The call traces are as follows: 1. get_page_from_freelist+0x2d8/0xd50 __alloc_pages_nodemask+0x180/0xc20 alloc_fresh_huge_page+0xb0/0x190 set_max_huge_pages+0x164/0x3b0 2. prep_new_huge_page+0x5c/0x100 alloc_fresh_huge_page+0xc8/0x190 set_max_huge_pages+0x164/0x3b0 This patch fixes such soft lockups. It is safe to call cond_resched() there because it is out of spin_lock/unlock section. Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com Signed-off-by: Jia He Reviewed-by: Naoya Horiguchi Acked-by: Michal Hocko Acked-by: Dave Hansen Cc: Mike Kravetz Cc: "Kirill A. Shutemov" Cc: Paul Gortmaker Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f904246a8fd5..619e00d82c5d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2216,6 +2216,10 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count, * and reducing the surplus. */ spin_unlock(&hugetlb_lock); + + /* yield cpu to avoid soft lockup */ + cond_resched(); + if (hstate_is_gigantic(h)) ret = alloc_fresh_gigantic_page(h, nodes_allowed); else -- cgit v1.2.3 From 4e666314d286765a9e61818b488c7372326654ec Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 2 Aug 2016 14:02:34 -0700 Subject: mm, hugetlb: fix huge_pte_alloc BUG_ON Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he runs his database load with memory online and offline running in parallel. The reason is that huge_pmd_share might detect a shared pmd which is currently migrated and so it has migration pte which is !pte_huge. There doesn't seem to be any easy way to prevent from the race and in fact seeing the migration swap entry is not harmful. Both callers of huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range will copy the swap entry and make it COW if needed. hugetlb_fault will back off and so the page fault is retries if the page is still under migration and waits for its completion in hugetlb_fault. That means that the BUG_ON is wrong and we should update it. Let's simply check that all present ptes are pte_huge instead. Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz Signed-off-by: Michal Hocko Reported-by: zhongjiang Acked-by: Naoya Horiguchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 619e00d82c5d..ef968306fd5b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4310,7 +4310,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, pte = (pte_t *)pmd_alloc(mm, pud, addr); } } - BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte)); + BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte)); return pte; } -- cgit v1.2.3 From d6507ff5331c002430cc20ab25922479453baae7 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 2 Aug 2016 14:02:37 -0700 Subject: memcg: put soft limit reclaim out of way if the excess tree is empty We've had a report about soft lockups caused by lock bouncing in the soft reclaim path: BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128] RIP: 0010:[] [] _raw_spin_lock+0x18/0x20 Call Trace: mem_cgroup_soft_limit_reclaim+0x25a/0x280 shrink_zones+0xed/0x200 do_try_to_free_pages+0x74/0x320 try_to_free_pages+0x112/0x180 __alloc_pages_slowpath+0x3ff/0x820 __alloc_pages_nodemask+0x1e9/0x200 alloc_pages_vma+0xe1/0x290 do_wp_page+0x19f/0x840 handle_pte_fault+0x1cd/0x230 do_page_fault+0x1fd/0x4c0 page_fault+0x25/0x30 There are no memcgs created so there cannot be any in the soft limit excess obviously: [...] memory 0 1 1 so all this just seems to be mem_cgroup_largest_soft_limit_node trying to get spin_lock_irq(&mctz->lock) just to find out that the soft limit excess tree is empty. This is just pointless wasting of cycles and cache line bouncing during heavy parallel reclaim on large machines. The particular machine wasn't very healthy and most probably suffering from a memory leak which just caused the memory reclaim to trash heavily. But bouncing on the lock certainly didn't help... Fix this by optimistic lockless check and bail out early if the tree is empty. This is theoretically racy but that shouldn't matter all that much. First of all soft limit is a best effort feature and it is slowly getting deprecated and its usage should be really scarce. Bouncing on a lock without a good reason is surely much bigger problem, especially on large CPU machines. Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Vladimir Davydov Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c265212bec8c..66beca1ad92f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2559,6 +2559,15 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, return 0; mctz = soft_limit_tree_node(pgdat->node_id); + + /* + * Do not even bother to check the largest node if the root + * is empty. Do it lockless to prevent lock bouncing. Races + * are acceptable as soft limit is best effort anyway. + */ + if (RB_EMPTY_ROOT(&mctz->rb_root)) + return 0; + /* * This loop can run a while, specially if mem_cgroup's continuously * keep exceeding their soft limit and putting the system under -- cgit v1.2.3 From 4a3d308d6674fabf213bce9c1a661ef43a85e515 Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Tue, 2 Aug 2016 14:02:40 -0700 Subject: mm/kasan: fix corruptions and false positive reports Once an object is put into quarantine, we no longer own it, i.e. object could leave the quarantine and be reallocated. So having set_track() call after the quarantine_put() may corrupt slab objects. BUG kmalloc-4096 (Not tainted): Poison overwritten ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: 0xffff8804540de850-0xffff8804540de857. First byte 0xb5 instead of 0x6b ... INFO: Freed in qlist_free_all+0x42/0x100 age=75 cpu=3 pid=24492 __slab_free+0x1d6/0x2e0 ___cache_free+0xb6/0xd0 qlist_free_all+0x83/0x100 quarantine_reduce+0x177/0x1b0 kasan_kmalloc+0xf3/0x100 kasan_slab_alloc+0x12/0x20 kmem_cache_alloc+0x109/0x3e0 mmap_region+0x53e/0xe40 do_mmap+0x70f/0xa50 vm_mmap_pgoff+0x147/0x1b0 SyS_mmap_pgoff+0x2c7/0x5b0 SyS_mmap+0x1b/0x30 do_syscall_64+0x1a0/0x4e0 return_from_SYSCALL_64+0x0/0x7a INFO: Slab 0xffffea0011503600 objects=7 used=7 fp=0x (null) flags=0x8000000000004080 INFO: Object 0xffff8804540de848 @offset=26696 fp=0xffff8804540dc588 Redzone ffff8804540de840: bb bb bb bb bb bb bb bb ........ Object ffff8804540de848: 6b 6b 6b 6b 6b 6b 6b 6b b5 52 00 00 f2 01 60 cc kkkkkkkk.R....`. Similarly, poisoning after the quarantine_put() leads to false positive use-after-free reports: BUG: KASAN: use-after-free in anon_vma_interval_tree_insert+0x304/0x430 at addr ffff880405c540a0 Read of size 8 by task trinity-c0/3036 CPU: 0 PID: 3036 Comm: trinity-c0 Not tainted 4.7.0-think+ #9 Call Trace: dump_stack+0x68/0x96 kasan_report_error+0x222/0x600 __asan_report_load8_noabort+0x61/0x70 anon_vma_interval_tree_insert+0x304/0x430 anon_vma_chain_link+0x91/0xd0 anon_vma_clone+0x136/0x3f0 anon_vma_fork+0x81/0x4c0 copy_process.part.47+0x2c43/0x5b20 _do_fork+0x16d/0xbd0 SyS_clone+0x19/0x20 do_syscall_64+0x1a0/0x4e0 entry_SYSCALL64_slow_path+0x25/0x25 Fix this by putting an object in the quarantine after all other operations. Fixes: 80a9201a5965 ("mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB") Link: http://lkml.kernel.org/r/1470062715-14077-1-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin Reported-by: Dave Jones Reported-by: Vegard Nossum Reported-by: Sasha Levin Acked-by: Alexander Potapenko Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/kasan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index b6f99e81bfeb..3019cecc0833 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -543,9 +543,9 @@ bool kasan_slab_free(struct kmem_cache *cache, void *object) switch (alloc_info->state) { case KASAN_STATE_ALLOC: alloc_info->state = KASAN_STATE_QUARANTINE; - quarantine_put(free_info, cache); set_track(&free_info->track, GFP_NOWAIT); kasan_poison_slab_free(cache, object); + quarantine_put(free_info, cache); return true; case KASAN_STATE_QUARANTINE: case KASAN_STATE_FREE: -- cgit v1.2.3 From 4b3ec5a3f4b1d5c9d64b9ab704042400d050d432 Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Tue, 2 Aug 2016 14:02:43 -0700 Subject: mm/kasan: don't reduce quarantine in atomic contexts Currently we call quarantine_reduce() for ___GFP_KSWAPD_RECLAIM (implied by __GFP_RECLAIM) allocation. So, basically we call it on almost every allocation. quarantine_reduce() sometimes is heavy operation, and calling it with disabled interrupts may trigger hard LOCKUP: NMI watchdog: Watchdog detected hard LOCKUP on cpu 2irq event stamp: 1411258 Call Trace: dump_stack+0x68/0x96 watchdog_overflow_callback+0x15b/0x190 __perf_event_overflow+0x1b1/0x540 perf_event_overflow+0x14/0x20 intel_pmu_handle_irq+0x36a/0xad0 perf_event_nmi_handler+0x2c/0x50 nmi_handle+0x128/0x480 default_do_nmi+0xb2/0x210 do_nmi+0x1aa/0x220 end_repeat_nmi+0x1a/0x1e <> __kernel_text_address+0x86/0xb0 print_context_stack+0x7b/0x100 dump_trace+0x12b/0x350 save_stack_trace+0x2b/0x50 set_track+0x83/0x140 free_debug_processing+0x1aa/0x420 __slab_free+0x1d6/0x2e0 ___cache_free+0xb6/0xd0 qlist_free_all+0x83/0x100 quarantine_reduce+0x177/0x1b0 kasan_kmalloc+0xf3/0x100 Reduce the quarantine_reduce iff direct reclaim is allowed. Fixes: 55834c59098d("mm: kasan: initial memory quarantine implementation") Link: http://lkml.kernel.org/r/1470062715-14077-2-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin Reported-by: Dave Jones Acked-by: Alexander Potapenko Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/kasan.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 3019cecc0833..c99ef40ebdfa 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -565,7 +565,7 @@ void kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size, unsigned long redzone_start; unsigned long redzone_end; - if (flags & __GFP_RECLAIM) + if (gfpflags_allow_blocking(flags)) quarantine_reduce(); if (unlikely(object == NULL)) @@ -596,7 +596,7 @@ void kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags) unsigned long redzone_start; unsigned long redzone_end; - if (flags & __GFP_RECLAIM) + if (gfpflags_allow_blocking(flags)) quarantine_reduce(); if (unlikely(ptr == NULL)) -- cgit v1.2.3 From f7376aed6c032aab820fa36806a89e16e353a0d9 Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Tue, 2 Aug 2016 14:02:46 -0700 Subject: mm/kasan, slub: don't disable interrupts when object leaves quarantine SLUB doesn't require disabled interrupts to call ___cache_free(). Link: http://lkml.kernel.org/r/1470062715-14077-3-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin Acked-by: Alexander Potapenko Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/quarantine.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c index 65793f150d1f..4852625ff851 100644 --- a/mm/kasan/quarantine.c +++ b/mm/kasan/quarantine.c @@ -147,10 +147,14 @@ static void qlink_free(struct qlist_node *qlink, struct kmem_cache *cache) struct kasan_alloc_meta *alloc_info = get_alloc_info(cache, object); unsigned long flags; - local_irq_save(flags); + if (IS_ENABLED(CONFIG_SLAB)) + local_irq_save(flags); + alloc_info->state = KASAN_STATE_FREE; ___cache_free(cache, object, _THIS_IP_); - local_irq_restore(flags); + + if (IS_ENABLED(CONFIG_SLAB)) + local_irq_restore(flags); } static void qlist_free_all(struct qlist_head *q, struct kmem_cache *cache) -- cgit v1.2.3 From 47b5c2a0f021e90a79845d1a1353780e5edd0bce Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Tue, 2 Aug 2016 14:02:49 -0700 Subject: mm/kasan: get rid of ->alloc_size in struct kasan_alloc_meta Size of slab object already stored in cache->object_size. Note, that kmalloc() internally rounds up size of allocation, so object_size may be not equal to alloc_size, but, usually we don't need to know the exact size of allocated object. In case if we need that information, we still can figure it out from the report. The dump of shadow memory allows to identify the end of allocated memory, and thereby the exact allocation size. Link: http://lkml.kernel.org/r/1470062715-14077-4-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin Cc: Alexander Potapenko Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/kasan.c | 1 - mm/kasan/kasan.h | 3 +-- mm/kasan/report.c | 8 +++----- 3 files changed, 4 insertions(+), 8 deletions(-) (limited to 'mm') diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index c99ef40ebdfa..388e812ccaca 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -584,7 +584,6 @@ void kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size, get_alloc_info(cache, object); alloc_info->state = KASAN_STATE_ALLOC; - alloc_info->alloc_size = size; set_track(&alloc_info->track, flags); } } diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index 31972cdba433..aa175460c8f9 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -75,8 +75,7 @@ struct kasan_track { struct kasan_alloc_meta { struct kasan_track track; - u32 state : 2; /* enum kasan_state */ - u32 alloc_size : 30; + u32 state; }; struct qlist_node { diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 861b9776841a..d67a7e020905 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -136,7 +136,9 @@ static void kasan_object_err(struct kmem_cache *cache, struct page *page, struct kasan_free_meta *free_info; dump_stack(); - pr_err("Object at %p, in cache %s\n", object, cache->name); + pr_err("Object at %p, in cache %s size: %d\n", object, cache->name, + cache->object_size); + if (!(cache->flags & SLAB_KASAN)) return; switch (alloc_info->state) { @@ -144,15 +146,11 @@ static void kasan_object_err(struct kmem_cache *cache, struct page *page, pr_err("Object not allocated yet\n"); break; case KASAN_STATE_ALLOC: - pr_err("Object allocated with size %u bytes.\n", - alloc_info->alloc_size); pr_err("Allocation:\n"); print_track(&alloc_info->track); break; case KASAN_STATE_FREE: case KASAN_STATE_QUARANTINE: - pr_err("Object freed, allocated with size %u bytes\n", - alloc_info->alloc_size); free_info = get_free_info(cache, object); pr_err("Allocation:\n"); print_track(&alloc_info->track); -- cgit v1.2.3 From b3cbd9bf77cd1888114dbee1653e79aa23fd4068 Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Tue, 2 Aug 2016 14:02:52 -0700 Subject: mm/kasan: get rid of ->state in struct kasan_alloc_meta The state of object currently tracked in two places - shadow memory, and the ->state field in struct kasan_alloc_meta. We can get rid of the latter. The will save us a little bit of memory. Also, this allow us to move free stack into struct kasan_alloc_meta, without increasing memory consumption. So now we should always know when the last time the object was freed. This may be useful for long delayed use-after-free bugs. As a side effect this fixes following UBSAN warning: UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13 member access within misaligned address ffff88000d1efebc for type 'struct qlist_node' which requires 8 byte alignment Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com Reported-by: kernel test robot Signed-off-by: Andrey Ryabinin Cc: Alexander Potapenko Cc: Dmitry Vyukov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/kasan.c | 61 +++++++++++++++++++++++---------------------------- mm/kasan/kasan.h | 12 ++-------- mm/kasan/quarantine.c | 2 -- mm/kasan/report.c | 23 +++++-------------- mm/slab.c | 4 +++- mm/slub.c | 1 + 6 files changed, 39 insertions(+), 64 deletions(-) (limited to 'mm') diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 388e812ccaca..92750e3b0083 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -442,11 +442,6 @@ void kasan_poison_object_data(struct kmem_cache *cache, void *object) kasan_poison_shadow(object, round_up(cache->object_size, KASAN_SHADOW_SCALE_SIZE), KASAN_KMALLOC_REDZONE); - if (cache->flags & SLAB_KASAN) { - struct kasan_alloc_meta *alloc_info = - get_alloc_info(cache, object); - alloc_info->state = KASAN_STATE_INIT; - } } static inline int in_irqentry_text(unsigned long ptr) @@ -510,6 +505,17 @@ struct kasan_free_meta *get_free_info(struct kmem_cache *cache, return (void *)object + cache->kasan_info.free_meta_offset; } +void kasan_init_slab_obj(struct kmem_cache *cache, const void *object) +{ + struct kasan_alloc_meta *alloc_info; + + if (!(cache->flags & SLAB_KASAN)) + return; + + alloc_info = get_alloc_info(cache, object); + __memset(alloc_info, 0, sizeof(*alloc_info)); +} + void kasan_slab_alloc(struct kmem_cache *cache, void *object, gfp_t flags) { kasan_kmalloc(cache, object, cache->object_size, flags); @@ -529,34 +535,27 @@ static void kasan_poison_slab_free(struct kmem_cache *cache, void *object) bool kasan_slab_free(struct kmem_cache *cache, void *object) { + s8 shadow_byte; + /* RCU slabs could be legally used after free within the RCU period */ if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU)) return false; - if (likely(cache->flags & SLAB_KASAN)) { - struct kasan_alloc_meta *alloc_info; - struct kasan_free_meta *free_info; + shadow_byte = READ_ONCE(*(s8 *)kasan_mem_to_shadow(object)); + if (shadow_byte < 0 || shadow_byte >= KASAN_SHADOW_SCALE_SIZE) { + pr_err("Double free"); + dump_stack(); + return true; + } - alloc_info = get_alloc_info(cache, object); - free_info = get_free_info(cache, object); + kasan_poison_slab_free(cache, object); - switch (alloc_info->state) { - case KASAN_STATE_ALLOC: - alloc_info->state = KASAN_STATE_QUARANTINE; - set_track(&free_info->track, GFP_NOWAIT); - kasan_poison_slab_free(cache, object); - quarantine_put(free_info, cache); - return true; - case KASAN_STATE_QUARANTINE: - case KASAN_STATE_FREE: - pr_err("Double free"); - dump_stack(); - break; - default: - break; - } - } - return false; + if (unlikely(!(cache->flags & SLAB_KASAN))) + return false; + + set_track(&get_alloc_info(cache, object)->free_track, GFP_NOWAIT); + quarantine_put(get_free_info(cache, object), cache); + return true; } void kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size, @@ -579,13 +578,9 @@ void kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size, kasan_unpoison_shadow(object, size); kasan_poison_shadow((void *)redzone_start, redzone_end - redzone_start, KASAN_KMALLOC_REDZONE); - if (cache->flags & SLAB_KASAN) { - struct kasan_alloc_meta *alloc_info = - get_alloc_info(cache, object); - alloc_info->state = KASAN_STATE_ALLOC; - set_track(&alloc_info->track, flags); - } + if (cache->flags & SLAB_KASAN) + set_track(&get_alloc_info(cache, object)->alloc_track, flags); } EXPORT_SYMBOL(kasan_kmalloc); diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index aa175460c8f9..9b7b31e25fd2 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -59,13 +59,6 @@ struct kasan_global { * Structures to keep alloc and free tracks * */ -enum kasan_state { - KASAN_STATE_INIT, - KASAN_STATE_ALLOC, - KASAN_STATE_QUARANTINE, - KASAN_STATE_FREE -}; - #define KASAN_STACK_DEPTH 64 struct kasan_track { @@ -74,8 +67,8 @@ struct kasan_track { }; struct kasan_alloc_meta { - struct kasan_track track; - u32 state; + struct kasan_track alloc_track; + struct kasan_track free_track; }; struct qlist_node { @@ -86,7 +79,6 @@ struct kasan_free_meta { * Otherwise it might be used for the allocator freelist. */ struct qlist_node quarantine_link; - struct kasan_track track; }; struct kasan_alloc_meta *get_alloc_info(struct kmem_cache *cache, diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c index 4852625ff851..7fd121d13b88 100644 --- a/mm/kasan/quarantine.c +++ b/mm/kasan/quarantine.c @@ -144,13 +144,11 @@ static void *qlink_to_object(struct qlist_node *qlink, struct kmem_cache *cache) static void qlink_free(struct qlist_node *qlink, struct kmem_cache *cache) { void *object = qlink_to_object(qlink, cache); - struct kasan_alloc_meta *alloc_info = get_alloc_info(cache, object); unsigned long flags; if (IS_ENABLED(CONFIG_SLAB)) local_irq_save(flags); - alloc_info->state = KASAN_STATE_FREE; ___cache_free(cache, object, _THIS_IP_); if (IS_ENABLED(CONFIG_SLAB)) diff --git a/mm/kasan/report.c b/mm/kasan/report.c index d67a7e020905..f437398b685a 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -133,7 +133,6 @@ static void kasan_object_err(struct kmem_cache *cache, struct page *page, void *object, char *unused_reason) { struct kasan_alloc_meta *alloc_info = get_alloc_info(cache, object); - struct kasan_free_meta *free_info; dump_stack(); pr_err("Object at %p, in cache %s size: %d\n", object, cache->name, @@ -141,23 +140,11 @@ static void kasan_object_err(struct kmem_cache *cache, struct page *page, if (!(cache->flags & SLAB_KASAN)) return; - switch (alloc_info->state) { - case KASAN_STATE_INIT: - pr_err("Object not allocated yet\n"); - break; - case KASAN_STATE_ALLOC: - pr_err("Allocation:\n"); - print_track(&alloc_info->track); - break; - case KASAN_STATE_FREE: - case KASAN_STATE_QUARANTINE: - free_info = get_free_info(cache, object); - pr_err("Allocation:\n"); - print_track(&alloc_info->track); - pr_err("Deallocation:\n"); - print_track(&free_info->track); - break; - } + + pr_err("Allocated:\n"); + print_track(&alloc_info->alloc_track); + pr_err("Freed:\n"); + print_track(&alloc_info->free_track); } static void print_address_description(struct kasan_access_info *info) diff --git a/mm/slab.c b/mm/slab.c index 09771ed3e693..ca135bd47c35 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2604,9 +2604,11 @@ static void cache_init_objs(struct kmem_cache *cachep, } for (i = 0; i < cachep->num; i++) { + objp = index_to_obj(cachep, page, i); + kasan_init_slab_obj(cachep, objp); + /* constructor could break poison info */ if (DEBUG == 0 && cachep->ctor) { - objp = index_to_obj(cachep, page, i); kasan_unpoison_object_data(cachep, objp); cachep->ctor(objp); kasan_poison_object_data(cachep, objp); diff --git a/mm/slub.c b/mm/slub.c index 74e7c8c30db8..26eb6a99540e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1384,6 +1384,7 @@ static void setup_object(struct kmem_cache *s, struct page *page, void *object) { setup_object_debug(s, page, object); + kasan_init_slab_obj(s, object); if (unlikely(s->ctor)) { kasan_unpoison_object_data(s, object); s->ctor(object); -- cgit v1.2.3 From 7e088978933ee186533355ae03a9dc1de99cf6c7 Mon Sep 17 00:00:00 2001 From: Andrey Ryabinin Date: Tue, 2 Aug 2016 14:02:55 -0700 Subject: kasan: improve double-free reports Currently we just dump stack in case of double free bug. Let's dump all info about the object that we have. [aryabinin@virtuozzo.com: change double free message per Alexander] Link: http://lkml.kernel.org/r/1470153654-30160-1-git-send-email-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/1470062715-14077-6-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin Cc: Alexander Potapenko Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/kasan.c | 3 +-- mm/kasan/kasan.h | 2 ++ mm/kasan/report.c | 54 ++++++++++++++++++++++++++++++++++++++---------------- 3 files changed, 41 insertions(+), 18 deletions(-) (limited to 'mm') diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 92750e3b0083..88af13c00d3c 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -543,8 +543,7 @@ bool kasan_slab_free(struct kmem_cache *cache, void *object) shadow_byte = READ_ONCE(*(s8 *)kasan_mem_to_shadow(object)); if (shadow_byte < 0 || shadow_byte >= KASAN_SHADOW_SCALE_SIZE) { - pr_err("Double free"); - dump_stack(); + kasan_report_double_free(cache, object, shadow_byte); return true; } diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h index 9b7b31e25fd2..e5c2181fee6f 100644 --- a/mm/kasan/kasan.h +++ b/mm/kasan/kasan.h @@ -99,6 +99,8 @@ static inline bool kasan_report_enabled(void) void kasan_report(unsigned long addr, size_t size, bool is_write, unsigned long ip); +void kasan_report_double_free(struct kmem_cache *cache, void *object, + s8 shadow); #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB) void quarantine_put(struct kasan_free_meta *info, struct kmem_cache *cache); diff --git a/mm/kasan/report.c b/mm/kasan/report.c index f437398b685a..24c1211fe9d5 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -116,6 +116,26 @@ static inline bool init_task_stack_addr(const void *addr) sizeof(init_thread_union.stack)); } +static DEFINE_SPINLOCK(report_lock); + +static void kasan_start_report(unsigned long *flags) +{ + /* + * Make sure we don't end up in loop. + */ + kasan_disable_current(); + spin_lock_irqsave(&report_lock, *flags); + pr_err("==================================================================\n"); +} + +static void kasan_end_report(unsigned long *flags) +{ + pr_err("==================================================================\n"); + add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); + spin_unlock_irqrestore(&report_lock, *flags); + kasan_enable_current(); +} + static void print_track(struct kasan_track *track) { pr_err("PID = %u\n", track->pid); @@ -129,8 +149,7 @@ static void print_track(struct kasan_track *track) } } -static void kasan_object_err(struct kmem_cache *cache, struct page *page, - void *object, char *unused_reason) +static void kasan_object_err(struct kmem_cache *cache, void *object) { struct kasan_alloc_meta *alloc_info = get_alloc_info(cache, object); @@ -147,6 +166,18 @@ static void kasan_object_err(struct kmem_cache *cache, struct page *page, print_track(&alloc_info->free_track); } +void kasan_report_double_free(struct kmem_cache *cache, void *object, + s8 shadow) +{ + unsigned long flags; + + kasan_start_report(&flags); + pr_err("BUG: Double free or freeing an invalid pointer\n"); + pr_err("Unexpected shadow byte: 0x%hhX\n", shadow); + kasan_object_err(cache, object); + kasan_end_report(&flags); +} + static void print_address_description(struct kasan_access_info *info) { const void *addr = info->access_addr; @@ -160,8 +191,7 @@ static void print_address_description(struct kasan_access_info *info) struct kmem_cache *cache = page->slab_cache; object = nearest_obj(cache, page, (void *)info->access_addr); - kasan_object_err(cache, page, object, - "kasan: bad access detected"); + kasan_object_err(cache, object); return; } dump_page(page, "kasan: bad access detected"); @@ -226,19 +256,13 @@ static void print_shadow_for_address(const void *addr) } } -static DEFINE_SPINLOCK(report_lock); - static void kasan_report_error(struct kasan_access_info *info) { unsigned long flags; const char *bug_type; - /* - * Make sure we don't end up in loop. - */ - kasan_disable_current(); - spin_lock_irqsave(&report_lock, flags); - pr_err("==================================================================\n"); + kasan_start_report(&flags); + if (info->access_addr < kasan_shadow_to_mem((void *)KASAN_SHADOW_START)) { if ((unsigned long)info->access_addr < PAGE_SIZE) @@ -259,10 +283,8 @@ static void kasan_report_error(struct kasan_access_info *info) print_address_description(info); print_shadow_for_address(info->first_bad_addr); } - pr_err("==================================================================\n"); - add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); - spin_unlock_irqrestore(&report_lock, flags); - kasan_enable_current(); + + kasan_end_report(&flags); } void kasan_report(unsigned long addr, size_t size, -- cgit v1.2.3 From c3cee372282cb6bcdf19ac1457581d5dd5ecb554 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Tue, 2 Aug 2016 14:02:58 -0700 Subject: kasan: avoid overflowing quarantine size on low memory systems If the total amount of memory assigned to quarantine is less than the amount of memory assigned to per-cpu quarantines, |new_quarantine_size| may overflow. Instead, set it to zero. [akpm@linux-foundation.org: cleanup: use WARN_ONCE return value] Link: http://lkml.kernel.org/r/1470063563-96266-1-git-send-email-glider@google.com Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation") Signed-off-by: Alexander Potapenko Reported-by: Dmitry Vyukov Cc: Andrey Ryabinin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/quarantine.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c index 7fd121d13b88..b6728a33a4ac 100644 --- a/mm/kasan/quarantine.c +++ b/mm/kasan/quarantine.c @@ -198,7 +198,7 @@ void quarantine_put(struct kasan_free_meta *info, struct kmem_cache *cache) void quarantine_reduce(void) { - size_t new_quarantine_size; + size_t new_quarantine_size, percpu_quarantines; unsigned long flags; struct qlist_head to_free = QLIST_INIT; size_t size_to_free = 0; @@ -216,7 +216,12 @@ void quarantine_reduce(void) */ new_quarantine_size = (READ_ONCE(totalram_pages) << PAGE_SHIFT) / QUARANTINE_FRACTION; - new_quarantine_size -= QUARANTINE_PERCPU_SIZE * num_online_cpus(); + percpu_quarantines = QUARANTINE_PERCPU_SIZE * num_online_cpus(); + if (WARN_ONCE(new_quarantine_size < percpu_quarantines, + "Too little memory, disabling global KASAN quarantine.\n")) + new_quarantine_size = 0; + else + new_quarantine_size -= percpu_quarantines; WRITE_ONCE(quarantine_size, new_quarantine_size); last = global_quarantine.head; -- cgit v1.2.3 From b5afba2974f9ebbaaa11b5a633e55db9be3cc363 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Tue, 2 Aug 2016 14:03:04 -0700 Subject: mm: vmscan: fix memcg-aware shrinkers not called on global reclaim We must call shrink_slab() for each memory cgroup on both global and memcg reclaim in shrink_node_memcg(). Commit d71df22b55099 accidentally changed that so that now shrink_slab() is only called with memcg != NULL on memcg reclaim. As a result, memcg-aware shrinkers (including dentry/inode) are never invoked on global reclaim. Fix that. Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") Link: http://lkml.kernel.org/r/1470056590-7177-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Mel Gorman Cc: Hillf Danton Cc: Vlastimil Babka Cc: Joonsoo Kim Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 650d26832569..374d95d04178 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2561,7 +2561,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - if (!global_reclaim(sc)) + if (memcg) shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->nr_scanned - scanned, lru_pages); -- cgit v1.2.3 From bd721ea73e1f965569b40620538c942001f76294 Mon Sep 17 00:00:00 2001 From: Fabian Frederick Date: Tue, 2 Aug 2016 14:03:33 -0700 Subject: treewide: replace obsolete _refok by __ref There was only one use of __initdata_refok and __exit_refok __init_refok was used 46 times against 82 for __ref. Those definitions are obsolete since commit 312b1485fb50 ("Introduce new section reference annotations tags: __ref, __refdata, __refconst") This patch removes the following compatibility definitions and replaces them treewide. /* compatibility defines */ #define __init_refok __ref #define __initdata_refok __refdata #define __exit_refok __ref I can also provide separate patches if necessary. (One patch per tree and check in 1 month or 2 to remove old definitions) [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick Cc: Ingo Molnar Cc: Sam Ravnborg Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 4 ++-- mm/slab.c | 2 +- mm/sparse-vmemmap.c | 2 +- mm/sparse.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ea759b935360..39a372a2a1d6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5276,7 +5276,7 @@ void __init setup_per_cpu_pageset(void) setup_zone_pageset(zone); } -static noinline __init_refok +static noinline __ref int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages) { int i; @@ -5903,7 +5903,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) } } -static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat) +static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { unsigned long __maybe_unused start = 0; unsigned long __maybe_unused offset = 0; diff --git a/mm/slab.c b/mm/slab.c index ca135bd47c35..261147ba156f 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1877,7 +1877,7 @@ static struct array_cache __percpu *alloc_kmem_cache_cpus( return cpu_cache; } -static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp) +static int __ref setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp) { if (slab_state >= FULL) return enable_cpucache(cachep, gfp); diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 68885dcbaf40..574c67b663fe 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -36,7 +36,7 @@ * Uses the main allocators if they are available, else bootmem. */ -static void * __init_refok __earlyonly_bootmem_alloc(int node, +static void * __ref __earlyonly_bootmem_alloc(int node, unsigned long size, unsigned long align, unsigned long goal) diff --git a/mm/sparse.c b/mm/sparse.c index 36d7bbb80e49..1e168bf2779a 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -59,7 +59,7 @@ static inline void set_section_nid(unsigned long section_nr, int nid) #endif #ifdef CONFIG_SPARSEMEM_EXTREME -static struct mem_section noinline __init_refok *sparse_index_alloc(int nid) +static noinline struct mem_section __ref *sparse_index_alloc(int nid) { struct mem_section *section = NULL; unsigned long array_size = SECTIONS_PER_ROOT * -- cgit v1.2.3 From ba093a6d9397da8eafcfbaa7d95bd34255da39a0 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Tue, 2 Aug 2016 14:04:54 -0700 Subject: mm: refuse wrapped vm_brk requests The vm_brk() alignment calculations should refuse to overflow. The ELF loader depending on this, but it has been fixed now. No other unsafe callers have been found. Link: http://lkml.kernel.org/r/1468014494-25291-3-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook Reported-by: Hector Marco-Gisbert Cc: Ismael Ripoll Ripoll Cc: Alexander Viro Cc: "Kirill A. Shutemov" Cc: Oleg Nesterov Cc: Chen Gang Cc: Michal Hocko Cc: Konstantin Khlebnikov Cc: Andrea Arcangeli Cc: Andrey Ryabinin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmap.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/mmap.c b/mm/mmap.c index d44bee96a5fe..ca9d91bca0d6 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2653,16 +2653,18 @@ static inline void verify_mm_writelocked(struct mm_struct *mm) * anonymous maps. eventually we may be able to do some * brk-specific accounting here. */ -static int do_brk(unsigned long addr, unsigned long len) +static int do_brk(unsigned long addr, unsigned long request) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; - unsigned long flags; + unsigned long flags, len; struct rb_node **rb_link, *rb_parent; pgoff_t pgoff = addr >> PAGE_SHIFT; int error; - len = PAGE_ALIGN(len); + len = PAGE_ALIGN(request); + if (len < request) + return -ENOMEM; if (!len) return 0; -- cgit v1.2.3