summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2012-11-09mm: bugfix: set current->reclaim_state to NULL while returning from kswapd()Takamori Yamaguchi
In kswapd(), set current->reclaim_state to NULL before returning, as current->reclaim_state holds reference to variable on kswapd()'s stack. In rare cases, while returning from kswapd() during memory offlining, __free_slab() and freepages() can access the dangling pointer of current->reclaim_state. Signed-off-by: Takamori Yamaguchi <takamori.yamaguchi@jp.sony.com> Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-26Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "This fixes a couple of nasty page table initialization bugs which were causing kdump regressions. A clean rearchitecturing of the code is in the works - meanwhile these are reverts that restore the best-known-working state of the kernel. There's also EFI fixes and other small fixes." * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, mm: Undo incorrect revert in arch/x86/mm/init.c x86: efi: Turn off efi_enabled after setup on mixed fw/kernel x86, mm: Find_early_table_space based on ranges that are actually being mapped x86, mm: Use memblock memory loop instead of e820_RAM x86, mm: Trim memory in memblock to be page aligned x86/irq/ioapic: Check for valid irq_cfg pointer in smp_irq_move_cleanup_interrupt x86/efi: Fix oops caused by incorrect set_memory_uc() usage x86-64: Fix page table accounting Revert "x86/mm: Fix the size calculation of mapping tables" MAINTAINERS: Add EFI git repository location
2012-10-25mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distantDavid Rientjes
Commit 957f822a0ab9 ("mm, numa: reclaim from all nodes within reclaim distance") caused zone_reclaim_mode to be set for all systems where two nodes are within RECLAIM_DISTANCE of each other. This is the opposite of what we actually want: zone_reclaim_mode should be set if two nodes are sufficiently distant. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Julian Wollrath <jwollrath@web.de> Tested-by: Julian Wollrath <jwollrath@web.de> Cc: Hugh Dickins <hughd@google.com> Cc: Patrik Kullman <patrik.kullman@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25mm/mmu_notifier: allocate mmu_notifier in advanceGavin Shan
While allocating mmu_notifier with parameter GFP_KERNEL, swap would start to work in case of tight available memory. Eventually, that would lead to a deadlock while the swap deamon swaps anonymous pages. It was caused by commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary"). ================================= [ INFO: inconsistent lock state ] 3.7.0-rc1+ #518 Not tainted --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes: (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0 {RECLAIM_FS-ON-W} state was registered at: mark_held_locks+0x86/0x150 lockdep_trace_alloc+0x67/0xc0 kmem_cache_alloc_trace+0x33/0x230 do_mmu_notifier_register+0x87/0x180 mmu_notifier_register+0x13/0x20 kvm_dev_ioctl+0x428/0x510 do_vfs_ioctl+0x98/0x570 sys_ioctl+0x91/0xb0 system_call_fastpath+0x16/0x1b irq event stamp: 825 hardirqs last enabled at (825): _raw_spin_unlock_irq+0x30/0x60 hardirqs last disabled at (824): _raw_spin_lock_irq+0x19/0x80 softirqs last enabled at (0): copy_process+0x630/0x17c0 softirqs last disabled at (0): (null) ... Simply back out the above commit, which was a small performance optimization. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reported-by: Andrea Righi <andrea@betterlinux.com> Tested-by: Andrea Righi <andrea@betterlinux.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Sagi Grimberg <sagig@mellanox.co.il> Cc: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25mm/page_alloc.c:alloc_contig_range(): return early for err pathBob Liu
If start_isolate_page_range() failed, unset_migratetype_isolate() has been done inside it. Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25mm: fix XFS oops due to dirty pages without buffers on s390Jan Kara
On s390 any write to a page (even from kernel itself) sets architecture specific page dirty bit. Thus when a page is written to via buffered write, HW dirty bit gets set and when we later map and unmap the page, page_remove_rmap() finds the dirty bit and calls set_page_dirty(). Dirtying of a page which shouldn't be dirty can cause all sorts of problems to filesystems. The bug we observed in practice is that buffers from the page get freed, so when the page gets later marked as dirty and writeback writes it, XFS crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called from xfs_count_page_state(). Similar problem can also happen when zero_user_segment() call from xfs_vm_writepage() (or block_write_full_page() for that matter) set the hardware dirty bit during writeback, later buffers get freed, and then page unmapped. Fix the issue by ignoring s390 HW dirty bit for page cache pages of mappings with mapping_cap_account_dirty(). This is safe because for such mappings when a page gets marked as writeable in PTE it is also marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(), the page gets writeprotected in page_mkclean(). So pagecache page is writeable if and only if it is dirty. Thanks to Hugh Dickins for pointing out mapping has to have mapping_cap_account_dirty() for things to work and proposing a cleaned up variant of the patch. The patch has survived about two hours of running fsx-linux on tmpfs while heavily swapping and several days of running on out build machines where the original problem was triggered. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@vger.kernel.org> [3.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-24x86, mm: Trim memory in memblock to be page alignedYinghai Lu
We will not map partial pages, so need to make sure memblock allocation will not allocate those bytes out. Also we will use for_each_mem_pfn_range() to loop to map memory range to keep them consistent. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/CAE9FiQVZirvaBMFYRfXMmWEcHbKSicQEHz4VAwUv0xFCk51ZNw@mail.gmail.com Acked-by: Jacob Shin <jacob.shin@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org>
2012-10-19Merge tag 'fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM soc fixes from Olof Johansson: "A set of fixes and some minor cleanups for -rc2: - A series from Arnd that fixes warnings in drivers and other code included by ARM defconfigs. Most have been acked by corresponding maintainers (and seem quite hard to argue not picking up anyway in the few exception cases). - A few misc patches from the list for integrator/vt8500/i.MX - A batch of fixes to OMAP platforms, fixing: - boot problems on beaglebone, - regression fixes for local timers - clockdomain locking fixes - a few boot/sparse warnings - For Tegra: - Clock rate calculation overflow fix - Revert a change that removed timer clocks and a fix for symbol name clashes - For Renesas: - IO accessor / annotation cleanups to remove warnings - For Kirkwood/Dove/mvebu: - Fixes for device trees for Dove (some minor cleanups, some fixes) - Fixes for the mvebu gpio driver - Fix build problem for Feroceon due to missing ifdefs - Fix lsxl DTS files" * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (31 commits) ARM: kirkwood: fix buttons on lsxl boards ARM: kirkwood: fix LEDs names for lsxl boards ARM: Kirkwood: fix disabling CACHE_FEROCEON_L2 gpio: mvebu: Add missing breaks in mvebu_gpio_irq_set_type ARM: dove: Add crypto engine to DT ARM: dove: Remove watchdog from DT ARM: dove: Restructure SoC device tree descriptor ARM: dove: Fix clock names of sata and gbe ARM: dove: Fix tauros2 device tree init ARM: dove: Add pcie clock support ARM: OMAP2+: Allow kernel to boot even if GPMC fails to reserve memory ARM: OMAP: clockdomain: Fix locking on _clkdm_clk_hwmod_enable / disable ARM: s3c: mark s3c2440_clk_add as __init_refok spi/s3c64xx: use correct dma_transfer_direction type ARM: OMAP4: devices: fixup OMAP4 DMIC platform device error message ARM: OMAP2+: clock data: Add dev-id for the omap-gpmc dummy fck ARM: OMAP: resolve sparse warning concerning debug_card_init() ARM: OMAP4: Fix twd_local_timer_register regression ARM: tegra: add tegra_timer clock ARM: tegra: rename tegra system timer ...
2012-10-19Merge branch 'testing/driver-warnings' of ↵Olof Johansson
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc into fixes A collection of warning fixes on non-ARM code from Arnd Bergmann: * 'testing/driver-warnings' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ARM: s3c: mark s3c2440_clk_add as __init_refok spi/s3c64xx: use correct dma_transfer_direction type pcmcia: sharpsl: don't discard sharpsl_pcmcia_ops USB: EHCI: mark ehci_orion_conf_mbus_windows __devinit mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN SCSI: ARM: make fas216_dumpinfo function conditional SCSI: ARM: ncr5380/oak uses no interrupts
2012-10-19Merge branch 'akpm' (Fixes from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "Seven fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (7 patches) lib/dma-debug.c: fix __hash_bucket_find() mm: compaction: correct the nr_strict va isolated check for CMA firmware/memmap: avoid type conflicts with the generic memmap_init() pidns: remove recursion from free_pid_ns() drivers/video/backlight/lm3639_bl.c: return proper error in lm3639_bled_mode_store() error paths kernel/sys.c: fix stack memory content leak via UNAME26 linux/coredump.h needs asm/siginfo.h
2012-10-19mm: compaction: correct the nr_strict va isolated check for CMAMel Gorman
Thierry reported that the "iron out" patch for isolate_freepages_block() had problems due to the strict check being too strict with "mm: compaction: Iron out isolate_freepages_block() and isolate_freepages_range() -fix1". It's possible that more pages than necessary are isolated but the check still fails and I missed that this fix was not picked up before RC1. This same problem has been identified in 3.7-RC1 by Tony Prisk and should be addressed by the following patch. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Tony Prisk <linux@prisktech.co.nz> Reported-by: Thierry Reding <thierry.reding@avionic-design.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19remap_file_pages: correctly handle the case of a NULL vm_ops pointerLinus Torvalds
In commit 0b173bc4daa8 ("mm: kill vma flag VM_CAN_NONLINEAR") we replaced the VM_CAN_NONLINEAR test with checking whether the mapping has a '->remap_pages()' vm operation, but there is no guarantee that there it even has a vm_ops pointer at all. Add the appropriate test for NULL vm_ops. Reported-by: Sasha Levin <levinsasha928@gmail.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-16mm, mempolicy: fix printing stack contents in numa_mapsDavid Rientjes
When reading /proc/pid/numa_maps, it's possible to return the contents of the stack where the mempolicy string should be printed if the policy gets freed from beneath us. This happens because mpol_to_str() may return an error the stack-allocated buffer is then printed without ever being stored. There are two possible error conditions in mpol_to_str(): - if the buffer allocated is insufficient for the string to be stored, and - if the mempolicy has an invalid mode. The first error condition is not triggered in any of the callers to mpol_to_str(): at least 50 bytes is always allocated on the stack and this is sufficient for the string to be written. A future patch should convert this into BUILD_BUG_ON() since we know the maximum strlen possible, but that's not -rc material. The second error condition is possible if a race occurs in dropping a reference to a task's mempolicy causing it to be freed during the read(). The slab poison value is then used for the mode and mpol_to_str() returns -EINVAL. This race is only possible because get_vma_policy() believes that mm->mmap_sem protects task->mempolicy, which isn't true. The exit path does not hold mm->mmap_sem when dropping the reference or setting task->mempolicy to NULL: it uses task_lock(task) instead. Thus, it's required for the caller of a task mempolicy to hold task_lock(task) while grabbing the mempolicy and reading it. Callers with a vma policy store their mempolicy earlier and can simply increment the reference count so it's guaranteed not to be freed. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-15mm: huge_memory: Fix build error.Ralf Baechle
Certain configurations won't implicitly pull in <linux/pagemap.h> resulting in the following build error: mm/huge_memory.c: In function 'release_pte_page': mm/huge_memory.c:1697:2: error: implicit declaration of function 'unlock_page' [-Werror=implicit-function-declaration] mm/huge_memory.c: In function '__collapse_huge_page_isolate': mm/huge_memory.c:1757:3: error: implicit declaration of function 'trylock_page' [-Werror=implicit-function-declaration] cc1: some warnings being treated as errors Reported-by: David Daney <david.daney@cavium.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull third pile of VFS updates from Al Viro: "Stuff from Jeff Layton, mostly. Sanitizing interplay between audit and namei, removing a lot of insanity from audit_inode() mess and getting things ready for his ESTALE patchset." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: procfs: don't need a PATH_MAX allocation to hold a string representation of an int vfs: embed struct filename inside of names_cache allocation if possible audit: make audit_inode take struct filename vfs: make path_openat take a struct filename pointer vfs: turn do_path_lookup into wrapper around struct filename variant audit: allow audit code to satisfy getname requests from its names_list vfs: define struct filename and have getname() return it vfs: unexport getname and putname symbols acct: constify the name arg to acct_on vfs: allocate page instead of names_cache buffer in mount_block_root audit: overhaul __audit_inode_child to accomodate retrying audit: optimize audit_compare_dname_path audit: make audit_compare_dname_path use parent_len helper audit: remove dirlen argument to audit_compare_dname_path audit: set the name_len in audit_inode for parent lookups audit: add a new "type" field to audit_names struct audit: reverse arguments to audit_inode_child audit: no need to walk list in audit_inode if name is NULL audit: pass in dentry to audit_copy_inode wherever possible audit: remove unnecessary NULL ptr checks from do_path_lookup
2012-10-12vfs: make path_openat take a struct filename pointerJeff Layton
...and fix up the callers. For do_file_open_root, just declare a struct filename on the stack and fill out the .name field. For do_filp_open, make it also take a struct filename pointer, and fix up its callers to call it appropriately. For filp_open, add a variant that takes a struct filename pointer and turn filp_open into a wrapper around it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12vfs: define struct filename and have getname() return itJeff Layton
getname() is intended to copy pathname strings from userspace into a kernel buffer. The result is just a string in kernel space. It would however be quite helpful to be able to attach some ancillary info to the string. For instance, we could attach some audit-related info to reduce the amount of audit-related processing needed. When auditing is enabled, we could also call getname() on the string more than once and not need to recopy it from userspace. This patchset converts the getname()/putname() interfaces to return a struct instead of a string. For now, the struct just tracks the string in kernel space and the original userland pointer for it. Later, we'll add other information to the struct as it becomes convenient. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12Merge branch 'slab/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB fix from Pekka Enberg: "This contains a lockdep false positive fix from Jiri Kosina I missed from the previous pull request." * 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: mm, slab: release slab_mutex earlier in kmem_cache_destroy()
2012-10-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull pile 2 of vfs updates from Al Viro: "Stuff in this one - assorted fixes, lglock tidy-up, death to lock_super(). There'll be a VFS pile tomorrow (with patches from Jeff Layton, sanitizing getname() and related parts of audit and preparing for ESTALE fixes), but I'd rather push the stuff in this one ASAP - some of the bugs closed here are quite unpleasant." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: bogus warnings in fs/namei.c consitify do_mount() arguments lglock: add DEFINE_STATIC_LGLOCK() lglock: make the per_cpu locks static lglock: remove unused DEFINE_LGLOCK_LOCKDEP() MAX_LFS_FILESIZE definition for 64bit needs LL... tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking vfs: drop lock/unlock super ufs: drop lock/unlock super sysv: drop lock/unlock super hpfs: drop lock/unlock super fat: drop lock/unlock super ext3: drop lock/unlock super exofs: drop lock/unlock super dup3: Return an error when oldfd == newfd. fs: handle failed audit_log_start properly fs: prevent use after free in auditing when symlink following was denied
2012-10-12Merge branch 'writeback-for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fixes from Fengguang Wu: "Three trivial writeback fixes" * 'writeback-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug writeback: correct comment for move_expired_inodes() backing-dev: use kstrto* in preference to simple_strtoul
2012-10-10mm, slab: release slab_mutex earlier in kmem_cache_destroy()Jiri Kosina
Commit 1331e7a1bbe1 ("rcu: Remove _rcu_barrier() dependency on __stop_machine()") introduced slab_mutex -> cpu_hotplug.lock dependency through kmem_cache_destroy() -> rcu_barrier() -> _rcu_barrier() -> get_online_cpus(). Lockdep thinks that this might actually result in ABBA deadlock, and reports it as below: === [ cut here ] === ====================================================== [ INFO: possible circular locking dependency detected ] 3.6.0-rc5-00004-g0d8ee37 #143 Not tainted ------------------------------------------------------- kworker/u:2/40 is trying to acquire lock: (rcu_sched_state.barrier_mutex){+.+...}, at: [<ffffffff810f2126>] _rcu_barrier+0x26/0x1e0 but task is already holding lock: (slab_mutex){+.+.+.}, at: [<ffffffff81176e15>] kmem_cache_destroy+0x45/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (slab_mutex){+.+.+.}: [<ffffffff810ae1e2>] validate_chain+0x632/0x720 [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530 [<ffffffff810ae921>] lock_acquire+0x121/0x190 [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450 [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50 [<ffffffff81558cb5>] cpuup_callback+0x2f/0xbe [<ffffffff81564b83>] notifier_call_chain+0x93/0x140 [<ffffffff81076f89>] __raw_notifier_call_chain+0x9/0x10 [<ffffffff8155719d>] _cpu_up+0xba/0x14e [<ffffffff815572ed>] cpu_up+0xbc/0x117 [<ffffffff81ae05e3>] smp_init+0x6b/0x9f [<ffffffff81ac47d6>] kernel_init+0x147/0x1dc [<ffffffff8156ab44>] kernel_thread_helper+0x4/0x10 -> #1 (cpu_hotplug.lock){+.+.+.}: [<ffffffff810ae1e2>] validate_chain+0x632/0x720 [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530 [<ffffffff810ae921>] lock_acquire+0x121/0x190 [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450 [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50 [<ffffffff81049197>] get_online_cpus+0x37/0x50 [<ffffffff810f21bb>] _rcu_barrier+0xbb/0x1e0 [<ffffffff810f22f0>] rcu_barrier_sched+0x10/0x20 [<ffffffff810f2309>] rcu_barrier+0x9/0x10 [<ffffffff8118c129>] deactivate_locked_super+0x49/0x90 [<ffffffff8118cc01>] deactivate_super+0x61/0x70 [<ffffffff811aaaa7>] mntput_no_expire+0x127/0x180 [<ffffffff811ab49e>] sys_umount+0x6e/0xd0 [<ffffffff81569979>] system_call_fastpath+0x16/0x1b -> #0 (rcu_sched_state.barrier_mutex){+.+...}: [<ffffffff810adb4e>] check_prev_add+0x3de/0x440 [<ffffffff810ae1e2>] validate_chain+0x632/0x720 [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530 [<ffffffff810ae921>] lock_acquire+0x121/0x190 [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450 [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50 [<ffffffff810f2126>] _rcu_barrier+0x26/0x1e0 [<ffffffff810f22f0>] rcu_barrier_sched+0x10/0x20 [<ffffffff810f2309>] rcu_barrier+0x9/0x10 [<ffffffff81176ea1>] kmem_cache_destroy+0xd1/0xe0 [<ffffffffa04c3154>] nf_conntrack_cleanup_net+0xe4/0x110 [nf_conntrack] [<ffffffffa04c31aa>] nf_conntrack_cleanup+0x2a/0x70 [nf_conntrack] [<ffffffffa04c42ce>] nf_conntrack_net_exit+0x5e/0x80 [nf_conntrack] [<ffffffff81454b79>] ops_exit_list+0x39/0x60 [<ffffffff814551ab>] cleanup_net+0xfb/0x1b0 [<ffffffff8106917b>] process_one_work+0x26b/0x4c0 [<ffffffff81069f3e>] worker_thread+0x12e/0x320 [<ffffffff8106f73e>] kthread+0x9e/0xb0 [<ffffffff8156ab44>] kernel_thread_helper+0x4/0x10 other info that might help us debug this: Chain exists of: rcu_sched_state.barrier_mutex --> cpu_hotplug.lock --> slab_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(slab_mutex); lock(cpu_hotplug.lock); lock(slab_mutex); lock(rcu_sched_state.barrier_mutex); *** DEADLOCK *** === [ cut here ] === This is actually a false positive. Lockdep has no way of knowing the fact that the ABBA can actually never happen, because of special semantics of cpu_hotplug.refcount and its handling in cpu_hotplug_begin(); the mutual exclusion there is not achieved through mutex, but through cpu_hotplug.refcount. The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin() until everyone who called get_online_cpus() will call put_online_cpus()" semantics is totally invisible to lockdep. This patch therefore moves the unlock of slab_mutex so that rcu_barrier() is being called with it unlocked. It has two advantages: - it slightly reduces hold time of slab_mutex; as it's used to protect the cachep list, it's not necessary to hold it over kmem_cache_free() call any more - it silences the lockdep false positive warning, as it avoids lockdep ever learning about slab_mutex -> cpu_hotplug.lock dependency Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-09tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checkingHugh Dickins
Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(), u64 inum = fid->raw[2]; which is unhelpfully reported as at the end of shmem_alloc_inode(): BUG: unable to handle kernel paging request at ffff880061cd3000 IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Call Trace: [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0 [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0 [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10 [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6 Right, tmpfs is being stupid to access fid->raw[2] before validating that fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may fall at the end of a page, and the next page not be present. But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and could oops in the same way: add the missing fh_len checks to those. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Sage Weil <sage@inktank.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-09mm/slob: use min_t() to compare ARCH_SLAB_MINALIGNArnd Bergmann
The definition of ARCH_SLAB_MINALIGN is architecture dependent and can be either of type size_t or int. Comparing that value with ARCH_KMALLOC_MINALIGN can cause harmless warnings on platforms where they are different. Since both are always small positive integer numbers, using the size_t type to compare them is safe and gets rid of the warning. Without this patch, building ARM collie_defconfig results in: mm/slob.c: In function '__kmalloc_node': mm/slob.c:431:152: warning: comparison of distinct pointer types lacks a cast [enabled by default] mm/slob.c: In function 'kfree': mm/slob.c:484:153: warning: comparison of distinct pointer types lacks a cast [enabled by default] mm/slob.c: In function 'ksize': mm/slob.c:503:153: warning: comparison of distinct pointer types lacks a cast [enabled by default] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org>
2012-10-09mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd().David Miller
Invalidation sequences are handled in various ways on various architectures. One way, which sparc64 uses, is to let the set_*_at() functions accumulate pending flushes into a per-cpu array. Then the flush_tlb_range() et al. calls process the pending TLB flushes. In this regime, the __tlb_remove_*tlb_entry() implementations are essentially NOPs. The canonical PTE zap in mm/memory.c is: ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); tlb_remove_tlb_entry(tlb, pte, addr); With a subsequent tlb_flush_mmu() if needed. Mirror this in the THP PMD zapping using: orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd); page = pmd_page(orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); And we properly accomodate TLB flush mechanims like the one described above. Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: Add and use update_mmu_cache_pmd() in transparent huge page code.David Miller
The transparent huge page code passes a PMD pointer in as the third argument of update_mmu_cache(), which expects a PTE pointer. This never got noticed because X86 implements update_mmu_cache() as a macro and thus we don't get any type checking, and X86 is the only architecture which supports transparent huge pages currently. Before other architectures can support transparent huge pages properly we need to add a new interface which will take a PMD pointer as the third argument rather than a PTE pointer. [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390] Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09memory-hotplug: suppress "Trying to free nonexistent resource ↵Yasuaki Ishimatsu
<XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning When our x86 box calls __remove_pages(), release_mem_region() shows many warnings. And x86 box cannot unregister iomem_resource. "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" release_mem_region() has been changed to be called in each PAGES_PER_SECTION by commit de7f0cba9678 ("memory hotplug: release memory regions in PAGES_PER_SECTION chunks"). Because powerpc registers iomem_resource in each PAGES_PER_SECTION chunk. But when I hot add memory on x86 box, iomem_resource is register in each _CRS not PAGES_PER_SECTION chunk. So x86 box unregisters iomem_resource. The patch fixes the problem. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@austin.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: document PageHuge somewhatAndrew Morton
Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: use %pK for /proc/vmallocinfoKees Cook
In the paranoid case of sysctl kernel.kptr_restrict=2, mask the kernel virtual addresses in /proc/vmallocinfo too. Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Brad Spengler <spender@grsecurity.net> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm, thp: fix mlock statisticsDavid Rientjes
NR_MLOCK is only accounted in single page units: there's no logic to handle transparent hugepages. This patch checks the appropriate number of pages to adjust the statistics by so that the correct amount of memory is reflected. Currently: $ grep Mlocked /proc/meminfo Mlocked: 19636 kB #define MAP_SIZE (4 << 30) /* 4GB */ void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); mlock(ptr, MAP_SIZE); $ grep Mlocked /proc/meminfo Mlocked: 29844 kB munlock(ptr, MAP_SIZE); $ grep Mlocked /proc/meminfo Mlocked: 19636 kB And with this patch: $ grep Mlock /proc/meminfo Mlocked: 19636 kB mlock(ptr, MAP_SIZE); $ grep Mlock /proc/meminfo Mlocked: 4213664 kB munlock(ptr, MAP_SIZE); $ grep Mlock /proc/meminfo Mlocked: 19636 kB Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Hugh Dickens <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm, thp: fix mapped pages avoiding unevictable list on mlockDavid Rientjes
When a transparent hugepage is mapped and it is included in an mlock() range, follow_page() incorrectly avoids setting the page's mlock bit and moving it to the unevictable lru. This is evident if you try to mlock(), munlock(), and then mlock() a range again. Currently: #define MAP_SIZE (4 << 30) /* 4GB */ void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); mlock(ptr, MAP_SIZE); $ grep -E "Unevictable|Inactive\(anon" /proc/meminfo Inactive(anon): 6304 kB Unevictable: 4213924 kB munlock(ptr, MAP_SIZE); Inactive(anon): 4186252 kB Unevictable: 19652 kB mlock(ptr, MAP_SIZE); Inactive(anon): 4198556 kB Unevictable: 21684 kB Notice that less than 2MB was added to the unevictable list; this is because these pages in the range are not transparent hugepages since the 4GB range was allocated with mmap() and has no specific alignment. If posix_memalign() were used instead, unevictable would not have grown at all on the second mlock(). The fix is to call mlock_vma_page() so that the mlock bit is set and the page is added to the unevictable list. With this patch: mlock(ptr, MAP_SIZE); Inactive(anon): 4056 kB Unevictable: 4213940 kB munlock(ptr, MAP_SIZE); Inactive(anon): 4198268 kB Unevictable: 19636 kB mlock(ptr, MAP_SIZE); Inactive(anon): 4008 kB Unevictable: 4213940 kB Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09memory-hotplug: update memory block's state and notify userspaceWen Congyang
remove_memory() will be called when hot removing a memory device. But even if offlining memory, we cannot notice it. So the patch updates the memory block's state and sends notification to userspace. Additionally, the memory device may contain more than one memory block. If the memory block has been offlined, __offline_pages() will fail. So we should try to offline one memory block at a time. Thus remove_memory() also check each memory block's state. So there is no need to check the memory block's state before calling remove_memory(). Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09memory-hotplug: preparation to notify memory block's state at memory hot removeWen Congyang
remove_memory() is called in two cases: 1. echo offline >/sys/devices/system/memory/memoryXX/state 2. hot remove a memory device In the 1st case, the memory block's state is changed and the notification that memory block's state changed is sent to userland after calling remove_memory(). So user can notice memory block is changed. But in the 2nd case, the memory block's state is not changed and the notification is not also sent to userspcae even if calling remove_memory(). So user cannot notice memory block is changed. For adding the notification at memory hot remove, the patch just prepare as follows: 1st case uses offline_pages() for offlining memory. 2nd case uses remove_memory() for offlining memory and changing memory block's state and notifing the information. The patch does not implement notification to remove_memory(). Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: avoid section mismatch warning for memblock_type_nameRaghavendra D Prabhu
Following section mismatch warning is thrown during build; WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock The function memblock_type_name() references the variable __meminitdata memblock. This is often because memblock_type_name lacks a __meminitdata annotation or the annotation of memblock is wrong. This is because memblock_type_name makes reference to memblock variable with attribute __meminitdata. Hence, the warning (even if the function is inline). [akpm@linux-foundation.org: remove inline] Signed-off-by: Raghavendra D Prabhu <rprabhu@wnohang.net> Cc: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09cma: decrease cc.nr_migratepages after reclaiming pagelistMinchan Kim
reclaim_clean_pages_from_list() reclaims clean pages before migration so cc.nr_migratepages should be updated. Currently, there is no problem but it can be wrong if we try to use the value in future. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09CMA: migrate mlocked pagesMinchan Kim
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate contiguous memory space. This patch makes mlocked pages be migrated out. Of course, it can affect realtime processes but in CMA usecase, contiguous memory allocation failing is far worse than access latency to an mlocked page being variable while CMA is running. If someone wants to make the system realtime, he shouldn't enable CMA because stalls can still happen at random times. [akpm@linux-foundation.org: tweak comment text, per Mel] Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm/memory.c: fix typo in commentRobert P. J. Day
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: remove unevictable_pgs_mlockfreedHugh Dickins
Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line from /proc/vmstat: Johannes and Mel point out that it was very unlikely to have been used by any tool, and of course we can restore it easily enough if that turns out to be wrong. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09memory-hotplug: fix zone stat mismatchMinchan Kim
During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing, causing the kernel to hang. When the system doesn't have enough free pages, it enters reclaim but never reclaim any pages due to too_many_isolated()==true and loops forever. The cause is that when we do memory-hotadd after memory-remove, __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset() although the vm_stat_diff of all CPUs still have values. In addtion, when we offline all pages of the zone, we reset them in zone_pcp_reset without draining so we loss some zone stat item. Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: revert 0def08e3 ("mm/mempolicy.c: check return code of check_range")Minchan Kim
Revert commit 0def08e3acc2 because check_range can't fail in migrate_to_node with considering current usecases. Quote from Johannes : I think it makes sense to revert. Not because of the semantics, but I : just don't see how check_range() could even fail for this callsite: : : 1. we pass mm->mmap->vm_start in there, so we should not fail due to : find_vma() : : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply : and so can not fail : : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will : continue until addr == end, so we never fail with -EIO And I added a new VM_BUG_ON for checking migrate_to_node's future usecase which might pass to MPOL_MF_STRICT. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vasiliy Kulikov <segooon@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: wrap calls to set_pte_at_notify with invalidate_range_start and ↵Haggai Eran
invalidate_range_end In order to allow sleeping during invalidate_page mmu notifier calls, we need to avoid calling when holding the PT lock. In addition to its direct calls, invalidate_page can also be called as a substitute for a change_pte call, in case the notifier client hasn't implemented change_pte. This patch drops the invalidate_page call from change_pte, and instead wraps all calls to change_pte with invalidate_range_start and invalidate_range_end calls. Note that change_pte still cannot sleep after this patch, and that clients implementing change_pte should not take action on it in case the number of outstanding invalidate_range_start calls is larger than one, otherwise they might miss a later invalidation. Signed-off-by: Haggai Eran <haggaie@mellanox.com> Cc: Andrea Arcangeli <andrea@qumranet.com> Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: move all mmu notifier invocations to be done outside the PT lockSagi Grimberg
In order to allow sleeping during mmu notifier calls, we need to avoid invoking them under the page table spinlock. This patch solves the problem by calling invalidate_page notification after releasing the lock (but before freeing the page itself), or by wrapping the page invalidation with calls to invalidate_range_begin and invalidate_range_end. To prevent accidental changes to the invalidate_range_end arguments after the call to invalidate_range_begin, the patch introduces a convention of saving the arguments in consistently named locals: unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ ... mmun_start = ... mmun_end = ... mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); ... mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); The patch changes code to use this convention for all calls to mmu_notifier_invalidate_range_start/end, except those where the calls are close enough so that anyone who glances at the code can see the values aren't changing. This patchset is a preliminary step towards on-demand paging design to be added to the RDMA stack. Why do we want on-demand paging for Infiniband? Applications register memory with an RDMA adapter using system calls, and subsequently post IO operations that refer to the corresponding virtual addresses directly to HW. Until now, this was achieved by pinning the memory during the registration calls. The goal of on demand paging is to avoid pinning the pages of registered memory regions (MRs). This will allow users the same flexibility they get when swapping any other part of their processes address spaces. Instead of requiring the entire MR to fit in physical memory, we can allow the MR to be larger, and only fit the current working set in physical memory. Why should anyone care? What problems are users currently experiencing? This can make programming with RDMA much simpler. Today, developers that are working with more data than their RAM can hold need either to deregister and reregister memory regions throughout their process's life, or keep a single memory region and copy the data to it. On demand paging will allow these developers to register a single MR at the beginning of their process's life, and let the operating system manage which pages needs to be fetched at a given time. In the future, we might be able to provide a single memory access key for each process that would provide the entire process's address as one large memory region, and the developers wouldn't need to register memory regions at all. Is there any prospect that any other subsystems will utilise these infrastructural changes? If so, which and how, etc? As for other subsystems, I understand that XPMEM wanted to sleep in MMU notifiers, as Christoph Lameter wrote at http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and perhaps Andrea knows about other use cases. Scheduling in mmu notifications is required since we need to sync the hardware with the secondary page tables change. A TLB flush of an IO device is inherently slower than a CPU TLB flush, so our design works by sending the invalidation request to the device, and waiting for an interrupt before exiting the mmu notifier handler. Avi said: kvm may be a buyer. kvm::mmu_lock, which serializes guest page faults, also protects long operations such as destroying large ranges. It would be good to convert it into a spinlock, but as it is used inside mmu notifiers, this cannot be done. (there are alternatives, such as keeping the spinlock and using a generation counter to do the teardown in O(1), which is what the "may" is doing up there). [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreachMichal Hocko
Commit 0c176d52b0b2 ("mm: hugetlb: fix pgoff computation when unmapping page from vma") fixed pgoff calculation but it has replaced it by vma_hugecache_offset() which is not approapriate for offsets used for vma_prio_tree_foreach() because that one expects index in page units rather than in huge_page_shift. Johannes said: : The resulting index may not be too big, but it can be too small: assume : hpage size of 2M and the address to unmap to be 0x200000. This is regular : page index 512 and hpage index 1. If you have a VMA that maps the file : only starting at the second huge page, that VMAs vm_pgoff will be 512 but : you ask for offset 1 and miss it even though it does map the page of : interest. hugetlb_cow() will try to unmap, miss the vma, and retry the : cow until the allocation succeeds or the skipped vma(s) go away. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm, numa: reclaim from all nodes within reclaim distanceDavid Rientjes
RECLAIM_DISTANCE represents the distance between nodes at which it is deemed too costly to allocate from; it's preferred to try to reclaim from a local zone before falling back to allocating on a remote node with such a distance. To do this, zone_reclaim_mode is set if the distance between any two nodes on the system is greather than this distance. This, however, ends up causing the page allocator to reclaim from every zone regardless of its affinity. What we really want is to reclaim only from zones that are closer than RECLAIM_DISTANCE. This patch adds a nodemask to each node that represents the set of nodes that are within this distance. During the zone iteration, if the bit for a zone's node is set for the local node, then reclaim is attempted; otherwise, the zone is skipped. [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: remove free_page_mlockHugh Dickins
We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be checking it, reporting "BUG: Bad page state" if it's ever found set. Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: use clear_page_mlock() in page_remove_rmap()Hugh Dickins
We had thought that pages could no longer get freed while still marked as mlocked; but Johannes Weiner posted this program to demonstrate that truncating an mlocked private file mapping containing COWed pages is still mishandled: #include <sys/types.h> #include <sys/mman.h> #include <sys/stat.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <stdio.h> int main(void) { char *map; int fd; system("grep mlockfreed /proc/vmstat"); fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR); unlink("chigurh"); ftruncate(fd, 4096); map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0); map[0] = 11; mlock(map, sizeof(fd)); ftruncate(fd, 0); close(fd); munlock(map, sizeof(fd)); munmap(map, 4096); system("grep mlockfreed /proc/vmstat"); return 0; } The anon COWed pages are not caught by truncation's clear_page_mlock() of the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to look out for them there in page_remove_rmap(). Indeed, why should truncation or invalidation be doing the clear_page_mlock() when removing from pagecache? mlock is a property of mapping in userspace, not a property of pagecache: an mlocked unmapped page is nonsensical. Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: remove vma arg from page_evictableHugh Dickins
page_evictable(page, vma) is an irritant: almost all its callers pass NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page) explicitly in the couple of places it's needed. But in those places we don't even need page_evictable() itself! They're dealing with a freshly allocated anonymous page, which has no "mapping" and cannot be mlocked yet. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: fix invalidate_complete_page2() lock orderingHugh Dickins
In fuzzing with trinity, lockdep protested "possible irq lock inversion dependency detected" when isolate_lru_page() reenabled interrupts while still holding the supposedly irq-safe tree_lock: invalidate_inode_pages2 invalidate_complete_page2 spin_lock_irq(&mapping->tree_lock) clear_page_mlock isolate_lru_page spin_unlock_irq(&zone->lru_lock) isolate_lru_page() is correct to enable interrupts unconditionally: invalidate_complete_page2() is incorrect to call clear_page_mlock() while holding tree_lock, which is supposed to nest inside lru_lock. Both truncate_complete_page() and invalidate_complete_page() call clear_page_mlock() before taking tree_lock to remove page from radix_tree. I guess invalidate_complete_page2() preferred to test PageDirty (again) under tree_lock before committing to the munlock; but since the page has already been unmapped, its state is already somewhat inconsistent, and no worse if clear_page_mlock() moved up. Reported-by: Sasha Levin <levinsasha928@gmail.com> Deciphered-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09memcg: move mem_cgroup_is_root upwardsMichal Hocko
kmem code uses this function and it is better to not use forward declarations for static inline functions as some (older) compilers don't like it: gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Cc: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09memcg: cleanup kmem tcp ifdefsMichal Hocko
TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but the code is not used if !CONFIG_INET so we should rather test for both. The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but let's keep those outside of any ifdefs because it is considered safer wrt. future maintainability. Tested with - CONFIG_INET && CONFIG_MEMCG_KMEM - !CONFIG_INET && CONFIG_MEMCG_KMEM - CONFIG_INET && !CONFIG_MEMCG_KMEM - !CONFIG_INET && !CONFIG_MEMCG_KMEM Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: fix-up zone present pagesJianguo Wu
I think zone->present_pages indicates pages that buddy system can management, it should be: zone->present_pages = spanned pages - absent pages - bootmem pages, but is now: zone->present_pages = spanned pages - absent pages - memmap pages. spanned pages: total size, including holes. absent pages: holes. bootmem pages: pages used in system boot, managed by bootmem allocator. memmap pages: pages used by page structs. This may cause zone->present_pages less than it should be. For example, numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now it also minus memmap pages(free_area_init_core), which are actually allocated from ZONE_MOVABLE. When offlining all memory of a zone, this will cause zone->present_pages less than 0, because present_pages is unsigned long type, it is actually a very large integer, it indirectly caused zone->watermark[WMARK_MIN] becomes a large integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a large integer(calculate_totalreserve_pages()), and finally cause memory allocating failure when fork process(__vm_enough_memory()). [root@localhost ~]# dmesg -bash: fork: Cannot allocate memory I think the bug described in http://marc.info/?l=linux-mm&m=134502182714186&w=2 is also caused by wrong zone present pages. This patch intends to fix-up zone->present_pages when memory are freed to buddy system on x86_64 and IA64 platforms. Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Reported-by: Petr Tesarik <ptesarik@suse.cz> Tested-by: Petr Tesarik <ptesarik@suse.cz> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>