summaryrefslogtreecommitdiff
path: root/mm/hugetlb.c
AgeCommit message (Collapse)Author
2014-02-15mm: hugetlbfs: fix hugetlbfs optimizationAndrea Arcangeli
commit 27c73ae759774e63313c1fbfeb17ba076cea64c5 upstream. Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database caused by THP") can cause dereference of a dangling pointer if split_huge_page runs during PageHuge() if there are updates to the tail_page->private field. Also it is repeating compound_head twice for hugetlbfs and it is running compound_head+compound_trans_head for THP when a single one is needed in both cases. The new code within the PageSlab() check doesn't need to verify that the THP page size is never bigger than the smallest hugetlbfs page size, to avoid memory corruption. A longstanding theoretical race condition was found while fixing the above (see the change right after the skip_unlock label, that is relevant for the compound_lock path too). By re-establishing the _mapcount tail refcounting for all compound pages, this also fixes the below problem: echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages BUG: Bad page state in process bash pfn:59a01 page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0 page flags: 0x1c00000000008000(tail) Modules linked in: CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x55/0x76 bad_page+0xd5/0x130 free_pages_prepare+0x213/0x280 __free_pages+0x36/0x80 update_and_free_page+0xc1/0xd0 free_pool_huge_page+0xc2/0xe0 set_max_huge_pages.part.58+0x14c/0x220 nr_hugepages_store_common.isra.60+0xd0/0xf0 nr_hugepages_store+0x13/0x20 kobj_attr_store+0xf/0x20 sysfs_write_file+0x189/0x1e0 vfs_write+0xc5/0x1f0 SyS_write+0x55/0xb0 system_call_fastpath+0x16/0x1b Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [Khalid Aziz: Backported to 3.4] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-07-27futex: Take hugepages into account when generating futex_keyZhang Yi
commit 13d60f4b6ab5b702dc8d2ee20999f98a93728aec upstream. The futex_keys of process shared futexes are generated from the page offset, the mapping host and the mapping index of the futex user space address. This should result in an unique identifier for each futex. Though this is not true when futexes are located in different subpages of an hugepage. The reason is, that the mapping index for all those futexes evaluates to the index of the base page of the hugetlbfs mapping. So a futex at offset 0 of the hugepage mapping and another one at offset PAGE_SIZE of the same hugepage mapping have identical futex_keys. This happens because the futex code blindly uses page->index. Steps to reproduce the bug: 1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0 and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs mapping. The mutexes must be initialized as PTHREAD_PROCESS_SHARED because PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as their keys solely depend on the user space address. 2. Lock mutex1 and mutex2 3. Create thread1 and in the thread function lock mutex1, which results in thread1 blocking on the locked mutex1. 4. Create thread2 and in the thread function lock mutex2, which results in thread2 blocking on the locked mutex2. 5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2 still blocks on mutex2 because the futex_key points to mutex1. To solve this issue we need to take the normal page index of the page which contains the futex into account, if the futex is in an hugetlbfs mapping. In other words, we calculate the normal page mapping index of the subpage in the hugetlbfs mapping. Mappings which are not based on hugetlbfs are not affected and still use page->index. Thanks to Mel Gorman who provided a patch for adding proper evaluation functions to the hugetlbfs code to avoid exposing hugetlbfs specific details to the futex code. [ tglx: Massaged changelog ] Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn> Reviewed-by: 'Mel Gorman' <mgorman@suse.de> Acked-by: 'Darren Hart' <dvhart@linux.intel.com> Cc: 'Peter Zijlstra' <peterz@infradead.org> Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-06-19mm: migration: add migrate_entry_wait_huge()Naoya Horiguchi
commit 30dad30922ccc733cfdbfe232090cf674dc374dc upstream. When we have a page fault for the address which is backed by a hugepage under migration, the kernel can't wait correctly and do busy looping on hugepage fault until the migration finishes. As a result, users who try to kick hugepage migration (via soft offlining, for example) occasionally experience long delay or soft lockup. This is because pte_offset_map_lock() can't get a correct migration entry or a correct page table lock for hugepage. This patch introduces migration_entry_wait_huge() to solve this. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-04-25hugetlbfs: add swap entry check in follow_hugetlb_page()Naoya Horiguchi
commit 9cc3a5bd40067b9a0fbd49199d0780463fc2140f upstream. With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory error happens on a hugepage and the affected processes try to access the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) in get_page(). The reason for this bug is that coredump-related code doesn't recognise "hugepage hwpoison entry" with which a pmd entry is replaced when a memory error occurs on a hugepage. In other words, physical address information is stored in different bit layout between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() which is called in get_dump_page() returns a wrong page from a given address. The expected behavior is like this: absent is_swap_pte FOLL_DUMP Expected behavior ------------------------------------------------------------------- true false false hugetlb_fault false true false hugetlb_fault false false false return page true false true skip page (to avoid allocation) false true true hugetlb_fault false false true return page With this patch, we can call hugetlb_fault() and take proper actions (we wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for hwpoisoned entries,) and as the result we can dump all hugepages except for hwpoisoned ones. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-27mm/hugetlb: fix total hugetlbfs pages count when using memory overcommit ↵Wanpeng Li
accouting commit d00285884c0892bb1310df96bce6056e9ce9b9d9 upstream. hugetlb_total_pages is used for overcommit calculations but the current implementation considers only the default hugetlb page size (which is either the first defined hugepage size or the one specified by default_hugepagesz kernel boot parameter). If the system is configured for more than one hugepage size, which is possible since commit a137e1cc6d6e ("hugetlbfs: per mount huge page sizes") then the overcommit estimation done by __vm_enough_memory() (resp. shown by meminfo_proc_show) is not precise - there is an impression of more available/allowed memory. This can lead to an unexpected ENOMEM/EFAULT resp. SIGSEGV when memory is accounted. Testcase: boot: hugepagesz=1G hugepages=1 the default overcommit ratio is 50 before patch: egrep 'CommitLimit' /proc/meminfo CommitLimit: 55434168 kB after patch: egrep 'CommitLimit' /proc/meminfo CommitLimit: 54909880 kB [akpm@linux-foundation.org: coding-style tweak] Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-17hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreachMichal Hocko
commit 36e4f20af833d1ce196e6a4ade05dc26c44652d1 upstream. Commit 0c176d52b0b2 ("mm: hugetlb: fix pgoff computation when unmapping page from vma") fixed pgoff calculation but it has replaced it by vma_hugecache_offset() which is not approapriate for offsets used for vma_prio_tree_foreach() because that one expects index in page units rather than in huge_page_shift. Johannes said: : The resulting index may not be too big, but it can be too small: assume : hpage size of 2M and the address to unmap to be 0x200000. This is regular : page index 512 and hpage index 1. If you have a VMA that maps the file : only starting at the second huge page, that VMAs vm_pgoff will be 512 but : you ask for offset 1 and miss it even though it does map the page of : interest. hugetlb_cow() will try to unmap, miss the vma, and retry the : cow until the allocation succeeds or the skipped vma(s) go away. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-17mm: hugetlb: fix pgoff computation when unmapping page from vmaHillf Danton
commit 0c176d52b0b2619f231b2bbf329b90c028134f58 upstream. The computation for pgoff is incorrect, at least with (vma->vm_pgoff >> PAGE_SHIFT) involved. It is fixed with the available method if HPAGE_SIZE is concerned in page cache lookup. [akpm@linux-foundation.org: use vma_hugecache_offset() directly, per Michal] Signed-off-by: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10mm: hugetlbfs: close race during teardown of hugetlbfs shared page tablesMel Gorman
commit d833352a4338dc31295ed832a30c9ccff5c7a183 upstream. If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a "bad pmd" and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [ ..........] Lots of bad pmd messages followed by this [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7). [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7). [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7). [ 127.186778] ------------[ cut here ]------------ [ 127.186781] kernel BUG at mm/filemap.c:134! [ 127.186782] invalid opcode: 0000 [#1] SMP [ 127.186783] CPU 7 [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod [ 127.186801] [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR [ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002 [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0 [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00 [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003 [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8 [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8 [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000 [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0 [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0) [ 127.186821] Stack: [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98 [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000 [ 127.186827] Call Trace: [ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80 [ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220 [ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30 [ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0 [ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0 [ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50 [ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130 [ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0 [ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230 [ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150 [ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30 [ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80 [ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360 [ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170 [ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 [ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186870] RSP <ffff8804144b5c08> [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- The bug is a race and not always easy to reproduce. To reproduce it I was doing the following on a single socket I7-based machine with 16G of RAM. $ hugeadm --pool-pages-max DEFAULT:13G $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall $ for i in `seq 1 9000`; do ./hugetlbfs-test; done On my particular machine, it usually triggers within 10 minutes but enabling debug options can change the timing such that it never hits. Once the bug is triggered, the machine is in trouble and needs to be rebooted. The machine will respond but processes accessing proc like "ps aux" will hang due to the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b. The basic problem is a race between page table sharing and teardown. For the most part page table sharing depends on i_mmap_mutex. In some cases, it is also taking the mm->page_table_lock for the PTE updates but with shared page tables, it is the i_mmap_mutex that is more important. Unfortunately it appears to be also insufficient. Consider the following situation Process A Process B --------- --------- hugetlb_fault shmdt LockWrite(mmap_sem) do_munmap unmap_region unmap_vmas unmap_single_vma unmap_hugepage_range Lock(i_mmap_mutex) Lock(mm->page_table_lock) huge_pmd_unshare/unmap tables <--- (1) Unlock(mm->page_table_lock) Unlock(i_mmap_mutex) huge_pte_alloc ... Lock(i_mmap_mutex) ... vma_prio_walk, find svma, spte ... Lock(mm->page_table_lock) ... share spte ... Unlock(mm->page_table_lock) ... Unlock(i_mmap_mutex) ... hugetlb_no_page <--- (2) free_pgtables unlink_file_vma hugetlb_free_pgd_range remove_vma_list In this scenario, it is possible for Process A to share page tables with Process B that is trying to tear them down. The i_mmap_mutex on its own does not prevent Process A walking Process B's page tables. At (1) above, the page tables are not shared yet so it unmaps the PMDs. Process A sets up page table sharing and at (2) faults a new entry. Process B then trips up on it in free_pgtables. This patch fixes the problem by adding a new function __unmap_hugepage_range_final that is only called when the VMA is about to be destroyed. This function clears VM_MAYSHARE during unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA ineligible for sharing and avoids the race. Superficially this looks like it would then be vunerable to truncate and madvise issues but hugetlbfs has its own truncate handlers so does not use unmap_mapping_range() and does not support madvise(DONTNEED). This should be treated as a -stable candidate if it is merged. Test program is as follows. The test case was mostly written by Michal Hocko with a few minor changes to reproduce this bug. ==== CUT HERE ==== static size_t huge_page_size = (2UL << 20); static size_t nr_huge_page_A = 512; static size_t nr_huge_page_B = 5632; unsigned int get_random(unsigned int max) { struct timeval tv; gettimeofday(&tv, NULL); srandom(tv.tv_usec); return random() % max; } static void play(void *addr, size_t size) { unsigned char *start = addr, *end = start + size, *a; start += get_random(size/2); /* we could itterate on huge pages but let's give it more time. */ for (a = start; a < end; a += 4096) *a = 0; } int main(int argc, char **argv) { key_t key = IPC_PRIVATE; size_t sizeA = nr_huge_page_A * huge_page_size; size_t sizeB = nr_huge_page_B * huge_page_size; int shmidA, shmidB; void *addrA = NULL, *addrB = NULL; int nr_children = 300, n = 0; if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } fork_child: switch(fork()) { case 0: switch (n%3) { case 0: play(addrA, sizeA); break; case 1: play(addrB, sizeB); break; case 2: break; } break; case -1: perror("fork:"); break; default: if (++n < nr_children) goto fork_child; play(addrA, sizeA); break; } shmdt(addrA); shmdt(addrB); do { wait(NULL); } while (--n > 0); shmctl(shmidA, IPC_RMID, NULL); shmctl(shmidB, IPC_RMID, NULL); return 0; } [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build] Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Adjust context - Drop the mmu_gather * parameters] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vmaKonstantin Khlebnikov
commit b1c12cbcd0a02527c180a862e8971e249d3b347d upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce large amounts of memory barrier related damage v3") Local variable "page" can be uninitialized if the nodemask from vma policy does not intersects with nodemask from cpuset. Even if it doesn't happens it is better to initialize this variable explicitly than to introduce a kernel oops in a weird corner case. mm/hugetlb.c: In function `alloc_huge_page': mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02cpuset: mm: reduce large amounts of memory barrier related damage v3Mel Gorman
commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> [bwh: Forward-ported from 3.0 to 3.2: apply the upstream changes to get_any_partial()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25hugepages: fix use after free bug in "quota" handlingDavid Gibson
commit 90481622d75715bfcb68501280a917dbfe516029 upstream. hugetlbfs_{get,put}_quota() are badly named. They don't interact with the general quota handling code, and they don't much resemble its behaviour. Rather than being about maintaining limits on on-disk block usage by particular users, they are instead about maintaining limits on in-memory page usage (including anonymous MAP_PRIVATE copied-on-write pages) associated with a particular hugetlbfs filesystem instance. Worse, they work by having callbacks to the hugetlbfs filesystem code from the low-level page handling code, in particular from free_huge_page(). This is a layering violation of itself, but more importantly, if the kernel does a get_user_pages() on hugepages (which can happen from KVM amongst others), then the free_huge_page() can be delayed until after the associated inode has already been freed. If an unmount occurs at the wrong time, even the hugetlbfs superblock where the "quota" limits are stored may have been freed. Andrew Barry proposed a patch to fix this by having hugepages, instead of storing a pointer to their address_space and reaching the superblock from there, had the hugepages store pointers directly to the superblock, bumping the reference count as appropriate to avoid it being freed. Andrew Morton rejected that version, however, on the grounds that it made the existing layering violation worse. This is a reworked version of Andrew's patch, which removes the extra, and some of the existing, layering violation. It works by introducing the concept of a hugepage "subpool" at the lower hugepage mm layer - that is a finite logical pool of hugepages to allocate from. hugetlbfs now creates a subpool for each filesystem instance with a page limit set, and a pointer to the subpool gets added to each allocated hugepage, instead of the address_space pointer used now. The subpool has its own lifetime and is only freed once all pages in it _and_ all other references to it (i.e. superblocks) are gone. subpools are optional - a NULL subpool pointer is taken by the code to mean that no subpool limits are in effect. Previous discussion of this bug found in: "Fix refcounting in hugetlbfs quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or http://marc.info/?l=linux-mm&m=126928970510627&w=1 v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to alloc_huge_page() - since it already takes the vma, it is not necessary. Signed-off-by: Andrew Barry <abarry@cray.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context to apply after commit c50ac050811d6485616a193eb0f37bfbd191cc89 'hugetlb: fix resv_map leak in error path', backported in 3.2.20] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-10mm: fix vma_resv_map() NULL pointerDave Hansen
commit 4523e1458566a0e8ecfaff90f380dd23acc44d27 upstream. hugetlb_reserve_pages() can be used for either normal file-backed hugetlbfs mappings, or MAP_HUGETLB. In the MAP_HUGETLB, semi-anonymous mode, there is not a VMA around. The new call to resv_map_put() assumed that there was, and resulted in a NULL pointer dereference: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 IP: vma_resv_map+0x9/0x30 PGD 141453067 PUD 1421e1067 PMD 0 Oops: 0000 [#1] PREEMPT SMP ... Pid: 14006, comm: trinity-child6 Not tainted 3.4.0+ #36 RIP: vma_resv_map+0x9/0x30 ... Process trinity-child6 (pid: 14006, threadinfo ffff8801414e0000, task ffff8801414f26b0) Call Trace: resv_map_put+0xe/0x40 hugetlb_reserve_pages+0xa6/0x1d0 hugetlb_file_setup+0x102/0x2c0 newseg+0x115/0x360 ipcget+0x1ce/0x310 sys_shmget+0x5a/0x60 system_call_fastpath+0x16/0x1b This was reported by Dave Jones, but was reproducible with the libhugetlbfs test cases, so shame on me for not running them in the first place. With this, the oops is gone, and the output of libhugetlbfs's run_tests.py is identical to plain 3.4 again. [ Marked for stable, since this was introduced by commit c50ac050811d ("hugetlb: fix resv_map leak in error path") which was also marked for stable ] Reported-by: Dave Jones <davej@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-10hugetlb: fix resv_map leak in error pathDave Hansen
commit c50ac050811d6485616a193eb0f37bfbd191cc89 upstream. When called for anonymous (non-shared) mappings, hugetlb_reserve_pages() does a resv_map_alloc(). It depends on code in hugetlbfs's vm_ops->close() to release that allocation. However, in the mmap() failure path, we do a plain unmap_region() without the remove_vma() which actually calls vm_ops->close(). This is a decent fix. This leak could get reintroduced if new code (say, after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return an error. But, I think it would have to unroll the reservation anyway. Christoph's test case: http://marc.info/?l=linux-mm&m=133728900729735 Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> [Christoph Lameter: I have rediffed the patch against 2.6.32 and 3.2.0.] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-20hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()Chris Metcalf
commit 4998a6c0edce7fae9c0a5463f6ec3fa585258ee7 upstream. Commit 66aebce747eaf ("hugetlb: fix race condition in hugetlb_fault()") added code to avoid a race condition by elevating the page refcount in hugetlb_fault() while calling hugetlb_cow(). However, one code path in hugetlb_cow() includes an assertion that the page count is 1, whereas it may now also have the value 2 in this path. The consensus is that this BUG_ON has served its purpose, so rather than extending it to cover both cases, we just remove it. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-04-22hugetlb: fix race condition in hugetlb_fault()Chris Metcalf
commit 66aebce747eaf9bc456bf1f1b217d8db843031d0 upstream. The race is as follows: Suppose a multi-threaded task forks a new process (on cpu A), thus bumping up the ref count on all the pages. While the fork is occurring (and thus we have marked all the PTEs as read-only), another thread in the original process (on cpu B) tries to write to a huge page, taking an access violation from the write-protect and calling hugetlb_cow(). Now, suppose the fork() fails. It will undo the COW and decrement the ref count on the pages, so the ref count on the huge page drops back to 1. Meanwhile hugetlb_cow() also decrements the ref count by one on the original page, since the original address space doesn't need it any more, having copied a new page to replace the original page. This leaves the ref count at zero, and when we call unlock_page(), we panic. fork on CPU A fault on CPU B ============= ============== ... down_write(&parent->mmap_sem); down_write_nested(&child->mmap_sem); ... while duplicating vmas if error break; ... up_write(&child->mmap_sem); up_write(&parent->mmap_sem); ... down_read(&parent->mmap_sem); ... lock_page(page); handle COW page_mapcount(old_page) == 2 alloc and prepare new_page ... handle error page_remove_rmap(page); put_page(page); ... fold new_page into pte page_remove_rmap(page); put_page(page); ... oops ==> unlock_page(page); up_read(&parent->mmap_sem); The solution is to take an extra reference to the page while we are holding the lock on it. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2011-12-29mm: hugetlb: fix non-atomic enqueue of huge pageHillf Danton
If a huge page is enqueued under the protection of hugetlb_lock, then the operation is atomic and safe. Signed-off-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [2.6.37+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-09thp: set compound tail page _count to zeroYouquan Song
Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all page_tail->_count zero at all times. But the current kernel does not set page_tail->_count to zero if a 1GB page is utilized. So when an IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a tail page's _count does not equal zero. kernel BUG at include/linux/mm.h:386! invalid opcode: 0000 [#1] SMP Call Trace: gup_pud_range+0xb8/0x19d get_user_pages_fast+0xcb/0x192 ? trace_hardirqs_off+0xd/0xf hva_to_pfn+0x119/0x2f2 gfn_to_pfn_memslot+0x2c/0x2e kvm_iommu_map_pages+0xfd/0x1c1 kvm_iommu_map_memslots+0x7c/0xbd kvm_iommu_map_guest+0xaa/0xbf kvm_vm_ioctl_assigned_device+0x2ef/0xa47 kvm_vm_ioctl+0x36c/0x3a2 do_vfs_ioctl+0x49e/0x4e4 sys_ioctl+0x5a/0x7c system_call_fastpath+0x16/0x1b RIP gup_huge_pud+0xf2/0x159 Signed-off-by: Youquan Song <youquan.song@intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-15hugetlb: release pages in the error path of hugetlb_cow()Hillf Danton
If we fail to prepare an anon_vma, the {new, old}_page should be released, or they will leak. Signed-off-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25mm: hugetlb: fix coding style issuesChris Forbes
Fix coding style issues flagged by checkpatch.pl Signed-off-by: Chris Forbes <chrisf@ijw.co.nz> Acked-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25hugetlb: add phys addr to struct huge_bootmem_pageBecky Bruce
This is needed on HIGHMEM systems - we don't always have a virtual address so store the physical address and map it in as needed. [akpm@linux-foundation.org: cleanup] Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15mm: fix negative commitlimit when gigantic hugepages are allocatedRafael Aquini
When 1GB hugepages are allocated on a system, free(1) reports less available memory than what really is installed in the box. Also, if the total size of hugepages allocated on a system is over half of the total memory size, CommitLimit becomes a negative number. The problem is that gigantic hugepages (order > MAX_ORDER) can only be allocated at boot with bootmem, thus its frames are not accounted to 'totalram_pages'. However, they are accounted to hugetlb_total_pages() What happens to turn CommitLimit into a negative number is this calculation, in fs/proc/meminfo.c: allowed = ((totalram_pages - hugetlb_total_pages()) * sysctl_overcommit_ratio / 100) + total_swap_pages; A similar calculation occurs in __vm_enough_memory() in mm/mmap.c. Also, every vm statistic which depends on 'totalram_pages' will render confusing values, as if system were 'missing' some part of its memory. Impact of this bug: When gigantic hugepages are allocated and sysctl_overcommit_memory == OVERCOMMIT_NEVER. In a such situation, __vm_enough_memory() goes through the mentioned 'allowed' calculation and might end up mistakenly returning -ENOMEM, thus forcing the system to start reclaiming pages earlier than it would be ususal, and this could cause detrimental impact to overall system's performance, depending on the workload. Besides the aforementioned scenario, I can only think of this causing annoyances with memory reports from /proc/meminfo and free(1). [akpm@linux-foundation.org: standardize comment layout] Reported-by: Russ Anderson <rja@sgi.com> Signed-off-by: Rafael Aquini <aquini@linux.com> Acked-by: Russ Anderson <rja@sgi.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-06mm: fix ENOSPC returned by handle_mm_fault()Hugh Dickins
Al Viro observes that in the hugetlb case, handle_mm_fault() may return a value of the kind ENOSPC when its caller is expecting a value of the kind VM_FAULT_SIGBUS: fix alloc_huge_page()'s failure returns. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26mm: don't access vm_flags as 'int'KOSAKI Motohiro
The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor 'unsigned int'. This patch fixes such misuse. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [ Changed to use a typedef - we'll extend it to cover more cases later, since there has been discussion about making it a 64-bit type.. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25mm: Convert i_mmap_lock to a mutexPeter Zijlstra
Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-26Merge branch 'master' into for-nextJiri Kosina
Fast-forwarded to current state of Linus' tree as there are patches to be applied for files that didn't exist on the old branch.
2011-04-10treewide: remove extra semicolonsJustin P. Mattock
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-22hugetlbfs: correct handling of negative input to /proc/sys/vm/nr_hugepagesPetr Holasek
When the user inserts a negative value into /proc/sys/vm/nr_hugepages it will cause the kernel to allocate as many hugepages as possible and to then update /proc/meminfo to reflect this. This changes the behavior so that the negative input will result in nr_hugepages value being unchanged. Signed-off-by: Petr Holasek <pholasek@redhat.com> Signed-off-by: Anton Arapov <anton@redhat.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13hugetlb: fix handling of parse errors in sysfsEric B Munson
When parsing changes to the huge page pool sizes made from userspace via the sysfs interface, bogus input values are being covered up by nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0 when strict_strtoul returns an error. This can cause an infinite loop in the nr_hugepages_store code. This patch changes the return value for these functions to -EINVAL when strict_strtoul returns an error. Signed-off-by: Eric B Munson <emunson@mgebm.net> Reported-by: CAI Qian <caiqian@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Eric B Munson <emunson@mgebm.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13hugetlb: do not allow pagesize >= MAX_ORDER pool adjustmentEric B Munson
Huge pages with order >= MAX_ORDER must be allocated at boot via the kernel command line, they cannot be allocated or freed once the kernel is up and running. Currently we allow values to be written to the sysfs and sysctl files controling pool size for these huge page sizes. This patch makes the store functions for nr_hugepages and nr_overcommit_hugepages return -EINVAL when the pool for a page size >= MAX_ORDER is changed. [akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()] [caiqian@redhat.com: add checking in hugetlb_overcommit_handler()] Signed-off-by: Eric B Munson <emunson@mgebm.net> Reported-by: CAI Qian <caiqian@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13hugetlb: check the return value of string conversion in sysctl handlerMichal Hocko
proc_doulongvec_minmax may fail if the given buffer doesn't represent a valid number. If we provide something invalid we will initialize the resulting value (nr_overcommit_huge_pages in this case) to a random value from the stack. The issue was introduced by a3d0c6aa when the default handler has been replaced by the helper function where we do not check the return value. Reproducer: echo "" > /proc/sys/vm/nr_overcommit_hugepages [akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: CAI Qian <caiqian@redhat.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13mm/hugetlb.c: fix error-path memory leak in nr_hugepages_store_common()Jesper Juhl
The NODEMASK_ALLOC macro may dynamically allocate memory for its second argument ('nodes_allowed' in this context). In nr_hugepages_store_common() we may abort early if strict_strtoul() fails, but in that case we do not free the memory already allocated to 'nodes_allowed', causing a memory leak. This patch closes the leak by freeing the memory in the error path. [akpm@linux-foundation.org: use NODEMASK_FREE, per Minchan Kim] Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13thp: clear_copy_huge_pageAndrea Arcangeli
Move the copy/clear_huge_page functions to common code to share between hugetlb.c and huge_memory.c. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-02mm/hugetlb.c: avoid double unlock_page() in hugetlb_fault()Dean Nelson
Have hugetlb_fault() call unlock_page(page) only if it had previously called lock_page(page). Setting CONFIG_DEBUG_VM=y and then running the libhugetlbfs test suite, resulted in the tripping of VM_BUG_ON(!PageLocked(page)) in unlock_page() having been called by hugetlb_fault() when page == pagecache_page. This patch remedied the problem. Signed-off-by: Dean Nelson <dnelson@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm/hugetlb.c: add missing spin_lock() to hugetlb_cow()Dean Nelson
Add missing spin_lock() of the page_table_lock before an error return in hugetlb_cow(). Callers of hugtelb_cow() expect it to be held upon return. Signed-off-by: Dean Nelson <dnelson@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-08Encode huge page size for VM_FAULT_HWPOISON errorsAndi Kleen
This fixes a problem introduced with the hugetlb hwpoison handling The user space SIGBUS signalling wants to know the size of the hugepage that caused a HWPOISON fault. Unfortunately the architecture page fault handlers do not have easy access to the struct page. Pass the information out in the fault error code instead. I added a separate VM_FAULT_HWPOISON_LARGE bit for this case and encode the hpage index in some free upper bits of the fault code. The small page hwpoison keeps stays with the VM_FAULT_HWPOISON name to minimize changes. Also add code to hugetlb.h to convert that index into a page shift. Will be used in a further patch. Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: fengguang.wu@intel.com Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugepage: move is_hugepage_on_freelist inside ifdef to avoid warningAndi Kleen
Fixes warning reported by Stephen Rothwell mm/hugetlb.c:2950: warning: 'is_hugepage_on_freelist' defined but not used for the !CONFIG_MEMORY_FAILURE case. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOSION, hugetlb: recover from free hugepage error when !MF_COUNT_INCREASEDNaoya Horiguchi
Currently error recovery for free hugepage works only for MF_COUNT_INCREASED. This patch enables !MF_COUNT_INCREASED case. Free hugepages can be handled directly by alloc_huge_page() and dequeue_hwpoisoned_huge_page(), and both of them are protected by hugetlb_lock, so there is no race between them. Note that this patch defines the refcount of HWPoisoned hugepage dequeued from freelist is 1, deviated from present 0, thereby we can avoid race between unpoison and memory failure on free hugepage. This is reasonable because unlikely to free buddy pages, free hugepage is governed by hugetlbfs even after error handling finishes. And it also makes unpoison code added in the later patch cleaner. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: move refcounting in hugepage allocation inside hugetlb_lockNaoya Horiguchi
Currently alloc_huge_page() raises page refcount outside hugetlb_lock. but it causes race when dequeue_hwpoison_huge_page() runs concurrently with alloc_huge_page(). To avoid it, this patch moves set_page_refcounted() in hugetlb_lock. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08HWPOISON, hugetlb: add free check to dequeue_hwpoison_huge_page()Naoya Horiguchi
This check is necessary to avoid race between dequeue and allocation, which can cause a free hugepage to be dequeued twice and get kernel unstable. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: hugepage migration coreNaoya Horiguchi
This patch extends page migration code to support hugepage migration. One of the potential users of this feature is soft offlining which is triggered by memory corrected errors (added by the next patch.) Todo: - there are other users of page migration such as memory policy, memory hotplug and memocy compaction. They are not ready for hugepage support for now. ChangeLog since v4: - define migrate_huge_pages() - remove changes on isolation/putback_lru_page() ChangeLog since v2: - refactor isolate/putback_lru_page() to handle hugepage - add comment about race on unmap_and_move_huge_page() ChangeLog since v1: - divide migration code path for hugepage - define routine checking migration swap entry for hugetlb - replace "goto" with "if/else" in remove_migration_pte() Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: redefine hugepage copy functionsNaoya Horiguchi
This patch modifies hugepage copy functions to have only destination and source hugepages as arguments for later use. The old ones are renamed from copy_{gigantic,huge}_page() to copy_user_{gigantic,huge}_page(). This naming convention is consistent with that between copy_highpage() and copy_user_highpage(). ChangeLog since v4: - add blank line between local declaration and code - remove unnecessary might_sleep() ChangeLog since v2: - change copy_huge_page() from macro to inline dummy function to avoid compile warning when !CONFIG_HUGETLB_PAGE. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: add allocate function for hugepage migrationNaoya Horiguchi
We can't use existing hugepage allocation functions to allocate hugepage for page migration, because page migration can happen asynchronously with the running processes and page migration users should call the allocation function with physical addresses (not virtual addresses) as arguments. ChangeLog since v3: - unify alloc_buddy_huge_page() and alloc_buddy_huge_page_node() ChangeLog since v2: - remove unnecessary get/put_mems_allowed() (thanks to David Rientjes) ChangeLog since v1: - add comment on top of alloc_huge_page_no_vma() Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-08hugetlb: fix metadata corruption in hugetlb_fault()Naoya Horiguchi
Since the PageHWPoison() check is for avoiding hwpoisoned page remained in pagecache mapping to the process, it should be done in "found in pagecache" branch, not in the common path. Otherwise, metadata corruption occurs if memory failure happens between alloc_huge_page() and lock_page() because page fault fails with metadata changes remained (such as refcount, mapcount, etc.) This patch moves the check to "found in pagecache" branch and fix the problem. ChangeLog since v2: - remove retry check in "new allocation" path. - make description more detailed - change patch name from "HWPOISON, hugetlb: move PG_HWPoison bit check" Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-09-23hugetlb, rmap: fix confusing page locking in hugetlb_cow()Naoya Horiguchi
The "if (!trylock_page)" block in the avoidcopy path of hugetlb_cow() looks confusing and is buggy. Originally this trylock_page() was intended to make sure that old_page is locked even when old_page != pagecache_page, because then only pagecache_page is locked. This patch fixes it by moving page locking into hugetlb_fault(). Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23hugetlb, rmap: use hugepage_add_new_anon_rmap() in hugetlb_cow()Naoya Horiguchi
Obviously, setting anon_vma for COWed hugepage should be done by hugepage_add_new_anon_rmap() to scan vmas faster. This patch fixes it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12Merge branch 'hwpoison' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: hugetlb: add missing unlock in avoidcopy path in hugetlb_cow() hwpoison: rename CONFIG HWPOISON, hugetlb: support hwpoison injection for hugepage HWPOISON, hugetlb: detect hwpoison in hugetlb code HWPOISON, hugetlb: isolate corrupted hugepage HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage HWPOISON, hugetlb: enable error handling path for hugepage hugetlb, rmap: add reverse mapping for hugepage hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h Fix up trivial conflicts in mm/memory-failure.c
2010-08-11hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()Naoya Horiguchi
This patch fixes possible deadlock in hugepage lock_page() by adding missing unlock_page(). libhugetlbfs test will hit this bug when the next patch in this patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-08-11HWPOISON, hugetlb: support hwpoison injection for hugepageNaoya Horiguchi
This patch enables hwpoison injection through debug/hwpoison interfaces, with which we can test memory error handling for free or reserved hugepages (which cannot be tested by madvise() injector). [AK: Export PageHuge too for the injection module] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-08-11HWPOISON, hugetlb: detect hwpoison in hugetlb codeNaoya Horiguchi
This patch enables to block access to hwpoisoned hugepage and also enables to block unmapping for it. Dependency: "HWPOISON, hugetlb: enable error handling path for hugepage" Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andi Kleen <ak@linux.intel.com>