summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2012-09-19net: Allow driver to limit number of GSO segments per skbBen Hutchings
[ Upstream commit 30b678d844af3305cda5953467005cebb5d7b687 ] A peer (or local user) may cause TCP to use a nominal MSS of as little as 88 (actual MSS of 76 with timestamps). Given that we have a sufficiently prodigious local sender and the peer ACKs quickly enough, it is nevertheless possible to grow the window for such a connection to the point that we will try to send just under 64K at once. This results in a single skb that expands to 861 segments. In some drivers with TSO support, such an skb will require hundreds of DMA descriptors; a substantial fraction of a TX ring or even more than a full ring. The TX queue selected for the skb may stall and trigger the TX watchdog repeatedly (since the problem skb will be retried after the TX reset). This particularly affects sfc, for which the issue is designated as CVE-2012-3412. Therefore: 1. Add the field net_device::gso_max_segs holding the device-specific limit. 2. In netif_skb_features(), if the number of segments is too high then mask out GSO features to force fall back to software GSO. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-19mmc: card: Skip secure erase on MoviNAND; causes unrecoverable corruption.Ian Chen
commit 3550ccdb9d8d350e526b809bf3dd92b550a74fe1 upstream. For several MoviNAND eMMC parts, there are known issues with secure erase and secure trim. For these specific MoviNAND devices, we skip these operations. Specifically, there is a bug in the eMMC firmware that causes unrecoverable corruption when the MMC is erased with MMC_CAP_ERASE enabled. References: http://forum.xda-developers.com/showthread.php?t=1644364 https://plus.google.com/111398485184813224730/posts/21pTYfTsCkB#111398485184813224730/posts/21pTYfTsCkB Signed-off-by: Ian Chen <ian.cy.chen@samsung.com> Reviewed-by: Namjae Jeon <linkinjeon@gmail.com> Acked-by: Jaehoon Chung <jh80.chung@samsung.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Chris Ball <cjb@laptop.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-19ARM: Orion: Set eth packet size csum offload limitArnaud Patard (Rtp)
commit 58569aee5a1a5dcc25c34a0a2ed9a377874e6b05 upstream. The mv643xx ethernet controller limits the packet size for the TX checksum offloading. This patch sets this limits for Kirkwood and Dove which have smaller limits that the default. As a side note, this patch is an updated version of a patch sent some years ago: http://lists.infradead.org/pipermail/linux-arm-kernel/2010-June/017320.html which seems to have been lost. Signed-off-by: Arnaud Patard <arnaud.patard@rtp-net.org> Signed-off-by: Jason Cooper <jason@lakedaemon.net> [bwh: Backported to 3.2: adjust for the extra two parameters of orion_ge0{0,1}_init()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-12USB: add USB_VENDOR_AND_INTERFACE_INFO() macroGustavo Padovan
commit d81a5d1956731c453b85c141458d4ff5d6cc5366 upstream. A lot of Broadcom Bluetooth devices provides vendor specific interface class and we are getting flooded by patches adding new device support. This change will help us enable support for any other Broadcom with vendor specific device that arrives in the future. Only the product id changes for those devices, so this macro would be perfect for us: { USB_VENDOR_AND_INTERFACE_INFO(0x0a5c, 0xff, 0x01, 0x01) } Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk> Acked-by: Henrik Rydberg <rydberg@bitmath.se> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19Input: eeti_ts: pass gpio value instead of IRQArnd Bergmann
commit 4eef6cbfcc03b294d9d334368a851b35b496ce53 upstream. The EETI touchscreen asserts its IRQ line as soon as it has data in its internal buffers. The line is automatically deasserted once all data has been read via I2C. Hence, the driver has to monitor the GPIO line and cannot simply rely on the interrupt handler reception. In the current implementation of the driver, irq_to_gpio() is used to determine the GPIO number from the i2c_client's IRQ value. As irq_to_gpio() is not available on all platforms, this patch changes this and makes the driver ignore the passed in IRQ. Instead, a GPIO is added to the platform_data struct and gpio_to_irq is used to derive the IRQ from that GPIO. If this fails, bail out. The driver is only able to work in environments where the touchscreen GPIO can be mapped to an IRQ. Without this patch, building raumfeld_defconfig results in: drivers/input/touchscreen/eeti_ts.c: In function 'eeti_ts_irq_active': drivers/input/touchscreen/eeti_ts.c:65:2: error: implicit declaration of function 'irq_to_gpio' [-Werror=implicit-function-declaration] Signed-off-by: Daniel Mack <zonque@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Sven Neumann <s.neumann@raumfeld.com> Cc: linux-input@vger.kernel.org Cc: Haojian Zhuang <haojian.zhuang@gmail.com> [bwh: Backported to 3.2: raumfeld_controller_i2c_board_info.irq was initialised using gpio_to_irq(), but this doesn't seem to matter] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-19ARM: pxa: remove irq_to_gpio from ezx-pcap driverArnd Bergmann
commit 59ee93a528b94ef4e81a08db252b0326feff171f upstream. The irq_to_gpio function was removed from the pxa platform in linux-3.2, and this driver has been broken since. There is actually no in-tree user of this driver that adds this platform device, but the driver can and does get enabled on some platforms. Without this patch, building ezx_defconfig results in: drivers/mfd/ezx-pcap.c: In function 'pcap_isr_work': drivers/mfd/ezx-pcap.c:205:2: error: implicit declaration of function 'irq_to_gpio' [-Werror=implicit-function-declaration] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Samuel Ortiz <sameo@linux.intel.com> Cc: Daniel Ribeiro <drwyrm@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10mm: hugetlbfs: close race during teardown of hugetlbfs shared page tablesMel Gorman
commit d833352a4338dc31295ed832a30c9ccff5c7a183 upstream. If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a "bad pmd" and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [ ..........] Lots of bad pmd messages followed by this [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7). [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7). [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7). [ 127.186778] ------------[ cut here ]------------ [ 127.186781] kernel BUG at mm/filemap.c:134! [ 127.186782] invalid opcode: 0000 [#1] SMP [ 127.186783] CPU 7 [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod [ 127.186801] [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR [ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002 [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0 [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00 [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003 [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8 [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8 [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000 [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0 [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0) [ 127.186821] Stack: [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98 [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000 [ 127.186827] Call Trace: [ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80 [ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220 [ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30 [ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0 [ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0 [ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50 [ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130 [ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0 [ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230 [ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150 [ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30 [ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80 [ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360 [ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170 [ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 [ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186870] RSP <ffff8804144b5c08> [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- The bug is a race and not always easy to reproduce. To reproduce it I was doing the following on a single socket I7-based machine with 16G of RAM. $ hugeadm --pool-pages-max DEFAULT:13G $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall $ for i in `seq 1 9000`; do ./hugetlbfs-test; done On my particular machine, it usually triggers within 10 minutes but enabling debug options can change the timing such that it never hits. Once the bug is triggered, the machine is in trouble and needs to be rebooted. The machine will respond but processes accessing proc like "ps aux" will hang due to the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b. The basic problem is a race between page table sharing and teardown. For the most part page table sharing depends on i_mmap_mutex. In some cases, it is also taking the mm->page_table_lock for the PTE updates but with shared page tables, it is the i_mmap_mutex that is more important. Unfortunately it appears to be also insufficient. Consider the following situation Process A Process B --------- --------- hugetlb_fault shmdt LockWrite(mmap_sem) do_munmap unmap_region unmap_vmas unmap_single_vma unmap_hugepage_range Lock(i_mmap_mutex) Lock(mm->page_table_lock) huge_pmd_unshare/unmap tables <--- (1) Unlock(mm->page_table_lock) Unlock(i_mmap_mutex) huge_pte_alloc ... Lock(i_mmap_mutex) ... vma_prio_walk, find svma, spte ... Lock(mm->page_table_lock) ... share spte ... Unlock(mm->page_table_lock) ... Unlock(i_mmap_mutex) ... hugetlb_no_page <--- (2) free_pgtables unlink_file_vma hugetlb_free_pgd_range remove_vma_list In this scenario, it is possible for Process A to share page tables with Process B that is trying to tear them down. The i_mmap_mutex on its own does not prevent Process A walking Process B's page tables. At (1) above, the page tables are not shared yet so it unmaps the PMDs. Process A sets up page table sharing and at (2) faults a new entry. Process B then trips up on it in free_pgtables. This patch fixes the problem by adding a new function __unmap_hugepage_range_final that is only called when the VMA is about to be destroyed. This function clears VM_MAYSHARE during unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA ineligible for sharing and avoids the race. Superficially this looks like it would then be vunerable to truncate and madvise issues but hugetlbfs has its own truncate handlers so does not use unmap_mapping_range() and does not support madvise(DONTNEED). This should be treated as a -stable candidate if it is merged. Test program is as follows. The test case was mostly written by Michal Hocko with a few minor changes to reproduce this bug. ==== CUT HERE ==== static size_t huge_page_size = (2UL << 20); static size_t nr_huge_page_A = 512; static size_t nr_huge_page_B = 5632; unsigned int get_random(unsigned int max) { struct timeval tv; gettimeofday(&tv, NULL); srandom(tv.tv_usec); return random() % max; } static void play(void *addr, size_t size) { unsigned char *start = addr, *end = start + size, *a; start += get_random(size/2); /* we could itterate on huge pages but let's give it more time. */ for (a = start; a < end; a += 4096) *a = 0; } int main(int argc, char **argv) { key_t key = IPC_PRIVATE; size_t sizeA = nr_huge_page_A * huge_page_size; size_t sizeB = nr_huge_page_B * huge_page_size; int shmidA, shmidB; void *addrA = NULL, *addrB = NULL; int nr_children = 300, n = 0; if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } fork_child: switch(fork()) { case 0: switch (n%3) { case 0: play(addrA, sizeA); break; case 1: play(addrB, sizeB); break; case 2: break; } break; case -1: perror("fork:"); break; default: if (++n < nr_children) goto fork_child; play(addrA, sizeA); break; } shmdt(addrA); shmdt(addrB); do { wait(NULL); } while (--n > 0); shmctl(shmidA, IPC_RMID, NULL); shmctl(shmidB, IPC_RMID, NULL); return 0; } [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build] Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: - Adjust context - Drop the mmu_gather * parameters] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10random: add new get_random_bytes_arch() functionTheodore Ts'o
commit c2557a303ab6712bb6e09447df828c557c710ac9 upstream. Create a new function, get_random_bytes_arch() which will use the architecture-specific hardware random number generator if it is present. Change get_random_bytes() to not use the HW RNG, even if it is avaiable. The reason for this is that the hw random number generator is fast (if it is present), but it requires that we trust the hardware manufacturer to have not put in a back door. (For example, an increasing counter encrypted by an AES key known to the NSA.) It's unlikely that Intel (for example) was paid off by the US Government to do this, but it's impossible for them to prove otherwise
2012-08-10random: create add_device_randomness() interfaceLinus Torvalds
commit a2080a67abe9e314f9e9c2cc3a4a176e8a8f8793 upstream. Add a new interface, add_device_randomness() for adding data to the random pool that is likely to differ between two devices (or possibly even per boot). This would be things like MAC addresses or serial numbers, or the read-out of the RTC. This does *not* add any actual entropy to the pool, but it initializes the pool to different values for devices that might otherwise be identical and have very little entropy available to them (particularly common in the embedded world). [ Modified by tytso to mix in a timestamp, since there may be some variability caused by the time needed to detect/configure the hardware in question. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10random: make 'add_interrupt_randomness()' do something saneTheodore Ts'o
commit 775f4b297b780601e61787b766f306ed3e1d23eb upstream. We've been moving away from add_interrupt_randomness() for various reasons: it's too expensive to do on every interrupt, and flooding the CPU with interrupts could theoretically cause bogus floods of entropy from a somewhat externally controllable source. This solves both problems by limiting the actual randomness addition to just once a second or after 64 interrupts, whicever comes first. During that time, the interrupt cycle data is buffered up in a per-cpu pool. Also, we make sure the the nonblocking pool used by urandom is initialized before we start feeding the normal input pool. This assures that /dev/urandom is returning unpredictable data as soon as possible. (Based on an original patch by Linus, but significantly modified by tytso.) Tested-by: Eric Wustrow <ewust@umich.edu> Reported-by: Eric Wustrow <ewust@umich.edu> Reported-by: Nadia Heninger <nadiah@cs.ucsd.edu> Reported-by: Zakir Durumeric <zakir@umich.edu> Reported-by: J. Alex Halderman <jhalderm@umich.edu>. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-10sched: Fix race in task_group()Peter Zijlstra
commit 8323f26ce3425460769605a6aece7a174edaa7d1 upstream Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"), he found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org> (backported to previous file names and layout) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02block: add blk_queue_dead()Tejun Heo
commit 34f6055c80285e4efb3f602a9119db75239744dc upstream. There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead() macro and use it. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02workqueue: perform cpu down operations from low priority cpu_notifier()Tejun Heo
commit 6575820221f7a4dd6eadecf7bf83cdd154335eda upstream. Currently, all workqueue cpu hotplug operations run off CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to ensure that workqueue is up and running while bringing up a CPU before other notifiers try to use workqueue on the CPU. Per-cpu workqueues are supposed to remain working and bound to the CPU for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even with workqueue offlining running with higher priority because workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which runs the per-cpu workqueue without concurrency management without explicitly detaching the existing workers. However, if the trustee needs to create new workers, it creates unbound workers which may wander off to other CPUs while CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU down is cancelled, the per-CPU workqueue may end up with workers which aren't bound to the CPU. While reliably reproducible with a convoluted artificial test-case involving scheduling and flushing CPU burning work items from CPU down notifiers, this isn't very likely to happen in the wild, and, even when it happens, the effects are likely to be hidden by the following successful CPU down. Fix it by using different priorities for up and down notifiers - high priority for up operations and low priority for down operations. Workqueue cpu hotplug operations will soon go through further cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02sched/nohz: Fix rq->cpu_load calculations some morePeter Zijlstra
commit 5aaa0b7a2ed5b12692c9ffb5222182bd558d3146 upstream. Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filenames and context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02cpuset: mm: reduce large amounts of memory barrier related damage v3Mel Gorman
commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream. Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely expensive and severely impacted page allocator performance. This is part of a series of patches that reduce page allocator overhead. Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> [bwh: Forward-ported from 3.0 to 3.2: apply the upstream changes to get_any_partial()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02mm: compaction: introduce sync-light migration for use by compactionMel Gorman
commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream. Stable note: Not tracked in Buzilla. This was part of a series that reduced interactivity stalls experienced when THP was enabled. These stalls were particularly noticable when copying data to a USB stick but the experiences for users varied a lot. This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT mode that avoids writing back pages to backing storage. Async compaction maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is used. This avoids sync compaction stalling for an excessive length of time, particularly when copying files to a USB stick where there might be a large number of dirty pages backed by a filesystem that does not support ->writepages. [aarcange@redhat.com: This patch is heavily based on Andrea's work] [akpm@linux-foundation.org: fix fs/nfs/write.c build] [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02mm: compaction: make isolate_lru_page() filter-aware againMel Gorman
commit c82449352854ff09e43062246af86bdeb628f0c3 upstream. Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging information by reducing LRU list churning had the side-effect of reducing THP allocation success rates. This was part of a series to restore the success rates while preserving the reclaim fix. Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware") noted that compaction does not migrate dirty or writeback pages and that is was meaningless to pick the page and re-add it to the LRU list. This had to be partially reverted because some dirty pages can be migrated by compaction without blocking. This patch updates "mm: compaction: make isolate_lru_page" by skipping over pages that migration has no possibility of migrating to minimise LRU disruption. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel<riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-08-02mm: compaction: determine if dirty pages can be migrated without blocking ↵Mel Gorman
within ->migratepage commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream. Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging information by reducing LRU list churning had the side-effect of reducing THP allocation success rates. This was part of a series to restore the success rates while preserving the reclaim fix. Asynchronous compaction is used when allocating transparent hugepages to avoid blocking for long periods of time. Due to reports of stalling, there was a debate on disabling synchronous compaction but this severely impacted allocation success rates. Part of the reason was that many dirty pages are skipped in asynchronous compaction by the following check; if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page) rc = -EBUSY; This skips over all mapping aops using buffer_migrate_page() even though it is possible to migrate some of these pages without blocking. This patch updates the ->migratepage callback with a "sync" parameter. It is the responsibility of the callback to fail gracefully if migration would block. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25net: remove skb_orphan_try()Eric Dumazet
commit 62b1a8ab9b3660bb820d8dfe23148ed6cda38574 upstream. Orphaning skb in dev_hard_start_xmit() makes bonding behavior unfriendly for applications sending big UDP bursts : Once packets pass the bonding device and come to real device, they might hit a full qdisc and be dropped. Without orphaning, the sender is automatically throttled because sk->sk_wmemalloc reaches sk->sk_sndbuf (assuming sk_sndbuf is not too big) We could try to defer the orphaning adding another test in dev_hard_start_xmit(), but all this seems of little gain, now that BQL tends to make packets more likely to be parked in Qdisc queues instead of NIC TX ring, in cases where performance matters. Reverts commits : fc6055a5ba31 net: Introduce skb_orphan_try() 87fd308cfc6b net: skb_tx_hash() fix relative to skb_orphan_try() and removes SKBTX_DRV_NEEDS_SK_REF flag Reported-and-bisected-by: Jean-Michel Hautbois <jhautbois@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 3.2: - Adjust context - SKBTX_WIFI_STATUS is not defined] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25timekeeping: Provide hrtimer update functionThomas Gleixner
This is a backport of f6c06abfb3972ad4914cef57d8348fcb2932bc3b To finally fix the infamous leap second issue and other race windows caused by functions which change the offsets between the various time bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a function which atomically gets the current monotonic time and updates the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic overhead. The previous patch which provides ktime_t offsets allows us to make this function almost as cheap as ktime_get() which is going to be replaced in hrtimer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [John Stultz: Backported to 3.2] Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25hrtimer: Provide clock_was_set_delayed()John Stultz
commit f55a6faa384304c89cfef162768e88374d3312cb upstream. clock_was_set() cannot be called from hard interrupt context because it calls on_each_cpu(). For fixing the widely reported leap seconds issue it is necessary to call it from hard interrupt context, i.e. the timer tick code, which does the timekeeping updates. Provide a new function which denotes it in the hrtimer cpu base structure of the cpu on which it is called and raise the hrtimer softirq. We then execute the clock_was_set() notificiation from softirq context in run_hrtimer_softirq(). The hrtimer softirq is rarely used, so polling the flag there is not a performance issue. [ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get rid of all this ifdeffery ASAP ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25ntp: Fix leap-second hrtimer livelockJohn Stultz
This is a backport of 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d This should have been backported when it was commited, but I mistook the problem as requiring the ntp_lock changes that landed in 3.4 in order for it to occur. Unfortunately the same issue can happen (with only one cpu) as follows: do_adjtimex() write_seqlock_irq(&xtime_lock); process_adjtimex_modes() process_adj_status() ntp_start_leap_timer() hrtimer_start() hrtimer_reprogram() tick_program_event() clockevents_program_event() ktime_get() seq = req_seqbegin(xtime_lock); [DEADLOCK] This deadlock will no always occur, as it requires the leap_timer to force a hrtimer_reprogram which only happens if its set and there's no sooner timer to expire. NOTE: This patch, being faithful to the original commit, introduces a bug (we don't update wall_to_monotonic), which will be resovled by backporting a following fix. Original commit message below: Since commit 7dffa3c673fbcf835cd7be80bb4aec8ad3f51168 the ntp subsystem has used an hrtimer for triggering the leapsecond adjustment. However, this can cause a potential livelock. Thomas diagnosed this as the following pattern: CPU 0 CPU 1 do_adjtimex() spin_lock_irq(&ntp_lock); process_adjtimex_modes(); timer_interrupt() process_adj_status(); do_timer() ntp_start_leap_timer(); write_lock(&xtime_lock); hrtimer_start(); update_wall_time(); hrtimer_reprogram(); ntp_tick_length() tick_program_event() spin_lock(&ntp_lock); clockevents_program_event() ktime_get() seq = req_seqbegin(xtime_lock); This patch tries to avoid the problem by reverting back to not using an hrtimer to inject leapseconds, and instead we handle the leapsecond processing in the second_overflow() function. The downside to this change is that on systems that support highres timers, the leap second processing will occur on a HZ tick boundary, (ie: ~1-10ms, depending on HZ) after the leap second instead of possibly sooner (~34us in my tests w/ x86_64 lapic). This patch applies on top of tip/timers/core. CC: Sasha Levin <levinsasha928@gmail.com> CC: Thomas Gleixner <tglx@linutronix.de> Reported-by: Sasha Levin <levinsasha928@gmail.com> Diagnoised-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sasha Levin <levinsasha928@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25NFC: Export nfc.h to userlandSamuel Ortiz
commit dbd4fcaf8d664fab4163b1f8682e41ad8bff3444 upstream. The netlink commands and attributes, along with the socket structure definitions need to be exported. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25memory hotplug: fix invalid memory access caused by stale kswapd pointerJiang Liu
commit d8adde17e5f858427504725218c56aef90e90fc7 upstream. kswapd_stop() is called to destroy the kswapd work thread when all memory of a NUMA node has been offlined. But kswapd_stop() only terminates the work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale pointer will prevent kswapd_run() from creating a new work thread when adding memory to the memory-less NUMA node again. Eventually the stale pointer may cause invalid memory access. An example stack dump as below. It's reproduced with 2.6.32, but latest kernel has the same issue. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051a94>] exit_creds+0x12/0x78 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/memory/memory391/state CPU 11 Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285 RIP: 0010:exit_creds+0x12/0x78 RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502 RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000 RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10 R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000 R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001 FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600) Stack: ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1 0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003 Call Trace: __put_task_struct+0x5d/0x97 kthread_stop+0x50/0x58 offline_pages+0x324/0x3da memory_block_change_state+0x179/0x1db store_mem_state+0x9e/0xbb sysfs_write_file+0xd0/0x107 vfs_write+0xad/0x169 sys_write+0x45/0x6e system_call_fastpath+0x16/0x1b Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0 RIP exit_creds+0x12/0x78 RSP <ffff8806044f1d78> CR2: 0000000000000000 [akpm@linux-foundation.org: add pglist_data.kswapd locking comments] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25PCI: EHCI: fix crash during suspend on ASUS computersAlan Stern
commit dbf0e4c7257f8d684ec1a3c919853464293de66e upstream. Quite a few ASUS computers experience a nasty problem, related to the EHCI controllers, when going into system suspend. It was observed that the problem didn't occur if the controllers were not put into the D3 power state before starting the suspend, and commit 151b61284776be2d6f02d48c23c3625678960b97 (USB: EHCI: fix crash during suspend on ASUS computers) was created to do this. It turned out this approach messed up other computers that didn't have the problem -- it prevented USB wakeup from working. Consequently commit c2fb8a3fa25513de8fedb38509b1f15a5bbee47b (USB: add NO_D3_DURING_SLEEP flag and revert 151b61284776be2) was merged; it reverted the earlier commit and added a whitelist of known good board names. Now we know the actual cause of the problem. Thanks to AceLan Kao for tracking it down. According to him, an engineer at ASUS explained that some of their BIOSes contain a bug that was added in an attempt to work around a problem in early versions of Windows. When the computer goes into S3 suspend, the BIOS tries to verify that the EHCI controllers were first quiesced by the OS. Nothing's wrong with this, but the BIOS does it by checking that the PCI COMMAND registers contain 0 without checking the controllers' power state. If the register isn't 0, the BIOS assumes the controller needs to be quiesced and tries to do so. This involves making various MMIO accesses to the controller, which don't work very well if the controller is already in D3. The end result is a system hang or memory corruption. Since the value in the PCI COMMAND register doesn't matter once the controller has been suspended, and since the value will be restored anyway when the controller is resumed, we can work around the BIOS bug simply by setting the register to 0 during system suspend. This patch (as1590) does so and also reverts the second commit mentioned above, which is now unnecessary. In theory we could do this for every PCI device. However to avoid introducing new problems, the patch restricts itself to EHCI host controllers. Finally the affected systems can suspend with USB wakeup working properly. Reference: https://bugzilla.kernel.org/show_bug.cgi?id=37632 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=42728 Based-on-patch-by: AceLan Kao <acelan.kao@canonical.com> Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Dâniel Fraga <fragabr@gmail.com> Tested-by: Javier Marcet <jmarcet@gmail.com> Tested-by: Andrey Rahmatullin <wrar@wrar.name> Tested-by: Oleksij Rempel <bug-track@fisher-privat.net> Tested-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25sched/nohz: Rewrite and fix load-avg computation -- againPeter Zijlstra
commit 5167e8d5417bf5c322a703d2927daec727ea40dd upstream. Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <dsmythies@telus.net> Reported-and-tested-by: Charles Wang <muming.wq@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filenames, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25hugepages: fix use after free bug in "quota" handlingDavid Gibson
commit 90481622d75715bfcb68501280a917dbfe516029 upstream. hugetlbfs_{get,put}_quota() are badly named. They don't interact with the general quota handling code, and they don't much resemble its behaviour. Rather than being about maintaining limits on on-disk block usage by particular users, they are instead about maintaining limits on in-memory page usage (including anonymous MAP_PRIVATE copied-on-write pages) associated with a particular hugetlbfs filesystem instance. Worse, they work by having callbacks to the hugetlbfs filesystem code from the low-level page handling code, in particular from free_huge_page(). This is a layering violation of itself, but more importantly, if the kernel does a get_user_pages() on hugepages (which can happen from KVM amongst others), then the free_huge_page() can be delayed until after the associated inode has already been freed. If an unmount occurs at the wrong time, even the hugetlbfs superblock where the "quota" limits are stored may have been freed. Andrew Barry proposed a patch to fix this by having hugepages, instead of storing a pointer to their address_space and reaching the superblock from there, had the hugepages store pointers directly to the superblock, bumping the reference count as appropriate to avoid it being freed. Andrew Morton rejected that version, however, on the grounds that it made the existing layering violation worse. This is a reworked version of Andrew's patch, which removes the extra, and some of the existing, layering violation. It works by introducing the concept of a hugepage "subpool" at the lower hugepage mm layer - that is a finite logical pool of hugepages to allocate from. hugetlbfs now creates a subpool for each filesystem instance with a page limit set, and a pointer to the subpool gets added to each allocated hugepage, instead of the address_space pointer used now. The subpool has its own lifetime and is only freed once all pages in it _and_ all other references to it (i.e. superblocks) are gone. subpools are optional - a NULL subpool pointer is taken by the code to mean that no subpool limits are in effect. Previous discussion of this bug found in: "Fix refcounting in hugetlbfs quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or http://marc.info/?l=linux-mm&m=126928970510627&w=1 v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to alloc_huge_page() - since it already takes the vma, it is not necessary. Signed-off-by: Andrew Barry <abarry@cray.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context to apply after commit c50ac050811d6485616a193eb0f37bfbd191cc89 'hugetlb: fix resv_map leak in error path', backported in 3.2.20] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12aio: make kiocb->private NUll in init_sync_kiocb()Junxiao Bi
commit 2dfd06036ba7ae8e7be2daf5a2fff1dac42390bf upstream. Ocfs2 uses kiocb.*private as a flag of unsigned long size. In commit a11f7e6 ocfs2: serialize unaligned aio, the unaligned io flag is involved in it to serialize the unaligned aio. As *private is not initialized in init_sync_kiocb() of do_sync_write(), this unaligned io flag may be unexpectly set in an aligned dio. And this will cause OCFS2_I(inode)->ip_unaligned_aio decreased to -1 in ocfs2_dio_end_io(), thus the following unaligned dio will hang forever at ocfs2_aiodio_wait() in ocfs2_file_aio_write(). Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-12splice: fix racy pipe->buffers usesEric Dumazet
commit 047fe3605235888f3ebcda0c728cb31937eadfe6 upstream. Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered by splice_shrink_spd() called from vmsplice_to_pipe() commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes) added capability to adjust pipe->buffers. Problem is some paths don't hold pipe mutex and assume pipe->buffers doesn't change for their duration. Fix this by adding nr_pages_max field in struct splice_pipe_desc, and use it in place of pipe->buffers where appropriate. splice_shrink_spd() loses its struct pipe_inode_info argument. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tom Herbert <therbert@google.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> [bwh: Backported to 3.2: - Adjust context in vmsplice_to_pipe() - Update one more call to splice_shrink_spd(), from skb_splice_bits()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-19swap: fix shmem swapping when more than 8 areasHugh Dickins
commit 9b15b817f3d62409290fd56fe3cbb076a931bb0a upstream. Minchan Kim reports that when a system has many swap areas, and tmpfs swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read back the page cannot locate it, and the read fails with -ENOMEM. Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry( swp_entry_to_pte()) technique for determining maximum usable swap offset, without stopping to realize that that actually depends upon the pte swap encoding shifting swap offset to the higher bits and truncating it there. Whereas our radix_tree swap encoding leaves offset in the lower bits: it's swap "type" (that is, index of swap area) that was truncated. Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header(). This does not reduce the usable size of a swap area any further, it leaves it as claimed when making the original commit: no change from 3.0 on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB per swapfile on i386 with PAE. It's not a change I would have risked five years ago, but with x86_64 supported for ten years, I believe it's appropriate now. Hmm, and what if some architecture implements its swap pte with offset encoded below type? That would equally break the maximum usable swap offset check. Happily, they all follow the same tradition of encoding offset above type, but I'll prepare a check on that for next. Reported-and-Reviewed-and-Tested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-19USB: add NO_D3_DURING_SLEEP flag and revert 151b61284776be2Alan Stern
commit c2fb8a3fa25513de8fedb38509b1f15a5bbee47b upstream. This patch (as1558) fixes a problem affecting several ASUS computers: The machine crashes or corrupts memory when going into suspend if the ehci-hcd driver is bound to any controllers. Users have been forced to unbind or unload ehci-hcd before putting their systems to sleep. After extensive testing, it was determined that the machines don't like going into suspend when any EHCI controllers are in the PCI D3 power state. Presumably this is a firmware bug, but there's nothing we can do about it except to avoid putting the controllers in D3 during system sleep. The patch adds a new flag to indicate whether the problem is present, and avoids changing the controller's power state if the flag is set. Runtime suspend is unaffected; this matters only for system suspend. However as a side effect, the controller will not respond to remote wakeup requests while the system is asleep. Hence USB wakeup is not functional -- but of course, this is already true in the current state of affairs. A similar patch has already been applied as commit 151b61284776be2d6f02d48c23c3625678960b97 (USB: EHCI: fix crash during suspend on ASUS computers). The patch supersedes that one and reverts it. There are two differences: The old patch added the flag at the USB level; this patch adds it at the PCI level. The old patch applied to all chipsets with the same vendor, subsystem vendor, and product IDs; this patch makes an exception for a known-good system (based on DMI information). Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Dâniel Fraga <fragabr@gmail.com> Tested-by: Andrey Rahmatullin <wrar@wrar.name> Tested-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-10kbuild: install kernel-page-flags.hUlrich Drepper
commit 9295b7a07c859a42346221b5839be0ae612333b0 upstream. Programs using /proc/kpageflags need to know about the various flags. The <linux/kernel-page-flags.h> provides them and the comments in the file indicate that it is supposed to be used by user-level code. But the file is not installed. Install the headers and mark the unstable flags as out-of-bounds. The page-type tool is also adjusted to not duplicate the definitions Signed-off-by: Ulrich Drepper <drepper@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context; drop change to missing tools/vm/] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-10skb: avoid unnecessary reallocations in __skb_cowFelix Fietkau
[ Upstream commit 617c8c11236716dcbda877e764b7bf37c6fd8063 ] At the beginning of __skb_cow, headroom gets set to a minimum of NET_SKB_PAD. This causes unnecessary reallocations if the buffer was not cloned and the headroom is just below NET_SKB_PAD, but still more than the amount requested by the caller. This was showing up frequently in my tests on VLAN tx, where vlan_insert_tag calls skb_cow_head(skb, VLAN_HLEN). Locally generated packets should have enough headroom, and for forward paths, we already have NET_SKB_PAD bytes of headroom, so we don't need to add any extra space here. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-10Revert "net: maintain namespace isolation between vlan and real device"David S. Miller
[ Upstream commit 59b9997baba5242997ddc7bd96b1391f5275a5a4 ] This reverts commit 8a83a00b0735190384a348156837918271034144. It causes regressions for S390 devices, because it does an unconditional DST drop on SKBs for vlans and the QETH device needs the neighbour entry hung off the DST for certain things on transmit. Arnd can't remember exactly why he even needed this change. Conflicts: drivers/net/macvlan.c net/8021q/vlan_dev.c net/core/dev.c Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-06-10set fake_rtable's dst to NULL to avoid kernel OopsPeter Huang (Peng)
[ Upstream commit a881e963c7fe1f226e991ee9bbe8907acda93294 ] bridge: set fake_rtable's dst to NULL to avoid kernel Oops when bridge is deleted before tap/vif device's delete, kernel may encounter an oops because of NULL reference to fake_rtable's dst. Set fake_rtable's dst to NULL before sending packets out can solve this problem. v4 reformat, change br_drop_fake_rtable(skb) to {} v3 enrich commit header v2 introducing new flag DST_FAKE_RTABLE to dst_entry struct. [ Use "do { } while (0)" for nop br_drop_fake_rtable() implementation -DaveM ] Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Peter Huang <peter.huangpeng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-31block: don't mark buffers beyond end of disk as mappedJeff Moyer
commit 080399aaaf3531f5b8761ec0ac30ff98891e8686 upstream. Hi, We have a bug report open where a squashfs image mounted on ppc64 would exhibit errors due to trying to read beyond the end of the disk. It can easily be reproduced by doing the following: [root@ibm-p750e-02-lp3 ~]# ls -l install.img -rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img [root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test [root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null dd: reading `/dev/loop0': Input/output error 277376+0 records in 277376+0 records out 142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s In dmesg, you'll find the following: squashfs: version 4.0 (2009/01/31) Phillip Lougher [ 43.106012] attempt to access beyond end of device [ 43.106029] loop0: rw=0, want=277410, limit=277408 [ 43.106039] Buffer I/O error on device loop0, logical block 138704 [ 43.106053] attempt to access beyond end of device [ 43.106057] loop0: rw=0, want=277412, limit=277408 [ 43.106061] Buffer I/O error on device loop0, logical block 138705 [ 43.106066] attempt to access beyond end of device [ 43.106070] loop0: rw=0, want=277414, limit=277408 [ 43.106073] Buffer I/O error on device loop0, logical block 138706 [ 43.106078] attempt to access beyond end of device [ 43.106081] loop0: rw=0, want=277416, limit=277408 [ 43.106085] Buffer I/O error on device loop0, logical block 138707 [ 43.106089] attempt to access beyond end of device [ 43.106093] loop0: rw=0, want=277418, limit=277408 [ 43.106096] Buffer I/O error on device loop0, logical block 138708 [ 43.106101] attempt to access beyond end of device [ 43.106104] loop0: rw=0, want=277420, limit=277408 [ 43.106108] Buffer I/O error on device loop0, logical block 138709 [ 43.106112] attempt to access beyond end of device [ 43.106116] loop0: rw=0, want=277422, limit=277408 [ 43.106120] Buffer I/O error on device loop0, logical block 138710 [ 43.106124] attempt to access beyond end of device [ 43.106128] loop0: rw=0, want=277424, limit=277408 [ 43.106131] Buffer I/O error on device loop0, logical block 138711 [ 43.106135] attempt to access beyond end of device [ 43.106139] loop0: rw=0, want=277426, limit=277408 [ 43.106143] Buffer I/O error on device loop0, logical block 138712 [ 43.106147] attempt to access beyond end of device [ 43.106151] loop0: rw=0, want=277428, limit=277408 [ 43.106154] Buffer I/O error on device loop0, logical block 138713 [ 43.106158] attempt to access beyond end of device [ 43.106162] loop0: rw=0, want=277430, limit=277408 [ 43.106166] attempt to access beyond end of device [ 43.106169] loop0: rw=0, want=277432, limit=277408 ... [ 43.106307] attempt to access beyond end of device [ 43.106311] loop0: rw=0, want=277470, limit=2774 Squashfs manages to read in the end block(s) of the disk during the mount operation. Then, when dd reads the block device, it leads to block_read_full_page being called with buffers that are beyond end of disk, but are marked as mapped. Thus, it would end up submitting read I/O against them, resulting in the errors mentioned above. I fixed the problem by modifying init_page_buffers to only set the buffer mapped if it fell inside of i_size. Cheers, Jeff Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Nick Piggin <npiggin@kernel.dk> -- Changes from v1->v2: re-used max_block, as suggested by Nick Piggin. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-31mmc: sdio: avoid spurious calls to interrupt handlersNicolas Pitre
commit bbbc4c4d8c5face097d695f9bf3a39647ba6b7e7 upstream. Commit 06e8935feb ("optimized SDIO IRQ handling for single irq") introduced some spurious calls to SDIO function interrupt handlers, such as when the SDIO IRQ thread is started, or the safety check performed upon a system resume. Let's add a flag to perform the optimization only when a real interrupt is signaled by the host driver and we know there is no point confirming it. Reported-by: Sujit Reddy Thumma <sthumma@codeaurora.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-31USB: move usb_translate_errors to linux/usb.hJohan Hovold
commit 2c4d6bf295ae10ffcd84f0df6cb642598eb66603 upstream. Move usb_translate_errors from usb core to linux/usb.h as it is meant to be accessed from drivers. Signed-off-by: Johan Hovold <jhovold@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-31block: fix buffer overflow when printing partition UUIDsTejun Heo
commit 05c69d298c96703741cac9a5cbbf6c53bd55a6e2 upstream. 6d1d8050b4bc8 "block, partition: add partition_meta_info to hd_struct" added part_unpack_uuid() which assumes that the passed in buffer has enough space for sprintfing "%pU" - 37 characters including '\0'. Unfortunately, b5af921ec0233 "init: add support for root devices specified by partition UUID" supplied 33 bytes buffer to the function leading to the following panic with stackprotector enabled. Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e [<ffffffff815e226b>] panic+0xba/0x1c6 [<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb [<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20 [<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb [<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f [<ffffffff81aee0fa>] mount_root+0x57/0x5b [<ffffffff81aee23b>] prepare_namespace+0x13d/0x176 [<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30 [<ffffffff81aedd60>] kernel_init+0x155/0x15a [<ffffffff81087b97>] ? schedule_tail+0x27/0xb0 [<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10 [<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5 [<ffffffff815f4d20>] ? gs_change+0x13/0x13 Increase the buffer size, remove the dangerous part_unpack_uuid() and use snprintf() directly from printk_all_partitions(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Szymon Gruszczynski <sz.gruszczynski@googlemail.com> Cc: Will Drewry <wad@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-31KVM: Ensure all vcpus are consistent with in-kernel irqchip settingsAvi Kivity
(cherry picked from commit 3e515705a1f46beb1c942bb8043c16f8ac7b1e9e) If some vcpus are created before KVM_CREATE_IRQCHIP, then irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading to potential NULL pointer dereferences. Fix by: - ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called - ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP This is somewhat long winded because vcpu->arch.apic is created without kvm->lock held. Based on earlier patch by Michael Ellerman. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-20mtd: map.h: fix arm cross-build failureArtem Bityutskiy
commit 4a42243886b87cd28a39b192161767c2af851a55 upstream. This patch fixes the following build failure: In file included from include/linux/mtd/qinfo.h:4:0, from include/linux/mtd/pfow.h:7, from drivers/mtd/lpddr/lpddr_cmds.c:27: include/linux/mtd/map.h: In function 'inline_map_read': include/linux/mtd/map.h:409:3: error: implicit declaration of function 'BUILD_BUG_ON' [-Werror=implicit-function-declaration] Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-20usbnet: fix skb traversing races during unlink(v2)Ming Lei
commit 5b6e9bcdeb65634b4ad604eb4536404bbfc62cfa upstream. Commit 4231d47e6fe69f061f96c98c30eaf9fb4c14b96d(net/usbnet: avoid recursive locking in usbnet_stop()) fixes the recursive locking problem by releasing the skb queue lock before unlink, but may cause skb traversing races: - after URB is unlinked and the queue lock is released, the refered skb and skb->next may be moved to done queue, even be released - in skb_queue_walk_safe, the next skb is still obtained by next pointer of the last skb - so maybe trigger oops or other problems This patch extends the usage of entry->state to describe 'start_unlink' state, so always holding the queue(rx/tx) lock to change the state if the referd skb is in rx or tx queue because we need to know if the refered urb has been started unlinking in unlink_urbs. The other part of this patch is based on Huajun's patch: always traverse from head of the tx/rx queue to get skb which is to be unlinked but not been started unlinking. Signed-off-by: Huajun Li <huajun.li.lee@gmail.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Cc: Oliver Neukum <oneukum@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11Fix __read_seqcount_begin() to use ACCESS_ONCE for sequence value readLinus Torvalds
commit 2f624278626677bfaf73fef97f86b37981621f5c upstream. We really need to use a ACCESS_ONCE() on the sequence value read in __read_seqcount_begin(), because otherwise the compiler might end up reloading the value in between the test and the return of it. As a result, it might end up returning an odd value (which means that a write is in progress). If the reader is then fast enough that that odd value is still the current one when the read_seqcount_retry() is done, we might end up with a "successful" read sequence, even despite the concurrent write being active. In practice this probably never really happens - there just isn't anything else going on around the read of the sequence count, and the common case is that we end up having a read barrier immediately afterwards. So the code sequence in which gcc might decide to reaload from memory is small, and there's no reason to believe it would ever actually do the reload. But if the compiler ever were to decide to do so, it would be incredibly annoying to debug. Let's just make sure. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11efi: Add new variable attributesMatthew Garrett
commit 41b3254c93acc56adc3c4477fef7c9512d47659e upstream. More recent versions of the UEFI spec have added new attributes for variables. Add them. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11pipes: add a "packetized pipe" mode for writingLinus Torvalds
commit 9883035ae7edef3ec62ad215611cb8e17d6a1a5d upstream. The actual internal pipe implementation is already really about individual packets (called "pipe buffers"), and this simply exposes that as a special packetized mode. When we are in the packetized mode (marked by O_DIRECT as suggested by Alan Cox), a write() on a pipe will not merge the new data with previous writes, so each write will get a pipe buffer of its own. The pipe buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn will tell the reader side to break the read at that boundary (and throw away any partial packet contents that do not fit in the read buffer). End result: as long as you do writes less than PIPE_BUF in size (so that the pipe doesn't have to split them up), you can now treat the pipe as a packet interface, where each read() system call will read one packet at a time. You can just use a sufficiently big read buffer (PIPE_BUF is sufficient, since bigger than that doesn't guarantee atomicity anyway), and the return value of the read() will naturally give you the size of the packet. NOTE! We do not support zero-sized packets, and zero-sized reads and writes to a pipe continue to be no-ops. Also note that big packets will currently be split at write time, but that the size at which that happens is not really specified (except that it's bigger than PIPE_BUF). Currently that limit is the system page size, but we might want to explicitly support bigger packets some day. The main user for this is going to be the autofs packet interface, allowing us to stop having to care so deeply about exact packet sizes (which have had bugs with 32/64-bit compatibility modes). But user space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will fail with an EINVAL on kernels that do not support this interface. Tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Miller <davem@davemloft.net> Cc: Ian Kent <raven@themaw.net> Cc: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11USB: EHCI: fix crash during suspend on ASUS computersAlan Stern
commit 151b61284776be2d6f02d48c23c3625678960b97 upstream. This patch (as1545) fixes a problem affecting several ASUS computers: The machine crashes or corrupts memory when going into suspend if the ehci-hcd driver is bound to any controllers. Users have been forced to unbind or unload ehci-hcd before putting their systems to sleep. After extensive testing, it was determined that the machines don't like going into suspend when any EHCI controllers are in the PCI D3 power state. Presumably this is a firmware bug, but there's nothing we can do about it except to avoid putting the controllers in D3 during system sleep. The patch adds a new flag to indicate whether the problem is present, and avoids changing the controller's power state if the flag is set. Runtime suspend is unaffected; this matters only for system suspend. However as a side effect, the controller will not respond to remote wakeup requests while the system is asleep. Hence USB wakeup is not functional -- but of course, this is already true in the current state of affairs. This fixes Bugzilla #42728. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Andrey Rahmatullin <wrar@wrar.name> Tested-by: Oleksij Rempel (fishor) <bug-track@fisher-privat.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11tcp: avoid order-1 allocations on wifi and tx pathEric Dumazet
[ This combines upstream commit a21d45726acacc963d8baddf74607d9b74e2b723 and the follow-on bug fix commit 22b4a4f22da4b39c6f7f679fd35f3d35c91bf851 ] Marc Merlin reported many order-1 allocations failures in TX path on its wireless setup, that dont make any sense with MTU=1500 network, and non SG capable hardware. After investigation, it turns out TCP uses sk_stream_alloc_skb() and used as a convention skb_tailroom(skb) to know how many bytes of data payload could be put in this skb (for non SG capable devices) Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER + sizeof(struct skb_shared_info) being above 2048) Later, mac80211 layer need to add some bytes at the tail of skb (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is available has to call pskb_expand_head() and request order-1 allocations. This patch changes sk_stream_alloc_skb() so that only sk->sk_prot->max_header bytes of headroom are reserved, and use a new skb field, avail_size to hold the data payload limit. This way, order-0 allocations done by TCP stack can leave more than 2 KB of tailroom and no more allocation is performed in mac80211 layer (or any layer needing some tailroom) avail_size is unioned with mark/dropcount, since mark will be set later in IP stack for output packets. Therefore, skb size is unchanged. Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Correct commit hash for follow-on bug fix] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11tcp: allow splice() to build full TSO packetsEric Dumazet
[ This combines upstream commit 2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ] vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) The call to tcp_push() at the end of do_tcp_sendpages() forces an immediate xmit when pipe is not already filled, and tso_fragment() try to split these skb to MSS multiples. 4096 bytes are usually split in a skb with 2 MSS, and a remaining sub-mss skb (assuming MTU=1500) This makes slow start suboptimal because many small frames are sent to qdisc/driver layers instead of big ones (constrained by cwnd and packets in flight of course) In fact, applications using sendmsg() (adding an additional memory copy) instead of vmsplice()/splice()/sendfile() are a bit faster because of this anomaly, especially if serving small files in environments with large initial [c]wnd. Call tcp_push() only if MSG_MORE is not set in the flags parameter. This bit is automatically provided by splice() internals but for the last page, or on all pages if user specified SPLICE_F_MORE splice() flag. In some workloads, this can reduce number of sent logical packets by an order of magnitude, making zero-copy TCP actually faster than one-copy :) Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11net: fix /proc/net/dev regressionEric Dumazet
[ Upstream commit 2def16ae6b0c77571200f18ba4be049b03d75579 ] Commit f04565ddf52 (dev: use name hash for dev_seq_ops) added a second regression, as some devices are missing from /proc/net/dev if many devices are defined. When seq_file buffer is filled, the last ->next/show() method is canceled (pos value is reverted to value prior ->next() call) Problem is after above commit, we dont restart the lookup at right position in ->start() method. Fix this by removing the internal 'pos' pointer added in commit, since we need to use the 'loff_t *pos' provided by seq_file layer. This also reverts commit 5cac98dd0 (net: Fix corruption in /proc/*/net/dev_mcast), since its not needed anymore. Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Mihai Maruseac <mmaruseac@ixiacom.com> Tested-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11KVM: unmap pages from the iommu when slots are removedAlex Williamson
commit 32f6daad4651a748a58a3ab6da0611862175722f upstream. We've been adding new mappings, but not destroying old mappings. This can lead to a page leak as pages are pinned using get_user_pages, but only unpinned with put_page if they still exist in the memslots list on vm shutdown. A memslot that is destroyed while an iommu domain is enabled for the guest will therefore result in an elevated page reference count that is never cleared. Additionally, without this fix, the iommu is only programmed with the first translation for a gpa. This can result in peer-to-peer errors if a mapping is destroyed and replaced by a new mapping at the same gpa as the iommu will still be pointing to the original, pinned memory address. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>