summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2012-03-29crypto: user - Fix size of netlink dump messageSteffen Klassert
The default netlink message size limit might be exceeded when dumping a lot of algorithms to userspace. As a result, not all of the instantiated algorithms dumped to userspace. So calculate an upper bound on the message size and call netlink_dump_start() with that value. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-03-29crypto: user - Fix lookup of algorithms with IV generatorSteffen Klassert
We lookup algorithms with crypto_alg_mod_lookup() when instantiating via crypto_add_alg(). However, algorithms that are wrapped by an IV genearator (e.g. aead or genicv type algorithms) need special care. The userspace process hangs until it gets a timeout when we use crypto_alg_mod_lookup() to lookup these algorithms. So export the lookup functions for these algorithms and use them in crypto_add_alg(). Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2012-03-22Merge tag 'stable/for-linus-3.4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull xen updates from Konrad Rzeszutek Wilk: "which has three neat features: - PV multiconsole support, so that there can be hvc1, hvc2, etc; This can be used in HVM and in PV mode. - P-state and C-state power management driver that uploads said power management data to the hypervisor. It also inhibits cpufreq scaling drivers to load so that only the hypervisor can make power management decisions - fixing a weird perf bug. There is one thing in the Kconfig that you won't like: "default y if (X86_ACPI_CPUFREQ = y || X86_POWERNOW_K8 = y)" (note, that it all depends on CONFIG_XEN which depends on CONFIG_PARAVIRT which by default is off). I've a fix to convert that boolean expression into "default m" which I am going to post after the cpufreq git pull - as the two patches to make this work depend on a fix in Dave Jones's tree. - Function Level Reset (FLR) support in the Xen PCI backend. Fixes: - Kconfig dependencies for Xen PV keyboard and video - Compile warnings and constify fixes - Change over to use percpu_xxx instead of this_cpu_xxx" Fix up trivial conflicts in drivers/tty/hvc/hvc_xen.c due to changes to a removed commit. * tag 'stable/for-linus-3.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen kconfig: relax INPUT_XEN_KBDDEV_FRONTEND deps xen/acpi-processor: C and P-state driver that uploads said data to hypervisor. xen: constify all instances of "struct attribute_group" xen/xenbus: ignore console/0 hvc_xen: introduce HVC_XEN_FRONTEND hvc_xen: implement multiconsole support hvc_xen: support PV on HVM consoles xenbus: don't free other end details too early xen/enlighten: Expose MWAIT and MWAIT_LEAF if hypervisor OKs it. xen/setup/pm/acpi: Remove the call to boot_option_idle_override. xenbus: address compiler warnings xen: use this_cpu_xxx replace percpu_xxx funcs xen/pciback: Support pci_reset_function, aka FLR or D3 support. pci: Introduce __pci_reset_function_locked to be used when holding device_lock. xen: Utilize the restore_msi_irqs hook.
2012-03-22Merge tag 'stable/for-linus-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm Pull cleancache changes from Konrad Rzeszutek Wilk: "This has some patches for the cleancache API that should have been submitted a _long_ time ago. They are basically cleanups: - rename of flush to invalidate - moving reporting of statistics into debugfs - use __read_mostly as necessary. Oh, and also the MAINTAINERS file change. The files (except the MAINTAINERS file) have been in #linux-next for months now. The late addition of MAINTAINERS file is a brain-fart on my side - didn't realize I needed that just until I was typing this up - and I based that patch on v3.3 - so the tree is on top of v3.3." * tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm: MAINTAINERS: Adding cleancache API to the list. mm: cleancache: Use __read_mostly as appropiate. mm: cleancache: report statistics via debugfs instead of sysfs. mm: zcache/tmem/cleancache: s/flush/invalidate/ mm: cleancache: s/flush/invalidate/
2012-03-22Merge branch 'drm-radeon-sitn-support' of ↵Linus Torvalds
git://people.freedesktop.org/~airlied/linux Pull radeon southern islands / trinity support from Dave Airlie: "This is support from AMD for their newest GPU and APUs. The products called RadeonHD 7xxx, and the Trinity APU series. This did come in a bit late, due to some over-complicated AMD internal review process, which from the outside seems unnecessary once the company has decided it wants to support open source. However as I said previously I'd rather not put the people who've got this hw for 3 months now being forced to use fglrx on it if there is open code. Its pretty well self contained and just plugs into the driver in various places." * 'drm-radeon-sitn-support' of git://people.freedesktop.org/~airlied/linux: (48 commits) drm/radeon/kms: update duallink checks for DCE6 drm/radeon/kms: add trinity pci ids drm/radeon/kms: add radeon_asic struct for trinity drm/radeon/kms: add support for ucode loading on trinity (v2) drm/radeon/kms/vm: set vram base offset properly for TN drm/radeon/kms: Update evergreen functions for trinity drm/radeon/kms: cayman gpu init updates for trinity drm/radeon/kms: Add checks for TN in the DP bridge code drm/radeon/kms/DCE6.1: ss is not supported on the internal pplls drm/radeon/kms: disable PPLL0 on DCE6.1 when not in use drm/radeon/kms: Adjust pll picker for DCE6.1 drm/radeon/kms: DCE6.1 disp eng pll updates drm/radeon/kms: DCE6.1 watermark updates for TN drm/radeon/kms: no support for internal thermal sensor on TN yet drm/radeon/kms: add trinity (TN) chip family drm/radeon/kms: Add SI pci ids drm/radeon: Update radeon_info_ioctl for SI. (v2) drm/radeon/kms: add radeon_asic struct for SI drm/radeon/kms: add support for compute rings in CS ioctl on SI drm/radeon/kms: fill in startup/shutdown callbacks for SI ...
2012-03-22Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm main changes from Dave Airlie: "This is the main drm pull request, I'm probably going to send two more smaller ones, will explain below. This contains a patch that is also in the fbdev tree, but it should be the same patch, it added an API for hot unplugging framebuffer devices, and I need that API for a new driver. It also contains some changes to the i2c tree which Jean has acked, and one change to moorestown platform stuff in x86. Highlights: - new drivers: UDL driver for USB displaylink devices, kms only, should support correct hotplug operations. - core: i2c speedups + better hotplug support, EDID overriding via firmware interface - allows user to load a firmware for a broken monitor/kvm from userspace, it even has documentation for it. - exynos: new HDMI audio + hdmi 1.4 + virtual output driver - gma500: code cleanup - radeon: cleanups, CS optimisations, streamout support and pageflip fix - nouveau: NVD9 displayport support + more reclocking work - i915: re-enabling GMBUS, finish gpu patch (might help hibernation who knows), missed irq fixes, stencil tiling fixes, interlaced support, aliasesd PPGTT support for SNB/IVB, swizzling for SNB/IVB, semaphore fixes As well as the usual bunch of cleanups and fixes all over the place. I've got two things I'd like to merge a bit later: a) AMD support for all their new radeonhd 7000 series GPU and APUs. AMD dropped this a bit late due to insane internal review processes, (please AMD just follow Intel and let open source guys ship stuff early) however I don't want to penalise people who own this hardware (since its been on sale for 3-4 months and GPU hw doesn't exactly have a lifetime in years) and consign them to using closed drivers for longer than necessary. The changes are well contained and just plug into the driver new gpu functionality so they should be fairly regression proof. I just want to give them a bit of a run on the hw AMD kindly sent me. b) drm prime/dma-buf interface code. This is just infrastructure code to expose the dma-buf stuff to drm drivers and to userspace. I'm not planning on pushing any driver support in this cycle (except maybe exynos), but I'd like to get the infrastructure code in so for the next cycle I can start getting the driver support into the individual drivers. We have started driver support for i915, nouveau and udl along with I think exynos and omap in staging. However this code relies on the dma-buf tree being pulled into your tree first since it needs the latest interfaces from that tree. I'll push to get that tree sent asap. (oh and any warnings you see in i915 are gcc's fault from what anyone can see)." Fix up trivial conflicts in arch/x86/platform/mrst/mrst.c due to the new msic_thermal_platform_data() thermal function being added next to the tc35876x_platform_data() i2c device function.. * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (326 commits) drm/i915: use DDC_ADDR instead of hard-coding it drm/radeon: use DDC_ADDR instead of hard-coding it drm: remove unneeded redefinition of DDC_ADDR drm/exynos: added virtual display driver. drm: allow loading an EDID as firmware to override broken monitor drm/exynos: enable hdmi audio feature drm/exynos: add default pixel format for plane drm/exynos: cleanup exynos_hdmi.h drm/exynos: add is_local member in exynos_drm_subdrv struct drm/exynos: add subdrv open/close functions drm/exynos: remove module of exynos drm subdrv drm/exynos: release pending pageflip events when closed drm/exynos: added new funtion to get/put dma address. drm/exynos: update gem and buffer framework. drm/exynos: added mode_fixup feature and code clean. drm/exynos: add HDMI version 1.4 support drm/exynos: remove exynos_mixer.h gma500: Fix mmap frambuffer drm/radeon: Drop radeon_gem_object_(un)pin. drm/radeon: Restrict offset for legacy display engine. ...
2012-03-22Merge tag 'sound-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull updates of sound stuff from Takashi Iwai: "Here is the first big update chunk of sound stuff for 3.4-rc1. In the common sound infrastructure, there are a few changes for dynamic PCM support (used in ASoC) and a few clean-ups. Majority of changes are found, as usual, in HD-audio and ASoC. Some highlights of HD-audio changes: - All the long-standing static quirk codes for Realtek codec were finally removed by fixing and extending the Realtek auto-parser. - The mute-LED control is standardized over all HD-audio codec drivers using the extended vmaster hook. - The vmaster slave mixer elements are initialized to 0dB as default so that the user won't be annoyed by the silent output after updates, e.g. due to the additions of new elements. - Other many fix-ups for the misc HD-audio devices. In the ASoC side, this is a very active release, including a quite a few framework enhancements. Some highlights: - Support for widgets not associated with a CODEC, an important part of the dynamic PCM framework. - A library factoring out the common code shared by dmaengine based DMA drivers contributed by Lars-Peter Clausen. This will save a lot of code and make it much easier to deploy enhancements to dmaengine. - Support for binary controls, used for providing runtime configuration of algorithm coefficients. - A new DAPM widget type for regulator supplies allowing drivers for devices that can power down unused supplies while active to do without any per-driver code. - DAPM widgets for DAIs, initially giving a speed boost for playback startup and shutdown and also the basis for CODEC<->CODEC DAI link support. - Support for specifying the number of significant bits on audio interfaces, useful for allowing applications to know how much effort to put into generating data for a larger sample format. - Conversion of the FSI driver used on some SH processors to DMAEngine. - Conversion of EP93xx drivers to DMAEngine. - New CODEC drivers for Maxim MAX9768 and Wolfson Microelectronics WM2200. - Move audmux driver from arc/arm to sound/soc - McBSP move from arch/ to sound/ and updates Also, a few small updates and fixes for other drivers like au88x0, ymfpci, USB 6fire, USB usx2yaudio are included." * tag 'sound-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (446 commits) ASoC: wm8994: Provide VMID mode control and fix default sequence ASoC: wm8994: Add missing break in resume ASoC: wm_hubs: Don't actively manage LINEOUT_VMID_BUF ASoC: pxa-ssp: atomically set stream active masks ASoC: fsl: p1022ds: tell the WM8776 codec driver that it's the master ASoC: Samsung: Added to support mono recording ALSA: hda - Fix build with CONFIG_PM=n ALSA: au88x0 - Avoid possible Oops at unbinding ALSA: usb-audio - Fix build error by consitification of rate list ASoC: core: Fix obscure leak of runtime array ALSA: pcm - Avoid GFP_ATOMIC in snd_pcm_link() ALSA: pcm: Constify the list in snd_pcm_hw_constraint_list ASoC: wm8996: Add 44.1kHz support ALSA: hda - Fix build of patch_sigmatel.c without CONFIG_SND_HDA_POWER_SAVE ASoC: mx27vis-aic32x4: Convert it to platform driver ALSA: hda - fix printing of high HDMI sample rates ALSA: ymfpci - Fix legacy registers on S3/S4 resume ALSA: control - Fixe a trailing white space error ALSA: hda - Add expose_enum_ctl flag to snd_hda_add_vmaster_hook() ALSA: hda - Add "Mute-LED Mode" enum control ...
2012-03-22Merge tag 'scsi-misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 SCSI updates from James Bottomley: "The update includes the usual assortment of driver updates (lpfc, qla2xxx, qla4xxx, bfa, bnx2fc, bnx2i, isci, fcoe, hpsa) plus a huge amount of infrastructure work in the SAS library and transport class as well as an iSCSI update. There's also a new SCSI based virtio driver." * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (177 commits) [SCSI] qla4xxx: Update driver version to 5.02.00-k15 [SCSI] qla4xxx: trivial cleanup [SCSI] qla4xxx: Fix sparse warning [SCSI] qla4xxx: Add support for multiple session per host. [SCSI] qla4xxx: Export CHAP index as sysfs attribute [SCSI] scsi_transport: Export CHAP index as sysfs attribute [SCSI] qla4xxx: Add support to display CHAP list and delete CHAP entry [SCSI] iscsi_transport: Add support to display CHAP list and delete CHAP entry [SCSI] pm8001: fix endian issue with code optimization. [SCSI] pm8001: Fix possible racing condition. [SCSI] pm8001: Fix bogus interrupt state flag issue. [SCSI] ipr: update PCI ID definitions for new adapters [SCSI] qla2xxx: handle default case in qla2x00_request_firmware() [SCSI] isci: improvements in driver unloading routine [SCSI] isci: improve phy event warnings [SCSI] isci: debug, provide state-enum-to-string conversions [SCSI] scsi_transport_sas: 'enable' phys on reset [SCSI] libsas: don't recover end devices attached to disabled phys [SCSI] libsas: fixup target_port_protocols for expanders that don't report sata [SCSI] libsas: set attached device type and target protocols for local phys ...
2012-03-22Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "This contains the usual set of updates and bugfixes to target-core + existing fabric module code, along with a handful of the patches destined for v3.3 stable. It also contains the necessary target-core infrastructure pieces required to run using tcm_qla2xxx.ko WWPNs with the new Qlogic Fibre Channel fabric module currently queued in target-pending/for-next-merge, and coming for round 2. The highlights for this series include: - Add target_submit_tmr() helper function for fabric task management (andy) - Convert tcm_fc to use target_submit_tmr() (andy) - Replace target core various cmd flags with a transport state (hch) - Convert loopback to use workqueue submission (hch) - Convert target core to use array_zalloc for tpg_lun_list (joern) - Convert target core to use array_zalloc for device_list (joern) - Add target core support for TMR_ABORT_TASK (nab) - Add target core se_sess->sess_kref + get/put helpers (nab) - Add target core se_node_acl->acl_kref for ->acl_free_comp usage (nab) - Convert iscsi-target to use target_put_session + sess_kref (nab) - Fix tcm_fc fc_exch memory leak in ft_send_resp_status (nab) - Fix ib_srpt srpt_handle_cmd send_ioctx->ioctx_kref leak on exception (nab) - Fix target core up handling of short INQUIRY buffers (roland) - Untangle target-core front-end and back-end meanings of max_sectors attribute (roland) - Set loopback residual field for SCSI commands (roland) - Fix target-core 16-bit target ports for SET TARGET PORT GROUPS emulation (roland) Thanks again to Andy, Christoph, Joern, Roland, and everyone who has contributed this round!" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (64 commits) ib_srpt: Fix srpt_handle_cmd send_ioctx->ioctx_kref leak on exception loopback: Fix transport_generic_allocate_tasks error handling iscsi-target: remove improper externs iscsi-target: Remove unused variables in iscsi_target_parameters.c target: remove obvious warnings target: Use array_zalloc for device_list target: Use array_zalloc for tpg_lun_list target: Fix sense code for unsupported SERVICE ACTION IN target: Remove hack to make READ CAPACITY(10) lie if thin provisioning is enabled target: Bump core version to v4.1.0-rc2-ml + fabric versions tcm_fc: Fix fc_exch memory leak in ft_send_resp_status target: Drop unused legacy target_core_fabric_ops API callers iscsi-target: Convert to use target_put_session + sess_kref target: Convert se_node_acl->acl_group removal to use ->acl_kref target: Add se_node_acl->acl_kref for ->acl_free_comp usage target: Add se_node_acl->acl_free_comp for NodeACL release path target: Add se_sess->sess_kref + get/put helpers target: Convert session_lock to irqsave target: Fix typo in drivers/target iscsi-target: Fix dynamic -> explict NodeACL pointer reference ...
2012-03-22Merge tag 'md-3.4' of git://neil.brown.name/mdLinus Torvalds
Pull md updates for 3.4 from Neil Brown: "Mostly tidying up code in preparation for some bigger changes next time. A few bug fixes tagged for -stable. Main functionality change is that some RAID10 arrays can now grow to use extra space that may have been made available on the individual devices." Fixed up trivial conflicts with the k[un]map_atomic() cleanups in drivers/md/bitmap.c. * tag 'md-3.4' of git://neil.brown.name/md: (22 commits) md: Add judgement bb->unacked_exist in function md_ack_all_badblocks(). md: fix clearing of the 'changed' flags for the bad blocks list. md/bitmap: discard CHUNK_BLOCK_SHIFT macro md/bitmap: remove unnecessary indirection when allocating. md/bitmap: remove some pointless locking. md/bitmap: change a 'goto' to a normal 'if' construct. md/bitmap: move printing of bitmap status to bitmap.c md/bitmap: remove some unused noise from bitmap.h md/raid10 - support resizing some RAID10 arrays. md/raid1: handle merge_bvec_fn in member devices. md/raid10: handle merge_bvec_fn in member devices. md: add proper merge_bvec handling to RAID0 and Linear. md: tidy up rdev_for_each usage. md/raid1,raid10: avoid deadlock during resync/recovery. md/bitmap: ensure to load bitmap when creating via sysfs. md: don't set md arrays to readonly on shutdown. md: allow re-add to failed arrays. md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read(). md: Use existed macros instead of numbers md/raid5: removed unused 'added_devices' variable. ...
2012-03-22Merge branch 'x86-platform-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform changes from Ingo Molnar. Removes the Moorestown platform that nobody ever used. * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform: Move APIC ID validity check into platform APIC code x86/olpc/xo15/sci: Enable lid close wakeup control x86/geode/net5501: Add platform driver for Soekris Engineering net5501 x86/geode/alix2: Supplement driver to include GPIO button support x86/mid/powerbtn: Use MSIC read/write instead of ipc_scu x86/mid/thermal: Turn off thermistor x86/mid/thermal: Add msic_thermal alias x86/mid/thermal: Convert to use Intel MSIC API x86/mid/scu_ipc: Remove Moorestown support x86/mid: Kill off Moorestown x86/mrst: Add msic_thermal platform support x86/config: Select MSIC MFD driver on Intel Medfield platform x86/mid: Remove Intel Moorestown x86/mrst: Set ISA bus type for fake MP IRQs x86/ioapic: Use legacy_pic to set correct gsi-irq mapping
2012-03-22Merge branch 'x86-mce-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MCE changes from Ingo Molnar. * 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Fix return value of mce_chrdev_read() when erst is disabled x86/mce: Convert static array of pointers to per-cpu variables x86/mce: Replace hard coded hex constants with symbolic defines x86/mce: Recognise machine check bank signature for data path error x86/mce: Handle "action required" errors x86/mce: Add mechanism to safely save information in MCE handler x86/mce: Create helper function to save addr/misc when needed HWPOISON: Add code to handle "action required" errors. HWPOISON: Clean up memory_failure() vs. __memory_failure()
2012-03-22Merge branch 'x86-eficross-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/eficross (booting 32/64-bit kernel from 64/32-bit EFI) from Ingo Molnar * 'x86-eficross-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, efi: Allow basic init with mixed 32/64-bit efi/kernel x86, efi: Add basic error handling x86, efi: Cleanup config table walking x86, efi: Convert printk to pr_*() x86, efi: Refactor efi_init() a bit
2012-03-22Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/asm changes from Ingo Molnar * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Include probe_roms.h in probe_roms.c x86/32: Print control and debug registers for kerenel context x86: Tighten dependencies of CPU_SUP_*_32 x86/numa: Improve internode cache alignment x86: Fix the NMI nesting comments x86-64: Improve insn scheduling in SAVE_ARGS_IRQ x86-64: Fix CFI annotations for NMI nesting code bitops: Add missing parentheses to new get_order macro bitops: Optimise get_order() bitops: Adjust the comment on get_order() to describe the size==0 case x86/spinlocks: Eliminate TICKET_MASK x86-64: Handle byte-wise tail copying in memcpy() without a loop x86-64: Fix memcpy() to support sizes of 4Gb and above x86-64: Fix memset() to support sizes of 4Gb and above x86-64: Slightly shorten copy_page()
2012-03-22Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge first batch of patches from Andrew Morton: "A few misc things and all the MM queue" * emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits) memcg: avoid THP split in task migration thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE memcg: clean up existing move charge code mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read() mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() mm/memcontrol.c: s/stealed/stolen/ memcg: fix performance of mem_cgroup_begin_update_page_stat() memcg: remove PCG_FILE_MAPPED memcg: use new logic for page stat accounting memcg: remove PCG_MOVE_LOCK flag from page_cgroup memcg: simplify move_account() check memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) memcg: kill dead prev_priority stubs memcg: remove PCG_CACHE page_cgroup flag memcg: let css_get_next() rely upon rcu_read_lock() cgroup: revert ss_id_lock to spinlock idr: make idr_get_next() good for rcu_read_lock() memcg: remove unnecessary thp check in page stat accounting memcg: remove redundant returns memcg: enum lru_list lru ...
2012-03-21Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc merge from Benjamin Herrenschmidt: "Here's the powerpc batch for this merge window. It is going to be a bit more nasty than usual as in touching things outside of arch/powerpc mostly due to the big iSeriesectomy :-) We finally got rid of the bugger (legacy iSeries support) which was a PITA to maintain and that nobody really used anymore. Here are some of the highlights: - Legacy iSeries is gone. Thanks Stephen ! There's still some bits and pieces remaining if you do a grep -ir series arch/powerpc but they are harmless and will be removed in the next few weeks hopefully. - The 'fadump' functionality (Firmware Assisted Dump) replaces the previous (equivalent) "pHyp assisted dump"... it's a rewrite of a mechanism to get the hypervisor to do crash dumps on pSeries, the new implementation hopefully being much more reliable. Thanks Mahesh Salgaonkar. - The "EEH" code (pSeries PCI error handling & recovery) got a big spring cleaning, motivated by the need to be able to implement a new backend for it on top of some new different type of firwmare. The work isn't complete yet, but a good chunk of the cleanups is there. Note that this adds a field to struct device_node which is not very nice and which Grant objects to. I will have a patch soon that moves that to a powerpc private data structure (hopefully before rc1) and we'll improve things further later on (hopefully getting rid of the need for that pointer completely). Thanks Gavin Shan. - I dug into our exception & interrupt handling code to improve the way we do lazy interrupt handling (and make it work properly with "edge" triggered interrupt sources), and while at it found & fixed a wagon of issues in those areas, including adding support for page fault retry & fatal signals on page faults. - Your usual random batch of small fixes & updates, including a bunch of new embedded boards, both Freescale and APM based ones, etc..." I fixed up some conflicts with the generalized irq-domain changes from Grant Likely, hopefully correctly. * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (141 commits) powerpc/ps3: Do not adjust the wrapper load address powerpc: Remove the rest of the legacy iSeries include files powerpc: Remove the remaining CONFIG_PPC_ISERIES pieces init: Remove CONFIG_PPC_ISERIES powerpc: Remove FW_FEATURE ISERIES from arch code tty/hvc_vio: FW_FEATURE_ISERIES is no longer selectable powerpc/spufs: Fix double unlocks powerpc/5200: convert mpc5200 to use of_platform_populate() powerpc/mpc5200: add options to mpc5200_defconfig powerpc/mpc52xx: add a4m072 board support powerpc/mpc5200: update mpc5200_defconfig to fit for charon board Documentation/powerpc/mpc52xx.txt: Checkpatch cleanup powerpc/44x: Add additional device support for APM821xx SoC and Bluestone board powerpc/44x: Add support PCI-E for APM821xx SoC and Bluestone board MAINTAINERS: Update PowerPC 4xx tree powerpc/44x: The bug fixed support for APM821xx SoC and Bluestone board powerpc: document the FSL MPIC message register binding powerpc: add support for MPIC message register API powerpc/fsl: Added aliased MSIIR register address to MSI node in dts powerpc/85xx: mpc8548cds - add 36-bit dts ...
2012-03-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmwLinus Torvalds
Pull gfs2 changes from Steven Whitehouse. * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: GFS2: Change truncate page allocation to be GFP_NOFS GFS2: call gfs2_write_alloc_required for each chunk GFS2: Clean up log flush header writing GFS2: Remove a __GFP_NOFAIL allocation GFS2: Flush pending glock work when evicting an inode GFS2: make sure rgrps are up to date in func gfs2_blk2rgrpd GFS2: Eliminate sd_rindex_mutex GFS2: Unlock rindex mutex on glock error GFS2: Make bd_cmp() static GFS2: Sort the ordered write list GFS2: FITRIM ioctl support GFS2: Move two functions from log.c to lops.c GFS2: glock statistics gathering
2012-03-21thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGENaoya Horiguchi
These macros will be used in a later patch, where all usages are expected to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to detect unexpected usages, we convert the existing BUG() to BUILD_BUG(). [akpm@linux-foundation.org: fix build in mm/pgtable-generic.c] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21memcg: fix performance of mem_cgroup_begin_update_page_stat()KAMEZAWA Hiroyuki
mem_cgroup_begin_update_page_stat() should be very fast because it's called very frequently. Now, it needs to look up page_cgroup and its memcg....this is slow. This patch adds a global variable to check "any memcg is moving or not". With this, the caller doesn't need to visit page_cgroup and memcg. Here is a test result. A test program makes page faults onto a file, MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the range by madvise() and page fault again. This program causes 26214400 times of page fault onto a file(size was 1G.) and shows shows the cost of mem_cgroup_begin_update_page_stat(). Before this patch for mem_cgroup_begin_update_page_stat() [kamezawa@bluextal test]$ time ./mmap 1G real 0m21.765s user 0m5.999s sys 0m15.434s 27.46% mmap mmap [.] reader 21.15% mmap [kernel.kallsyms] [k] page_fault 9.17% mmap [kernel.kallsyms] [k] filemap_fault 2.96% mmap [kernel.kallsyms] [k] __do_fault 2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat After this patch [root@bluextal test]# time ./mmap 1G real 0m21.373s user 0m6.113s sys 0m15.016s In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away. Note: we may be able to remove this optimization in future if we can get pointer to memcg directly from struct page. [akpm@linux-foundation.org: don't return a void] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Greg Thelen <gthelen@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21memcg: remove PCG_FILE_MAPPEDKAMEZAWA Hiroyuki
With the new lock scheme for updating memcg's page stat, we don't need a flag PCG_FILE_MAPPED which was duplicated information of page_mapped(). [hughd@google.com: cosmetic fix] [hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Greg Thelen <gthelen@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21memcg: use new logic for page stat accountingKAMEZAWA Hiroyuki
Now, page-stat-per-memcg is recorded into per page_cgroup flag by duplicating page's status into the flag. The reason is that memcg has a feature to move a page from a group to another group and we have race between "move" and "page stat accounting", Under current logic, assume CPU-A and CPU-B. CPU-A does "move" and CPU-B does "page stat accounting". When CPU-A goes 1st, CPU-A CPU-B update "struct page" info. move_lock_mem_cgroup(memcg) see pc->flags copy page stat to new group overwrite pc->mem_cgroup. move_unlock_mem_cgroup(memcg) move_lock_mem_cgroup(mem) set pc->flags update page stat accounting move_unlock_mem_cgroup(mem) stat accounting is guarded by move_lock_mem_cgroup() and "move" logic (CPU-A) doesn't see changes in "struct page" information. But it's costly to have the same information both in 'struct page' and 'struct page_cgroup'. And, there is a potential problem. For example, assume we have PG_dirty accounting in memcg. PG_..is a flag for struct page. PCG_ is a flag for struct page_cgroup. (This is just an example. The same problem can be found in any kind of page stat accounting.) CPU-A CPU-B TestSet PG_dirty (delay) TestClear PG_dirty if (TestClear(PCG_dirty)) memcg->nr_dirty-- if (TestSet(PCG_dirty)) memcg->nr_dirty++ Here, memcg->nr_dirty = +1, this is wrong. This race was reported by Greg Thelen <gthelen@google.com>. Now, only FILE_MAPPED is supported but fortunately, it's serialized by page table lock and this is not real bug, _now_, If this potential problem is caused by having duplicated information in struct page and struct page_cgroup, we may be able to fix this by using original 'struct page' information. But we'll have a problem in "move account" Assume we use only PG_dirty. CPU-A CPU-B TestSet PG_dirty (delay) move_lock_mem_cgroup() if (PageDirty(page)) new_memcg->nr_dirty++ pc->mem_cgroup = new_memcg; move_unlock_mem_cgroup() move_lock_mem_cgroup() memcg = pc->mem_cgroup new_memcg->nr_dirty++ accounting information may be double-counted. This was original reason to have PCG_xxx flags but it seems PCG_xxx has another problem. I think we need a bigger lock as move_lock_mem_cgroup(page) TestSetPageDirty(page) update page stats (without any checks) move_unlock_mem_cgroup(page) This fixes both of problems and we don't have to duplicate page flag into page_cgroup. Please note: move_lock_mem_cgroup() is held only when there are possibility of "account move" under the system. So, in most path, status update will go without atomic locks. This patch introduces mem_cgroup_begin_update_page_stat() and mem_cgroup_end_update_page_stat() both should be called at modifying 'struct page' information if memcg takes care of it. as mem_cgroup_begin_update_page_stat() modify page information mem_cgroup_update_page_stat() => never check any 'struct page' info, just update counters. mem_cgroup_end_update_page_stat(). This patch is slow because we need to call begin_update_page_stat()/ end_update_page_stat() regardless of accounted will be changed or not. A following patch adds an easy optimization and reduces the cost. [akpm@linux-foundation.org: s/lock/locked/] [hughd@google.com: fix deadlock by avoiding stat lock when anon] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21memcg: remove PCG_MOVE_LOCK flag from page_cgroupKAMEZAWA Hiroyuki
PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting pc->mem_cgroup and page statistics accounting per memcg. This lock helps to avoid the race but the race is very rare because moving tasks between cgroup is not a usual job. So, it seems using 1bit per page is too costly. This patch changes this lock as per-memcg spinlock and removes PCG_MOVE_LOCK. If smaller lock is required, we'll be able to add some hashes but I'd like to start from this. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21memcg: kill dead prev_priority stubsKonstantin Khlebnikov
This code was removed in 25edde033291 ("vmscan: kill prev_priority completely") Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21memcg: remove PCG_CACHE page_cgroup flagKAMEZAWA Hiroyuki
We record 'the page is cache' with the PCG_CACHE bit in page_cgroup. Here, "CACHE" means anonymous user pages (and SwapCache). This doesn't include shmem. Considering callers, at charge/uncharge, the caller should know what the page is and we don't need to record it by using one bit per page. This patch removes PCG_CACHE bit and make callers of mem_cgroup_charge_statistics() to specify what the page is. About page migration: Mapping of the used page is not touched during migra tion (see page_remove_rmap) so we can rely on it and push the correct charge type down to __mem_cgroup_uncharge_common from end_migration for unused page. The force flag was misleading was abused for skipping the needless page_mapped() / PageCgroupMigration() check, as we know the unused page is no longer mapped and cleared the migration flag just a few lines up. But doing the checks is no biggie and it's not worth adding another flag just to skip them. [akpm@linux-foundation.org: checkpatch fixes] [hughd@google.com: fix PageAnon uncharging] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21cgroup: revert ss_id_lock to spinlockHugh Dickins
Commit c1e2ee2dc436 ("memcg: replace ss->id_lock with a rwlock") has now been seen to cause the unfair behavior we should have expected from converting a spinlock to an rwlock: softlockup in cgroup_mkdir(), whose get_new_cssid() is waiting for the wlock, while there are 19 tasks using the rlock in css_get_next() to get on with their memcg workload (in an artificial test, admittedly). Yet lib/idr.c was made suitable for RCU way back: revert that commit, restoring ss->id_lock to a spinlock. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21memcg: replace MEM_CONT by MEM_RES_CTLRHugh Dickins
Correct an #endif comment in memcontrol.h from MEM_CONT to MEM_RES_CTLR. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21page_alloc.c: remove add_from_early_node_map()Kautuk Consul
add_from_early_node_map() is unused. Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21hugetlbfs: fix alignment of huge page requestsSteven Truelove
When calling shmget() with SHM_HUGETLB, shmget aligns the request size to PAGE_SIZE, but this is not sufficient. Modify hugetlb_file_setup() to align requests to the huge page size, and to accept an address argument so that all alignment checks can be performed in hugetlb_file_setup(), rather than in its callers. Change newseg() and mmap_pgoff() to match the new prototype and eliminate a now redundant alignment check. [akpm@linux-foundation.org: fix build] Signed-off-by: Steven Truelove <steven.truelove@utoronto.ca> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat()David Rientjes
sync_mm_rss() can only be used for current to avoid race conditions in iterating and clearing its per-task counters. Remove the task argument for it and its helper function, __sync_task_rss_stat(), to avoid thinking it can be used safely for anything other than current. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21hugepages: fix use after free bug in "quota" handlingDavid Gibson
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the general quota handling code, and they don't much resemble its behaviour. Rather than being about maintaining limits on on-disk block usage by particular users, they are instead about maintaining limits on in-memory page usage (including anonymous MAP_PRIVATE copied-on-write pages) associated with a particular hugetlbfs filesystem instance. Worse, they work by having callbacks to the hugetlbfs filesystem code from the low-level page handling code, in particular from free_huge_page(). This is a layering violation of itself, but more importantly, if the kernel does a get_user_pages() on hugepages (which can happen from KVM amongst others), then the free_huge_page() can be delayed until after the associated inode has already been freed. If an unmount occurs at the wrong time, even the hugetlbfs superblock where the "quota" limits are stored may have been freed. Andrew Barry proposed a patch to fix this by having hugepages, instead of storing a pointer to their address_space and reaching the superblock from there, had the hugepages store pointers directly to the superblock, bumping the reference count as appropriate to avoid it being freed. Andrew Morton rejected that version, however, on the grounds that it made the existing layering violation worse. This is a reworked version of Andrew's patch, which removes the extra, and some of the existing, layering violation. It works by introducing the concept of a hugepage "subpool" at the lower hugepage mm layer - that is a finite logical pool of hugepages to allocate from. hugetlbfs now creates a subpool for each filesystem instance with a page limit set, and a pointer to the subpool gets added to each allocated hugepage, instead of the address_space pointer used now. The subpool has its own lifetime and is only freed once all pages in it _and_ all other references to it (i.e. superblocks) are gone. subpools are optional - a NULL subpool pointer is taken by the code to mean that no subpool limits are in effect. Previous discussion of this bug found in: "Fix refcounting in hugetlbfs quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or http://marc.info/?l=linux-mm&m=126928970510627&w=1 v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to alloc_huge_page() - since it already takes the vma, it is not necessary. Signed-off-by: Andrew Barry <abarry@cray.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21hugetlb: cleanup hugetlb.hDavid Gibson
Make a couple of small cleanups to linux/include/hugetlb.h. The set_file_hugepages() function, which was not used anywhere is removed, and the hugetlbfs_config and hugetlbfs_inode_info structures with its HUGETLBFS_I helper function are moved into inode.c, the only place they were used. These structures are really linked to the hugetlbfs filesystem specifically not to hugepage mm handling in general, so they belong in the filesystem code not in a generally available header. It would be nice to move the hugetlbfs_sb_info (superblock) structure in there as well, but it's currently needed in a number of places via the hstate_vma() and hstate_inode(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: Hugh Dickins <hughd@google.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Andrew Barry <abarry@cray.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21cpuset: mm: reduce large amounts of memory barrier related damage v3Mel Gorman
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm, memcg: pass charge order to oom killerDavid Rientjes
The oom killer typically displays the allocation order at the time of oom as a part of its diangostic messages (for global, cpuset, and mempolicy ooms). The memory controller may also pass the charge order to the oom killer so it can emit the same information. This is useful in determining how large the memory allocation is that triggered the oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm: drain percpu lru add/rotate page-vectors on cpu hot-unplugKonstantin Khlebnikov
This cpu hotplug hook was accidentally removed in commit 00a62ce91e55 ("mm: fix Committed_AS underflow on large NR_CPUS environment") The visible effect of this accident: some pages are borrowed in per-cpu page-vectors. Truncate can deal with it, but these pages cannot be reused while this cpu is offline. So this is like a temporary memory leak. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21thp: allow a hwpoisoned head page to be put back to LRUDean Nelson
Andrea Arcangeli pointed out to me that a check in __memory_failure() which was intended to prevent THP tail pages from being checked for the absence of the PG_lru flag (something that is always the case), was also preventing THP head pages from being checked. A THP head page could actually benefit from the call to shake_page() by ending up being put back to a LRU, provided it had been waiting in a pagevec array. Andrea suggested that the "!PageTransCompound(p)" in the if-statement should be replaced by a "!PageTransTail(p)", thus allowing THP head pages to be checked and possibly shaken. Signed-off-by: Dean Nelson <dnelson@redhat.com> Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm, oom: force oom kill on sysrq+fDavid Rientjes
The oom killer chooses not to kill a thread if: - an eligible thread has already been oom killed and has yet to exit, and - an eligible thread is exiting but has yet to free all its memory and is not the thread attempting to currently allocate memory. SysRq+F manually invokes the global oom killer to kill a memory-hogging task. This is normally done as a last resort to free memory when no progress is being made or to test the oom killer itself. For both uses, we always want to kill a thread and never defer. This patch causes SysRq+F to always kill an eligible thread and can be used to force a kill even if another oom killed thread has failed to exit. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21procfs: mark thread stack correctly in proc/<pid>/mapsSiddhesh Poyarekar
Stack for a new thread is mapped by userspace code and passed via sys_clone. This memory is currently seen as anonymous in /proc/<pid>/maps, which makes it difficult to ascertain which mappings are being used for thread stacks. This patch uses the individual task stack pointers to determine which vmas are actually thread stacks. For a multithreaded program like the following: #include <pthread.h> void *thread_main(void *foo) { while(1); } int main() { pthread_t t; pthread_create(&t, NULL, thread_main, NULL); pthread_join(t, NULL); } proc/PID/maps looks like the following: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but that is not always a reliable way to find out which vma is a thread stack. Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same content. With this patch in place, /proc/PID/task/TID/maps are treated as 'maps as the task would see it' and hence, only the vma that that task uses as stack is marked as [stack]. All other 'stack' vmas are marked as anonymous memory. /proc/PID/maps acts as a thread group level view, where all thread stack vmas are marked as [stack:TID] where TID is the process ID of the task that uses that vma as stack, while the process stack is marked as [stack]. So /proc/PID/maps will look like this: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442] 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] Thus marking all vmas that are used as stacks by the threads in the thread group along with the process stack. The task level maps will however like this: 00400000-00401000 r-xp 00000000 fd:0a 3671804 /home/siddhesh/a.out 00600000-00601000 rw-p 00000000 fd:0a 3671804 /home/siddhesh/a.out 019ef000-01a10000 rw-p 00000000 00:00 0 [heap] 7f8a44491000-7f8a44492000 ---p 00000000 00:00 0 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack] 7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482 /lib64/libc-2.14.90.so 7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0 7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938 /lib64/libpthread-2.14.90.so 7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0 7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0 7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0 7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348 /lib64/ld-2.14.90.so 7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 7fff627ff000-7fff62800000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] where only the vma that is being used as a stack by *that* task is marked as [stack]. Analogous changes have been made to /proc/PID/smaps, /proc/PID/numa_maps, /proc/PID/task/TID/smaps and /proc/PID/task/TID/numa_maps. Relevant snippets from smaps and numa_maps: [siddhesh@localhost ~ ]$ pgrep a.out 1441 [siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack" 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack:1442] 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack" 7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack" 7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0 [stack] [siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack" 7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2 7fff6273a000 default stack anon=3 dirty=3 N0=3 [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack" 7f8a44492000 default stack anon=2 dirty=2 N0=2 [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack" 7fff6273a000 default stack anon=3 dirty=3 N0=3 [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix build] Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jamie Lokier <jamie@shareable.org> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21rmap: remove __anon_vma_link() declarationXiao Guangrong
This declaration is not used anymore, remove it. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm: replace PAGE_MIGRATION with IS_ENABLED(CONFIG_MIGRATION)Konstantin Khlebnikov
Since commit 2a11c8ea20bf ("kconfig: Introduce IS_ENABLED(), IS_BUILTIN() and IS_MODULE()") there is a generic grep-friendly method for checking config options in C expressions. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21pagemap: export KPF_THPNaoya Horiguchi
This flag shows that a given page is a subpage of a transparent hugepage. It helps us debug and test the kernel by showing physical address of thp. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21thp: optimize away unnecessary page table lockingNaoya Horiguchi
Currently when we check if we can handle thp as it is or we need to split it into regular sized pages, we hold page table lock prior to check whether a given pmd is mapping thp or not. Because of this, when it's not "huge pmd" we suffer from unnecessary lock/unlock overhead. To remove it, this patch introduces a optimized check function and replace several similar logics with it. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21vmscan: only defer compaction for failed order and higherRik van Riel
Currently a failed order-9 (transparent hugepage) compaction can lead to memory compaction being temporarily disabled for a memory zone. Even if we only need compaction for an order 2 allocation, eg. for jumbo frames networking. The fix is relatively straightforward: keep track of the highest order at which compaction is succeeding, and only defer compaction for orders at which compaction is failing. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21vmscan: kswapd carefully call compactionRik van Riel
With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous free pages, even when it is woken for a higher order request. This could be bad for eg. jumbo frame network allocations, which are done from interrupt context and cannot compact memory themselves. Higher than before allocation failure rates in the network receive path have been observed in kernels with compaction enabled. Teach kswapd to defragment the memory zones in a node, but only if required and compaction is not deferred in a zone. [akpm@linux-foundation.org: reduce scope of zones_need_compaction] Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm: make swapin readahead skip over holesRik van Riel
Ever since abandoning the virtual scan of processes, for scalability reasons, swap space has been a little more fragmented than before. This can lead to the situation where a large memory user is killed, swap space ends up full of "holes" and swapin readahead is totally ineffective. On my home system, after killing a leaky firefox it took over an hour to page just under 2GB of memory back in, slowing the virtual machines down to a crawl. This patch makes swapin readahead simply skip over holes, instead of stopping at them. This allows the system to swap things back in at rates of several MB/second, instead of a few hundred kB/second. The checks done in valid_swaphandles are already done in read_swap_cache_async as well, allowing us to remove a fair amount of code. [akpm@linux-foundation.org: fix it for page_cluster >= 32] Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Adrian Drzewiecki <z@drze.net> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm: make get_mm_counter static-inlineKonstantin Khlebnikov
Make get_mm_counter() always static inline, it is simple enough for that. And remove unused set_mm_counter() bloat-o-meter: add/remove: 0/1 grow/shrink: 4/12 up/down: 99/-341 (-242) function old new delta try_to_unmap_one 886 952 +66 sys_remap_file_pages 1214 1230 +16 dup_mm 1684 1700 +16 do_exit 2277 2278 +1 zap_page_range 208 205 -3 unmap_region 304 296 -8 static.oom_kill_process 554 546 -8 try_to_unmap_file 1716 1700 -16 getrusage 925 909 -16 flush_old_exec 1704 1688 -16 static.dump_header 416 390 -26 acct_update_integrals 218 187 -31 do_task_stat 2986 2954 -32 get_mm_counter 34 - -34 xacct_add_tsk 371 334 -37 task_statm 172 118 -54 task_mem 383 323 -60 try_to_unmap_one() grows because update_hiwater_rss() now completely inline. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read modeAndrea Arcangeli
In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of *pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(*pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through */ } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t *pmd) 144 { -> 145 pmd_ERROR(*pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. | | | | .-|---------------------| | | | | | |<-- B(fault) | | | 2 MB | |/////////////////////|-. huge < |/////////////////////| > A(range) page | |/////////////////////|-' | | | | | | '-|---------------------| | | | | '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(&current->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(*pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. | // | if (pmd_none_or_clear_bad(pmd)) | { | if (unlikely(pmd_bad(*pmd))) | pmd_clear_bad | { | pmd_ERROR | // Log "bad pmd ..." message here. | pmd_clear | // Clear the page's PMD entry. | // Thread B incremented the map count | // in page_add_new_anon_rmap(), but | // now the page is no longer mapped | // by a PMD entry (-> inconsistency). | } | } | v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Jones <davej@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "This is _not_ all; in particular, Miklos' and Jan's stuff is not there yet." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) ext4: initialization of ext4_li_mtx needs to be done earlier debugfs-related mode_t whack-a-mole hfsplus: add an ioctl to bless files hfsplus: change finder_info to u32 hfsplus: initialise userflags qnx4: new helper - try_extent() qnx4: get rid of qnx4_bread/qnx4_getblk take removal of PF_FORKNOEXEC to flush_old_exec() trim includes in inode.c um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it um: embed ->stub_pages[] into mmu_context gadgetfs: list_for_each_safe() misuse ocfs2: fix leaks on failure exits in module_init ecryptfs: make register_filesystem() the last potential failure exit ntfs: forgets to unregister sysctls on register_filesystem() failure logfs: missing cleanup on register_filesystem() failure jfs: mising cleanup on register_filesystem() failure make configfs_pin_fs() return root dentry on success configfs: configfs_create_dir() has parent dentry in dentry->d_parent configfs: sanitize configfs_create() ...
2012-03-21Merge branch 'vm' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull munmap/truncate race fixes from Al Viro: "Fixes for racy use of unmap_vmas() on truncate-related codepaths" * 'vm' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VM: make zap_page_range() callers that act on a single VMA use separate helper VM: make unmap_vmas() return void VM: don't bother with feeding upper limit to tlb_finish_mmu() in exit_mmap() VM: make zap_page_range() return void VM: can't go through the inner loop in unmap_vmas() more than once... VM: unmap_page_range() can return void
2012-03-21Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates for 3.4 from James Morris: "The main addition here is the new Yama security module from Kees Cook, which was discussed at the Linux Security Summit last year. Its purpose is to collect miscellaneous DAC security enhancements in one place. This also marks a departure in policy for LSM modules, which were previously limited to being standalone access control systems. Chromium OS is using Yama, and I believe there are plans for Ubuntu, at least. This patchset also includes maintenance updates for AppArmor, TOMOYO and others." Fix trivial conflict in <net/sock.h> due to the jumo_label->static_key rename. * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits) AppArmor: Fix location of const qualifier on generated string tables TOMOYO: Return error if fails to delete a domain AppArmor: add const qualifiers to string arrays AppArmor: Add ability to load extended policy TOMOYO: Return appropriate value to poll(). AppArmor: Move path failure information into aa_get_name and rename AppArmor: Update dfa matching routines. AppArmor: Minor cleanup of d_namespace_path to consolidate error handling AppArmor: Retrieve the dentry_path for error reporting when path lookup fails AppArmor: Add const qualifiers to generated string tables AppArmor: Fix oops in policy unpack auditing AppArmor: Fix error returned when a path lookup is disconnected KEYS: testing wrong bit for KEY_FLAG_REVOKED TOMOYO: Fix mount flags checking order. security: fix ima kconfig warning AppArmor: Fix the error case for chroot relative path name lookup AppArmor: fix mapping of META_READ to audit and quiet flags AppArmor: Fix underflow in xindex calculation AppArmor: Fix dropping of allowed operations that are force audited AppArmor: Add mising end of structure test to caps unpacking ...
2012-03-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds
Pull crypto update from Herbert Xu: "* sha512 bug fixes (already in your tree). * SHA224/SHA384 AEAD support in caam. * X86-64 optimised version of Camellia. * Tegra AES support. * Bulk algorithm registration interface to make driver registration easier. * padata race fixes. * Misc fixes." * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (31 commits) padata: Fix race on sequence number wrap padata: Fix race in the serialization path crypto: camellia - add assembler implementation for x86_64 crypto: camellia - rename camellia.c to camellia_generic.c crypto: camellia - fix checkpatch warnings crypto: camellia - rename camellia module to camellia_generic crypto: tcrypt - add more camellia tests crypto: testmgr - add more camellia test vectors crypto: camellia - simplify key setup and CAMELLIA_ROUNDSM macro crypto: twofish-x86_64/i586 - set alignmask to zero crypto: blowfish-x86_64 - set alignmask to zero crypto: serpent-sse2 - combine ablk_*_init functions crypto: blowfish-x86_64 - use crypto_[un]register_algs crypto: twofish-x86_64-3way - use crypto_[un]register_algs crypto: serpent-sse2 - use crypto_[un]register_algs crypto: serpent-sse2 - remove dead code from serpent_sse2_glue.c::serpent_sse2_init() crypto: twofish-x86 - Remove dead code from twofish_glue_3way.c::init() crypto: In crypto_add_alg(), 'exact' wants to be initialized to 0 crypto: caam - fix gcc 4.6 warning crypto: Add bulk algorithm registration interface ...