summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2011-10-31vmscan: fix shrinker callback bug in fs/super.cMikulas Patocka
The callback must not return -1 when nr_to_scan is zero. Fix the bug in fs/super.c and add this requirement to the callback specification. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: neaten warn_alloc_failedJoe Perches
Add __attribute__((format (printf...) to the function to validate format and arguments. Use vsprintf extension %pV to avoid any possible message interleaving. Coalesce format string. Convert printks/pr_warning to pr_warn. [akpm@linux-foundation.org: use the __printf() macro] Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31include/asm-generic/page.h: calculate virt_to_page and page_to_virt via ↵Sonic Zhang
predefined macro On NOMMU architectures, if physical memory doesn't start from 0, ARCH_PFN_OFFSET is defined to generate page index in mem_map array. Because virtual address is equal to physical address, PAGE_OFFSET is always 0. virt_to_page and page_to_virt should not index page by PAGE_OFFSET directly. Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31thp: mremap support and TLB optimizationAndrea Arcangeli
This adds THP support to mremap (decreases the number of split_huge_page() calls). Here are also some benchmarks with a proggy like this: === #define _GNU_SOURCE #include <sys/mman.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include <sys/time.h> #define SIZE (5UL*1024*1024*1024) int main() { static struct timeval oldstamp, newstamp; long diffsec; char *p, *p2, *p3, *p4; if (posix_memalign((void **)&p, 2*1024*1024, SIZE)) perror("memalign"), exit(1); if (posix_memalign((void **)&p2, 2*1024*1024, SIZE)) perror("memalign"), exit(1); if (posix_memalign((void **)&p3, 2*1024*1024, 4096)) perror("memalign"), exit(1); memset(p, 0xff, SIZE); memset(p2, 0xff, SIZE); memset(p3, 0x77, 4096); gettimeofday(&oldstamp, NULL); p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3); gettimeofday(&newstamp, NULL); diffsec = newstamp.tv_sec - oldstamp.tv_sec; diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec; printf("usec %ld\n", diffsec); if (p == MAP_FAILED || p4 != p3) //if (p == MAP_FAILED) perror("mremap"), exit(1); if (memcmp(p4, p2, SIZE)) printf("mremap bug\n"), exit(1); printf("ok\n"); return 0; } === THP on Performance counter stats for './largepage13' (3 runs): 69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%) 60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%) 676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%) 29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%) 1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%) 8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%) 7.314454164 seconds time elapsed ( +- 0.023% ) THP off Performance counter stats for './largepage13' (3 runs): 1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%) 9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%) 2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%) 3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%) 6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%) 8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%) 9.693655243 seconds time elapsed ( +- 0.069% ) grep thp /proc/vmstat thp_fault_alloc 35849 thp_fault_fallback 0 thp_collapse_alloc 3 thp_collapse_alloc_failed 0 thp_split 0 thp_split 0 confirms no thp split despite plenty of hugepages allocated. The measurement of only the mremap time (so excluding the 3 long memset and final long 10GB memory accessing memcmp): THP on usec 14824 usec 14862 usec 14859 THP off usec 256416 usec 255981 usec 255847 With an older kernel without the mremap optimizations (the below patch optimizes the non THP version too). THP on usec 392107 usec 390237 usec 404124 THP off usec 444294 usec 445237 usec 445820 I guess with a threaded program that sends more IPI on large SMP it'd create an even larger difference. All debug options are off except DEBUG_VM to avoid skewing the results. The only problem for native 2M mremap like it happens above both the source and destination address must be 2M aligned or the hugepmd can't be moved without a split but that is an hardware limitation. [akpm@linux-foundation.org: coding-style nitpicking] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31memblock: add memblock_start_of_DRAM()Sam Ravnborg
SPARC32 require access to the start address. Add a new helper memblock_start_of_DRAM() to give access to the address of the first memblock - which contains the lowest address. The awkward name was chosen to match the already present memblock_end_of_DRAM(). Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: avoid null pointer access in vm_struct via /proc/vmallocinfoMitsuo Hayasaka
The /proc/vmallocinfo shows information about vmalloc allocations in vmlist that is a linklist of vm_struct. It, however, may access pages field of vm_struct where a page was not allocated. This results in a null pointer access and leads to a kernel panic. Why this happens: In __vmalloc_node_range() called from vmalloc(), newly allocated vm_struct is added to vmlist at __get_vm_area_node() and then, some fields of vm_struct such as nr_pages and pages are set at __vmalloc_area_node(). In other words, it is added to vmlist before it is fully initialized. At the same time, when the /proc/vmallocinfo is read, it accesses the pages field of vm_struct according to the nr_pages field at show_numa_info(). Thus, a null pointer access happens. The patch adds the newly allocated vm_struct to the vmlist *after* it is fully initialized. So, it can avoid accessing the pages field with unallocated page when show_numa_info() is called. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31lib/string.c: introduce memchr_inv()Akinobu Mita
memchr_inv() is mainly used to check whether the whole buffer is filled with just a specified byte. The function name and prototype are stolen from logfs and the implementation is from SLUB. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Joern Engel <joern@logfs.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completesMel Gorman
When direct reclaim encounters a dirty page, it gets recycled around the LRU for another cycle. This patch marks the page PageReclaim similar to deactivate_page() so that the page gets reclaimed almost immediately after the page gets cleaned. This is to avoid reclaiming clean pages that are younger than a dirty page encountered at the end of the LRU that might have been something like a use-once page. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <jweiner@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: vmscan: do not writeback filesystem pages in direct reclaimMel Gorman
Testing from the XFS folk revealed that there is still too much I/O from the end of the LRU in kswapd. Previously it was considered acceptable by VM people for a small number of pages to be written back from reclaim with testing generally showing about 0.3% of pages reclaimed were written back (higher if memory was low). That writing back a small number of pages is ok has been heavily disputed for quite some time and Dave Chinner explained it well; It doesn't have to be a very high number to be a problem. IO is orders of magnitude slower than the CPU time it takes to flush a page, so the cost of making a bad flush decision is very high. And single page writeback from the LRU is almost always a bad flush decision. To complicate matters, filesystems respond very differently to requests from reclaim according to Christoph Hellwig; xfs tries to write it back if the requester is kswapd ext4 ignores the request if it's a delayed allocation btrfs ignores the request As a result, each filesystem has different performance characteristics when under memory pressure and there are many pages being dirtied. In some cases, the request is ignored entirely so the VM cannot depend on the IO being dispatched. The objective of this series is to reduce writing of filesystem-backed pages from reclaim, play nicely with writeback that is already in progress and throttle reclaim appropriately when writeback pages are encountered. The assumption is that the flushers will always write pages faster than if reclaim issues the IO. A secondary goal is to avoid the problem whereby direct reclaim splices two potentially deep call stacks together. There is a potential new problem as reclaim has less control over how long before a page in a particularly zone or container is cleaned and direct reclaimers depend on kswapd or flusher threads to do the necessary work. However, as filesystems sometimes ignore direct reclaim requests already, it is not expected to be a serious issue. Patch 1 disables writeback of filesystem pages from direct reclaim entirely. Anonymous pages are still written. Patch 2 removes dead code in lumpy reclaim as it is no longer able to synchronously write pages. This hurts lumpy reclaim but there is an expectation that compaction is used for hugepage allocations these days and lumpy reclaim's days are numbered. Patches 3-4 add warnings to XFS and ext4 if called from direct reclaim. With patch 1, this "never happens" and is intended to catch regressions in this logic in the future. Patch 5 disables writeback of filesystem pages from kswapd unless the priority is raised to the point where kswapd is considered to be in trouble. Patch 6 throttles reclaimers if too many dirty pages are being encountered and the zones or backing devices are congested. Patch 7 invalidates dirty pages found at the end of the LRU so they are reclaimed quickly after being written back rather than waiting for a reclaimer to find them I consider this series to be orthogonal to the writeback work but it is worth noting that the writeback work affects the viability of patch 8 in particular. I tested this on ext4 and xfs using fs_mark, a simple writeback test based on dd and a micro benchmark that does a streaming write to a large mapping (exercises use-once LRU logic) followed by streaming writes to a mix of anonymous and file-backed mappings. The command line for fs_mark when botted with 512M looked something like ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760 The number of files was adjusted depending on the amount of available memory so that the files created was about 3xRAM. For multiple threads, the -d switch is specified multiple times. The test machine is x86-64 with an older generation of AMD processor with 4 cores. The underlying storage was 4 disks configured as RAID-0 as this was the best configuration of storage I had available. Swap is on a separate disk. Dirty ratio was tuned to 40% instead of the default of 20%. Testing was run with and without monitors to both verify that the patches were operating as expected and that any performance gain was real and not due to interference from monitors. Here is a summary of results based on testing XFS. 512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%) 512M1P-xfs Elapsed Time fsmark 51.41 48.29 512M1P-xfs Elapsed Time simple-wb 114.09 108.61 512M1P-xfs Elapsed Time mmap-strm 113.46 109.34 512M1P-xfs Kswapd efficiency fsmark 62% 63% 512M1P-xfs Kswapd efficiency simple-wb 56% 61% 512M1P-xfs Kswapd efficiency mmap-strm 44% 42% 512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%) 512M-xfs Elapsed Time fsmark 56.08 48.90 512M-xfs Elapsed Time simple-wb 112.22 98.13 512M-xfs Elapsed Time mmap-strm 219.15 196.67 512M-xfs Kswapd efficiency fsmark 54% 56% 512M-xfs Kswapd efficiency simple-wb 54% 55% 512M-xfs Kswapd efficiency mmap-strm 45% 44% 512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%) 512M-4X-xfs Elapsed Time fsmark 63.26 55.88 512M-4X-xfs Elapsed Time simple-wb 100.90 90.25 512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38 512M-4X-xfs Kswapd efficiency fsmark 49% 50% 512M-4X-xfs Kswapd efficiency simple-wb 54% 56% 512M-4X-xfs Kswapd efficiency mmap-strm 37% 36% 512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%) 512M-16X-xfs Elapsed Time fsmark 67.47 58.25 512M-16X-xfs Elapsed Time simple-wb 103.22 90.89 512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82 512M-16X-xfs Kswapd efficiency fsmark 45% 46% 512M-16X-xfs Kswapd efficiency simple-wb 53% 55% 512M-16X-xfs Kswapd efficiency mmap-strm 33% 33% Up until 512-4X, the FSmark improvements were statistically significant. For the 4X and 16X tests the results were within standard deviations but just barely. The time to completion for all tests is improved which is an important result. In general, kswapd efficiency is not affected by skipping dirty pages. 1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%) 1024M1P-xfs Elapsed Time fsmark 84.14 80.41 1024M1P-xfs Elapsed Time simple-wb 210.77 184.78 1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34 1024M1P-xfs Kswapd efficiency fsmark 69% 75% 1024M1P-xfs Kswapd efficiency simple-wb 71% 77% 1024M1P-xfs Kswapd efficiency mmap-strm 43% 44% 1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%) 1024M-xfs Elapsed Time fsmark 94.59 91.00 1024M-xfs Elapsed Time simple-wb 229.84 195.08 1024M-xfs Elapsed Time mmap-strm 405.38 440.29 1024M-xfs Kswapd efficiency fsmark 79% 71% 1024M-xfs Kswapd efficiency simple-wb 74% 74% 1024M-xfs Kswapd efficiency mmap-strm 39% 42% 1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%) 1024M-4X-xfs Elapsed Time fsmark 103.33 97.74 1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57 1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88 1024M-4X-xfs Kswapd efficiency fsmark 81% 70% 1024M-4X-xfs Kswapd efficiency simple-wb 73% 72% 1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38% 1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%) 1024M-16X-xfs Elapsed Time fsmark 103.11 99.11 1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24 1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82 1024M-16X-xfs Kswapd efficiency fsmark 84% 69% 1024M-16X-xfs Kswapd efficiency simple-wb 74% 73% 1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40% All FSMark tests up to 16X had statistically significant improvements. For the most part, tests are completing faster with the exception of the streaming writes to a mixture of anonymous and file-backed mappings which were slower in two cases In the cases where the mmap-strm tests were slower, there was more swapping due to dirty pages being skipped. The number of additional pages swapped is almost identical to the fewer number of pages written from reclaim. In other words, roughly the same number of pages were reclaimed but swapping was slower. As the test is a bit unrealistic and stresses memory heavily, the small shift is acceptable. 4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%) 4608M1P-xfs Elapsed Time fsmark 512.01 492.15 4608M1P-xfs Elapsed Time simple-wb 618.18 566.24 4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07 4608M1P-xfs Kswapd efficiency fsmark 93% 86% 4608M1P-xfs Kswapd efficiency simple-wb 88% 84% 4608M1P-xfs Kswapd efficiency mmap-strm 46% 45% 4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%) 4608M-xfs Elapsed Time fsmark 555.96 532.34 4608M-xfs Elapsed Time simple-wb 659.72 571.85 4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38 4608M-xfs Kswapd efficiency fsmark 89% 91% 4608M-xfs Kswapd efficiency simple-wb 88% 82% 4608M-xfs Kswapd efficiency mmap-strm 48% 46% 4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%) 4608M-4X-xfs Elapsed Time fsmark 592.91 564.00 4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07 4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53 4608M-4X-xfs Kswapd efficiency fsmark 90% 94% 4608M-4X-xfs Kswapd efficiency simple-wb 87% 82% 4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43% 4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%) 4608M-16X-xfs Elapsed Time fsmark 602.69 585.78 4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81 4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86 4608M-16X-xfs Kswapd efficiency fsmark 98% 98% 4608M-16X-xfs Kswapd efficiency simple-wb 88% 82% 4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42% Unlike the other tests, the fsmark results are not statistically significant but the min and max times are both improved and for the most part, tests completed faster. There are other indications that this is an improvement as well. For example, in the vast majority of cases, there were fewer pages scanned by direct reclaim implying in many cases that stalls due to direct reclaim are reduced. KSwapd is scanning more due to skipping dirty pages which is unfortunate but the CPU usage is still acceptable In an earlier set of tests, I used blktrace and in almost all cases throughput throughout the entire test was higher. However, I ended up discarding those results as recording blktrace data was too heavy for my liking. On a laptop, I plugged in a USB stick and ran a similar tests of tests using it as backing storage. A desktop environment was running and for the entire duration of the tests, firefox and gnome terminal were launching and exiting to vaguely simulate a user. 1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%) 1024M-xfs Elapsed Time fsmark 2053.52 1641.03 1024M-xfs Elapsed Time simple-wb 1229.53 768.05 1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03 1024M-xfs Kswapd efficiency fsmark 84% 85% 1024M-xfs Kswapd efficiency simple-wb 92% 81% 1024M-xfs Kswapd efficiency mmap-strm 60% 51% 1024M-xfs Avg wait ms fsmark 5404.53 4473.87 1024M-xfs Avg wait ms simple-wb 2541.35 1453.54 1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53 The mmap-strm results were hurt because firefox launching had a tendency to push the test out of memory. On the postive side, firefox launched marginally faster with the patches applied. Time to completion for many tests was faster but more importantly - the "Avg wait" time as measured by iostat was far lower implying the system would be more responsive. It was also the case that "Avg wait ms" on the root filesystem was lower. I tested it manually and while the system felt slightly more responsive while copying data to a USB stick, it was marginal enough that it could be my imagination. This patch: do not writeback filesystem pages in direct reclaim. When kswapd is failing to keep zones above the min watermark, a process will enter direct reclaim in the same manner kswapd does. If a dirty page is encountered during the scan, this page is written to backing storage using mapping->writepage. This causes two problems. First, it can result in very deep call stacks, particularly if the target storage or filesystem are complex. Some filesystems ignore write requests from direct reclaim as a result. The second is that a single-page flush is inefficient in terms of IO. While there is an expectation that the elevator will merge requests, this does not always happen. Quoting Christoph Hellwig; The elevator has a relatively small window it can operate on, and can never fix up a bad large scale writeback pattern. This patch prevents direct reclaim writing back filesystem pages by checking if current is kswapd. Anonymous pages are still written to swap as there is not the equivalent of a flusher thread for anonymous pages. If the dirty pages cannot be written back, they are placed back on the LRU lists. There is now a direct dependency on dirty page balancing to prevent too many pages in the system being dirtied which would prevent reclaim making forward progress. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: add comments to explain mm_struct fieldsChristoph Lameter
Add comments to explain the page statistics field in the mm_struct. [akpm@linux-foundation.org: add missing ;] Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: distinguish between mlocked and pinned pagesChristoph Lameter
Some kernel components pin user space memory (infiniband and perf) (by increasing the page count) and account that memory as "mlocked". The difference between mlocking and pinning is: A. mlocked pages are marked with PG_mlocked and are exempt from swapping. Page migration may move them around though. They are kept on a special LRU list. B. Pinned pages cannot be moved because something needs to directly access physical memory. They may not be on any LRU list. I recently saw an mlockalled process where mm->locked_vm became bigger than the virtual size of the process (!) because some memory was accounted for twice: Once when the page was mlocked and once when the Infiniband layer increased the refcount because it needt to pin the RDMA memory. This patch introduces a separate counter for pinned pages and accounts them seperately. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Mike Marciniszyn <infinipath@qlogic.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31oom: fix race while temporarily setting current's oom_score_adjDavid Rientjes
test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31oom: remove oom_disable_countDavid Rientjes
This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: zone_reclaim: make isolate_lru_page() filter-awareMinchan Kim
In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless, we have isolated mapped page and re-add it into LRU's head. It's unnecessary CPU overhead and makes LRU churning. Of course, when we isolate the page, the page might be mapped but when we try to migrate the page, the page would be not mapped. So it could be migrated. But race is rare and although it happens, it's no big deal. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: compaction: make isolate_lru_page() filter-awareMinchan Kim
In async mode, compaction doesn't migrate dirty or writeback pages. So, it's meaningless to pick the page and re-add it to lru list. Of course, when we isolate the page in compaction, the page might be dirty or writeback but when we try to migrate the page, the page would be not dirty, writeback. So it could be migrated. But it's very unlikely as isolate and migration cycle is much faster than writeout. So, this patch helps cpu overhead and prevent unnecessary LRU churning. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: change isolate mode from #define to bitwise typeMinchan Kim
Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally, macro isn't recommended as it's type-unsafe and making debugging harder as symbol cannot be passed throught to the debugger. Quote from Johannes " Hmm, it would probably be cleaner to fully convert the isolation mode into independent flags. INACTIVE, ACTIVE, BOTH is currently a tri-state among flags, which is a bit ugly." This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31Cross Memory AttachChristopher Yeoh
The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: http://marc.info/?l=linux-mm&m=130105930902915&w=2 There is a basic man page for the proposed interface here: http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31include/linux/dmar.h: forward-declare struct acpi_dmar_headerAndrew Morton
x86_64 allnoconfig: In file included from arch/x86/kernel/pci-dma.c:3: include/linux/dmar.h:248: warning: 'struct acpi_dmar_header' declared inside parameter list include/linux/dmar.h:248: warning: its scope is only this definition or declaration, which is probably not what you want Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31dma-mapping: fix sync_single_range_* DMA debuggingClemens Ladisch
Commit 5fd75a7850b5 (dma-mapping: remove unnecessary sync_single_range_* in dma_map_ops) unified not only the dma_map_ops but also the corresponding debug_dma_sync_* calls. This led to spurious WARN()ings like the following because the DMA debug code was no longer able to detect the DMA buffer base address without the separate offset parameter: WARNING: at lib/dma-debug.c:911 check_sync+0xce/0x446() firewire_ohci 0000:04:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000cedaa400] [size=1024 bytes] Call Trace: ... [<ffffffff811326a5>] check_sync+0xce/0x446 [<ffffffff81132ad9>] debug_dma_sync_single_for_device+0x39/0x3b [<ffffffffa01d6e6a>] ohci_queue_iso+0x4f3/0x77d [firewire_ohci] ... To fix this, unshare the sync_single_* and sync_single_range_* implementations so that we are able to call the correct debug_dma_sync_* functions. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits) vlan: allow nested vlan_do_receive() ipv6: fix route lookup in addrconf_prefix_rcv() bonding: eliminate bond_close race conditions qlcnic: fix beacon and LED test. qlcnic: Updated License file qlcnic: updated reset sequence qlcnic: reset loopback mode if promiscous mode setting fails. qlcnic: skip IDC ack check in fw reset path. i825xx: Fix incorrect dependency for BVME6000_NET ipv6: fix route error binding peer in func icmp6_dst_alloc ipv6: fix error propagation in ip6_ufo_append_data() stmmac: update normal descriptor structure (v2) stmmac: fix NULL pointer dereference in capabilities fixup (v2) stmmac: fix a bug while checking the HW cap reg (v2) be2net: Changing MAC Address of a VF was broken. be2net: Refactored be_cmds.c file. bnx2x: update driver version to 1.70.30-0 bnx2x: use FW 7.0.29.0 bnx2x: Enable changing speed when port type is PORT_DA bnx2x: Fix 54618se LED behavior ...
2011-10-30Merge branch 'i2c-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging * 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging: i2c: Functions for byte-swapped smbus_write/read_word_data i2c-algo-pca: Return standard fault codes i2c-algo-bit: Return standard fault codes i2c-algo-bit: Be verbose on bus testing failure i2c-algo-bit: Let user test buses without failing i2c/scx200_acb: Fix section mismatch warning in scx200_pci_drv i2c: I2C_ELEKTOR should depend on HAS_IOPORT
2011-10-30Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommuLinus Torvalds
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits) iommu/core: Remove global iommu_ops and register_iommu iommu/msm: Use bus_set_iommu instead of register_iommu iommu/omap: Use bus_set_iommu instead of register_iommu iommu/vt-d: Use bus_set_iommu instead of register_iommu iommu/amd: Use bus_set_iommu instead of register_iommu iommu/core: Use bus->iommu_ops in the iommu-api iommu/core: Convert iommu_found to iommu_present iommu/core: Add bus_type parameter to iommu_domain_alloc Driver core: Add iommu_ops to bus_type iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API iommu/amd: Fix wrong shift direction iommu/omap: always provide iommu debug code iommu/core: let drivers know if an iommu fault handler isn't installed iommu/core: export iommu_set_fault_handler() iommu/omap: Fix build error with !IOMMU_SUPPORT iommu/omap: Migrate to the generic fault report mechanism iommu/core: Add fault reporting mechanism iommu/core: Use PAGE_SIZE instead of hard-coded value iommu/core: use the existing IS_ALIGNED macro iommu/msm: ->unmap() should return order of unmapped page ... Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config options" just happened to touch lines next to each other.
2011-10-30Merge branch 'kvm-updates/3.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (75 commits) KVM: SVM: Keep intercepting task switching with NPT enabled KVM: s390: implement sigp external call KVM: s390: fix register setting KVM: s390: fix return value of kvm_arch_init_vm KVM: s390: check cpu_id prior to using it KVM: emulate lapic tsc deadline timer for guest x86: TSC deadline definitions KVM: Fix simultaneous NMIs KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode KVM: x86 emulator: streamline decode of segment registers KVM: x86 emulator: simplify OpMem64 decode KVM: x86 emulator: switch src decode to decode_operand() KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack KVM: x86 emulator: switch OpImmUByte decode to decode_imm() KVM: x86 emulator: free up some flag bits near src, dst KVM: x86 emulator: switch src2 to generic decode_operand() KVM: x86 emulator: expand decode flags to 64 bits KVM: x86 emulator: split dst decode to a generic decode_operand() KVM: x86 emulator: move memop, memopp into emulation context ...
2011-10-30Merge branch 'fbdev-next' of git://github.com/schandinat/linux-2.6Linus Torvalds
* 'fbdev-next' of git://github.com/schandinat/linux-2.6: (270 commits) video: platinumfb: Add __devexit_p at necessary place drivers/video: fsl-diu-fb: merge diu_pool into fsl_diu_data drivers/video: fsl-diu-fb: merge diu_hw into fsl_diu_data drivers/video: fsl-diu-fb: only DIU modes 0 and 1 are supported drivers/video: fsl-diu-fb: remove unused panel operating mode support drivers/video: fsl-diu-fb: use an enum for the AOI index drivers/video: fsl-diu-fb: add several new video modes drivers/video: fsl-diu-fb: remove broken screen blanking support drivers/video: fsl-diu-fb: move some definitions out of the header file drivers/video: fsl-diu-fb: fix some ioctls video: da8xx-fb: Increased resolution configuration of revised LCDC IP OMAPDSS: picodlp: add missing #include <linux/module.h> fb: fix au1100fb bitrot. mx3fb: fix NULL pointer dereference in screen blanking. video: irq: Remove IRQF_DISABLED smscufx: change edid data to u8 instead of char OMAPDSS: DISPC: zorder support for DSS overlays OMAPDSS: DISPC: VIDEO3 pipeline support OMAPDSS/OMAP_VOUT: Fix incorrect OMAP3-alpha compatibility setting video/omap: fix build dependencies ... Fix up conflicts in: - drivers/staging/xgifb/XGI_main_26.c Changes to XGIfb_pan_var() - drivers/video/omap/{lcd_apollon.c,lcd_ldp.c,lcd_overo.c} Removed (or in the case of apollon.c, merged into the generic DSS panel in drivers/video/omap2/displays/panel-generic-dpi.c)
2011-10-30i2c: Functions for byte-swapped smbus_write/read_word_dataJonathan Cameron
Reimplemented at least 17 times discounting error mangling cases where it could be used. Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk> Signed-off-by: Jean Delvare <khali@linux-fr.org>
2011-10-30KVM: s390: implement sigp external callChristian Ehrhardt
Implement sigp external call, which might be required for guests that issue an external call instead of an emergency signal for IPI. This fixes an issue with "KVM: unknown SIGP: 0x02" when booting such an SMP guest. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-10-30vlan: allow nested vlan_do_receive()Eric Dumazet
commit 2425717b27eb (net: allow vlan traffic to be received under bond) broke ARP processing on vlan on top of bonding. +-------+ eth0 --| bond0 |---bond0.103 eth1 --| | +-------+ 52870.115435: skb_gro_reset_offset <-napi_gro_receive 52870.115435: dev_gro_receive <-napi_gro_receive 52870.115435: napi_skb_finish <-napi_gro_receive 52870.115435: netif_receive_skb <-napi_skb_finish 52870.115435: get_rps_cpu <-netif_receive_skb 52870.115435: __netif_receive_skb <-netif_receive_skb 52870.115436: vlan_do_receive <-__netif_receive_skb 52870.115436: bond_handle_frame <-__netif_receive_skb 52870.115436: vlan_do_receive <-__netif_receive_skb 52870.115436: arp_rcv <-__netif_receive_skb 52870.115436: kfree_skb <-arp_rcv Packet is dropped in arp_rcv() because its pkt_type was set to PACKET_OTHERHOST in the first vlan_do_receive() call, since no eth0.103 exists. We really need to change pkt_type only if no more rx_handler is about to be called for the packet. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-29Merge branch 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6Linus Torvalds
* 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6: ARM: mark empty gpio.h files empty gpio: Fix ARM versatile-express build failure of: include errno.h
2011-10-29Merge branch 'spi/next' of git://git.secretlab.ca/git/linux-2.6Linus Torvalds
* 'spi/next' of git://git.secretlab.ca/git/linux-2.6: drivercore: Add helper macro for platform_driver boilerplate spi: irq: Remove IRQF_DISABLED OMAP: SPI: Fix the trying to free nonexistent resource error spi/spi-ep93xx: add module.h include spi/tegra: fix compilation error in spi-tegra.c spi: spi-dw: fix all sparse warnings spi/spi-pl022: Call pl022_dma_remove(pl022) only if enable_dma is true spi/spi-pl022: calculate_effective_freq() must set rate <= requested rate spi/spi-pl022: Don't allocate more sg than required. spi/spi-pl022: Use GFP_ATOMIC for allocation from tasklet spi/spi-pl022: Resolve formatting issues
2011-10-29Merge branch 'gpio/next' of git://git.secretlab.ca/git/linux-2.6Linus Torvalds
* 'gpio/next' of git://git.secretlab.ca/git/linux-2.6: h8300: Move gpio.h to gpio-internal.h gpio: pl061: add DT binding support gpio: fix build error in include/asm-generic/gpio.h gpiolib: Ensure struct gpio is always defined irq: Add EXPORT_SYMBOL_GPL to function of irq generic-chip gpio-ml-ioh: Use NUMA_NO_NODE not GFP_KERNEL gpio-pch: Use NUMA_NO_NODE not GFP_KERNEL gpio: langwell: ensure alternate function is cleared gpio-pch: Support interrupt function gpio-pch: Save register value in suspend() gpio-pch: modify gpio_nums and mask gpio-pch: support ML7223 IOH n-Bus gpio-pch: add spinlock in suspend/resume processing gpio-pch: Delete invalid "restore" code in suspend() gpio-ml-ioh: Fix suspend/resume issue gpio-ml-ioh: Support interrupt function gpio-ml-ioh: Delete unnecessary code gpio/mxc: add chained_irq_enter/exit() to mx3_gpio_irq_handler() gpio/nomadik: use genirq core to track enablement gpio/nomadik: disable clocks when unused
2011-10-29of: include errno.hKalle Valo
When compiling ath6kl for beagleboard (omap2plus_defconfig plus CONFIG_ATH6KL, CONFIG_OF disable) with current linux-next compilation fails: include/linux/of.h:269: error: 'ENOSYS' undeclared (first use in this function) include/linux/of.h:276: error: 'ENOSYS' undeclared (first use in this function) include/linux/of.h:289: error: 'ENOSYS' undeclared (first use in this function) Fix this by including errno.h from of.h. Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-10-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (204 commits) [SCSI] qla4xxx: export address/port of connection (fix udev disk names) [SCSI] ipr: Fix BUG on adapter dump timeout [SCSI] megaraid_sas: Fix instance access in megasas_reset_timer [SCSI] hpsa: change confusing message to be more clear [SCSI] iscsi class: fix vlan configuration [SCSI] qla4xxx: fix data alignment and use nl helpers [SCSI] iscsi class: fix link local mispelling [SCSI] iscsi class: Replace iscsi_get_next_target_id with IDA [SCSI] aacraid: use lower snprintf() limit [SCSI] lpfc 8.3.27: Change driver version to 8.3.27 [SCSI] lpfc 8.3.27: T10 additions for SLI4 [SCSI] lpfc 8.3.27: Fix queue allocation failure recovery [SCSI] lpfc 8.3.27: Change algorithm for getting physical port name [SCSI] lpfc 8.3.27: Changed worst case mailbox timeout [SCSI] lpfc 8.3.27: Miscellanous logic and interface fixes [SCSI] megaraid_sas: Changelog and version update [SCSI] megaraid_sas: Add driver workaround for PERC5/1068 kdump kernel panic [SCSI] megaraid_sas: Add multiple MSI-X vector/multiple reply queue support [SCSI] megaraid_sas: Add support for MegaRAID 9360/9380 12GB/s controllers [SCSI] megaraid_sas: Clear FUSION_IN_RESET before enabling interrupts ...
2011-10-28Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-clientLinus Torvalds
* 'for-linus' of git://ceph.newdream.net/git/ceph-client: libceph: fix double-free of page vector ceph: fix 32-bit ino numbers libceph: force resend of osd requests if we skip an osdmap ceph: use kernel DNS resolver ceph: fix ceph_monc_init memory leak ceph: let the set_layout ioctl set single traits Revert "ceph: don't truncate dirty pages in invalidate work thread" ceph: replace leading spaces with tabs libceph: warn on msg allocation failures libceph: don't complain on msgpool alloc failures libceph: always preallocate mon connection libceph: create messenger with client ceph: document ioctls ceph: implement (optional) max read size ceph: rename rsize -> rasize ceph: make readpages fully async
2011-10-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (549 commits) ALSA: hda - Fix ADC input-amp handling for Cx20549 codec ALSA: hda - Keep EAPD turned on for old Conexant chips ALSA: hda/realtek - Fix missing volume controls with ALC260 ASoC: wm8940: Properly set codec->dapm.bias_level ALSA: hda - Fix pin-config for ASUS W90V ALSA: hda - Fix surround/CLFE headphone and speaker pins order ALSA: hda - Fix typo ALSA: Update the sound git tree URL ALSA: HDA: Add new revision for ALC662 ASoC: max98095: Convert codec->hw_write to snd_soc_write ASoC: keep pointer to resource so it can be freed ASoC: sgtl5000: Fix wrong mask in some snd_soc_update_bits calls ASoC: wm8996: Fix wrong mask for setting WM8996_AIF_CLOCKING_2 ASoC: da7210: Add support for line out and DAC ASoC: da7210: Add support for DAPM ALSA: hda/realtek - Fix DAC assignments of multiple speakers ASoC: Use SGTL5000_LINREG_VDDD_MASK instead of hardcoded mask value ASoC: Set sgtl5000->ldo in ldo_regulator_register ASoC: wm8996: Use SND_SOC_DAPM_AIF_OUT for AIF2 Capture ASoC: wm8994: Use SND_SOC_DAPM_AIF_OUT for AIF3 Capture ...
2011-10-28Merge branch 'next-rebase' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci * 'next-rebase' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci: PCI: Clean-up MPS debug output pci: Clamp pcie_set_readrq() when using "performance" settings PCI: enable MPS "performance" setting to properly handle bridge MPS PCI: Workaround for Intel MPS errata PCI: Add support for PASID capability PCI: Add implementation for PRI capability PCI: Export ATS functions to modules PCI: Move ATS implementation into own file PCI / PM: Remove unnecessary error variable from acpi_dev_run_wake() PCI hotplug: acpiphp: Prevent deadlock on PCI-to-PCI bridge remove PCI / PM: Extend PME polling to all PCI devices PCI quirk: mmc: Always check for lower base frequency quirk for Ricoh 1180:e823 PCI: Make pci_setup_bridge() non-static for use by arch code x86: constify PCI raw ops structures PCI: Add quirk for known incorrect MPSS PCI: Add Solarflare vendor ID and SFC4000 device IDs
2011-10-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (83 commits) mmc: fix compile error when CONFIG_BLOCK is not enabled mmc: core: Cleanup eMMC4.5 conditionals mmc: omap_hsmmc: if multiblock reads are broken, disable them mmc: core: add workaround for controllers with broken multiblock reads mmc: core: Prevent too long response times for suspend mmc: recognise SDIO cards with SDIO_CCCR_REV 3.00 mmc: sd: Handle SD3.0 cards not supporting UHS-I bus speed mode mmc: core: support HPI send command mmc: core: Add cache control for eMMC4.5 device mmc: core: Modify the timeout value for writing power class mmc: core: new discard feature support at eMMC v4.5 mmc: core: mmc sanitize feature support for v4.5 mmc: dw_mmc: modify DATA register offset mmc: sdhci-pci: add flag for devices that can support runtime PM mmc: omap_hsmmc: ensure pbias configuration is always done mmc: core: Add Power Off Notify Feature eMMC 4.5 mmc: sdhci-s3c: fix potential NULL dereference mmc: replace printk with appropriate display macro mmc: core: Add default timeout value for CMD6 mmc: sdhci-pci: add runtime pm support ...
2011-10-28Merge branch 'devel-stable' of ↵Linus Torvalds
http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm * 'devel-stable' of http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm: (178 commits) ARM: 7139/1: fix compilation with CONFIG_ARM_ATAG_DTB_COMPAT and large TEXT_OFFSET ARM: gic, local timers: use the request_percpu_irq() interface ARM: gic: consolidate PPI handling ARM: switch from NO_MACH_MEMORY_H to NEED_MACH_MEMORY_H ARM: mach-s5p64x0: remove mach/memory.h ARM: mach-s3c64xx: remove mach/memory.h ARM: plat-mxc: remove mach/memory.h ARM: mach-prima2: remove mach/memory.h ARM: mach-zynq: remove mach/memory.h ARM: mach-bcmring: remove mach/memory.h ARM: mach-davinci: remove mach/memory.h ARM: mach-pxa: remove mach/memory.h ARM: mach-ixp4xx: remove mach/memory.h ARM: mach-h720x: remove mach/memory.h ARM: mach-vt8500: remove mach/memory.h ARM: mach-s5pc100: remove mach/memory.h ARM: mach-tegra: remove mach/memory.h ARM: plat-tcc: remove mach/memory.h ARM: mach-mmp: remove mach/memory.h ARM: mach-cns3xxx: remove mach/memory.h ... Fix up mostly pretty trivial conflicts in: - arch/arm/Kconfig - arch/arm/include/asm/localtimer.h - arch/arm/kernel/Makefile - arch/arm/mach-shmobile/board-ap4evb.c - arch/arm/mach-u300/core.c - arch/arm/mm/dma-mapping.c - arch/arm/mm/proc-v7.S - arch/arm/plat-omap/Kconfig largely due to some CONFIG option renaming (ie CONFIG_PM_SLEEP -> CONFIG_ARM_CPU_SUSPEND for the arm-specific suspend code etc) and addition of NEED_MACH_MEMORY_H next to HAVE_IDE.
2011-10-28Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits) leases: fix write-open/read-lease race nfs: drop unnecessary locking in llseek ext4: replace cut'n'pasted llseek code with generic_file_llseek_size vfs: add generic_file_llseek_size vfs: do (nearly) lockless generic_file_llseek direct-io: merge direct_io_walker into __blockdev_direct_IO direct-io: inline the complete submission path direct-io: separate map_bh from dio direct-io: use a slab cache for struct dio direct-io: rearrange fields in dio/dio_submit to avoid holes direct-io: fix a wrong comment direct-io: separate fields only used in the submission path from struct dio vfs: fix spinning prevention in prune_icache_sb vfs: add a comment to inode_permission() vfs: pass all mask flags check_acl and posix_acl_permission vfs: add hex format for MAY_* flag values vfs: indicate that the permission functions take all the MAY_* flags compat: sync compat_stats with statfs. vfs: add "device" tag to /proc/self/mountstats cleanup: vfs: small comment fix for block_invalidatepage ... Fix up trivial conflict in fs/gfs2/file.c (llseek changes)
2011-10-28Merge branch '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
* '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6: (52 commits) Fix build break when freezer not configured Add definition for share encryption CIFS: Make cifs_push_locks send as many locks at once as possible CIFS: Send as many mandatory unlock ranges at once as possible CIFS: Implement caching mechanism for posix brlocks CIFS: Implement caching mechanism for mandatory brlocks CIFS: Fix DFS handling in cifs_get_file_info CIFS: Fix error handling in cifs_readv_complete [CIFS] Fixup trivial checkpatch warning [CIFS] Show nostrictsync and noperm mount options in /proc/mounts cifs, freezer: add wait_event_freezekillable and have cifs use it cifs: allow cifs_max_pending to be readable under /sys/module/cifs/parameters cifs: tune bdi.ra_pages in accordance with the rsize cifs: allow for larger rsize= options and change defaults cifs: convert cifs_readpages to use async reads cifs: add cifs_async_readv cifs: fix protocol definition for READ_RSP cifs: add a callback function to receive the rest of the frame cifs: break out 3rd receive phase into separate function cifs: find mid earlier in receive codepath ...
2011-10-28vfs: add generic_file_llseek_sizeAndi Kleen
Add a generic_file_llseek variant to the VFS that allows passing in the maximum file size of the file system, instead of always using maxbytes from the superblock. This can be used to eliminate some cut'n'paste seek code in ext4. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28vfs: do (nearly) lockless generic_file_llseekAndi Kleen
The i_mutex lock use of generic _file_llseek hurts. Independent processes accessing the same file synchronize over a single lock, even though they have no need for synchronization at all. Under high utilization this can cause llseek to scale very poorly on larger systems. This patch does some rethinking of the llseek locking model: First the 64bit f_pos is not necessarily atomic without locks on 32bit systems. This can already cause races with read() today. This was discussed on linux-kernel in the past and deemed acceptable. The patch does not change that. Let's look at the different seek variants: SEEK_SET: Doesn't really need any locking. If there's a race one writer wins, the other loses. For 32bit the non atomic update races against read() stay the same. Without a lock they can also happen against write() now. The read() race was deemed acceptable in past discussions, and I think if it's ok for read it's ok for write too. => Don't need a lock. SEEK_END: This behaves like SEEK_SET plus it reads the maximum size too. Reading the maximum size would have the 32bit atomic problem. But luckily we already have a way to read the maximum size without locking (i_size_read), so we can just use that instead. Without i_mutex there is no synchronization with write() anymore, however since the write() update is atomic on 64bit it just behaves like another racy SEEK_SET. On non atomic 32bit it's the same as SEEK_SET. => Don't need a lock, but need to use i_size_read() SEEK_CUR: This has a read-modify-write race window on the same file. One could argue that any application doing unsynchronized seeks on the same file is already broken. But for the sake of not adding a regression here I'm using the file->f_lock to synchronize this. Using this lock is much better than the inode mutex because it doesn't synchronize between processes. => So still need a lock, but can use a f_lock. This patch implements this new scheme in generic_file_llseek. I dropped generic_file_llseek_unlocked and changed all callers. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28vfs: add hex format for MAY_* flag valuesAneesh Kumar K.V
We are going to add more flags and having them in hex format make it simpler Acked-by: J. Bruce Fields <bfields@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28Merge branch 'drm-core-next' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
* 'drm-core-next' of git://people.freedesktop.org/~airlied/linux: (290 commits) Revert "drm/ttm: add a way to bo_wait for either the last read or last write" Revert "drm/radeon/kms: add a new gem_wait ioctl with read/write flags" vmwgfx: Don't pass unused arguments to do_dirty functions vmwgfx: Emulate depth 32 framebuffers drm/radeon: Lower the severity of the radeon lockup messages. drm/i915/dp: Fix eDP on PCH DP on CPT/PPT drm/i915/dp: Introduce is_cpu_edp() drm/i915: use correct SPD type value drm/i915: fix ILK+ infoframe support drm/i915: add DP test request handling drm/i915: read full receiver capability field during DP hot plug drm/i915/dp: Remove eDP special cases from bandwidth checks drm/i915/dp: Fix the math in intel_dp_link_required drm/i915/panel: Always record the backlight level again (but cleverly) i915: Move i915_read/write out of line drm/i915: remove transcoder PLL mashing from mode_set per specs drm/i915: if transcoder disable fails, say which drm/i915: set watermarks for third pipe on IVB drm/i915: export a CPT mode set verification function drm/i915: fix transcoder PLL select masking ...
2011-10-28Merge branch 'x86-rdrand-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'x86-rdrand-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, random: Verify RDRAND functionality and allow it to be disabled x86, random: Architectural inlines to get random integers with RDRAND random: Add support for architectural random hooks Fix up trivial conflicts in drivers/char/random.c: the architectural random hooks touched "get_random_int()" that was simplified to use MD5 and not do the keyptr thing any more (see commit 6e5714eaf77d: "net: Compute protocol sequence numbers and fragment IDs using MD5").
2011-10-27Fix build break when freezer not configuredSteve French
fs/cifs/transport.c: In function 'wait_for_response': fs/cifs/transport.c:328: error: implicit declaration of function 'wait_event_freezekillable' Caused by commit f06ac72e9291 ("cifs, freezer: add wait_event_freezekillable and have cifs use it"). In this config, CONFIG_FREEZER is not set. Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com> CC: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2011-10-27Revert "drm/ttm: add a way to bo_wait for either the last read or last write"Dave Airlie
This reverts commit dfadbbdb57b3f2bb33e14f129a43047c6f0caefa. Further upstream discussion between Marek and Thomas decided this wasn't fully baked and needed further work, so revert it before it hits mainline. Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-10-27Revert "drm/radeon/kms: add a new gem_wait ioctl with read/write flags"Dave Airlie
This reverts commit d3ed74027f1dd197b7e08247a40d3bf9be1852b0. Further upstream discussion between Thomas and Marek decided this needed more work and driver specifics. So revert before it goes upstream. Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-10-27mmc: fix compile error when CONFIG_BLOCK is not enabledNamjae Jeon
'DISK_NAME_LEN' is undeclared when CONFIG_BLOCK is disabled; its use was introduced via genhd.h by the general purpose partition patch. To fix, we just add our own MAX_MMC_PART_NAME_LEN macro instead of using DISK_NAME_LEN. Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: Chris Ball <cjb@laptop.org>
2011-10-27mmc: core: add workaround for controllers with broken multiblock readsPaul Walmsley
Due to hardware bugs, some MMC host controllers don't support multiple-block reads[1]. To resolve, add a new MMC capability flag, MMC_CAP2_NO_MULTI_READ, which can be set by affected host controller drivers. When this capability is set, all reads will be issued one sector at a time. 1. See for example Advisory 2.1.1.128 "MMC: Multiple Block Read Operation Issue" in _OMAP3530/3525/3515/3503 Silicon Errata_ Revision F (October 2010) (SPRZ278F), available from http://focus.ti.com/lit/er/sprz278f/sprz278f.pdf Signed-off-by: Paul Walmsley <paul@pwsan.com> Cc: Dave Hylands <dhylands@gmail.com> Tested-by: Steve Sakoman <sakoman@gmail.com> Signed-off-by: Chris Ball <cjb@laptop.org>
2011-10-27Merge branch 'topic/asoc' into for-linusTakashi Iwai