summaryrefslogtreecommitdiff
path: root/lib
AgeCommit message (Collapse)Author
2014-09-26mm: filemap: move radix tree hole searching hereJohannes Weiner
commit e7b563bb2a6f4d974208da46200784b9c5b5a47e upstream. The radix tree hole searching code is only used for page cache, for example the readahead code trying to get a a picture of the area surrounding a fault. It sufficed to rely on the radix tree definition of holes, which is "empty tree slot". But this is about to change, though, as shadow page descriptors will be stored in the page cache after the actual pages get evicted from memory. Move the functions over to mm/filemap.c and make them native page cache operations, where they can later be adapted to handle the new definition of "page cache hole". Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26lib: radix-tree: add radix_tree_delete_item()Johannes Weiner
commit 53c59f262d747ea82e7414774c59a489501186a0 upstream. Provide a function that does not just delete an entry at a given index, but also allows passing in an expected item. Delete only if that item is still located at the specified index. This is handy when lockless tree traversals want to delete entries as well because they don't have to do an second, locked lookup to verify the slot has not changed under them before deleting the entry. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26lib/plist: add plist_requeueDan Streetman
commit a75f232ce0fe38bd01301899ecd97ffd0254316a upstream. Add plist_requeue(), which moves the specified plist_node after all other same-priority plist_nodes in the list. This is essentially an optimized plist_del() followed by plist_add(). This is needed by swap, which (with the next patch in this set) uses a plist of available swap devices. When a swap device (either a swap partition or swap file) are added to the system with swapon(), the device is added to a plist, ordered by the swap device's priority. When swap needs to allocate a page from one of the swap devices, it takes the page from the first swap device on the plist, which is the highest priority swap device. The swap device is left in the plist until all its pages are used, and then removed from the plist when it becomes full. However, as described in man 2 swapon, swap must allocate pages from swap devices with the same priority in round-robin order; to do this, on each swap page allocation, swap uses a page from the first swap device in the plist, and then calls plist_requeue() to move that swap device entry to after any other same-priority swap devices. The next swap page allocation will again use a page from the first swap device in the plist and requeue it, and so on, resulting in round-robin usage of equal-priority swap devices. Also add plist_test_requeue() test function, for use by plist_test() to test plist_requeue() function. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Cc: Weijie Yang <weijieut@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-08-19lib/btree.c: fix leak of whole btree nodesMinfei Huang
commit c75b53af2f0043aff500af0a6f878497bef41bca upstream. I use btree from 3.14-rc2 in my own module. When the btree module is removed, a warning arises: kmem_cache_destroy btree_node: Slab cache still has objects CPU: 13 PID: 9150 Comm: rmmod Tainted: GF O 3.14.0-rc2 #1 Hardware name: Inspur NF5270M3/NF5270M3, BIOS CHEETAH_2.1.3 09/10/2013 Call Trace: dump_stack+0x49/0x5d kmem_cache_destroy+0xcf/0xe0 btree_module_exit+0x10/0x12 [btree] SyS_delete_module+0x198/0x1f0 system_call_fastpath+0x16/0x1b The cause is that it doesn't release the last btree node, when height = 1 and fill = 1. [akpm@linux-foundation.org: remove unneeded test of NULL] Signed-off-by: Minfei Huang <huangminfei@ucloud.cn> Cc: Joern Engel <joern@logfs.org> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-17lz4: add overrun checks to lz4_uncompress_unknownoutputsize()Greg Kroah-Hartman
commit 4a3a99045177369700c60d074c0e525e8093b0fc upstream. Jan points out that I forgot to make the needed fixes to the lz4_uncompress_unknownoutputsize() function to mirror the changes done in lz4_decompress() with regards to potential pointer overflows. The only in-kernel user of this function is the zram code, which only takes data from a valid compressed buffer that it made itself, so it's not a big issue. But due to external kernel modules using this function, it's better to be safe here. Reported-by: Jan Beulich <JBeulich@suse.com> Cc: "Don A. Bailey" <donb@securitymouse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02lz4: fix another possible overrunGreg Kroah-Hartman
commit 4148c1f67abf823099b2d7db6851e4aea407f5ee upstream. There is one other possible overrun in the lz4 code as implemented by Linux at this point in time (which differs from the upstream lz4 codebase, but will get synced at in a future kernel release.) As pointed out by Don, we also need to check the overflow in the data itself. While we are at it, replace the odd error return value with just a "simple" -1 value as the return value is never used for anything other than a basic "did this work or not" check. Reported-by: "Don A. Bailey" <donb@securitymouse.com> Reported-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02idr: fix overflow bug during maximum ID calculation at maximum heightLai Jiangshan
commit 3afb69cb5572b3c8c898c00880803cf1a49852c4 upstream. idr_replace() open-codes the logic to calculate the maximum valid ID given the height of the idr tree; unfortunately, the open-coded logic doesn't account for the fact that the top layer may have unused slots and over-shifts the limit to zero when the tree is at its maximum height. The following test code shows it fails to replace the value for id=((1<<27)+42): static void test5(void) { int id; DEFINE_IDR(test_idr); #define TEST5_START ((1<<27)+42) /* use the highest layer */ printk(KERN_INFO "Start test5\n"); id = idr_alloc(&test_idr, (void *)1, TEST5_START, 0, GFP_KERNEL); BUG_ON(id != TEST5_START); TEST_BUG_ON(idr_replace(&test_idr, (void *)2, TEST5_START) != (void *)1); idr_destroy(&test_idr); printk(KERN_INFO "End of test5\n"); } Fix the bug by using idr_max() which correctly takes into account the maximum allowed shift. sub_alloc() shares the same problem and may incorrectly fail with -EAGAIN; however, this bug doesn't affect correct operation because idr_get_empty_slot(), which already uses idr_max(), retries with the increased @id in such cases. [tj@kernel.org: Updated patch description.] Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02lz4: ensure length does not wrapGreg Kroah-Hartman
commit 206204a1162b995e2185275167b22468c00d6b36 upstream. Given some pathologically compressed data, lz4 could possibly decide to wrap a few internal variables, causing unknown things to happen. Catch this before the wrapping happens and abort the decompression. Reported-by: "Don A. Bailey" <donb@securitymouse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02lzo: properly check for overrunsGreg Kroah-Hartman
commit 206a81c18401c0cde6e579164f752c4b147324ce upstream. The lzo decompressor can, if given some really crazy data, possibly overrun some variable types. Modify the checking logic to properly detect overruns before they happen. Reported-by: "Don A. Bailey" <donb@securitymouse.com> Tested-by: "Don A. Bailey" <donb@securitymouse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-23netlink: rate-limit leftover bytes warning and print process nameMichal Schmidt
[ Upstream commit bfc5184b69cf9eeb286137640351c650c27f118a ] Any process is able to send netlink messages with leftover bytes. Make the warning rate-limited to prevent too much log spam. The warning is supposed to help find userspace bugs, so print the triggering command name to implicate the buggy program. [v2: Use pr_warn_ratelimited instead of printk_ratelimited.] Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-05-15lib/percpu_counter.c: fix bad percpu counter state during suspendJens Axboe
commit e39435ce68bb4685288f78b1a7e24311f7ef939f upstream. I got a bug report yesterday from Laszlo Ersek in which he states that his kvm instance fails to suspend. Laszlo bisected it down to this commit 1cf7e9c68fe8 ("virtio_blk: blk-mq support") where virtio-blk is converted to use the blk-mq infrastructure. After digging a bit, it became clear that the issue was with the queue drain. blk-mq tracks queue usage in a percpu counter, which is incremented on request alloc and decremented when the request is freed. The initial hunt was for an inconsistency in blk-mq, but everything seemed fine. In fact, the counter only returned crazy values when suspend was in progress. When a CPU is unplugged, the percpu counters merges that CPU state with the general state. blk-mq takes care to register a hotcpu notifier with the appropriate priority, so we know it runs after the percpu counter notifier. However, the percpu counter notifier only merges the state when the CPU is fully gone. This leaves a state transition where the CPU going away is no longer in the online mask, yet it still holds private values. This means that in this state, percpu_counter_sum() returns invalid results, and the suspend then hangs waiting for abs(dead-cpu-value) requests to complete which of course will never happen. Fix this by clearing the state earlier, so we never have a case where the CPU isn't in online mask but still holds private state. This bug has been there since forever, I guess we don't have a lot of users where percpu counters needs to be reliable during the suspend cycle. Signed-off-by: Jens Axboe <axboe@fb.com> Reported-by: Laszlo Ersek <lersek@redhat.com> Tested-by: Laszlo Ersek <lersek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-18netlink: don't compare the nul-termination in nla_strcmpPablo Neira
[ Upstream commit 8b7b932434f5eee495b91a2804f5b64ebb2bc835 ] nla_strcmp compares the string length plus one, so it's implicitly including the nul-termination in the comparison. int nla_strcmp(const struct nlattr *nla, const char *str) { int len = strlen(str) + 1; ... d = memcmp(nla_data(nla), str, len); However, if NLA_STRING is used, userspace can send us a string without the nul-termination. This is a problem since the string comparison will not match as the last byte may be not the nul-termination. Fix this by skipping the comparison of the nul-termination if the attribute data is nul-terminated. Suggested by Thomas Graf. Cc: Florian Westphal <fw@strlen.de> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12mm: do not walk all of system memory during show_memMel Gorman
commit c78e93630d15b5f5774213aad9bdc9f52473a89b upstream. It has been reported on very large machines that show_mem is taking almost 5 minutes to display information. This is a serious problem if there is an OOM storm. The bulk of the cost is in show_mem doing a very expensive PFN walk to give us the following information Total RAM: Also available as totalram_pages Highmem pages: Also available as totalhigh_pages Reserved pages: Can be inferred from the zone structure Shared pages: PFN walk required Unshared pages: PFN walk required Quick pages: Per-cpu walk required Only the shared/unshared pages requires a full PFN walk but that information is useless. It is also inaccurate as page pins of unshared pages would be accounted for as shared. Even if the information was accurate, I'm struggling to think how the shared/unshared information could be useful for debugging OOM conditions. Maybe it was useful before rmap existed when reclaiming shared pages was costly but it is less relevant today. The PFN walk could be optimised a bit but why bother as the information is useless. This patch deletes the PFN walker and infers the total RAM, highmem and reserved pages count from struct zone. It omits the shared/unshared page usage on the grounds that it is useless. It also corrects the reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE has similar problems to HighMem with respect to lowmem/highmem exhaustion. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-20x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=yPeter Oberparleiter
commit 6583327c4dd55acbbf2a6f25e775b28b3abf9a42 upstream. Commit d61931d89b, "x86: Add optimized popcnt variants" introduced compile flag -fcall-saved-rdi for lib/hweight.c. When combined with options -fprofile-arcs and -O2, this flag causes gcc to generate broken constructor code. As a result, a 64 bit x86 kernel compiled with CONFIG_GCOV_PROFILE_ALL=y prints message "gcov: could not create file" and runs into sproadic BUGs during boot. The gcc people indicate that these kinds of problems are endemic when using ad hoc calling conventions. It is therefore best to treat any file compiled with ad hoc calling conventions as an isolated environment and avoid things like profiling or coverage analysis, since those subsystems assume a "normal" calling conventions. This patch avoids the bug by excluding lib/hweight.o from coverage profiling. Reported-by: Meelis Roos <mroos@linux.ee> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/52F3A30C.7050205@linux.vnet.ibm.com Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13iscsi-target: Fix connection reset hang with percpu_ida_allocNicholas Bellinger
commit 555b270e25b0279b98083518a85f4b1da144a181 upstream. This patch addresses a bug where connection reset would hang indefinately once percpu_ida_alloc() was starved for tags, due to the fact that it always assumed uninterruptible sleep mode. So now make percpu_ida_alloc() check for signal_pending_state() for making interruptible sleep optional, and convert iscsit_allocate_cmd() to set TASK_INTERRUPTIBLE for GFP_KERNEL, or TASK_RUNNING for GFP_ATOMIC. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13percpu_ida: Make percpu_ida_alloc + callers accept task state bitmaskKent Overstreet
commit 6f6b5d1ec56acdeab0503d2b823f6f88a0af493e upstream. This patch changes percpu_ida_alloc() + callers to accept task state bitmask for prepare_to_wait() for code like target/iscsi that needs it for interruptible sleep, that is provided in a subsequent patch. It now expects TASK_UNINTERRUPTIBLE when the caller is able to sleep waiting for a new tag, or TASK_RUNNING when the caller cannot sleep, and is forced to return a negative value when no tags are available. v2 changes: - Include blk-mq + tcm_fc + vhost/scsi + target/iscsi changes - Drop signal_pending_state() call v3 changes: - Only call prepare_to_wait() + finish_wait() when != TASK_RUNNING (PeterZ) Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08random32: fix off-by-one in seeding requirementDaniel Borkmann
[ Upstream commit 51c37a70aaa3f95773af560e6db3073520513912 ] For properly initialising the Tausworthe generator [1], we have a strict seeding requirement, that is, s1 > 1, s2 > 7, s3 > 15. Commit 697f8d0348 ("random32: seeding improvement") introduced a __seed() function that imposes boundary checks proposed by the errata paper [2] to properly ensure above conditions. However, we're off by one, as the function is implemented as: "return (x < m) ? x + m : x;", and called with __seed(X, 1), __seed(X, 7), __seed(X, 15). Thus, an unwanted seed of 1, 7, 15 would be possible, whereas the lower boundary should actually be of at least 2, 8, 16, just as GSL does. Fix this, as otherwise an initialization with an unwanted seed could have the effect that Tausworthe's PRNG properties cannot not be ensured. Note that this PRNG is *not* used for cryptography in the kernel. [1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps [2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps Joint work with Hannes Frederic Sowa. Fixes: 697f8d0348a6 ("random32: seeding improvement") Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04vsprintf: check real user/group id for %pKRyan Mallon
commit 312b4e226951f707e120b95b118cbc14f3d162b2 upstream. Some setuid binaries will allow reading of files which have read permission by the real user id. This is problematic with files which use %pK because the file access permission is checked at open() time, but the kptr_restrict setting is checked at read() time. If a setuid binary opens a %pK file as an unprivileged user, and then elevates permissions before reading the file, then kernel pointer values may be leaked. This happens for example with the setuid pppd application on Ubuntu 12.04: $ head -1 /proc/kallsyms 00000000 T startup_32 $ pppd file /proc/kallsyms pppd: In file /proc/kallsyms: unrecognized option 'c1000000' This will only leak the pointer value from the first line, but other setuid binaries may leak more information. Fix this by adding a check that in addition to the current process having CAP_SYSLOG, that effective user and group ids are equal to the real ids. If a setuid binary reads the contents of a file which uses %pK then the pointer values will be printed as NULL if the real user is unprivileged. Update the sysctl documentation to reflect the changes, and also correct the documentation to state the kptr_restrict=0 is the default. This is a only temporary solution to the issue. The correct solution is to do the permission check at open() time on files, and to replace %pK with a function which checks the open() time permission. %pK uses in printk should be removed since no sane permission check can be done, and instead protected by using dmesg_restrict. Signed-off-by: Ryan Mallon <rmallon@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-31lib/scatterlist.c: don't flush_kernel_dcache_page on slab pageMing Lei
Commit b1adaf65ba03 ("[SCSI] block: add sg buffer copy helper functions") introduces two sg buffer copy helpers, and calls flush_kernel_dcache_page() on pages in SG list after these pages are written to. Unfortunately, the commit may introduce a potential bug: - Before sending some SCSI commands, kmalloc() buffer may be passed to block layper, so flush_kernel_dcache_page() can see a slab page finally - According to cachetlb.txt, flush_kernel_dcache_page() is only called on "a user page", which surely can't be a slab page. - ARCH's implementation of flush_kernel_dcache_page() may use page mapping information to do optimization so page_mapping() will see the slab page, then VM_BUG_ON() is triggered. Aaro Koskinen reported the bug on ARM/kirkwood when DEBUG_VM is enabled, and this patch fixes the bug by adding test of '!PageSlab(miter->page)' before calling flush_kernel_dcache_page(). Signed-off-by: Ming Lei <ming.lei@canonical.com> Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Tested-by: Simon Baatz <gmbnomis@gmail.com> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: Aaro Koskinen <aaro.koskinen@iki.fi> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Tejun Heo <tj@kernel.org> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> [3.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-29Kconfig: make KOBJECT_RELEASE debugging require timer debuggingLinus Torvalds
Without the timer debugging, the delayed kobject release will just result in undebuggable oopses if it triggers any latent bugs. That doesn't actually help debugging at all. So make DEBUG_KOBJECT_RELEASE depend on DEBUG_OBJECTS_TIMERS to avoid having people enable one without the other. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-16percpu_refcount: export symbolsMatias Bjorling
Export the interface to be used within modules. Signed-off-by: Matias Bjorling <m@bjorling.me> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-10kobject: show debug info on delayed kobject releaseFengguang Wu
Useful for locating buggy drivers on kernel oops. It may add dozens of new lines to boot dmesg. DEBUG_KOBJECT_RELEASE is hopefully only enabled in debug kernels (like maybe the Fedora rawhide one, or at developers), so being a bit more verbose is likely ok. Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking changes from David Miller: 1) Multiply in netfilter IPVS can overflow when calculating destination weight. From Simon Kirby. 2) Use after free fixes in IPVS from Julian Anastasov. 3) SFC driver bug fixes from Daniel Pieczko. 4) Memory leak in pcan_usb_core failure paths, from Alexey Khoroshilov. 5) Locking and encapsulation fixes to serial line CAN driver, from Andrew Naujoks. 6) Duplex and VF handling fixes to bnx2x driver from Yaniv Rosner, Eilon Greenstein, and Ariel Elior. 7) In lapb, if no other packets are outstanding, T1 timeouts actually stall things and no packet gets sent. Fix from Josselin Costanzi. 8) ICMP redirects should not make it to the socket error queues, from Duan Jiong. 9) Fix bugs in skge DMA mapping error handling, from Nikulas Patocka. 10) Fix setting of VLAN priority field on via-rhine driver, from Roget Luethi. 11) Fix TX stalls and VLAN promisc programming in be2net driver from Ajit Khaparde. 12) Packet padding doesn't get handled correctly in new usbnet SG support code, from Ming Lei. 13) Fix races in netdevice teardown wrt. network namespace closing. From Eric W. Biederman. 14) Fix potential missed initialization of net_secret if not TCP connections are openned. From Eric Dumazet. 15) Cinterion PLXX product ID in qmi_wwan driver is wrong, from Aleksander Morgado. 16) skb_cow_head() can change skb->data and thus packet header pointers, don't use stale ip_hdr reference in ip_tunnel code. 17) Backend state transition handling fixes in xen-netback, from Paul Durrant. 18) Packet offset for AH protocol is handled wrong in flow dissector, from Eric Dumazet. 19) Taking down an fq packet scheduler instance can leave stale packets in the queues, fix from Eric Dumazet. 20) Fix performance regressions introduced by TCP Small Queues. From Eric Dumazet. 21) IPV6 GRE tunneling code calculates max_headroom incorrectly, from Hannes Frederic Sowa. 22) Multicast timer handlers in ipv4 and ipv6 can be the last and final reference to the ipv4/ipv6 specific network device state, so use the reference put that will check and release the object if the reference hits zero. From Salam Noureddine. 23) Fix memory corruption in ip_tunnel driver, and use skb_push() instead of __skb_push() so that similar bugs are less hard to find. From Steffen Klassert. 24) Add forgotten hookup of rtnl_ops in SIT and ip6tnl drivers, from Nicolas Dichtel. 25) fq scheduler doesn't accurately rate limit in certain circumstances, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits) pkt_sched: fq: rate limiting improvements ip6tnl: allow to use rtnl ops on fb tunnel sit: allow to use rtnl ops on fb tunnel ip_tunnel: Remove double unregister of the fallback device ip_tunnel_core: Change __skb_push back to skb_push ip_tunnel: Add fallback tunnels to the hash lists ip_tunnel: Fix a memory corruption in ip_tunnel_xmit qlcnic: Fix SR-IOV configuration ll_temac: Reset dma descriptors indexes on ndo_open skbuff: size of hole is wrong in a comment ipv6 mcast: use in6_dev_put in timer handlers instead of __in6_dev_put ipv4 igmp: use in_dev_put in timer handlers instead of __in_dev_put ethernet: moxa: fix incorrect placement of __initdata tag ipv6: gre: correct calculation of max_headroom powerpc/83xx: gianfar_ptp: select 1588 clock source through dts file Revert "powerpc/83xx: gianfar_ptp: select 1588 clock source through dts file" bonding: Fix broken promiscuity reference counting issue tcp: TSQ can use a dynamic limit dm9601: fix IFF_ALLMULTI handling pkt_sched: fq: qdisc dismantle fixes ...
2013-09-28lockref: use arch_mutex_cpu_relax() in CMPXCHG_LOOP()Heiko Carstens
Make use of arch_mutex_cpu_relax() so architectures can override the default cpu_relax() semantics. This is especially useful for s390, where cpu_relax() means that we yield() the current (virtual) cpu and therefore is very expensive, and would contradict the whole purpose of the lockless cmpxchg loop. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2013-09-27sysfs: Allow mounting without CONFIG_NETEric W. Biederman
In kobj_ns_current_may_mount the default should be to allow the mount. The test is only for a single kobj_ns_type at a time, and unless there is a reason to prevent it the mounting sysfs should be allowed. Subsystems that are not registered can't have are not involved so can't have a reason to prevent mounting sysfs. This is a bug-fix to commit 7dc5dbc879bd ("sysfs: Restrict mounting sysfs") that came in via the userns tree during the 3.12 merge window. Reported-and-tested-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-27lockref: allow relaxed cmpxchg64 variant for lockless updatesWill Deacon
The 64-bit cmpxchg operation on the lockref is ordered by virtue of hazarding between the cmpxchg operation and the reference count manipulation. On weakly ordered memory architectures (such as ARM), it can be of great benefit to omit the barrier instructions where they are not needed. This patch moves the lockless lockref code over to a cmpxchg64_relaxed operation, which doesn't provide barrier semantics. If the operation isn't defined, we simply #define it as the usual 64-bit cmpxchg macro. Cc: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-20lib: introduce upper case hex ascii helpersAndre Naujoks
To be able to use the hex ascii functions in case sensitive environments the array hex_asc_upper[] and the needed functions for hex_byte_pack_upper() are introduced. Signed-off-by: Andre Naujoks <nautsch2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-20lockref: use cmpxchg64 explicitly for lockless updatesWill Deacon
The cmpxchg() function tends not to support 64-bit arguments on 32-bit architectures. This could be either due to use of unsigned long arguments (like on ARM) or lack of instruction support (cmpxchgq on x86). However, these architectures may implement a specific cmpxchg64() function to provide 64-bit cmpxchg support instead. Since the lockref code requires a 64-bit cmpxchg and relies on the architecture selecting ARCH_USE_CMPXCHG_LOCKREF, move to using cmpxchg64 instead of cmpxchg and allow 32-bit architectures to make use of the lockless lockref implementation. Cc: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-13Merge branch 'genirq' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull generic hardirq option removal from Martin Schwidefsky: "All architectures now use generic hardirqs, s390 has been last to switch. With that the code under !CONFIG_GENERIC_HARDIRQS and the related HAVE_GENERIC_HARDIRQS and GENERIC_HARDIRQS config options can be removed. Yay!" * 'genirq' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: Remove GENERIC_HARDIRQ config option
2013-09-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds
Pull crypto fixes from Herbert Xu: "This fixes a 7+ year race condition in the crypto API that causes sporadic crashes when multiple threads load the same algorithm. It also fixes the crct10dif algorithm again to prevent boot failures on systems where the initramfs tool ignores module softdeps" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: crct10dif - Add fallback for broken initrds crypto: api - Fix race condition in larval lookup
2013-09-13Remove GENERIC_HARDIRQ config optionMartin Schwidefsky
After the last architecture switched to generic hard irqs the config options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code for !CONFIG_GENERIC_HARDIRQS can be removed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-12Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "Lots of activity again this round for I/O performance optimizations (per-cpu IDA pre-allocation for vhost + iscsi/target), and the addition of new fabric independent features to target-core (COMPARE_AND_WRITE + EXTENDED_COPY). The main highlights include: - Support for iscsi-target login multiplexing across individual network portals - Generic Per-cpu IDA logic (kent + akpm + clameter) - Conversion of vhost to use per-cpu IDA pre-allocation for descriptors, SGLs and userspace page pointer list - Conversion of iscsi-target + iser-target to use per-cpu IDA pre-allocation for descriptors - Add support for generic COMPARE_AND_WRITE (AtomicTestandSet) emulation for virtual backend drivers - Add support for generic EXTENDED_COPY (CopyOffload) emulation for virtual backend drivers. - Add support for fast memory registration mode to iser-target (Vu) The patches to add COMPARE_AND_WRITE and EXTENDED_COPY support are of particular significance, which make us the first and only open source target to support the full set of VAAI primitives. Currently Linux clients are lacking upstream support to actually utilize these primitives. However, with server side support now in place for folks like MKP + ZAB working on the client, this logic once reserved for the highest end of storage arrays, can now be run in VMs on their laptops" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (50 commits) target/iscsi: Bump versions to v4.1.0 target: Update copyright ownership/year information to 2013 iscsi-target: Bump default TCP listen backlog to 256 target: Fix >= v3.9+ regression in PR APTPL + ALUA metadata write-out iscsi-target; Bump default CmdSN Depth to 64 iscsi-target: Remove unnecessary wait_for_completion in iscsi_get_thread_set iscsi-target: Add thread_set->ts_activate_sem + use common deallocate iscsi-target: Fix race with thread_pre_handler flush_signals + ISCSI_THREAD_SET_DIE target: remove unused including <linux/version.h> iser-target: introduce fast memory registration mode (FRWR) iser-target: generalize rdma memory registration and cleanup iser-target: move rdma wr processing to a shared function target: Enable global EXTENDED_COPY setup/release target: Add Third Party Copy (3PC) bit in INQUIRY response target: Enable EXTENDED_COPY setup in spc_parse_cdb target: Add support for EXTENDED_COPY copy offload emulation target: Avoid non-existent tg_pt_gp_mem in target_alua_state_check target: Add global device list for EXTENDED_COPY target: Make helpers non static for EXTENDED_COPY command setup target: Make spc_parse_naa_6h_vendor_specific non static ...
2013-09-12crypto: crct10dif - Add fallback for broken initrdsHerbert Xu
Unfortunately, even with a softdep some distros fail to include the necessary modules in the initrd. Therefore this patch adds a fallback path to restore existing behaviour where we cannot load the new crypto crct10dif algorithm. In order to do this, the underlying crct10dif has been split out from the crypto implementation so that it can be used on the fallback path. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-09-11lz4: fix compression/decompression signedness mismatchSergey Senozhatsky
LZ4 compression and decompression functions require different in signedness input/output parameters: unsigned char for compression and signed char for decompression. Change decompression API to require "(const) unsigned char *". Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Kyungsik Lee <kyungsik.lee@lge.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Yann Collet <yann.collet.73@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interruptJan Kara
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is one such possible user), the following race can happen: radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; <interrupt> ... radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; And we give out one radix tree node twice. That clearly results in radix tree corruption with different results (usually OOPS) depending on which two users of radix tree race. We fix the problem by making radix_tree_node_alloc() always allocate fresh radix tree nodes when in interrupt. Using preloading when in interrupt doesn't make sense since all the allocations have to be atomic anyway and we cannot steal nodes from process-context users because some users rely on radix_tree_insert() succeeding after radix_tree_preload(). in_interrupt() check is somewhat ugly but we cannot simply key off passed gfp_mask as that is acquired from root_gfp_mask() and thus the same for all preload users. Another part of the fix is to avoid node preallocation in radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again, preallocation in such case doesn't make sense and when preallocation would happen in interrupt we could possibly leak some allocated nodes. However, some users of radix_tree_preload() require following radix_tree_insert() to succeed. To avoid unexpected effects for these users, radix_tree_preload() only warns if passed gfp mask doesn't allow waiting and we provide a new function radix_tree_maybe_preload() for those users which get different gfp mask from different call sites and which are prepared to handle radix_tree_insert() failure. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11rbtree: allow tests to run as builtinCody P Schafer
No reason require rbtree test code to be a module, allow it to be builtin (streamlines my development process) Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11rbtree_test: add test for postorder iterationCody P Schafer
Just check that we examine all nodes in the tree for the postorder iteration. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11rbtree: add postorder iteration functionsCody P Schafer
Postorder iteration yields all of a node's children prior to yielding the node itself, and this particular implementation also avoids examining the leaf links in a node after that node has been yielded. In what I expect will be its most common usage, postorder iteration allows the deletion of every node in an rbtree without modifying the rbtree nodes (no _requirement_ that they be nulled) while avoiding referencing child nodes after they have been "deleted" (most commonly, freed). I have only updated zswap to use this functionality at this point, but numerous bits of code (most notably in the filesystem drivers) use a hand rolled postorder iteration that NULLs child links as it traverses the tree. Each of those instances could be replaced with this common implementation. 1 & 2 add rbtree postorder iteration functions. 3 adds testing of the iteration to the rbtree runtime tests 4 allows building the rbtree runtime tests as builtins 5 updates zswap. This patch: Add postorder iteration functions for rbtree. These are useful for safely freeing an entire rbtree without modifying the tree at all. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: David Woodhouse <David.Woodhouse@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11lib/decompressors: fix "no limit" output buffer lengthAlexandre Courbot
When decompressing into memory, the output buffer length is set to some arbitrarily high value (0x7fffffff) to indicate the output is, virtually, unlimited in size. The problem with this is that some platforms have their physical memory at high physical addresses (0x80000000 or more), and that the output buffer address and its "unlimited" length cannot be added without overflowing. An example of this can be found in inflate_fast(): /* next_out is the output buffer address */ out = strm->next_out - OFF; /* avail_out is the output buffer size. end will overflow if the output * address is >= 0x80000104 */ end = out + (strm->avail_out - 257); This has huge consequences on the performance of kernel decompression, since the following exit condition of inflate_fast() will be always true: } while (in < last && out < end); Indeed, "end" has overflowed and is now always lower than "out". As a result, inflate_fast() will return after processing one single byte of input data, and will thus need to be called an unreasonably high number of times. This probably went unnoticed because kernel decompression is fast enough even with this issue. Nonetheless, adjusting the output buffer length in such a way that the above pointer arithmetic never overflows results in a kernel decompression that is about 3 times faster on affected machines. Signed-off-by: Alexandre Courbot <acourbot@nvidia.com> Tested-by: Jon Medhurst <tixy@linaro.org> Cc: Stephen Warren <swarren@wwwdotorg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11lib/crc32: update the comments of crc32_{be,le}_generic()Gu Zheng
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11lib/genalloc.c: correct dev_get_gen_pool documentationEmilio López
The documentation mentions a "name" parameter, which does not exist. This commit removes such mention from the function documentation. Signed-off-by: Emilio López <emilio@elopez.com.ar> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11lib/genalloc.c: convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)Joe Perches
Use the helper function instead of __GFP_ZERO. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11lib/genalloc.c: fix overflow of ending address of memory chunkJoonyoung Shim
In struct gen_pool_chunk, end_addr means the end address of memory chunk (inclusive), but in the implementation it is treated as address + size of memory chunk (exclusive), so it points to the address plus one instead of correct ending address. The ending address of memory chunk plus one will cause overflow on the memory chunk including the last address of memory map, e.g. when starting address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending address will be 0x100000000. Use correct ending address like starting address + size - 1. [akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr] Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-10Merge tag 'dm-3.12-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device-mapper updates from Mike Snitzer: "Add the ability to collect I/O statistics on user-defined regions of a device-mapper device. This dm-stats code required the reintroduction of a div64_u64_rem() helper, but as a separate method that doesn't slow down div64_u64() -- especially on 32-bit systems. Allow the error target to replace request-based DM devices (e.g. multipath) in addition to bio-based DM devices. Various other small code fixes and improvements to thin-provisioning, DM cache and the DM ioctl interface" * tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm stripe: silence a couple sparse warnings dm: add statistics support dm thin: always return -ENOSPC if no_free_space is set dm ioctl: cleanup error handling in table_load dm ioctl: increase granularity of type_lock when loading table dm ioctl: prevent rename to empty name or uuid dm thin: set pool read-only if breaking_sharing fails block allocation dm thin: prefix pool error messages with pool device name dm: allow error target to replace bio-based and request-based targets math64: New separate div64_u64_rem helper dm space map: optimise sm_ll_dec and sm_ll_inc dm btree: prefetch child nodes when walking tree for a dm_btree_del dm btree: use pop_frame in dm_btree_del to cleanup code dm cache: eliminate holes in cache structure dm cache: fix stacking of geometry limits dm thin: fix stacking of geometry limits dm thin: add data block size limits to Documentation dm cache: add data block size limits to code and Documentation dm cache: document metadata device is exclussive to a cache dm: stop using WQ_NON_REENTRANT
2013-09-10Merge tag 'md/3.12' of git://neil.brown.name/mdLinus Torvalds
Pull md update from Neil Brown: "Headline item is multithreading for RAID5 so that more IO/sec can be supported on fast (SSD) devices. Also TILE-Gx SIMD suppor for RAID6 calculations and an assortment of bug fixes" * tag 'md/3.12' of git://neil.brown.name/md: raid5: only wakeup necessary threads md/raid5: flush out all pending requests before proceeding with reshape. md/raid5: use seqcount to protect access to shape in make_request. raid5: sysfs entry to control worker thread number raid5: offload stripe handle to workqueue raid5: fix stripe release order raid5: make release_stripe lockless md: avoid deadlock when dirty buffers during md_stop. md: Don't test all of mddev->flags at once. md: Fix apparent cut-and-paste error in super_90_validate raid6/test: replace echo -e with printf RAID: add tilegx SIMD implementation of raid6 md: fix safe_mode buglet. md: don't call md_allow_write in get_bitmap_file.
2013-09-09idr: Percpu idaKent Overstreet
Percpu frontend for allocating ids. With percpu allocation (that works), it's impossible to guarantee it will always be possible to allocate all nr_tags - typically, some will be stuck on a remote percpu freelist where the current job can't get to them. We do guarantee that it will always be possible to allocate at least (nr_tags / 2) tags - this is done by keeping track of which and how many cpus have tags on their percpu freelists. On allocation failure if enough cpus have tags that there could potentially be (nr_tags / 2) tags stuck on remote percpu freelists, we then pick a remote cpu at random to steal from. Note that there's no cpu hotplug notifier - we don't care, because steal_tags() will eventually get the down cpu's tags. We _could_ satisfy more allocations if we had a notifier - but we'll still meet our guarantees and it's absolutely not a correctness issue, so I don't think it's worth the extra code. From akpm: "It looks OK to me (that's as close as I get to an ack :)) v6 changes: - Add #include <linux/cpumask.h> to include/linux/percpu_ida.h to make alpha/arc builds happy (Fengguang) - Move second (cpu >= nr_cpu_ids) check inside of first check scope in steal_tags() (akpm + nab) v5 changes: - Change percpu_ida->cpus_have_tags to cpumask_t (kmo + akpm) - Add comment for percpu_ida_cpu->lock + ->nr_free (kmo + akpm) - Convert steal_tags() to use cpumask_weight() + cpumask_next() + cpumask_first() + cpumask_clear_cpu() (kmo + akpm) - Add comment for alloc_global_tags() (kmo + akpm) - Convert percpu_ida_alloc() to use cpumask_set_cpu() (kmo + akpm) - Convert percpu_ida_free() to use cpumask_set_cpu() (kmo + akpm) - Drop percpu_ida->cpus_have_tags allocation in percpu_ida_init() (kmo + akpm) - Drop percpu_ida->cpus_have_tags kfree in percpu_ida_destroy() (kmo + akpm) - Add comment for percpu_ida_alloc @ gfp (kmo + akpm) - Move to percpu_ida.c + percpu_ida.h (kmo + akpm + nab) v4 changes: - Fix tags.c reference in percpu_ida_init (akpm) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2013-09-09Merge tag 'arc-v3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC changes from Vineet Gupta: - ARC MM changes: - preparation for MMUv4 (accomodate new PTE bits, new cmds) - Rework the ASID allocation algorithm to remove asid-mm reverse map - Boilerplate code consolidation in Exception Handlers - Disable FRAME_POINTER for ARC - Unaligned Access Emulation for Big-Endian from Noam - Bunch of fixes (udelay, missing accessors) from Mischa * tag 'arc-v3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: fix new Section mismatches in build (post __cpuinit cleanup) Kconfig.debug: Add FRAME_POINTER anti-dependency for ARC ARC: Fix __udelay calculation ARC: remove console_verbose() from setup_arch() ARC: Add read*_relaxed to asm/io.h ARC: Handle un-aligned user space access in BE. ARC: [ASID] Track ASID allocation cycles/generations ARC: [ASID] activate_mm() == switch_mm() ARC: [ASID] get_new_mmu_context() to conditionally allocate new ASID ARC: [ASID] Refactor the TLB paranoid debug code ARC: [ASID] Remove legacy/unused debug code ARC: No need to flush the TLB in early boot ARC: MMUv4 preps/3 - Abstract out TLB Insert/Delete ARC: MMUv4 preps/2 - Reshuffle PTE bits ARC: MMUv4 preps/1 - Fold PTE K/U access flags ARC: Code cosmetics (Nothing semantical) ARC: Entry Handler tweaks: Optimize away redundant IRQ_DISABLE_SAVE ARC: Exception Handlers Code consolidation ARC: Add some .gitignore entries
2013-09-07lockref: add ability to mark lockrefs "dead"Linus Torvalds
The only actual current lockref user (dcache) uses zero reference counts even for perfectly live dentries, because it's a cache: there may not be any users, but that doesn't mean that we want to throw away the dentry. At the same time, the dentry cache does have a notion of a truly "dead" dentry that we must not even increment the reference count of, because we have pruned it and it is not valid. Currently that distinction is not visible in the lockref itself, and the dentry cache validation uses "lockref_get_or_lock()" to either get a new reference to a dentry that already had existing references (and thus cannot be dead), or get the dentry lock so that we can then verify the dentry and increment the reference count under the lock if that verification was successful. That's all somewhat complicated. This adds the concept of being "dead" to the lockref itself, by simply using a count that is negative. This allows a usage scenario where we can increment the refcount of a dentry without having to validate it, and pushing the special "we killed it" case into the lockref code. The dentry code itself doesn't actually use this yet, and it's probably too late in the merge window to do that code (the dentry_kill() code with its "should I decrement the count" logic really is pretty complex code), but let's introduce the concept at the lockref level now. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-07lockref: fix docbook argument namesLinus Torvalds
The code got rewritten, but the comments got copied as-is from older versions, and as a result the argument name in the comment didn't actually match the code any more. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace changes from Eric Biederman: "This is an assorted mishmash of small cleanups, enhancements and bug fixes. The major theme is user namespace mount restrictions. nsown_capable is killed as it encourages not thinking about details that need to be considered. A very hard to hit pid namespace exiting bug was finally tracked and fixed. A couple of cleanups to the basic namespace infrastructure. Finally there is an enhancement that makes per user namespace capabilities usable as capabilities, and an enhancement that allows the per userns root to nice other processes in the user namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Kill nsown_capable it makes the wrong thing easy capabilities: allow nice if we are privileged pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD userns: Allow PR_CAPBSET_DROP in a user namespace. namespaces: Simplify copy_namespaces so it is clear what is going on. pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup sysfs: Restrict mounting sysfs userns: Better restrictions on when proc and sysfs can be mounted vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces kernel/nsproxy.c: Improving a snippet of code. proc: Restrict mounting the proc filesystem vfs: Lock in place mounts from more privileged users