summaryrefslogtreecommitdiff
path: root/lib
AgeCommit message (Collapse)Author
2014-01-21mm, show_mem: remove SHOW_MEM_FILTER_PAGE_COUNTMel Gorman
Commit 4b59e6c47309 ("mm, show_mem: suppress page counts in non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to suppress PFN walks on large memory machines. Commit c78e93630d15 ("mm: do not walk all of system memory during show_mem") avoided a PFN walk in the generic show_mem helper which removes the requirement for SHOW_MEM_FILTER_PAGE_COUNT in that case. This patch removes PFN walkers from the arch-specific implementations that report on a per-node or per-zone granularity. ARM and unicore32 still do a PFN walk as they report memory usage on each bank which is a much finer granularity where the debugging information may still be of use. As the remaining arches doing PFN walks have relatively small amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT. [akpm@linux-foundation.org: fix parisc] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: James Bottomley <jejb@parisc-linux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21dma-debug: introduce debug_dma_assert_idle()Dan Williams
Record actively mapped pages and provide an api for asserting a given page is dma inactive before execution proceeds. Placing debug_dma_assert_idle() in cow_user_page() flagged the violation of the dma-api in the NET_DMA implementation (see commit 77873803363c "net_dma: mark broken"). The implementation includes the capability to count, in a limited way, repeat mappings of the same page that occur without an intervening unmap. This 'overlap' counter is limited to the few bits of tag space in a radix tree. This mechanism is added to mitigate false negative cases where, for example, a page is dma mapped twice and debug_dma_assert_idle() is called after the page is un-mapped once. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Vinod Koul <vinod.koul@intel.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-20Merge tag 'usb-3.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB updates from Greg KH: "Here's the big USB pull request for 3.14-rc1 Lots of little things all over the place, and the usual USB gadget updates, and XHCI fixes (some for an issue reported by a lot of people). USB PHY updates as well as chipidea updates and fixes. All of these have been in the linux-next tree with no reported issues" * tag 'usb-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (318 commits) usb: chipidea: udc: using MultO at TD as real mult value for ISO-TX usb: chipidea: need to mask INT_STATUS when write otgsc usb: chipidea: put hw_phymode_configure before ci_usb_phy_init usb: chipidea: Fix Internal error: : 808 [#1] ARM related to STS flag usb: chipidea: imx: set CI_HDRC_IMX28_WRITE_FIX for imx28 usb: chipidea: add freescale imx28 special write register method usb: ehci: add freescale imx28 special write register method usb: core: check for valid id_table when using the RefId feature usb: cdc-wdm: resp_count can be 0 even if WDM_READ is set usb: core: bail out if user gives an unknown RefId when using new_id usb: core: allow a reference device for new_id usb: core: add sanity checks when using bInterfaceClass with new_id USB: image: correct spelling mistake in comment USB: c67x00: correct spelling mistakes in comments usb: delete non-required instances of include <linux/init.h> usb:hub set hub->change_bits when over-current happens Revert "usb: chipidea: imx: set CI_HDRC_IMX28_WRITE_FIX for imx28" xhci: Set scatter-gather limit to avoid failed block writes. xhci: Avoid infinite loop when sg urb requires too many trbs usb: gadget: remove unused variable in gr_queue_int() ...
2014-01-20Merge tag 'driver-core-3.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core and sysfs patch set for 3.14-rc1. There's a lot of work here moving sysfs logic out into a "kernfs" to allow other subsystems to also have a virtual filesystem with the same attributes of sysfs (handle device disconnect, dynamic creation / removal as needed / unneeded, etc) This is primarily being done for the cgroups filesystem, but the goal is to also move debugfs to it when it is ready, solving all of the known issues in that filesystem as well. The code isn't completed yet, but all should be stable now (there is a big section that was reverted due to problems found when testing) There's also some other smaller fixes, and a driver core addition that allows for a "collection" of objects, that the DRM people will be using soon (it's in this tree to make merges after -rc1 easier) All of this has been in linux-next with no reported issues" * tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits) kernfs: associate a new kernfs_node with its parent on creation kernfs: add struct dentry declaration in kernfs.h kernfs: fix get_active failure handling in kernfs_seq_*() Revert "kernfs: fix get_active failure handling in kernfs_seq_*()" Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq" Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()" Revert "kernfs: remove KERNFS_REMOVED" Revert "kernfs: restructure removal path to fix possible premature return" Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()" Revert "kernfs: remove kernfs_addrm_cxt" Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed" Revert "kernfs: implement kernfs_{de|re}activate[_self]()" Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers" Revert "pci: use device_remove_file_self() instead of device_schedule_callback()" Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()" Revert "s390: use device_remove_file_self() instead of device_schedule_callback()" Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()" Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()" kernfs: remove unnecessary NULL check in __kernfs_remove() drivers/base: provide an infrastructure for componentised subsystems ...
2014-01-20Merge branch 'core-debug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core debug changes from Ingo Molnar: "Currently there are two methods to set the panic_timeout: via 'panic=X' boot commandline option, or via /proc/sys/kernel/panic. This tree adds a third panic_timeout configuration method: configuration via Kconfig, via CONFIG_PANIC_TIMEOUT=X - useful to distros that generally want their kernel defaults to come with the .config. CONFIG_PANIC_TIMEOUT defaults to 0, which was the previous default value of panic_timeout. Doing that unearthed a few arch trickeries regarding arch-special panic_timeout values and related complications - hopefully all resolved to the satisfaction of everyone" * 'core-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: powerpc: Clean up panic_timeout usage MIPS: Remove panic_timeout settings panic: Make panic_timeout configurable
2014-01-17percpu_counter: unbreak __percpu_counter_add()Hugh Dickins
Commit 74e72f894d56 ("lib/percpu_counter.c: fix __percpu_counter_add()") looked very plausible, but its arithmetic was badly wrong: obvious once you see the fix, but maddening to get there from the weird tmpfs ENOSPCs Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Ming Lei <tom.leiming@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fan Du <fan.du@windriver.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-15lib/percpu_counter.c: fix __percpu_counter_add()Ming Lei
__percpu_counter_add() may be called in softirq/hardirq handler (such as, blk_mq_queue_exit() is typically called in hardirq/softirq handler), so we need to call this_cpu_add()(irq safe helper) to update percpu counter, otherwise counts may be lost. This fixes the problem that 'rmmod null_blk' hangs in blk_cleanup_queue() because of miscounting of request_queue->mq_usage_counter. This patch is the v1 of previous one of "lib/percpu_counter.c: disable local irq when updating percpu couter", and takes Andrew's approach which may be more efficient for ARCHs(x86, s390) that have optimized this_cpu_add(). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fan Du <fan.du@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-08kobject: Fix source code comment spellingBart Van Assche
Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-04Revert "kobject: introduce kobj_completion"Greg Kroah-Hartman
This reverts commit eee031649707db3c9920d9498f8d03819b74fc23. Jeff writes: I have no objections to reverting it. There were concerns from Al Viro that it'd be tough to get right by callers and I had assumed it got dropped after that. I had planned on using it in my btrfs sysfs exports patchset but came up with a better way. Cc: Jeff Mahoney <jeffm@suse.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-24Merge 3.13-rc5 into staging-nextGreg Kroah-Hartman
We want these fixes here to handle some merge issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-16Merge branch 3.13-rc4 into usb-nextGreg Kroah-Hartman
2013-12-11kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordinglyTejun Heo
kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. This patch performs the following renames. * s/sysfs_elem_dir/kernfs_elem_dir/ * s/sysfs_elem_symlink/kernfs_elem_symlink/ * s/sysfs_elem_attr/kernfs_elem_file/ * s/sysfs_dirent/kernfs_node/ * s/sd/kn/ in kernfs proper * s/parent_sd/parent/ * s/target_sd/target/ * s/dir_sd/parent/ * s/to_sysfs_dirent()/rb_to_kn()/ * misc renames of local vars when they conflict with the above Because md, mic and gpio dig into sysfs details, this patch ends up modifying them. All are sysfs_dirent renames and trivial. While we can avoid these by introducing a dummy wrapping struct sysfs_dirent around kernfs_node, given the limited usage outside kernfs and sysfs proper, I don't think such workaround is called for. This patch is strictly rename only and doesn't introduce any functional difference. - mic / gpio renames were missing. Spotted by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08kobject: fix memory leak in kobject_set_name_vargsMaurizio Lombardi
If the call to kvasprintf fails then the old name of the object will be leaked, this patch fixes the bug by restoring the old name before returning ENOMEM. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08lib/scatterlist: export sg_miter_skip()Ming Lei
sg_copy_buffer() can't meet demand for some drrivers(such usb mass storage), so we have to use the sg_miter_* APIs to access sg buffer, then need export sg_miter_skip() for these drivers. The API is needed for converting to sg_miter_* APIs in USB storage driver for accessing sg buffer. Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@canonical.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-07kobject: remove kset from sysfs immediately in kset_unregister()Bjorn Helgaas
There's no "unlink from sysfs" interface for ksets, so I think callers of kset_unregister() expect the kset to be removed from sysfs immediately, without waiting for the last reference to be released. This patch makes the sysfs removal happen immediately, so the caller may create a new kset with the same name as soon as kset_unregister() returns. Without this, every caller has to call "kobject_del(&kset->kobj)" first unless it knows it will never create a new kset with the same name. This sometimes shows up on module unload and reload, where the reload fails because it tries to create a kobject with the same name as one from the original load that still exists. CONFIG_DEBUG_KOBJECT_RELEASE=y makes this problem easier to hit. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-07kobject: delay kobject release for random timeBjorn Helgaas
When CONFIG_DEBUG_KOBJECT_RELEASE=y, delay kobject release functions for a random time between 1 and 8 seconds, which effectively changes the order in which they're called. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-02KEYS: Fix multiple key add into associative arrayDavid Howells
If sufficient keys (or keyrings) are added into a keyring such that a node in the associative array's tree overflows (each node has a capacity N, currently 16) and such that all N+1 keys have the same index key segment for that level of the tree (the level'th nibble of the index key), then assoc_array_insert() calls ops->diff_objects() to indicate at which bit position the two index keys vary. However, __key_link_begin() passes a NULL object to assoc_array_insert() with the intention of supplying the correct pointer later before we commit the change. This means that keyring_diff_objects() is given a NULL pointer as one of its arguments which it does not expect. This results in an oops like the attached. With the previous patch to fix the keyring hash function, this can be forced much more easily by creating a keyring and only adding keyrings to it. Add any other sort of key and a different insertion path is taken - all 16+1 objects must want to cluster in the same node slot. This can be tested by: r=`keyctl newring sandbox @s` for ((i=0; i<=16; i++)); do keyctl newring ring$i $r; done This should work fine, but oopses when the 17th keyring is added. Since ops->diff_objects() is always called with the first pointer pointing to the object to be inserted (ie. the NULL pointer), we can fix the problem by changing the to-be-inserted object pointer to point to the index key passed into assoc_array_insert() instead. Whilst we're at it, we also switch the arguments so that they are the same as for ->compare_object(). BUG: unable to handle kernel NULL pointer dereference at 0000000000000088 IP: [<ffffffff81191ee4>] hash_key_type_and_desc+0x18/0xb0 ... RIP: 0010:[<ffffffff81191ee4>] hash_key_type_and_desc+0x18/0xb0 ... Call Trace: [<ffffffff81191f9d>] keyring_diff_objects+0x21/0xd2 [<ffffffff811f09ef>] assoc_array_insert+0x3b6/0x908 [<ffffffff811929a7>] __key_link_begin+0x78/0xe5 [<ffffffff81191a2e>] key_create_or_update+0x17d/0x36a [<ffffffff81192e0a>] SyS_add_key+0x123/0x183 [<ffffffff81400ddb>] tracesys+0xdd/0xe2 Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Stephen Gallagher <sgallagh@redhat.com>
2013-11-29sysfs, kernfs: introduce kernfs_create_dir[_ns]()Tejun Heo
Introduce kernfs interface to manipulate a directory which takes and returns sysfs_dirents. create_dir() is renamed to kernfs_create_dir_ns() and its argumantes and return value are updated. create_dir() usages are replaced with kernfs_create_dir_ns() and sysfs_create_subdir() usages are replaced with kernfs_create_dir(). Dup warnings are handled explicitly by sysfs users of the kernfs interface. sysfs_enable_ns() is renamed to kernfs_enable_ns(). This patch doesn't introduce any behavior changes. v2: Dummy implementation for !CONFIG_SYSFS updated to return -ENOSYS. v3: kernfs_enable_ns() added. v4: Refreshed on top of "sysfs: drop kobj_ns_type handling, take #2" so that this patch removes sysfs_enable_ns(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29Merge 3.13-rc2 into driver-core-nextGreg Kroah-Hartman
We want those fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-27lockref: include mutex.h rather than reinvent arch_mutex_cpu_relaxWill Deacon
arch_mutex_cpu_relax is already conditionally defined in mutex.h, so simply include that header rather than replicate the code here. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-27sysfs: drop kobj_ns_type handling, take #2Tejun Heo
The way namespace tags are implemented in sysfs is more complicated than necessary. As each tag is a pointer value and required to be non-NULL under a namespace enabled parent, there's no need to record separately what type each tag is. If multiple namespace types are needed, which currently aren't, we can simply compare the tag to a set of allowed tags in the superblock assuming that the tags, being pointers, won't have the same value across multiple types. This patch rips out kobj_ns_type handling from sysfs. sysfs now has an enable switch to turn on namespace under a node. If enabled, all children are required to have non-NULL namespace tags and filtered against the super_block's tag. kobject namespace determination is now performed in lib/kobject.c::create_dir() making sysfs_read_ns_type() unnecessary. The sanity checks are also moved. create_dir() is restructured to ease such addition. This removes most kobject namespace knowledge from sysfs proper which will enable proper separation and layering of sysfs. This is the second try. The first one was cb26a311578e ("sysfs: drop kobj_ns_type handling") which tried to automatically enable namespace if there are children with non-NULL namespace tags; however, it was broken for symlinks as they should inherit the target's tag iff namespace is enabled in the parent. This led to namespace filtering enabled incorrectly for wireless net class devices through phy80211 symlinks and thus network configuration failure. a1212d278c05 ("Revert "sysfs: drop kobj_ns_type handling"") reverted the commit. This shouldn't introduce any behavior changes, for real. v2: Dummy implementation of sysfs_enable_ns() for !CONFIG_SYSFS was missing and caused build failure. Reported by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-26panic: Make panic_timeout configurableJason Baron
The panic_timeout value can be set via the command line option 'panic=x', or via /proc/sys/kernel/panic, however that is not sufficient when the panic occurs before we are able to set up these values. Thus, add a CONFIG_PANIC_TIMEOUT so that we can set the desired value from the .config. The default panic_timeout value continues to be 0 - wait forever. Also adds set_arch_panic_timeout(new_timeout, arch_default_timeout), which is intended to be used by arches in arch_setup(). The idea being that the new_timeout is only set if the user hasn't changed from the arch_default_timeout. Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: benh@kernel.crashing.org Cc: paulus@samba.org Cc: ralf@linux-mips.org Cc: mpe@ellerman.id.au Cc: felipe.contreras@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1a1674daec27c534df409697025ac568ebcee91e.1385418410.git.jbaron@akamai.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-22Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending Pull SCSI target updates from Nicholas Bellinger: "Things have been quiet this round with mostly bugfixes, percpu conversions, and other minor iscsi-target conformance testing changes. The highlights include: - Add demo_mode_discovery attribute for iscsi-target (Thomas) - Convert tcm_fc(FCoE) to use percpu-ida pre-allocation - Add send completion interrupt coalescing for ib_isert - Convert target-core to use percpu-refcounting for se_lun - Fix mutex_trylock usage bug in iscsit_increment_maxcmdsn - tcm_loop updates (Hannes) - target-core ALUA cleanups + prep for v3.14 SCSI Referrals support (Hannes) v3.14 is currently shaping to be a busy development cycle in target land, with initial support for T10 Referrals and T10 DIF currently on the roadmap" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (40 commits) iscsi-target: chap auth shouldn't match username with trailing garbage iscsi-target: fix extract_param to handle buffer length corner case iscsi-target: Expose default_erl as TPG attribute target_core_configfs: split up ALUA supported states target_core_alua: Make supported states configurable target_core_alua: Store supported ALUA states target_core_alua: Rename ALUA_ACCESS_STATE_OPTIMIZED target_core_alua: spellcheck target core: rename (ex,im)plict -> (ex,im)plicit percpu-refcount: Add percpu-refcount.o to obj-y iscsi-target: Do not reject non-immediate CmdSNs exceeding MaxCmdSN iscsi-target: Convert iscsi_session statistics to atomic_long_t target: Convert se_device statistics to atomic_long_t target: Fix delayed Task Aborted Status (TAS) handling bug iscsi-target: Reject unsupported multi PDU text command sequence ib_isert: Avoid duplicate iscsit_increment_maxcmdsn call iscsi-target: Fix mutex_trylock usage in iscsit_increment_maxcmdsn target: Core does not need blkdev.h target: Pass through I/O topology for block backstores iser-target: Avoid using FRMR for single dma entry requests ...
2013-11-21Merge branch 'for-linus2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "In this patchset, we finally get an SELinux update, with Paul Moore taking over as maintainer of that code. Also a significant update for the Keys subsystem, as well as maintenance updates to Smack, IMA, TPM, and Apparmor" and since I wanted to know more about the updates to key handling, here's the explanation from David Howells on that: "Okay. There are a number of separate bits. I'll go over the big bits and the odd important other bit, most of the smaller bits are just fixes and cleanups. If you want the small bits accounting for, I can do that too. (1) Keyring capacity expansion. KEYS: Consolidate the concept of an 'index key' for key access KEYS: Introduce a search context structure KEYS: Search for auth-key by name rather than target key ID Add a generic associative array implementation. KEYS: Expand the capacity of a keyring Several of the patches are providing an expansion of the capacity of a keyring. Currently, the maximum size of a keyring payload is one page. Subtract a small header and then divide up into pointers, that only gives you ~500 pointers on an x86_64 box. However, since the NFS idmapper uses a keyring to store ID mapping data, that has proven to be insufficient to the cause. Whatever data structure I use to handle the keyring payload, it can only store pointers to keys, not the keys themselves because several keyrings may point to a single key. This precludes inserting, say, and rb_node struct into the key struct for this purpose. I could make an rbtree of records such that each record has an rb_node and a key pointer, but that would use four words of space per key stored in the keyring. It would, however, be able to use much existing code. I selected instead a non-rebalancing radix-tree type approach as that could have a better space-used/key-pointer ratio. I could have used the radix tree implementation that we already have and insert keys into it by their serial numbers, but that means any sort of search must iterate over the whole radix tree. Further, its nodes are a bit on the capacious side for what I want - especially given that key serial numbers are randomly allocated, thus leaving a lot of empty space in the tree. So what I have is an associative array that internally is a radix-tree with 16 pointers per node where the index key is constructed from the key type pointer and the key description. This means that an exact lookup by type+description is very fast as this tells us how to navigate directly to the target key. I made the data structure general in lib/assoc_array.c as far as it is concerned, its index key is just a sequence of bits that leads to a pointer. It's possible that someone else will be able to make use of it also. FS-Cache might, for example. (2) Mark keys as 'trusted' and keyrings as 'trusted only'. KEYS: verify a certificate is signed by a 'trusted' key KEYS: Make the system 'trusted' keyring viewable by userspace KEYS: Add a 'trusted' flag and a 'trusted only' flag KEYS: Separate the kernel signature checking keyring from module signing These patches allow keys carrying asymmetric public keys to be marked as being 'trusted' and allow keyrings to be marked as only permitting the addition or linkage of trusted keys. Keys loaded from hardware during kernel boot or compiled into the kernel during build are marked as being trusted automatically. New keys can be loaded at runtime with add_key(). They are checked against the system keyring contents and if their signatures can be validated with keys that are already marked trusted, then they are marked trusted also and can thus be added into the master keyring. Patches from Mimi Zohar make this usable with the IMA keyrings also. (3) Remove the date checks on the key used to validate a module signature. X.509: Remove certificate date checks It's not reasonable to reject a signature just because the key that it was generated with is no longer valid datewise - especially if the kernel hasn't yet managed to set the system clock when the first module is loaded - so just remove those checks. (4) Make it simpler to deal with additional X.509 being loaded into the kernel. KEYS: Load *.x509 files into kernel keyring KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate The builder of the kernel now just places files with the extension ".x509" into the kernel source or build trees and they're concatenated by the kernel build and stuffed into the appropriate section. (5) Add support for userspace kerberos to use keyrings. KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches KEYS: Implement a big key type that can save to tmpfs Fedora went to, by default, storing kerberos tickets and tokens in tmpfs. We looked at storing it in keyrings instead as that confers certain advantages such as tickets being automatically deleted after a certain amount of time and the ability for the kernel to get at these tokens more easily. To make this work, two things were needed: (a) A way for the tickets to persist beyond the lifetime of all a user's sessions so that cron-driven processes can still use them. The problem is that a user's session keyrings are deleted when the session that spawned them logs out and the user's user keyring is deleted when the UID is deleted (typically when the last log out happens), so neither of these places is suitable. I've added a system keyring into which a 'persistent' keyring is created for each UID on request. Each time a user requests their persistent keyring, the expiry time on it is set anew. If the user doesn't ask for it for, say, three days, the keyring is automatically expired and garbage collected using the existing gc. All the kerberos tokens it held are then also gc'd. (b) A key type that can hold really big tickets (up to 1MB in size). The problem is that Active Directory can return huge tickets with lots of auxiliary data attached. We don't, however, want to eat up huge tracts of unswappable kernel space for this, so if the ticket is greater than a certain size, we create a swappable shmem file and dump the contents in there and just live with the fact we then have an inode and a dentry overhead. If the ticket is smaller than that, we slap it in a kmalloc()'d buffer" * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (121 commits) KEYS: Fix keyring content gc scanner KEYS: Fix error handling in big_key instantiation KEYS: Fix UID check in keyctl_get_persistent() KEYS: The RSA public key algorithm needs to select MPILIB ima: define '_ima' as a builtin 'trusted' keyring ima: extend the measurement list to include the file signature kernel/system_certificate.S: use real contents instead of macro GLOBAL() KEYS: fix error return code in big_key_instantiate() KEYS: Fix keyring quota misaccounting on key replacement and unlink KEYS: Fix a race between negating a key and reading the error set KEYS: Make BIG_KEYS boolean apparmor: remove the "task" arg from may_change_ptraced_domain() apparmor: remove parent task info from audit logging apparmor: remove tsk field from the apparmor_audit_struct apparmor: fix capability to not use the current task, during reporting Smack: Ptrace access check mode ima: provide hash algo info in the xattr ima: enable support for larger default filedata hash algorithms ima: define kernel parameter 'ima_template=' to change configured default ima: add Kconfig default measurement list template ...
2013-11-19percpu-refcount: Add percpu-refcount.o to obj-yRandy Dunlap
Drop percpu_ida.o from lib-y since it is also listed in obj-y and it doesn't need to be listed in both places. Move percpu-refcount.o from lib-y to obj-y to fix build errors in target_core_mod: ERROR: "percpu_ref_cancel_init" [drivers/target/target_core_mod.ko] undefined! ERROR: "percpu_ref_kill_and_confirm" [drivers/target/target_core_mod.ko] undefined! ERROR: "percpu_ref_init" [drivers/target/target_core_mod.ko] undefined! Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2013-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Mostly these are fixes for fallout due to merge window changes, as well as cures for problems that have been with us for a much longer period of time" 1) Johannes Berg noticed two major deficiencies in our genetlink registration. Some genetlink protocols we passing in constant counts for their ops array rather than something like ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were using fixed IDs for their multicast groups. We have to retain these fixed IDs to keep existing userland tools working, but reserve them so that other multicast groups used by other protocols can not possibly conflict. In dealing with these two problems, we actually now use less state management for genetlink operations and multicast groups. 2) When configuring interface hardware timestamping, fix several drivers that simply do not validate that the hwtstamp_config value is one the driver actually supports. From Ben Hutchings. 3) Invalid memory references in mwifiex driver, from Amitkumar Karwar. 4) In dev_forward_skb(), set the skb->protocol in the right order relative to skb_scrub_packet(). From Alexei Starovoitov. 5) Bridge erroneously fails to use the proper wrapper functions to make calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki Makita. 6) When detaching a bridge port, make sure to flush all VLAN IDs to prevent them from leaking, also from Toshiaki Makita. 7) Put in a compromise for TCP Small Queues so that deep queued devices that delay TX reclaim non-trivially don't have such a performance decrease. One particularly problematic area is 802.11 AMPDU in wireless. From Eric Dumazet. 8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts here. Fix from Eric Dumzaet, reported by Dave Jones. 9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn. 10) When computing mergeable buffer sizes, virtio-net fails to take the virtio-net header into account. From Michael Dalton. 11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic bumping, this one has been with us for a while. From Eric Dumazet. 12) Fix NULL deref in the new TIPC fragmentation handling, from Erik Hugne. 13) 6lowpan bit used for traffic classification was wrong, from Jukka Rissanen. 14) macvlan has the same issue as normal vlans did wrt. propagating LRO disabling down to the real device, fix it the same way. From Michal Kubecek. 15) CPSW driver needs to soft reset all slaves during suspend, from Daniel Mack. 16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet. 17) The xen-netfront RX buffer refill timer isn't properly scheduled on partial RX allocation success, from Ma JieYue. 18) When ipv6 ping protocol support was added, the AF_INET6 protocol initialization cleanup path on failure was borked a little. Fix from Vlad Yasevich. 19) If a socket disconnects during a read/recvmsg/recvfrom/etc that blocks we can do the wrong thing with the msg_name we write back to userspace. From Hannes Frederic Sowa. There is another fix in the works from Hannes which will prevent future problems of this nature. 20) Fix route leak in VTI tunnel transmit, from Fan Du. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) genetlink: make multicast groups const, prevent abuse genetlink: pass family to functions using groups genetlink: add and use genl_set_err() genetlink: remove family pointer from genl_multicast_group genetlink: remove genl_unregister_mc_group() hsr: don't call genl_unregister_mc_group() quota/genetlink: use proper genetlink multicast APIs drop_monitor/genetlink: use proper genetlink multicast APIs genetlink: only pass array to genl_register_family_with_ops() tcp: don't update snd_nxt, when a socket is switched from repair mode atm: idt77252: fix dev refcnt leak xfrm: Release dst if this dst is improper for vti tunnel netlink: fix documentation typo in netlink_set_err() be2net: Delete secondary unicast MAC addresses during be_close be2net: Fix unconditional enabling of Rx interface options net, virtio_net: replace the magic value ping: prevent NULL pointer dereference on write to msg_name bnx2x: Prevent "timeout waiting for state X" bnx2x: prevent CFC attention bnx2x: Prevent panic during DMAE timeout ...
2013-11-15Merge tag 'stable/for-linus-3.13-rc0-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull Xen updates from Konrad Rzeszutek Wilk: "This has tons of fixes and two major features which are concentrated around the Xen SWIOTLB library. The short <blurb> is that the tracing facility (just one function) has been added to SWIOTLB to make it easier to track I/O progress. Additionally under Xen and ARM (32 & 64) the Xen-SWIOTLB driver "is used to translate physical to machine and machine to physical addresses of foreign[guest] pages for DMA operations" (Stefano) when booting under hardware without proper IOMMU. There are also bug-fixes, cleanups, compile warning fixes, etc. The commit times for some of the commits is a bit fresh - that is b/c we wanted to make sure we have the Ack's from the ARM folks - which with the string of back-to-back conferences took a bit of time. Rest assured - the code has been stewing in #linux-next for some time. Features: - SWIOTLB has tracing added when doing bounce buffer. - Xen ARM/ARM64 can use Xen-SWIOTLB. This work allows Linux to safely program real devices for DMA operations when running as a guest on Xen on ARM, without IOMMU support. [*1] - xen_raw_printk works with PVHVM guests if needed. Bug-fixes: - Make memory ballooning work under HVM with large MMIO region. - Inform hypervisor of MCFG regions found in ACPI DSDT. - Remove deprecated IRQF_DISABLED. - Remove deprecated __cpuinit. [*1]: "On arm and arm64 all Xen guests, including dom0, run with second stage translation enabled. As a consequence when dom0 programs a device for a DMA operation is going to use (pseudo) physical addresses instead machine addresses. This work introduces two trees to track physical to machine and machine to physical mappings of foreign pages. Local pages are assumed mapped 1:1 (physical address == machine address). It enables the SWIOTLB-Xen driver on ARM and ARM64, so that Linux can translate physical addresses to machine addresses for dma operations when necessary. " (Stefano)" * tag 'stable/for-linus-3.13-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (32 commits) xen/arm: pfn_to_mfn and mfn_to_pfn return the argument if nothing is in the p2m arm,arm64/include/asm/io.h: define struct bio_vec swiotlb-xen: missing include dma-direction.h pci-swiotlb-xen: call pci_request_acs only ifdef CONFIG_PCI arm: make SWIOTLB available xen: delete new instances of added __cpuinit xen/balloon: Set balloon's initial state to number of existing RAM pages xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved for MCFG areas. xen: remove deprecated IRQF_DISABLED x86/xen: remove deprecated IRQF_DISABLED swiotlb-xen: fix error code returned by xen_swiotlb_map_sg_attrs swiotlb-xen: static inline xen_phys_to_bus, xen_bus_to_phys, xen_virt_to_bus and range_straddles_page_boundary grant-table: call set_phys_to_machine after mapping grant refs arm,arm64: do not always merge biovec if we are running on Xen swiotlb: print a warning when the swiotlb is full swiotlb-xen: use xen_dma_map/unmap_page, xen_dma_sync_single_for_cpu/device xen: introduce xen_dma_map/unmap_page and xen_dma_sync_single_for_cpu/device tracing/events: Fix swiotlb tracepoint creation swiotlb-xen: use xen_alloc/free_coherent_pages xen: introduce xen_alloc/free_coherent_pages ...
2013-11-15kfifo: kfifo_copy_{to,from}_user: fix copied bytes calculationLars-Peter Clausen
'copied' and 'len' are in bytes, while 'ret' is in elements, so we need to multiply 'ret' with the size of one element to get the correct result. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Cc: Stefani Seibold <stefani@seibold.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15llists-move-llist_reverse_order-from-raid5-to-llistc-fixAndrew Morton
fix comment typo, per Jan Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15llists: move llist_reverse_order from raid5 to llist.cChristoph Hellwig
Make this useful helper available for other users. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15vsprintf: ignore %n againKees Cook
This ignores %n in printf again, as was originally documented. Implementing %n poses a greater security risk than utility, so it should stay ignored. To help anyone attempting to use %n, a warning will be emitted if it is encountered. Based on an earlier patch by Joe Perches. Because %n was designed to write to pointers on the stack, it has been frequently used as an attack vector when bugs are found that leak user-controlled strings into functions that ultimately process format strings. While this class of bug can still be turned into an information leak, removing %n eliminates the common method of elevating such a bug into an arbitrary kernel memory writing primitive, significantly reducing the danger of this class of bug. For seq_file users that need to know the length of a written string for padding, please see seq_setwidth() and seq_pad() instead. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15lockref: use BLOATED_SPINLOCKS to avoid explicit config dependenciesPeter Zijlstra
Avoid the fragile Kconfig construct guestimating spinlock_t sizes; use a friendly compile-time test to determine this. [kirill.shutemov@linux.intel.com: drop CONFIG_CMPXCHG_LOCKREF] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-14random32: use msecs_to_jiffies for reseed timerDaniel Borkmann
Use msecs_to_jiffies, for these calculations as different HZ considerations are taken into account for conversion of the timer shot, and also it makes the code more readable. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14random32: add __init prefix to prandom_start_seed_timerDaniel Borkmann
We only call that in functions annotated with __init, so add __init prefix in prandom_start_seed_timer() as well, so that the kernel can make use of this hint and we can possibly free up resources after it's usage. And since it's an internal function rename it to __prandom_start_seed_timer(). Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14Merge branch 'core-locking-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking changes from Ingo Molnar: "The biggest changes: - add lockdep support for seqcount/seqlocks structures, this unearthed both bugs and required extra annotation. - move the various kernel locking primitives to the new kernel/locking/ directory" * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) block: Use u64_stats_init() to initialize seqcounts locking/lockdep: Mark __lockdep_count_forward_deps() as static lockdep/proc: Fix lock-time avg computation locking/doc: Update references to kernel/mutex.c ipv6: Fix possible ipv6 seqlock deadlock cpuset: Fix potential deadlock w/ set_mems_allowed seqcount: Add lockdep functionality to seqcount/seqlock structures net: Explicitly initialize u64_stats_sync structures for lockdep locking: Move the percpu-rwsem code to kernel/locking/ locking: Move the lglocks code to kernel/locking/ locking: Move the rwsem code to kernel/locking/ locking: Move the rtmutex code to kernel/locking/ locking: Move the semaphore core to kernel/locking/ locking: Move the spinlock code to kernel/locking/ locking: Move the lockdep code to kernel/locking/ locking: Move the mutex code to kernel/locking/ hung_task debugging: Add tracepoint to report the hang x86/locking/kconfig: Update paravirt spinlock Kconfig description lockstat: Report avg wait and hold times lockdep, x86/alternatives: Drop ancient lockdep fixup message ...
2013-11-14Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO core updates from Jens Axboe: "This is the pull request for the core changes in the block layer for 3.13. It contains: - The new blk-mq request interface. This is a new and more scalable queueing model that marries the best part of the request based interface we currently have (which is fully featured, but scales poorly) and the bio based "interface" which the new drivers for high IOPS devices end up using because it's much faster than the request based one. The bio interface has no block layer support, since it taps into the stack much earlier. This means that drivers end up having to implement a lot of functionality on their own, like tagging, timeout handling, requeue, etc. The blk-mq interface provides all these. Some drivers even provide a switch to select bio or rq and has code to handle both, since things like merging only works in the rq model and hence is faster for some workloads. This is a huge mess. Conversion of these drivers nets us a substantial code reduction. Initial results on converting SCSI to this model even shows an 8x improvement on single queue devices. So while the model was intended to work on the newer multiqueue devices, it has substantial improvements for "classic" hardware as well. This code has gone through extensive testing and development, it's now ready to go. A pull request is coming to convert virtio-blk to this model will be will be coming as well, with more drivers scheduled for 3.14 conversion. - Two blktrace fixes from Jan and Chen Gang. - A plug merge fix from Alireza Haghdoost. - Conversion of __get_cpu_var() from Christoph Lameter. - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven. - A fix for a race between request completion and the timeout handling from Jeff Moyer. This is what caused the merge conflict with blk-mq/core, in case you are looking at that. - A dm stacking fix from Mike Snitzer. - A code consolidation fix and duplicated code removal from Kent Overstreet. - A handful of block bug fixes from Mikulas Patocka, fixing a loop crash and memory corruption on blk cg. - Elevator switch bug fix from Tomoki Sekiyama. A heads-up that I had to rebase this branch. Initially the immutable bio_vecs had been queued up for inclusion, but a week later, it became clear that it wasn't fully cooked yet. So the decision was made to pull this out and postpone it until 3.14. It was a straight forward rebase, just pruning out the immutable series and the later fixes of problems with it. The rest of the patches applied directly and no further changes were made" * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits) block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: Do not call sector_div() with a 64-bit divisor kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup() block: Consolidate duplicated bio_trim() implementations block: Use rw_copy_check_uvector() block: Enable sysfs nomerge control for I/O requests in the plug list block: properly stack underlying max_segment_size to DM device elevator: acquire q->sysfs_lock in elevator_change() elevator: Fix a race in elevator switching and md device initialization block: Replace __get_cpu_var uses bdi: test bdi_init failure block: fix a probe argument to blk_register_region loop: fix crash if blk_alloc_queue fails blk-core: Fix memory corruption if blkcg_init_queue fails block: fix race between request completion and timeout handling blktrace: Send BLK_TN_PROCESS events to all running traces blk-mq: don't disallow request merges for req->special being set blk-mq: mq plug list breakage blk-mq: fix for flush deadlock ...
2013-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) The addition of nftables. No longer will we need protocol aware firewall filtering modules, it can all live in userspace. At the core of nftables is a, for lack of a better term, virtual machine that executes byte codes to inspect packet or metadata (arriving interface index, etc.) and make verdict decisions. Besides support for loading packet contents and comparing them, the interpreter supports lookups in various datastructures as fundamental operations. For example sets are supports, and therefore one could create a set of whitelist IP address entries which have ACCEPT verdicts attached to them, and use the appropriate byte codes to do such lookups. Since the interpreted code is composed in userspace, userspace can do things like optimize things before giving it to the kernel. Another major improvement is the capability of atomically updating portions of the ruleset. In the existing netfilter implementation, one has to update the entire rule set in order to make a change and this is very expensive. Userspace tools exist to create nftables rules using existing netfilter rule sets, but both kernel implementations will need to co-exist for quite some time as we transition from the old to the new stuff. Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have worked so hard on this. 2) Daniel Borkmann and Hannes Frederic Sowa made several improvements to our pseudo-random number generator, mostly used for things like UDP port randomization and netfitler, amongst other things. In particular the taus88 generater is updated to taus113, and test cases are added. 3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet and Yang Yingliang. 4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin Sujir. 5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet, Neal Cardwell, and Yuchung Cheng. 6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary control message data, much like other socket option attributes. From Francesco Fusco. 7) Allow applications to specify a cap on the rate computed automatically by the kernel for pacing flows, via a new SO_MAX_PACING_RATE socket option. From Eric Dumazet. 8) Make the initial autotuned send buffer sizing in TCP more closely reflect actual needs, from Eric Dumazet. 9) Currently early socket demux only happens for TCP sockets, but we can do it for connected UDP sockets too. Implementation from Shawn Bohrer. 10) Refactor inet socket demux with the goal of improving hash demux performance for listening sockets. With the main goals being able to use RCU lookups on even request sockets, and eliminating the listening lock contention. From Eric Dumazet. 11) The bonding layer has many demuxes in it's fast path, and an RCU conversion was started back in 3.11, several changes here extend the RCU usage to even more locations. From Ding Tianhong and Wang Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav Falico. 12) Allow stackability of segmentation offloads to, in particular, allow segmentation offloading over tunnels. From Eric Dumazet. 13) Significantly improve the handling of secret keys we input into the various hash functions in the inet hashtables, TCP fast open, as well as syncookies. From Hannes Frederic Sowa. The key fundamental operation is "net_get_random_once()" which uses static keys. Hannes even extended this to ipv4/ipv6 fragmentation handling and our generic flow dissector. 14) The generic driver layer takes care now to set the driver data to NULL on device removal, so it's no longer necessary for drivers to explicitly set it to NULL any more. Many drivers have been cleaned up in this way, from Jingoo Han. 15) Add a BPF based packet scheduler classifier, from Daniel Borkmann. 16) Improve CRC32 interfaces and generic SKB checksum iterators so that SCTP's checksumming can more cleanly be handled. Also from Daniel Borkmann. 17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces using the interface MTU value. This helps avoid PMTU attacks, particularly on DNS servers. From Hannes Frederic Sowa. 18) Use generic XPS for transmit queue steering rather than internal (re-)implementation in virtio-net. From Jason Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits) random32: add test cases for taus113 implementation random32: upgrade taus88 generator to taus113 from errata paper random32: move rnd_state to linux/random.h random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized random32: add periodic reseeding random32: fix off-by-one in seeding requirement PHY: Add RTL8201CP phy_driver to realtek xtsonic: add missing platform_set_drvdata() in xtsonic_probe() macmace: add missing platform_set_drvdata() in mace_probe() ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe() ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh vlan: Implement vlan_dev_get_egress_qos_mask as an inline. ixgbe: add warning when max_vfs is out of range. igb: Update link modes display in ethtool netfilter: push reasm skb through instead of original frag skbs ip6_output: fragment outgoing reassembled skb properly MAINTAINERS: mv643xx_eth: take over maintainership from Lennart net_sched: tbf: support of 64bit rates ixgbe: deleting dfwd stations out of order can cause null ptr deref ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS ...
2013-11-13lib/genalloc: add a helper function for DMA buffer allocationNicolin Chen
When using pool space for DMA buffer, there might be duplicated calling of gen_pool_alloc() and gen_pool_virt_to_phys() in each implementation. Thus it's better to add a simple helper function, a compatible one to the common dma_alloc_coherent(), to save some code. Signed-off-by: Nicolin Chen <b42378@freescale.com> Cc: "Hans J. Koch" <hjk@hansjkoch.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Eric Miao <eric.y.miao@gmail.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Kevin Hilman <khilman@deeprootsystems.com> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Mauro Carvalho Chehab <m.chehab@samsung.com> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Sekhar Nori <nsekhar@ti.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13lib/digsig.c: use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))Duan Jiong
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13lib/vsprintf.c: document formats for dentry and struct fileOlof Johansson
Looks like these were added to Documentation/printk-formats.txt but not the in-file table. Signed-off-by: Olof Johansson <olof@lixom.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13lib/debugobjects.c: remove unnecessary work pending testXie XiuQi
Remove unnecessary work pending test before calling schedule_work(). It has been tested in queue_work_on() already. No functional changed. Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13vsprintf: check real user/group id for %pKRyan Mallon
Some setuid binaries will allow reading of files which have read permission by the real user id. This is problematic with files which use %pK because the file access permission is checked at open() time, but the kptr_restrict setting is checked at read() time. If a setuid binary opens a %pK file as an unprivileged user, and then elevates permissions before reading the file, then kernel pointer values may be leaked. This happens for example with the setuid pppd application on Ubuntu 12.04: $ head -1 /proc/kallsyms 00000000 T startup_32 $ pppd file /proc/kallsyms pppd: In file /proc/kallsyms: unrecognized option 'c1000000' This will only leak the pointer value from the first line, but other setuid binaries may leak more information. Fix this by adding a check that in addition to the current process having CAP_SYSLOG, that effective user and group ids are equal to the real ids. If a setuid binary reads the contents of a file which uses %pK then the pointer values will be printed as NULL if the real user is unprivileged. Update the sysctl documentation to reflect the changes, and also correct the documentation to state the kptr_restrict=0 is the default. This is a only temporary solution to the issue. The correct solution is to do the permission check at open() time on files, and to replace %pK with a function which checks the open() time permission. %pK uses in printk should be removed since no sane permission check can be done, and instead protected by using dmesg_restrict. Signed-off-by: Ryan Mallon <rmallon@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13percpu: add test module for various percpu operationsGreg Thelen
Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm: do not walk all of system memory during show_memMel Gorman
It has been reported on very large machines that show_mem is taking almost 5 minutes to display information. This is a serious problem if there is an OOM storm. The bulk of the cost is in show_mem doing a very expensive PFN walk to give us the following information Total RAM: Also available as totalram_pages Highmem pages: Also available as totalhigh_pages Reserved pages: Can be inferred from the zone structure Shared pages: PFN walk required Unshared pages: PFN walk required Quick pages: Per-cpu walk required Only the shared/unshared pages requires a full PFN walk but that information is useless. It is also inaccurate as page pins of unshared pages would be accounted for as shared. Even if the information was accurate, I'm struggling to think how the shared/unshared information could be useful for debugging OOM conditions. Maybe it was useful before rmap existed when reclaiming shared pages was costly but it is less relevant today. The PFN walk could be optimised a bit but why bother as the information is useless. This patch deletes the PFN walker and infers the total RAM, highmem and reserved pages count from struct zone. It omits the shared/unshared page usage on the grounds that it is useless. It also corrects the reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE has similar problems to HighMem with respect to lowmem/highmem exhaustion. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-12Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes in this cycle are: - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik van Riel, Peter Zijlstra et al. Yay! - optimize preemption counter handling: merge the NEED_RESCHED flag into the preempt_count variable, by Peter Zijlstra. - wait.h fixes and code reorganization from Peter Zijlstra - cfs_bandwidth fixes from Ben Segall - SMP load-balancer cleanups from Peter Zijstra - idle balancer improvements from Jason Low - other fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits) ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED stop_machine: Fix race between stop_two_cpus() and stop_cpus() sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus sched: Fix asymmetric scheduling for POWER7 sched: Move completion code from core.c to completion.c sched: Move wait code from core.c to wait.c sched: Move wait.c into kernel/sched/ sched/wait: Fix __wait_event_interruptible_lock_irq_timeout() sched: Avoid throttle_cfs_rq() racing with period_timer stopping sched: Guarantee new group-entities always have weight sched: Fix hrtimer_cancel()/rq->lock deadlock sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining sched: Fix race on toggling cfs_bandwidth_used sched: Remove extra put_online_cpus() inside sched_setaffinity() sched/rt: Fix task_tick_rt() comment sched/wait: Fix build breakage sched/wait: Introduce prepare_to_wait_event() sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too sched: Remove get_online_cpus() usage sched: Fix race in migrate_swap_stop() ...
2013-11-11random32: add test cases for taus113 implementationDaniel Borkmann
We generated a battery of 100 test cases from GSL taus113 implemention and compare the results from a particular seed and a particular iteration with our implementation in the kernel. We have verified on 32 and 64 bit machines that our taus113 kernel implementation gives same results as GSL taus113 implementation: [ 0.147370] prandom: seed boundary self test passed [ 0.148078] prandom: 100 self tests passed This is a Kconfig option that is disabled on default, just like the crc32 init selftests in order to not unnecessary slow down boot process. We also refactored out prandom_seed_very_weak() as it's now used in multiple places in order to reduce redundant code. GSL code we used for generating test cases: int i, j; srand(time(NULL)); for (i = 0; i < 100; ++i) { int iteration = 500 + (rand() % 500); gsl_rng_default_seed = rand() + 1; gsl_rng *r = gsl_rng_alloc(gsl_rng_taus113); printf("\t{ %lu, ", gsl_rng_default_seed); for (j = 0; j < iteration - 1; ++j) gsl_rng_get(r); printf("%u, %lu },\n", iteration, gsl_rng_get(r)); gsl_rng_free(r); } Joint work with Hannes Frederic Sowa. Cc: Florian Weimer <fweimer@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11random32: upgrade taus88 generator to taus113 from errata paperDaniel Borkmann
Since we use prandom*() functions quite often in networking code i.e. in UDP port selection, netfilter code, etc, upgrade the PRNG from Pierre L'Ecuyer's original paper "Maximally Equidistributed Combined Tausworthe Generators", Mathematics of Computation, 65, 213 (1996), 203--213 to the version published in his errata paper [1]. The Tausworthe generator is a maximally-equidistributed generator, that is fast and has good statistical properties [1]. The version presented there upgrades the 3 state LFSR to a 4 state LFSR with increased periodicity from about 2^88 to 2^113. The algorithm is presented in [1] by the very same author who also designed the original algorithm in [2]. Also, by increasing the state, we make it a bit harder for attackers to "guess" the PRNGs internal state. See also discussion in [3]. Now, as we use this sort of weak initialization discussed in [3] only between core_initcall() until late_initcall() time [*] for prandom32*() users, namely in prandom_init(), it is less relevant from late_initcall() onwards as we overwrite seeds through prandom_reseed() anyways with a seed source of higher entropy, that is, get_random_bytes(). In other words, a exhaustive keysearch of 96 bit would be needed. Now, with the help of this patch, this state-search increases further to 128 bit. Initialization needs to make sure that s1 > 1, s2 > 7, s3 > 15, s4 > 127. taus88 and taus113 algorithm is also part of GSL. I added a test case in the next patch to verify internal behaviour of this patch with GSL and ran tests with the dieharder 3.31.1 RNG test suite: $ dieharder -g 052 -a -m 10 -s 1 -S 4137730333 #taus88 $ dieharder -g 054 -a -m 10 -s 1 -S 4137730333 #taus113 With this seed configuration, in order to compare both, we get the following differences: algorithm taus88 taus113 rands/second [**] 1.61e+08 1.37e+08 sts_serial(4, 1st run) WEAK PASSED sts_serial(9, 2nd run) WEAK PASSED rgb_lagged_sum(31) WEAK PASSED We took out diehard_sums test as according to the authors it is considered broken and unusable [4]. Despite that and the slight decrease in performance (which is acceptable), taus113 here passes all 113 tests (only rgb_minimum_distance_5 in WEAK, the rest PASSED). In general, taus/taus113 is considered "very good" by the authors of dieharder [5]. The papers [1][2] states a single warm-up step is sufficient by running quicktaus once on each state to ensure proper initialization of ~s_{0}: Our selection of (s) according to Table 1 of [1] row 1 holds the condition L - k <= r - s, that is, (32 32 32 32) - (31 29 28 25) <= (25 27 15 22) - (18 2 7 13) with r = k - q and q = (6 2 13 3) as also stated by the paper. So according to [2] we are safe with one round of quicktaus for initialization. However we decided to include the warm-up phase of the PRNG as done in GSL in every case as a safety net. We also use the warm up phase to make the output of the RNG easier to verify by the GSL output. In prandom_init(), we also mix random_get_entropy() into it, just like drivers/char/random.c does it, jiffies ^ random_get_entropy(). random-get_entropy() is get_cycles(). xor is entropy preserving so it is fine if it is not implemented by some architectures. Note, this PRNG is *not* used for cryptography in the kernel, but rather as a fast PRNG for various randomizations i.e. in the networking code, or elsewhere for debugging purposes, for example. [*]: In order to generate some "sort of pseduo-randomness", since get_random_bytes() is not yet available for us, we use jiffies and initialize states s1 - s3 with a simple linear congruential generator (LCG), that is x <- x * 69069; and derive s2, s3, from the 32bit initialization from s1. So the above quote from [3] accounts only for the time from core to late initcall, not afterwards. [**] Single threaded run on MacBook Air w/ Intel Core i5-3317U [1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps [2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps [3] http://thread.gmane.org/gmane.comp.encryption.general/12103/ [4] http://code.google.com/p/dieharder/source/browse/trunk/libdieharder/diehard_sums.c?spec=svn490&r=490#20 [5] http://www.phy.duke.edu/~rgb/General/dieharder.php Joint work with Hannes Frederic Sowa. Cc: Florian Weimer <fweimer@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11random32: add prandom_reseed_late() and call when nonblocking pool becomes ↵Hannes Frederic Sowa
initialized The Tausworthe PRNG is initialized at late_initcall time. At that time the entropy pool serving get_random_bytes is not filled sufficiently. This patch adds an additional reseeding step as soon as the nonblocking pool gets marked as initialized. On some machines it might be possible that late_initcall gets called after the pool has been initialized. In this situation we won't reseed again. (A call to prandom_seed_late blocks later invocations of early reseed attempts.) Joint work with Daniel Borkmann. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11random32: add periodic reseedingHannes Frederic Sowa
The current Tausworthe PRNG is never reseeded with truly random data after the first attempt in late_initcall. As this PRNG is used for some critical random data as e.g. UDP port randomization we should try better and reseed the PRNG once in a while with truly random data from get_random_bytes(). When we reseed with prandom_seed we now make also sure to throw the first output away. This suffices the reseeding procedure. The delay calculation is based on a proposal from Eric Dumazet. Joint work with Daniel Borkmann. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-11random32: fix off-by-one in seeding requirementDaniel Borkmann
For properly initialising the Tausworthe generator [1], we have a strict seeding requirement, that is, s1 > 1, s2 > 7, s3 > 15. Commit 697f8d0348 ("random32: seeding improvement") introduced a __seed() function that imposes boundary checks proposed by the errata paper [2] to properly ensure above conditions. However, we're off by one, as the function is implemented as: "return (x < m) ? x + m : x;", and called with __seed(X, 1), __seed(X, 7), __seed(X, 15). Thus, an unwanted seed of 1, 7, 15 would be possible, whereas the lower boundary should actually be of at least 2, 8, 16, just as GSL does. Fix this, as otherwise an initialization with an unwanted seed could have the effect that Tausworthe's PRNG properties cannot not be ensured. Note that this PRNG is *not* used for cryptography in the kernel. [1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps [2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps Joint work with Hannes Frederic Sowa. Fixes: 697f8d0348a6 ("random32: seeding improvement") Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Florian Weimer <fweimer@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>