summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2018-03-22fs: Teach path_connected to handle nfs filesystems with multiple roots.Eric W. Biederman
commit 95dd77580ccd66a0da96e6d4696945b8cea39431 upstream. On nfsv2 and nfsv3 the nfs server can export subsets of the same filesystem and report the same filesystem identifier, so that the nfs client can know they are the same filesystem. The subsets can be from disjoint directory trees. The nfsv2 and nfsv3 filesystems provides no way to find the common root of all directory trees exported form the server with the same filesystem identifier. The practical result is that in struct super s_root for nfs s_root is not necessarily the root of the filesystem. The nfs mount code sets s_root to the root of the first subset of the nfs filesystem that the kernel mounts. This effects the dcache invalidation code in generic_shutdown_super currently called shrunk_dcache_for_umount and that code for years has gone through an additional list of dentries that might be dentry trees that need to be freed to accomodate nfs. When I wrote path_connected I did not realize nfs was so special, and it's hueristic for avoiding calling is_subdir can fail. The practical case where this fails is when there is a move of a directory from the subtree exposed by one nfs mount to the subtree exposed by another nfs mount. This move can happen either locally or remotely. With the remote case requiring that the move directory be cached before the move and that after the move someone walks the path to where the move directory now exists and in so doing causes the already cached directory to be moved in the dcache through the magic of d_splice_alias. If someone whose working directory is in the move directory or a subdirectory and now starts calling .. from the initial mount of nfs (where s_root == mnt_root), then path_connected as a heuristic will not bother with the is_subdir check. As s_root really is not the root of the nfs filesystem this heuristic is wrong, and the path may actually not be connected and path_connected can fail. The is_subdir function might be cheap enough that we can call it unconditionally. Verifying that will take some benchmarking and the result may not be the same on all kernels this fix needs to be backported to. So I am avoiding that for now. Filesystems with snapshots such as nilfs and btrfs do something similar. But as the directory tree of the snapshots are disjoint from one another and from the main directory tree rename won't move things between them and this problem will not occur. Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22regulator: isl9305: fix array sizeVincent Stehlé
[ Upstream commit 0c08aaf873174c95e674cf21ffcd041c589d2e5b ] ISL9305_MAX_REGULATOR is the last index used to access the init_data[] array, so we need to add one to this last index to obtain the necessary array size. This fixes the following smatch error: drivers/regulator/isl9305.c:160 isl9305_i2c_probe() error: buffer overflow 'pdata->init_data' 3 <= 3 Fixes: dec38b5ce6a9edb4 ("regulator: isl9305: Add Intersil ISL9305/H driver") Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22regulator: core: Limit propagation of parent voltage count and listMatthias Kaehlcke
[ Upstream commit fd086045559d90cd7854818b4c60a7119eda6231 ] Commit 26988efe11b1 ("regulator: core: Allow to get voltage count and list from parent") introduces the propagation of the parent voltage count and list for regulators that don't provide this information themselves. The goal is to support simple switch regulators, however as a side effect normal continuous regulators can leak details of their supplies and provide consumers with inconsistent information. Limit the propagation of the voltage count and list to switch regulators. Fixes: 26988efe11b1 ("regulator: core: Allow to get voltage count and list from parent") Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Javier Martinez Canillas <javier@osg.samsung.com> Tested-by: Javier Martinez Canillas <javier@osg.samsung.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22mm: Fix false-positive VM_BUG_ON() in page_cache_{get,add}_speculative()Kirill A. Shutemov
[ Upstream commit 591a3d7c09fa08baff48ad86c2347dbd28a52753 ] 0day testing by Fengguang Wu triggered this crash while running Trinity: kernel BUG at include/linux/pagemap.h:151! ... CPU: 0 PID: 458 Comm: trinity-c0 Not tainted 4.11.0-rc2-00251-g2947ba0 #1 ... Call Trace: __get_user_pages_fast() get_user_pages_fast() get_futex_key() futex_requeue() do_futex() SyS_futex() do_syscall_64() entry_SYSCALL64_slow_path() It' VM_BUG_ON() due to false-negative in_atomic(). We call page_cache_get_speculative() with disabled local interrupts. It should be atomic enough. So let's check for disabled interrupts in the VM_BUG_ON() condition too, to resolve this. ( This got triggered by the conversion of the x86 GUP code to the generic GUP code. ) Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170324114709.pcytvyb3d6ajux33@black.fi.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18usb: quirks: add control message delay for 1b1c:1b20Danilo Krummrich
commit cb88a0588717ba6c756cb5972d75766b273a6817 upstream. Corsair Strafe RGB keyboard does not respond to usb control messages sometimes and hence generates timeouts. Commit de3af5bf259d ("usb: quirks: add delay init quirk for Corsair Strafe RGB keyboard") tried to fix those timeouts by adding USB_QUIRK_DELAY_INIT. Unfortunately, even with this quirk timeouts of usb_control_msg() can still be seen, but with a lower frequency (approx. 1 out of 15): [ 29.103520] usb 1-8: string descriptor 0 read error: -110 [ 34.363097] usb 1-8: can't set config #1, error -110 Adding further delays to different locations where usb control messages are issued just moves the timeouts to other locations, e.g.: [ 35.400533] usbhid 1-8:1.0: can't add hid device: -110 [ 35.401014] usbhid: probe of 1-8:1.0 failed with error -110 The only way to reliably avoid those issues is having a pause after each usb control message. In approx. 200 boot cycles no more timeouts were seen. Addionaly, keep USB_QUIRK_DELAY_INIT as it turned out to be necessary to have the delay in hub_port_connect() after hub_port_init(). The overall boot time seems not to be influenced by these additional delays, even on fast machines and lightweight distributions. Fixes: de3af5bf259d ("usb: quirks: add delay init quirk for Corsair Strafe RGB keyboard") Cc: stable@vger.kernel.org Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18netfilter: x_tables: pack percpu counter allocationsFlorian Westphal
commit ae0ac0ed6fcf5af3be0f63eb935f483f44a402d2 upstream. instead of allocating each xt_counter individually, allocate 4k chunks and then use these for counter allocation requests. This should speed up rule evaluation by increasing data locality, also speeds up ruleset loading because we reduce calls to the percpu allocator. As Eric points out we can't use PAGE_SIZE, page_allocator would fail on arches with 64k page size. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18netfilter: x_tables: pass xt_counters struct to counter allocatorFlorian Westphal
commit f28e15bacedd444608e25421c72eb2cf4527c9ca upstream. Keeps some noise away from a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18netfilter: x_tables: pass xt_counters struct instead of packet counterFlorian Westphal
commit 4d31eef5176df06f218201bc9c0ce40babb41660 upstream. On SMP we overload the packet counter (unsigned long) to contain percpu offset. Hide this from callers and pass xt_counters address instead. Preparation patch to allocate the percpu counters in page-sized batch chunks. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18x86/retpoline: Support retpoline builds with ClangDavid Woodhouse
commit 87358710c1fb4f1bf96bbe2349975ff9953fc9b2 upstream. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: arjan.van.de.ven@intel.com Cc: bp@alien8.de Cc: dave.hansen@intel.com Cc: jmattson@google.com Cc: karahmed@amazon.de Cc: kvm@vger.kernel.org Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Link: http://lkml.kernel.org/r/1519037457-7643-5-git-send-email-dwmw@amazon.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18nospec: Include <asm/barrier.h> dependencyDan Williams
commit eb6174f6d1be16b19cfa43dac296bfed003ce1a6 upstream. The nospec.h header expects the per-architecture header file <asm/barrier.h> to optionally define array_index_mask_nospec(). Include that dependency to prevent inadvertent fallback to the default array_index_mask_nospec() implementation. The default implementation may not provide a full mitigation on architectures that perform data value speculation. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arch@vger.kernel.org Link: http://lkml.kernel.org/r/151881605404.17395.1341935530792574707.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18nospec: Kill array_index_nospec_mask_check()Dan Williams
commit 1d91c1d2c80cb70e2e553845e278b87a960c04da upstream. There are multiple problems with the dynamic sanity checking in array_index_nospec_mask_check(): * It causes unnecessary overhead in the 32-bit case since integer sized @index values will no longer cause the check to be compiled away like in the 64-bit case. * In the 32-bit case it may trigger with user controllable input when the expectation is that should only trigger during development of new kernel enabling. * The macro reuses the input parameter in multiple locations which is broken if someone passes an expression like 'index++' to array_index_nospec(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arch@vger.kernel.org Link: http://lkml.kernel.org/r/151881604278.17395.6605847763178076520.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-18workqueue: Allow retrieval of current task's work structLukas Wunner
commit 27d4ee03078aba88c5e07dcc4917e8d01d046f38 upstream. Introduce a helper to retrieve the current task's work struct if it is a workqueue worker. This allows us to fix a long-standing deadlock in several DRM drivers wherein the ->runtime_suspend callback waits for a specific worker to finish and that worker in turn calls a function which waits for runtime suspend to finish. That function is invoked from multiple call sites and waiting for runtime suspend to finish is the correct thing to do except if it's executing in the context of the worker. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11nospec: Allow index argument to have const-qualified typeRasmus Villemoes
commit b98c6a160a057d5686a8c54c79cc6c8c94a7d0c8 upstream. The last expression in a statement expression need not be a bare variable, quoting gcc docs The last thing in the compound statement should be an expression followed by a semicolon; the value of this subexpression serves as the value of the entire construct. and we already use that in e.g. the min/max macros which end with a ternary expression. This way, we can allow index to have const-qualified type, which will in some cases avoid the need for introducing a local copy of index of non-const qualified type. That, in turn, can prevent readers not familiar with the internals of array_index_nospec from wondering about the seemingly redundant extra variable, and I think that's worthwhile considering how confusing the whole _nospec business is. The expression _i&_mask has type unsigned long (since that is the type of _mask, and the BUILD_BUG_ONs guarantee that _i will get promoted to that), so in order not to change the type of the whole expression, add a cast back to typeof(_i). Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/151881604837.17395.10812767547837568328.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11dax: fix vma_is_fsdax() helperDan Williams
commit 230f5a8969d8345fc9bbe3683f068246cf1be4b8 upstream. Gerd reports that ->i_mode may contain other bits besides S_IFCHR. Use S_ISCHR() instead. Otherwise, get_user_pages_longterm() may fail on device-dax instances when those are meant to be explicitly allowed. Fixes: 2bb6d2837083 ("mm: introduce get_user_pages_longterm") Cc: <stable@vger.kernel.org> Reported-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Jane Chu <jane.chu@oracle.com> Reported-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-28libnvdimm, dax: fix 1GB-aligned namespaces vs physical misalignmentDan Williams
commit 41fce90f26333c4fa82e8e43b9ace86c4e8a0120 upstream. The following namespace configuration attempt: # ndctl create-namespace -e namespace0.0 -m devdax -a 1G -f libndctl: ndctl_dax_enable: dax0.1: failed to enable Error: namespace0.0: failed to enable failed to reconfigure namespace: No such device or address ...fails when the backing memory range is not physically aligned to 1G: # cat /proc/iomem | grep Persistent 210000000-30fffffff : Persistent Memory (legacy) In the above example the 4G persistent memory range starts and ends on a 256MB boundary. We handle this case correctly when needing to handle cases that violate section alignment (128MB) collisions against "System RAM", and we simply need to extend that padding/truncation for the 1GB alignment use case. Cc: <stable@vger.kernel.org> Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute...") Reported-and-tested-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-28mm: introduce get_user_pages_longtermDan Williams
commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream. Patch series "introduce get_user_pages_longterm()", v2. Here is a new get_user_pages api for cases where a driver intends to keep an elevated page count indefinitely. This is distinct from usages like iov_iter_get_pages where the elevated page counts are transient. The iov_iter_get_pages cases immediately turn around and submit the pages to a device driver which will put_page when the i/o operation completes (under kernel control). In the longterm case userspace is responsible for dropping the page reference at some undefined point in the future. This is untenable for filesystem-dax case where the filesystem is in control of the lifetime of the block / page and needs reasonable limits on how long it can wait for pages in a mapping to become idle. Fixing filesystems to actually wait for dax pages to be idle before blocks from a truncate/hole-punch operation are repurposed is saved for a later patch series. Also, allowing longterm registration of dax mappings is a future patch series that introduces a "map with lease" semantic where the kernel can revoke a lease and force userspace to drop its page references. I have also tagged these for -stable to purposely break cases that might assume that longterm memory registrations for filesystem-dax mappings were supported by the kernel. The behavior regression this policy change implies is one of the reasons we maintain the "dax enabled. Warning: EXPERIMENTAL, use at your own risk" notification when mounting a filesystem in dax mode. It is worth noting the device-dax interface does not suffer the same constraints since it does not support file space management operations like hole-punch. This patch (of 4): Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed. This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2). [akpm@linux-foundation.org: use kcalloc()] Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Cc: Doug Ledford <dledford@redhat.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Inki Dae <inki.dae@samsung.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Joonyoung Shim <jy0922.shim@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Seung-Woo Kim <sw0312.kim@samsung.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25PCI: Change pci_host_common_probe() visibilityMarc Gonzalez
commit de5bbdd01cf9ee3cd4586b5a970d3ea015c6d7e3 upstream. pci_host_common_probe() is defined when CONFIG_PCI_HOST_COMMON=y; therefore the function declaration should match that. drivers/pci/host/pcie-tango.c:300:9: error: implicit declaration of function 'pci_host_common_probe' Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25crypto: hash - prevent using keyed hashes without setting keyEric Biggers
commit 9fa68f620041be04720d0cbfb1bd3ddfc6310b24 upstream. Currently, almost none of the keyed hash algorithms check whether a key has been set before proceeding. Some algorithms are okay with this and will effectively just use a key of all 0's or some other bogus default. However, others will severely break, as demonstrated using "hmac(sha3-512-generic)", the unkeyed use of which causes a kernel crash via a (potentially exploitable) stack buffer overflow. A while ago, this problem was solved for AF_ALG by pairing each hash transform with a 'has_key' bool. However, there are still other places in the kernel where userspace can specify an arbitrary hash algorithm by name, and the kernel uses it as unkeyed hash without checking whether it is really unkeyed. Examples of this include: - KEYCTL_DH_COMPUTE, via the KDF extension - dm-verity - dm-crypt, via the ESSIV support - dm-integrity, via the "internal hash" mode with no key given - drbd (Distributed Replicated Block Device) This bug is especially bad for KEYCTL_DH_COMPUTE as that requires no privileges to call. Fix the bug for all users by adding a flag CRYPTO_TFM_NEED_KEY to the ->crt_flags of each hash transform that indicates whether the transform still needs to be keyed or not. Then, make the hash init, import, and digest functions return -ENOKEY if the key is still needed. The new flag also replaces the 'has_key' bool which algif_hash was previously using, thereby simplifying the algif_hash implementation. Reported-by: syzbot <syzkaller@googlegroups.com> Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25crypto: hash - annotate algorithms taking optional keyEric Biggers
commit a208fa8f33031b9e0aba44c7d1b7e68eb0cbd29e upstream. We need to consistently enforce that keyed hashes cannot be used without setting the key. To do this we need a reliable way to determine whether a given hash algorithm is keyed or not. AF_ALG currently does this by checking for the presence of a ->setkey() method. However, this is actually slightly broken because the CRC-32 algorithms implement ->setkey() but can also be used without a key. (The CRC-32 "key" is not actually a cryptographic key but rather represents the initial state. If not overridden, then a default initial state is used.) Prepare to fix this by introducing a flag CRYPTO_ALG_OPTIONAL_KEY which indicates that the algorithm has a ->setkey() method, but it is not required to be called. Then set it on all the CRC-32 algorithms. The same also applies to the Adler-32 implementation in Lustre. Also, the cryptd and mcryptd templates have to pass through the flag from their underlying algorithm. Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25Provide a function to create a NUL-terminated string from unterminated dataDavid Howells
commit f35157417215ec138c920320c746fdb3e04ef1d5 upstream. Provide a function, kmemdup_nul(), that will create a NUL-terminated string from an unterminated character array where the length is known in advance. This is better than kstrndup() in situations where we already know the string length as the strnlen() in kstrndup() is superfluous. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25ptr_ring: fail early if queue occupies more than KMALLOC_MAX_SIZEJason Wang
commit 6e6e41c3112276288ccaf80c70916779b84bb276 upstream. To avoid slab to warn about exceeded size, fail early if queue occupies more than KMALLOC_MAX_SIZE. Reported-by: syzbot+e4d4f9ddd4295539735d@syzkaller.appspotmail.com Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22x86: fix build warnign with 32-bit PAEArnd Bergmann
I ran into a 4.9 build warning in randconfig testing, starting with the KAISER patches: arch/x86/kernel/ldt.c: In function 'alloc_ldt_struct': arch/x86/include/asm/pgtable_types.h:208:24: error: large integer implicitly truncated to unsigned type [-Werror=overflow] #define __PAGE_KERNEL (__PAGE_KERNEL_EXEC | _PAGE_NX) ^ arch/x86/kernel/ldt.c:81:6: note: in expansion of macro '__PAGE_KERNEL' __PAGE_KERNEL); ^~~~~~~~~~~~~ I originally ran into this last year when the patches were part of linux-next, and tried to work around it by using the proper 'pteval_t' types consistently, but that caused additional problems. This takes a much simpler approach, and makes the argument type of the dummy helper always 64-bit, which is wide enough for any page table layout and won't hurt since this call is just an empty stub anyway. Fixes: 8f0baadf2bea ("kaiser: merged update") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22nospec: Move array_index_nospec() parameter checking into separate macroWill Deacon
commit 8fa80c503b484ddc1abbd10c7cb2ab81f3824a50 upstream. For architectures providing their own implementation of array_index_mask_nospec() in asm/barrier.h, attempting to use WARN_ONCE() to complain about out-of-range parameters using WARN_ON() results in a mess of mutually-dependent include files. Rather than unpick the dependencies, simply have the core code in nospec.h perform the checking for us. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1517840166-15399-1-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22compiler-gcc.h: Introduce __optimize function attributeGeert Uytterhoeven
commit df5d45aa08f848b79caf395211b222790534ccc7 upstream. Create a new function attribute __optimize, which allows to specify an optimization level on a per-function basis. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22jbd2: fix sphinx kernel-doc build warningsTobin C. Harding
commit f69120ce6c024aa634a8fc25787205e42f0ccbe6 upstream. Sphinx emits various (26) warnings when building make target 'htmldocs'. Currently struct definitions contain duplicate documentation, some as kernel-docs and some as standard c89 comments. We can reduce duplication while cleaning up the kernel docs. Move all kernel-docs to right above each struct member. Use the set of all existing comments (kernel-doc and c89). Add documentation for missing struct members and function arguments. Signed-off-by: Tobin C. Harding <me@tobin.cc> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-17mtd: cfi: convert inline functions to macrosArnd Bergmann
commit 9e343e87d2c4c707ef8fae2844864d4dde3a2d13 upstream. The map_word_() functions, dating back to linux-2.6.8, try to perform bitwise operations on a 'map_word' structure. This may have worked with compilers that were current then (gcc-3.4 or earlier), but end up being rather inefficient on any version I could try now (gcc-4.4 or higher). Specifically we hit a problem analyzed in gcc PR81715 where we fail to reuse the stack space for local variables. This can be seen immediately in the stack consumption for cfi_staa_erase_varsize() and other functions that (with CONFIG_KASAN) can be up to 2200 bytes. Changing the inline functions into macros brings this down to 1280 bytes. Without KASAN, the same problem exists, but the stack consumption is lower to start with, my patch shrinks it from 920 to 496 bytes on with arm-linux-gnueabi-gcc-5.4, and saves around 1KB in .text size for cfi_cmdset_0020.c, as it avoids copying map_word structures for each call to one of these helpers. With the latest gcc-8 snapshot, the problem is fixed in upstream gcc, but nobody uses that yet, so we should still work around it in mainline kernels and probably backport the workaround to stable kernels as well. We had a couple of other functions that suffered from the same gcc bug, and all of those had a simpler workaround involving dummy variables in the inline function. Unfortunately that did not work here, the macro hack was the best I could come up with. It would also be helpful to have someone to a little performance testing on the patch, to see how much it helps in terms of CPU utilitzation. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715 Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Richard Weinberger <richard@nod.at> Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-13x86/retpoline: Avoid retpolines for built-in __init functionsDavid Woodhouse
(cherry picked from commit 66f793099a636862a71c59d4a6ba91387b155e0c) There's no point in building init code with retpolines, since it runs before any potentially hostile userspace does. And before the retpoline is actually ALTERNATIVEd into place, for much of it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: karahmed@amazon.de Cc: peterz@infradead.org Cc: bp@alien8.de Link: https://lkml.kernel.org/r/1517484441-1420-2-git-send-email-dwmw@amazon.co.uk Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-13vfs, fdtable: Prevent bounds-check bypass via speculative executionDan Williams
(cherry picked from commit 56c30ba7b348b90484969054d561f711ba196507) 'fd' is a user controlled value that is used as a data dependency to read from the 'fdt->fd' array. In order to avoid potential leaks of kernel memory values, block speculative execution of the instruction stream that could issue reads based on an invalid 'file *' returned from __fcheck_files. Co-developed-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: kernel-hardening@lists.openwall.com Cc: gregkh@linuxfoundation.org Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: torvalds@linux-foundation.org Cc: alan@linux.intel.com Link: https://lkml.kernel.org/r/151727418500.33451.17392199002892248656.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-13array_index_nospec: Sanitize speculative array de-referencesDan Williams
(cherry picked from commit f3804203306e098dae9ca51540fcd5eb700d7f40) array_index_nospec() is proposed as a generic mechanism to mitigate against Spectre-variant-1 attacks, i.e. an attack that bypasses boundary checks via speculative execution. The array_index_nospec() implementation is expected to be safe for current generation CPUs across multiple architectures (ARM, x86). Based on an original implementation by Linus Torvalds, tweaked to remove speculative flows by Alexei Starovoitov, and tweaked again by Linus to introduce an x86 assembly implementation for the mask generation. Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Co-developed-by: Alexei Starovoitov <ast@kernel.org> Suggested-by: Cyril Novikov <cnovikov@lynx.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: kernel-hardening@lists.openwall.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: gregkh@linuxfoundation.org Cc: torvalds@linux-foundation.org Cc: alan@linux.intel.com Link: https://lkml.kernel.org/r/151727414229.33451.18411580953862676575.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-13module/retpoline: Warn about missing retpoline in moduleAndi Kleen
(cherry picked from commit caf7501a1b4ec964190f31f9c3f163de252273b8) There's a risk that a kernel which has full retpoline mitigations becomes vulnerable when a module gets loaded that hasn't been compiled with the right compiler or the right option. To enable detection of that mismatch at module load time, add a module info string "retpoline" at build time when the module was compiled with retpoline support. This only covers compiled C source, but assembler source or prebuilt object files are not checked. If a retpoline enabled kernel detects a non retpoline protected module at load time, print a warning and report it in the sysfs vulnerability file. [ tglx: Massaged changelog ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: gregkh@linuxfoundation.org Cc: torvalds@linux-foundation.org Cc: jeyu@kernel.org Cc: arjan@linux.intel.com Link: https://lkml.kernel.org/r/20180125235028.31211-1-andi@firstfloor.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03tty: fix data race between tty_init_dev and flush of bufGaurav Kohli
commit b027e2298bd588d6fa36ed2eda97447fb3eac078 upstream. There can be a race, if receive_buf call comes before tty initialization completes in n_tty_open and tty->disc_data may be NULL. CPU0 CPU1 ---- ---- 000|n_tty_receive_buf_common() n_tty_open() -001|n_tty_receive_buf2() tty_ldisc_open.isra.3() -002|tty_ldisc_receive_buf(inline) tty_ldisc_setup() Using ldisc semaphore lock in tty_init_dev till disc_data initializes completely. Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org> Reviewed-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03net/mlx5: Define interface bits for fencing UMR wqeMax Gurtovoy
commit 1410a90ae449061b7e1ae19d275148f36948801b upstream. HW can implement UMR wqe re-transmission in various ways. Thus, add HCA cap to distinguish the needed fence for UMR to make sure that the wqe wouldn't fail on mkey checks. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Acked-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Marta Rybczynska <mrybczyn@kalray.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31bpf: avoid false sharing of map refcount with max_entriesDaniel Borkmann
[ upstream commit be95a845cc4402272994ce290e3ad928aff06cb9 ] In addition to commit b2157399cc98 ("bpf: prevent out-of-bounds speculation") also change the layout of struct bpf_map such that false sharing of fast-path members like max_entries is avoided when the maps reference counter is altered. Therefore enforce them to be placed into separate cachelines. pahole dump after change: struct bpf_map { const struct bpf_map_ops * ops; /* 0 8 */ struct bpf_map * inner_map_meta; /* 8 8 */ void * security; /* 16 8 */ enum bpf_map_type map_type; /* 24 4 */ u32 key_size; /* 28 4 */ u32 value_size; /* 32 4 */ u32 max_entries; /* 36 4 */ u32 map_flags; /* 40 4 */ u32 pages; /* 44 4 */ u32 id; /* 48 4 */ int numa_node; /* 52 4 */ bool unpriv_array; /* 56 1 */ /* XXX 7 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ struct user_struct * user; /* 64 8 */ atomic_t refcnt; /* 72 4 */ atomic_t usercnt; /* 76 4 */ struct work_struct work; /* 80 32 */ char name[16]; /* 112 16 */ /* --- cacheline 2 boundary (128 bytes) --- */ /* size: 128, cachelines: 2, members: 17 */ /* sum members: 121, holes: 1, sum holes: 7 */ }; Now all entries in the first cacheline are read only throughout the life time of the map, set up once during map creation. Overall struct size and number of cachelines doesn't change from the reordering. struct bpf_map is usually first member and embedded in map structs in specific map implementations, so also avoid those members to sit at the end where it could potentially share the cacheline with first map values e.g. in the array since remote CPUs could trigger map updates just as well for those (easily dirtying members like max_entries intentionally as well) while having subsequent values in cache. Quoting from Google's Project Zero blog [1]: Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array's length to be slow on other physical CPU cores (intentional false sharing). While this doesn't 'control' the out-of-bounds speculation through masking the index as in commit b2157399cc98, triggering a manipulation of the map's reference counter is really trivial, so lets not allow to easily affect max_entries from it. Splitting to separate cachelines also generally makes sense from a performance perspective anyway in that fast-path won't have a cache miss if the map gets pinned, reused in other progs, etc out of control path, thus also avoids unintentional false sharing. [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31mm: fix 100% CPU kswapd busyloop on unreclaimable nodesJohannes Weiner
commit c73322d098e4b6f5f0f0fa1330bf57e218775539 upstream. Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and cleanups". Jia reported a scenario in which the kswapd of a node indefinitely spins at 100% CPU usage. We have seen similar cases at Facebook. The kernel's current method of judging its ability to reclaim a node (or whether to back off and sleep) is based on the amount of scanned pages in proportion to the amount of reclaimable pages. In Jia's and our scenarios, there are no reclaimable pages in the node, however, and the condition for backing off is never met. Kswapd busyloops in an attempt to restore the watermarks while having nothing to work with. This series reworks the definition of an unreclaimable node based not on scanning but on whether kswapd is able to actually reclaim pages in MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria the page allocator uses for giving up on direct reclaim and invoking the OOM killer. If it cannot free any pages, kswapd will go to sleep and leave further attempts to direct reclaim invocations, which will either make progress and re-enable kswapd, or invoke the OOM killer. Patch #1 fixes the immediate problem Jia reported, the remainder are smaller fixlets, cleanups, and overall phasing out of the old method. Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(), and directly related to #5, but in itself not relevant to the series. If the whole series is too ambitious for 4.11, I would consider the first three patches fixes, the rest cleanups. This patch (of 9): Jia He reports a problem with kswapd spinning at 100% CPU when requesting more hugepages than memory available in the system: $ echo 4000 >/proc/sys/vm/nr_hugepages top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01 Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3 At that time, there are no reclaimable pages left in the node, but as kswapd fails to restore the high watermarks it refuses to go to sleep. Kswapd needs to back away from nodes that fail to balance. Up until commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") kswapd had such a mechanism. It considered zones whose theoretically reclaimable pages it had reclaimed six times over as unreclaimable and backed away from them. This guard was erroneously removed as the patch changed the definition of a balanced node. However, simply restoring this code wouldn't help in the case reported here: there *are* no reclaimable pages that could be scanned until the threshold is met. Kswapd would stay awake anyway. Introduce a new and much simpler way of backing off. If kswapd runs through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single page, make it back off from the node. This is the same number of shots direct reclaim takes before declaring OOM. Kswapd will go to sleep on that node until a direct reclaimer manages to reclaim some pages, thus proving the node reclaimable again. [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count] Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org [shakeelb@google.com: fix condition for throttle_direct_reclaim] Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Jia He <hejianet@gmail.com> Tested-by: Jia He <hejianet@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dmitry Shmidt <dimitrysh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31Revert "module: Add retpoline tag to VERMAGIC"Greg Kroah-Hartman
commit 5132ede0fe8092b043dae09a7cc32b8ae7272baa upstream. This reverts commit 6cfb521ac0d5b97470883ff9b7facae264b7ab12. Turns out distros do not want to make retpoline as part of their "ABI", so this patch should not have been merged. Sorry Andi, this was my fault, I suggested it when your original patch was the "correct" way of doing this instead. Reported-by: Jiri Kosina <jikos@kernel.org> Fixes: 6cfb521ac0d5 ("module: Add retpoline tag to VERMAGIC") Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: rusty@rustcorp.com.au Cc: arjan.van.de.ven@intel.com Cc: jeyu@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31drivers: base: cacheinfo: fix x86 with CONFIG_OF enabledSudeep Holla
commit fac51482577d5e05bbb0efa8d602a3c2111098bf upstream. With CONFIG_OF enabled on x86, we get the following error on boot: " Failed to find cpu0 device node Unable to detect cache hierarchy from DT for CPU 0 " and the cacheinfo fails to get populated in the corresponding sysfs entries. This is because cache_setup_of_node looks for of_node for setting up the shared cpu_map without checking that it's already populated in the architecture specific callback. In order to indicate that the shared cpu_map is already populated, this patch introduces a boolean `cpu_map_populated` in struct cpu_cacheinfo that can be used by the generic code to skip cache_shared_cpu_map_setup. This patch also sets that boolean for x86. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Mian Yousaf Kaukab <yousaf.kaukab@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23module: Add retpoline tag to VERMAGICAndi Kleen
commit 6cfb521ac0d5b97470883ff9b7facae264b7ab12 upstream. Add a marker for retpoline to the module VERMAGIC. This catches the case when a non RETPOLINE compiled module gets loaded into a retpoline kernel, making it insecure. It doesn't handle the case when retpoline has been runtime disabled. Even in this case the match of the retcompile status will be enforced. This implies that even with retpoline run time disabled all modules loaded need to be recompiled. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Woodhouse <dwmw@amazon.co.uk> Cc: rusty@rustcorp.com.au Cc: arjan.van.de.ven@intel.com Cc: jeyu@kernel.org Cc: torvalds@linux-foundation.org Link: https://lkml.kernel.org/r/20180116205228.4890-1-andi@firstfloor.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17objtool, modules: Discard objtool annotation sections for modulesJosh Poimboeuf
commit e390f9a9689a42f477a6073e2e7df530a4c1b740 upstream. The '__unreachable' and '__func_stack_frame_non_standard' sections are only used at compile time. They're discarded for vmlinux but they should also be discarded for modules. Since this is a recurring pattern, prefix the section names with ".discard.". It's a nice convention and vmlinux.lds.h already discards such sections. Also remove the 'a' (allocatable) flag from the __unreachable section since it doesn't make sense for a discarded section. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: d1091c7fa3d5 ("objtool: Improve detection of BUG() and other dead ends") Link: http://lkml.kernel.org/r/20170301180444.lhd53c5tibc4ns77@treble Signed-off-by: Ingo Molnar <mingo@kernel.org> [dwmw2: Remove the unreachable part in backporting since it's not here yet] Signed-off-by: David Woodhouse <dwmw@amazon.co.ku> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17sysfs/cpu: Add vulnerability folderThomas Gleixner
commit 87590ce6e373d1a5401f6539f0c59ef92dd924a9 upstream. As the meltdown/spectre problem affects several CPU architectures, it makes sense to have common way to express whether a system is affected by a particular vulnerability or not. If affected the way to express the mitigation should be common as well. Create /sys/devices/system/cpu/vulnerabilities folder and files for meltdown, spectre_v1 and spectre_v2. Allow architectures to override the show function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Woodhouse <dwmw@amazon.co.uk> Link: https://lkml.kernel.org/r/20180107214913.096657732@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17bpf: prevent out-of-bounds speculationAlexei Starovoitov
commit b2157399cc9898260d6031c5bfe45fe137c1fbe7 upstream. Under speculation, CPUs may mis-predict branches in bounds checks. Thus, memory accesses under a bounds check may be speculated even if the bounds check fails, providing a primitive for building a side channel. To avoid leaking kernel data round up array-based maps and mask the index after bounds check, so speculated load with out of bounds index will load either valid value from the array or zero from the padded area. Unconditionally mask index for all array types even when max_entries are not rounded to power of 2 for root user. When map is created by unpriv user generate a sequence of bpf insns that includes AND operation to make sure that JITed code includes the same 'index & index_mask' operation. If prog_array map is created by unpriv user replace bpf_tail_call(ctx, map, index); with if (index >= max_entries) { index &= map->index_mask; bpf_tail_call(ctx, map, index); } (along with roundup to power 2) to prevent out-of-bounds speculation. There is secondary redundant 'if (index >= max_entries)' in the interpreter and in all JITs, but they can be optimized later if necessary. Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array) cannot be used by unpriv, so no changes there. That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on all architectures with and without JIT. v2->v3: Daniel noticed that attack potentially can be crafted via syscall commands without loading the program, so add masking to those paths as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Slaby <jslaby@suse.cz> [ Backported to 4.9 - gregkh ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17net: stmmac: enable EEE in MII, GMII or RGMII onlyJerome Brunet
[ Upstream commit 879626e3a52630316d817cbda7cec9a5446d1d82 ] Note in the databook - Section 4.4 - EEE : " The EEE feature is not supported when the MAC is configured to use the TBI, RTBI, SMII, RMII or SGMII single PHY interface. Even if the MAC supports multiple PHY interfaces, you should activate the EEE mode only when the MAC is operating with GMII, MII, or RGMII interface." Applying this restriction solves a stability issue observed on Amlogic gxl platforms operating with RMII interface and the internal PHY. Fixes: 83bf79b6bb64 ("stmmac: disable at run-time the EEE if not supported") Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Tested-by: Arnaud Patard <arnaud.patard@rtp-net.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17sh_eth: fix SH7757 GEther initializationSergei Shtylyov
[ Upstream commit 5133550296d43236439494aa955bfb765a89f615 ] Renesas SH7757 has 2 Fast and 2 Gigabit Ether controllers, while the 'sh_eth' driver can only reset and initialize TSU of the first controller pair. Shimoda-san tried to solve that adding the 'needs_init' member to the 'struct sh_eth_plat_data', however the platform code still never sets this flag. I think that we can infer this information from the 'devno' variable (set to 'platform_device::id') and reset/init the Ether controller pair only for an even 'devno'; therefore 'sh_eth_plat_data::needs_init' can be removed... Fixes: 150647fb2c31 ("net: sh_eth: change the condition of initialization") Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10kernel: make groups_sort calling a responsibility group_info allocatorsThiago Rafael Becker
commit bdcf0a423ea1c40bbb40e7ee483b50fc8aa3d758 upstream. In testing, we found that nfsd threads may call set_groups in parallel for the same entry cached in auth.unix.gid, racing in the call of groups_sort, corrupting the groups for that entry and leading to permission denials for the client. This patch: - Make groups_sort globally visible. - Move the call to groups_sort to the modifiers of group_info - Remove the call to groups_sort from set_groups Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10fscache: Fix the default for fscache_maybe_release_page()David Howells
commit 98801506552593c9b8ac11021b0cdad12cab4f6b upstream. Fix the default for fscache_maybe_release_page() for when the cookie isn't valid or the page isn't cached. It mustn't return false as that indicates the page cannot yet be freed. The problem with the default is that if, say, there's no cache, but a network filesystem's pages are using up almost all the available memory, a system can OOM because the filesystem ->releasepage() op will not allow them to be released as fscache_maybe_release_page() incorrectly prevents it. This can be tested by writing a sequence of 512MiB files to an AFS mount. It does not affect NFS or CIFS because both of those wrap the call in a check of PG_fscache and it shouldn't bother Ceph as that only has PG_private set whilst writeback is in progress. This might be an issue for 9P, however. Note that the pages aren't entirely stuck. Removing a file or unmounting will clear things because that uses ->invalidatepage() instead. Fixes: 201a15428bd5 ("FS-Cache: Handle pages pending storage that get evicted under OOM conditions") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05KPTI: Rename to PAGE_TABLE_ISOLATIONKees Cook
This renames CONFIG_KAISER to CONFIG_PAGE_TABLE_ISOLATION. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: vmstat show NR_KAISERTABLE as nr_overheadHugh Dickins
The kaiser update made an interesting choice, never to free any shadow page tables. Contention on global spinlock was worrying, particularly with it held across page table scans when freeing. Something had to be done: I was going to add refcounting; but simply never to free them is an appealing choice, minimizing contention without complicating the code (the more a page table is found already, the less the spinlock is used). But leaking pages in this way is also a worry: can we get away with it? At the very least, we need a count to show how bad it actually gets: in principle, one might end up wasting about 1/256 of memory that way (1/512 for when direct-mapped pages have to be user-mapped, plus 1/512 for when they are user-mapped from the vmalloc area on another occasion (but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed). Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the shared pgd entries, and 1 for each intermediate page table added thereafter for user-mapping - but leave out the 1 per mm, for its shadow pgd, because that distracts from the monotonic increase. Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled). In practice, it doesn't look so bad so far: more like 1/12000 after nine hours of gtests below; and movable pageblock segregation should tend to cluster the kaiser tables into a subset of the address space (if not, they will be bad for compaction too). But production may tell a different story: keep an eye on this number, and bring back lighter freeing if it gets out of control (maybe a shrinker). ["nr_overhead" should of course say "nr_kaisertable", if it needs to stay; but for the moment we are being coy, preferring that when Joe Blow notices a new line in his /proc/vmstat, he does not get too curious about what this "kaiser" stuff might be.] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: cleanups while trying for gold linkHugh Dickins
While trying to get our gold link to work, four cleanups: matched the gdt_page declaration to its definition; in fiddling unsuccessfully with PERCPU_INPUT(), lined up backslashes; lined up the backslashes according to convention in percpu-defs.h; deleted the unused irq_stack_pointer addition to irq_stack_union. Sad to report that aligning backslashes does not appear to help gold align to 8192: but while these did not help, they are worth keeping. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZEHugh Dickins
Kaiser only needs to map one page of the stack; and kernel/fork.c did not build on powerpc (no __PAGE_KERNEL). It's all cleaner if linux/kaiser.h provides kaiser_map_thread_stack() and kaiser_unmap_thread_stack() wrappers around asm/kaiser.h's kaiser_add_mapping() and kaiser_remove_mapping(). And use linux/kaiser.h in init/main.c to avoid the #ifdefs there. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: merged updateDave Hansen
Merged fixes and cleanups, rebased to 4.9.51 tree (no 5-level paging). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05KAISER: Kernel Address IsolationRichard Fellner
This patch introduces our implementation of KAISER (Kernel Address Isolation to have Side-channels Efficiently Removed), a kernel isolation technique to close hardware side channels on kernel address information. More information about the patch can be found on: https://github.com/IAIK/KAISER From: Richard Fellner <richard.fellner@student.tugraz.at> From: Daniel Gruss <daniel.gruss@iaik.tugraz.at> Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode Date: Thu, 4 May 2017 14:26:50 +0200 Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2 Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5 To: <linux-kernel@vger.kernel.org> To: <kernel-hardening@lists.openwall.com> Cc: <clementine.maurice@iaik.tugraz.at> Cc: <moritz.lipp@iaik.tugraz.at> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at> Cc: Richard Fellner <richard.fellner@student.tugraz.at> Cc: Ingo Molnar <mingo@kernel.org> Cc: <kirill.shutemov@linux.intel.com> Cc: <anders.fogh@gdata-adan.de> After several recent works [1,2,3] KASLR on x86_64 was basically considered dead by many researchers. We have been working on an efficient but effective fix for this problem and found that not mapping the kernel space when running in user mode is the solution to this problem [4] (the corresponding paper [5] will be presented at ESSoS17). With this RFC patch we allow anybody to configure their kernel with the flag CONFIG_KAISER to add our defense mechanism. If there are any questions we would love to answer them. We also appreciate any comments! Cheers, Daniel (+ the KAISER team from Graz University of Technology) [1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf [2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf [3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf [4] https://github.com/IAIK/KAISER [5] https://gruss.cc/files/kaiser.pdf [patch based also on https://raw.githubusercontent.com/IAIK/KAISER/master/KAISER/0001-KAISER-Kernel-Address-Isolation.patch] Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at> Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at> Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at> Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>