| Age | Commit message (Collapse) | Author |
|
commit fc2798502f860b18f3c7121e4dc659d3d9d28d74 upstream.
These interfaces:
pcibios_resource_to_bus(struct pci_dev *dev, *bus_region, *resource)
pcibios_bus_to_resource(struct pci_dev *dev, *resource, *bus_region)
took a pci_dev, but they really depend only on the pci_bus. And we want to
use them in resource allocation paths where we have the bus but not a
device, so this patch converts them to take the pci_bus instead of the
pci_dev:
pcibios_resource_to_bus(struct pci_bus *bus, *bus_region, *resource)
pcibios_bus_to_resource(struct pci_bus *bus, *resource, *bus_region)
In fact, with standard PCI-PCI bridges, they only depend on the host
bridge, because that's the only place address translation occurs, but
we aren't going that far yet.
[bhelgaas: changelog]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Dirk Behme <dirk.behme@gmail.com>
[lizf: Backported to 3.4:
- make changes to pci_host_bridge() instead of find_pci_root_bus()
- adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit bba0bdd7ad4713d82338bcd9b72d57e9335a664b upstream.
SCSI transport drivers and SCSI LLDs block a SCSI device if the
transport layer is not operational. This means that in this state
no requests should be processed, even if the REQ_PREEMPT flag has
been set. This patch avoids that a rescan shortly after a cable
pull sporadically triggers the following kernel oops:
BUG: unable to handle kernel paging request at ffffc9001a6bc084
IP: [<ffffffffa04e08f2>] mlx4_ib_post_send+0xd2/0xb30 [mlx4_ib]
Process rescan-scsi-bus (pid: 9241, threadinfo ffff88053484a000, task ffff880534aae100)
Call Trace:
[<ffffffffa0718135>] srp_post_send+0x65/0x70 [ib_srp]
[<ffffffffa071b9df>] srp_queuecommand+0x1cf/0x3e0 [ib_srp]
[<ffffffffa0001ff1>] scsi_dispatch_cmd+0x101/0x280 [scsi_mod]
[<ffffffffa0009ad1>] scsi_request_fn+0x411/0x4d0 [scsi_mod]
[<ffffffff81223b37>] __blk_run_queue+0x27/0x30
[<ffffffff8122a8d2>] blk_execute_rq_nowait+0x82/0x110
[<ffffffff8122a9c2>] blk_execute_rq+0x62/0xf0
[<ffffffffa000b0e8>] scsi_execute+0xe8/0x190 [scsi_mod]
[<ffffffffa000b2f3>] scsi_execute_req+0xa3/0x130 [scsi_mod]
[<ffffffffa000c1aa>] scsi_probe_lun+0x17a/0x450 [scsi_mod]
[<ffffffffa000ce86>] scsi_probe_and_add_lun+0x156/0x480 [scsi_mod]
[<ffffffffa000dc2f>] __scsi_scan_target+0xdf/0x1f0 [scsi_mod]
[<ffffffffa000dfa3>] scsi_scan_host_selected+0x183/0x1c0 [scsi_mod]
[<ffffffffa000edfb>] scsi_scan+0xdb/0xe0 [scsi_mod]
[<ffffffffa000ee13>] store_scan+0x13/0x20 [scsi_mod]
[<ffffffff811c8d9b>] sysfs_write_file+0xcb/0x160
[<ffffffff811589de>] vfs_write+0xce/0x140
[<ffffffff81158b53>] sys_write+0x53/0xa0
[<ffffffff81464592>] system_call_fastpath+0x16/0x1b
[<00007f611c9d9300>] 0x7f611c9d92ff
Reported-by: Max Gurtuvoy <maxg@mellanox.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 6ee8e25fc3e916193bce4ebb43d5439e1e2144ab upstream.
Commit e9fd702a58c4 ("audit: convert audit watches to use fsnotify
instead of inotify") broke handling of renames in audit. Audit code
wants to update inode number of an inode corresponding to watched name
in a directory. When something gets renamed into a directory to a
watched name, inotify previously passed moved inode to audit code
however new fsnotify code passes directory inode where the change
happened. That confuses audit and it starts watching parent directory
instead of a file in a directory.
This can be observed for example by doing:
cd /tmp
touch foo bar
auditctl -w /tmp/foo
touch foo
mv bar foo
touch foo
In audit log we see events like:
type=CONFIG_CHANGE msg=audit(1423563584.155:90): auid=1000 ses=2 op="updated rules" path="/tmp/foo" key=(null) list=4 res=1
...
type=PATH msg=audit(1423563584.155:91): item=2 name="bar" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE
type=PATH msg=audit(1423563584.155:91): item=3 name="foo" inode=1046842 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE
type=PATH msg=audit(1423563584.155:91): item=4 name="foo" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=CREATE
...
and that's it - we see event for the first touch after creating the
audit rule, we see events for rename but we don't see any event for the
last touch. However we start seeing events for unrelated stuff
happening in /tmp.
Fix the problem by passing moved inode as data in the FS_MOVED_FROM and
FS_MOVED_TO events instead of the directory where the change happens.
This doesn't introduce any new problems because noone besides
audit_watch.c cares about the passed value:
fs/notify/fanotify/fanotify.c cares only about FSNOTIFY_EVENT_PATH events.
fs/notify/dnotify/dnotify.c doesn't care about passed 'data' value at all.
fs/notify/inotify/inotify_fsnotify.c uses 'data' only for FSNOTIFY_EVENT_PATH.
kernel/audit_tree.c doesn't care about passed 'data' at all.
kernel/audit_watch.c expects moved inode as 'data'.
Fixes: e9fd702a58c49db ("audit: convert audit watches to use fsnotify instead of inotify")
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 074f9dd55f9cab1b82690ed7e44bcf38b9616ce0 upstream.
Currently the USB stack assumes that all host controller drivers are
capable of receiving wakeup requests from downstream devices.
However, this isn't true for the isp1760-hcd driver, which means that
it isn't safe to do a runtime suspend of any device attached to a
root-hub port if the device requires wakeup.
This patch adds a "cant_recv_wakeups" flag to the usb_hcd structure
and sets the flag in isp1760-hcd. The core is modified to prevent a
direct child of the root hub from being put into runtime suspend with
wakeup enabled if the flag is set.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 5efd2ea8c9f4f12916ffc8ba636792ce052f6911 upstream.
the following error pops up during "testusb -a -t 10"
| musb-hdrc musb-hdrc.1.auto: dma_pool_free buffer-128, f134e000/be842000 (bad dma)
hcd_buffer_create() creates a few buffers, the smallest has 32 bytes of
size. ARCH_KMALLOC_MINALIGN is set to 64 bytes. This combo results in
hcd_buffer_alloc() returning memory which is 32 bytes aligned and it
might by identified by buffer_offset() as another buffer. This means the
buffer which is on a 32 byte boundary will not get freed, instead it
tries to free another buffer with the error message.
This patch fixes the issue by creating the smallest DMA buffer with the
size of ARCH_KMALLOC_MINALIGN (or 32 in case ARCH_KMALLOC_MINALIGN is
smaller). This might be 32, 64 or even 128 bytes. The next three pools
will have the size 128, 512 and 2048.
In case the smallest pool is 128 bytes then we have only three pools
instead of four (and zero the first entry in the array).
The last pool size is always 2048 bytes which is the assumed PAGE_SIZE /
2 of 4096. I doubt it makes sense to continue using PAGE_SIZE / 2 where
we would end up with 8KiB buffer in case we have 16KiB pages.
Instead I think it makes sense to have a common size(s) and extend them
if there is need to.
There is a BUILD_BUG_ON() now in case someone has a minalign of more than
128 bytes.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 4161b4505f1690358ac0a9ee59845a7887336b21 upstream.
When ak4114 work calls its callback and the callback invokes
ak4114_reinit(), it stalls due to flush_delayed_work(). For avoiding
this, control the reentrance by introducing a refcount. Also
flush_delayed_work() is replaced with cancel_delayed_work_sync().
The exactly same bug is present in ak4113.c and fixed as well.
Reported-by: Pavel Hofman <pavel.hofman@ivitera.com>
Acked-by: Jaroslav Kysela <perex@perex.cz>
Tested-by: Pavel Hofman <pavel.hofman@ivitera.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
[lizf: Backported to 3.4: snd_ak4113_reinit() and snd_ak4114_reinit()
used flush_delayed_work_sync() instead of flush_delayed_work()]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[bwh: Backported to 3.2:
- Apply name changes in all the different places we use d_alias and d_child
- Move the WARN_ON() in __d_free() to d_free() as we don't have dentry_free()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[lizf: Backported to 3.4:
- adjust context
- need one more name change in debugfs]
|
|
commit 9de7922bc709eee2f609cd01d98aaedc4cf5ea74 upstream.
Commit 6f4c618ddb0 ("SCTP : Add paramters validity check for
ASCONF chunk") added basic verification of ASCONF chunks, however,
it is still possible to remotely crash a server by sending a
special crafted ASCONF chunk, even up to pre 2.6.12 kernels:
skb_over_panic: text:ffffffffa01ea1c3 len:31056 put:30768
head:ffff88011bd81800 data:ffff88011bd81800 tail:0x7950
end:0x440 dev:<NULL>
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:129!
[...]
Call Trace:
<IRQ>
[<ffffffff8144fb1c>] skb_put+0x5c/0x70
[<ffffffffa01ea1c3>] sctp_addto_chunk+0x63/0xd0 [sctp]
[<ffffffffa01eadaf>] sctp_process_asconf+0x1af/0x540 [sctp]
[<ffffffff8152d025>] ? _read_unlock_bh+0x15/0x20
[<ffffffffa01e0038>] sctp_sf_do_asconf+0x168/0x240 [sctp]
[<ffffffffa01e3751>] sctp_do_sm+0x71/0x1210 [sctp]
[<ffffffff8147645d>] ? fib_rules_lookup+0xad/0xf0
[<ffffffffa01e6b22>] ? sctp_cmp_addr_exact+0x32/0x40 [sctp]
[<ffffffffa01e8393>] sctp_assoc_bh_rcv+0xd3/0x180 [sctp]
[<ffffffffa01ee986>] sctp_inq_push+0x56/0x80 [sctp]
[<ffffffffa01fcc42>] sctp_rcv+0x982/0xa10 [sctp]
[<ffffffffa01d5123>] ? ipt_local_in_hook+0x23/0x28 [iptable_filter]
[<ffffffff8148bdc9>] ? nf_iterate+0x69/0xb0
[<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
[<ffffffff8148bf86>] ? nf_hook_slow+0x76/0x120
[<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
[<ffffffff81496ded>] ip_local_deliver_finish+0xdd/0x2d0
[<ffffffff81497078>] ip_local_deliver+0x98/0xa0
[<ffffffff8149653d>] ip_rcv_finish+0x12d/0x440
[<ffffffff81496ac5>] ip_rcv+0x275/0x350
[<ffffffff8145c88b>] __netif_receive_skb+0x4ab/0x750
[<ffffffff81460588>] netif_receive_skb+0x58/0x60
This can be triggered e.g., through a simple scripted nmap
connection scan injecting the chunk after the handshake, for
example, ...
-------------- INIT[ASCONF; ASCONF_ACK] ------------->
<----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
-------------------- COOKIE-ECHO -------------------->
<-------------------- COOKIE-ACK ---------------------
------------------ ASCONF; UNKNOWN ------------------>
... where ASCONF chunk of length 280 contains 2 parameters ...
1) Add IP address parameter (param length: 16)
2) Add/del IP address parameter (param length: 255)
... followed by an UNKNOWN chunk of e.g. 4 bytes. Here, the
Address Parameter in the ASCONF chunk is even missing, too.
This is just an example and similarly-crafted ASCONF chunks
could be used just as well.
The ASCONF chunk passes through sctp_verify_asconf() as all
parameters passed sanity checks, and after walking, we ended
up successfully at the chunk end boundary, and thus may invoke
sctp_process_asconf(). Parameter walking is done with
WORD_ROUND() to take padding into account.
In sctp_process_asconf()'s TLV processing, we may fail in
sctp_process_asconf_param() e.g., due to removal of the IP
address that is also the source address of the packet containing
the ASCONF chunk, and thus we need to add all TLVs after the
failure to our ASCONF response to remote via helper function
sctp_add_asconf_response(), which basically invokes a
sctp_addto_chunk() adding the error parameters to the given
skb.
When walking to the next parameter this time, we proceed
with ...
length = ntohs(asconf_param->param_hdr.length);
asconf_param = (void *)asconf_param + length;
... instead of the WORD_ROUND()'ed length, thus resulting here
in an off-by-one that leads to reading the follow-up garbage
parameter length of 12336, and thus throwing an skb_over_panic
for the reply when trying to sctp_addto_chunk() next time,
which implicitly calls the skb_put() with that length.
Fix it by using sctp_walk_params() [ which is also used in
INIT parameter processing ] macro in the verification *and*
in ASCONF processing: it will make sure we don't spill over,
that we walk parameters WORD_ROUND()'ed. Moreover, we're being
more defensive and guard against unknown parameter types and
missized addresses.
Joint work with Vlad Yasevich.
Fixes: b896b82be4ae ("[SCTP] ADDIP: Support for processing incoming ASCONF_ACK chunks.")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit b69040d8e39f20d5215a03502a8e8b4c6ab78395 upstream.
When receiving a e.g. semi-good formed connection scan in the
form of ...
-------------- INIT[ASCONF; ASCONF_ACK] ------------->
<----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
-------------------- COOKIE-ECHO -------------------->
<-------------------- COOKIE-ACK ---------------------
---------------- ASCONF_a; ASCONF_b ----------------->
... where ASCONF_a equals ASCONF_b chunk (at least both serials
need to be equal), we panic an SCTP server!
The problem is that good-formed ASCONF chunks that we reply with
ASCONF_ACK chunks are cached per serial. Thus, when we receive a
same ASCONF chunk twice (e.g. through a lost ASCONF_ACK), we do
not need to process them again on the server side (that was the
idea, also proposed in the RFC). Instead, we know it was cached
and we just resend the cached chunk instead. So far, so good.
Where things get nasty is in SCTP's side effect interpreter, that
is, sctp_cmd_interpreter():
While incoming ASCONF_a (chunk = event_arg) is being marked
!end_of_packet and !singleton, and we have an association context,
we do not flush the outqueue the first time after processing the
ASCONF_ACK singleton chunk via SCTP_CMD_REPLY. Instead, we keep it
queued up, although we set local_cork to 1. Commit 2e3216cd54b1
changed the precedence, so that as long as we get bundled, incoming
chunks we try possible bundling on outgoing queue as well. Before
this commit, we would just flush the output queue.
Now, while ASCONF_a's ASCONF_ACK sits in the corked outq, we
continue to process the same ASCONF_b chunk from the packet. As
we have cached the previous ASCONF_ACK, we find it, grab it and
do another SCTP_CMD_REPLY command on it. So, effectively, we rip
the chunk->list pointers and requeue the same ASCONF_ACK chunk
another time. Since we process ASCONF_b, it's correctly marked
with end_of_packet and we enforce an uncork, and thus flush, thus
crashing the kernel.
Fix it by testing if the ASCONF_ACK is currently pending and if
that is the case, do not requeue it. When flushing the output
queue we may relink the chunk for preparing an outgoing packet,
but eventually unlink it when it's copied into the skb right
before transmission.
Joint work with Vlad Yasevich.
Fixes: 2e3216cd54b1 ("sctp: Follow security requirement of responding with 1 packet")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream.
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.
That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works. However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV. And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.
However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d4514 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space. And user space really
expected SIGSEGV, not SIGBUS.
To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it. They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.
This is the mindless minimal patch to do this. A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.
Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.
Reported-and-tested-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
- Adjust filenames, context
- Drop arc, metag, nios2 and lustre changes
- For sh, patch both 32-bit and 64-bit implementations to use goto bad_area
- For s390, pass int_code and trans_exc_code as arguments to do_no_context()
and do_sigsegv()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[lizf: Backported to 3.4:
- adjust context in arch/power/mm/fault.c
- apply the original change in upstream commit for s390]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit cd83ce9e6195aa3ea15ab4db92892802c20df5d0 upstream.
This patch adds a usb quirk to support devices with interupt endpoints
and bInterval values expressed as microframes. The quirk causes the
parse endpoint function to modify the reported bInterval to a standards
conforming value.
There is currently code in the endpoint parser that checks for
bIntervals that are outside of the valid range (1-16 for USB 2+ high
speed and super speed interupt endpoints). In this case, the code assumes
the bInterval is being reported in 1ms frames. As well, the correction
is only applied if the original bInterval value is out of the 1-16 range.
With this quirk applied to the device, the bInterval will be
accurately adjusted from microframes to an exponent.
Signed-off-by: James P Michels III <james.p.michels@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit b14bf2d0c0358140041d1c1805a674376964d0e0 upstream.
Some buggy JMicron USB-ATA bridges don't know how to translate the FUA
bit in READs or WRITEs. This patch adds an entry in unusual_devs.h
and a blacklist flag to tell the sd driver not to use FUA.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Michael Büsch <m@bues.ch>
Tested-by: Michael Büsch <m@bues.ch>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
CC: Matthew Dharm <mdharm-usb@one-eyed-alien.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 72dd299d5039a336493993dcc63413cf31d0e662 upstream.
Ronny reports: https://bugzilla.kernel.org/show_bug.cgi?id=87101
"Since commit 8a4aeec8d "libata/ahci: accommodate tag ordered
controllers" the access to the harddisk on the first SATA-port is
failing on its first access. The access to the harddisk on the
second port is working normal.
When reverting the above commit, access to both harddisks is working
fine again."
Maintain tag ordered submission as the default, but allow sata_sil24 to
continue with the old behavior.
Cc: Tejun Heo <tj@kernel.org>
Reported-by: Ronny Hegewald <Ronny.Hegewald@online.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 39ef311204941ddd01ea2950d6220c8ccc710d15 upstream.
device_create_groups lets callers create devices as well as associated
sysfs attributes with a single call. This avoids race conditions seen
if sysfs attributes on new devices are created later.
[fixed up comment block placement and add checks for printk buffer
formats - gregkh]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit f2f37f58b1b933b06d6d84e80a31a1b500fb0db2 upstream.
To make it easier for driver subsystems to work with attribute groups,
create the ATTRIBUTE_GROUPS macro to remove some of the repetitive
typing for the most common use for attribute groups.
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 2d6d7f98284648c5ed113fe22a132148950b140f upstream.
Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:
__set_page_dirty_nobuffers() __delete_from_page_cache()
if (TestSetPageDirty(page))
page->mapping = NULL
if (PageDirty())
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
if (page->mapping)
account_page_dirtied(page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held. The notable exception to this rule, though,
is do_wp_page(), for which this race exists. However, do_wp_page()
already waits for a locked page to unlock before setting the dirty bit,
in order to prevent a race where clear_page_dirty() misses the page bit
in the presence of dirty ptes. Upgrade that wait to a fully locked
set_page_dirty() to also cover the situation explained above.
Afterwards, the code in set_page_dirty() dealing with a truncation race
is no longer needed. Remove it.
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4:
- adjust context
- use VM_BUG_ON() instead of VM_BUG_ON_PAGE()]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 7a3ef208e662f4b63d43a23f61a64a129c525bbc upstream.
Constantly forking task causes unlimited grow of anon_vma chain. Each
next child allocates new level of anon_vmas and links vma to all
previous levels because pages might be inherited from any level.
This patch adds heuristic which decides to reuse existing anon_vma
instead of forking new one. It adds counter anon_vma->degree which
counts linked vmas and directly descending anon_vmas and reuses anon_vma
if counter is lower than two. As a result each anon_vma has either vma
or at least two descending anon_vmas. In such trees half of nodes are
leafs with alive vmas, thus count of anon_vmas is no more than two times
bigger than count of vmas.
This heuristic reuses anon_vmas as few as possible because each reuse
adds false aliasing among vmas and rmap walker ought to scan more ptes
when it searches where page is might be mapped.
Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue")
[akpm@linux-foundation.org: fix typo, per Rik]
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
Tested-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Jerome Marchand <jmarchan@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 6ada1fc0e1c4775de0e043e1bd3ae9d065491aa5 upstream.
An unvalidated user input is multiplied by a constant, which can result in
an undefined behaviour for large values. While this is validated later,
we should avoid triggering undefined behaviour.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: include trivial milisecond->microsecond correction noticed
by Andy]
Signed-off-by: John Stultz <john.stultz@linaro.org>
[lizf: Backported to 3.4: adjust filename]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit fee7e49d45149fba60156f5b59014f764d3e3728 upstream.
Jay Foad reports that the address sanitizer test (asan) sometimes gets
confused by a stack pointer that ends up being outside the stack vma
that is reported by /proc/maps.
This happens due to an interaction between RLIMIT_STACK and the guard
page: when we do the guard page check, we ignore the potential error
from the stack expansion, which effectively results in a missing guard
page, since the expected stack expansion won't have been done.
And since /proc/maps explicitly ignores the guard page (commit
d7824370e263: "mm: fix up some user-visible effects of the stack guard
page"), the stack pointer ends up being outside the reported stack area.
This is the minimal patch: it just propagates the error. It also
effectively makes the guard page part of the stack limit, which in turn
measn that the actual real stack is one page less than the stack limit.
Let's see if anybody notices. We could teach acct_stack_growth() to
allow an extra page for a grow-up/grow-down stack in the rlimit test,
but I don't want to add more complexity if it isn't needed.
Reported-and-tested-by: Jay Foad <jay.foad@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 9bffdca8c64a72ac54c47a552734ab457bc720d4 upstream.
Use dev_to_virtio wrapper in virtio to make code clearly.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit c291ee622165cb2c8d4e7af63fffd499354a23be upstream.
Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.
CPU0 CPU1
show_interrupts()
desc = irq_to_desc(X);
free_desc(desc)
remove_from_radix_tree();
kfree(desc);
raw_spinlock_irq(&desc->lock);
/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.
The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.
For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.
Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.
Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.
Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.
Fixes: 1f5a5b87f78f "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[lizf: Backported to 3.4:
- define kstat_irqs() for CONFIG_GENERIC_HARDIRQS
- add ifdef/endif CONFIG_SPARSE_IRQ]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 5188cd44c55db3e92cd9e77a40b5baa7ed4340f7 upstream.
UFO is now disabled on all drivers that work with virtio net headers,
but userland may try to send UFO/IPv6 packets anyway. Instead of
sending with ID=0, we should select identifiers on their behalf (as we
used to).
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 916e4cf46d02 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: For 3.2, net/ipv6/output_core.c is a completely new file]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit a87fa1d81a9fb5e9adca9820e16008c40ad09f33 upstream.
The string property read helpers will run off the end of the buffer if
it is handed a malformed string property. Rework the parsers to make
sure that doesn't happen. At the same time add new test cases to make
sure the functions behave themselves.
The original implementations of of_property_read_string_index() and
of_property_count_strings() both open-coded the same block of parsing
code, each with it's own subtly different bugs. The fix here merges
functions into a single helper and makes the original functions static
inline wrappers around the helper.
One non-bugfix aspect of this patch is the addition of a new wrapper,
of_property_read_string_array(). The new wrapper is needed by the
device_properties feature that Rafael is working on and planning to
merge for v3.19. The implementation is identical both with and without
the new static inline wrapper, so it just got left in to reduce the
churn on the header file.
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Darren Hart <darren.hart@intel.com>
[lizf: Backported to 3.4:
- adjust context
- drop selftest hunks that don't apply]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 2a159389bf5d962359349a76827b2f683276a1c7 upstream.
Add new quirk for devices that cannot handle requests for the
device_qualifier descriptor.
A USB-2.0 compliant device must respond to requests for the
device_qualifier descriptor (even if it's with a request error), but at
least one device is known to misbehave after such a request.
Suggested-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 6d50e60cd2edb5a57154db5a6f64eef5aa59b751 upstream.
If an anonymous mapping is not allowed to fault thp memory and then
madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
collapse this memory into thp memory.
This occurs because the madvise(2) handler for thp, hugepage_madvise(),
clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
until the final action of madvise_behavior(). This causes the
khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
the vma had previously had VM_NOHUGEPAGE set.
Fix this by passing the correct vma flags to the khugepaged mm slot
handler. There's no chance khugepaged can run on this vma until after
madvise_behavior() returns since we hold mm->mmap_sem.
It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
in hugepage_advise(), but I didn't want to introduce special case
behavior into madvise_behavior(). I think it's best to just let it
always set vma->vm_flags itself.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Suleiman Souhlal <suleiman@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 8c3e434769b1707fd2d24de5a2eb25fedc634c4a upstream.
0x4c6e is a secondary device id so should not be used
by the driver.
Noticed-by: Mark Kettenis <mark.kettenis@xs4all.nl>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 5695be142e203167e3cb515ef86a88424f3524eb upstream.
PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.
Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.
Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.
Changes since v1
- push the re-check loop out of freeze_processes into
check_frozen_processes and invert the condition to make the code more
readable as per Rafael
Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 0c740d0afc3bff0a097ad03a1c8df92757516f5c upstream.
while_each_thread() and next_thread() should die, almost every lockless
usage is wrong.
1. Unless g == current, the lockless while_each_thread() is not safe.
while_each_thread(g, t) can loop forever if g exits, next_thread()
can't reach the unhashed thread in this case. Note that this can
happen even if g is the group leader, it can exec.
2. Even if while_each_thread() itself was correct, people often use
it wrongly.
It was never safe to just take rcu_read_lock() and loop unless
you verify that pid_alive(g) == T, even the first next_thread()
can point to the already freed/reused memory.
This patch adds signal_struct->thread_head and task->thread_node to
create the normal rcu-safe list with the stable head. The new
for_each_thread(g, t) helper is always safe under rcu_read_lock() as
long as this task_struct can't go away.
Note: of course it is ugly to have both task_struct->thread_node and the
old task_struct->thread_group, we will kill it later, after we change
the users of while_each_thread() to use for_each_thread().
Perhaps we can kill it even before we convert all users, we can
reimplement next_thread(t) using the new thread_head/thread_node. But
we can't do this right now because this will lead to subtle behavioural
changes. For example, do/while_each_thread() always sees at least one
task, while for_each_thread() can do nothing if the whole thread group
has died. Or thread_group_empty(), currently its semantics is not clear
unless thread_group_leader(p) and we need to audit the callers before we
can change it.
So this patch adds the new interface which has to coexist with the old
one for some time, hopefully the next changes will be more or less
straightforward and the old one will go away soon.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Sergey Dyasly <dserrg@gmail.com>
Tested-by: Sergey Dyasly <dserrg@gmail.com>
Reviewed-by: Sameer Nanda <snanda@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: "Ma, Xindong" <xindong.ma@intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit d4c5efdb97773f59a2b711754ca0953f24516739 upstream.
zatimend has reported that in his environment (3.16/gcc4.8.3/corei7)
memset() calls which clear out sensitive data in extract_{buf,entropy,
entropy_user}() in random driver are being optimized away by gcc.
Add a helper memzero_explicit() (similarly as explicit_bzero() variants)
that can be used in such cases where a variable with sensitive data is
being cleared out in the end. Other use cases might also be in crypto
code. [ I have put this into lib/string.c though, as it's always built-in
and doesn't need any dependencies then. ]
Fixes kernel bugzilla: 82041
Reported-by: zatimend@hotmail.co.uk
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[lizf: Backported to 3.4:
- adjust context
- another memset() in extract_buf() needs to be converted]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit fe8c8a126806fea4465c43d62a1f9d273a572bf5 upstream.
[Only use the compiler.h portion of this patch, to get the
OPTIMIZER_HIDE_VAR() macro, which we need for other -stable patches
- gregkh]
Disabling compiler optimizations can be fragile, since a new
optimization could be added to -O0 or -Os that breaks the assumptions
the code is making.
Instead of disabling compiler optimizations, use a dummy inline assembly
(based on RELOC_HIDE) to block the problematic kinds of optimization,
while still allowing other optimizations to be applied to the code.
The dummy inline assembly is added after every OR, and has the
accumulator variable as its input and output. The compiler is forced to
assume that the dummy inline assembly could both depend on the
accumulator variable and change the accumulator variable, so it is
forced to compute the value correctly before the inline assembly, and
cannot assume anything about its value after the inline assembly.
This change should be enough to make crypto_memneq work correctly (with
data-independent timing) even if it is inlined at its call sites. That
can be done later in a followup patch.
Compile-tested on x86_64.
Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.eti.br>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 90a8020278c1598fafd071736a0846b38510309c upstream.
->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.
However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
ftruncate(fd, 0);
pwrite(fd, buf, 1024, 0);
map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
map[0] = 'a'; ----> page_mkwrite() for index 0 is called
ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
mremap(map, 1024, 10000, 0);
map[4095] = 'a'; ----> no page_mkwrite() called
At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.
This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
[lizf: Backported to 3.4:
- adjust context
- truncate_setsize() already has an oldsize variable]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 71458cfc782eafe4b27656e078d379a34e472adf upstream.
We're missing include/linux/compiler-gcc5.h which is required now
because gcc branched off to v5 in trunk.
Just copy the relevant bits out of include/linux/compiler-gcc4.h,
no new code is added as of now.
This fixes a build error when using gcc 5.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit b8839b8c55f3fdd60dc36abcda7e0266aff7985c upstream.
The math in both blk_stack_limits() and queue_limit_alignment_offset()
assume that a block device's io_min (aka minimum_io_size) is always a
power-of-2. Fix the math such that it works for non-power-of-2 io_min.
This issue (of alignment_offset != 0) became apparent when testing
dm-thinp with a thinp blocksize that matches a RAID6 stripesize of
1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data
block size") unlocked the potential for alignment_offset != 0 due to
the dm-thin-pool's io_min possibly being a non-power-of-2.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit ddbe1fca0bcb87ca8c199ea873a456ca8a948567 upstream.
This full-speed USB device generates spurious remote wakeup event
as soon as USB_DEVICE_REMOTE_WAKEUP feature is set. As the result,
Linux can't enter system suspend and S0ix power saving modes once
this keyboard is used.
This patch tries to introduce USB_QUIRK_IGNORE_REMOTE_WAKEUP quirk.
With this quirk set, wakeup capability will be ignored during
device configure.
This patch could be back-ported to kernels as old as 2.6.39.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit a6138db815df5ee542d848318e5dae681590fccd upstream.
Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.
Correct this by replacing the mask of mount flags to preserve
with a mask of mount flags that may be changed, and preserve
all others. This ensures that any future bugs with this mask and
remount will fail in an easy to detect way where new mount flags
simply won't change.
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit 2ad654bc5e2b211e92f66da1d819e47d79a866f0 upstream.
When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.
Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.
Here's the full report:
https://lkml.org/lkml/2014/9/19/230
To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
[lizf: Backported to 3.4:
- adjust context
- check current->flags & PF_MEMPOLICY rather than current->mempolicy]
|
|
commit e0e5070b20e01f0321f97db4e4e174f3f6b49e50 upstream.
This will simplify code when we add new flags.
v3:
- Kees pointed out that no_new_privs should never be cleared, so we
shouldn't define task_clear_no_new_privs(). we define 3 macros instead
of a single one.
v2:
- updated scripts/tags.sh, suggested by Peter
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
[lizf: Backported to 3.4:
- adjust context
- remove no_new_priv code
- add atomic_flags to struct task_struct]
|
|
commit d78c9300c51d6ceed9f6d078d4e9366f259de28c upstream.
timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:
setitimer(ITIMER_PROF, &val, NULL);
setitimer(ITIMER_PROF, NULL, &val);
would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.) Doing
this repeatedly would cause unbounded growth in val. So fix the math.
Here's what was wrong with the conversion: we essentially computed
(eliding seconds)
jiffies = usec * (NSEC_PER_USEC/TICK_NSEC)
by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:
jiffies = (usec * x) >> USEC_JIFFIE_SC
and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)
In particular, with HZ=1000, we consistently computed that 10000 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.
We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies. This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.
Tested: the following program:
int main() {
struct itimerval zero = {{0, 0}, {0, 0}};
/* Initially set to 10 ms. */
struct itimerval initial = zero;
initial.it_interval.tv_usec = 10000;
setitimer(ITIMER_PROF, &initial, NULL);
/* Save and restore several times. */
for (size_t i = 0; i < 10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, &zero, &prev);
/* on old kernels, this goes up by TICK_USEC every iteration */
printf("previous value: %ld %ld %ld %ld\n",
prev.it_interval.tv_sec, prev.it_interval.tv_usec,
prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, &prev, NULL);
}
return 0;
}
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Paul Turner <pjt@google.com>
Reported-by: Aaron Jacobs <jacobsa@google.com>
Signed-off-by: Andrew Hunter <ahh@google.com>
[jstultz: Tweaked to apply to 3.17-rc]
Signed-off-by: John Stultz <john.stultz@linaro.org>
[lizf: Backported to 3.4: adjust filename]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit a5fe8e7695dc3f547e955ad2b662e3e72969e506 upstream.
alpha2 is defined as 2-chars array, but is used in multiple
places as string (e.g. with nla_put_string calls), which
might leak kernel data.
Solve it by simply adding an extra char for the NULL
terminator, making such operations safe.
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
commit e7b691b085fda913830e5280ae6f724b2a63c824 upstream.
slab_node() could access current->mempolicy from interrupt context.
However there's a race condition during exit where the mempolicy
is first freed and then the pointer zeroed.
Using this from interrupts seems bogus anyways. The interrupt
will interrupt a random process and therefore get a random
mempolicy. Many times, this will be idle's, which noone can change.
Just disable this here and always use local for slab
from interrupts. I also cleaned up the callers of slab_node a bit
which always passed the same argument.
I believe the original mempolicy code did that in fact,
so it's likely a regression.
v2: send version with correct logic
v3: simplify. fix typo.
Reported-by: Arun Sharma <asharma@fb.com>
Cc: penberg@kernel.org
Cc: cl@linux.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[tdmackey@twitter.com: Rework control flow based on feedback from
cl@linux.com, fix logic, and cleanup current task_struct reference]
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Mackey <tdmackey@twitter.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
|
|
[ Upstream commit 04ca6973f7c1a0d8537f2d9906a0cf8e69886d75 ]
In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.
With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.
This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.
Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.
This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.
We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.
For IPv6, adds saddr into the hash as well, but not nexthdr.
If I ping the patched target, we can see ID are now hard to predict.
21:57:11.008086 IP (...)
A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
target > A: ICMP echo reply, seq 1, length 64
21:57:12.013133 IP (...)
A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
target > A: ICMP echo reply, seq 2, length 64
21:57:13.016580 IP (...)
A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
target > A: ICMP echo reply, seq 3, length 64
[1] TCP sessions uses a per flow ID generator not changed by this patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu>
Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 73f156a6e8c1074ac6327e0abd1169e95eb66463 ]
Ideally, we would need to generate IP ID using a per destination IP
generator.
linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.
1) each inet_peer struct consumes 192 bytes
2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.
3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.
4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())
5) We garbage collect inet_peer aggressively.
IP ID generation do not have to be 'perfect'
Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.
We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.
ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)
secure_ip_id() and secure_ipv6_id() no longer are needed.
Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit 29a3cd46644ec8098dbe1c12f89643b5c11831a9 which is
commit fe6cc55f3a9a053482a76f5a6b2257cee51b4663 upstream.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream.
After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.
This only changes the function name. No logic changes.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a3860c1c5dd1137db23d7786d284939c5761d517 upstream.
ULONG_MAX is often used to check for integer overflow when calculating
allocation size. While ULONG_MAX happens to work on most systems, there
is no guarantee that `size_t' must be the same size as `long'.
This patch introduces SIZE_MAX, the maximum value of `size_t', to improve
portability and readability for allocation size validation.
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Acked-by: Alex Elder <elder@dreamhost.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Qiang Huang <h.huangqiang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1a112d10f03e83fb3a2fdc4c9165865dec8a3ca6 upstream.
1871ee134b73 ("libata: support the ata host which implements a queue
depth less than 32") directly used ata_port->scsi_host->can_queue from
ata_qc_new() to determine the number of tags supported by the host;
unfortunately, SAS controllers doing SATA don't initialize ->scsi_host
leading to the following oops.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
IP: [<ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: isci libsas scsi_transport_sas mgag200 drm_kms_helper ttm
CPU: 1 PID: 518 Comm: udevd Not tainted 3.16.0-rc6+ #62
Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
task: ffff880c1a00b280 ti: ffff88061a000000 task.ti: ffff88061a000000
RIP: 0010:[<ffffffff814e0618>] [<ffffffff814e0618>] ata_qc_new_init+0x188/0x1b0
RSP: 0018:ffff88061a003ae8 EFLAGS: 00010012
RAX: 0000000000000001 RBX: ffff88000241ca80 RCX: 00000000000000fa
RDX: 0000000000000020 RSI: 0000000000000020 RDI: ffff8806194aa298
RBP: ffff88061a003ae8 R08: ffff8806194a8000 R09: 0000000000000000
R10: 0000000000000000 R11: ffff88000241ca80 R12: ffff88061ad58200
R13: ffff8806194aa298 R14: ffffffff814e67a0 R15: ffff8806194a8000
FS: 00007f3ad7fe3840(0000) GS:ffff880627620000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000058 CR3: 000000061a118000 CR4: 00000000001407e0
Stack:
ffff88061a003b20 ffffffff814e96e1 ffff88000241ca80 ffff88061ad58200
ffff8800b6bf6000 ffff880c1c988000 ffff880619903850 ffff88061a003b68
ffffffffa0056ce1 ffff88061a003b48 0000000013d6e6f8 ffff88000241ca80
Call Trace:
[<ffffffff814e96e1>] ata_sas_queuecmd+0xa1/0x430
[<ffffffffa0056ce1>] sas_queuecommand+0x191/0x220 [libsas]
[<ffffffff8149afee>] scsi_dispatch_cmd+0x10e/0x300 [<ffffffff814a3bc5>] scsi_request_fn+0x2f5/0x550
[<ffffffff81317613>] __blk_run_queue+0x33/0x40
[<ffffffff8131781a>] queue_unplugged+0x2a/0x90
[<ffffffff8131ceb4>] blk_flush_plug_list+0x1b4/0x210
[<ffffffff8131d274>] blk_finish_plug+0x14/0x50
[<ffffffff8117eaa8>] __do_page_cache_readahead+0x198/0x1f0
[<ffffffff8117ee21>] force_page_cache_readahead+0x31/0x50
[<ffffffff8117ee7e>] page_cache_sync_readahead+0x3e/0x50
[<ffffffff81172ac6>] generic_file_read_iter+0x496/0x5a0
[<ffffffff81219897>] blkdev_read_iter+0x37/0x40
[<ffffffff811e307e>] new_sync_read+0x7e/0xb0
[<ffffffff811e3734>] vfs_read+0x94/0x170
[<ffffffff811e43c6>] SyS_read+0x46/0xb0
[<ffffffff811e33d1>] ? SyS_lseek+0x91/0xb0
[<ffffffff8171ee29>] system_call_fastpath+0x16/0x1b
Code: 00 00 00 88 50 29 83 7f 08 01 19 d2 83 e2 f0 83 ea 50 88 50 34 c6 81 1d 02 00 00 40 c6 81 17 02 00 00 00 5d c3 66 0f 1f 44 00 00 <89> 14 25 58 00 00 00
Fix it by introducing ata_host->n_tags which is initialized to
ATA_MAX_QUEUE - 1 in ata_host_init() for SAS controllers and set to
scsi_host_template->can_queue in ata_host_register() for !SAS ones.
As SAS hosts are never registered, this will give them the same
ATA_MAX_QUEUE - 1 as before. Note that we can't use
scsi_host->can_queue directly for SAS hosts anyway as they can go
higher than the libata maximum.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Reported-by: Jesse Brandeburg <jesse.brandeburg@gmail.com>
Reported-by: Peter Hurley <peter@hurleysoftware.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Fixes: 1871ee134b73 ("libata: support the ata host which implements a queue depth less than 32")
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e10b9969f217c948c5523045f44eba4d3a758ff0 upstream.
In this API, we were using sizeof operator for an array
given as function argument, which is invalid.
However this API is not used anywhere.
Signed-off-by: Syam Sidhardhan <s.syam@samsung.com>
Signed-off-by: Gustavo Padovan <gustavo@padovan.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 17b72feb2be14e6d37023267dc0e199e8e0e3fdc upstream.
Matching on device and interface class with with unspecified
subclass and protocol is sometimes useful. This is slightly
different from USB_DEVICE_AND_INTERFACE_INFO which requires
the full interface class/subclass/protocol triplet.
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4af4206be2bd1933cae20c2b6fb2058dbc887f7c upstream.
syscall_regfunc() and syscall_unregfunc() should set/clear
TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
with copy_process() and miss the new child which was not added to
the process/thread lists yet.
Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
under tasklist.
Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com
Fixes: a871bd33a6c0 "tracing: Add syscall tracepoints"
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b9cd18de4db3c9ffa7e17b0dc0ca99ed5aa4d43a upstream.
The 'sysret' fastpath does not correctly restore even all regular
registers, much less any segment registers or reflags values. That is
very much part of why it's faster than 'iret'.
Normally that isn't a problem, because the normal ptrace() interface
catches the process using the signal handler infrastructure, which
always returns with an iret.
However, some paths can get caught using ptrace_event() instead of the
signal path, and for those we need to make sure that we aren't going to
return to user space using 'sysret'. Otherwise the modifications that
may have been done to the register set by the tracer wouldn't
necessarily take effect.
Fix it by forcing IRET path by setting TIF_NOTIFY_RESUME from
arch_ptrace_stop_needed() which is invoked from ptrace_stop().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|