summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2013-01-26userns: Recommend use of memory control groups.Eric W. Biederman
In the help text describing user namespaces recommend use of memory control groups. In many cases memory control groups are the only mechanism there is to limit how much memory a user who can create user namespaces can use. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2013-01-26userns: Allow any uid or gid mappings that don't overlap.Eric W. Biederman
When I initially wrote the code for /proc/<pid>/uid_map. I was lazy and avoided duplicate mappings by the simple expedient of ensuring the first number in a new extent was greater than any number in the previous extent. Unfortunately that precludes a number of valid mappings, and someone noticed and complained. So use a simple check to ensure that ranges in the mapping extents don't overlap. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-01-26userns: Avoid recursion in put_user_nsEric W. Biederman
When freeing a deeply nested user namespace free_user_ns calls put_user_ns on it's parent which may in turn call free_user_ns again. When -fno-optimize-sibling-calls is passed to gcc one stack frame per user namespace is left on the stack, potentially overflowing the kernel stack. CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls so we can't count on gcc to optimize this code. Remove struct kref and use a plain atomic_t. Making the code more flexible and easier to comprehend. Make the loop in free_user_ns explict to guarantee that the stack does not overflow with CONFIG_FRAME_POINTER enabled. I have tested this fix with a simple program that uses unshare to create a deeply nested user namespace structure and then calls exit. With 1000 nesteuser namespaces before this change running my test program causes the kernel to die a horrible death. With 10,000,000 nested user namespaces after this change my test program runs to completion and causes no harm. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Pointed-out-by: Vasily Kulikov <segoon@openwall.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-26userns: Allow unprivileged rebootLi Zefan
In a container with its own pid namespace and user namespace, rebooting the system won't reboot the host, but terminate all the processes in it and thus have the container shutdown, so it's safe. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-12-25f2fs: Don't assign e_id in f2fs_acl_from_diskEric W. Biederman
With user namespaces enabled building f2fs fails with: CC fs/f2fs/acl.o fs/f2fs/acl.c: In function ‘f2fs_acl_from_disk’: fs/f2fs/acl.c:85:21: error: ‘struct posix_acl_entry’ has no member named ‘e_id’ make[2]: *** [fs/f2fs/acl.o] Error 1 make[2]: Target `__build' not remade because of errors. e_id is a backwards compatibility field only used for file systems that haven't been converted to use kuids and kgids. When the posix acl tag field is neither ACL_USER nor ACL_GROUP assigning e_id is unnecessary. Remove the assignment so f2fs will build with user namespaces enabled. Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Amit Sahrawat <a.sahrawat@samsung.com> Acked-by: Jaegeuk Kim <jaegeuk.kim@samsung.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-25proc: Allow proc_free_inum to be called from any contextEric W. Biederman
While testing the pid namespace code I hit this nasty warning. [ 176.262617] ------------[ cut here ]------------ [ 176.263388] WARNING: at /home/eric/projects/linux/linux-userns-devel/kernel/softirq.c:160 local_bh_enable_ip+0x7a/0xa0() [ 176.265145] Hardware name: Bochs [ 176.265677] Modules linked in: [ 176.266341] Pid: 742, comm: bash Not tainted 3.7.0userns+ #18 [ 176.266564] Call Trace: [ 176.266564] [<ffffffff810a539f>] warn_slowpath_common+0x7f/0xc0 [ 176.266564] [<ffffffff810a53fa>] warn_slowpath_null+0x1a/0x20 [ 176.266564] [<ffffffff810ad9ea>] local_bh_enable_ip+0x7a/0xa0 [ 176.266564] [<ffffffff819308c9>] _raw_spin_unlock_bh+0x19/0x20 [ 176.266564] [<ffffffff8123dbda>] proc_free_inum+0x3a/0x50 [ 176.266564] [<ffffffff8111d0dc>] free_pid_ns+0x1c/0x80 [ 176.266564] [<ffffffff8111d195>] put_pid_ns+0x35/0x50 [ 176.266564] [<ffffffff810c608a>] put_pid+0x4a/0x60 [ 176.266564] [<ffffffff8146b177>] tty_ioctl+0x717/0xc10 [ 176.266564] [<ffffffff810aa4d5>] ? wait_consider_task+0x855/0xb90 [ 176.266564] [<ffffffff81086bf9>] ? default_spin_lock_flags+0x9/0x10 [ 176.266564] [<ffffffff810cab0a>] ? remove_wait_queue+0x5a/0x70 [ 176.266564] [<ffffffff811e37e8>] do_vfs_ioctl+0x98/0x550 [ 176.266564] [<ffffffff810b8a0f>] ? recalc_sigpending+0x1f/0x60 [ 176.266564] [<ffffffff810b9127>] ? __set_task_blocked+0x37/0x80 [ 176.266564] [<ffffffff810ab95b>] ? sys_wait4+0xab/0xf0 [ 176.266564] [<ffffffff811e3d31>] sys_ioctl+0x91/0xb0 [ 176.266564] [<ffffffff810a95f0>] ? task_stopped_code+0x50/0x50 [ 176.266564] [<ffffffff81939199>] system_call_fastpath+0x16/0x1b [ 176.266564] ---[ end trace 387af88219ad6143 ]--- It turns out that spin_unlock_bh(proc_inum_lock) is not safe when put_pid is called with another spinlock held and irqs disabled. For now take the easy path and use spin_lock_irqsave(proc_inum_lock) in proc_free_inum and spin_loc_irq in proc_alloc_inum(proc_inum_lock). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-25pidns: Stop pid allocation when init diesEric W. Biederman
Oleg pointed out that in a pid namespace the sequence. - pid 1 becomes a zombie - setns(thepidns), fork,... - reaping pid 1. - The injected processes exiting. Can lead to processes attempting access their child reaper and instead following a stale pointer. That waitpid for init can return before all of the processes in the pid namespace have exited is also unfortunate. Avoid these problems by disabling the allocation of new pids in a pid namespace when init dies, instead of when the last process in a pid namespace is reaped. Pointed-out-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-24pidns: Outlaw thread creation after unshare(CLONE_NEWPID)Eric W. Biederman
The sequence: unshare(CLONE_NEWPID) clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM) Creates a new process in the new pid namespace without setting pid_ns->child_reaper. After forking this results in a NULL pointer dereference. Avoid this and other nonsense scenarios that can show up after creating a new pid namespace with unshare by adding a new check in copy_prodcess. Pointed-out-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-21Linux 3.8-rc1v3.8-rc1Linus Torvalds
2012-12-21Merge git://www.linux-watchdog.org/linux-watchdogLinus Torvalds
Pull watchdog updates from Wim Van Sebroeck: "This includes some fixes and code improvements (like clk_prepare_enable and clk_disable_unprepare), conversion from the omap_wdt and twl4030_wdt drivers to the watchdog framework, addition of the SB8x0 chipset support and the DA9055 Watchdog driver and some OF support for the davinci_wdt driver." * git://www.linux-watchdog.org/linux-watchdog: (22 commits) watchdog: mei: avoid oops in watchdog unregister code path watchdog: Orion: Fix possible null-deference in orion_wdt_probe watchdog: sp5100_tco: Add SB8x0 chipset support watchdog: davinci_wdt: add OF support watchdog: da9052: Fix invalid free of devm_ allocated data watchdog: twl4030_wdt: Change TWL4030_MODULE_PM_RECEIVER to TWL_MODULE_PM_RECEIVER watchdog: remove depends on CONFIG_EXPERIMENTAL watchdog: Convert dev_printk(KERN_<LEVEL> to dev_<level>( watchdog: DA9055 Watchdog driver watchdog: omap_wdt: eliminate goto watchdog: omap_wdt: delete redundant platform_set_drvdata() calls watchdog: omap_wdt: convert to devm_ functions watchdog: omap_wdt: convert to new watchdog core watchdog: WatchDog Timer Driver Core: fix comment watchdog: s3c2410_wdt: use clk_prepare_enable and clk_disable_unprepare watchdog: imx2_wdt: Select the driver via ARCH_MXC watchdog: cpu5wdt.c: add missing del_timer call watchdog: hpwdt.c: Increase version string watchdog: Convert twl4030_wdt to watchdog core davinci_wdt: preparation for switch to common clock framework ...
2012-12-21Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fixes from Steve French: "Misc small cifs fixes" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs: eliminate cifsERROR variable cifs: don't compare uniqueids in cifs_prime_dcache unless server inode numbers are in use cifs: fix double-free of "string" in cifs_parse_mount_options
2012-12-21Merge tag 'dm-3.8-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull dm update from Alasdair G Kergon: "Miscellaneous device-mapper fixes, cleanups and performance improvements. Of particular note: - Disable broken WRITE SAME support in all targets except linear and striped. Use it when kcopyd is zeroing blocks. - Remove several mempools from targets by moving the data into the bio's new front_pad area(which dm calls 'per_bio_data'). - Fix a race in thin provisioning if discards are misused. - Prevent userspace from interfering with the ioctl parameters and use kmalloc for the data buffer if it's small instead of vmalloc. - Throttle some annoying error messages when I/O fails." * tag 'dm-3.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: (36 commits) dm stripe: add WRITE SAME support dm: remove map_info dm snapshot: do not use map_context dm thin: dont use map_context dm raid1: dont use map_context dm flakey: dont use map_context dm raid1: rename read_record to bio_record dm: move target request nr to dm_target_io dm snapshot: use per_bio_data dm verity: use per_bio_data dm raid1: use per_bio_data dm: introduce per_bio_data dm kcopyd: add WRITE SAME support to dm_kcopyd_zero dm linear: add WRITE SAME support dm: add WRITE SAME support dm: prepare to support WRITE SAME dm ioctl: use kmalloc if possible dm ioctl: remove PF_MEMALLOC dm persistent data: improve improve space map block alloc failure message dm thin: use DMERR_LIMIT for errors ...
2012-12-21Revert "nfsd: warn on odd reply state in nfsd_vfs_read"J. Bruce Fields
This reverts commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e. This is obviously wrong, and I have no idea how I missed seeing the warning in testing: I must just not have looked at the right logs. The caller bumps rq_resused/rq_next_page, so it will always be hit on a large enough read. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21Merge tag 'rdma-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband Pull more infiniband changes from Roland Dreier: "Second batch of InfiniBand/RDMA changes for 3.8: - cxgb4 changes to fix lookup engine hash collisions - mlx4 changes to make flow steering usable - fix to IPoIB to avoid pinning dst reference for too long" * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: RDMA/cxgb4: Fix bug for active and passive LE hash collision path RDMA/cxgb4: Fix LE hash collision bug for passive open connection RDMA/cxgb4: Fix LE hash collision bug for active open connection mlx4_core: Allow choosing flow steering mode mlx4_core: Adjustments to Flow Steering activation logic for SR-IOV mlx4_core: Fix error flow in the flow steering wrapper mlx4_core: Add QPN enforcement for flow steering rules set by VFs cxgb4: Add LE hash collision bug fix path in LLD driver cxgb4: Add T4 filter support IPoIB: Call skb_dst_drop() once skb is enqueued for sending
2012-12-21Merge tag 'asm-generic' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic cleanup from Arnd Bergmann: "These are a few cleanups for asm-generic: - a set of patches from Lars-Peter Clausen to generalize asm/mmu.h and use it in the architectures that don't need any special handling. - A patch from Will Deacon to remove the {read,write}s{b,w,l} as discussed during the arm64 review - A patch from James Hogan that helps with the meta architecture series." * tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: xtensa: Use generic asm/mmu.h for nommu h8300: Use generic asm/mmu.h c6x: Use generic asm/mmu.h asm-generic/mmu.h: Add support for FDPIC asm-generic/mmu.h: Remove unused vmlist field from mm_context_t asm-generic: io: remove {read,write} string functions asm-generic/io.h: remove asm/cacheflush.h include
2012-12-21ARM: dts: fix duplicated build target and alphabetical sort out for exynosKukjin Kim
Commit db5b0ae00712 ("Merge tag 'dt' of git://git.kernel.org/.../arm-soc") causes a duplicated build target. This patch fixes it and sorts out the build target alphabetically so that we can recognize something wrong easily. Cc: Olof Johansson <olof@lixom.net> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21dm stripe: add WRITE SAME supportMike Snitzer
Rename stripe_map_discard to stripe_map_range and reuse it for WRITE SAME bio processing. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm: remove map_infoMikulas Patocka
This patch removes map_info from bio-based device mapper targets. map_info is still used for request-based targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm snapshot: do not use map_contextMikulas Patocka
Eliminate struct map_info from dm-snap. map_info->ptr was used in dm-snap to indicate if the bio was tracked. If map_info->ptr was non-NULL, the bio was linked in tracked_chunk_hash. This patch removes the use of map_info->ptr. We determine if the bio was tracked based on hlist_unhashed(&c->node). If hlist_unhashed is true, the bio is not tracked, if it is false, the bio is tracked. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm thin: dont use map_contextMikulas Patocka
This patch removes endio_hook_pool from dm-thin and uses per-bio data instead. This patch removes any use of map_info in preparation for the next patch that removes map_info from bio-based device mapper. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm raid1: dont use map_contextMikulas Patocka
Don't use map_info any more in dm-raid1. map_info was used for writes to hold the region number. For this purpose we add a new field dm_bio_details to dm_raid1_bio_record. map_info was used for reads to hold a pointer to dm_raid1_bio_record (if the pointer was non-NULL, bio details were saved; if the pointer was NULL, bio details were not saved). We use dm_raid1_bio_record.details->bi_bdev for this purpose. If bi_bdev is NULL, details were not saved, if bi_bdev is non-NULL, details were saved. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm flakey: dont use map_contextMikulas Patocka
Replace map_info with a per-bio structure "struct per_bio_data" in dm-flakey. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm raid1: rename read_record to bio_recordMikulas Patocka
Rename struct read_record to bio_record in dm-raid1. In the following patch, the structure will be used for both read and write bios, so rename it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm: move target request nr to dm_target_ioMikulas Patocka
This patch moves target_request_nr from map_info to dm_target_io and makes it accessible with dm_bio_get_target_request_nr. This patch is a preparation for the next patch that removes map_info. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm snapshot: use per_bio_dataMikulas Patocka
Replace tracked_chunk_pool with per_bio_data in dm-snap. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm verity: use per_bio_dataMikulas Patocka
Replace io_mempool with per_bio_data in dm-verity. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm raid1: use per_bio_dataMikulas Patocka
Replace read_record_pool with per_bio_data in dm-raid1. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm: introduce per_bio_dataMikulas Patocka
Introduce a field per_bio_data_size in struct dm_target. Targets can set this field in the constructor. If a target sets this field to a non-zero value, "per_bio_data_size" bytes of auxiliary data are allocated for each bio submitted to the target. These data can be used for any purpose by the target and help us improve performance by removing some per-target mempools. Per-bio data is accessed with dm_per_bio_data. The argument data_size must be the same as the value per_bio_data_size in dm_target. If the target has a pointer to per_bio_data, it can get a pointer to the bio with dm_bio_from_per_bio_data() function (data_size must be the same as the value passed to dm_per_bio_data). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm kcopyd: add WRITE SAME support to dm_kcopyd_zeroMike Snitzer
Add WRITE SAME support to dm-io and make it accessible to dm_kcopyd_zero(). dm_kcopyd_zero() provides an asynchronous interface whereas the blkdev_issue_write_same() interface is synchronous. WRITE SAME is a SCSI command that can be leveraged for more efficient zeroing of a specified logical extent of a device which supports it. Only a single zeroed logical block is transfered to the target for each WRITE SAME and the target then writes that same block across the specified extent. The dm thin target uses this. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm linear: add WRITE SAME supportMike Snitzer
The linear target can already support WRITE SAME requests so signal this by setting num_write_same_requests to 1. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm: add WRITE SAME supportMike Snitzer
WRITE SAME bios have a payload that contain a single page. When cloning WRITE SAME bios DM has no need to modify the bi_io_vec attributes (and doing so would be detrimental). DM need only alter the start and end of the WRITE SAME bio accordingly. Rather than duplicate __clone_and_map_discard, factor out a common function that is also used by __clone_and_map_write_same. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm: prepare to support WRITE SAMEMike Snitzer
Allow targets to opt in to WRITE SAME support by setting 'num_write_same_requests' in the dm_target structure. A dm device will only advertise WRITE SAME support if all its targets and all its underlying devices support it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm ioctl: use kmalloc if possibleMikulas Patocka
If the parameter buffer is small enough, try to allocate it with kmalloc() rather than vmalloc(). vmalloc is noticeably slower than kmalloc because it has to manipulate page tables. In my tests, on PA-RISC this patch speeds up activation 13 times. On Opteron this patch speeds up activation by 5%. This patch introduces a new function free_params() to free the parameters and this uses new flags that record whether or not vmalloc() was used and whether or not the input buffer must be wiped after use. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm ioctl: remove PF_MEMALLOCMikulas Patocka
When allocating memory for the userspace ioctl data, set some appropriate GPF flags directly instead of using PF_MEMALLOC. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm persistent data: improve improve space map block alloc failure messageJoe Thornber
Improve space map error message when unable to allocate a new metadata block. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm thin: use DMERR_LIMIT for errorsMike Snitzer
Throttle all errors logged from the IO path by dm thin. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm persistent data: use DMERR_LIMIT for errorsMike Snitzer
Nearly all of persistent-data is in the IO path so throttle error messages with DMERR_LIMIT to limit the amount logged when something has gone wrong. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm block manager: reinstate message when validator failsMike Snitzer
Reinstate a useful error message when the block manager buffer validator fails. This was mistakenly eliminated when the block manager was converted to use dm-bufio. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm raid: round region_size to power of twoJonathan Brassow
If the user does not supply a bitmap region_size to the dm raid target, a reasonable size is computed automatically. If this is not a power of 2, the md code will report an error later. This patch catches the problem early and rounds the region_size to the next power of two. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm thin: cleanup dead codeJoe Thornber
Remove unused @data_block parameter from cell_defer. Change thin_bio_map to use many returns rather than setting a variable. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm thin: rename cell_defer_except to cell_defer_no_holderJoe Thornber
Rename cell_defer_except() to cell_defer_no_holder() which describes its function more clearly. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm snapshot: optimize track_chunkMikulas Patocka
track_chunk is always called with interrupts enabled. Consequently, we do not need to save and restore interrupt state in "flags" variable. This patch changes spin_lock_irqsave to spin_lock_irq and spin_unlock_irqrestore to spin_unlock_irq. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm raid: use DM_ENDIO_INCOMPLETEMikulas Patocka
Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm raid1: remove impossible mempool_alloc error testMikulas Patocka
mempool_alloc can't fail if __GFP_WAIT is specified, so the condition that tests if read_record is non-NULL is always true. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm thin: emit ignore_discard in status when discards disabledMike Snitzer
If "ignore_discard" is specified when creating the thin pool device then discard support is disabled for that device. The pool device's status should reflect this fact rather than stating "no_discard_passdown" (which implies discards are enabled but passdown is disabled). Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm persistent data: fix nested btree deletionJoe Thornber
When deleting nested btrees, the code forgets to delete the innermost btree. The thin-metadata code serendipitously compensates for this by claiming there is one extra layer in the tree. This patch corrects both problems. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm thin: wake worker when discard is preparedJoe Thornber
When discards are prepared it is best to directly wake the worker that will process them. The worker will be woken anyway, via periodic commit, but there is no reason to not wake_worker here. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm thin: fix race between simultaneous io and discards to same blockJoe Thornber
There is a race when discard bios and non-discard bios are issued simultaneously to the same block. Discard support is expensive for all thin devices precisely because you have to be careful to quiesce the area you're discarding. DM thin must handle this conflicting IO pattern (simultaneous non-discard vs discard) even though a sane application shouldn't be issuing such IO. The race manifests as follows: 1. A non-discard bio is mapped in thin_bio_map. This doesn't lock out parallel activity to the same block. 2. A discard bio is issued to the same block as the non-discard bio. 3. The discard bio is locked in a dm_bio_prison_cell in process_discard to lock out parallel activity against the same block. 4. The non-discard bio's mapping continues and its all_io_entry is incremented so the bio is accounted for in the thin pool's all_io_ds which is a dm_deferred_set used to track time locality of non-discard IO. 5. The non-discard bio is finally locked in a dm_bio_prison_cell in process_bio. The race can result in deadlock, leaving the block layer hanging waiting for completion of a discard bio that never completes, e.g.: INFO: task ruby:15354 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ruby D ffffffff8160f0e0 0 15354 15314 0x00000000 ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900 ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900 ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0 Call Trace: [<ffffffff814e5a19>] schedule+0x29/0x70 [<ffffffff814e3d85>] schedule_timeout+0x195/0x220 [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod] [<ffffffff814e589e>] wait_for_common+0x11e/0x190 [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0 [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20 [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260 [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0 [<ffffffff8119a65c>] block_ioctl+0x3c/0x40 [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340 [<ffffffff8119a547>] ? block_llseek+0x67/0xb0 [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0 [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0 [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b The thinp-test-suite's test_discard_random_sectors reliably hits this deadlock on fast SSD storage. The fix for this race is that the all_io_entry for a bio must be incremented whilst the dm_bio_prison_cell is held for the bio's associated virtual and physical blocks. That cell locking wasn't occurring early enough in thin_bio_map. This patch fixes this. Care is taken to always call the new function inc_all_io_entry() with the relevant cells locked, but they are generally unlocked before calling issue() to try to avoid holding the cells locked across generic_submit_request. Also, now that thin_bio_map may lock bios in a cell, process_bio() is no longer the only thread that will do so. Because of this we must be sure to use cell_defer_except() to release all non-holder entries, that were added by the other thread, because they must be deferred. This patch depends on "dm thin: replace dm_cell_release_singleton with cell_defer_except". Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@vger.kernel.org
2012-12-21dm thin: replace dm_cell_release_singleton with cell_defer_exceptJoe Thornber
Change existing users of the function dm_cell_release_singleton to share cell_defer_except instead, and then remove the now-unused function. Everywhere that calls dm_cell_release_singleton, the bio in question is the holder of the cell. If there are no non-holder entries in the cell then cell_defer_except behaves exactly like dm_cell_release_singleton. Conversely, if there *are* non-holder entries then dm_cell_release_singleton must not be used because those entries would need to be deferred. Consequently, it is safe to replace use of dm_cell_release_singleton with cell_defer_except. This patch is a pre-requisite for "dm thin: fix race between simultaneous io and discards to same block". Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21dm: disable WRITE SAMEMike Snitzer
WRITE SAME bios are not yet handled correctly by device-mapper so disable their use on device-mapper devices by setting max_write_same_sectors to zero. As an example, a ciphertext device is incompatible because the data gets changed according to the location at which it written and so the dm crypt target cannot support it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>