From 80168676ebfe4af51407d30f336d67f082d45201 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Fri, 24 Sep 2010 18:13:44 +1000 Subject: xfs: force background CIL push under sustained load I have been seeing occasional pauses in transaction throughput up to 30s long under heavy parallel workloads. The only notable thing was that the xfsaild was trying to be active during the pauses, but making no progress. It was running exactly 20 times a second (on the 50ms no-progress backoff), and the number of pushbuf events was constant across this time as well. IOWs, the xfsaild appeared to be stuck on buffers that it could not push out. Further investigation indicated that it was trying to push out inode buffers that were pinned and/or locked. The xfsbufd was also getting woken at the same frequency (by the xfsaild, no doubt) to push out delayed write buffers. The xfsbufd was not making any progress because all the buffers in the delwri queue were pinned. This scan- and-make-no-progress dance went one in the trace for some seconds, before the xfssyncd came along an issued a log force, and then things started going again. However, I noticed something strange about the log force - there were way too many IO's issued. 516 log buffers were written, to be exact. That added up to 129MB of log IO, which got me very interested because it's almost exactly 25% of the size of the log. He delayed logging code is suppose to aggregate the minimum of 25% of the log or 8MB worth of changes before flushing. That's what really puzzled me - why did a log force write 129MB instead of only 8MB? Essentially what has happened is that no CIL pushes had occurred since the previous tail push which cleared out 25% of the log space. That caused all the new transactions to block because there wasn't log space for them, but they kick the xfsaild to push the tail. However, the xfsaild was not making progress because there were buffers it could not lock and flush, and the xfsbufd could not flush them because they were pinned. As a result, both the xfsaild and the xfsbufd could not move the tail of the log forward without the CIL first committing. The cause of the problem was that the background CIL push, which should happen when 8MB of aggregated changes have been committed, is being held off by the concurrent transaction commit load. The background push does a down_write_trylock() which will fail if there is a concurrent transaction commit holding the push lock in read mode. With 8 CPUs all doing transactions as fast as they can, there was enough concurrent transaction commits to hold off the background push until tail-pushing could no longer free log space, and the halt would occur. It should be noted that there is no reason why it would halt at 25% of log space used by a single CIL checkpoint. This bug could definitely violate the "no transaction should be larger than half the log" requirement and hence result in corruption if the system crashed under heavy load. This sort of bug is exactly the reason why delayed logging was tagged as experimental.... The fix is to start blocking background pushes once the threshold has been exceeded. Rework the threshold calculations to keep the amount of log space a CIL checkpoint can use to below that of the AIL push threshold to avoid the problem completely. Signed-off-by: Dave Chinner Reviewed-by: Alex Elder Reviewed-by: Christoph Hellwig --- fs/xfs/xfs_log_cil.c | 12 +++++++++--- fs/xfs/xfs_log_priv.h | 37 +++++++++++++++++++++---------------- 2 files changed, 30 insertions(+), 19 deletions(-) (limited to 'fs') diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index ed575fb4b495..7e206fc1fa36 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -405,9 +405,15 @@ xlog_cil_push( new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_SLEEP|KM_NOFS); new_ctx->ticket = xlog_cil_ticket_alloc(log); - /* lock out transaction commit, but don't block on background push */ + /* + * Lock out transaction commit, but don't block for background pushes + * unless we are well over the CIL space limit. See the definition of + * XLOG_CIL_HARD_SPACE_LIMIT() for the full explanation of the logic + * used here. + */ if (!down_write_trylock(&cil->xc_ctx_lock)) { - if (!push_seq) + if (!push_seq && + cil->xc_ctx->space_used < XLOG_CIL_HARD_SPACE_LIMIT(log)) goto out_free_ticket; down_write(&cil->xc_ctx_lock); } @@ -422,7 +428,7 @@ xlog_cil_push( goto out_skip; /* check for a previously pushed seqeunce */ - if (push_seq < cil->xc_ctx->sequence) + if (push_seq && push_seq < cil->xc_ctx->sequence) goto out_skip; /* diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index ced52b98b322..edcdfe01617f 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -426,13 +426,13 @@ struct xfs_cil { }; /* - * The amount of log space we should the CIL to aggregate is difficult to size. - * Whatever we chose we have to make we can get a reservation for the log space - * effectively, that it is large enough to capture sufficient relogging to - * reduce log buffer IO significantly, but it is not too large for the log or - * induces too much latency when writing out through the iclogs. We track both - * space consumed and the number of vectors in the checkpoint context, so we - * need to decide which to use for limiting. + * The amount of log space we allow the CIL to aggregate is difficult to size. + * Whatever we choose, we have to make sure we can get a reservation for the + * log space effectively, that it is large enough to capture sufficient + * relogging to reduce log buffer IO significantly, but it is not too large for + * the log or induces too much latency when writing out through the iclogs. We + * track both space consumed and the number of vectors in the checkpoint + * context, so we need to decide which to use for limiting. * * Every log buffer we write out during a push needs a header reserved, which * is at least one sector and more for v2 logs. Hence we need a reservation of @@ -459,16 +459,21 @@ struct xfs_cil { * checkpoint transaction ticket is specific to the checkpoint context, rather * than the CIL itself. * - * With dynamic reservations, we can basically make up arbitrary limits for the - * checkpoint size so long as they don't violate any other size rules. Hence - * the initial maximum size for the checkpoint transaction will be set to a - * quarter of the log or 8MB, which ever is smaller. 8MB is an arbitrary limit - * right now based on the latency of writing out a large amount of data through - * the circular iclog buffers. + * With dynamic reservations, we can effectively make up arbitrary limits for + * the checkpoint size so long as they don't violate any other size rules. + * Recovery imposes a rule that no transaction exceed half the log, so we are + * limited by that. Furthermore, the log transaction reservation subsystem + * tries to keep 25% of the log free, so we need to keep below that limit or we + * risk running out of free log space to start any new transactions. + * + * In order to keep background CIL push efficient, we will set a lower + * threshold at which background pushing is attempted without blocking current + * transaction commits. A separate, higher bound defines when CIL pushes are + * enforced to ensure we stay within our maximum checkpoint size bounds. + * threshold, yet give us plenty of space for aggregation on large logs. */ - -#define XLOG_CIL_SPACE_LIMIT(log) \ - (min((log->l_logsize >> 2), (8 * 1024 * 1024))) +#define XLOG_CIL_SPACE_LIMIT(log) (log->l_logsize >> 3) +#define XLOG_CIL_HARD_SPACE_LIMIT(log) (3 * (log->l_logsize >> 4)) /* * The reservation head lsn is not made up of a cycle number and block number. -- cgit v1.2.3 From 522440ed55d2cc8855ea5f82bc067e0483b2e1be Mon Sep 17 00:00:00 2001 From: Jeff Layton Date: Wed, 29 Sep 2010 09:49:54 -0400 Subject: cifs: set backing_dev_info on new S_ISREG inodes Testing on very recent kernel (2.6.36-rc6) made this warning pop: WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x65/0x70() Hardware name: Dirtiable inode bdi default != sb bdi cifs ...the following patch fixes it and seems to be the obviously correct thing to do for cifs. Cc: stable@kernel.org Acked-by: Dave Kleikamp Signed-off-by: Jeff Layton Signed-off-by: Steve French --- fs/cifs/inode.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'fs') diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index 93f77d438d3c..53cce8cc2224 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -801,6 +801,8 @@ retry_iget5_locked: inode->i_flags |= S_NOATIME | S_NOCMTIME; if (inode->i_state & I_NEW) { inode->i_ino = hash; + if (S_ISREG(inode->i_mode)) + inode->i_data.backing_dev_info = sb->s_bdi; #ifdef CONFIG_CIFS_FSCACHE /* initialize per-inode cache cookie pointer */ CIFS_I(inode)->fscache = NULL; -- cgit v1.2.3 From 1fc8a117865b54590acd773a55fbac9221b018f0 Mon Sep 17 00:00:00 2001 From: Joel Becker Date: Wed, 29 Sep 2010 17:33:05 -0700 Subject: ocfs2: Don't walk off the end of fast symlinks. ocfs2 fast symlinks are NUL terminated strings stored inline in the inode data area. However, disk corruption or a local attacker could, in theory, remove that NUL. Because we're using strlen() (my fault, introduced in a731d1 when removing vfs_follow_link()), we could walk off the end of that string. Signed-off-by: Joel Becker Cc: stable@kernel.org --- fs/ocfs2/symlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/ocfs2/symlink.c b/fs/ocfs2/symlink.c index 32499d213fc4..9975457c981f 100644 --- a/fs/ocfs2/symlink.c +++ b/fs/ocfs2/symlink.c @@ -128,7 +128,7 @@ static void *ocfs2_fast_follow_link(struct dentry *dentry, } /* Fast symlinks can't be large */ - len = strlen(target); + len = strnlen(target, ocfs2_fast_symlink_chars(inode->i_sb)); link = kzalloc(len + 1, GFP_NOFS); if (!link) { status = -ENOMEM; -- cgit v1.2.3 From f569599ae70f0899035f8d5876a7939f629c5976 Mon Sep 17 00:00:00 2001 From: Jeff Layton Date: Wed, 29 Sep 2010 15:27:08 -0400 Subject: cifs: prevent infinite recursion in cifs_reconnect_tcon cifs_reconnect_tcon is called from smb_init. After a successful reconnect, cifs_reconnect_tcon will call reset_cifs_unix_caps. That function will, in turn call CIFSSMBQFSUnixInfo and CIFSSMBSetFSUnixInfo. Those functions also call smb_init. It's possible for the session and tcon reconnect to succeed, and then for another cifs_reconnect to occur before CIFSSMBQFSUnixInfo or CIFSSMBSetFSUnixInfo to be called. That'll cause those functions to call smb_init and cifs_reconnect_tcon again, ad infinitum... Break the infinite recursion by having those functions use a new smb_init variant that doesn't attempt to perform a reconnect. Reported-and-Tested-by: Michal Suchanek Signed-off-by: Jeff Layton Signed-off-by: Steve French --- fs/cifs/cifssmb.c | 49 +++++++++++++++++++++++++++++++++---------------- 1 file changed, 33 insertions(+), 16 deletions(-) (limited to 'fs') diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c index c65c3419dd37..7e83b356cc9e 100644 --- a/fs/cifs/cifssmb.c +++ b/fs/cifs/cifssmb.c @@ -232,7 +232,7 @@ static int small_smb_init(int smb_command, int wct, struct cifsTconInfo *tcon, void **request_buf) { - int rc = 0; + int rc; rc = cifs_reconnect_tcon(tcon, smb_command); if (rc) @@ -250,7 +250,7 @@ small_smb_init(int smb_command, int wct, struct cifsTconInfo *tcon, if (tcon != NULL) cifs_stats_inc(&tcon->num_smbs_sent); - return rc; + return 0; } int @@ -281,16 +281,9 @@ small_smb_init_no_tc(const int smb_command, const int wct, /* If the return code is zero, this function must fill in request_buf pointer */ static int -smb_init(int smb_command, int wct, struct cifsTconInfo *tcon, - void **request_buf /* returned */ , - void **response_buf /* returned */ ) +__smb_init(int smb_command, int wct, struct cifsTconInfo *tcon, + void **request_buf, void **response_buf) { - int rc = 0; - - rc = cifs_reconnect_tcon(tcon, smb_command); - if (rc) - return rc; - *request_buf = cifs_buf_get(); if (*request_buf == NULL) { /* BB should we add a retry in here if not a writepage? */ @@ -309,7 +302,31 @@ smb_init(int smb_command, int wct, struct cifsTconInfo *tcon, if (tcon != NULL) cifs_stats_inc(&tcon->num_smbs_sent); - return rc; + return 0; +} + +/* If the return code is zero, this function must fill in request_buf pointer */ +static int +smb_init(int smb_command, int wct, struct cifsTconInfo *tcon, + void **request_buf, void **response_buf) +{ + int rc; + + rc = cifs_reconnect_tcon(tcon, smb_command); + if (rc) + return rc; + + return __smb_init(smb_command, wct, tcon, request_buf, response_buf); +} + +static int +smb_init_no_reconnect(int smb_command, int wct, struct cifsTconInfo *tcon, + void **request_buf, void **response_buf) +{ + if (tcon->ses->need_reconnect || tcon->need_reconnect) + return -EHOSTDOWN; + + return __smb_init(smb_command, wct, tcon, request_buf, response_buf); } static int validate_t2(struct smb_t2_rsp *pSMB) @@ -4534,8 +4551,8 @@ CIFSSMBQFSUnixInfo(const int xid, struct cifsTconInfo *tcon) cFYI(1, "In QFSUnixInfo"); QFSUnixRetry: - rc = smb_init(SMB_COM_TRANSACTION2, 15, tcon, (void **) &pSMB, - (void **) &pSMBr); + rc = smb_init_no_reconnect(SMB_COM_TRANSACTION2, 15, tcon, + (void **) &pSMB, (void **) &pSMBr); if (rc) return rc; @@ -4604,8 +4621,8 @@ CIFSSMBSetFSUnixInfo(const int xid, struct cifsTconInfo *tcon, __u64 cap) cFYI(1, "In SETFSUnixInfo"); SETFSUnixRetry: /* BB switch to small buf init to save memory */ - rc = smb_init(SMB_COM_TRANSACTION2, 15, tcon, (void **) &pSMB, - (void **) &pSMBr); + rc = smb_init_no_reconnect(SMB_COM_TRANSACTION2, 15, tcon, + (void **) &pSMB, (void **) &pSMBr); if (rc) return rc; -- cgit v1.2.3 From 3036e7b490bf7878c6dae952eec5fb87b1106589 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 30 Sep 2010 15:15:33 -0700 Subject: proc: make /proc/pid/limits world readable Having the limits file world readable will ease the task of system management on systems where root privileges might be restricted. Having admin restricted with root priviledges, he/she could not check other users process' limits. Also it'd align with most of the /proc stat files. Signed-off-by: Jiri Olsa Acked-by: Neil Horman Cc: Eugene Teo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/base.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'fs') diff --git a/fs/proc/base.c b/fs/proc/base.c index a1c43e7c8a7b..8e4addaa5424 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2675,7 +2675,7 @@ static const struct pid_entry tgid_base_stuff[] = { INF("auxv", S_IRUSR, proc_pid_auxv), ONE("status", S_IRUGO, proc_pid_status), ONE("personality", S_IRUSR, proc_pid_personality), - INF("limits", S_IRUSR, proc_pid_limits), + INF("limits", S_IRUGO, proc_pid_limits), #ifdef CONFIG_SCHED_DEBUG REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations), #endif @@ -3011,7 +3011,7 @@ static const struct pid_entry tid_base_stuff[] = { INF("auxv", S_IRUSR, proc_pid_auxv), ONE("status", S_IRUGO, proc_pid_status), ONE("personality", S_IRUSR, proc_pid_personality), - INF("limits", S_IRUSR, proc_pid_limits), + INF("limits", S_IRUGO, proc_pid_limits), #ifdef CONFIG_SCHED_DEBUG REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations), #endif -- cgit v1.2.3 From 3f259d092c7a2fdf217823e8f1838530adb0cdb0 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 30 Sep 2010 15:15:37 -0700 Subject: reiserfs: fix dependency inversion between inode and reiserfs mutexes The reiserfs mutex already depends on the inode mutex, so we can't lock the inode mutex in reiserfs_unpack() without using the safe locking API, because reiserfs_unpack() is always called with the reiserfs mutex locked. This fixes: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.35c #13 ------------------------------------------------------- lilo/1606 is trying to acquire lock: (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [] reiserfs_unpack+0x60/0x110 [reiserfs] but task is already holding lock: (&REISERFS_SB(s)->lock){+.+.+.}, at: [] reiserfs_write_lock+0x28/0x40 [reiserfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&REISERFS_SB(s)->lock){+.+.+.}: [] lock_acquire+0x67/0x80 [] __mutex_lock_common+0x4d/0x410 [] mutex_lock_nested+0x18/0x20 [] reiserfs_write_lock+0x28/0x40 [reiserfs] [] reiserfs_lookup_privroot+0x2a/0x90 [reiserfs] [] reiserfs_fill_super+0x941/0xe60 [reiserfs] [] get_sb_bdev+0x117/0x170 [] get_super_block+0x21/0x30 [reiserfs] [] vfs_kern_mount+0x6a/0x1b0 [] do_kern_mount+0x39/0xe0 [] do_mount+0x340/0x790 [] sys_mount+0x84/0xb0 [] syscall_call+0x7/0xb -> #0 (&sb->s_type->i_mutex_key#8){+.+.+.}: [] __lock_acquire+0x1026/0x1180 [] lock_acquire+0x67/0x80 [] __mutex_lock_common+0x4d/0x410 [] mutex_lock_nested+0x18/0x20 [] reiserfs_unpack+0x60/0x110 [reiserfs] [] reiserfs_ioctl+0x272/0x320 [reiserfs] [] vfs_ioctl+0x28/0xa0 [] do_vfs_ioctl+0x32d/0x5c0 [] sys_ioctl+0x63/0x70 [] syscall_call+0x7/0xb other info that might help us debug this: 1 lock held by lilo/1606: #0: (&REISERFS_SB(s)->lock){+.+.+.}, at: [] reiserfs_write_lock+0x28/0x40 [reiserfs] stack backtrace: Pid: 1606, comm: lilo Not tainted 2.6.35c #13 Call Trace: [] __lock_acquire+0x1026/0x1180 [] lock_acquire+0x67/0x80 [] __mutex_lock_common+0x4d/0x410 [] mutex_lock_nested+0x18/0x20 [] reiserfs_unpack+0x60/0x110 [reiserfs] [] reiserfs_ioctl+0x272/0x320 [reiserfs] [] vfs_ioctl+0x28/0xa0 [] do_vfs_ioctl+0x32d/0x5c0 [] sys_ioctl+0x63/0x70 [] syscall_call+0x7/0xb Reported-by: Jarek Poplawski Tested-by: Jarek Poplawski Signed-off-by: Frederic Weisbecker Cc: Jeff Mahoney Cc: [2.6.32 and later] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/reiserfs/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/reiserfs/ioctl.c b/fs/reiserfs/ioctl.c index f53505de0712..679d5029f50f 100644 --- a/fs/reiserfs/ioctl.c +++ b/fs/reiserfs/ioctl.c @@ -188,7 +188,7 @@ int reiserfs_unpack(struct inode *inode, struct file *filp) /* we need to make sure nobody is changing the file size beneath ** us */ - mutex_lock(&inode->i_mutex); + reiserfs_mutex_lock_safe(&inode->i_mutex, inode->i_sb); reiserfs_write_lock(inode->i_sb); write_from = inode->i_size & (blocksize - 1); -- cgit v1.2.3 From 9d8117e72bf453dd9d85e0cd322ce4a0f8bccbc0 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 30 Sep 2010 15:15:38 -0700 Subject: reiserfs: fix unwanted reiserfs lock recursion Prevent from recursively locking the reiserfs lock in reiserfs_unpack() because we may call journal_begin() that requires the lock to be taken only once, otherwise it won't be able to release the lock while taking other mutexes, ending up in inverted dependencies between the journal mutex and the reiserfs lock for example. This fixes: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.35.4.4a #3 ------------------------------------------------------- lilo/1620 is trying to acquire lock: (&journal->j_mutex){+.+...}, at: [] do_journal_begin_r+0x7f/0x340 [reiserfs] but task is already holding lock: (&REISERFS_SB(s)->lock){+.+.+.}, at: [] reiserfs_write_lock+0x28/0x40 [reiserfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&REISERFS_SB(s)->lock){+.+.+.}: [] lock_acquire+0x67/0x80 [] __mutex_lock_common+0x4d/0x410 [] mutex_lock_nested+0x18/0x20 [] reiserfs_write_lock+0x28/0x40 [reiserfs] [] do_journal_begin_r+0x86/0x340 [reiserfs] [] journal_begin+0x77/0x140 [reiserfs] [] reiserfs_remount+0x224/0x530 [reiserfs] [] do_remount_sb+0x60/0x110 [] do_mount+0x625/0x790 [] sys_mount+0x84/0xb0 [] syscall_call+0x7/0xb -> #0 (&journal->j_mutex){+.+...}: [] __lock_acquire+0x1026/0x1180 [] lock_acquire+0x67/0x80 [] __mutex_lock_common+0x4d/0x410 [] mutex_lock_nested+0x18/0x20 [] do_journal_begin_r+0x7f/0x340 [reiserfs] [] journal_begin+0x77/0x140 [reiserfs] [] reiserfs_persistent_transaction+0x41/0x90 [reiserfs] [] reiserfs_get_block+0x22c/0x1530 [reiserfs] [] __block_prepare_write+0x1bb/0x3a0 [] block_prepare_write+0x26/0x40 [] reiserfs_prepare_write+0x88/0x170 [reiserfs] [] reiserfs_unpack+0xe6/0x120 [reiserfs] [] reiserfs_ioctl+0x272/0x320 [reiserfs] [] vfs_ioctl+0x28/0xa0 [] do_vfs_ioctl+0x32d/0x5c0 [] sys_ioctl+0x63/0x70 [] syscall_call+0x7/0xb other info that might help us debug this: 2 locks held by lilo/1620: #0: (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [] reiserfs_unpack+0x6a/0x120 [reiserfs] #1: (&REISERFS_SB(s)->lock){+.+.+.}, at: [] reiserfs_write_lock+0x28/0x40 [reiserfs] stack backtrace: Pid: 1620, comm: lilo Not tainted 2.6.35.4.4a #3 Call Trace: [] __lock_acquire+0x1026/0x1180 [] lock_acquire+0x67/0x80 [] __mutex_lock_common+0x4d/0x410 [] mutex_lock_nested+0x18/0x20 [] do_journal_begin_r+0x7f/0x340 [reiserfs] [] journal_begin+0x77/0x140 [reiserfs] [] reiserfs_persistent_transaction+0x41/0x90 [reiserfs] [] reiserfs_get_block+0x22c/0x1530 [reiserfs] [] __block_prepare_write+0x1bb/0x3a0 [] block_prepare_write+0x26/0x40 [] reiserfs_prepare_write+0x88/0x170 [reiserfs] [] reiserfs_unpack+0xe6/0x120 [reiserfs] [] reiserfs_ioctl+0x272/0x320 [reiserfs] [] vfs_ioctl+0x28/0xa0 [] do_vfs_ioctl+0x32d/0x5c0 [] sys_ioctl+0x63/0x70 [] syscall_call+0x7/0xb Reported-by: Jarek Poplawski Tested-by: Jarek Poplawski Signed-off-by: Frederic Weisbecker Cc: Jeff Mahoney Cc: All since 2.6.32 Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/reiserfs/ioctl.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'fs') diff --git a/fs/reiserfs/ioctl.c b/fs/reiserfs/ioctl.c index 679d5029f50f..5cbb81e134ac 100644 --- a/fs/reiserfs/ioctl.c +++ b/fs/reiserfs/ioctl.c @@ -170,6 +170,7 @@ int reiserfs_prepare_write(struct file *f, struct page *page, int reiserfs_unpack(struct inode *inode, struct file *filp) { int retval = 0; + int depth; int index; struct page *page; struct address_space *mapping; @@ -189,7 +190,7 @@ int reiserfs_unpack(struct inode *inode, struct file *filp) ** us */ reiserfs_mutex_lock_safe(&inode->i_mutex, inode->i_sb); - reiserfs_write_lock(inode->i_sb); + depth = reiserfs_write_lock_once(inode->i_sb); write_from = inode->i_size & (blocksize - 1); /* if we are on a block boundary, we are already unpacked. */ @@ -224,6 +225,6 @@ int reiserfs_unpack(struct inode *inode, struct file *filp) out: mutex_unlock(&inode->i_mutex); - reiserfs_write_unlock(inode->i_sb); + reiserfs_write_unlock_once(inode->i_sb, depth); return retval; } -- cgit v1.2.3 From 0157443c56bcc50be4933ebdff3ece723dfd535c Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Thu, 30 Sep 2010 22:06:21 +0200 Subject: fuse: Initialize total_len in fuse_retrieve() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit fs/fuse/dev.c:1357: warning: ‘total_len’ may be used uninitialized in this function Initialize total_len to zero, else its value will be undefined. Signed-off-by: Geert Uytterhoeven Signed-off-by: Miklos Szeredi --- fs/fuse/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index d367af1514ef..cde755cca564 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -1354,7 +1354,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode, loff_t file_size; unsigned int num; unsigned int offset; - size_t total_len; + size_t total_len = 0; req = fuse_get_req(fc); if (IS_ERR(req)) -- cgit v1.2.3 From aaead25b954879e1a708ff2f3602f494c18d20b5 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 4 Oct 2010 14:25:33 +0200 Subject: writeback: always use sb->s_bdi for writeback purposes We currently use struct backing_dev_info for various different purposes. Originally it was introduced to describe a backing device which includes an unplug and congestion function and various bits of readahead information and VM-relevant flags. We're also using for tracking dirty inodes for writeback. To make writeback properly find all inodes we need to only access the per-filesystem backing_device pointed to by the superblock in ->s_bdi inside the writeback code, and not the instances pointeded to by inode->i_mapping->backing_dev which can be overriden by special devices or might not be set at all by some filesystems. Long term we should split out the writeback-relevant bits of struct backing_device_info (which includes more than the current bdi_writeback) and only point to it from the superblock while leaving the traditional backing device as a separate structure that can be overriden by devices. The one exception for now is the block device filesystem which really wants different writeback contexts for it's different (internal) inodes to handle the writeout more efficiently. For now we do this with a hack in fs-writeback.c because we're so late in the cycle, but in the future I plan to replace this with a superblock method that allows for multiple writeback contexts per filesystem. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/fs-writeback.c | 19 ++++--------------- 1 file changed, 4 insertions(+), 15 deletions(-) (limited to 'fs') diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 5581122bd2c0..ab38fef1c9a1 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -72,22 +72,11 @@ int writeback_in_progress(struct backing_dev_info *bdi) static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) { struct super_block *sb = inode->i_sb; - struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info; - /* - * For inodes on standard filesystems, we use superblock's bdi. For - * inodes on virtual filesystems, we want to use inode mapping's bdi - * because they can possibly point to something useful (think about - * block_dev filesystem). - */ - if (sb->s_bdi && sb->s_bdi != &noop_backing_dev_info) { - /* Some device inodes could play dirty tricks. Catch them... */ - WARN(bdi != sb->s_bdi && bdi_cap_writeback_dirty(bdi), - "Dirtiable inode bdi %s != sb bdi %s\n", - bdi->name, sb->s_bdi->name); - return sb->s_bdi; - } - return bdi; + if (strcmp(sb->s_type->name, "bdev") == 0) + return inode->i_mapping->backing_dev_info; + + return sb->s_bdi; } static void bdi_queue_work(struct backing_dev_info *bdi, -- cgit v1.2.3 From 081003fff467ea0e727f66d5d435b4f473a789b3 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 1 Oct 2010 07:43:54 +0000 Subject: xfs: properly account for reclaimed inodes When marking an inode reclaimable, a per-AG counter is increased, the inode is tagged reclaimable in its per-AG tree, and, when this is the first reclaimable inode in the AG, the AG entry in the per-mount tree is also tagged. When an inode is finally reclaimed, however, it is only deleted from the per-AG tree. Neither the counter is decreased, nor is the parent tree's AG entry untagged properly. Since the tags in the per-mount tree are not cleared, the inode shrinker iterates over all AGs that have had reclaimable inodes at one point in time. The counters on the other hand signal an increasing amount of slab objects to reclaim. Since "70e60ce xfs: convert inode shrinker to per-filesystem context" this is not a real issue anymore because the shrinker bails out after one iteration. But the problem was observable on a machine running v2.6.34, where the reclaimable work increased and each process going into direct reclaim eventually got stuck on the xfs inode shrinking path, trying to scan several million objects. Fix this by properly unwinding the reclaimable-state tracking of an inode when it is reclaimed. Signed-off-by: Johannes Weiner Cc: stable@kernel.org Reviewed-by: Dave Chinner Signed-off-by: Alex Elder --- fs/xfs/linux-2.6/xfs_sync.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) (limited to 'fs') diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index d59c4a65d492..81976ffed7d6 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -668,14 +668,11 @@ xfs_inode_set_reclaim_tag( xfs_perag_put(pag); } -void -__xfs_inode_clear_reclaim_tag( - xfs_mount_t *mp, +STATIC void +__xfs_inode_clear_reclaim( xfs_perag_t *pag, xfs_inode_t *ip) { - radix_tree_tag_clear(&pag->pag_ici_root, - XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG); pag->pag_ici_reclaimable--; if (!pag->pag_ici_reclaimable) { /* clear the reclaim tag from the perag radix tree */ @@ -689,6 +686,17 @@ __xfs_inode_clear_reclaim_tag( } } +void +__xfs_inode_clear_reclaim_tag( + xfs_mount_t *mp, + xfs_perag_t *pag, + xfs_inode_t *ip) +{ + radix_tree_tag_clear(&pag->pag_ici_root, + XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG); + __xfs_inode_clear_reclaim(pag, ip); +} + /* * Inodes in different states need to be treated differently, and the return * value of xfs_iflush is not sufficient to get this right. The following table @@ -838,6 +846,7 @@ reclaim: if (!radix_tree_delete(&pag->pag_ici_root, XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino))) ASSERT(0); + __xfs_inode_clear_reclaim(pag, ip); write_unlock(&pag->pag_ici_lock); /* -- cgit v1.2.3 From 936aeb5c4a9fa799abd7d630a94223acedcaad50 Mon Sep 17 00:00:00 2001 From: Henry C Chang Date: Wed, 22 Sep 2010 20:21:17 -0700 Subject: ceph: fix list_add usage on unsafe_writes list Fix argument order. Signed-off-by: Henry C Chang Signed-off-by: Sage Weil --- fs/ceph/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8c044a4f0457..66e4da6dba22 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -697,7 +697,7 @@ more: * start_request so that a tid has been assigned. */ spin_lock(&ci->i_unsafe_lock); - list_add(&ci->i_unsafe_writes, &req->r_unsafe_item); + list_add(&req->r_unsafe_item, &ci->i_unsafe_writes); spin_unlock(&ci->i_unsafe_lock); ceph_get_cap_refs(ci, CEPH_CAP_FILE_WR); } -- cgit v1.2.3 From 6bc18876ba01fd4a077db6e1ed27201e4bda8864 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Mon, 27 Sep 2010 10:18:52 -0700 Subject: ceph: avoid null deref in osd request error path If we interrupt an osd request, we call __cancel_request, but it wasn't verifying that req->r_osd was non-NULL before dereferencing it. This could cause a crash if osds were flapping and we aborted a request on said osd. Reported-by: Henry C Chang Signed-off-by: Sage Weil --- fs/ceph/osd_client.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/ceph/osd_client.c b/fs/ceph/osd_client.c index dfced1dacbcd..3b5571b8ce22 100644 --- a/fs/ceph/osd_client.c +++ b/fs/ceph/osd_client.c @@ -549,7 +549,7 @@ static void __unregister_request(struct ceph_osd_client *osdc, */ static void __cancel_request(struct ceph_osd_request *req) { - if (req->r_sent) { + if (req->r_sent && req->r_osd) { ceph_con_revoke(&req->r_osd->o_con, req->r_request); req->r_sent = 0; } -- cgit v1.2.3 From 92923dcbfcad107b0e0469f579a2455729ccf10e Mon Sep 17 00:00:00 2001 From: "Aneesh Kumar K.V" Date: Tue, 5 Oct 2010 16:03:41 +0530 Subject: ceph: Fix return value of encode_fh function encode_fh function should return 255 on error as done by other file system to indicate EOVERFLOW. Also max_len is in sizeof(u32) units and not in bytes. Signed-off-by: Aneesh Kumar K.V Signed-off-by: Sage Weil --- fs/ceph/export.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) (limited to 'fs') diff --git a/fs/ceph/export.c b/fs/ceph/export.c index 4480cb1c63e7..387c5823944e 100644 --- a/fs/ceph/export.c +++ b/fs/ceph/export.c @@ -42,32 +42,34 @@ struct ceph_nfs_confh { static int ceph_encode_fh(struct dentry *dentry, u32 *rawfh, int *max_len, int connectable) { + int type; struct ceph_nfs_fh *fh = (void *)rawfh; struct ceph_nfs_confh *cfh = (void *)rawfh; struct dentry *parent = dentry->d_parent; struct inode *inode = dentry->d_inode; - int type; + int connected_handle_length = sizeof(*cfh)/4; + int handle_length = sizeof(*fh)/4; /* don't re-export snaps */ if (ceph_snap(inode) != CEPH_NOSNAP) return -EINVAL; - if (*max_len >= sizeof(*cfh)) { + if (*max_len >= connected_handle_length) { dout("encode_fh %p connectable\n", dentry); cfh->ino = ceph_ino(dentry->d_inode); cfh->parent_ino = ceph_ino(parent->d_inode); cfh->parent_name_hash = parent->d_name.hash; - *max_len = sizeof(*cfh); + *max_len = connected_handle_length; type = 2; - } else if (*max_len > sizeof(*fh)) { + } else if (*max_len >= handle_length) { if (connectable) - return -ENOSPC; + return 255; dout("encode_fh %p\n", dentry); fh->ino = ceph_ino(dentry->d_inode); - *max_len = sizeof(*fh); + *max_len = handle_length; type = 1; } else { - return -ENOSPC; + return 255; } return type; } -- cgit v1.2.3 From bba0cd0e3d97472855840af817b766e3f632a501 Mon Sep 17 00:00:00 2001 From: "Aneesh Kumar K.V" Date: Tue, 5 Oct 2010 16:03:42 +0530 Subject: ceph: Update max_len with minimum required size encode_fh on error should update max_len with minimum required size, so that caller can redo the call with the reallocated buffer. This is required with open by handle patch series Signed-off-by: Aneesh Kumar K.V Signed-off-by: Sage Weil --- fs/ceph/export.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/ceph/export.c b/fs/ceph/export.c index 387c5823944e..e38423e82f2e 100644 --- a/fs/ceph/export.c +++ b/fs/ceph/export.c @@ -62,13 +62,16 @@ static int ceph_encode_fh(struct dentry *dentry, u32 *rawfh, int *max_len, *max_len = connected_handle_length; type = 2; } else if (*max_len >= handle_length) { - if (connectable) + if (connectable) { + *max_len = connected_handle_length; return 255; + } dout("encode_fh %p\n", dentry); fh->ino = ceph_ino(dentry->d_inode); *max_len = handle_length; type = 1; } else { + *max_len = handle_length; return 255; } return type; -- cgit v1.2.3 From 21b559de56695d36b3f0819b7e2454737db254f8 Mon Sep 17 00:00:00 2001 From: Greg Farnum Date: Wed, 6 Oct 2010 15:46:30 -0700 Subject: ceph: send cap release message early on failed revoke. If an MDS tries to revoke caps that we don't have, we want to send releases early since they probably contain the caps message the MDS is looking for. Previously, we only sent the messages if we didn't have the inode either. But in a multi-mds system we can retain the inode after dropping all caps for a single MDS. Signed-off-by: Greg Farnum Signed-off-by: Sage Weil --- fs/ceph/caps.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) (limited to 'fs') diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 73c153092f72..97de325a49f8 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2774,15 +2774,7 @@ void ceph_handle_caps(struct ceph_mds_session *session, if (op == CEPH_CAP_OP_IMPORT) __queue_cap_release(session, vino.ino, cap_id, mseq, seq); - - /* - * send any full release message to try to move things - * along for the mds (who clearly thinks we still have this - * cap). - */ - ceph_add_cap_releases(mdsc, session); - ceph_send_cap_releases(mdsc, session); - goto done; + goto flush_cap_releases; } /* these will work even if we don't have a cap yet */ @@ -2810,7 +2802,7 @@ void ceph_handle_caps(struct ceph_mds_session *session, dout(" no cap on %p ino %llx.%llx from mds%d\n", inode, ceph_ino(inode), ceph_snap(inode), mds); spin_unlock(&inode->i_lock); - goto done; + goto flush_cap_releases; } /* note that each of these drops i_lock for us */ @@ -2834,6 +2826,17 @@ void ceph_handle_caps(struct ceph_mds_session *session, ceph_cap_op_name(op)); } + goto done; + +flush_cap_releases: + /* + * send any full release message to try to move things + * along for the mds (who clearly thinks we still have this + * cap). + */ + ceph_add_cap_releases(mdsc, session); + ceph_send_cap_releases(mdsc, session); + done: mutex_unlock(&session->s_mutex); done_unlocked: -- cgit v1.2.3 From d91f2438d881514e4a923fd786dbd94b764a9440 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Wed, 22 Sep 2010 11:16:00 -0700 Subject: ceph: update issue_seq on cap grant We need to update the issue_seq on any grant operation, be it via an MDS reply or a separate grant message. The update in the grant path was missing. This broke cap release for inodes in which the MDS sent an explicit grant message that was not soon after followed by a successful MDS reply on the same inode. Also fix the signedness on seq locals. Signed-off-by: Sage Weil --- fs/ceph/caps.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'fs') diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 97de325a49f8..5e9da996a151 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2283,7 +2283,8 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant, { struct ceph_inode_info *ci = ceph_inode(inode); int mds = session->s_mds; - int seq = le32_to_cpu(grant->seq); + unsigned seq = le32_to_cpu(grant->seq); + unsigned issue_seq = le32_to_cpu(grant->issue_seq); int newcaps = le32_to_cpu(grant->caps); int issued, implemented, used, wanted, dirty; u64 size = le64_to_cpu(grant->size); @@ -2295,8 +2296,8 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant, int revoked_rdcache = 0; int queue_invalidate = 0; - dout("handle_cap_grant inode %p cap %p mds%d seq %d %s\n", - inode, cap, mds, seq, ceph_cap_string(newcaps)); + dout("handle_cap_grant inode %p cap %p mds%d seq %u/%u %s\n", + inode, cap, mds, seq, issue_seq, ceph_cap_string(newcaps)); dout(" size %llu max_size %llu, i_size %llu\n", size, max_size, inode->i_size); @@ -2392,6 +2393,7 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant, } cap->seq = seq; + cap->issue_seq = issue_seq; /* file layout may have changed */ ci->i_layout = grant->layout; -- cgit v1.2.3 From f17b1f9f1a5882e486aad469b9ac4cb18581707f Mon Sep 17 00:00:00 2001 From: Boaz Harrosh Date: Thu, 7 Oct 2010 13:37:51 -0400 Subject: exofs: Fix double page_unlock BUG in write_begin/end This BUG is there since the first submit of the code, but only triggered in last Kernel. It's timing related do to the asynchronous object-creation behaviour of exofs. (Which should be investigated farther) The bug is obvious hence the fixed. Signed-off-by: Boaz Harrosh --- fs/exofs/inode.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c index eb7368ebd8cd..3eadd97324b1 100644 --- a/fs/exofs/inode.c +++ b/fs/exofs/inode.c @@ -54,6 +54,9 @@ struct page_collect { unsigned nr_pages; unsigned long length; loff_t pg_first; /* keep 64bit also in 32-arches */ + bool read_4_write; /* This means two things: that the read is sync + * And the pages should not be unlocked. + */ }; static void _pcol_init(struct page_collect *pcol, unsigned expected_pages, @@ -71,6 +74,7 @@ static void _pcol_init(struct page_collect *pcol, unsigned expected_pages, pcol->nr_pages = 0; pcol->length = 0; pcol->pg_first = -1; + pcol->read_4_write = false; } static void _pcol_reset(struct page_collect *pcol) @@ -347,7 +351,8 @@ static int readpage_strip(void *data, struct page *page) if (PageError(page)) ClearPageError(page); - unlock_page(page); + if (!pcol->read_4_write) + unlock_page(page); EXOFS_DBGMSG("readpage_strip(0x%lx, 0x%lx) empty page," " splitting\n", inode->i_ino, page->index); @@ -428,6 +433,7 @@ static int _readpage(struct page *page, bool is_sync) /* readpage_strip might call read_exec(,is_sync==false) at several * places but not if we have a single page. */ + pcol.read_4_write = is_sync; ret = readpage_strip(&pcol, page); if (ret) { EXOFS_ERR("_readpage => %d\n", ret); -- cgit v1.2.3 From 7c5347733dcc4ba0bac0baf86d99fae0561f33b7 Mon Sep 17 00:00:00 2001 From: Eric Paris Date: Mon, 11 Oct 2010 18:13:31 -0400 Subject: fanotify: disable fanotify syscalls This patch disables the fanotify syscalls by just not building them and letting the cond_syscall() statements in kernel/sys_ni.c redirect them to sys_ni_syscall(). It was pointed out by Tvrtko Ursulin that the fanotify interface did not include an explicit prioritization between groups. This is necessary for fanotify to be usable for hierarchical storage management software, as they must get first access to the file, before inotify-like notifiers see the file. This feature can be added in an ABI compatible way in the next release (by using a number of bits in the flags field to carry the info) but it was suggested by Alan that maybe we should just hold off and do it in the next cycle, likely with an (new) explicit argument to the syscall. I don't like this approach best as I know people are already starting to use the current interface, but Alan is all wise and noone on list backed me up with just using what we have. I feel this is needlessly ripping the rug out from under people at the last minute, but if others think it needs to be a new argument it might be the best way forward. Three choices: Go with what we got (and implement the new feature next cycle). Add a new field right now (and implement the new feature next cycle). Wait till next cycle to release the ABI (and implement the new feature next cycle). This is number 3. Signed-off-by: Eric Paris Signed-off-by: Linus Torvalds --- fs/notify/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs') diff --git a/fs/notify/Kconfig b/fs/notify/Kconfig index 22c629eedd82..b388443c3a09 100644 --- a/fs/notify/Kconfig +++ b/fs/notify/Kconfig @@ -3,4 +3,4 @@ config FSNOTIFY source "fs/notify/dnotify/Kconfig" source "fs/notify/inotify/Kconfig" -source "fs/notify/fanotify/Kconfig" +#source "fs/notify/fanotify/Kconfig" -- cgit v1.2.3 From b1e86db1de2e8bc2be9fb94fae3451c2a776e8c1 Mon Sep 17 00:00:00 2001 From: "J. Bruce Fields" Date: Wed, 13 Oct 2010 14:46:17 -0400 Subject: nfsd: fix BUG at fs/nfsd/nfsfh.h:199 on unlink As of commit 43a9aa64a2f4330a9cb59aaf5c5636566bce067c "NFSD: Fill in WCC data for REMOVE, RMDIR, MKNOD, and MKDIR", we sometimes call fh_unlock on a filehandle that isn't fully initialized. We should fix up the callers, but as a quick fix it is also sufficient just to remove this assertion. Reported-by: Marius Tolzmann Cc: Chuck Lever Signed-off-by: J. Bruce Fields --- fs/nfsd/nfsfh.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'fs') diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h index cdfb8c6a4206..c16f8d8331b5 100644 --- a/fs/nfsd/nfsfh.h +++ b/fs/nfsd/nfsfh.h @@ -196,8 +196,6 @@ fh_lock(struct svc_fh *fhp) static inline void fh_unlock(struct svc_fh *fhp) { - BUG_ON(!fhp->fh_dentry); - if (fhp->fh_locked) { fill_post_wcc(fhp); mutex_unlock(&fhp->fh_dentry->d_inode->i_mutex); -- cgit v1.2.3 From 0eead9ab41da33644ae2c97c57ad03da636a0422 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Thu, 14 Oct 2010 10:57:40 -0700 Subject: Don't dump task struct in a.out core-dumps akiphie points out that a.out core-dumps have that odd task struct dumping that was never used and was never really a good idea (it goes back into the mists of history, probably the original core-dumping code). Just remove it. Also do the access_ok() check on dump_write(). It probably doesn't matter (since normal filesystems all seem to do it anyway), but he points out that it's normally done by the VFS layer, so ... [ I suspect that we should possibly do "vfs_write()" instead of calling ->write directly. That also does the whole fsnotify and write statistics thing, which may or may not be a good idea. ] And just to be anal, do this all for the x86-64 32-bit a.out emulation code too, even though it's not enabled (and won't currently even compile) Reported-by: akiphie Signed-off-by: Linus Torvalds --- fs/binfmt_aout.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'fs') diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c index f96eff04e11a..a6395bdb26ae 100644 --- a/fs/binfmt_aout.c +++ b/fs/binfmt_aout.c @@ -134,10 +134,6 @@ static int aout_core_dump(struct coredump_params *cprm) if (!dump_write(file, dump_start, dump_size)) goto end_coredump; } -/* Finally dump the task struct. Not be used by gdb, but could be useful */ - set_fs(KERNEL_DS); - if (!dump_write(file, current, sizeof(*current))) - goto end_coredump; end_coredump: set_fs(fs); return has_dumped; -- cgit v1.2.3 From 3aa0ce825ade0cf5506e32ccf51d01fc8d22a9cf Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Thu, 14 Oct 2010 14:32:06 -0700 Subject: Un-inline the core-dump helper functions Tony Luck reports that the addition of the access_ok() check in commit 0eead9ab41da ("Don't dump task struct in a.out core-dumps") broke the ia64 compile due to missing the necessary header file includes. Rather than add yet another include () to make everything happy, just uninline the silly core dump helper functions and move the bodies to fs/exec.c where they make a lot more sense. dump_seek() in particular was too big to be an inline function anyway, and none of them are in any way performance-critical. And we really don't need to mess up our include file headers more than they already are. Reported-and-tested-by: Tony Luck Signed-off-by: Linus Torvalds --- fs/exec.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) (limited to 'fs') diff --git a/fs/exec.c b/fs/exec.c index 828dd2461d6b..03278c984ba0 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -2014,3 +2014,41 @@ fail_creds: fail: return; } + +/* + * Core dumping helper functions. These are the only things you should + * do on a core-file: use only these functions to write out all the + * necessary info. + */ +int dump_write(struct file *file, const void *addr, int nr) +{ + return access_ok(VERIFY_READ, addr, nr) && file->f_op->write(file, addr, nr, &file->f_pos) == nr; +} + +int dump_seek(struct file *file, loff_t off) +{ + int ret = 1; + + if (file->f_op->llseek && file->f_op->llseek != no_llseek) { + if (file->f_op->llseek(file, off, SEEK_CUR) < 0) + return 0; + } else { + char *buf = (char *)get_zeroed_page(GFP_KERNEL); + + if (!buf) + return 0; + while (off > 0) { + unsigned long n = off; + + if (n > PAGE_SIZE) + n = PAGE_SIZE; + if (!dump_write(file, buf, n)) { + ret = 0; + break; + } + off -= n; + } + free_page((unsigned long)buf); + } + return ret; +} -- cgit v1.2.3 From 8fd01d6cfbf75465d84a4e533ed70c5f57b3ff51 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Thu, 14 Oct 2010 19:15:28 -0700 Subject: Export dump_{write,seek} to binary loader modules If you build aout support as a module, you'll want these exported. Reported-by: Tetsuo Handa Signed-off-by: Linus Torvalds --- fs/exec.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'fs') diff --git a/fs/exec.c b/fs/exec.c index 03278c984ba0..6d2b6f936858 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -2024,6 +2024,7 @@ int dump_write(struct file *file, const void *addr, int nr) { return access_ok(VERIFY_READ, addr, nr) && file->f_op->write(file, addr, nr, &file->f_pos) == nr; } +EXPORT_SYMBOL(dump_write); int dump_seek(struct file *file, loff_t off) { @@ -2052,3 +2053,4 @@ int dump_seek(struct file *file, loff_t off) } return ret; } +EXPORT_SYMBOL(dump_seek); -- cgit v1.2.3