linux-toradex.git/fs/jbd2, branch v3.2.73

jbd2: avoid infinite loop when destroying aborted journal

2015-10-13T02:46:13+00:00

commit 841df7df196237ea63233f0f9eaa41db53afd70f upstream.

Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
when the journal is aborted. That makes logic in
jbd2_log_do_checkpoint() bail out which is fine, except that
jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
a progress in cleaning the journal. Without it jbd2_journal_destroy()
just loops in an infinite loop.

Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
jbd2_log_do_checkpoint() fails with error.

Reported-by: Eryu Guan 
Tested-by: Eryu Guan 
Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
Signed-off-by: Jan Kara 
Signed-off-by: Theodore Ts'o 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings 
Cc: Roland Dreier

jbd2: protect all log tail updates with j_checkpoint_mutex

2015-10-13T02:46:00+00:00

commit a78bb11d7acd525623c6a0c2ff4e213d527573fa upstream.

There are some log tail updates that are not protected by j_checkpoint_mutex.
Some of these are harmless because they happen during startup or shutdown but
updates in jbd2_journal_commit_transaction() and jbd2_journal_flush() can
really race with other log tail updates (e.g. someone doing
jbd2_journal_flush() with someone running jbd2_cleanup_journal_tail()). So
protect all log tail updates with j_checkpoint_mutex.

Signed-off-by: Jan Kara 
Signed-off-by: "Theodore Ts'o" 
[bwh: Backported to 3.2:
 - Adjust context
 - Add unlock on the error path in jbd2_journal_flush()]
Signed-off-by: Ben Hutchings 
Cc: Bartosz Kwitniewski

jbd2: fix ocfs2 corrupt when updating journal superblock fails

2015-08-12T14:33:15+00:00

commit 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a upstream.

If updating journal superblock fails after journal data has been
flushed, the error is omitted and this will mislead the caller as a
normal case.  In ocfs2, the checkpoint will be treated successfully
and the other node can get the lock to update. Since the sb_start is
still pointing to the old log block, it will rewrite the journal data
during journal recovery by the other node. Thus the new updates will
be overwritten and ocfs2 corrupts.  So in above case we have to return
the error, and ocfs2_commit_cache will take care of the error and
prevent the other node to do update first.  And only after recovering
journal it can do the new updates.

The issue discussion mail can be found at:
https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
http://comments.gmane.org/gmane.comp.file-systems.ext4/48841

[ Fixed bug in patch which allowed a non-negative error return from
  jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
  was causing xfstests ext4/306 to fail. -- Ted ]

Reported-by: Yiwen Jiang 
Signed-off-by: Joseph Qi 
Signed-off-by: Theodore Ts'o 
Tested-by: Yiwen Jiang 
Cc: Junxiao Bi 
[bwh: Backported to 3.2:
 - Adjust context
 - Don't drop j_checkpoint_mutex where we don't hold it]
Signed-off-by: Ben Hutchings

jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()

2015-08-12T14:33:15+00:00

commit b4f1afcd068f6e533230dfed00782cd8a907f96b upstream.

jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start()
So allocations should be done with GFP_NOFS

[Full stack trace snipped from 3.10-rh7]
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x61/0x80
[] warn_slowpath_null+0x1a/0x20
[] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
[] kmem_cache_alloc+0x55/0x210
[] ? mempool_alloc_slab+0x15/0x20
[] mempool_alloc_slab+0x15/0x20
[] mempool_alloc+0x69/0x170
[] ? _raw_spin_unlock_irq+0xe/0x20
[] ? finish_task_switch+0x5d/0x150
[] bio_alloc_bioset+0x1be/0x2e0
[] blkdev_issue_flush+0x99/0x120
[] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL
[] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2]
[] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2]
[] start_this_handle+0x2d8/0x550 [jbd2]
[] ? __memcg_kmem_put_cache+0x29/0x30
[] ? kmem_cache_alloc+0x130/0x210
[] jbd2__journal_start+0xba/0x190 [jbd2]
[] ? lru_cache_add+0xe/0x10
[] ? ext4_da_write_begin+0xf9/0x330 [ext4]
[] __ext4_journal_start_sb+0x77/0x160 [ext4]
[] ext4_da_write_begin+0xf9/0x330 [ext4]
[] generic_file_buffered_write_iter+0x10c/0x270
[] __generic_file_write_iter+0x178/0x390
[] __generic_file_aio_write+0x8b/0xb0
[] generic_file_aio_write+0x5d/0xc0
[] ext4_file_write+0xa9/0x450 [ext4]
[] ? pipe_read+0x379/0x4f0
[] do_sync_write+0x90/0xe0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x58/0xb0
[] system_call_fastpath+0x16/0x1b

Signed-off-by: Dmitry Monakhov 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Ben Hutchings

jbd2: issue cache flush after checkpointing even with internal journal

2015-08-12T14:33:15+00:00

commit 79feb521a44705262d15cc819a4117a447b11ea7 upstream.

When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
checkpointed buffers are on a stable storage - especially if buffers were
written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
caches. Thus when we update journal superblock effectively removing old
transaction from journal, this write of superblock can get to stable storage
before those checkpointed buffers which can result in filesystem corruption
after a crash. Thus we must unconditionally issue a cache flush before we
update journal superblock in these cases.

A similar problem can also occur if journal superblock is written only in
disk's caches, other transaction starts reusing space of the transaction
cleaned from the log and power failure happens. Subsequent journal replay would
still try to replay the old transaction but some of it's blocks may be already
overwritten by the new transaction. For this reason we must use WRITE_FUA when
updating log tail and we must first write new log tail to disk and update
in-memory information only after that.

Signed-off-by: Jan Kara 
Signed-off-by: "Theodore Ts'o" 
[bwh: Prerequisite for "jbd2: fix ocfs2 corrupt when updating journal
 superblock fails".
 Backported to 3.2:
 - Adjust context
 - Drop changes to jbd2_journal_update_sb_log_tail trace event]
Signed-off-by: Ben Hutchings

jbd2: split updating of journal superblock and marking journal empty

2015-08-12T14:33:15+00:00

commit 24bcc89c7e7c64982e6192b4952a0a92379fc341 upstream.

There are three case of updating journal superblock. In the first case, we want
to mark journal as empty (setting s_sequence to 0), in the second case we want
to update log tail, in the third case we want to update s_errno. Split these
cases into separate functions. It makes the code slightly more straightforward
and later patches will make the distinction even more important.

Signed-off-by: Jan Kara 
Signed-off-by: "Theodore Ts'o" 
[bwh: Prerequisite for "jbd2: fix ocfs2 corrupt when updating journal
 superblock fails".
 Backported to 3.2: drop changes to trace events.]
Signed-off-by: Ben Hutchings

jbd2: fix r_count overflows leading to buffer overflow in journal recovery

2015-08-06T23:32:12+00:00

commit e531d0bceb402e643a4499de40dd3fa39d8d2e43 upstream.

The journal revoke block recovery code does not check r_count for
sanity, which means that an evil value of r_count could result in
the kernel reading off the end of the revoke table and into whatever
garbage lies beyond.  This could crash the kernel, so fix that.

However, in testing this fix, I discovered that the code to write
out the revoke tables also was not correctly checking to see if the
block was full -- the current offset check is fine so long as the
revoke table space size is a multiple of the record size, but this
is not true when either journal_csum_v[23] are set.

Signed-off-by: Darrick J. Wong 
Signed-off-by: Theodore Ts'o 
Reviewed-by: Jan Kara 
[bwh: Backported to 3.2: journal checksumming is not supported, so only
 the first fix is needed]
Signed-off-by: Ben Hutchings

ext4: disable synchronous transaction batching if max_batch_time==0

2014-08-06T17:07:34+00:00

commit 5dd214248f94d430d70e9230bda72f2654ac88a8 upstream.

The mount manpage says of the max_batch_time option,

	This optimization can be turned off entirely
	by setting max_batch_time to 0.

But the code doesn't do that.  So fix the code to do
that.

Signed-off-by: Eric Sandeen 
Signed-off-by: Theodore Ts'o 
[bwh: Backported to 3.2: option parsing looks different]
Signed-off-by: Ben Hutchings

jbd2: fix theoretical race in jbd2__journal_restart

2013-07-27T04:34:25+00:00

commit 39c04153fda8c32e85b51c96eb5511a326ad7609 upstream.

Once we decrement transaction->t_updates, if this is the last handle
holding the transaction from closing, and once we release the
t_handle_lock spinlock, it's possible for the transaction to commit
and be released.  In practice with normal kernels, this probably won't
happen, since the commit happens in a separate kernel thread and it's
unlikely this could all happen within the space of a few CPU cycles.

On the other hand, with a real-time kernel, this could potentially
happen, so save the tid found in transaction->t_tid before we release
t_handle_lock.  It would require an insane configuration, such as one
where the jbd2 thread was set to a very high real-time priority,
perhaps because a high priority real-time thread is trying to read or
write to a file system.  But some people who use real-time kernels
have been known to do insane things, including controlling
laser-wielding industrial robots.  :-)

Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Ben Hutchings

jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback

2013-05-13T14:02:14+00:00

commit 794446c6946513c684d448205fbd76fa35f38b72 upstream.

The following race is possible:

[kjournald2]                              other_task
jbd2_journal_commit_transaction()
  j_state = T_FINISHED;
  spin_unlock(&journal->j_list_lock);
                                         ->jbd2_journal_remove_checkpoint()
					   ->jbd2_journal_free_transaction();
					     ->kmem_cache_free(transaction)
  ->j_commit_callback(journal, transaction);
    -> USE_AFTER_FREE

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [] warn_slowpath_common+0xad/0xf0
 [] warn_slowpath_fmt+0x46/0x50
 [] ? ext4_journal_commit_callback+0x99/0xc0
 [] __list_del_entry+0x1c0/0x250
 [] ext4_journal_commit_callback+0x6f/0xc0
 [] jbd2_journal_commit_transaction+0x23a6/0x2570
 [] ? try_to_del_timer_sync+0x82/0xa0
 [] ? del_timer_sync+0x91/0x1e0
 [] kjournald2+0x19f/0x6a0
 [] ? wake_up_bit+0x40/0x40
 [] ? bit_spin_lock+0x80/0x80
 [] kthread+0x10e/0x120
 [] ? __init_kthread_worker+0x70/0x70
 [] ret_from_fork+0x7c/0xb0
 [] ? __init_kthread_worker+0x70/0x70

In order to demonstrace this issue one should mount ext4 with mount -o
discard option on SSD disk.  This makes callback longer and race
window becomes wider.

In order to fix this we should mark transaction as finished only after
callbacks have completed

Signed-off-by: Dmitry Monakhov 
Signed-off-by: "Theodore Ts'o" 
[bwh: Backported to 3.2: s/jbd2_journal_free_transaction/kfree/]
Signed-off-by: Ben Hutchings