<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/fs/nfs/write.c, branch v3.15-rc7</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING</title>
<updated>2014-01-28T19:48:18+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@redhat.com</email>
</author>
<published>2014-01-28T18:47:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=4db72b40fdbc706f8957e9773ae73b1574b8c694'/>
<id>4db72b40fdbc706f8957e9773ae73b1574b8c694</id>
<content type='text'>
If the setting of NFS_INO_INVALIDATING gets reordered to before the
clearing of NFS_INO_INVALID_DATA, then another task may hit a race
window where both appear to be clear, even though the inode's pages are
still in need of invalidation. Fix this by adding the appropriate memory
barriers.

Signed-off-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If the setting of NFS_INO_INVALIDATING gets reordered to before the
clearing of NFS_INO_INVALID_DATA, then another task may hit a race
window where both appear to be clear, even though the inode's pages are
still in need of invalidation. Fix this by adding the appropriate memory
barriers.

Signed-off-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping</title>
<updated>2014-01-27T20:35:56+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@redhat.com</email>
</author>
<published>2014-01-27T18:46:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d529ef83c355f97027ff85298a9709fe06216a66'/>
<id>d529ef83c355f97027ff85298a9709fe06216a66</id>
<content type='text'>
There is a possible race in how the nfs_invalidate_mapping function is
handled.  Currently, we go and invalidate the pages in the file and then
clear NFS_INO_INVALID_DATA.

The problem is that it's possible for a stale page to creep into the
mapping after the page was invalidated (i.e., via readahead). If another
writer comes along and sets the flag after that happens but before
invalidate_inode_pages2 returns then we could clear the flag
without the cache having been properly invalidated.

So, we must clear the flag first and then invalidate the pages. Doing
this however, opens another race:

It's possible to have two concurrent read() calls that end up in
nfs_revalidate_mapping at the same time. The first one clears the
NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.

Just before calling that though, the other task races in, checks the
flag and finds it cleared. At that point, it trusts that the mapping is
good and gets the lock on the page, allowing the read() to be satisfied
from the cache even though the data is no longer valid.

These effects are easily manifested by running diotest3 from the LTP
test suite on NFS. That program does a series of DIO writes and buffered
reads. The operations are serialized and page-aligned but the existing
code fails the test since it occasionally allows a read to come out of
the cache incorrectly. While mixing direct and buffered I/O isn't
recommended, I believe it's possible to hit this in other ways that just
use buffered I/O, though that situation is much harder to reproduce.

The problem is that the checking/clearing of that flag and the
invalidation of the mapping really need to be atomic. Fix this by
serializing concurrent invalidations with a bitlock.

At the same time, we also need to allow other places that check
NFS_INO_INVALID_DATA to check whether we might be in the middle of
invalidating the file, so fix up a couple of places that do that
to look for the new NFS_INO_INVALIDATING flag.

Doing this requires us to be careful not to set the bitlock
unnecessarily, so this code only does that if it believes it will
be doing an invalidation.

Signed-off-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There is a possible race in how the nfs_invalidate_mapping function is
handled.  Currently, we go and invalidate the pages in the file and then
clear NFS_INO_INVALID_DATA.

The problem is that it's possible for a stale page to creep into the
mapping after the page was invalidated (i.e., via readahead). If another
writer comes along and sets the flag after that happens but before
invalidate_inode_pages2 returns then we could clear the flag
without the cache having been properly invalidated.

So, we must clear the flag first and then invalidate the pages. Doing
this however, opens another race:

It's possible to have two concurrent read() calls that end up in
nfs_revalidate_mapping at the same time. The first one clears the
NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.

Just before calling that though, the other task races in, checks the
flag and finds it cleared. At that point, it trusts that the mapping is
good and gets the lock on the page, allowing the read() to be satisfied
from the cache even though the data is no longer valid.

These effects are easily manifested by running diotest3 from the LTP
test suite on NFS. That program does a series of DIO writes and buffered
reads. The operations are serialized and page-aligned but the existing
code fails the test since it occasionally allows a read to come out of
the cache incorrectly. While mixing direct and buffered I/O isn't
recommended, I believe it's possible to hit this in other ways that just
use buffered I/O, though that situation is much harder to reproduce.

The problem is that the checking/clearing of that flag and the
invalidation of the mapping really need to be atomic. Fix this by
serializing concurrent invalidations with a bitlock.

At the same time, we also need to allow other places that check
NFS_INO_INVALID_DATA to check whether we might be in the middle of
invalidating the file, so fix up a couple of places that do that
to look for the new NFS_INO_INVALIDATING flag.

Doing this requires us to be careful not to set the bitlock
unnecessarily, so this code only does that if it believes it will
be doing an invalidation.

Signed-off-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfs: always make sure page is up-to-date before extending a write to cover the entire page</title>
<updated>2014-01-17T20:37:15+00:00</updated>
<author>
<name>Scott Mayhew</name>
<email>smayhew@redhat.com</email>
</author>
<published>2014-01-17T20:12:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=263b4509ec4d47e0da3e753f85a39ea12d1eff24'/>
<id>263b4509ec4d47e0da3e753f85a39ea12d1eff24</id>
<content type='text'>
We should always make sure the cached page is up-to-date when we're
determining whether we can extend a write to cover the full page -- even
if we've received a write delegation from the server.

Commit c7559663 added logic to skip this check if we have a write
delegation, which can lead to data corruption such as the following
scenario if client B receives a write delegation from the NFS server:

Client A:
    # echo 123456789 &gt; /mnt/file

Client B:
    # echo abcdefghi &gt;&gt; /mnt/file
    # cat /mnt/file
    0�D0�abcdefghi

Just because we hold a write delegation doesn't mean that we've read in
the entire page contents.

Cc: &lt;stable@vger.kernel.org&gt; # v3.11+
Signed-off-by: Scott Mayhew &lt;smayhew@redhat.com&gt;
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We should always make sure the cached page is up-to-date when we're
determining whether we can extend a write to cover the full page -- even
if we've received a write delegation from the server.

Commit c7559663 added logic to skip this check if we have a write
delegation, which can lead to data corruption such as the following
scenario if client B receives a write delegation from the NFS server:

Client A:
    # echo 123456789 &gt; /mnt/file

Client B:
    # echo abcdefghi &gt;&gt; /mnt/file
    # cat /mnt/file
    0�D0�abcdefghi

Just because we hold a write delegation doesn't mean that we've read in
the entire page contents.

Cc: &lt;stable@vger.kernel.org&gt; # v3.11+
Signed-off-by: Scott Mayhew &lt;smayhew@redhat.com&gt;
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>NFS: dprintk() should not print negative fileids and inode numbers</title>
<updated>2014-01-05T20:51:23+00:00</updated>
<author>
<name>Niels de Vos</name>
<email>ndevos@redhat.com</email>
</author>
<published>2013-12-17T17:20:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=1e8968c5b0582392d5f132422f581e3ebc24e627'/>
<id>1e8968c5b0582392d5f132422f581e3ebc24e627</id>
<content type='text'>
A fileid in NFS is a uint64. There are some occurrences where dprintk()
outputs a signed fileid. This leads to confusion and more difficult to
read debugging (negative fileids matching positive inode numbers).

Signed-off-by: Niels de Vos &lt;ndevos@redhat.com&gt;
CC: Santosh Pradhan &lt;spradhan@redhat.com&gt;
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A fileid in NFS is a uint64. There are some occurrences where dprintk()
outputs a signed fileid. This leads to confusion and more difficult to
read debugging (negative fileids matching positive inode numbers).

Signed-off-by: Niels de Vos &lt;ndevos@redhat.com&gt;
CC: Santosh Pradhan &lt;spradhan@redhat.com&gt;
Signed-off-by: Trond Myklebust &lt;trond.myklebust@primarydata.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfs: use %p[dD] instead of open-coded (and often racy) equivalents</title>
<updated>2013-10-25T03:34:50+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2013-09-16T14:53:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=6de1472f1a4a3bd912f515f29d3cf52a65a4c718'/>
<id>6de1472f1a4a3bd912f515f29d3cf52a65a4c718</id>
<content type='text'>
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>NFS: Don't check lock owner compatibility in writes unless file is locked</title>
<updated>2013-09-05T22:11:42+00:00</updated>
<author>
<name>Trond Myklebust</name>
<email>Trond.Myklebust@netapp.com</email>
</author>
<published>2013-09-05T19:52:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0f1d26055068bbc66751d1974ecc6f0398b3ac67'/>
<id>0f1d26055068bbc66751d1974ecc6f0398b3ac67</id>
<content type='text'>
If we're doing buffered writes, and there is no file locking involved,
then we don't have to worry about whether or not the lock owner information
is identical.
By relaxing this check, we ensure that fork()ed child processes can write
to a page without having to first sync dirty data that was written
by the parent to disk.

Reported-by: Quentin Barnes &lt;qbarnes@gmail.com&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
Tested-by: Quentin Barnes &lt;qbarnes@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If we're doing buffered writes, and there is no file locking involved,
then we don't have to worry about whether or not the lock owner information
is identical.
By relaxing this check, we ensure that fork()ed child processes can write
to a page without having to first sync dirty data that was written
by the parent to disk.

Reported-by: Quentin Barnes &lt;qbarnes@gmail.com&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
Tested-by: Quentin Barnes &lt;qbarnes@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfs4.1: Add SP4_MACH_CRED write and commit support</title>
<updated>2013-09-05T14:50:45+00:00</updated>
<author>
<name>Weston Andros Adamson</name>
<email>dros@netapp.com</email>
</author>
<published>2013-08-13T20:37:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8c21c62c4452f4e66c3dac9b3f6b74474fad3e08'/>
<id>8c21c62c4452f4e66c3dac9b3f6b74474fad3e08</id>
<content type='text'>
WRITE and COMMIT can use the machine credential.

If WRITE is supported and COMMIT is not, make all (mach cred) writes FILE_SYNC4.

Signed-off-by: Weston Andros Adamson &lt;dros@netapp.com&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
WRITE and COMMIT can use the machine credential.

If WRITE is supported and COMMIT is not, make all (mach cred) writes FILE_SYNC4.

Signed-off-by: Weston Andros Adamson &lt;dros@netapp.com&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>NFSv4: Don't try to recover NFSv4 locks when they are lost.</title>
<updated>2013-09-04T16:26:32+00:00</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2013-09-04T07:04:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=ef1820f9be27b6ad158f433ab38002ab8131db4d'/>
<id>ef1820f9be27b6ad158f433ab38002ab8131db4d</id>
<content type='text'>
When an NFSv4 client loses contact with the server it can lose any
locks that it holds.

Currently when it reconnects to the server it simply tries to reclaim
those locks.  This might succeed even though some other client has
held and released a lock in the mean time.  So the first client might
think the file is unchanged, but it isn't.  This isn't good.

If, when recovery happens, the locks cannot be claimed because some
other client still holds the lock, then we get a message in the kernel
logs, but the client can still write.  So two clients can both think
they have a lock and can both write at the same time.  This is equally
not good.

There was a patch a while ago
  http://comments.gmane.org/gmane.linux.nfs/41917

which tried to address some of this, but it didn't seem to go
anywhere.  That patch would also send a signal to the process.  That
might be useful but for now this patch just causes writes to fail.

For NFSv4 (unlike v2/v3) there is a strong link between the lock and
the write request so we can fairly easily fail any IO of the lock is
gone.  While some applications might not expect this, it is still
safer than allowing the write to succeed.

Because this is a fairly big change in behaviour a module parameter,
"recover_locks", is introduced which defaults to true (the current
behaviour) but can be set to "false" to tell the client not to try to
recover things that were lost.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When an NFSv4 client loses contact with the server it can lose any
locks that it holds.

Currently when it reconnects to the server it simply tries to reclaim
those locks.  This might succeed even though some other client has
held and released a lock in the mean time.  So the first client might
think the file is unchanged, but it isn't.  This isn't good.

If, when recovery happens, the locks cannot be claimed because some
other client still holds the lock, then we get a message in the kernel
logs, but the client can still write.  So two clients can both think
they have a lock and can both write at the same time.  This is equally
not good.

There was a patch a while ago
  http://comments.gmane.org/gmane.linux.nfs/41917

which tried to address some of this, but it didn't seem to go
anywhere.  That patch would also send a signal to the process.  That
might be useful but for now this patch just causes writes to fail.

For NFSv4 (unlike v2/v3) there is a strong link between the lock and
the write request so we can fairly easily fail any IO of the lock is
gone.  While some applications might not expect this, it is still
safer than allowing the write to succeed.

Because this is a fairly big change in behaviour a module parameter,
"recover_locks", is introduced which defaults to true (the current
behaviour) but can be set to "false" to tell the client not to try to
recover things that were lost.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>NFS avoid expired credential keys for buffered writes</title>
<updated>2013-09-03T19:25:09+00:00</updated>
<author>
<name>Andy Adamson</name>
<email>andros@netapp.com</email>
</author>
<published>2013-08-14T15:59:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=dc24826bfca8d788d05f625208f06d90be5560b3'/>
<id>dc24826bfca8d788d05f625208f06d90be5560b3</id>
<content type='text'>
We must avoid buffering a WRITE that is using a credential key (e.g. a GSS
context key) that is about to expire or has expired.  We currently will
paint ourselves into a corner by returning success to the applciation
for such a buffered WRITE, only to discover that we do not have permission when
we attempt to flush the WRITE (and potentially associated COMMIT) to disk.

Use the RPC layer credential key timeout and expire routines which use a
a watermark, gss_key_expire_timeo. We test the key in nfs_file_write.

If a WRITE is using a credential with a key that will expire within
watermark seconds, flush the inode in nfs_write_end and send only
NFS_FILE_SYNC WRITEs by adding nfs_ctx_key_to_expire to nfs_need_sync_write.
Note that this results in single page NFS_FILE_SYNC WRITEs.

Signed-off-by: Andy Adamson &lt;andros@netapp.com&gt;
[Trond: removed a pr_warn_ratelimited() for now]
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We must avoid buffering a WRITE that is using a credential key (e.g. a GSS
context key) that is about to expire or has expired.  We currently will
paint ourselves into a corner by returning success to the applciation
for such a buffered WRITE, only to discover that we do not have permission when
we attempt to flush the WRITE (and potentially associated COMMIT) to disk.

Use the RPC layer credential key timeout and expire routines which use a
a watermark, gss_key_expire_timeo. We test the key in nfs_file_write.

If a WRITE is using a credential with a key that will expire within
watermark seconds, flush the inode in nfs_write_end and send only
NFS_FILE_SYNC WRITEs by adding nfs_ctx_key_to_expire to nfs_need_sync_write.
Note that this results in single page NFS_FILE_SYNC WRITEs.

Signed-off-by: Andy Adamson &lt;andros@netapp.com&gt;
[Trond: removed a pr_warn_ratelimited() for now]
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>NFS: Add event tracing for generic NFS events</title>
<updated>2013-08-22T12:58:17+00:00</updated>
<author>
<name>Trond Myklebust</name>
<email>Trond.Myklebust@netapp.com</email>
</author>
<published>2013-08-19T22:59:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=f4ce1299b329e96bb247c95c4fee8809827d6931'/>
<id>f4ce1299b329e96bb247c95c4fee8809827d6931</id>
<content type='text'>
Add tracepoints for inode attribute updates, attribute revalidation,
writeback start/end fsync start/end, attribute change start/end,
permission check start/end.

The intention is to enable performance tracing using 'perf'as well as
improving debugging.

Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add tracepoints for inode attribute updates, attribute revalidation,
writeback start/end fsync start/end, attribute change start/end,
permission check start/end.

The intention is to enable performance tracing using 'perf'as well as
improving debugging.

Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
