<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/fs/file.c, branch v5.5</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>Revert "fs: remove ksys_dup()"</title>
<updated>2020-01-03T00:15:33+00:00</updated>
<author>
<name>Dominik Brodowski</name>
<email>linux@dominikbrodowski.net</email>
</author>
<published>2020-01-01T19:05:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=74f1a299107b9e1a563831a4ba85f769ab577164'/>
<id>74f1a299107b9e1a563831a4ba85f769ab577164</id>
<content type='text'>
This reverts commit 8243186f0cc7 ("fs: remove ksys_dup()") and the
subsequent fix for it in commit 2d3145f8d280 ("early init: fix error
handling when opening /dev/console").

Trying to use filp_open() and f_dupfd() instead of pseudo-syscalls
caused more trouble than what is worth it: it requires accessing vfs
internals and it turns out there were other bugs in it too.

In particular, the file reference counting was wrong - because unlike
the original "open+2*dup" sequence it used "filp_open+3*f_dupfd" and
thus had an extra leaked file reference.

That in turn then caused odd problems with Androidx86 long after boot
becaue of how the extra reference to the console kept the session active
even after all file descriptors had been closed.

Reported-by: youling 257 &lt;youling257@gmail.com&gt;
Cc: Arvind Sankar &lt;nivedita@alum.mit.edu&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Dominik Brodowski &lt;linux@dominikbrodowski.net&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This reverts commit 8243186f0cc7 ("fs: remove ksys_dup()") and the
subsequent fix for it in commit 2d3145f8d280 ("early init: fix error
handling when opening /dev/console").

Trying to use filp_open() and f_dupfd() instead of pseudo-syscalls
caused more trouble than what is worth it: it requires accessing vfs
internals and it turns out there were other bugs in it too.

In particular, the file reference counting was wrong - because unlike
the original "open+2*dup" sequence it used "filp_open+3*f_dupfd" and
thus had an extra leaked file reference.

That in turn then caused odd problems with Androidx86 long after boot
becaue of how the extra reference to the console kept the session active
even after all file descriptors had been closed.

Reported-by: youling 257 &lt;youling257@gmail.com&gt;
Cc: Arvind Sankar &lt;nivedita@alum.mit.edu&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Dominik Brodowski &lt;linux@dominikbrodowski.net&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: remove ksys_dup()</title>
<updated>2019-12-12T18:00:36+00:00</updated>
<author>
<name>Dominik Brodowski</name>
<email>linux@dominikbrodowski.net</email>
</author>
<published>2018-10-23T14:24:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=8243186f0cc7c57cf9d6a110cd7315c44e3e0be8'/>
<id>8243186f0cc7c57cf9d6a110cd7315c44e3e0be8</id>
<content type='text'>
ksys_dup() is used only at one place in the kernel, namely to duplicate
fd 0 of /dev/console to stdout and stderr. The same functionality can be
achieved by using functions already available within the kernel namespace.

Signed-off-by: Dominik Brodowski &lt;linux@dominikbrodowski.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
ksys_dup() is used only at one place in the kernel, namely to duplicate
fd 0 of /dev/console to stdout and stderr. The same functionality can be
achieved by using functions already available within the kernel namespace.

Signed-off-by: Dominik Brodowski &lt;linux@dominikbrodowski.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Revert "vfs: properly and reliably lock f_pos in fdget_pos()"</title>
<updated>2019-11-26T19:34:06+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-11-26T19:34:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2be7d348fe924f0c5583c6a805bd42cecda93104'/>
<id>2be7d348fe924f0c5583c6a805bd42cecda93104</id>
<content type='text'>
This reverts commit 0be0ee71816b2b6725e2b4f32ad6726c9d729777.

I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.

While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads.  Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service.  In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.

There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor.  Which doesn't work well for
them.

The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side).  It may be that it's just this that causes problems, but we
clearly weren't ready yet.

Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:

  /dev/fuse		(fuse_dev_operations)
  /dev/mcelog		(mce_chrdev_ops)
  /dev/mei0		(mei_fops)
  /dev/net/tun		(tun_fops)
  /dev/nvme0		(nvme_dev_fops)
  /dev/tpm0		(tpm_fops)
  /proc/self/ns/mnt	(ns_file_operations)
  /dev/snd/pcm*		(snd_pcm_f_ops[])

and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.

Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.

We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
annotations).

Reported-by: Kenneth R. Crudup &lt;kenny@panix.com&gt;
Cc: Kirill Smelkov &lt;kirr@nexedi.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This reverts commit 0be0ee71816b2b6725e2b4f32ad6726c9d729777.

I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.

While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads.  Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service.  In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.

There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor.  Which doesn't work well for
them.

The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side).  It may be that it's just this that causes problems, but we
clearly weren't ready yet.

Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:

  /dev/fuse		(fuse_dev_operations)
  /dev/mcelog		(mce_chrdev_ops)
  /dev/mei0		(mei_fops)
  /dev/net/tun		(tun_fops)
  /dev/nvme0		(nvme_dev_fops)
  /dev/tpm0		(tpm_fops)
  /proc/self/ns/mnt	(ns_file_operations)
  /dev/snd/pcm*		(snd_pcm_f_ops[])

and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.

Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.

We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
annotations).

Reported-by: Kenneth R. Crudup &lt;kenny@panix.com&gt;
Cc: Kirill Smelkov &lt;kirr@nexedi.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vfs: properly and reliably lock f_pos in fdget_pos()</title>
<updated>2019-11-25T18:01:53+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-11-11T23:51:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=0be0ee71816b2b6725e2b4f32ad6726c9d729777'/>
<id>0be0ee71816b2b6725e2b4f32ad6726c9d729777</id>
<content type='text'>
fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).

However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.

The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file.  Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.

The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it.  With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.

Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()".  We've done that for all the obvious and common
situations, we may need a few more.  Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):

 "And I appreciate if people could help at least somehow with "getting
  rid of mixed case entirely" (i.e. always lock f_pos_lock on
  !FMODE_STREAM), because this transition starts to diverge from my
  particular use-case too far. To me it makes sense to do that
  transition as follows:

   - convert nonseekable_open -&gt; stream_open via stream_open.cocci;
   - audit other nonseekable_open calls and convert left users that
     truly don't depend on position to stream_open;
   - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
     will cover pipes and sockets), or maybe convert pipes and sockets
     to FMODE_STREAM manually;
   - extend stream_open.cocci to analyze file_operations that use
     no_llseek or noop_llseek, but do not use nonseekable_open or
     alloc_file_pseudo. This might find files that have stream semantic
     but are opened differently;
   - extend stream_open.cocci to analyze file_operations whose
     .read/.write do not use ppos at all (independently of how file was
     opened);
   - ...
   - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
     !FMODE_STREAM;
   - gather bug reports for deadlocked read/write and convert missed
     cases to FMODE_STREAM, probably extending stream_open.cocci along
     the road to catch similar cases

  i.e. always take f_pos_lock unless a file is explicitly marked as
  being stream, and try to find and cover all files that are streams"

We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.

The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".

The second test was correct, although a bit subtle and worth just
re-iterating here.  There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with
clone()).

The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.

But in both cases the file count would be greater than one, so the
"file_count(file) &gt; 1" test catches both situations.  Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.

Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov &lt;kirr@nexedi.com&gt;
Cc: Eic Dumazet &lt;edumazet@google.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Alan Stern &lt;stern@rowland.harvard.edu&gt;
Cc: Marco Elver &lt;elver@google.com&gt;
Cc: Andrea Parri &lt;parri.andrea@gmail.com&gt;
Cc: Paul McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).

However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.

The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file.  Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.

The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it.  With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.

Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()".  We've done that for all the obvious and common
situations, we may need a few more.  Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):

 "And I appreciate if people could help at least somehow with "getting
  rid of mixed case entirely" (i.e. always lock f_pos_lock on
  !FMODE_STREAM), because this transition starts to diverge from my
  particular use-case too far. To me it makes sense to do that
  transition as follows:

   - convert nonseekable_open -&gt; stream_open via stream_open.cocci;
   - audit other nonseekable_open calls and convert left users that
     truly don't depend on position to stream_open;
   - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
     will cover pipes and sockets), or maybe convert pipes and sockets
     to FMODE_STREAM manually;
   - extend stream_open.cocci to analyze file_operations that use
     no_llseek or noop_llseek, but do not use nonseekable_open or
     alloc_file_pseudo. This might find files that have stream semantic
     but are opened differently;
   - extend stream_open.cocci to analyze file_operations whose
     .read/.write do not use ppos at all (independently of how file was
     opened);
   - ...
   - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
     !FMODE_STREAM;
   - gather bug reports for deadlocked read/write and convert missed
     cases to FMODE_STREAM, probably extending stream_open.cocci along
     the road to catch similar cases

  i.e. always take f_pos_lock unless a file is explicitly marked as
  being stream, and try to find and cover all files that are streams"

We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.

The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".

The second test was correct, although a bit subtle and worth just
re-iterating here.  There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with
clone()).

The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.

But in both cases the file count would be greater than one, so the
"file_count(file) &gt; 1" test catches both situations.  Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.

Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov &lt;kirr@nexedi.com&gt;
Cc: Eic Dumazet &lt;edumazet@google.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Alan Stern &lt;stern@rowland.harvard.edu&gt;
Cc: Marco Elver &lt;elver@google.com&gt;
Cc: Andrea Parri &lt;parri.andrea@gmail.com&gt;
Cc: Paul McKenney &lt;paulmck@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block</title>
<updated>2019-03-08T22:48:40+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-03-08T22:48:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=38e7571c07be01f9f19b355a9306a4e3d5cb0f5b'/>
<id>38e7571c07be01f9f19b355a9306a4e3d5cb0f5b</id>
<content type='text'>
Pull io_uring IO interface from Jens Axboe:
 "Second attempt at adding the io_uring interface.

  Since the first one, we've added basic unit testing of the three
  system calls, that resides in liburing like the other unit tests that
  we have so far. It'll take a while to get full coverage of it, but
  we're working towards it. I've also added two basic test programs to
  tools/io_uring. One uses the raw interface and has support for all the
  various features that io_uring supports outside of standard IO, like
  fixed files, fixed IO buffers, and polled IO. The other uses the
  liburing API, and is a simplified version of cp(1).

  This adds support for a new IO interface, io_uring.

  io_uring allows an application to communicate with the kernel through
  two rings, the submission queue (SQ) and completion queue (CQ) ring.
  This allows for very efficient handling of IOs, see the v5 posting for
  some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

  Outside of just efficiency, the interface is also flexible and
  extendable, and allows for future use cases like the upcoming NVMe
  key-value store API, networked IO, and so on. It also supports async
  buffered IO, something that we've always failed to support in the
  kernel.

  Outside of basic IO features, it supports async polled IO as well.
  This particular feature has already been tested at Facebook months ago
  for flash storage boxes, with 25-33% improvements. It makes polled IO
  actually useful for real world use cases, where even basic flash sees
  a nice win in terms of efficiency, latency, and performance. These
  boxes were IOPS bound before, now they are not.

  This series adds three new system calls. One for setting up an
  io_uring instance (io_uring_setup(2)), one for submitting/completing
  IO (io_uring_enter(2)), and one for aux functions like registrating
  file sets, buffers, etc (io_uring_register(2)). Through the help of
  Arnd, I've coordinated the syscall numbers so merge on that front
  should be painless.

  Jon did a writeup of the interface a while back, which (except for
  minor details that have been tweaked) is still accurate. Find that
  here:

    https://lwn.net/Articles/776703/

  Huge thanks to Al Viro for helping getting the reference cycle code
  correct, and to Jann Horn for his extensive reviews focused on both
  security and bugs in general.

  There's a userspace library that provides basic functionality for
  applications that don't need or want to care about how to fiddle with
  the rings directly. It has helpers to allow applications to easily set
  up an io_uring instance, and submit/complete IO through it without
  knowing about the intricacies of the rings. It also includes man pages
  (thanks to Jeff Moyer), and will continue to grow support helper
  functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

  Fio has full support for the raw interface, both in the form of an IO
  engine (io_uring), but also with a small test application (t/io_uring)
  that can exercise and benchmark the interface"

* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
  io_uring: add a few test tools
  io_uring: allow workqueue item to handle multiple buffered requests
  io_uring: add support for IORING_OP_POLL
  io_uring: add io_kiocb ref count
  io_uring: add submission polling
  io_uring: add file set registration
  net: split out functions related to registering inflight socket files
  io_uring: add support for pre-mapped user IO buffers
  block: implement bio helper to add iter bvec pages to bio
  io_uring: batch io_kiocb allocation
  io_uring: use fget/fput_many() for file references
  fs: add fget_many() and fput_many()
  io_uring: support for IO polling
  io_uring: add fsync support
  Add io_uring IO interface
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull io_uring IO interface from Jens Axboe:
 "Second attempt at adding the io_uring interface.

  Since the first one, we've added basic unit testing of the three
  system calls, that resides in liburing like the other unit tests that
  we have so far. It'll take a while to get full coverage of it, but
  we're working towards it. I've also added two basic test programs to
  tools/io_uring. One uses the raw interface and has support for all the
  various features that io_uring supports outside of standard IO, like
  fixed files, fixed IO buffers, and polled IO. The other uses the
  liburing API, and is a simplified version of cp(1).

  This adds support for a new IO interface, io_uring.

  io_uring allows an application to communicate with the kernel through
  two rings, the submission queue (SQ) and completion queue (CQ) ring.
  This allows for very efficient handling of IOs, see the v5 posting for
  some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

  Outside of just efficiency, the interface is also flexible and
  extendable, and allows for future use cases like the upcoming NVMe
  key-value store API, networked IO, and so on. It also supports async
  buffered IO, something that we've always failed to support in the
  kernel.

  Outside of basic IO features, it supports async polled IO as well.
  This particular feature has already been tested at Facebook months ago
  for flash storage boxes, with 25-33% improvements. It makes polled IO
  actually useful for real world use cases, where even basic flash sees
  a nice win in terms of efficiency, latency, and performance. These
  boxes were IOPS bound before, now they are not.

  This series adds three new system calls. One for setting up an
  io_uring instance (io_uring_setup(2)), one for submitting/completing
  IO (io_uring_enter(2)), and one for aux functions like registrating
  file sets, buffers, etc (io_uring_register(2)). Through the help of
  Arnd, I've coordinated the syscall numbers so merge on that front
  should be painless.

  Jon did a writeup of the interface a while back, which (except for
  minor details that have been tweaked) is still accurate. Find that
  here:

    https://lwn.net/Articles/776703/

  Huge thanks to Al Viro for helping getting the reference cycle code
  correct, and to Jann Horn for his extensive reviews focused on both
  security and bugs in general.

  There's a userspace library that provides basic functionality for
  applications that don't need or want to care about how to fiddle with
  the rings directly. It has helpers to allow applications to easily set
  up an io_uring instance, and submit/complete IO through it without
  knowing about the intricacies of the rings. It also includes man pages
  (thanks to Jeff Moyer), and will continue to grow support helper
  functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

  Fio has full support for the raw interface, both in the form of an IO
  engine (io_uring), but also with a small test application (t/io_uring)
  that can exercise and benchmark the interface"

* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
  io_uring: add a few test tools
  io_uring: allow workqueue item to handle multiple buffered requests
  io_uring: add support for IORING_OP_POLL
  io_uring: add io_kiocb ref count
  io_uring: add submission polling
  io_uring: add file set registration
  net: split out functions related to registering inflight socket files
  io_uring: add support for pre-mapped user IO buffers
  block: implement bio helper to add iter bvec pages to bio
  io_uring: batch io_kiocb allocation
  io_uring: use fget/fput_many() for file references
  fs: add fget_many() and fput_many()
  io_uring: support for IO polling
  io_uring: add fsync support
  Add io_uring IO interface
</pre>
</div>
</content>
</entry>
<entry>
<title>fs/file.c: initialize init_files.resize_wait</title>
<updated>2019-03-06T05:07:14+00:00</updated>
<author>
<name>Shuriyc Chu</name>
<email>sureeju@gmail.com</email>
</author>
<published>2019-03-05T23:41:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5704a06810682683355624923547b41540e2801a'/>
<id>5704a06810682683355624923547b41540e2801a</id>
<content type='text'>
(Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647)

'get_unused_fd_flags' in kthread cause kernel crash.  It works fine on
4.1, but causes crash after get 64 fds.  It also cause crash on
ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the
same.

The crash message on centos7.5 shows below:

  start fd 61
  start fd 62
  start fd 63
  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: __wake_up_common+0x2e/0x90
  PGD 0
  Oops: 0000 [#1] SMP
  Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core
   virtio floppy dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.3.3.el7.x86_64 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
  task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000
  RIP: 0010:__wake_up_common+0x2e/0x90
  RSP: 0018:ffff8e94247a2d18  EFLAGS: 00010086
  RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0
  RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8
  R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8
  R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
  FS:  0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0
  Call Trace:
    __wake_up+0x39/0x50
    expand_files+0x131/0x250
    __alloc_fd+0x47/0x170
    get_unused_fd_flags+0x30/0x40
    test_fd+0x12a/0x1c0 [test]
    kthread+0xd1/0xe0
    ret_from_fork_nospec_begin+0x21/0x21
  Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 &lt;48&gt; 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef
  RIP   __wake_up_common+0x2e/0x90
   RSP &lt;ffff8e94247a2d18&gt;
  CR2: 0000000000000000

This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4
(3.10.0-693.21.1 ) is ok.  Root cause: the item 'resize_wait' is not
initialized before being used.

Reported-by: Richard Zhang &lt;zhang.zijian@h3c.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
(Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647)

'get_unused_fd_flags' in kthread cause kernel crash.  It works fine on
4.1, but causes crash after get 64 fds.  It also cause crash on
ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the
same.

The crash message on centos7.5 shows below:

  start fd 61
  start fd 62
  start fd 63
  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: __wake_up_common+0x2e/0x90
  PGD 0
  Oops: 0000 [#1] SMP
  Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core
   virtio floppy dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.3.3.el7.x86_64 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
  task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000
  RIP: 0010:__wake_up_common+0x2e/0x90
  RSP: 0018:ffff8e94247a2d18  EFLAGS: 00010086
  RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0
  RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8
  R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8
  R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
  FS:  0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0
  Call Trace:
    __wake_up+0x39/0x50
    expand_files+0x131/0x250
    __alloc_fd+0x47/0x170
    get_unused_fd_flags+0x30/0x40
    test_fd+0x12a/0x1c0 [test]
    kthread+0xd1/0xe0
    ret_from_fork_nospec_begin+0x21/0x21
  Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 &lt;48&gt; 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef
  RIP   __wake_up_common+0x2e/0x90
   RSP &lt;ffff8e94247a2d18&gt;
  CR2: 0000000000000000

This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4
(3.10.0-693.21.1 ) is ok.  Root cause: the item 'resize_wait' is not
initialized before being used.

Reported-by: Richard Zhang &lt;zhang.zijian@h3c.com&gt;
Reviewed-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: add fget_many() and fput_many()</title>
<updated>2019-02-28T15:24:23+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2018-11-21T17:32:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=091141a42e15fe47ada737f3996b317072afcefb'/>
<id>091141a42e15fe47ada737f3996b317072afcefb</id>
<content type='text'>
Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Reviewed-by: Hannes Reinecke &lt;hare@suse.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Reviewed-by: Hannes Reinecke &lt;hare@suse.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'char-misc-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc</title>
<updated>2018-12-29T04:54:57+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2018-12-29T04:54:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=457fa3469a65a524be04412f5cd497fa3b11c9fd'/>
<id>457fa3469a65a524be04412f5cd497fa3b11c9fd</id>
<content type='text'>
Pull char/misc driver updates from Greg KH:
 "Here is the big set of char and misc driver patches for 4.21-rc1.

  Lots of different types of driver things in here, as this tree seems
  to be the "collection of various driver subsystems not big enough to
  have their own git tree" lately.

  Anyway, some highlights of the changes in here:

   - binderfs: is it a rule that all driver subsystems will eventually
     grow to have their own filesystem? Binder now has one to handle the
     use of it in containerized systems.

     This was discussed at the Plumbers conference a few months ago and
     knocked into mergable shape very fast by Christian Brauner. Who
     also has signed up to be another binder maintainer, showing a
     distinct lack of good judgement :)

   - binder updates and fixes

   - mei driver updates

   - fpga driver updates and additions

   - thunderbolt driver updates

   - soundwire driver updates

   - extcon driver updates

   - nvmem driver updates

   - hyper-v driver updates

   - coresight driver updates

   - pvpanic driver additions and reworking for more device support

   - lp driver updates. Yes really, it's _finally_ moved to the proper
     parallal port driver model, something I never thought I would see
     happen. Good stuff.

   - other tiny driver updates and fixes.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (116 commits)
  MAINTAINERS: add another Android binder maintainer
  intel_th: msu: Fix an off-by-one in attribute store
  stm class: Add a reference to the SyS-T document
  stm class: Fix a module refcount leak in policy creation error path
  char: lp: use new parport device model
  char: lp: properly count the lp devices
  char: lp: use first unused lp number while registering
  char: lp: detach the device when parallel port is removed
  char: lp: introduce list to save port number
  bus: qcom: remove duplicated include from qcom-ebi2.c
  VMCI: Use memdup_user() rather than duplicating its implementation
  char/rtc: Use of_node_name_eq for node name comparisons
  misc: mic: fix a DMA pool free failure
  ptp: fix an IS_ERR() vs NULL check
  genwqe: Fix size check
  binder: implement binderfs
  binder: fix use-after-free due to ksys_close() during fdget()
  bus: fsl-mc: remove duplicated include files
  bus: fsl-mc: explicitly define the fsl_mc_command endianness
  misc: ti-st: make array read_ver_cmd static, shrinks object size
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull char/misc driver updates from Greg KH:
 "Here is the big set of char and misc driver patches for 4.21-rc1.

  Lots of different types of driver things in here, as this tree seems
  to be the "collection of various driver subsystems not big enough to
  have their own git tree" lately.

  Anyway, some highlights of the changes in here:

   - binderfs: is it a rule that all driver subsystems will eventually
     grow to have their own filesystem? Binder now has one to handle the
     use of it in containerized systems.

     This was discussed at the Plumbers conference a few months ago and
     knocked into mergable shape very fast by Christian Brauner. Who
     also has signed up to be another binder maintainer, showing a
     distinct lack of good judgement :)

   - binder updates and fixes

   - mei driver updates

   - fpga driver updates and additions

   - thunderbolt driver updates

   - soundwire driver updates

   - extcon driver updates

   - nvmem driver updates

   - hyper-v driver updates

   - coresight driver updates

   - pvpanic driver additions and reworking for more device support

   - lp driver updates. Yes really, it's _finally_ moved to the proper
     parallal port driver model, something I never thought I would see
     happen. Good stuff.

   - other tiny driver updates and fixes.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (116 commits)
  MAINTAINERS: add another Android binder maintainer
  intel_th: msu: Fix an off-by-one in attribute store
  stm class: Add a reference to the SyS-T document
  stm class: Fix a module refcount leak in policy creation error path
  char: lp: use new parport device model
  char: lp: properly count the lp devices
  char: lp: use first unused lp number while registering
  char: lp: detach the device when parallel port is removed
  char: lp: introduce list to save port number
  bus: qcom: remove duplicated include from qcom-ebi2.c
  VMCI: Use memdup_user() rather than duplicating its implementation
  char/rtc: Use of_node_name_eq for node name comparisons
  misc: mic: fix a DMA pool free failure
  ptp: fix an IS_ERR() vs NULL check
  genwqe: Fix size check
  binder: implement binderfs
  binder: fix use-after-free due to ksys_close() during fdget()
  bus: fsl-mc: remove duplicated include files
  bus: fsl-mc: explicitly define the fsl_mc_command endianness
  misc: ti-st: make array read_ver_cmd static, shrinks object size
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>binder: fix use-after-free due to ksys_close() during fdget()</title>
<updated>2018-12-19T08:40:13+00:00</updated>
<author>
<name>Todd Kjos</name>
<email>tkjos@android.com</email>
</author>
<published>2018-12-14T23:58:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=80cd795630d6526ba729a089a435bf74a57af927'/>
<id>80cd795630d6526ba729a089a435bf74a57af927</id>
<content type='text'>
44d8047f1d8 ("binder: use standard functions to allocate fds")
exposed a pre-existing issue in the binder driver.

fdget() is used in ksys_ioctl() as a performance optimization.
One of the rules associated with fdget() is that ksys_close() must
not be called between the fdget() and the fdput(). There is a case
where this requirement is not met in the binder driver which results
in the reference count dropping to 0 when the device is still in
use. This can result in use-after-free or other issues.

If userpace has passed a file-descriptor for the binder driver using
a BINDER_TYPE_FDA object, then kys_close() is called on it when
handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
the assumptions for using fdget().

The problem is fixed by deferring the close using task_work_add(). A
new variant of __close_fd() was created that returns a struct file
with a reference. The fput() is deferred instead of using ksys_close().

Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds")
Suggested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Todd Kjos &lt;tkjos@google.com&gt;
Cc: stable &lt;stable@vger.kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
44d8047f1d8 ("binder: use standard functions to allocate fds")
exposed a pre-existing issue in the binder driver.

fdget() is used in ksys_ioctl() as a performance optimization.
One of the rules associated with fdget() is that ksys_close() must
not be called between the fdget() and the fdput(). There is a case
where this requirement is not met in the binder driver which results
in the reference count dropping to 0 when the device is still in
use. This can result in use-after-free or other issues.

If userpace has passed a file-descriptor for the binder driver using
a BINDER_TYPE_FDA object, then kys_close() is called on it when
handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
the assumptions for using fdget().

The problem is fixed by deferring the close using task_work_add(). A
new variant of __close_fd() was created that returns a struct file
with a reference. The fput() is deferred instead of using ksys_close().

Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds")
Suggested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Todd Kjos &lt;tkjos@google.com&gt;
Cc: stable &lt;stable@vger.kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs/file: Replace synchronize_sched() with synchronize_rcu()</title>
<updated>2018-11-27T17:21:39+00:00</updated>
<author>
<name>Paul E. McKenney</name>
<email>paulmck@linux.ibm.com</email>
</author>
<published>2018-11-06T01:31:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=c93ffc15cceb057924410f9178e679120ee12353'/>
<id>c93ffc15cceb057924410f9178e679120ee12353</id>
<content type='text'>
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu().  This commit therefore makes this change.

Signed-off-by: Paul E. McKenney &lt;paulmck@linux.ibm.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: &lt;linux-fsdevel@vger.kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu().  This commit therefore makes this change.

Signed-off-by: Paul E. McKenney &lt;paulmck@linux.ibm.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: &lt;linux-fsdevel@vger.kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
