linux-toradex.git/include/linux/kernfs.h, branch v4.8-rc6

Merge tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

2016-05-21T04:26:15+00:00

Pull driver core updates from Greg KH:
 "Here's the "big" driver core update for 4.7-rc1.

  Mostly just debugfs changes, the long-known and messy races with
  removing debugfs files should be fixed thanks to the great work of
  Nicolai Stange.  We also have some isa updates in here (the x86
  maintainers told me to take it through this tree), a new warning when
  we run out of dynamic char major numbers, and a few other assorted
  changes, details in the shortlog.

  All have been in linux-next for some time with no reported issues"

* tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
  Revert "base: dd: don't remove driver_data in -EPROBE_DEFER case"
  gpio: ws16c48: Utilize the ISA bus driver
  gpio: 104-idio-16: Utilize the ISA bus driver
  gpio: 104-idi-48: Utilize the ISA bus driver
  gpio: 104-dio-48e: Utilize the ISA bus driver
  watchdog: ebc-c384_wdt: Utilize the ISA bus driver
  iio: stx104: Utilize the module_isa_driver and max_num_isa_dev macros
  iio: stx104: Add X86 dependency to STX104 Kconfig option
  Documentation: Add ISA bus driver documentation
  isa: Implement the max_num_isa_dev macro
  isa: Implement the module_isa_driver macro
  pnp: pnpbios: Add explicit X86_32 dependency to PNPBIOS
  isa: Decouple X86_32 dependency from the ISA Kconfig option
  driver-core: use 'dev' argument in dev_dbg_ratelimited stub
  base: dd: don't remove driver_data in -EPROBE_DEFER case
  kernfs: Move faulting copy_user operations outside of the mutex
  devcoredump: add scatterlist support
  debugfs: unproxify files created through debugfs_create_u32_array()
  debugfs: unproxify files created through debugfs_create_blob()
  debugfs: unproxify files created through debugfs_create_bool()
  ...

cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces

2016-05-09T16:15:03+00:00

Patch summary:

When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.

Short explanation (courtesy of mkerrisk):

If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point.  Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.

Long version:

When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b.  The root dentry field of the mountinfo
entry will show '/a/b'.

 cat > /tmp/do1 << EOF
 mount -t cgroup -o freezer freezer /mnt
 grep freezer /proc/self/mountinfo
 EOF

 unshare -Gm  bash /tmp/do1
 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
 > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':

 grep freezer /proc/self/cgroup
 9:freezer:/

If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:

 mount --bind /sys/fs/cgroup/freezer/a/b /mnt
 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.

Example (by mkerrisk):

First, a little script to save some typing and verbiage:

echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
        awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:

2653
2653                         # Our shell
14254                        # cat(1)
        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer

Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):

Look at the state of the /proc files:

        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /sys/fs/cgroup/freezer

The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.

However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:

                                      # propagating to other mountns
        /proc/self/cgroup:      7:freezer:/
        mountinfo:              /a/b    /mnt/freezer

The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.

With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace.  So the same steps as above:

        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /../..  /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /mnt/freezer

cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
cgroup.procs           freezer.self_freezing    notify_on_release
3164
2653                   # First shell that placed in this cgroup
3164                   # Shell started by 'unshare'
14197                  # cat(1)

Signed-off-by: Serge Hallyn 
Tested-by: Michael Kerrisk 
Acked-by: Michael Kerrisk 
Signed-off-by: Tejun Heo

kernfs: Move faulting copy_user operations outside of the mutex

2016-04-30T17:05:05+00:00

A fault in a user provided buffer may lead anywhere, and lockdep warns
that we have a potential deadlock between the mm->mmap_sem and the
kernfs file mutex:

[   82.811702] ======================================================
[   82.811705] [ INFO: possible circular locking dependency detected ]
[   82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
[   82.811711] -------------------------------------------------------
[   82.811714] kms_setmode/5859 is trying to acquire lock:
[   82.811717]  (&dev->struct_mutex){+.+.+.}, at: [] drm_gem_mmap+0x1a1/0x270
[   82.811731]
but task is already holding lock:
[   82.811734]  (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x44/0xa0
[   82.811745]
which lock already depends on the new lock.

[   82.811749]
the existing dependency chain (in reverse order) is:
[   82.811752]
-> #3 (&mm->mmap_sem){++++++}:
[   82.811761]        [] lock_acquire+0xc3/0x1d0
[   82.811766]        [] __might_fault+0x75/0xa0
[   82.811771]        [] kernfs_fop_write+0x8a/0x180
[   82.811787]        [] __vfs_write+0x23/0xe0
[   82.811792]        [] vfs_write+0xa4/0x190
[   82.811797]        [] SyS_write+0x44/0xb0
[   82.811801]        [] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.811807]
-> #2 (s_active#6){++++.+}:
[   82.811814]        [] lock_acquire+0xc3/0x1d0
[   82.811819]        [] __kernfs_remove+0x210/0x2f0
[   82.811823]        [] kernfs_remove_by_name_ns+0x40/0xa0
[   82.811828]        [] sysfs_remove_file_ns+0x10/0x20
[   82.811832]        [] device_del+0x124/0x250
[   82.811837]        [] device_unregister+0x19/0x60
[   82.811841]        [] cpu_cache_sysfs_exit+0x51/0xb0
[   82.811846]        [] cacheinfo_cpu_callback+0x38/0x70
[   82.811851]        [] notifier_call_chain+0x39/0xa0
[   82.811856]        [] __raw_notifier_call_chain+0x9/0x10
[   82.811860]        [] cpu_notify+0x1e/0x40
[   82.811865]        [] cpu_notify_nofail+0x9/0x20
[   82.811869]        [] _cpu_down+0x233/0x340
[   82.811874]        [] disable_nonboot_cpus+0xc9/0x350
[   82.811878]        [] suspend_devices_and_enter+0x5a1/0xb50
[   82.811883]        [] pm_suspend+0x543/0x8d0
[   82.811888]        [] state_store+0x77/0xe0
[   82.811892]        [] kobj_attr_store+0xf/0x20
[   82.811897]        [] sysfs_kf_write+0x40/0x50
[   82.811902]        [] kernfs_fop_write+0x13c/0x180
[   82.811906]        [] __vfs_write+0x23/0xe0
[   82.811910]        [] vfs_write+0xa4/0x190
[   82.811914]        [] SyS_write+0x44/0xb0
[   82.811918]        [] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.811923]
-> #1 (cpu_hotplug.lock){+.+.+.}:
[   82.811929]        [] lock_acquire+0xc3/0x1d0
[   82.811933]        [] mutex_lock_nested+0x62/0x3b0
[   82.811940]        [] get_online_cpus+0x61/0x80
[   82.811944]        [] stop_machine+0x1b/0xe0
[   82.811949]        [] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
[   82.812009]        [] ggtt_bind_vma+0x46/0x70 [i915]
[   82.812045]        [] i915_vma_bind+0x140/0x290 [i915]
[   82.812081]        [] i915_gem_object_do_pin+0x899/0xb00 [i915]
[   82.812117]        [] i915_gem_object_pin+0x35/0x40 [i915]
[   82.812154]        [] intel_init_pipe_control+0xbe/0x210 [i915]
[   82.812192]        [] intel_logical_rings_init+0xe2/0xde0 [i915]
[   82.812232]        [] i915_gem_init+0xf3/0x130 [i915]
[   82.812278]        [] i915_driver_load+0xf2d/0x1770 [i915]
[   82.812318]        [] drm_dev_register+0xa4/0xb0
[   82.812323]        [] drm_get_pci_dev+0xce/0x1e0
[   82.812328]        [] i915_pci_probe+0x2f/0x50 [i915]
[   82.812360]        [] pci_device_probe+0x87/0xf0
[   82.812366]        [] driver_probe_device+0x229/0x450
[   82.812371]        [] __driver_attach+0x83/0x90
[   82.812375]        [] bus_for_each_dev+0x61/0xa0
[   82.812380]        [] driver_attach+0x19/0x20
[   82.812384]        [] bus_add_driver+0x1ef/0x290
[   82.812388]        [] driver_register+0x5b/0xe0
[   82.812393]        [] __pci_register_driver+0x5b/0x60
[   82.812398]        [] drm_pci_init+0xd6/0x100
[   82.812402]        [] 0xffffffffa027c094
[   82.812406]        [] do_one_initcall+0xae/0x1d0
[   82.812412]        [] do_init_module+0x5b/0x1cb
[   82.812417]        [] load_module+0x1c20/0x2480
[   82.812422]        [] SyS_finit_module+0x7e/0xa0
[   82.812428]        [] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.812433]
-> #0 (&dev->struct_mutex){+.+.+.}:
[   82.812439]        [] __lock_acquire+0x1fc9/0x20f0
[   82.812443]        [] lock_acquire+0xc3/0x1d0
[   82.812456]        [] drm_gem_mmap+0x1c7/0x270
[   82.812460]        [] mmap_region+0x334/0x580
[   82.812466]        [] do_mmap+0x364/0x410
[   82.812470]        [] vm_mmap_pgoff+0x6d/0xa0
[   82.812474]        [] SyS_mmap_pgoff+0x184/0x220
[   82.812479]        [] SyS_mmap+0x1d/0x20
[   82.812484]        [] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.812489]
other info that might help us debug this:

[   82.812493] Chain exists of:
  &dev->struct_mutex --> s_active#6 --> &mm->mmap_sem

[   82.812502]  Possible unsafe locking scenario:

[   82.812506]        CPU0                    CPU1
[   82.812508]        ----                    ----
[   82.812510]   lock(&mm->mmap_sem);
[   82.812514]                                lock(s_active#6);
[   82.812519]                                lock(&mm->mmap_sem);
[   82.812522]   lock(&dev->struct_mutex);
[   82.812526]
 *** DEADLOCK ***

[   82.812531] 1 lock held by kms_setmode/5859:
[   82.812533]  #0:  (&mm->mmap_sem){++++++}, at: [] vm_mmap_pgoff+0x44/0xa0
[   82.812541]
stack backtrace:
[   82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
[   82.812550] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
[   82.812553]  0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
[   82.812560]  ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
[   82.812566]  ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
[   82.812573] Call Trace:
[   82.812578]  [] dump_stack+0x67/0x92
[   82.812582]  [] print_circular_bug+0x1fc/0x310
[   82.812586]  [] __lock_acquire+0x1fc9/0x20f0
[   82.812590]  [] lock_acquire+0xc3/0x1d0
[   82.812594]  [] ? drm_gem_mmap+0x1a1/0x270
[   82.812599]  [] drm_gem_mmap+0x1c7/0x270
[   82.812603]  [] ? drm_gem_mmap+0x1a1/0x270
[   82.812608]  [] mmap_region+0x334/0x580
[   82.812612]  [] do_mmap+0x364/0x410
[   82.812616]  [] vm_mmap_pgoff+0x6d/0xa0
[   82.812629]  [] SyS_mmap_pgoff+0x184/0x220
[   82.812633]  [] SyS_mmap+0x1d/0x20
[   82.812637]  [] entry_SYSCALL_64_fastpath+0x16/0x73

Highly unlikely though this scenario is, we can avoid the issue entirely
by moving the copy operation from out under the kernfs_get_active()
tracking by assigning the preallocated buffer its own mutex. The
temporary buffer allocation doesn't require mutex locking as it is
entirely local.

The locked section was extended by the addition of the preallocated buf
to speed up md user operations in

commit 2b75869bba676c248d8d25ae6d2bd9221dfffdb6
Author: NeilBrown 
Date:   Mon Oct 13 16:41:28 2014 +1100

    sysfs/kernfs: allow attributes to request write buffer be pre-allocated.

Reported-by: Ville Syrjälä 
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
Signed-off-by: Chris Wilson 
Reviewed-by: Joonas Lahtinen 
Cc: Ville Syrjälä 
Cc: Joonas Lahtinen 
Cc: NeilBrown 
Acked-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

kernfs: define kernfs_node_dentry

2016-02-16T18:04:58+00:00

Add a new kernfs api is added to lookup the dentry for a particular
kernfs path.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
Acked-by: Greg Kroah-Hartman 
Signed-off-by: Tejun Heo

kernfs: Add API to generate relative kernfs path

2016-02-16T18:04:58+00:00

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Signed-off-by: Aditya Kali 
Signed-off-by: Serge E. Hallyn 
Acked-by: Greg Kroah-Hartman 
Signed-off-by: Tejun Heo

kernfs: implement kernfs_walk_and_get()

2015-11-20T20:55:52+00:00

Implement kernfs_walk_and_get() which is similar to
kernfs_find_and_get() but can walk a path instead of just a name.

v2: Use strlcpy() instead of strlen() + memcpy() as suggested by
    David.

Signed-off-by: Tejun Heo 
Acked-by: Greg Kroah-Hartman 
Cc: David Miller

kernfs: implement kernfs_path_len()

2015-08-18T22:49:15+00:00

Add a function to determine the path length of a kernfs node.  This
for now will be used by writeback tracepoint updates.

Signed-off-by: Tejun Heo 
Acked-by: Greg Kroah-Hartman 
Signed-off-by: Jens Axboe

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace

2015-07-03T22:20:57+00:00

Pull user namespace updates from Eric Biederman:
 "Long ago and far away when user namespaces where young it was realized
  that allowing fresh mounts of proc and sysfs with only user namespace
  permissions could violate the basic rule that only root gets to decide
  if proc or sysfs should be mounted at all.

  Some hacks were put in place to reduce the worst of the damage could
  be done, and the common sense rule was adopted that fresh mounts of
  proc and sysfs should allow no more than bind mounts of proc and
  sysfs.  Unfortunately that rule has not been fully enforced.

  There are two kinds of gaps in that enforcement.  Only filesystems
  mounted on empty directories of proc and sysfs should be ignored but
  the test for empty directories was insufficient.  So in my tree
  directories on proc, sysctl and sysfs that will always be empty are
  created specially.  Every other technique is imperfect as an ordinary
  directory can have entries added even after a readdir returns and
  shows that the directory is empty.  Special creation of directories
  for mount points makes the code in the kernel a smidge clearer about
  it's purpose.  I asked container developers from the various container
  projects to help test this and no holes were found in the set of mount
  points on proc and sysfs that are created specially.

  This set of changes also starts enforcing the mount flags of fresh
  mounts of proc and sysfs are consistent with the existing mount of
  proc and sysfs.  I expected this to be the boring part of the work but
  unfortunately unprivileged userspace winds up mounting fresh copies of
  proc and sysfs with noexec and nosuid clear when root set those flags
  on the previous mount of proc and sysfs.  So for now only the atime,
  read-only and nodev attributes which userspace happens to keep
  consistent are enforced.  Dealing with the noexec and nosuid
  attributes remains for another time.

  This set of changes also addresses an issue with how open file
  descriptors from /proc//ns/* are displayed.  Recently readlink of
  /proc//fd has been triggering a WARN_ON that has not been
  meaningful since it was added (as all of the code in the kernel was
  converted) and is not now actively wrong.

  There is also a short list of issues that have not been fixed yet that
  I will mention briefly.

  It is possible to rename a directory from below to above a bind mount.
  At which point any directory pointers below the renamed directory can
  be walked up to the root directory of the filesystem.  With user
  namespaces enabled a bind mount of the bind mount can be created
  allowing the user to pick a directory whose children they can rename
  to outside of the bind mount.  This is challenging to fix and doubly
  so because all obvious solutions must touch code that is in the
  performance part of pathname resolution.

  As mentioned above there is also a question of how to ensure that
  developers by accident or with purpose do not introduce exectuable
  files on sysfs and proc and in doing so introduce security regressions
  in the current userspace that will not be immediately obvious and as
  such are likely to require breaking userspace in painful ways once
  they are recognized"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  vfs: Remove incorrect debugging WARN in prepend_path
  mnt: Update fs_fully_visible to test for permanently empty directories
  sysfs: Create mountpoints with sysfs_create_mount_point
  sysfs: Add support for permanently empty directories to serve as mount points.
  kernfs: Add support for always empty directories.
  proc: Allow creating permanently empty directories that serve as mount points
  sysctl: Allow creating permanently empty directories that serve as mountpoints.
  fs: Add helper functions for permanently empty directories.
  vfs: Ignore unlocked mounts in fs_fully_visible
  mnt: Modify fs_fully_visible to deal with locked ro nodev and atime
  mnt: Refactor the logic for mounting sysfs and proc in a user namespace

kernfs: Add support for always empty directories.

2015-07-01T15:36:43+00:00

Add a new function kernfs_create_empty_dir that can be used to create
directory that can not be modified.

Update the code to use make_empty_dir_inode when reporting a
permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman"

kernfs: make kernfs_get_inode() public

2015-06-18T20:54:28+00:00

Move kernfs_get_inode() prototype from fs/kernfs/kernfs-internal.h to
include/linux/kernfs.h.  It obtains the matching inode for a
kernfs_node.

It will be used by cgroup for inode based permission checks for now
but is generally useful.

Signed-off-by: Tejun Heo 
Acked-by: Greg Kroah-Hartman