diff options
294 files changed, 7439 insertions, 3359 deletions
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 280355d08af5..33feab2f4084 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -77,6 +77,7 @@ configure specific aspects of kernel behavior to your liking. blockdev/index ext4 binderfs + xfs pm/index thunderbolt LSM/index diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a5f4004e8705..f0461456d910 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2011,6 +2011,19 @@ Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y, the default is off. + kprobe_event=[probe-list] + [FTRACE] Add kprobe events and enable at boot time. + The probe-list is a semicolon delimited list of probe + definitions. Each definition is same as kprobe_events + interface, but the parameters are comma delimited. + For example, to add a kprobe event on vfs_read with + arg1 and arg2, add to the command line; + + kprobe_event=p,vfs_read,$arg1,$arg2 + + See also Documentation/trace/kprobetrace.rst "Kernel + Boot Parameter" section. + kpti= [ARM64] Control page table isolation of user and kernel address spaces. Default: enabled on cores which need mitigation. diff --git a/Documentation/filesystems/xfs.txt b/Documentation/admin-guide/xfs.rst index a5cbb5e0e3db..e76665a8f2f2 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/admin-guide/xfs.rst @@ -1,4 +1,6 @@ +.. SPDX-License-Identifier: GPL-2.0 +====================== The SGI XFS Filesystem ====================== @@ -18,8 +20,6 @@ Mount Options ============= When mounting an XFS filesystem, the following options are accepted. -For boolean mount options, the names with the (*) suffix is the -default behaviour. allocsize=size Sets the buffered I/O end-of-file preallocation size when @@ -31,46 +31,43 @@ default behaviour. preallocation size, which uses a set of heuristics to optimise the preallocation size based on the current allocation patterns within the file and the access patterns - to the file. Specifying a fixed allocsize value turns off + to the file. Specifying a fixed ``allocsize`` value turns off the dynamic behaviour. - attr2 - noattr2 + attr2 or noattr2 The options enable/disable an "opportunistic" improvement to be made in the way inline extended attributes are stored on-disk. When the new form is used for the first time when - attr2 is selected (either when setting or removing extended + ``attr2`` is selected (either when setting or removing extended attributes) the on-disk superblock feature bit field will be updated to reflect this format being in use. The default behaviour is determined by the on-disk feature - bit indicating that attr2 behaviour is active. If either - mount option it set, then that becomes the new default used + bit indicating that ``attr2`` behaviour is active. If either + mount option is set, then that becomes the new default used by the filesystem. - CRC enabled filesystems always use the attr2 format, and so - will reject the noattr2 mount option if it is set. + CRC enabled filesystems always use the ``attr2`` format, and so + will reject the ``noattr2`` mount option if it is set. - discard - nodiscard (*) + discard or nodiscard (default) Enable/disable the issuing of commands to let the block device reclaim space freed by the filesystem. This is useful for SSD devices, thinly provisioned LUNs and virtual machine images, but may have a performance impact. - Note: It is currently recommended that you use the fstrim - application to discard unused blocks rather than the discard + Note: It is currently recommended that you use the ``fstrim`` + application to ``discard`` unused blocks rather than the ``discard`` mount option because the performance impact of this option is quite severe. - grpid/bsdgroups - nogrpid/sysvgroups (*) + grpid/bsdgroups or nogrpid/sysvgroups (default) These options define what group ID a newly created file - gets. When grpid is set, it takes the group ID of the + gets. When ``grpid`` is set, it takes the group ID of the directory in which it is created; otherwise it takes the - fsgid of the current process, unless the directory has the - setgid bit set, in which case it takes the gid from the - parent directory, and also gets the setgid bit set if it is + ``fsgid`` of the current process, unless the directory has the + ``setgid`` bit set, in which case it takes the ``gid`` from the + parent directory, and also gets the ``setgid`` bit set if it is a directory itself. filestreams @@ -78,46 +75,42 @@ default behaviour. across the entire filesystem rather than just on directories configured to use it. - ikeep - noikeep (*) - When ikeep is specified, XFS does not delete empty inode - clusters and keeps them around on disk. When noikeep is + ikeep or noikeep (default) + When ``ikeep`` is specified, XFS does not delete empty inode + clusters and keeps them around on disk. When ``noikeep`` is specified, empty inode clusters are returned to the free space pool. - inode32 - inode64 (*) - When inode32 is specified, it indicates that XFS limits + inode32 or inode64 (default) + When ``inode32`` is specified, it indicates that XFS limits inode creation to locations which will not result in inode numbers with more than 32 bits of significance. - When inode64 is specified, it indicates that XFS is allowed + When ``inode64`` is specified, it indicates that XFS is allowed to create inodes at any location in the filesystem, including those which will result in inode numbers occupying - more than 32 bits of significance. + more than 32 bits of significance. - inode32 is provided for backwards compatibility with older + ``inode32`` is provided for backwards compatibility with older systems and applications, since 64 bits inode numbers might cause problems for some applications that cannot handle large inode numbers. If applications are in use which do - not handle inode numbers bigger than 32 bits, the inode32 + not handle inode numbers bigger than 32 bits, the ``inode32`` option should be specified. - - largeio - nolargeio (*) - If "nolargeio" is specified, the optimal I/O reported in - st_blksize by stat(2) will be as small as possible to allow + largeio or nolargeio (default) + If ``nolargeio`` is specified, the optimal I/O reported in + ``st_blksize`` by **stat(2)** will be as small as possible to allow user applications to avoid inefficient read/modify/write I/O. This is typically the page size of the machine, as this is the granularity of the page cache. - If "largeio" specified, a filesystem that was created with a - "swidth" specified will return the "swidth" value (in bytes) - in st_blksize. If the filesystem does not have a "swidth" - specified but does specify an "allocsize" then "allocsize" + If ``largeio`` is specified, a filesystem that was created with a + ``swidth`` specified will return the ``swidth`` value (in bytes) + in ``st_blksize``. If the filesystem does not have a ``swidth`` + specified but does specify an ``allocsize`` then ``allocsize`` (in bytes) will be returned instead. Otherwise the behaviour - is the same as if "nolargeio" was specified. + is the same as if ``nolargeio`` was specified. logbufs=value Set the number of in-memory log buffers. Valid numbers @@ -127,7 +120,7 @@ default behaviour. If the memory cost of 8 log buffers is too high on small systems, then it may be reduced at some cost to performance - on metadata intensive workloads. The logbsize option below + on metadata intensive workloads. The ``logbsize`` option below controls the size of each buffer and so is also relevant to this case. @@ -138,7 +131,7 @@ default behaviour. and 32768 (32k). Valid sizes for version 2 logs also include 65536 (64k), 131072 (128k) and 262144 (256k). The logbsize must be an integer multiple of the log - stripe unit configured at mkfs time. + stripe unit configured at **mkfs(8)** time. The default value for for version 1 logs is 32768, while the default value for version 2 logs is MAX(32768, log_sunit). @@ -153,21 +146,21 @@ default behaviour. noalign Data allocations will not be aligned at stripe unit boundaries. This is only relevant to filesystems created - with non-zero data alignment parameters (sunit, swidth) by - mkfs. + with non-zero data alignment parameters (``sunit``, ``swidth``) by + **mkfs(8)**. norecovery The filesystem will be mounted without running log recovery. If the filesystem was not cleanly unmounted, it is likely to - be inconsistent when mounted in "norecovery" mode. + be inconsistent when mounted in ``norecovery`` mode. Some files or directories may not be accessible because of this. - Filesystems mounted "norecovery" must be mounted read-only or + Filesystems mounted ``norecovery`` must be mounted read-only or the mount will fail. nouuid Don't check for double mounted file systems using the file - system uuid. This is useful to mount LVM snapshot volumes, - and often used in combination with "norecovery" for mounting + system ``uuid``. This is useful to mount LVM snapshot volumes, + and often used in combination with ``norecovery`` for mounting read-only snapshots. noquota @@ -176,15 +169,15 @@ default behaviour. uquota/usrquota/uqnoenforce/quota User disk quota accounting enabled, and limits (optionally) - enforced. Refer to xfs_quota(8) for further details. + enforced. Refer to **xfs_quota(8)** for further details. gquota/grpquota/gqnoenforce Group disk quota accounting enabled and limits (optionally) - enforced. Refer to xfs_quota(8) for further details. + enforced. Refer to **xfs_quota(8)** for further details. pquota/prjquota/pqnoenforce Project disk quota accounting enabled and limits (optionally) - enforced. Refer to xfs_quota(8) for further details. + enforced. Refer to **xfs_quota(8)** for further details. sunit=value and swidth=value Used to specify the stripe unit and width for a RAID device @@ -192,11 +185,11 @@ default behaviour. block units. These options are only relevant to filesystems that were created with non-zero data alignment parameters. - The sunit and swidth parameters specified must be compatible + The ``sunit`` and ``swidth`` parameters specified must be compatible with the existing filesystem alignment characteristics. In - general, that means the only valid changes to sunit are - increasing it by a power-of-2 multiple. Valid swidth values - are any integer multiple of a valid sunit value. + general, that means the only valid changes to ``sunit`` are + increasing it by a power-of-2 multiple. Valid ``swidth`` values + are any integer multiple of a valid ``sunit`` value. Typically the only time these mount options are necessary if after an underlying RAID device has had it's geometry @@ -221,22 +214,25 @@ default behaviour. Deprecated Mount Options ======================== +=========================== ================ Name Removal Schedule - ---- ---------------- +=========================== ================ +=========================== ================ Removed Mount Options ===================== +=========================== ======= Name Removed - ---- ------- +=========================== ======= delaylog/nodelaylog v4.0 ihashsize v4.0 irixsgid v4.0 osyncisdsync/osyncisosync v4.0 barrier v4.19 nobarrier v4.19 - +=========================== ======= sysctls ======= @@ -302,27 +298,27 @@ The following sysctls are available for the XFS filesystem: fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) Setting this to "1" will cause the "sync" flag set - by the xfs_io(8) chattr command on a directory to be + by the **xfs_io(8)** chattr command on a directory to be inherited by files in that directory. fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1) Setting this to "1" will cause the "nodump" flag set - by the xfs_io(8) chattr command on a directory to be + by the **xfs_io(8)** chattr command on a directory to be inherited by files in that directory. fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1) Setting this to "1" will cause the "noatime" flag set - by the xfs_io(8) chattr command on a directory to be + by the **xfs_io(8)** chattr command on a directory to be inherited by files in that directory. fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1) Setting this to "1" will cause the "nosymlinks" flag set - by the xfs_io(8) chattr command on a directory to be + by the **xfs_io(8)** chattr command on a directory to be inherited by files in that directory. fs.xfs.inherit_nodefrag (Min: 0 Default: 1 Max: 1) Setting this to "1" will cause the "nodefrag" flag set - by the xfs_io(8) chattr command on a directory to be + by the **xfs_io(8)** chattr command on a directory to be inherited by files in that directory. fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256) @@ -368,7 +364,7 @@ handler: -error handlers: Defines the behavior for a specific error. -The filesystem behavior during an error can be set via sysfs files. Each +The filesystem behavior during an error can be set via ``sysfs`` files. Each error handler works independently - the first condition met by an error handler for a specific class will cause the error to be propagated rather than reset and retried. @@ -419,7 +415,7 @@ level directory: handler configurations. Note: there is no guarantee that fail_at_unmount can be set while an - unmount is in progress. It is possible that the sysfs entries are + unmount is in progress. It is possible that the ``sysfs`` entries are removed by the unmounting filesystem before a "retry forever" error handler configuration causes unmount to hang, and hence the filesystem must be configured appropriately before unmount begins to prevent @@ -428,7 +424,7 @@ level directory: Each filesystem has specific error class handlers that define the error propagation behaviour for specific errors. There is also a "default" error handler defined, which defines the behaviour for all errors that don't have -specific handlers defined. Where multiple retry constraints are configuredi for +specific handlers defined. Where multiple retry constraints are configured for a single error, the first retry configuration that expires will cause the error to be propagated. The handler configurations are found in the directory: @@ -463,7 +459,7 @@ to be propagated. The handler configurations are found in the directory: Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the operation for up to "N" seconds before propagating the error. -Note: The default behaviour for a specific error handler is dependent on both +**Note:** The default behaviour for a specific error handler is dependent on both the class and error context. For example, the default values for "metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults to "fail immediately" behaviour. This is done because ENODEV is a fatal, diff --git a/Documentation/devicetree/bindings/arm/freescale/fsl,scu.txt b/Documentation/devicetree/bindings/arm/freescale/fsl,scu.txt index f378922906f6..a575e42f7fec 100644 --- a/Documentation/devicetree/bindings/arm/freescale/fsl,scu.txt +++ b/Documentation/devicetree/bindings/arm/freescale/fsl,scu.txt @@ -145,6 +145,16 @@ Optional Child nodes: - Data cells of ocotp: Detailed bindings are described in bindings/nvmem/nvmem.txt +Watchdog bindings based on SCU Message Protocol +------------------------------------------------------------ + +Required properties: +- compatible: should be: + "fsl,imx8qxp-sc-wdt" + followed by "fsl,imx-sc-wdt"; +Optional properties: +- timeout-sec: contains the watchdog timeout in seconds. + Example (imx8qxp): ------------- aliases { @@ -207,6 +217,11 @@ firmware { rtc: rtc { compatible = "fsl,imx8qxp-sc-rtc"; }; + + watchdog { + compatible = "fsl,imx8qxp-sc-wdt", "fsl,imx-sc-wdt"; + timeout-sec = <60>; + }; }; }; diff --git a/Documentation/devicetree/bindings/watchdog/fsl-imx-sc-wdt.txt b/Documentation/devicetree/bindings/watchdog/fsl-imx-sc-wdt.txt deleted file mode 100644 index 02b87e92ae68..000000000000 --- a/Documentation/devicetree/bindings/watchdog/fsl-imx-sc-wdt.txt +++ /dev/null @@ -1,24 +0,0 @@ -* Freescale i.MX System Controller Watchdog - -i.MX system controller watchdog is for i.MX SoCs with system controller inside, -the watchdog is managed by system controller, users can ONLY communicate with -system controller from secure mode for watchdog operations, so Linux i.MX system -controller watchdog driver will call ARM SMC API and trap into ARM-Trusted-Firmware -for watchdog operations, ARM-Trusted-Firmware is running at secure EL3 mode and -it will request system controller to execute the watchdog operation passed from -Linux kernel. - -Required properties: -- compatible: Should be : - "fsl,imx8qxp-sc-wdt" - followed by "fsl,imx-sc-wdt"; - -Optional properties: -- timeout-sec : Contains the watchdog timeout in seconds. - -Examples: - -watchdog { - compatible = "fsl,imx8qxp-sc-wdt", "fsl,imx-sc-wdt"; - timeout-sec = <60>; -}; diff --git a/Documentation/devicetree/bindings/watchdog/renesas-wdt.txt b/Documentation/devicetree/bindings/watchdog/renesas,wdt.txt index 9f365c1a3399..9f365c1a3399 100644 --- a/Documentation/devicetree/bindings/watchdog/renesas-wdt.txt +++ b/Documentation/devicetree/bindings/watchdog/renesas,wdt.txt diff --git a/Documentation/devicetree/bindings/watchdog/sunxi-wdt.txt b/Documentation/devicetree/bindings/watchdog/sunxi-wdt.txt index 46055254e8dd..e65198d82a2b 100644 --- a/Documentation/devicetree/bindings/watchdog/sunxi-wdt.txt +++ b/Documentation/devicetree/bindings/watchdog/sunxi-wdt.txt @@ -6,6 +6,7 @@ Required properties: "allwinner,sun4i-a10-wdt" "allwinner,sun6i-a31-wdt" "allwinner,sun50i-a64-wdt","allwinner,sun6i-a31-wdt" + "allwinner,sun50i-h6-wdt","allwinner,sun6i-a31-wdt" "allwinner,suniv-f1c100s-wdt", "allwinner,sun4i-a10-wdt" - reg : Specifies base physical address and size of the registers. diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index 6d2c0d340dea..679729442fd2 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -76,7 +76,7 @@ exposure of uninitialized data through mmap. These filesystems may be used for inspiration: - ext2: see Documentation/filesystems/ext2.txt - ext4: see Documentation/filesystems/ext4/ -- xfs: see Documentation/filesystems/xfs.txt +- xfs: see Documentation/admin-guide/xfs.rst Handling Media Errors diff --git a/Documentation/riscv/boot-image-header.txt b/Documentation/riscv/boot-image-header.txt new file mode 100644 index 000000000000..1b73fea23b39 --- /dev/null +++ b/Documentation/riscv/boot-image-header.txt @@ -0,0 +1,50 @@ + Boot image header in RISC-V Linux + ============================================= + +Author: Atish Patra <atish.patra@wdc.com> +Date : 20 May 2019 + +This document only describes the boot image header details for RISC-V Linux. +The complete booting guide will be available at Documentation/riscv/booting.txt. + +The following 64-byte header is present in decompressed Linux kernel image. + + u32 code0; /* Executable code */ + u32 code1; /* Executable code */ + u64 text_offset; /* Image load offset, little endian */ + u64 image_size; /* Effective Image size, little endian */ + u64 flags; /* kernel flags, little endian */ + u32 version; /* Version of this header */ + u32 res1 = 0; /* Reserved */ + u64 res2 = 0; /* Reserved */ + u64 magic = 0x5643534952; /* Magic number, little endian, "RISCV" */ + u32 res3; /* Reserved for additional RISC-V specific header */ + u32 res4; /* Reserved for PE COFF offset */ + +This header format is compliant with PE/COFF header and largely inspired from +ARM64 header. Thus, both ARM64 & RISC-V header can be combined into one common +header in future. + +Notes: +- This header can also be reused to support EFI stub for RISC-V in future. EFI + specification needs PE/COFF image header in the beginning of the kernel image + in order to load it as an EFI application. In order to support EFI stub, + code0 should be replaced with "MZ" magic string and res5(at offset 0x3c) should + point to the rest of the PE/COFF header. + +- version field indicate header version number. + Bits 0:15 - Minor version + Bits 16:31 - Major version + + This preserves compatibility across newer and older version of the header. + The current version is defined as 0.1. + +- res3 is reserved for offset to any other additional fields. This makes the + header extendible in future. One example would be to accommodate ISA + extension for RISC-V in future. For current version, it is set to be zero. + +- In current header, the flag field has only one field. + Bit 0: Kernel endianness. 1 if BE, 0 if LE. + +- Image size is mandatory for boot loader to load kernel image. Booting will + fail otherwise. diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst index 7d2b0178d3f3..fbb314bfa112 100644 --- a/Documentation/trace/kprobetrace.rst +++ b/Documentation/trace/kprobetrace.rst @@ -51,15 +51,17 @@ Synopsis of kprobe_events $argN : Fetch the Nth function argument. (N >= 1) (\*1) $retval : Fetch return value.(\*2) $comm : Fetch current task comm. - +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(\*3) + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4) NAME=FETCHARG : Set NAME as the argument name of FETCHARG. FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types - (x8/x16/x32/x64), "string" and bitfield are supported. + (x8/x16/x32/x64), "string", "ustring" and bitfield + are supported. (\*1) only for the probe on function entry (offs == 0). (\*2) only for return probe. (\*3) this is useful for fetching a field of data structures. + (\*4) "u" means user-space dereference. See :ref:`user_mem_access`. Types ----- @@ -77,7 +79,8 @@ apply it to registers/stack-entries etc. (for example, '$stack1:x8[8]' is wrong, but '+8($stack):x8[8]' is OK.) String type is a special type, which fetches a "null-terminated" string from kernel space. This means it will fail and store NULL if the string container -has been paged out. +has been paged out. "ustring" type is an alternative of string for user-space. +See :ref:`user_mem_access` for more info.. The string array type is a bit different from other types. For other base types, <base-type>[1] is equal to <base-type> (e.g. +0(%di):x32[1] is same as +0(%di):x32.) But string[1] is not equal to string. The string type itself @@ -92,6 +95,25 @@ Symbol type('symbol') is an alias of u32 or u64 type (depends on BITS_PER_LONG) which shows given pointer in "symbol+offset" style. For $comm, the default type is "string"; any other type is invalid. +.. _user_mem_access: +User Memory Access +------------------ +Kprobe events supports user-space memory access. For that purpose, you can use +either user-space dereference syntax or 'ustring' type. + +The user-space dereference syntax allows you to access a field of a data +structure in user-space. This is done by adding the "u" prefix to the +dereference syntax. For example, +u4(%si) means it will read memory from the +address in the register %si offset by 4, and the memory is expected to be in +user-space. You can use this for strings too, e.g. +u0(%si):string will read +a string from the address in the register %si that is expected to be in user- +space. 'ustring' is a shortcut way of performing the same task. That is, ++0(%si):ustring is equivalent to +u0(%si):string. + +Note that kprobe-event provides the user-memory access syntax but it doesn't +use it transparently. This means if you use normal dereference or string type +for user memory, it might fail, and may always fail on some archs. The user +has to carefully check if the target data is in kernel or user space. Per-Probe Event Filtering ------------------------- @@ -124,6 +146,20 @@ You can check the total number of probe hits and probe miss-hits via The first column is event name, the second is the number of probe hits, the third is the number of probe miss-hits. +Kernel Boot Parameter +--------------------- +You can add and enable new kprobe events when booting up the kernel by +"kprobe_event=" parameter. The parameter accepts a semicolon-delimited +kprobe events, which format is similar to the kprobe_events. +The difference is that the probe definition parameters are comma-delimited +instead of space. For example, adding myprobe event on do_sys_open like below + + p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack) + +should be below for kernel boot parameter (just replace spaces with comma) + + p:myprobe,do_sys_open,dfd=%ax,filename=%dx,flags=%cx,mode=+4($stack) + Usage examples -------------- diff --git a/Documentation/trace/uprobetracer.rst b/Documentation/trace/uprobetracer.rst index 0b21305fabdc..6e75a6c5a2c8 100644 --- a/Documentation/trace/uprobetracer.rst +++ b/Documentation/trace/uprobetracer.rst @@ -42,16 +42,18 @@ Synopsis of uprobe_tracer @+OFFSET : Fetch memory at OFFSET (OFFSET from same file as PATH) $stackN : Fetch Nth entry of stack (N >= 0) $stack : Fetch stack address. - $retval : Fetch return value.(*) + $retval : Fetch return value.(\*1) $comm : Fetch current task comm. - +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*2)(\*3) NAME=FETCHARG : Set NAME as the argument name of FETCHARG. FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types (x8/x16/x32/x64), "string" and bitfield are supported. - (*) only for return probe. - (**) this is useful for fetching a field of data structures. + (\*1) only for return probe. + (\*2) this is useful for fetching a field of data structures. + (\*3) Unlike kprobe event, "u" prefix will just be ignored, becuse uprobe + events can access only user-space memory. Types ----- diff --git a/Documentation/watchdog/hpwdt.rst b/Documentation/watchdog/hpwdt.rst index 94a96371113e..c165d92cfd12 100644 --- a/Documentation/watchdog/hpwdt.rst +++ b/Documentation/watchdog/hpwdt.rst @@ -39,6 +39,10 @@ Last reviewed: 08/20/2018 Default value is set when compiling the kernel. If it is set to "Y", then there is no way of disabling the watchdog once it has been started. + kdumptimeout Minimum timeout in seconds to apply upon receipt of an NMI + before calling panic. (-1) disables the watchdog. When value + is > 0, the timer is reprogrammed with the greater of + value or current timeout value. ============ ================================================================ NOTE: diff --git a/Documentation/watchdog/watchdog-parameters.rst b/Documentation/watchdog/watchdog-parameters.rst index b121caae7798..a3985cc5aeda 100644 --- a/Documentation/watchdog/watchdog-parameters.rst +++ b/Documentation/watchdog/watchdog-parameters.rst @@ -13,6 +13,17 @@ modules. ------------------------------------------------- +watchdog core: + open_timeout: + Maximum time, in seconds, for which the watchdog framework will take + care of pinging a running hardware watchdog until userspace opens the + corresponding /dev/watchdogN device. A value of 0 means an infinite + timeout. Setting this to a non-zero value can be useful to ensure that + either userspace comes up properly, or the board gets reset and allows + fallback logic in the bootloader to try something else. + +------------------------------------------------- + acquirewdt: wdt_stop: Acquire WDT 'stop' io port (default 0x43) diff --git a/MAINTAINERS b/MAINTAINERS index d51808468713..500cdb68ccbc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3765,7 +3765,7 @@ F: arch/powerpc/platforms/cell/ CEPH COMMON CODE (LIBCEPH) M: Ilya Dryomov <idryomov@gmail.com> -M: "Yan, Zheng" <zyan@redhat.com> +M: Jeff Layton <jlayton@kernel.org> M: Sage Weil <sage@redhat.com> L: ceph-devel@vger.kernel.org W: http://ceph.com/ @@ -3777,7 +3777,7 @@ F: include/linux/ceph/ F: include/linux/crush/ CEPH DISTRIBUTED FILE SYSTEM CLIENT (CEPH) -M: "Yan, Zheng" <zyan@redhat.com> +M: Jeff Layton <jlayton@kernel.org> M: Sage Weil <sage@redhat.com> M: Ilya Dryomov <idryomov@gmail.com> L: ceph-devel@vger.kernel.org @@ -13720,7 +13720,7 @@ RISC-V ARCHITECTURE M: Palmer Dabbelt <palmer@sifive.com> M: Albert Ou <aou@eecs.berkeley.edu> L: linux-riscv@lists.infradead.org -T: git git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux.git +T: git git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git S: Supported F: arch/riscv/ K: riscv @@ -14582,7 +14582,7 @@ M: Paul Walmsley <paul.walmsley@sifive.com> L: linux-riscv@lists.infradead.org T: git git://github.com/sifive/riscv-linux.git S: Supported -K: sifive +K: [^@]sifive N: sifive SIFIVE FU540 SYSTEM-ON-CHIP @@ -17651,9 +17651,8 @@ L: linux-xfs@vger.kernel.org W: http://xfs.org/ T: git git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git S: Supported -F: Documentation/filesystems/xfs.txt +F: Documentation/admin-guide/xfs.rst F: Documentation/ABI/testing/sysfs-fs-xfs -F: Documentation/filesystems/xfs.txt F: Documentation/filesystems/xfs-delayed-logging-design.txt F: Documentation/filesystems/xfs-self-describing-metadata.txt F: fs/xfs/ diff --git a/arch/Kconfig b/arch/Kconfig index e8d19c3cb91f..ac0fba400ded 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -128,22 +128,6 @@ config UPROBES managed by the kernel and kept transparent to the probed application. ) -config HAVE_64BIT_ALIGNED_ACCESS - def_bool 64BIT && !HAVE_EFFICIENT_UNALIGNED_ACCESS - help - Some architectures require 64 bit accesses to be 64 bit - aligned, which also requires structs containing 64 bit values - to be 64 bit aligned too. This includes some 32 bit - architectures which can do 64 bit accesses, as well as 64 bit - architectures without unaligned access. - - This symbol should be selected by an architecture if 64 bit - accesses are required to be 64 bit aligned in this way even - though it is not a 64 bit architecture. - - See Documentation/unaligned-memory-access.txt for more - information on the topic of unaligned memory accesses. - config HAVE_EFFICIENT_UNALIGNED_ACCESS bool help @@ -585,6 +569,9 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD config HAVE_ARCH_HUGE_VMAP bool +config ARCH_WANT_HUGE_PMD_SHARE + bool + config HAVE_ARCH_SOFT_DIRTY bool diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index b3d439c41c7b..deef17f34bd2 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -55,6 +55,13 @@ void *module_alloc(unsigned long size) } #endif +bool module_exit_section(const char *name) +{ + return strstarts(name, ".exit") || + strstarts(name, ".ARM.extab.exit") || + strstarts(name, ".ARM.exidx.exit"); +} + int apply_relocate(Elf32_Shdr *sechdrs, const char *strtab, unsigned int symindex, unsigned int relindex, struct module *module) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e1ea69994e0f..3adcec05b1f6 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -73,6 +73,7 @@ config ARM64 select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_FRAME_POINTERS + select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) select ARCH_HAS_UBSAN_SANITIZE_ALL select ARM_AMBA select ARM_ARCH_TIMER @@ -906,7 +907,6 @@ config SYS_SUPPORTS_HUGETLBFS def_bool y config ARCH_WANT_HUGE_PMD_SHARE - def_bool y if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) config ARCH_HAS_CACHE_LINE_SIZE def_bool y diff --git a/arch/parisc/include/asm/unistd.h b/arch/parisc/include/asm/unistd.h index b0838dc4dfee..cd438e4150f6 100644 --- a/arch/parisc/include/asm/unistd.h +++ b/arch/parisc/include/asm/unistd.h @@ -166,6 +166,7 @@ type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5 arg5) \ #define __ARCH_WANT_SYS_FORK #define __ARCH_WANT_SYS_VFORK #define __ARCH_WANT_SYS_CLONE +#define __ARCH_WANT_SYS_CLONE3 #define __ARCH_WANT_COMPAT_SYS_SENDFILE #ifdef CONFIG_64BIT diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S index 3e430590c1e1..d9d3387f7c47 100644 --- a/arch/parisc/kernel/entry.S +++ b/arch/parisc/kernel/entry.S @@ -1732,6 +1732,7 @@ ENDPROC_CFI(sys_\name\()_wrapper) .endm fork_like clone +fork_like clone3 fork_like fork fork_like vfork diff --git a/arch/parisc/kernel/kprobes.c b/arch/parisc/kernel/kprobes.c index d58960b33bda..5d7f2692ac5a 100644 --- a/arch/parisc/kernel/kprobes.c +++ b/arch/parisc/kernel/kprobes.c @@ -133,6 +133,9 @@ int __kprobes parisc_kprobe_ss_handler(struct pt_regs *regs) struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); struct kprobe *p = kprobe_running(); + if (!p) + return 0; + if (regs->iaoq[0] != (unsigned long)p->ainsn.insn+4) return 0; diff --git a/arch/parisc/kernel/ptrace.c b/arch/parisc/kernel/ptrace.c index f642ba378ffa..9f6ff7bc06f9 100644 --- a/arch/parisc/kernel/ptrace.c +++ b/arch/parisc/kernel/ptrace.c @@ -167,6 +167,9 @@ long arch_ptrace(struct task_struct *child, long request, if ((addr & (sizeof(unsigned long)-1)) || addr >= sizeof(struct pt_regs)) break; + if (addr == PT_IAOQ0 || addr == PT_IAOQ1) { + data |= 3; /* ensure userspace privilege */ + } if ((addr >= PT_GR1 && addr <= PT_GR31) || addr == PT_IAOQ0 || addr == PT_IAOQ1 || (addr >= PT_FR0 && addr <= PT_FR31 + 4) || @@ -228,16 +231,18 @@ long arch_ptrace(struct task_struct *child, long request, static compat_ulong_t translate_usr_offset(compat_ulong_t offset) { - if (offset < 0) - return sizeof(struct pt_regs); - else if (offset <= 32*4) /* gr[0..31] */ - return offset * 2 + 4; - else if (offset <= 32*4+32*8) /* gr[0..31] + fr[0..31] */ - return offset + 32*4; - else if (offset < sizeof(struct pt_regs)/2 + 32*4) - return offset * 2 + 4 - 32*8; + compat_ulong_t pos; + + if (offset < 32*4) /* gr[0..31] */ + pos = offset * 2 + 4; + else if (offset < 32*4+32*8) /* fr[0] ... fr[31] */ + pos = (offset - 32*4) + PT_FR0; + else if (offset < sizeof(struct pt_regs)/2 + 32*4) /* sr[0] ... ipsw */ + pos = (offset - 32*4 - 32*8) * 2 + PT_SR0 + 4; else - return sizeof(struct pt_regs); + pos = sizeof(struct pt_regs); + + return pos; } long compat_arch_ptrace(struct task_struct *child, compat_long_t request, @@ -281,9 +286,12 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request, addr = translate_usr_offset(addr); if (addr >= sizeof(struct pt_regs)) break; + if (addr == PT_IAOQ0+4 || addr == PT_IAOQ1+4) { + data |= 3; /* ensure userspace privilege */ + } if (addr >= PT_FR0 && addr <= PT_FR31 + 4) { /* Special case, fp regs are 64 bits anyway */ - *(__u64 *) ((char *) task_regs(child) + addr) = data; + *(__u32 *) ((char *) task_regs(child) + addr) = data; ret = 0; } else if ((addr >= PT_GR1+4 && addr <= PT_GR31+4) || @@ -496,7 +504,8 @@ static void set_reg(struct pt_regs *regs, int num, unsigned long val) return; case RI(iaoq[0]): case RI(iaoq[1]): - regs->iaoq[num - RI(iaoq[0])] = val; + /* set 2 lowest bits to ensure userspace privilege: */ + regs->iaoq[num - RI(iaoq[0])] = val | 3; return; case RI(sar): regs->sar = val; return; diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index c7aadfef5386..670d1371aca1 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -431,4 +431,4 @@ 432 common fsmount sys_fsmount 433 common fspick sys_fspick 434 common pidfd_open sys_pidfd_open -# 435 reserved for clone3 +435 common clone3 sys_clone3_wrapper diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 13a1c0d04e9e..59a4727ecd6c 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -52,6 +52,8 @@ config RISCV select ARCH_HAS_MMIOWB select HAVE_EBPF_JIT if 64BIT select EDAC_SUPPORT + select ARCH_HAS_GIGANTIC_PAGE + select ARCH_WANT_HUGE_PMD_SHARE if 64BIT config MMU def_bool y @@ -66,6 +68,12 @@ config PAGE_OFFSET default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB +config ARCH_WANT_GENERAL_HUGETLB + def_bool y + +config SYS_SUPPORTS_HUGETLBFS + def_bool y + config STACKTRACE_SUPPORT def_bool y @@ -97,6 +105,8 @@ config PGTABLE_LEVELS default 3 if 64BIT default 2 +source "arch/riscv/Kconfig.socs" + menu "Platform type" choice diff --git a/arch/riscv/Kconfig.socs b/arch/riscv/Kconfig.socs new file mode 100644 index 000000000000..536c0ef4aee8 --- /dev/null +++ b/arch/riscv/Kconfig.socs @@ -0,0 +1,13 @@ +menu "SoC selection" + +config SOC_SIFIVE + bool "SiFive SoCs" + select SERIAL_SIFIVE + select SERIAL_SIFIVE_CONSOLE + select CLK_SIFIVE + select CLK_SIFIVE_FU540_PRCI + select SIFIVE_PLIC + help + This enables support for SiFive SoC platform hardware. + +endmenu diff --git a/arch/riscv/boot/dts/sifive/Makefile b/arch/riscv/boot/dts/sifive/Makefile index baaeef9efdcb..6d6189e6e4af 100644 --- a/arch/riscv/boot/dts/sifive/Makefile +++ b/arch/riscv/boot/dts/sifive/Makefile @@ -1,2 +1,2 @@ # SPDX-License-Identifier: GPL-2.0 -dtb-y += hifive-unleashed-a00.dtb +dtb-$(CONFIG_SOC_SIFIVE) += hifive-unleashed-a00.dtb diff --git a/arch/riscv/configs/defconfig b/arch/riscv/configs/defconfig index 04944fb4fa7a..b7b749b18853 100644 --- a/arch/riscv/configs/defconfig +++ b/arch/riscv/configs/defconfig @@ -1,5 +1,7 @@ CONFIG_SYSVIPC=y CONFIG_POSIX_MQUEUE=y +CONFIG_NO_HZ_IDLE=y +CONFIG_HIGH_RES_TIMERS=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_CGROUPS=y @@ -12,6 +14,7 @@ CONFIG_CHECKPOINT_RESTORE=y CONFIG_BLK_DEV_INITRD=y CONFIG_EXPERT=y CONFIG_BPF_SYSCALL=y +CONFIG_SOC_SIFIVE=y CONFIG_SMP=y CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y @@ -49,8 +52,6 @@ CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_SERIAL_OF_PLATFORM=y CONFIG_SERIAL_EARLYCON_RISCV_SBI=y -CONFIG_SERIAL_SIFIVE=y -CONFIG_SERIAL_SIFIVE_CONSOLE=y CONFIG_HVC_RISCV_SBI=y # CONFIG_PTP_1588_CLOCK is not set CONFIG_DRM=y @@ -66,9 +67,6 @@ CONFIG_USB_OHCI_HCD_PLATFORM=y CONFIG_USB_STORAGE=y CONFIG_USB_UAS=y CONFIG_VIRTIO_MMIO=y -CONFIG_CLK_SIFIVE=y -CONFIG_CLK_SIFIVE_FU540_PRCI=y -CONFIG_SIFIVE_PLIC=y CONFIG_SPI_SIFIVE=y CONFIG_EXT4_FS=y CONFIG_EXT4_FS_POSIX_ACL=y diff --git a/arch/riscv/configs/rv32_defconfig b/arch/riscv/configs/rv32_defconfig index 1a911ed8e772..d5449ef805a3 100644 --- a/arch/riscv/configs/rv32_defconfig +++ b/arch/riscv/configs/rv32_defconfig @@ -1,5 +1,7 @@ CONFIG_SYSVIPC=y CONFIG_POSIX_MQUEUE=y +CONFIG_NO_HZ_IDLE=y +CONFIG_HIGH_RES_TIMERS=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_CGROUPS=y diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h index ad8678f1b54a..555b20b11dc3 100644 --- a/arch/riscv/include/asm/cacheflush.h +++ b/arch/riscv/include/asm/cacheflush.h @@ -6,11 +6,66 @@ #ifndef _ASM_RISCV_CACHEFLUSH_H #define _ASM_RISCV_CACHEFLUSH_H -#include <asm-generic/cacheflush.h> +#include <linux/mm.h> -#undef flush_icache_range -#undef flush_icache_user_range -#undef flush_dcache_page +#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 0 + +/* + * The cache doesn't need to be flushed when TLB entries change when + * the cache is mapped to physical memory, not virtual memory + */ +static inline void flush_cache_all(void) +{ +} + +static inline void flush_cache_mm(struct mm_struct *mm) +{ +} + +static inline void flush_cache_dup_mm(struct mm_struct *mm) +{ +} + +static inline void flush_cache_range(struct vm_area_struct *vma, + unsigned long start, + unsigned long end) +{ +} + +static inline void flush_cache_page(struct vm_area_struct *vma, + unsigned long vmaddr, + unsigned long pfn) +{ +} + +static inline void flush_dcache_mmap_lock(struct address_space *mapping) +{ +} + +static inline void flush_dcache_mmap_unlock(struct address_space *mapping) +{ +} + +static inline void flush_icache_page(struct vm_area_struct *vma, + struct page *page) +{ +} + +static inline void flush_cache_vmap(unsigned long start, unsigned long end) +{ +} + +static inline void flush_cache_vunmap(unsigned long start, unsigned long end) +{ +} + +#define copy_to_user_page(vma, page, vaddr, dst, src, len) \ + do { \ + memcpy(dst, src, len); \ + flush_icache_user_range(vma, page, vaddr, len); \ + } while (0) +#define copy_from_user_page(vma, page, vaddr, dst, src, len) \ + memcpy(dst, src, len) static inline void local_flush_icache_all(void) { diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h index c207f6634b91..9c66033c3a54 100644 --- a/arch/riscv/include/asm/fixmap.h +++ b/arch/riscv/include/asm/fixmap.h @@ -21,6 +21,11 @@ */ enum fixed_addresses { FIX_HOLE, +#define FIX_FDT_SIZE SZ_1M + FIX_FDT_END, + FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1, + FIX_PTE, + FIX_PMD, FIX_EARLYCON_MEM_BASE, __end_of_fixed_addresses }; diff --git a/arch/riscv/include/asm/hugetlb.h b/arch/riscv/include/asm/hugetlb.h new file mode 100644 index 000000000000..728a5db66597 --- /dev/null +++ b/arch/riscv/include/asm/hugetlb.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_RISCV_HUGETLB_H +#define _ASM_RISCV_HUGETLB_H + +#include <asm-generic/hugetlb.h> +#include <asm/page.h> + +static inline int is_hugepage_only_range(struct mm_struct *mm, + unsigned long addr, + unsigned long len) { + return 0; +} + +static inline void arch_clear_hugepage_flags(struct page *page) +{ +} + +#endif /* _ASM_RISCV_HUGETLB_H */ diff --git a/arch/riscv/include/asm/image.h b/arch/riscv/include/asm/image.h new file mode 100644 index 000000000000..ef28e106f247 --- /dev/null +++ b/arch/riscv/include/asm/image.h @@ -0,0 +1,65 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __ASM_IMAGE_H +#define __ASM_IMAGE_H + +#define RISCV_IMAGE_MAGIC "RISCV" + +#define RISCV_IMAGE_FLAG_BE_SHIFT 0 +#define RISCV_IMAGE_FLAG_BE_MASK 0x1 + +#define RISCV_IMAGE_FLAG_LE 0 +#define RISCV_IMAGE_FLAG_BE 1 + +#ifdef CONFIG_CPU_BIG_ENDIAN +#error conversion of header fields to LE not yet implemented +#else +#define __HEAD_FLAG_BE RISCV_IMAGE_FLAG_LE +#endif + +#define __HEAD_FLAG(field) (__HEAD_FLAG_##field << \ + RISCV_IMAGE_FLAG_##field##_SHIFT) + +#define __HEAD_FLAGS (__HEAD_FLAG(BE)) + +#define RISCV_HEADER_VERSION_MAJOR 0 +#define RISCV_HEADER_VERSION_MINOR 1 + +#define RISCV_HEADER_VERSION (RISCV_HEADER_VERSION_MAJOR << 16 | \ + RISCV_HEADER_VERSION_MINOR) + +#ifndef __ASSEMBLY__ +/** + * struct riscv_image_header - riscv kernel image header + * @code0: Executable code + * @code1: Executable code + * @text_offset: Image load offset (little endian) + * @image_size: Effective Image size (little endian) + * @flags: kernel flags (little endian) + * @version: version + * @res1: reserved + * @res2: reserved + * @magic: Magic number + * @res3: reserved (will be used for additional RISC-V specific + * header) + * @res4: reserved (will be used for PE COFF offset) + * + * The intention is for this header format to be shared between multiple + * architectures to avoid a proliferation of image header formats. + */ + +struct riscv_image_header { + u32 code0; + u32 code1; + u64 text_offset; + u64 image_size; + u64 flags; + u32 version; + u32 res1; + u64 res2; + u64 magic; + u32 res3; + u32 res4; +}; +#endif /* __ASSEMBLY__ */ +#endif /* __ASM_IMAGE_H */ diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h index 8ddb6c7fedac..707e00a8430b 100644 --- a/arch/riscv/include/asm/page.h +++ b/arch/riscv/include/asm/page.h @@ -16,6 +16,16 @@ #define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE - 1)) +#ifdef CONFIG_64BIT +#define HUGE_MAX_HSTATE 2 +#else +#define HUGE_MAX_HSTATE 1 +#endif +#define HPAGE_SHIFT PMD_SHIFT +#define HPAGE_SIZE (_AC(1, UL) << HPAGE_SHIFT) +#define HPAGE_MASK (~(HPAGE_SIZE - 1)) +#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) + /* * PAGE_OFFSET -- the first address of the first page of memory. * When not using MMU this corresponds to the first free page in @@ -115,8 +125,4 @@ extern unsigned long min_low_pfn; #include <asm-generic/memory_model.h> #include <asm-generic/getorder.h> -/* vDSO support */ -/* We do define AT_SYSINFO_EHDR but don't use the gate mechanism */ -#define __HAVE_ARCH_GATE_AREA - #endif /* _ASM_RISCV_PAGE_H */ diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h index 45dfac2ac51f..74630989006d 100644 --- a/arch/riscv/include/asm/pgtable-64.h +++ b/arch/riscv/include/asm/pgtable-64.h @@ -70,6 +70,11 @@ static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot) return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot)); } +static inline unsigned long _pmd_pfn(pmd_t pmd) +{ + return pmd_val(pmd) >> _PAGE_PFN_SHIFT; +} + #define pmd_ERROR(e) \ pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e)) diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index f7c3f7de15f2..a364aba23d55 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -59,6 +59,8 @@ #define PAGE_KERNEL __pgprot(_PAGE_KERNEL) #define PAGE_KERNEL_EXEC __pgprot(_PAGE_KERNEL | _PAGE_EXEC) +#define PAGE_TABLE __pgprot(_PAGE_TABLE) + extern pgd_t swapper_pg_dir[]; /* MAP_PRIVATE permissions: xwr (copy-on-write) */ @@ -113,12 +115,16 @@ static inline void pmd_clear(pmd_t *pmdp) set_pmd(pmdp, __pmd(0)); } - static inline pgd_t pfn_pgd(unsigned long pfn, pgprot_t prot) { return __pgd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot)); } +static inline unsigned long _pgd_pfn(pgd_t pgd) +{ + return pgd_val(pgd) >> _PAGE_PFN_SHIFT; +} + #define pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1)) /* Locate an entry in the page global directory */ @@ -250,6 +256,11 @@ static inline pte_t pte_mkspecial(pte_t pte) return __pte(pte_val(pte) | _PAGE_SPECIAL); } +static inline pte_t pte_mkhuge(pte_t pte) +{ + return pte; +} + /* Modify page protection bits */ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { @@ -396,6 +407,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma, #define kern_addr_valid(addr) (1) /* FIXME */ #endif +extern void *dtb_early_va; extern void setup_bootmem(void); extern void paging_init(void); @@ -409,7 +421,7 @@ static inline void pgtable_cache_init(void) #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE) /* - * Task size is 0x40000000000 for RV64 or 0xb800000 for RV32. + * Task size is 0x4000000000 for RV64 or 0xb800000 for RV32. * Note that PGDIR_SIZE must evenly divide TASK_SIZE. */ #ifdef CONFIG_64BIT diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S index 4e46f31072da..0f1ba17e476f 100644 --- a/arch/riscv/kernel/head.S +++ b/arch/riscv/kernel/head.S @@ -11,9 +11,41 @@ #include <asm/thread_info.h> #include <asm/page.h> #include <asm/csr.h> +#include <asm/image.h> __INIT ENTRY(_start) + /* + * Image header expected by Linux boot-loaders. The image header data + * structure is described in asm/image.h. + * Do not modify it without modifying the structure and all bootloaders + * that expects this header format!! + */ + /* jump to start kernel */ + j _start_kernel + /* reserved */ + .word 0 + .balign 8 +#if __riscv_xlen == 64 + /* Image load offset(2MB) from start of RAM */ + .dword 0x200000 +#else + /* Image load offset(4MB) from start of RAM */ + .dword 0x400000 +#endif + /* Effective size of kernel image */ + .dword _end - _start + .dword __HEAD_FLAGS + .word RISCV_HEADER_VERSION + .word 0 + .dword 0 + .asciz RISCV_IMAGE_MAGIC + .word 0 + .balign 4 + .word 0 + +.global _start_kernel +_start_kernel: /* Mask all interrupts */ csrw CSR_SIE, zero csrw CSR_SIP, zero @@ -55,7 +87,9 @@ clear_bss_done: /* Initialize page tables and relocate to virtual addresses */ la sp, init_thread_union + THREAD_SIZE + mv a0, s1 call setup_vm + la a0, early_pg_dir call relocate /* Restore C environment */ @@ -64,25 +98,23 @@ clear_bss_done: la sp, init_thread_union + THREAD_SIZE /* Start the kernel */ - mv a0, s1 call parse_dtb tail start_kernel relocate: /* Relocate return address */ li a1, PAGE_OFFSET - la a0, _start - sub a1, a1, a0 + la a2, _start + sub a1, a1, a2 add ra, ra, a1 /* Point stvec to virtual address of intruction after satp write */ - la a0, 1f - add a0, a0, a1 - csrw CSR_STVEC, a0 + la a2, 1f + add a2, a2, a1 + csrw CSR_STVEC, a2 /* Compute satp for kernel page tables, but don't load it yet */ - la a2, swapper_pg_dir - srl a2, a2, PAGE_SHIFT + srl a2, a0, PAGE_SHIFT li a1, SATP_MODE or a2, a2, a1 @@ -148,6 +180,7 @@ relocate: fence /* Enable virtual memory and relocate to virtual address */ + la a0, swapper_pg_dir call relocate tail smp_callin diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c index b92e6831d1ec..a990a6cb184f 100644 --- a/arch/riscv/kernel/setup.c +++ b/arch/riscv/kernel/setup.c @@ -39,11 +39,9 @@ struct screen_info screen_info = { atomic_t hart_lottery; unsigned long boot_cpu_hartid; -void __init parse_dtb(phys_addr_t dtb_phys) +void __init parse_dtb(void) { - void *dtb = __va(dtb_phys); - - if (early_init_dt_scan(dtb)) + if (early_init_dt_scan(dtb_early_va)) return; pr_err("No DTB passed to the kernel\n"); diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c index a0084c36d270..c9c21e0d5641 100644 --- a/arch/riscv/kernel/vdso.c +++ b/arch/riscv/kernel/vdso.c @@ -92,22 +92,3 @@ const char *arch_vma_name(struct vm_area_struct *vma) return "[vdso]"; return NULL; } - -/* - * Function stubs to prevent linker errors when AT_SYSINFO_EHDR is defined - */ - -int in_gate_area_no_mm(unsigned long addr) -{ - return 0; -} - -int in_gate_area(struct mm_struct *mm, unsigned long addr) -{ - return 0; -} - -struct vm_area_struct *get_gate_vma(struct mm_struct *mm) -{ - return NULL; -} diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile index fc51d3b7876e..74055e1d6f21 100644 --- a/arch/riscv/mm/Makefile +++ b/arch/riscv/mm/Makefile @@ -12,3 +12,5 @@ obj-y += ioremap.o obj-y += cacheflush.o obj-y += context.o obj-y += sifive_l2_cache.o + +obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c new file mode 100644 index 000000000000..0d4747e9d5b5 --- /dev/null +++ b/arch/riscv/mm/hugetlbpage.c @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/hugetlb.h> +#include <linux/err.h> + +int pud_huge(pud_t pud) +{ + return pud_present(pud) && + (pud_val(pud) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)); +} + +int pmd_huge(pmd_t pmd) +{ + return pmd_present(pmd) && + (pmd_val(pmd) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)); +} + +static __init int setup_hugepagesz(char *opt) +{ + unsigned long ps = memparse(opt, &opt); + + if (ps == HPAGE_SIZE) { + hugetlb_add_hstate(HPAGE_SHIFT - PAGE_SHIFT); + } else if (IS_ENABLED(CONFIG_64BIT) && ps == PUD_SIZE) { + hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT); + } else { + hugetlb_bad_size(); + pr_err("hugepagesz: Unsupported page size %lu M\n", ps >> 20); + return 0; + } + + return 1; +} +__setup("hugepagesz=", setup_hugepagesz); + +#ifdef CONFIG_CONTIG_ALLOC +static __init int gigantic_pages_init(void) +{ + /* With CONTIG_ALLOC, we can allocate gigantic pages at runtime */ + if (IS_ENABLED(CONFIG_64BIT) && !size_to_hstate(1UL << PUD_SHIFT)) + hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT); + return 0; +} +arch_initcall(gigantic_pages_init); +#endif diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 84747d7a1e85..42bf939693d3 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (C) 2012 Regents of the University of California + * Copyright (C) 2019 Western Digital Corporation or its affiliates. */ #include <linux/init.h> @@ -21,6 +22,8 @@ unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_bss; EXPORT_SYMBOL(empty_zero_page); +extern char _start[]; + static void __init zone_sizes_init(void) { unsigned long max_zone_pfns[MAX_NR_ZONES] = { 0, }; @@ -39,13 +42,6 @@ void setup_zero_page(void) memset((void *)empty_zero_page, 0, PAGE_SIZE); } -void __init paging_init(void) -{ - setup_zero_page(); - local_flush_tlb_all(); - zone_sizes_init(); -} - void __init mem_init(void) { #ifdef CONFIG_FLATMEM @@ -84,29 +80,20 @@ disable: initrd_start = 0; initrd_end = 0; } - -void __init free_initrd_mem(unsigned long start, unsigned long end) -{ - free_reserved_area((void *)start, (void *)end, -1, "initrd"); -} #endif /* CONFIG_BLK_DEV_INITRD */ void __init setup_bootmem(void) { struct memblock_region *reg; phys_addr_t mem_size = 0; + phys_addr_t vmlinux_end = __pa(&_end); + phys_addr_t vmlinux_start = __pa(&_start); /* Find the memory region containing the kernel */ for_each_memblock(memory, reg) { - phys_addr_t vmlinux_end = __pa(_end); phys_addr_t end = reg->base + reg->size; if (reg->base <= vmlinux_end && vmlinux_end <= end) { - /* - * Reserve from the start of the region to the end of - * the kernel - */ - memblock_reserve(reg->base, vmlinux_end - reg->base); mem_size = min(reg->size, (phys_addr_t)-PAGE_OFFSET); /* @@ -120,6 +107,9 @@ void __init setup_bootmem(void) } BUG_ON(mem_size == 0); + /* Reserve from the start of the kernel to the end of the kernel */ + memblock_reserve(vmlinux_start, vmlinux_end - vmlinux_start); + set_max_mapnr(PFN_DOWN(mem_size)); max_low_pfn = PFN_DOWN(memblock_end_of_DRAM()); @@ -147,17 +137,15 @@ EXPORT_SYMBOL(va_pa_offset); unsigned long pfn_base; EXPORT_SYMBOL(pfn_base); +void *dtb_early_va; pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss; -pgd_t trampoline_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE); +pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss; +pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss; +static bool mmu_enabled; -#ifndef __PAGETABLE_PMD_FOLDED -#define NUM_SWAPPER_PMDS ((uintptr_t)-PAGE_OFFSET >> PGDIR_SHIFT) -pmd_t swapper_pmd[PTRS_PER_PMD*((-PAGE_OFFSET)/PGDIR_SIZE)] __page_aligned_bss; -pmd_t trampoline_pmd[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE); -pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss; -#endif +#define MAX_EARLY_MAPPING_SIZE SZ_128M -pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss; +pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE); void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot) { @@ -176,6 +164,156 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot) } } +static pte_t *__init get_pte_virt(phys_addr_t pa) +{ + if (mmu_enabled) { + clear_fixmap(FIX_PTE); + return (pte_t *)set_fixmap_offset(FIX_PTE, pa); + } else { + return (pte_t *)((uintptr_t)pa); + } +} + +static phys_addr_t __init alloc_pte(uintptr_t va) +{ + /* + * We only create PMD or PGD early mappings so we + * should never reach here with MMU disabled. + */ + BUG_ON(!mmu_enabled); + + return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE); +} + +static void __init create_pte_mapping(pte_t *ptep, + uintptr_t va, phys_addr_t pa, + phys_addr_t sz, pgprot_t prot) +{ + uintptr_t pte_index = pte_index(va); + + BUG_ON(sz != PAGE_SIZE); + + if (pte_none(ptep[pte_index])) + ptep[pte_index] = pfn_pte(PFN_DOWN(pa), prot); +} + +#ifndef __PAGETABLE_PMD_FOLDED + +pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss; +pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss; + +#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE +#define NUM_EARLY_PMDS 1UL +#else +#define NUM_EARLY_PMDS (1UL + MAX_EARLY_MAPPING_SIZE / PGDIR_SIZE) +#endif +pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata __aligned(PAGE_SIZE); + +static pmd_t *__init get_pmd_virt(phys_addr_t pa) +{ + if (mmu_enabled) { + clear_fixmap(FIX_PMD); + return (pmd_t *)set_fixmap_offset(FIX_PMD, pa); + } else { + return (pmd_t *)((uintptr_t)pa); + } +} + +static phys_addr_t __init alloc_pmd(uintptr_t va) +{ + uintptr_t pmd_num; + + if (mmu_enabled) + return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE); + + pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT; + BUG_ON(pmd_num >= NUM_EARLY_PMDS); + return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD]; +} + +static void __init create_pmd_mapping(pmd_t *pmdp, + uintptr_t va, phys_addr_t pa, + phys_addr_t sz, pgprot_t prot) +{ + pte_t *ptep; + phys_addr_t pte_phys; + uintptr_t pmd_index = pmd_index(va); + + if (sz == PMD_SIZE) { + if (pmd_none(pmdp[pmd_index])) + pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pa), prot); + return; + } + + if (pmd_none(pmdp[pmd_index])) { + pte_phys = alloc_pte(va); + pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pte_phys), PAGE_TABLE); + ptep = get_pte_virt(pte_phys); + memset(ptep, 0, PAGE_SIZE); + } else { + pte_phys = PFN_PHYS(_pmd_pfn(pmdp[pmd_index])); + ptep = get_pte_virt(pte_phys); + } + + create_pte_mapping(ptep, va, pa, sz, prot); +} + +#define pgd_next_t pmd_t +#define alloc_pgd_next(__va) alloc_pmd(__va) +#define get_pgd_next_virt(__pa) get_pmd_virt(__pa) +#define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \ + create_pmd_mapping(__nextp, __va, __pa, __sz, __prot) +#define PTE_PARENT_SIZE PMD_SIZE +#define fixmap_pgd_next fixmap_pmd +#else +#define pgd_next_t pte_t +#define alloc_pgd_next(__va) alloc_pte(__va) +#define get_pgd_next_virt(__pa) get_pte_virt(__pa) +#define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \ + create_pte_mapping(__nextp, __va, __pa, __sz, __prot) +#define PTE_PARENT_SIZE PGDIR_SIZE +#define fixmap_pgd_next fixmap_pte +#endif + +static void __init create_pgd_mapping(pgd_t *pgdp, + uintptr_t va, phys_addr_t pa, + phys_addr_t sz, pgprot_t prot) +{ + pgd_next_t *nextp; + phys_addr_t next_phys; + uintptr_t pgd_index = pgd_index(va); + + if (sz == PGDIR_SIZE) { + if (pgd_val(pgdp[pgd_index]) == 0) + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot); + return; + } + + if (pgd_val(pgdp[pgd_index]) == 0) { + next_phys = alloc_pgd_next(va); + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(next_phys), PAGE_TABLE); + nextp = get_pgd_next_virt(next_phys); + memset(nextp, 0, PAGE_SIZE); + } else { + next_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index])); + nextp = get_pgd_next_virt(next_phys); + } + + create_pgd_next_mapping(nextp, va, pa, sz, prot); +} + +static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size) +{ + uintptr_t map_size = PAGE_SIZE; + + /* Upgrade to PMD/PGDIR mappings whenever possible */ + if (!(base & (PTE_PARENT_SIZE - 1)) && + !(size & (PTE_PARENT_SIZE - 1))) + map_size = PTE_PARENT_SIZE; + + return map_size; +} + /* * setup_vm() is called from head.S with MMU-off. * @@ -195,55 +333,115 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot) "not use absolute addressing." #endif -asmlinkage void __init setup_vm(void) +asmlinkage void __init setup_vm(uintptr_t dtb_pa) { - extern char _start; - uintptr_t i; - uintptr_t pa = (uintptr_t) &_start; - pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC); + uintptr_t va, end_va; + uintptr_t load_pa = (uintptr_t)(&_start); + uintptr_t load_sz = (uintptr_t)(&_end) - load_pa; + uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE); - va_pa_offset = PAGE_OFFSET - pa; - pfn_base = PFN_DOWN(pa); + va_pa_offset = PAGE_OFFSET - load_pa; + pfn_base = PFN_DOWN(load_pa); + + /* + * Enforce boot alignment requirements of RV32 and + * RV64 by only allowing PMD or PGD mappings. + */ + BUG_ON(map_size == PAGE_SIZE); /* Sanity check alignment and size */ BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0); - BUG_ON((pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0); + BUG_ON((load_pa % map_size) != 0); + BUG_ON(load_sz > MAX_EARLY_MAPPING_SIZE); + + /* Setup early PGD for fixmap */ + create_pgd_mapping(early_pg_dir, FIXADDR_START, + (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE); #ifndef __PAGETABLE_PMD_FOLDED - trampoline_pg_dir[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] = - pfn_pgd(PFN_DOWN((uintptr_t)trampoline_pmd), - __pgprot(_PAGE_TABLE)); - trampoline_pmd[0] = pfn_pmd(PFN_DOWN(pa), prot); + /* Setup fixmap PMD */ + create_pmd_mapping(fixmap_pmd, FIXADDR_START, + (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE); + /* Setup trampoline PGD and PMD */ + create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET, + (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE); + create_pmd_mapping(trampoline_pmd, PAGE_OFFSET, + load_pa, PMD_SIZE, PAGE_KERNEL_EXEC); +#else + /* Setup trampoline PGD */ + create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET, + load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC); +#endif - for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) { - size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i; + /* + * Setup early PGD covering entire kernel which will allows + * us to reach paging_init(). We map all memory banks later + * in setup_vm_final() below. + */ + end_va = PAGE_OFFSET + load_sz; + for (va = PAGE_OFFSET; va < end_va; va += map_size) + create_pgd_mapping(early_pg_dir, va, + load_pa + (va - PAGE_OFFSET), + map_size, PAGE_KERNEL_EXEC); + + /* Create fixed mapping for early FDT parsing */ + end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE; + for (va = __fix_to_virt(FIX_FDT); va < end_va; va += PAGE_SIZE) + create_pte_mapping(fixmap_pte, va, + dtb_pa + (va - __fix_to_virt(FIX_FDT)), + PAGE_SIZE, PAGE_KERNEL); + + /* Save pointer to DTB for early FDT parsing */ + dtb_early_va = (void *)fix_to_virt(FIX_FDT) + (dtb_pa & ~PAGE_MASK); +} - swapper_pg_dir[o] = - pfn_pgd(PFN_DOWN((uintptr_t)swapper_pmd) + i, - __pgprot(_PAGE_TABLE)); - } - for (i = 0; i < ARRAY_SIZE(swapper_pmd); i++) - swapper_pmd[i] = pfn_pmd(PFN_DOWN(pa + i * PMD_SIZE), prot); - - swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] = - pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pmd), - __pgprot(_PAGE_TABLE)); - fixmap_pmd[(FIXADDR_START >> PMD_SHIFT) % PTRS_PER_PMD] = - pfn_pmd(PFN_DOWN((uintptr_t)fixmap_pte), - __pgprot(_PAGE_TABLE)); -#else - trampoline_pg_dir[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] = - pfn_pgd(PFN_DOWN(pa), prot); +static void __init setup_vm_final(void) +{ + uintptr_t va, map_size; + phys_addr_t pa, start, end; + struct memblock_region *reg; - for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) { - size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i; + /* Set mmu_enabled flag */ + mmu_enabled = true; - swapper_pg_dir[o] = - pfn_pgd(PFN_DOWN(pa + i * PGDIR_SIZE), prot); + /* Setup swapper PGD for fixmap */ + create_pgd_mapping(swapper_pg_dir, FIXADDR_START, + __pa(fixmap_pgd_next), + PGDIR_SIZE, PAGE_TABLE); + + /* Map all memory banks */ + for_each_memblock(memory, reg) { + start = reg->base; + end = start + reg->size; + + if (start >= end) + break; + if (memblock_is_nomap(reg)) + continue; + if (start <= __pa(PAGE_OFFSET) && + __pa(PAGE_OFFSET) < end) + start = __pa(PAGE_OFFSET); + + map_size = best_map_size(start, end - start); + for (pa = start; pa < end; pa += map_size) { + va = (uintptr_t)__va(pa); + create_pgd_mapping(swapper_pg_dir, va, pa, + map_size, PAGE_KERNEL_EXEC); + } } - swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] = - pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pte), - __pgprot(_PAGE_TABLE)); -#endif + /* Clear fixmap PTE and PMD mappings */ + clear_fixmap(FIX_PTE); + clear_fixmap(FIX_PMD); + + /* Move to swapper page table */ + csr_write(sptbr, PFN_DOWN(__pa(swapper_pg_dir)) | SATP_MODE); + local_flush_tlb_all(); +} + +void __init paging_init(void) +{ + setup_vm_final(); + setup_zero_page(); + zone_sizes_init(); } diff --git a/arch/riscv/mm/sifive_l2_cache.c b/arch/riscv/mm/sifive_l2_cache.c index 4eb64619b3f4..2e637ad71c05 100644 --- a/arch/riscv/mm/sifive_l2_cache.c +++ b/arch/riscv/mm/sifive_l2_cache.c @@ -109,13 +109,14 @@ EXPORT_SYMBOL_GPL(unregister_sifive_l2_error_notifier); static irqreturn_t l2_int_handler(int irq, void *device) { - unsigned int regval, add_h, add_l; + unsigned int add_h, add_l; if (irq == g_irq[DIR_CORR]) { add_h = readl(l2_base + SIFIVE_L2_DIRECCFIX_HIGH); add_l = readl(l2_base + SIFIVE_L2_DIRECCFIX_LOW); pr_err("L2CACHE: DirError @ 0x%08X.%08X\n", add_h, add_l); - regval = readl(l2_base + SIFIVE_L2_DIRECCFIX_COUNT); + /* Reading this register clears the DirError interrupt sig */ + readl(l2_base + SIFIVE_L2_DIRECCFIX_COUNT); atomic_notifier_call_chain(&l2_err_chain, SIFIVE_L2_ERR_TYPE_CE, "DirECCFix"); } @@ -123,7 +124,8 @@ static irqreturn_t l2_int_handler(int irq, void *device) add_h = readl(l2_base + SIFIVE_L2_DATECCFIX_HIGH); add_l = readl(l2_base + SIFIVE_L2_DATECCFIX_LOW); pr_err("L2CACHE: DataError @ 0x%08X.%08X\n", add_h, add_l); - regval = readl(l2_base + SIFIVE_L2_DATECCFIX_COUNT); + /* Reading this register clears the DataError interrupt sig */ + readl(l2_base + SIFIVE_L2_DATECCFIX_COUNT); atomic_notifier_call_chain(&l2_err_chain, SIFIVE_L2_ERR_TYPE_CE, "DatECCFix"); } @@ -131,7 +133,8 @@ static irqreturn_t l2_int_handler(int irq, void *device) add_h = readl(l2_base + SIFIVE_L2_DATECCFAIL_HIGH); add_l = readl(l2_base + SIFIVE_L2_DATECCFAIL_LOW); pr_err("L2CACHE: DataFail @ 0x%08X.%08X\n", add_h, add_l); - regval = readl(l2_base + SIFIVE_L2_DATECCFAIL_COUNT); + /* Reading this register clears the DataFail interrupt sig */ + readl(l2_base + SIFIVE_L2_DATECCFAIL_COUNT); atomic_notifier_call_chain(&l2_err_chain, SIFIVE_L2_ERR_TYPE_UE, "DatECCFail"); } diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1342654e8057..78772870facd 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -94,6 +94,7 @@ config X86 select ARCH_USE_QUEUED_SPINLOCKS select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANTS_DYNAMIC_TASK_STRUCT + select ARCH_WANT_HUGE_PMD_SHARE select ARCH_WANTS_THP_SWAP if X86_64 select BUILDTIME_EXTABLE_SORT select CLKEVT_I8253 @@ -307,9 +308,6 @@ config ARCH_HIBERNATION_POSSIBLE config ARCH_SUSPEND_POSSIBLE def_bool y -config ARCH_WANT_HUGE_PMD_SHARE - def_bool y - config ARCH_WANT_GENERAL_HUGETLB def_bool y diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index c82abd6e4ca3..9c4435307ff8 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -66,7 +66,9 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un }) #ifdef CONFIG_DEBUG_ATOMIC_SLEEP -# define WARN_ON_IN_IRQ() WARN_ON_ONCE(!in_task()) +static inline bool pagefault_disabled(void); +# define WARN_ON_IN_IRQ() \ + WARN_ON_ONCE(!in_task() && !pagefault_disabled()) #else # define WARN_ON_IN_IRQ() #endif diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index 4b73f5937f41..024c3053dbba 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -373,7 +373,7 @@ static int add_brk_on_nop(struct dyn_ftrace *rec) return add_break(rec->ip, old); } -static int add_breakpoints(struct dyn_ftrace *rec, int enable) +static int add_breakpoints(struct dyn_ftrace *rec, bool enable) { unsigned long ftrace_addr; int ret; @@ -481,7 +481,7 @@ static int add_update_nop(struct dyn_ftrace *rec) return add_update_code(ip, new); } -static int add_update(struct dyn_ftrace *rec, int enable) +static int add_update(struct dyn_ftrace *rec, bool enable) { unsigned long ftrace_addr; int ret; @@ -527,7 +527,7 @@ static int finish_update_nop(struct dyn_ftrace *rec) return ftrace_write(ip, new, 1); } -static int finish_update(struct dyn_ftrace *rec, int enable) +static int finish_update(struct dyn_ftrace *rec, bool enable) { unsigned long ftrace_addr; int ret; diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 23022cf20d26..c02fa27dd3f3 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2426,7 +2426,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw, offset = to_interleave_offset(offset, mmio); writeq(cmd, mmio->addr.base + offset); - nvdimm_flush(nfit_blk->nd_region); + nvdimm_flush(nfit_blk->nd_region, NULL); if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH) readq(mmio->addr.base + offset); @@ -2475,7 +2475,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk, } if (rw) - nvdimm_flush(nfit_blk->nd_region); + nvdimm_flush(nfit_blk->nd_region, NULL); rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0; return rc; diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index e5009a34f9c2..3327192bb71f 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -115,6 +115,8 @@ static int atomic_dec_return_safe(atomic_t *v) #define RBD_FEATURE_LAYERING (1ULL<<0) #define RBD_FEATURE_STRIPINGV2 (1ULL<<1) #define RBD_FEATURE_EXCLUSIVE_LOCK (1ULL<<2) +#define RBD_FEATURE_OBJECT_MAP (1ULL<<3) +#define RBD_FEATURE_FAST_DIFF (1ULL<<4) #define RBD_FEATURE_DEEP_FLATTEN (1ULL<<5) #define RBD_FEATURE_DATA_POOL (1ULL<<7) #define RBD_FEATURE_OPERATIONS (1ULL<<8) @@ -122,6 +124,8 @@ static int atomic_dec_return_safe(atomic_t *v) #define RBD_FEATURES_ALL (RBD_FEATURE_LAYERING | \ RBD_FEATURE_STRIPINGV2 | \ RBD_FEATURE_EXCLUSIVE_LOCK | \ + RBD_FEATURE_OBJECT_MAP | \ + RBD_FEATURE_FAST_DIFF | \ RBD_FEATURE_DEEP_FLATTEN | \ RBD_FEATURE_DATA_POOL | \ RBD_FEATURE_OPERATIONS) @@ -203,6 +207,11 @@ struct rbd_client { struct list_head node; }; +struct pending_result { + int result; /* first nonzero result */ + int num_pending; +}; + struct rbd_img_request; enum obj_request_type { @@ -219,6 +228,18 @@ enum obj_operation_type { OBJ_OP_ZEROOUT, }; +#define RBD_OBJ_FLAG_DELETION (1U << 0) +#define RBD_OBJ_FLAG_COPYUP_ENABLED (1U << 1) +#define RBD_OBJ_FLAG_COPYUP_ZEROS (1U << 2) +#define RBD_OBJ_FLAG_MAY_EXIST (1U << 3) +#define RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT (1U << 4) + +enum rbd_obj_read_state { + RBD_OBJ_READ_START = 1, + RBD_OBJ_READ_OBJECT, + RBD_OBJ_READ_PARENT, +}; + /* * Writes go through the following state machine to deal with * layering: @@ -245,17 +266,28 @@ enum obj_operation_type { * even if there is a parent). */ enum rbd_obj_write_state { - RBD_OBJ_WRITE_FLAT = 1, - RBD_OBJ_WRITE_GUARD, - RBD_OBJ_WRITE_READ_FROM_PARENT, - RBD_OBJ_WRITE_COPYUP_EMPTY_SNAPC, - RBD_OBJ_WRITE_COPYUP_OPS, + RBD_OBJ_WRITE_START = 1, + RBD_OBJ_WRITE_PRE_OBJECT_MAP, + RBD_OBJ_WRITE_OBJECT, + __RBD_OBJ_WRITE_COPYUP, + RBD_OBJ_WRITE_COPYUP, + RBD_OBJ_WRITE_POST_OBJECT_MAP, +}; + +enum rbd_obj_copyup_state { + RBD_OBJ_COPYUP_START = 1, + RBD_OBJ_COPYUP_READ_PARENT, + __RBD_OBJ_COPYUP_OBJECT_MAPS, + RBD_OBJ_COPYUP_OBJECT_MAPS, + __RBD_OBJ_COPYUP_WRITE_OBJECT, + RBD_OBJ_COPYUP_WRITE_OBJECT, }; struct rbd_obj_request { struct ceph_object_extent ex; + unsigned int flags; /* RBD_OBJ_FLAG_* */ union { - bool tried_parent; /* for reads */ + enum rbd_obj_read_state read_state; /* for reads */ enum rbd_obj_write_state write_state; /* for writes */ }; @@ -271,14 +303,15 @@ struct rbd_obj_request { u32 bvec_idx; }; }; + + enum rbd_obj_copyup_state copyup_state; struct bio_vec *copyup_bvecs; u32 copyup_bvec_count; - struct ceph_osd_request *osd_req; - - u64 xferred; /* bytes transferred */ - int result; + struct list_head osd_reqs; /* w/ r_private_item */ + struct mutex state_mutex; + struct pending_result pending; struct kref kref; }; @@ -287,11 +320,19 @@ enum img_req_flags { IMG_REQ_LAYERED, /* ENOENT handling: normal = 0, layered = 1 */ }; +enum rbd_img_state { + RBD_IMG_START = 1, + RBD_IMG_EXCLUSIVE_LOCK, + __RBD_IMG_OBJECT_REQUESTS, + RBD_IMG_OBJECT_REQUESTS, +}; + struct rbd_img_request { struct rbd_device *rbd_dev; enum obj_operation_type op_type; enum obj_request_type data_type; unsigned long flags; + enum rbd_img_state state; union { u64 snap_id; /* for reads */ struct ceph_snap_context *snapc; /* for writes */ @@ -300,13 +341,14 @@ struct rbd_img_request { struct request *rq; /* block request */ struct rbd_obj_request *obj_request; /* obj req initiator */ }; - spinlock_t completion_lock; - u64 xferred;/* aggregate bytes transferred */ - int result; /* first nonzero obj_request result */ + struct list_head lock_item; struct list_head object_extents; /* obj_req.ex structs */ - u32 pending_count; + struct mutex state_mutex; + struct pending_result pending; + struct work_struct work; + int work_result; struct kref kref; }; @@ -380,7 +422,17 @@ struct rbd_device { struct work_struct released_lock_work; struct delayed_work lock_dwork; struct work_struct unlock_work; - wait_queue_head_t lock_waitq; + spinlock_t lock_lists_lock; + struct list_head acquiring_list; + struct list_head running_list; + struct completion acquire_wait; + int acquire_err; + struct completion releasing_wait; + + spinlock_t object_map_lock; + u8 *object_map; + u64 object_map_size; /* in objects */ + u64 object_map_flags; struct workqueue_struct *task_wq; @@ -408,12 +460,10 @@ struct rbd_device { * Flag bits for rbd_dev->flags: * - REMOVING (which is coupled with rbd_dev->open_count) is protected * by rbd_dev->lock - * - BLACKLISTED is protected by rbd_dev->lock_rwsem */ enum rbd_dev_flags { RBD_DEV_FLAG_EXISTS, /* mapped snapshot has not been deleted */ RBD_DEV_FLAG_REMOVING, /* this mapping is being removed */ - RBD_DEV_FLAG_BLACKLISTED, /* our ceph_client is blacklisted */ }; static DEFINE_MUTEX(client_mutex); /* Serialize client creation */ @@ -466,6 +516,8 @@ static int minor_to_rbd_dev_id(int minor) static bool __rbd_is_lock_owner(struct rbd_device *rbd_dev) { + lockdep_assert_held(&rbd_dev->lock_rwsem); + return rbd_dev->lock_state == RBD_LOCK_STATE_LOCKED || rbd_dev->lock_state == RBD_LOCK_STATE_RELEASING; } @@ -583,6 +635,26 @@ static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id, u8 *order, u64 *snap_size); static int _rbd_dev_v2_snap_features(struct rbd_device *rbd_dev, u64 snap_id, u64 *snap_features); +static int rbd_dev_v2_get_flags(struct rbd_device *rbd_dev); + +static void rbd_obj_handle_request(struct rbd_obj_request *obj_req, int result); +static void rbd_img_handle_request(struct rbd_img_request *img_req, int result); + +/* + * Return true if nothing else is pending. + */ +static bool pending_result_dec(struct pending_result *pending, int *result) +{ + rbd_assert(pending->num_pending > 0); + + if (*result && !pending->result) + pending->result = *result; + if (--pending->num_pending) + return false; + + *result = pending->result; + return true; +} static int rbd_open(struct block_device *bdev, fmode_t mode) { @@ -1317,6 +1389,8 @@ static void zero_bvecs(struct ceph_bvec_iter *bvec_pos, u32 off, u32 bytes) static void rbd_obj_zero_range(struct rbd_obj_request *obj_req, u32 off, u32 bytes) { + dout("%s %p data buf %u~%u\n", __func__, obj_req, off, bytes); + switch (obj_req->img_request->data_type) { case OBJ_REQUEST_BIO: zero_bios(&obj_req->bio_pos, off, bytes); @@ -1339,13 +1413,6 @@ static void rbd_obj_request_put(struct rbd_obj_request *obj_request) kref_put(&obj_request->kref, rbd_obj_request_destroy); } -static void rbd_img_request_get(struct rbd_img_request *img_request) -{ - dout("%s: img %p (was %d)\n", __func__, img_request, - kref_read(&img_request->kref)); - kref_get(&img_request->kref); -} - static void rbd_img_request_destroy(struct kref *kref); static void rbd_img_request_put(struct rbd_img_request *img_request) { @@ -1362,7 +1429,6 @@ static inline void rbd_img_obj_request_add(struct rbd_img_request *img_request, /* Image request now owns object's original reference */ obj_request->img_request = img_request; - img_request->pending_count++; dout("%s: img %p obj %p\n", __func__, img_request, obj_request); } @@ -1375,13 +1441,13 @@ static inline void rbd_img_obj_request_del(struct rbd_img_request *img_request, rbd_obj_request_put(obj_request); } -static void rbd_obj_request_submit(struct rbd_obj_request *obj_request) +static void rbd_osd_submit(struct ceph_osd_request *osd_req) { - struct ceph_osd_request *osd_req = obj_request->osd_req; + struct rbd_obj_request *obj_req = osd_req->r_priv; - dout("%s %p object_no %016llx %llu~%llu osd_req %p\n", __func__, - obj_request, obj_request->ex.oe_objno, obj_request->ex.oe_off, - obj_request->ex.oe_len, osd_req); + dout("%s osd_req %p for obj_req %p objno %llu %llu~%llu\n", + __func__, osd_req, obj_req, obj_req->ex.oe_objno, + obj_req->ex.oe_off, obj_req->ex.oe_len); ceph_osdc_start_request(osd_req->r_osdc, osd_req, false); } @@ -1457,41 +1523,38 @@ static bool rbd_img_is_write(struct rbd_img_request *img_req) } } -static void rbd_obj_handle_request(struct rbd_obj_request *obj_req); - static void rbd_osd_req_callback(struct ceph_osd_request *osd_req) { struct rbd_obj_request *obj_req = osd_req->r_priv; + int result; dout("%s osd_req %p result %d for obj_req %p\n", __func__, osd_req, osd_req->r_result, obj_req); - rbd_assert(osd_req == obj_req->osd_req); - obj_req->result = osd_req->r_result < 0 ? osd_req->r_result : 0; - if (!obj_req->result && !rbd_img_is_write(obj_req->img_request)) - obj_req->xferred = osd_req->r_result; + /* + * Writes aren't allowed to return a data payload. In some + * guarded write cases (e.g. stat + zero on an empty object) + * a stat response makes it through, but we don't care. + */ + if (osd_req->r_result > 0 && rbd_img_is_write(obj_req->img_request)) + result = 0; else - /* - * Writes aren't allowed to return a data payload. In some - * guarded write cases (e.g. stat + zero on an empty object) - * a stat response makes it through, but we don't care. - */ - obj_req->xferred = 0; + result = osd_req->r_result; - rbd_obj_handle_request(obj_req); + rbd_obj_handle_request(obj_req, result); } -static void rbd_osd_req_format_read(struct rbd_obj_request *obj_request) +static void rbd_osd_format_read(struct ceph_osd_request *osd_req) { - struct ceph_osd_request *osd_req = obj_request->osd_req; + struct rbd_obj_request *obj_request = osd_req->r_priv; osd_req->r_flags = CEPH_OSD_FLAG_READ; osd_req->r_snapid = obj_request->img_request->snap_id; } -static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request) +static void rbd_osd_format_write(struct ceph_osd_request *osd_req) { - struct ceph_osd_request *osd_req = obj_request->osd_req; + struct rbd_obj_request *obj_request = osd_req->r_priv; osd_req->r_flags = CEPH_OSD_FLAG_WRITE; ktime_get_real_ts64(&osd_req->r_mtime); @@ -1499,19 +1562,21 @@ static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request) } static struct ceph_osd_request * -__rbd_osd_req_create(struct rbd_obj_request *obj_req, - struct ceph_snap_context *snapc, unsigned int num_ops) +__rbd_obj_add_osd_request(struct rbd_obj_request *obj_req, + struct ceph_snap_context *snapc, int num_ops) { struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; struct ceph_osd_request *req; const char *name_format = rbd_dev->image_format == 1 ? RBD_V1_DATA_FORMAT : RBD_V2_DATA_FORMAT; + int ret; req = ceph_osdc_alloc_request(osdc, snapc, num_ops, false, GFP_NOIO); if (!req) - return NULL; + return ERR_PTR(-ENOMEM); + list_add_tail(&req->r_private_item, &obj_req->osd_reqs); req->r_callback = rbd_osd_req_callback; req->r_priv = obj_req; @@ -1522,27 +1587,20 @@ __rbd_osd_req_create(struct rbd_obj_request *obj_req, ceph_oloc_copy(&req->r_base_oloc, &rbd_dev->header_oloc); req->r_base_oloc.pool = rbd_dev->layout.pool_id; - if (ceph_oid_aprintf(&req->r_base_oid, GFP_NOIO, name_format, - rbd_dev->header.object_prefix, obj_req->ex.oe_objno)) - goto err_req; + ret = ceph_oid_aprintf(&req->r_base_oid, GFP_NOIO, name_format, + rbd_dev->header.object_prefix, + obj_req->ex.oe_objno); + if (ret) + return ERR_PTR(ret); return req; - -err_req: - ceph_osdc_put_request(req); - return NULL; } static struct ceph_osd_request * -rbd_osd_req_create(struct rbd_obj_request *obj_req, unsigned int num_ops) +rbd_obj_add_osd_request(struct rbd_obj_request *obj_req, int num_ops) { - return __rbd_osd_req_create(obj_req, obj_req->img_request->snapc, - num_ops); -} - -static void rbd_osd_req_destroy(struct ceph_osd_request *osd_req) -{ - ceph_osdc_put_request(osd_req); + return __rbd_obj_add_osd_request(obj_req, obj_req->img_request->snapc, + num_ops); } static struct rbd_obj_request *rbd_obj_request_create(void) @@ -1554,6 +1612,8 @@ static struct rbd_obj_request *rbd_obj_request_create(void) return NULL; ceph_object_extent_init(&obj_request->ex); + INIT_LIST_HEAD(&obj_request->osd_reqs); + mutex_init(&obj_request->state_mutex); kref_init(&obj_request->kref); dout("%s %p\n", __func__, obj_request); @@ -1563,14 +1623,19 @@ static struct rbd_obj_request *rbd_obj_request_create(void) static void rbd_obj_request_destroy(struct kref *kref) { struct rbd_obj_request *obj_request; + struct ceph_osd_request *osd_req; u32 i; obj_request = container_of(kref, struct rbd_obj_request, kref); dout("%s: obj %p\n", __func__, obj_request); - if (obj_request->osd_req) - rbd_osd_req_destroy(obj_request->osd_req); + while (!list_empty(&obj_request->osd_reqs)) { + osd_req = list_first_entry(&obj_request->osd_reqs, + struct ceph_osd_request, r_private_item); + list_del_init(&osd_req->r_private_item); + ceph_osdc_put_request(osd_req); + } switch (obj_request->img_request->data_type) { case OBJ_REQUEST_NODATA: @@ -1684,8 +1749,9 @@ static struct rbd_img_request *rbd_img_request_create( if (rbd_dev_parent_get(rbd_dev)) img_request_layered_set(img_request); - spin_lock_init(&img_request->completion_lock); + INIT_LIST_HEAD(&img_request->lock_item); INIT_LIST_HEAD(&img_request->object_extents); + mutex_init(&img_request->state_mutex); kref_init(&img_request->kref); dout("%s: rbd_dev %p %s -> img %p\n", __func__, rbd_dev, @@ -1703,6 +1769,7 @@ static void rbd_img_request_destroy(struct kref *kref) dout("%s: img %p\n", __func__, img_request); + WARN_ON(!list_empty(&img_request->lock_item)); for_each_obj_request_safe(img_request, obj_request, next_obj_request) rbd_img_obj_request_del(img_request, obj_request); @@ -1717,6 +1784,466 @@ static void rbd_img_request_destroy(struct kref *kref) kmem_cache_free(rbd_img_request_cache, img_request); } +#define BITS_PER_OBJ 2 +#define OBJS_PER_BYTE (BITS_PER_BYTE / BITS_PER_OBJ) +#define OBJ_MASK ((1 << BITS_PER_OBJ) - 1) + +static void __rbd_object_map_index(struct rbd_device *rbd_dev, u64 objno, + u64 *index, u8 *shift) +{ + u32 off; + + rbd_assert(objno < rbd_dev->object_map_size); + *index = div_u64_rem(objno, OBJS_PER_BYTE, &off); + *shift = (OBJS_PER_BYTE - off - 1) * BITS_PER_OBJ; +} + +static u8 __rbd_object_map_get(struct rbd_device *rbd_dev, u64 objno) +{ + u64 index; + u8 shift; + + lockdep_assert_held(&rbd_dev->object_map_lock); + __rbd_object_map_index(rbd_dev, objno, &index, &shift); + return (rbd_dev->object_map[index] >> shift) & OBJ_MASK; +} + +static void __rbd_object_map_set(struct rbd_device *rbd_dev, u64 objno, u8 val) +{ + u64 index; + u8 shift; + u8 *p; + + lockdep_assert_held(&rbd_dev->object_map_lock); + rbd_assert(!(val & ~OBJ_MASK)); + + __rbd_object_map_index(rbd_dev, objno, &index, &shift); + p = &rbd_dev->object_map[index]; + *p = (*p & ~(OBJ_MASK << shift)) | (val << shift); +} + +static u8 rbd_object_map_get(struct rbd_device *rbd_dev, u64 objno) +{ + u8 state; + + spin_lock(&rbd_dev->object_map_lock); + state = __rbd_object_map_get(rbd_dev, objno); + spin_unlock(&rbd_dev->object_map_lock); + return state; +} + +static bool use_object_map(struct rbd_device *rbd_dev) +{ + return ((rbd_dev->header.features & RBD_FEATURE_OBJECT_MAP) && + !(rbd_dev->object_map_flags & RBD_FLAG_OBJECT_MAP_INVALID)); +} + +static bool rbd_object_map_may_exist(struct rbd_device *rbd_dev, u64 objno) +{ + u8 state; + + /* fall back to default logic if object map is disabled or invalid */ + if (!use_object_map(rbd_dev)) + return true; + + state = rbd_object_map_get(rbd_dev, objno); + return state != OBJECT_NONEXISTENT; +} + +static void rbd_object_map_name(struct rbd_device *rbd_dev, u64 snap_id, + struct ceph_object_id *oid) +{ + if (snap_id == CEPH_NOSNAP) + ceph_oid_printf(oid, "%s%s", RBD_OBJECT_MAP_PREFIX, + rbd_dev->spec->image_id); + else + ceph_oid_printf(oid, "%s%s.%016llx", RBD_OBJECT_MAP_PREFIX, + rbd_dev->spec->image_id, snap_id); +} + +static int rbd_object_map_lock(struct rbd_device *rbd_dev) +{ + struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; + CEPH_DEFINE_OID_ONSTACK(oid); + u8 lock_type; + char *lock_tag; + struct ceph_locker *lockers; + u32 num_lockers; + bool broke_lock = false; + int ret; + + rbd_object_map_name(rbd_dev, CEPH_NOSNAP, &oid); + +again: + ret = ceph_cls_lock(osdc, &oid, &rbd_dev->header_oloc, RBD_LOCK_NAME, + CEPH_CLS_LOCK_EXCLUSIVE, "", "", "", 0); + if (ret != -EBUSY || broke_lock) { + if (ret == -EEXIST) + ret = 0; /* already locked by myself */ + if (ret) + rbd_warn(rbd_dev, "failed to lock object map: %d", ret); + return ret; + } + + ret = ceph_cls_lock_info(osdc, &oid, &rbd_dev->header_oloc, + RBD_LOCK_NAME, &lock_type, &lock_tag, + &lockers, &num_lockers); + if (ret) { + if (ret == -ENOENT) + goto again; + + rbd_warn(rbd_dev, "failed to get object map lockers: %d", ret); + return ret; + } + + kfree(lock_tag); + if (num_lockers == 0) + goto again; + + rbd_warn(rbd_dev, "breaking object map lock owned by %s%llu", + ENTITY_NAME(lockers[0].id.name)); + + ret = ceph_cls_break_lock(osdc, &oid, &rbd_dev->header_oloc, + RBD_LOCK_NAME, lockers[0].id.cookie, + &lockers[0].id.name); + ceph_free_lockers(lockers, num_lockers); + if (ret) { + if (ret == -ENOENT) + goto again; + + rbd_warn(rbd_dev, "failed to break object map lock: %d", ret); + return ret; + } + + broke_lock = true; + goto again; +} + +static void rbd_object_map_unlock(struct rbd_device *rbd_dev) +{ + struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; + CEPH_DEFINE_OID_ONSTACK(oid); + int ret; + + rbd_object_map_name(rbd_dev, CEPH_NOSNAP, &oid); + + ret = ceph_cls_unlock(osdc, &oid, &rbd_dev->header_oloc, RBD_LOCK_NAME, + ""); + if (ret && ret != -ENOENT) + rbd_warn(rbd_dev, "failed to unlock object map: %d", ret); +} + +static int decode_object_map_header(void **p, void *end, u64 *object_map_size) +{ + u8 struct_v; + u32 struct_len; + u32 header_len; + void *header_end; + int ret; + + ceph_decode_32_safe(p, end, header_len, e_inval); + header_end = *p + header_len; + + ret = ceph_start_decoding(p, end, 1, "BitVector header", &struct_v, + &struct_len); + if (ret) + return ret; + + ceph_decode_64_safe(p, end, *object_map_size, e_inval); + + *p = header_end; + return 0; + +e_inval: + return -EINVAL; +} + +static int __rbd_object_map_load(struct rbd_device *rbd_dev) +{ + struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; + CEPH_DEFINE_OID_ONSTACK(oid); + struct page **pages; + void *p, *end; + size_t reply_len; + u64 num_objects; + u64 object_map_bytes; + u64 object_map_size; + int num_pages; + int ret; + + rbd_assert(!rbd_dev->object_map && !rbd_dev->object_map_size); + + num_objects = ceph_get_num_objects(&rbd_dev->layout, + rbd_dev->mapping.size); + object_map_bytes = DIV_ROUND_UP_ULL(num_objects * BITS_PER_OBJ, + BITS_PER_BYTE); + num_pages = calc_pages_for(0, object_map_bytes) + 1; + pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL); + if (IS_ERR(pages)) + return PTR_ERR(pages); + + reply_len = num_pages * PAGE_SIZE; + rbd_object_map_name(rbd_dev, rbd_dev->spec->snap_id, &oid); + ret = ceph_osdc_call(osdc, &oid, &rbd_dev->header_oloc, + "rbd", "object_map_load", CEPH_OSD_FLAG_READ, + NULL, 0, pages, &reply_len); + if (ret) + goto out; + + p = page_address(pages[0]); + end = p + min(reply_len, (size_t)PAGE_SIZE); + ret = decode_object_map_header(&p, end, &object_map_size); + if (ret) + goto out; + + if (object_map_size != num_objects) { + rbd_warn(rbd_dev, "object map size mismatch: %llu vs %llu", + object_map_size, num_objects); + ret = -EINVAL; + goto out; + } + + if (offset_in_page(p) + object_map_bytes > reply_len) { + ret = -EINVAL; + goto out; + } + + rbd_dev->object_map = kvmalloc(object_map_bytes, GFP_KERNEL); + if (!rbd_dev->object_map) { + ret = -ENOMEM; + goto out; + } + + rbd_dev->object_map_size = object_map_size; + ceph_copy_from_page_vector(pages, rbd_dev->object_map, + offset_in_page(p), object_map_bytes); + +out: + ceph_release_page_vector(pages, num_pages); + return ret; +} + +static void rbd_object_map_free(struct rbd_device *rbd_dev) +{ + kvfree(rbd_dev->object_map); + rbd_dev->object_map = NULL; + rbd_dev->object_map_size = 0; +} + +static int rbd_object_map_load(struct rbd_device *rbd_dev) +{ + int ret; + + ret = __rbd_object_map_load(rbd_dev); + if (ret) + return ret; + + ret = rbd_dev_v2_get_flags(rbd_dev); + if (ret) { + rbd_object_map_free(rbd_dev); + return ret; + } + + if (rbd_dev->object_map_flags & RBD_FLAG_OBJECT_MAP_INVALID) + rbd_warn(rbd_dev, "object map is invalid"); + + return 0; +} + +static int rbd_object_map_open(struct rbd_device *rbd_dev) +{ + int ret; + + ret = rbd_object_map_lock(rbd_dev); + if (ret) + return ret; + + ret = rbd_object_map_load(rbd_dev); + if (ret) { + rbd_object_map_unlock(rbd_dev); + return ret; + } + + return 0; +} + +static void rbd_object_map_close(struct rbd_device *rbd_dev) +{ + rbd_object_map_free(rbd_dev); + rbd_object_map_unlock(rbd_dev); +} + +/* + * This function needs snap_id (or more precisely just something to + * distinguish between HEAD and snapshot object maps), new_state and + * current_state that were passed to rbd_object_map_update(). + * + * To avoid allocating and stashing a context we piggyback on the OSD + * request. A HEAD update has two ops (assert_locked). For new_state + * and current_state we decode our own object_map_update op, encoded in + * rbd_cls_object_map_update(). + */ +static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req, + struct ceph_osd_request *osd_req) +{ + struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; + struct ceph_osd_data *osd_data; + u64 objno; + u8 state, new_state, current_state; + bool has_current_state; + void *p; + + if (osd_req->r_result) + return osd_req->r_result; + + /* + * Nothing to do for a snapshot object map. + */ + if (osd_req->r_num_ops == 1) + return 0; + + /* + * Update in-memory HEAD object map. + */ + rbd_assert(osd_req->r_num_ops == 2); + osd_data = osd_req_op_data(osd_req, 1, cls, request_data); + rbd_assert(osd_data->type == CEPH_OSD_DATA_TYPE_PAGES); + + p = page_address(osd_data->pages[0]); + objno = ceph_decode_64(&p); + rbd_assert(objno == obj_req->ex.oe_objno); + rbd_assert(ceph_decode_64(&p) == objno + 1); + new_state = ceph_decode_8(&p); + has_current_state = ceph_decode_8(&p); + if (has_current_state) + current_state = ceph_decode_8(&p); + + spin_lock(&rbd_dev->object_map_lock); + state = __rbd_object_map_get(rbd_dev, objno); + if (!has_current_state || current_state == state || + (current_state == OBJECT_EXISTS && state == OBJECT_EXISTS_CLEAN)) + __rbd_object_map_set(rbd_dev, objno, new_state); + spin_unlock(&rbd_dev->object_map_lock); + + return 0; +} + +static void rbd_object_map_callback(struct ceph_osd_request *osd_req) +{ + struct rbd_obj_request *obj_req = osd_req->r_priv; + int result; + + dout("%s osd_req %p result %d for obj_req %p\n", __func__, osd_req, + osd_req->r_result, obj_req); + + result = rbd_object_map_update_finish(obj_req, osd_req); + rbd_obj_handle_request(obj_req, result); +} + +static bool update_needed(struct rbd_device *rbd_dev, u64 objno, u8 new_state) +{ + u8 state = rbd_object_map_get(rbd_dev, objno); + + if (state == new_state || + (new_state == OBJECT_PENDING && state == OBJECT_NONEXISTENT) || + (new_state == OBJECT_NONEXISTENT && state != OBJECT_PENDING)) + return false; + + return true; +} + +static int rbd_cls_object_map_update(struct ceph_osd_request *req, + int which, u64 objno, u8 new_state, + const u8 *current_state) +{ + struct page **pages; + void *p, *start; + int ret; + + ret = osd_req_op_cls_init(req, which, "rbd", "object_map_update"); + if (ret) + return ret; + + pages = ceph_alloc_page_vector(1, GFP_NOIO); + if (IS_ERR(pages)) + return PTR_ERR(pages); + + p = start = page_address(pages[0]); + ceph_encode_64(&p, objno); + ceph_encode_64(&p, objno + 1); + ceph_encode_8(&p, new_state); + if (current_state) { + ceph_encode_8(&p, 1); + ceph_encode_8(&p, *current_state); + } else { + ceph_encode_8(&p, 0); + } + + osd_req_op_cls_request_data_pages(req, which, pages, p - start, 0, + false, true); + return 0; +} + +/* + * Return: + * 0 - object map update sent + * 1 - object map update isn't needed + * <0 - error + */ +static int rbd_object_map_update(struct rbd_obj_request *obj_req, u64 snap_id, + u8 new_state, const u8 *current_state) +{ + struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; + struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; + struct ceph_osd_request *req; + int num_ops = 1; + int which = 0; + int ret; + + if (snap_id == CEPH_NOSNAP) { + if (!update_needed(rbd_dev, obj_req->ex.oe_objno, new_state)) + return 1; + + num_ops++; /* assert_locked */ + } + + req = ceph_osdc_alloc_request(osdc, NULL, num_ops, false, GFP_NOIO); + if (!req) + return -ENOMEM; + + list_add_tail(&req->r_private_item, &obj_req->osd_reqs); + req->r_callback = rbd_object_map_callback; + req->r_priv = obj_req; + + rbd_object_map_name(rbd_dev, snap_id, &req->r_base_oid); + ceph_oloc_copy(&req->r_base_oloc, &rbd_dev->header_oloc); + req->r_flags = CEPH_OSD_FLAG_WRITE; + ktime_get_real_ts64(&req->r_mtime); + + if (snap_id == CEPH_NOSNAP) { + /* + * Protect against possible race conditions during lock + * ownership transitions. + */ + ret = ceph_cls_assert_locked(req, which++, RBD_LOCK_NAME, + CEPH_CLS_LOCK_EXCLUSIVE, "", ""); + if (ret) + return ret; + } + + ret = rbd_cls_object_map_update(req, which, obj_req->ex.oe_objno, + new_state, current_state); + if (ret) + return ret; + + ret = ceph_osdc_alloc_messages(req, GFP_NOIO); + if (ret) + return ret; + + ceph_osdc_start_request(osdc, req, false); + return 0; +} + static void prune_extents(struct ceph_file_extent *img_extents, u32 *num_img_extents, u64 overlap) { @@ -1764,11 +2291,13 @@ static int rbd_obj_calc_img_extents(struct rbd_obj_request *obj_req, return 0; } -static void rbd_osd_req_setup_data(struct rbd_obj_request *obj_req, u32 which) +static void rbd_osd_setup_data(struct ceph_osd_request *osd_req, int which) { + struct rbd_obj_request *obj_req = osd_req->r_priv; + switch (obj_req->img_request->data_type) { case OBJ_REQUEST_BIO: - osd_req_op_extent_osd_data_bio(obj_req->osd_req, which, + osd_req_op_extent_osd_data_bio(osd_req, which, &obj_req->bio_pos, obj_req->ex.oe_len); break; @@ -1777,7 +2306,7 @@ static void rbd_osd_req_setup_data(struct rbd_obj_request *obj_req, u32 which) rbd_assert(obj_req->bvec_pos.iter.bi_size == obj_req->ex.oe_len); rbd_assert(obj_req->bvec_idx == obj_req->bvec_count); - osd_req_op_extent_osd_data_bvec_pos(obj_req->osd_req, which, + osd_req_op_extent_osd_data_bvec_pos(osd_req, which, &obj_req->bvec_pos); break; default: @@ -1785,22 +2314,7 @@ static void rbd_osd_req_setup_data(struct rbd_obj_request *obj_req, u32 which) } } -static int rbd_obj_setup_read(struct rbd_obj_request *obj_req) -{ - obj_req->osd_req = __rbd_osd_req_create(obj_req, NULL, 1); - if (!obj_req->osd_req) - return -ENOMEM; - - osd_req_op_extent_init(obj_req->osd_req, 0, CEPH_OSD_OP_READ, - obj_req->ex.oe_off, obj_req->ex.oe_len, 0, 0); - rbd_osd_req_setup_data(obj_req, 0); - - rbd_osd_req_format_read(obj_req); - return 0; -} - -static int __rbd_obj_setup_stat(struct rbd_obj_request *obj_req, - unsigned int which) +static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which) { struct page **pages; @@ -1816,45 +2330,60 @@ static int __rbd_obj_setup_stat(struct rbd_obj_request *obj_req, if (IS_ERR(pages)) return PTR_ERR(pages); - osd_req_op_init(obj_req->osd_req, which, CEPH_OSD_OP_STAT, 0); - osd_req_op_raw_data_in_pages(obj_req->osd_req, which, pages, + osd_req_op_init(osd_req, which, CEPH_OSD_OP_STAT, 0); + osd_req_op_raw_data_in_pages(osd_req, which, pages, 8 + sizeof(struct ceph_timespec), 0, false, true); return 0; } -static int count_write_ops(struct rbd_obj_request *obj_req) +static int rbd_osd_setup_copyup(struct ceph_osd_request *osd_req, int which, + u32 bytes) +{ + struct rbd_obj_request *obj_req = osd_req->r_priv; + int ret; + + ret = osd_req_op_cls_init(osd_req, which, "rbd", "copyup"); + if (ret) + return ret; + + osd_req_op_cls_request_data_bvecs(osd_req, which, obj_req->copyup_bvecs, + obj_req->copyup_bvec_count, bytes); + return 0; +} + +static int rbd_obj_init_read(struct rbd_obj_request *obj_req) { - return 2; /* setallochint + write/writefull */ + obj_req->read_state = RBD_OBJ_READ_START; + return 0; } -static void __rbd_obj_setup_write(struct rbd_obj_request *obj_req, - unsigned int which) +static void __rbd_osd_setup_write_ops(struct ceph_osd_request *osd_req, + int which) { + struct rbd_obj_request *obj_req = osd_req->r_priv; struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; u16 opcode; - osd_req_op_alloc_hint_init(obj_req->osd_req, which++, - rbd_dev->layout.object_size, - rbd_dev->layout.object_size); + if (!use_object_map(rbd_dev) || + !(obj_req->flags & RBD_OBJ_FLAG_MAY_EXIST)) { + osd_req_op_alloc_hint_init(osd_req, which++, + rbd_dev->layout.object_size, + rbd_dev->layout.object_size); + } if (rbd_obj_is_entire(obj_req)) opcode = CEPH_OSD_OP_WRITEFULL; else opcode = CEPH_OSD_OP_WRITE; - osd_req_op_extent_init(obj_req->osd_req, which, opcode, + osd_req_op_extent_init(osd_req, which, opcode, obj_req->ex.oe_off, obj_req->ex.oe_len, 0, 0); - rbd_osd_req_setup_data(obj_req, which++); - - rbd_assert(which == obj_req->osd_req->r_num_ops); - rbd_osd_req_format_write(obj_req); + rbd_osd_setup_data(osd_req, which); } -static int rbd_obj_setup_write(struct rbd_obj_request *obj_req) +static int rbd_obj_init_write(struct rbd_obj_request *obj_req) { - unsigned int num_osd_ops, which = 0; - bool need_guard; int ret; /* reverse map the entire object onto the parent */ @@ -1862,24 +2391,10 @@ static int rbd_obj_setup_write(struct rbd_obj_request *obj_req) if (ret) return ret; - need_guard = rbd_obj_copyup_enabled(obj_req); - num_osd_ops = need_guard + count_write_ops(obj_req); - - obj_req->osd_req = rbd_osd_req_create(obj_req, num_osd_ops); - if (!obj_req->osd_req) - return -ENOMEM; - - if (need_guard) { - ret = __rbd_obj_setup_stat(obj_req, which++); - if (ret) - return ret; + if (rbd_obj_copyup_enabled(obj_req)) + obj_req->flags |= RBD_OBJ_FLAG_COPYUP_ENABLED; - obj_req->write_state = RBD_OBJ_WRITE_GUARD; - } else { - obj_req->write_state = RBD_OBJ_WRITE_FLAT; - } - - __rbd_obj_setup_write(obj_req, which); + obj_req->write_state = RBD_OBJ_WRITE_START; return 0; } @@ -1889,11 +2404,26 @@ static u16 truncate_or_zero_opcode(struct rbd_obj_request *obj_req) CEPH_OSD_OP_ZERO; } -static int rbd_obj_setup_discard(struct rbd_obj_request *obj_req) +static void __rbd_osd_setup_discard_ops(struct ceph_osd_request *osd_req, + int which) +{ + struct rbd_obj_request *obj_req = osd_req->r_priv; + + if (rbd_obj_is_entire(obj_req) && !obj_req->num_img_extents) { + rbd_assert(obj_req->flags & RBD_OBJ_FLAG_DELETION); + osd_req_op_init(osd_req, which, CEPH_OSD_OP_DELETE, 0); + } else { + osd_req_op_extent_init(osd_req, which, + truncate_or_zero_opcode(obj_req), + obj_req->ex.oe_off, obj_req->ex.oe_len, + 0, 0); + } +} + +static int rbd_obj_init_discard(struct rbd_obj_request *obj_req) { struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; - u64 off = obj_req->ex.oe_off; - u64 next_off = obj_req->ex.oe_off + obj_req->ex.oe_len; + u64 off, next_off; int ret; /* @@ -1906,10 +2436,17 @@ static int rbd_obj_setup_discard(struct rbd_obj_request *obj_req) */ if (rbd_dev->opts->alloc_size != rbd_dev->layout.object_size || !rbd_obj_is_tail(obj_req)) { - off = round_up(off, rbd_dev->opts->alloc_size); - next_off = round_down(next_off, rbd_dev->opts->alloc_size); + off = round_up(obj_req->ex.oe_off, rbd_dev->opts->alloc_size); + next_off = round_down(obj_req->ex.oe_off + obj_req->ex.oe_len, + rbd_dev->opts->alloc_size); if (off >= next_off) return 1; + + dout("%s %p %llu~%llu -> %llu~%llu\n", __func__, + obj_req, obj_req->ex.oe_off, obj_req->ex.oe_len, + off, next_off - off); + obj_req->ex.oe_off = off; + obj_req->ex.oe_len = next_off - off; } /* reverse map the entire object onto the parent */ @@ -1917,52 +2454,29 @@ static int rbd_obj_setup_discard(struct rbd_obj_request *obj_req) if (ret) return ret; - obj_req->osd_req = rbd_osd_req_create(obj_req, 1); - if (!obj_req->osd_req) - return -ENOMEM; - - if (rbd_obj_is_entire(obj_req) && !obj_req->num_img_extents) { - osd_req_op_init(obj_req->osd_req, 0, CEPH_OSD_OP_DELETE, 0); - } else { - dout("%s %p %llu~%llu -> %llu~%llu\n", __func__, - obj_req, obj_req->ex.oe_off, obj_req->ex.oe_len, - off, next_off - off); - osd_req_op_extent_init(obj_req->osd_req, 0, - truncate_or_zero_opcode(obj_req), - off, next_off - off, 0, 0); - } + obj_req->flags |= RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT; + if (rbd_obj_is_entire(obj_req) && !obj_req->num_img_extents) + obj_req->flags |= RBD_OBJ_FLAG_DELETION; - obj_req->write_state = RBD_OBJ_WRITE_FLAT; - rbd_osd_req_format_write(obj_req); + obj_req->write_state = RBD_OBJ_WRITE_START; return 0; } -static int count_zeroout_ops(struct rbd_obj_request *obj_req) -{ - int num_osd_ops; - - if (rbd_obj_is_entire(obj_req) && obj_req->num_img_extents && - !rbd_obj_copyup_enabled(obj_req)) - num_osd_ops = 2; /* create + truncate */ - else - num_osd_ops = 1; /* delete/truncate/zero */ - - return num_osd_ops; -} - -static void __rbd_obj_setup_zeroout(struct rbd_obj_request *obj_req, - unsigned int which) +static void __rbd_osd_setup_zeroout_ops(struct ceph_osd_request *osd_req, + int which) { + struct rbd_obj_request *obj_req = osd_req->r_priv; u16 opcode; if (rbd_obj_is_entire(obj_req)) { if (obj_req->num_img_extents) { - if (!rbd_obj_copyup_enabled(obj_req)) - osd_req_op_init(obj_req->osd_req, which++, + if (!(obj_req->flags & RBD_OBJ_FLAG_COPYUP_ENABLED)) + osd_req_op_init(osd_req, which++, CEPH_OSD_OP_CREATE, 0); opcode = CEPH_OSD_OP_TRUNCATE; } else { - osd_req_op_init(obj_req->osd_req, which++, + rbd_assert(obj_req->flags & RBD_OBJ_FLAG_DELETION); + osd_req_op_init(osd_req, which++, CEPH_OSD_OP_DELETE, 0); opcode = 0; } @@ -1971,18 +2485,13 @@ static void __rbd_obj_setup_zeroout(struct rbd_obj_request *obj_req, } if (opcode) - osd_req_op_extent_init(obj_req->osd_req, which++, opcode, + osd_req_op_extent_init(osd_req, which, opcode, obj_req->ex.oe_off, obj_req->ex.oe_len, 0, 0); - - rbd_assert(which == obj_req->osd_req->r_num_ops); - rbd_osd_req_format_write(obj_req); } -static int rbd_obj_setup_zeroout(struct rbd_obj_request *obj_req) +static int rbd_obj_init_zeroout(struct rbd_obj_request *obj_req) { - unsigned int num_osd_ops, which = 0; - bool need_guard; int ret; /* reverse map the entire object onto the parent */ @@ -1990,31 +2499,66 @@ static int rbd_obj_setup_zeroout(struct rbd_obj_request *obj_req) if (ret) return ret; - need_guard = rbd_obj_copyup_enabled(obj_req); - num_osd_ops = need_guard + count_zeroout_ops(obj_req); + if (rbd_obj_copyup_enabled(obj_req)) + obj_req->flags |= RBD_OBJ_FLAG_COPYUP_ENABLED; + if (!obj_req->num_img_extents) { + obj_req->flags |= RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT; + if (rbd_obj_is_entire(obj_req)) + obj_req->flags |= RBD_OBJ_FLAG_DELETION; + } - obj_req->osd_req = rbd_osd_req_create(obj_req, num_osd_ops); - if (!obj_req->osd_req) - return -ENOMEM; + obj_req->write_state = RBD_OBJ_WRITE_START; + return 0; +} - if (need_guard) { - ret = __rbd_obj_setup_stat(obj_req, which++); - if (ret) - return ret; +static int count_write_ops(struct rbd_obj_request *obj_req) +{ + struct rbd_img_request *img_req = obj_req->img_request; - obj_req->write_state = RBD_OBJ_WRITE_GUARD; - } else { - obj_req->write_state = RBD_OBJ_WRITE_FLAT; + switch (img_req->op_type) { + case OBJ_OP_WRITE: + if (!use_object_map(img_req->rbd_dev) || + !(obj_req->flags & RBD_OBJ_FLAG_MAY_EXIST)) + return 2; /* setallochint + write/writefull */ + + return 1; /* write/writefull */ + case OBJ_OP_DISCARD: + return 1; /* delete/truncate/zero */ + case OBJ_OP_ZEROOUT: + if (rbd_obj_is_entire(obj_req) && obj_req->num_img_extents && + !(obj_req->flags & RBD_OBJ_FLAG_COPYUP_ENABLED)) + return 2; /* create + truncate */ + + return 1; /* delete/truncate/zero */ + default: + BUG(); } +} - __rbd_obj_setup_zeroout(obj_req, which); - return 0; +static void rbd_osd_setup_write_ops(struct ceph_osd_request *osd_req, + int which) +{ + struct rbd_obj_request *obj_req = osd_req->r_priv; + + switch (obj_req->img_request->op_type) { + case OBJ_OP_WRITE: + __rbd_osd_setup_write_ops(osd_req, which); + break; + case OBJ_OP_DISCARD: + __rbd_osd_setup_discard_ops(osd_req, which); + break; + case OBJ_OP_ZEROOUT: + __rbd_osd_setup_zeroout_ops(osd_req, which); + break; + default: + BUG(); + } } /* - * For each object request in @img_req, allocate an OSD request, add - * individual OSD ops and prepare them for submission. The number of - * OSD ops depends on op_type and the overlap point (if any). + * Prune the list of object requests (adjust offset and/or length, drop + * redundant requests). Prepare object request state machines and image + * request state machine for execution. */ static int __rbd_img_fill_request(struct rbd_img_request *img_req) { @@ -2024,16 +2568,16 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req) for_each_obj_request_safe(img_req, obj_req, next_obj_req) { switch (img_req->op_type) { case OBJ_OP_READ: - ret = rbd_obj_setup_read(obj_req); + ret = rbd_obj_init_read(obj_req); break; case OBJ_OP_WRITE: - ret = rbd_obj_setup_write(obj_req); + ret = rbd_obj_init_write(obj_req); break; case OBJ_OP_DISCARD: - ret = rbd_obj_setup_discard(obj_req); + ret = rbd_obj_init_discard(obj_req); break; case OBJ_OP_ZEROOUT: - ret = rbd_obj_setup_zeroout(obj_req); + ret = rbd_obj_init_zeroout(obj_req); break; default: BUG(); @@ -2041,17 +2585,12 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req) if (ret < 0) return ret; if (ret > 0) { - img_req->xferred += obj_req->ex.oe_len; - img_req->pending_count--; rbd_img_obj_request_del(img_req, obj_req); continue; } - - ret = ceph_osdc_alloc_messages(obj_req->osd_req, GFP_NOIO); - if (ret) - return ret; } + img_req->state = RBD_IMG_START; return 0; } @@ -2340,17 +2879,55 @@ static int rbd_img_fill_from_bvecs(struct rbd_img_request *img_req, &it); } -static void rbd_img_request_submit(struct rbd_img_request *img_request) +static void rbd_img_handle_request_work(struct work_struct *work) { - struct rbd_obj_request *obj_request; + struct rbd_img_request *img_req = + container_of(work, struct rbd_img_request, work); - dout("%s: img %p\n", __func__, img_request); + rbd_img_handle_request(img_req, img_req->work_result); +} - rbd_img_request_get(img_request); - for_each_obj_request(img_request, obj_request) - rbd_obj_request_submit(obj_request); +static void rbd_img_schedule(struct rbd_img_request *img_req, int result) +{ + INIT_WORK(&img_req->work, rbd_img_handle_request_work); + img_req->work_result = result; + queue_work(rbd_wq, &img_req->work); +} - rbd_img_request_put(img_request); +static bool rbd_obj_may_exist(struct rbd_obj_request *obj_req) +{ + struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; + + if (rbd_object_map_may_exist(rbd_dev, obj_req->ex.oe_objno)) { + obj_req->flags |= RBD_OBJ_FLAG_MAY_EXIST; + return true; + } + + dout("%s %p objno %llu assuming dne\n", __func__, obj_req, + obj_req->ex.oe_objno); + return false; +} + +static int rbd_obj_read_object(struct rbd_obj_request *obj_req) +{ + struct ceph_osd_request *osd_req; + int ret; + + osd_req = __rbd_obj_add_osd_request(obj_req, NULL, 1); + if (IS_ERR(osd_req)) + return PTR_ERR(osd_req); + + osd_req_op_extent_init(osd_req, 0, CEPH_OSD_OP_READ, + obj_req->ex.oe_off, obj_req->ex.oe_len, 0, 0); + rbd_osd_setup_data(osd_req, 0); + rbd_osd_format_read(osd_req); + + ret = ceph_osdc_alloc_messages(osd_req, GFP_NOIO); + if (ret) + return ret; + + rbd_osd_submit(osd_req); + return 0; } static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req) @@ -2396,51 +2973,144 @@ static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req) return ret; } - rbd_img_request_submit(child_img_req); + /* avoid parent chain recursion */ + rbd_img_schedule(child_img_req, 0); return 0; } -static bool rbd_obj_handle_read(struct rbd_obj_request *obj_req) +static bool rbd_obj_advance_read(struct rbd_obj_request *obj_req, int *result) { struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; int ret; - if (obj_req->result == -ENOENT && - rbd_dev->parent_overlap && !obj_req->tried_parent) { - /* reverse map this object extent onto the parent */ - ret = rbd_obj_calc_img_extents(obj_req, false); +again: + switch (obj_req->read_state) { + case RBD_OBJ_READ_START: + rbd_assert(!*result); + + if (!rbd_obj_may_exist(obj_req)) { + *result = -ENOENT; + obj_req->read_state = RBD_OBJ_READ_OBJECT; + goto again; + } + + ret = rbd_obj_read_object(obj_req); if (ret) { - obj_req->result = ret; + *result = ret; return true; } - - if (obj_req->num_img_extents) { - obj_req->tried_parent = true; - ret = rbd_obj_read_from_parent(obj_req); + obj_req->read_state = RBD_OBJ_READ_OBJECT; + return false; + case RBD_OBJ_READ_OBJECT: + if (*result == -ENOENT && rbd_dev->parent_overlap) { + /* reverse map this object extent onto the parent */ + ret = rbd_obj_calc_img_extents(obj_req, false); if (ret) { - obj_req->result = ret; + *result = ret; return true; } - return false; + if (obj_req->num_img_extents) { + ret = rbd_obj_read_from_parent(obj_req); + if (ret) { + *result = ret; + return true; + } + obj_req->read_state = RBD_OBJ_READ_PARENT; + return false; + } + } + + /* + * -ENOENT means a hole in the image -- zero-fill the entire + * length of the request. A short read also implies zero-fill + * to the end of the request. + */ + if (*result == -ENOENT) { + rbd_obj_zero_range(obj_req, 0, obj_req->ex.oe_len); + *result = 0; + } else if (*result >= 0) { + if (*result < obj_req->ex.oe_len) + rbd_obj_zero_range(obj_req, *result, + obj_req->ex.oe_len - *result); + else + rbd_assert(*result == obj_req->ex.oe_len); + *result = 0; } + return true; + case RBD_OBJ_READ_PARENT: + return true; + default: + BUG(); } +} - /* - * -ENOENT means a hole in the image -- zero-fill the entire - * length of the request. A short read also implies zero-fill - * to the end of the request. In both cases we update xferred - * count to indicate the whole request was satisfied. - */ - if (obj_req->result == -ENOENT || - (!obj_req->result && obj_req->xferred < obj_req->ex.oe_len)) { - rbd_assert(!obj_req->xferred || !obj_req->result); - rbd_obj_zero_range(obj_req, obj_req->xferred, - obj_req->ex.oe_len - obj_req->xferred); - obj_req->result = 0; - obj_req->xferred = obj_req->ex.oe_len; +static bool rbd_obj_write_is_noop(struct rbd_obj_request *obj_req) +{ + struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; + + if (rbd_object_map_may_exist(rbd_dev, obj_req->ex.oe_objno)) + obj_req->flags |= RBD_OBJ_FLAG_MAY_EXIST; + + if (!(obj_req->flags & RBD_OBJ_FLAG_MAY_EXIST) && + (obj_req->flags & RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT)) { + dout("%s %p noop for nonexistent\n", __func__, obj_req); + return true; } - return true; + return false; +} + +/* + * Return: + * 0 - object map update sent + * 1 - object map update isn't needed + * <0 - error + */ +static int rbd_obj_write_pre_object_map(struct rbd_obj_request *obj_req) +{ + struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; + u8 new_state; + + if (!(rbd_dev->header.features & RBD_FEATURE_OBJECT_MAP)) + return 1; + + if (obj_req->flags & RBD_OBJ_FLAG_DELETION) + new_state = OBJECT_PENDING; + else + new_state = OBJECT_EXISTS; + + return rbd_object_map_update(obj_req, CEPH_NOSNAP, new_state, NULL); +} + +static int rbd_obj_write_object(struct rbd_obj_request *obj_req) +{ + struct ceph_osd_request *osd_req; + int num_ops = count_write_ops(obj_req); + int which = 0; + int ret; + + if (obj_req->flags & RBD_OBJ_FLAG_COPYUP_ENABLED) + num_ops++; /* stat */ + + osd_req = rbd_obj_add_osd_request(obj_req, num_ops); + if (IS_ERR(osd_req)) + return PTR_ERR(osd_req); + + if (obj_req->flags & RBD_OBJ_FLAG_COPYUP_ENABLED) { + ret = rbd_osd_setup_stat(osd_req, which++); + if (ret) + return ret; + } + + rbd_osd_setup_write_ops(osd_req, which); + rbd_osd_format_write(osd_req); + + ret = ceph_osdc_alloc_messages(osd_req, GFP_NOIO); + if (ret) + return ret; + + rbd_osd_submit(osd_req); + return 0; } /* @@ -2463,123 +3133,67 @@ static bool is_zero_bvecs(struct bio_vec *bvecs, u32 bytes) #define MODS_ONLY U32_MAX -static int rbd_obj_issue_copyup_empty_snapc(struct rbd_obj_request *obj_req, - u32 bytes) +static int rbd_obj_copyup_empty_snapc(struct rbd_obj_request *obj_req, + u32 bytes) { + struct ceph_osd_request *osd_req; int ret; dout("%s obj_req %p bytes %u\n", __func__, obj_req, bytes); - rbd_assert(obj_req->osd_req->r_ops[0].op == CEPH_OSD_OP_STAT); rbd_assert(bytes > 0 && bytes != MODS_ONLY); - rbd_osd_req_destroy(obj_req->osd_req); - obj_req->osd_req = __rbd_osd_req_create(obj_req, &rbd_empty_snapc, 1); - if (!obj_req->osd_req) - return -ENOMEM; + osd_req = __rbd_obj_add_osd_request(obj_req, &rbd_empty_snapc, 1); + if (IS_ERR(osd_req)) + return PTR_ERR(osd_req); - ret = osd_req_op_cls_init(obj_req->osd_req, 0, "rbd", "copyup"); + ret = rbd_osd_setup_copyup(osd_req, 0, bytes); if (ret) return ret; - osd_req_op_cls_request_data_bvecs(obj_req->osd_req, 0, - obj_req->copyup_bvecs, - obj_req->copyup_bvec_count, - bytes); - rbd_osd_req_format_write(obj_req); + rbd_osd_format_write(osd_req); - ret = ceph_osdc_alloc_messages(obj_req->osd_req, GFP_NOIO); + ret = ceph_osdc_alloc_messages(osd_req, GFP_NOIO); if (ret) return ret; - rbd_obj_request_submit(obj_req); + rbd_osd_submit(osd_req); return 0; } -static int rbd_obj_issue_copyup_ops(struct rbd_obj_request *obj_req, u32 bytes) +static int rbd_obj_copyup_current_snapc(struct rbd_obj_request *obj_req, + u32 bytes) { - struct rbd_img_request *img_req = obj_req->img_request; - unsigned int num_osd_ops = (bytes != MODS_ONLY); - unsigned int which = 0; + struct ceph_osd_request *osd_req; + int num_ops = count_write_ops(obj_req); + int which = 0; int ret; dout("%s obj_req %p bytes %u\n", __func__, obj_req, bytes); - rbd_assert(obj_req->osd_req->r_ops[0].op == CEPH_OSD_OP_STAT || - obj_req->osd_req->r_ops[0].op == CEPH_OSD_OP_CALL); - rbd_osd_req_destroy(obj_req->osd_req); - switch (img_req->op_type) { - case OBJ_OP_WRITE: - num_osd_ops += count_write_ops(obj_req); - break; - case OBJ_OP_ZEROOUT: - num_osd_ops += count_zeroout_ops(obj_req); - break; - default: - BUG(); - } + if (bytes != MODS_ONLY) + num_ops++; /* copyup */ - obj_req->osd_req = rbd_osd_req_create(obj_req, num_osd_ops); - if (!obj_req->osd_req) - return -ENOMEM; + osd_req = rbd_obj_add_osd_request(obj_req, num_ops); + if (IS_ERR(osd_req)) + return PTR_ERR(osd_req); if (bytes != MODS_ONLY) { - ret = osd_req_op_cls_init(obj_req->osd_req, which, "rbd", - "copyup"); + ret = rbd_osd_setup_copyup(osd_req, which++, bytes); if (ret) return ret; - - osd_req_op_cls_request_data_bvecs(obj_req->osd_req, which++, - obj_req->copyup_bvecs, - obj_req->copyup_bvec_count, - bytes); } - switch (img_req->op_type) { - case OBJ_OP_WRITE: - __rbd_obj_setup_write(obj_req, which); - break; - case OBJ_OP_ZEROOUT: - __rbd_obj_setup_zeroout(obj_req, which); - break; - default: - BUG(); - } + rbd_osd_setup_write_ops(osd_req, which); + rbd_osd_format_write(osd_req); - ret = ceph_osdc_alloc_messages(obj_req->osd_req, GFP_NOIO); + ret = ceph_osdc_alloc_messages(osd_req, GFP_NOIO); if (ret) return ret; - rbd_obj_request_submit(obj_req); + rbd_osd_submit(osd_req); return 0; } -static int rbd_obj_issue_copyup(struct rbd_obj_request *obj_req, u32 bytes) -{ - /* - * Only send non-zero copyup data to save some I/O and network - * bandwidth -- zero copyup data is equivalent to the object not - * existing. - */ - if (is_zero_bvecs(obj_req->copyup_bvecs, bytes)) { - dout("%s obj_req %p detected zeroes\n", __func__, obj_req); - bytes = 0; - } - - if (obj_req->img_request->snapc->num_snaps && bytes > 0) { - /* - * Send a copyup request with an empty snapshot context to - * deep-copyup the object through all existing snapshots. - * A second request with the current snapshot context will be - * sent for the actual modification. - */ - obj_req->write_state = RBD_OBJ_WRITE_COPYUP_EMPTY_SNAPC; - return rbd_obj_issue_copyup_empty_snapc(obj_req, bytes); - } - - obj_req->write_state = RBD_OBJ_WRITE_COPYUP_OPS; - return rbd_obj_issue_copyup_ops(obj_req, bytes); -} - static int setup_copyup_bvecs(struct rbd_obj_request *obj_req, u64 obj_overlap) { u32 i; @@ -2608,7 +3222,12 @@ static int setup_copyup_bvecs(struct rbd_obj_request *obj_req, u64 obj_overlap) return 0; } -static int rbd_obj_handle_write_guard(struct rbd_obj_request *obj_req) +/* + * The target object doesn't exist. Read the data for the entire + * target object up to the overlap point (if any) from the parent, + * so we can use it for a copyup. + */ +static int rbd_obj_copyup_read_parent(struct rbd_obj_request *obj_req) { struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; int ret; @@ -2623,178 +3242,492 @@ static int rbd_obj_handle_write_guard(struct rbd_obj_request *obj_req) * request -- pass MODS_ONLY since the copyup isn't needed * anymore. */ - obj_req->write_state = RBD_OBJ_WRITE_COPYUP_OPS; - return rbd_obj_issue_copyup_ops(obj_req, MODS_ONLY); + return rbd_obj_copyup_current_snapc(obj_req, MODS_ONLY); } ret = setup_copyup_bvecs(obj_req, rbd_obj_img_extents_bytes(obj_req)); if (ret) return ret; - obj_req->write_state = RBD_OBJ_WRITE_READ_FROM_PARENT; return rbd_obj_read_from_parent(obj_req); } -static bool rbd_obj_handle_write(struct rbd_obj_request *obj_req) +static void rbd_obj_copyup_object_maps(struct rbd_obj_request *obj_req) { + struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; + struct ceph_snap_context *snapc = obj_req->img_request->snapc; + u8 new_state; + u32 i; int ret; - switch (obj_req->write_state) { - case RBD_OBJ_WRITE_GUARD: - rbd_assert(!obj_req->xferred); - if (obj_req->result == -ENOENT) { - /* - * The target object doesn't exist. Read the data for - * the entire target object up to the overlap point (if - * any) from the parent, so we can use it for a copyup. - */ - ret = rbd_obj_handle_write_guard(obj_req); - if (ret) { - obj_req->result = ret; - return true; - } - return false; + rbd_assert(!obj_req->pending.result && !obj_req->pending.num_pending); + + if (!(rbd_dev->header.features & RBD_FEATURE_OBJECT_MAP)) + return; + + if (obj_req->flags & RBD_OBJ_FLAG_COPYUP_ZEROS) + return; + + for (i = 0; i < snapc->num_snaps; i++) { + if ((rbd_dev->header.features & RBD_FEATURE_FAST_DIFF) && + i + 1 < snapc->num_snaps) + new_state = OBJECT_EXISTS_CLEAN; + else + new_state = OBJECT_EXISTS; + + ret = rbd_object_map_update(obj_req, snapc->snaps[i], + new_state, NULL); + if (ret < 0) { + obj_req->pending.result = ret; + return; } - /* fall through */ - case RBD_OBJ_WRITE_FLAT: - case RBD_OBJ_WRITE_COPYUP_OPS: - if (!obj_req->result) - /* - * There is no such thing as a successful short - * write -- indicate the whole request was satisfied. - */ - obj_req->xferred = obj_req->ex.oe_len; - return true; - case RBD_OBJ_WRITE_READ_FROM_PARENT: - if (obj_req->result) - return true; - rbd_assert(obj_req->xferred); - ret = rbd_obj_issue_copyup(obj_req, obj_req->xferred); + rbd_assert(!ret); + obj_req->pending.num_pending++; + } +} + +static void rbd_obj_copyup_write_object(struct rbd_obj_request *obj_req) +{ + u32 bytes = rbd_obj_img_extents_bytes(obj_req); + int ret; + + rbd_assert(!obj_req->pending.result && !obj_req->pending.num_pending); + + /* + * Only send non-zero copyup data to save some I/O and network + * bandwidth -- zero copyup data is equivalent to the object not + * existing. + */ + if (obj_req->flags & RBD_OBJ_FLAG_COPYUP_ZEROS) + bytes = 0; + + if (obj_req->img_request->snapc->num_snaps && bytes > 0) { + /* + * Send a copyup request with an empty snapshot context to + * deep-copyup the object through all existing snapshots. + * A second request with the current snapshot context will be + * sent for the actual modification. + */ + ret = rbd_obj_copyup_empty_snapc(obj_req, bytes); + if (ret) { + obj_req->pending.result = ret; + return; + } + + obj_req->pending.num_pending++; + bytes = MODS_ONLY; + } + + ret = rbd_obj_copyup_current_snapc(obj_req, bytes); + if (ret) { + obj_req->pending.result = ret; + return; + } + + obj_req->pending.num_pending++; +} + +static bool rbd_obj_advance_copyup(struct rbd_obj_request *obj_req, int *result) +{ + struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; + int ret; + +again: + switch (obj_req->copyup_state) { + case RBD_OBJ_COPYUP_START: + rbd_assert(!*result); + + ret = rbd_obj_copyup_read_parent(obj_req); if (ret) { - obj_req->result = ret; - obj_req->xferred = 0; + *result = ret; return true; } + if (obj_req->num_img_extents) + obj_req->copyup_state = RBD_OBJ_COPYUP_READ_PARENT; + else + obj_req->copyup_state = RBD_OBJ_COPYUP_WRITE_OBJECT; return false; - case RBD_OBJ_WRITE_COPYUP_EMPTY_SNAPC: - if (obj_req->result) + case RBD_OBJ_COPYUP_READ_PARENT: + if (*result) return true; - obj_req->write_state = RBD_OBJ_WRITE_COPYUP_OPS; - ret = rbd_obj_issue_copyup_ops(obj_req, MODS_ONLY); - if (ret) { - obj_req->result = ret; + if (is_zero_bvecs(obj_req->copyup_bvecs, + rbd_obj_img_extents_bytes(obj_req))) { + dout("%s %p detected zeros\n", __func__, obj_req); + obj_req->flags |= RBD_OBJ_FLAG_COPYUP_ZEROS; + } + + rbd_obj_copyup_object_maps(obj_req); + if (!obj_req->pending.num_pending) { + *result = obj_req->pending.result; + obj_req->copyup_state = RBD_OBJ_COPYUP_OBJECT_MAPS; + goto again; + } + obj_req->copyup_state = __RBD_OBJ_COPYUP_OBJECT_MAPS; + return false; + case __RBD_OBJ_COPYUP_OBJECT_MAPS: + if (!pending_result_dec(&obj_req->pending, result)) + return false; + /* fall through */ + case RBD_OBJ_COPYUP_OBJECT_MAPS: + if (*result) { + rbd_warn(rbd_dev, "snap object map update failed: %d", + *result); return true; } + + rbd_obj_copyup_write_object(obj_req); + if (!obj_req->pending.num_pending) { + *result = obj_req->pending.result; + obj_req->copyup_state = RBD_OBJ_COPYUP_WRITE_OBJECT; + goto again; + } + obj_req->copyup_state = __RBD_OBJ_COPYUP_WRITE_OBJECT; return false; + case __RBD_OBJ_COPYUP_WRITE_OBJECT: + if (!pending_result_dec(&obj_req->pending, result)) + return false; + /* fall through */ + case RBD_OBJ_COPYUP_WRITE_OBJECT: + return true; default: BUG(); } } /* - * Returns true if @obj_req is completed, or false otherwise. + * Return: + * 0 - object map update sent + * 1 - object map update isn't needed + * <0 - error */ -static bool __rbd_obj_handle_request(struct rbd_obj_request *obj_req) +static int rbd_obj_write_post_object_map(struct rbd_obj_request *obj_req) { - switch (obj_req->img_request->op_type) { - case OBJ_OP_READ: - return rbd_obj_handle_read(obj_req); - case OBJ_OP_WRITE: - return rbd_obj_handle_write(obj_req); - case OBJ_OP_DISCARD: - case OBJ_OP_ZEROOUT: - if (rbd_obj_handle_write(obj_req)) { + struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; + u8 current_state = OBJECT_PENDING; + + if (!(rbd_dev->header.features & RBD_FEATURE_OBJECT_MAP)) + return 1; + + if (!(obj_req->flags & RBD_OBJ_FLAG_DELETION)) + return 1; + + return rbd_object_map_update(obj_req, CEPH_NOSNAP, OBJECT_NONEXISTENT, + ¤t_state); +} + +static bool rbd_obj_advance_write(struct rbd_obj_request *obj_req, int *result) +{ + struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; + int ret; + +again: + switch (obj_req->write_state) { + case RBD_OBJ_WRITE_START: + rbd_assert(!*result); + + if (rbd_obj_write_is_noop(obj_req)) + return true; + + ret = rbd_obj_write_pre_object_map(obj_req); + if (ret < 0) { + *result = ret; + return true; + } + obj_req->write_state = RBD_OBJ_WRITE_PRE_OBJECT_MAP; + if (ret > 0) + goto again; + return false; + case RBD_OBJ_WRITE_PRE_OBJECT_MAP: + if (*result) { + rbd_warn(rbd_dev, "pre object map update failed: %d", + *result); + return true; + } + ret = rbd_obj_write_object(obj_req); + if (ret) { + *result = ret; + return true; + } + obj_req->write_state = RBD_OBJ_WRITE_OBJECT; + return false; + case RBD_OBJ_WRITE_OBJECT: + if (*result == -ENOENT) { + if (obj_req->flags & RBD_OBJ_FLAG_COPYUP_ENABLED) { + *result = 0; + obj_req->copyup_state = RBD_OBJ_COPYUP_START; + obj_req->write_state = __RBD_OBJ_WRITE_COPYUP; + goto again; + } /* - * Hide -ENOENT from delete/truncate/zero -- discarding - * a non-existent object is not a problem. + * On a non-existent object: + * delete - -ENOENT, truncate/zero - 0 */ - if (obj_req->result == -ENOENT) { - obj_req->result = 0; - obj_req->xferred = obj_req->ex.oe_len; - } + if (obj_req->flags & RBD_OBJ_FLAG_DELETION) + *result = 0; + } + if (*result) + return true; + + obj_req->write_state = RBD_OBJ_WRITE_COPYUP; + goto again; + case __RBD_OBJ_WRITE_COPYUP: + if (!rbd_obj_advance_copyup(obj_req, result)) + return false; + /* fall through */ + case RBD_OBJ_WRITE_COPYUP: + if (*result) { + rbd_warn(rbd_dev, "copyup failed: %d", *result); + return true; + } + ret = rbd_obj_write_post_object_map(obj_req); + if (ret < 0) { + *result = ret; return true; } + obj_req->write_state = RBD_OBJ_WRITE_POST_OBJECT_MAP; + if (ret > 0) + goto again; return false; + case RBD_OBJ_WRITE_POST_OBJECT_MAP: + if (*result) + rbd_warn(rbd_dev, "post object map update failed: %d", + *result); + return true; default: BUG(); } } -static void rbd_obj_end_request(struct rbd_obj_request *obj_req) +/* + * Return true if @obj_req is completed. + */ +static bool __rbd_obj_handle_request(struct rbd_obj_request *obj_req, + int *result) { struct rbd_img_request *img_req = obj_req->img_request; + struct rbd_device *rbd_dev = img_req->rbd_dev; + bool done; - rbd_assert((!obj_req->result && - obj_req->xferred == obj_req->ex.oe_len) || - (obj_req->result < 0 && !obj_req->xferred)); - if (!obj_req->result) { - img_req->xferred += obj_req->xferred; - return; - } + mutex_lock(&obj_req->state_mutex); + if (!rbd_img_is_write(img_req)) + done = rbd_obj_advance_read(obj_req, result); + else + done = rbd_obj_advance_write(obj_req, result); + mutex_unlock(&obj_req->state_mutex); - rbd_warn(img_req->rbd_dev, - "%s at objno %llu %llu~%llu result %d xferred %llu", - obj_op_name(img_req->op_type), obj_req->ex.oe_objno, - obj_req->ex.oe_off, obj_req->ex.oe_len, obj_req->result, - obj_req->xferred); - if (!img_req->result) { - img_req->result = obj_req->result; - img_req->xferred = 0; + if (done && *result) { + rbd_assert(*result < 0); + rbd_warn(rbd_dev, "%s at objno %llu %llu~%llu result %d", + obj_op_name(img_req->op_type), obj_req->ex.oe_objno, + obj_req->ex.oe_off, obj_req->ex.oe_len, *result); } + return done; } -static void rbd_img_end_child_request(struct rbd_img_request *img_req) +/* + * This is open-coded in rbd_img_handle_request() to avoid parent chain + * recursion. + */ +static void rbd_obj_handle_request(struct rbd_obj_request *obj_req, int result) { - struct rbd_obj_request *obj_req = img_req->obj_request; + if (__rbd_obj_handle_request(obj_req, &result)) + rbd_img_handle_request(obj_req->img_request, result); +} - rbd_assert(test_bit(IMG_REQ_CHILD, &img_req->flags)); - rbd_assert((!img_req->result && - img_req->xferred == rbd_obj_img_extents_bytes(obj_req)) || - (img_req->result < 0 && !img_req->xferred)); +static bool need_exclusive_lock(struct rbd_img_request *img_req) +{ + struct rbd_device *rbd_dev = img_req->rbd_dev; - obj_req->result = img_req->result; - obj_req->xferred = img_req->xferred; - rbd_img_request_put(img_req); + if (!(rbd_dev->header.features & RBD_FEATURE_EXCLUSIVE_LOCK)) + return false; + + if (rbd_dev->spec->snap_id != CEPH_NOSNAP) + return false; + + rbd_assert(!test_bit(IMG_REQ_CHILD, &img_req->flags)); + if (rbd_dev->opts->lock_on_read || + (rbd_dev->header.features & RBD_FEATURE_OBJECT_MAP)) + return true; + + return rbd_img_is_write(img_req); } -static void rbd_img_end_request(struct rbd_img_request *img_req) +static bool rbd_lock_add_request(struct rbd_img_request *img_req) { - rbd_assert(!test_bit(IMG_REQ_CHILD, &img_req->flags)); - rbd_assert((!img_req->result && - img_req->xferred == blk_rq_bytes(img_req->rq)) || - (img_req->result < 0 && !img_req->xferred)); + struct rbd_device *rbd_dev = img_req->rbd_dev; + bool locked; + + lockdep_assert_held(&rbd_dev->lock_rwsem); + locked = rbd_dev->lock_state == RBD_LOCK_STATE_LOCKED; + spin_lock(&rbd_dev->lock_lists_lock); + rbd_assert(list_empty(&img_req->lock_item)); + if (!locked) + list_add_tail(&img_req->lock_item, &rbd_dev->acquiring_list); + else + list_add_tail(&img_req->lock_item, &rbd_dev->running_list); + spin_unlock(&rbd_dev->lock_lists_lock); + return locked; +} + +static void rbd_lock_del_request(struct rbd_img_request *img_req) +{ + struct rbd_device *rbd_dev = img_req->rbd_dev; + bool need_wakeup; - blk_mq_end_request(img_req->rq, - errno_to_blk_status(img_req->result)); - rbd_img_request_put(img_req); + lockdep_assert_held(&rbd_dev->lock_rwsem); + spin_lock(&rbd_dev->lock_lists_lock); + rbd_assert(!list_empty(&img_req->lock_item)); + list_del_init(&img_req->lock_item); + need_wakeup = (rbd_dev->lock_state == RBD_LOCK_STATE_RELEASING && + list_empty(&rbd_dev->running_list)); + spin_unlock(&rbd_dev->lock_lists_lock); + if (need_wakeup) + complete(&rbd_dev->releasing_wait); } -static void rbd_obj_handle_request(struct rbd_obj_request *obj_req) +static int rbd_img_exclusive_lock(struct rbd_img_request *img_req) { - struct rbd_img_request *img_req; + struct rbd_device *rbd_dev = img_req->rbd_dev; + + if (!need_exclusive_lock(img_req)) + return 1; + + if (rbd_lock_add_request(img_req)) + return 1; + + if (rbd_dev->opts->exclusive) { + WARN_ON(1); /* lock got released? */ + return -EROFS; + } + + /* + * Note the use of mod_delayed_work() in rbd_acquire_lock() + * and cancel_delayed_work() in wake_lock_waiters(). + */ + dout("%s rbd_dev %p queueing lock_dwork\n", __func__, rbd_dev); + queue_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0); + return 0; +} + +static void rbd_img_object_requests(struct rbd_img_request *img_req) +{ + struct rbd_obj_request *obj_req; + + rbd_assert(!img_req->pending.result && !img_req->pending.num_pending); + + for_each_obj_request(img_req, obj_req) { + int result = 0; + + if (__rbd_obj_handle_request(obj_req, &result)) { + if (result) { + img_req->pending.result = result; + return; + } + } else { + img_req->pending.num_pending++; + } + } +} + +static bool rbd_img_advance(struct rbd_img_request *img_req, int *result) +{ + struct rbd_device *rbd_dev = img_req->rbd_dev; + int ret; again: - if (!__rbd_obj_handle_request(obj_req)) - return; + switch (img_req->state) { + case RBD_IMG_START: + rbd_assert(!*result); - img_req = obj_req->img_request; - spin_lock(&img_req->completion_lock); - rbd_obj_end_request(obj_req); - rbd_assert(img_req->pending_count); - if (--img_req->pending_count) { - spin_unlock(&img_req->completion_lock); - return; + ret = rbd_img_exclusive_lock(img_req); + if (ret < 0) { + *result = ret; + return true; + } + img_req->state = RBD_IMG_EXCLUSIVE_LOCK; + if (ret > 0) + goto again; + return false; + case RBD_IMG_EXCLUSIVE_LOCK: + if (*result) + return true; + + rbd_assert(!need_exclusive_lock(img_req) || + __rbd_is_lock_owner(rbd_dev)); + + rbd_img_object_requests(img_req); + if (!img_req->pending.num_pending) { + *result = img_req->pending.result; + img_req->state = RBD_IMG_OBJECT_REQUESTS; + goto again; + } + img_req->state = __RBD_IMG_OBJECT_REQUESTS; + return false; + case __RBD_IMG_OBJECT_REQUESTS: + if (!pending_result_dec(&img_req->pending, result)) + return false; + /* fall through */ + case RBD_IMG_OBJECT_REQUESTS: + return true; + default: + BUG(); + } +} + +/* + * Return true if @img_req is completed. + */ +static bool __rbd_img_handle_request(struct rbd_img_request *img_req, + int *result) +{ + struct rbd_device *rbd_dev = img_req->rbd_dev; + bool done; + + if (need_exclusive_lock(img_req)) { + down_read(&rbd_dev->lock_rwsem); + mutex_lock(&img_req->state_mutex); + done = rbd_img_advance(img_req, result); + if (done) + rbd_lock_del_request(img_req); + mutex_unlock(&img_req->state_mutex); + up_read(&rbd_dev->lock_rwsem); + } else { + mutex_lock(&img_req->state_mutex); + done = rbd_img_advance(img_req, result); + mutex_unlock(&img_req->state_mutex); + } + + if (done && *result) { + rbd_assert(*result < 0); + rbd_warn(rbd_dev, "%s%s result %d", + test_bit(IMG_REQ_CHILD, &img_req->flags) ? "child " : "", + obj_op_name(img_req->op_type), *result); } + return done; +} + +static void rbd_img_handle_request(struct rbd_img_request *img_req, int result) +{ +again: + if (!__rbd_img_handle_request(img_req, &result)) + return; - spin_unlock(&img_req->completion_lock); if (test_bit(IMG_REQ_CHILD, &img_req->flags)) { - obj_req = img_req->obj_request; - rbd_img_end_child_request(img_req); - goto again; + struct rbd_obj_request *obj_req = img_req->obj_request; + + rbd_img_request_put(img_req); + if (__rbd_obj_handle_request(obj_req, &result)) { + img_req = obj_req->img_request; + goto again; + } + } else { + struct request *rq = img_req->rq; + + rbd_img_request_put(img_req); + blk_mq_end_request(rq, errno_to_blk_status(result)); } - rbd_img_end_request(img_req); } static const struct rbd_client_id rbd_empty_cid; @@ -2839,6 +3772,7 @@ static void __rbd_lock(struct rbd_device *rbd_dev, const char *cookie) { struct rbd_client_id cid = rbd_get_cid(rbd_dev); + rbd_dev->lock_state = RBD_LOCK_STATE_LOCKED; strcpy(rbd_dev->lock_cookie, cookie); rbd_set_owner_cid(rbd_dev, &cid); queue_work(rbd_dev->task_wq, &rbd_dev->acquired_lock_work); @@ -2863,7 +3797,6 @@ static int rbd_lock(struct rbd_device *rbd_dev) if (ret) return ret; - rbd_dev->lock_state = RBD_LOCK_STATE_LOCKED; __rbd_lock(rbd_dev, cookie); return 0; } @@ -2882,7 +3815,7 @@ static void rbd_unlock(struct rbd_device *rbd_dev) ret = ceph_cls_unlock(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, RBD_LOCK_NAME, rbd_dev->lock_cookie); if (ret && ret != -ENOENT) - rbd_warn(rbd_dev, "failed to unlock: %d", ret); + rbd_warn(rbd_dev, "failed to unlock header: %d", ret); /* treat errors as the image is unlocked */ rbd_dev->lock_state = RBD_LOCK_STATE_UNLOCKED; @@ -3009,15 +3942,34 @@ e_inval: goto out; } -static void wake_requests(struct rbd_device *rbd_dev, bool wake_all) +/* + * Either image request state machine(s) or rbd_add_acquire_lock() + * (i.e. "rbd map"). + */ +static void wake_lock_waiters(struct rbd_device *rbd_dev, int result) { - dout("%s rbd_dev %p wake_all %d\n", __func__, rbd_dev, wake_all); + struct rbd_img_request *img_req; + + dout("%s rbd_dev %p result %d\n", __func__, rbd_dev, result); + lockdep_assert_held_write(&rbd_dev->lock_rwsem); cancel_delayed_work(&rbd_dev->lock_dwork); - if (wake_all) - wake_up_all(&rbd_dev->lock_waitq); - else - wake_up(&rbd_dev->lock_waitq); + if (!completion_done(&rbd_dev->acquire_wait)) { + rbd_assert(list_empty(&rbd_dev->acquiring_list) && + list_empty(&rbd_dev->running_list)); + rbd_dev->acquire_err = result; + complete_all(&rbd_dev->acquire_wait); + return; + } + + list_for_each_entry(img_req, &rbd_dev->acquiring_list, lock_item) { + mutex_lock(&img_req->state_mutex); + rbd_assert(img_req->state == RBD_IMG_EXCLUSIVE_LOCK); + rbd_img_schedule(img_req, result); + mutex_unlock(&img_req->state_mutex); + } + + list_splice_tail_init(&rbd_dev->acquiring_list, &rbd_dev->running_list); } static int get_lock_owner_info(struct rbd_device *rbd_dev, @@ -3132,13 +4084,10 @@ static int rbd_try_lock(struct rbd_device *rbd_dev) goto again; ret = find_watcher(rbd_dev, lockers); - if (ret) { - if (ret > 0) - ret = 0; /* have to request lock */ - goto out; - } + if (ret) + goto out; /* request lock or error */ - rbd_warn(rbd_dev, "%s%llu seems dead, breaking lock", + rbd_warn(rbd_dev, "breaking header lock owned by %s%llu", ENTITY_NAME(lockers[0].id.name)); ret = ceph_monc_blacklist_add(&client->monc, @@ -3165,53 +4114,90 @@ out: return ret; } +static int rbd_post_acquire_action(struct rbd_device *rbd_dev) +{ + int ret; + + if (rbd_dev->header.features & RBD_FEATURE_OBJECT_MAP) { + ret = rbd_object_map_open(rbd_dev); + if (ret) + return ret; + } + + return 0; +} + /* - * ret is set only if lock_state is RBD_LOCK_STATE_UNLOCKED + * Return: + * 0 - lock acquired + * 1 - caller should call rbd_request_lock() + * <0 - error */ -static enum rbd_lock_state rbd_try_acquire_lock(struct rbd_device *rbd_dev, - int *pret) +static int rbd_try_acquire_lock(struct rbd_device *rbd_dev) { - enum rbd_lock_state lock_state; + int ret; down_read(&rbd_dev->lock_rwsem); dout("%s rbd_dev %p read lock_state %d\n", __func__, rbd_dev, rbd_dev->lock_state); if (__rbd_is_lock_owner(rbd_dev)) { - lock_state = rbd_dev->lock_state; up_read(&rbd_dev->lock_rwsem); - return lock_state; + return 0; } up_read(&rbd_dev->lock_rwsem); down_write(&rbd_dev->lock_rwsem); dout("%s rbd_dev %p write lock_state %d\n", __func__, rbd_dev, rbd_dev->lock_state); - if (!__rbd_is_lock_owner(rbd_dev)) { - *pret = rbd_try_lock(rbd_dev); - if (*pret) - rbd_warn(rbd_dev, "failed to acquire lock: %d", *pret); + if (__rbd_is_lock_owner(rbd_dev)) { + up_write(&rbd_dev->lock_rwsem); + return 0; + } + + ret = rbd_try_lock(rbd_dev); + if (ret < 0) { + rbd_warn(rbd_dev, "failed to lock header: %d", ret); + if (ret == -EBLACKLISTED) + goto out; + + ret = 1; /* request lock anyway */ + } + if (ret > 0) { + up_write(&rbd_dev->lock_rwsem); + return ret; + } + + rbd_assert(rbd_dev->lock_state == RBD_LOCK_STATE_LOCKED); + rbd_assert(list_empty(&rbd_dev->running_list)); + + ret = rbd_post_acquire_action(rbd_dev); + if (ret) { + rbd_warn(rbd_dev, "post-acquire action failed: %d", ret); + /* + * Can't stay in RBD_LOCK_STATE_LOCKED because + * rbd_lock_add_request() would let the request through, + * assuming that e.g. object map is locked and loaded. + */ + rbd_unlock(rbd_dev); } - lock_state = rbd_dev->lock_state; +out: + wake_lock_waiters(rbd_dev, ret); up_write(&rbd_dev->lock_rwsem); - return lock_state; + return ret; } static void rbd_acquire_lock(struct work_struct *work) { struct rbd_device *rbd_dev = container_of(to_delayed_work(work), struct rbd_device, lock_dwork); - enum rbd_lock_state lock_state; - int ret = 0; + int ret; dout("%s rbd_dev %p\n", __func__, rbd_dev); again: - lock_state = rbd_try_acquire_lock(rbd_dev, &ret); - if (lock_state != RBD_LOCK_STATE_UNLOCKED || ret == -EBLACKLISTED) { - if (lock_state == RBD_LOCK_STATE_LOCKED) - wake_requests(rbd_dev, true); - dout("%s rbd_dev %p lock_state %d ret %d - done\n", __func__, - rbd_dev, lock_state, ret); + ret = rbd_try_acquire_lock(rbd_dev); + if (ret <= 0) { + dout("%s rbd_dev %p ret %d - done\n", __func__, rbd_dev, ret); return; } @@ -3220,16 +4206,9 @@ again: goto again; /* treat this as a dead client */ } else if (ret == -EROFS) { rbd_warn(rbd_dev, "peer will not release lock"); - /* - * If this is rbd_add_acquire_lock(), we want to fail - * immediately -- reuse BLACKLISTED flag. Otherwise we - * want to block. - */ - if (!(rbd_dev->disk->flags & GENHD_FL_UP)) { - set_bit(RBD_DEV_FLAG_BLACKLISTED, &rbd_dev->flags); - /* wake "rbd map --exclusive" process */ - wake_requests(rbd_dev, false); - } + down_write(&rbd_dev->lock_rwsem); + wake_lock_waiters(rbd_dev, ret); + up_write(&rbd_dev->lock_rwsem); } else if (ret < 0) { rbd_warn(rbd_dev, "error requesting lock: %d", ret); mod_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, @@ -3246,43 +4225,67 @@ again: } } -/* - * lock_rwsem must be held for write - */ -static bool rbd_release_lock(struct rbd_device *rbd_dev) +static bool rbd_quiesce_lock(struct rbd_device *rbd_dev) { - dout("%s rbd_dev %p read lock_state %d\n", __func__, rbd_dev, - rbd_dev->lock_state); + bool need_wait; + + dout("%s rbd_dev %p\n", __func__, rbd_dev); + lockdep_assert_held_write(&rbd_dev->lock_rwsem); + if (rbd_dev->lock_state != RBD_LOCK_STATE_LOCKED) return false; - rbd_dev->lock_state = RBD_LOCK_STATE_RELEASING; - downgrade_write(&rbd_dev->lock_rwsem); /* * Ensure that all in-flight IO is flushed. - * - * FIXME: ceph_osdc_sync() flushes the entire OSD client, which - * may be shared with other devices. */ - ceph_osdc_sync(&rbd_dev->rbd_client->client->osdc); + rbd_dev->lock_state = RBD_LOCK_STATE_RELEASING; + rbd_assert(!completion_done(&rbd_dev->releasing_wait)); + need_wait = !list_empty(&rbd_dev->running_list); + downgrade_write(&rbd_dev->lock_rwsem); + if (need_wait) + wait_for_completion(&rbd_dev->releasing_wait); up_read(&rbd_dev->lock_rwsem); down_write(&rbd_dev->lock_rwsem); - dout("%s rbd_dev %p write lock_state %d\n", __func__, rbd_dev, - rbd_dev->lock_state); if (rbd_dev->lock_state != RBD_LOCK_STATE_RELEASING) return false; + rbd_assert(list_empty(&rbd_dev->running_list)); + return true; +} + +static void rbd_pre_release_action(struct rbd_device *rbd_dev) +{ + if (rbd_dev->header.features & RBD_FEATURE_OBJECT_MAP) + rbd_object_map_close(rbd_dev); +} + +static void __rbd_release_lock(struct rbd_device *rbd_dev) +{ + rbd_assert(list_empty(&rbd_dev->running_list)); + + rbd_pre_release_action(rbd_dev); rbd_unlock(rbd_dev); +} + +/* + * lock_rwsem must be held for write + */ +static void rbd_release_lock(struct rbd_device *rbd_dev) +{ + if (!rbd_quiesce_lock(rbd_dev)) + return; + + __rbd_release_lock(rbd_dev); + /* * Give others a chance to grab the lock - we would re-acquire - * almost immediately if we got new IO during ceph_osdc_sync() - * otherwise. We need to ack our own notifications, so this - * lock_dwork will be requeued from rbd_wait_state_locked() - * after wake_requests() in rbd_handle_released_lock(). + * almost immediately if we got new IO while draining the running + * list otherwise. We need to ack our own notifications, so this + * lock_dwork will be requeued from rbd_handle_released_lock() by + * way of maybe_kick_acquire(). */ cancel_delayed_work(&rbd_dev->lock_dwork); - return true; } static void rbd_release_lock_work(struct work_struct *work) @@ -3295,6 +4298,23 @@ static void rbd_release_lock_work(struct work_struct *work) up_write(&rbd_dev->lock_rwsem); } +static void maybe_kick_acquire(struct rbd_device *rbd_dev) +{ + bool have_requests; + + dout("%s rbd_dev %p\n", __func__, rbd_dev); + if (__rbd_is_lock_owner(rbd_dev)) + return; + + spin_lock(&rbd_dev->lock_lists_lock); + have_requests = !list_empty(&rbd_dev->acquiring_list); + spin_unlock(&rbd_dev->lock_lists_lock); + if (have_requests || delayed_work_pending(&rbd_dev->lock_dwork)) { + dout("%s rbd_dev %p kicking lock_dwork\n", __func__, rbd_dev); + mod_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0); + } +} + static void rbd_handle_acquired_lock(struct rbd_device *rbd_dev, u8 struct_v, void **p) { @@ -3324,8 +4344,7 @@ static void rbd_handle_acquired_lock(struct rbd_device *rbd_dev, u8 struct_v, down_read(&rbd_dev->lock_rwsem); } - if (!__rbd_is_lock_owner(rbd_dev)) - wake_requests(rbd_dev, false); + maybe_kick_acquire(rbd_dev); up_read(&rbd_dev->lock_rwsem); } @@ -3357,8 +4376,7 @@ static void rbd_handle_released_lock(struct rbd_device *rbd_dev, u8 struct_v, down_read(&rbd_dev->lock_rwsem); } - if (!__rbd_is_lock_owner(rbd_dev)) - wake_requests(rbd_dev, false); + maybe_kick_acquire(rbd_dev); up_read(&rbd_dev->lock_rwsem); } @@ -3608,7 +4626,6 @@ static void cancel_tasks_sync(struct rbd_device *rbd_dev) static void rbd_unregister_watch(struct rbd_device *rbd_dev) { - WARN_ON(waitqueue_active(&rbd_dev->lock_waitq)); cancel_tasks_sync(rbd_dev); mutex_lock(&rbd_dev->watch_mutex); @@ -3630,7 +4647,8 @@ static void rbd_reacquire_lock(struct rbd_device *rbd_dev) char cookie[32]; int ret; - WARN_ON(rbd_dev->lock_state != RBD_LOCK_STATE_LOCKED); + if (!rbd_quiesce_lock(rbd_dev)) + return; format_lock_cookie(rbd_dev, cookie); ret = ceph_cls_set_cookie(osdc, &rbd_dev->header_oid, @@ -3646,11 +4664,11 @@ static void rbd_reacquire_lock(struct rbd_device *rbd_dev) * Lock cookie cannot be updated on older OSDs, so do * a manual release and queue an acquire. */ - if (rbd_release_lock(rbd_dev)) - queue_delayed_work(rbd_dev->task_wq, - &rbd_dev->lock_dwork, 0); + __rbd_release_lock(rbd_dev); + queue_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0); } else { __rbd_lock(rbd_dev, cookie); + wake_lock_waiters(rbd_dev, 0); } } @@ -3671,15 +4689,18 @@ static void rbd_reregister_watch(struct work_struct *work) ret = __rbd_register_watch(rbd_dev); if (ret) { rbd_warn(rbd_dev, "failed to reregister watch: %d", ret); - if (ret == -EBLACKLISTED || ret == -ENOENT) { - set_bit(RBD_DEV_FLAG_BLACKLISTED, &rbd_dev->flags); - wake_requests(rbd_dev, true); - } else { + if (ret != -EBLACKLISTED && ret != -ENOENT) { queue_delayed_work(rbd_dev->task_wq, &rbd_dev->watch_dwork, RBD_RETRY_DELAY); + mutex_unlock(&rbd_dev->watch_mutex); + return; } + mutex_unlock(&rbd_dev->watch_mutex); + down_write(&rbd_dev->lock_rwsem); + wake_lock_waiters(rbd_dev, ret); + up_write(&rbd_dev->lock_rwsem); return; } @@ -3742,7 +4763,7 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev, ret = ceph_osdc_call(osdc, oid, oloc, RBD_DRV_NAME, method_name, CEPH_OSD_FLAG_READ, req_page, outbound_size, - reply_page, &inbound_size); + &reply_page, &inbound_size); if (!ret) { memcpy(inbound, page_address(reply_page), inbound_size); ret = inbound_size; @@ -3754,54 +4775,6 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev, return ret; } -/* - * lock_rwsem must be held for read - */ -static int rbd_wait_state_locked(struct rbd_device *rbd_dev, bool may_acquire) -{ - DEFINE_WAIT(wait); - unsigned long timeout; - int ret = 0; - - if (test_bit(RBD_DEV_FLAG_BLACKLISTED, &rbd_dev->flags)) - return -EBLACKLISTED; - - if (rbd_dev->lock_state == RBD_LOCK_STATE_LOCKED) - return 0; - - if (!may_acquire) { - rbd_warn(rbd_dev, "exclusive lock required"); - return -EROFS; - } - - do { - /* - * Note the use of mod_delayed_work() in rbd_acquire_lock() - * and cancel_delayed_work() in wake_requests(). - */ - dout("%s rbd_dev %p queueing lock_dwork\n", __func__, rbd_dev); - queue_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0); - prepare_to_wait_exclusive(&rbd_dev->lock_waitq, &wait, - TASK_UNINTERRUPTIBLE); - up_read(&rbd_dev->lock_rwsem); - timeout = schedule_timeout(ceph_timeout_jiffies( - rbd_dev->opts->lock_timeout)); - down_read(&rbd_dev->lock_rwsem); - if (test_bit(RBD_DEV_FLAG_BLACKLISTED, &rbd_dev->flags)) { - ret = -EBLACKLISTED; - break; - } - if (!timeout) { - rbd_warn(rbd_dev, "timed out waiting for lock"); - ret = -ETIMEDOUT; - break; - } - } while (rbd_dev->lock_state != RBD_LOCK_STATE_LOCKED); - - finish_wait(&rbd_dev->lock_waitq, &wait); - return ret; -} - static void rbd_queue_workfn(struct work_struct *work) { struct request *rq = blk_mq_rq_from_pdu(work); @@ -3812,7 +4785,6 @@ static void rbd_queue_workfn(struct work_struct *work) u64 length = blk_rq_bytes(rq); enum obj_operation_type op_type; u64 mapping_size; - bool must_be_locked; int result; switch (req_op(rq)) { @@ -3886,21 +4858,10 @@ static void rbd_queue_workfn(struct work_struct *work) goto err_rq; } - must_be_locked = - (rbd_dev->header.features & RBD_FEATURE_EXCLUSIVE_LOCK) && - (op_type != OBJ_OP_READ || rbd_dev->opts->lock_on_read); - if (must_be_locked) { - down_read(&rbd_dev->lock_rwsem); - result = rbd_wait_state_locked(rbd_dev, - !rbd_dev->opts->exclusive); - if (result) - goto err_unlock; - } - img_request = rbd_img_request_create(rbd_dev, op_type, snapc); if (!img_request) { result = -ENOMEM; - goto err_unlock; + goto err_rq; } img_request->rq = rq; snapc = NULL; /* img_request consumes a ref */ @@ -3910,19 +4871,14 @@ static void rbd_queue_workfn(struct work_struct *work) else result = rbd_img_fill_from_bio(img_request, offset, length, rq->bio); - if (result || !img_request->pending_count) + if (result) goto err_img_request; - rbd_img_request_submit(img_request); - if (must_be_locked) - up_read(&rbd_dev->lock_rwsem); + rbd_img_handle_request(img_request, 0); return; err_img_request: rbd_img_request_put(img_request); -err_unlock: - if (must_be_locked) - up_read(&rbd_dev->lock_rwsem); err_rq: if (result) rbd_warn(rbd_dev, "%s %llx at %llx result %d", @@ -4589,7 +5545,13 @@ static struct rbd_device *__rbd_dev_create(struct rbd_client *rbdc, INIT_WORK(&rbd_dev->released_lock_work, rbd_notify_released_lock); INIT_DELAYED_WORK(&rbd_dev->lock_dwork, rbd_acquire_lock); INIT_WORK(&rbd_dev->unlock_work, rbd_release_lock_work); - init_waitqueue_head(&rbd_dev->lock_waitq); + spin_lock_init(&rbd_dev->lock_lists_lock); + INIT_LIST_HEAD(&rbd_dev->acquiring_list); + INIT_LIST_HEAD(&rbd_dev->running_list); + init_completion(&rbd_dev->acquire_wait); + init_completion(&rbd_dev->releasing_wait); + + spin_lock_init(&rbd_dev->object_map_lock); rbd_dev->dev.bus = &rbd_bus_type; rbd_dev->dev.type = &rbd_device_type; @@ -4772,6 +5734,32 @@ static int rbd_dev_v2_features(struct rbd_device *rbd_dev) &rbd_dev->header.features); } +/* + * These are generic image flags, but since they are used only for + * object map, store them in rbd_dev->object_map_flags. + * + * For the same reason, this function is called only on object map + * (re)load and not on header refresh. + */ +static int rbd_dev_v2_get_flags(struct rbd_device *rbd_dev) +{ + __le64 snapid = cpu_to_le64(rbd_dev->spec->snap_id); + __le64 flags; + int ret; + + ret = rbd_obj_method_sync(rbd_dev, &rbd_dev->header_oid, + &rbd_dev->header_oloc, "get_flags", + &snapid, sizeof(snapid), + &flags, sizeof(flags)); + if (ret < 0) + return ret; + if (ret < sizeof(flags)) + return -EBADMSG; + + rbd_dev->object_map_flags = le64_to_cpu(flags); + return 0; +} + struct parent_image_info { u64 pool_id; const char *pool_ns; @@ -4829,7 +5817,7 @@ static int __get_parent_info(struct rbd_device *rbd_dev, ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, "rbd", "parent_get", CEPH_OSD_FLAG_READ, - req_page, sizeof(u64), reply_page, &reply_len); + req_page, sizeof(u64), &reply_page, &reply_len); if (ret) return ret == -EOPNOTSUPP ? 1 : ret; @@ -4841,7 +5829,7 @@ static int __get_parent_info(struct rbd_device *rbd_dev, ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, "rbd", "parent_overlap_get", CEPH_OSD_FLAG_READ, - req_page, sizeof(u64), reply_page, &reply_len); + req_page, sizeof(u64), &reply_page, &reply_len); if (ret) return ret; @@ -4872,7 +5860,7 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev, ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, "rbd", "get_parent", CEPH_OSD_FLAG_READ, - req_page, sizeof(u64), reply_page, &reply_len); + req_page, sizeof(u64), &reply_page, &reply_len); if (ret) return ret; @@ -5605,28 +6593,49 @@ static void rbd_dev_image_unlock(struct rbd_device *rbd_dev) { down_write(&rbd_dev->lock_rwsem); if (__rbd_is_lock_owner(rbd_dev)) - rbd_unlock(rbd_dev); + __rbd_release_lock(rbd_dev); up_write(&rbd_dev->lock_rwsem); } +/* + * If the wait is interrupted, an error is returned even if the lock + * was successfully acquired. rbd_dev_image_unlock() will release it + * if needed. + */ static int rbd_add_acquire_lock(struct rbd_device *rbd_dev) { - int ret; + long ret; if (!(rbd_dev->header.features & RBD_FEATURE_EXCLUSIVE_LOCK)) { + if (!rbd_dev->opts->exclusive && !rbd_dev->opts->lock_on_read) + return 0; + rbd_warn(rbd_dev, "exclusive-lock feature is not enabled"); return -EINVAL; } - /* FIXME: "rbd map --exclusive" should be in interruptible */ - down_read(&rbd_dev->lock_rwsem); - ret = rbd_wait_state_locked(rbd_dev, true); - up_read(&rbd_dev->lock_rwsem); + if (rbd_dev->spec->snap_id != CEPH_NOSNAP) + return 0; + + rbd_assert(!rbd_is_lock_owner(rbd_dev)); + queue_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0); + ret = wait_for_completion_killable_timeout(&rbd_dev->acquire_wait, + ceph_timeout_jiffies(rbd_dev->opts->lock_timeout)); + if (ret > 0) + ret = rbd_dev->acquire_err; + else if (!ret) + ret = -ETIMEDOUT; + if (ret) { - rbd_warn(rbd_dev, "failed to acquire exclusive lock"); - return -EROFS; + rbd_warn(rbd_dev, "failed to acquire exclusive lock: %ld", ret); + return ret; } + /* + * The lock may have been released by now, unless automatic lock + * transitions are disabled. + */ + rbd_assert(!rbd_dev->opts->exclusive || rbd_is_lock_owner(rbd_dev)); return 0; } @@ -5724,6 +6733,8 @@ static void rbd_dev_unprobe(struct rbd_device *rbd_dev) struct rbd_image_header *header; rbd_dev_parent_put(rbd_dev); + rbd_object_map_free(rbd_dev); + rbd_dev_mapping_clear(rbd_dev); /* Free dynamic fields from the header, then zero it out */ @@ -5824,7 +6835,6 @@ out_err: static void rbd_dev_device_release(struct rbd_device *rbd_dev) { clear_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags); - rbd_dev_mapping_clear(rbd_dev); rbd_free_disk(rbd_dev); if (!single_major) unregister_blkdev(rbd_dev->major, rbd_dev->name); @@ -5858,23 +6868,17 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev) if (ret) goto err_out_blkdev; - ret = rbd_dev_mapping_set(rbd_dev); - if (ret) - goto err_out_disk; - set_capacity(rbd_dev->disk, rbd_dev->mapping.size / SECTOR_SIZE); set_disk_ro(rbd_dev->disk, rbd_dev->opts->read_only); ret = dev_set_name(&rbd_dev->dev, "%d", rbd_dev->dev_id); if (ret) - goto err_out_mapping; + goto err_out_disk; set_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags); up_write(&rbd_dev->header_rwsem); return 0; -err_out_mapping: - rbd_dev_mapping_clear(rbd_dev); err_out_disk: rbd_free_disk(rbd_dev); err_out_blkdev: @@ -5975,6 +6979,17 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth) goto err_out_probe; } + ret = rbd_dev_mapping_set(rbd_dev); + if (ret) + goto err_out_probe; + + if (rbd_dev->spec->snap_id != CEPH_NOSNAP && + (rbd_dev->header.features & RBD_FEATURE_OBJECT_MAP)) { + ret = rbd_object_map_load(rbd_dev); + if (ret) + goto err_out_probe; + } + if (rbd_dev->header.features & RBD_FEATURE_LAYERING) { ret = rbd_dev_v2_parent_info(rbd_dev); if (ret) @@ -6071,11 +7086,9 @@ static ssize_t do_rbd_add(struct bus_type *bus, if (rc) goto err_out_image_probe; - if (rbd_dev->opts->exclusive) { - rc = rbd_add_acquire_lock(rbd_dev); - if (rc) - goto err_out_device_setup; - } + rc = rbd_add_acquire_lock(rbd_dev); + if (rc) + goto err_out_image_lock; /* Everything's ready. Announce the disk to the world. */ @@ -6101,7 +7114,6 @@ out: err_out_image_lock: rbd_dev_image_unlock(rbd_dev); -err_out_device_setup: rbd_dev_device_release(rbd_dev); err_out_image_probe: rbd_dev_image_release(rbd_dev); diff --git a/drivers/block/rbd_types.h b/drivers/block/rbd_types.h index 62ff50d3e7a6..ac98ab6ccd3b 100644 --- a/drivers/block/rbd_types.h +++ b/drivers/block/rbd_types.h @@ -18,6 +18,7 @@ /* For format version 2, rbd image 'foo' consists of objects * rbd_id.foo - id of image * rbd_header.<id> - image metadata + * rbd_object_map.<id> - optional image object map * rbd_data.<id>.0000000000000000 * rbd_data.<id>.0000000000000001 * ... - data @@ -25,6 +26,7 @@ */ #define RBD_HEADER_PREFIX "rbd_header." +#define RBD_OBJECT_MAP_PREFIX "rbd_object_map." #define RBD_ID_PREFIX "rbd_id." #define RBD_V2_DATA_FORMAT "%s.%016llx" @@ -39,6 +41,14 @@ enum rbd_notify_op { RBD_NOTIFY_OP_HEADER_UPDATE = 3, }; +#define OBJECT_NONEXISTENT 0 +#define OBJECT_EXISTS 1 +#define OBJECT_PENDING 2 +#define OBJECT_EXISTS_CLEAN 3 + +#define RBD_FLAG_OBJECT_MAP_INVALID (1ULL << 0) +#define RBD_FLAG_FAST_DIFF_INVALID (1ULL << 1) + /* * For format version 1, rbd image 'foo' consists of objects * foo.rbd - image metadata diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 2109cfe80219..8fafbeab510a 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -295,6 +295,22 @@ static ssize_t target_node_show(struct device *dev, } static DEVICE_ATTR_RO(target_node); +static unsigned long long dev_dax_resource(struct dev_dax *dev_dax) +{ + struct dax_region *dax_region = dev_dax->region; + + return dax_region->res.start; +} + +static ssize_t resource_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + + return sprintf(buf, "%#llx\n", dev_dax_resource(dev_dax)); +} +static DEVICE_ATTR_RO(resource); + static ssize_t modalias_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -313,6 +329,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) if (a == &dev_attr_target_node.attr && dev_dax_target_node(dev_dax) < 0) return 0; + if (a == &dev_attr_resource.attr) + return 0400; return a->mode; } @@ -320,6 +338,7 @@ static struct attribute *dev_dax_attributes[] = { &dev_attr_modalias.attr, &dev_attr_size.attr, &dev_attr_target_node.attr, + &dev_attr_resource.attr, NULL, }; @@ -388,7 +407,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id, * No 'host' or dax_operations since there is no access to this * device outside of mmap of the resulting character device. */ - dax_dev = alloc_dax(dev_dax, NULL, NULL); + dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC); if (!dax_dev) goto err; diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 4e5ae7e8b557..8ab12068eea3 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -195,6 +195,8 @@ enum dax_device_flags { DAXDEV_ALIVE, /* gate whether dax_flush() calls the low level flush routine */ DAXDEV_WRITE_CACHE, + /* flag to check if device supports synchronous flush */ + DAXDEV_SYNC, }; /** @@ -372,6 +374,18 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev) } EXPORT_SYMBOL_GPL(dax_write_cache_enabled); +bool __dax_synchronous(struct dax_device *dax_dev) +{ + return test_bit(DAXDEV_SYNC, &dax_dev->flags); +} +EXPORT_SYMBOL_GPL(__dax_synchronous); + +void __set_dax_synchronous(struct dax_device *dax_dev) +{ + set_bit(DAXDEV_SYNC, &dax_dev->flags); +} +EXPORT_SYMBOL_GPL(__set_dax_synchronous); + bool dax_alive(struct dax_device *dax_dev) { lockdep_assert_held(&dax_srcu); @@ -526,7 +540,7 @@ static void dax_add_host(struct dax_device *dax_dev, const char *host) } struct dax_device *alloc_dax(void *private, const char *__host, - const struct dax_operations *ops) + const struct dax_operations *ops, unsigned long flags) { struct dax_device *dax_dev; const char *host; @@ -549,6 +563,9 @@ struct dax_device *alloc_dax(void *private, const char *__host, dax_add_host(dax_dev, host); dax_dev->ops = ops; dax_dev->private = private; + if (flags & DAXDEV_F_SYNC) + set_dax_synchronous(dax_dev); + return dax_dev; err_dev: diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c index 671c24332802..df2011de7be2 100644 --- a/drivers/md/dm-kcopyd.c +++ b/drivers/md/dm-kcopyd.c @@ -28,10 +28,27 @@ #include "dm-core.h" -#define SUB_JOB_SIZE 128 #define SPLIT_COUNT 8 #define MIN_JOBS 8 -#define RESERVE_PAGES (DIV_ROUND_UP(SUB_JOB_SIZE << SECTOR_SHIFT, PAGE_SIZE)) + +#define DEFAULT_SUB_JOB_SIZE_KB 512 +#define MAX_SUB_JOB_SIZE_KB 1024 + +static unsigned kcopyd_subjob_size_kb = DEFAULT_SUB_JOB_SIZE_KB; + +module_param(kcopyd_subjob_size_kb, uint, S_IRUGO | S_IWUSR); +MODULE_PARM_DESC(kcopyd_subjob_size_kb, "Sub-job size for dm-kcopyd clients"); + +static unsigned dm_get_kcopyd_subjob_size(void) +{ + unsigned sub_job_size_kb; + + sub_job_size_kb = __dm_get_module_param(&kcopyd_subjob_size_kb, + DEFAULT_SUB_JOB_SIZE_KB, + MAX_SUB_JOB_SIZE_KB); + + return sub_job_size_kb << 1; +} /*----------------------------------------------------------------- * Each kcopyd client has its own little pool of preallocated @@ -41,6 +58,7 @@ struct dm_kcopyd_client { struct page_list *pages; unsigned nr_reserved_pages; unsigned nr_free_pages; + unsigned sub_job_size; struct dm_io_client *io_client; @@ -693,8 +711,8 @@ static void segment_complete(int read_err, unsigned long write_err, progress = job->progress; count = job->source.count - progress; if (count) { - if (count > SUB_JOB_SIZE) - count = SUB_JOB_SIZE; + if (count > kc->sub_job_size) + count = kc->sub_job_size; job->progress += count; } @@ -821,7 +839,7 @@ void dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from, job->master_job = job; job->write_offset = 0; - if (job->source.count <= SUB_JOB_SIZE) + if (job->source.count <= kc->sub_job_size) dispatch_job(job); else { job->progress = 0; @@ -888,6 +906,7 @@ int kcopyd_cancel(struct kcopyd_job *job, int block) struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *throttle) { int r; + unsigned reserve_pages; struct dm_kcopyd_client *kc; kc = kzalloc(sizeof(*kc), GFP_KERNEL); @@ -912,9 +931,12 @@ struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *thro goto bad_workqueue; } + kc->sub_job_size = dm_get_kcopyd_subjob_size(); + reserve_pages = DIV_ROUND_UP(kc->sub_job_size << SECTOR_SHIFT, PAGE_SIZE); + kc->pages = NULL; kc->nr_reserved_pages = kc->nr_free_pages = 0; - r = client_reserve_pages(kc, RESERVE_PAGES); + r = client_reserve_pages(kc, reserve_pages); if (r) goto bad_client_pages; diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index 63916e1dc569..f150f5c5492b 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c @@ -2072,6 +2072,12 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio) return DM_MAPIO_REMAPPED; } + if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) { + /* Once merging, discards no longer effect change */ + bio_endio(bio); + return DM_MAPIO_SUBMITTED; + } + chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector); down_write(&s->lock); @@ -2331,6 +2337,8 @@ static void snapshot_io_hints(struct dm_target *ti, struct queue_limits *limits) if (snap->discard_zeroes_cow) { struct dm_snapshot *snap_src = NULL, *snap_dest = NULL; + down_read(&_origins_lock); + (void) __find_snapshots_sharing_cow(snap, &snap_src, &snap_dest, NULL); if (snap_src && snap_dest) snap = snap_src; @@ -2338,6 +2346,8 @@ static void snapshot_io_hints(struct dm_target *ti, struct queue_limits *limits) /* All discards are split on chunk_size boundary */ limits->discard_granularity = snap->store->chunk_size; limits->max_discard_sectors = snap->store->chunk_size; + + up_read(&_origins_lock); } } diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index ec8b27e20de3..caaee8032afe 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -881,7 +881,7 @@ void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type) EXPORT_SYMBOL_GPL(dm_table_set_type); /* validate the dax capability of the target device span */ -static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev, +int device_supports_dax(struct dm_target *ti, struct dm_dev *dev, sector_t start, sector_t len, void *data) { int blocksize = *(int *) data; @@ -890,7 +890,15 @@ static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev, start, len); } -bool dm_table_supports_dax(struct dm_table *t, int blocksize) +/* Check devices support synchronous DAX */ +static int device_synchronous(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) +{ + return dax_synchronous(dev->dax_dev); +} + +bool dm_table_supports_dax(struct dm_table *t, + iterate_devices_callout_fn iterate_fn, int *blocksize) { struct dm_target *ti; unsigned i; @@ -903,8 +911,7 @@ bool dm_table_supports_dax(struct dm_table *t, int blocksize) return false; if (!ti->type->iterate_devices || - !ti->type->iterate_devices(ti, device_supports_dax, - &blocksize)) + !ti->type->iterate_devices(ti, iterate_fn, blocksize)) return false; } @@ -940,6 +947,7 @@ static int dm_table_determine_type(struct dm_table *t) struct dm_target *tgt; struct list_head *devices = dm_table_get_devices(t); enum dm_queue_mode live_md_type = dm_get_md_type(t->md); + int page_size = PAGE_SIZE; if (t->type != DM_TYPE_NONE) { /* target already set the table's type */ @@ -984,7 +992,7 @@ static int dm_table_determine_type(struct dm_table *t) verify_bio_based: /* We must use this table as bio-based */ t->type = DM_TYPE_BIO_BASED; - if (dm_table_supports_dax(t, PAGE_SIZE) || + if (dm_table_supports_dax(t, device_supports_dax, &page_size) || (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { t->type = DM_TYPE_DAX_BIO_BASED; } else { @@ -1883,6 +1891,7 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, struct queue_limits *limits) { bool wc = false, fua = false; + int page_size = PAGE_SIZE; /* * Copy table's limits to the DM device's request_queue @@ -1910,8 +1919,11 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, } blk_queue_write_cache(q, wc, fua); - if (dm_table_supports_dax(t, PAGE_SIZE)) + if (dm_table_supports_dax(t, device_supports_dax, &page_size)) { blk_queue_flag_set(QUEUE_FLAG_DAX, q); + if (dm_table_supports_dax(t, device_synchronous, NULL)) + set_dax_synchronous(t->md->dax_dev); + } else blk_queue_flag_clear(QUEUE_FLAG_DAX, q); diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c index 9faf3e49c7af..8545dcee9fd0 100644 --- a/drivers/md/dm-zoned-metadata.c +++ b/drivers/md/dm-zoned-metadata.c @@ -1602,30 +1602,6 @@ struct dm_zone *dmz_get_zone_for_reclaim(struct dmz_metadata *zmd) } /* - * Activate a zone (increment its reference count). - */ -void dmz_activate_zone(struct dm_zone *zone) -{ - set_bit(DMZ_ACTIVE, &zone->flags); - atomic_inc(&zone->refcount); -} - -/* - * Deactivate a zone. This decrement the zone reference counter - * and clears the active state of the zone once the count reaches 0, - * indicating that all BIOs to the zone have completed. Returns - * true if the zone was deactivated. - */ -void dmz_deactivate_zone(struct dm_zone *zone) -{ - if (atomic_dec_and_test(&zone->refcount)) { - WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags)); - clear_bit_unlock(DMZ_ACTIVE, &zone->flags); - smp_mb__after_atomic(); - } -} - -/* * Get the zone mapping a chunk, if the chunk is mapped already. * If no mapping exist and the operation is WRITE, a zone is * allocated and used to map the chunk. diff --git a/drivers/md/dm-zoned.h b/drivers/md/dm-zoned.h index 12419f0bfe78..ed8de49c9a08 100644 --- a/drivers/md/dm-zoned.h +++ b/drivers/md/dm-zoned.h @@ -115,7 +115,6 @@ enum { DMZ_BUF, /* Zone internal state */ - DMZ_ACTIVE, DMZ_RECLAIM, DMZ_SEQ_WRITE_ERR, }; @@ -128,7 +127,6 @@ enum { #define dmz_is_empty(z) ((z)->wp_block == 0) #define dmz_is_offline(z) test_bit(DMZ_OFFLINE, &(z)->flags) #define dmz_is_readonly(z) test_bit(DMZ_READ_ONLY, &(z)->flags) -#define dmz_is_active(z) test_bit(DMZ_ACTIVE, &(z)->flags) #define dmz_in_reclaim(z) test_bit(DMZ_RECLAIM, &(z)->flags) #define dmz_seq_write_err(z) test_bit(DMZ_SEQ_WRITE_ERR, &(z)->flags) @@ -188,8 +186,30 @@ void dmz_unmap_zone(struct dmz_metadata *zmd, struct dm_zone *zone); unsigned int dmz_nr_rnd_zones(struct dmz_metadata *zmd); unsigned int dmz_nr_unmap_rnd_zones(struct dmz_metadata *zmd); -void dmz_activate_zone(struct dm_zone *zone); -void dmz_deactivate_zone(struct dm_zone *zone); +/* + * Activate a zone (increment its reference count). + */ +static inline void dmz_activate_zone(struct dm_zone *zone) +{ + atomic_inc(&zone->refcount); +} + +/* + * Deactivate a zone. This decrement the zone reference counter + * indicating that all BIOs to the zone have completed when the count is 0. + */ +static inline void dmz_deactivate_zone(struct dm_zone *zone) +{ + atomic_dec(&zone->refcount); +} + +/* + * Test if a zone is active, that is, has a refcount > 0. + */ +static inline bool dmz_is_active(struct dm_zone *zone) +{ + return atomic_read(&zone->refcount); +} int dmz_lock_zone_reclaim(struct dm_zone *zone); void dmz_unlock_zone_reclaim(struct dm_zone *zone); diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 61f1152b74e9..d0beef033e2f 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1117,7 +1117,7 @@ static bool dm_dax_supported(struct dax_device *dax_dev, struct block_device *bd if (!map) return false; - ret = dm_table_supports_dax(map, blocksize); + ret = dm_table_supports_dax(map, device_supports_dax, &blocksize); dm_put_live_table(md, srcu_idx); @@ -1989,7 +1989,8 @@ static struct mapped_device *alloc_dev(int minor) sprintf(md->disk->disk_name, "dm-%d", minor); if (IS_ENABLED(CONFIG_DAX_DRIVER)) { - md->dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops); + md->dax_dev = alloc_dax(md, md->disk->disk_name, + &dm_dax_ops, 0); if (!md->dax_dev) goto bad; } diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 17e3db54404c..0475673337f3 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -72,7 +72,10 @@ bool dm_table_bio_based(struct dm_table *t); bool dm_table_request_based(struct dm_table *t); void dm_table_free_md_mempools(struct dm_table *t); struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t); -bool dm_table_supports_dax(struct dm_table *t, int blocksize); +bool dm_table_supports_dax(struct dm_table *t, iterate_devices_callout_fn fn, + int *blocksize); +int device_supports_dax(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data); void dm_lock_md_type(struct mapped_device *md); void dm_unlock_md_type(struct mapped_device *md); diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile index 6f2a088afad6..cefe233e0b52 100644 --- a/drivers/nvdimm/Makefile +++ b/drivers/nvdimm/Makefile @@ -5,6 +5,7 @@ obj-$(CONFIG_ND_BTT) += nd_btt.o obj-$(CONFIG_ND_BLK) += nd_blk.o obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o obj-$(CONFIG_OF_PMEM) += of_pmem.o +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o nd_virtio.o nd_pmem-y := pmem.o diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c index 26c1c7618891..2985ca949912 100644 --- a/drivers/nvdimm/claim.c +++ b/drivers/nvdimm/claim.c @@ -255,7 +255,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev); unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512); sector_t sector = offset >> 9; - int rc = 0; + int rc = 0, ret = 0; if (unlikely(!size)) return 0; @@ -293,7 +293,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns, } memcpy_flushcache(nsio->addr + offset, buf, size); - nvdimm_flush(to_nd_region(ndns->dev.parent)); + ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL); + if (ret) + rc = ret; return rc; } diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c index a434a5964cb9..2d8d7e554877 100644 --- a/drivers/nvdimm/namespace_devs.c +++ b/drivers/nvdimm/namespace_devs.c @@ -1822,8 +1822,8 @@ static bool has_uuid_at_pos(struct nd_region *nd_region, u8 *uuid, && !guid_equal(&nd_set->type_guid, &nd_label->type_guid)) { dev_dbg(ndd->dev, "expect type_guid %pUb got %pUb\n", - nd_set->type_guid.b, - nd_label->type_guid.b); + &nd_set->type_guid, + &nd_label->type_guid); continue; } @@ -2227,8 +2227,8 @@ static struct device *create_namespace_blk(struct nd_region *nd_region, if (namespace_label_has(ndd, type_guid)) { if (!guid_equal(&nd_set->type_guid, &nd_label->type_guid)) { dev_dbg(ndd->dev, "expect type_guid %pUb got %pUb\n", - nd_set->type_guid.b, - nd_label->type_guid.b); + &nd_set->type_guid, + &nd_label->type_guid); return ERR_PTR(-EAGAIN); } diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h index d24304c0e6d7..1b9955651379 100644 --- a/drivers/nvdimm/nd.h +++ b/drivers/nvdimm/nd.h @@ -155,6 +155,7 @@ struct nd_region { struct badblocks bb; struct nd_interleave_set *nd_set; struct nd_percpu_lane __percpu *lane; + int (*flush)(struct nd_region *nd_region, struct bio *bio); struct nd_mapping mapping[0]; }; diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c new file mode 100644 index 000000000000..10351d5b49fa --- /dev/null +++ b/drivers/nvdimm/nd_virtio.c @@ -0,0 +1,125 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * virtio_pmem.c: Virtio pmem Driver + * + * Discovers persistent memory range information + * from host and provides a virtio based flushing + * interface. + */ +#include "virtio_pmem.h" +#include "nd.h" + + /* The interrupt handler */ +void virtio_pmem_host_ack(struct virtqueue *vq) +{ + struct virtio_pmem *vpmem = vq->vdev->priv; + struct virtio_pmem_request *req_data, *req_buf; + unsigned long flags; + unsigned int len; + + spin_lock_irqsave(&vpmem->pmem_lock, flags); + while ((req_data = virtqueue_get_buf(vq, &len)) != NULL) { + req_data->done = true; + wake_up(&req_data->host_acked); + + if (!list_empty(&vpmem->req_list)) { + req_buf = list_first_entry(&vpmem->req_list, + struct virtio_pmem_request, list); + req_buf->wq_buf_avail = true; + wake_up(&req_buf->wq_buf); + list_del(&req_buf->list); + } + } + spin_unlock_irqrestore(&vpmem->pmem_lock, flags); +} +EXPORT_SYMBOL_GPL(virtio_pmem_host_ack); + + /* The request submission function */ +static int virtio_pmem_flush(struct nd_region *nd_region) +{ + struct virtio_device *vdev = nd_region->provider_data; + struct virtio_pmem *vpmem = vdev->priv; + struct virtio_pmem_request *req_data; + struct scatterlist *sgs[2], sg, ret; + unsigned long flags; + int err, err1; + + might_sleep(); + req_data = kmalloc(sizeof(*req_data), GFP_KERNEL); + if (!req_data) + return -ENOMEM; + + req_data->done = false; + init_waitqueue_head(&req_data->host_acked); + init_waitqueue_head(&req_data->wq_buf); + INIT_LIST_HEAD(&req_data->list); + req_data->req.type = cpu_to_le32(VIRTIO_PMEM_REQ_TYPE_FLUSH); + sg_init_one(&sg, &req_data->req, sizeof(req_data->req)); + sgs[0] = &sg; + sg_init_one(&ret, &req_data->resp.ret, sizeof(req_data->resp)); + sgs[1] = &ret; + + spin_lock_irqsave(&vpmem->pmem_lock, flags); + /* + * If virtqueue_add_sgs returns -ENOSPC then req_vq virtual + * queue does not have free descriptor. We add the request + * to req_list and wait for host_ack to wake us up when free + * slots are available. + */ + while ((err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req_data, + GFP_ATOMIC)) == -ENOSPC) { + + dev_info(&vdev->dev, "failed to send command to virtio pmem device, no free slots in the virtqueue\n"); + req_data->wq_buf_avail = false; + list_add_tail(&req_data->list, &vpmem->req_list); + spin_unlock_irqrestore(&vpmem->pmem_lock, flags); + + /* A host response results in "host_ack" getting called */ + wait_event(req_data->wq_buf, req_data->wq_buf_avail); + spin_lock_irqsave(&vpmem->pmem_lock, flags); + } + err1 = virtqueue_kick(vpmem->req_vq); + spin_unlock_irqrestore(&vpmem->pmem_lock, flags); + /* + * virtqueue_add_sgs failed with error different than -ENOSPC, we can't + * do anything about that. + */ + if (err || !err1) { + dev_info(&vdev->dev, "failed to send command to virtio pmem device\n"); + err = -EIO; + } else { + /* A host repsonse results in "host_ack" getting called */ + wait_event(req_data->host_acked, req_data->done); + err = le32_to_cpu(req_data->resp.ret); + } + + kfree(req_data); + return err; +}; + +/* The asynchronous flush callback function */ +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio) +{ + /* + * Create child bio for asynchronous flush and chain with + * parent bio. Otherwise directly call nd_region flush. + */ + if (bio && bio->bi_iter.bi_sector != -1) { + struct bio *child = bio_alloc(GFP_ATOMIC, 0); + + if (!child) + return -ENOMEM; + bio_copy_dev(child, bio); + child->bi_opf = REQ_PREFLUSH; + child->bi_iter.bi_sector = -1; + bio_chain(child, bio); + submit_bio(child); + return 0; + } + if (virtio_pmem_flush(nd_region)) + return -EIO; + + return 0; +}; +EXPORT_SYMBOL_GPL(async_pmem_flush); +MODULE_LICENSE("GPL"); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index e7d8cc9f41e8..2bf3acd69613 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -184,6 +184,7 @@ static blk_status_t pmem_do_bvec(struct pmem_device *pmem, struct page *page, static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) { + int ret = 0; blk_status_t rc = 0; bool do_acct; unsigned long start; @@ -193,7 +194,7 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) struct nd_region *nd_region = to_region(pmem); if (bio->bi_opf & REQ_PREFLUSH) - nvdimm_flush(nd_region); + ret = nvdimm_flush(nd_region, bio); do_acct = nd_iostat_start(bio, &start); bio_for_each_segment(bvec, bio, iter) { @@ -208,7 +209,10 @@ static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio) nd_iostat_end(bio, start); if (bio->bi_opf & REQ_FUA) - nvdimm_flush(nd_region); + ret = nvdimm_flush(nd_region, bio); + + if (ret) + bio->bi_status = errno_to_blk_status(ret); bio_endio(bio); return BLK_QC_T_NONE; @@ -362,6 +366,7 @@ static int pmem_attach_disk(struct device *dev, struct gendisk *disk; void *addr; int rc; + unsigned long flags = 0UL; pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL); if (!pmem) @@ -457,14 +462,15 @@ static int pmem_attach_disk(struct device *dev, nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res); disk->bb = &pmem->bb; - dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops); + if (is_nvdimm_sync(nd_region)) + flags = DAXDEV_F_SYNC; + dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops, flags); if (!dax_dev) { put_disk(disk); return -ENOMEM; } dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); pmem->dax_dev = dax_dev; - gendev = disk_to_dev(disk); gendev->groups = pmem_attribute_groups; @@ -522,14 +528,14 @@ static int nd_pmem_remove(struct device *dev) sysfs_put(pmem->bb_state); pmem->bb_state = NULL; } - nvdimm_flush(to_nd_region(dev->parent)); + nvdimm_flush(to_nd_region(dev->parent), NULL); return 0; } static void nd_pmem_shutdown(struct device *dev) { - nvdimm_flush(to_nd_region(dev->parent)); + nvdimm_flush(to_nd_region(dev->parent), NULL); } static void nd_pmem_notify(struct device *dev, enum nvdimm_event event) diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index 4fed9ce9c2fe..56f2227f192a 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -287,7 +287,9 @@ static ssize_t deep_flush_store(struct device *dev, struct device_attribute *att return rc; if (!flush) return -EINVAL; - nvdimm_flush(nd_region); + rc = nvdimm_flush(nd_region, NULL); + if (rc) + return rc; return len; } @@ -1077,6 +1079,11 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus, dev->of_node = ndr_desc->of_node; nd_region->ndr_size = resource_size(ndr_desc->res); nd_region->ndr_start = ndr_desc->res->start; + if (ndr_desc->flush) + nd_region->flush = ndr_desc->flush; + else + nd_region->flush = NULL; + nd_device_register(dev); return nd_region; @@ -1117,11 +1124,24 @@ struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus, } EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create); +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio) +{ + int rc = 0; + + if (!nd_region->flush) + rc = generic_nvdimm_flush(nd_region); + else { + if (nd_region->flush(nd_region, bio)) + rc = -EIO; + } + + return rc; +} /** * nvdimm_flush - flush any posted write queues between the cpu and pmem media * @nd_region: blk or interleaved pmem region */ -void nvdimm_flush(struct nd_region *nd_region) +int generic_nvdimm_flush(struct nd_region *nd_region) { struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev); int i, idx; @@ -1145,6 +1165,8 @@ void nvdimm_flush(struct nd_region *nd_region) if (ndrd_get_flush_wpq(ndrd, i, 0)) writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); wmb(); + + return 0; } EXPORT_SYMBOL_GPL(nvdimm_flush); @@ -1189,6 +1211,13 @@ int nvdimm_has_cache(struct nd_region *nd_region) } EXPORT_SYMBOL_GPL(nvdimm_has_cache); +bool is_nvdimm_sync(struct nd_region *nd_region) +{ + return is_nd_pmem(&nd_region->dev) && + !test_bit(ND_REGION_ASYNC, &nd_region->flags); +} +EXPORT_SYMBOL_GPL(is_nvdimm_sync); + struct conflict_context { struct nd_region *nd_region; resource_size_t start, size; diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c new file mode 100644 index 000000000000..5e3d07b47e0c --- /dev/null +++ b/drivers/nvdimm/virtio_pmem.c @@ -0,0 +1,122 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * virtio_pmem.c: Virtio pmem Driver + * + * Discovers persistent memory range information + * from host and registers the virtual pmem device + * with libnvdimm core. + */ +#include "virtio_pmem.h" +#include "nd.h" + +static struct virtio_device_id id_table[] = { + { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID }, + { 0 }, +}; + + /* Initialize virt queue */ +static int init_vq(struct virtio_pmem *vpmem) +{ + /* single vq */ + vpmem->req_vq = virtio_find_single_vq(vpmem->vdev, + virtio_pmem_host_ack, "flush_queue"); + if (IS_ERR(vpmem->req_vq)) + return PTR_ERR(vpmem->req_vq); + + spin_lock_init(&vpmem->pmem_lock); + INIT_LIST_HEAD(&vpmem->req_list); + + return 0; +}; + +static int virtio_pmem_probe(struct virtio_device *vdev) +{ + struct nd_region_desc ndr_desc = {}; + int nid = dev_to_node(&vdev->dev); + struct nd_region *nd_region; + struct virtio_pmem *vpmem; + struct resource res; + int err = 0; + + if (!vdev->config->get) { + dev_err(&vdev->dev, "%s failure: config access disabled\n", + __func__); + return -EINVAL; + } + + vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem), GFP_KERNEL); + if (!vpmem) { + err = -ENOMEM; + goto out_err; + } + + vpmem->vdev = vdev; + vdev->priv = vpmem; + err = init_vq(vpmem); + if (err) { + dev_err(&vdev->dev, "failed to initialize virtio pmem vq's\n"); + goto out_err; + } + + virtio_cread(vpmem->vdev, struct virtio_pmem_config, + start, &vpmem->start); + virtio_cread(vpmem->vdev, struct virtio_pmem_config, + size, &vpmem->size); + + res.start = vpmem->start; + res.end = vpmem->start + vpmem->size - 1; + vpmem->nd_desc.provider_name = "virtio-pmem"; + vpmem->nd_desc.module = THIS_MODULE; + + vpmem->nvdimm_bus = nvdimm_bus_register(&vdev->dev, + &vpmem->nd_desc); + if (!vpmem->nvdimm_bus) { + dev_err(&vdev->dev, "failed to register device with nvdimm_bus\n"); + err = -ENXIO; + goto out_vq; + } + + dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus); + + ndr_desc.res = &res; + ndr_desc.numa_node = nid; + ndr_desc.flush = async_pmem_flush; + set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); + set_bit(ND_REGION_ASYNC, &ndr_desc.flags); + nd_region = nvdimm_pmem_region_create(vpmem->nvdimm_bus, &ndr_desc); + if (!nd_region) { + dev_err(&vdev->dev, "failed to create nvdimm region\n"); + err = -ENXIO; + goto out_nd; + } + nd_region->provider_data = dev_to_virtio(nd_region->dev.parent->parent); + return 0; +out_nd: + nvdimm_bus_unregister(vpmem->nvdimm_bus); +out_vq: + vdev->config->del_vqs(vdev); +out_err: + return err; +} + +static void virtio_pmem_remove(struct virtio_device *vdev) +{ + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev); + + nvdimm_bus_unregister(nvdimm_bus); + vdev->config->del_vqs(vdev); + vdev->config->reset(vdev); +} + +static struct virtio_driver virtio_pmem_driver = { + .driver.name = KBUILD_MODNAME, + .driver.owner = THIS_MODULE, + .id_table = id_table, + .probe = virtio_pmem_probe, + .remove = virtio_pmem_remove, +}; + +module_virtio_driver(virtio_pmem_driver); +MODULE_DEVICE_TABLE(virtio, id_table); +MODULE_DESCRIPTION("Virtio pmem driver"); +MODULE_LICENSE("GPL"); diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h new file mode 100644 index 000000000000..0dddefe594c4 --- /dev/null +++ b/drivers/nvdimm/virtio_pmem.h @@ -0,0 +1,55 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * virtio_pmem.h: virtio pmem Driver + * + * Discovers persistent memory range information + * from host and provides a virtio based flushing + * interface. + **/ + +#ifndef _LINUX_VIRTIO_PMEM_H +#define _LINUX_VIRTIO_PMEM_H + +#include <linux/module.h> +#include <uapi/linux/virtio_pmem.h> +#include <linux/libnvdimm.h> +#include <linux/spinlock.h> + +struct virtio_pmem_request { + struct virtio_pmem_req req; + struct virtio_pmem_resp resp; + + /* Wait queue to process deferred work after ack from host */ + wait_queue_head_t host_acked; + bool done; + + /* Wait queue to process deferred work after virt queue buffer avail */ + wait_queue_head_t wq_buf; + bool wq_buf_avail; + struct list_head list; +}; + +struct virtio_pmem { + struct virtio_device *vdev; + + /* Virtio pmem request queue */ + struct virtqueue *req_vq; + + /* nvdimm bus registers virtio pmem device */ + struct nvdimm_bus *nvdimm_bus; + struct nvdimm_bus_descriptor nd_desc; + + /* List to store deferred work if virtqueue is full */ + struct list_head req_list; + + /* Synchronize virtqueue data */ + spinlock_t pmem_lock; + + /* Memory region information */ + __u64 start; + __u64 size; +}; + +void virtio_pmem_host_ack(struct virtqueue *vq); +int async_pmem_flush(struct nd_region *nd_region, struct bio *bio); +#endif diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index d04d4378ca50..63502ca537eb 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -679,7 +679,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char goto put_dev; dev_info->dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name, - &dcssblk_dax_ops); + &dcssblk_dax_ops, DAXDEV_F_SYNC); if (!dev_info->dax_dev) { rc = -ENOMEM; goto put_dev; diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index 023fc3bc01c6..078615cf2afc 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -43,6 +43,17 @@ config VIRTIO_PCI_LEGACY If unsure, say Y. +config VIRTIO_PMEM + tristate "Support for virtio pmem driver" + depends on VIRTIO + depends on LIBNVDIMM + help + This driver provides access to virtio-pmem devices, storage devices + that are mapped into the physical address space - similar to NVDIMMs + - with a virtio-based flushing interface. + + If unsure, say Y. + config VIRTIO_BALLOON tristate "Virtio balloon driver" depends on VIRTIO diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index 6cad0b33d7ad..8188963a405b 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -58,6 +58,15 @@ config WATCHDOG_HANDLE_BOOT_ENABLED the watchdog on its own. Thus if your userspace does not start fast enough your device will reboot. +config WATCHDOG_OPEN_TIMEOUT + int "Timeout value for opening watchdog device" + default 0 + help + The maximum time, in seconds, for which the watchdog framework takes + care of pinging a hardware watchdog. A value of 0 means infinite. The + value set here can be overridden by the commandline parameter + "watchdog.open_timeout". + config WATCHDOG_SYSFS bool "Read different watchdog information through sysfs" help @@ -717,6 +726,7 @@ config IMX2_WDT config IMX_SC_WDT tristate "IMX SC Watchdog" depends on HAVE_ARM_SMCCC + depends on IMX_SCU select WATCHDOG_CORE help This is the driver for the system controller watchdog diff --git a/drivers/watchdog/acquirewdt.c b/drivers/watchdog/acquirewdt.c index 957d1255d4ca..848db958411e 100644 --- a/drivers/watchdog/acquirewdt.c +++ b/drivers/watchdog/acquirewdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Acquire Single Board Computer Watchdog Timer driver * @@ -6,11 +7,6 @@ * (c) Copyright 1996 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/advantechwdt.c b/drivers/watchdog/advantechwdt.c index 2766af292a71..0d02bb275b3d 100644 --- a/drivers/watchdog/advantechwdt.c +++ b/drivers/watchdog/advantechwdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Advantech Single Board Computer WDT driver * @@ -9,11 +10,6 @@ * (c) Copyright 1996 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/aspeed_wdt.c b/drivers/watchdog/aspeed_wdt.c index f0148637e5dd..cc71861e033a 100644 --- a/drivers/watchdog/aspeed_wdt.c +++ b/drivers/watchdog/aspeed_wdt.c @@ -309,13 +309,7 @@ static int aspeed_wdt_probe(struct platform_device *pdev) if (status & WDT_TIMEOUT_STATUS_BOOT_SECONDARY) wdt->wdd.bootstatus = WDIOF_CARDRESET; - ret = devm_watchdog_register_device(dev, &wdt->wdd); - if (ret) { - dev_err(dev, "failed to register\n"); - return ret; - } - - return 0; + return devm_watchdog_register_device(dev, &wdt->wdd); } static struct platform_driver aspeed_watchdog_driver = { diff --git a/drivers/watchdog/bcm2835_wdt.c b/drivers/watchdog/bcm2835_wdt.c index 560c1c54c177..dec6ca019bea 100644 --- a/drivers/watchdog/bcm2835_wdt.c +++ b/drivers/watchdog/bcm2835_wdt.c @@ -202,10 +202,8 @@ static int bcm2835_wdt_probe(struct platform_device *pdev) watchdog_stop_on_reboot(&bcm2835_wdt_wdd); err = devm_watchdog_register_device(dev, &bcm2835_wdt_wdd); - if (err) { - dev_err(dev, "Failed to register watchdog device"); + if (err) return err; - } if (pm_power_off == NULL) { pm_power_off = bcm2835_power_off; @@ -240,6 +238,7 @@ module_param(nowayout, bool, 0); MODULE_PARM_DESC(nowayout, "Watchdog cannot be stopped once started (default=" __MODULE_STRING(WATCHDOG_NOWAYOUT) ")"); +MODULE_ALIAS("platform:bcm2835-wdt"); MODULE_AUTHOR("Lubomir Rintel <lkundrak@v3.sk>"); MODULE_DESCRIPTION("Driver for Broadcom BCM2835 watchdog timer"); MODULE_LICENSE("GPL"); diff --git a/drivers/watchdog/bcm7038_wdt.c b/drivers/watchdog/bcm7038_wdt.c index d3d88f6703d7..979caa18d3c8 100644 --- a/drivers/watchdog/bcm7038_wdt.c +++ b/drivers/watchdog/bcm7038_wdt.c @@ -159,10 +159,8 @@ static int bcm7038_wdt_probe(struct platform_device *pdev) watchdog_stop_on_reboot(&wdt->wdd); watchdog_stop_on_unregister(&wdt->wdd); err = devm_watchdog_register_device(dev, &wdt->wdd); - if (err) { - dev_err(dev, "Failed to register watchdog device\n"); + if (err) return err; - } dev_info(dev, "Registered BCM7038 Watchdog\n"); diff --git a/drivers/watchdog/bcm_kona_wdt.c b/drivers/watchdog/bcm_kona_wdt.c index 921291025680..eb850a8d19df 100644 --- a/drivers/watchdog/bcm_kona_wdt.c +++ b/drivers/watchdog/bcm_kona_wdt.c @@ -301,10 +301,8 @@ static int bcm_kona_wdt_probe(struct platform_device *pdev) watchdog_stop_on_reboot(&bcm_kona_wdt_wdd); watchdog_stop_on_unregister(&bcm_kona_wdt_wdd); ret = devm_watchdog_register_device(dev, &bcm_kona_wdt_wdd); - if (ret) { - dev_err(dev, "Failed to register watchdog device"); + if (ret) return ret; - } bcm_kona_wdt_debug_init(pdev); dev_dbg(dev, "Broadcom Kona Watchdog Timer"); diff --git a/drivers/watchdog/cadence_wdt.c b/drivers/watchdog/cadence_wdt.c index a22f2d431a35..f8d4e91d0383 100644 --- a/drivers/watchdog/cadence_wdt.c +++ b/drivers/watchdog/cadence_wdt.c @@ -363,10 +363,8 @@ static int cdns_wdt_probe(struct platform_device *pdev) watchdog_stop_on_reboot(cdns_wdt_device); watchdog_stop_on_unregister(cdns_wdt_device); ret = devm_watchdog_register_device(dev, cdns_wdt_device); - if (ret) { - dev_err(dev, "Failed to register wdt device\n"); + if (ret) return ret; - } platform_set_drvdata(pdev, wdt); dev_info(dev, "Xilinx Watchdog Timer at %p with timeout %ds%s\n", diff --git a/drivers/watchdog/da9052_wdt.c b/drivers/watchdog/da9052_wdt.c index a2feef1ff307..d708c091bf1b 100644 --- a/drivers/watchdog/da9052_wdt.c +++ b/drivers/watchdog/da9052_wdt.c @@ -176,14 +176,7 @@ static int da9052_wdt_probe(struct platform_device *pdev) return ret; } - ret = devm_watchdog_register_device(dev, &driver_data->wdt); - if (ret != 0) { - dev_err(da9052->dev, "watchdog_register_device() failed: %d\n", - ret); - return ret; - } - - return ret; + return devm_watchdog_register_device(dev, &driver_data->wdt); } static struct platform_driver da9052_wdt_driver = { diff --git a/drivers/watchdog/da9062_wdt.c b/drivers/watchdog/da9062_wdt.c index aac749cfaccb..e149e66a6ea9 100644 --- a/drivers/watchdog/da9062_wdt.c +++ b/drivers/watchdog/da9062_wdt.c @@ -214,11 +214,8 @@ static int da9062_wdt_probe(struct platform_device *pdev) watchdog_set_drvdata(&wdt->wdtdev, wdt); ret = devm_watchdog_register_device(dev, &wdt->wdtdev); - if (ret < 0) { - dev_err(wdt->hw->dev, - "watchdog registration failed (%d)\n", ret); + if (ret < 0) return ret; - } return da9062_wdt_ping(&wdt->wdtdev); } diff --git a/drivers/watchdog/davinci_wdt.c b/drivers/watchdog/davinci_wdt.c index 7b2ee35b5ffd..2b3f3cd382ef 100644 --- a/drivers/watchdog/davinci_wdt.c +++ b/drivers/watchdog/davinci_wdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * drivers/char/watchdog/davinci_wdt.c * @@ -5,10 +6,7 @@ * * Copyright (C) 2006-2013 Texas Instruments. * - * 2007 (c) MontaVista Software, Inc. This file is licensed under - * the terms of the GNU General Public License version 2. This program - * is licensed "as is" without any warranty of any kind, whether express - * or implied. + * 2007 (c) MontaVista Software, Inc. */ #include <linux/module.h> @@ -247,13 +245,7 @@ static int davinci_wdt_probe(struct platform_device *pdev) if (IS_ERR(davinci_wdt->base)) return PTR_ERR(davinci_wdt->base); - ret = devm_watchdog_register_device(dev, wdd); - if (ret) { - dev_err(dev, "cannot register watchdog device\n"); - return ret; - } - - return 0; + return devm_watchdog_register_device(dev, wdd); } static const struct of_device_id davinci_wdt_of_match[] = { diff --git a/drivers/watchdog/digicolor_wdt.c b/drivers/watchdog/digicolor_wdt.c index 8af6e9a67d0d..073d37867f47 100644 --- a/drivers/watchdog/digicolor_wdt.c +++ b/drivers/watchdog/digicolor_wdt.c @@ -118,7 +118,6 @@ static int dc_wdt_probe(struct platform_device *pdev) { struct device *dev = &pdev->dev; struct dc_wdt *wdt; - int ret; wdt = devm_kzalloc(dev, sizeof(struct dc_wdt), GFP_KERNEL); if (!wdt) @@ -141,13 +140,7 @@ static int dc_wdt_probe(struct platform_device *pdev) watchdog_set_restart_priority(&dc_wdt_wdd, 128); watchdog_init_timeout(&dc_wdt_wdd, timeout, dev); watchdog_stop_on_reboot(&dc_wdt_wdd); - ret = devm_watchdog_register_device(dev, &dc_wdt_wdd); - if (ret) { - dev_err(dev, "Failed to register watchdog device"); - return ret; - } - - return 0; + return devm_watchdog_register_device(dev, &dc_wdt_wdd); } static const struct of_device_id dc_wdt_of_match[] = { diff --git a/drivers/watchdog/ebc-c384_wdt.c b/drivers/watchdog/ebc-c384_wdt.c index c176f59fea28..8ef4b0df3855 100644 --- a/drivers/watchdog/ebc-c384_wdt.c +++ b/drivers/watchdog/ebc-c384_wdt.c @@ -2,15 +2,6 @@ /* * Watchdog timer driver for the WinSystems EBC-C384 * Copyright (C) 2016 William Breathitt Gray - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License, version 2, as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include <linux/device.h> #include <linux/dmi.h> diff --git a/drivers/watchdog/eurotechwdt.c b/drivers/watchdog/eurotechwdt.c index 89129e6fa9b6..3a83a48abcae 100644 --- a/drivers/watchdog/eurotechwdt.c +++ b/drivers/watchdog/eurotechwdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Eurotech CPU-1220/1410/1420 on board WDT driver * @@ -11,11 +12,6 @@ * (c) Copyright 1996-1997 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/ftwdt010_wdt.c b/drivers/watchdog/ftwdt010_wdt.c index d9626ef9b9ae..21dcc7765688 100644 --- a/drivers/watchdog/ftwdt010_wdt.c +++ b/drivers/watchdog/ftwdt010_wdt.c @@ -165,10 +165,8 @@ static int ftwdt010_wdt_probe(struct platform_device *pdev) } ret = devm_watchdog_register_device(dev, &gwdt->wdd); - if (ret) { - dev_err(dev, "failed to register watchdog\n"); + if (ret) return ret; - } /* Set up platform driver data */ platform_set_drvdata(pdev, gwdt); diff --git a/drivers/watchdog/gpio_wdt.c b/drivers/watchdog/gpio_wdt.c index 777de10f2a78..0923201ce874 100644 --- a/drivers/watchdog/gpio_wdt.c +++ b/drivers/watchdog/gpio_wdt.c @@ -13,6 +13,12 @@ #include <linux/platform_device.h> #include <linux/watchdog.h> +static bool nowayout = WATCHDOG_NOWAYOUT; +module_param(nowayout, bool, 0); +MODULE_PARM_DESC(nowayout, + "Watchdog cannot be stopped once started (default=" + __MODULE_STRING(WATCHDOG_NOWAYOUT) ")"); + #define SOFT_TIMEOUT_MIN 1 #define SOFT_TIMEOUT_DEF 60 @@ -151,6 +157,7 @@ static int gpio_wdt_probe(struct platform_device *pdev) priv->wdd.timeout = SOFT_TIMEOUT_DEF; watchdog_init_timeout(&priv->wdd, 0, dev); + watchdog_set_nowayout(&priv->wdd, nowayout); watchdog_stop_on_reboot(&priv->wdd); diff --git a/drivers/watchdog/hpwdt.c b/drivers/watchdog/hpwdt.c index 8a90f159ffb1..7d34bcf1c45b 100644 --- a/drivers/watchdog/hpwdt.c +++ b/drivers/watchdog/hpwdt.c @@ -22,10 +22,11 @@ #include <linux/watchdog.h> #include <asm/nmi.h> -#define HPWDT_VERSION "2.0.2" +#define HPWDT_VERSION "2.0.3" #define SECS_TO_TICKS(secs) ((secs) * 1000 / 128) #define TICKS_TO_SECS(ticks) ((ticks) * 128 / 1000) -#define HPWDT_MAX_TIMER TICKS_TO_SECS(65535) +#define HPWDT_MAX_TICKS 65535 +#define HPWDT_MAX_TIMER TICKS_TO_SECS(HPWDT_MAX_TICKS) #define DEFAULT_MARGIN 30 #define PRETIMEOUT_SEC 9 @@ -33,6 +34,7 @@ static bool ilo5; static unsigned int soft_margin = DEFAULT_MARGIN; /* in seconds */ static bool nowayout = WATCHDOG_NOWAYOUT; static bool pretimeout = IS_ENABLED(CONFIG_HPWDT_NMI_DECODING); +static int kdumptimeout = -1; static void __iomem *pci_mem_addr; /* the PCI-memory address */ static unsigned long __iomem *hpwdt_nmistat; @@ -52,15 +54,21 @@ static const struct pci_device_id hpwdt_blacklist[] = { {0}, /* terminate list */ }; +static struct watchdog_device hpwdt_dev; /* * Watchdog operations */ +static int hpwdt_hw_is_running(void) +{ + return ioread8(hpwdt_timer_con) & 0x01; +} + static int hpwdt_start(struct watchdog_device *wdd) { int control = 0x81 | (pretimeout ? 0x4 : 0); - int reload = SECS_TO_TICKS(wdd->timeout); + int reload = SECS_TO_TICKS(min(wdd->timeout, wdd->max_hw_heartbeat_ms/1000)); - dev_dbg(wdd->parent, "start watchdog 0x%08x:0x%02x\n", reload, control); + dev_dbg(wdd->parent, "start watchdog 0x%08x:0x%08x:0x%02x\n", wdd->timeout, reload, control); iowrite16(reload, hpwdt_timer_reg); iowrite8(control, hpwdt_timer_con); @@ -85,12 +93,18 @@ static int hpwdt_stop_core(struct watchdog_device *wdd) return 0; } +static void hpwdt_ping_ticks(int val) +{ + val = min(val, HPWDT_MAX_TICKS); + iowrite16(val, hpwdt_timer_reg); +} + static int hpwdt_ping(struct watchdog_device *wdd) { - int reload = SECS_TO_TICKS(wdd->timeout); + int reload = SECS_TO_TICKS(min(wdd->timeout, wdd->max_hw_heartbeat_ms/1000)); - dev_dbg(wdd->parent, "ping watchdog 0x%08x\n", reload); - iowrite16(reload, hpwdt_timer_reg); + dev_dbg(wdd->parent, "ping watchdog 0x%08x:0x%08x\n", wdd->timeout, reload); + hpwdt_ping_ticks(reload); return 0; } @@ -166,7 +180,14 @@ static int hpwdt_pretimeout(unsigned int ulReason, struct pt_regs *regs) if (ilo5 && !pretimeout && !mynmi) return NMI_DONE; - hpwdt_stop(); + if (kdumptimeout < 0) + hpwdt_stop(); + else if (kdumptimeout == 0) + ; + else { + unsigned int val = max((unsigned int)kdumptimeout, hpwdt_dev.timeout); + hpwdt_ping_ticks(SECS_TO_TICKS(val)); + } hex_byte_pack(panic_msg, mynmi); nmi_panic(regs, panic_msg); @@ -204,9 +225,9 @@ static struct watchdog_device hpwdt_dev = { .info = &ident, .ops = &hpwdt_ops, .min_timeout = 1, - .max_timeout = HPWDT_MAX_TIMER, .timeout = DEFAULT_MARGIN, .pretimeout = PRETIMEOUT_SEC, + .max_hw_heartbeat_ms = HPWDT_MAX_TIMER * 1000, }; @@ -298,14 +319,18 @@ static int hpwdt_init_one(struct pci_dev *dev, hpwdt_timer_reg = pci_mem_addr + 0x70; hpwdt_timer_con = pci_mem_addr + 0x72; - /* Make sure that timer is disabled until /dev/watchdog is opened */ - hpwdt_stop(); + /* Have the core update running timer until user space is ready */ + if (hpwdt_hw_is_running()) { + dev_info(&dev->dev, "timer is running\n"); + set_bit(WDOG_HW_RUNNING, &hpwdt_dev.status); + } /* Initialize NMI Decoding functionality */ retval = hpwdt_init_nmi_decoding(dev); if (retval != 0) goto error_init_nmi_decoding; + watchdog_stop_on_unregister(&hpwdt_dev); watchdog_set_nowayout(&hpwdt_dev, nowayout); watchdog_init_timeout(&hpwdt_dev, soft_margin, NULL); @@ -314,13 +339,12 @@ static int hpwdt_init_one(struct pci_dev *dev, pretimeout = 0; } hpwdt_dev.pretimeout = pretimeout ? PRETIMEOUT_SEC : 0; + kdumptimeout = min(kdumptimeout, HPWDT_MAX_TIMER); hpwdt_dev.parent = &dev->dev; retval = watchdog_register_device(&hpwdt_dev); - if (retval < 0) { - dev_err(&dev->dev, "watchdog register failed: %d.\n", retval); + if (retval < 0) goto error_wd_register; - } dev_info(&dev->dev, "HPE Watchdog Timer Driver: Version: %s\n", HPWDT_VERSION); @@ -328,6 +352,7 @@ static int hpwdt_init_one(struct pci_dev *dev, hpwdt_dev.timeout, nowayout); dev_info(&dev->dev, "pretimeout: %s.\n", pretimeout ? "on" : "off"); + dev_info(&dev->dev, "kdumptimeout: %d.\n", kdumptimeout); if (dev->subsystem_vendor == PCI_VENDOR_ID_HP_3PAR) ilo5 = true; @@ -345,9 +370,6 @@ error_pci_iomap: static void hpwdt_exit(struct pci_dev *dev) { - if (!nowayout) - hpwdt_stop(); - watchdog_unregister_device(&hpwdt_dev); hpwdt_exit_nmi_decoding(); pci_iounmap(dev, pci_mem_addr); @@ -376,6 +398,9 @@ module_param(nowayout, bool, 0); MODULE_PARM_DESC(nowayout, "Watchdog cannot be stopped once started (default=" __MODULE_STRING(WATCHDOG_NOWAYOUT) ")"); +module_param(kdumptimeout, int, 0444); +MODULE_PARM_DESC(kdumptimeout, "Timeout applied for crash kernel transition in seconds"); + #ifdef CONFIG_HPWDT_NMI_DECODING module_param(pretimeout, bool, 0); MODULE_PARM_DESC(pretimeout, "Watchdog pretimeout enabled"); diff --git a/drivers/watchdog/i6300esb.c b/drivers/watchdog/i6300esb.c index f98f35a05896..a30835f547b3 100644 --- a/drivers/watchdog/i6300esb.c +++ b/drivers/watchdog/i6300esb.c @@ -315,11 +315,8 @@ static int esb_probe(struct pci_dev *pdev, /* Register the watchdog so that userspace has access to it */ ret = watchdog_register_device(&edev->wdd); - if (ret != 0) { - dev_err(&pdev->dev, - "cannot register watchdog device (err=%d)\n", ret); + if (ret != 0) goto err_unmap; - } dev_info(&pdev->dev, "initialized. heartbeat=%d sec (nowayout=%d)\n", edev->wdd.timeout, nowayout); diff --git a/drivers/watchdog/iTCO_vendor_support.c b/drivers/watchdog/iTCO_vendor_support.c index 68a9d9cc2eb8..4f1b96f59349 100644 --- a/drivers/watchdog/iTCO_vendor_support.c +++ b/drivers/watchdog/iTCO_vendor_support.c @@ -1,13 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * intel TCO vendor specific watchdog driver support * * (c) Copyright 2006-2009 Wim Van Sebroeck <wim@iguana.be>. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Wim Van Sebroeck nor Iguana vzw. admit liability nor * provide warranty for any of this software. This material is * provided "AS-IS" and at no charge. @@ -216,4 +212,3 @@ MODULE_AUTHOR("Wim Van Sebroeck <wim@iguana.be>, " MODULE_DESCRIPTION("Intel TCO Vendor Specific WatchDog Timer Driver Support"); MODULE_VERSION(DRV_VERSION); MODULE_LICENSE("GPL"); - diff --git a/drivers/watchdog/iTCO_wdt.c b/drivers/watchdog/iTCO_wdt.c index 89cea6ce9a08..c559f706ae7e 100644 --- a/drivers/watchdog/iTCO_wdt.c +++ b/drivers/watchdog/iTCO_wdt.c @@ -1,13 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * intel TCO Watchdog Driver * * (c) Copyright 2006-2011 Wim Van Sebroeck <wim@iguana.be>. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Wim Van Sebroeck nor Iguana vzw. admit liability nor * provide warranty for any of this software. This material is * provided "AS-IS" and at no charge. diff --git a/drivers/watchdog/ib700wdt.c b/drivers/watchdog/ib700wdt.c index 30d6cec582af..92fd7f33bc4d 100644 --- a/drivers/watchdog/ib700wdt.c +++ b/drivers/watchdog/ib700wdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * IB700 Single Board Computer WDT driver * @@ -14,11 +15,6 @@ * (c) Copyright 1996 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/ie6xx_wdt.c b/drivers/watchdog/ie6xx_wdt.c index 508fbefce9f6..8f28993fab8b 100644 --- a/drivers/watchdog/ie6xx_wdt.c +++ b/drivers/watchdog/ie6xx_wdt.c @@ -66,7 +66,7 @@ MODULE_PARM_DESC(resetmode, static struct { unsigned short sch_wdtba; - struct spinlock unlock_sequence; + spinlock_t unlock_sequence; #ifdef CONFIG_DEBUG_FS struct dentry *debugfs; #endif @@ -254,12 +254,8 @@ static int ie6xx_wdt_probe(struct platform_device *pdev) ie6xx_wdt_debugfs_init(); ret = watchdog_register_device(&ie6xx_wdt_dev); - if (ret) { - dev_err(&pdev->dev, - "Watchdog timer: cannot register device (err =%d)\n", - ret); + if (ret) goto misc_register_error; - } return 0; diff --git a/drivers/watchdog/imx2_wdt.c b/drivers/watchdog/imx2_wdt.c index a606005dd65f..32af3974e6bb 100644 --- a/drivers/watchdog/imx2_wdt.c +++ b/drivers/watchdog/imx2_wdt.c @@ -316,10 +316,8 @@ static int __init imx2_wdt_probe(struct platform_device *pdev) regmap_write(wdev->regmap, IMX2_WDT_WMCR, 0); ret = watchdog_register_device(wdog); - if (ret) { - dev_err(&pdev->dev, "cannot register watchdog device\n"); + if (ret) goto disable_clk; - } dev_info(&pdev->dev, "timeout %d sec (nowayout=%d)\n", wdog->timeout, nowayout); diff --git a/drivers/watchdog/imx_sc_wdt.c b/drivers/watchdog/imx_sc_wdt.c index 49848b66186c..78eaaf75a263 100644 --- a/drivers/watchdog/imx_sc_wdt.c +++ b/drivers/watchdog/imx_sc_wdt.c @@ -4,6 +4,7 @@ */ #include <linux/arm-smccc.h> +#include <linux/firmware/imx/sci.h> #include <linux/io.h> #include <linux/init.h> #include <linux/kernel.h> @@ -33,11 +34,19 @@ #define SC_TIMER_WDOG_ACTION_PARTITION 0 +#define SC_IRQ_WDOG 1 +#define SC_IRQ_GROUP_WDOG 1 + static bool nowayout = WATCHDOG_NOWAYOUT; module_param(nowayout, bool, 0000); MODULE_PARM_DESC(nowayout, "Watchdog cannot be stopped once started (default=" __MODULE_STRING(WATCHDOG_NOWAYOUT) ")"); +struct imx_sc_wdt_device { + struct watchdog_device wdd; + struct notifier_block wdt_notifier; +}; + static int imx_sc_wdt_ping(struct watchdog_device *wdog) { struct arm_smccc_res res; @@ -85,24 +94,66 @@ static int imx_sc_wdt_set_timeout(struct watchdog_device *wdog, return res.a0 ? -EACCES : 0; } +static int imx_sc_wdt_set_pretimeout(struct watchdog_device *wdog, + unsigned int pretimeout) +{ + struct arm_smccc_res res; + + arm_smccc_smc(IMX_SIP_TIMER, IMX_SIP_TIMER_SET_PRETIME_WDOG, + pretimeout * 1000, 0, 0, 0, 0, 0, &res); + if (res.a0) + return -EACCES; + + wdog->pretimeout = pretimeout; + + return 0; +} + +static int imx_sc_wdt_notify(struct notifier_block *nb, + unsigned long event, void *group) +{ + struct imx_sc_wdt_device *imx_sc_wdd = + container_of(nb, + struct imx_sc_wdt_device, + wdt_notifier); + + if (event & SC_IRQ_WDOG && + *(u8 *)group == SC_IRQ_GROUP_WDOG) + watchdog_notify_pretimeout(&imx_sc_wdd->wdd); + + return 0; +} + +static void imx_sc_wdt_action(void *data) +{ + struct notifier_block *wdt_notifier = data; + + imx_scu_irq_unregister_notifier(wdt_notifier); + imx_scu_irq_group_enable(SC_IRQ_GROUP_WDOG, + SC_IRQ_WDOG, + false); +} + static const struct watchdog_ops imx_sc_wdt_ops = { .owner = THIS_MODULE, .start = imx_sc_wdt_start, .stop = imx_sc_wdt_stop, .ping = imx_sc_wdt_ping, .set_timeout = imx_sc_wdt_set_timeout, + .set_pretimeout = imx_sc_wdt_set_pretimeout, }; -static const struct watchdog_info imx_sc_wdt_info = { +static struct watchdog_info imx_sc_wdt_info = { .identity = "i.MX SC watchdog timer", .options = WDIOF_SETTIMEOUT | WDIOF_KEEPALIVEPING | - WDIOF_MAGICCLOSE | WDIOF_PRETIMEOUT, + WDIOF_MAGICCLOSE, }; static int imx_sc_wdt_probe(struct platform_device *pdev) { + struct imx_sc_wdt_device *imx_sc_wdd; + struct watchdog_device *wdog; struct device *dev = &pdev->dev; - struct watchdog_device *imx_sc_wdd; int ret; imx_sc_wdd = devm_kzalloc(dev, sizeof(*imx_sc_wdd), GFP_KERNEL); @@ -111,42 +162,70 @@ static int imx_sc_wdt_probe(struct platform_device *pdev) platform_set_drvdata(pdev, imx_sc_wdd); - imx_sc_wdd->info = &imx_sc_wdt_info; - imx_sc_wdd->ops = &imx_sc_wdt_ops; - imx_sc_wdd->min_timeout = 1; - imx_sc_wdd->max_timeout = MAX_TIMEOUT; - imx_sc_wdd->parent = dev; - imx_sc_wdd->timeout = DEFAULT_TIMEOUT; - - watchdog_init_timeout(imx_sc_wdd, 0, dev); - watchdog_stop_on_reboot(imx_sc_wdd); - watchdog_stop_on_unregister(imx_sc_wdd); + wdog = &imx_sc_wdd->wdd; + wdog->info = &imx_sc_wdt_info; + wdog->ops = &imx_sc_wdt_ops; + wdog->min_timeout = 1; + wdog->max_timeout = MAX_TIMEOUT; + wdog->parent = dev; + wdog->timeout = DEFAULT_TIMEOUT; + + watchdog_init_timeout(wdog, 0, dev); + watchdog_stop_on_reboot(wdog); + watchdog_stop_on_unregister(wdog); + + ret = devm_watchdog_register_device(dev, wdog); + + if (ret) { + dev_err(dev, "Failed to register watchdog device\n"); + return ret; + } + + ret = imx_scu_irq_group_enable(SC_IRQ_GROUP_WDOG, + SC_IRQ_WDOG, + true); + if (ret) { + dev_warn(dev, "Enable irq failed, pretimeout NOT supported\n"); + return 0; + } - ret = devm_watchdog_register_device(dev, imx_sc_wdd); + imx_sc_wdd->wdt_notifier.notifier_call = imx_sc_wdt_notify; + ret = imx_scu_irq_register_notifier(&imx_sc_wdd->wdt_notifier); if (ret) { - dev_err(dev, "Failed to register watchdog device\n"); - return ret; + imx_scu_irq_group_enable(SC_IRQ_GROUP_WDOG, + SC_IRQ_WDOG, + false); + dev_warn(dev, + "Register irq notifier failed, pretimeout NOT supported\n"); + return 0; } + ret = devm_add_action_or_reset(dev, imx_sc_wdt_action, + &imx_sc_wdd->wdt_notifier); + if (!ret) + imx_sc_wdt_info.options |= WDIOF_PRETIMEOUT; + else + dev_warn(dev, "Add action failed, pretimeout NOT supported\n"); + return 0; } static int __maybe_unused imx_sc_wdt_suspend(struct device *dev) { - struct watchdog_device *imx_sc_wdd = dev_get_drvdata(dev); + struct imx_sc_wdt_device *imx_sc_wdd = dev_get_drvdata(dev); - if (watchdog_active(imx_sc_wdd)) - imx_sc_wdt_stop(imx_sc_wdd); + if (watchdog_active(&imx_sc_wdd->wdd)) + imx_sc_wdt_stop(&imx_sc_wdd->wdd); return 0; } static int __maybe_unused imx_sc_wdt_resume(struct device *dev) { - struct watchdog_device *imx_sc_wdd = dev_get_drvdata(dev); + struct imx_sc_wdt_device *imx_sc_wdd = dev_get_drvdata(dev); - if (watchdog_active(imx_sc_wdd)) - imx_sc_wdt_start(imx_sc_wdd); + if (watchdog_active(&imx_sc_wdd->wdd)) + imx_sc_wdt_start(&imx_sc_wdd->wdd); return 0; } diff --git a/drivers/watchdog/intel-mid_wdt.c b/drivers/watchdog/intel-mid_wdt.c index b2463f8276e6..2cdbd37c700c 100644 --- a/drivers/watchdog/intel-mid_wdt.c +++ b/drivers/watchdog/intel-mid_wdt.c @@ -161,10 +161,8 @@ static int mid_wdt_probe(struct platform_device *pdev) set_bit(WDOG_HW_RUNNING, &wdt_dev->status); ret = devm_watchdog_register_device(dev, wdt_dev); - if (ret) { - dev_err(dev, "error registering watchdog device\n"); + if (ret) return ret; - } dev_info(dev, "Intel MID watchdog device probed\n"); diff --git a/drivers/watchdog/jz4740_wdt.c b/drivers/watchdog/jz4740_wdt.c index 313358b2e0b1..d4a90916dd38 100644 --- a/drivers/watchdog/jz4740_wdt.c +++ b/drivers/watchdog/jz4740_wdt.c @@ -4,6 +4,7 @@ * JZ4740 Watchdog driver */ +#include <linux/mfd/ingenic-tcu.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/types.h> @@ -19,23 +20,16 @@ #include <asm/mach-jz4740/timer.h> -#define JZ_REG_WDT_TIMER_DATA 0x0 -#define JZ_REG_WDT_COUNTER_ENABLE 0x4 -#define JZ_REG_WDT_TIMER_COUNTER 0x8 -#define JZ_REG_WDT_TIMER_CONTROL 0xC - #define JZ_WDT_CLOCK_PCLK 0x1 #define JZ_WDT_CLOCK_RTC 0x2 #define JZ_WDT_CLOCK_EXT 0x4 -#define JZ_WDT_CLOCK_DIV_SHIFT 3 - -#define JZ_WDT_CLOCK_DIV_1 (0 << JZ_WDT_CLOCK_DIV_SHIFT) -#define JZ_WDT_CLOCK_DIV_4 (1 << JZ_WDT_CLOCK_DIV_SHIFT) -#define JZ_WDT_CLOCK_DIV_16 (2 << JZ_WDT_CLOCK_DIV_SHIFT) -#define JZ_WDT_CLOCK_DIV_64 (3 << JZ_WDT_CLOCK_DIV_SHIFT) -#define JZ_WDT_CLOCK_DIV_256 (4 << JZ_WDT_CLOCK_DIV_SHIFT) -#define JZ_WDT_CLOCK_DIV_1024 (5 << JZ_WDT_CLOCK_DIV_SHIFT) +#define JZ_WDT_CLOCK_DIV_1 (0 << TCU_TCSR_PRESCALE_LSB) +#define JZ_WDT_CLOCK_DIV_4 (1 << TCU_TCSR_PRESCALE_LSB) +#define JZ_WDT_CLOCK_DIV_16 (2 << TCU_TCSR_PRESCALE_LSB) +#define JZ_WDT_CLOCK_DIV_64 (3 << TCU_TCSR_PRESCALE_LSB) +#define JZ_WDT_CLOCK_DIV_256 (4 << TCU_TCSR_PRESCALE_LSB) +#define JZ_WDT_CLOCK_DIV_1024 (5 << TCU_TCSR_PRESCALE_LSB) #define DEFAULT_HEARTBEAT 5 #define MAX_HEARTBEAT 2048 @@ -63,7 +57,7 @@ static int jz4740_wdt_ping(struct watchdog_device *wdt_dev) { struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev); - writew(0x0, drvdata->base + JZ_REG_WDT_TIMER_COUNTER); + writew(0x0, drvdata->base + TCU_REG_WDT_TCNT); return 0; } @@ -74,6 +68,7 @@ static int jz4740_wdt_set_timeout(struct watchdog_device *wdt_dev, unsigned int rtc_clk_rate; unsigned int timeout_value; unsigned short clock_div = JZ_WDT_CLOCK_DIV_1; + u8 tcer; rtc_clk_rate = clk_get_rate(drvdata->rtc_clk); @@ -86,18 +81,19 @@ static int jz4740_wdt_set_timeout(struct watchdog_device *wdt_dev, break; } timeout_value >>= 2; - clock_div += (1 << JZ_WDT_CLOCK_DIV_SHIFT); + clock_div += (1 << TCU_TCSR_PRESCALE_LSB); } - writeb(0x0, drvdata->base + JZ_REG_WDT_COUNTER_ENABLE); - writew(clock_div, drvdata->base + JZ_REG_WDT_TIMER_CONTROL); + tcer = readb(drvdata->base + TCU_REG_WDT_TCER); + writeb(0x0, drvdata->base + TCU_REG_WDT_TCER); + writew(clock_div, drvdata->base + TCU_REG_WDT_TCSR); - writew((u16)timeout_value, drvdata->base + JZ_REG_WDT_TIMER_DATA); - writew(0x0, drvdata->base + JZ_REG_WDT_TIMER_COUNTER); - writew(clock_div | JZ_WDT_CLOCK_RTC, - drvdata->base + JZ_REG_WDT_TIMER_CONTROL); + writew((u16)timeout_value, drvdata->base + TCU_REG_WDT_TDR); + writew(0x0, drvdata->base + TCU_REG_WDT_TCNT); + writew(clock_div | JZ_WDT_CLOCK_RTC, drvdata->base + TCU_REG_WDT_TCSR); - writeb(0x1, drvdata->base + JZ_REG_WDT_COUNTER_ENABLE); + if (tcer & TCU_WDT_TCER_TCEN) + writeb(TCU_WDT_TCER_TCEN, drvdata->base + TCU_REG_WDT_TCER); wdt_dev->timeout = new_timeout; return 0; @@ -105,9 +101,18 @@ static int jz4740_wdt_set_timeout(struct watchdog_device *wdt_dev, static int jz4740_wdt_start(struct watchdog_device *wdt_dev) { + struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev); + u8 tcer; + + tcer = readb(drvdata->base + TCU_REG_WDT_TCER); + jz4740_timer_enable_watchdog(); jz4740_wdt_set_timeout(wdt_dev, wdt_dev->timeout); + /* Start watchdog if it wasn't started already */ + if (!(tcer & TCU_WDT_TCER_TCEN)) + writeb(TCU_WDT_TCER_TCEN, drvdata->base + TCU_REG_WDT_TCER); + return 0; } @@ -115,7 +120,7 @@ static int jz4740_wdt_stop(struct watchdog_device *wdt_dev) { struct jz4740_wdt_drvdata *drvdata = watchdog_get_drvdata(wdt_dev); - writeb(0x0, drvdata->base + JZ_REG_WDT_COUNTER_ENABLE); + writeb(0x0, drvdata->base + TCU_REG_WDT_TCER); jz4740_timer_disable_watchdog(); return 0; @@ -187,11 +192,7 @@ static int jz4740_wdt_probe(struct platform_device *pdev) return PTR_ERR(drvdata->rtc_clk); } - ret = devm_watchdog_register_device(dev, &drvdata->wdt); - if (ret < 0) - return ret; - - return 0; + return devm_watchdog_register_device(dev, &drvdata->wdt); } static struct platform_driver jz4740_wdt_driver = { diff --git a/drivers/watchdog/loongson1_wdt.c b/drivers/watchdog/loongson1_wdt.c index c8c2b8a88fc2..bb3d075c0633 100644 --- a/drivers/watchdog/loongson1_wdt.c +++ b/drivers/watchdog/loongson1_wdt.c @@ -132,10 +132,8 @@ static int ls1x_wdt_probe(struct platform_device *pdev) watchdog_set_drvdata(ls1x_wdt, drvdata); err = devm_watchdog_register_device(dev, &drvdata->wdt); - if (err) { - dev_err(dev, "failed to register watchdog device\n"); + if (err) return err; - } platform_set_drvdata(pdev, drvdata); diff --git a/drivers/watchdog/max77620_wdt.c b/drivers/watchdog/max77620_wdt.c index 9937f9fccd2e..be6a53c30002 100644 --- a/drivers/watchdog/max77620_wdt.c +++ b/drivers/watchdog/max77620_wdt.c @@ -182,13 +182,7 @@ static int max77620_wdt_probe(struct platform_device *pdev) watchdog_set_drvdata(wdt_dev, wdt); watchdog_stop_on_unregister(wdt_dev); - ret = devm_watchdog_register_device(dev, wdt_dev); - if (ret < 0) { - dev_err(dev, "watchdog registration failed: %d\n", ret); - return ret; - } - - return 0; + return devm_watchdog_register_device(dev, wdt_dev); } static const struct platform_device_id max77620_wdt_devtype[] = { diff --git a/drivers/watchdog/mei_wdt.c b/drivers/watchdog/mei_wdt.c index 96a770938ff0..5391bf3e6b11 100644 --- a/drivers/watchdog/mei_wdt.c +++ b/drivers/watchdog/mei_wdt.c @@ -384,10 +384,8 @@ static int mei_wdt_register(struct mei_wdt *wdt) watchdog_stop_on_reboot(&wdt->wdd); ret = watchdog_register_device(&wdt->wdd); - if (ret) { - dev_err(dev, "unable to register watchdog device = %d.\n", ret); + if (ret) watchdog_set_drvdata(&wdt->wdd, NULL); - } wdt->state = MEI_WDT_IDLE; diff --git a/drivers/watchdog/mena21_wdt.c b/drivers/watchdog/mena21_wdt.c index e9ca4e0e25dc..99d2359d5a8a 100644 --- a/drivers/watchdog/mena21_wdt.c +++ b/drivers/watchdog/mena21_wdt.c @@ -190,10 +190,8 @@ static int a21_wdt_probe(struct platform_device *pdev) dev_set_drvdata(dev, drv); ret = devm_watchdog_register_device(dev, &a21_wdt); - if (ret) { - dev_err(dev, "Cannot register watchdog device\n"); + if (ret) return ret; - } dev_info(dev, "MEN A21 watchdog timer driver enabled\n"); diff --git a/drivers/watchdog/menf21bmc_wdt.c b/drivers/watchdog/menf21bmc_wdt.c index 7766d7361d3b..81ebdfc371f4 100644 --- a/drivers/watchdog/menf21bmc_wdt.c +++ b/drivers/watchdog/menf21bmc_wdt.c @@ -152,10 +152,8 @@ static int menf21bmc_wdt_probe(struct platform_device *pdev) } ret = devm_watchdog_register_device(dev, &drv_data->wdt); - if (ret) { - dev_err(dev, "failed to register Watchdog device\n"); + if (ret) return ret; - } dev_info(dev, "MEN 14F021P00 BMC Watchdog device enabled\n"); diff --git a/drivers/watchdog/mpc8xxx_wdt.c b/drivers/watchdog/mpc8xxx_wdt.c index b6ffad421bd0..3fc457bc16db 100644 --- a/drivers/watchdog/mpc8xxx_wdt.c +++ b/drivers/watchdog/mpc8xxx_wdt.c @@ -201,11 +201,8 @@ static int mpc8xxx_wdt_probe(struct platform_device *ofdev) ddata->wdd.timeout = ddata->wdd.min_timeout; ret = devm_watchdog_register_device(dev, &ddata->wdd); - if (ret) { - dev_err(dev, "cannot register watchdog device (err=%d)\n", - ret); + if (ret) return ret; - } dev_info(dev, "WDT driver for MPC8xxx initialized. mode:%s timeout=%d sec\n", diff --git a/drivers/watchdog/mv64x60_wdt.c b/drivers/watchdog/mv64x60_wdt.c index c785f4f0a196..74bf7144a970 100644 --- a/drivers/watchdog/mv64x60_wdt.c +++ b/drivers/watchdog/mv64x60_wdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * mv64x60_wdt.c - MV64X60 (Marvell Discovery) watchdog userspace interface * @@ -9,10 +10,7 @@ * * Derived from mpc8xx_wdt.c, with the following copyright. * - * 2002 (c) Florian Schirmer <jolt@tuxbox.org> This file is licensed under - * the terms of the GNU General Public License version 2. This program - * is licensed "as is" without any warranty of any kind, whether express - * or implied. + * 2002 (c) Florian Schirmer <jolt@tuxbox.org> */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt diff --git a/drivers/watchdog/ni903x_wdt.c b/drivers/watchdog/ni903x_wdt.c index 60f5608af2a8..4cebad324b20 100644 --- a/drivers/watchdog/ni903x_wdt.c +++ b/drivers/watchdog/ni903x_wdt.c @@ -211,10 +211,8 @@ static int ni903x_acpi_add(struct acpi_device *device) watchdog_init_timeout(wdd, timeout, dev); ret = watchdog_register_device(wdd); - if (ret) { - dev_err(dev, "failed to register watchdog\n"); + if (ret) return ret; - } /* Switch from boot mode to user mode */ outb(NIWD_CONTROL_RESET | NIWD_CONTROL_MODE, diff --git a/drivers/watchdog/nic7018_wdt.c b/drivers/watchdog/nic7018_wdt.c index 2e1a2a3d4ec9..2a46cc662943 100644 --- a/drivers/watchdog/nic7018_wdt.c +++ b/drivers/watchdog/nic7018_wdt.c @@ -210,7 +210,6 @@ static int nic7018_probe(struct platform_device *pdev) ret = watchdog_register_device(wdd); if (ret) { outb(LOCK, wdt->io_base + WDT_REG_LOCK); - dev_err(dev, "failed to register watchdog\n"); return ret; } diff --git a/drivers/watchdog/npcm_wdt.c b/drivers/watchdog/npcm_wdt.c index 9d6c1689b12c..9c773c3d6d5d 100644 --- a/drivers/watchdog/npcm_wdt.c +++ b/drivers/watchdog/npcm_wdt.c @@ -220,10 +220,8 @@ static int npcm_wdt_probe(struct platform_device *pdev) return ret; ret = devm_watchdog_register_device(dev, &wdt->wdd); - if (ret) { - dev_err(dev, "failed to register watchdog\n"); + if (ret) return ret; - } dev_info(dev, "NPCM watchdog driver enabled\n"); diff --git a/drivers/watchdog/nv_tco.h b/drivers/watchdog/nv_tco.h index c2d1d04e055b..d325e528010f 100644 --- a/drivers/watchdog/nv_tco.h +++ b/drivers/watchdog/nv_tco.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ /* * nv_tco: TCO timer driver for nVidia chipsets. * @@ -10,11 +11,6 @@ * Reserved. * http://www.kernelconcepts.de * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither kernel concepts nor Nils Faerber admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/octeon-wdt-main.c b/drivers/watchdog/octeon-wdt-main.c index 0ec419a3f7ed..fde9e739b436 100644 --- a/drivers/watchdog/octeon-wdt-main.c +++ b/drivers/watchdog/octeon-wdt-main.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Octeon Watchdog driver * @@ -10,22 +11,12 @@ * (c) Copyright 1996-1997 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. * * (c) Copyright 1995 Alan Cox <alan@lxorguk.ukuu.org.uk> * - * This file is subject to the terms and conditions of the GNU General Public - * License. See the file "COPYING" in the main directory of this archive - * for more details. - * - * * The OCTEON watchdog has a maximum timeout of 2^32 * io_clock. * For most systems this is less than 10 seconds, so to allow for * software to request longer watchdog heartbeats, we maintain software diff --git a/drivers/watchdog/of_xilinx_wdt.c b/drivers/watchdog/of_xilinx_wdt.c index 03786992b701..7fe4f7c3f7ce 100644 --- a/drivers/watchdog/of_xilinx_wdt.c +++ b/drivers/watchdog/of_xilinx_wdt.c @@ -238,10 +238,8 @@ static int xwdt_probe(struct platform_device *pdev) } rc = devm_watchdog_register_device(dev, xilinx_wdt_wdd); - if (rc) { - dev_err(dev, "Cannot register watchdog (err=%d)\n", rc); + if (rc) return rc; - } clk_disable(xdev->clk); diff --git a/drivers/watchdog/omap_wdt.c b/drivers/watchdog/omap_wdt.c index d49688d93f6a..9b91882fe3c4 100644 --- a/drivers/watchdog/omap_wdt.c +++ b/drivers/watchdog/omap_wdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * omap_wdt.c * @@ -6,10 +7,7 @@ * Author: MontaVista Software, Inc. * <gdavis@mvista.com> or <source@mvista.com> * - * 2003 (c) MontaVista Software, Inc. This file is licensed under the - * terms of the GNU General Public License version 2. This program is - * licensed "as is" without any warranty of any kind, whether express - * or implied. + * 2003 (c) MontaVista Software, Inc. * * History: * diff --git a/drivers/watchdog/omap_wdt.h b/drivers/watchdog/omap_wdt.h index 42f31ec5e90d..950b4643f3e7 100644 --- a/drivers/watchdog/omap_wdt.h +++ b/drivers/watchdog/omap_wdt.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ /* * linux/drivers/char/watchdog/omap_wdt.h * @@ -5,26 +6,6 @@ * OMAP Watchdog timer register definitions * * Copyright (C) 2004 Texas Instruments. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License as published by the - * Free Software Foundation; either version 2 of the License, or (at your - * option) any later version. - * - * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED - * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF - * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN - * NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, - * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT - * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF - * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON - * ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF - * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * You should have received a copy of the GNU General Public License along - * with this program; if not, write to the Free Software Foundation, Inc., - * 675 Mass Ave, Cambridge, MA 02139, USA. */ #ifndef _OMAP_WATCHDOG_H diff --git a/drivers/watchdog/pc87413_wdt.c b/drivers/watchdog/pc87413_wdt.c index ca21d6c240a3..2af1a8b3f973 100644 --- a/drivers/watchdog/pc87413_wdt.c +++ b/drivers/watchdog/pc87413_wdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * NS pc87413-wdt Watchdog Timer driver for Linux 2.6.x.x * @@ -6,11 +7,6 @@ * (C) Copyright 2006 Sven Anders, <anders@anduras.de> * and Marcus Junker, <junker@anduras.de> * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Sven Anders, Marcus Junker nor ANDURAS AG * admit liability nor provide warranty for any of this software. * This material is provided "AS-IS" and at no charge. diff --git a/drivers/watchdog/pcwd_pci.c b/drivers/watchdog/pcwd_pci.c index 5773d2591d3f..e30c1f762045 100644 --- a/drivers/watchdog/pcwd_pci.c +++ b/drivers/watchdog/pcwd_pci.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Berkshire PCI-PC Watchdog Card Driver * @@ -10,11 +11,6 @@ * Matt Domsch <Matt_Domsch@dell.com>, * Rob Radez <rob@osinvestor.com> * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Wim Van Sebroeck nor Iguana vzw. admit liability nor * provide warranty for any of this software. This material is * provided "AS-IS" and at no charge. diff --git a/drivers/watchdog/pcwd_usb.c b/drivers/watchdog/pcwd_usb.c index 5de6182dae33..6727f8ab2d18 100644 --- a/drivers/watchdog/pcwd_usb.c +++ b/drivers/watchdog/pcwd_usb.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Berkshire USB-PC Watchdog Card Driver * @@ -10,11 +11,6 @@ * Rob Radez <rob@osinvestor.com>, * Greg Kroah-Hartman <greg@kroah.com> * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Wim Van Sebroeck nor Iguana vzw. admit liability nor * provide warranty for any of this software. This material is * provided "AS-IS" and at no charge. diff --git a/drivers/watchdog/pic32-dmt.c b/drivers/watchdog/pic32-dmt.c index 4f2aca78f13a..f43062b3c4c8 100644 --- a/drivers/watchdog/pic32-dmt.c +++ b/drivers/watchdog/pic32-dmt.c @@ -212,10 +212,8 @@ static int pic32_dmt_probe(struct platform_device *pdev) watchdog_set_drvdata(wdd, dmt); ret = devm_watchdog_register_device(dev, wdd); - if (ret) { - dev_err(dev, "watchdog register failed, err %d\n", ret); + if (ret) return ret; - } platform_set_drvdata(pdev, wdd); return 0; diff --git a/drivers/watchdog/pic32-wdt.c b/drivers/watchdog/pic32-wdt.c index 5ecdd880f0b7..41715d68d9e9 100644 --- a/drivers/watchdog/pic32-wdt.c +++ b/drivers/watchdog/pic32-wdt.c @@ -221,10 +221,8 @@ static int pic32_wdt_drv_probe(struct platform_device *pdev) watchdog_set_drvdata(wdd, wdt); ret = devm_watchdog_register_device(dev, wdd); - if (ret) { - dev_err(dev, "watchdog register failed, err %d\n", ret); + if (ret) return ret; - } platform_set_drvdata(pdev, wdd); diff --git a/drivers/watchdog/pnx4008_wdt.c b/drivers/watchdog/pnx4008_wdt.c index d9e03544aeae..7b446b696f2b 100644 --- a/drivers/watchdog/pnx4008_wdt.c +++ b/drivers/watchdog/pnx4008_wdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0 /* * drivers/char/watchdog/pnx4008_wdt.c * @@ -11,10 +12,6 @@ * 2005-2006 (c) MontaVista Software, Inc. * * (C) 2012 Wolfram Sang, Pengutronix - * - * This file is licensed under the terms of the GNU General Public License - * version 2. This program is licensed "as is" without any warranty of any - * kind, whether express or implied. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt @@ -221,10 +218,8 @@ static int pnx4008_wdt_probe(struct platform_device *pdev) set_bit(WDOG_HW_RUNNING, &pnx4008_wdd.status); ret = devm_watchdog_register_device(dev, &pnx4008_wdd); - if (ret < 0) { - dev_err(dev, "cannot register watchdog device\n"); + if (ret < 0) return ret; - } dev_info(dev, "heartbeat %d sec\n", pnx4008_wdd.timeout); diff --git a/drivers/watchdog/qcom-wdt.c b/drivers/watchdog/qcom-wdt.c index fc0f7e5de38d..7be7f87be28f 100644 --- a/drivers/watchdog/qcom-wdt.c +++ b/drivers/watchdog/qcom-wdt.c @@ -223,10 +223,8 @@ static int qcom_wdt_probe(struct platform_device *pdev) watchdog_init_timeout(&wdt->wdd, 0, dev); ret = devm_watchdog_register_device(dev, &wdt->wdd); - if (ret) { - dev_err(dev, "failed to register watchdog\n"); + if (ret) return ret; - } platform_set_drvdata(pdev, wdt); return 0; diff --git a/drivers/watchdog/rave-sp-wdt.c b/drivers/watchdog/rave-sp-wdt.c index 35db173252f9..2c95615b6354 100644 --- a/drivers/watchdog/rave-sp-wdt.c +++ b/drivers/watchdog/rave-sp-wdt.c @@ -310,7 +310,6 @@ static int rave_sp_wdt_probe(struct platform_device *pdev) ret = devm_watchdog_register_device(dev, wdd); if (ret) { - dev_err(dev, "Failed to register watchdog device\n"); rave_sp_wdt_stop(wdd); return ret; } diff --git a/drivers/watchdog/renesas_wdt.c b/drivers/watchdog/renesas_wdt.c index 565dbc1ec638..00662a8e039c 100644 --- a/drivers/watchdog/renesas_wdt.c +++ b/drivers/watchdog/renesas_wdt.c @@ -7,6 +7,7 @@ */ #include <linux/bitops.h> #include <linux/clk.h> +#include <linux/delay.h> #include <linux/io.h> #include <linux/kernel.h> #include <linux/module.h> @@ -70,6 +71,15 @@ static int rwdt_init_timeout(struct watchdog_device *wdev) return 0; } +static void rwdt_wait_cycles(struct rwdt_priv *priv, unsigned int cycles) +{ + unsigned int delay; + + delay = DIV_ROUND_UP(cycles * 1000000, priv->clk_rate); + + usleep_range(delay, 2 * delay); +} + static int rwdt_start(struct watchdog_device *wdev) { struct rwdt_priv *priv = watchdog_get_drvdata(wdev); @@ -80,6 +90,8 @@ static int rwdt_start(struct watchdog_device *wdev) /* Stop the timer before we modify any register */ val = readb_relaxed(priv->base + RWTCSRA) & ~RWTCSRA_TME; rwdt_write(priv, val, RWTCSRA); + /* Delay 2 cycles before setting watchdog counter */ + rwdt_wait_cycles(priv, 2); rwdt_init_timeout(wdev); rwdt_write(priv, priv->cks, RWTCSRA); @@ -98,6 +110,8 @@ static int rwdt_stop(struct watchdog_device *wdev) struct rwdt_priv *priv = watchdog_get_drvdata(wdev); rwdt_write(priv, priv->cks, RWTCSRA); + /* Delay 3 cycles before disabling module clock */ + rwdt_wait_cycles(priv, 3); pm_runtime_put(wdev->parent); return 0; @@ -175,15 +189,16 @@ static inline bool rwdt_blacklisted(struct device *dev) { return false; } static int rwdt_probe(struct platform_device *pdev) { + struct device *dev = &pdev->dev; struct rwdt_priv *priv; struct clk *clk; unsigned long clks_per_sec; int ret, i; - if (rwdt_blacklisted(&pdev->dev)) + if (rwdt_blacklisted(dev)) return -ENODEV; - priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL); + priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL); if (!priv) return -ENOMEM; @@ -191,16 +206,16 @@ static int rwdt_probe(struct platform_device *pdev) if (IS_ERR(priv->base)) return PTR_ERR(priv->base); - clk = devm_clk_get(&pdev->dev, NULL); + clk = devm_clk_get(dev, NULL); if (IS_ERR(clk)) return PTR_ERR(clk); - pm_runtime_enable(&pdev->dev); - pm_runtime_get_sync(&pdev->dev); + pm_runtime_enable(dev); + pm_runtime_get_sync(dev); priv->clk_rate = clk_get_rate(clk); priv->wdev.bootstatus = (readb_relaxed(priv->base + RWTCSRA) & RWTCSRA_WOVF) ? WDIOF_CARDRESET : 0; - pm_runtime_put(&pdev->dev); + pm_runtime_put(dev); if (!priv->clk_rate) { ret = -ENOENT; @@ -216,14 +231,14 @@ static int rwdt_probe(struct platform_device *pdev) } if (i < 0) { - dev_err(&pdev->dev, "Can't find suitable clock divider\n"); + dev_err(dev, "Can't find suitable clock divider\n"); ret = -ERANGE; goto out_pm_disable; } priv->wdev.info = &rwdt_ident; priv->wdev.ops = &rwdt_ops; - priv->wdev.parent = &pdev->dev; + priv->wdev.parent = dev; priv->wdev.min_timeout = 1; priv->wdev.max_timeout = DIV_BY_CLKS_PER_SEC(priv, 65536); priv->wdev.timeout = min(priv->wdev.max_timeout, RWDT_DEFAULT_TIMEOUT); @@ -235,7 +250,7 @@ static int rwdt_probe(struct platform_device *pdev) watchdog_stop_on_unregister(&priv->wdev); /* This overrides the default timeout only if DT configuration was found */ - watchdog_init_timeout(&priv->wdev, 0, &pdev->dev); + watchdog_init_timeout(&priv->wdev, 0, dev); ret = watchdog_register_device(&priv->wdev); if (ret < 0) @@ -244,7 +259,7 @@ static int rwdt_probe(struct platform_device *pdev) return 0; out_pm_disable: - pm_runtime_disable(&pdev->dev); + pm_runtime_disable(dev); return ret; } diff --git a/drivers/watchdog/retu_wdt.c b/drivers/watchdog/retu_wdt.c index 39cd51df2ffc..258dfcf9cbda 100644 --- a/drivers/watchdog/retu_wdt.c +++ b/drivers/watchdog/retu_wdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Retu watchdog driver * @@ -5,15 +6,6 @@ * * Based on code written by Amit Kucheria and Michael Buesch. * Rewritten by Aaro Koskinen. - * - * This file is subject to the terms and conditions of the GNU General - * Public License. See the file "COPYING" in the main directory of this - * archive for more details. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. */ #include <linux/slab.h> diff --git a/drivers/watchdog/s3c2410_wdt.c b/drivers/watchdog/s3c2410_wdt.c index daf3bf0d86b8..2395f353e52d 100644 --- a/drivers/watchdog/s3c2410_wdt.c +++ b/drivers/watchdog/s3c2410_wdt.c @@ -606,10 +606,8 @@ static int s3c2410wdt_probe(struct platform_device *pdev) wdt->wdt_device.parent = dev; ret = watchdog_register_device(&wdt->wdt_device); - if (ret) { - dev_err(dev, "cannot register watchdog (%d)\n", ret); + if (ret) goto err_cpufreq; - } ret = s3c2410wdt_mask_and_disable_reset(wdt, false); if (ret < 0) diff --git a/drivers/watchdog/sa1100_wdt.c b/drivers/watchdog/sa1100_wdt.c index bfa035e1a75e..cbd8c957182f 100644 --- a/drivers/watchdog/sa1100_wdt.c +++ b/drivers/watchdog/sa1100_wdt.c @@ -1,14 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Watchdog driver for the SA11x0/PXA2xx * * (c) Copyright 2000 Oleg Drokin <green@crimea.edu> * Based on SoftDog driver by Alan Cox <alan@lxorguk.ukuu.org.uk> * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Oleg Drokin nor iXcelerator.com admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/sama5d4_wdt.c b/drivers/watchdog/sama5d4_wdt.c index b8da1bf21e12..d193a60430b2 100644 --- a/drivers/watchdog/sama5d4_wdt.c +++ b/drivers/watchdog/sama5d4_wdt.c @@ -110,9 +110,7 @@ static int sama5d4_wdt_set_timeout(struct watchdog_device *wdd, u32 value = WDT_SEC2TICKS(timeout); wdt->mr &= ~AT91_WDT_WDV; - wdt->mr &= ~AT91_WDT_WDD; wdt->mr |= AT91_WDT_SET_WDV(value); - wdt->mr |= AT91_WDT_SET_WDD(value); /* * WDDIS has to be 0 when updating WDD/WDV. The datasheet states: When @@ -248,7 +246,7 @@ static int sama5d4_wdt_probe(struct platform_device *pdev) timeout = WDT_SEC2TICKS(wdd->timeout); - wdt->mr |= AT91_WDT_SET_WDD(timeout); + wdt->mr |= AT91_WDT_SET_WDD(WDT_SEC2TICKS(MAX_WDT_TIMEOUT)); wdt->mr |= AT91_WDT_SET_WDV(timeout); ret = sama5d4_wdt_init(wdt); @@ -259,10 +257,8 @@ static int sama5d4_wdt_probe(struct platform_device *pdev) watchdog_stop_on_unregister(wdd); ret = devm_watchdog_register_device(dev, wdd); - if (ret) { - dev_err(dev, "failed to register watchdog device\n"); + if (ret) return ret; - } platform_set_drvdata(pdev, wdt); @@ -279,7 +275,17 @@ static const struct of_device_id sama5d4_wdt_of_match[] = { MODULE_DEVICE_TABLE(of, sama5d4_wdt_of_match); #ifdef CONFIG_PM_SLEEP -static int sama5d4_wdt_resume(struct device *dev) +static int sama5d4_wdt_suspend_late(struct device *dev) +{ + struct sama5d4_wdt *wdt = dev_get_drvdata(dev); + + if (watchdog_active(&wdt->wdd)) + sama5d4_wdt_stop(&wdt->wdd); + + return 0; +} + +static int sama5d4_wdt_resume_early(struct device *dev) { struct sama5d4_wdt *wdt = dev_get_drvdata(dev); @@ -290,12 +296,17 @@ static int sama5d4_wdt_resume(struct device *dev) */ sama5d4_wdt_init(wdt); + if (watchdog_active(&wdt->wdd)) + sama5d4_wdt_start(&wdt->wdd); + return 0; } #endif -static SIMPLE_DEV_PM_OPS(sama5d4_wdt_pm_ops, NULL, - sama5d4_wdt_resume); +static const struct dev_pm_ops sama5d4_wdt_pm_ops = { + SET_LATE_SYSTEM_SLEEP_PM_OPS(sama5d4_wdt_suspend_late, + sama5d4_wdt_resume_early) +}; static struct platform_driver sama5d4_wdt_driver = { .probe = sama5d4_wdt_probe, diff --git a/drivers/watchdog/sbc7240_wdt.c b/drivers/watchdog/sbc7240_wdt.c index efc81b318939..12cdee7d5069 100644 --- a/drivers/watchdog/sbc7240_wdt.c +++ b/drivers/watchdog/sbc7240_wdt.c @@ -1,19 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0 /* * NANO7240 SBC Watchdog device driver * * Based on w83877f.c by Scott Jennings, * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation; - * - * Software distributed under the License is distributed on an "AS IS" - * basis, WITHOUT WARRANTY OF ANY KIND, either express or - * implied. See the License for the specific language governing - * rights and limitations under the License. - * * (c) Copyright 2007 Gilles GIGAN <gilles.gigan@jcu.edu.au> - * */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt diff --git a/drivers/watchdog/sbc8360.c b/drivers/watchdog/sbc8360.c index 3396024e7b76..4f8b9912fc51 100644 --- a/drivers/watchdog/sbc8360.c +++ b/drivers/watchdog/sbc8360.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * SBC8360 Watchdog driver * @@ -19,11 +20,6 @@ * (c) Copyright 1996 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/sch311x_wdt.c b/drivers/watchdog/sch311x_wdt.c index ed6e9fac5d74..3612f1df381b 100644 --- a/drivers/watchdog/sch311x_wdt.c +++ b/drivers/watchdog/sch311x_wdt.c @@ -1,14 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * sch311x_wdt.c - Driver for the SCH311x Super-I/O chips * integrated watchdog. * * (c) Copyright 2008 Wim Van Sebroeck <wim@iguana.be>. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Wim Van Sebroeck nor Iguana vzw. admit liability nor * provide warranty for any of this software. This material is * provided "AS-IS" and at no charge. diff --git a/drivers/watchdog/softdog.c b/drivers/watchdog/softdog.c index 060740625485..3e4885c1545e 100644 --- a/drivers/watchdog/softdog.c +++ b/drivers/watchdog/softdog.c @@ -1,14 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * SoftDog: A Software Watchdog Device * * (c) Copyright 1996 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/sp5100_tco.c b/drivers/watchdog/sp5100_tco.c index cd4430ff9b1c..93bd302ae7c5 100644 --- a/drivers/watchdog/sp5100_tco.c +++ b/drivers/watchdog/sp5100_tco.c @@ -402,10 +402,8 @@ static int sp5100_tco_probe(struct platform_device *pdev) return ret; ret = devm_watchdog_register_device(dev, wdd); - if (ret) { - dev_err(dev, "cannot register watchdog device (err=%d)\n", ret); + if (ret) return ret; - } /* Show module parameters */ dev_info(dev, "initialized. heartbeat=%d sec (nowayout=%d)\n", diff --git a/drivers/watchdog/sp805_wdt.c b/drivers/watchdog/sp805_wdt.c index 072986d461b7..53e04926a7b2 100644 --- a/drivers/watchdog/sp805_wdt.c +++ b/drivers/watchdog/sp805_wdt.c @@ -288,11 +288,8 @@ sp805_wdt_probe(struct amba_device *adev, const struct amba_id *id) } ret = watchdog_register_device(&wdt->wdd); - if (ret) { - dev_err(&adev->dev, "watchdog_register_device() failed: %d\n", - ret); + if (ret) goto err; - } amba_set_drvdata(adev, wdt); dev_info(&adev->dev, "registration successful\n"); diff --git a/drivers/watchdog/sprd_wdt.c b/drivers/watchdog/sprd_wdt.c index 916fb3f96bdc..edba4e278685 100644 --- a/drivers/watchdog/sprd_wdt.c +++ b/drivers/watchdog/sprd_wdt.c @@ -320,7 +320,6 @@ static int sprd_wdt_probe(struct platform_device *pdev) ret = devm_watchdog_register_device(dev, &wdt->wdd); if (ret) { sprd_wdt_disable(wdt); - dev_err(dev, "failed to register watchdog\n"); return ret; } platform_set_drvdata(pdev, wdt); diff --git a/drivers/watchdog/st_lpc_wdt.c b/drivers/watchdog/st_lpc_wdt.c index 7a90184eb950..14ab6559c748 100644 --- a/drivers/watchdog/st_lpc_wdt.c +++ b/drivers/watchdog/st_lpc_wdt.c @@ -228,10 +228,8 @@ static int st_wdog_probe(struct platform_device *pdev) return ret; ret = devm_watchdog_register_device(dev, &st_wdog_dev); - if (ret) { - dev_err(dev, "Unable to register watchdog\n"); + if (ret) return ret; - } st_wdog_setup(st_wdog, true); diff --git a/drivers/watchdog/stm32_iwdg.c b/drivers/watchdog/stm32_iwdg.c index d569a3634d9b..a3a329011a06 100644 --- a/drivers/watchdog/stm32_iwdg.c +++ b/drivers/watchdog/stm32_iwdg.c @@ -263,10 +263,8 @@ static int stm32_iwdg_probe(struct platform_device *pdev) watchdog_init_timeout(wdd, 0, dev); ret = devm_watchdog_register_device(dev, wdd); - if (ret) { - dev_err(dev, "failed to register watchdog device\n"); + if (ret) return ret; - } platform_set_drvdata(pdev, wdt); diff --git a/drivers/watchdog/stmp3xxx_rtc_wdt.c b/drivers/watchdog/stmp3xxx_rtc_wdt.c index 671f4ba7b4ed..7caf3aa71c6a 100644 --- a/drivers/watchdog/stmp3xxx_rtc_wdt.c +++ b/drivers/watchdog/stmp3xxx_rtc_wdt.c @@ -98,10 +98,8 @@ static int stmp3xxx_wdt_probe(struct platform_device *pdev) stmp3xxx_wdd.parent = dev; ret = devm_watchdog_register_device(dev, &stmp3xxx_wdd); - if (ret < 0) { - dev_err(dev, "cannot register watchdog device\n"); + if (ret < 0) return ret; - } if (register_reboot_notifier(&wdt_notifier)) dev_warn(dev, "cannot register reboot notifier\n"); diff --git a/drivers/watchdog/tegra_wdt.c b/drivers/watchdog/tegra_wdt.c index a58b000acc4f..dfe06e506cad 100644 --- a/drivers/watchdog/tegra_wdt.c +++ b/drivers/watchdog/tegra_wdt.c @@ -219,10 +219,8 @@ static int tegra_wdt_probe(struct platform_device *pdev) watchdog_stop_on_unregister(wdd); ret = devm_watchdog_register_device(dev, wdd); - if (ret) { - dev_err(dev, "failed to register watchdog device\n"); + if (ret) return ret; - } platform_set_drvdata(pdev, wdt); diff --git a/drivers/watchdog/ts4800_wdt.c b/drivers/watchdog/ts4800_wdt.c index 9dc6d7f45806..c137ad2bd5c3 100644 --- a/drivers/watchdog/ts4800_wdt.c +++ b/drivers/watchdog/ts4800_wdt.c @@ -171,10 +171,8 @@ static int ts4800_wdt_probe(struct platform_device *pdev) ts4800_wdt_stop(wdd); ret = devm_watchdog_register_device(dev, wdd); - if (ret) { - dev_err(dev, "failed to register watchdog device\n"); + if (ret) return ret; - } platform_set_drvdata(pdev, wdt); diff --git a/drivers/watchdog/w83627hf_wdt.c b/drivers/watchdog/w83627hf_wdt.c index 3a49ba9ea608..38b31e9947aa 100644 --- a/drivers/watchdog/w83627hf_wdt.c +++ b/drivers/watchdog/w83627hf_wdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * w83627hf/thf WDT driver * @@ -17,11 +18,6 @@ * (c) Copyright 1996 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/wafer5823wdt.c b/drivers/watchdog/wafer5823wdt.c index 0a8073b419f8..6d2071a0590d 100644 --- a/drivers/watchdog/wafer5823wdt.c +++ b/drivers/watchdog/wafer5823wdt.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * ICP Wafer 5823 Single Board Computer WDT driver * http://www.icpamerica.com/wafer_5823.php @@ -13,11 +14,6 @@ * (c) Copyright 1996-1997 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/watchdog_core.c b/drivers/watchdog/watchdog_core.c index 62be9e52a4de..21e8085b848b 100644 --- a/drivers/watchdog/watchdog_core.c +++ b/drivers/watchdog/watchdog_core.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * watchdog_core.c * @@ -16,11 +17,6 @@ * Satyam Sharma <satyam@infradead.org> * Randy Dunlap <randy.dunlap@oracle.com> * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox, CymruNet Ltd., Wim Van Sebroeck nor Iguana vzw. * admit liability nor provide warranty for any of this software. * This material is provided "AS-IS" and at no charge. @@ -60,11 +56,10 @@ static DEFINE_MUTEX(wtd_deferred_reg_mutex); static LIST_HEAD(wtd_deferred_reg_list); static bool wtd_deferred_reg_done; -static int watchdog_deferred_registration_add(struct watchdog_device *wdd) +static void watchdog_deferred_registration_add(struct watchdog_device *wdd) { list_add_tail(&wdd->deferred, &wtd_deferred_reg_list); - return 0; } static void watchdog_deferred_registration_del(struct watchdog_device *wdd) @@ -265,14 +260,23 @@ static int __watchdog_register_device(struct watchdog_device *wdd) int watchdog_register_device(struct watchdog_device *wdd) { - int ret; + const char *dev_str; + int ret = 0; mutex_lock(&wtd_deferred_reg_mutex); if (wtd_deferred_reg_done) ret = __watchdog_register_device(wdd); else - ret = watchdog_deferred_registration_add(wdd); + watchdog_deferred_registration_add(wdd); mutex_unlock(&wtd_deferred_reg_mutex); + + if (ret) { + dev_str = wdd->parent ? dev_name(wdd->parent) : + (const char *)wdd->info->identity; + pr_err("%s: failed to register watchdog device (err = %d)\n", + dev_str, ret); + } + return ret; } EXPORT_SYMBOL_GPL(watchdog_register_device); diff --git a/drivers/watchdog/watchdog_core.h b/drivers/watchdog/watchdog_core.h index 86ff962d1e15..a5062e8e0d13 100644 --- a/drivers/watchdog/watchdog_core.h +++ b/drivers/watchdog/watchdog_core.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ /* * watchdog_core.h * @@ -16,11 +17,6 @@ * Satyam Sharma <satyam@infradead.org> * Randy Dunlap <randy.dunlap@oracle.com> * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox, CymruNet Ltd., Wim Van Sebroeck nor Iguana vzw. * admit liability nor provide warranty for any of this software. * This material is provided "AS-IS" and at no charge. diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c index 252a7c7b6592..dbd2ad4c9294 100644 --- a/drivers/watchdog/watchdog_dev.c +++ b/drivers/watchdog/watchdog_dev.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * watchdog_dev.c * @@ -20,11 +21,6 @@ * Satyam Sharma <satyam@infradead.org> * Randy Dunlap <randy.dunlap@oracle.com> * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox, CymruNet Ltd., Wim Van Sebroeck nor Iguana vzw. * admit liability nor provide warranty for any of this software. * This material is provided "AS-IS" and at no charge. @@ -69,6 +65,7 @@ struct watchdog_core_data { struct mutex lock; ktime_t last_keepalive; ktime_t last_hw_keepalive; + ktime_t open_deadline; struct hrtimer timer; struct kthread_work work; unsigned long status; /* Internal status bits */ @@ -87,6 +84,19 @@ static struct kthread_worker *watchdog_kworker; static bool handle_boot_enabled = IS_ENABLED(CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED); +static unsigned open_timeout = CONFIG_WATCHDOG_OPEN_TIMEOUT; + +static bool watchdog_past_open_deadline(struct watchdog_core_data *data) +{ + return ktime_after(ktime_get(), data->open_deadline); +} + +static void watchdog_set_open_deadline(struct watchdog_core_data *data) +{ + data->open_deadline = open_timeout ? + ktime_get() + ktime_set(open_timeout, 0) : KTIME_MAX; +} + static inline bool watchdog_need_worker(struct watchdog_device *wdd) { /* All variables in milli-seconds */ @@ -119,14 +129,15 @@ static ktime_t watchdog_next_keepalive(struct watchdog_device *wdd) ktime_t virt_timeout; unsigned int hw_heartbeat_ms; - virt_timeout = ktime_add(wd_data->last_keepalive, - ms_to_ktime(timeout_ms)); + if (watchdog_active(wdd)) + virt_timeout = ktime_add(wd_data->last_keepalive, + ms_to_ktime(timeout_ms)); + else + virt_timeout = wd_data->open_deadline; + hw_heartbeat_ms = min_not_zero(timeout_ms, wdd->max_hw_heartbeat_ms); keepalive_interval = ms_to_ktime(hw_heartbeat_ms / 2); - if (!watchdog_active(wdd)) - return keepalive_interval; - /* * To ensure that the watchdog times out wdd->timeout seconds * after the most recent ping from userspace, the last @@ -211,7 +222,13 @@ static bool watchdog_worker_should_ping(struct watchdog_core_data *wd_data) { struct watchdog_device *wdd = wd_data->wdd; - return wdd && (watchdog_active(wdd) || watchdog_hw_running(wdd)); + if (!wdd) + return false; + + if (watchdog_active(wdd)) + return true; + + return watchdog_hw_running(wdd) && !watchdog_past_open_deadline(wd_data); } static void watchdog_ping_work(struct kthread_work *work) @@ -824,6 +841,15 @@ static int watchdog_open(struct inode *inode, struct file *file) if (!hw_running) kref_get(&wd_data->kref); + /* + * open_timeout only applies for the first open from + * userspace. Set open_deadline to infinity so that the kernel + * will take care of an always-running hardware watchdog in + * case the device gets magic-closed or WDIOS_DISABLECARD is + * applied. + */ + wd_data->open_deadline = KTIME_MAX; + /* dev/watchdog is a virtual (and thus non-seekable) filesystem */ return stream_open(inode, file); @@ -983,6 +1009,7 @@ static int watchdog_cdev_register(struct watchdog_device *wdd, dev_t devno) /* Record time of most recent heartbeat as 'just before now'. */ wd_data->last_hw_keepalive = ktime_sub(ktime_get(), 1); + watchdog_set_open_deadline(wd_data); /* * If the watchdog is running, prevent its driver from being unloaded, @@ -1181,3 +1208,8 @@ module_param(handle_boot_enabled, bool, 0444); MODULE_PARM_DESC(handle_boot_enabled, "Watchdog core auto-updates boot enabled watchdogs before userspace takes over (default=" __MODULE_STRING(IS_ENABLED(CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED)) ")"); + +module_param(open_timeout, uint, 0644); +MODULE_PARM_DESC(open_timeout, + "Maximum time (in seconds, 0 means infinity) for userspace to take over a running watchdog (default=" + __MODULE_STRING(CONFIG_WATCHDOG_OPEN_TIMEOUT) ")"); diff --git a/drivers/watchdog/wd501p.h b/drivers/watchdog/wd501p.h index 0e3a497d5626..43a4d88fd363 100644 --- a/drivers/watchdog/wd501p.h +++ b/drivers/watchdog/wd501p.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-1.0+ */ /* * Industrial Computer Source WDT500/501 driver * @@ -11,12 +12,7 @@ * * http://www.cymru.net * - * This driver is provided under the GNU General Public License, - * incorporated herein by reference. The driver is provided without - * warranty or support. - * * Release 0.04. - * */ diff --git a/drivers/watchdog/wdt.c b/drivers/watchdog/wdt.c index 3d2f5ed60e88..0650100fad00 100644 --- a/drivers/watchdog/wdt.c +++ b/drivers/watchdog/wdt.c @@ -1,14 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Industrial Computer Source WDT501 driver * * (c) Copyright 1996-1997 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/wdt_pci.c b/drivers/watchdog/wdt_pci.c index ff3a41f47127..66303ab95685 100644 --- a/drivers/watchdog/wdt_pci.c +++ b/drivers/watchdog/wdt_pci.c @@ -1,14 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * Industrial Computer Source PCI-WDT500/501 driver * * (c) Copyright 1996-1997 Alan Cox <alan@lxorguk.ukuu.org.uk>, * All Rights Reserved. * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide * warranty for any of this software. This material is provided * "AS-IS" and at no charge. diff --git a/drivers/watchdog/wm831x_wdt.c b/drivers/watchdog/wm831x_wdt.c index 9b6565a3fab4..030ce240620d 100644 --- a/drivers/watchdog/wm831x_wdt.c +++ b/drivers/watchdog/wm831x_wdt.c @@ -267,14 +267,7 @@ static int wm831x_wdt_probe(struct platform_device *pdev) } } - ret = devm_watchdog_register_device(dev, &driver_data->wdt); - if (ret != 0) { - dev_err(wm831x->dev, "watchdog_register_device() failed: %d\n", - ret); - return ret; - } - - return 0; + return devm_watchdog_register_device(dev, &driver_data->wdt); } static struct platform_driver wm831x_wdt_driver = { diff --git a/drivers/watchdog/xen_wdt.c b/drivers/watchdog/xen_wdt.c index 2ba0a3c4523c..b343f421dc72 100644 --- a/drivers/watchdog/xen_wdt.c +++ b/drivers/watchdog/xen_wdt.c @@ -138,10 +138,8 @@ static int xen_wdt_probe(struct platform_device *pdev) watchdog_stop_on_unregister(&xen_wdt_dev); ret = devm_watchdog_register_device(dev, &xen_wdt_dev); - if (ret) { - dev_err(dev, "cannot register watchdog device (%d)\n", ret); + if (ret) return ret; - } dev_info(dev, "initialized (timeout=%ds, nowayout=%d)\n", xen_wdt_dev.timeout, nowayout); diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index d53f3493a6b9..cfbe46785a3b 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -402,7 +402,7 @@ static dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page, map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir, attrs); - if (map == DMA_MAPPING_ERROR) + if (map == (phys_addr_t)DMA_MAPPING_ERROR) return DMA_MAPPING_ERROR; dev_addr = xen_phys_to_bus(map); diff --git a/fs/ceph/Kconfig b/fs/ceph/Kconfig index 7f7d92d6b024..cf235f6eacf9 100644 --- a/fs/ceph/Kconfig +++ b/fs/ceph/Kconfig @@ -36,3 +36,15 @@ config CEPH_FS_POSIX_ACL groups beyond the owner/group/world scheme. If you don't know what Access Control Lists are, say N + +config CEPH_FS_SECURITY_LABEL + bool "CephFS Security Labels" + depends on CEPH_FS && SECURITY + help + Security labels support alternative access control models + implemented by security modules like SELinux. This option + enables an extended attribute handler for file security + labels in the Ceph filesystem. + + If you are not using a security module that requires using + extended attributes for file security labels, say N. diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c index 8a19c249036c..aa55f412a6e3 100644 --- a/fs/ceph/acl.c +++ b/fs/ceph/acl.c @@ -159,7 +159,7 @@ out: } int ceph_pre_init_acls(struct inode *dir, umode_t *mode, - struct ceph_acls_info *info) + struct ceph_acl_sec_ctx *as_ctx) { struct posix_acl *acl, *default_acl; size_t val_size1 = 0, val_size2 = 0; @@ -234,9 +234,9 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode, kfree(tmp_buf); - info->acl = acl; - info->default_acl = default_acl; - info->pagelist = pagelist; + as_ctx->acl = acl; + as_ctx->default_acl = default_acl; + as_ctx->pagelist = pagelist; return 0; out_err: @@ -248,18 +248,10 @@ out_err: return err; } -void ceph_init_inode_acls(struct inode* inode, struct ceph_acls_info *info) +void ceph_init_inode_acls(struct inode *inode, struct ceph_acl_sec_ctx *as_ctx) { if (!inode) return; - ceph_set_cached_acl(inode, ACL_TYPE_ACCESS, info->acl); - ceph_set_cached_acl(inode, ACL_TYPE_DEFAULT, info->default_acl); -} - -void ceph_release_acls_info(struct ceph_acls_info *info) -{ - posix_acl_release(info->acl); - posix_acl_release(info->default_acl); - if (info->pagelist) - ceph_pagelist_release(info->pagelist); + ceph_set_cached_acl(inode, ACL_TYPE_ACCESS, as_ctx->acl); + ceph_set_cached_acl(inode, ACL_TYPE_DEFAULT, as_ctx->default_acl); } diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index a47c541f8006..e078cc55b989 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -10,6 +10,7 @@ #include <linux/pagevec.h> #include <linux/task_io_accounting_ops.h> #include <linux/signal.h> +#include <linux/iversion.h> #include "super.h" #include "mds_client.h" @@ -1576,6 +1577,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf) /* Update time before taking page lock */ file_update_time(vma->vm_file); + inode_inc_iversion_raw(inode); do { lock_page(page); diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 0176241eaea7..d98dcd976c80 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -8,6 +8,7 @@ #include <linux/vmalloc.h> #include <linux/wait.h> #include <linux/writeback.h> +#include <linux/iversion.h> #include "super.h" #include "mds_client.h" @@ -1138,8 +1139,9 @@ struct cap_msg_args { u64 ino, cid, follows; u64 flush_tid, oldest_flush_tid, size, max_size; u64 xattr_version; + u64 change_attr; struct ceph_buffer *xattr_buf; - struct timespec64 atime, mtime, ctime; + struct timespec64 atime, mtime, ctime, btime; int op, caps, wanted, dirty; u32 seq, issue_seq, mseq, time_warp_seq; u32 flags; @@ -1160,7 +1162,6 @@ static int send_cap_msg(struct cap_msg_args *arg) struct ceph_msg *msg; void *p; size_t extra_len; - struct timespec64 zerotime = {0}; struct ceph_osd_client *osdc = &arg->session->s_mdsc->fsc->client->osdc; dout("send_cap_msg %s %llx %llx caps %s wanted %s dirty %s" @@ -1245,15 +1246,10 @@ static int send_cap_msg(struct cap_msg_args *arg) /* pool namespace (version 8) (mds always ignores this) */ ceph_encode_32(&p, 0); - /* - * btime and change_attr (version 9) - * - * We just zero these out for now, as the MDS ignores them unless - * the requisite feature flags are set (which we don't do yet). - */ - ceph_encode_timespec64(p, &zerotime); + /* btime and change_attr (version 9) */ + ceph_encode_timespec64(p, &arg->btime); p += sizeof(struct ceph_timespec); - ceph_encode_64(&p, 0); + ceph_encode_64(&p, arg->change_attr); /* Advisory flags (version 10) */ ceph_encode_32(&p, arg->flags); @@ -1263,20 +1259,22 @@ static int send_cap_msg(struct cap_msg_args *arg) } /* - * Queue cap releases when an inode is dropped from our cache. Since - * inode is about to be destroyed, there is no need for i_ceph_lock. + * Queue cap releases when an inode is dropped from our cache. */ -void __ceph_remove_caps(struct inode *inode) +void __ceph_remove_caps(struct ceph_inode_info *ci) { - struct ceph_inode_info *ci = ceph_inode(inode); struct rb_node *p; + /* lock i_ceph_lock, because ceph_d_revalidate(..., LOOKUP_RCU) + * may call __ceph_caps_issued_mask() on a freeing inode. */ + spin_lock(&ci->i_ceph_lock); p = rb_first(&ci->i_caps); while (p) { struct ceph_cap *cap = rb_entry(p, struct ceph_cap, ci_node); p = rb_next(p); __ceph_remove_cap(cap, true); } + spin_unlock(&ci->i_ceph_lock); } /* @@ -1297,7 +1295,7 @@ void __ceph_remove_caps(struct inode *inode) * caller should hold snap_rwsem (read), s_mutex. */ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap, - int op, bool sync, int used, int want, int retain, + int op, int flags, int used, int want, int retain, int flushing, u64 flush_tid, u64 oldest_flush_tid) __releases(cap->ci->i_ceph_lock) { @@ -1377,6 +1375,8 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap, arg.mtime = inode->i_mtime; arg.atime = inode->i_atime; arg.ctime = inode->i_ctime; + arg.btime = ci->i_btime; + arg.change_attr = inode_peek_iversion_raw(inode); arg.op = op; arg.caps = cap->implemented; @@ -1393,12 +1393,19 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap, arg.mode = inode->i_mode; arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE; - if (list_empty(&ci->i_cap_snaps)) - arg.flags = CEPH_CLIENT_CAPS_NO_CAPSNAP; - else - arg.flags = CEPH_CLIENT_CAPS_PENDING_CAPSNAP; - if (sync) - arg.flags |= CEPH_CLIENT_CAPS_SYNC; + if (!(flags & CEPH_CLIENT_CAPS_PENDING_CAPSNAP) && + !list_empty(&ci->i_cap_snaps)) { + struct ceph_cap_snap *capsnap; + list_for_each_entry_reverse(capsnap, &ci->i_cap_snaps, ci_item) { + if (capsnap->cap_flush.tid) + break; + if (capsnap->need_flush) { + flags |= CEPH_CLIENT_CAPS_PENDING_CAPSNAP; + break; + } + } + } + arg.flags = flags; spin_unlock(&ci->i_ceph_lock); @@ -1436,6 +1443,8 @@ static inline int __send_flush_snap(struct inode *inode, arg.atime = capsnap->atime; arg.mtime = capsnap->mtime; arg.ctime = capsnap->ctime; + arg.btime = capsnap->btime; + arg.change_attr = capsnap->change_attr; arg.op = CEPH_CAP_OP_FLUSHSNAP; arg.caps = capsnap->issued; @@ -1603,10 +1612,8 @@ retry: } // make sure flushsnap messages are sent in proper order. - if (ci->i_ceph_flags & CEPH_I_KICK_FLUSH) { + if (ci->i_ceph_flags & CEPH_I_KICK_FLUSH) __kick_flushing_caps(mdsc, session, ci, 0); - ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH; - } __ceph_flush_snaps(ci, session); out: @@ -2048,10 +2055,8 @@ ack: if (cap == ci->i_auth_cap && (ci->i_ceph_flags & (CEPH_I_KICK_FLUSH | CEPH_I_FLUSH_SNAPS))) { - if (ci->i_ceph_flags & CEPH_I_KICK_FLUSH) { + if (ci->i_ceph_flags & CEPH_I_KICK_FLUSH) __kick_flushing_caps(mdsc, session, ci, 0); - ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH; - } if (ci->i_ceph_flags & CEPH_I_FLUSH_SNAPS) __ceph_flush_snaps(ci, session); @@ -2087,7 +2092,7 @@ ack: sent++; /* __send_cap drops i_ceph_lock */ - delayed += __send_cap(mdsc, cap, CEPH_CAP_OP_UPDATE, false, + delayed += __send_cap(mdsc, cap, CEPH_CAP_OP_UPDATE, 0, cap_used, want, retain, flushing, flush_tid, oldest_flush_tid); goto retry; /* retake i_ceph_lock and restart our cap scan. */ @@ -2121,6 +2126,7 @@ static int try_flush_caps(struct inode *inode, u64 *ptid) retry: spin_lock(&ci->i_ceph_lock); +retry_locked: if (ci->i_ceph_flags & CEPH_I_NOFLUSH) { spin_unlock(&ci->i_ceph_lock); dout("try_flush_caps skipping %p I_NOFLUSH set\n", inode); @@ -2128,8 +2134,6 @@ retry: } if (ci->i_dirty_caps && ci->i_auth_cap) { struct ceph_cap *cap = ci->i_auth_cap; - int used = __ceph_caps_used(ci); - int want = __ceph_caps_wanted(ci); int delayed; if (!session || session != cap->session) { @@ -2145,13 +2149,25 @@ retry: goto out; } + if (ci->i_ceph_flags & + (CEPH_I_KICK_FLUSH | CEPH_I_FLUSH_SNAPS)) { + if (ci->i_ceph_flags & CEPH_I_KICK_FLUSH) + __kick_flushing_caps(mdsc, session, ci, 0); + if (ci->i_ceph_flags & CEPH_I_FLUSH_SNAPS) + __ceph_flush_snaps(ci, session); + goto retry_locked; + } + flushing = __mark_caps_flushing(inode, session, true, &flush_tid, &oldest_flush_tid); /* __send_cap drops i_ceph_lock */ - delayed = __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH, true, - used, want, (cap->issued | cap->implemented), - flushing, flush_tid, oldest_flush_tid); + delayed = __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH, + CEPH_CLIENT_CAPS_SYNC, + __ceph_caps_used(ci), + __ceph_caps_wanted(ci), + (cap->issued | cap->implemented), + flushing, flush_tid, oldest_flush_tid); if (delayed) { spin_lock(&ci->i_ceph_lock); @@ -2320,6 +2336,16 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc, struct ceph_cap_flush *cf; int ret; u64 first_tid = 0; + u64 last_snap_flush = 0; + + ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH; + + list_for_each_entry_reverse(cf, &ci->i_cap_flush_list, i_list) { + if (!cf->caps) { + last_snap_flush = cf->tid; + break; + } + } list_for_each_entry(cf, &ci->i_cap_flush_list, i_list) { if (cf->tid < first_tid) @@ -2338,10 +2364,13 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc, dout("kick_flushing_caps %p cap %p tid %llu %s\n", inode, cap, cf->tid, ceph_cap_string(cf->caps)); ci->i_ceph_flags |= CEPH_I_NODELAY; + ret = __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH, - false, __ceph_caps_used(ci), + (cf->tid < last_snap_flush ? + CEPH_CLIENT_CAPS_PENDING_CAPSNAP : 0), + __ceph_caps_used(ci), __ceph_caps_wanted(ci), - cap->issued | cap->implemented, + (cap->issued | cap->implemented), cf->caps, cf->tid, oldest_flush_tid); if (ret) { pr_err("kick_flushing_caps: error sending " @@ -2410,7 +2439,6 @@ void ceph_early_kick_flushing_caps(struct ceph_mds_client *mdsc, */ if ((cap->issued & ci->i_flushing_caps) != ci->i_flushing_caps) { - ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH; /* encode_caps_cb() also will reset these sequence * numbers. make sure sequence numbers in cap flush * message match later reconnect message */ @@ -2450,7 +2478,6 @@ void ceph_kick_flushing_caps(struct ceph_mds_client *mdsc, continue; } if (ci->i_ceph_flags & CEPH_I_KICK_FLUSH) { - ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH; __kick_flushing_caps(mdsc, session, ci, oldest_flush_tid); } @@ -2478,7 +2505,6 @@ static void kick_flushing_inode_caps(struct ceph_mds_client *mdsc, oldest_flush_tid = __get_oldest_flush_tid(mdsc); spin_unlock(&mdsc->cap_dirty_lock); - ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH; __kick_flushing_caps(mdsc, session, ci, oldest_flush_tid); spin_unlock(&ci->i_ceph_lock); } else { @@ -3040,8 +3066,10 @@ struct cap_extra_info { bool dirstat_valid; u64 nfiles; u64 nsubdirs; + u64 change_attr; /* currently issued */ int issued; + struct timespec64 btime; }; /* @@ -3123,11 +3151,14 @@ static void handle_cap_grant(struct inode *inode, __check_cap_issue(ci, cap, newcaps); + inode_set_max_iversion_raw(inode, extra_info->change_attr); + if ((newcaps & CEPH_CAP_AUTH_SHARED) && (extra_info->issued & CEPH_CAP_AUTH_EXCL) == 0) { inode->i_mode = le32_to_cpu(grant->mode); inode->i_uid = make_kuid(&init_user_ns, le32_to_cpu(grant->uid)); inode->i_gid = make_kgid(&init_user_ns, le32_to_cpu(grant->gid)); + ci->i_btime = extra_info->btime; dout("%p mode 0%o uid.gid %d.%d\n", inode, inode->i_mode, from_kuid(&init_user_ns, inode->i_uid), from_kgid(&init_user_ns, inode->i_gid)); @@ -3154,6 +3185,7 @@ static void handle_cap_grant(struct inode *inode, ci->i_xattrs.blob = ceph_buffer_get(xattr_buf); ci->i_xattrs.version = version; ceph_forget_all_cached_acls(inode); + ceph_security_invalidate_secctx(inode); } } @@ -3848,17 +3880,19 @@ void ceph_handle_caps(struct ceph_mds_session *session, } } - if (msg_version >= 11) { + if (msg_version >= 9) { struct ceph_timespec *btime; - u64 change_attr; - u32 flags; - /* version >= 9 */ if (p + sizeof(*btime) > end) goto bad; btime = p; + ceph_decode_timespec64(&extra_info.btime, btime); p += sizeof(*btime); - ceph_decode_64_safe(&p, end, change_attr, bad); + ceph_decode_64_safe(&p, end, extra_info.change_attr, bad); + } + + if (msg_version >= 11) { + u32 flags; /* version >= 10 */ ceph_decode_32_safe(&p, end, flags, bad); /* version >= 11 */ diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c index 83cd41fa2b01..2eb88ed22993 100644 --- a/fs/ceph/debugfs.c +++ b/fs/ceph/debugfs.c @@ -52,7 +52,7 @@ static int mdsc_show(struct seq_file *s, void *p) struct ceph_mds_client *mdsc = fsc->mdsc; struct ceph_mds_request *req; struct rb_node *rp; - int pathlen; + int pathlen = 0; u64 pathbase; char *path; diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 0637149fb9f9..aab29f48c62d 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -825,7 +825,7 @@ static int ceph_mknod(struct inode *dir, struct dentry *dentry, struct ceph_fs_client *fsc = ceph_sb_to_client(dir->i_sb); struct ceph_mds_client *mdsc = fsc->mdsc; struct ceph_mds_request *req; - struct ceph_acls_info acls = {}; + struct ceph_acl_sec_ctx as_ctx = {}; int err; if (ceph_snap(dir) != CEPH_NOSNAP) @@ -836,7 +836,10 @@ static int ceph_mknod(struct inode *dir, struct dentry *dentry, goto out; } - err = ceph_pre_init_acls(dir, &mode, &acls); + err = ceph_pre_init_acls(dir, &mode, &as_ctx); + if (err < 0) + goto out; + err = ceph_security_init_secctx(dentry, mode, &as_ctx); if (err < 0) goto out; @@ -855,9 +858,9 @@ static int ceph_mknod(struct inode *dir, struct dentry *dentry, req->r_args.mknod.rdev = cpu_to_le32(rdev); req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL; req->r_dentry_unless = CEPH_CAP_FILE_EXCL; - if (acls.pagelist) { - req->r_pagelist = acls.pagelist; - acls.pagelist = NULL; + if (as_ctx.pagelist) { + req->r_pagelist = as_ctx.pagelist; + as_ctx.pagelist = NULL; } err = ceph_mdsc_do_request(mdsc, dir, req); if (!err && !req->r_reply_info.head->is_dentry) @@ -865,10 +868,10 @@ static int ceph_mknod(struct inode *dir, struct dentry *dentry, ceph_mdsc_put_request(req); out: if (!err) - ceph_init_inode_acls(d_inode(dentry), &acls); + ceph_init_inode_acls(d_inode(dentry), &as_ctx); else d_drop(dentry); - ceph_release_acls_info(&acls); + ceph_release_acl_sec_ctx(&as_ctx); return err; } @@ -884,6 +887,7 @@ static int ceph_symlink(struct inode *dir, struct dentry *dentry, struct ceph_fs_client *fsc = ceph_sb_to_client(dir->i_sb); struct ceph_mds_client *mdsc = fsc->mdsc; struct ceph_mds_request *req; + struct ceph_acl_sec_ctx as_ctx = {}; int err; if (ceph_snap(dir) != CEPH_NOSNAP) @@ -894,6 +898,10 @@ static int ceph_symlink(struct inode *dir, struct dentry *dentry, goto out; } + err = ceph_security_init_secctx(dentry, S_IFLNK | 0777, &as_ctx); + if (err < 0) + goto out; + dout("symlink in dir %p dentry %p to '%s'\n", dir, dentry, dest); req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_SYMLINK, USE_AUTH_MDS); if (IS_ERR(req)) { @@ -919,6 +927,7 @@ static int ceph_symlink(struct inode *dir, struct dentry *dentry, out: if (err) d_drop(dentry); + ceph_release_acl_sec_ctx(&as_ctx); return err; } @@ -927,7 +936,7 @@ static int ceph_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) struct ceph_fs_client *fsc = ceph_sb_to_client(dir->i_sb); struct ceph_mds_client *mdsc = fsc->mdsc; struct ceph_mds_request *req; - struct ceph_acls_info acls = {}; + struct ceph_acl_sec_ctx as_ctx = {}; int err = -EROFS; int op; @@ -950,7 +959,10 @@ static int ceph_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) } mode |= S_IFDIR; - err = ceph_pre_init_acls(dir, &mode, &acls); + err = ceph_pre_init_acls(dir, &mode, &as_ctx); + if (err < 0) + goto out; + err = ceph_security_init_secctx(dentry, mode, &as_ctx); if (err < 0) goto out; @@ -967,9 +979,9 @@ static int ceph_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) req->r_args.mkdir.mode = cpu_to_le32(mode); req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL; req->r_dentry_unless = CEPH_CAP_FILE_EXCL; - if (acls.pagelist) { - req->r_pagelist = acls.pagelist; - acls.pagelist = NULL; + if (as_ctx.pagelist) { + req->r_pagelist = as_ctx.pagelist; + as_ctx.pagelist = NULL; } err = ceph_mdsc_do_request(mdsc, dir, req); if (!err && @@ -979,10 +991,10 @@ static int ceph_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) ceph_mdsc_put_request(req); out: if (!err) - ceph_init_inode_acls(d_inode(dentry), &acls); + ceph_init_inode_acls(d_inode(dentry), &as_ctx); else d_drop(dentry); - ceph_release_acls_info(&acls); + ceph_release_acl_sec_ctx(&as_ctx); return err; } @@ -1433,8 +1445,7 @@ static bool __dentry_lease_is_valid(struct ceph_dentry_info *di) return false; } -static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags, - struct inode *dir) +static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags) { struct ceph_dentry_info *di; struct ceph_mds_session *session = NULL; @@ -1466,7 +1477,7 @@ static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags, spin_unlock(&dentry->d_lock); if (session) { - ceph_mdsc_lease_send_msg(session, dir, dentry, + ceph_mdsc_lease_send_msg(session, dentry, CEPH_MDS_LEASE_RENEW, seq); ceph_put_mds_session(session); } @@ -1512,18 +1523,26 @@ static int __dir_lease_try_check(const struct dentry *dentry) static int dir_lease_is_valid(struct inode *dir, struct dentry *dentry) { struct ceph_inode_info *ci = ceph_inode(dir); - struct ceph_dentry_info *di = ceph_dentry(dentry); - int valid = 0; + int valid; + int shared_gen; spin_lock(&ci->i_ceph_lock); - if (atomic_read(&ci->i_shared_gen) == di->lease_shared_gen) - valid = __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1); + valid = __ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1); + shared_gen = atomic_read(&ci->i_shared_gen); spin_unlock(&ci->i_ceph_lock); - if (valid) - __ceph_dentry_dir_lease_touch(di); - dout("dir_lease_is_valid dir %p v%u dentry %p v%u = %d\n", - dir, (unsigned)atomic_read(&ci->i_shared_gen), - dentry, (unsigned)di->lease_shared_gen, valid); + if (valid) { + struct ceph_dentry_info *di; + spin_lock(&dentry->d_lock); + di = ceph_dentry(dentry); + if (dir == d_inode(dentry->d_parent) && + di && di->lease_shared_gen == shared_gen) + __ceph_dentry_dir_lease_touch(di); + else + valid = 0; + spin_unlock(&dentry->d_lock); + } + dout("dir_lease_is_valid dir %p v%u dentry %p = %d\n", + dir, (unsigned)atomic_read(&ci->i_shared_gen), dentry, valid); return valid; } @@ -1558,7 +1577,7 @@ static int ceph_d_revalidate(struct dentry *dentry, unsigned int flags) ceph_snap(d_inode(dentry)) == CEPH_SNAPDIR) { valid = 1; } else { - valid = dentry_lease_is_valid(dentry, flags, dir); + valid = dentry_lease_is_valid(dentry, flags); if (valid == -ECHILD) return valid; if (valid || dir_lease_is_valid(dir, dentry)) { diff --git a/fs/ceph/export.c b/fs/ceph/export.c index d3ef7ee429ec..15ff1b09cfa2 100644 --- a/fs/ceph/export.c +++ b/fs/ceph/export.c @@ -368,7 +368,7 @@ static struct dentry *ceph_get_parent(struct dentry *child) } out: dout("get_parent %p ino %llx.%llx err=%ld\n", - child, ceph_vinop(inode), (IS_ERR(dn) ? PTR_ERR(dn) : 0)); + child, ceph_vinop(inode), (long)PTR_ERR_OR_ZERO(dn)); return dn; } diff --git a/fs/ceph/file.c b/fs/ceph/file.c index c5517ffeb11c..685a03cc4b77 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -10,6 +10,7 @@ #include <linux/namei.h> #include <linux/writeback.h> #include <linux/falloc.h> +#include <linux/iversion.h> #include "super.h" #include "mds_client.h" @@ -437,7 +438,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry, struct ceph_mds_client *mdsc = fsc->mdsc; struct ceph_mds_request *req; struct dentry *dn; - struct ceph_acls_info acls = {}; + struct ceph_acl_sec_ctx as_ctx = {}; int mask; int err; @@ -451,25 +452,28 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry, if (flags & O_CREAT) { if (ceph_quota_is_max_files_exceeded(dir)) return -EDQUOT; - err = ceph_pre_init_acls(dir, &mode, &acls); + err = ceph_pre_init_acls(dir, &mode, &as_ctx); if (err < 0) return err; + err = ceph_security_init_secctx(dentry, mode, &as_ctx); + if (err < 0) + goto out_ctx; } /* do the open */ req = prepare_open_request(dir->i_sb, flags, mode); if (IS_ERR(req)) { err = PTR_ERR(req); - goto out_acl; + goto out_ctx; } req->r_dentry = dget(dentry); req->r_num_caps = 2; if (flags & O_CREAT) { req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL; req->r_dentry_unless = CEPH_CAP_FILE_EXCL; - if (acls.pagelist) { - req->r_pagelist = acls.pagelist; - acls.pagelist = NULL; + if (as_ctx.pagelist) { + req->r_pagelist = as_ctx.pagelist; + as_ctx.pagelist = NULL; } } @@ -507,7 +511,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry, } else { dout("atomic_open finish_open on dn %p\n", dn); if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) { - ceph_init_inode_acls(d_inode(dentry), &acls); + ceph_init_inode_acls(d_inode(dentry), &as_ctx); file->f_mode |= FMODE_CREATED; } err = finish_open(file, dentry, ceph_open); @@ -516,8 +520,8 @@ out_req: if (!req->r_err && req->r_target_inode) ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode); ceph_mdsc_put_request(req); -out_acl: - ceph_release_acls_info(&acls); +out_ctx: + ceph_release_acl_sec_ctx(&as_ctx); dout("atomic_open result=%d\n", err); return err; } @@ -1007,7 +1011,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, * may block. */ truncate_inode_pages_range(inode->i_mapping, pos, - (pos+len) | (PAGE_SIZE - 1)); + PAGE_ALIGN(pos + len) - 1); req->r_mtime = mtime; } @@ -1022,7 +1026,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, req->r_callback = ceph_aio_complete_req; req->r_inode = inode; req->r_priv = aio_req; - list_add_tail(&req->r_unsafe_item, &aio_req->osd_reqs); + list_add_tail(&req->r_private_item, &aio_req->osd_reqs); pos += len; continue; @@ -1082,8 +1086,8 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, while (!list_empty(&osd_reqs)) { req = list_first_entry(&osd_reqs, struct ceph_osd_request, - r_unsafe_item); - list_del_init(&req->r_unsafe_item); + r_private_item); + list_del_init(&req->r_private_item); if (ret >= 0) ret = ceph_osdc_start_request(req->r_osdc, req, false); @@ -1432,6 +1436,8 @@ retry_snap: if (err) goto out; + inode_inc_iversion_raw(inode); + if (ci->i_inline_version != CEPH_INLINE_NONE) { err = ceph_uninline_data(file, NULL); if (err < 0) @@ -2063,6 +2069,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, do_final_copy = true; file_update_time(dst_file); + inode_inc_iversion_raw(dst_inode); + if (endoff > size) { int caps_flags = 0; diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 761451f36e2d..791f84a13bb8 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -13,6 +13,7 @@ #include <linux/posix_acl.h> #include <linux/random.h> #include <linux/sort.h> +#include <linux/iversion.h> #include "super.h" #include "mds_client.h" @@ -42,6 +43,7 @@ static int ceph_set_ino_cb(struct inode *inode, void *data) { ceph_inode(inode)->i_vino = *(struct ceph_vino *)data; inode->i_ino = ceph_vino_to_ino(*(struct ceph_vino *)data); + inode_set_iversion_raw(inode, 0); return 0; } @@ -509,6 +511,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb) INIT_WORK(&ci->i_work, ceph_inode_work); ci->i_work_mask = 0; + memset(&ci->i_btime, '\0', sizeof(ci->i_btime)); ceph_fscache_inode_init(ci); @@ -523,17 +526,20 @@ void ceph_free_inode(struct inode *inode) kmem_cache_free(ceph_inode_cachep, ci); } -void ceph_destroy_inode(struct inode *inode) +void ceph_evict_inode(struct inode *inode) { struct ceph_inode_info *ci = ceph_inode(inode); struct ceph_inode_frag *frag; struct rb_node *n; - dout("destroy_inode %p ino %llx.%llx\n", inode, ceph_vinop(inode)); + dout("evict_inode %p ino %llx.%llx\n", inode, ceph_vinop(inode)); + + truncate_inode_pages_final(&inode->i_data); + clear_inode(inode); ceph_fscache_unregister_inode_cookie(ci); - __ceph_remove_caps(inode); + __ceph_remove_caps(ci); if (__ceph_has_any_quota(ci)) ceph_adjust_quota_realms_count(inode, false); @@ -578,16 +584,6 @@ void ceph_destroy_inode(struct inode *inode) ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns)); } -int ceph_drop_inode(struct inode *inode) -{ - /* - * Positve dentry and corresponding inode are always accompanied - * in MDS reply. So no need to keep inode in the cache after - * dropping all its aliases. - */ - return 1; -} - static inline blkcnt_t calc_inode_blocks(u64 size) { return (size + (1<<9) - 1) >> 9; @@ -795,6 +791,9 @@ static int fill_inode(struct inode *inode, struct page *locked_page, le64_to_cpu(info->version) > (ci->i_version & ~1))) new_version = true; + /* Update change_attribute */ + inode_set_max_iversion_raw(inode, iinfo->change_attr); + __ceph_caps_issued(ci, &issued); issued |= __ceph_caps_dirty(ci); new_issued = ~issued & info_caps; @@ -813,6 +812,8 @@ static int fill_inode(struct inode *inode, struct page *locked_page, dout("%p mode 0%o uid.gid %d.%d\n", inode, inode->i_mode, from_kuid(&init_user_ns, inode->i_uid), from_kgid(&init_user_ns, inode->i_gid)); + ceph_decode_timespec64(&ci->i_btime, &iinfo->btime); + ceph_decode_timespec64(&ci->i_snap_btime, &iinfo->snap_btime); } if ((new_version || (new_issued & CEPH_CAP_LINK_SHARED)) && @@ -887,6 +888,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page, iinfo->xattr_data, iinfo->xattr_len); ci->i_xattrs.version = le64_to_cpu(info->xattr_version); ceph_forget_all_cached_acls(inode); + ceph_security_invalidate_secctx(inode); xattr_blob = NULL; } @@ -1027,59 +1029,38 @@ out: } /* - * caller should hold session s_mutex. + * caller should hold session s_mutex and dentry->d_lock. */ -static void update_dentry_lease(struct dentry *dentry, - struct ceph_mds_reply_lease *lease, - struct ceph_mds_session *session, - unsigned long from_time, - struct ceph_vino *tgt_vino, - struct ceph_vino *dir_vino) +static void __update_dentry_lease(struct inode *dir, struct dentry *dentry, + struct ceph_mds_reply_lease *lease, + struct ceph_mds_session *session, + unsigned long from_time, + struct ceph_mds_session **old_lease_session) { struct ceph_dentry_info *di = ceph_dentry(dentry); long unsigned duration = le32_to_cpu(lease->duration_ms); long unsigned ttl = from_time + (duration * HZ) / 1000; long unsigned half_ttl = from_time + (duration * HZ / 2) / 1000; - struct inode *dir; - struct ceph_mds_session *old_lease_session = NULL; - /* - * Make sure dentry's inode matches tgt_vino. NULL tgt_vino means that - * we expect a negative dentry. - */ - if (!tgt_vino && d_really_is_positive(dentry)) - return; - - if (tgt_vino && (d_really_is_negative(dentry) || - !ceph_ino_compare(d_inode(dentry), tgt_vino))) - return; - - spin_lock(&dentry->d_lock); dout("update_dentry_lease %p duration %lu ms ttl %lu\n", dentry, duration, ttl); - dir = d_inode(dentry->d_parent); - - /* make sure parent matches dir_vino */ - if (!ceph_ino_compare(dir, dir_vino)) - goto out_unlock; - /* only track leases on regular dentries */ if (ceph_snap(dir) != CEPH_NOSNAP) - goto out_unlock; + return; di->lease_shared_gen = atomic_read(&ceph_inode(dir)->i_shared_gen); if (duration == 0) { __ceph_dentry_dir_lease_touch(di); - goto out_unlock; + return; } if (di->lease_gen == session->s_cap_gen && time_before(ttl, di->time)) - goto out_unlock; /* we already have a newer lease. */ + return; /* we already have a newer lease. */ if (di->lease_session && di->lease_session != session) { - old_lease_session = di->lease_session; + *old_lease_session = di->lease_session; di->lease_session = NULL; } @@ -1092,6 +1073,62 @@ static void update_dentry_lease(struct dentry *dentry, di->time = ttl; __ceph_dentry_lease_touch(di); +} + +static inline void update_dentry_lease(struct inode *dir, struct dentry *dentry, + struct ceph_mds_reply_lease *lease, + struct ceph_mds_session *session, + unsigned long from_time) +{ + struct ceph_mds_session *old_lease_session = NULL; + spin_lock(&dentry->d_lock); + __update_dentry_lease(dir, dentry, lease, session, from_time, + &old_lease_session); + spin_unlock(&dentry->d_lock); + if (old_lease_session) + ceph_put_mds_session(old_lease_session); +} + +/* + * update dentry lease without having parent inode locked + */ +static void update_dentry_lease_careful(struct dentry *dentry, + struct ceph_mds_reply_lease *lease, + struct ceph_mds_session *session, + unsigned long from_time, + char *dname, u32 dname_len, + struct ceph_vino *pdvino, + struct ceph_vino *ptvino) + +{ + struct inode *dir; + struct ceph_mds_session *old_lease_session = NULL; + + spin_lock(&dentry->d_lock); + /* make sure dentry's name matches target */ + if (dentry->d_name.len != dname_len || + memcmp(dentry->d_name.name, dname, dname_len)) + goto out_unlock; + + dir = d_inode(dentry->d_parent); + /* make sure parent matches dvino */ + if (!ceph_ino_compare(dir, pdvino)) + goto out_unlock; + + /* make sure dentry's inode matches target. NULL ptvino means that + * we expect a negative dentry */ + if (ptvino) { + if (d_really_is_negative(dentry)) + goto out_unlock; + if (!ceph_ino_compare(d_inode(dentry), ptvino)) + goto out_unlock; + } else { + if (d_really_is_positive(dentry)) + goto out_unlock; + } + + __update_dentry_lease(dir, dentry, lease, session, + from_time, &old_lease_session); out_unlock: spin_unlock(&dentry->d_lock); if (old_lease_session) @@ -1156,19 +1193,6 @@ static int splice_dentry(struct dentry **pdn, struct inode *in) return 0; } -static int d_name_cmp(struct dentry *dentry, const char *name, size_t len) -{ - int ret; - - /* take d_lock to ensure dentry->d_name stability */ - spin_lock(&dentry->d_lock); - ret = dentry->d_name.len - len; - if (!ret) - ret = memcmp(dentry->d_name.name, name, len); - spin_unlock(&dentry->d_lock); - return ret; -} - /* * Incorporate results into the local cache. This is either just * one inode, or a directory, dentry, and possibly linked-to inode (e.g., @@ -1371,10 +1395,9 @@ retry_lookup: } else if (have_lease) { if (d_unhashed(dn)) d_add(dn, NULL); - update_dentry_lease(dn, rinfo->dlease, - session, - req->r_request_started, - NULL, &dvino); + update_dentry_lease(dir, dn, + rinfo->dlease, session, + req->r_request_started); } goto done; } @@ -1396,11 +1419,9 @@ retry_lookup: } if (have_lease) { - tvino.ino = le64_to_cpu(rinfo->targeti.in->ino); - tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid); - update_dentry_lease(dn, rinfo->dlease, session, - req->r_request_started, - &tvino, &dvino); + update_dentry_lease(dir, dn, + rinfo->dlease, session, + req->r_request_started); } dout(" final dn %p\n", dn); } else if ((req->r_op == CEPH_MDS_OP_LOOKUPSNAP || @@ -1418,27 +1439,20 @@ retry_lookup: err = splice_dentry(&req->r_dentry, in); if (err < 0) goto done; - } else if (rinfo->head->is_dentry && - !d_name_cmp(req->r_dentry, rinfo->dname, rinfo->dname_len)) { + } else if (rinfo->head->is_dentry && req->r_dentry) { + /* parent inode is not locked, be carefull */ struct ceph_vino *ptvino = NULL; - - if ((le32_to_cpu(rinfo->diri.in->cap.caps) & CEPH_CAP_FILE_SHARED) || - le32_to_cpu(rinfo->dlease->duration_ms)) { - dvino.ino = le64_to_cpu(rinfo->diri.in->ino); - dvino.snap = le64_to_cpu(rinfo->diri.in->snapid); - - if (rinfo->head->is_target) { - tvino.ino = le64_to_cpu(rinfo->targeti.in->ino); - tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid); - ptvino = &tvino; - } - - update_dentry_lease(req->r_dentry, rinfo->dlease, - session, req->r_request_started, ptvino, - &dvino); - } else { - dout("%s: no dentry lease or dir cap\n", __func__); + dvino.ino = le64_to_cpu(rinfo->diri.in->ino); + dvino.snap = le64_to_cpu(rinfo->diri.in->snapid); + if (rinfo->head->is_target) { + tvino.ino = le64_to_cpu(rinfo->targeti.in->ino); + tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid); + ptvino = &tvino; } + update_dentry_lease_careful(req->r_dentry, rinfo->dlease, + session, req->r_request_started, + rinfo->dname, rinfo->dname_len, + &dvino, ptvino); } done: dout("fill_trace done err=%d\n", err); @@ -1600,7 +1614,7 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req, /* FIXME: release caps/leases if error occurs */ for (i = 0; i < rinfo->dir_nr; i++) { struct ceph_mds_reply_dir_entry *rde = rinfo->dir_entries + i; - struct ceph_vino tvino, dvino; + struct ceph_vino tvino; dname.name = rde->name; dname.len = rde->name_len; @@ -1701,9 +1715,9 @@ retry_lookup: ceph_dentry(dn)->offset = rde->offset; - dvino = ceph_vino(d_inode(parent)); - update_dentry_lease(dn, rde->lease, req->r_session, - req->r_request_started, &tvino, &dvino); + update_dentry_lease(d_inode(parent), dn, + rde->lease, req->r_session, + req->r_request_started); if (err == 0 && skipped == 0 && cache_ctl.index >= 0) { ret = fill_readdir_cache(d_inode(parent), dn, @@ -2282,7 +2296,7 @@ static int statx_to_caps(u32 want) { int mask = 0; - if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME)) + if (want & (STATX_MODE|STATX_UID|STATX_GID|STATX_CTIME|STATX_BTIME)) mask |= CEPH_CAP_AUTH_SHARED; if (want & (STATX_NLINK|STATX_CTIME)) @@ -2307,6 +2321,7 @@ int ceph_getattr(const struct path *path, struct kstat *stat, { struct inode *inode = d_inode(path->dentry); struct ceph_inode_info *ci = ceph_inode(inode); + u32 valid_mask = STATX_BASIC_STATS; int err = 0; /* Skip the getattr altogether if we're asked not to sync */ @@ -2319,6 +2334,16 @@ int ceph_getattr(const struct path *path, struct kstat *stat, generic_fillattr(inode, stat); stat->ino = ceph_translate_ino(inode->i_sb, inode->i_ino); + + /* + * btime on newly-allocated inodes is 0, so if this is still set to + * that, then assume that it's not valid. + */ + if (ci->i_btime.tv_sec || ci->i_btime.tv_nsec) { + stat->btime = ci->i_btime; + valid_mask |= STATX_BTIME; + } + if (ceph_snap(inode) == CEPH_NOSNAP) stat->dev = inode->i_sb->s_dev; else @@ -2342,7 +2367,6 @@ int ceph_getattr(const struct path *path, struct kstat *stat, stat->nlink = 1 + 1 + ci->i_subdirs; } - /* Mask off any higher bits (e.g. btime) until we have support */ - stat->result_mask = request_mask & STATX_BASIC_STATS; + stat->result_mask = request_mask & valid_mask; return err; } diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index c8a9b89b922d..920e9f048bd8 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -150,14 +150,13 @@ static int parse_reply_info_in(void **p, void *end, info->pool_ns_data = *p; *p += info->pool_ns_len; } - /* btime, change_attr */ - { - struct ceph_timespec btime; - u64 change_attr; - ceph_decode_need(p, end, sizeof(btime), bad); - ceph_decode_copy(p, &btime, sizeof(btime)); - ceph_decode_64_safe(p, end, change_attr, bad); - } + + /* btime */ + ceph_decode_need(p, end, sizeof(info->btime), bad); + ceph_decode_copy(p, &info->btime, sizeof(info->btime)); + + /* change attribute */ + ceph_decode_64_safe(p, end, info->change_attr, bad); /* dir pin */ if (struct_v >= 2) { @@ -166,6 +165,15 @@ static int parse_reply_info_in(void **p, void *end, info->dir_pin = -ENODATA; } + /* snapshot birth time, remains zero for v<=2 */ + if (struct_v >= 3) { + ceph_decode_need(p, end, sizeof(info->snap_btime), bad); + ceph_decode_copy(p, &info->snap_btime, + sizeof(info->snap_btime)); + } else { + memset(&info->snap_btime, 0, sizeof(info->snap_btime)); + } + *p = end; } else { if (features & CEPH_FEATURE_MDS_INLINE_DATA) { @@ -197,7 +205,14 @@ static int parse_reply_info_in(void **p, void *end, } } + if (features & CEPH_FEATURE_FS_BTIME) { + ceph_decode_need(p, end, sizeof(info->btime), bad); + ceph_decode_copy(p, &info->btime, sizeof(info->btime)); + ceph_decode_64_safe(p, end, info->change_attr, bad); + } + info->dir_pin = -ENODATA; + /* info->snap_btime remains zero */ } return 0; bad: @@ -717,6 +732,7 @@ void ceph_mdsc_release_request(struct kref *kref) ceph_pagelist_release(req->r_pagelist); put_request_session(req); ceph_unreserve_caps(req->r_mdsc, &req->r_caps_reservation); + WARN_ON_ONCE(!list_empty(&req->r_wait)); kfree(req); } @@ -903,7 +919,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc, struct inode *dir; rcu_read_lock(); - parent = req->r_dentry->d_parent; + parent = READ_ONCE(req->r_dentry->d_parent); dir = req->r_parent ? : d_inode_rcu(parent); if (!dir || dir->i_sb != mdsc->fsc->sb) { @@ -2135,7 +2151,7 @@ retry: memcpy(path + pos, temp->d_name.name, temp->d_name.len); } spin_unlock(&temp->d_lock); - temp = temp->d_parent; + temp = READ_ONCE(temp->d_parent); /* Are we at the root? */ if (IS_ROOT(temp)) @@ -3727,42 +3743,35 @@ static void check_new_map(struct ceph_mds_client *mdsc, ceph_mdsmap_is_laggy(newmap, i) ? " (laggy)" : "", ceph_session_state_name(s->s_state)); - if (i >= newmap->m_num_mds || - memcmp(ceph_mdsmap_get_addr(oldmap, i), - ceph_mdsmap_get_addr(newmap, i), - sizeof(struct ceph_entity_addr))) { - if (s->s_state == CEPH_MDS_SESSION_OPENING) { - /* the session never opened, just close it - * out now */ - get_session(s); - __unregister_session(mdsc, s); - __wake_requests(mdsc, &s->s_waiting); - ceph_put_mds_session(s); - } else if (i >= newmap->m_num_mds) { - /* force close session for stopped mds */ - get_session(s); - __unregister_session(mdsc, s); - __wake_requests(mdsc, &s->s_waiting); - kick_requests(mdsc, i); - mutex_unlock(&mdsc->mutex); + if (i >= newmap->m_num_mds) { + /* force close session for stopped mds */ + get_session(s); + __unregister_session(mdsc, s); + __wake_requests(mdsc, &s->s_waiting); + mutex_unlock(&mdsc->mutex); - mutex_lock(&s->s_mutex); - cleanup_session_requests(mdsc, s); - remove_session_caps(s); - mutex_unlock(&s->s_mutex); + mutex_lock(&s->s_mutex); + cleanup_session_requests(mdsc, s); + remove_session_caps(s); + mutex_unlock(&s->s_mutex); - ceph_put_mds_session(s); + ceph_put_mds_session(s); - mutex_lock(&mdsc->mutex); - } else { - /* just close it */ - mutex_unlock(&mdsc->mutex); - mutex_lock(&s->s_mutex); - mutex_lock(&mdsc->mutex); - ceph_con_close(&s->s_con); - mutex_unlock(&s->s_mutex); - s->s_state = CEPH_MDS_SESSION_RESTARTING; - } + mutex_lock(&mdsc->mutex); + kick_requests(mdsc, i); + continue; + } + + if (memcmp(ceph_mdsmap_get_addr(oldmap, i), + ceph_mdsmap_get_addr(newmap, i), + sizeof(struct ceph_entity_addr))) { + /* just close it */ + mutex_unlock(&mdsc->mutex); + mutex_lock(&s->s_mutex); + mutex_lock(&mdsc->mutex); + ceph_con_close(&s->s_con); + mutex_unlock(&s->s_mutex); + s->s_state = CEPH_MDS_SESSION_RESTARTING; } else if (oldstate == newstate) { continue; /* nothing new with this mds */ } @@ -3931,31 +3940,33 @@ bad: } void ceph_mdsc_lease_send_msg(struct ceph_mds_session *session, - struct inode *inode, struct dentry *dentry, char action, u32 seq) { struct ceph_msg *msg; struct ceph_mds_lease *lease; - int len = sizeof(*lease) + sizeof(u32); - int dnamelen = 0; + struct inode *dir; + int len = sizeof(*lease) + sizeof(u32) + NAME_MAX; - dout("lease_send_msg inode %p dentry %p %s to mds%d\n", - inode, dentry, ceph_lease_op_name(action), session->s_mds); - dnamelen = dentry->d_name.len; - len += dnamelen; + dout("lease_send_msg identry %p %s to mds%d\n", + dentry, ceph_lease_op_name(action), session->s_mds); msg = ceph_msg_new(CEPH_MSG_CLIENT_LEASE, len, GFP_NOFS, false); if (!msg) return; lease = msg->front.iov_base; lease->action = action; - lease->ino = cpu_to_le64(ceph_vino(inode).ino); - lease->first = lease->last = cpu_to_le64(ceph_vino(inode).snap); lease->seq = cpu_to_le32(seq); - put_unaligned_le32(dnamelen, lease + 1); - memcpy((void *)(lease + 1) + 4, dentry->d_name.name, dnamelen); + spin_lock(&dentry->d_lock); + dir = d_inode(dentry->d_parent); + lease->ino = cpu_to_le64(ceph_ino(dir)); + lease->first = lease->last = cpu_to_le64(ceph_snap(dir)); + + put_unaligned_le32(dentry->d_name.len, lease + 1); + memcpy((void *)(lease + 1) + 4, + dentry->d_name.name, dentry->d_name.len); + spin_unlock(&dentry->d_lock); /* * if this is a preemptive lease RELEASE, no need to * flush request stream, since the actual request will @@ -4157,6 +4168,7 @@ static void wait_requests(struct ceph_mds_client *mdsc) while ((req = __get_oldest_req(mdsc))) { dout("wait_requests timed out on tid %llu\n", req->r_tid); + list_del_init(&req->r_wait); __unregister_request(mdsc, req); } } diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index a83f28bc2387..f7c8603484fe 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -69,6 +69,9 @@ struct ceph_mds_reply_info_in { u64 max_bytes; u64 max_files; s32 dir_pin; + struct ceph_timespec btime; + struct ceph_timespec snap_btime; + u64 change_attr; }; struct ceph_mds_reply_dir_entry { @@ -504,7 +507,6 @@ extern char *ceph_mdsc_build_path(struct dentry *dentry, int *plen, u64 *base, extern void __ceph_mdsc_drop_dentry_lease(struct dentry *dentry); extern void ceph_mdsc_lease_send_msg(struct ceph_mds_session *session, - struct inode *inode, struct dentry *dentry, char action, u32 seq); diff --git a/fs/ceph/mdsmap.c b/fs/ceph/mdsmap.c index 701b4fb0fb5a..ce2d00da5096 100644 --- a/fs/ceph/mdsmap.c +++ b/fs/ceph/mdsmap.c @@ -107,7 +107,7 @@ struct ceph_mdsmap *ceph_mdsmap_decode(void **p, void *end) struct ceph_mdsmap *m; const void *start = *p; int i, j, n; - int err = -EINVAL; + int err; u8 mdsmap_v, mdsmap_cv; u16 mdsmap_ev; @@ -183,8 +183,9 @@ struct ceph_mdsmap *ceph_mdsmap_decode(void **p, void *end) inc = ceph_decode_32(p); state = ceph_decode_32(p); state_seq = ceph_decode_64(p); - ceph_decode_copy(p, &addr, sizeof(addr)); - ceph_decode_addr(&addr); + err = ceph_decode_entity_addr(p, end, &addr); + if (err) + goto corrupt; ceph_decode_copy(p, &laggy_since, sizeof(laggy_since)); *p += sizeof(u32); ceph_decode_32_safe(p, end, namelen, bad); @@ -357,7 +358,7 @@ bad_ext: nomem: err = -ENOMEM; goto out_err; -bad: +corrupt: pr_err("corrupt mdsmap\n"); print_hex_dump(KERN_DEBUG, "mdsmap: ", DUMP_PREFIX_OFFSET, 16, 1, @@ -365,6 +366,9 @@ bad: out_err: ceph_mdsmap_destroy(m); return ERR_PTR(err); +bad: + err = -EINVAL; + goto corrupt; } void ceph_mdsmap_destroy(struct ceph_mdsmap *m) diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c index d629fc857450..de56dee60540 100644 --- a/fs/ceph/quota.c +++ b/fs/ceph/quota.c @@ -135,7 +135,7 @@ static struct inode *lookup_quotarealm_inode(struct ceph_mds_client *mdsc, return NULL; mutex_lock(&qri->mutex); - if (qri->inode) { + if (qri->inode && ceph_is_any_caps(qri->inode)) { /* A request has already returned the inode */ mutex_unlock(&qri->mutex); return qri->inode; @@ -146,7 +146,18 @@ static struct inode *lookup_quotarealm_inode(struct ceph_mds_client *mdsc, mutex_unlock(&qri->mutex); return NULL; } - in = ceph_lookup_inode(sb, realm->ino); + if (qri->inode) { + /* get caps */ + int ret = __ceph_do_getattr(qri->inode, NULL, + CEPH_STAT_CAP_INODE, true); + if (ret >= 0) + in = qri->inode; + else + in = ERR_PTR(ret); + } else { + in = ceph_lookup_inode(sb, realm->ino); + } + if (IS_ERR(in)) { pr_warn("Can't lookup inode %llx (err: %ld)\n", realm->ino, PTR_ERR(in)); diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 72c6c022f02b..4c6494eb02b5 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -3,6 +3,7 @@ #include <linux/sort.h> #include <linux/slab.h> +#include <linux/iversion.h> #include "super.h" #include "mds_client.h" #include <linux/ceph/decode.h> @@ -606,6 +607,8 @@ int __ceph_finish_cap_snap(struct ceph_inode_info *ci, capsnap->mtime = inode->i_mtime; capsnap->atime = inode->i_atime; capsnap->ctime = inode->i_ctime; + capsnap->btime = ci->i_btime; + capsnap->change_attr = inode_peek_iversion_raw(inode); capsnap->time_warp_seq = ci->i_time_warp_seq; capsnap->truncate_size = ci->i_truncate_size; capsnap->truncate_seq = ci->i_truncate_seq; diff --git a/fs/ceph/super.c b/fs/ceph/super.c index ed1b65a6c2c3..ab4868c7308e 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -840,10 +840,10 @@ static int ceph_remount(struct super_block *sb, int *flags, char *data) static const struct super_operations ceph_super_ops = { .alloc_inode = ceph_alloc_inode, - .destroy_inode = ceph_destroy_inode, .free_inode = ceph_free_inode, .write_inode = ceph_write_inode, - .drop_inode = ceph_drop_inode, + .drop_inode = generic_delete_inode, + .evict_inode = ceph_evict_inode, .sync_fs = ceph_sync_fs, .put_super = ceph_put_super, .remount_fs = ceph_remount, @@ -978,7 +978,7 @@ static int ceph_set_super(struct super_block *s, void *data) s->s_d_op = &ceph_dentry_ops; s->s_export_op = &ceph_export_ops; - s->s_time_gran = 1000; /* 1000 ns == 1 us */ + s->s_time_gran = 1; ret = set_anon_super(s, NULL); /* what is that second arg for? */ if (ret != 0) @@ -1159,17 +1159,15 @@ static int __init init_ceph(void) goto out; ceph_flock_init(); - ceph_xattr_init(); ret = register_filesystem(&ceph_fs_type); if (ret) - goto out_xattr; + goto out_caches; pr_info("loaded (mds proto %d)\n", CEPH_MDSC_PROTOCOL); return 0; -out_xattr: - ceph_xattr_exit(); +out_caches: destroy_caches(); out: return ret; @@ -1179,7 +1177,6 @@ static void __exit exit_ceph(void) { dout("exit_ceph\n"); unregister_filesystem(&ceph_fs_type); - ceph_xattr_exit(); destroy_caches(); } diff --git a/fs/ceph/super.h b/fs/ceph/super.h index fbe6869a3f95..d2352fd95dbc 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -197,7 +197,8 @@ struct ceph_cap_snap { u64 xattr_version; u64 size; - struct timespec64 mtime, atime, ctime; + u64 change_attr; + struct timespec64 mtime, atime, ctime, btime; u64 time_warp_seq; u64 truncate_size; u32 truncate_seq; @@ -384,6 +385,8 @@ struct ceph_inode_info { int i_snap_realm_counter; /* snap realm (if caps) */ struct list_head i_snap_realm_item; struct list_head i_snap_flush_item; + struct timespec64 i_btime; + struct timespec64 i_snap_btime; struct work_struct i_work; unsigned long i_work_mask; @@ -544,7 +547,12 @@ static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci, long long release_count, long long ordered_count) { - smp_mb__before_atomic(); + /* + * Makes sure operations that setup readdir cache (update page + * cache and i_size) are strongly ordered w.r.t. the following + * atomic64_set() operations. + */ + smp_mb(); atomic64_set(&ci->i_complete_seq[0], release_count); atomic64_set(&ci->i_complete_seq[1], ordered_count); } @@ -876,9 +884,8 @@ static inline bool __ceph_have_pending_cap_snap(struct ceph_inode_info *ci) extern const struct inode_operations ceph_file_iops; extern struct inode *ceph_alloc_inode(struct super_block *sb); -extern void ceph_destroy_inode(struct inode *inode); +extern void ceph_evict_inode(struct inode *inode); extern void ceph_free_inode(struct inode *inode); -extern int ceph_drop_inode(struct inode *inode); extern struct inode *ceph_get_inode(struct super_block *sb, struct ceph_vino vino); @@ -921,10 +928,20 @@ ssize_t __ceph_getxattr(struct inode *, const char *, void *, size_t); extern ssize_t ceph_listxattr(struct dentry *, char *, size_t); extern void __ceph_build_xattrs_blob(struct ceph_inode_info *ci); extern void __ceph_destroy_xattrs(struct ceph_inode_info *ci); -extern void __init ceph_xattr_init(void); -extern void ceph_xattr_exit(void); extern const struct xattr_handler *ceph_xattr_handlers[]; +struct ceph_acl_sec_ctx { +#ifdef CONFIG_CEPH_FS_POSIX_ACL + void *default_acl; + void *acl; +#endif +#ifdef CONFIG_CEPH_FS_SECURITY_LABEL + void *sec_ctx; + u32 sec_ctxlen; +#endif + struct ceph_pagelist *pagelist; +}; + #ifdef CONFIG_SECURITY extern bool ceph_security_xattr_deadlock(struct inode *in); extern bool ceph_security_xattr_wanted(struct inode *in); @@ -939,21 +956,32 @@ static inline bool ceph_security_xattr_wanted(struct inode *in) } #endif -/* acl.c */ -struct ceph_acls_info { - void *default_acl; - void *acl; - struct ceph_pagelist *pagelist; -}; +#ifdef CONFIG_CEPH_FS_SECURITY_LABEL +extern int ceph_security_init_secctx(struct dentry *dentry, umode_t mode, + struct ceph_acl_sec_ctx *ctx); +extern void ceph_security_invalidate_secctx(struct inode *inode); +#else +static inline int ceph_security_init_secctx(struct dentry *dentry, umode_t mode, + struct ceph_acl_sec_ctx *ctx) +{ + return 0; +} +static inline void ceph_security_invalidate_secctx(struct inode *inode) +{ +} +#endif + +void ceph_release_acl_sec_ctx(struct ceph_acl_sec_ctx *as_ctx); +/* acl.c */ #ifdef CONFIG_CEPH_FS_POSIX_ACL struct posix_acl *ceph_get_acl(struct inode *, int); int ceph_set_acl(struct inode *inode, struct posix_acl *acl, int type); int ceph_pre_init_acls(struct inode *dir, umode_t *mode, - struct ceph_acls_info *info); -void ceph_init_inode_acls(struct inode *inode, struct ceph_acls_info *info); -void ceph_release_acls_info(struct ceph_acls_info *info); + struct ceph_acl_sec_ctx *as_ctx); +void ceph_init_inode_acls(struct inode *inode, + struct ceph_acl_sec_ctx *as_ctx); static inline void ceph_forget_all_cached_acls(struct inode *inode) { @@ -966,15 +994,12 @@ static inline void ceph_forget_all_cached_acls(struct inode *inode) #define ceph_set_acl NULL static inline int ceph_pre_init_acls(struct inode *dir, umode_t *mode, - struct ceph_acls_info *info) + struct ceph_acl_sec_ctx *as_ctx) { return 0; } static inline void ceph_init_inode_acls(struct inode *inode, - struct ceph_acls_info *info) -{ -} -static inline void ceph_release_acls_info(struct ceph_acls_info *info) + struct ceph_acl_sec_ctx *as_ctx) { } static inline int ceph_acl_chmod(struct dentry *dentry, struct inode *inode) @@ -1000,7 +1025,7 @@ extern void ceph_add_cap(struct inode *inode, unsigned cap, unsigned seq, u64 realmino, int flags, struct ceph_cap **new_cap); extern void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release); -extern void __ceph_remove_caps(struct inode* inode); +extern void __ceph_remove_caps(struct ceph_inode_info *ci); extern void ceph_put_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap); extern int ceph_is_any_caps(struct inode *inode); diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 0cc42c8879e9..37b458a9af3a 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -8,6 +8,7 @@ #include <linux/ceph/decode.h> #include <linux/xattr.h> +#include <linux/security.h> #include <linux/posix_acl_xattr.h> #include <linux/slab.h> @@ -17,26 +18,9 @@ static int __remove_xattr(struct ceph_inode_info *ci, struct ceph_inode_xattr *xattr); -static const struct xattr_handler ceph_other_xattr_handler; - -/* - * List of handlers for synthetic system.* attributes. Other - * attributes are handled directly. - */ -const struct xattr_handler *ceph_xattr_handlers[] = { -#ifdef CONFIG_CEPH_FS_POSIX_ACL - &posix_acl_access_xattr_handler, - &posix_acl_default_xattr_handler, -#endif - &ceph_other_xattr_handler, - NULL, -}; - static bool ceph_is_valid_xattr(const char *name) { return !strncmp(name, XATTR_CEPH_PREFIX, XATTR_CEPH_PREFIX_LEN) || - !strncmp(name, XATTR_SECURITY_PREFIX, - XATTR_SECURITY_PREFIX_LEN) || !strncmp(name, XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN) || !strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN); } @@ -48,8 +32,8 @@ static bool ceph_is_valid_xattr(const char *name) struct ceph_vxattr { char *name; size_t name_size; /* strlen(name) + 1 (for '\0') */ - size_t (*getxattr_cb)(struct ceph_inode_info *ci, char *val, - size_t size); + ssize_t (*getxattr_cb)(struct ceph_inode_info *ci, char *val, + size_t size); bool (*exists_cb)(struct ceph_inode_info *ci); unsigned int flags; }; @@ -68,8 +52,8 @@ static bool ceph_vxattrcb_layout_exists(struct ceph_inode_info *ci) rcu_dereference_raw(fl->pool_ns) != NULL); } -static size_t ceph_vxattrcb_layout(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_layout(struct ceph_inode_info *ci, char *val, + size_t size) { struct ceph_fs_client *fsc = ceph_sb_to_client(ci->vfs_inode.i_sb); struct ceph_osd_client *osdc = &fsc->client->osdc; @@ -79,7 +63,7 @@ static size_t ceph_vxattrcb_layout(struct ceph_inode_info *ci, char *val, const char *ns_field = " pool_namespace="; char buf[128]; size_t len, total_len = 0; - int ret; + ssize_t ret; pool_ns = ceph_try_get_string(ci->i_layout.pool_ns); @@ -96,18 +80,15 @@ static size_t ceph_vxattrcb_layout(struct ceph_inode_info *ci, char *val, len = snprintf(buf, sizeof(buf), "stripe_unit=%u stripe_count=%u object_size=%u pool=%lld", ci->i_layout.stripe_unit, ci->i_layout.stripe_count, - ci->i_layout.object_size, (unsigned long long)pool); + ci->i_layout.object_size, pool); total_len = len; } if (pool_ns) total_len += strlen(ns_field) + pool_ns->len; - if (!size) { - ret = total_len; - } else if (total_len > size) { - ret = -ERANGE; - } else { + ret = total_len; + if (size >= total_len) { memcpy(val, buf, len); ret = len; if (pool_name) { @@ -128,28 +109,55 @@ static size_t ceph_vxattrcb_layout(struct ceph_inode_info *ci, char *val, return ret; } -static size_t ceph_vxattrcb_layout_stripe_unit(struct ceph_inode_info *ci, - char *val, size_t size) +/* + * The convention with strings in xattrs is that they should not be NULL + * terminated, since we're returning the length with them. snprintf always + * NULL terminates however, so call it on a temporary buffer and then memcpy + * the result into place. + */ +static int ceph_fmt_xattr(char *val, size_t size, const char *fmt, ...) { - return snprintf(val, size, "%u", ci->i_layout.stripe_unit); + int ret; + va_list args; + char buf[96]; /* NB: reevaluate size if new vxattrs are added */ + + va_start(args, fmt); + ret = vsnprintf(buf, size ? sizeof(buf) : 0, fmt, args); + va_end(args); + + /* Sanity check */ + if (size && ret + 1 > sizeof(buf)) { + WARN_ONCE(true, "Returned length too big (%d)", ret); + return -E2BIG; + } + + if (ret <= size) + memcpy(val, buf, ret); + return ret; } -static size_t ceph_vxattrcb_layout_stripe_count(struct ceph_inode_info *ci, +static ssize_t ceph_vxattrcb_layout_stripe_unit(struct ceph_inode_info *ci, char *val, size_t size) { - return snprintf(val, size, "%u", ci->i_layout.stripe_count); + return ceph_fmt_xattr(val, size, "%u", ci->i_layout.stripe_unit); +} + +static ssize_t ceph_vxattrcb_layout_stripe_count(struct ceph_inode_info *ci, + char *val, size_t size) +{ + return ceph_fmt_xattr(val, size, "%u", ci->i_layout.stripe_count); } -static size_t ceph_vxattrcb_layout_object_size(struct ceph_inode_info *ci, - char *val, size_t size) +static ssize_t ceph_vxattrcb_layout_object_size(struct ceph_inode_info *ci, + char *val, size_t size) { - return snprintf(val, size, "%u", ci->i_layout.object_size); + return ceph_fmt_xattr(val, size, "%u", ci->i_layout.object_size); } -static size_t ceph_vxattrcb_layout_pool(struct ceph_inode_info *ci, - char *val, size_t size) +static ssize_t ceph_vxattrcb_layout_pool(struct ceph_inode_info *ci, + char *val, size_t size) { - int ret; + ssize_t ret; struct ceph_fs_client *fsc = ceph_sb_to_client(ci->vfs_inode.i_sb); struct ceph_osd_client *osdc = &fsc->client->osdc; s64 pool = ci->i_layout.pool_id; @@ -157,21 +165,27 @@ static size_t ceph_vxattrcb_layout_pool(struct ceph_inode_info *ci, down_read(&osdc->lock); pool_name = ceph_pg_pool_name_by_id(osdc->osdmap, pool); - if (pool_name) - ret = snprintf(val, size, "%s", pool_name); - else - ret = snprintf(val, size, "%lld", (unsigned long long)pool); + if (pool_name) { + ret = strlen(pool_name); + if (ret <= size) + memcpy(val, pool_name, ret); + } else { + ret = ceph_fmt_xattr(val, size, "%lld", pool); + } up_read(&osdc->lock); return ret; } -static size_t ceph_vxattrcb_layout_pool_namespace(struct ceph_inode_info *ci, - char *val, size_t size) +static ssize_t ceph_vxattrcb_layout_pool_namespace(struct ceph_inode_info *ci, + char *val, size_t size) { - int ret = 0; + ssize_t ret = 0; struct ceph_string *ns = ceph_try_get_string(ci->i_layout.pool_ns); + if (ns) { - ret = snprintf(val, size, "%.*s", (int)ns->len, ns->str); + ret = ns->len; + if (ret <= size) + memcpy(val, ns->str, ret); ceph_put_string(ns); } return ret; @@ -179,53 +193,54 @@ static size_t ceph_vxattrcb_layout_pool_namespace(struct ceph_inode_info *ci, /* directories */ -static size_t ceph_vxattrcb_dir_entries(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_dir_entries(struct ceph_inode_info *ci, char *val, + size_t size) { - return snprintf(val, size, "%lld", ci->i_files + ci->i_subdirs); + return ceph_fmt_xattr(val, size, "%lld", ci->i_files + ci->i_subdirs); } -static size_t ceph_vxattrcb_dir_files(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_dir_files(struct ceph_inode_info *ci, char *val, + size_t size) { - return snprintf(val, size, "%lld", ci->i_files); + return ceph_fmt_xattr(val, size, "%lld", ci->i_files); } -static size_t ceph_vxattrcb_dir_subdirs(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_dir_subdirs(struct ceph_inode_info *ci, char *val, + size_t size) { - return snprintf(val, size, "%lld", ci->i_subdirs); + return ceph_fmt_xattr(val, size, "%lld", ci->i_subdirs); } -static size_t ceph_vxattrcb_dir_rentries(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_dir_rentries(struct ceph_inode_info *ci, char *val, + size_t size) { - return snprintf(val, size, "%lld", ci->i_rfiles + ci->i_rsubdirs); + return ceph_fmt_xattr(val, size, "%lld", + ci->i_rfiles + ci->i_rsubdirs); } -static size_t ceph_vxattrcb_dir_rfiles(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_dir_rfiles(struct ceph_inode_info *ci, char *val, + size_t size) { - return snprintf(val, size, "%lld", ci->i_rfiles); + return ceph_fmt_xattr(val, size, "%lld", ci->i_rfiles); } -static size_t ceph_vxattrcb_dir_rsubdirs(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_dir_rsubdirs(struct ceph_inode_info *ci, char *val, + size_t size) { - return snprintf(val, size, "%lld", ci->i_rsubdirs); + return ceph_fmt_xattr(val, size, "%lld", ci->i_rsubdirs); } -static size_t ceph_vxattrcb_dir_rbytes(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_dir_rbytes(struct ceph_inode_info *ci, char *val, + size_t size) { - return snprintf(val, size, "%lld", ci->i_rbytes); + return ceph_fmt_xattr(val, size, "%lld", ci->i_rbytes); } -static size_t ceph_vxattrcb_dir_rctime(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_dir_rctime(struct ceph_inode_info *ci, char *val, + size_t size) { - return snprintf(val, size, "%lld.09%ld", ci->i_rctime.tv_sec, - ci->i_rctime.tv_nsec); + return ceph_fmt_xattr(val, size, "%lld.%09ld", ci->i_rctime.tv_sec, + ci->i_rctime.tv_nsec); } /* dir pin */ @@ -234,10 +249,10 @@ static bool ceph_vxattrcb_dir_pin_exists(struct ceph_inode_info *ci) return ci->i_dir_pin != -ENODATA; } -static size_t ceph_vxattrcb_dir_pin(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_dir_pin(struct ceph_inode_info *ci, char *val, + size_t size) { - return snprintf(val, size, "%d", (int)ci->i_dir_pin); + return ceph_fmt_xattr(val, size, "%d", (int)ci->i_dir_pin); } /* quotas */ @@ -254,23 +269,36 @@ static bool ceph_vxattrcb_quota_exists(struct ceph_inode_info *ci) return ret; } -static size_t ceph_vxattrcb_quota(struct ceph_inode_info *ci, char *val, - size_t size) +static ssize_t ceph_vxattrcb_quota(struct ceph_inode_info *ci, char *val, + size_t size) +{ + return ceph_fmt_xattr(val, size, "max_bytes=%llu max_files=%llu", + ci->i_max_bytes, ci->i_max_files); +} + +static ssize_t ceph_vxattrcb_quota_max_bytes(struct ceph_inode_info *ci, + char *val, size_t size) { - return snprintf(val, size, "max_bytes=%llu max_files=%llu", - ci->i_max_bytes, ci->i_max_files); + return ceph_fmt_xattr(val, size, "%llu", ci->i_max_bytes); } -static size_t ceph_vxattrcb_quota_max_bytes(struct ceph_inode_info *ci, - char *val, size_t size) +static ssize_t ceph_vxattrcb_quota_max_files(struct ceph_inode_info *ci, + char *val, size_t size) { - return snprintf(val, size, "%llu", ci->i_max_bytes); + return ceph_fmt_xattr(val, size, "%llu", ci->i_max_files); } -static size_t ceph_vxattrcb_quota_max_files(struct ceph_inode_info *ci, - char *val, size_t size) +/* snapshots */ +static bool ceph_vxattrcb_snap_btime_exists(struct ceph_inode_info *ci) { - return snprintf(val, size, "%llu", ci->i_max_files); + return (ci->i_snap_btime.tv_sec != 0 || ci->i_snap_btime.tv_nsec != 0); +} + +static ssize_t ceph_vxattrcb_snap_btime(struct ceph_inode_info *ci, char *val, + size_t size) +{ + return ceph_fmt_xattr(val, size, "%lld.%09ld", ci->i_snap_btime.tv_sec, + ci->i_snap_btime.tv_nsec); } #define CEPH_XATTR_NAME(_type, _name) XATTR_CEPH_PREFIX #_type "." #_name @@ -327,7 +355,7 @@ static struct ceph_vxattr ceph_dir_vxattrs[] = { XATTR_RSTAT_FIELD(dir, rctime), { .name = "ceph.dir.pin", - .name_size = sizeof("ceph.dir_pin"), + .name_size = sizeof("ceph.dir.pin"), .getxattr_cb = ceph_vxattrcb_dir_pin, .exists_cb = ceph_vxattrcb_dir_pin_exists, .flags = VXATTR_FLAG_HIDDEN, @@ -341,9 +369,15 @@ static struct ceph_vxattr ceph_dir_vxattrs[] = { }, XATTR_QUOTA_FIELD(quota, max_bytes), XATTR_QUOTA_FIELD(quota, max_files), + { + .name = "ceph.snap.btime", + .name_size = sizeof("ceph.snap.btime"), + .getxattr_cb = ceph_vxattrcb_snap_btime, + .exists_cb = ceph_vxattrcb_snap_btime_exists, + .flags = VXATTR_FLAG_READONLY, + }, { .name = NULL, 0 } /* Required table terminator */ }; -static size_t ceph_dir_vxattrs_name_size; /* total size of all names */ /* files */ @@ -360,9 +394,15 @@ static struct ceph_vxattr ceph_file_vxattrs[] = { XATTR_LAYOUT_FIELD(file, layout, object_size), XATTR_LAYOUT_FIELD(file, layout, pool), XATTR_LAYOUT_FIELD(file, layout, pool_namespace), + { + .name = "ceph.snap.btime", + .name_size = sizeof("ceph.snap.btime"), + .getxattr_cb = ceph_vxattrcb_snap_btime, + .exists_cb = ceph_vxattrcb_snap_btime_exists, + .flags = VXATTR_FLAG_READONLY, + }, { .name = NULL, 0 } /* Required table terminator */ }; -static size_t ceph_file_vxattrs_name_size; /* total size of all names */ static struct ceph_vxattr *ceph_inode_vxattrs(struct inode *inode) { @@ -373,47 +413,6 @@ static struct ceph_vxattr *ceph_inode_vxattrs(struct inode *inode) return NULL; } -static size_t ceph_vxattrs_name_size(struct ceph_vxattr *vxattrs) -{ - if (vxattrs == ceph_dir_vxattrs) - return ceph_dir_vxattrs_name_size; - if (vxattrs == ceph_file_vxattrs) - return ceph_file_vxattrs_name_size; - BUG_ON(vxattrs); - return 0; -} - -/* - * Compute the aggregate size (including terminating '\0') of all - * virtual extended attribute names in the given vxattr table. - */ -static size_t __init vxattrs_name_size(struct ceph_vxattr *vxattrs) -{ - struct ceph_vxattr *vxattr; - size_t size = 0; - - for (vxattr = vxattrs; vxattr->name; vxattr++) { - if (!(vxattr->flags & VXATTR_FLAG_HIDDEN)) - size += vxattr->name_size; - } - - return size; -} - -/* Routines called at initialization and exit time */ - -void __init ceph_xattr_init(void) -{ - ceph_dir_vxattrs_name_size = vxattrs_name_size(ceph_dir_vxattrs); - ceph_file_vxattrs_name_size = vxattrs_name_size(ceph_file_vxattrs); -} - -void ceph_xattr_exit(void) -{ - ceph_dir_vxattrs_name_size = 0; - ceph_file_vxattrs_name_size = 0; -} - static struct ceph_vxattr *ceph_match_vxattr(struct inode *inode, const char *name) { @@ -523,8 +522,8 @@ static int __set_xattr(struct ceph_inode_info *ci, dout("__set_xattr_val p=%p\n", p); } - dout("__set_xattr_val added %llx.%llx xattr %p %s=%.*s\n", - ceph_vinop(&ci->vfs_inode), xattr, name, val_len, val); + dout("__set_xattr_val added %llx.%llx xattr %p %.*s=%.*s\n", + ceph_vinop(&ci->vfs_inode), xattr, name_len, name, val_len, val); return 0; } @@ -823,7 +822,7 @@ ssize_t __ceph_getxattr(struct inode *inode, const char *name, void *value, struct ceph_inode_xattr *xattr; struct ceph_vxattr *vxattr = NULL; int req_mask; - int err; + ssize_t err; /* let's see if a virtual xattr was requested */ vxattr = ceph_match_vxattr(inode, name); @@ -835,8 +834,11 @@ ssize_t __ceph_getxattr(struct inode *inode, const char *name, void *value, if (err) return err; err = -ENODATA; - if (!(vxattr->exists_cb && !vxattr->exists_cb(ci))) + if (!(vxattr->exists_cb && !vxattr->exists_cb(ci))) { err = vxattr->getxattr_cb(ci, value, size); + if (size && size < err) + err = -ERANGE; + } return err; } @@ -897,10 +899,9 @@ ssize_t ceph_listxattr(struct dentry *dentry, char *names, size_t size) struct inode *inode = d_inode(dentry); struct ceph_inode_info *ci = ceph_inode(inode); struct ceph_vxattr *vxattrs = ceph_inode_vxattrs(inode); - u32 vir_namelen = 0; + bool len_only = (size == 0); u32 namelen; int err; - u32 len; int i; spin_lock(&ci->i_ceph_lock); @@ -919,38 +920,45 @@ ssize_t ceph_listxattr(struct dentry *dentry, char *names, size_t size) err = __build_xattrs(inode); if (err < 0) goto out; - /* - * Start with virtual dir xattr names (if any) (including - * terminating '\0' characters for each). - */ - vir_namelen = ceph_vxattrs_name_size(vxattrs); - /* adding 1 byte per each variable due to the null termination */ + /* add 1 byte for each xattr due to the null termination */ namelen = ci->i_xattrs.names_size + ci->i_xattrs.count; - err = -ERANGE; - if (size && vir_namelen + namelen > size) - goto out; - - err = namelen + vir_namelen; - if (size == 0) - goto out; + if (!len_only) { + if (namelen > size) { + err = -ERANGE; + goto out; + } + names = __copy_xattr_names(ci, names); + size -= namelen; + } - names = __copy_xattr_names(ci, names); /* virtual xattr names, too */ - err = namelen; if (vxattrs) { for (i = 0; vxattrs[i].name; i++) { - if (!(vxattrs[i].flags & VXATTR_FLAG_HIDDEN) && - !(vxattrs[i].exists_cb && - !vxattrs[i].exists_cb(ci))) { - len = sprintf(names, "%s", vxattrs[i].name); - names += len + 1; - err += len + 1; + size_t this_len; + + if (vxattrs[i].flags & VXATTR_FLAG_HIDDEN) + continue; + if (vxattrs[i].exists_cb && !vxattrs[i].exists_cb(ci)) + continue; + + this_len = strlen(vxattrs[i].name) + 1; + namelen += this_len; + if (len_only) + continue; + + if (this_len > size) { + err = -ERANGE; + goto out; } + + memcpy(names, vxattrs[i].name, this_len); + names += this_len; + size -= this_len; } } - + err = namelen; out: spin_unlock(&ci->i_ceph_lock); return err; @@ -1206,4 +1214,138 @@ bool ceph_security_xattr_deadlock(struct inode *in) spin_unlock(&ci->i_ceph_lock); return ret; } + +#ifdef CONFIG_CEPH_FS_SECURITY_LABEL +int ceph_security_init_secctx(struct dentry *dentry, umode_t mode, + struct ceph_acl_sec_ctx *as_ctx) +{ + struct ceph_pagelist *pagelist = as_ctx->pagelist; + const char *name; + size_t name_len; + int err; + + err = security_dentry_init_security(dentry, mode, &dentry->d_name, + &as_ctx->sec_ctx, + &as_ctx->sec_ctxlen); + if (err < 0) { + WARN_ON_ONCE(err != -EOPNOTSUPP); + err = 0; /* do nothing */ + goto out; + } + + err = -ENOMEM; + if (!pagelist) { + pagelist = ceph_pagelist_alloc(GFP_KERNEL); + if (!pagelist) + goto out; + err = ceph_pagelist_reserve(pagelist, PAGE_SIZE); + if (err) + goto out; + ceph_pagelist_encode_32(pagelist, 1); + } + + /* + * FIXME: Make security_dentry_init_security() generic. Currently + * It only supports single security module and only selinux has + * dentry_init_security hook. + */ + name = XATTR_NAME_SELINUX; + name_len = strlen(name); + err = ceph_pagelist_reserve(pagelist, + 4 * 2 + name_len + as_ctx->sec_ctxlen); + if (err) + goto out; + + if (as_ctx->pagelist) { + /* update count of KV pairs */ + BUG_ON(pagelist->length <= sizeof(__le32)); + if (list_is_singular(&pagelist->head)) { + le32_add_cpu((__le32*)pagelist->mapped_tail, 1); + } else { + struct page *page = list_first_entry(&pagelist->head, + struct page, lru); + void *addr = kmap_atomic(page); + le32_add_cpu((__le32*)addr, 1); + kunmap_atomic(addr); + } + } else { + as_ctx->pagelist = pagelist; + } + + ceph_pagelist_encode_32(pagelist, name_len); + ceph_pagelist_append(pagelist, name, name_len); + + ceph_pagelist_encode_32(pagelist, as_ctx->sec_ctxlen); + ceph_pagelist_append(pagelist, as_ctx->sec_ctx, as_ctx->sec_ctxlen); + + err = 0; +out: + if (pagelist && !as_ctx->pagelist) + ceph_pagelist_release(pagelist); + return err; +} + +void ceph_security_invalidate_secctx(struct inode *inode) +{ + security_inode_invalidate_secctx(inode); +} + +static int ceph_xattr_set_security_label(const struct xattr_handler *handler, + struct dentry *unused, struct inode *inode, + const char *key, const void *buf, + size_t buflen, int flags) +{ + if (security_ismaclabel(key)) { + const char *name = xattr_full_name(handler, key); + return __ceph_setxattr(inode, name, buf, buflen, flags); + } + return -EOPNOTSUPP; +} + +static int ceph_xattr_get_security_label(const struct xattr_handler *handler, + struct dentry *unused, struct inode *inode, + const char *key, void *buf, size_t buflen) +{ + if (security_ismaclabel(key)) { + const char *name = xattr_full_name(handler, key); + return __ceph_getxattr(inode, name, buf, buflen); + } + return -EOPNOTSUPP; +} + +static const struct xattr_handler ceph_security_label_handler = { + .prefix = XATTR_SECURITY_PREFIX, + .get = ceph_xattr_get_security_label, + .set = ceph_xattr_set_security_label, +}; +#endif #endif + +void ceph_release_acl_sec_ctx(struct ceph_acl_sec_ctx *as_ctx) +{ +#ifdef CONFIG_CEPH_FS_POSIX_ACL + posix_acl_release(as_ctx->acl); + posix_acl_release(as_ctx->default_acl); +#endif +#ifdef CONFIG_CEPH_FS_SECURITY_LABEL + security_release_secctx(as_ctx->sec_ctx, as_ctx->sec_ctxlen); +#endif + if (as_ctx->pagelist) + ceph_pagelist_release(as_ctx->pagelist); +} + +/* + * List of handlers for synthetic system.* attributes. Other + * attributes are handled directly. + */ +const struct xattr_handler *ceph_xattr_handlers[] = { +#ifdef CONFIG_CEPH_FS_POSIX_ACL + &posix_acl_access_xattr_handler, + &posix_acl_default_xattr_handler, +#endif +#ifdef CONFIG_CEPH_FS_SECURITY_LABEL + &ceph_security_label_handler, +#endif + &ceph_other_xattr_handler, + NULL, +}; diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig index 523e9ea78a28..b16219e5dac9 100644 --- a/fs/cifs/Kconfig +++ b/fs/cifs/Kconfig @@ -13,9 +13,11 @@ config CIFS select CRYPTO_LIB_ARC4 select CRYPTO_AEAD2 select CRYPTO_CCM + select CRYPTO_GCM select CRYPTO_ECB select CRYPTO_AES select CRYPTO_DES + select KEYS help This is the client VFS module for the SMB3 family of NAS protocols, (including support for the most recent, most secure dialect SMB3.1.1) @@ -109,7 +111,7 @@ config CIFS_WEAK_PW_HASH config CIFS_UPCALL bool "Kerberos/SPNEGO advanced session setup" - depends on CIFS && KEYS + depends on CIFS select DNS_RESOLVER help Enables an upcall mechanism for CIFS which accesses userspace helper @@ -144,14 +146,6 @@ config CIFS_POSIX (such as Samba 3.10 and later) which can negotiate CIFS POSIX ACL support. If unsure, say N. -config CIFS_ACL - bool "Provide CIFS ACL support" - depends on CIFS_XATTR && KEYS - help - Allows fetching CIFS/NTFS ACL from the server. The DACL blob - is handed over to the application/caller. See the man - page for getcifsacl for more information. If unsure, say Y. - config CIFS_DEBUG bool "Enable CIFS debugging routines" default y @@ -184,7 +178,7 @@ config CIFS_DEBUG_DUMP_KEYS config CIFS_DFS_UPCALL bool "DFS feature support" - depends on CIFS && KEYS + depends on CIFS select DNS_RESOLVER help Distributed File System (DFS) support is used to access shares @@ -203,10 +197,10 @@ config CIFS_NFSD_EXPORT Allows NFS server to export a CIFS mounted share (nfsd over cifs) config CIFS_SMB_DIRECT - bool "SMB Direct support (Experimental)" + bool "SMB Direct support" depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND=y && INFINIBAND_ADDR_TRANS=y help - Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1. + Enables SMB Direct support for SMB 3.0, 3.02 and 3.1.1. SMB Direct allows transferring SMB packets over RDMA. If unsure, say N. diff --git a/fs/cifs/Makefile b/fs/cifs/Makefile index 51af69a1a328..41332f20055b 100644 --- a/fs/cifs/Makefile +++ b/fs/cifs/Makefile @@ -10,10 +10,9 @@ cifs-y := trace.o cifsfs.o cifssmb.o cifs_debug.o connect.o dir.o file.o \ cifs_unicode.o nterr.o cifsencrypt.o \ readdir.o ioctl.o sess.o export.o smb1ops.o winucase.o \ smb2ops.o smb2maperror.o smb2transport.o \ - smb2misc.o smb2pdu.o smb2inode.o smb2file.o + smb2misc.o smb2pdu.o smb2inode.o smb2file.o cifsacl.o cifs-$(CONFIG_CIFS_XATTR) += xattr.o -cifs-$(CONFIG_CIFS_ACL) += cifsacl.o cifs-$(CONFIG_CIFS_UPCALL) += cifs_spnego.o diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c index ec933fb0b36e..a38d796f5ffe 100644 --- a/fs/cifs/cifs_debug.c +++ b/fs/cifs/cifs_debug.c @@ -240,9 +240,7 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v) #ifdef CONFIG_CIFS_XATTR seq_printf(m, ",XATTR"); #endif -#ifdef CONFIG_CIFS_ACL seq_printf(m, ",ACL"); -#endif seq_putc(m, '\n'); seq_printf(m, "CIFSMaxBufSize: %d\n", CIFSMaxBufSize); seq_printf(m, "Active VFS Requests: %d\n", GlobalTotalActiveXid); diff --git a/fs/cifs/cifs_fs_sb.h b/fs/cifs/cifs_fs_sb.h index ed49222abecb..b326d2ca3765 100644 --- a/fs/cifs/cifs_fs_sb.h +++ b/fs/cifs/cifs_fs_sb.h @@ -52,6 +52,7 @@ #define CIFS_MOUNT_UID_FROM_ACL 0x2000000 /* try to get UID via special SID */ #define CIFS_MOUNT_NO_HANDLE_CACHE 0x4000000 /* disable caching dir handles */ #define CIFS_MOUNT_NO_DFS 0x8000000 /* disable DFS resolving */ +#define CIFS_MOUNT_MODE_FROM_SID 0x10000000 /* retrieve mode from special ACE */ struct cifs_sb_info { struct rb_root tlink_tree; @@ -83,5 +84,10 @@ struct cifs_sb_info { * failover properly. */ char *origin_fullpath; /* \\HOST\SHARE\[OPTIONAL PATH] */ + /* + * Indicate whether serverino option was turned off later + * (cifs_autodisable_serverino) in order to match new mounts. + */ + bool mnt_cifs_serverino_autodisabled; }; #endif /* _CIFS_FS_SB_H */ diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index 24635b65effa..270d3c58fb3b 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -526,6 +526,8 @@ cifs_show_options(struct seq_file *s, struct dentry *root) seq_puts(s, ",nobrl"); if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NO_HANDLE_CACHE) seq_puts(s, ",nohandlecache"); + if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MODE_FROM_SID) + seq_puts(s, ",modefromsid"); if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_CIFS_ACL) seq_puts(s, ",cifsacl"); if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_DYNPERM) @@ -554,6 +556,11 @@ cifs_show_options(struct seq_file *s, struct dentry *root) seq_printf(s, ",bsize=%u", cifs_sb->bsize); seq_printf(s, ",echo_interval=%lu", tcon->ses->server->echo_interval / HZ); + + /* Only display max_credits if it was overridden on mount */ + if (tcon->ses->server->max_credits != SMB2_MAX_CREDITS_AVAILABLE) + seq_printf(s, ",max_credits=%u", tcon->ses->server->max_credits); + if (tcon->snapshot_time) seq_printf(s, ",snapshot=%llu", tcon->snapshot_time); if (tcon->handle_timeout) @@ -1517,11 +1524,9 @@ init_cifs(void) goto out_destroy_dfs_cache; #endif /* CONFIG_CIFS_UPCALL */ -#ifdef CONFIG_CIFS_ACL rc = init_cifs_idmap(); if (rc) goto out_register_key_type; -#endif /* CONFIG_CIFS_ACL */ rc = register_filesystem(&cifs_fs_type); if (rc) @@ -1536,10 +1541,8 @@ init_cifs(void) return 0; out_init_cifs_idmap: -#ifdef CONFIG_CIFS_ACL exit_cifs_idmap(); out_register_key_type: -#endif #ifdef CONFIG_CIFS_UPCALL exit_cifs_spnego(); out_destroy_dfs_cache: @@ -1571,9 +1574,7 @@ exit_cifs(void) unregister_filesystem(&cifs_fs_type); unregister_filesystem(&smb3_fs_type); cifs_dfs_release_automount_timer(); -#ifdef CONFIG_CIFS_ACL exit_cifs_idmap(); -#endif #ifdef CONFIG_CIFS_UPCALL exit_cifs_spnego(); #endif @@ -1607,5 +1608,6 @@ MODULE_SOFTDEP("pre: sha256"); MODULE_SOFTDEP("pre: sha512"); MODULE_SOFTDEP("pre: aead2"); MODULE_SOFTDEP("pre: ccm"); +MODULE_SOFTDEP("pre: gcm"); module_init(init_cifs) module_exit(exit_cifs) diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h index 4777b3c4a92c..fe610e7e3670 100644 --- a/fs/cifs/cifsglob.h +++ b/fs/cifs/cifsglob.h @@ -550,6 +550,7 @@ struct smb_vol { bool override_gid:1; bool dynperm:1; bool noperm:1; + bool mode_ace:1; bool no_psx_acl:1; /* set if posix acl support should be disabled */ bool cifs_acl:1; bool backupuid_specified; /* mount option backupuid is specified */ @@ -600,6 +601,7 @@ struct smb_vol { __u64 snapshot_time; /* needed for timewarp tokens */ __u32 handle_timeout; /* persistent and durable handle timeout in ms */ unsigned int max_credits; /* smb3 max_credits 10 < credits < 60000 */ + __u16 compression; /* compression algorithm 0xFFFF default 0=disabled */ }; /** @@ -617,7 +619,8 @@ struct smb_vol { CIFS_MOUNT_FSCACHE | CIFS_MOUNT_MF_SYMLINKS | \ CIFS_MOUNT_MULTIUSER | CIFS_MOUNT_STRICT_IO | \ CIFS_MOUNT_CIFS_BACKUPUID | CIFS_MOUNT_CIFS_BACKUPGID | \ - CIFS_MOUNT_NO_DFS) + CIFS_MOUNT_UID_FROM_ACL | CIFS_MOUNT_NO_HANDLE_CACHE | \ + CIFS_MOUNT_NO_DFS | CIFS_MOUNT_MODE_FROM_SID) /** * Generic VFS superblock mount flags (s_flags) to consider when @@ -1870,7 +1873,6 @@ extern unsigned int cifs_min_small; /* min size of small buf pool */ extern unsigned int cifs_max_pending; /* MAX requests at once to server*/ extern bool disable_legacy_dialects; /* forbid vers=1.0 and vers=2.0 mounts */ -#ifdef CONFIG_CIFS_ACL GLOBAL_EXTERN struct rb_root uidtree; GLOBAL_EXTERN struct rb_root gidtree; GLOBAL_EXTERN spinlock_t siduidlock; @@ -1879,7 +1881,6 @@ GLOBAL_EXTERN struct rb_root siduidtree; GLOBAL_EXTERN struct rb_root sidgidtree; GLOBAL_EXTERN spinlock_t uidsidlock; GLOBAL_EXTERN spinlock_t gidsidlock; -#endif /* CONFIG_CIFS_ACL */ void cifs_oplock_break(struct work_struct *work); void cifs_queue_oplock_break(struct cifsFileInfo *cfile); diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c index 1fbd92843a73..e2f95965065d 100644 --- a/fs/cifs/cifssmb.c +++ b/fs/cifs/cifssmb.c @@ -3600,11 +3600,9 @@ static int cifs_copy_posix_acl(char *trgt, char *src, const int buflen, return size; } -static __u16 convert_ace_to_cifs_ace(struct cifs_posix_ace *cifs_ace, +static void convert_ace_to_cifs_ace(struct cifs_posix_ace *cifs_ace, const struct posix_acl_xattr_entry *local_ace) { - __u16 rc = 0; /* 0 = ACL converted ok */ - cifs_ace->cifs_e_perm = le16_to_cpu(local_ace->e_perm); cifs_ace->cifs_e_tag = le16_to_cpu(local_ace->e_tag); /* BB is there a better way to handle the large uid? */ @@ -3617,7 +3615,6 @@ static __u16 convert_ace_to_cifs_ace(struct cifs_posix_ace *cifs_ace, cifs_dbg(FYI, "perm %d tag %d id %d\n", ace->e_perm, ace->e_tag, ace->e_id); */ - return rc; } /* Convert ACL from local Linux POSIX xattr to CIFS POSIX ACL wire format */ @@ -3653,13 +3650,8 @@ static __u16 ACL_to_cifs_posix(char *parm_data, const char *pACL, cifs_dbg(FYI, "unknown ACL type %d\n", acl_type); return 0; } - for (i = 0; i < count; i++) { - rc = convert_ace_to_cifs_ace(&cifs_acl->ace_array[i], &ace[i]); - if (rc != 0) { - /* ACE not converted */ - break; - } - } + for (i = 0; i < count; i++) + convert_ace_to_cifs_ace(&cifs_acl->ace_array[i], &ace[i]); if (rc == 0) { rc = (__u16)(count * sizeof(struct cifs_posix_ace)); rc += sizeof(struct cifs_posix_acl); @@ -3920,7 +3912,6 @@ GetExtAttrOut: #endif /* CONFIG_POSIX */ -#ifdef CONFIG_CIFS_ACL /* * Initialize NT TRANSACT SMB into small smb request buffer. This assumes that * all NT TRANSACTS that we init here have total parm and data under about 400 @@ -4164,7 +4155,6 @@ setCifsAclRetry: return (rc); } -#endif /* CONFIG_CIFS_ACL */ /* Legacy Query Path Information call for lookup to old servers such as Win9x/WinME */ diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index 714a359c7c8d..a4830ced0f98 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -96,7 +96,8 @@ enum { Opt_multiuser, Opt_sloppy, Opt_nosharesock, Opt_persistent, Opt_nopersistent, Opt_resilient, Opt_noresilient, - Opt_domainauto, Opt_rdma, + Opt_domainauto, Opt_rdma, Opt_modesid, + Opt_compress, /* Mount options which take numeric value */ Opt_backupuid, Opt_backupgid, Opt_uid, @@ -175,6 +176,7 @@ static const match_table_t cifs_mount_option_tokens = { { Opt_serverino, "serverino" }, { Opt_noserverino, "noserverino" }, { Opt_rwpidforward, "rwpidforward" }, + { Opt_modesid, "modefromsid" }, { Opt_cifsacl, "cifsacl" }, { Opt_nocifsacl, "nocifsacl" }, { Opt_acl, "acl" }, @@ -212,6 +214,7 @@ static const match_table_t cifs_mount_option_tokens = { { Opt_echo_interval, "echo_interval=%s" }, { Opt_max_credits, "max_credits=%s" }, { Opt_snapshot, "snapshot=%s" }, + { Opt_compress, "compress=%s" }, { Opt_blank_user, "user=" }, { Opt_blank_user, "username=" }, @@ -706,10 +709,10 @@ static bool server_unresponsive(struct TCP_Server_Info *server) { /* - * We need to wait 2 echo intervals to make sure we handle such + * We need to wait 3 echo intervals to make sure we handle such * situations right: * 1s client sends a normal SMB request - * 2s client gets a response + * 3s client gets a response * 30s echo workqueue job pops, and decides we got a response recently * and don't need to send another * ... @@ -718,9 +721,9 @@ server_unresponsive(struct TCP_Server_Info *server) */ if ((server->tcpStatus == CifsGood || server->tcpStatus == CifsNeedNegotiate) && - time_after(jiffies, server->lstrp + 2 * server->echo_interval)) { + time_after(jiffies, server->lstrp + 3 * server->echo_interval)) { cifs_dbg(VFS, "Server %s has not responded in %lu seconds. Reconnecting...\n", - server->hostname, (2 * server->echo_interval) / HZ); + server->hostname, (3 * server->echo_interval) / HZ); cifs_reconnect(server); wake_up(&server->response_q); return true; @@ -1223,11 +1226,11 @@ next_pdu: atomic_read(&midCount)); cifs_dump_mem("Received Data is: ", bufs[i], HEADER_SIZE(server)); + smb2_add_credits_from_hdr(bufs[i], server); #ifdef CONFIG_CIFS_DEBUG2 if (server->ops->dump_detail) server->ops->dump_detail(bufs[i], server); - smb2_add_credits_from_hdr(bufs[i], server); cifs_dump_mids(server); #endif /* CIFS_DEBUG2 */ } @@ -1830,6 +1833,9 @@ cifs_parse_mount_options(const char *mountdata, const char *devname, case Opt_rwpidforward: vol->rwpidforward = 1; break; + case Opt_modesid: + vol->mode_ace = 1; + break; case Opt_cifsacl: vol->cifs_acl = 1; break; @@ -1911,6 +1917,11 @@ cifs_parse_mount_options(const char *mountdata, const char *devname, case Opt_rdma: vol->rdma = true; break; + case Opt_compress: + vol->compression = UNKNOWN_TYPE; + cifs_dbg(VFS, + "SMB3 compression support is experimental\n"); + break; /* Numeric Values */ case Opt_backupuid: @@ -2544,8 +2555,15 @@ static int match_server(struct TCP_Server_Info *server, struct smb_vol *vol) if (vol->nosharesock) return 0; - /* BB update this for smb3any and default case */ - if ((server->vals != vol->vals) || (server->ops != vol->ops)) + /* If multidialect negotiation see if existing sessions match one */ + if (strcmp(vol->vals->version_string, SMB3ANY_VERSION_STRING) == 0) { + if (server->vals->protocol_id < SMB30_PROT_ID) + return 0; + } else if (strcmp(vol->vals->version_string, + SMBDEFAULT_VERSION_STRING) == 0) { + if (server->vals->protocol_id < SMB21_PROT_ID) + return 0; + } else if ((server->vals != vol->vals) || (server->ops != vol->ops)) return 0; if (!net_eq(cifs_net_ns(server), current->nsproxy->net_ns)) @@ -2680,6 +2698,7 @@ cifs_get_tcp_session(struct smb_vol *volume_info) tcp_ses->sequence_number = 0; tcp_ses->reconnect_instance = 1; tcp_ses->lstrp = jiffies; + tcp_ses->compress_algorithm = cpu_to_le16(volume_info->compression); spin_lock_init(&tcp_ses->req_lock); INIT_LIST_HEAD(&tcp_ses->tcp_ses_list); INIT_LIST_HEAD(&tcp_ses->smb_ses_list); @@ -3460,12 +3479,16 @@ compare_mount_options(struct super_block *sb, struct cifs_mnt_data *mnt_data) { struct cifs_sb_info *old = CIFS_SB(sb); struct cifs_sb_info *new = mnt_data->cifs_sb; + unsigned int oldflags = old->mnt_cifs_flags & CIFS_MOUNT_MASK; + unsigned int newflags = new->mnt_cifs_flags & CIFS_MOUNT_MASK; if ((sb->s_flags & CIFS_MS_MASK) != (mnt_data->flags & CIFS_MS_MASK)) return 0; - if ((old->mnt_cifs_flags & CIFS_MOUNT_MASK) != - (new->mnt_cifs_flags & CIFS_MOUNT_MASK)) + if (old->mnt_cifs_serverino_autodisabled) + newflags &= ~CIFS_MOUNT_SERVER_INUM; + + if (oldflags != newflags) return 0; /* @@ -3965,6 +3988,8 @@ int cifs_setup_cifs_sb(struct smb_vol *pvolume_info, cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_NOPOSIXBRL; if (pvolume_info->rwpidforward) cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_RWPIDFORWARD; + if (pvolume_info->mode_ace) + cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_MODE_FROM_SID; if (pvolume_info->cifs_acl) cifs_sb->mnt_cifs_flags |= CIFS_MOUNT_CIFS_ACL; if (pvolume_info->backupuid_specified) { @@ -4459,11 +4484,13 @@ cifs_are_all_path_components_accessible(struct TCP_Server_Info *server, unsigned int xid, struct cifs_tcon *tcon, struct cifs_sb_info *cifs_sb, - char *full_path) + char *full_path, + int added_treename) { int rc; char *s; char sep, tmp; + int skip = added_treename ? 1 : 0; sep = CIFS_DIR_SEP(cifs_sb); s = full_path; @@ -4478,7 +4505,14 @@ cifs_are_all_path_components_accessible(struct TCP_Server_Info *server, /* next separator */ while (*s && *s != sep) s++; - + /* + * if the treename is added, we then have to skip the first + * part within the separators + */ + if (skip) { + skip = 0; + continue; + } /* * temporarily null-terminate the path at the end of * the current component @@ -4526,8 +4560,7 @@ static int is_path_remote(struct cifs_sb_info *cifs_sb, struct smb_vol *vol, if (rc != -EREMOTE) { rc = cifs_are_all_path_components_accessible(server, xid, tcon, - cifs_sb, - full_path); + cifs_sb, full_path, tcon->Flags & SMB_SHARE_IS_IN_DFS); if (rc != 0) { cifs_dbg(VFS, "cannot query dirs between root and final path, " "enabling CIFS_MOUNT_USE_PREFIX_PATH\n"); diff --git a/fs/cifs/dfs_cache.c b/fs/cifs/dfs_cache.c index e3e1c13df439..1692c0c6c23a 100644 --- a/fs/cifs/dfs_cache.c +++ b/fs/cifs/dfs_cache.c @@ -492,7 +492,7 @@ static struct dfs_cache_entry *__find_cache_entry(unsigned int hash, #ifdef CONFIG_CIFS_DEBUG2 char *name = get_tgt_name(ce); - if (unlikely(IS_ERR(name))) { + if (IS_ERR(name)) { rcu_read_unlock(); return ERR_CAST(name); } diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index d7cc62252634..1bffe029fb66 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -892,7 +892,6 @@ cifs_get_inode_info(struct inode **inode, const char *full_path, cifs_dbg(FYI, "cifs_sfu_type failed: %d\n", tmprc); } -#ifdef CONFIG_CIFS_ACL /* fill in 0777 bits from ACL */ if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_CIFS_ACL) { rc = cifs_acl_to_fattr(cifs_sb, &fattr, *inode, full_path, fid); @@ -902,7 +901,6 @@ cifs_get_inode_info(struct inode **inode, const char *full_path, goto cgii_exit; } } -#endif /* CONFIG_CIFS_ACL */ /* fill in remaining high mode bits e.g. SUID, VTX */ if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_UNX_EMUL) @@ -2415,7 +2413,7 @@ cifs_setattr_nounix(struct dentry *direntry, struct iattr *attrs) xid = get_xid(); - cifs_dbg(FYI, "setattr on file %pd attrs->iavalid 0x%x\n", + cifs_dbg(FYI, "setattr on file %pd attrs->ia_valid 0x%x\n", direntry, attrs->ia_valid); if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NO_PERM) @@ -2466,7 +2464,6 @@ cifs_setattr_nounix(struct dentry *direntry, struct iattr *attrs) if (attrs->ia_valid & ATTR_GID) gid = attrs->ia_gid; -#ifdef CONFIG_CIFS_ACL if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_CIFS_ACL) { if (uid_valid(uid) || gid_valid(gid)) { rc = id_mode_to_cifs_acl(inode, full_path, NO_CHANGE_64, @@ -2478,7 +2475,6 @@ cifs_setattr_nounix(struct dentry *direntry, struct iattr *attrs) } } } else -#endif /* CONFIG_CIFS_ACL */ if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_SET_UID)) attrs->ia_valid &= ~(ATTR_UID | ATTR_GID); @@ -2489,7 +2485,6 @@ cifs_setattr_nounix(struct dentry *direntry, struct iattr *attrs) if (attrs->ia_valid & ATTR_MODE) { mode = attrs->ia_mode; rc = 0; -#ifdef CONFIG_CIFS_ACL if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_CIFS_ACL) { rc = id_mode_to_cifs_acl(inode, full_path, mode, INVALID_UID, INVALID_GID); @@ -2499,7 +2494,6 @@ cifs_setattr_nounix(struct dentry *direntry, struct iattr *attrs) goto cifs_setattr_exit; } } else -#endif /* CONFIG_CIFS_ACL */ if (((mode & S_IWUGO) == 0) && (cifsInode->cifsAttrs & ATTR_READONLY) == 0) { diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c index b1a696a73f7c..f383877a6511 100644 --- a/fs/cifs/misc.c +++ b/fs/cifs/misc.c @@ -539,6 +539,7 @@ cifs_autodisable_serverino(struct cifs_sb_info *cifs_sb) tcon = cifs_sb_master_tcon(cifs_sb); cifs_sb->mnt_cifs_flags &= ~CIFS_MOUNT_SERVER_INUM; + cifs_sb->mnt_cifs_serverino_autodisabled = true; cifs_dbg(VFS, "Autodisabling the use of server inode numbers on %s.\n", tcon ? tcon->treeName : "new server"); cifs_dbg(VFS, "The server doesn't seem to support them properly or the files might be on different servers (DFS).\n"); diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c index 9e430ae9314f..b7421a096319 100644 --- a/fs/cifs/smb1ops.c +++ b/fs/cifs/smb1ops.c @@ -1223,16 +1223,15 @@ struct smb_version_operations smb1_operations = { .query_all_EAs = CIFSSMBQAllEAs, .set_EA = CIFSSMBSetEA, #endif /* CIFS_XATTR */ -#ifdef CONFIG_CIFS_ACL .get_acl = get_cifs_acl, .get_acl_by_fid = get_cifs_acl_by_fid, .set_acl = set_cifs_acl, -#endif /* CIFS_ACL */ .make_node = cifs_make_node, }; struct smb_version_values smb1_values = { .version_string = SMB1_VERSION_STRING, + .protocol_id = SMB10_PROT_ID, .large_lock_type = LOCKING_ANDX_LARGE_FILES, .exclusive_lock_type = 0, .shared_lock_type = LOCKING_ANDX_SHARED_LOCK, diff --git a/fs/cifs/smb2inode.c b/fs/cifs/smb2inode.c index 278405d26c47..d8d9cdfa30b6 100644 --- a/fs/cifs/smb2inode.c +++ b/fs/cifs/smb2inode.c @@ -120,6 +120,8 @@ smb2_compound_op(const unsigned int xid, struct cifs_tcon *tcon, SMB2_O_INFO_FILE, 0, sizeof(struct smb2_file_all_info) + PATH_MAX * 2, 0, NULL); + if (rc) + goto finished; smb2_set_next_command(tcon, &rqst[num_rqst]); smb2_set_related(&rqst[num_rqst++]); trace_smb3_query_info_compound_enter(xid, ses->Suid, tcon->tid, @@ -147,6 +149,8 @@ smb2_compound_op(const unsigned int xid, struct cifs_tcon *tcon, COMPOUND_FID, current->tgid, FILE_DISPOSITION_INFORMATION, SMB2_O_INFO_FILE, 0, data, size); + if (rc) + goto finished; smb2_set_next_command(tcon, &rqst[num_rqst]); smb2_set_related(&rqst[num_rqst++]); trace_smb3_rmdir_enter(xid, ses->Suid, tcon->tid, full_path); @@ -163,6 +167,8 @@ smb2_compound_op(const unsigned int xid, struct cifs_tcon *tcon, COMPOUND_FID, current->tgid, FILE_END_OF_FILE_INFORMATION, SMB2_O_INFO_FILE, 0, data, size); + if (rc) + goto finished; smb2_set_next_command(tcon, &rqst[num_rqst]); smb2_set_related(&rqst[num_rqst++]); trace_smb3_set_eof_enter(xid, ses->Suid, tcon->tid, full_path); @@ -180,6 +186,8 @@ smb2_compound_op(const unsigned int xid, struct cifs_tcon *tcon, COMPOUND_FID, current->tgid, FILE_BASIC_INFORMATION, SMB2_O_INFO_FILE, 0, data, size); + if (rc) + goto finished; smb2_set_next_command(tcon, &rqst[num_rqst]); smb2_set_related(&rqst[num_rqst++]); trace_smb3_set_info_compound_enter(xid, ses->Suid, tcon->tid, @@ -206,6 +214,8 @@ smb2_compound_op(const unsigned int xid, struct cifs_tcon *tcon, COMPOUND_FID, current->tgid, FILE_RENAME_INFORMATION, SMB2_O_INFO_FILE, 0, data, size); + if (rc) + goto finished; smb2_set_next_command(tcon, &rqst[num_rqst]); smb2_set_related(&rqst[num_rqst++]); trace_smb3_rename_enter(xid, ses->Suid, tcon->tid, full_path); @@ -231,6 +241,8 @@ smb2_compound_op(const unsigned int xid, struct cifs_tcon *tcon, COMPOUND_FID, current->tgid, FILE_LINK_INFORMATION, SMB2_O_INFO_FILE, 0, data, size); + if (rc) + goto finished; smb2_set_next_command(tcon, &rqst[num_rqst]); smb2_set_related(&rqst[num_rqst++]); trace_smb3_hardlink_enter(xid, ses->Suid, tcon->tid, full_path); diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c index 9fd56b0acd7e..0cdc4e47ca87 100644 --- a/fs/cifs/smb2ops.c +++ b/fs/cifs/smb2ops.c @@ -2027,6 +2027,10 @@ smb2_set_related(struct smb_rqst *rqst) struct smb2_sync_hdr *shdr; shdr = (struct smb2_sync_hdr *)(rqst->rq_iov[0].iov_base); + if (shdr == NULL) { + cifs_dbg(FYI, "shdr NULL in smb2_set_related\n"); + return; + } shdr->Flags |= SMB2_FLAGS_RELATED_OPERATIONS; } @@ -2041,6 +2045,12 @@ smb2_set_next_command(struct cifs_tcon *tcon, struct smb_rqst *rqst) unsigned long len = smb_rqst_len(server, rqst); int i, num_padding; + shdr = (struct smb2_sync_hdr *)(rqst->rq_iov[0].iov_base); + if (shdr == NULL) { + cifs_dbg(FYI, "shdr NULL in smb2_set_next_command\n"); + return; + } + /* SMB headers in a compound are 8 byte aligned. */ /* No padding needed */ @@ -2080,7 +2090,6 @@ smb2_set_next_command(struct cifs_tcon *tcon, struct smb_rqst *rqst) } finished: - shdr = (struct smb2_sync_hdr *)(rqst->rq_iov[0].iov_base); shdr->NextCommand = cpu_to_le32(len); } @@ -2374,6 +2383,34 @@ smb2_get_dfs_refer(const unsigned int xid, struct cifs_ses *ses, } static int +parse_reparse_posix(struct reparse_posix_data *symlink_buf, + u32 plen, char **target_path, + struct cifs_sb_info *cifs_sb) +{ + unsigned int len; + + /* See MS-FSCC 2.1.2.6 for the 'NFS' style reparse tags */ + len = le16_to_cpu(symlink_buf->ReparseDataLength); + + if (le64_to_cpu(symlink_buf->InodeType) != NFS_SPECFILE_LNK) { + cifs_dbg(VFS, "%lld not a supported symlink type\n", + le64_to_cpu(symlink_buf->InodeType)); + return -EOPNOTSUPP; + } + + *target_path = cifs_strndup_from_utf16( + symlink_buf->PathBuffer, + len, true, cifs_sb->local_nls); + if (!(*target_path)) + return -ENOMEM; + + convert_delimiter(*target_path, '/'); + cifs_dbg(FYI, "%s: target path: %s\n", __func__, *target_path); + + return 0; +} + +static int parse_reparse_symlink(struct reparse_symlink_data_buffer *symlink_buf, u32 plen, char **target_path, struct cifs_sb_info *cifs_sb) @@ -2381,11 +2418,7 @@ parse_reparse_symlink(struct reparse_symlink_data_buffer *symlink_buf, unsigned int sub_len; unsigned int sub_offset; - /* We only handle Symbolic Link : MS-FSCC 2.1.2.4 */ - if (le32_to_cpu(symlink_buf->ReparseTag) != IO_REPARSE_TAG_SYMLINK) { - cifs_dbg(VFS, "srv returned invalid symlink buffer\n"); - return -EIO; - } + /* We handle Symbolic Link reparse tag here. See: MS-FSCC 2.1.2.4 */ sub_offset = le16_to_cpu(symlink_buf->SubstituteNameOffset); sub_len = le16_to_cpu(symlink_buf->SubstituteNameLength); @@ -2407,6 +2440,41 @@ parse_reparse_symlink(struct reparse_symlink_data_buffer *symlink_buf, return 0; } +static int +parse_reparse_point(struct reparse_data_buffer *buf, + u32 plen, char **target_path, + struct cifs_sb_info *cifs_sb) +{ + if (plen < sizeof(struct reparse_data_buffer)) { + cifs_dbg(VFS, "reparse buffer is too small. Must be " + "at least 8 bytes but was %d\n", plen); + return -EIO; + } + + if (plen < le16_to_cpu(buf->ReparseDataLength) + + sizeof(struct reparse_data_buffer)) { + cifs_dbg(VFS, "srv returned invalid reparse buf " + "length: %d\n", plen); + return -EIO; + } + + /* See MS-FSCC 2.1.2 */ + switch (le32_to_cpu(buf->ReparseTag)) { + case IO_REPARSE_TAG_NFS: + return parse_reparse_posix( + (struct reparse_posix_data *)buf, + plen, target_path, cifs_sb); + case IO_REPARSE_TAG_SYMLINK: + return parse_reparse_symlink( + (struct reparse_symlink_data_buffer *)buf, + plen, target_path, cifs_sb); + default: + cifs_dbg(VFS, "srv returned unknown symlink buffer " + "tag:0x%08x\n", le32_to_cpu(buf->ReparseTag)); + return -EOPNOTSUPP; + } +} + #define SMB2_SYMLINK_STRUCT_SIZE \ (sizeof(struct smb2_err_rsp) - 1 + sizeof(struct smb2_symlink_err_rsp)) @@ -2533,23 +2601,8 @@ smb2_query_symlink(const unsigned int xid, struct cifs_tcon *tcon, goto querty_exit; } - if (plen < 8) { - cifs_dbg(VFS, "reparse buffer is too small. Must be " - "at least 8 bytes but was %d\n", plen); - rc = -EIO; - goto querty_exit; - } - - if (plen < le16_to_cpu(reparse_buf->ReparseDataLength) + 8) { - cifs_dbg(VFS, "srv returned invalid reparse buf " - "length: %d\n", plen); - rc = -EIO; - goto querty_exit; - } - - rc = parse_reparse_symlink( - (struct reparse_symlink_data_buffer *)reparse_buf, - plen, target_path, cifs_sb); + rc = parse_reparse_point(reparse_buf, plen, target_path, + cifs_sb); goto querty_exit; } @@ -2561,26 +2614,32 @@ smb2_query_symlink(const unsigned int xid, struct cifs_tcon *tcon, err_buf = err_iov.iov_base; if (le32_to_cpu(err_buf->ByteCount) < sizeof(struct smb2_symlink_err_rsp) || err_iov.iov_len < SMB2_SYMLINK_STRUCT_SIZE) { - rc = -ENOENT; + rc = -EINVAL; + goto querty_exit; + } + + symlink = (struct smb2_symlink_err_rsp *)err_buf->ErrorData; + if (le32_to_cpu(symlink->SymLinkErrorTag) != SYMLINK_ERROR_TAG || + le32_to_cpu(symlink->ReparseTag) != IO_REPARSE_TAG_SYMLINK) { + rc = -EINVAL; goto querty_exit; } /* open must fail on symlink - reset rc */ rc = 0; - symlink = (struct smb2_symlink_err_rsp *)err_buf->ErrorData; sub_len = le16_to_cpu(symlink->SubstituteNameLength); sub_offset = le16_to_cpu(symlink->SubstituteNameOffset); print_len = le16_to_cpu(symlink->PrintNameLength); print_offset = le16_to_cpu(symlink->PrintNameOffset); if (err_iov.iov_len < SMB2_SYMLINK_STRUCT_SIZE + sub_offset + sub_len) { - rc = -ENOENT; + rc = -EINVAL; goto querty_exit; } if (err_iov.iov_len < SMB2_SYMLINK_STRUCT_SIZE + print_offset + print_len) { - rc = -ENOENT; + rc = -EINVAL; goto querty_exit; } @@ -2606,7 +2665,6 @@ smb2_query_symlink(const unsigned int xid, struct cifs_tcon *tcon, return rc; } -#ifdef CONFIG_CIFS_ACL static struct cifs_ntsd * get_smb2_acl_by_fid(struct cifs_sb_info *cifs_sb, const struct cifs_fid *cifsfid, u32 *pacllen) @@ -2691,7 +2749,6 @@ get_smb2_acl_by_path(struct cifs_sb_info *cifs_sb, return pntsd; } -#ifdef CONFIG_CIFS_ACL static int set_smb2_acl(struct cifs_ntsd *pnntsd, __u32 acllen, struct inode *inode, const char *path, int aclflag) @@ -2749,7 +2806,6 @@ set_smb2_acl(struct cifs_ntsd *pnntsd, __u32 acllen, free_xid(xid); return rc; } -#endif /* CIFS_ACL */ /* Retrieve an ACL from the server */ static struct cifs_ntsd * @@ -2769,7 +2825,6 @@ get_smb2_acl(struct cifs_sb_info *cifs_sb, cifsFileInfo_put(open_file); return pntsd; } -#endif static long smb3_zero_range(struct file *file, struct cifs_tcon *tcon, loff_t offset, loff_t len, bool keep_size) @@ -3367,7 +3422,7 @@ smb2_dir_needs_close(struct cifsFileInfo *cfile) static void fill_transform_hdr(struct smb2_transform_hdr *tr_hdr, unsigned int orig_len, - struct smb_rqst *old_rq) + struct smb_rqst *old_rq, __le16 cipher_type) { struct smb2_sync_hdr *shdr = (struct smb2_sync_hdr *)old_rq->rq_iov[0].iov_base; @@ -3376,7 +3431,10 @@ fill_transform_hdr(struct smb2_transform_hdr *tr_hdr, unsigned int orig_len, tr_hdr->ProtocolId = SMB2_TRANSFORM_PROTO_NUM; tr_hdr->OriginalMessageSize = cpu_to_le32(orig_len); tr_hdr->Flags = cpu_to_le16(0x01); - get_random_bytes(&tr_hdr->Nonce, SMB3_AES128CMM_NONCE); + if (cipher_type == SMB2_ENCRYPTION_AES128_GCM) + get_random_bytes(&tr_hdr->Nonce, SMB3_AES128GCM_NONCE); + else + get_random_bytes(&tr_hdr->Nonce, SMB3_AES128CCM_NONCE); memcpy(&tr_hdr->SessionId, &shdr->SessionId, 8); } @@ -3534,8 +3592,13 @@ crypt_message(struct TCP_Server_Info *server, int num_rqst, rc = -ENOMEM; goto free_sg; } - iv[0] = 3; - memcpy(iv + 1, (char *)tr_hdr->Nonce, SMB3_AES128CMM_NONCE); + + if (server->cipher_type == SMB2_ENCRYPTION_AES128_GCM) + memcpy(iv, (char *)tr_hdr->Nonce, SMB3_AES128GCM_NONCE); + else { + iv[0] = 3; + memcpy(iv + 1, (char *)tr_hdr->Nonce, SMB3_AES128CCM_NONCE); + } aead_request_set_crypt(req, sg, sg, crypt_len, iv); aead_request_set_ad(req, assoc_data_len); @@ -3635,7 +3698,7 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, int num_rqst, } /* fill the 1st iov with a transform header */ - fill_transform_hdr(tr_hdr, orig_len, old_rq); + fill_transform_hdr(tr_hdr, orig_len, old_rq, server->cipher_type); rc = crypt_message(server, num_rqst, new_rq, 1); cifs_dbg(FYI, "Encrypt message returned %d\n", rc); @@ -4284,11 +4347,9 @@ struct smb_version_operations smb20_operations = { .query_all_EAs = smb2_query_eas, .set_EA = smb2_set_ea, #endif /* CIFS_XATTR */ -#ifdef CONFIG_CIFS_ACL .get_acl = get_smb2_acl, .get_acl_by_fid = get_smb2_acl_by_fid, .set_acl = set_smb2_acl, -#endif /* CIFS_ACL */ .next_header = smb2_next_header, .ioctl_query_info = smb2_ioctl_query_info, .make_node = smb2_make_node, @@ -4385,11 +4446,9 @@ struct smb_version_operations smb21_operations = { .query_all_EAs = smb2_query_eas, .set_EA = smb2_set_ea, #endif /* CIFS_XATTR */ -#ifdef CONFIG_CIFS_ACL .get_acl = get_smb2_acl, .get_acl_by_fid = get_smb2_acl_by_fid, .set_acl = set_smb2_acl, -#endif /* CIFS_ACL */ .next_header = smb2_next_header, .ioctl_query_info = smb2_ioctl_query_info, .make_node = smb2_make_node, @@ -4495,11 +4554,9 @@ struct smb_version_operations smb30_operations = { .query_all_EAs = smb2_query_eas, .set_EA = smb2_set_ea, #endif /* CIFS_XATTR */ -#ifdef CONFIG_CIFS_ACL .get_acl = get_smb2_acl, .get_acl_by_fid = get_smb2_acl_by_fid, .set_acl = set_smb2_acl, -#endif /* CIFS_ACL */ .next_header = smb2_next_header, .ioctl_query_info = smb2_ioctl_query_info, .make_node = smb2_make_node, @@ -4606,11 +4663,9 @@ struct smb_version_operations smb311_operations = { .query_all_EAs = smb2_query_eas, .set_EA = smb2_set_ea, #endif /* CIFS_XATTR */ -#ifdef CONFIG_CIFS_ACL .get_acl = get_smb2_acl, .get_acl_by_fid = get_smb2_acl_by_fid, .set_acl = set_smb2_acl, -#endif /* CIFS_ACL */ .next_header = smb2_next_header, .ioctl_query_info = smb2_ioctl_query_info, .make_node = smb2_make_node, diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index 75311a8a68bf..f58e4dc3987b 100644 --- a/fs/cifs/smb2pdu.c +++ b/fs/cifs/smb2pdu.c @@ -489,10 +489,25 @@ static void build_encrypt_ctxt(struct smb2_encryption_neg_context *pneg_ctxt) { pneg_ctxt->ContextType = SMB2_ENCRYPTION_CAPABILITIES; - pneg_ctxt->DataLength = cpu_to_le16(4); /* Cipher Count + le16 cipher */ - pneg_ctxt->CipherCount = cpu_to_le16(1); -/* pneg_ctxt->Ciphers[0] = SMB2_ENCRYPTION_AES128_GCM;*/ /* not supported yet */ - pneg_ctxt->Ciphers[0] = SMB2_ENCRYPTION_AES128_CCM; + pneg_ctxt->DataLength = cpu_to_le16(6); /* Cipher Count + two ciphers */ + pneg_ctxt->CipherCount = cpu_to_le16(2); + pneg_ctxt->Ciphers[0] = SMB2_ENCRYPTION_AES128_GCM; + pneg_ctxt->Ciphers[1] = SMB2_ENCRYPTION_AES128_CCM; +} + +static unsigned int +build_netname_ctxt(struct smb2_netname_neg_context *pneg_ctxt, char *hostname) +{ + struct nls_table *cp = load_nls_default(); + + pneg_ctxt->ContextType = SMB2_NETNAME_NEGOTIATE_CONTEXT_ID; + + /* copy up to max of first 100 bytes of server name to NetName field */ + pneg_ctxt->DataLength = cpu_to_le16(2 + + (2 * cifs_strtoUTF16(pneg_ctxt->NetName, hostname, 100, cp))); + /* context size is DataLength + minimal smb2_neg_context */ + return DIV_ROUND_UP(le16_to_cpu(pneg_ctxt->DataLength) + + sizeof(struct smb2_neg_context), 8) * 8; } static void @@ -521,7 +536,7 @@ build_posix_ctxt(struct smb2_posix_neg_context *pneg_ctxt) static void assemble_neg_contexts(struct smb2_negotiate_req *req, - unsigned int *total_len) + struct TCP_Server_Info *server, unsigned int *total_len) { char *pneg_ctxt = (char *)req; unsigned int ctxt_len; @@ -551,17 +566,25 @@ assemble_neg_contexts(struct smb2_negotiate_req *req, *total_len += ctxt_len; pneg_ctxt += ctxt_len; - build_compression_ctxt((struct smb2_compression_capabilities_context *) + if (server->compress_algorithm) { + build_compression_ctxt((struct smb2_compression_capabilities_context *) pneg_ctxt); - ctxt_len = DIV_ROUND_UP( - sizeof(struct smb2_compression_capabilities_context), 8) * 8; + ctxt_len = DIV_ROUND_UP( + sizeof(struct smb2_compression_capabilities_context), + 8) * 8; + *total_len += ctxt_len; + pneg_ctxt += ctxt_len; + req->NegotiateContextCount = cpu_to_le16(5); + } else + req->NegotiateContextCount = cpu_to_le16(4); + + ctxt_len = build_netname_ctxt((struct smb2_netname_neg_context *)pneg_ctxt, + server->hostname); *total_len += ctxt_len; pneg_ctxt += ctxt_len; build_posix_ctxt((struct smb2_posix_neg_context *)pneg_ctxt); *total_len += sizeof(struct smb2_posix_neg_context); - - req->NegotiateContextCount = cpu_to_le16(4); } static void decode_preauth_context(struct smb2_preauth_neg_context *ctxt) @@ -829,7 +852,7 @@ SMB2_negotiate(const unsigned int xid, struct cifs_ses *ses) if ((ses->server->vals->protocol_id == SMB311_PROT_ID) || (strcmp(ses->server->vals->version_string, SMBDEFAULT_VERSION_STRING) == 0)) - assemble_neg_contexts(req, &total_len); + assemble_neg_contexts(req, server, &total_len); } iov[0].iov_base = (char *)req; iov[0].iov_len = total_len; @@ -2095,6 +2118,48 @@ add_twarp_context(struct kvec *iov, unsigned int *num_iovec, __u64 timewarp) return 0; } +static struct crt_query_id_ctxt * +create_query_id_buf(void) +{ + struct crt_query_id_ctxt *buf; + + buf = kzalloc(sizeof(struct crt_query_id_ctxt), GFP_KERNEL); + if (!buf) + return NULL; + + buf->ccontext.DataOffset = cpu_to_le16(0); + buf->ccontext.DataLength = cpu_to_le32(0); + buf->ccontext.NameOffset = cpu_to_le16(offsetof + (struct crt_query_id_ctxt, Name)); + buf->ccontext.NameLength = cpu_to_le16(4); + /* SMB2_CREATE_QUERY_ON_DISK_ID is "QFid" */ + buf->Name[0] = 'Q'; + buf->Name[1] = 'F'; + buf->Name[2] = 'i'; + buf->Name[3] = 'd'; + return buf; +} + +/* See MS-SMB2 2.2.13.2.9 */ +static int +add_query_id_context(struct kvec *iov, unsigned int *num_iovec) +{ + struct smb2_create_req *req = iov[0].iov_base; + unsigned int num = *num_iovec; + + iov[num].iov_base = create_query_id_buf(); + if (iov[num].iov_base == NULL) + return -ENOMEM; + iov[num].iov_len = sizeof(struct crt_query_id_ctxt); + if (!req->CreateContextsOffset) + req->CreateContextsOffset = cpu_to_le32( + sizeof(struct smb2_create_req) + + iov[num - 1].iov_len); + le32_add_cpu(&req->CreateContextsLength, sizeof(struct crt_query_id_ctxt)); + *num_iovec = num + 1; + return 0; +} + static int alloc_path_with_tree_prefix(__le16 **out_path, int *out_size, int *out_len, const char *treename, const __le16 *path) @@ -2423,6 +2488,12 @@ SMB2_open_init(struct cifs_tcon *tcon, struct smb_rqst *rqst, __u8 *oplock, return rc; } + if (n_iov > 2) { + struct create_context *ccontext = + (struct create_context *)iov[n_iov-1].iov_base; + ccontext->Next = cpu_to_le32(iov[n_iov-1].iov_len); + } + add_query_id_context(iov, &n_iov); rqst->rq_nvec = n_iov; return 0; @@ -2550,12 +2621,11 @@ SMB2_ioctl_init(struct cifs_tcon *tcon, struct smb_rqst *rqst, * indatalen is usually small at a couple of bytes max, so * just allocate through generic pool */ - in_data_buf = kmalloc(indatalen, GFP_NOFS); + in_data_buf = kmemdup(in_data, indatalen, GFP_NOFS); if (!in_data_buf) { cifs_small_buf_release(req); return -ENOMEM; } - memcpy(in_data_buf, in_data, indatalen); } req->CtlCode = cpu_to_le32(opcode); diff --git a/fs/cifs/smb2pdu.h b/fs/cifs/smb2pdu.h index 858353d20c39..7e2e782f8edd 100644 --- a/fs/cifs/smb2pdu.h +++ b/fs/cifs/smb2pdu.h @@ -123,7 +123,7 @@ struct smb2_sync_pdu { __le16 StructureSize2; /* size of wct area (varies, request specific) */ } __packed; -#define SMB3_AES128CMM_NONCE 11 +#define SMB3_AES128CCM_NONCE 11 #define SMB3_AES128GCM_NONCE 12 struct smb2_transform_hdr { @@ -166,6 +166,8 @@ struct smb2_err_rsp { __u8 ErrorData[1]; /* variable length */ } __packed; +#define SYMLINK_ERROR_TAG 0x4c4d5953 + struct smb2_symlink_err_rsp { __le32 SymLinkLength; __le32 SymLinkErrorTag; @@ -227,6 +229,7 @@ struct smb2_negotiate_req { } __packed; /* Dialects */ +#define SMB10_PROT_ID 0x0000 /* local only, not sent on wire w/CIFS negprot */ #define SMB20_PROT_ID 0x0202 #define SMB21_PROT_ID 0x0210 #define SMB30_PROT_ID 0x0300 @@ -293,7 +296,7 @@ struct smb2_encryption_neg_context { __le16 DataLength; __le32 Reserved; __le16 CipherCount; /* AES-128-GCM and AES-128-CCM */ - __le16 Ciphers[1]; /* Ciphers[0] since only one used now */ + __le16 Ciphers[2]; } __packed; /* See MS-SMB2 2.2.3.1.3 */ @@ -316,6 +319,12 @@ struct smb2_compression_capabilities_context { * For smb2_netname_negotiate_context_id See MS-SMB2 2.2.3.1.4. * Its struct simply contains NetName, an array of Unicode characters */ +struct smb2_netname_neg_context { + __le16 ContextType; /* 0x100 */ + __le16 DataLength; + __le32 Reserved; + __le16 NetName[0]; /* hostname of target converted to UCS-2 */ +} __packed; #define POSIX_CTXT_DATA_LEN 16 struct smb2_posix_neg_context { @@ -640,6 +649,7 @@ struct smb2_tree_disconnect_rsp { #define SMB2_CREATE_DURABLE_HANDLE_REQUEST_V2 "DH2Q" #define SMB2_CREATE_DURABLE_HANDLE_RECONNECT_V2 "DH2C" #define SMB2_CREATE_APP_INSTANCE_ID 0x45BCA66AEFA7F74A9008FA462E144D74 +#define SMB2_CREATE_APP_INSTANCE_VERSION 0xB982D0B73B56074FA07B524A8116A010 #define SVHDX_OPEN_DEVICE_CONTEX 0x9CCBCF9E04C1E643980E158DA1F6EC83 #define SMB2_CREATE_TAG_POSIX 0x93AD25509CB411E7B42383DE968BCD7C @@ -654,9 +664,10 @@ struct smb2_tree_disconnect_rsp { * [3] : durable context * [4] : posix context * [5] : time warp context - * [6] : compound padding + * [6] : query id context + * [7] : compound padding */ -#define SMB2_CREATE_IOV_SIZE 7 +#define SMB2_CREATE_IOV_SIZE 8 struct smb2_create_req { struct smb2_sync_hdr sync_hdr; @@ -680,10 +691,10 @@ struct smb2_create_req { /* * Maximum size of a SMB2_CREATE response is 64 (smb2 header) + - * 88 (fixed part of create response) + 520 (path) + 150 (contexts) + + * 88 (fixed part of create response) + 520 (path) + 208 (contexts) + * 2 bytes of padding. */ -#define MAX_SMB2_CREATE_RESPONSE_SIZE 824 +#define MAX_SMB2_CREATE_RESPONSE_SIZE 880 struct smb2_create_rsp { struct smb2_sync_hdr sync_hdr; @@ -806,6 +817,13 @@ struct durable_reconnect_context_v2 { __le32 Flags; /* see above DHANDLE_FLAG_PERSISTENT */ } __packed; +/* See MS-SMB2 2.2.14.2.9 */ +struct on_disk_id { + __le64 DiskFileId; + __le64 VolumeId; + __u32 Reserved[4]; +} __packed; + /* See MS-SMB2 2.2.14.2.12 */ struct durable_reconnect_context_v2_rsp { __le32 Timeout; @@ -826,6 +844,12 @@ struct crt_twarp_ctxt { } __packed; +/* See MS-SMB2 2.2.13.2.9 */ +struct crt_query_id_ctxt { + struct create_context ccontext; + __u8 Name[8]; +} __packed; + #define COPY_CHUNK_RES_KEY_SIZE 24 struct resume_key_req { char ResumeKey[COPY_CHUNK_RES_KEY_SIZE]; diff --git a/fs/cifs/smb2transport.c b/fs/cifs/smb2transport.c index d1181572758b..1ccbcf9c2c3b 100644 --- a/fs/cifs/smb2transport.c +++ b/fs/cifs/smb2transport.c @@ -734,7 +734,10 @@ smb3_crypto_aead_allocate(struct TCP_Server_Info *server) struct crypto_aead *tfm; if (!server->secmech.ccmaesencrypt) { - tfm = crypto_alloc_aead("ccm(aes)", 0, 0); + if (server->cipher_type == SMB2_ENCRYPTION_AES128_GCM) + tfm = crypto_alloc_aead("gcm(aes)", 0, 0); + else + tfm = crypto_alloc_aead("ccm(aes)", 0, 0); if (IS_ERR(tfm)) { cifs_dbg(VFS, "%s: Failed to alloc encrypt aead\n", __func__); @@ -744,7 +747,10 @@ smb3_crypto_aead_allocate(struct TCP_Server_Info *server) } if (!server->secmech.ccmaesdecrypt) { - tfm = crypto_alloc_aead("ccm(aes)", 0, 0); + if (server->cipher_type == SMB2_ENCRYPTION_AES128_GCM) + tfm = crypto_alloc_aead("gcm(aes)", 0, 0); + else + tfm = crypto_alloc_aead("ccm(aes)", 0, 0); if (IS_ERR(tfm)) { crypto_free_aead(server->secmech.ccmaesencrypt); server->secmech.ccmaesencrypt = NULL; diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c index 60661b3f983a..5d6d44bfe10a 100644 --- a/fs/cifs/transport.c +++ b/fs/cifs/transport.c @@ -979,6 +979,7 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, }; unsigned int instance; char *buf; + struct TCP_Server_Info *server; optype = flags & CIFS_OP_MASK; @@ -990,7 +991,8 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, return -EIO; } - if (ses->server->tcpStatus == CifsExiting) + server = ses->server; + if (server->tcpStatus == CifsExiting) return -ENOENT; /* @@ -1001,7 +1003,7 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, * other requests. * This can be handled by the eventual session reconnect. */ - rc = wait_for_compound_request(ses->server, num_rqst, flags, + rc = wait_for_compound_request(server, num_rqst, flags, &instance); if (rc) return rc; @@ -1017,7 +1019,7 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, * of smb data. */ - mutex_lock(&ses->server->srv_mutex); + mutex_lock(&server->srv_mutex); /* * All the parts of the compound chain belong obtained credits from the @@ -1026,24 +1028,24 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, * we obtained credits and return -EAGAIN in such cases to let callers * handle it. */ - if (instance != ses->server->reconnect_instance) { - mutex_unlock(&ses->server->srv_mutex); + if (instance != server->reconnect_instance) { + mutex_unlock(&server->srv_mutex); for (j = 0; j < num_rqst; j++) - add_credits(ses->server, &credits[j], optype); + add_credits(server, &credits[j], optype); return -EAGAIN; } for (i = 0; i < num_rqst; i++) { - midQ[i] = ses->server->ops->setup_request(ses, &rqst[i]); + midQ[i] = server->ops->setup_request(ses, &rqst[i]); if (IS_ERR(midQ[i])) { - revert_current_mid(ses->server, i); + revert_current_mid(server, i); for (j = 0; j < i; j++) cifs_delete_mid(midQ[j]); - mutex_unlock(&ses->server->srv_mutex); + mutex_unlock(&server->srv_mutex); /* Update # of requests on wire to server */ for (j = 0; j < num_rqst; j++) - add_credits(ses->server, &credits[j], optype); + add_credits(server, &credits[j], optype); return PTR_ERR(midQ[i]); } @@ -1059,19 +1061,19 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, else midQ[i]->callback = cifs_compound_last_callback; } - cifs_in_send_inc(ses->server); - rc = smb_send_rqst(ses->server, num_rqst, rqst, flags); - cifs_in_send_dec(ses->server); + cifs_in_send_inc(server); + rc = smb_send_rqst(server, num_rqst, rqst, flags); + cifs_in_send_dec(server); for (i = 0; i < num_rqst; i++) cifs_save_when_sent(midQ[i]); if (rc < 0) { - revert_current_mid(ses->server, num_rqst); - ses->server->sequence_number -= 2; + revert_current_mid(server, num_rqst); + server->sequence_number -= 2; } - mutex_unlock(&ses->server->srv_mutex); + mutex_unlock(&server->srv_mutex); /* * If sending failed for some reason or it is an oplock break that we @@ -1079,7 +1081,7 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, */ if (rc < 0 || (flags & CIFS_NO_SRV_RSP)) { for (i = 0; i < num_rqst; i++) - add_credits(ses->server, &credits[i], optype); + add_credits(server, &credits[i], optype); goto out; } @@ -1099,7 +1101,7 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, rqst[0].rq_nvec); for (i = 0; i < num_rqst; i++) { - rc = wait_for_response(ses->server, midQ[i]); + rc = wait_for_response(server, midQ[i]); if (rc != 0) break; } @@ -1107,7 +1109,7 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, for (; i < num_rqst; i++) { cifs_dbg(VFS, "Cancelling wait for mid %llu cmd: %d\n", midQ[i]->mid, le16_to_cpu(midQ[i]->command)); - send_cancel(ses->server, &rqst[i], midQ[i]); + send_cancel(server, &rqst[i], midQ[i]); spin_lock(&GlobalMid_Lock); if (midQ[i]->mid_state == MID_REQUEST_SUBMITTED) { midQ[i]->mid_flags |= MID_WAIT_CANCELLED; @@ -1123,7 +1125,7 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, if (rc < 0) goto out; - rc = cifs_sync_mid_result(midQ[i], ses->server); + rc = cifs_sync_mid_result(midQ[i], server); if (rc != 0) { /* mark this mid as cancelled to not free it below */ cancelled_mid[i] = true; @@ -1140,14 +1142,14 @@ compound_send_recv(const unsigned int xid, struct cifs_ses *ses, buf = (char *)midQ[i]->resp_buf; resp_iov[i].iov_base = buf; resp_iov[i].iov_len = midQ[i]->resp_buf_size + - ses->server->vals->header_preamble_size; + server->vals->header_preamble_size; if (midQ[i]->large_buf) resp_buf_type[i] = CIFS_LARGE_BUFFER; else resp_buf_type[i] = CIFS_SMALL_BUFFER; - rc = ses->server->ops->check_receive(midQ[i], ses->server, + rc = server->ops->check_receive(midQ[i], server, flags & CIFS_LOG_ERROR); /* mark it so buf will not be freed by cifs_delete_mid */ diff --git a/fs/cifs/xattr.c b/fs/cifs/xattr.c index 50ddb795aaeb..9076150758d8 100644 --- a/fs/cifs/xattr.c +++ b/fs/cifs/xattr.c @@ -96,7 +96,6 @@ static int cifs_xattr_set(const struct xattr_handler *handler, break; case XATTR_CIFS_ACL: { -#ifdef CONFIG_CIFS_ACL struct cifs_ntsd *pacl; if (!value) @@ -117,7 +116,6 @@ static int cifs_xattr_set(const struct xattr_handler *handler, CIFS_I(inode)->time = 0; kfree(pacl); } -#endif /* CONFIG_CIFS_ACL */ break; } @@ -247,7 +245,6 @@ static int cifs_xattr_get(const struct xattr_handler *handler, break; case XATTR_CIFS_ACL: { -#ifdef CONFIG_CIFS_ACL u32 acllen; struct cifs_ntsd *pacl; @@ -270,7 +267,6 @@ static int cifs_xattr_get(const struct xattr_handler *handler, rc = acllen; kfree(pacl); } -#endif /* CONFIG_CIFS_ACL */ break; } @@ -124,6 +124,15 @@ static int dax_is_empty_entry(void *entry) } /* + * true if the entry that was found is of a smaller order than the entry + * we were looking for + */ +static bool dax_is_conflict(void *entry) +{ + return entry == XA_RETRY_ENTRY; +} + +/* * DAX page cache entry locking */ struct exceptional_entry_key { @@ -195,11 +204,13 @@ static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all) * Look up entry in page cache, wait for it to become unlocked if it * is a DAX entry and return it. The caller must subsequently call * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry() - * if it did. + * if it did. The entry returned may have a larger order than @order. + * If @order is larger than the order of the entry found in i_pages, this + * function returns a dax_is_conflict entry. * * Must be called with the i_pages lock held. */ -static void *get_unlocked_entry(struct xa_state *xas) +static void *get_unlocked_entry(struct xa_state *xas, unsigned int order) { void *entry; struct wait_exceptional_entry_queue ewait; @@ -210,6 +221,8 @@ static void *get_unlocked_entry(struct xa_state *xas) for (;;) { entry = xas_find_conflict(xas); + if (dax_entry_order(entry) < order) + return XA_RETRY_ENTRY; if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) || !dax_is_locked(entry)) return entry; @@ -254,7 +267,7 @@ static void wait_entry_unlocked(struct xa_state *xas, void *entry) static void put_unlocked_entry(struct xa_state *xas, void *entry) { /* If we were the only waiter woken, wake the next one */ - if (entry) + if (entry && dax_is_conflict(entry)) dax_wake_entry(xas, entry, false); } @@ -461,7 +474,7 @@ void dax_unlock_page(struct page *page, dax_entry_t cookie) * overlap with xarray value entries. */ static void *grab_mapping_entry(struct xa_state *xas, - struct address_space *mapping, unsigned long size_flag) + struct address_space *mapping, unsigned int order) { unsigned long index = xas->xa_index; bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */ @@ -469,20 +482,17 @@ static void *grab_mapping_entry(struct xa_state *xas, retry: xas_lock_irq(xas); - entry = get_unlocked_entry(xas); + entry = get_unlocked_entry(xas, order); if (entry) { + if (dax_is_conflict(entry)) + goto fallback; if (!xa_is_value(entry)) { xas_set_err(xas, EIO); goto out_unlock; } - if (size_flag & DAX_PMD) { - if (dax_is_pte_entry(entry)) { - put_unlocked_entry(xas, entry); - goto fallback; - } - } else { /* trying to grab a PTE entry */ + if (order == 0) { if (dax_is_pmd_entry(entry) && (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))) { @@ -523,7 +533,11 @@ retry: if (entry) { dax_lock_entry(xas, entry); } else { - entry = dax_make_entry(pfn_to_pfn_t(0), size_flag | DAX_EMPTY); + unsigned long flags = DAX_EMPTY; + + if (order > 0) + flags |= DAX_PMD; + entry = dax_make_entry(pfn_to_pfn_t(0), flags); dax_lock_entry(xas, entry); if (xas_error(xas)) goto out_unlock; @@ -594,7 +608,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping) if (WARN_ON_ONCE(!xa_is_value(entry))) continue; if (unlikely(dax_is_locked(entry))) - entry = get_unlocked_entry(&xas); + entry = get_unlocked_entry(&xas, 0); if (entry) page = dax_busy_page(entry); put_unlocked_entry(&xas, entry); @@ -621,7 +635,7 @@ static int __dax_invalidate_entry(struct address_space *mapping, void *entry; xas_lock_irq(&xas); - entry = get_unlocked_entry(&xas); + entry = get_unlocked_entry(&xas, 0); if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) goto out; if (!trunc && @@ -848,7 +862,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, if (unlikely(dax_is_locked(entry))) { void *old_entry = entry; - entry = get_unlocked_entry(xas); + entry = get_unlocked_entry(xas, 0); /* Entry got punched out / reallocated? */ if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) @@ -1509,7 +1523,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, * entry is already in the array, for instance), it will return * VM_FAULT_FALLBACK. */ - entry = grab_mapping_entry(&xas, mapping, DAX_PMD); + entry = grab_mapping_entry(&xas, mapping, PMD_ORDER); if (xa_is_internal(entry)) { result = xa_to_internal(entry); goto fallback; @@ -1658,11 +1672,10 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) vm_fault_t ret; xas_lock_irq(&xas); - entry = get_unlocked_entry(&xas); + entry = get_unlocked_entry(&xas, order); /* Did we race with someone splitting entry or so? */ - if (!entry || - (order == 0 && !dax_is_pte_entry(entry)) || - (order == PMD_ORDER && !dax_is_pmd_entry(entry))) { + if (!entry || dax_is_conflict(entry) || + (order == 0 && !dax_is_pte_entry(entry))) { put_unlocked_entry(&xas, entry); xas_unlock_irq(&xas); trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf, diff --git a/fs/ext4/file.c b/fs/ext4/file.c index f4a24a46245e..70b0438dbc94 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -371,15 +371,17 @@ static const struct vm_operations_struct ext4_file_vm_ops = { static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file->f_mapping->host; + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); + struct dax_device *dax_dev = sbi->s_daxdev; - if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) + if (unlikely(ext4_forced_shutdown(sbi))) return -EIO; /* - * We don't support synchronous mappings for non-DAX files. At least - * until someone comes with a sensible use case. + * We don't support synchronous mappings for non-DAX files and + * for DAX files if underneath dax_device is not synchronous. */ - if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC)) + if (!daxdev_mapping_supported(vma, dax_dev)) return -EOPNOTSUPP; file_accessed(file); diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile index c587e3c4c6a6..34cdeaecccf6 100644 --- a/fs/nfs/Makefile +++ b/fs/nfs/Makefile @@ -8,7 +8,8 @@ obj-$(CONFIG_NFS_FS) += nfs.o CFLAGS_nfstrace.o += -I$(src) nfs-y := client.o dir.o file.o getroot.o inode.o super.o \ io.o direct.o pagelist.o read.o symlink.o unlink.o \ - write.o namespace.o mount_clnt.o nfstrace.o export.o + write.o namespace.o mount_clnt.o nfstrace.o \ + export.o sysfs.o nfs-$(CONFIG_ROOT_NFS) += nfsroot.o nfs-$(CONFIG_SYSCTL) += sysctl.o nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c index 315967354954..f39924ba050b 100644 --- a/fs/nfs/callback_proc.c +++ b/fs/nfs/callback_proc.c @@ -414,27 +414,39 @@ static __be32 validate_seqid(const struct nfs4_slot_table *tbl, const struct nfs4_slot *slot, const struct cb_sequenceargs * args) { + __be32 ret; + + ret = cpu_to_be32(NFS4ERR_BADSLOT); if (args->csa_slotid > tbl->server_highest_slotid) - return htonl(NFS4ERR_BADSLOT); + goto out_err; /* Replay */ if (args->csa_sequenceid == slot->seq_nr) { + ret = cpu_to_be32(NFS4ERR_DELAY); if (nfs4_test_locked_slot(tbl, slot->slot_nr)) - return htonl(NFS4ERR_DELAY); + goto out_err; + /* Signal process_op to set this error on next op */ + ret = cpu_to_be32(NFS4ERR_RETRY_UNCACHED_REP); if (args->csa_cachethis == 0) - return htonl(NFS4ERR_RETRY_UNCACHED_REP); + goto out_err; /* Liar! We never allowed you to set csa_cachethis != 0 */ - return htonl(NFS4ERR_SEQ_FALSE_RETRY); + ret = cpu_to_be32(NFS4ERR_SEQ_FALSE_RETRY); + goto out_err; } /* Note: wraparound relies on seq_nr being of type u32 */ - if (likely(args->csa_sequenceid == slot->seq_nr + 1)) - return htonl(NFS4_OK); - /* Misordered request */ - return htonl(NFS4ERR_SEQ_MISORDERED); + ret = cpu_to_be32(NFS4ERR_SEQ_MISORDERED); + if (args->csa_sequenceid != slot->seq_nr + 1) + goto out_err; + + return cpu_to_be32(NFS4_OK); + +out_err: + trace_nfs4_cb_seqid_err(args, ret); + return ret; } /* diff --git a/fs/nfs/client.c b/fs/nfs/client.c index d7e4f0848e28..30838304a0bf 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -49,6 +49,7 @@ #include "pnfs.h" #include "nfs.h" #include "netns.h" +#include "sysfs.h" #define NFSDBG_FACILITY NFSDBG_CLIENT @@ -175,6 +176,7 @@ struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *cl_init) clp->cl_rpcclient = ERR_PTR(-EINVAL); clp->cl_proto = cl_init->proto; + clp->cl_nconnect = cl_init->nconnect; clp->cl_net = get_net(cl_init->net); clp->cl_principal = "*"; @@ -192,7 +194,7 @@ error_0: EXPORT_SYMBOL_GPL(nfs_alloc_client); #if IS_ENABLED(CONFIG_NFS_V4) -void nfs_cleanup_cb_ident_idr(struct net *net) +static void nfs_cleanup_cb_ident_idr(struct net *net) { struct nfs_net *nn = net_generic(net, nfs_net_id); @@ -214,7 +216,7 @@ static void pnfs_init_server(struct nfs_server *server) } #else -void nfs_cleanup_cb_ident_idr(struct net *net) +static void nfs_cleanup_cb_ident_idr(struct net *net) { } @@ -406,10 +408,10 @@ struct nfs_client *nfs_get_client(const struct nfs_client_initdata *cl_init) clp = nfs_match_client(cl_init); if (clp) { spin_unlock(&nn->nfs_client_lock); - if (IS_ERR(clp)) - return clp; if (new) new->rpc_ops->free_client(new); + if (IS_ERR(clp)) + return clp; return nfs_found_client(cl_init, clp); } if (new) { @@ -493,6 +495,7 @@ int nfs_create_rpc_client(struct nfs_client *clp, struct rpc_create_args args = { .net = clp->cl_net, .protocol = clp->cl_proto, + .nconnect = clp->cl_nconnect, .address = (struct sockaddr *)&clp->cl_addr, .addrsize = clp->cl_addrlen, .timeout = cl_init->timeparms, @@ -658,6 +661,7 @@ static int nfs_init_server(struct nfs_server *server, .net = data->net, .timeparms = &timeparms, .cred = server->cred, + .nconnect = data->nfs_server.nconnect, }; struct nfs_client *clp; int error; @@ -1072,6 +1076,18 @@ void nfs_clients_init(struct net *net) #endif spin_lock_init(&nn->nfs_client_lock); nn->boot_time = ktime_get_real(); + + nfs_netns_sysfs_setup(nn, net); +} + +void nfs_clients_exit(struct net *net) +{ + struct nfs_net *nn = net_generic(net, nfs_net_id); + + nfs_netns_sysfs_destroy(nn); + nfs_cleanup_cb_ident_idr(net); + WARN_ON_ONCE(!list_empty(&nn->nfs_client_list)); + WARN_ON_ONCE(!list_empty(&nn->nfs_volume_list)); } #ifdef CONFIG_PROC_FS diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 57b6a45576ad..8d501093660f 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -80,6 +80,10 @@ static struct nfs_open_dir_context *alloc_nfs_open_dir_context(struct inode *dir ctx->dup_cookie = 0; ctx->cred = get_cred(cred); spin_lock(&dir->i_lock); + if (list_empty(&nfsi->open_files) && + (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)) + nfsi->cache_validity |= NFS_INO_INVALID_DATA | + NFS_INO_REVAL_FORCED; list_add(&ctx->list, &nfsi->open_files); spin_unlock(&dir->i_lock); return ctx; @@ -140,19 +144,12 @@ struct nfs_cache_array { struct nfs_cache_array_entry array[0]; }; -struct readdirvec { - unsigned long nr; - unsigned long index; - struct page *pages[NFS_MAX_READDIR_RAPAGES]; -}; - typedef int (*decode_dirent_t)(struct xdr_stream *, struct nfs_entry *, bool); typedef struct { struct file *file; struct page *page; struct dir_context *ctx; unsigned long page_index; - struct readdirvec pvec; u64 *dir_cookie; u64 last_cookie; loff_t current_index; @@ -532,10 +529,6 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en struct nfs_cache_array *array; unsigned int count = 0; int status; - int max_rapages = NFS_MAX_READDIR_RAPAGES; - - desc->pvec.index = desc->page_index; - desc->pvec.nr = 0; scratch = alloc_page(GFP_KERNEL); if (scratch == NULL) @@ -560,40 +553,20 @@ int nfs_readdir_page_filler(nfs_readdir_descriptor_t *desc, struct nfs_entry *en if (desc->plus) nfs_prime_dcache(file_dentry(desc->file), entry); - status = nfs_readdir_add_to_array(entry, desc->pvec.pages[desc->pvec.nr]); - if (status == -ENOSPC) { - desc->pvec.nr++; - if (desc->pvec.nr == max_rapages) - break; - status = nfs_readdir_add_to_array(entry, desc->pvec.pages[desc->pvec.nr]); - } + status = nfs_readdir_add_to_array(entry, page); if (status != 0) break; } while (!entry->eof); - /* - * page and desc->pvec.pages[0] are valid, don't need to check - * whether or not to be NULL. - */ - copy_highpage(page, desc->pvec.pages[0]); - out_nopages: if (count == 0 || (status == -EBADCOOKIE && entry->eof != 0)) { - array = kmap_atomic(desc->pvec.pages[desc->pvec.nr]); + array = kmap(page); array->eof_index = array->size; status = 0; - kunmap_atomic(array); + kunmap(page); } put_page(scratch); - - /* - * desc->pvec.nr > 0 means at least one page was completely filled, - * we should return -ENOSPC. Otherwise function - * nfs_readdir_xdr_to_array will enter infinite loop. - */ - if (desc->pvec.nr > 0) - return -ENOSPC; return status; } @@ -627,24 +600,6 @@ out_freepages: return -ENOMEM; } -/* - * nfs_readdir_rapages_init initialize rapages by nfs_cache_array structure. - */ -static -void nfs_readdir_rapages_init(nfs_readdir_descriptor_t *desc) -{ - struct nfs_cache_array *array; - int max_rapages = NFS_MAX_READDIR_RAPAGES; - int index; - - for (index = 0; index < max_rapages; index++) { - array = kmap_atomic(desc->pvec.pages[index]); - memset(array, 0, sizeof(struct nfs_cache_array)); - array->eof_index = -1; - kunmap_atomic(array); - } -} - static int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, struct inode *inode) { @@ -655,12 +610,6 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page, int status = -ENOMEM; unsigned int array_size = ARRAY_SIZE(pages); - /* - * This means we hit readdir rdpages miss, the preallocated rdpages - * are useless, the preallocate rdpages should be reinitialized. - */ - nfs_readdir_rapages_init(desc); - entry.prev_cookie = 0; entry.cookie = desc->last_cookie; entry.eof = 0; @@ -721,24 +670,9 @@ int nfs_readdir_filler(void *data, struct page* page) struct inode *inode = file_inode(desc->file); int ret; - /* - * If desc->page_index in range desc->pvec.index and - * desc->pvec.index + desc->pvec.nr, we get readdir cache hit. - */ - if (desc->page_index >= desc->pvec.index && - desc->page_index < (desc->pvec.index + desc->pvec.nr)) { - /* - * page and desc->pvec.pages[x] are valid, don't need to check - * whether or not to be NULL. - */ - copy_highpage(page, desc->pvec.pages[desc->page_index - desc->pvec.index]); - ret = 0; - } else { - ret = nfs_readdir_xdr_to_array(desc, page, inode); - if (ret < 0) - goto error; - } - + ret = nfs_readdir_xdr_to_array(desc, page, inode); + if (ret < 0) + goto error; SetPageUptodate(page); if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) { @@ -903,7 +837,6 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx) *desc = &my_desc; struct nfs_open_dir_context *dir_ctx = file->private_data; int res = 0; - int max_rapages = NFS_MAX_READDIR_RAPAGES; dfprintk(FILE, "NFS: readdir(%pD2) starting at cookie %llu\n", file, (long long)ctx->pos); @@ -923,12 +856,6 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx) desc->decode = NFS_PROTO(inode)->decode_dirent; desc->plus = nfs_use_readdirplus(inode, ctx); - res = nfs_readdir_alloc_pages(desc->pvec.pages, max_rapages); - if (res < 0) - return -ENOMEM; - - nfs_readdir_rapages_init(desc); - if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) res = nfs_revalidate_mapping(inode, file->f_mapping); if (res < 0) @@ -964,7 +891,6 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx) break; } while (!desc->eof); out: - nfs_readdir_free_pages(desc->pvec.pages, max_rapages); if (res > 0) res = 0; dfprintk(FILE, "NFS: readdir(%pD2) returns %d\n", file, res); diff --git a/fs/nfs/flexfilelayout/flexfilelayout.c b/fs/nfs/flexfilelayout/flexfilelayout.c index bcff3bf5ae09..b04e20d28162 100644 --- a/fs/nfs/flexfilelayout/flexfilelayout.c +++ b/fs/nfs/flexfilelayout/flexfilelayout.c @@ -934,6 +934,10 @@ out_nolseg: if (pgio->pg_error < 0) return; out_mds: + trace_pnfs_mds_fallback_pg_init_read(pgio->pg_inode, + 0, NFS4_MAX_UINT64, IOMODE_READ, + NFS_I(pgio->pg_inode)->layout, + pgio->pg_lseg); pnfs_put_lseg(pgio->pg_lseg); pgio->pg_lseg = NULL; nfs_pageio_reset_read_mds(pgio); @@ -1000,6 +1004,10 @@ retry: return; out_mds: + trace_pnfs_mds_fallback_pg_init_write(pgio->pg_inode, + 0, NFS4_MAX_UINT64, IOMODE_RW, + NFS_I(pgio->pg_inode)->layout, + pgio->pg_lseg); pnfs_put_lseg(pgio->pg_lseg); pgio->pg_lseg = NULL; nfs_pageio_reset_write_mds(pgio); @@ -1026,6 +1034,10 @@ ff_layout_pg_get_mirror_count_write(struct nfs_pageio_descriptor *pgio, if (pgio->pg_lseg) return FF_LAYOUT_MIRROR_COUNT(pgio->pg_lseg); + trace_pnfs_mds_fallback_pg_get_mirror_count(pgio->pg_inode, + 0, NFS4_MAX_UINT64, IOMODE_RW, + NFS_I(pgio->pg_inode)->layout, + pgio->pg_lseg); /* no lseg means that pnfs is not in use, so no mirroring here */ nfs_pageio_reset_write_mds(pgio); out: @@ -1075,6 +1087,10 @@ static void ff_layout_reset_write(struct nfs_pgio_header *hdr, bool retry_pnfs) hdr->args.count, (unsigned long long)hdr->args.offset); + trace_pnfs_mds_fallback_write_done(hdr->inode, + hdr->args.offset, hdr->args.count, + IOMODE_RW, NFS_I(hdr->inode)->layout, + hdr->lseg); task->tk_status = pnfs_write_done_resend_to_mds(hdr); } } @@ -1094,6 +1110,10 @@ static void ff_layout_reset_read(struct nfs_pgio_header *hdr) hdr->args.count, (unsigned long long)hdr->args.offset); + trace_pnfs_mds_fallback_read_done(hdr->inode, + hdr->args.offset, hdr->args.count, + IOMODE_READ, NFS_I(hdr->inode)->layout, + hdr->lseg); task->tk_status = pnfs_read_done_resend_to_mds(hdr); } } @@ -1827,6 +1847,9 @@ ff_layout_read_pagelist(struct nfs_pgio_header *hdr) out_failed: if (ff_layout_avoid_mds_available_ds(lseg)) return PNFS_TRY_AGAIN; + trace_pnfs_mds_fallback_read_pagelist(hdr->inode, + hdr->args.offset, hdr->args.count, + IOMODE_READ, NFS_I(hdr->inode)->layout, lseg); return PNFS_NOT_ATTEMPTED; } @@ -1892,6 +1915,9 @@ ff_layout_write_pagelist(struct nfs_pgio_header *hdr, int sync) out_failed: if (ff_layout_avoid_mds_available_ds(lseg)) return PNFS_TRY_AGAIN; + trace_pnfs_mds_fallback_write_pagelist(hdr->inode, + hdr->args.offset, hdr->args.count, + IOMODE_RW, NFS_I(hdr->inode)->layout, lseg); return PNFS_NOT_ATTEMPTED; } diff --git a/fs/nfs/flexfilelayout/flexfilelayoutdev.c b/fs/nfs/flexfilelayout/flexfilelayoutdev.c index 19f856f45689..3eda40a320a5 100644 --- a/fs/nfs/flexfilelayout/flexfilelayoutdev.c +++ b/fs/nfs/flexfilelayout/flexfilelayoutdev.c @@ -257,7 +257,7 @@ int ff_layout_track_ds_error(struct nfs4_flexfile_layout *flo, if (status == 0) return 0; - if (mirror->mirror_ds == NULL) + if (IS_ERR_OR_NULL(mirror->mirror_ds)) return -EINVAL; dserr = kmalloc(sizeof(*dserr), gfp_flags); diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 0b4a1a974411..8a1758200b57 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -51,6 +51,7 @@ #include "pnfs.h" #include "nfs.h" #include "netns.h" +#include "sysfs.h" #include "nfstrace.h" @@ -208,7 +209,7 @@ static void nfs_set_cache_invalid(struct inode *inode, unsigned long flags) } if (inode->i_mapping->nrpages == 0) - flags &= ~NFS_INO_INVALID_DATA; + flags &= ~(NFS_INO_INVALID_DATA|NFS_INO_DATA_INVAL_DEFER); nfsi->cache_validity |= flags; if (flags & NFS_INO_INVALID_DATA) nfs_fscache_invalidate(inode); @@ -652,7 +653,8 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset) i_size_write(inode, offset); /* Optimisation */ if (offset == 0) - NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA; + NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_DATA | + NFS_INO_DATA_INVAL_DEFER); NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE; spin_unlock(&inode->i_lock); @@ -1032,6 +1034,10 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx) struct nfs_inode *nfsi = NFS_I(inode); spin_lock(&inode->i_lock); + if (list_empty(&nfsi->open_files) && + (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)) + nfsi->cache_validity |= NFS_INO_INVALID_DATA | + NFS_INO_REVAL_FORCED; list_add_tail_rcu(&ctx->list, &nfsi->open_files); spin_unlock(&inode->i_lock); } @@ -1100,6 +1106,7 @@ int nfs_open(struct inode *inode, struct file *filp) nfs_fscache_open_file(inode, filp); return 0; } +EXPORT_SYMBOL_GPL(nfs_open); /* * This function is called whenever some part of NFS notices that @@ -1312,7 +1319,8 @@ int nfs_revalidate_mapping(struct inode *inode, set_bit(NFS_INO_INVALIDATING, bitlock); smp_wmb(); - nfsi->cache_validity &= ~NFS_INO_INVALID_DATA; + nfsi->cache_validity &= ~(NFS_INO_INVALID_DATA| + NFS_INO_DATA_INVAL_DEFER); spin_unlock(&inode->i_lock); trace_nfs_invalidate_mapping_enter(inode); ret = nfs_invalidate_mapping(inode, mapping); @@ -1870,7 +1878,8 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr) dprintk("NFS: change_attr change on server for file %s/%ld\n", inode->i_sb->s_id, inode->i_ino); - } + } else if (!have_delegation) + nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER; inode_set_iversion_raw(inode, fattr->change_attr); attr_changed = true; } @@ -2159,12 +2168,8 @@ static int nfs_net_init(struct net *net) static void nfs_net_exit(struct net *net) { - struct nfs_net *nn = net_generic(net, nfs_net_id); - nfs_fs_proc_net_exit(net); - nfs_cleanup_cb_ident_idr(net); - WARN_ON_ONCE(!list_empty(&nn->nfs_client_list)); - WARN_ON_ONCE(!list_empty(&nn->nfs_volume_list)); + nfs_clients_exit(net); } static struct pernet_operations nfs_net_ops = { @@ -2181,6 +2186,10 @@ static int __init init_nfs_fs(void) { int err; + err = nfs_sysfs_init(); + if (err < 0) + goto out10; + err = register_pernet_subsys(&nfs_net_ops); if (err < 0) goto out9; @@ -2244,6 +2253,8 @@ out7: out8: unregister_pernet_subsys(&nfs_net_ops); out9: + nfs_sysfs_exit(); +out10: return err; } @@ -2260,6 +2271,7 @@ static void __exit exit_nfs_fs(void) unregister_nfs_fs(); nfs_fs_proc_exit(); nfsiod_stop(); + nfs_sysfs_exit(); } /* Not quite true; I just maintain it */ diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 498fab72f70b..a2346a2f8361 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -69,8 +69,7 @@ struct nfs_clone_mount { * Maximum number of pages that readdir can use for creating * a vmapped array of pages. */ -#define NFS_MAX_READDIR_PAGES 64 -#define NFS_MAX_READDIR_RAPAGES 8 +#define NFS_MAX_READDIR_PAGES 8 struct nfs_client_initdata { unsigned long init_flags; @@ -82,6 +81,7 @@ struct nfs_client_initdata { struct nfs_subversion *nfs_mod; int proto; u32 minorversion; + unsigned int nconnect; struct net *net; const struct rpc_timeout *timeparms; const struct cred *cred; @@ -123,6 +123,7 @@ struct nfs_parsed_mount_data { char *export_path; int port; unsigned short protocol; + unsigned short nconnect; } nfs_server; void *lsm_opts; @@ -158,6 +159,7 @@ extern void nfs_umount(const struct nfs_mount_request *info); /* client.c */ extern const struct rpc_program nfs_program; extern void nfs_clients_init(struct net *net); +extern void nfs_clients_exit(struct net *net); extern struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *); int nfs_create_rpc_client(struct nfs_client *, const struct nfs_client_initdata *, rpc_authflavor_t); struct nfs_client *nfs_get_client(const struct nfs_client_initdata *); @@ -170,7 +172,6 @@ int nfs_init_server_rpcclient(struct nfs_server *, const struct rpc_timeout *t, struct nfs_server *nfs_alloc_server(void); void nfs_server_copy_userdata(struct nfs_server *, struct nfs_server *); -extern void nfs_cleanup_cb_ident_idr(struct net *); extern void nfs_put_client(struct nfs_client *); extern void nfs_free_client(struct nfs_client *); extern struct nfs_client *nfs4_find_client_ident(struct net *, int); diff --git a/fs/nfs/netns.h b/fs/nfs/netns.h index fc9978c58265..c8374f74dce1 100644 --- a/fs/nfs/netns.h +++ b/fs/nfs/netns.h @@ -15,6 +15,8 @@ struct bl_dev_msg { uint32_t major, minor; }; +struct nfs_netns_client; + struct nfs_net { struct cache_detail *nfs_dns_resolve; struct rpc_pipe *bl_device_pipe; @@ -29,6 +31,7 @@ struct nfs_net { unsigned short nfs_callback_tcpport6; int cb_users[NFS4_MAX_MINOR_VERSION + 1]; #endif + struct nfs_netns_client *nfs_client; spinlock_t nfs_client_lock; ktime_t boot_time; #ifdef CONFIG_PROC_FS diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c index 572794dab4b1..cbc17a203248 100644 --- a/fs/nfs/nfs2xdr.c +++ b/fs/nfs/nfs2xdr.c @@ -151,7 +151,7 @@ static int decode_stat(struct xdr_stream *xdr, enum nfs_stat *status) return 0; out_status: *status = be32_to_cpup(p); - trace_nfs_xdr_status((int)*status); + trace_nfs_xdr_status(xdr, (int)*status); return 0; } diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c index fb0c425b5d45..148ceb74d27c 100644 --- a/fs/nfs/nfs3client.c +++ b/fs/nfs/nfs3client.c @@ -102,6 +102,9 @@ struct nfs_client *nfs3_set_ds_client(struct nfs_server *mds_srv, return ERR_PTR(-EINVAL); cl_init.hostname = buf; + if (mds_clp->cl_nconnect > 1 && ds_proto == XPRT_TRANSPORT_TCP) + cl_init.nconnect = mds_clp->cl_nconnect; + if (mds_srv->flags & NFS_MOUNT_NORESVPORT) set_bit(NFS_CS_NORESVPORT, &cl_init.init_flags); diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c index abbbdde97e31..602767850b36 100644 --- a/fs/nfs/nfs3xdr.c +++ b/fs/nfs/nfs3xdr.c @@ -343,7 +343,7 @@ static int decode_nfsstat3(struct xdr_stream *xdr, enum nfs_stat *status) return 0; out_status: *status = be32_to_cpup(p); - trace_nfs_xdr_status((int)*status); + trace_nfs_xdr_status(xdr, (int)*status); return 0; } diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h index 8a38a254f516..d778dad9a75e 100644 --- a/fs/nfs/nfs4_fs.h +++ b/fs/nfs/nfs4_fs.h @@ -312,12 +312,12 @@ extern int nfs4_set_rw_stateid(nfs4_stateid *stateid, const struct nfs_lock_context *l_ctx, fmode_t fmode); +extern int nfs4_proc_get_lease_time(struct nfs_client *clp, + struct nfs_fsinfo *fsinfo); #if defined(CONFIG_NFS_V4_1) extern int nfs41_sequence_done(struct rpc_task *, struct nfs4_sequence_res *); extern int nfs4_proc_create_session(struct nfs_client *, const struct cred *); extern int nfs4_proc_destroy_session(struct nfs4_session *, const struct cred *); -extern int nfs4_proc_get_lease_time(struct nfs_client *clp, - struct nfs_fsinfo *fsinfo); extern int nfs4_proc_layoutcommit(struct nfs4_layoutcommit_data *data, bool sync); extern int nfs4_detect_session_trunking(struct nfs_client *clp, diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c index 81b9b6d7927a..616393a01c06 100644 --- a/fs/nfs/nfs4client.c +++ b/fs/nfs/nfs4client.c @@ -859,7 +859,8 @@ static int nfs4_set_client(struct nfs_server *server, const size_t addrlen, const char *ip_addr, int proto, const struct rpc_timeout *timeparms, - u32 minorversion, struct net *net) + u32 minorversion, unsigned int nconnect, + struct net *net) { struct nfs_client_initdata cl_init = { .hostname = hostname, @@ -875,6 +876,8 @@ static int nfs4_set_client(struct nfs_server *server, }; struct nfs_client *clp; + if (minorversion > 0 && proto == XPRT_TRANSPORT_TCP) + cl_init.nconnect = nconnect; if (server->flags & NFS_MOUNT_NORESVPORT) set_bit(NFS_CS_NORESVPORT, &cl_init.init_flags); if (server->options & NFS_OPTION_MIGRATION) @@ -941,6 +944,9 @@ struct nfs_client *nfs4_set_ds_client(struct nfs_server *mds_srv, return ERR_PTR(-EINVAL); cl_init.hostname = buf; + if (mds_clp->cl_nconnect > 1 && ds_proto == XPRT_TRANSPORT_TCP) + cl_init.nconnect = mds_clp->cl_nconnect; + if (mds_srv->flags & NFS_MOUNT_NORESVPORT) __set_bit(NFS_CS_NORESVPORT, &cl_init.init_flags); @@ -1074,6 +1080,7 @@ static int nfs4_init_server(struct nfs_server *server, data->nfs_server.protocol, &timeparms, data->minorversion, + data->nfs_server.nconnect, data->net); if (error < 0) return error; @@ -1163,6 +1170,7 @@ struct nfs_server *nfs4_create_referral_server(struct nfs_clone_mount *data, XPRT_TRANSPORT_RDMA, parent_server->client->cl_timeout, parent_client->cl_mvops->minor_version, + parent_client->cl_nconnect, parent_client->cl_net); if (!error) goto init_server; @@ -1176,6 +1184,7 @@ struct nfs_server *nfs4_create_referral_server(struct nfs_clone_mount *data, XPRT_TRANSPORT_TCP, parent_server->client->cl_timeout, parent_client->cl_mvops->minor_version, + parent_client->cl_nconnect, parent_client->cl_net); if (error < 0) goto error; @@ -1271,7 +1280,8 @@ int nfs4_update_server(struct nfs_server *server, const char *hostname, set_bit(NFS_MIG_TSM_POSSIBLE, &server->mig_status); error = nfs4_set_client(server, hostname, sap, salen, buf, clp->cl_proto, clnt->cl_timeout, - clp->cl_minorversion, net); + clp->cl_minorversion, + clp->cl_nconnect, net); clear_bit(NFS_MIG_TSM_POSSIBLE, &server->mig_status); if (error != 0) { nfs_server_insert_lists(server); diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c index f4157eb1f69d..96db471ca2e5 100644 --- a/fs/nfs/nfs4file.c +++ b/fs/nfs/nfs4file.c @@ -49,7 +49,7 @@ nfs4_file_open(struct inode *inode, struct file *filp) return err; if ((openflags & O_ACCMODE) == 3) - openflags--; + return nfs_open(inode, filp); /* We can't create new files here */ openflags &= ~(O_CREAT|O_EXCL); @@ -204,7 +204,11 @@ static loff_t nfs42_remap_file_range(struct file *src_file, loff_t src_off, bool same_inode = false; int ret; - if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) + /* NFS does not support deduplication. */ + if (remap_flags & REMAP_FILE_DEDUP) + return -EOPNOTSUPP; + + if (remap_flags & ~REMAP_FILE_ADVISORY) return -EINVAL; /* check alignment w.r.t. clone_blksize */ diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 6418cb6c079b..39896afc6edf 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -428,6 +428,22 @@ static int nfs4_delay(long *timeout, bool interruptible) return nfs4_delay_killable(timeout); } +static const nfs4_stateid * +nfs4_recoverable_stateid(const nfs4_stateid *stateid) +{ + if (!stateid) + return NULL; + switch (stateid->type) { + case NFS4_OPEN_STATEID_TYPE: + case NFS4_LOCK_STATEID_TYPE: + case NFS4_DELEGATION_STATEID_TYPE: + return stateid; + default: + break; + } + return NULL; +} + /* This is the error handling routine for processes that are allowed * to sleep. */ @@ -436,7 +452,7 @@ static int nfs4_do_handle_exception(struct nfs_server *server, { struct nfs_client *clp = server->nfs_client; struct nfs4_state *state = exception->state; - const nfs4_stateid *stateid = exception->stateid; + const nfs4_stateid *stateid; struct inode *inode = exception->inode; int ret = errorcode; @@ -444,8 +460,9 @@ static int nfs4_do_handle_exception(struct nfs_server *server, exception->recovering = 0; exception->retry = 0; + stateid = nfs4_recoverable_stateid(exception->stateid); if (stateid == NULL && state != NULL) - stateid = &state->stateid; + stateid = nfs4_recoverable_stateid(&state->stateid); switch(errorcode) { case 0: @@ -1165,6 +1182,18 @@ static bool nfs4_clear_cap_atomic_open_v1(struct nfs_server *server, return true; } +static fmode_t _nfs4_ctx_to_accessmode(const struct nfs_open_context *ctx) +{ + return ctx->mode & (FMODE_READ|FMODE_WRITE|FMODE_EXEC); +} + +static fmode_t _nfs4_ctx_to_openmode(const struct nfs_open_context *ctx) +{ + fmode_t ret = ctx->mode & (FMODE_READ|FMODE_WRITE); + + return (ctx->mode & FMODE_EXEC) ? FMODE_READ | ret : ret; +} + static u32 nfs4_map_atomic_open_share(struct nfs_server *server, fmode_t fmode, int openflags) @@ -2900,14 +2929,13 @@ static unsigned nfs4_exclusive_attrset(struct nfs4_opendata *opendata, } static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata, - fmode_t fmode, - int flags, - struct nfs_open_context *ctx) + int flags, struct nfs_open_context *ctx) { struct nfs4_state_owner *sp = opendata->owner; struct nfs_server *server = sp->so_server; struct dentry *dentry; struct nfs4_state *state; + fmode_t acc_mode = _nfs4_ctx_to_accessmode(ctx); unsigned int seq; int ret; @@ -2946,7 +2974,8 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata, /* Parse layoutget results before we check for access */ pnfs_parse_lgopen(state->inode, opendata->lgp, ctx); - ret = nfs4_opendata_access(sp->so_cred, opendata, state, fmode, flags); + ret = nfs4_opendata_access(sp->so_cred, opendata, state, + acc_mode, flags); if (ret != 0) goto out; @@ -2978,7 +3007,7 @@ static int _nfs4_do_open(struct inode *dir, struct dentry *dentry = ctx->dentry; const struct cred *cred = ctx->cred; struct nfs4_threshold **ctx_th = &ctx->mdsthreshold; - fmode_t fmode = ctx->mode & (FMODE_READ|FMODE_WRITE|FMODE_EXEC); + fmode_t fmode = _nfs4_ctx_to_openmode(ctx); enum open_claim_type4 claim = NFS4_OPEN_CLAIM_NULL; struct iattr *sattr = c->sattr; struct nfs4_label *label = c->label; @@ -3024,7 +3053,7 @@ static int _nfs4_do_open(struct inode *dir, if (d_really_is_positive(dentry)) opendata->state = nfs4_get_open_state(d_inode(dentry), sp); - status = _nfs4_open_and_get_state(opendata, fmode, flags, ctx); + status = _nfs4_open_and_get_state(opendata, flags, ctx); if (status != 0) goto err_free_label; state = ctx->state; @@ -3594,9 +3623,9 @@ static void nfs4_close_context(struct nfs_open_context *ctx, int is_sync) if (ctx->state == NULL) return; if (is_sync) - nfs4_close_sync(ctx->state, ctx->mode); + nfs4_close_sync(ctx->state, _nfs4_ctx_to_openmode(ctx)); else - nfs4_close_state(ctx->state, ctx->mode); + nfs4_close_state(ctx->state, _nfs4_ctx_to_openmode(ctx)); } #define FATTR4_WORD1_NFS40_MASK (2*FATTR4_WORD1_MOUNTED_ON_FILEID - 1UL) @@ -5980,7 +6009,7 @@ int nfs4_proc_setclientid(struct nfs_client *clp, u32 program, .rpc_message = &msg, .callback_ops = &nfs4_setclientid_ops, .callback_data = &setclientid, - .flags = RPC_TASK_TIMEOUT, + .flags = RPC_TASK_TIMEOUT | RPC_TASK_NO_ROUND_ROBIN, }; int status; @@ -6046,7 +6075,8 @@ int nfs4_proc_setclientid_confirm(struct nfs_client *clp, dprintk("NFS call setclientid_confirm auth=%s, (client ID %llx)\n", clp->cl_rpcclient->cl_auth->au_ops->au_name, clp->cl_clientid); - status = rpc_call_sync(clp->cl_rpcclient, &msg, RPC_TASK_TIMEOUT); + status = rpc_call_sync(clp->cl_rpcclient, &msg, + RPC_TASK_TIMEOUT | RPC_TASK_NO_ROUND_ROBIN); trace_nfs4_setclientid_confirm(clp, status); dprintk("NFS reply setclientid_confirm: %d\n", status); return status; @@ -7627,7 +7657,7 @@ static int _nfs4_proc_secinfo(struct inode *dir, const struct qstr *name, struct NFS_SP4_MACH_CRED_SECINFO, &clnt, &msg); status = nfs4_call_sync(clnt, NFS_SERVER(dir), &msg, &args.seq_args, - &res.seq_res, 0); + &res.seq_res, RPC_TASK_NO_ROUND_ROBIN); dprintk("NFS reply secinfo: %d\n", status); put_cred(cred); @@ -7965,7 +7995,7 @@ nfs4_run_exchange_id(struct nfs_client *clp, const struct cred *cred, .rpc_client = clp->cl_rpcclient, .callback_ops = &nfs4_exchange_id_call_ops, .rpc_message = &msg, - .flags = RPC_TASK_TIMEOUT, + .flags = RPC_TASK_TIMEOUT | RPC_TASK_NO_ROUND_ROBIN, }; struct nfs41_exchange_id_data *calldata; int status; @@ -8190,7 +8220,8 @@ static int _nfs4_proc_destroy_clientid(struct nfs_client *clp, }; int status; - status = rpc_call_sync(clp->cl_rpcclient, &msg, RPC_TASK_TIMEOUT); + status = rpc_call_sync(clp->cl_rpcclient, &msg, + RPC_TASK_TIMEOUT | RPC_TASK_NO_ROUND_ROBIN); trace_nfs4_destroy_clientid(clp, status); if (status) dprintk("NFS: Got error %d from the server %s on " @@ -8241,6 +8272,8 @@ out: return ret; } +#endif /* CONFIG_NFS_V4_1 */ + struct nfs4_get_lease_time_data { struct nfs4_get_lease_time_args *args; struct nfs4_get_lease_time_res *res; @@ -8273,7 +8306,7 @@ static void nfs4_get_lease_time_done(struct rpc_task *task, void *calldata) (struct nfs4_get_lease_time_data *)calldata; dprintk("--> %s\n", __func__); - if (!nfs41_sequence_done(task, &data->res->lr_seq_res)) + if (!nfs4_sequence_done(task, &data->res->lr_seq_res)) return; switch (task->tk_status) { case -NFS4ERR_DELAY: @@ -8331,6 +8364,8 @@ int nfs4_proc_get_lease_time(struct nfs_client *clp, struct nfs_fsinfo *fsinfo) return status; } +#ifdef CONFIG_NFS_V4_1 + /* * Initialize the values to be used by the client in CREATE_SESSION * If nfs4_init_session set the fore channel request and response sizes, @@ -8345,6 +8380,7 @@ static void nfs4_init_channel_attrs(struct nfs41_create_session_args *args, { unsigned int max_rqst_sz, max_resp_sz; unsigned int max_bc_payload = rpc_max_bc_payload(clnt); + unsigned int max_bc_slots = rpc_num_bc_slots(clnt); max_rqst_sz = NFS_MAX_FILE_IO_SIZE + nfs41_maxwrite_overhead; max_resp_sz = NFS_MAX_FILE_IO_SIZE + nfs41_maxread_overhead; @@ -8367,6 +8403,8 @@ static void nfs4_init_channel_attrs(struct nfs41_create_session_args *args, args->bc_attrs.max_resp_sz_cached = 0; args->bc_attrs.max_ops = NFS4_MAX_BACK_CHANNEL_OPS; args->bc_attrs.max_reqs = max_t(unsigned short, max_session_cb_slots, 1); + if (args->bc_attrs.max_reqs > max_bc_slots) + args->bc_attrs.max_reqs = max_bc_slots; dprintk("%s: Back Channel : max_rqst_sz=%u max_resp_sz=%u " "max_resp_sz_cached=%u max_ops=%u max_reqs=%u\n", @@ -8469,7 +8507,8 @@ static int _nfs4_proc_create_session(struct nfs_client *clp, nfs4_init_channel_attrs(&args, clp->cl_rpcclient); args.flags = (SESSION4_PERSIST | SESSION4_BACK_CHAN); - status = rpc_call_sync(session->clp->cl_rpcclient, &msg, RPC_TASK_TIMEOUT); + status = rpc_call_sync(session->clp->cl_rpcclient, &msg, + RPC_TASK_TIMEOUT | RPC_TASK_NO_ROUND_ROBIN); trace_nfs4_create_session(clp, status); switch (status) { @@ -8545,7 +8584,8 @@ int nfs4_proc_destroy_session(struct nfs4_session *session, if (!test_and_clear_bit(NFS4_SESSION_ESTABLISHED, &session->session_state)) return 0; - status = rpc_call_sync(session->clp->cl_rpcclient, &msg, RPC_TASK_TIMEOUT); + status = rpc_call_sync(session->clp->cl_rpcclient, &msg, + RPC_TASK_TIMEOUT | RPC_TASK_NO_ROUND_ROBIN); trace_nfs4_destroy_session(session->clp, status); if (status) @@ -8799,7 +8839,7 @@ static int nfs41_proc_reclaim_complete(struct nfs_client *clp, .rpc_client = clp->cl_rpcclient, .rpc_message = &msg, .callback_ops = &nfs4_reclaim_complete_call_ops, - .flags = RPC_TASK_ASYNC, + .flags = RPC_TASK_ASYNC | RPC_TASK_NO_ROUND_ROBIN, }; int status = -ENOMEM; @@ -9318,7 +9358,7 @@ _nfs41_proc_secinfo_no_name(struct nfs_server *server, struct nfs_fh *fhandle, dprintk("--> %s\n", __func__); status = nfs4_call_sync(clnt, server, &msg, &args.seq_args, - &res.seq_res, 0); + &res.seq_res, RPC_TASK_NO_ROUND_ROBIN); dprintk("<-- %s status=%d\n", __func__, status); put_cred(cred); diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index e2e3c4f04d3e..9afd051a4876 100644 --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -87,6 +87,27 @@ const nfs4_stateid current_stateid = { static DEFINE_MUTEX(nfs_clid_init_mutex); +static int nfs4_setup_state_renewal(struct nfs_client *clp) +{ + int status; + struct nfs_fsinfo fsinfo; + unsigned long now; + + if (!test_bit(NFS_CS_CHECK_LEASE_TIME, &clp->cl_res_state)) { + nfs4_schedule_state_renewal(clp); + return 0; + } + + now = jiffies; + status = nfs4_proc_get_lease_time(clp, &fsinfo); + if (status == 0) { + nfs4_set_lease_period(clp, fsinfo.lease_time * HZ, now); + nfs4_schedule_state_renewal(clp); + } + + return status; +} + int nfs4_init_clientid(struct nfs_client *clp, const struct cred *cred) { struct nfs4_setclientid_res clid = { @@ -114,7 +135,7 @@ do_confirm: if (status != 0) goto out; clear_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state); - nfs4_schedule_state_renewal(clp); + nfs4_setup_state_renewal(clp); out: return status; } @@ -286,34 +307,13 @@ static int nfs4_begin_drain_session(struct nfs_client *clp) #if defined(CONFIG_NFS_V4_1) -static int nfs41_setup_state_renewal(struct nfs_client *clp) -{ - int status; - struct nfs_fsinfo fsinfo; - unsigned long now; - - if (!test_bit(NFS_CS_CHECK_LEASE_TIME, &clp->cl_res_state)) { - nfs4_schedule_state_renewal(clp); - return 0; - } - - now = jiffies; - status = nfs4_proc_get_lease_time(clp, &fsinfo); - if (status == 0) { - nfs4_set_lease_period(clp, fsinfo.lease_time * HZ, now); - nfs4_schedule_state_renewal(clp); - } - - return status; -} - static void nfs41_finish_session_reset(struct nfs_client *clp) { clear_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state); clear_bit(NFS4CLNT_SESSION_RESET, &clp->cl_state); /* create_session negotiated new slot table */ clear_bit(NFS4CLNT_BIND_CONN_TO_SESSION, &clp->cl_state); - nfs41_setup_state_renewal(clp); + nfs4_setup_state_renewal(clp); } int nfs41_init_clientid(struct nfs_client *clp, const struct cred *cred) @@ -1064,8 +1064,7 @@ int nfs4_select_rw_stateid(struct nfs4_state *state, * choose to use. */ goto out; - nfs4_copy_open_stateid(dst, state); - ret = 0; + ret = nfs4_copy_open_stateid(dst, state) ? 0 : -EAGAIN; out: if (nfs_server_capable(state->inode, NFS_CAP_STATEID_NFSV41)) dst->seqid = 0; diff --git a/fs/nfs/nfs4trace.c b/fs/nfs/nfs4trace.c index e9fb3e50a999..1a8f376b3f73 100644 --- a/fs/nfs/nfs4trace.c +++ b/fs/nfs/nfs4trace.c @@ -16,4 +16,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(nfs4_pnfs_read); EXPORT_TRACEPOINT_SYMBOL_GPL(nfs4_pnfs_write); EXPORT_TRACEPOINT_SYMBOL_GPL(nfs4_pnfs_commit_ds); + +EXPORT_TRACEPOINT_SYMBOL_GPL(pnfs_mds_fallback_pg_init_read); +EXPORT_TRACEPOINT_SYMBOL_GPL(pnfs_mds_fallback_pg_init_write); +EXPORT_TRACEPOINT_SYMBOL_GPL(pnfs_mds_fallback_pg_get_mirror_count); +EXPORT_TRACEPOINT_SYMBOL_GPL(pnfs_mds_fallback_read_done); +EXPORT_TRACEPOINT_SYMBOL_GPL(pnfs_mds_fallback_write_done); +EXPORT_TRACEPOINT_SYMBOL_GPL(pnfs_mds_fallback_read_pagelist); +EXPORT_TRACEPOINT_SYMBOL_GPL(pnfs_mds_fallback_write_pagelist); #endif diff --git a/fs/nfs/nfs4trace.h b/fs/nfs/nfs4trace.h index cd1a5c08da9a..b2f395fa7350 100644 --- a/fs/nfs/nfs4trace.h +++ b/fs/nfs/nfs4trace.h @@ -156,7 +156,7 @@ TRACE_DEFINE_ENUM(NFS4ERR_WRONG_TYPE); TRACE_DEFINE_ENUM(NFS4ERR_XDEV); #define show_nfsv4_errors(error) \ - __print_symbolic(-(error), \ + __print_symbolic(error, \ { NFS4_OK, "OK" }, \ /* Mapped by nfs4_stat_to_errno() */ \ { EPERM, "EPERM" }, \ @@ -348,7 +348,7 @@ DECLARE_EVENT_CLASS(nfs4_clientid_event, TP_STRUCT__entry( __string(dstaddr, clp->cl_hostname) - __field(int, error) + __field(unsigned long, error) ), TP_fast_assign( @@ -357,8 +357,8 @@ DECLARE_EVENT_CLASS(nfs4_clientid_event, ), TP_printk( - "error=%d (%s) dstaddr=%s", - __entry->error, + "error=%ld (%s) dstaddr=%s", + -__entry->error, show_nfsv4_errors(__entry->error), __get_str(dstaddr) ) @@ -420,7 +420,7 @@ TRACE_EVENT(nfs4_sequence_done, __field(unsigned int, highest_slotid) __field(unsigned int, target_highest_slotid) __field(unsigned int, status_flags) - __field(int, error) + __field(unsigned long, error) ), TP_fast_assign( @@ -435,10 +435,10 @@ TRACE_EVENT(nfs4_sequence_done, __entry->error = res->sr_status; ), TP_printk( - "error=%d (%s) session=0x%08x slot_nr=%u seq_nr=%u " + "error=%ld (%s) session=0x%08x slot_nr=%u seq_nr=%u " "highest_slotid=%u target_highest_slotid=%u " "status_flags=%u (%s)", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), __entry->session, __entry->slot_nr, @@ -467,7 +467,7 @@ TRACE_EVENT(nfs4_cb_sequence, __field(unsigned int, seq_nr) __field(unsigned int, highest_slotid) __field(unsigned int, cachethis) - __field(int, error) + __field(unsigned long, error) ), TP_fast_assign( @@ -476,13 +476,13 @@ TRACE_EVENT(nfs4_cb_sequence, __entry->seq_nr = args->csa_sequenceid; __entry->highest_slotid = args->csa_highestslotid; __entry->cachethis = args->csa_cachethis; - __entry->error = -be32_to_cpu(status); + __entry->error = be32_to_cpu(status); ), TP_printk( - "error=%d (%s) session=0x%08x slot_nr=%u seq_nr=%u " + "error=%ld (%s) session=0x%08x slot_nr=%u seq_nr=%u " "highest_slotid=%u", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), __entry->session, __entry->slot_nr, @@ -490,6 +490,44 @@ TRACE_EVENT(nfs4_cb_sequence, __entry->highest_slotid ) ); + +TRACE_EVENT(nfs4_cb_seqid_err, + TP_PROTO( + const struct cb_sequenceargs *args, + __be32 status + ), + TP_ARGS(args, status), + + TP_STRUCT__entry( + __field(unsigned int, session) + __field(unsigned int, slot_nr) + __field(unsigned int, seq_nr) + __field(unsigned int, highest_slotid) + __field(unsigned int, cachethis) + __field(unsigned long, error) + ), + + TP_fast_assign( + __entry->session = nfs_session_id_hash(&args->csa_sessionid); + __entry->slot_nr = args->csa_slotid; + __entry->seq_nr = args->csa_sequenceid; + __entry->highest_slotid = args->csa_highestslotid; + __entry->cachethis = args->csa_cachethis; + __entry->error = be32_to_cpu(status); + ), + + TP_printk( + "error=%ld (%s) session=0x%08x slot_nr=%u seq_nr=%u " + "highest_slotid=%u", + -__entry->error, + show_nfsv4_errors(__entry->error), + __entry->session, + __entry->slot_nr, + __entry->seq_nr, + __entry->highest_slotid + ) +); + #endif /* CONFIG_NFS_V4_1 */ TRACE_EVENT(nfs4_setup_sequence, @@ -526,26 +564,37 @@ TRACE_EVENT(nfs4_setup_sequence, TRACE_EVENT(nfs4_xdr_status, TP_PROTO( + const struct xdr_stream *xdr, u32 op, int error ), - TP_ARGS(op, error), + TP_ARGS(xdr, op, error), TP_STRUCT__entry( + __field(unsigned int, task_id) + __field(unsigned int, client_id) + __field(u32, xid) __field(u32, op) - __field(int, error) + __field(unsigned long, error) ), TP_fast_assign( + const struct rpc_rqst *rqstp = xdr->rqst; + const struct rpc_task *task = rqstp->rq_task; + + __entry->task_id = task->tk_pid; + __entry->client_id = task->tk_client->cl_clid; + __entry->xid = be32_to_cpu(rqstp->rq_xid); __entry->op = op; - __entry->error = -error; + __entry->error = error; ), TP_printk( - "operation %d: nfs status %d (%s)", - __entry->op, - __entry->error, show_nfsv4_errors(__entry->error) + "task:%u@%d xid=0x%08x error=%ld (%s) operation=%u", + __entry->task_id, __entry->client_id, __entry->xid, + -__entry->error, show_nfsv4_errors(__entry->error), + __entry->op ) ); @@ -559,7 +608,7 @@ DECLARE_EVENT_CLASS(nfs4_open_event, TP_ARGS(ctx, flags, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(unsigned int, flags) __field(unsigned int, fmode) __field(dev_t, dev) @@ -577,7 +626,7 @@ DECLARE_EVENT_CLASS(nfs4_open_event, const struct nfs4_state *state = ctx->state; const struct inode *inode = NULL; - __entry->error = error; + __entry->error = -error; __entry->flags = flags; __entry->fmode = (__force unsigned int)ctx->mode; __entry->dev = ctx->dentry->d_sb->s_dev; @@ -609,11 +658,11 @@ DECLARE_EVENT_CLASS(nfs4_open_event, ), TP_printk( - "error=%d (%s) flags=%d (%s) fmode=%s " + "error=%ld (%s) flags=%d (%s) fmode=%s " "fileid=%02x:%02x:%llu fhandle=0x%08x " "name=%02x:%02x:%llu/%s stateid=%d:0x%08x " "openstateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), __entry->flags, show_open_flags(__entry->flags), @@ -695,7 +744,7 @@ TRACE_EVENT(nfs4_close, __field(u32, fhandle) __field(u64, fileid) __field(unsigned int, fmode) - __field(int, error) + __field(unsigned long, error) __field(int, stateid_seq) __field(u32, stateid_hash) ), @@ -715,9 +764,9 @@ TRACE_EVENT(nfs4_close, ), TP_printk( - "error=%d (%s) fmode=%s fileid=%02x:%02x:%llu " + "error=%ld (%s) fmode=%s fileid=%02x:%02x:%llu " "fhandle=0x%08x openstateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), __entry->fmode ? show_fmode_flags(__entry->fmode) : "closed", @@ -757,7 +806,7 @@ DECLARE_EVENT_CLASS(nfs4_lock_event, TP_ARGS(request, state, cmd, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(int, cmd) __field(char, type) __field(loff_t, start) @@ -787,10 +836,10 @@ DECLARE_EVENT_CLASS(nfs4_lock_event, ), TP_printk( - "error=%d (%s) cmd=%s:%s range=%lld:%lld " + "error=%ld (%s) cmd=%s:%s range=%lld:%lld " "fileid=%02x:%02x:%llu fhandle=0x%08x " "stateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), show_lock_cmd(__entry->cmd), show_lock_type(__entry->type), @@ -827,7 +876,7 @@ TRACE_EVENT(nfs4_set_lock, TP_ARGS(request, state, lockstateid, cmd, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(int, cmd) __field(char, type) __field(loff_t, start) @@ -863,10 +912,10 @@ TRACE_EVENT(nfs4_set_lock, ), TP_printk( - "error=%d (%s) cmd=%s:%s range=%lld:%lld " + "error=%ld (%s) cmd=%s:%s range=%lld:%lld " "fileid=%02x:%02x:%llu fhandle=0x%08x " "stateid=%d:0x%08x lockstateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), show_lock_cmd(__entry->cmd), show_lock_type(__entry->type), @@ -932,7 +981,7 @@ TRACE_EVENT(nfs4_delegreturn_exit, TP_STRUCT__entry( __field(dev_t, dev) __field(u32, fhandle) - __field(int, error) + __field(unsigned long, error) __field(int, stateid_seq) __field(u32, stateid_hash) ), @@ -948,9 +997,9 @@ TRACE_EVENT(nfs4_delegreturn_exit, ), TP_printk( - "error=%d (%s) dev=%02x:%02x fhandle=0x%08x " + "error=%ld (%s) dev=%02x:%02x fhandle=0x%08x " "stateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), __entry->fhandle, @@ -969,7 +1018,7 @@ DECLARE_EVENT_CLASS(nfs4_test_stateid_event, TP_ARGS(state, lsp, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(dev_t, dev) __field(u32, fhandle) __field(u64, fileid) @@ -991,9 +1040,9 @@ DECLARE_EVENT_CLASS(nfs4_test_stateid_event, ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "stateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1026,7 +1075,7 @@ DECLARE_EVENT_CLASS(nfs4_lookup_event, TP_STRUCT__entry( __field(dev_t, dev) - __field(int, error) + __field(unsigned long, error) __field(u64, dir) __string(name, name->name) ), @@ -1034,13 +1083,13 @@ DECLARE_EVENT_CLASS(nfs4_lookup_event, TP_fast_assign( __entry->dev = dir->i_sb->s_dev; __entry->dir = NFS_FILEID(dir); - __entry->error = error; + __entry->error = -error; __assign_str(name, name->name); ), TP_printk( - "error=%d (%s) name=%02x:%02x:%llu/%s", - __entry->error, + "error=%ld (%s) name=%02x:%02x:%llu/%s", + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->dir, @@ -1076,7 +1125,7 @@ TRACE_EVENT(nfs4_lookupp, TP_STRUCT__entry( __field(dev_t, dev) __field(u64, ino) - __field(int, error) + __field(unsigned long, error) ), TP_fast_assign( @@ -1086,8 +1135,8 @@ TRACE_EVENT(nfs4_lookupp, ), TP_printk( - "error=%d (%s) inode=%02x:%02x:%llu", - __entry->error, + "error=%ld (%s) inode=%02x:%02x:%llu", + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->ino @@ -1107,7 +1156,7 @@ TRACE_EVENT(nfs4_rename, TP_STRUCT__entry( __field(dev_t, dev) - __field(int, error) + __field(unsigned long, error) __field(u64, olddir) __string(oldname, oldname->name) __field(u64, newdir) @@ -1124,9 +1173,9 @@ TRACE_EVENT(nfs4_rename, ), TP_printk( - "error=%d (%s) oldname=%02x:%02x:%llu/%s " + "error=%ld (%s) oldname=%02x:%02x:%llu/%s " "newname=%02x:%02x:%llu/%s", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->olddir, @@ -1149,19 +1198,19 @@ DECLARE_EVENT_CLASS(nfs4_inode_event, __field(dev_t, dev) __field(u32, fhandle) __field(u64, fileid) - __field(int, error) + __field(unsigned long, error) ), TP_fast_assign( __entry->dev = inode->i_sb->s_dev; __entry->fileid = NFS_FILEID(inode); __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode)); - __entry->error = error; + __entry->error = error < 0 ? -error : 0; ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x", - __entry->error, + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x", + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1200,7 +1249,7 @@ DECLARE_EVENT_CLASS(nfs4_inode_stateid_event, __field(dev_t, dev) __field(u32, fhandle) __field(u64, fileid) - __field(int, error) + __field(unsigned long, error) __field(int, stateid_seq) __field(u32, stateid_hash) ), @@ -1217,9 +1266,9 @@ DECLARE_EVENT_CLASS(nfs4_inode_stateid_event, ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "stateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1257,7 +1306,7 @@ DECLARE_EVENT_CLASS(nfs4_getattr_event, __field(u32, fhandle) __field(u64, fileid) __field(unsigned int, valid) - __field(int, error) + __field(unsigned long, error) ), TP_fast_assign( @@ -1269,9 +1318,9 @@ DECLARE_EVENT_CLASS(nfs4_getattr_event, ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "valid=%s", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1304,7 +1353,7 @@ DECLARE_EVENT_CLASS(nfs4_inode_callback_event, TP_ARGS(clp, fhandle, inode, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(dev_t, dev) __field(u32, fhandle) __field(u64, fileid) @@ -1325,9 +1374,9 @@ DECLARE_EVENT_CLASS(nfs4_inode_callback_event, ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "dstaddr=%s", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1359,7 +1408,7 @@ DECLARE_EVENT_CLASS(nfs4_inode_stateid_callback_event, TP_ARGS(clp, fhandle, inode, stateid, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(dev_t, dev) __field(u32, fhandle) __field(u64, fileid) @@ -1386,9 +1435,9 @@ DECLARE_EVENT_CLASS(nfs4_inode_stateid_callback_event, ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "stateid=%d:0x%08x dstaddr=%s", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1422,7 +1471,7 @@ DECLARE_EVENT_CLASS(nfs4_idmap_event, TP_ARGS(name, len, id, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(u32, id) __dynamic_array(char, name, len > 0 ? len + 1 : 1) ), @@ -1437,8 +1486,8 @@ DECLARE_EVENT_CLASS(nfs4_idmap_event, ), TP_printk( - "error=%d id=%u name=%s", - __entry->error, + "error=%ld (%s) id=%u name=%s", + -__entry->error, show_nfsv4_errors(__entry->error), __entry->id, __get_str(name) ) @@ -1471,7 +1520,7 @@ DECLARE_EVENT_CLASS(nfs4_read_event, __field(u64, fileid) __field(loff_t, offset) __field(size_t, count) - __field(int, error) + __field(unsigned long, error) __field(int, stateid_seq) __field(u32, stateid_hash) ), @@ -1485,7 +1534,7 @@ DECLARE_EVENT_CLASS(nfs4_read_event, __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode)); __entry->offset = hdr->args.offset; __entry->count = hdr->args.count; - __entry->error = error; + __entry->error = error < 0 ? -error : 0; __entry->stateid_seq = be32_to_cpu(state->stateid.seqid); __entry->stateid_hash = @@ -1493,9 +1542,9 @@ DECLARE_EVENT_CLASS(nfs4_read_event, ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "offset=%lld count=%zu stateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1531,7 +1580,7 @@ DECLARE_EVENT_CLASS(nfs4_write_event, __field(u64, fileid) __field(loff_t, offset) __field(size_t, count) - __field(int, error) + __field(unsigned long, error) __field(int, stateid_seq) __field(u32, stateid_hash) ), @@ -1545,7 +1594,7 @@ DECLARE_EVENT_CLASS(nfs4_write_event, __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode)); __entry->offset = hdr->args.offset; __entry->count = hdr->args.count; - __entry->error = error; + __entry->error = error < 0 ? -error : 0; __entry->stateid_seq = be32_to_cpu(state->stateid.seqid); __entry->stateid_hash = @@ -1553,9 +1602,9 @@ DECLARE_EVENT_CLASS(nfs4_write_event, ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "offset=%lld count=%zu stateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1592,7 +1641,7 @@ DECLARE_EVENT_CLASS(nfs4_commit_event, __field(u64, fileid) __field(loff_t, offset) __field(size_t, count) - __field(int, error) + __field(unsigned long, error) ), TP_fast_assign( @@ -1606,9 +1655,9 @@ DECLARE_EVENT_CLASS(nfs4_commit_event, ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "offset=%lld count=%zu", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1656,7 +1705,7 @@ TRACE_EVENT(nfs4_layoutget, __field(u32, iomode) __field(u64, offset) __field(u64, count) - __field(int, error) + __field(unsigned long, error) __field(int, stateid_seq) __field(u32, stateid_hash) __field(int, layoutstateid_seq) @@ -1689,10 +1738,10 @@ TRACE_EVENT(nfs4_layoutget, ), TP_printk( - "error=%d (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "iomode=%s offset=%llu count=%llu stateid=%d:0x%08x " "layoutstateid=%d:0x%08x", - __entry->error, + -__entry->error, show_nfsv4_errors(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, @@ -1722,6 +1771,7 @@ TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_BLOCKED); TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_INVALID_OPEN); TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_RETRY); TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET); +TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_EXIT); #define show_pnfs_update_layout_reason(reason) \ __print_symbolic(reason, \ @@ -1737,7 +1787,8 @@ TRACE_DEFINE_ENUM(PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET); { PNFS_UPDATE_LAYOUT_BLOCKED, "layouts blocked" }, \ { PNFS_UPDATE_LAYOUT_INVALID_OPEN, "invalid open" }, \ { PNFS_UPDATE_LAYOUT_RETRY, "retrying" }, \ - { PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET, "sent layoutget" }) + { PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET, "sent layoutget" }, \ + { PNFS_UPDATE_LAYOUT_EXIT, "exit" }) TRACE_EVENT(pnfs_update_layout, TP_PROTO(struct inode *inode, @@ -1796,6 +1847,78 @@ TRACE_EVENT(pnfs_update_layout, ) ); +DECLARE_EVENT_CLASS(pnfs_layout_event, + TP_PROTO(struct inode *inode, + loff_t pos, + u64 count, + enum pnfs_iomode iomode, + struct pnfs_layout_hdr *lo, + struct pnfs_layout_segment *lseg + ), + TP_ARGS(inode, pos, count, iomode, lo, lseg), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(u64, fileid) + __field(u32, fhandle) + __field(loff_t, pos) + __field(u64, count) + __field(enum pnfs_iomode, iomode) + __field(int, layoutstateid_seq) + __field(u32, layoutstateid_hash) + __field(long, lseg) + ), + TP_fast_assign( + __entry->dev = inode->i_sb->s_dev; + __entry->fileid = NFS_FILEID(inode); + __entry->fhandle = nfs_fhandle_hash(NFS_FH(inode)); + __entry->pos = pos; + __entry->count = count; + __entry->iomode = iomode; + if (lo != NULL) { + __entry->layoutstateid_seq = + be32_to_cpu(lo->plh_stateid.seqid); + __entry->layoutstateid_hash = + nfs_stateid_hash(&lo->plh_stateid); + } else { + __entry->layoutstateid_seq = 0; + __entry->layoutstateid_hash = 0; + } + __entry->lseg = (long)lseg; + ), + TP_printk( + "fileid=%02x:%02x:%llu fhandle=0x%08x " + "iomode=%s pos=%llu count=%llu " + "layoutstateid=%d:0x%08x lseg=0x%lx", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long long)__entry->fileid, + __entry->fhandle, + show_pnfs_iomode(__entry->iomode), + (unsigned long long)__entry->pos, + (unsigned long long)__entry->count, + __entry->layoutstateid_seq, __entry->layoutstateid_hash, + __entry->lseg + ) +); + +#define DEFINE_PNFS_LAYOUT_EVENT(name) \ + DEFINE_EVENT(pnfs_layout_event, name, \ + TP_PROTO(struct inode *inode, \ + loff_t pos, \ + u64 count, \ + enum pnfs_iomode iomode, \ + struct pnfs_layout_hdr *lo, \ + struct pnfs_layout_segment *lseg \ + ), \ + TP_ARGS(inode, pos, count, iomode, lo, lseg)) + +DEFINE_PNFS_LAYOUT_EVENT(pnfs_mds_fallback_pg_init_read); +DEFINE_PNFS_LAYOUT_EVENT(pnfs_mds_fallback_pg_init_write); +DEFINE_PNFS_LAYOUT_EVENT(pnfs_mds_fallback_pg_get_mirror_count); +DEFINE_PNFS_LAYOUT_EVENT(pnfs_mds_fallback_read_done); +DEFINE_PNFS_LAYOUT_EVENT(pnfs_mds_fallback_write_done); +DEFINE_PNFS_LAYOUT_EVENT(pnfs_mds_fallback_read_pagelist); +DEFINE_PNFS_LAYOUT_EVENT(pnfs_mds_fallback_write_pagelist); + #endif /* CONFIG_NFS_V4_1 */ #endif /* _TRACE_NFS4_H */ diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c index 602446158bfb..46a8d636d151 100644 --- a/fs/nfs/nfs4xdr.c +++ b/fs/nfs/nfs4xdr.c @@ -837,6 +837,7 @@ static int decode_layoutget(struct xdr_stream *xdr, struct rpc_rqst *req, #define NFS4_dec_sequence_sz \ (compound_decode_hdr_maxsz + \ decode_sequence_maxsz) +#endif #define NFS4_enc_get_lease_time_sz (compound_encode_hdr_maxsz + \ encode_sequence_maxsz + \ encode_putrootfh_maxsz + \ @@ -845,6 +846,7 @@ static int decode_layoutget(struct xdr_stream *xdr, struct rpc_rqst *req, decode_sequence_maxsz + \ decode_putrootfh_maxsz + \ decode_fsinfo_maxsz) +#if defined(CONFIG_NFS_V4_1) #define NFS4_enc_reclaim_complete_sz (compound_encode_hdr_maxsz + \ encode_sequence_maxsz + \ encode_reclaim_complete_maxsz) @@ -2957,6 +2959,8 @@ static void nfs4_xdr_enc_sequence(struct rpc_rqst *req, struct xdr_stream *xdr, encode_nops(&hdr); } +#endif + /* * a GET_LEASE_TIME request */ @@ -2977,6 +2981,8 @@ static void nfs4_xdr_enc_get_lease_time(struct rpc_rqst *req, encode_nops(&hdr); } +#ifdef CONFIG_NFS_V4_1 + /* * a RECLAIM_COMPLETE request */ @@ -3187,7 +3193,7 @@ static bool __decode_op_hdr(struct xdr_stream *xdr, enum nfs_opnum4 expected, return true; out_status: nfserr = be32_to_cpup(p); - trace_nfs4_xdr_status(opnum, nfserr); + trace_nfs4_xdr_status(xdr, opnum, nfserr); *nfs_retval = nfs4_stat_to_errno(nfserr); return true; out_bad_operation: @@ -3427,7 +3433,7 @@ static int decode_attr_lease_time(struct xdr_stream *xdr, uint32_t *bitmap, uint *res = be32_to_cpup(p); bitmap[0] &= ~FATTR4_WORD0_LEASE_TIME; } - dprintk("%s: file size=%u\n", __func__, (unsigned int)*res); + dprintk("%s: lease time=%u\n", __func__, (unsigned int)*res); return 0; } @@ -7122,6 +7128,8 @@ static int nfs4_xdr_dec_sequence(struct rpc_rqst *rqstp, return status; } +#endif + /* * Decode GET_LEASE_TIME response */ @@ -7143,6 +7151,8 @@ static int nfs4_xdr_dec_get_lease_time(struct rpc_rqst *rqstp, return status; } +#ifdef CONFIG_NFS_V4_1 + /* * Decode RECLAIM_COMPLETE response */ @@ -7551,7 +7561,7 @@ const struct rpc_procinfo nfs4_procedures[] = { PROC41(CREATE_SESSION, enc_create_session, dec_create_session), PROC41(DESTROY_SESSION, enc_destroy_session, dec_destroy_session), PROC41(SEQUENCE, enc_sequence, dec_sequence), - PROC41(GET_LEASE_TIME, enc_get_lease_time, dec_get_lease_time), + PROC(GET_LEASE_TIME, enc_get_lease_time, dec_get_lease_time), PROC41(RECLAIM_COMPLETE,enc_reclaim_complete, dec_reclaim_complete), PROC41(GETDEVICEINFO, enc_getdeviceinfo, dec_getdeviceinfo), PROC41(LAYOUTGET, enc_layoutget, dec_layoutget), diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h index a0d6910aa03a..976d4089e267 100644 --- a/fs/nfs/nfstrace.h +++ b/fs/nfs/nfstrace.h @@ -11,6 +11,16 @@ #include <linux/tracepoint.h> #include <linux/iversion.h> +TRACE_DEFINE_ENUM(DT_UNKNOWN); +TRACE_DEFINE_ENUM(DT_FIFO); +TRACE_DEFINE_ENUM(DT_CHR); +TRACE_DEFINE_ENUM(DT_DIR); +TRACE_DEFINE_ENUM(DT_BLK); +TRACE_DEFINE_ENUM(DT_REG); +TRACE_DEFINE_ENUM(DT_LNK); +TRACE_DEFINE_ENUM(DT_SOCK); +TRACE_DEFINE_ENUM(DT_WHT); + #define nfs_show_file_type(ftype) \ __print_symbolic(ftype, \ { DT_UNKNOWN, "UNKNOWN" }, \ @@ -23,25 +33,57 @@ { DT_SOCK, "SOCK" }, \ { DT_WHT, "WHT" }) +TRACE_DEFINE_ENUM(NFS_INO_INVALID_DATA); +TRACE_DEFINE_ENUM(NFS_INO_INVALID_ATIME); +TRACE_DEFINE_ENUM(NFS_INO_INVALID_ACCESS); +TRACE_DEFINE_ENUM(NFS_INO_INVALID_ACL); +TRACE_DEFINE_ENUM(NFS_INO_REVAL_PAGECACHE); +TRACE_DEFINE_ENUM(NFS_INO_REVAL_FORCED); +TRACE_DEFINE_ENUM(NFS_INO_INVALID_LABEL); +TRACE_DEFINE_ENUM(NFS_INO_INVALID_CHANGE); +TRACE_DEFINE_ENUM(NFS_INO_INVALID_CTIME); +TRACE_DEFINE_ENUM(NFS_INO_INVALID_MTIME); +TRACE_DEFINE_ENUM(NFS_INO_INVALID_SIZE); +TRACE_DEFINE_ENUM(NFS_INO_INVALID_OTHER); + #define nfs_show_cache_validity(v) \ __print_flags(v, "|", \ - { NFS_INO_INVALID_ATTR, "INVALID_ATTR" }, \ { NFS_INO_INVALID_DATA, "INVALID_DATA" }, \ { NFS_INO_INVALID_ATIME, "INVALID_ATIME" }, \ { NFS_INO_INVALID_ACCESS, "INVALID_ACCESS" }, \ { NFS_INO_INVALID_ACL, "INVALID_ACL" }, \ { NFS_INO_REVAL_PAGECACHE, "REVAL_PAGECACHE" }, \ { NFS_INO_REVAL_FORCED, "REVAL_FORCED" }, \ - { NFS_INO_INVALID_LABEL, "INVALID_LABEL" }) + { NFS_INO_INVALID_LABEL, "INVALID_LABEL" }, \ + { NFS_INO_INVALID_CHANGE, "INVALID_CHANGE" }, \ + { NFS_INO_INVALID_CTIME, "INVALID_CTIME" }, \ + { NFS_INO_INVALID_MTIME, "INVALID_MTIME" }, \ + { NFS_INO_INVALID_SIZE, "INVALID_SIZE" }, \ + { NFS_INO_INVALID_OTHER, "INVALID_OTHER" }) + +TRACE_DEFINE_ENUM(NFS_INO_ADVISE_RDPLUS); +TRACE_DEFINE_ENUM(NFS_INO_STALE); +TRACE_DEFINE_ENUM(NFS_INO_ACL_LRU_SET); +TRACE_DEFINE_ENUM(NFS_INO_INVALIDATING); +TRACE_DEFINE_ENUM(NFS_INO_FSCACHE); +TRACE_DEFINE_ENUM(NFS_INO_FSCACHE_LOCK); +TRACE_DEFINE_ENUM(NFS_INO_LAYOUTCOMMIT); +TRACE_DEFINE_ENUM(NFS_INO_LAYOUTCOMMITTING); +TRACE_DEFINE_ENUM(NFS_INO_LAYOUTSTATS); +TRACE_DEFINE_ENUM(NFS_INO_ODIRECT); #define nfs_show_nfsi_flags(v) \ __print_flags(v, "|", \ - { 1 << NFS_INO_ADVISE_RDPLUS, "ADVISE_RDPLUS" }, \ - { 1 << NFS_INO_STALE, "STALE" }, \ - { 1 << NFS_INO_INVALIDATING, "INVALIDATING" }, \ - { 1 << NFS_INO_FSCACHE, "FSCACHE" }, \ - { 1 << NFS_INO_LAYOUTCOMMIT, "NEED_LAYOUTCOMMIT" }, \ - { 1 << NFS_INO_LAYOUTCOMMITTING, "LAYOUTCOMMIT" }) + { BIT(NFS_INO_ADVISE_RDPLUS), "ADVISE_RDPLUS" }, \ + { BIT(NFS_INO_STALE), "STALE" }, \ + { BIT(NFS_INO_ACL_LRU_SET), "ACL_LRU_SET" }, \ + { BIT(NFS_INO_INVALIDATING), "INVALIDATING" }, \ + { BIT(NFS_INO_FSCACHE), "FSCACHE" }, \ + { BIT(NFS_INO_FSCACHE_LOCK), "FSCACHE_LOCK" }, \ + { BIT(NFS_INO_LAYOUTCOMMIT), "NEED_LAYOUTCOMMIT" }, \ + { BIT(NFS_INO_LAYOUTCOMMITTING), "LAYOUTCOMMIT" }, \ + { BIT(NFS_INO_LAYOUTSTATS), "LAYOUTSTATS" }, \ + { BIT(NFS_INO_ODIRECT), "ODIRECT" }) DECLARE_EVENT_CLASS(nfs_inode_event, TP_PROTO( @@ -83,7 +125,7 @@ DECLARE_EVENT_CLASS(nfs_inode_event_done, TP_ARGS(inode, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(dev_t, dev) __field(u32, fhandle) __field(unsigned char, type) @@ -96,7 +138,7 @@ DECLARE_EVENT_CLASS(nfs_inode_event_done, TP_fast_assign( const struct nfs_inode *nfsi = NFS_I(inode); - __entry->error = error; + __entry->error = error < 0 ? -error : 0; __entry->dev = inode->i_sb->s_dev; __entry->fileid = nfsi->fileid; __entry->fhandle = nfs_fhandle_hash(&nfsi->fh); @@ -108,10 +150,10 @@ DECLARE_EVENT_CLASS(nfs_inode_event_done, ), TP_printk( - "error=%d fileid=%02x:%02x:%llu fhandle=0x%08x " + "error=%ld (%s) fileid=%02x:%02x:%llu fhandle=0x%08x " "type=%u (%s) version=%llu size=%lld " - "cache_validity=%lu (%s) nfs_flags=%ld (%s)", - __entry->error, + "cache_validity=0x%lx (%s) nfs_flags=0x%lx (%s)", + -__entry->error, nfs_show_status(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->fileid, __entry->fhandle, @@ -158,13 +200,41 @@ DEFINE_NFS_INODE_EVENT_DONE(nfs_fsync_exit); DEFINE_NFS_INODE_EVENT(nfs_access_enter); DEFINE_NFS_INODE_EVENT_DONE(nfs_access_exit); +TRACE_DEFINE_ENUM(LOOKUP_FOLLOW); +TRACE_DEFINE_ENUM(LOOKUP_DIRECTORY); +TRACE_DEFINE_ENUM(LOOKUP_AUTOMOUNT); +TRACE_DEFINE_ENUM(LOOKUP_PARENT); +TRACE_DEFINE_ENUM(LOOKUP_REVAL); +TRACE_DEFINE_ENUM(LOOKUP_RCU); +TRACE_DEFINE_ENUM(LOOKUP_NO_REVAL); +TRACE_DEFINE_ENUM(LOOKUP_NO_EVAL); +TRACE_DEFINE_ENUM(LOOKUP_OPEN); +TRACE_DEFINE_ENUM(LOOKUP_CREATE); +TRACE_DEFINE_ENUM(LOOKUP_EXCL); +TRACE_DEFINE_ENUM(LOOKUP_RENAME_TARGET); +TRACE_DEFINE_ENUM(LOOKUP_JUMPED); +TRACE_DEFINE_ENUM(LOOKUP_ROOT); +TRACE_DEFINE_ENUM(LOOKUP_EMPTY); +TRACE_DEFINE_ENUM(LOOKUP_DOWN); + #define show_lookup_flags(flags) \ - __print_flags((unsigned long)flags, "|", \ - { LOOKUP_AUTOMOUNT, "AUTOMOUNT" }, \ + __print_flags(flags, "|", \ + { LOOKUP_FOLLOW, "FOLLOW" }, \ { LOOKUP_DIRECTORY, "DIRECTORY" }, \ + { LOOKUP_AUTOMOUNT, "AUTOMOUNT" }, \ + { LOOKUP_PARENT, "PARENT" }, \ + { LOOKUP_REVAL, "REVAL" }, \ + { LOOKUP_RCU, "RCU" }, \ + { LOOKUP_NO_REVAL, "NO_REVAL" }, \ + { LOOKUP_NO_EVAL, "NO_EVAL" }, \ { LOOKUP_OPEN, "OPEN" }, \ { LOOKUP_CREATE, "CREATE" }, \ - { LOOKUP_EXCL, "EXCL" }) + { LOOKUP_EXCL, "EXCL" }, \ + { LOOKUP_RENAME_TARGET, "RENAME_TARGET" }, \ + { LOOKUP_JUMPED, "JUMPED" }, \ + { LOOKUP_ROOT, "ROOT" }, \ + { LOOKUP_EMPTY, "EMPTY" }, \ + { LOOKUP_DOWN, "DOWN" }) DECLARE_EVENT_CLASS(nfs_lookup_event, TP_PROTO( @@ -176,7 +246,7 @@ DECLARE_EVENT_CLASS(nfs_lookup_event, TP_ARGS(dir, dentry, flags), TP_STRUCT__entry( - __field(unsigned int, flags) + __field(unsigned long, flags) __field(dev_t, dev) __field(u64, dir) __string(name, dentry->d_name.name) @@ -190,7 +260,7 @@ DECLARE_EVENT_CLASS(nfs_lookup_event, ), TP_printk( - "flags=%u (%s) name=%02x:%02x:%llu/%s", + "flags=0x%lx (%s) name=%02x:%02x:%llu/%s", __entry->flags, show_lookup_flags(__entry->flags), MAJOR(__entry->dev), MINOR(__entry->dev), @@ -219,8 +289,8 @@ DECLARE_EVENT_CLASS(nfs_lookup_event_done, TP_ARGS(dir, dentry, flags, error), TP_STRUCT__entry( - __field(int, error) - __field(unsigned int, flags) + __field(unsigned long, error) + __field(unsigned long, flags) __field(dev_t, dev) __field(u64, dir) __string(name, dentry->d_name.name) @@ -229,14 +299,14 @@ DECLARE_EVENT_CLASS(nfs_lookup_event_done, TP_fast_assign( __entry->dev = dir->i_sb->s_dev; __entry->dir = NFS_FILEID(dir); - __entry->error = error; + __entry->error = error < 0 ? -error : 0; __entry->flags = flags; __assign_str(name, dentry->d_name.name); ), TP_printk( - "error=%d flags=%u (%s) name=%02x:%02x:%llu/%s", - __entry->error, + "error=%ld (%s) flags=0x%lx (%s) name=%02x:%02x:%llu/%s", + -__entry->error, nfs_show_status(__entry->error), __entry->flags, show_lookup_flags(__entry->flags), MAJOR(__entry->dev), MINOR(__entry->dev), @@ -260,15 +330,43 @@ DEFINE_NFS_LOOKUP_EVENT_DONE(nfs_lookup_exit); DEFINE_NFS_LOOKUP_EVENT(nfs_lookup_revalidate_enter); DEFINE_NFS_LOOKUP_EVENT_DONE(nfs_lookup_revalidate_exit); +TRACE_DEFINE_ENUM(O_WRONLY); +TRACE_DEFINE_ENUM(O_RDWR); +TRACE_DEFINE_ENUM(O_CREAT); +TRACE_DEFINE_ENUM(O_EXCL); +TRACE_DEFINE_ENUM(O_NOCTTY); +TRACE_DEFINE_ENUM(O_TRUNC); +TRACE_DEFINE_ENUM(O_APPEND); +TRACE_DEFINE_ENUM(O_NONBLOCK); +TRACE_DEFINE_ENUM(O_DSYNC); +TRACE_DEFINE_ENUM(O_DIRECT); +TRACE_DEFINE_ENUM(O_LARGEFILE); +TRACE_DEFINE_ENUM(O_DIRECTORY); +TRACE_DEFINE_ENUM(O_NOFOLLOW); +TRACE_DEFINE_ENUM(O_NOATIME); +TRACE_DEFINE_ENUM(O_CLOEXEC); + #define show_open_flags(flags) \ - __print_flags((unsigned long)flags, "|", \ + __print_flags(flags, "|", \ + { O_WRONLY, "O_WRONLY" }, \ + { O_RDWR, "O_RDWR" }, \ { O_CREAT, "O_CREAT" }, \ { O_EXCL, "O_EXCL" }, \ + { O_NOCTTY, "O_NOCTTY" }, \ { O_TRUNC, "O_TRUNC" }, \ { O_APPEND, "O_APPEND" }, \ + { O_NONBLOCK, "O_NONBLOCK" }, \ { O_DSYNC, "O_DSYNC" }, \ { O_DIRECT, "O_DIRECT" }, \ - { O_DIRECTORY, "O_DIRECTORY" }) + { O_LARGEFILE, "O_LARGEFILE" }, \ + { O_DIRECTORY, "O_DIRECTORY" }, \ + { O_NOFOLLOW, "O_NOFOLLOW" }, \ + { O_NOATIME, "O_NOATIME" }, \ + { O_CLOEXEC, "O_CLOEXEC" }) + +TRACE_DEFINE_ENUM(FMODE_READ); +TRACE_DEFINE_ENUM(FMODE_WRITE); +TRACE_DEFINE_ENUM(FMODE_EXEC); #define show_fmode_flags(mode) \ __print_flags(mode, "|", \ @@ -286,7 +384,7 @@ TRACE_EVENT(nfs_atomic_open_enter, TP_ARGS(dir, ctx, flags), TP_STRUCT__entry( - __field(unsigned int, flags) + __field(unsigned long, flags) __field(unsigned int, fmode) __field(dev_t, dev) __field(u64, dir) @@ -302,7 +400,7 @@ TRACE_EVENT(nfs_atomic_open_enter, ), TP_printk( - "flags=%u (%s) fmode=%s name=%02x:%02x:%llu/%s", + "flags=0x%lx (%s) fmode=%s name=%02x:%02x:%llu/%s", __entry->flags, show_open_flags(__entry->flags), show_fmode_flags(__entry->fmode), @@ -323,8 +421,8 @@ TRACE_EVENT(nfs_atomic_open_exit, TP_ARGS(dir, ctx, flags, error), TP_STRUCT__entry( - __field(int, error) - __field(unsigned int, flags) + __field(unsigned long, error) + __field(unsigned long, flags) __field(unsigned int, fmode) __field(dev_t, dev) __field(u64, dir) @@ -332,7 +430,7 @@ TRACE_EVENT(nfs_atomic_open_exit, ), TP_fast_assign( - __entry->error = error; + __entry->error = -error; __entry->dev = dir->i_sb->s_dev; __entry->dir = NFS_FILEID(dir); __entry->flags = flags; @@ -341,9 +439,9 @@ TRACE_EVENT(nfs_atomic_open_exit, ), TP_printk( - "error=%d flags=%u (%s) fmode=%s " + "error=%ld (%s) flags=0x%lx (%s) fmode=%s " "name=%02x:%02x:%llu/%s", - __entry->error, + -__entry->error, nfs_show_status(__entry->error), __entry->flags, show_open_flags(__entry->flags), show_fmode_flags(__entry->fmode), @@ -363,7 +461,7 @@ TRACE_EVENT(nfs_create_enter, TP_ARGS(dir, dentry, flags), TP_STRUCT__entry( - __field(unsigned int, flags) + __field(unsigned long, flags) __field(dev_t, dev) __field(u64, dir) __string(name, dentry->d_name.name) @@ -377,7 +475,7 @@ TRACE_EVENT(nfs_create_enter, ), TP_printk( - "flags=%u (%s) name=%02x:%02x:%llu/%s", + "flags=0x%lx (%s) name=%02x:%02x:%llu/%s", __entry->flags, show_open_flags(__entry->flags), MAJOR(__entry->dev), MINOR(__entry->dev), @@ -397,15 +495,15 @@ TRACE_EVENT(nfs_create_exit, TP_ARGS(dir, dentry, flags, error), TP_STRUCT__entry( - __field(int, error) - __field(unsigned int, flags) + __field(unsigned long, error) + __field(unsigned long, flags) __field(dev_t, dev) __field(u64, dir) __string(name, dentry->d_name.name) ), TP_fast_assign( - __entry->error = error; + __entry->error = -error; __entry->dev = dir->i_sb->s_dev; __entry->dir = NFS_FILEID(dir); __entry->flags = flags; @@ -413,8 +511,8 @@ TRACE_EVENT(nfs_create_exit, ), TP_printk( - "error=%d flags=%u (%s) name=%02x:%02x:%llu/%s", - __entry->error, + "error=%ld (%s) flags=0x%lx (%s) name=%02x:%02x:%llu/%s", + -__entry->error, nfs_show_status(__entry->error), __entry->flags, show_open_flags(__entry->flags), MAJOR(__entry->dev), MINOR(__entry->dev), @@ -469,7 +567,7 @@ DECLARE_EVENT_CLASS(nfs_directory_event_done, TP_ARGS(dir, dentry, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(dev_t, dev) __field(u64, dir) __string(name, dentry->d_name.name) @@ -478,13 +576,13 @@ DECLARE_EVENT_CLASS(nfs_directory_event_done, TP_fast_assign( __entry->dev = dir->i_sb->s_dev; __entry->dir = NFS_FILEID(dir); - __entry->error = error; + __entry->error = error < 0 ? -error : 0; __assign_str(name, dentry->d_name.name); ), TP_printk( - "error=%d name=%02x:%02x:%llu/%s", - __entry->error, + "error=%ld (%s) name=%02x:%02x:%llu/%s", + -__entry->error, nfs_show_status(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->dir, __get_str(name) @@ -557,7 +655,7 @@ TRACE_EVENT(nfs_link_exit, TP_ARGS(inode, dir, dentry, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned long, error) __field(dev_t, dev) __field(u64, fileid) __field(u64, dir) @@ -568,13 +666,13 @@ TRACE_EVENT(nfs_link_exit, __entry->dev = inode->i_sb->s_dev; __entry->fileid = NFS_FILEID(inode); __entry->dir = NFS_FILEID(dir); - __entry->error = error; + __entry->error = error < 0 ? -error : 0; __assign_str(name, dentry->d_name.name); ), TP_printk( - "error=%d fileid=%02x:%02x:%llu name=%02x:%02x:%llu/%s", - __entry->error, + "error=%ld (%s) fileid=%02x:%02x:%llu name=%02x:%02x:%llu/%s", + -__entry->error, nfs_show_status(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), __entry->fileid, MAJOR(__entry->dev), MINOR(__entry->dev), @@ -642,7 +740,7 @@ DECLARE_EVENT_CLASS(nfs_rename_event_done, TP_STRUCT__entry( __field(dev_t, dev) - __field(int, error) + __field(unsigned long, error) __field(u64, old_dir) __string(old_name, old_dentry->d_name.name) __field(u64, new_dir) @@ -651,17 +749,17 @@ DECLARE_EVENT_CLASS(nfs_rename_event_done, TP_fast_assign( __entry->dev = old_dir->i_sb->s_dev; + __entry->error = -error; __entry->old_dir = NFS_FILEID(old_dir); __entry->new_dir = NFS_FILEID(new_dir); - __entry->error = error; __assign_str(old_name, old_dentry->d_name.name); __assign_str(new_name, new_dentry->d_name.name); ), TP_printk( - "error=%d old_name=%02x:%02x:%llu/%s " + "error=%ld (%s) old_name=%02x:%02x:%llu/%s " "new_name=%02x:%02x:%llu/%s", - __entry->error, + -__entry->error, nfs_show_status(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->old_dir, __get_str(old_name), @@ -697,7 +795,7 @@ TRACE_EVENT(nfs_sillyrename_unlink, TP_STRUCT__entry( __field(dev_t, dev) - __field(int, error) + __field(unsigned long, error) __field(u64, dir) __dynamic_array(char, name, data->args.name.len + 1) ), @@ -707,15 +805,15 @@ TRACE_EVENT(nfs_sillyrename_unlink, size_t len = data->args.name.len; __entry->dev = dir->i_sb->s_dev; __entry->dir = NFS_FILEID(dir); - __entry->error = error; + __entry->error = -error; memcpy(__get_str(name), data->args.name.name, len); __get_str(name)[len] = 0; ), TP_printk( - "error=%d name=%02x:%02x:%llu/%s", - __entry->error, + "error=%ld (%s) name=%02x:%02x:%llu/%s", + -__entry->error, nfs_show_status(__entry->error), MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long long)__entry->dir, __get_str(name) @@ -974,6 +1072,8 @@ TRACE_DEFINE_ENUM(NFSERR_PERM); TRACE_DEFINE_ENUM(NFSERR_NOENT); TRACE_DEFINE_ENUM(NFSERR_IO); TRACE_DEFINE_ENUM(NFSERR_NXIO); +TRACE_DEFINE_ENUM(ECHILD); +TRACE_DEFINE_ENUM(NFSERR_EAGAIN); TRACE_DEFINE_ENUM(NFSERR_ACCES); TRACE_DEFINE_ENUM(NFSERR_EXIST); TRACE_DEFINE_ENUM(NFSERR_XDEV); @@ -985,6 +1085,7 @@ TRACE_DEFINE_ENUM(NFSERR_FBIG); TRACE_DEFINE_ENUM(NFSERR_NOSPC); TRACE_DEFINE_ENUM(NFSERR_ROFS); TRACE_DEFINE_ENUM(NFSERR_MLINK); +TRACE_DEFINE_ENUM(NFSERR_OPNOTSUPP); TRACE_DEFINE_ENUM(NFSERR_NAMETOOLONG); TRACE_DEFINE_ENUM(NFSERR_NOTEMPTY); TRACE_DEFINE_ENUM(NFSERR_DQUOT); @@ -1007,6 +1108,8 @@ TRACE_DEFINE_ENUM(NFSERR_JUKEBOX); { NFSERR_NOENT, "NOENT" }, \ { NFSERR_IO, "IO" }, \ { NFSERR_NXIO, "NXIO" }, \ + { ECHILD, "CHILD" }, \ + { NFSERR_EAGAIN, "AGAIN" }, \ { NFSERR_ACCES, "ACCES" }, \ { NFSERR_EXIST, "EXIST" }, \ { NFSERR_XDEV, "XDEV" }, \ @@ -1018,6 +1121,7 @@ TRACE_DEFINE_ENUM(NFSERR_JUKEBOX); { NFSERR_NOSPC, "NOSPC" }, \ { NFSERR_ROFS, "ROFS" }, \ { NFSERR_MLINK, "MLINK" }, \ + { NFSERR_OPNOTSUPP, "OPNOTSUPP" }, \ { NFSERR_NAMETOOLONG, "NAMETOOLONG" }, \ { NFSERR_NOTEMPTY, "NOTEMPTY" }, \ { NFSERR_DQUOT, "DQUOT" }, \ @@ -1035,22 +1139,33 @@ TRACE_DEFINE_ENUM(NFSERR_JUKEBOX); TRACE_EVENT(nfs_xdr_status, TP_PROTO( + const struct xdr_stream *xdr, int error ), - TP_ARGS(error), + TP_ARGS(xdr, error), TP_STRUCT__entry( - __field(int, error) + __field(unsigned int, task_id) + __field(unsigned int, client_id) + __field(u32, xid) + __field(unsigned long, error) ), TP_fast_assign( + const struct rpc_rqst *rqstp = xdr->rqst; + const struct rpc_task *task = rqstp->rq_task; + + __entry->task_id = task->tk_pid; + __entry->client_id = task->tk_client->cl_clid; + __entry->xid = be32_to_cpu(rqstp->rq_xid); __entry->error = error; ), TP_printk( - "error=%d (%s)", - __entry->error, nfs_show_status(__entry->error) + "task:%u@%d xid=0x%08x error=%ld (%s)", + __entry->task_id, __entry->client_id, __entry->xid, + -__entry->error, nfs_show_status(__entry->error) ) ); diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c index 6ef5278326b6..ed4e1b07447b 100644 --- a/fs/nfs/pagelist.c +++ b/fs/nfs/pagelist.c @@ -77,7 +77,7 @@ void nfs_set_pgio_error(struct nfs_pgio_header *hdr, int error, loff_t pos) static inline struct nfs_page * nfs_page_alloc(void) { - struct nfs_page *p = kmem_cache_zalloc(nfs_page_cachep, GFP_NOIO); + struct nfs_page *p = kmem_cache_zalloc(nfs_page_cachep, GFP_KERNEL); if (p) INIT_LIST_HEAD(&p->wb_list); return p; @@ -775,8 +775,6 @@ int nfs_generic_pgio(struct nfs_pageio_descriptor *desc, if (pagecount <= ARRAY_SIZE(pg_array->page_array)) pg_array->pagevec = pg_array->page_array; else { - if (hdr->rw_mode == FMODE_WRITE) - gfp_flags = GFP_NOIO; pg_array->pagevec = kcalloc(pagecount, sizeof(struct page *), gfp_flags); if (!pg_array->pagevec) { pg_array->npages = 0; @@ -851,7 +849,7 @@ nfs_pageio_alloc_mirrors(struct nfs_pageio_descriptor *desc, desc->pg_mirrors_dynamic = NULL; if (mirror_count == 1) return desc->pg_mirrors_static; - ret = kmalloc_array(mirror_count, sizeof(*ret), GFP_NOFS); + ret = kmalloc_array(mirror_count, sizeof(*ret), GFP_KERNEL); if (ret != NULL) { for (i = 0; i < mirror_count; i++) nfs_pageio_mirror_init(&ret[i], desc->pg_bsize); diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 83722e936b4a..75bd5b552ba4 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -1890,7 +1890,7 @@ lookup_again: spin_unlock(&ino->i_lock); lseg = ERR_PTR(wait_var_event_killable(&lo->plh_outstanding, !atomic_read(&lo->plh_outstanding))); - if (IS_ERR(lseg) || !list_empty(&lo->plh_segs)) + if (IS_ERR(lseg)) goto out_put_layout_hdr; pnfs_put_layout_hdr(lo); goto lookup_again; @@ -1915,6 +1915,7 @@ lookup_again: * stateid. */ if (test_bit(NFS_LAYOUT_INVALID_STID, &lo->plh_flags)) { + int status; /* * The first layoutget for the file. Need to serialize per @@ -1934,13 +1935,20 @@ lookup_again: } first = true; - if (nfs4_select_rw_stateid(ctx->state, + status = nfs4_select_rw_stateid(ctx->state, iomode == IOMODE_RW ? FMODE_WRITE : FMODE_READ, - NULL, &stateid, NULL) != 0) { + NULL, &stateid, NULL); + if (status != 0) { trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg, PNFS_UPDATE_LAYOUT_INVALID_OPEN); - goto out_unlock; + if (status != -EAGAIN) + goto out_unlock; + spin_unlock(&ino->i_lock); + nfs4_schedule_stateid_recovery(server, ctx->state); + pnfs_clear_first_layoutget(lo); + pnfs_put_layout_hdr(lo); + goto lookup_again; } } else { nfs4_stateid_copy(&stateid, &lo->plh_stateid); @@ -2029,6 +2037,8 @@ lookup_again: out_put_layout_hdr: if (first) pnfs_clear_first_layoutget(lo); + trace_pnfs_update_layout(ino, pos, count, iomode, lo, lseg, + PNFS_UPDATE_LAYOUT_EXIT); pnfs_put_layout_hdr(lo); out: dprintk("%s: inode %s/%llu pNFS layout segment %s for " @@ -2468,7 +2478,7 @@ pnfs_generic_pg_init_write(struct nfs_pageio_descriptor *pgio, wb_size, IOMODE_RW, false, - GFP_NOFS); + GFP_KERNEL); if (IS_ERR(pgio->pg_lseg)) { pgio->pg_error = PTR_ERR(pgio->pg_lseg); pgio->pg_lseg = NULL; diff --git a/fs/nfs/super.c b/fs/nfs/super.c index f88ddac2dcdf..3683d2b1cc8e 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -77,6 +77,8 @@ #define NFS_DEFAULT_VERSION 2 #endif +#define NFS_MAX_CONNECTIONS 16 + enum { /* Mount options that take no arguments */ Opt_soft, Opt_softerr, Opt_hard, @@ -108,6 +110,7 @@ enum { Opt_nfsvers, Opt_sec, Opt_proto, Opt_mountproto, Opt_mounthost, Opt_addr, Opt_mountaddr, Opt_clientaddr, + Opt_nconnect, Opt_lookupcache, Opt_fscache_uniq, Opt_local_lock, @@ -181,6 +184,8 @@ static const match_table_t nfs_mount_option_tokens = { { Opt_mounthost, "mounthost=%s" }, { Opt_mountaddr, "mountaddr=%s" }, + { Opt_nconnect, "nconnect=%s" }, + { Opt_lookupcache, "lookupcache=%s" }, { Opt_fscache_uniq, "fsc=%s" }, { Opt_local_lock, "local_lock=%s" }, @@ -582,7 +587,7 @@ static void nfs_show_mountd_options(struct seq_file *m, struct nfs_server *nfss, } default: if (showdefaults) - seq_printf(m, ",mountaddr=unspecified"); + seq_puts(m, ",mountaddr=unspecified"); } if (nfss->mountd_version || showdefaults) @@ -673,6 +678,8 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss, seq_printf(m, ",proto=%s", rpc_peeraddr2str(nfss->client, RPC_DISPLAY_NETID)); rcu_read_unlock(); + if (clp->cl_nconnect > 0) + seq_printf(m, ",nconnect=%u", clp->cl_nconnect); if (version == 4) { if (nfss->port != NFS_PORT) seq_printf(m, ",port=%u", nfss->port); @@ -690,29 +697,29 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss, nfs_show_nfsv4_options(m, nfss, showdefaults); if (nfss->options & NFS_OPTION_FSCACHE) - seq_printf(m, ",fsc"); + seq_puts(m, ",fsc"); if (nfss->options & NFS_OPTION_MIGRATION) - seq_printf(m, ",migration"); + seq_puts(m, ",migration"); if (nfss->flags & NFS_MOUNT_LOOKUP_CACHE_NONEG) { if (nfss->flags & NFS_MOUNT_LOOKUP_CACHE_NONE) - seq_printf(m, ",lookupcache=none"); + seq_puts(m, ",lookupcache=none"); else - seq_printf(m, ",lookupcache=pos"); + seq_puts(m, ",lookupcache=pos"); } local_flock = nfss->flags & NFS_MOUNT_LOCAL_FLOCK; local_fcntl = nfss->flags & NFS_MOUNT_LOCAL_FCNTL; if (!local_flock && !local_fcntl) - seq_printf(m, ",local_lock=none"); + seq_puts(m, ",local_lock=none"); else if (local_flock && local_fcntl) - seq_printf(m, ",local_lock=all"); + seq_puts(m, ",local_lock=all"); else if (local_flock) - seq_printf(m, ",local_lock=flock"); + seq_puts(m, ",local_lock=flock"); else - seq_printf(m, ",local_lock=posix"); + seq_puts(m, ",local_lock=posix"); } /* @@ -735,11 +742,21 @@ int nfs_show_options(struct seq_file *m, struct dentry *root) EXPORT_SYMBOL_GPL(nfs_show_options); #if IS_ENABLED(CONFIG_NFS_V4) +static void show_lease(struct seq_file *m, struct nfs_server *server) +{ + struct nfs_client *clp = server->nfs_client; + unsigned long expire; + + seq_printf(m, ",lease_time=%ld", clp->cl_lease_time / HZ); + expire = clp->cl_last_renewal + clp->cl_lease_time; + seq_printf(m, ",lease_expired=%ld", + time_after(expire, jiffies) ? 0 : (jiffies - expire) / HZ); +} #ifdef CONFIG_NFS_V4_1 static void show_sessions(struct seq_file *m, struct nfs_server *server) { if (nfs4_has_session(server->nfs_client)) - seq_printf(m, ",sessions"); + seq_puts(m, ",sessions"); } #else static void show_sessions(struct seq_file *m, struct nfs_server *server) {} @@ -816,7 +833,7 @@ int nfs_show_stats(struct seq_file *m, struct dentry *root) /* * Display all mount option settings */ - seq_printf(m, "\n\topts:\t"); + seq_puts(m, "\n\topts:\t"); seq_puts(m, sb_rdonly(root->d_sb) ? "ro" : "rw"); seq_puts(m, root->d_sb->s_flags & SB_SYNCHRONOUS ? ",sync" : ""); seq_puts(m, root->d_sb->s_flags & SB_NOATIME ? ",noatime" : ""); @@ -827,7 +844,7 @@ int nfs_show_stats(struct seq_file *m, struct dentry *root) show_implementation_id(m, nfss); - seq_printf(m, "\n\tcaps:\t"); + seq_puts(m, "\n\tcaps:\t"); seq_printf(m, "caps=0x%x", nfss->caps); seq_printf(m, ",wtmult=%u", nfss->wtmult); seq_printf(m, ",dtsize=%u", nfss->dtsize); @@ -836,13 +853,14 @@ int nfs_show_stats(struct seq_file *m, struct dentry *root) #if IS_ENABLED(CONFIG_NFS_V4) if (nfss->nfs_client->rpc_ops->version == 4) { - seq_printf(m, "\n\tnfsv4:\t"); + seq_puts(m, "\n\tnfsv4:\t"); seq_printf(m, "bm0=0x%x", nfss->attr_bitmask[0]); seq_printf(m, ",bm1=0x%x", nfss->attr_bitmask[1]); seq_printf(m, ",bm2=0x%x", nfss->attr_bitmask[2]); seq_printf(m, ",acl=0x%x", nfss->acl_bitmask); show_sessions(m, nfss); show_pnfs(m, nfss); + show_lease(m, nfss); } #endif @@ -874,20 +892,20 @@ int nfs_show_stats(struct seq_file *m, struct dentry *root) preempt_enable(); } - seq_printf(m, "\n\tevents:\t"); + seq_puts(m, "\n\tevents:\t"); for (i = 0; i < __NFSIOS_COUNTSMAX; i++) seq_printf(m, "%lu ", totals.events[i]); - seq_printf(m, "\n\tbytes:\t"); + seq_puts(m, "\n\tbytes:\t"); for (i = 0; i < __NFSIOS_BYTESMAX; i++) seq_printf(m, "%Lu ", totals.bytes[i]); #ifdef CONFIG_NFS_FSCACHE if (nfss->options & NFS_OPTION_FSCACHE) { - seq_printf(m, "\n\tfsc:\t"); + seq_puts(m, "\n\tfsc:\t"); for (i = 0; i < __NFSIOS_FSCACHEMAX; i++) seq_printf(m, "%Lu ", totals.fscache[i]); } #endif - seq_printf(m, "\n"); + seq_putc(m, '\n'); rpc_clnt_show_stats(m, nfss->client); @@ -1549,6 +1567,11 @@ static int nfs_parse_mount_options(char *raw, if (mnt->mount_server.addrlen == 0) goto out_invalid_address; break; + case Opt_nconnect: + if (nfs_get_option_ul_bound(args, &option, 1, NFS_MAX_CONNECTIONS)) + goto out_invalid_value; + mnt->nfs_server.nconnect = option; + break; case Opt_lookupcache: string = match_strdup(args); if (string == NULL) diff --git a/fs/nfs/sysfs.c b/fs/nfs/sysfs.c new file mode 100644 index 000000000000..4f3390b20239 --- /dev/null +++ b/fs/nfs/sysfs.c @@ -0,0 +1,187 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2019 Hammerspace Inc + */ + +#include <linux/module.h> +#include <linux/kobject.h> +#include <linux/sysfs.h> +#include <linux/fs.h> +#include <linux/slab.h> +#include <linux/netdevice.h> +#include <linux/string.h> +#include <linux/nfs_fs.h> +#include <linux/rcupdate.h> + +#include "nfs4_fs.h" +#include "netns.h" +#include "sysfs.h" + +struct kobject *nfs_client_kobj; +static struct kset *nfs_client_kset; + +static void nfs_netns_object_release(struct kobject *kobj) +{ + kfree(kobj); +} + +static const struct kobj_ns_type_operations *nfs_netns_object_child_ns_type( + struct kobject *kobj) +{ + return &net_ns_type_operations; +} + +static struct kobj_type nfs_netns_object_type = { + .release = nfs_netns_object_release, + .sysfs_ops = &kobj_sysfs_ops, + .child_ns_type = nfs_netns_object_child_ns_type, +}; + +static struct kobject *nfs_netns_object_alloc(const char *name, + struct kset *kset, struct kobject *parent) +{ + struct kobject *kobj; + + kobj = kzalloc(sizeof(*kobj), GFP_KERNEL); + if (kobj) { + kobj->kset = kset; + if (kobject_init_and_add(kobj, &nfs_netns_object_type, + parent, "%s", name) == 0) + return kobj; + kobject_put(kobj); + } + return NULL; +} + +int nfs_sysfs_init(void) +{ + nfs_client_kset = kset_create_and_add("nfs", NULL, fs_kobj); + if (!nfs_client_kset) + return -ENOMEM; + nfs_client_kobj = nfs_netns_object_alloc("net", nfs_client_kset, NULL); + if (!nfs_client_kobj) { + kset_unregister(nfs_client_kset); + nfs_client_kset = NULL; + return -ENOMEM; + } + return 0; +} + +void nfs_sysfs_exit(void) +{ + kobject_put(nfs_client_kobj); + kset_unregister(nfs_client_kset); +} + +static ssize_t nfs_netns_identifier_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct nfs_netns_client *c = container_of(kobj, + struct nfs_netns_client, + kobject); + return scnprintf(buf, PAGE_SIZE, "%s\n", c->identifier); +} + +/* Strip trailing '\n' */ +static size_t nfs_string_strip(const char *c, size_t len) +{ + while (len > 0 && c[len-1] == '\n') + --len; + return len; +} + +static ssize_t nfs_netns_identifier_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct nfs_netns_client *c = container_of(kobj, + struct nfs_netns_client, + kobject); + const char *old; + char *p; + size_t len; + + len = nfs_string_strip(buf, min_t(size_t, count, CONTAINER_ID_MAXLEN)); + if (!len) + return 0; + p = kmemdup_nul(buf, len, GFP_KERNEL); + if (!p) + return -ENOMEM; + old = xchg(&c->identifier, p); + if (old) { + synchronize_rcu(); + kfree(old); + } + return count; +} + +static void nfs_netns_client_release(struct kobject *kobj) +{ + struct nfs_netns_client *c = container_of(kobj, + struct nfs_netns_client, + kobject); + + if (c->identifier) + kfree(c->identifier); + kfree(c); +} + +static const void *nfs_netns_client_namespace(struct kobject *kobj) +{ + return container_of(kobj, struct nfs_netns_client, kobject)->net; +} + +static struct kobj_attribute nfs_netns_client_id = __ATTR(identifier, + 0644, nfs_netns_identifier_show, nfs_netns_identifier_store); + +static struct attribute *nfs_netns_client_attrs[] = { + &nfs_netns_client_id.attr, + NULL, +}; + +static struct kobj_type nfs_netns_client_type = { + .release = nfs_netns_client_release, + .default_attrs = nfs_netns_client_attrs, + .sysfs_ops = &kobj_sysfs_ops, + .namespace = nfs_netns_client_namespace, +}; + +static struct nfs_netns_client *nfs_netns_client_alloc(struct kobject *parent, + struct net *net) +{ + struct nfs_netns_client *p; + + p = kzalloc(sizeof(*p), GFP_KERNEL); + if (p) { + p->net = net; + p->kobject.kset = nfs_client_kset; + if (kobject_init_and_add(&p->kobject, &nfs_netns_client_type, + parent, "nfs_client") == 0) + return p; + kobject_put(&p->kobject); + } + return NULL; +} + +void nfs_netns_sysfs_setup(struct nfs_net *netns, struct net *net) +{ + struct nfs_netns_client *clp; + + clp = nfs_netns_client_alloc(nfs_client_kobj, net); + if (clp) { + netns->nfs_client = clp; + kobject_uevent(&clp->kobject, KOBJ_ADD); + } +} + +void nfs_netns_sysfs_destroy(struct nfs_net *netns) +{ + struct nfs_netns_client *clp = netns->nfs_client; + + if (clp) { + kobject_uevent(&clp->kobject, KOBJ_REMOVE); + kobject_del(&clp->kobject); + kobject_put(&clp->kobject); + netns->nfs_client = NULL; + } +} diff --git a/fs/nfs/sysfs.h b/fs/nfs/sysfs.h new file mode 100644 index 000000000000..f1b27411dcc0 --- /dev/null +++ b/fs/nfs/sysfs.h @@ -0,0 +1,25 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2019 Hammerspace Inc + */ + +#ifndef __NFS_SYSFS_H +#define __NFS_SYSFS_H + +#define CONTAINER_ID_MAXLEN (64) + +struct nfs_netns_client { + struct kobject kobject; + struct net *net; + const char *identifier; +}; + +extern struct kobject *nfs_client_kobj; + +extern int nfs_sysfs_init(void); +extern void nfs_sysfs_exit(void); + +void nfs_netns_sysfs_setup(struct nfs_net *netns, struct net *net); +void nfs_netns_sysfs_destroy(struct nfs_net *netns); + +#endif diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 059a7c38bc4f..92d9cadc6102 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -103,7 +103,7 @@ EXPORT_SYMBOL_GPL(nfs_commit_free); static struct nfs_pgio_header *nfs_writehdr_alloc(void) { - struct nfs_pgio_header *p = mempool_alloc(nfs_wdata_mempool, GFP_NOIO); + struct nfs_pgio_header *p = mempool_alloc(nfs_wdata_mempool, GFP_KERNEL); memset(p, 0, sizeof(*p)); p->rw_mode = FMODE_WRITE; @@ -721,12 +721,11 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc) struct inode *inode = mapping->host; struct nfs_pageio_descriptor pgio; struct nfs_io_completion *ioc; - unsigned int pflags = memalloc_nofs_save(); int err; nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES); - ioc = nfs_io_completion_alloc(GFP_NOFS); + ioc = nfs_io_completion_alloc(GFP_KERNEL); if (ioc) nfs_io_completion_init(ioc, nfs_io_completion_commit, inode); @@ -737,8 +736,6 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc) nfs_pageio_complete(&pgio); nfs_io_completion_put(ioc); - memalloc_nofs_restore(pflags); - if (err < 0) goto out_err; err = pgio.pg_error; diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index b74a47169297..06b68b6115bc 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -49,6 +49,7 @@ xfs-y += $(addprefix libxfs/, \ xfs_refcount_btree.o \ xfs_sb.o \ xfs_symlink_remote.o \ + xfs_trans_inode.o \ xfs_trans_resv.o \ xfs_types.o \ ) @@ -107,8 +108,7 @@ xfs-y += xfs_log.o \ xfs_rmap_item.o \ xfs_log_recover.o \ xfs_trans_ail.o \ - xfs_trans_buf.o \ - xfs_trans_inode.o + xfs_trans_buf.o # optional features xfs-$(CONFIG_XFS_QUOTA) += xfs_dquot.o \ diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index 93d14e47269d..a9ad90926b87 100644 --- a/fs/xfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -66,6 +66,10 @@ xfs_trans_ichgtime( inode->i_mtime = tv; if (flags & XFS_ICHGTIME_CHG) inode->i_ctime = tv; + if (flags & XFS_ICHGTIME_CREATE) { + ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec; + ip->i_d.di_crtime.t_nsec = (int32_t)tv.tv_nsec; + } } /* diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e93bacbd49ae..28101bbc0b78 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1197,11 +1197,14 @@ xfs_file_mmap( struct file *filp, struct vm_area_struct *vma) { + struct dax_device *dax_dev; + + dax_dev = xfs_find_daxdev_for_inode(file_inode(filp)); /* - * We don't support synchronous mappings for non-DAX files. At least - * until someone comes with a sensible use case. + * We don't support synchronous mappings for non-DAX files and + * for DAX files if underneath dax_device is not synchronous. */ - if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC)) + if (!daxdev_mapping_supported(vma, dax_dev)) return -EOPNOTSUPP; file_accessed(filp); diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h index 65a38c4a02a1..39e6f4c57580 100644 --- a/include/linux/ceph/ceph_features.h +++ b/include/linux/ceph/ceph_features.h @@ -211,6 +211,7 @@ DEFINE_CEPH_FEATURE_DEPRECATED(63, 1, RESERVED_BROKEN, LUMINOUS) // client-facin CEPH_FEATURE_MON_STATEFUL_SUB | \ CEPH_FEATURE_CRUSH_TUNABLES5 | \ CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING | \ + CEPH_FEATURE_MSG_ADDR2 | \ CEPH_FEATURE_CEPHX_V2) #define CEPH_FEATURES_REQUIRED_DEFAULT 0 diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h index 3ac0feaf2b5e..cb21c5cf12c3 100644 --- a/include/linux/ceph/ceph_fs.h +++ b/include/linux/ceph/ceph_fs.h @@ -682,7 +682,7 @@ extern const char *ceph_cap_op_name(int op); /* flags field in client cap messages (version >= 10) */ #define CEPH_CLIENT_CAPS_SYNC (1<<0) #define CEPH_CLIENT_CAPS_NO_CAPSNAP (1<<1) -#define CEPH_CLIENT_CAPS_PENDING_CAPSNAP (1<<2); +#define CEPH_CLIENT_CAPS_PENDING_CAPSNAP (1<<2) /* * caps message, used for capability callbacks, acks, requests, etc. diff --git a/include/linux/ceph/cls_lock_client.h b/include/linux/ceph/cls_lock_client.h index bea6c77d2093..17bc7584d1fe 100644 --- a/include/linux/ceph/cls_lock_client.h +++ b/include/linux/ceph/cls_lock_client.h @@ -52,4 +52,7 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, char *lock_name, u8 *type, char **tag, struct ceph_locker **lockers, u32 *num_lockers); +int ceph_cls_assert_locked(struct ceph_osd_request *req, int which, + char *lock_name, u8 type, char *cookie, char *tag); + #endif diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h index a6c2a48d42e0..450384fe487c 100644 --- a/include/linux/ceph/decode.h +++ b/include/linux/ceph/decode.h @@ -218,18 +218,27 @@ static inline void ceph_encode_timespec64(struct ceph_timespec *tv, /* * sockaddr_storage <-> ceph_sockaddr */ -static inline void ceph_encode_addr(struct ceph_entity_addr *a) +#define CEPH_ENTITY_ADDR_TYPE_NONE 0 +#define CEPH_ENTITY_ADDR_TYPE_LEGACY __cpu_to_le32(1) + +static inline void ceph_encode_banner_addr(struct ceph_entity_addr *a) { __be16 ss_family = htons(a->in_addr.ss_family); a->in_addr.ss_family = *(__u16 *)&ss_family; + + /* Banner addresses require TYPE_NONE */ + a->type = CEPH_ENTITY_ADDR_TYPE_NONE; } -static inline void ceph_decode_addr(struct ceph_entity_addr *a) +static inline void ceph_decode_banner_addr(struct ceph_entity_addr *a) { __be16 ss_family = *(__be16 *)&a->in_addr.ss_family; a->in_addr.ss_family = ntohs(ss_family); WARN_ON(a->in_addr.ss_family == 512); + a->type = CEPH_ENTITY_ADDR_TYPE_LEGACY; } +extern int ceph_decode_entity_addr(void **p, void *end, + struct ceph_entity_addr *addr); /* * encoders */ diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h index 337d5049ff93..82156da3c650 100644 --- a/include/linux/ceph/libceph.h +++ b/include/linux/ceph/libceph.h @@ -84,11 +84,13 @@ struct ceph_options { #define CEPH_MSG_MAX_MIDDLE_LEN (16*1024*1024) /* - * Handle the largest possible rbd object in one message. + * The largest possible rbd data object is 32M. + * The largest possible rbd object map object is 64M. + * * There is no limit on the size of cephfs objects, but it has to obey * rsize and wsize mount options anyway. */ -#define CEPH_MSG_MAX_DATA_LEN (32*1024*1024) +#define CEPH_MSG_MAX_DATA_LEN (64*1024*1024) #define CEPH_AUTH_NAME_DEFAULT "guest" @@ -299,10 +301,6 @@ int ceph_wait_for_latest_osdmap(struct ceph_client *client, /* pagevec.c */ extern void ceph_release_page_vector(struct page **pages, int num_pages); - -extern struct page **ceph_get_direct_page_vector(const void __user *data, - int num_pages, - bool write_page); extern void ceph_put_page_vector(struct page **pages, int num_pages, bool dirty); extern struct page **ceph_alloc_page_vector(int num_pages, gfp_t flags); diff --git a/include/linux/ceph/mon_client.h b/include/linux/ceph/mon_client.h index 3a4688af7455..b4d134d3312a 100644 --- a/include/linux/ceph/mon_client.h +++ b/include/linux/ceph/mon_client.h @@ -104,7 +104,6 @@ struct ceph_mon_client { #endif }; -extern struct ceph_monmap *ceph_monmap_decode(void *p, void *end); extern int ceph_monmap_contains(struct ceph_monmap *m, struct ceph_entity_addr *addr); diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 2294f963dab7..ad7fe5d10dcd 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -198,9 +198,9 @@ struct ceph_osd_request { bool r_mempool; struct completion r_completion; /* private to osd_client.c */ ceph_osdc_callback_t r_callback; - struct list_head r_unsafe_item; struct inode *r_inode; /* for use by callbacks */ + struct list_head r_private_item; /* ditto */ void *r_priv; /* ditto */ /* set by submitter */ @@ -389,6 +389,14 @@ extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc, void ceph_osdc_update_epoch_barrier(struct ceph_osd_client *osdc, u32 eb); void ceph_osdc_abort_requests(struct ceph_osd_client *osdc, int err); +#define osd_req_op_data(oreq, whch, typ, fld) \ +({ \ + struct ceph_osd_request *__oreq = (oreq); \ + unsigned int __whch = (whch); \ + BUG_ON(__whch >= __oreq->r_num_ops); \ + &__oreq->r_ops[__whch].typ.fld; \ +}) + extern void osd_req_op_init(struct ceph_osd_request *osd_req, unsigned int which, u16 opcode, u32 flags); @@ -497,7 +505,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, const char *class, const char *method, unsigned int flags, struct page *req_page, size_t req_len, - struct page *resp_page, size_t *resp_len); + struct page **resp_pages, size_t *resp_len); extern int ceph_osdc_readpages(struct ceph_osd_client *osdc, struct ceph_vino vino, diff --git a/include/linux/ceph/striper.h b/include/linux/ceph/striper.h index cbd0d24b7148..3486636c0e6e 100644 --- a/include/linux/ceph/striper.h +++ b/include/linux/ceph/striper.h @@ -66,4 +66,6 @@ int ceph_extent_to_file(struct ceph_file_layout *l, struct ceph_file_extent **file_extents, u32 *num_file_extents); +u64 ceph_get_num_objects(struct ceph_file_layout *l, u64 size); + #endif diff --git a/include/linux/dax.h b/include/linux/dax.h index becaea5f4488..9bd8528bd305 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -7,6 +7,9 @@ #include <linux/radix-tree.h> #include <asm/pgtable.h> +/* Flag for synchronous flush */ +#define DAXDEV_F_SYNC (1UL << 0) + typedef unsigned long dax_entry_t; struct iomap_ops; @@ -38,18 +41,40 @@ extern struct attribute_group dax_attribute_group; #if IS_ENABLED(CONFIG_DAX) struct dax_device *dax_get_by_host(const char *host); struct dax_device *alloc_dax(void *private, const char *host, - const struct dax_operations *ops); + const struct dax_operations *ops, unsigned long flags); void put_dax(struct dax_device *dax_dev); void kill_dax(struct dax_device *dax_dev); void dax_write_cache(struct dax_device *dax_dev, bool wc); bool dax_write_cache_enabled(struct dax_device *dax_dev); +bool __dax_synchronous(struct dax_device *dax_dev); +static inline bool dax_synchronous(struct dax_device *dax_dev) +{ + return __dax_synchronous(dax_dev); +} +void __set_dax_synchronous(struct dax_device *dax_dev); +static inline void set_dax_synchronous(struct dax_device *dax_dev) +{ + __set_dax_synchronous(dax_dev); +} +/* + * Check if given mapping is supported by the file / underlying device. + */ +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, + struct dax_device *dax_dev) +{ + if (!(vma->vm_flags & VM_SYNC)) + return true; + if (!IS_DAX(file_inode(vma->vm_file))) + return false; + return dax_synchronous(dax_dev); +} #else static inline struct dax_device *dax_get_by_host(const char *host) { return NULL; } static inline struct dax_device *alloc_dax(void *private, const char *host, - const struct dax_operations *ops) + const struct dax_operations *ops, unsigned long flags) { /* * Callers should check IS_ENABLED(CONFIG_DAX) to know if this @@ -70,6 +95,18 @@ static inline bool dax_write_cache_enabled(struct dax_device *dax_dev) { return false; } +static inline bool dax_synchronous(struct dax_device *dax_dev) +{ + return true; +} +static inline void set_dax_synchronous(struct dax_device *dax_dev) +{ +} +static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, + struct dax_device *dax_dev) +{ + return !(vma->vm_flags & VM_SYNC); +} #endif struct writeback_control; diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index 3b470cb03b66..399ad8632356 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -529,29 +529,20 @@ void *dm_vcalloc(unsigned long nmemb, unsigned long elem_size); *---------------------------------------------------------------*/ #define DM_NAME "device-mapper" -#define DM_RATELIMIT(pr_func, fmt, ...) \ -do { \ - static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL, \ - DEFAULT_RATELIMIT_BURST); \ - \ - if (__ratelimit(&rs)) \ - pr_func(DM_FMT(fmt), ##__VA_ARGS__); \ -} while (0) - #define DM_FMT(fmt) DM_NAME ": " DM_MSG_PREFIX ": " fmt "\n" #define DMCRIT(fmt, ...) pr_crit(DM_FMT(fmt), ##__VA_ARGS__) #define DMERR(fmt, ...) pr_err(DM_FMT(fmt), ##__VA_ARGS__) -#define DMERR_LIMIT(fmt, ...) DM_RATELIMIT(pr_err, fmt, ##__VA_ARGS__) +#define DMERR_LIMIT(fmt, ...) pr_err_ratelimited(DM_FMT(fmt), ##__VA_ARGS__) #define DMWARN(fmt, ...) pr_warn(DM_FMT(fmt), ##__VA_ARGS__) -#define DMWARN_LIMIT(fmt, ...) DM_RATELIMIT(pr_warn, fmt, ##__VA_ARGS__) +#define DMWARN_LIMIT(fmt, ...) pr_warn_ratelimited(DM_FMT(fmt), ##__VA_ARGS__) #define DMINFO(fmt, ...) pr_info(DM_FMT(fmt), ##__VA_ARGS__) -#define DMINFO_LIMIT(fmt, ...) DM_RATELIMIT(pr_info, fmt, ##__VA_ARGS__) +#define DMINFO_LIMIT(fmt, ...) pr_info_ratelimited(DM_FMT(fmt), ##__VA_ARGS__) #ifdef CONFIG_DM_DEBUG #define DMDEBUG(fmt, ...) printk(KERN_DEBUG DM_FMT(fmt), ##__VA_ARGS__) -#define DMDEBUG_LIMIT(fmt, ...) DM_RATELIMIT(pr_debug, fmt, ##__VA_ARGS__) +#define DMDEBUG_LIMIT(fmt, ...) pr_debug_ratelimited(DM_FMT(fmt), ##__VA_ARGS__) #else #define DMDEBUG(fmt, ...) no_printk(fmt, ##__VA_ARGS__) #define DMDEBUG_LIMIT(fmt, ...) no_printk(fmt, ##__VA_ARGS__) diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index 25e2995d4a4c..8a8cb3c401b2 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -427,8 +427,8 @@ struct dyn_ftrace *ftrace_rec_iter_record(struct ftrace_rec_iter *iter); iter = ftrace_rec_iter_next(iter)) -int ftrace_update_record(struct dyn_ftrace *rec, int enable); -int ftrace_test_record(struct dyn_ftrace *rec, int enable); +int ftrace_update_record(struct dyn_ftrace *rec, bool enable); +int ftrace_test_record(struct dyn_ftrace *rec, bool enable); void ftrace_run_stop_machine(int command); unsigned long ftrace_location(unsigned long ip); unsigned long ftrace_location_range(unsigned long start, unsigned long end); diff --git a/include/linux/iversion.h b/include/linux/iversion.h index be50ef7cedab..2917ef990d43 100644 --- a/include/linux/iversion.h +++ b/include/linux/iversion.h @@ -113,6 +113,30 @@ inode_peek_iversion_raw(const struct inode *inode) } /** + * inode_set_max_iversion_raw - update i_version new value is larger + * @inode: inode to set + * @val: new i_version to set + * + * Some self-managed filesystems (e.g Ceph) will only update the i_version + * value if the new value is larger than the one we already have. + */ +static inline void +inode_set_max_iversion_raw(struct inode *inode, u64 val) +{ + u64 cur, old; + + cur = inode_peek_iversion_raw(inode); + for (;;) { + if (cur > val) + break; + old = atomic64_cmpxchg(&inode->i_version, cur, val); + if (likely(old == cur)) + break; + cur = old; + } +} + +/** * inode_set_iversion - set i_version to a particular value * @inode: inode to set * @val: new i_version value to set diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index 03d5c3aece9d..7a64b3ddb408 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -11,6 +11,7 @@ #include <linux/types.h> #include <linux/uuid.h> #include <linux/spinlock.h> +#include <linux/bio.h> struct badrange_entry { u64 start; @@ -57,6 +58,9 @@ enum { */ ND_REGION_PERSIST_MEMCTRL = 2, + /* Platform provides asynchronous flush mechanism */ + ND_REGION_ASYNC = 3, + /* mark newly adjusted resources as requiring a label update */ DPA_RESOURCE_ADJUSTED = 1 << 0, }; @@ -113,6 +117,7 @@ struct nd_mapping_desc { int position; }; +struct nd_region; struct nd_region_desc { struct resource *res; struct nd_mapping_desc *mapping; @@ -125,6 +130,7 @@ struct nd_region_desc { int target_node; unsigned long flags; struct device_node *of_node; + int (*flush)(struct nd_region *nd_region, struct bio *bio); }; struct device; @@ -252,10 +258,12 @@ unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr); unsigned int nd_region_acquire_lane(struct nd_region *nd_region); void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane); u64 nd_fletcher64(void *addr, size_t len, bool le); -void nvdimm_flush(struct nd_region *nd_region); +int nvdimm_flush(struct nd_region *nd_region, struct bio *bio); +int generic_nvdimm_flush(struct nd_region *nd_region); int nvdimm_has_flush(struct nd_region *nd_region); int nvdimm_has_cache(struct nd_region *nd_region); int nvdimm_in_overwrite(struct nvdimm *nvdimm); +bool is_nvdimm_sync(struct nd_region *nd_region); static inline int nvdimm_ctl(struct nvdimm *nvdimm, unsigned int cmd, void *buf, unsigned int buf_len, int *cmd_rc) diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h index 31013c2effd3..5229c18025e9 100644 --- a/include/linux/moduleloader.h +++ b/include/linux/moduleloader.h @@ -29,6 +29,11 @@ void *module_alloc(unsigned long size); /* Free memory returned from module_alloc. */ void module_memfree(void *module_region); +/* Determines if the section name is an exit section (that is only used during + * module unloading) + */ +bool module_exit_section(const char *name); + /* * Apply the given relocation to the (simplified) ELF. Return -error * or 0. diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h index 22494d170619..fd59904a282c 100644 --- a/include/linux/nfs4.h +++ b/include/linux/nfs4.h @@ -660,6 +660,7 @@ enum pnfs_update_layout_reason { PNFS_UPDATE_LAYOUT_BLOCKED, PNFS_UPDATE_LAYOUT_INVALID_OPEN, PNFS_UPDATE_LAYOUT_SEND_LAYOUTGET, + PNFS_UPDATE_LAYOUT_EXIT, }; #define NFS4_OP_MAP_NUM_LONGS \ diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h index d363d5765cdf..0a11712a80e3 100644 --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -223,6 +223,8 @@ struct nfs4_copy_state { #define NFS_INO_INVALID_MTIME BIT(10) /* cached mtime is invalid */ #define NFS_INO_INVALID_SIZE BIT(11) /* cached size is invalid */ #define NFS_INO_INVALID_OTHER BIT(12) /* other attrs are invalid */ +#define NFS_INO_DATA_INVAL_DEFER \ + BIT(13) /* Deferred cache invalidation */ #define NFS_INO_INVALID_ATTR (NFS_INO_INVALID_CHANGE \ | NFS_INO_INVALID_CTIME \ diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h index 1e78032a174b..a87fe854f008 100644 --- a/include/linux/nfs_fs_sb.h +++ b/include/linux/nfs_fs_sb.h @@ -58,6 +58,7 @@ struct nfs_client { struct nfs_subversion * cl_nfs_mod; /* pointer to nfs version module */ u32 cl_minorversion;/* NFSv4 minorversion */ + unsigned int cl_nconnect; /* Number of connections */ const char * cl_principal; /* used for machine cred */ #if IS_ENABLED(CONFIG_NFS_V4) diff --git a/include/linux/sunrpc/bc_xprt.h b/include/linux/sunrpc/bc_xprt.h index d4229a78524a..87d27e13d885 100644 --- a/include/linux/sunrpc/bc_xprt.h +++ b/include/linux/sunrpc/bc_xprt.h @@ -43,6 +43,7 @@ void xprt_destroy_backchannel(struct rpc_xprt *, unsigned int max_reqs); int xprt_setup_bc(struct rpc_xprt *xprt, unsigned int min_reqs); void xprt_destroy_bc(struct rpc_xprt *xprt, unsigned int max_reqs); void xprt_free_bc_rqst(struct rpc_rqst *req); +unsigned int xprt_bc_max_slots(struct rpc_xprt *xprt); /* * Determine if a shared backchannel is in use diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h index 6e8073140a5d..abc63bd1be2b 100644 --- a/include/linux/sunrpc/clnt.h +++ b/include/linux/sunrpc/clnt.h @@ -124,6 +124,7 @@ struct rpc_create_args { u32 prognumber; /* overrides program->number */ u32 version; rpc_authflavor_t authflavor; + u32 nconnect; unsigned long flags; char *client_name; struct svc_xprt *bc_xprt; /* NFSv4.1 backchannel */ @@ -163,6 +164,8 @@ void rpc_shutdown_client(struct rpc_clnt *); void rpc_release_client(struct rpc_clnt *); void rpc_task_release_transport(struct rpc_task *); void rpc_task_release_client(struct rpc_task *); +struct rpc_xprt *rpc_task_get_xprt(struct rpc_clnt *clnt, + struct rpc_xprt *xprt); int rpcb_create_local(struct net *); void rpcb_put_local(struct net *); @@ -191,6 +194,7 @@ void rpc_setbufsize(struct rpc_clnt *, unsigned int, unsigned int); struct net * rpc_net_ns(struct rpc_clnt *); size_t rpc_max_payload(struct rpc_clnt *); size_t rpc_max_bc_payload(struct rpc_clnt *); +unsigned int rpc_num_bc_slots(struct rpc_clnt *); void rpc_force_rebind(struct rpc_clnt *); size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t); const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t); diff --git a/include/linux/sunrpc/metrics.h b/include/linux/sunrpc/metrics.h index 1b3751327575..0ee3f7052846 100644 --- a/include/linux/sunrpc/metrics.h +++ b/include/linux/sunrpc/metrics.h @@ -30,7 +30,7 @@ #include <linux/ktime.h> #include <linux/spinlock.h> -#define RPC_IOSTATS_VERS "1.0" +#define RPC_IOSTATS_VERS "1.1" struct rpc_iostats { spinlock_t om_lock; @@ -66,6 +66,11 @@ struct rpc_iostats { ktime_t om_queue, /* queued for xmit */ om_rtt, /* RPC RTT */ om_execute; /* RPC execution */ + /* + * The count of operations that complete with tk_status < 0. + * These statuses usually indicate error conditions. + */ + unsigned long om_error_status; } ____cacheline_aligned; struct rpc_task; diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h index d0e451868f02..baa3ecdb882f 100644 --- a/include/linux/sunrpc/sched.h +++ b/include/linux/sunrpc/sched.h @@ -126,6 +126,7 @@ struct rpc_task_setup { #define RPC_CALL_MAJORSEEN 0x0020 /* major timeout seen */ #define RPC_TASK_ROOTCREDS 0x0040 /* force root creds */ #define RPC_TASK_DYNAMIC 0x0080 /* task was kmalloc'ed */ +#define RPC_TASK_NO_ROUND_ROBIN 0x0100 /* send requests on "main" xprt */ #define RPC_TASK_SOFT 0x0200 /* Use soft timeouts */ #define RPC_TASK_SOFTCONN 0x0400 /* Fail if can't connect */ #define RPC_TASK_SENT 0x0800 /* message was sent */ @@ -183,8 +184,9 @@ struct rpc_task_setup { #define RPC_NR_PRIORITY (1 + RPC_PRIORITY_PRIVILEGED - RPC_PRIORITY_LOW) struct rpc_timer { - struct timer_list timer; struct list_head list; + unsigned long expires; + struct delayed_work dwork; }; /* diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h index a6d9fce7f20e..13e108bcc9eb 100644 --- a/include/linux/sunrpc/xprt.h +++ b/include/linux/sunrpc/xprt.h @@ -158,6 +158,7 @@ struct rpc_xprt_ops { int (*bc_setup)(struct rpc_xprt *xprt, unsigned int min_reqs); size_t (*bc_maxpayload)(struct rpc_xprt *xprt); + unsigned int (*bc_num_slots)(struct rpc_xprt *xprt); void (*bc_free_rqst)(struct rpc_rqst *rqst); void (*bc_destroy)(struct rpc_xprt *xprt, unsigned int max_reqs); @@ -238,6 +239,7 @@ struct rpc_xprt { /* * Send stuff */ + atomic_long_t queuelen; spinlock_t transport_lock; /* lock transport info */ spinlock_t reserve_lock; /* lock slot table */ spinlock_t queue_lock; /* send/receive queue lock */ @@ -250,8 +252,9 @@ struct rpc_xprt { #if defined(CONFIG_SUNRPC_BACKCHANNEL) struct svc_serv *bc_serv; /* The RPC service which will */ /* process the callback */ - int bc_alloc_count; /* Total number of preallocs */ - atomic_t bc_free_slots; + unsigned int bc_alloc_max; + unsigned int bc_alloc_count; /* Total number of preallocs */ + atomic_t bc_slot_count; /* Number of allocated slots */ spinlock_t bc_pa_lock; /* Protects the preallocated * items */ struct list_head bc_pa_list; /* List of preallocated @@ -334,6 +337,9 @@ struct xprt_class { */ struct rpc_xprt *xprt_create_transport(struct xprt_create *args); void xprt_connect(struct rpc_task *task); +unsigned long xprt_reconnect_delay(const struct rpc_xprt *xprt); +void xprt_reconnect_backoff(struct rpc_xprt *xprt, + unsigned long init_to); void xprt_reserve(struct rpc_task *task); void xprt_retry_reserve(struct rpc_task *task); int xprt_reserve_xprt(struct rpc_xprt *xprt, struct rpc_task *task); diff --git a/include/linux/sunrpc/xprtmultipath.h b/include/linux/sunrpc/xprtmultipath.h index af1257c030d2..c6cce3fbf29d 100644 --- a/include/linux/sunrpc/xprtmultipath.h +++ b/include/linux/sunrpc/xprtmultipath.h @@ -15,6 +15,8 @@ struct rpc_xprt_switch { struct kref xps_kref; unsigned int xps_nxprts; + unsigned int xps_nactive; + atomic_long_t xps_queuelen; struct list_head xps_xprt_list; struct net * xps_net; diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h index b81d0b3e0799..7638dbe7bc50 100644 --- a/include/linux/sunrpc/xprtsock.h +++ b/include/linux/sunrpc/xprtsock.h @@ -56,6 +56,7 @@ struct sock_xprt { */ unsigned long sock_state; struct delayed_work connect_worker; + struct work_struct error_worker; struct work_struct recv_worker; struct mutex recv_mutex; struct sockaddr_storage srcaddr; @@ -84,6 +85,10 @@ struct sock_xprt { #define XPRT_SOCK_CONNECTING 1U #define XPRT_SOCK_DATA_READY (2) #define XPRT_SOCK_UPD_TIMEOUT (3) +#define XPRT_SOCK_WAKE_ERROR (4) +#define XPRT_SOCK_WAKE_WRITE (5) +#define XPRT_SOCK_WAKE_PENDING (6) +#define XPRT_SOCK_WAKE_DISCONNECT (7) #endif /* __KERNEL__ */ diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 8a62731673f7..5150436783e8 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -142,6 +142,7 @@ enum print_line_t { enum print_line_t trace_handle_return(struct trace_seq *s); void tracing_generic_entry_update(struct trace_entry *entry, + unsigned short type, unsigned long flags, int pc); struct trace_event_file; @@ -317,6 +318,14 @@ trace_event_name(struct trace_event_call *call) return call->name; } +static inline struct list_head * +trace_get_fields(struct trace_event_call *event_call) +{ + if (!event_call->class->get_fields) + return &event_call->class->fields; + return event_call->class->get_fields(event_call); +} + struct trace_array; struct trace_subsystem_dir; diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 2b70130af585..34a038563d97 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -203,7 +203,10 @@ static inline void pagefault_enable(void) /* * Is the pagefault handler disabled? If so, user access methods will not sleep. */ -#define pagefault_disabled() (current->pagefault_disabled != 0) +static inline bool pagefault_disabled(void) +{ + return current->pagefault_disabled != 0; +} /* * The pagefault handler is in general disabled by pagefault_disable() or @@ -240,6 +243,18 @@ extern long probe_kernel_read(void *dst, const void *src, size_t size); extern long __probe_kernel_read(void *dst, const void *src, size_t size); /* + * probe_user_read(): safely attempt to read from a location in user space + * @dst: pointer to the buffer that shall take the data + * @src: address to read from + * @size: size of the data chunk + * + * Safely read from address @src to the buffer at @dst. If a kernel fault + * happens, handle that and return -EFAULT. + */ +extern long probe_user_read(void *dst, const void __user *src, size_t size); +extern long __probe_user_read(void *dst, const void __user *src, size_t size); + +/* * probe_kernel_write(): safely attempt to write to a location * @dst: address to write to * @src: pointer to the data that shall be written @@ -252,6 +267,9 @@ extern long notrace probe_kernel_write(void *dst, const void *src, size_t size); extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size); extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); +extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, + long count); +extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count); /** * probe_kernel_address(): safely attempt to read from a location diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h index df9851cb82b2..f6a4eaa85a3e 100644 --- a/include/trace/events/rpcrdma.h +++ b/include/trace/events/rpcrdma.h @@ -181,18 +181,6 @@ DECLARE_EVENT_CLASS(xprtrdma_wrch_event, ), \ TP_ARGS(task, mr, nsegs)) -TRACE_DEFINE_ENUM(FRWR_IS_INVALID); -TRACE_DEFINE_ENUM(FRWR_IS_VALID); -TRACE_DEFINE_ENUM(FRWR_FLUSHED_FR); -TRACE_DEFINE_ENUM(FRWR_FLUSHED_LI); - -#define xprtrdma_show_frwr_state(x) \ - __print_symbolic(x, \ - { FRWR_IS_INVALID, "INVALID" }, \ - { FRWR_IS_VALID, "VALID" }, \ - { FRWR_FLUSHED_FR, "FLUSHED_FR" }, \ - { FRWR_FLUSHED_LI, "FLUSHED_LI" }) - DECLARE_EVENT_CLASS(xprtrdma_frwr_done, TP_PROTO( const struct ib_wc *wc, @@ -203,22 +191,19 @@ DECLARE_EVENT_CLASS(xprtrdma_frwr_done, TP_STRUCT__entry( __field(const void *, mr) - __field(unsigned int, state) __field(unsigned int, status) __field(unsigned int, vendor_err) ), TP_fast_assign( __entry->mr = container_of(frwr, struct rpcrdma_mr, frwr); - __entry->state = frwr->fr_state; __entry->status = wc->status; __entry->vendor_err = __entry->status ? wc->vendor_err : 0; ), TP_printk( - "mr=%p state=%s: %s (%u/0x%x)", - __entry->mr, xprtrdma_show_frwr_state(__entry->state), - rdma_show_wc_status(__entry->status), + "mr=%p: %s (%u/0x%x)", + __entry->mr, rdma_show_wc_status(__entry->status), __entry->status, __entry->vendor_err ) ); @@ -390,6 +375,37 @@ DEFINE_RXPRT_EVENT(xprtrdma_op_inject_dsc); DEFINE_RXPRT_EVENT(xprtrdma_op_close); DEFINE_RXPRT_EVENT(xprtrdma_op_connect); +TRACE_EVENT(xprtrdma_op_set_cto, + TP_PROTO( + const struct rpcrdma_xprt *r_xprt, + unsigned long connect, + unsigned long reconnect + ), + + TP_ARGS(r_xprt, connect, reconnect), + + TP_STRUCT__entry( + __field(const void *, r_xprt) + __field(unsigned long, connect) + __field(unsigned long, reconnect) + __string(addr, rpcrdma_addrstr(r_xprt)) + __string(port, rpcrdma_portstr(r_xprt)) + ), + + TP_fast_assign( + __entry->r_xprt = r_xprt; + __entry->connect = connect; + __entry->reconnect = reconnect; + __assign_str(addr, rpcrdma_addrstr(r_xprt)); + __assign_str(port, rpcrdma_portstr(r_xprt)); + ), + + TP_printk("peer=[%s]:%s r_xprt=%p: connect=%lu reconnect=%lu", + __get_str(addr), __get_str(port), __entry->r_xprt, + __entry->connect / HZ, __entry->reconnect / HZ + ) +); + TRACE_EVENT(xprtrdma_qp_event, TP_PROTO( const struct rpcrdma_xprt *r_xprt, @@ -470,13 +486,12 @@ TRACE_DEFINE_ENUM(rpcrdma_replych); TRACE_EVENT(xprtrdma_marshal, TP_PROTO( - const struct rpc_rqst *rqst, - unsigned int hdrlen, + const struct rpcrdma_req *req, unsigned int rtype, unsigned int wtype ), - TP_ARGS(rqst, hdrlen, rtype, wtype), + TP_ARGS(req, rtype, wtype), TP_STRUCT__entry( __field(unsigned int, task_id) @@ -491,10 +506,12 @@ TRACE_EVENT(xprtrdma_marshal, ), TP_fast_assign( + const struct rpc_rqst *rqst = &req->rl_slot; + __entry->task_id = rqst->rq_task->tk_pid; __entry->client_id = rqst->rq_task->tk_client->cl_clid; __entry->xid = be32_to_cpu(rqst->rq_xid); - __entry->hdrlen = hdrlen; + __entry->hdrlen = req->rl_hdrbuf.len; __entry->headlen = rqst->rq_snd_buf.head[0].iov_len; __entry->pagelen = rqst->rq_snd_buf.page_len; __entry->taillen = rqst->rq_snd_buf.tail[0].iov_len; @@ -538,6 +555,33 @@ TRACE_EVENT(xprtrdma_marshal_failed, ) ); +TRACE_EVENT(xprtrdma_prepsend_failed, + TP_PROTO(const struct rpc_rqst *rqst, + int ret + ), + + TP_ARGS(rqst, ret), + + TP_STRUCT__entry( + __field(unsigned int, task_id) + __field(unsigned int, client_id) + __field(u32, xid) + __field(int, ret) + ), + + TP_fast_assign( + __entry->task_id = rqst->rq_task->tk_pid; + __entry->client_id = rqst->rq_task->tk_client->cl_clid; + __entry->xid = be32_to_cpu(rqst->rq_xid); + __entry->ret = ret; + ), + + TP_printk("task:%u@%u xid=0x%08x: ret=%d", + __entry->task_id, __entry->client_id, __entry->xid, + __entry->ret + ) +); + TRACE_EVENT(xprtrdma_post_send, TP_PROTO( const struct rpcrdma_req *req, @@ -559,7 +603,8 @@ TRACE_EVENT(xprtrdma_post_send, const struct rpc_rqst *rqst = &req->rl_slot; __entry->task_id = rqst->rq_task->tk_pid; - __entry->client_id = rqst->rq_task->tk_client->cl_clid; + __entry->client_id = rqst->rq_task->tk_client ? + rqst->rq_task->tk_client->cl_clid : -1; __entry->req = req; __entry->num_sge = req->rl_sendctx->sc_wr.num_sge; __entry->signaled = req->rl_sendctx->sc_wr.send_flags & @@ -698,6 +743,7 @@ TRACE_EVENT(xprtrdma_wc_receive, DEFINE_FRWR_DONE_EVENT(xprtrdma_wc_fastreg); DEFINE_FRWR_DONE_EVENT(xprtrdma_wc_li); DEFINE_FRWR_DONE_EVENT(xprtrdma_wc_li_wake); +DEFINE_FRWR_DONE_EVENT(xprtrdma_wc_li_done); TRACE_EVENT(xprtrdma_frwr_alloc, TP_PROTO( diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h index cfe47c5d9a56..348fd0176f75 100644 --- a/include/uapi/linux/virtio_ids.h +++ b/include/uapi/linux/virtio_ids.h @@ -44,5 +44,6 @@ #define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */ #define VIRTIO_ID_CRYPTO 20 /* virtio crypto */ #define VIRTIO_ID_IOMMU 23 /* virtio IOMMU */ +#define VIRTIO_ID_PMEM 27 /* virtio pmem */ #endif /* _LINUX_VIRTIO_IDS_H */ diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h new file mode 100644 index 000000000000..9a63ed6d062f --- /dev/null +++ b/include/uapi/linux/virtio_pmem.h @@ -0,0 +1,34 @@ +/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */ +/* + * Definitions for virtio-pmem devices. + * + * Copyright (C) 2019 Red Hat, Inc. + * + * Author(s): Pankaj Gupta <pagupta@redhat.com> + */ + +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H +#define _UAPI_LINUX_VIRTIO_PMEM_H + +#include <linux/types.h> +#include <linux/virtio_ids.h> +#include <linux/virtio_config.h> + +struct virtio_pmem_config { + __u64 start; + __u64 size; +}; + +#define VIRTIO_PMEM_REQ_TYPE_FLUSH 0 + +struct virtio_pmem_resp { + /* Host return status corresponding to flush request */ + __le32 ret; +}; + +struct virtio_pmem_req { + /* command type */ + __le32 type; +}; + +#endif diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 62fa5a82a065..9de232229063 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -129,15 +129,17 @@ setup_io_tlb_npages(char *str) } early_param("swiotlb", setup_io_tlb_npages); +static bool no_iotlb_memory; + unsigned long swiotlb_nr_tbl(void) { - return io_tlb_nslabs; + return unlikely(no_iotlb_memory) ? 0 : io_tlb_nslabs; } EXPORT_SYMBOL_GPL(swiotlb_nr_tbl); unsigned int swiotlb_max_segment(void) { - return max_segment; + return unlikely(no_iotlb_memory) ? 0 : max_segment; } EXPORT_SYMBOL_GPL(swiotlb_max_segment); @@ -160,8 +162,6 @@ unsigned long swiotlb_size_or_default(void) return size ? size : (IO_TLB_DEFAULT_SIZE); } -static bool no_iotlb_memory; - void swiotlb_print_info(void) { unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT; @@ -317,6 +317,14 @@ swiotlb_late_init_with_default_size(size_t default_size) return rc; } +static void swiotlb_cleanup(void) +{ + io_tlb_end = 0; + io_tlb_start = 0; + io_tlb_nslabs = 0; + max_segment = 0; +} + int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs) { @@ -367,10 +375,7 @@ cleanup4: sizeof(int))); io_tlb_list = NULL; cleanup3: - io_tlb_end = 0; - io_tlb_start = 0; - io_tlb_nslabs = 0; - max_segment = 0; + swiotlb_cleanup(); return -ENOMEM; } @@ -394,10 +399,7 @@ void __init swiotlb_exit(void) memblock_free_late(io_tlb_start, PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT)); } - io_tlb_start = 0; - io_tlb_end = 0; - io_tlb_nslabs = 0; - max_segment = 0; + swiotlb_cleanup(); } /* @@ -546,7 +548,7 @@ not_found: if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit()) dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n", size, io_tlb_nslabs, tmp_io_tlb_used); - return DMA_MAPPING_ERROR; + return (phys_addr_t)DMA_MAPPING_ERROR; found: io_tlb_used += nslots; spin_unlock_irqrestore(&io_tlb_lock, flags); @@ -664,7 +666,7 @@ bool swiotlb_map(struct device *dev, phys_addr_t *phys, dma_addr_t *dma_addr, /* Oh well, have to allocate and map a bounce buffer. */ *phys = swiotlb_tbl_map_single(dev, __phys_to_dma(dev, io_tlb_start), *phys, size, dir, attrs); - if (*phys == DMA_MAPPING_ERROR) + if (*phys == (phys_addr_t)DMA_MAPPING_ERROR) return false; /* Ensure that the address returned is DMA'ble */ diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 9f5433a52488..9873fc627d61 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -2276,6 +2276,7 @@ static int __init init_kprobes(void) init_test_probes(); return err; } +subsys_initcall(init_kprobes); #ifdef CONFIG_DEBUG_FS static void report_probe(struct seq_file *pi, struct kprobe *p, @@ -2588,5 +2589,3 @@ static int __init debugfs_kprobe_init(void) late_initcall(debugfs_kprobe_init); #endif /* CONFIG_DEBUG_FS */ - -module_init(init_kprobes); diff --git a/kernel/module.c b/kernel/module.c index a2cee14a83f3..5933395af9a0 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1492,8 +1492,7 @@ static void add_sect_attrs(struct module *mod, const struct load_info *info) for (i = 0; i < info->hdr->e_shnum; i++) if (!sect_empty(&info->sechdrs[i])) nloaded++; - size[0] = ALIGN(sizeof(*sect_attrs) - + nloaded * sizeof(sect_attrs->attrs[0]), + size[0] = ALIGN(struct_size(sect_attrs, attrs, nloaded), sizeof(sect_attrs->grp.attrs[0])); size[1] = (nloaded + 1) * sizeof(sect_attrs->grp.attrs[0]); sect_attrs = kzalloc(size[0] + size[1], GFP_KERNEL); @@ -1697,6 +1696,8 @@ static int add_usage_links(struct module *mod) return ret; } +static void module_remove_modinfo_attrs(struct module *mod, int end); + static int module_add_modinfo_attrs(struct module *mod) { struct module_attribute *attr; @@ -1711,24 +1712,34 @@ static int module_add_modinfo_attrs(struct module *mod) return -ENOMEM; temp_attr = mod->modinfo_attrs; - for (i = 0; (attr = modinfo_attrs[i]) && !error; i++) { + for (i = 0; (attr = modinfo_attrs[i]); i++) { if (!attr->test || attr->test(mod)) { memcpy(temp_attr, attr, sizeof(*temp_attr)); sysfs_attr_init(&temp_attr->attr); error = sysfs_create_file(&mod->mkobj.kobj, &temp_attr->attr); + if (error) + goto error_out; ++temp_attr; } } + + return 0; + +error_out: + if (i > 0) + module_remove_modinfo_attrs(mod, --i); return error; } -static void module_remove_modinfo_attrs(struct module *mod) +static void module_remove_modinfo_attrs(struct module *mod, int end) { struct module_attribute *attr; int i; for (i = 0; (attr = &mod->modinfo_attrs[i]); i++) { + if (end >= 0 && i > end) + break; /* pick a field to test for end of list */ if (!attr->attr.name) break; @@ -1816,7 +1827,7 @@ static int mod_sysfs_setup(struct module *mod, return 0; out_unreg_modinfo_attrs: - module_remove_modinfo_attrs(mod); + module_remove_modinfo_attrs(mod, -1); out_unreg_param: module_param_sysfs_remove(mod); out_unreg_holders: @@ -1852,7 +1863,7 @@ static void mod_sysfs_fini(struct module *mod) { } -static void module_remove_modinfo_attrs(struct module *mod) +static void module_remove_modinfo_attrs(struct module *mod, int end) { } @@ -1868,14 +1879,14 @@ static void init_param_lock(struct module *mod) static void mod_sysfs_teardown(struct module *mod) { del_usage_links(mod); - module_remove_modinfo_attrs(mod); + module_remove_modinfo_attrs(mod, -1); module_param_sysfs_remove(mod); kobject_put(mod->mkobj.drivers_dir); kobject_put(mod->holders_dir); mod_sysfs_fini(mod); } -#ifdef CONFIG_STRICT_MODULE_RWX +#ifdef CONFIG_ARCH_HAS_STRICT_MODULE_RWX /* * LKM RO/NX protection: protect module's text/ro-data * from modification and any data from execution. @@ -1898,6 +1909,7 @@ static void frob_text(const struct module_layout *layout, layout->text_size >> PAGE_SHIFT); } +#ifdef CONFIG_STRICT_MODULE_RWX static void frob_rodata(const struct module_layout *layout, int (*set_memory)(unsigned long start, int num_pages)) { @@ -1949,13 +1961,9 @@ void module_enable_ro(const struct module *mod, bool after_init) set_vm_flush_reset_perms(mod->core_layout.base); set_vm_flush_reset_perms(mod->init_layout.base); frob_text(&mod->core_layout, set_memory_ro); - frob_text(&mod->core_layout, set_memory_x); frob_rodata(&mod->core_layout, set_memory_ro); - frob_text(&mod->init_layout, set_memory_ro); - frob_text(&mod->init_layout, set_memory_x); - frob_rodata(&mod->init_layout, set_memory_ro); if (after_init) @@ -2014,9 +2022,19 @@ void set_all_modules_text_ro(void) } mutex_unlock(&module_mutex); } -#else +#else /* !CONFIG_STRICT_MODULE_RWX */ static void module_enable_nx(const struct module *mod) { } -#endif +#endif /* CONFIG_STRICT_MODULE_RWX */ +static void module_enable_x(const struct module *mod) +{ + frob_text(&mod->core_layout, set_memory_x); + frob_text(&mod->init_layout, set_memory_x); +} +#else /* !CONFIG_ARCH_HAS_STRICT_MODULE_RWX */ +static void module_enable_nx(const struct module *mod) { } +static void module_enable_x(const struct module *mod) { } +#endif /* CONFIG_ARCH_HAS_STRICT_MODULE_RWX */ + #ifdef CONFIG_LIVEPATCH /* @@ -2723,6 +2741,11 @@ void * __weak module_alloc(unsigned long size) return vmalloc_exec(size); } +bool __weak module_exit_section(const char *name) +{ + return strstarts(name, ".exit"); +} + #ifdef CONFIG_DEBUG_KMEMLEAK static void kmemleak_load_module(const struct module *mod, const struct load_info *info) @@ -2912,7 +2935,7 @@ static int rewrite_section_headers(struct load_info *info, int flags) #ifndef CONFIG_MODULE_UNLOAD /* Don't load .exit sections */ - if (strstarts(info->secstrings+shdr->sh_name, ".exit")) + if (module_exit_section(info->secstrings+shdr->sh_name)) shdr->sh_flags &= ~(unsigned long)SHF_ALLOC; #endif } @@ -3390,8 +3413,7 @@ static bool finished_loading(const char *name) sched_annotate_sleep(); mutex_lock(&module_mutex); mod = find_module_all(name, strlen(name), true); - ret = !mod || mod->state == MODULE_STATE_LIVE - || mod->state == MODULE_STATE_GOING; + ret = !mod || mod->state == MODULE_STATE_LIVE; mutex_unlock(&module_mutex); return ret; @@ -3581,8 +3603,7 @@ again: mutex_lock(&module_mutex); old = find_module_all(mod->name, strlen(mod->name), true); if (old != NULL) { - if (old->state == MODULE_STATE_COMING - || old->state == MODULE_STATE_UNFORMED) { + if (old->state != MODULE_STATE_LIVE) { /* Wait in case it fails to load. */ mutex_unlock(&module_mutex); err = wait_event_interruptible(module_wq, @@ -3621,6 +3642,7 @@ static int complete_formation(struct module *mod, struct load_info *info) module_enable_ro(mod, false); module_enable_nx(mod); + module_enable_x(mod); /* Mark state as coming so strong_try_module_get() ignores us, * but kallsyms etc. can see us. */ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 564e5fdb025f..98da8998c25c 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -597,9 +597,19 @@ config FTRACE_STARTUP_TEST functioning properly. It will do tests on all the configured tracers of ftrace. +config EVENT_TRACE_STARTUP_TEST + bool "Run selftest on trace events" + depends on FTRACE_STARTUP_TEST + default y + help + This option performs a test on all trace events in the system. + It basically just enables each event and runs some code that + will trigger events (not necessarily the event it enables) + This may take some time run as there are a lot of events. + config EVENT_TRACE_TEST_SYSCALLS bool "Run selftest on syscall events" - depends on FTRACE_STARTUP_TEST + depends on EVENT_TRACE_STARTUP_TEST help This option will also enable testing every syscall event. It only enables the event and disables it and runs various loads diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 576c41644e77..eca34503f178 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1622,6 +1622,11 @@ static bool test_rec_ops_needs_regs(struct dyn_ftrace *rec) return keep_regs; } +static struct ftrace_ops * +ftrace_find_tramp_ops_any(struct dyn_ftrace *rec); +static struct ftrace_ops * +ftrace_find_tramp_ops_next(struct dyn_ftrace *rec, struct ftrace_ops *ops); + static bool __ftrace_hash_rec_update(struct ftrace_ops *ops, int filter_hash, bool inc) @@ -1750,15 +1755,17 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops, } /* - * If the rec had TRAMP enabled, then it needs to - * be cleared. As TRAMP can only be enabled iff - * there is only a single ops attached to it. - * In otherwords, always disable it on decrementing. - * In the future, we may set it if rec count is - * decremented to one, and the ops that is left - * has a trampoline. + * The TRAMP needs to be set only if rec count + * is decremented to one, and the ops that is + * left has a trampoline. As TRAMP can only be + * enabled if there is only a single ops attached + * to it. */ - rec->flags &= ~FTRACE_FL_TRAMP; + if (ftrace_rec_count(rec) == 1 && + ftrace_find_tramp_ops_any(rec)) + rec->flags |= FTRACE_FL_TRAMP; + else + rec->flags &= ~FTRACE_FL_TRAMP; /* * flags will be cleared in ftrace_check_record() @@ -1768,7 +1775,7 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops, count++; /* Must match FTRACE_UPDATE_CALLS in ftrace_modify_all_code() */ - update |= ftrace_test_record(rec, 1) != FTRACE_UPDATE_IGNORE; + update |= ftrace_test_record(rec, true) != FTRACE_UPDATE_IGNORE; /* Shortcut, if we handled all records, we are done. */ if (!all && count == hash->count) @@ -1951,11 +1958,6 @@ static void print_ip_ins(const char *fmt, const unsigned char *p) printk(KERN_CONT "%s%02x", i ? ":" : "", p[i]); } -static struct ftrace_ops * -ftrace_find_tramp_ops_any(struct dyn_ftrace *rec); -static struct ftrace_ops * -ftrace_find_tramp_ops_next(struct dyn_ftrace *rec, struct ftrace_ops *ops); - enum ftrace_bug_type ftrace_bug_type; const void *ftrace_expected; @@ -2047,7 +2049,7 @@ void ftrace_bug(int failed, struct dyn_ftrace *rec) } } -static int ftrace_check_record(struct dyn_ftrace *rec, int enable, int update) +static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update) { unsigned long flag = 0UL; @@ -2146,28 +2148,28 @@ static int ftrace_check_record(struct dyn_ftrace *rec, int enable, int update) /** * ftrace_update_record, set a record that now is tracing or not * @rec: the record to update - * @enable: set to 1 if the record is tracing, zero to force disable + * @enable: set to true if the record is tracing, false to force disable * * The records that represent all functions that can be traced need * to be updated when tracing has been enabled. */ -int ftrace_update_record(struct dyn_ftrace *rec, int enable) +int ftrace_update_record(struct dyn_ftrace *rec, bool enable) { - return ftrace_check_record(rec, enable, 1); + return ftrace_check_record(rec, enable, true); } /** * ftrace_test_record, check if the record has been enabled or not * @rec: the record to test - * @enable: set to 1 to check if enabled, 0 if it is disabled + * @enable: set to true to check if enabled, false if it is disabled * * The arch code may need to test if a record is already set to * tracing to determine how to modify the function code that it * represents. */ -int ftrace_test_record(struct dyn_ftrace *rec, int enable) +int ftrace_test_record(struct dyn_ftrace *rec, bool enable) { - return ftrace_check_record(rec, enable, 0); + return ftrace_check_record(rec, enable, false); } static struct ftrace_ops * @@ -2356,7 +2358,7 @@ unsigned long ftrace_get_addr_curr(struct dyn_ftrace *rec) } static int -__ftrace_replace_code(struct dyn_ftrace *rec, int enable) +__ftrace_replace_code(struct dyn_ftrace *rec, bool enable) { unsigned long ftrace_old_addr; unsigned long ftrace_addr; @@ -2395,7 +2397,7 @@ void __weak ftrace_replace_code(int mod_flags) { struct dyn_ftrace *rec; struct ftrace_page *pg; - int enable = mod_flags & FTRACE_MODIFY_ENABLE_FL; + bool enable = mod_flags & FTRACE_MODIFY_ENABLE_FL; int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL; int failed; diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 05b0b3139ebc..66358d66c933 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -128,16 +128,7 @@ int ring_buffer_print_entry_header(struct trace_seq *s) #define RB_ALIGNMENT 4U #define RB_MAX_SMALL_DATA (RB_ALIGNMENT * RINGBUF_TYPE_DATA_TYPE_LEN_MAX) #define RB_EVNT_MIN_SIZE 8U /* two 32bit words */ - -#ifndef CONFIG_HAVE_64BIT_ALIGNED_ACCESS -# define RB_FORCE_8BYTE_ALIGNMENT 0 -# define RB_ARCH_ALIGNMENT RB_ALIGNMENT -#else -# define RB_FORCE_8BYTE_ALIGNMENT 1 -# define RB_ARCH_ALIGNMENT 8U -#endif - -#define RB_ALIGN_DATA __aligned(RB_ARCH_ALIGNMENT) +#define RB_ALIGN_DATA __aligned(RB_ALIGNMENT) /* define RINGBUF_TYPE_DATA for 'case RINGBUF_TYPE_DATA:' */ #define RINGBUF_TYPE_DATA 0 ... RINGBUF_TYPE_DATA_TYPE_LEN_MAX @@ -2373,7 +2364,7 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer, event->time_delta = delta; length -= RB_EVNT_HDR_SIZE; - if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT) { + if (length > RB_MAX_SMALL_DATA) { event->type_len = 0; event->array[0] = length; } else @@ -2388,11 +2379,11 @@ static unsigned rb_calculate_event_length(unsigned length) if (!length) length++; - if (length > RB_MAX_SMALL_DATA || RB_FORCE_8BYTE_ALIGNMENT) + if (length > RB_MAX_SMALL_DATA) length += sizeof(event.array[0]); length += RB_EVNT_HDR_SIZE; - length = ALIGN(length, RB_ARCH_ALIGNMENT); + length = ALIGN(length, RB_ALIGNMENT); /* * In case the time delta is larger than the 27 bits for it diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index c90c687cf950..525a97fbbc60 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -366,7 +366,7 @@ trace_ignore_this_task(struct trace_pid_list *filtered_pids, struct task_struct } /** - * trace_pid_filter_add_remove_task - Add or remove a task from a pid_list + * trace_filter_add_remove_task - Add or remove a task from a pid_list * @pid_list: The list to modify * @self: The current task for fork or NULL for exit * @task: The task to add or remove @@ -743,8 +743,7 @@ trace_event_setup(struct ring_buffer_event *event, { struct trace_entry *ent = ring_buffer_event_data(event); - tracing_generic_entry_update(ent, flags, pc); - ent->type = type; + tracing_generic_entry_update(ent, type, flags, pc); } static __always_inline struct ring_buffer_event * @@ -2312,13 +2311,14 @@ enum print_line_t trace_handle_return(struct trace_seq *s) EXPORT_SYMBOL_GPL(trace_handle_return); void -tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags, - int pc) +tracing_generic_entry_update(struct trace_entry *entry, unsigned short type, + unsigned long flags, int pc) { struct task_struct *tsk = current; entry->preempt_count = pc & 0xff; entry->pid = (tsk) ? tsk->pid : 0; + entry->type = type; entry->flags = #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) | @@ -4842,12 +4842,13 @@ static const char readme_msg[] = "\t args: <name>=fetcharg[:type]\n" "\t fetcharg: %<register>, @<address>, @<symbol>[+|-<offset>],\n" #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API - "\t $stack<index>, $stack, $retval, $comm, $arg<N>\n" + "\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n" #else - "\t $stack<index>, $stack, $retval, $comm\n" + "\t $stack<index>, $stack, $retval, $comm,\n" #endif + "\t +|-[u]<offset>(<fetcharg>)\n" "\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, string, symbol,\n" - "\t b<bit-width>@<bit-offset>/<container-size>,\n" + "\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n" "\t <type>\\[<array-size>\\]\n" #ifdef CONFIG_HIST_TRIGGERS "\t field: <stype> <name>;\n" diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 4629a6104474..0892e38ed6fb 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -416,8 +416,7 @@ void perf_trace_buf_update(void *record, u16 type) unsigned long flags; local_save_flags(flags); - tracing_generic_entry_update(entry, flags, pc); - entry->type = type; + tracing_generic_entry_update(entry, type, flags, pc); } NOKPROBE_SYMBOL(perf_trace_buf_update); diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 0ce3db67f556..c7506bc81b75 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -70,14 +70,6 @@ static int system_refcount_dec(struct event_subsystem *system) #define while_for_each_event_file() \ } -static struct list_head * -trace_get_fields(struct trace_event_call *event_call) -{ - if (!event_call->class->get_fields) - return &event_call->class->fields; - return event_call->class->get_fields(event_call); -} - static struct ftrace_event_field * __find_event_field(struct list_head *head, char *name) { @@ -3190,7 +3182,7 @@ void __init trace_event_init(void) event_trace_enable(); } -#ifdef CONFIG_FTRACE_STARTUP_TEST +#ifdef CONFIG_EVENT_TRACE_STARTUP_TEST static DEFINE_SPINLOCK(test_spinlock); static DEFINE_SPINLOCK(test_spinlock_irq); diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c index 5079d1db3754..c773b8fb270c 100644 --- a/kernel/trace/trace_events_filter.c +++ b/kernel/trace/trace_events_filter.c @@ -1084,6 +1084,9 @@ int filter_assign_type(const char *type) if (strchr(type, '[') && strstr(type, "char")) return FILTER_STATIC_STRING; + if (strcmp(type, "char *") == 0 || strcmp(type, "const char *") == 0) + return FILTER_PTR_STRING; + return FILTER_OTHER; } diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 7d736248a070..9d483ad9bb6c 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -12,6 +12,8 @@ #include <linux/rculist.h> #include <linux/error-injection.h> +#include <asm/setup.h> /* for COMMAND_LINE_SIZE */ + #include "trace_dynevent.h" #include "trace_kprobe_selftest.h" #include "trace_probe.h" @@ -19,6 +21,18 @@ #define KPROBE_EVENT_SYSTEM "kprobes" #define KRETPROBE_MAXACTIVE_MAX 4096 +#define MAX_KPROBE_CMDLINE_SIZE 1024 + +/* Kprobe early definition from command line */ +static char kprobe_boot_events_buf[COMMAND_LINE_SIZE] __initdata; +static bool kprobe_boot_events_enabled __initdata; + +static int __init set_kprobe_boot_events(char *str) +{ + strlcpy(kprobe_boot_events_buf, str, COMMAND_LINE_SIZE); + return 0; +} +__setup("kprobe_event=", set_kprobe_boot_events); static int trace_kprobe_create(int argc, const char **argv); static int trace_kprobe_show(struct seq_file *m, struct dyn_event *ev); @@ -128,8 +142,8 @@ static bool trace_kprobe_match(const char *system, const char *event, { struct trace_kprobe *tk = to_trace_kprobe(ev); - return strcmp(trace_event_name(&tk->tp.call), event) == 0 && - (!system || strcmp(tk->tp.call.class->system, system) == 0); + return strcmp(trace_probe_name(&tk->tp), event) == 0 && + (!system || strcmp(trace_probe_group_name(&tk->tp), system) == 0); } static nokprobe_inline unsigned long trace_kprobe_nhit(struct trace_kprobe *tk) @@ -143,6 +157,12 @@ static nokprobe_inline unsigned long trace_kprobe_nhit(struct trace_kprobe *tk) return nhit; } +static nokprobe_inline bool trace_kprobe_is_registered(struct trace_kprobe *tk) +{ + return !(list_empty(&tk->rp.kp.list) && + hlist_unhashed(&tk->rp.kp.hlist)); +} + /* Return 0 if it fails to find the symbol address */ static nokprobe_inline unsigned long trace_kprobe_address(struct trace_kprobe *tk) @@ -183,6 +203,16 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs); static int kretprobe_dispatcher(struct kretprobe_instance *ri, struct pt_regs *regs); +static void free_trace_kprobe(struct trace_kprobe *tk) +{ + if (tk) { + trace_probe_cleanup(&tk->tp); + kfree(tk->symbol); + free_percpu(tk->nhit); + kfree(tk); + } +} + /* * Allocate new trace_probe and initialize it (including kprobes). */ @@ -220,49 +250,20 @@ static struct trace_kprobe *alloc_trace_kprobe(const char *group, tk->rp.kp.pre_handler = kprobe_dispatcher; tk->rp.maxactive = maxactive; + INIT_HLIST_NODE(&tk->rp.kp.hlist); + INIT_LIST_HEAD(&tk->rp.kp.list); - if (!event || !group) { - ret = -EINVAL; - goto error; - } - - tk->tp.call.class = &tk->tp.class; - tk->tp.call.name = kstrdup(event, GFP_KERNEL); - if (!tk->tp.call.name) - goto error; - - tk->tp.class.system = kstrdup(group, GFP_KERNEL); - if (!tk->tp.class.system) + ret = trace_probe_init(&tk->tp, event, group); + if (ret < 0) goto error; dyn_event_init(&tk->devent, &trace_kprobe_ops); - INIT_LIST_HEAD(&tk->tp.files); return tk; error: - kfree(tk->tp.call.name); - kfree(tk->symbol); - free_percpu(tk->nhit); - kfree(tk); + free_trace_kprobe(tk); return ERR_PTR(ret); } -static void free_trace_kprobe(struct trace_kprobe *tk) -{ - int i; - - if (!tk) - return; - - for (i = 0; i < tk->tp.nr_args; i++) - traceprobe_free_probe_arg(&tk->tp.args[i]); - - kfree(tk->tp.call.class->system); - kfree(tk->tp.call.name); - kfree(tk->symbol); - free_percpu(tk->nhit); - kfree(tk); -} - static struct trace_kprobe *find_trace_kprobe(const char *event, const char *group) { @@ -270,8 +271,8 @@ static struct trace_kprobe *find_trace_kprobe(const char *event, struct trace_kprobe *tk; for_each_trace_kprobe(tk, pos) - if (strcmp(trace_event_name(&tk->tp.call), event) == 0 && - strcmp(tk->tp.call.class->system, group) == 0) + if (strcmp(trace_probe_name(&tk->tp), event) == 0 && + strcmp(trace_probe_group_name(&tk->tp), group) == 0) return tk; return NULL; } @@ -280,7 +281,7 @@ static inline int __enable_trace_kprobe(struct trace_kprobe *tk) { int ret = 0; - if (trace_probe_is_registered(&tk->tp) && !trace_kprobe_has_gone(tk)) { + if (trace_kprobe_is_registered(tk) && !trace_kprobe_has_gone(tk)) { if (trace_kprobe_is_return(tk)) ret = enable_kretprobe(&tk->rp); else @@ -297,34 +298,27 @@ static inline int __enable_trace_kprobe(struct trace_kprobe *tk) static int enable_trace_kprobe(struct trace_kprobe *tk, struct trace_event_file *file) { - struct event_file_link *link; + bool enabled = trace_probe_is_enabled(&tk->tp); int ret = 0; if (file) { - link = kmalloc(sizeof(*link), GFP_KERNEL); - if (!link) { - ret = -ENOMEM; - goto out; - } - - link->file = file; - list_add_tail_rcu(&link->list, &tk->tp.files); + ret = trace_probe_add_file(&tk->tp, file); + if (ret) + return ret; + } else + trace_probe_set_flag(&tk->tp, TP_FLAG_PROFILE); - tk->tp.flags |= TP_FLAG_TRACE; - ret = __enable_trace_kprobe(tk); - if (ret) { - list_del_rcu(&link->list); - kfree(link); - tk->tp.flags &= ~TP_FLAG_TRACE; - } + if (enabled) + return 0; - } else { - tk->tp.flags |= TP_FLAG_PROFILE; - ret = __enable_trace_kprobe(tk); - if (ret) - tk->tp.flags &= ~TP_FLAG_PROFILE; + ret = __enable_trace_kprobe(tk); + if (ret) { + if (file) + trace_probe_remove_file(&tk->tp, file); + else + trace_probe_clear_flag(&tk->tp, TP_FLAG_PROFILE); } - out: + return ret; } @@ -335,54 +329,34 @@ enable_trace_kprobe(struct trace_kprobe *tk, struct trace_event_file *file) static int disable_trace_kprobe(struct trace_kprobe *tk, struct trace_event_file *file) { - struct event_file_link *link = NULL; - int wait = 0; + struct trace_probe *tp = &tk->tp; int ret = 0; if (file) { - link = find_event_file_link(&tk->tp, file); - if (!link) { - ret = -EINVAL; - goto out; - } - - list_del_rcu(&link->list); - wait = 1; - if (!list_empty(&tk->tp.files)) + if (!trace_probe_get_file_link(tp, file)) + return -ENOENT; + if (!trace_probe_has_single_file(tp)) goto out; - - tk->tp.flags &= ~TP_FLAG_TRACE; + trace_probe_clear_flag(tp, TP_FLAG_TRACE); } else - tk->tp.flags &= ~TP_FLAG_PROFILE; + trace_probe_clear_flag(tp, TP_FLAG_PROFILE); - if (!trace_probe_is_enabled(&tk->tp) && trace_probe_is_registered(&tk->tp)) { + if (!trace_probe_is_enabled(tp) && trace_kprobe_is_registered(tk)) { if (trace_kprobe_is_return(tk)) disable_kretprobe(&tk->rp); else disable_kprobe(&tk->rp.kp); - wait = 1; } - /* - * if tk is not added to any list, it must be a local trace_kprobe - * created with perf_event_open. We don't need to wait for these - * trace_kprobes - */ - if (list_empty(&tk->devent.list)) - wait = 0; out: - if (wait) { + if (file) /* - * Synchronize with kprobe_trace_func/kretprobe_trace_func - * to ensure disabled (all running handlers are finished). - * This is not only for kfree(), but also the caller, - * trace_remove_event_call() supposes it for releasing - * event_call related objects, which will be accessed in - * the kprobe_trace_func/kretprobe_trace_func. + * Synchronization is done in below function. For perf event, + * file == NULL and perf_trace_event_unreg() calls + * tracepoint_synchronize_unregister() to ensure synchronize + * event. We don't need to care about it. */ - synchronize_rcu(); - kfree(link); /* Ignored if link == NULL */ - } + trace_probe_remove_file(tp, file); return ret; } @@ -415,7 +389,7 @@ static int __register_trace_kprobe(struct trace_kprobe *tk) { int i, ret; - if (trace_probe_is_registered(&tk->tp)) + if (trace_kprobe_is_registered(tk)) return -EINVAL; if (within_notrace_func(tk)) { @@ -441,21 +415,20 @@ static int __register_trace_kprobe(struct trace_kprobe *tk) else ret = register_kprobe(&tk->rp.kp); - if (ret == 0) - tk->tp.flags |= TP_FLAG_REGISTERED; return ret; } /* Internal unregister function - just handle k*probes and flags */ static void __unregister_trace_kprobe(struct trace_kprobe *tk) { - if (trace_probe_is_registered(&tk->tp)) { + if (trace_kprobe_is_registered(tk)) { if (trace_kprobe_is_return(tk)) unregister_kretprobe(&tk->rp); else unregister_kprobe(&tk->rp.kp); - tk->tp.flags &= ~TP_FLAG_REGISTERED; - /* Cleanup kprobe for reuse */ + /* Cleanup kprobe for reuse and mark it unregistered */ + INIT_HLIST_NODE(&tk->rp.kp.hlist); + INIT_LIST_HEAD(&tk->rp.kp.list); if (tk->rp.kp.symbol_name) tk->rp.kp.addr = NULL; } @@ -487,8 +460,8 @@ static int register_trace_kprobe(struct trace_kprobe *tk) mutex_lock(&event_mutex); /* Delete old (same name) event if exist */ - old_tk = find_trace_kprobe(trace_event_name(&tk->tp.call), - tk->tp.call.class->system); + old_tk = find_trace_kprobe(trace_probe_name(&tk->tp), + trace_probe_group_name(&tk->tp)); if (old_tk) { ret = unregister_trace_kprobe(old_tk); if (ret < 0) @@ -541,7 +514,7 @@ static int trace_kprobe_module_callback(struct notifier_block *nb, ret = __register_trace_kprobe(tk); if (ret) pr_warn("Failed to re-register probe %s on %s: %d\n", - trace_event_name(&tk->tp.call), + trace_probe_name(&tk->tp), mod->name, ret); } } @@ -716,6 +689,10 @@ static int trace_kprobe_create(int argc, const char *argv[]) goto error; /* This can be -ENOMEM */ } + ret = traceprobe_set_print_fmt(&tk->tp, is_return); + if (ret < 0) + goto error; + ret = register_trace_kprobe(tk); if (ret) { trace_probe_log_set_index(1); @@ -767,8 +744,8 @@ static int trace_kprobe_show(struct seq_file *m, struct dyn_event *ev) int i; seq_putc(m, trace_kprobe_is_return(tk) ? 'r' : 'p'); - seq_printf(m, ":%s/%s", tk->tp.call.class->system, - trace_event_name(&tk->tp.call)); + seq_printf(m, ":%s/%s", trace_probe_group_name(&tk->tp), + trace_probe_name(&tk->tp)); if (!tk->symbol) seq_printf(m, " 0x%p", tk->rp.kp.addr); @@ -842,7 +819,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v) tk = to_trace_kprobe(ev); seq_printf(m, " %-44s %15lu %15lu\n", - trace_event_name(&tk->tp.call), + trace_probe_name(&tk->tp), trace_kprobe_nhit(tk), tk->rp.kp.nmissed); @@ -886,6 +863,15 @@ fetch_store_strlen(unsigned long addr) return (ret < 0) ? ret : len; } +/* Return the length of string -- including null terminal byte */ +static nokprobe_inline int +fetch_store_strlen_user(unsigned long addr) +{ + const void __user *uaddr = (__force const void __user *)addr; + + return strnlen_unsafe_user(uaddr, MAX_STRING_SIZE); +} + /* * Fetch a null-terminated string. Caller MUST set *(u32 *)buf with max * length and relative data location. @@ -894,19 +880,46 @@ static nokprobe_inline int fetch_store_string(unsigned long addr, void *dest, void *base) { int maxlen = get_loc_len(*(u32 *)dest); - u8 *dst = get_loc_data(dest, base); + void *__dest; long ret; if (unlikely(!maxlen)) return -ENOMEM; + + __dest = get_loc_data(dest, base); + /* * Try to get string again, since the string can be changed while * probing. */ - ret = strncpy_from_unsafe(dst, (void *)addr, maxlen); + ret = strncpy_from_unsafe(__dest, (void *)addr, maxlen); + if (ret >= 0) + *(u32 *)dest = make_data_loc(ret, __dest - base); + + return ret; +} +/* + * Fetch a null-terminated string from user. Caller MUST set *(u32 *)buf + * with max length and relative data location. + */ +static nokprobe_inline int +fetch_store_string_user(unsigned long addr, void *dest, void *base) +{ + const void __user *uaddr = (__force const void __user *)addr; + int maxlen = get_loc_len(*(u32 *)dest); + void *__dest; + long ret; + + if (unlikely(!maxlen)) + return -ENOMEM; + + __dest = get_loc_data(dest, base); + + ret = strncpy_from_unsafe_user(__dest, uaddr, maxlen); if (ret >= 0) - *(u32 *)dest = make_data_loc(ret, (void *)dst - base); + *(u32 *)dest = make_data_loc(ret, __dest - base); + return ret; } @@ -916,6 +929,14 @@ probe_mem_read(void *dest, void *src, size_t size) return probe_kernel_read(dest, src, size); } +static nokprobe_inline int +probe_mem_read_user(void *dest, void *src, size_t size) +{ + const void __user *uaddr = (__force const void __user *)src; + + return probe_user_read(dest, uaddr, size); +} + /* Note that we don't verify it, since the code does not come from user space */ static int process_fetch_insn(struct fetch_insn *code, struct pt_regs *regs, void *dest, @@ -971,7 +992,7 @@ __kprobe_trace_func(struct trace_kprobe *tk, struct pt_regs *regs, struct ring_buffer *buffer; int size, dsize, pc; unsigned long irq_flags; - struct trace_event_call *call = &tk->tp.call; + struct trace_event_call *call = trace_probe_event_call(&tk->tp); WARN_ON(call != trace_file->event_call); @@ -1003,7 +1024,7 @@ kprobe_trace_func(struct trace_kprobe *tk, struct pt_regs *regs) { struct event_file_link *link; - list_for_each_entry_rcu(link, &tk->tp.files, list) + trace_probe_for_each_link_rcu(link, &tk->tp) __kprobe_trace_func(tk, regs, link->file); } NOKPROBE_SYMBOL(kprobe_trace_func); @@ -1019,7 +1040,7 @@ __kretprobe_trace_func(struct trace_kprobe *tk, struct kretprobe_instance *ri, struct ring_buffer *buffer; int size, pc, dsize; unsigned long irq_flags; - struct trace_event_call *call = &tk->tp.call; + struct trace_event_call *call = trace_probe_event_call(&tk->tp); WARN_ON(call != trace_file->event_call); @@ -1053,7 +1074,7 @@ kretprobe_trace_func(struct trace_kprobe *tk, struct kretprobe_instance *ri, { struct event_file_link *link; - list_for_each_entry_rcu(link, &tk->tp.files, list) + trace_probe_for_each_link_rcu(link, &tk->tp) __kretprobe_trace_func(tk, ri, regs, link->file); } NOKPROBE_SYMBOL(kretprobe_trace_func); @@ -1070,7 +1091,7 @@ print_kprobe_event(struct trace_iterator *iter, int flags, field = (struct kprobe_trace_entry_head *)iter->ent; tp = container_of(event, struct trace_probe, call.event); - trace_seq_printf(s, "%s: (", trace_event_name(&tp->call)); + trace_seq_printf(s, "%s: (", trace_probe_name(tp)); if (!seq_print_ip_sym(s, field->ip, flags | TRACE_ITER_SYM_OFFSET)) goto out; @@ -1097,7 +1118,7 @@ print_kretprobe_event(struct trace_iterator *iter, int flags, field = (struct kretprobe_trace_entry_head *)iter->ent; tp = container_of(event, struct trace_probe, call.event); - trace_seq_printf(s, "%s: (", trace_event_name(&tp->call)); + trace_seq_printf(s, "%s: (", trace_probe_name(tp)); if (!seq_print_ip_sym(s, field->ret_ip, flags | TRACE_ITER_SYM_OFFSET)) goto out; @@ -1149,7 +1170,7 @@ static int kretprobe_event_define_fields(struct trace_event_call *event_call) static int kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) { - struct trace_event_call *call = &tk->tp.call; + struct trace_event_call *call = trace_probe_event_call(&tk->tp); struct kprobe_trace_entry_head *entry; struct hlist_head *head; int size, __size, dsize; @@ -1199,7 +1220,7 @@ static void kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri, struct pt_regs *regs) { - struct trace_event_call *call = &tk->tp.call; + struct trace_event_call *call = trace_probe_event_call(&tk->tp); struct kretprobe_trace_entry_head *entry; struct hlist_head *head; int size, __size, dsize; @@ -1299,10 +1320,10 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs) raw_cpu_inc(*tk->nhit); - if (tk->tp.flags & TP_FLAG_TRACE) + if (trace_probe_test_flag(&tk->tp, TP_FLAG_TRACE)) kprobe_trace_func(tk, regs); #ifdef CONFIG_PERF_EVENTS - if (tk->tp.flags & TP_FLAG_PROFILE) + if (trace_probe_test_flag(&tk->tp, TP_FLAG_PROFILE)) ret = kprobe_perf_func(tk, regs); #endif return ret; @@ -1316,10 +1337,10 @@ kretprobe_dispatcher(struct kretprobe_instance *ri, struct pt_regs *regs) raw_cpu_inc(*tk->nhit); - if (tk->tp.flags & TP_FLAG_TRACE) + if (trace_probe_test_flag(&tk->tp, TP_FLAG_TRACE)) kretprobe_trace_func(tk, ri, regs); #ifdef CONFIG_PERF_EVENTS - if (tk->tp.flags & TP_FLAG_PROFILE) + if (trace_probe_test_flag(&tk->tp, TP_FLAG_PROFILE)) kretprobe_perf_func(tk, ri, regs); #endif return 0; /* We don't tweek kernel, so just return 0 */ @@ -1334,10 +1355,10 @@ static struct trace_event_functions kprobe_funcs = { .trace = print_kprobe_event }; -static inline void init_trace_event_call(struct trace_kprobe *tk, - struct trace_event_call *call) +static inline void init_trace_event_call(struct trace_kprobe *tk) { - INIT_LIST_HEAD(&call->class->fields); + struct trace_event_call *call = trace_probe_event_call(&tk->tp); + if (trace_kprobe_is_return(tk)) { call->event.funcs = &kretprobe_funcs; call->class->define_fields = kretprobe_event_define_fields; @@ -1353,37 +1374,14 @@ static inline void init_trace_event_call(struct trace_kprobe *tk, static int register_kprobe_event(struct trace_kprobe *tk) { - struct trace_event_call *call = &tk->tp.call; - int ret = 0; - - init_trace_event_call(tk, call); + init_trace_event_call(tk); - if (traceprobe_set_print_fmt(&tk->tp, trace_kprobe_is_return(tk)) < 0) - return -ENOMEM; - ret = register_trace_event(&call->event); - if (!ret) { - kfree(call->print_fmt); - return -ENODEV; - } - ret = trace_add_event_call(call); - if (ret) { - pr_info("Failed to register kprobe event: %s\n", - trace_event_name(call)); - kfree(call->print_fmt); - unregister_trace_event(&call->event); - } - return ret; + return trace_probe_register_event_call(&tk->tp); } static int unregister_kprobe_event(struct trace_kprobe *tk) { - int ret; - - /* tp->event is unregistered in trace_remove_event_call() */ - ret = trace_remove_event_call(&tk->tp.call); - if (!ret) - kfree(tk->tp.call.print_fmt); - return ret; + return trace_probe_unregister_event_call(&tk->tp); } #ifdef CONFIG_PERF_EVENTS @@ -1413,7 +1411,7 @@ create_local_trace_kprobe(char *func, void *addr, unsigned long offs, return ERR_CAST(tk); } - init_trace_event_call(tk, &tk->tp.call); + init_trace_event_call(tk); if (traceprobe_set_print_fmt(&tk->tp, trace_kprobe_is_return(tk)) < 0) { ret = -ENOMEM; @@ -1421,12 +1419,10 @@ create_local_trace_kprobe(char *func, void *addr, unsigned long offs, } ret = __register_trace_kprobe(tk); - if (ret < 0) { - kfree(tk->tp.call.print_fmt); + if (ret < 0) goto error; - } - return &tk->tp.call; + return trace_probe_event_call(&tk->tp); error: free_trace_kprobe(tk); return ERR_PTR(ret); @@ -1445,11 +1441,50 @@ void destroy_local_trace_kprobe(struct trace_event_call *event_call) __unregister_trace_kprobe(tk); - kfree(tk->tp.call.print_fmt); free_trace_kprobe(tk); } #endif /* CONFIG_PERF_EVENTS */ +static __init void enable_boot_kprobe_events(void) +{ + struct trace_array *tr = top_trace_array(); + struct trace_event_file *file; + struct trace_kprobe *tk; + struct dyn_event *pos; + + mutex_lock(&event_mutex); + for_each_trace_kprobe(tk, pos) { + list_for_each_entry(file, &tr->events, list) + if (file->event_call == trace_probe_event_call(&tk->tp)) + trace_event_enable_disable(file, 1, 0); + } + mutex_unlock(&event_mutex); +} + +static __init void setup_boot_kprobe_events(void) +{ + char *p, *cmd = kprobe_boot_events_buf; + int ret; + + strreplace(kprobe_boot_events_buf, ',', ' '); + + while (cmd && *cmd != '\0') { + p = strchr(cmd, ';'); + if (p) + *p++ = '\0'; + + ret = trace_run_command(cmd, create_or_delete_trace_kprobe); + if (ret) + pr_warn("Failed to add event(%d): %s\n", ret, cmd); + else + kprobe_boot_events_enabled = true; + + cmd = p; + } + + enable_boot_kprobe_events(); +} + /* Make a tracefs interface for controlling probe points */ static __init int init_kprobe_trace(void) { @@ -1481,6 +1516,9 @@ static __init int init_kprobe_trace(void) if (!entry) pr_warn("Could not create tracefs 'kprobe_profile' entry\n"); + + setup_boot_kprobe_events(); + return 0; } fs_initcall(init_kprobe_trace); @@ -1493,7 +1531,7 @@ find_trace_probe_file(struct trace_kprobe *tk, struct trace_array *tr) struct trace_event_file *file; list_for_each_entry(file, &tr->events, list) - if (file->event_call == &tk->tp.call) + if (file->event_call == trace_probe_event_call(&tk->tp)) return file; return NULL; @@ -1513,6 +1551,11 @@ static __init int kprobe_trace_self_tests_init(void) if (tracing_is_disabled()) return -ENODEV; + if (kprobe_boot_events_enabled) { + pr_info("Skipping kprobe tests due to kprobe_event on cmdline\n"); + return 0; + } + target = kprobe_trace_selftest_target; pr_info("Testing kprobe tracing: "); diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index a347faced959..dbef0d135075 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -78,6 +78,8 @@ static const struct fetch_type probe_fetch_types[] = { /* Special types */ __ASSIGN_FETCH_TYPE("string", string, string, sizeof(u32), 1, "__data_loc char[]"), + __ASSIGN_FETCH_TYPE("ustring", string, string, sizeof(u32), 1, + "__data_loc char[]"), /* Basic types */ ASSIGN_FETCH_TYPE(u8, u8, 0), ASSIGN_FETCH_TYPE(u16, u16, 0), @@ -322,6 +324,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type, { struct fetch_insn *code = *pcode; unsigned long param; + int deref = FETCH_OP_DEREF; long offset = 0; char *tmp; int ret = 0; @@ -394,9 +397,14 @@ parse_probe_arg(char *arg, const struct fetch_type *type, break; case '+': /* deref memory */ - arg++; /* Skip '+', because kstrtol() rejects it. */ - /* fall through */ case '-': + if (arg[1] == 'u') { + deref = FETCH_OP_UDEREF; + arg[1] = arg[0]; + arg++; + } + if (arg[0] == '+') + arg++; /* Skip '+', because kstrtol() rejects it. */ tmp = strchr(arg, '('); if (!tmp) { trace_probe_log_err(offs, DEREF_NEED_BRACE); @@ -432,7 +440,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type, } *pcode = code; - code->op = FETCH_OP_DEREF; + code->op = deref; code->offset = offset; } break; @@ -569,15 +577,17 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, goto fail; /* Store operation */ - if (!strcmp(parg->type->name, "string")) { - if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_IMM && - code->op != FETCH_OP_COMM) { + if (!strcmp(parg->type->name, "string") || + !strcmp(parg->type->name, "ustring")) { + if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_UDEREF && + code->op != FETCH_OP_IMM && code->op != FETCH_OP_COMM) { trace_probe_log_err(offset + (t ? (t - arg) : 0), BAD_STRING); ret = -EINVAL; goto fail; } - if (code->op != FETCH_OP_DEREF || parg->count) { + if ((code->op == FETCH_OP_IMM || code->op == FETCH_OP_COMM) || + parg->count) { /* * IMM and COMM is pointing actual address, those must * be kept, and if parg->count != 0, this is an array @@ -590,12 +600,20 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, goto fail; } } - code->op = FETCH_OP_ST_STRING; /* In DEREF case, replace it */ + /* If op == DEREF, replace it with STRING */ + if (!strcmp(parg->type->name, "ustring") || + code->op == FETCH_OP_UDEREF) + code->op = FETCH_OP_ST_USTRING; + else + code->op = FETCH_OP_ST_STRING; code->size = parg->type->size; parg->dynamic = true; } else if (code->op == FETCH_OP_DEREF) { code->op = FETCH_OP_ST_MEM; code->size = parg->type->size; + } else if (code->op == FETCH_OP_UDEREF) { + code->op = FETCH_OP_ST_UMEM; + code->size = parg->type->size; } else { code++; if (code->op != FETCH_OP_NOP) { @@ -618,7 +636,8 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, /* Loop(Array) operation */ if (parg->count) { if (scode->op != FETCH_OP_ST_MEM && - scode->op != FETCH_OP_ST_STRING) { + scode->op != FETCH_OP_ST_STRING && + scode->op != FETCH_OP_ST_USTRING) { trace_probe_log_err(offset + (t ? (t - arg) : 0), BAD_STRING); ret = -EINVAL; @@ -825,6 +844,7 @@ static int __set_print_fmt(struct trace_probe *tp, char *buf, int len, int traceprobe_set_print_fmt(struct trace_probe *tp, bool is_return) { + struct trace_event_call *call = trace_probe_event_call(tp); int len; char *print_fmt; @@ -836,7 +856,7 @@ int traceprobe_set_print_fmt(struct trace_probe *tp, bool is_return) /* Second: actually write the @print_fmt */ __set_print_fmt(tp, print_fmt, len + 1, is_return); - tp->call.print_fmt = print_fmt; + call->print_fmt = print_fmt; return 0; } @@ -865,3 +885,105 @@ int traceprobe_define_arg_fields(struct trace_event_call *event_call, } return 0; } + + +void trace_probe_cleanup(struct trace_probe *tp) +{ + struct trace_event_call *call = trace_probe_event_call(tp); + int i; + + for (i = 0; i < tp->nr_args; i++) + traceprobe_free_probe_arg(&tp->args[i]); + + kfree(call->class->system); + kfree(call->name); + kfree(call->print_fmt); +} + +int trace_probe_init(struct trace_probe *tp, const char *event, + const char *group) +{ + struct trace_event_call *call = trace_probe_event_call(tp); + + if (!event || !group) + return -EINVAL; + + call->class = &tp->class; + call->name = kstrdup(event, GFP_KERNEL); + if (!call->name) + return -ENOMEM; + + tp->class.system = kstrdup(group, GFP_KERNEL); + if (!tp->class.system) { + kfree(call->name); + call->name = NULL; + return -ENOMEM; + } + INIT_LIST_HEAD(&tp->files); + INIT_LIST_HEAD(&tp->class.fields); + + return 0; +} + +int trace_probe_register_event_call(struct trace_probe *tp) +{ + struct trace_event_call *call = trace_probe_event_call(tp); + int ret; + + ret = register_trace_event(&call->event); + if (!ret) + return -ENODEV; + + ret = trace_add_event_call(call); + if (ret) + unregister_trace_event(&call->event); + + return ret; +} + +int trace_probe_add_file(struct trace_probe *tp, struct trace_event_file *file) +{ + struct event_file_link *link; + + link = kmalloc(sizeof(*link), GFP_KERNEL); + if (!link) + return -ENOMEM; + + link->file = file; + INIT_LIST_HEAD(&link->list); + list_add_tail_rcu(&link->list, &tp->files); + trace_probe_set_flag(tp, TP_FLAG_TRACE); + return 0; +} + +struct event_file_link *trace_probe_get_file_link(struct trace_probe *tp, + struct trace_event_file *file) +{ + struct event_file_link *link; + + trace_probe_for_each_link(link, tp) { + if (link->file == file) + return link; + } + + return NULL; +} + +int trace_probe_remove_file(struct trace_probe *tp, + struct trace_event_file *file) +{ + struct event_file_link *link; + + link = trace_probe_get_file_link(tp, file); + if (!link) + return -ENOENT; + + list_del_rcu(&link->list); + synchronize_rcu(); + kfree(link); + + if (list_empty(&tp->files)) + trace_probe_clear_flag(tp, TP_FLAG_TRACE); + + return 0; +} diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index f9a8c632188b..d1714820efe1 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -55,7 +55,6 @@ /* Flags for trace_probe */ #define TP_FLAG_TRACE 1 #define TP_FLAG_PROFILE 2 -#define TP_FLAG_REGISTERED 4 /* data_loc: data location, compatible with u32 */ #define make_data_loc(len, offs) \ @@ -92,10 +91,13 @@ enum fetch_op { FETCH_OP_FOFFS, /* File offset: .immediate */ // Stage 2 (dereference) op FETCH_OP_DEREF, /* Dereference: .offset */ + FETCH_OP_UDEREF, /* User-space Dereference: .offset */ // Stage 3 (store) ops FETCH_OP_ST_RAW, /* Raw: .size */ FETCH_OP_ST_MEM, /* Mem: .offset, .size */ + FETCH_OP_ST_UMEM, /* Mem: .offset, .size */ FETCH_OP_ST_STRING, /* String: .offset, .size */ + FETCH_OP_ST_USTRING, /* User String: .offset, .size */ // Stage 4 (modify) op FETCH_OP_MOD_BF, /* Bitfield: .basesize, .lshift, .rshift */ // Stage 5 (loop) op @@ -235,16 +237,71 @@ struct event_file_link { struct list_head list; }; +static inline bool trace_probe_test_flag(struct trace_probe *tp, + unsigned int flag) +{ + return !!(tp->flags & flag); +} + +static inline void trace_probe_set_flag(struct trace_probe *tp, + unsigned int flag) +{ + tp->flags |= flag; +} + +static inline void trace_probe_clear_flag(struct trace_probe *tp, + unsigned int flag) +{ + tp->flags &= ~flag; +} + static inline bool trace_probe_is_enabled(struct trace_probe *tp) { - return !!(tp->flags & (TP_FLAG_TRACE | TP_FLAG_PROFILE)); + return trace_probe_test_flag(tp, TP_FLAG_TRACE | TP_FLAG_PROFILE); } -static inline bool trace_probe_is_registered(struct trace_probe *tp) +static inline const char *trace_probe_name(struct trace_probe *tp) { - return !!(tp->flags & TP_FLAG_REGISTERED); + return trace_event_name(&tp->call); } +static inline const char *trace_probe_group_name(struct trace_probe *tp) +{ + return tp->call.class->system; +} + +static inline struct trace_event_call * + trace_probe_event_call(struct trace_probe *tp) +{ + return &tp->call; +} + +static inline int trace_probe_unregister_event_call(struct trace_probe *tp) +{ + /* tp->event is unregistered in trace_remove_event_call() */ + return trace_remove_event_call(&tp->call); +} + +static inline bool trace_probe_has_single_file(struct trace_probe *tp) +{ + return !!list_is_singular(&tp->files); +} + +int trace_probe_init(struct trace_probe *tp, const char *event, + const char *group); +void trace_probe_cleanup(struct trace_probe *tp); +int trace_probe_register_event_call(struct trace_probe *tp); +int trace_probe_add_file(struct trace_probe *tp, struct trace_event_file *file); +int trace_probe_remove_file(struct trace_probe *tp, + struct trace_event_file *file); +struct event_file_link *trace_probe_get_file_link(struct trace_probe *tp, + struct trace_event_file *file); + +#define trace_probe_for_each_link(pos, tp) \ + list_for_each_entry(pos, &(tp)->files, list) +#define trace_probe_for_each_link_rcu(pos, tp) \ + list_for_each_entry_rcu(pos, &(tp)->files, list) + /* Check the name is good for event/group/fields */ static inline bool is_good_name(const char *name) { @@ -257,18 +314,6 @@ static inline bool is_good_name(const char *name) return true; } -static inline struct event_file_link * -find_event_file_link(struct trace_probe *tp, struct trace_event_file *file) -{ - struct event_file_link *link; - - list_for_each_entry(link, &tp->files, list) - if (link->file == file) - return link; - - return NULL; -} - #define TPARG_FL_RETURN BIT(0) #define TPARG_FL_KERNEL BIT(1) #define TPARG_FL_FENTRY BIT(2) diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h index c30c61f12ddd..e5282828f4a6 100644 --- a/kernel/trace/trace_probe_tmpl.h +++ b/kernel/trace/trace_probe_tmpl.h @@ -59,8 +59,13 @@ process_fetch_insn(struct fetch_insn *code, struct pt_regs *regs, static nokprobe_inline int fetch_store_strlen(unsigned long addr); static nokprobe_inline int fetch_store_string(unsigned long addr, void *dest, void *base); +static nokprobe_inline int fetch_store_strlen_user(unsigned long addr); +static nokprobe_inline int +fetch_store_string_user(unsigned long addr, void *dest, void *base); static nokprobe_inline int probe_mem_read(void *dest, void *src, size_t size); +static nokprobe_inline int +probe_mem_read_user(void *dest, void *src, size_t size); /* From the 2nd stage, routine is same */ static nokprobe_inline int @@ -74,14 +79,21 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val, stage2: /* 2nd stage: dereference memory if needed */ - while (code->op == FETCH_OP_DEREF) { - lval = val; - ret = probe_mem_read(&val, (void *)val + code->offset, - sizeof(val)); + do { + if (code->op == FETCH_OP_DEREF) { + lval = val; + ret = probe_mem_read(&val, (void *)val + code->offset, + sizeof(val)); + } else if (code->op == FETCH_OP_UDEREF) { + lval = val; + ret = probe_mem_read_user(&val, + (void *)val + code->offset, sizeof(val)); + } else + break; if (ret) return ret; code++; - } + } while (1); s3 = code; stage3: @@ -91,6 +103,10 @@ stage3: ret = fetch_store_strlen(val + code->offset); code++; goto array; + } else if (code->op == FETCH_OP_ST_USTRING) { + ret += fetch_store_strlen_user(val + code->offset); + code++; + goto array; } else return -EILSEQ; } @@ -102,10 +118,17 @@ stage3: case FETCH_OP_ST_MEM: probe_mem_read(dest, (void *)val + code->offset, code->size); break; + case FETCH_OP_ST_UMEM: + probe_mem_read_user(dest, (void *)val + code->offset, code->size); + break; case FETCH_OP_ST_STRING: loc = *(u32 *)dest; ret = fetch_store_string(val + code->offset, dest, base); break; + case FETCH_OP_ST_USTRING: + loc = *(u32 *)dest; + ret = fetch_store_string_user(val + code->offset, dest, base); + break; default: return -EILSEQ; } @@ -123,7 +146,8 @@ array: total += ret; if (++i < code->param) { code = s3; - if (s3->op != FETCH_OP_ST_STRING) { + if (s3->op != FETCH_OP_ST_STRING && + s3->op != FETCH_OP_ST_USTRING) { dest += s3->size; val += s3->size; goto stage3; diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 7860e3f59fad..1ceedb9146b1 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -140,6 +140,13 @@ probe_mem_read(void *dest, void *src, size_t size) return copy_from_user(dest, vaddr, size) ? -EFAULT : 0; } + +static nokprobe_inline int +probe_mem_read_user(void *dest, void *src, size_t size) +{ + return probe_mem_read(dest, src, size); +} + /* * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max * length and relative data location. @@ -176,6 +183,12 @@ fetch_store_string(unsigned long addr, void *dest, void *base) return ret; } +static nokprobe_inline int +fetch_store_string_user(unsigned long addr, void *dest, void *base) +{ + return fetch_store_string(addr, dest, base); +} + /* Return the length of string -- including null terminal byte */ static nokprobe_inline int fetch_store_strlen(unsigned long addr) @@ -191,6 +204,12 @@ fetch_store_strlen(unsigned long addr) return (len > MAX_STRING_SIZE) ? 0 : len; } +static nokprobe_inline int +fetch_store_strlen_user(unsigned long addr) +{ + return fetch_store_strlen(addr); +} + static unsigned long translate_user_vaddr(unsigned long file_offset) { unsigned long base_addr; @@ -270,8 +289,8 @@ static bool trace_uprobe_match(const char *system, const char *event, { struct trace_uprobe *tu = to_trace_uprobe(ev); - return strcmp(trace_event_name(&tu->tp.call), event) == 0 && - (!system || strcmp(tu->tp.call.class->system, system) == 0); + return strcmp(trace_probe_name(&tu->tp), event) == 0 && + (!system || strcmp(trace_probe_group_name(&tu->tp), system) == 0); } /* @@ -281,25 +300,17 @@ static struct trace_uprobe * alloc_trace_uprobe(const char *group, const char *event, int nargs, bool is_ret) { struct trace_uprobe *tu; - - if (!event || !group) - return ERR_PTR(-EINVAL); + int ret; tu = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL); if (!tu) return ERR_PTR(-ENOMEM); - tu->tp.call.class = &tu->tp.class; - tu->tp.call.name = kstrdup(event, GFP_KERNEL); - if (!tu->tp.call.name) - goto error; - - tu->tp.class.system = kstrdup(group, GFP_KERNEL); - if (!tu->tp.class.system) + ret = trace_probe_init(&tu->tp, event, group); + if (ret < 0) goto error; dyn_event_init(&tu->devent, &trace_uprobe_ops); - INIT_LIST_HEAD(&tu->tp.files); tu->consumer.handler = uprobe_dispatcher; if (is_ret) tu->consumer.ret_handler = uretprobe_dispatcher; @@ -307,25 +318,18 @@ alloc_trace_uprobe(const char *group, const char *event, int nargs, bool is_ret) return tu; error: - kfree(tu->tp.call.name); kfree(tu); - return ERR_PTR(-ENOMEM); + return ERR_PTR(ret); } static void free_trace_uprobe(struct trace_uprobe *tu) { - int i; - if (!tu) return; - for (i = 0; i < tu->tp.nr_args; i++) - traceprobe_free_probe_arg(&tu->tp.args[i]); - path_put(&tu->path); - kfree(tu->tp.call.class->system); - kfree(tu->tp.call.name); + trace_probe_cleanup(&tu->tp); kfree(tu->filename); kfree(tu); } @@ -336,8 +340,8 @@ static struct trace_uprobe *find_probe_event(const char *event, const char *grou struct trace_uprobe *tu; for_each_trace_uprobe(tu, pos) - if (strcmp(trace_event_name(&tu->tp.call), event) == 0 && - strcmp(tu->tp.call.class->system, group) == 0) + if (strcmp(trace_probe_name(&tu->tp), event) == 0 && + strcmp(trace_probe_group_name(&tu->tp), group) == 0) return tu; return NULL; @@ -372,8 +376,8 @@ static struct trace_uprobe *find_old_trace_uprobe(struct trace_uprobe *new) struct trace_uprobe *tmp, *old = NULL; struct inode *new_inode = d_real_inode(new->path.dentry); - old = find_probe_event(trace_event_name(&new->tp.call), - new->tp.call.class->system); + old = find_probe_event(trace_probe_name(&new->tp), + trace_probe_group_name(&new->tp)); for_each_trace_uprobe(tmp, pos) { if ((old ? old != tmp : true) && @@ -578,6 +582,10 @@ static int trace_uprobe_create(int argc, const char **argv) goto error; } + ret = traceprobe_set_print_fmt(&tu->tp, is_ret_probe(tu)); + if (ret < 0) + goto error; + ret = register_trace_uprobe(tu); if (!ret) goto out; @@ -621,8 +629,8 @@ static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev) char c = is_ret_probe(tu) ? 'r' : 'p'; int i; - seq_printf(m, "%c:%s/%s %s:0x%0*lx", c, tu->tp.call.class->system, - trace_event_name(&tu->tp.call), tu->filename, + seq_printf(m, "%c:%s/%s %s:0x%0*lx", c, trace_probe_group_name(&tu->tp), + trace_probe_name(&tu->tp), tu->filename, (int)(sizeof(void *) * 2), tu->offset); if (tu->ref_ctr_offset) @@ -692,7 +700,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v) tu = to_trace_uprobe(ev); seq_printf(m, " %s %-44s %15lu\n", tu->filename, - trace_event_name(&tu->tp.call), tu->nhit); + trace_probe_name(&tu->tp), tu->nhit); return 0; } @@ -818,7 +826,7 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, struct ring_buffer *buffer; void *data; int size, esize; - struct trace_event_call *call = &tu->tp.call; + struct trace_event_call *call = trace_probe_event_call(&tu->tp); WARN_ON(call != trace_file->event_call); @@ -860,7 +868,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, return 0; rcu_read_lock(); - list_for_each_entry_rcu(link, &tu->tp.files, list) + trace_probe_for_each_link_rcu(link, &tu->tp) __uprobe_trace_func(tu, 0, regs, ucb, dsize, link->file); rcu_read_unlock(); @@ -874,7 +882,7 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct event_file_link *link; rcu_read_lock(); - list_for_each_entry_rcu(link, &tu->tp.files, list) + trace_probe_for_each_link_rcu(link, &tu->tp) __uprobe_trace_func(tu, func, regs, ucb, dsize, link->file); rcu_read_unlock(); } @@ -893,12 +901,12 @@ print_uprobe_event(struct trace_iterator *iter, int flags, struct trace_event *e if (is_ret_probe(tu)) { trace_seq_printf(s, "%s: (0x%lx <- 0x%lx)", - trace_event_name(&tu->tp.call), + trace_probe_name(&tu->tp), entry->vaddr[1], entry->vaddr[0]); data = DATAOF_TRACE_ENTRY(entry, true); } else { trace_seq_printf(s, "%s: (0x%lx)", - trace_event_name(&tu->tp.call), + trace_probe_name(&tu->tp), entry->vaddr[0]); data = DATAOF_TRACE_ENTRY(entry, false); } @@ -921,26 +929,20 @@ probe_event_enable(struct trace_uprobe *tu, struct trace_event_file *file, filter_func_t filter) { bool enabled = trace_probe_is_enabled(&tu->tp); - struct event_file_link *link = NULL; int ret; if (file) { - if (tu->tp.flags & TP_FLAG_PROFILE) + if (trace_probe_test_flag(&tu->tp, TP_FLAG_PROFILE)) return -EINTR; - link = kmalloc(sizeof(*link), GFP_KERNEL); - if (!link) - return -ENOMEM; - - link->file = file; - list_add_tail_rcu(&link->list, &tu->tp.files); - - tu->tp.flags |= TP_FLAG_TRACE; + ret = trace_probe_add_file(&tu->tp, file); + if (ret < 0) + return ret; } else { - if (tu->tp.flags & TP_FLAG_TRACE) + if (trace_probe_test_flag(&tu->tp, TP_FLAG_TRACE)) return -EINTR; - tu->tp.flags |= TP_FLAG_PROFILE; + trace_probe_set_flag(&tu->tp, TP_FLAG_PROFILE); } WARN_ON(!uprobe_filter_is_empty(&tu->filter)); @@ -970,13 +972,11 @@ probe_event_enable(struct trace_uprobe *tu, struct trace_event_file *file, uprobe_buffer_disable(); err_flags: - if (file) { - list_del(&link->list); - kfree(link); - tu->tp.flags &= ~TP_FLAG_TRACE; - } else { - tu->tp.flags &= ~TP_FLAG_PROFILE; - } + if (file) + trace_probe_remove_file(&tu->tp, file); + else + trace_probe_clear_flag(&tu->tp, TP_FLAG_PROFILE); + return ret; } @@ -987,26 +987,18 @@ probe_event_disable(struct trace_uprobe *tu, struct trace_event_file *file) return; if (file) { - struct event_file_link *link; - - link = find_event_file_link(&tu->tp, file); - if (!link) + if (trace_probe_remove_file(&tu->tp, file) < 0) return; - list_del_rcu(&link->list); - /* synchronize with u{,ret}probe_trace_func */ - synchronize_rcu(); - kfree(link); - - if (!list_empty(&tu->tp.files)) + if (trace_probe_is_enabled(&tu->tp)) return; - } + } else + trace_probe_clear_flag(&tu->tp, TP_FLAG_PROFILE); WARN_ON(!uprobe_filter_is_empty(&tu->filter)); uprobe_unregister(tu->inode, tu->offset, &tu->consumer); tu->inode = NULL; - tu->tp.flags &= file ? ~TP_FLAG_TRACE : ~TP_FLAG_PROFILE; uprobe_buffer_disable(); } @@ -1126,7 +1118,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, struct uprobe_cpu_buffer *ucb, int dsize) { - struct trace_event_call *call = &tu->tp.call; + struct trace_event_call *call = trace_probe_event_call(&tu->tp); struct uprobe_trace_entry_head *entry; struct hlist_head *head; void *data; @@ -1279,11 +1271,11 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs) ucb = uprobe_buffer_get(); store_trace_args(ucb->buf, &tu->tp, regs, esize, dsize); - if (tu->tp.flags & TP_FLAG_TRACE) + if (trace_probe_test_flag(&tu->tp, TP_FLAG_TRACE)) ret |= uprobe_trace_func(tu, regs, ucb, dsize); #ifdef CONFIG_PERF_EVENTS - if (tu->tp.flags & TP_FLAG_PROFILE) + if (trace_probe_test_flag(&tu->tp, TP_FLAG_PROFILE)) ret |= uprobe_perf_func(tu, regs, ucb, dsize); #endif uprobe_buffer_put(ucb); @@ -1314,11 +1306,11 @@ static int uretprobe_dispatcher(struct uprobe_consumer *con, ucb = uprobe_buffer_get(); store_trace_args(ucb->buf, &tu->tp, regs, esize, dsize); - if (tu->tp.flags & TP_FLAG_TRACE) + if (trace_probe_test_flag(&tu->tp, TP_FLAG_TRACE)) uretprobe_trace_func(tu, func, regs, ucb, dsize); #ifdef CONFIG_PERF_EVENTS - if (tu->tp.flags & TP_FLAG_PROFILE) + if (trace_probe_test_flag(&tu->tp, TP_FLAG_PROFILE)) uretprobe_perf_func(tu, func, regs, ucb, dsize); #endif uprobe_buffer_put(ucb); @@ -1329,10 +1321,10 @@ static struct trace_event_functions uprobe_funcs = { .trace = print_uprobe_event }; -static inline void init_trace_event_call(struct trace_uprobe *tu, - struct trace_event_call *call) +static inline void init_trace_event_call(struct trace_uprobe *tu) { - INIT_LIST_HEAD(&call->class->fields); + struct trace_event_call *call = trace_probe_event_call(&tu->tp); + call->event.funcs = &uprobe_funcs; call->class->define_fields = uprobe_event_define_fields; @@ -1343,43 +1335,14 @@ static inline void init_trace_event_call(struct trace_uprobe *tu, static int register_uprobe_event(struct trace_uprobe *tu) { - struct trace_event_call *call = &tu->tp.call; - int ret = 0; - - init_trace_event_call(tu, call); - - if (traceprobe_set_print_fmt(&tu->tp, is_ret_probe(tu)) < 0) - return -ENOMEM; + init_trace_event_call(tu); - ret = register_trace_event(&call->event); - if (!ret) { - kfree(call->print_fmt); - return -ENODEV; - } - - ret = trace_add_event_call(call); - - if (ret) { - pr_info("Failed to register uprobe event: %s\n", - trace_event_name(call)); - kfree(call->print_fmt); - unregister_trace_event(&call->event); - } - - return ret; + return trace_probe_register_event_call(&tu->tp); } static int unregister_uprobe_event(struct trace_uprobe *tu) { - int ret; - - /* tu->event is unregistered in trace_remove_event_call() */ - ret = trace_remove_event_call(&tu->tp.call); - if (ret) - return ret; - kfree(tu->tp.call.print_fmt); - tu->tp.call.print_fmt = NULL; - return 0; + return trace_probe_unregister_event_call(&tu->tp); } #ifdef CONFIG_PERF_EVENTS @@ -1419,14 +1382,14 @@ create_local_trace_uprobe(char *name, unsigned long offs, tu->path = path; tu->ref_ctr_offset = ref_ctr_offset; tu->filename = kstrdup(name, GFP_KERNEL); - init_trace_event_call(tu, &tu->tp.call); + init_trace_event_call(tu); if (traceprobe_set_print_fmt(&tu->tp, is_ret_probe(tu)) < 0) { ret = -ENOMEM; goto error; } - return &tu->tp.call; + return trace_probe_event_call(&tu->tp); error: free_trace_uprobe(tu); return ERR_PTR(ret); @@ -1438,9 +1401,6 @@ void destroy_local_trace_uprobe(struct trace_event_call *event_call) tu = container_of(event_call, struct trace_uprobe, tp.call); - kfree(tu->tp.call.print_fmt); - tu->tp.call.print_fmt = NULL; - free_trace_uprobe(tu); } #endif /* CONFIG_PERF_EVENTS */ diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c index df3ade14ccbd..73956eaff8a9 100644 --- a/kernel/tracepoint.c +++ b/kernel/tracepoint.c @@ -55,8 +55,8 @@ struct tp_probes { static inline void *allocate_probes(int count) { - struct tp_probes *p = kmalloc(count * sizeof(struct tracepoint_func) - + sizeof(struct tp_probes), GFP_KERNEL); + struct tp_probes *p = kmalloc(struct_size(p, probes, count), + GFP_KERNEL); return p == NULL ? NULL : p->probes; } diff --git a/mm/maccess.c b/mm/maccess.c index 482d4d670f19..d065736f6b87 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -6,8 +6,20 @@ #include <linux/mm.h> #include <linux/uaccess.h> +static __always_inline long +probe_read_common(void *dst, const void __user *src, size_t size) +{ + long ret; + + pagefault_disable(); + ret = __copy_from_user_inatomic(dst, src, size); + pagefault_enable(); + + return ret ? -EFAULT : 0; +} + /** - * probe_kernel_read(): safely attempt to read from a location + * probe_kernel_read(): safely attempt to read from a kernel-space location * @dst: pointer to the buffer that shall take the data * @src: address to read from * @size: size of the data chunk @@ -30,17 +42,41 @@ long __probe_kernel_read(void *dst, const void *src, size_t size) mm_segment_t old_fs = get_fs(); set_fs(KERNEL_DS); - pagefault_disable(); - ret = __copy_from_user_inatomic(dst, - (__force const void __user *)src, size); - pagefault_enable(); + ret = probe_read_common(dst, (__force const void __user *)src, size); set_fs(old_fs); - return ret ? -EFAULT : 0; + return ret; } EXPORT_SYMBOL_GPL(probe_kernel_read); /** + * probe_user_read(): safely attempt to read from a user-space location + * @dst: pointer to the buffer that shall take the data + * @src: address to read from. This must be a user address. + * @size: size of the data chunk + * + * Safely read from user address @src to the buffer at @dst. If a kernel fault + * happens, handle that and return -EFAULT. + */ + +long __weak probe_user_read(void *dst, const void __user *src, size_t size) + __attribute__((alias("__probe_user_read"))); + +long __probe_user_read(void *dst, const void __user *src, size_t size) +{ + long ret = -EFAULT; + mm_segment_t old_fs = get_fs(); + + set_fs(USER_DS); + if (access_ok(src, size)) + ret = probe_read_common(dst, src, size); + set_fs(old_fs); + + return ret; +} +EXPORT_SYMBOL_GPL(probe_user_read); + +/** * probe_kernel_write(): safely attempt to write to a location * @dst: address to write to * @src: pointer to the data that shall be written @@ -67,6 +103,7 @@ long __probe_kernel_write(void *dst, const void *src, size_t size) } EXPORT_SYMBOL_GPL(probe_kernel_write); + /** * strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address. * @dst: Destination address, in kernel space. This buffer must be at @@ -106,3 +143,76 @@ long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) return ret ? -EFAULT : src - unsafe_addr; } + +/** + * strncpy_from_unsafe_user: - Copy a NUL terminated string from unsafe user + * address. + * @dst: Destination address, in kernel space. This buffer must be at + * least @count bytes long. + * @unsafe_addr: Unsafe user address. + * @count: Maximum number of bytes to copy, including the trailing NUL. + * + * Copies a NUL-terminated string from unsafe user address to kernel buffer. + * + * On success, returns the length of the string INCLUDING the trailing NUL. + * + * If access fails, returns -EFAULT (some data may have been copied + * and the trailing NUL added). + * + * If @count is smaller than the length of the string, copies @count-1 bytes, + * sets the last byte of @dst buffer to NUL and returns @count. + */ +long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, + long count) +{ + mm_segment_t old_fs = get_fs(); + long ret; + + if (unlikely(count <= 0)) + return 0; + + set_fs(USER_DS); + pagefault_disable(); + ret = strncpy_from_user(dst, unsafe_addr, count); + pagefault_enable(); + set_fs(old_fs); + + if (ret >= count) { + ret = count; + dst[ret - 1] = '\0'; + } else if (ret > 0) { + ret++; + } + + return ret; +} + +/** + * strnlen_unsafe_user: - Get the size of a user string INCLUDING final NUL. + * @unsafe_addr: The string to measure. + * @count: Maximum count (including NUL) + * + * Get the size of a NUL-terminated string in user space without pagefault. + * + * Returns the size of the string INCLUDING the terminating NUL. + * + * If the string is too long, returns a number larger than @count. User + * has to check the return value against "> count". + * On exception (or invalid count), returns 0. + * + * Unlike strnlen_user, this can be used from IRQ handler etc. because + * it disables pagefaults. + */ +long strnlen_unsafe_user(const void __user *unsafe_addr, long count) +{ + mm_segment_t old_fs = get_fs(); + int ret; + + set_fs(USER_DS); + pagefault_disable(); + ret = strnlen_user(unsafe_addr, count); + pagefault_enable(); + set_fs(old_fs); + + return ret; +} diff --git a/net/ceph/Makefile b/net/ceph/Makefile index db09defe27d0..59d0ba2072de 100644 --- a/net/ceph/Makefile +++ b/net/ceph/Makefile @@ -5,7 +5,7 @@ obj-$(CONFIG_CEPH_LIB) += libceph.o libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \ - mon_client.o \ + mon_client.o decode.o \ cls_lock_client.o \ osd_client.o osdmap.o crush/crush.o crush/mapper.o crush/hash.o \ striper.o \ diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c index 4cc28541281b..17447c19d937 100644 --- a/net/ceph/cls_lock_client.c +++ b/net/ceph/cls_lock_client.c @@ -6,6 +6,7 @@ #include <linux/ceph/cls_lock_client.h> #include <linux/ceph/decode.h> +#include <linux/ceph/libceph.h> /** * ceph_cls_lock - grab rados lock for object @@ -264,8 +265,11 @@ static int decode_locker(void **p, void *end, struct ceph_locker *locker) return ret; *p += sizeof(struct ceph_timespec); /* skip expiration */ - ceph_decode_copy(p, &locker->info.addr, sizeof(locker->info.addr)); - ceph_decode_addr(&locker->info.addr); + + ret = ceph_decode_entity_addr(p, end, &locker->info.addr); + if (ret) + return ret; + len = ceph_decode_32(p); *p += len; /* skip description */ @@ -360,7 +364,7 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, dout("%s lock_name %s\n", __func__, lock_name); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "get_info", CEPH_OSD_FLAG_READ, get_info_op_page, - get_info_op_buf_size, reply_page, &reply_len); + get_info_op_buf_size, &reply_page, &reply_len); dout("%s: status %d\n", __func__, ret); if (ret >= 0) { @@ -375,3 +379,47 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, return ret; } EXPORT_SYMBOL(ceph_cls_lock_info); + +int ceph_cls_assert_locked(struct ceph_osd_request *req, int which, + char *lock_name, u8 type, char *cookie, char *tag) +{ + int assert_op_buf_size; + int name_len = strlen(lock_name); + int cookie_len = strlen(cookie); + int tag_len = strlen(tag); + struct page **pages; + void *p, *end; + int ret; + + assert_op_buf_size = name_len + sizeof(__le32) + + cookie_len + sizeof(__le32) + + tag_len + sizeof(__le32) + + sizeof(u8) + CEPH_ENCODING_START_BLK_LEN; + if (assert_op_buf_size > PAGE_SIZE) + return -E2BIG; + + ret = osd_req_op_cls_init(req, which, "lock", "assert_locked"); + if (ret) + return ret; + + pages = ceph_alloc_page_vector(1, GFP_NOIO); + if (IS_ERR(pages)) + return PTR_ERR(pages); + + p = page_address(pages[0]); + end = p + assert_op_buf_size; + + /* encode cls_lock_assert_op struct */ + ceph_start_encoding(&p, 1, 1, + assert_op_buf_size - CEPH_ENCODING_START_BLK_LEN); + ceph_encode_string(&p, end, lock_name, name_len); + ceph_encode_8(&p, type); + ceph_encode_string(&p, end, cookie, cookie_len); + ceph_encode_string(&p, end, tag, tag_len); + WARN_ON(p != end); + + osd_req_op_cls_request_data_pages(req, which, pages, assert_op_buf_size, + 0, false, true); + return 0; +} +EXPORT_SYMBOL(ceph_cls_assert_locked); diff --git a/net/ceph/decode.c b/net/ceph/decode.c new file mode 100644 index 000000000000..eea529595a7a --- /dev/null +++ b/net/ceph/decode.c @@ -0,0 +1,84 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/ceph/decode.h> + +static int +ceph_decode_entity_addr_versioned(void **p, void *end, + struct ceph_entity_addr *addr) +{ + int ret; + u8 struct_v; + u32 struct_len, addr_len; + void *struct_end; + + ret = ceph_start_decoding(p, end, 1, "entity_addr_t", &struct_v, + &struct_len); + if (ret) + goto bad; + + ret = -EINVAL; + struct_end = *p + struct_len; + + ceph_decode_copy_safe(p, end, &addr->type, sizeof(addr->type), bad); + + ceph_decode_copy_safe(p, end, &addr->nonce, sizeof(addr->nonce), bad); + + ceph_decode_32_safe(p, end, addr_len, bad); + if (addr_len > sizeof(addr->in_addr)) + goto bad; + + memset(&addr->in_addr, 0, sizeof(addr->in_addr)); + if (addr_len) { + ceph_decode_copy_safe(p, end, &addr->in_addr, addr_len, bad); + + addr->in_addr.ss_family = + le16_to_cpu((__force __le16)addr->in_addr.ss_family); + } + + /* Advance past anything the client doesn't yet understand */ + *p = struct_end; + ret = 0; +bad: + return ret; +} + +static int +ceph_decode_entity_addr_legacy(void **p, void *end, + struct ceph_entity_addr *addr) +{ + int ret = -EINVAL; + + /* Skip rest of type field */ + ceph_decode_skip_n(p, end, 3, bad); + + /* + * Clients that don't support ADDR2 always send TYPE_NONE, change it + * to TYPE_LEGACY for forward compatibility. + */ + addr->type = CEPH_ENTITY_ADDR_TYPE_LEGACY; + ceph_decode_copy_safe(p, end, &addr->nonce, sizeof(addr->nonce), bad); + memset(&addr->in_addr, 0, sizeof(addr->in_addr)); + ceph_decode_copy_safe(p, end, &addr->in_addr, + sizeof(addr->in_addr), bad); + addr->in_addr.ss_family = + be16_to_cpu((__force __be16)addr->in_addr.ss_family); + ret = 0; +bad: + return ret; +} + +int +ceph_decode_entity_addr(void **p, void *end, struct ceph_entity_addr *addr) +{ + u8 marker; + + ceph_decode_8_safe(p, end, marker, bad); + if (marker == 1) + return ceph_decode_entity_addr_versioned(p, end, addr); + else if (marker == 0) + return ceph_decode_entity_addr_legacy(p, end, addr); +bad: + return -EINVAL; +} +EXPORT_SYMBOL(ceph_decode_entity_addr); + diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index a33402c99321..962f521c863e 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -199,12 +199,14 @@ const char *ceph_pr_addr(const struct ceph_entity_addr *addr) switch (ss.ss_family) { case AF_INET: - snprintf(s, MAX_ADDR_STR_LEN, "%pI4:%hu", &in4->sin_addr, + snprintf(s, MAX_ADDR_STR_LEN, "(%d)%pI4:%hu", + le32_to_cpu(addr->type), &in4->sin_addr, ntohs(in4->sin_port)); break; case AF_INET6: - snprintf(s, MAX_ADDR_STR_LEN, "[%pI6c]:%hu", &in6->sin6_addr, + snprintf(s, MAX_ADDR_STR_LEN, "(%d)[%pI6c]:%hu", + le32_to_cpu(addr->type), &in6->sin6_addr, ntohs(in6->sin6_port)); break; @@ -220,7 +222,7 @@ EXPORT_SYMBOL(ceph_pr_addr); static void encode_my_addr(struct ceph_messenger *msgr) { memcpy(&msgr->my_enc_addr, &msgr->inst.addr, sizeof(msgr->my_enc_addr)); - ceph_encode_addr(&msgr->my_enc_addr); + ceph_encode_banner_addr(&msgr->my_enc_addr); } /* @@ -1732,12 +1734,14 @@ static int read_partial_banner(struct ceph_connection *con) ret = read_partial(con, end, size, &con->actual_peer_addr); if (ret <= 0) goto out; + ceph_decode_banner_addr(&con->actual_peer_addr); size = sizeof (con->peer_addr_for_me); end += size; ret = read_partial(con, end, size, &con->peer_addr_for_me); if (ret <= 0) goto out; + ceph_decode_banner_addr(&con->peer_addr_for_me); out: return ret; @@ -1981,6 +1985,7 @@ int ceph_parse_ips(const char *c, const char *end, } addr_set_port(&addr[i], port); + addr[i].type = CEPH_ENTITY_ADDR_TYPE_LEGACY; dout("parse_ips got %s\n", ceph_pr_addr(&addr[i])); @@ -2011,9 +2016,6 @@ static int process_banner(struct ceph_connection *con) if (verify_hello(con) < 0) return -1; - ceph_decode_addr(&con->actual_peer_addr); - ceph_decode_addr(&con->peer_addr_for_me); - /* * Make sure the other end is who we wanted. note that the other * end may not yet know their ip address, so if it's 0.0.0.0, give diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index 895679d3529b..0520bf9825aa 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -39,7 +39,7 @@ static int __validate_auth(struct ceph_mon_client *monc); /* * Decode a monmap blob (e.g., during mount). */ -struct ceph_monmap *ceph_monmap_decode(void *p, void *end) +static struct ceph_monmap *ceph_monmap_decode(void *p, void *end) { struct ceph_monmap *m = NULL; int i, err = -EINVAL; @@ -50,7 +50,7 @@ struct ceph_monmap *ceph_monmap_decode(void *p, void *end) ceph_decode_32_safe(&p, end, len, bad); ceph_decode_need(&p, end, len, bad); - dout("monmap_decode %p %p len %d\n", p, end, (int)(end-p)); + dout("monmap_decode %p %p len %d (%d)\n", p, end, len, (int)(end-p)); p += sizeof(u16); /* skip version */ ceph_decode_need(&p, end, sizeof(fsid) + 2*sizeof(u32), bad); @@ -58,7 +58,6 @@ struct ceph_monmap *ceph_monmap_decode(void *p, void *end) epoch = ceph_decode_32(&p); num_mon = ceph_decode_32(&p); - ceph_decode_need(&p, end, num_mon*sizeof(m->mon_inst[0]), bad); if (num_mon > CEPH_MAX_MON) goto bad; @@ -68,17 +67,22 @@ struct ceph_monmap *ceph_monmap_decode(void *p, void *end) m->fsid = fsid; m->epoch = epoch; m->num_mon = num_mon; - ceph_decode_copy(&p, m->mon_inst, num_mon*sizeof(m->mon_inst[0])); - for (i = 0; i < num_mon; i++) - ceph_decode_addr(&m->mon_inst[i].addr); - + for (i = 0; i < num_mon; ++i) { + struct ceph_entity_inst *inst = &m->mon_inst[i]; + + /* copy name portion */ + ceph_decode_copy_safe(&p, end, &inst->name, + sizeof(inst->name), bad); + err = ceph_decode_entity_addr(&p, end, &inst->addr); + if (err) + goto bad; + } dout("monmap_decode epoch %d, num_mon %d\n", m->epoch, m->num_mon); for (i = 0; i < m->num_mon; i++) dout("monmap_decode mon%d is %s\n", i, ceph_pr_addr(&m->mon_inst[i].addr)); return m; - bad: dout("monmap_decode failed with %d\n", err); kfree(m); @@ -469,6 +473,7 @@ static void ceph_monc_handle_map(struct ceph_mon_client *monc, if (IS_ERR(monmap)) { pr_err("problem decoding monmap, %d\n", (int)PTR_ERR(monmap)); + ceph_msg_dump(msg); goto out; } diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 9a8eca5eda65..0b2df09b2554 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -171,14 +171,6 @@ static void ceph_osd_data_bvecs_init(struct ceph_osd_data *osd_data, osd_data->num_bvecs = num_bvecs; } -#define osd_req_op_data(oreq, whch, typ, fld) \ -({ \ - struct ceph_osd_request *__oreq = (oreq); \ - unsigned int __whch = (whch); \ - BUG_ON(__whch >= __oreq->r_num_ops); \ - &__oreq->r_ops[__whch].typ.fld; \ -}) - static struct ceph_osd_data * osd_req_op_raw_data_in(struct ceph_osd_request *osd_req, unsigned int which) { @@ -478,7 +470,7 @@ static void request_release_checks(struct ceph_osd_request *req) { WARN_ON(!RB_EMPTY_NODE(&req->r_node)); WARN_ON(!RB_EMPTY_NODE(&req->r_mc_node)); - WARN_ON(!list_empty(&req->r_unsafe_item)); + WARN_ON(!list_empty(&req->r_private_item)); WARN_ON(req->r_osd); } @@ -538,7 +530,7 @@ static void request_init(struct ceph_osd_request *req) init_completion(&req->r_completion); RB_CLEAR_NODE(&req->r_node); RB_CLEAR_NODE(&req->r_mc_node); - INIT_LIST_HEAD(&req->r_unsafe_item); + INIT_LIST_HEAD(&req->r_private_item); target_init(&req->r_t); } @@ -4914,20 +4906,26 @@ static int decode_watcher(void **p, void *end, struct ceph_watch_item *item) ret = ceph_start_decoding(p, end, 2, "watch_item_t", &struct_v, &struct_len); if (ret) - return ret; + goto bad; + + ret = -EINVAL; + ceph_decode_copy_safe(p, end, &item->name, sizeof(item->name), bad); + ceph_decode_64_safe(p, end, item->cookie, bad); + ceph_decode_skip_32(p, end, bad); /* skip timeout seconds */ - ceph_decode_copy(p, &item->name, sizeof(item->name)); - item->cookie = ceph_decode_64(p); - *p += 4; /* skip timeout_seconds */ if (struct_v >= 2) { - ceph_decode_copy(p, &item->addr, sizeof(item->addr)); - ceph_decode_addr(&item->addr); + ret = ceph_decode_entity_addr(p, end, &item->addr); + if (ret) + goto bad; + } else { + ret = 0; } dout("%s %s%llu cookie %llu addr %s\n", __func__, ENTITY_NAME(item->name), item->cookie, ceph_pr_addr(&item->addr)); - return 0; +bad: + return ret; } static int decode_watchers(void **p, void *end, @@ -5044,12 +5042,12 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, const char *class, const char *method, unsigned int flags, struct page *req_page, size_t req_len, - struct page *resp_page, size_t *resp_len) + struct page **resp_pages, size_t *resp_len) { struct ceph_osd_request *req; int ret; - if (req_len > PAGE_SIZE || (resp_page && *resp_len > PAGE_SIZE)) + if (req_len > PAGE_SIZE) return -E2BIG; req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_NOIO); @@ -5067,8 +5065,8 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, if (req_page) osd_req_op_cls_request_data_pages(req, 0, &req_page, req_len, 0, false, false); - if (resp_page) - osd_req_op_cls_response_data_pages(req, 0, &resp_page, + if (resp_pages) + osd_req_op_cls_response_data_pages(req, 0, resp_pages, *resp_len, 0, false, false); ret = ceph_osdc_alloc_messages(req, GFP_NOIO); @@ -5079,7 +5077,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, ret = ceph_osdc_wait_request(osdc, req); if (ret >= 0) { ret = req->r_ops[0].rval; - if (resp_page) + if (resp_pages) *resp_len = req->r_ops[0].outdata_len; } diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c index 48a31dc9161c..90437906b7bc 100644 --- a/net/ceph/osdmap.c +++ b/net/ceph/osdmap.c @@ -1489,11 +1489,9 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map) /* osd_state, osd_weight, osd_addrs->client_addr */ ceph_decode_need(p, end, 3*sizeof(u32) + - map->max_osd*((struct_v >= 5 ? sizeof(u32) : - sizeof(u8)) + - sizeof(*map->osd_weight) + - sizeof(*map->osd_addr)), e_inval); - + map->max_osd*(struct_v >= 5 ? sizeof(u32) : + sizeof(u8)) + + sizeof(*map->osd_weight), e_inval); if (ceph_decode_32(p) != map->max_osd) goto e_inval; @@ -1514,9 +1512,11 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map) if (ceph_decode_32(p) != map->max_osd) goto e_inval; - ceph_decode_copy(p, map->osd_addr, map->max_osd*sizeof(*map->osd_addr)); - for (i = 0; i < map->max_osd; i++) - ceph_decode_addr(&map->osd_addr[i]); + for (i = 0; i < map->max_osd; i++) { + err = ceph_decode_entity_addr(p, end, &map->osd_addr[i]); + if (err) + goto bad; + } /* pg_temp */ err = decode_pg_temp(p, end, map); @@ -1618,12 +1618,17 @@ static int decode_new_up_state_weight(void **p, void *end, u8 struct_v, void *new_state; void *new_weight_end; u32 len; + int i; new_up_client = *p; ceph_decode_32_safe(p, end, len, e_inval); - len *= sizeof(u32) + sizeof(struct ceph_entity_addr); - ceph_decode_need(p, end, len, e_inval); - *p += len; + for (i = 0; i < len; ++i) { + struct ceph_entity_addr addr; + + ceph_decode_skip_32(p, end, e_inval); + if (ceph_decode_entity_addr(p, end, &addr)) + goto e_inval; + } new_state = *p; ceph_decode_32_safe(p, end, len, e_inval); @@ -1699,9 +1704,9 @@ static int decode_new_up_state_weight(void **p, void *end, u8 struct_v, struct ceph_entity_addr addr; osd = ceph_decode_32(p); - ceph_decode_copy(p, &addr, sizeof(addr)); - ceph_decode_addr(&addr); BUG_ON(osd >= map->max_osd); + if (ceph_decode_entity_addr(p, end, &addr)) + goto e_inval; pr_info("osd%d up\n", osd); map->osd_state[osd] |= CEPH_OSD_EXISTS | CEPH_OSD_UP; map->osd_addr[osd] = addr; diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c index 74cafc0142ea..64305e7056a1 100644 --- a/net/ceph/pagevec.c +++ b/net/ceph/pagevec.c @@ -10,39 +10,6 @@ #include <linux/ceph/libceph.h> -/* - * build a vector of user pages - */ -struct page **ceph_get_direct_page_vector(const void __user *data, - int num_pages, bool write_page) -{ - struct page **pages; - int got = 0; - int rc = 0; - - pages = kmalloc_array(num_pages, sizeof(*pages), GFP_NOFS); - if (!pages) - return ERR_PTR(-ENOMEM); - - while (got < num_pages) { - rc = get_user_pages_fast( - (unsigned long)data + ((unsigned long)got * PAGE_SIZE), - num_pages - got, write_page ? FOLL_WRITE : 0, pages + got); - if (rc < 0) - break; - BUG_ON(rc == 0); - got += rc; - } - if (rc < 0) - goto fail; - return pages; - -fail: - ceph_put_page_vector(pages, got, false); - return ERR_PTR(rc); -} -EXPORT_SYMBOL(ceph_get_direct_page_vector); - void ceph_put_page_vector(struct page **pages, int num_pages, bool dirty) { int i; diff --git a/net/ceph/striper.c b/net/ceph/striper.c index c36462dc86b7..3b3fa75d1189 100644 --- a/net/ceph/striper.c +++ b/net/ceph/striper.c @@ -259,3 +259,20 @@ int ceph_extent_to_file(struct ceph_file_layout *l, return 0; } EXPORT_SYMBOL(ceph_extent_to_file); + +u64 ceph_get_num_objects(struct ceph_file_layout *l, u64 size) +{ + u64 period = (u64)l->stripe_count * l->object_size; + u64 num_periods = DIV64_U64_ROUND_UP(size, period); + u64 remainder_bytes; + u64 remainder_objs = 0; + + div64_u64_rem(size, period, &remainder_bytes); + if (remainder_bytes > 0 && + remainder_bytes < (u64)l->stripe_count * l->stripe_unit) + remainder_objs = l->stripe_count - + DIV_ROUND_UP_ULL(remainder_bytes, l->stripe_unit); + + return num_periods * l->stripe_count - remainder_objs; +} +EXPORT_SYMBOL(ceph_get_num_objects); diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig index aa307505ca54..3bcf985507be 100644 --- a/net/sunrpc/Kconfig +++ b/net/sunrpc/Kconfig @@ -35,7 +35,7 @@ config RPCSEC_GSS_KRB5 If unsure, say Y. -config CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES +config SUNRPC_DISABLE_INSECURE_ENCTYPES bool "Secure RPC: Disable insecure Kerberos encryption types" depends on RPCSEC_GSS_KRB5 default n diff --git a/net/sunrpc/backchannel_rqst.c b/net/sunrpc/backchannel_rqst.c index c47d82622fd1..339e8c077c2d 100644 --- a/net/sunrpc/backchannel_rqst.c +++ b/net/sunrpc/backchannel_rqst.c @@ -31,25 +31,20 @@ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #define RPCDBG_FACILITY RPCDBG_TRANS #endif +#define BC_MAX_SLOTS 64U + +unsigned int xprt_bc_max_slots(struct rpc_xprt *xprt) +{ + return BC_MAX_SLOTS; +} + /* * Helper routines that track the number of preallocation elements * on the transport. */ static inline int xprt_need_to_requeue(struct rpc_xprt *xprt) { - return xprt->bc_alloc_count < atomic_read(&xprt->bc_free_slots); -} - -static inline void xprt_inc_alloc_count(struct rpc_xprt *xprt, unsigned int n) -{ - atomic_add(n, &xprt->bc_free_slots); - xprt->bc_alloc_count += n; -} - -static inline int xprt_dec_alloc_count(struct rpc_xprt *xprt, unsigned int n) -{ - atomic_sub(n, &xprt->bc_free_slots); - return xprt->bc_alloc_count -= n; + return xprt->bc_alloc_count < xprt->bc_alloc_max; } /* @@ -145,6 +140,9 @@ int xprt_setup_bc(struct rpc_xprt *xprt, unsigned int min_reqs) dprintk("RPC: setup backchannel transport\n"); + if (min_reqs > BC_MAX_SLOTS) + min_reqs = BC_MAX_SLOTS; + /* * We use a temporary list to keep track of the preallocated * buffers. Once we're done building the list we splice it @@ -172,7 +170,9 @@ int xprt_setup_bc(struct rpc_xprt *xprt, unsigned int min_reqs) */ spin_lock(&xprt->bc_pa_lock); list_splice(&tmp_list, &xprt->bc_pa_list); - xprt_inc_alloc_count(xprt, min_reqs); + xprt->bc_alloc_count += min_reqs; + xprt->bc_alloc_max += min_reqs; + atomic_add(min_reqs, &xprt->bc_slot_count); spin_unlock(&xprt->bc_pa_lock); dprintk("RPC: setup backchannel transport done\n"); @@ -220,11 +220,13 @@ void xprt_destroy_bc(struct rpc_xprt *xprt, unsigned int max_reqs) goto out; spin_lock_bh(&xprt->bc_pa_lock); - xprt_dec_alloc_count(xprt, max_reqs); + xprt->bc_alloc_max -= max_reqs; list_for_each_entry_safe(req, tmp, &xprt->bc_pa_list, rq_bc_pa_list) { dprintk("RPC: req=%p\n", req); list_del(&req->rq_bc_pa_list); xprt_free_allocation(req); + xprt->bc_alloc_count--; + atomic_dec(&xprt->bc_slot_count); if (--max_reqs == 0) break; } @@ -241,13 +243,14 @@ static struct rpc_rqst *xprt_get_bc_request(struct rpc_xprt *xprt, __be32 xid, struct rpc_rqst *req = NULL; dprintk("RPC: allocate a backchannel request\n"); - if (atomic_read(&xprt->bc_free_slots) <= 0) - goto not_found; if (list_empty(&xprt->bc_pa_list)) { if (!new) goto not_found; + if (atomic_read(&xprt->bc_slot_count) >= BC_MAX_SLOTS) + goto not_found; list_add_tail(&new->rq_bc_pa_list, &xprt->bc_pa_list); xprt->bc_alloc_count++; + atomic_inc(&xprt->bc_slot_count); } req = list_first_entry(&xprt->bc_pa_list, struct rpc_rqst, rq_bc_pa_list); @@ -291,6 +294,7 @@ void xprt_free_bc_rqst(struct rpc_rqst *req) if (xprt_need_to_requeue(xprt)) { list_add_tail(&req->rq_bc_pa_list, &xprt->bc_pa_list); xprt->bc_alloc_count++; + atomic_inc(&xprt->bc_slot_count); req = NULL; } spin_unlock_bh(&xprt->bc_pa_lock); @@ -357,7 +361,7 @@ void xprt_complete_bc_request(struct rpc_rqst *req, uint32_t copied) spin_lock(&xprt->bc_pa_lock); list_del(&req->rq_bc_pa_list); - xprt_dec_alloc_count(xprt, 1); + xprt->bc_alloc_count--; spin_unlock(&xprt->bc_pa_lock); req->rq_private_buf.len = copied; diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index b03bfa055c08..d8679b6027e9 100644 --- a/net/sunrpc/clnt.c +++ b/net/sunrpc/clnt.c @@ -528,6 +528,8 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args) .bc_xprt = args->bc_xprt, }; char servername[48]; + struct rpc_clnt *clnt; + int i; if (args->bc_xprt) { WARN_ON_ONCE(!(args->protocol & XPRT_TRANSPORT_BC)); @@ -590,7 +592,15 @@ struct rpc_clnt *rpc_create(struct rpc_create_args *args) if (args->flags & RPC_CLNT_CREATE_NONPRIVPORT) xprt->resvport = 0; - return rpc_create_xprt(args, xprt); + clnt = rpc_create_xprt(args, xprt); + if (IS_ERR(clnt) || args->nconnect <= 1) + return clnt; + + for (i = 0; i < args->nconnect - 1; i++) { + if (rpc_clnt_add_xprt(clnt, &xprtargs, NULL, NULL) < 0) + break; + } + return clnt; } EXPORT_SYMBOL_GPL(rpc_create); @@ -968,13 +978,46 @@ out: } EXPORT_SYMBOL_GPL(rpc_bind_new_program); +struct rpc_xprt * +rpc_task_get_xprt(struct rpc_clnt *clnt, struct rpc_xprt *xprt) +{ + struct rpc_xprt_switch *xps; + + if (!xprt) + return NULL; + rcu_read_lock(); + xps = rcu_dereference(clnt->cl_xpi.xpi_xpswitch); + atomic_long_inc(&xps->xps_queuelen); + rcu_read_unlock(); + atomic_long_inc(&xprt->queuelen); + + return xprt; +} + +static void +rpc_task_release_xprt(struct rpc_clnt *clnt, struct rpc_xprt *xprt) +{ + struct rpc_xprt_switch *xps; + + atomic_long_dec(&xprt->queuelen); + rcu_read_lock(); + xps = rcu_dereference(clnt->cl_xpi.xpi_xpswitch); + atomic_long_dec(&xps->xps_queuelen); + rcu_read_unlock(); + + xprt_put(xprt); +} + void rpc_task_release_transport(struct rpc_task *task) { struct rpc_xprt *xprt = task->tk_xprt; if (xprt) { task->tk_xprt = NULL; - xprt_put(xprt); + if (task->tk_client) + rpc_task_release_xprt(task->tk_client, xprt); + else + xprt_put(xprt); } } EXPORT_SYMBOL_GPL(rpc_task_release_transport); @@ -983,6 +1026,7 @@ void rpc_task_release_client(struct rpc_task *task) { struct rpc_clnt *clnt = task->tk_client; + rpc_task_release_transport(task); if (clnt != NULL) { /* Remove from client task list */ spin_lock(&clnt->cl_lock); @@ -992,14 +1036,34 @@ void rpc_task_release_client(struct rpc_task *task) rpc_release_client(clnt); } - rpc_task_release_transport(task); +} + +static struct rpc_xprt * +rpc_task_get_first_xprt(struct rpc_clnt *clnt) +{ + struct rpc_xprt *xprt; + + rcu_read_lock(); + xprt = xprt_get(rcu_dereference(clnt->cl_xprt)); + rcu_read_unlock(); + return rpc_task_get_xprt(clnt, xprt); +} + +static struct rpc_xprt * +rpc_task_get_next_xprt(struct rpc_clnt *clnt) +{ + return rpc_task_get_xprt(clnt, xprt_iter_get_next(&clnt->cl_xpi)); } static void rpc_task_set_transport(struct rpc_task *task, struct rpc_clnt *clnt) { - if (!task->tk_xprt) - task->tk_xprt = xprt_iter_get_next(&clnt->cl_xpi); + if (task->tk_xprt) + return; + if (task->tk_flags & RPC_TASK_NO_ROUND_ROBIN) + task->tk_xprt = rpc_task_get_first_xprt(clnt); + else + task->tk_xprt = rpc_task_get_next_xprt(clnt); } static @@ -1462,6 +1526,19 @@ size_t rpc_max_bc_payload(struct rpc_clnt *clnt) } EXPORT_SYMBOL_GPL(rpc_max_bc_payload); +unsigned int rpc_num_bc_slots(struct rpc_clnt *clnt) +{ + struct rpc_xprt *xprt; + unsigned int ret; + + rcu_read_lock(); + xprt = rcu_dereference(clnt->cl_xprt); + ret = xprt->ops->bc_num_slots(xprt); + rcu_read_unlock(); + return ret; +} +EXPORT_SYMBOL_GPL(rpc_num_bc_slots); + /** * rpc_force_rebind - force transport to check that remote port is unchanged * @clnt: client to rebind @@ -1788,6 +1865,7 @@ rpc_xdr_encode(struct rpc_task *task) req->rq_snd_buf.head[0].iov_len = 0; xdr_init_encode(&xdr, &req->rq_snd_buf, req->rq_snd_buf.head[0].iov_base, req); + xdr_free_bvec(&req->rq_snd_buf); if (rpc_encode_header(task, &xdr)) return; @@ -1827,8 +1905,6 @@ call_encode(struct rpc_task *task) rpc_call_rpcerror(task, task->tk_status); } return; - } else { - xprt_request_prepare(task->tk_rqstp); } /* Add task to reply queue before transmission to avoid races */ @@ -2696,6 +2772,10 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt, return -ENOMEM; data->xps = xprt_switch_get(xps); data->xprt = xprt_get(xprt); + if (rpc_xprt_switch_has_addr(data->xps, (struct sockaddr *)&xprt->addr)) { + rpc_cb_add_xprt_release(data); + goto success; + } task = rpc_call_null_helper(clnt, xprt, NULL, RPC_TASK_SOFT|RPC_TASK_SOFTCONN|RPC_TASK_ASYNC|RPC_TASK_NULLCREDS, @@ -2703,6 +2783,7 @@ int rpc_clnt_test_and_add_xprt(struct rpc_clnt *clnt, if (IS_ERR(task)) return PTR_ERR(task); rpc_put_task(task); +success: return 1; } EXPORT_SYMBOL_GPL(rpc_clnt_test_and_add_xprt); diff --git a/net/sunrpc/debugfs.c b/net/sunrpc/debugfs.c index 707d7aab1546..fd9bca242724 100644 --- a/net/sunrpc/debugfs.c +++ b/net/sunrpc/debugfs.c @@ -1,5 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 -/** +/* * debugfs interface for sunrpc * * (c) 2014 Jeff Layton <jlayton@primarydata.com> @@ -117,12 +117,37 @@ static const struct file_operations tasks_fops = { .release = tasks_release, }; +static int do_xprt_debugfs(struct rpc_clnt *clnt, struct rpc_xprt *xprt, void *numv) +{ + int len; + char name[24]; /* enough for "../../rpc_xprt/ + 8 hex digits + NULL */ + char link[9]; /* enough for 8 hex digits + NULL */ + int *nump = numv; + + if (IS_ERR_OR_NULL(xprt->debugfs)) + return 0; + len = snprintf(name, sizeof(name), "../../rpc_xprt/%s", + xprt->debugfs->d_name.name); + if (len > sizeof(name)) + return -1; + if (*nump == 0) + strcpy(link, "xprt"); + else { + len = snprintf(link, sizeof(link), "xprt%d", *nump); + if (len > sizeof(link)) + return -1; + } + debugfs_create_symlink(link, clnt->cl_debugfs, name); + (*nump)++; + return 0; +} + void rpc_clnt_debugfs_register(struct rpc_clnt *clnt) { int len; - char name[24]; /* enough for "../../rpc_xprt/ + 8 hex digits + NULL */ - struct rpc_xprt *xprt; + char name[9]; /* enough for 8 hex digits + NULL */ + int xprtnum = 0; len = snprintf(name, sizeof(name), "%x", clnt->cl_clid); if (len >= sizeof(name)) @@ -135,26 +160,7 @@ rpc_clnt_debugfs_register(struct rpc_clnt *clnt) debugfs_create_file("tasks", S_IFREG | 0400, clnt->cl_debugfs, clnt, &tasks_fops); - rcu_read_lock(); - xprt = rcu_dereference(clnt->cl_xprt); - /* no "debugfs" dentry? Don't bother with the symlink. */ - if (IS_ERR_OR_NULL(xprt->debugfs)) { - rcu_read_unlock(); - return; - } - len = snprintf(name, sizeof(name), "../../rpc_xprt/%s", - xprt->debugfs->d_name.name); - rcu_read_unlock(); - - if (len >= sizeof(name)) - goto out_err; - - debugfs_create_symlink("xprt", clnt->cl_debugfs, name); - - return; -out_err: - debugfs_remove_recursive(clnt->cl_debugfs); - clnt->cl_debugfs = NULL; + rpc_clnt_iterate_for_each_xprt(clnt, do_xprt_debugfs, &xprtnum); } void diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index a2c114812717..1f275aba786f 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -23,6 +23,7 @@ #include <linux/sched/mm.h> #include <linux/sunrpc/clnt.h> +#include <linux/sunrpc/metrics.h> #include "sunrpc.h" @@ -46,7 +47,7 @@ static mempool_t *rpc_buffer_mempool __read_mostly; static void rpc_async_schedule(struct work_struct *); static void rpc_release_task(struct rpc_task *task); -static void __rpc_queue_timer_fn(struct timer_list *t); +static void __rpc_queue_timer_fn(struct work_struct *); /* * RPC tasks sit here while waiting for conditions to improve. @@ -58,6 +59,7 @@ static struct rpc_wait_queue delay_queue; */ struct workqueue_struct *rpciod_workqueue __read_mostly; struct workqueue_struct *xprtiod_workqueue __read_mostly; +EXPORT_SYMBOL_GPL(xprtiod_workqueue); unsigned long rpc_task_timeout(const struct rpc_task *task) @@ -87,13 +89,19 @@ __rpc_disable_timer(struct rpc_wait_queue *queue, struct rpc_task *task) task->tk_timeout = 0; list_del(&task->u.tk_wait.timer_list); if (list_empty(&queue->timer_list.list)) - del_timer(&queue->timer_list.timer); + cancel_delayed_work(&queue->timer_list.dwork); } static void rpc_set_queue_timer(struct rpc_wait_queue *queue, unsigned long expires) { - timer_reduce(&queue->timer_list.timer, expires); + unsigned long now = jiffies; + queue->timer_list.expires = expires; + if (time_before_eq(expires, now)) + expires = 0; + else + expires -= now; + mod_delayed_work(rpciod_workqueue, &queue->timer_list.dwork, expires); } /* @@ -107,7 +115,8 @@ __rpc_add_timer(struct rpc_wait_queue *queue, struct rpc_task *task, task->tk_pid, jiffies_to_msecs(timeout - jiffies)); task->tk_timeout = timeout; - rpc_set_queue_timer(queue, timeout); + if (list_empty(&queue->timer_list.list) || time_before(timeout, queue->timer_list.expires)) + rpc_set_queue_timer(queue, timeout); list_add(&task->u.tk_wait.timer_list, &queue->timer_list.list); } @@ -250,7 +259,8 @@ static void __rpc_init_priority_wait_queue(struct rpc_wait_queue *queue, const c queue->maxpriority = nr_queues - 1; rpc_reset_waitqueue_priority(queue); queue->qlen = 0; - timer_setup(&queue->timer_list.timer, __rpc_queue_timer_fn, 0); + queue->timer_list.expires = 0; + INIT_DEFERRABLE_WORK(&queue->timer_list.dwork, __rpc_queue_timer_fn); INIT_LIST_HEAD(&queue->timer_list.list); rpc_assign_waitqueue_name(queue, qname); } @@ -269,7 +279,7 @@ EXPORT_SYMBOL_GPL(rpc_init_wait_queue); void rpc_destroy_wait_queue(struct rpc_wait_queue *queue) { - del_timer_sync(&queue->timer_list.timer); + cancel_delayed_work_sync(&queue->timer_list.dwork); } EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue); @@ -424,9 +434,9 @@ void rpc_sleep_on_timeout(struct rpc_wait_queue *q, struct rpc_task *task, /* * Protect the queue operations. */ - spin_lock_bh(&q->lock); + spin_lock(&q->lock); __rpc_sleep_on_priority_timeout(q, task, timeout, task->tk_priority); - spin_unlock_bh(&q->lock); + spin_unlock(&q->lock); } EXPORT_SYMBOL_GPL(rpc_sleep_on_timeout); @@ -442,9 +452,9 @@ void rpc_sleep_on(struct rpc_wait_queue *q, struct rpc_task *task, /* * Protect the queue operations. */ - spin_lock_bh(&q->lock); + spin_lock(&q->lock); __rpc_sleep_on_priority(q, task, task->tk_priority); - spin_unlock_bh(&q->lock); + spin_unlock(&q->lock); } EXPORT_SYMBOL_GPL(rpc_sleep_on); @@ -458,9 +468,9 @@ void rpc_sleep_on_priority_timeout(struct rpc_wait_queue *q, /* * Protect the queue operations. */ - spin_lock_bh(&q->lock); + spin_lock(&q->lock); __rpc_sleep_on_priority_timeout(q, task, timeout, priority); - spin_unlock_bh(&q->lock); + spin_unlock(&q->lock); } EXPORT_SYMBOL_GPL(rpc_sleep_on_priority_timeout); @@ -475,9 +485,9 @@ void rpc_sleep_on_priority(struct rpc_wait_queue *q, struct rpc_task *task, /* * Protect the queue operations. */ - spin_lock_bh(&q->lock); + spin_lock(&q->lock); __rpc_sleep_on_priority(q, task, priority); - spin_unlock_bh(&q->lock); + spin_unlock(&q->lock); } EXPORT_SYMBOL_GPL(rpc_sleep_on_priority); @@ -555,9 +565,9 @@ void rpc_wake_up_queued_task_on_wq(struct workqueue_struct *wq, { if (!RPC_IS_QUEUED(task)) return; - spin_lock_bh(&queue->lock); + spin_lock(&queue->lock); rpc_wake_up_task_on_wq_queue_locked(wq, queue, task); - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); } /* @@ -567,9 +577,9 @@ void rpc_wake_up_queued_task(struct rpc_wait_queue *queue, struct rpc_task *task { if (!RPC_IS_QUEUED(task)) return; - spin_lock_bh(&queue->lock); + spin_lock(&queue->lock); rpc_wake_up_task_queue_locked(queue, task); - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); } EXPORT_SYMBOL_GPL(rpc_wake_up_queued_task); @@ -602,9 +612,9 @@ rpc_wake_up_queued_task_set_status(struct rpc_wait_queue *queue, { if (!RPC_IS_QUEUED(task)) return; - spin_lock_bh(&queue->lock); + spin_lock(&queue->lock); rpc_wake_up_task_queue_set_status_locked(queue, task, status); - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); } /* @@ -667,12 +677,12 @@ struct rpc_task *rpc_wake_up_first_on_wq(struct workqueue_struct *wq, dprintk("RPC: wake_up_first(%p \"%s\")\n", queue, rpc_qname(queue)); - spin_lock_bh(&queue->lock); + spin_lock(&queue->lock); task = __rpc_find_next_queued(queue); if (task != NULL) task = rpc_wake_up_task_on_wq_queue_action_locked(wq, queue, task, func, data); - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); return task; } @@ -711,7 +721,7 @@ void rpc_wake_up(struct rpc_wait_queue *queue) { struct list_head *head; - spin_lock_bh(&queue->lock); + spin_lock(&queue->lock); head = &queue->tasks[queue->maxpriority]; for (;;) { while (!list_empty(head)) { @@ -725,7 +735,7 @@ void rpc_wake_up(struct rpc_wait_queue *queue) break; head--; } - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); } EXPORT_SYMBOL_GPL(rpc_wake_up); @@ -740,7 +750,7 @@ void rpc_wake_up_status(struct rpc_wait_queue *queue, int status) { struct list_head *head; - spin_lock_bh(&queue->lock); + spin_lock(&queue->lock); head = &queue->tasks[queue->maxpriority]; for (;;) { while (!list_empty(head)) { @@ -755,13 +765,15 @@ void rpc_wake_up_status(struct rpc_wait_queue *queue, int status) break; head--; } - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); } EXPORT_SYMBOL_GPL(rpc_wake_up_status); -static void __rpc_queue_timer_fn(struct timer_list *t) +static void __rpc_queue_timer_fn(struct work_struct *work) { - struct rpc_wait_queue *queue = from_timer(queue, t, timer_list.timer); + struct rpc_wait_queue *queue = container_of(work, + struct rpc_wait_queue, + timer_list.dwork.work); struct rpc_task *task, *n; unsigned long expires, now, timeo; @@ -832,6 +844,10 @@ rpc_reset_task_statistics(struct rpc_task *task) void rpc_exit_task(struct rpc_task *task) { task->tk_action = NULL; + if (task->tk_ops->rpc_count_stats) + task->tk_ops->rpc_count_stats(task, task->tk_calldata); + else if (task->tk_client) + rpc_count_iostats(task, task->tk_client->cl_metrics); if (task->tk_ops->rpc_call_done != NULL) { task->tk_ops->rpc_call_done(task, task->tk_calldata); if (task->tk_action != NULL) { @@ -927,13 +943,13 @@ static void __rpc_execute(struct rpc_task *task) * rpc_task pointer may still be dereferenced. */ queue = task->tk_waitqueue; - spin_lock_bh(&queue->lock); + spin_lock(&queue->lock); if (!RPC_IS_QUEUED(task)) { - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); continue; } rpc_clear_running(task); - spin_unlock_bh(&queue->lock); + spin_unlock(&queue->lock); if (task_is_async) return; @@ -1076,7 +1092,8 @@ static void rpc_init_task(struct rpc_task *task, const struct rpc_task_setup *ta /* Initialize workqueue for async tasks */ task->tk_workqueue = task_setup_data->workqueue; - task->tk_xprt = xprt_get(task_setup_data->rpc_xprt); + task->tk_xprt = rpc_task_get_xprt(task_setup_data->rpc_client, + xprt_get(task_setup_data->rpc_xprt)); task->tk_op_cred = get_rpccred(task_setup_data->rpc_op_cred); diff --git a/net/sunrpc/stats.c b/net/sunrpc/stats.c index 2b6dc7e5f74f..7c74197c2ecf 100644 --- a/net/sunrpc/stats.c +++ b/net/sunrpc/stats.c @@ -177,6 +177,8 @@ void rpc_count_iostats_metrics(const struct rpc_task *task, execute = ktime_sub(now, task->tk_start); op_metrics->om_execute = ktime_add(op_metrics->om_execute, execute); + if (task->tk_status < 0) + op_metrics->om_error_status++; spin_unlock(&op_metrics->om_lock); @@ -219,13 +221,14 @@ static void _add_rpc_iostats(struct rpc_iostats *a, struct rpc_iostats *b) a->om_queue = ktime_add(a->om_queue, b->om_queue); a->om_rtt = ktime_add(a->om_rtt, b->om_rtt); a->om_execute = ktime_add(a->om_execute, b->om_execute); + a->om_error_status += b->om_error_status; } static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats, int op, const struct rpc_procinfo *procs) { _print_name(seq, op, procs); - seq_printf(seq, "%lu %lu %lu %Lu %Lu %Lu %Lu %Lu\n", + seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %llu %lu\n", stats->om_ops, stats->om_ntrans, stats->om_timeouts, @@ -233,12 +236,20 @@ static void _print_rpc_iostats(struct seq_file *seq, struct rpc_iostats *stats, stats->om_bytes_recv, ktime_to_ms(stats->om_queue), ktime_to_ms(stats->om_rtt), - ktime_to_ms(stats->om_execute)); + ktime_to_ms(stats->om_execute), + stats->om_error_status); +} + +static int do_print_stats(struct rpc_clnt *clnt, struct rpc_xprt *xprt, void *seqv) +{ + struct seq_file *seq = seqv; + + xprt->ops->print_stats(xprt, seq); + return 0; } void rpc_clnt_show_stats(struct seq_file *seq, struct rpc_clnt *clnt) { - struct rpc_xprt *xprt; unsigned int op, maxproc = clnt->cl_maxproc; if (!clnt->cl_metrics) @@ -248,11 +259,7 @@ void rpc_clnt_show_stats(struct seq_file *seq, struct rpc_clnt *clnt) seq_printf(seq, "p/v: %u/%u (%s)\n", clnt->cl_prog, clnt->cl_vers, clnt->cl_program->name); - rcu_read_lock(); - xprt = rcu_dereference(clnt->cl_xprt); - if (xprt) - xprt->ops->print_stats(xprt, seq); - rcu_read_unlock(); + rpc_clnt_iterate_for_each_xprt(clnt, do_print_stats, seq); seq_printf(seq, "\tper-op statistics\n"); for (op = 0; op < maxproc; op++) { diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c index e15cb704453e..220b79988000 100644 --- a/net/sunrpc/svc.c +++ b/net/sunrpc/svc.c @@ -1595,7 +1595,7 @@ bc_svc_process(struct svc_serv *serv, struct rpc_rqst *req, /* Parse and execute the bc call */ proc_error = svc_process_common(rqstp, argv, resv); - atomic_inc(&req->rq_xprt->bc_free_slots); + atomic_dec(&req->rq_xprt->bc_slot_count); if (!proc_error) { /* Processing error: drop the request */ xprt_free_bc_request(req); diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index f6c82b1651e7..783748dc5e6f 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -302,9 +302,9 @@ static inline int xprt_lock_write(struct rpc_xprt *xprt, struct rpc_task *task) if (test_bit(XPRT_LOCKED, &xprt->state) && xprt->snd_task == task) return 1; - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); retval = xprt->ops->reserve_xprt(xprt, task); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); return retval; } @@ -381,9 +381,9 @@ static inline void xprt_release_write(struct rpc_xprt *xprt, struct rpc_task *ta { if (xprt->snd_task != task) return; - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); xprt->ops->release_xprt(xprt, task); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); } /* @@ -435,9 +435,9 @@ xprt_request_get_cong(struct rpc_xprt *xprt, struct rpc_rqst *req) if (req->rq_cong) return true; - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); ret = __xprt_get_cong(xprt, req) != 0; - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); return ret; } EXPORT_SYMBOL_GPL(xprt_request_get_cong); @@ -464,9 +464,9 @@ static void xprt_clear_congestion_window_wait(struct rpc_xprt *xprt) { if (test_and_clear_bit(XPRT_CWND_WAIT, &xprt->state)) { - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); __xprt_lock_write_next_cong(xprt); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); } } @@ -563,9 +563,9 @@ bool xprt_write_space(struct rpc_xprt *xprt) if (!test_bit(XPRT_WRITE_SPACE, &xprt->state)) return false; - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); ret = xprt_clear_write_space_locked(xprt); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); return ret; } EXPORT_SYMBOL_GPL(xprt_write_space); @@ -634,9 +634,9 @@ int xprt_adjust_timeout(struct rpc_rqst *req) req->rq_retries = 0; xprt_reset_majortimeo(req); /* Reset the RTT counters == "slow start" */ - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); rpc_init_rtt(req->rq_task->tk_client->cl_rtt, to->to_initval); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); status = -ETIMEDOUT; } @@ -668,11 +668,11 @@ static void xprt_autoclose(struct work_struct *work) void xprt_disconnect_done(struct rpc_xprt *xprt) { dprintk("RPC: disconnected transport %p\n", xprt); - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); xprt_clear_connected(xprt); xprt_clear_write_space_locked(xprt); xprt_wake_pending_tasks(xprt, -ENOTCONN); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); } EXPORT_SYMBOL_GPL(xprt_disconnect_done); @@ -684,7 +684,7 @@ EXPORT_SYMBOL_GPL(xprt_disconnect_done); void xprt_force_disconnect(struct rpc_xprt *xprt) { /* Don't race with the test_bit() in xprt_clear_locked() */ - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); set_bit(XPRT_CLOSE_WAIT, &xprt->state); /* Try to schedule an autoclose RPC call */ if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0) @@ -692,7 +692,7 @@ void xprt_force_disconnect(struct rpc_xprt *xprt) else if (xprt->snd_task) rpc_wake_up_queued_task_set_status(&xprt->pending, xprt->snd_task, -ENOTCONN); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); } EXPORT_SYMBOL_GPL(xprt_force_disconnect); @@ -726,7 +726,7 @@ xprt_request_retransmit_after_disconnect(struct rpc_task *task) void xprt_conditional_disconnect(struct rpc_xprt *xprt, unsigned int cookie) { /* Don't race with the test_bit() in xprt_clear_locked() */ - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); if (cookie != xprt->connect_cookie) goto out; if (test_bit(XPRT_CLOSING, &xprt->state)) @@ -737,7 +737,7 @@ void xprt_conditional_disconnect(struct rpc_xprt *xprt, unsigned int cookie) queue_work(xprtiod_workqueue, &xprt->task_cleanup); xprt_wake_pending_tasks(xprt, -EAGAIN); out: - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); } static bool @@ -750,6 +750,7 @@ static void xprt_schedule_autodisconnect(struct rpc_xprt *xprt) __must_hold(&xprt->transport_lock) { + xprt->last_used = jiffies; if (RB_EMPTY_ROOT(&xprt->recv_queue) && xprt_has_timer(xprt)) mod_timer(&xprt->timer, xprt->last_used + xprt->idle_timeout); } @@ -759,18 +760,13 @@ xprt_init_autodisconnect(struct timer_list *t) { struct rpc_xprt *xprt = from_timer(xprt, t, timer); - spin_lock(&xprt->transport_lock); if (!RB_EMPTY_ROOT(&xprt->recv_queue)) - goto out_abort; + return; /* Reset xprt->last_used to avoid connect/autodisconnect cycling */ xprt->last_used = jiffies; if (test_and_set_bit(XPRT_LOCKED, &xprt->state)) - goto out_abort; - spin_unlock(&xprt->transport_lock); + return; queue_work(xprtiod_workqueue, &xprt->task_cleanup); - return; -out_abort: - spin_unlock(&xprt->transport_lock); } bool xprt_lock_connect(struct rpc_xprt *xprt, @@ -779,7 +775,7 @@ bool xprt_lock_connect(struct rpc_xprt *xprt, { bool ret = false; - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); if (!test_bit(XPRT_LOCKED, &xprt->state)) goto out; if (xprt->snd_task != task) @@ -787,13 +783,13 @@ bool xprt_lock_connect(struct rpc_xprt *xprt, xprt->snd_task = cookie; ret = true; out: - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); return ret; } void xprt_unlock_connect(struct rpc_xprt *xprt, void *cookie) { - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); if (xprt->snd_task != cookie) goto out; if (!test_bit(XPRT_LOCKED, &xprt->state)) @@ -802,7 +798,7 @@ void xprt_unlock_connect(struct rpc_xprt *xprt, void *cookie) xprt->ops->release_xprt(xprt, NULL); xprt_schedule_autodisconnect(xprt); out: - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); wake_up_bit(&xprt->state, XPRT_LOCKED); } @@ -850,6 +846,38 @@ void xprt_connect(struct rpc_task *task) xprt_release_write(xprt, task); } +/** + * xprt_reconnect_delay - compute the wait before scheduling a connect + * @xprt: transport instance + * + */ +unsigned long xprt_reconnect_delay(const struct rpc_xprt *xprt) +{ + unsigned long start, now = jiffies; + + start = xprt->stat.connect_start + xprt->reestablish_timeout; + if (time_after(start, now)) + return start - now; + return 0; +} +EXPORT_SYMBOL_GPL(xprt_reconnect_delay); + +/** + * xprt_reconnect_backoff - compute the new re-establish timeout + * @xprt: transport instance + * @init_to: initial reestablish timeout + * + */ +void xprt_reconnect_backoff(struct rpc_xprt *xprt, unsigned long init_to) +{ + xprt->reestablish_timeout <<= 1; + if (xprt->reestablish_timeout > xprt->max_reconnect_timeout) + xprt->reestablish_timeout = xprt->max_reconnect_timeout; + if (xprt->reestablish_timeout < init_to) + xprt->reestablish_timeout = init_to; +} +EXPORT_SYMBOL_GPL(xprt_reconnect_backoff); + enum xprt_xid_rb_cmp { XID_RB_EQUAL, XID_RB_LEFT, @@ -1013,6 +1041,8 @@ xprt_request_enqueue_receive(struct rpc_task *task) if (!xprt_request_need_enqueue_receive(task, req)) return; + + xprt_request_prepare(task->tk_rqstp); spin_lock(&xprt->queue_lock); /* Update the softirq receive buffer */ @@ -1412,14 +1442,14 @@ xprt_request_transmit(struct rpc_rqst *req, struct rpc_task *snd_task) xprt_inject_disconnect(xprt); task->tk_flags |= RPC_TASK_SENT; - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); xprt->stat.sends++; xprt->stat.req_u += xprt->stat.sends - xprt->stat.recvs; xprt->stat.bklog_u += xprt->backlog.qlen; xprt->stat.sending_u += xprt->sending.qlen; xprt->stat.pending_u += xprt->pending.qlen; - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); req->rq_connect_cookie = connect_cookie; out_dequeue: @@ -1765,18 +1795,13 @@ void xprt_release(struct rpc_task *task) } xprt = req->rq_xprt; - if (task->tk_ops->rpc_count_stats != NULL) - task->tk_ops->rpc_count_stats(task, task->tk_calldata); - else if (task->tk_client) - rpc_count_iostats(task, task->tk_client->cl_metrics); xprt_request_dequeue_all(task, req); - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); xprt->ops->release_xprt(xprt, task); if (xprt->ops->release_request) xprt->ops->release_request(task); - xprt->last_used = jiffies; xprt_schedule_autodisconnect(xprt); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); if (req->rq_buffer) xprt->ops->buf_free(task); xprt_inject_disconnect(xprt); diff --git a/net/sunrpc/xprtmultipath.c b/net/sunrpc/xprtmultipath.c index 8394124126f8..78c075a68c04 100644 --- a/net/sunrpc/xprtmultipath.c +++ b/net/sunrpc/xprtmultipath.c @@ -19,7 +19,7 @@ #include <linux/sunrpc/addr.h> #include <linux/sunrpc/xprtmultipath.h> -typedef struct rpc_xprt *(*xprt_switch_find_xprt_t)(struct list_head *head, +typedef struct rpc_xprt *(*xprt_switch_find_xprt_t)(struct rpc_xprt_switch *xps, const struct rpc_xprt *cur); static const struct rpc_xprt_iter_ops rpc_xprt_iter_singular; @@ -36,6 +36,7 @@ static void xprt_switch_add_xprt_locked(struct rpc_xprt_switch *xps, if (xps->xps_nxprts == 0) xps->xps_net = xprt->xprt_net; xps->xps_nxprts++; + xps->xps_nactive++; } /** @@ -51,8 +52,7 @@ void rpc_xprt_switch_add_xprt(struct rpc_xprt_switch *xps, if (xprt == NULL) return; spin_lock(&xps->xps_lock); - if ((xps->xps_net == xprt->xprt_net || xps->xps_net == NULL) && - !rpc_xprt_switch_has_addr(xps, (struct sockaddr *)&xprt->addr)) + if (xps->xps_net == xprt->xprt_net || xps->xps_net == NULL) xprt_switch_add_xprt_locked(xps, xprt); spin_unlock(&xps->xps_lock); } @@ -62,6 +62,7 @@ static void xprt_switch_remove_xprt_locked(struct rpc_xprt_switch *xps, { if (unlikely(xprt == NULL)) return; + xps->xps_nactive--; xps->xps_nxprts--; if (xps->xps_nxprts == 0) xps->xps_net = NULL; @@ -102,7 +103,9 @@ struct rpc_xprt_switch *xprt_switch_alloc(struct rpc_xprt *xprt, if (xps != NULL) { spin_lock_init(&xps->xps_lock); kref_init(&xps->xps_kref); - xps->xps_nxprts = 0; + xps->xps_nxprts = xps->xps_nactive = 0; + atomic_long_set(&xps->xps_queuelen, 0); + xps->xps_net = NULL; INIT_LIST_HEAD(&xps->xps_xprt_list); xps->xps_iter_ops = &rpc_xprt_iter_singular; xprt_switch_add_xprt_locked(xps, xprt); @@ -193,9 +196,21 @@ void xprt_iter_default_rewind(struct rpc_xprt_iter *xpi) } static +bool xprt_is_active(const struct rpc_xprt *xprt) +{ + return kref_read(&xprt->kref) != 0; +} + +static struct rpc_xprt *xprt_switch_find_first_entry(struct list_head *head) { - return list_first_or_null_rcu(head, struct rpc_xprt, xprt_switch); + struct rpc_xprt *pos; + + list_for_each_entry_rcu(pos, head, xprt_switch) { + if (xprt_is_active(pos)) + return pos; + } + return NULL; } static @@ -213,9 +228,12 @@ struct rpc_xprt *xprt_switch_find_current_entry(struct list_head *head, const struct rpc_xprt *cur) { struct rpc_xprt *pos; + bool found = false; list_for_each_entry_rcu(pos, head, xprt_switch) { if (cur == pos) + found = true; + if (found && xprt_is_active(pos)) return pos; } return NULL; @@ -260,9 +278,12 @@ struct rpc_xprt *xprt_switch_find_next_entry(struct list_head *head, const struct rpc_xprt *cur) { struct rpc_xprt *pos, *prev = NULL; + bool found = false; list_for_each_entry_rcu(pos, head, xprt_switch) { if (cur == prev) + found = true; + if (found && xprt_is_active(pos)) return pos; prev = pos; } @@ -270,22 +291,15 @@ struct rpc_xprt *xprt_switch_find_next_entry(struct list_head *head, } static -struct rpc_xprt *xprt_switch_set_next_cursor(struct list_head *head, +struct rpc_xprt *xprt_switch_set_next_cursor(struct rpc_xprt_switch *xps, struct rpc_xprt **cursor, xprt_switch_find_xprt_t find_next) { - struct rpc_xprt *cur, *pos, *old; + struct rpc_xprt *pos, *old; - cur = READ_ONCE(*cursor); - for (;;) { - old = cur; - pos = find_next(head, old); - if (pos == NULL) - break; - cur = cmpxchg_relaxed(cursor, old, pos); - if (cur == old) - break; - } + old = smp_load_acquire(cursor); + pos = find_next(xps, old); + smp_store_release(cursor, pos); return pos; } @@ -297,13 +311,11 @@ struct rpc_xprt *xprt_iter_next_entry_multiple(struct rpc_xprt_iter *xpi, if (xps == NULL) return NULL; - return xprt_switch_set_next_cursor(&xps->xps_xprt_list, - &xpi->xpi_cursor, - find_next); + return xprt_switch_set_next_cursor(xps, &xpi->xpi_cursor, find_next); } static -struct rpc_xprt *xprt_switch_find_next_entry_roundrobin(struct list_head *head, +struct rpc_xprt *__xprt_switch_find_next_entry_roundrobin(struct list_head *head, const struct rpc_xprt *cur) { struct rpc_xprt *ret; @@ -315,6 +327,31 @@ struct rpc_xprt *xprt_switch_find_next_entry_roundrobin(struct list_head *head, } static +struct rpc_xprt *xprt_switch_find_next_entry_roundrobin(struct rpc_xprt_switch *xps, + const struct rpc_xprt *cur) +{ + struct list_head *head = &xps->xps_xprt_list; + struct rpc_xprt *xprt; + unsigned int nactive; + + for (;;) { + unsigned long xprt_queuelen, xps_queuelen; + + xprt = __xprt_switch_find_next_entry_roundrobin(head, cur); + if (!xprt) + break; + xprt_queuelen = atomic_long_read(&xprt->queuelen); + xps_queuelen = atomic_long_read(&xps->xps_queuelen); + nactive = READ_ONCE(xps->xps_nactive); + /* Exit loop if xprt_queuelen <= average queue length */ + if (xprt_queuelen * nactive <= xps_queuelen) + break; + cur = xprt; + } + return xprt; +} + +static struct rpc_xprt *xprt_iter_next_entry_roundrobin(struct rpc_xprt_iter *xpi) { return xprt_iter_next_entry_multiple(xpi, @@ -322,9 +359,17 @@ struct rpc_xprt *xprt_iter_next_entry_roundrobin(struct rpc_xprt_iter *xpi) } static +struct rpc_xprt *xprt_switch_find_next_entry_all(struct rpc_xprt_switch *xps, + const struct rpc_xprt *cur) +{ + return xprt_switch_find_next_entry(&xps->xps_xprt_list, cur); +} + +static struct rpc_xprt *xprt_iter_next_entry_all(struct rpc_xprt_iter *xpi) { - return xprt_iter_next_entry_multiple(xpi, xprt_switch_find_next_entry); + return xprt_iter_next_entry_multiple(xpi, + xprt_switch_find_next_entry_all); } /* diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c index ce986591f213..59e624b1d7a0 100644 --- a/net/sunrpc/xprtrdma/backchannel.c +++ b/net/sunrpc/xprtrdma/backchannel.c @@ -52,6 +52,13 @@ size_t xprt_rdma_bc_maxpayload(struct rpc_xprt *xprt) return maxmsg - RPCRDMA_HDRLEN_MIN; } +unsigned int xprt_rdma_bc_max_slots(struct rpc_xprt *xprt) +{ + struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt); + + return r_xprt->rx_buf.rb_bc_srv_max_requests; +} + static int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst) { struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt); diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c index 794ba4ca0994..0b6dad7580a1 100644 --- a/net/sunrpc/xprtrdma/frwr_ops.c +++ b/net/sunrpc/xprtrdma/frwr_ops.c @@ -144,6 +144,26 @@ frwr_mr_recycle_worker(struct work_struct *work) frwr_release_mr(mr); } +/* frwr_reset - Place MRs back on the free list + * @req: request to reset + * + * Used after a failed marshal. For FRWR, this means the MRs + * don't have to be fully released and recreated. + * + * NB: This is safe only as long as none of @req's MRs are + * involved with an ongoing asynchronous FAST_REG or LOCAL_INV + * Work Request. + */ +void frwr_reset(struct rpcrdma_req *req) +{ + while (!list_empty(&req->rl_registered)) { + struct rpcrdma_mr *mr; + + mr = rpcrdma_mr_pop(&req->rl_registered); + rpcrdma_mr_unmap_and_put(mr); + } +} + /** * frwr_init_mr - Initialize one MR * @ia: interface adapter @@ -168,7 +188,6 @@ int frwr_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mr *mr) goto out_list_err; mr->frwr.fr_mr = frmr; - mr->frwr.fr_state = FRWR_IS_INVALID; mr->mr_dir = DMA_NONE; INIT_LIST_HEAD(&mr->mr_list); INIT_WORK(&mr->mr_recycle, frwr_mr_recycle_worker); @@ -298,65 +317,6 @@ size_t frwr_maxpages(struct rpcrdma_xprt *r_xprt) } /** - * frwr_wc_fastreg - Invoked by RDMA provider for a flushed FastReg WC - * @cq: completion queue (ignored) - * @wc: completed WR - * - */ -static void -frwr_wc_fastreg(struct ib_cq *cq, struct ib_wc *wc) -{ - struct ib_cqe *cqe = wc->wr_cqe; - struct rpcrdma_frwr *frwr = - container_of(cqe, struct rpcrdma_frwr, fr_cqe); - - /* WARNING: Only wr_cqe and status are reliable at this point */ - if (wc->status != IB_WC_SUCCESS) - frwr->fr_state = FRWR_FLUSHED_FR; - trace_xprtrdma_wc_fastreg(wc, frwr); -} - -/** - * frwr_wc_localinv - Invoked by RDMA provider for a flushed LocalInv WC - * @cq: completion queue (ignored) - * @wc: completed WR - * - */ -static void -frwr_wc_localinv(struct ib_cq *cq, struct ib_wc *wc) -{ - struct ib_cqe *cqe = wc->wr_cqe; - struct rpcrdma_frwr *frwr = container_of(cqe, struct rpcrdma_frwr, - fr_cqe); - - /* WARNING: Only wr_cqe and status are reliable at this point */ - if (wc->status != IB_WC_SUCCESS) - frwr->fr_state = FRWR_FLUSHED_LI; - trace_xprtrdma_wc_li(wc, frwr); -} - -/** - * frwr_wc_localinv_wake - Invoked by RDMA provider for a signaled LocalInv WC - * @cq: completion queue (ignored) - * @wc: completed WR - * - * Awaken anyone waiting for an MR to finish being fenced. - */ -static void -frwr_wc_localinv_wake(struct ib_cq *cq, struct ib_wc *wc) -{ - struct ib_cqe *cqe = wc->wr_cqe; - struct rpcrdma_frwr *frwr = container_of(cqe, struct rpcrdma_frwr, - fr_cqe); - - /* WARNING: Only wr_cqe and status are reliable at this point */ - if (wc->status != IB_WC_SUCCESS) - frwr->fr_state = FRWR_FLUSHED_LI; - trace_xprtrdma_wc_li_wake(wc, frwr); - complete(&frwr->fr_linv_done); -} - -/** * frwr_map - Register a memory region * @r_xprt: controlling transport * @seg: memory region co-ordinates @@ -378,23 +338,15 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt, { struct rpcrdma_ia *ia = &r_xprt->rx_ia; bool holes_ok = ia->ri_mrtype == IB_MR_TYPE_SG_GAPS; - struct rpcrdma_frwr *frwr; struct rpcrdma_mr *mr; struct ib_mr *ibmr; struct ib_reg_wr *reg_wr; int i, n; u8 key; - mr = NULL; - do { - if (mr) - rpcrdma_mr_recycle(mr); - mr = rpcrdma_mr_get(r_xprt); - if (!mr) - return ERR_PTR(-EAGAIN); - } while (mr->frwr.fr_state != FRWR_IS_INVALID); - frwr = &mr->frwr; - frwr->fr_state = FRWR_IS_VALID; + mr = rpcrdma_mr_get(r_xprt); + if (!mr) + goto out_getmr_err; if (nsegs > ia->ri_max_frwr_depth) nsegs = ia->ri_max_frwr_depth; @@ -423,7 +375,7 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt, if (!mr->mr_nents) goto out_dmamap_err; - ibmr = frwr->fr_mr; + ibmr = mr->frwr.fr_mr; n = ib_map_mr_sg(ibmr, mr->mr_sg, mr->mr_nents, NULL, PAGE_SIZE); if (unlikely(n != mr->mr_nents)) goto out_mapmr_err; @@ -433,7 +385,7 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt, key = (u8)(ibmr->rkey & 0x000000FF); ib_update_fast_reg_key(ibmr, ++key); - reg_wr = &frwr->fr_regwr; + reg_wr = &mr->frwr.fr_regwr; reg_wr->mr = ibmr; reg_wr->key = ibmr->rkey; reg_wr->access = writing ? @@ -448,6 +400,10 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt, *out = mr; return seg; +out_getmr_err: + xprt_wait_for_buffer_space(&r_xprt->rx_xprt); + return ERR_PTR(-EAGAIN); + out_dmamap_err: mr->mr_dir = DMA_NONE; trace_xprtrdma_frwr_sgerr(mr, i); @@ -461,6 +417,23 @@ out_mapmr_err: } /** + * frwr_wc_fastreg - Invoked by RDMA provider for a flushed FastReg WC + * @cq: completion queue (ignored) + * @wc: completed WR + * + */ +static void frwr_wc_fastreg(struct ib_cq *cq, struct ib_wc *wc) +{ + struct ib_cqe *cqe = wc->wr_cqe; + struct rpcrdma_frwr *frwr = + container_of(cqe, struct rpcrdma_frwr, fr_cqe); + + /* WARNING: Only wr_cqe and status are reliable at this point */ + trace_xprtrdma_wc_fastreg(wc, frwr); + /* The MR will get recycled when the associated req is retransmitted */ +} + +/** * frwr_send - post Send WR containing the RPC Call message * @ia: interface adapter * @req: Prepared RPC Call @@ -512,31 +485,75 @@ void frwr_reminv(struct rpcrdma_rep *rep, struct list_head *mrs) if (mr->mr_handle == rep->rr_inv_rkey) { list_del_init(&mr->mr_list); trace_xprtrdma_mr_remoteinv(mr); - mr->frwr.fr_state = FRWR_IS_INVALID; rpcrdma_mr_unmap_and_put(mr); break; /* only one invalidated MR per RPC */ } } +static void __frwr_release_mr(struct ib_wc *wc, struct rpcrdma_mr *mr) +{ + if (wc->status != IB_WC_SUCCESS) + rpcrdma_mr_recycle(mr); + else + rpcrdma_mr_unmap_and_put(mr); +} + /** - * frwr_unmap_sync - invalidate memory regions that were registered for @req - * @r_xprt: controlling transport - * @mrs: list of MRs to process + * frwr_wc_localinv - Invoked by RDMA provider for a LOCAL_INV WC + * @cq: completion queue (ignored) + * @wc: completed WR + * + */ +static void frwr_wc_localinv(struct ib_cq *cq, struct ib_wc *wc) +{ + struct ib_cqe *cqe = wc->wr_cqe; + struct rpcrdma_frwr *frwr = + container_of(cqe, struct rpcrdma_frwr, fr_cqe); + struct rpcrdma_mr *mr = container_of(frwr, struct rpcrdma_mr, frwr); + + /* WARNING: Only wr_cqe and status are reliable at this point */ + trace_xprtrdma_wc_li(wc, frwr); + __frwr_release_mr(wc, mr); +} + +/** + * frwr_wc_localinv_wake - Invoked by RDMA provider for a LOCAL_INV WC + * @cq: completion queue (ignored) + * @wc: completed WR * - * Sleeps until it is safe for the host CPU to access the - * previously mapped memory regions. + * Awaken anyone waiting for an MR to finish being fenced. + */ +static void frwr_wc_localinv_wake(struct ib_cq *cq, struct ib_wc *wc) +{ + struct ib_cqe *cqe = wc->wr_cqe; + struct rpcrdma_frwr *frwr = + container_of(cqe, struct rpcrdma_frwr, fr_cqe); + struct rpcrdma_mr *mr = container_of(frwr, struct rpcrdma_mr, frwr); + + /* WARNING: Only wr_cqe and status are reliable at this point */ + trace_xprtrdma_wc_li_wake(wc, frwr); + complete(&frwr->fr_linv_done); + __frwr_release_mr(wc, mr); +} + +/** + * frwr_unmap_sync - invalidate memory regions that were registered for @req + * @r_xprt: controlling transport instance + * @req: rpcrdma_req with a non-empty list of MRs to process * - * Caller ensures that @mrs is not empty before the call. This - * function empties the list. + * Sleeps until it is safe for the host CPU to access the previously mapped + * memory regions. This guarantees that registered MRs are properly fenced + * from the server before the RPC consumer accesses the data in them. It + * also ensures proper Send flow control: waking the next RPC waits until + * this RPC has relinquished all its Send Queue entries. */ -void frwr_unmap_sync(struct rpcrdma_xprt *r_xprt, struct list_head *mrs) +void frwr_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req) { struct ib_send_wr *first, **prev, *last; const struct ib_send_wr *bad_wr; - struct rpcrdma_ia *ia = &r_xprt->rx_ia; struct rpcrdma_frwr *frwr; struct rpcrdma_mr *mr; - int count, rc; + int rc; /* ORDER: Invalidate all of the MRs first * @@ -544,33 +561,32 @@ void frwr_unmap_sync(struct rpcrdma_xprt *r_xprt, struct list_head *mrs) * a single ib_post_send() call. */ frwr = NULL; - count = 0; prev = &first; - list_for_each_entry(mr, mrs, mr_list) { - mr->frwr.fr_state = FRWR_IS_INVALID; + while (!list_empty(&req->rl_registered)) { + mr = rpcrdma_mr_pop(&req->rl_registered); - frwr = &mr->frwr; trace_xprtrdma_mr_localinv(mr); + r_xprt->rx_stats.local_inv_needed++; + frwr = &mr->frwr; frwr->fr_cqe.done = frwr_wc_localinv; last = &frwr->fr_invwr; - memset(last, 0, sizeof(*last)); + last->next = NULL; last->wr_cqe = &frwr->fr_cqe; + last->sg_list = NULL; + last->num_sge = 0; last->opcode = IB_WR_LOCAL_INV; + last->send_flags = IB_SEND_SIGNALED; last->ex.invalidate_rkey = mr->mr_handle; - count++; *prev = last; prev = &last->next; } - if (!frwr) - goto unmap; /* Strong send queue ordering guarantees that when the * last WR in the chain completes, all WRs in the chain * are complete. */ - last->send_flags = IB_SEND_SIGNALED; frwr->fr_cqe.done = frwr_wc_localinv_wake; reinit_completion(&frwr->fr_linv_done); @@ -578,37 +594,126 @@ void frwr_unmap_sync(struct rpcrdma_xprt *r_xprt, struct list_head *mrs) * replaces the QP. The RPC reply handler won't call us * unless ri_id->qp is a valid pointer. */ - r_xprt->rx_stats.local_inv_needed++; bad_wr = NULL; - rc = ib_post_send(ia->ri_id->qp, first, &bad_wr); + rc = ib_post_send(r_xprt->rx_ia.ri_id->qp, first, &bad_wr); + trace_xprtrdma_post_send(req, rc); + + /* The final LOCAL_INV WR in the chain is supposed to + * do the wake. If it was never posted, the wake will + * not happen, so don't wait in that case. + */ if (bad_wr != first) wait_for_completion(&frwr->fr_linv_done); - if (rc) - goto out_release; + if (!rc) + return; - /* ORDER: Now DMA unmap all of the MRs, and return - * them to the free MR list. + /* Recycle MRs in the LOCAL_INV chain that did not get posted. */ -unmap: - while (!list_empty(mrs)) { - mr = rpcrdma_mr_pop(mrs); - rpcrdma_mr_unmap_and_put(mr); + while (bad_wr) { + frwr = container_of(bad_wr, struct rpcrdma_frwr, + fr_invwr); + mr = container_of(frwr, struct rpcrdma_mr, frwr); + bad_wr = bad_wr->next; + + list_del_init(&mr->mr_list); + rpcrdma_mr_recycle(mr); } - return; +} -out_release: - pr_err("rpcrdma: FRWR invalidate ib_post_send returned %i\n", rc); +/** + * frwr_wc_localinv_done - Invoked by RDMA provider for a signaled LOCAL_INV WC + * @cq: completion queue (ignored) + * @wc: completed WR + * + */ +static void frwr_wc_localinv_done(struct ib_cq *cq, struct ib_wc *wc) +{ + struct ib_cqe *cqe = wc->wr_cqe; + struct rpcrdma_frwr *frwr = + container_of(cqe, struct rpcrdma_frwr, fr_cqe); + struct rpcrdma_mr *mr = container_of(frwr, struct rpcrdma_mr, frwr); - /* Unmap and release the MRs in the LOCAL_INV WRs that did not - * get posted. + /* WARNING: Only wr_cqe and status are reliable at this point */ + trace_xprtrdma_wc_li_done(wc, frwr); + rpcrdma_complete_rqst(frwr->fr_req->rl_reply); + __frwr_release_mr(wc, mr); +} + +/** + * frwr_unmap_async - invalidate memory regions that were registered for @req + * @r_xprt: controlling transport instance + * @req: rpcrdma_req with a non-empty list of MRs to process + * + * This guarantees that registered MRs are properly fenced from the + * server before the RPC consumer accesses the data in them. It also + * ensures proper Send flow control: waking the next RPC waits until + * this RPC has relinquished all its Send Queue entries. + */ +void frwr_unmap_async(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req) +{ + struct ib_send_wr *first, *last, **prev; + const struct ib_send_wr *bad_wr; + struct rpcrdma_frwr *frwr; + struct rpcrdma_mr *mr; + int rc; + + /* Chain the LOCAL_INV Work Requests and post them with + * a single ib_post_send() call. + */ + frwr = NULL; + prev = &first; + while (!list_empty(&req->rl_registered)) { + mr = rpcrdma_mr_pop(&req->rl_registered); + + trace_xprtrdma_mr_localinv(mr); + r_xprt->rx_stats.local_inv_needed++; + + frwr = &mr->frwr; + frwr->fr_cqe.done = frwr_wc_localinv; + frwr->fr_req = req; + last = &frwr->fr_invwr; + last->next = NULL; + last->wr_cqe = &frwr->fr_cqe; + last->sg_list = NULL; + last->num_sge = 0; + last->opcode = IB_WR_LOCAL_INV; + last->send_flags = IB_SEND_SIGNALED; + last->ex.invalidate_rkey = mr->mr_handle; + + *prev = last; + prev = &last->next; + } + + /* Strong send queue ordering guarantees that when the + * last WR in the chain completes, all WRs in the chain + * are complete. The last completion will wake up the + * RPC waiter. + */ + frwr->fr_cqe.done = frwr_wc_localinv_done; + + /* Transport disconnect drains the receive CQ before it + * replaces the QP. The RPC reply handler won't call us + * unless ri_id->qp is a valid pointer. + */ + bad_wr = NULL; + rc = ib_post_send(r_xprt->rx_ia.ri_id->qp, first, &bad_wr); + trace_xprtrdma_post_send(req, rc); + if (!rc) + return; + + /* Recycle MRs in the LOCAL_INV chain that did not get posted. */ while (bad_wr) { - frwr = container_of(bad_wr, struct rpcrdma_frwr, - fr_invwr); + frwr = container_of(bad_wr, struct rpcrdma_frwr, fr_invwr); mr = container_of(frwr, struct rpcrdma_mr, frwr); bad_wr = bad_wr->next; - list_del_init(&mr->mr_list); rpcrdma_mr_recycle(mr); } + + /* The final LOCAL_INV WR in the chain is supposed to + * do the wake. If it was never posted, the wake will + * not happen, so wake here in that case. + */ + rpcrdma_complete_rqst(req->rl_reply); } diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c index 85115a2e2639..4345e6912392 100644 --- a/net/sunrpc/xprtrdma/rpc_rdma.c +++ b/net/sunrpc/xprtrdma/rpc_rdma.c @@ -366,6 +366,9 @@ rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req, unsigned int pos; int nsegs; + if (rtype == rpcrdma_noch) + goto done; + pos = rqst->rq_snd_buf.head[0].iov_len; if (rtype == rpcrdma_areadch) pos = 0; @@ -389,7 +392,8 @@ rpcrdma_encode_read_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req, nsegs -= mr->mr_nents; } while (nsegs); - return 0; +done: + return encode_item_not_present(xdr); } /* Register and XDR encode the Write list. Supports encoding a list @@ -417,6 +421,9 @@ rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req, int nsegs, nchunks; __be32 *segcount; + if (wtype != rpcrdma_writech) + goto done; + seg = req->rl_segments; nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf, rqst->rq_rcv_buf.head[0].iov_len, @@ -451,7 +458,8 @@ rpcrdma_encode_write_list(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req, /* Update count of segments in this Write chunk */ *segcount = cpu_to_be32(nchunks); - return 0; +done: + return encode_item_not_present(xdr); } /* Register and XDR encode the Reply chunk. Supports encoding an array @@ -476,6 +484,9 @@ rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req, int nsegs, nchunks; __be32 *segcount; + if (wtype != rpcrdma_replych) + return encode_item_not_present(xdr); + seg = req->rl_segments; nsegs = rpcrdma_convert_iovs(r_xprt, &rqst->rq_rcv_buf, 0, wtype, seg); if (nsegs < 0) @@ -511,6 +522,16 @@ rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req, return 0; } +static void rpcrdma_sendctx_done(struct kref *kref) +{ + struct rpcrdma_req *req = + container_of(kref, struct rpcrdma_req, rl_kref); + struct rpcrdma_rep *rep = req->rl_reply; + + rpcrdma_complete_rqst(rep); + rep->rr_rxprt->rx_stats.reply_waits_for_send++; +} + /** * rpcrdma_sendctx_unmap - DMA-unmap Send buffer * @sc: sendctx containing SGEs to unmap @@ -520,6 +541,9 @@ void rpcrdma_sendctx_unmap(struct rpcrdma_sendctx *sc) { struct ib_sge *sge; + if (!sc->sc_unmap_count) + return; + /* The first two SGEs contain the transport header and * the inline buffer. These are always left mapped so * they can be cheaply re-used. @@ -529,9 +553,7 @@ void rpcrdma_sendctx_unmap(struct rpcrdma_sendctx *sc) ib_dma_unmap_page(sc->sc_device, sge->addr, sge->length, DMA_TO_DEVICE); - if (test_and_clear_bit(RPCRDMA_REQ_F_TX_RESOURCES, - &sc->sc_req->rl_flags)) - wake_up_bit(&sc->sc_req->rl_flags, RPCRDMA_REQ_F_TX_RESOURCES); + kref_put(&sc->sc_req->rl_kref, rpcrdma_sendctx_done); } /* Prepare an SGE for the RPC-over-RDMA transport header. @@ -666,7 +688,7 @@ map_tail: out: sc->sc_wr.num_sge += sge_no; if (sc->sc_unmap_count) - __set_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags); + kref_get(&req->rl_kref); return true; out_regbuf: @@ -699,22 +721,28 @@ rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req, u32 hdrlen, struct xdr_buf *xdr, enum rpcrdma_chunktype rtype) { + int ret; + + ret = -EAGAIN; req->rl_sendctx = rpcrdma_sendctx_get_locked(r_xprt); if (!req->rl_sendctx) - return -EAGAIN; + goto err; req->rl_sendctx->sc_wr.num_sge = 0; req->rl_sendctx->sc_unmap_count = 0; req->rl_sendctx->sc_req = req; - __clear_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags); + kref_init(&req->rl_kref); + ret = -EIO; if (!rpcrdma_prepare_hdr_sge(r_xprt, req, hdrlen)) - return -EIO; - + goto err; if (rtype != rpcrdma_areadch) if (!rpcrdma_prepare_msg_sges(r_xprt, req, xdr, rtype)) - return -EIO; - + goto err; return 0; + +err: + trace_xprtrdma_prepsend_failed(&req->rl_slot, ret); + return ret; } /** @@ -842,50 +870,28 @@ rpcrdma_marshal_req(struct rpcrdma_xprt *r_xprt, struct rpc_rqst *rqst) * send a Call message with a Position Zero Read chunk and a * regular Read chunk at the same time. */ - if (rtype != rpcrdma_noch) { - ret = rpcrdma_encode_read_list(r_xprt, req, rqst, rtype); - if (ret) - goto out_err; - } - ret = encode_item_not_present(xdr); + ret = rpcrdma_encode_read_list(r_xprt, req, rqst, rtype); if (ret) goto out_err; - - if (wtype == rpcrdma_writech) { - ret = rpcrdma_encode_write_list(r_xprt, req, rqst, wtype); - if (ret) - goto out_err; - } - ret = encode_item_not_present(xdr); + ret = rpcrdma_encode_write_list(r_xprt, req, rqst, wtype); if (ret) goto out_err; - - if (wtype != rpcrdma_replych) - ret = encode_item_not_present(xdr); - else - ret = rpcrdma_encode_reply_chunk(r_xprt, req, rqst, wtype); + ret = rpcrdma_encode_reply_chunk(r_xprt, req, rqst, wtype); if (ret) goto out_err; - trace_xprtrdma_marshal(rqst, xdr_stream_pos(xdr), rtype, wtype); - - ret = rpcrdma_prepare_send_sges(r_xprt, req, xdr_stream_pos(xdr), + ret = rpcrdma_prepare_send_sges(r_xprt, req, req->rl_hdrbuf.len, &rqst->rq_snd_buf, rtype); if (ret) goto out_err; + + trace_xprtrdma_marshal(req, rtype, wtype); return 0; out_err: trace_xprtrdma_marshal_failed(rqst, ret); - switch (ret) { - case -EAGAIN: - xprt_wait_for_buffer_space(rqst->rq_xprt); - break; - case -ENOBUFS: - break; - default: - r_xprt->rx_stats.failed_marshal_count++; - } + r_xprt->rx_stats.failed_marshal_count++; + frwr_reset(req); return ret; } @@ -1269,51 +1275,17 @@ out_badheader: goto out; } -void rpcrdma_release_rqst(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req) -{ - /* Invalidate and unmap the data payloads before waking - * the waiting application. This guarantees the memory - * regions are properly fenced from the server before the - * application accesses the data. It also ensures proper - * send flow control: waking the next RPC waits until this - * RPC has relinquished all its Send Queue entries. - */ - if (!list_empty(&req->rl_registered)) - frwr_unmap_sync(r_xprt, &req->rl_registered); - - /* Ensure that any DMA mapped pages associated with - * the Send of the RPC Call have been unmapped before - * allowing the RPC to complete. This protects argument - * memory not controlled by the RPC client from being - * re-used before we're done with it. - */ - if (test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) { - r_xprt->rx_stats.reply_waits_for_send++; - out_of_line_wait_on_bit(&req->rl_flags, - RPCRDMA_REQ_F_TX_RESOURCES, - bit_wait, - TASK_UNINTERRUPTIBLE); - } -} - -/* Reply handling runs in the poll worker thread. Anything that - * might wait is deferred to a separate workqueue. - */ -void rpcrdma_deferred_completion(struct work_struct *work) +static void rpcrdma_reply_done(struct kref *kref) { - struct rpcrdma_rep *rep = - container_of(work, struct rpcrdma_rep, rr_work); - struct rpcrdma_req *req = rpcr_to_rdmar(rep->rr_rqst); - struct rpcrdma_xprt *r_xprt = rep->rr_rxprt; + struct rpcrdma_req *req = + container_of(kref, struct rpcrdma_req, rl_kref); - trace_xprtrdma_defer_cmp(rep); - if (rep->rr_wc_flags & IB_WC_WITH_INVALIDATE) - frwr_reminv(rep, &req->rl_registered); - rpcrdma_release_rqst(r_xprt, req); - rpcrdma_complete_rqst(rep); + rpcrdma_complete_rqst(req->rl_reply); } -/* Process received RPC/RDMA messages. +/** + * rpcrdma_reply_handler - Process received RPC/RDMA messages + * @rep: Incoming rpcrdma_rep object to process * * Errors must result in the RPC task either being awakened, or * allowed to timeout, to discover the errors at that time. @@ -1360,10 +1332,10 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep) else if (credits > buf->rb_max_requests) credits = buf->rb_max_requests; if (buf->rb_credits != credits) { - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); buf->rb_credits = credits; xprt->cwnd = credits << RPC_CWNDSHIFT; - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); } req = rpcr_to_rdmar(rqst); @@ -1373,10 +1345,16 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep) } req->rl_reply = rep; rep->rr_rqst = rqst; - clear_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags); trace_xprtrdma_reply(rqst->rq_task, rep, req, credits); - queue_work(buf->rb_completion_wq, &rep->rr_work); + + if (rep->rr_wc_flags & IB_WC_WITH_INVALIDATE) + frwr_reminv(rep, &req->rl_registered); + if (!list_empty(&req->rl_registered)) + frwr_unmap_async(r_xprt, req); + /* LocalInv completion will complete the RPC */ + else + kref_put(&req->rl_kref, rpcrdma_reply_done); return; out_badversion: diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c index bed57d8b5c19..d1fcc41d5eb5 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c +++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c @@ -72,9 +72,9 @@ int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, __be32 *rdma_resp, else if (credits > r_xprt->rx_buf.rb_bc_max_requests) credits = r_xprt->rx_buf.rb_bc_max_requests; - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); xprt->cwnd = credits << RPC_CWNDSHIFT; - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); spin_lock(&xprt->queue_lock); ret = 0; diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 0004535c0188..3fe665152d95 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -226,9 +226,9 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id, * Enqueue the new transport on the accept queue of the listening * transport */ - spin_lock_bh(&listen_xprt->sc_lock); + spin_lock(&listen_xprt->sc_lock); list_add_tail(&newxprt->sc_accept_q, &listen_xprt->sc_accept_q); - spin_unlock_bh(&listen_xprt->sc_lock); + spin_unlock(&listen_xprt->sc_lock); set_bit(XPT_CONN, &listen_xprt->sc_xprt.xpt_flags); svc_xprt_enqueue(&listen_xprt->sc_xprt); @@ -401,7 +401,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) listen_rdma = container_of(xprt, struct svcxprt_rdma, sc_xprt); clear_bit(XPT_CONN, &xprt->xpt_flags); /* Get the next entry off the accept list */ - spin_lock_bh(&listen_rdma->sc_lock); + spin_lock(&listen_rdma->sc_lock); if (!list_empty(&listen_rdma->sc_accept_q)) { newxprt = list_entry(listen_rdma->sc_accept_q.next, struct svcxprt_rdma, sc_accept_q); @@ -409,7 +409,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) } if (!list_empty(&listen_rdma->sc_accept_q)) set_bit(XPT_CONN, &listen_rdma->sc_xprt.xpt_flags); - spin_unlock_bh(&listen_rdma->sc_lock); + spin_unlock(&listen_rdma->sc_lock); if (!newxprt) return NULL; diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c index ffb1684c4573..2ec349ed4770 100644 --- a/net/sunrpc/xprtrdma/transport.c +++ b/net/sunrpc/xprtrdma/transport.c @@ -297,6 +297,7 @@ xprt_rdma_destroy(struct rpc_xprt *xprt) module_put(THIS_MODULE); } +/* 60 second timeout, no retries */ static const struct rpc_timeout xprt_rdma_default_timeout = { .to_initval = 60 * HZ, .to_maxval = 60 * HZ, @@ -322,8 +323,9 @@ xprt_setup_rdma(struct xprt_create *args) if (!xprt) return ERR_PTR(-ENOMEM); - /* 60 second timeout, no retries */ xprt->timeout = &xprt_rdma_default_timeout; + xprt->connect_timeout = xprt->timeout->to_initval; + xprt->max_reconnect_timeout = xprt->timeout->to_maxval; xprt->bind_timeout = RPCRDMA_BIND_TO; xprt->reestablish_timeout = RPCRDMA_INIT_REEST_TO; xprt->idle_timeout = RPCRDMA_IDLE_DISC_TO; @@ -486,31 +488,64 @@ xprt_rdma_timer(struct rpc_xprt *xprt, struct rpc_task *task) } /** - * xprt_rdma_connect - try to establish a transport connection + * xprt_rdma_set_connect_timeout - set timeouts for establishing a connection + * @xprt: controlling transport instance + * @connect_timeout: reconnect timeout after client disconnects + * @reconnect_timeout: reconnect timeout after server disconnects + * + */ +static void xprt_rdma_tcp_set_connect_timeout(struct rpc_xprt *xprt, + unsigned long connect_timeout, + unsigned long reconnect_timeout) +{ + struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt); + + trace_xprtrdma_op_set_cto(r_xprt, connect_timeout, reconnect_timeout); + + spin_lock(&xprt->transport_lock); + + if (connect_timeout < xprt->connect_timeout) { + struct rpc_timeout to; + unsigned long initval; + + to = *xprt->timeout; + initval = connect_timeout; + if (initval < RPCRDMA_INIT_REEST_TO << 1) + initval = RPCRDMA_INIT_REEST_TO << 1; + to.to_initval = initval; + to.to_maxval = initval; + r_xprt->rx_timeout = to; + xprt->timeout = &r_xprt->rx_timeout; + xprt->connect_timeout = connect_timeout; + } + + if (reconnect_timeout < xprt->max_reconnect_timeout) + xprt->max_reconnect_timeout = reconnect_timeout; + + spin_unlock(&xprt->transport_lock); +} + +/** + * xprt_rdma_connect - schedule an attempt to reconnect * @xprt: transport state - * @task: RPC scheduler context + * @task: RPC scheduler context (unused) * */ static void xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task) { struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt); + unsigned long delay; trace_xprtrdma_op_connect(r_xprt); + + delay = 0; if (r_xprt->rx_ep.rep_connected != 0) { - /* Reconnect */ - schedule_delayed_work(&r_xprt->rx_connect_worker, - xprt->reestablish_timeout); - xprt->reestablish_timeout <<= 1; - if (xprt->reestablish_timeout > RPCRDMA_MAX_REEST_TO) - xprt->reestablish_timeout = RPCRDMA_MAX_REEST_TO; - else if (xprt->reestablish_timeout < RPCRDMA_INIT_REEST_TO) - xprt->reestablish_timeout = RPCRDMA_INIT_REEST_TO; - } else { - schedule_delayed_work(&r_xprt->rx_connect_worker, 0); - if (!RPC_IS_ASYNC(task)) - flush_delayed_work(&r_xprt->rx_connect_worker); + delay = xprt_reconnect_delay(xprt); + xprt_reconnect_backoff(xprt, RPCRDMA_INIT_REEST_TO); } + queue_delayed_work(xprtiod_workqueue, &r_xprt->rx_connect_worker, + delay); } /** @@ -549,8 +584,11 @@ out_sleep: static void xprt_rdma_free_slot(struct rpc_xprt *xprt, struct rpc_rqst *rqst) { + struct rpcrdma_xprt *r_xprt = + container_of(xprt, struct rpcrdma_xprt, rx_xprt); + memset(rqst, 0, sizeof(*rqst)); - rpcrdma_buffer_put(rpcr_to_rdmar(rqst)); + rpcrdma_buffer_put(&r_xprt->rx_buf, rpcr_to_rdmar(rqst)); rpc_wake_up_next(&xprt->backlog); } @@ -617,9 +655,16 @@ xprt_rdma_free(struct rpc_task *task) struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt); struct rpcrdma_req *req = rpcr_to_rdmar(rqst); - if (test_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags)) - rpcrdma_release_rqst(r_xprt, req); trace_xprtrdma_op_free(task, req); + + if (!list_empty(&req->rl_registered)) + frwr_unmap_sync(r_xprt, req); + + /* XXX: If the RPC is completing because of a signal and + * not because a reply was received, we ought to ensure + * that the Send completion has fired, so that memory + * involved with the Send is not still visible to the NIC. + */ } /** @@ -666,7 +711,6 @@ xprt_rdma_send_request(struct rpc_rqst *rqst) goto drop_connection; rqst->rq_xtime = ktime_get(); - __set_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags); if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req)) goto drop_connection; @@ -759,6 +803,7 @@ static const struct rpc_xprt_ops xprt_rdma_procs = { .send_request = xprt_rdma_send_request, .close = xprt_rdma_close, .destroy = xprt_rdma_destroy, + .set_connect_timeout = xprt_rdma_tcp_set_connect_timeout, .print_stats = xprt_rdma_print_stats, .enable_swap = xprt_rdma_enable_swap, .disable_swap = xprt_rdma_disable_swap, @@ -766,6 +811,7 @@ static const struct rpc_xprt_ops xprt_rdma_procs = { #if defined(CONFIG_SUNRPC_BACKCHANNEL) .bc_setup = xprt_rdma_bc_setup, .bc_maxpayload = xprt_rdma_bc_maxpayload, + .bc_num_slots = xprt_rdma_bc_max_slots, .bc_free_rqst = xprt_rdma_bc_free_rqst, .bc_destroy = xprt_rdma_bc_destroy, #endif diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 84bb37924540..805b1f35e1ca 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -89,14 +89,12 @@ static void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp); */ static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt) { - struct rpcrdma_buffer *buf = &r_xprt->rx_buf; struct rpcrdma_ia *ia = &r_xprt->rx_ia; /* Flush Receives, then wait for deferred Reply work * to complete. */ ib_drain_rq(ia->ri_id->qp); - drain_workqueue(buf->rb_completion_wq); /* Deferred Reply processing might have scheduled * local invalidations. @@ -901,7 +899,7 @@ out_emptyq: * completions recently. This is a sign the Send Queue is * backing up. Cause the caller to pause and try again. */ - set_bit(RPCRDMA_BUF_F_EMPTY_SCQ, &buf->rb_flags); + xprt_wait_for_buffer_space(&r_xprt->rx_xprt); r_xprt->rx_stats.empty_sendctx_q++; return NULL; } @@ -936,10 +934,7 @@ rpcrdma_sendctx_put_locked(struct rpcrdma_sendctx *sc) /* Paired with READ_ONCE */ smp_store_release(&buf->rb_sc_tail, next_tail); - if (test_and_clear_bit(RPCRDMA_BUF_F_EMPTY_SCQ, &buf->rb_flags)) { - smp_mb__after_atomic(); - xprt_write_space(&sc->sc_xprt->rx_xprt); - } + xprt_write_space(&sc->sc_xprt->rx_xprt); } static void @@ -977,8 +972,6 @@ rpcrdma_mrs_create(struct rpcrdma_xprt *r_xprt) r_xprt->rx_stats.mrs_allocated += count; spin_unlock(&buf->rb_mrlock); trace_xprtrdma_createmrs(r_xprt, count); - - xprt_write_space(&r_xprt->rx_xprt); } static void @@ -990,6 +983,7 @@ rpcrdma_mr_refresh_worker(struct work_struct *work) rx_buf); rpcrdma_mrs_create(r_xprt); + xprt_write_space(&r_xprt->rx_xprt); } /** @@ -1025,7 +1019,6 @@ struct rpcrdma_req *rpcrdma_req_create(struct rpcrdma_xprt *r_xprt, size_t size, if (!req->rl_recvbuf) goto out4; - req->rl_buffer = buffer; INIT_LIST_HEAD(&req->rl_registered); spin_lock(&buffer->rb_lock); list_add(&req->rl_all, &buffer->rb_allreqs); @@ -1042,9 +1035,9 @@ out1: return NULL; } -static bool rpcrdma_rep_create(struct rpcrdma_xprt *r_xprt, bool temp) +static struct rpcrdma_rep *rpcrdma_rep_create(struct rpcrdma_xprt *r_xprt, + bool temp) { - struct rpcrdma_buffer *buf = &r_xprt->rx_buf; struct rpcrdma_rep *rep; rep = kzalloc(sizeof(*rep), GFP_KERNEL); @@ -1055,27 +1048,22 @@ static bool rpcrdma_rep_create(struct rpcrdma_xprt *r_xprt, bool temp) DMA_FROM_DEVICE, GFP_KERNEL); if (!rep->rr_rdmabuf) goto out_free; + xdr_buf_init(&rep->rr_hdrbuf, rdmab_data(rep->rr_rdmabuf), rdmab_length(rep->rr_rdmabuf)); - rep->rr_cqe.done = rpcrdma_wc_receive; rep->rr_rxprt = r_xprt; - INIT_WORK(&rep->rr_work, rpcrdma_deferred_completion); rep->rr_recv_wr.next = NULL; rep->rr_recv_wr.wr_cqe = &rep->rr_cqe; rep->rr_recv_wr.sg_list = &rep->rr_rdmabuf->rg_iov; rep->rr_recv_wr.num_sge = 1; rep->rr_temp = temp; - - spin_lock(&buf->rb_lock); - list_add(&rep->rr_list, &buf->rb_recv_bufs); - spin_unlock(&buf->rb_lock); - return true; + return rep; out_free: kfree(rep); out: - return false; + return NULL; } /** @@ -1089,7 +1077,6 @@ int rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt) struct rpcrdma_buffer *buf = &r_xprt->rx_buf; int i, rc; - buf->rb_flags = 0; buf->rb_max_requests = r_xprt->rx_ep.rep_max_requests; buf->rb_bc_srv_max_requests = 0; spin_lock_init(&buf->rb_mrlock); @@ -1122,15 +1109,6 @@ int rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt) if (rc) goto out; - buf->rb_completion_wq = alloc_workqueue("rpcrdma-%s", - WQ_MEM_RECLAIM | WQ_HIGHPRI, - 0, - r_xprt->rx_xprt.address_strings[RPC_DISPLAY_ADDR]); - if (!buf->rb_completion_wq) { - rc = -ENOMEM; - goto out; - } - return 0; out: rpcrdma_buffer_destroy(buf); @@ -1204,11 +1182,6 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf) { cancel_delayed_work_sync(&buf->rb_refresh_worker); - if (buf->rb_completion_wq) { - destroy_workqueue(buf->rb_completion_wq); - buf->rb_completion_wq = NULL; - } - rpcrdma_sendctxs_destroy(buf); while (!list_empty(&buf->rb_recv_bufs)) { @@ -1325,13 +1298,12 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers) /** * rpcrdma_buffer_put - Put request/reply buffers back into pool + * @buffers: buffer pool * @req: object to return * */ -void -rpcrdma_buffer_put(struct rpcrdma_req *req) +void rpcrdma_buffer_put(struct rpcrdma_buffer *buffers, struct rpcrdma_req *req) { - struct rpcrdma_buffer *buffers = req->rl_buffer; struct rpcrdma_rep *rep = req->rl_reply; req->rl_reply = NULL; @@ -1484,8 +1456,7 @@ rpcrdma_ep_post(struct rpcrdma_ia *ia, struct ib_send_wr *send_wr = &req->rl_sendctx->sc_wr; int rc; - if (!ep->rep_send_count || - test_bit(RPCRDMA_REQ_F_TX_RESOURCES, &req->rl_flags)) { + if (!ep->rep_send_count || kref_read(&req->rl_kref) > 1) { send_wr->send_flags |= IB_SEND_SIGNALED; ep->rep_send_count = ep->rep_send_batch; } else { @@ -1505,11 +1476,13 @@ rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp) { struct rpcrdma_buffer *buf = &r_xprt->rx_buf; struct rpcrdma_ep *ep = &r_xprt->rx_ep; - struct ib_recv_wr *wr, *bad_wr; + struct ib_recv_wr *i, *wr, *bad_wr; + struct rpcrdma_rep *rep; int needed, count, rc; rc = 0; count = 0; + needed = buf->rb_credits + (buf->rb_bc_srv_max_requests << 1); if (ep->rep_receive_count > needed) goto out; @@ -1517,51 +1490,65 @@ rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, bool temp) if (!temp) needed += RPCRDMA_MAX_RECV_BATCH; - count = 0; + /* fast path: all needed reps can be found on the free list */ wr = NULL; + spin_lock(&buf->rb_lock); while (needed) { - struct rpcrdma_regbuf *rb; - struct rpcrdma_rep *rep; - - spin_lock(&buf->rb_lock); rep = list_first_entry_or_null(&buf->rb_recv_bufs, struct rpcrdma_rep, rr_list); - if (likely(rep)) - list_del(&rep->rr_list); - spin_unlock(&buf->rb_lock); - if (!rep) { - if (!rpcrdma_rep_create(r_xprt, temp)) - break; - continue; - } + if (!rep) + break; - rb = rep->rr_rdmabuf; - if (!rpcrdma_regbuf_dma_map(r_xprt, rb)) { - rpcrdma_recv_buffer_put(rep); + list_del(&rep->rr_list); + rep->rr_recv_wr.next = wr; + wr = &rep->rr_recv_wr; + --needed; + } + spin_unlock(&buf->rb_lock); + + while (needed) { + rep = rpcrdma_rep_create(r_xprt, temp); + if (!rep) break; - } - trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe); rep->rr_recv_wr.next = wr; wr = &rep->rr_recv_wr; - ++count; --needed; } - if (!count) + if (!wr) goto out; + for (i = wr; i; i = i->next) { + rep = container_of(i, struct rpcrdma_rep, rr_recv_wr); + + if (!rpcrdma_regbuf_dma_map(r_xprt, rep->rr_rdmabuf)) + goto release_wrs; + + trace_xprtrdma_post_recv(rep->rr_recv_wr.wr_cqe); + ++count; + } + rc = ib_post_recv(r_xprt->rx_ia.ri_id->qp, wr, (const struct ib_recv_wr **)&bad_wr); +out: + trace_xprtrdma_post_recvs(r_xprt, count, rc); if (rc) { - for (wr = bad_wr; wr; wr = wr->next) { + for (wr = bad_wr; wr;) { struct rpcrdma_rep *rep; rep = container_of(wr, struct rpcrdma_rep, rr_recv_wr); + wr = wr->next; rpcrdma_recv_buffer_put(rep); --count; } } ep->rep_receive_count += count; -out: - trace_xprtrdma_post_recvs(r_xprt, count, rc); + return; + +release_wrs: + for (i = wr; i;) { + rep = container_of(i, struct rpcrdma_rep, rr_recv_wr); + i = i->next; + rpcrdma_recv_buffer_put(rep); + } } diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index d1e0749bcbc4..92ce09fcea74 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -44,7 +44,8 @@ #include <linux/wait.h> /* wait_queue_head_t, etc */ #include <linux/spinlock.h> /* spinlock_t, etc */ -#include <linux/atomic.h> /* atomic_t, etc */ +#include <linux/atomic.h> /* atomic_t, etc */ +#include <linux/kref.h> /* struct kref */ #include <linux/workqueue.h> /* struct work_struct */ #include <rdma/rdma_cm.h> /* RDMA connection api */ @@ -202,10 +203,9 @@ struct rpcrdma_rep { bool rr_temp; struct rpcrdma_regbuf *rr_rdmabuf; struct rpcrdma_xprt *rr_rxprt; - struct work_struct rr_work; + struct rpc_rqst *rr_rqst; struct xdr_buf rr_hdrbuf; struct xdr_stream rr_stream; - struct rpc_rqst *rr_rqst; struct list_head rr_list; struct ib_recv_wr rr_recv_wr; }; @@ -240,18 +240,12 @@ struct rpcrdma_sendctx { * An external memory region is any buffer or page that is registered * on the fly (ie, not pre-registered). */ -enum rpcrdma_frwr_state { - FRWR_IS_INVALID, /* ready to be used */ - FRWR_IS_VALID, /* in use */ - FRWR_FLUSHED_FR, /* flushed FASTREG WR */ - FRWR_FLUSHED_LI, /* flushed LOCALINV WR */ -}; - +struct rpcrdma_req; struct rpcrdma_frwr { struct ib_mr *fr_mr; struct ib_cqe fr_cqe; - enum rpcrdma_frwr_state fr_state; struct completion fr_linv_done; + struct rpcrdma_req *fr_req; union { struct ib_reg_wr fr_regwr; struct ib_send_wr fr_invwr; @@ -326,7 +320,6 @@ struct rpcrdma_buffer; struct rpcrdma_req { struct list_head rl_list; struct rpc_rqst rl_slot; - struct rpcrdma_buffer *rl_buffer; struct rpcrdma_rep *rl_reply; struct xdr_stream rl_stream; struct xdr_buf rl_hdrbuf; @@ -336,18 +329,12 @@ struct rpcrdma_req { struct rpcrdma_regbuf *rl_recvbuf; /* rq_rcv_buf */ struct list_head rl_all; - unsigned long rl_flags; + struct kref rl_kref; struct list_head rl_registered; /* registered segments */ struct rpcrdma_mr_seg rl_segments[RPCRDMA_MAX_SEGS]; }; -/* rl_flags */ -enum { - RPCRDMA_REQ_F_PENDING = 0, - RPCRDMA_REQ_F_TX_RESOURCES, -}; - static inline struct rpcrdma_req * rpcr_to_rdmar(const struct rpc_rqst *rqst) { @@ -391,22 +378,15 @@ struct rpcrdma_buffer { struct list_head rb_recv_bufs; struct list_head rb_allreqs; - unsigned long rb_flags; u32 rb_max_requests; u32 rb_credits; /* most recent credit grant */ u32 rb_bc_srv_max_requests; u32 rb_bc_max_requests; - struct workqueue_struct *rb_completion_wq; struct delayed_work rb_refresh_worker; }; -/* rb_flags */ -enum { - RPCRDMA_BUF_F_EMPTY_SCQ = 0, -}; - /* * Statistics for RPCRDMA */ @@ -452,6 +432,7 @@ struct rpcrdma_xprt { struct rpcrdma_ep rx_ep; struct rpcrdma_buffer rx_buf; struct delayed_work rx_connect_worker; + struct rpc_timeout rx_timeout; struct rpcrdma_stats rx_stats; }; @@ -518,7 +499,8 @@ rpcrdma_mr_recycle(struct rpcrdma_mr *mr) } struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *); -void rpcrdma_buffer_put(struct rpcrdma_req *); +void rpcrdma_buffer_put(struct rpcrdma_buffer *buffers, + struct rpcrdma_req *req); void rpcrdma_recv_buffer_put(struct rpcrdma_rep *); bool rpcrdma_regbuf_realloc(struct rpcrdma_regbuf *rb, size_t size, @@ -564,6 +546,7 @@ rpcrdma_data_dir(bool writing) /* Memory registration calls xprtrdma/frwr_ops.c */ bool frwr_is_supported(struct ib_device *device); +void frwr_reset(struct rpcrdma_req *req); int frwr_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep); int frwr_init_mr(struct rpcrdma_ia *ia, struct rpcrdma_mr *mr); void frwr_release_mr(struct rpcrdma_mr *mr); @@ -574,8 +557,8 @@ struct rpcrdma_mr_seg *frwr_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr **mr); int frwr_send(struct rpcrdma_ia *ia, struct rpcrdma_req *req); void frwr_reminv(struct rpcrdma_rep *rep, struct list_head *mrs); -void frwr_unmap_sync(struct rpcrdma_xprt *r_xprt, - struct list_head *mrs); +void frwr_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req); +void frwr_unmap_async(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req); /* * RPC/RDMA protocol calls - xprtrdma/rpc_rdma.c @@ -598,9 +581,6 @@ int rpcrdma_marshal_req(struct rpcrdma_xprt *r_xprt, struct rpc_rqst *rqst); void rpcrdma_set_max_header_sizes(struct rpcrdma_xprt *); void rpcrdma_complete_rqst(struct rpcrdma_rep *rep); void rpcrdma_reply_handler(struct rpcrdma_rep *rep); -void rpcrdma_release_rqst(struct rpcrdma_xprt *r_xprt, - struct rpcrdma_req *req); -void rpcrdma_deferred_completion(struct work_struct *work); static inline void rpcrdma_set_xdrlen(struct xdr_buf *xdr, size_t len) { @@ -625,6 +605,7 @@ void xprt_rdma_cleanup(void); #if defined(CONFIG_SUNRPC_BACKCHANNEL) int xprt_rdma_bc_setup(struct rpc_xprt *, unsigned int); size_t xprt_rdma_bc_maxpayload(struct rpc_xprt *); +unsigned int xprt_rdma_bc_max_slots(struct rpc_xprt *); int rpcrdma_bc_post_recv(struct rpcrdma_xprt *, unsigned int); void rpcrdma_bc_receive_call(struct rpcrdma_xprt *, struct rpcrdma_rep *); int xprt_rdma_bc_send_reply(struct rpc_rqst *rqst); diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 36652352a38c..e2176c167a57 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -880,7 +880,7 @@ static int xs_nospace(struct rpc_rqst *req) req->rq_slen); /* Protect against races with write_space */ - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); /* Don't race with disconnect */ if (xprt_connected(xprt)) { @@ -890,7 +890,7 @@ static int xs_nospace(struct rpc_rqst *req) } else ret = -ENOTCONN; - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); /* Race breaker in case memory is freed before above code is called */ if (ret == -EAGAIN) { @@ -909,6 +909,7 @@ static int xs_nospace(struct rpc_rqst *req) static void xs_stream_prepare_request(struct rpc_rqst *req) { + xdr_free_bvec(&req->rq_rcv_buf); req->rq_task->tk_status = xdr_alloc_bvec(&req->rq_rcv_buf, GFP_KERNEL); } @@ -1211,6 +1212,15 @@ static void xs_sock_reset_state_flags(struct rpc_xprt *xprt) struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt); clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state); + clear_bit(XPRT_SOCK_WAKE_ERROR, &transport->sock_state); + clear_bit(XPRT_SOCK_WAKE_WRITE, &transport->sock_state); + clear_bit(XPRT_SOCK_WAKE_DISCONNECT, &transport->sock_state); +} + +static void xs_run_error_worker(struct sock_xprt *transport, unsigned int nr) +{ + set_bit(nr, &transport->sock_state); + queue_work(xprtiod_workqueue, &transport->error_worker); } static void xs_sock_reset_connection_flags(struct rpc_xprt *xprt) @@ -1231,6 +1241,7 @@ static void xs_sock_reset_connection_flags(struct rpc_xprt *xprt) */ static void xs_error_report(struct sock *sk) { + struct sock_xprt *transport; struct rpc_xprt *xprt; int err; @@ -1238,13 +1249,14 @@ static void xs_error_report(struct sock *sk) if (!(xprt = xprt_from_sock(sk))) goto out; + transport = container_of(xprt, struct sock_xprt, xprt); err = -sk->sk_err; if (err == 0) goto out; dprintk("RPC: xs_error_report client %p, error=%d...\n", xprt, -err); trace_rpc_socket_error(xprt, sk->sk_socket, err); - xprt_wake_pending_tasks(xprt, err); + xs_run_error_worker(transport, XPRT_SOCK_WAKE_ERROR); out: read_unlock_bh(&sk->sk_callback_lock); } @@ -1333,6 +1345,7 @@ static void xs_destroy(struct rpc_xprt *xprt) cancel_delayed_work_sync(&transport->connect_worker); xs_close(xprt); cancel_work_sync(&transport->recv_worker); + cancel_work_sync(&transport->error_worker); xs_xprt_free(xprt); module_put(THIS_MODULE); } @@ -1386,9 +1399,9 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt, } - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); xprt_adjust_cwnd(xprt, task, copied); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); spin_lock(&xprt->queue_lock); xprt_complete_rqst(task, copied); __UDPX_INC_STATS(sk, UDP_MIB_INDATAGRAMS); @@ -1498,7 +1511,6 @@ static void xs_tcp_state_change(struct sock *sk) trace_rpc_socket_state_change(xprt, sk->sk_socket); switch (sk->sk_state) { case TCP_ESTABLISHED: - spin_lock(&xprt->transport_lock); if (!xprt_test_and_set_connected(xprt)) { xprt->connect_cookie++; clear_bit(XPRT_SOCK_CONNECTING, &transport->sock_state); @@ -1507,9 +1519,8 @@ static void xs_tcp_state_change(struct sock *sk) xprt->stat.connect_count++; xprt->stat.connect_time += (long)jiffies - xprt->stat.connect_start; - xprt_wake_pending_tasks(xprt, -EAGAIN); + xs_run_error_worker(transport, XPRT_SOCK_WAKE_PENDING); } - spin_unlock(&xprt->transport_lock); break; case TCP_FIN_WAIT1: /* The client initiated a shutdown of the socket */ @@ -1525,7 +1536,7 @@ static void xs_tcp_state_change(struct sock *sk) /* The server initiated a shutdown of the socket */ xprt->connect_cookie++; clear_bit(XPRT_CONNECTED, &xprt->state); - xs_tcp_force_close(xprt); + xs_run_error_worker(transport, XPRT_SOCK_WAKE_DISCONNECT); /* fall through */ case TCP_CLOSING: /* @@ -1547,7 +1558,7 @@ static void xs_tcp_state_change(struct sock *sk) xprt_clear_connecting(xprt); clear_bit(XPRT_CLOSING, &xprt->state); /* Trigger the socket release */ - xs_tcp_force_close(xprt); + xs_run_error_worker(transport, XPRT_SOCK_WAKE_DISCONNECT); } out: read_unlock_bh(&sk->sk_callback_lock); @@ -1556,6 +1567,7 @@ static void xs_tcp_state_change(struct sock *sk) static void xs_write_space(struct sock *sk) { struct socket_wq *wq; + struct sock_xprt *transport; struct rpc_xprt *xprt; if (!sk->sk_socket) @@ -1564,13 +1576,14 @@ static void xs_write_space(struct sock *sk) if (unlikely(!(xprt = xprt_from_sock(sk)))) return; + transport = container_of(xprt, struct sock_xprt, xprt); rcu_read_lock(); wq = rcu_dereference(sk->sk_wq); if (!wq || test_and_clear_bit(SOCKWQ_ASYNC_NOSPACE, &wq->flags) == 0) goto out; - if (xprt_write_space(xprt)) - sk->sk_write_pending--; + xs_run_error_worker(transport, XPRT_SOCK_WAKE_WRITE); + sk->sk_write_pending--; out: rcu_read_unlock(); } @@ -1664,9 +1677,9 @@ static void xs_udp_set_buffer_size(struct rpc_xprt *xprt, size_t sndsize, size_t */ static void xs_udp_timer(struct rpc_xprt *xprt, struct rpc_task *task) { - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); xprt_adjust_cwnd(xprt, task, -ETIMEDOUT); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); } static int xs_get_random_port(void) @@ -2201,13 +2214,13 @@ static void xs_tcp_set_socket_timeouts(struct rpc_xprt *xprt, unsigned int opt_on = 1; unsigned int timeo; - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); keepidle = DIV_ROUND_UP(xprt->timeout->to_initval, HZ); keepcnt = xprt->timeout->to_retries + 1; timeo = jiffies_to_msecs(xprt->timeout->to_initval) * (xprt->timeout->to_retries + 1); clear_bit(XPRT_SOCK_UPD_TIMEOUT, &transport->sock_state); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); /* TCP Keepalive options */ kernel_setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, @@ -2232,7 +2245,7 @@ static void xs_tcp_set_connect_timeout(struct rpc_xprt *xprt, struct rpc_timeout to; unsigned long initval; - spin_lock_bh(&xprt->transport_lock); + spin_lock(&xprt->transport_lock); if (reconnect_timeout < xprt->max_reconnect_timeout) xprt->max_reconnect_timeout = reconnect_timeout; if (connect_timeout < xprt->connect_timeout) { @@ -2249,7 +2262,7 @@ static void xs_tcp_set_connect_timeout(struct rpc_xprt *xprt, xprt->connect_timeout = connect_timeout; } set_bit(XPRT_SOCK_UPD_TIMEOUT, &transport->sock_state); - spin_unlock_bh(&xprt->transport_lock); + spin_unlock(&xprt->transport_lock); } static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) @@ -2402,25 +2415,6 @@ out: xprt_wake_pending_tasks(xprt, status); } -static unsigned long xs_reconnect_delay(const struct rpc_xprt *xprt) -{ - unsigned long start, now = jiffies; - - start = xprt->stat.connect_start + xprt->reestablish_timeout; - if (time_after(start, now)) - return start - now; - return 0; -} - -static void xs_reconnect_backoff(struct rpc_xprt *xprt) -{ - xprt->reestablish_timeout <<= 1; - if (xprt->reestablish_timeout > xprt->max_reconnect_timeout) - xprt->reestablish_timeout = xprt->max_reconnect_timeout; - if (xprt->reestablish_timeout < XS_TCP_INIT_REEST_TO) - xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO; -} - /** * xs_connect - connect a socket to a remote endpoint * @xprt: pointer to transport structure @@ -2450,8 +2444,8 @@ static void xs_connect(struct rpc_xprt *xprt, struct rpc_task *task) /* Start by resetting any existing state */ xs_reset_transport(transport); - delay = xs_reconnect_delay(xprt); - xs_reconnect_backoff(xprt); + delay = xprt_reconnect_delay(xprt); + xprt_reconnect_backoff(xprt, XS_TCP_INIT_REEST_TO); } else dprintk("RPC: xs_connect scheduled xprt %p\n", xprt); @@ -2461,6 +2455,56 @@ static void xs_connect(struct rpc_xprt *xprt, struct rpc_task *task) delay); } +static void xs_wake_disconnect(struct sock_xprt *transport) +{ + if (test_and_clear_bit(XPRT_SOCK_WAKE_DISCONNECT, &transport->sock_state)) + xs_tcp_force_close(&transport->xprt); +} + +static void xs_wake_write(struct sock_xprt *transport) +{ + if (test_and_clear_bit(XPRT_SOCK_WAKE_WRITE, &transport->sock_state)) + xprt_write_space(&transport->xprt); +} + +static void xs_wake_error(struct sock_xprt *transport) +{ + int sockerr; + int sockerr_len = sizeof(sockerr); + + if (!test_bit(XPRT_SOCK_WAKE_ERROR, &transport->sock_state)) + return; + mutex_lock(&transport->recv_mutex); + if (transport->sock == NULL) + goto out; + if (!test_and_clear_bit(XPRT_SOCK_WAKE_ERROR, &transport->sock_state)) + goto out; + if (kernel_getsockopt(transport->sock, SOL_SOCKET, SO_ERROR, + (char *)&sockerr, &sockerr_len) != 0) + goto out; + if (sockerr < 0) + xprt_wake_pending_tasks(&transport->xprt, sockerr); +out: + mutex_unlock(&transport->recv_mutex); +} + +static void xs_wake_pending(struct sock_xprt *transport) +{ + if (test_and_clear_bit(XPRT_SOCK_WAKE_PENDING, &transport->sock_state)) + xprt_wake_pending_tasks(&transport->xprt, -EAGAIN); +} + +static void xs_error_handle(struct work_struct *work) +{ + struct sock_xprt *transport = container_of(work, + struct sock_xprt, error_worker); + + xs_wake_disconnect(transport); + xs_wake_write(transport); + xs_wake_error(transport); + xs_wake_pending(transport); +} + /** * xs_local_print_stats - display AF_LOCAL socket-specifc stats * @xprt: rpc_xprt struct containing statistics @@ -2745,6 +2789,7 @@ static const struct rpc_xprt_ops xs_tcp_ops = { #ifdef CONFIG_SUNRPC_BACKCHANNEL .bc_setup = xprt_setup_bc, .bc_maxpayload = xs_tcp_bc_maxpayload, + .bc_num_slots = xprt_bc_max_slots, .bc_free_rqst = xprt_free_bc_rqst, .bc_destroy = xprt_destroy_bc, #endif @@ -2873,6 +2918,7 @@ static struct rpc_xprt *xs_setup_local(struct xprt_create *args) xprt->timeout = &xs_local_default_timeout; INIT_WORK(&transport->recv_worker, xs_stream_data_receive_workfn); + INIT_WORK(&transport->error_worker, xs_error_handle); INIT_DELAYED_WORK(&transport->connect_worker, xs_dummy_setup_socket); switch (sun->sun_family) { @@ -2943,6 +2989,7 @@ static struct rpc_xprt *xs_setup_udp(struct xprt_create *args) xprt->timeout = &xs_udp_default_timeout; INIT_WORK(&transport->recv_worker, xs_udp_data_receive_workfn); + INIT_WORK(&transport->error_worker, xs_error_handle); INIT_DELAYED_WORK(&transport->connect_worker, xs_udp_setup_socket); switch (addr->sa_family) { @@ -3024,6 +3071,7 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create *args) (xprt->timeout->to_retries + 1); INIT_WORK(&transport->recv_worker, xs_stream_data_receive_workfn); + INIT_WORK(&transport->error_worker, xs_error_handle); INIT_DELAYED_WORK(&transport->connect_worker, xs_tcp_setup_socket); switch (addr->sa_family) { diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt index b6866a05edd2..ed3ecfa422e1 100644 --- a/tools/perf/Documentation/perf-probe.txt +++ b/tools/perf/Documentation/perf-probe.txt @@ -194,12 +194,13 @@ PROBE ARGUMENT -------------- Each probe argument follows below syntax. - [NAME=]LOCALVAR|$retval|%REG|@SYMBOL[:TYPE] + [NAME=]LOCALVAR|$retval|%REG|@SYMBOL[:TYPE][@user] 'NAME' specifies the name of this argument (optional). You can use the name of local variable, local data structure member (e.g. var->field, var.field2), local array with fixed index (e.g. array[1], var->array[0], var->pointer[2]), or kprobe-tracer argument format (e.g. $retval, %ax, etc). Note that the name of this argument will be set as the last member name if you specify a local data structure member (e.g. field2 for 'var->field1.field2'.) '$vars' and '$params' special arguments are also available for NAME, '$vars' is expanded to the local variables (including function parameters) which can access at given probe point. '$params' is expanded to only the function parameters. 'TYPE' casts the type of this argument (optional). If omitted, perf probe automatically set the type based on debuginfo (*). Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal integers (x/x8/x16/x32/x64), signedness casting (u/s), "string" and bitfield are supported. (see TYPES for detail) On x86 systems %REG is always the short form of the register: for example %AX. %RAX or %EAX is not valid. +"@user" is a special attribute which means the LOCALVAR will be treated as a user-space memory. This is only valid for kprobe event. TYPES ----- diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index 0c3b55d0617d..cd1eb73cfe83 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -1562,6 +1562,17 @@ static int parse_perf_probe_arg(char *str, struct perf_probe_arg *arg) str = tmp + 1; } + tmp = strchr(str, '@'); + if (tmp && tmp != str && strcmp(tmp + 1, "user")) { /* user attr */ + if (!user_access_is_supported()) { + semantic_error("ftrace does not support user access\n"); + return -EINVAL; + } + *tmp = '\0'; + arg->user_access = true; + pr_debug("user_access "); + } + tmp = strchr(str, ':'); if (tmp) { /* Type setting */ *tmp = '\0'; diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h index 05c8d571a901..96a319cd2378 100644 --- a/tools/perf/util/probe-event.h +++ b/tools/perf/util/probe-event.h @@ -37,6 +37,7 @@ struct probe_trace_point { struct probe_trace_arg_ref { struct probe_trace_arg_ref *next; /* Next reference */ long offset; /* Offset value */ + bool user_access; /* User-memory access */ }; /* kprobe-tracer and uprobe-tracer tracing argument */ @@ -82,6 +83,7 @@ struct perf_probe_arg { char *var; /* Variable name */ char *type; /* Type name */ struct perf_probe_arg_field *field; /* Structure fields */ + bool user_access; /* User-memory access */ }; /* Perf probe probing event (point + arg) */ diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c index c2998f90b23c..5b4d49382932 100644 --- a/tools/perf/util/probe-file.c +++ b/tools/perf/util/probe-file.c @@ -1005,6 +1005,7 @@ enum ftrace_readme { FTRACE_README_PROBE_TYPE_X = 0, FTRACE_README_KRETPROBE_OFFSET, FTRACE_README_UPROBE_REF_CTR, + FTRACE_README_USER_ACCESS, FTRACE_README_END, }; @@ -1017,6 +1018,7 @@ static struct { DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"), DEFINE_TYPE(FTRACE_README_KRETPROBE_OFFSET, "*place (kretprobe): *"), DEFINE_TYPE(FTRACE_README_UPROBE_REF_CTR, "*ref_ctr_offset*"), + DEFINE_TYPE(FTRACE_README_USER_ACCESS, "*[u]<offset>*"), }; static bool scan_ftrace_readme(enum ftrace_readme type) @@ -1077,3 +1079,8 @@ bool uprobe_ref_ctr_is_supported(void) { return scan_ftrace_readme(FTRACE_README_UPROBE_REF_CTR); } + +bool user_access_is_supported(void) +{ + return scan_ftrace_readme(FTRACE_README_USER_ACCESS); +} diff --git a/tools/perf/util/probe-file.h b/tools/perf/util/probe-file.h index 2a249182f2a6..986c1c94f64f 100644 --- a/tools/perf/util/probe-file.h +++ b/tools/perf/util/probe-file.h @@ -70,6 +70,7 @@ int probe_cache__show_all_caches(struct strfilter *filter); bool probe_type_is_available(enum probe_type type); bool kretprobe_offset_is_supported(void); bool uprobe_ref_ctr_is_supported(void); +bool user_access_is_supported(void); #else /* ! HAVE_LIBELF_SUPPORT */ static inline struct probe_cache *probe_cache__new(const char *tgt __maybe_unused, struct nsinfo *nsi __maybe_unused) { diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c index 7d8c99734928..025fc4491993 100644 --- a/tools/perf/util/probe-finder.c +++ b/tools/perf/util/probe-finder.c @@ -280,7 +280,7 @@ static_var: static int convert_variable_type(Dwarf_Die *vr_die, struct probe_trace_arg *tvar, - const char *cast) + const char *cast, bool user_access) { struct probe_trace_arg_ref **ref_ptr = &tvar->ref; Dwarf_Die type; @@ -320,7 +320,8 @@ static int convert_variable_type(Dwarf_Die *vr_die, pr_debug("%s type is %s.\n", dwarf_diename(vr_die), dwarf_diename(&type)); - if (cast && strcmp(cast, "string") == 0) { /* String type */ + if (cast && (!strcmp(cast, "string") || !strcmp(cast, "ustring"))) { + /* String type */ ret = dwarf_tag(&type); if (ret != DW_TAG_pointer_type && ret != DW_TAG_array_type) { @@ -343,6 +344,7 @@ static int convert_variable_type(Dwarf_Die *vr_die, pr_warning("Out of memory error\n"); return -ENOMEM; } + (*ref_ptr)->user_access = user_access; } if (!die_compare_name(&type, "char") && !die_compare_name(&type, "unsigned char")) { @@ -397,7 +399,7 @@ formatted: static int convert_variable_fields(Dwarf_Die *vr_die, const char *varname, struct perf_probe_arg_field *field, struct probe_trace_arg_ref **ref_ptr, - Dwarf_Die *die_mem) + Dwarf_Die *die_mem, bool user_access) { struct probe_trace_arg_ref *ref = *ref_ptr; Dwarf_Die type; @@ -434,6 +436,7 @@ static int convert_variable_fields(Dwarf_Die *vr_die, const char *varname, *ref_ptr = ref; } ref->offset += dwarf_bytesize(&type) * field->index; + ref->user_access = user_access; goto next; } else if (tag == DW_TAG_pointer_type) { /* Check the pointer and dereference */ @@ -505,17 +508,18 @@ static int convert_variable_fields(Dwarf_Die *vr_die, const char *varname, } } ref->offset += (long)offs; + ref->user_access = user_access; /* If this member is unnamed, we need to reuse this field */ if (!dwarf_diename(die_mem)) return convert_variable_fields(die_mem, varname, field, - &ref, die_mem); + &ref, die_mem, user_access); next: /* Converting next field */ if (field->next) return convert_variable_fields(die_mem, field->name, - field->next, &ref, die_mem); + field->next, &ref, die_mem, user_access); else return 0; } @@ -541,11 +545,12 @@ static int convert_variable(Dwarf_Die *vr_die, struct probe_finder *pf) else if (ret == 0 && pf->pvar->field) { ret = convert_variable_fields(vr_die, pf->pvar->var, pf->pvar->field, &pf->tvar->ref, - &die_mem); + &die_mem, pf->pvar->user_access); vr_die = &die_mem; } if (ret == 0) - ret = convert_variable_type(vr_die, pf->tvar, pf->pvar->type); + ret = convert_variable_type(vr_die, pf->tvar, pf->pvar->type, + pf->pvar->user_access); /* *expr will be cached in libdw. Don't free it. */ return ret; } diff --git a/tools/testing/selftests/ftrace/ftracetest b/tools/testing/selftests/ftrace/ftracetest index 6d5e9e87c4b7..063ecb290a5a 100755 --- a/tools/testing/selftests/ftrace/ftracetest +++ b/tools/testing/selftests/ftrace/ftracetest @@ -23,9 +23,15 @@ echo " If <dir> is -, all logs output in console only" exit $1 } +# default error +err_ret=1 + +# kselftest skip code is 4 +err_skip=4 + errexit() { # message echo "Error: $1" 1>&2 - exit 1 + exit $err_ret } # Ensuring user privilege @@ -116,11 +122,31 @@ parse_opts() { # opts } # Parameters -DEBUGFS_DIR=`grep debugfs /proc/mounts | cut -f2 -d' ' | head -1` -if [ -z "$DEBUGFS_DIR" ]; then - TRACING_DIR=`grep tracefs /proc/mounts | cut -f2 -d' ' | head -1` -else - TRACING_DIR=$DEBUGFS_DIR/tracing +TRACING_DIR=`grep tracefs /proc/mounts | cut -f2 -d' ' | head -1` +if [ -z "$TRACING_DIR" ]; then + DEBUGFS_DIR=`grep debugfs /proc/mounts | cut -f2 -d' ' | head -1` + if [ -z "$DEBUGFS_DIR" ]; then + # If tracefs exists, then so does /sys/kernel/tracing + if [ -d "/sys/kernel/tracing" ]; then + mount -t tracefs nodev /sys/kernel/tracing || + errexit "Failed to mount /sys/kernel/tracing" + TRACING_DIR="/sys/kernel/tracing" + # If debugfs exists, then so does /sys/kernel/debug + elif [ -d "/sys/kernel/debug" ]; then + mount -t debugfs nodev /sys/kernel/debug || + errexit "Failed to mount /sys/kernel/debug" + TRACING_DIR="/sys/kernel/debug/tracing" + else + err_ret=$err_skip + errexit "debugfs and tracefs are not configured in this kernel" + fi + else + TRACING_DIR="$DEBUGFS_DIR/tracing" + fi +fi +if [ ! -d "$TRACING_DIR" ]; then + err_ret=$err_skip + errexit "ftrace is not configured in this kernel" fi TOP_DIR=`absdir $0` diff --git a/tools/testing/selftests/ftrace/test.d/functions b/tools/testing/selftests/ftrace/test.d/functions index 779ec11f61bd..1d96c5f7e402 100644 --- a/tools/testing/selftests/ftrace/test.d/functions +++ b/tools/testing/selftests/ftrace/test.d/functions @@ -91,8 +91,8 @@ initialize_ftrace() { # Reset ftrace to initial-state reset_events_filter reset_ftrace_filter disable_events - echo > set_event_pid # event tracer is always on - echo > set_ftrace_pid + [ -f set_event_pid ] && echo > set_event_pid + [ -f set_ftrace_pid ] && echo > set_ftrace_pid [ -f set_ftrace_filter ] && echo | tee set_ftrace_* [ -f set_graph_function ] && echo | tee set_graph_* [ -f stack_trace_filter ] && echo > stack_trace_filter diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc new file mode 100644 index 000000000000..0f60087583d8 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc @@ -0,0 +1,32 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Kprobe event user-memory access + +[ -f kprobe_events ] || exit_unsupported # this is configurable + +grep -q '\$arg<N>' README || exit_unresolved # depends on arch +grep -A10 "fetcharg:" README | grep -q 'ustring' || exit_unsupported +grep -A10 "fetcharg:" README | grep -q '\[u\]<offset>' || exit_unsupported + +:;: "user-memory access syntax and ustring working on user memory";: +echo 'p:myevent do_sys_open path=+0($arg2):ustring path2=+u0($arg2):string' \ + > kprobe_events + +grep myevent kprobe_events | \ + grep -q 'path=+0($arg2):ustring path2=+u0($arg2):string' +echo 1 > events/kprobes/myevent/enable +echo > /dev/null +echo 0 > events/kprobes/myevent/enable + +grep myevent trace | grep -q 'path="/dev/null" path2="/dev/null"' + +:;: "user-memory access syntax and ustring not working with kernel memory";: +echo 'p:myevent vfs_symlink path=+0($arg3):ustring path2=+u0($arg3):string' \ + > kprobe_events +echo 1 > events/kprobes/myevent/enable +ln -s foo $TMPDIR/bar +echo 0 > events/kprobes/myevent/enable + +grep myevent trace | grep -q 'path=(fault) path2=(fault)' + +exit 0 |