From ba49df4767d4fa5bbd2af3a51709fb81f94264ec Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 7 Oct 2012 09:26:13 -0700 Subject: rcu: Update RCU_FAST_NO_HZ help text The RCU_FAST_NO_HZ help text included a warning about overhead on large systems, but that issue has since been resolved. The main remaining issue with RCU_FAST_NO_HZ is increased real-time latency. This commit therefore updates the help text accordingly. Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- init/Kconfig | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 6fdd6e339326..b63e7982c1c6 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -582,14 +582,13 @@ config RCU_FAST_NO_HZ depends on NO_HZ && SMP default n help - This option causes RCU to attempt to accelerate grace periods - in order to allow CPUs to enter dynticks-idle state more - quickly. On the other hand, this option increases the overhead - of the dynticks-idle checking, particularly on systems with - large numbers of CPUs. - - Say Y if energy efficiency is critically important, particularly - if you have relatively few CPUs. + This option causes RCU to attempt to accelerate grace periods in + order to allow CPUs to enter dynticks-idle state more quickly. + On the other hand, this option increases the overhead of the + dynticks-idle checking, thus degrading scheduling latency. + + Say Y if energy efficiency is critically important, and you don't + care about real-time response. Say N if you are unsure. -- cgit v1.2.3 From af71befa282ddf51c09509978abe1e83de8fe7eb Mon Sep 17 00:00:00 2001 From: Paul Gortmaker Date: Wed, 24 Oct 2012 11:07:09 -0700 Subject: rcu: Wordsmith help text for RCU_USER_QS kernel parameter This commit adds a "try" missing from the end of the first paragraph of the RCU_USER_QS help text. [ paulmck: Also fix up the last paragraph a bit. ] Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- init/Kconfig | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index b63e7982c1c6..ec62139207d3 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -494,11 +494,11 @@ config RCU_USER_QS puts RCU in extended quiescent state when the CPU runs in userspace. It means that when a CPU runs in userspace, it is excluded from the global RCU state machine and thus doesn't - to keep the timer tick on for RCU. + try to keep the timer tick on for RCU. Unless you want to hack and help the development of the full - tickless feature, you shouldn't enable this option. It adds - unnecessary overhead. + tickless feature, you shouldn't enable this option. It also + adds unnecessary overhead. If unsure say N -- cgit v1.2.3 From bd52276fa1d420c3a504b76ffaaa1642cc79d4c4 Mon Sep 17 00:00:00 2001 From: Jan Beulich Date: Fri, 25 May 2012 16:20:31 +0100 Subject: x86-64/efi: Use EFI to deal with platform wall clock (again) Other than ix86, x86-64 on EFI so far didn't set the {g,s}et_wallclock accessors to the EFI routines, thus incorrectly using raw RTC accesses instead. Simply removing the #ifdef around the respective code isn't enough, however: While so far early get-time calls were done in physical mode, this doesn't work properly for x86-64, as virtual addresses would still need to be set up for all runtime regions (which wasn't the case on the system I have access to), so instead the patch moves the call to efi_enter_virtual_mode() ahead (which in turn allows to drop all code related to calling efi-get-time in physical mode). Additionally the earlier calling of efi_set_executable() requires the CPA code to cope, i.e. during early boot it must be avoided to call cpa_flush_array(), as the first thing this function does is a BUG_ON(irqs_disabled()). Also make the two EFI functions in question here static - they're not being referenced elsewhere. History: This commit was originally merged as bacef661acdb ("x86-64/efi: Use EFI to deal with platform wall clock") but it resulted in some ASUS machines no longer booting due to a firmware bug, and so was reverted in f026cfa82f62. A pre-emptive fix for the buggy ASUS firmware was merged in 03a1c254975e ("x86, efi: 1:1 pagetable mapping for virtual EFI calls") so now this patch can be reapplied. Signed-off-by: Jan Beulich Tested-by: Matt Fleming Acked-by: Matthew Garrett Cc: Ingo Molnar Cc: Peter Zijlstra Cc: H. Peter Anvin Signed-off-by: Matt Fleming [added commit history] --- init/main.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 9cf77ab138a6..ae70b647b4d9 100644 --- a/init/main.c +++ b/init/main.c @@ -461,6 +461,10 @@ static void __init mm_init(void) percpu_init_late(); pgtable_cache_init(); vmalloc_init(); +#ifdef CONFIG_X86 + if (efi_enabled) + efi_enter_virtual_mode(); +#endif } asmlinkage void __init start_kernel(void) @@ -601,10 +605,6 @@ asmlinkage void __init start_kernel(void) calibrate_delay(); pidmap_init(); anon_vma_init(); -#ifdef CONFIG_X86 - if (efi_enabled) - efi_enter_virtual_mode(); -#endif thread_info_cache_init(); cred_init(); fork_init(totalram_pages); -- cgit v1.2.3 From eded09ccf58ab00474ccde547dd525c75dbc28fd Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 2 Nov 2012 13:20:42 +0000 Subject: FRV: gcc-4.1.2 also inlines weak functions gcc-4.1.2 inlines weak functions, which causes FRV to fail when the dummy thread_info_cache_init() gets inlined into start_kernel(). Signed-off-by: David Howells --- init/main.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 9cf77ab138a6..e33e09df3cbc 100644 --- a/init/main.c +++ b/init/main.c @@ -442,9 +442,11 @@ void __init __weak smp_setup_processor_id(void) { } +# if THREAD_SIZE >= PAGE_SIZE void __init __weak thread_info_cache_init(void) { } +#endif /* * Set up kernel memory allocators -- cgit v1.2.3 From 45634cd8cb6541523227753944c7417ac3d20f94 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Tue, 7 Feb 2012 16:18:52 -0800 Subject: userns: Support autofs4 interacing with multiple user namespaces Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue. When creating directories and symlinks default the uid and gid of the mount requester to the global root uid and gid. autofs4_wait will update these fields when a mount is requested. When generating autofsv5 packets report the uid and gid of the mount requestor in user namespace of the process that opened the pipe, reporting unmapped uids and gids as overflowuid and overflowgid. In autofs_dev_ioctl_requester return the uid and gid of the last mount requester converted into the calling processes user namespace. When the uid or gid don't map return overflowuid and overflowgid as appropriate, allowing failure to find a mount requester to be distinguished from failure to map a mount requester. The uid and gid mount options specifying the user and group of the root autofs inode are converted into kuid and kgid as they are parsed defaulting to the current uid and current gid of the process that mounts autofs. Mounting of autofs for the present remains confined to processes in the initial user namespace. Cc: Ian Kent Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- init/Kconfig | 1 - 1 file changed, 1 deletion(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 6fdd6e339326..b6369fbaa22b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1004,7 +1004,6 @@ config UIDGID_CONVERTED # Filesystems depends on 9P_FS = n depends on AFS_FS = n - depends on AUTOFS4_FS = n depends on CEPH_FS = n depends on CIFS = n depends on CODA_FS = n -- cgit v1.2.3 From 499dcf2024092e5cce41d05599a5b51d1f92031a Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Tue, 7 Feb 2012 16:26:03 -0800 Subject: userns: Support fuse interacting with multiple user namespaces Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data. The connection between between a fuse filesystem and a fuse daemon is established when a fuse filesystem is mounted and provided with a file descriptor the fuse daemon created by opening /dev/fuse. For now restrict the communication of uids and gids between the fuse filesystem and the fuse daemon to the initial user namespace. Enforce this by verifying the file descriptor passed to the mount of fuse was opened in the initial user namespace. Ensuring the mount happens in the initial user namespace is not necessary as mounts from non-initial user namespaces are not yet allowed. In fuse_req_init_context convert the currrent fsuid and fsgid into the initial user namespace for the request that will be sent to the fuse daemon. In fuse_fill_attr convert the uid and gid passed from the fuse daemon from the initial user namespace into kuids and kgids. In iattr_to_fattr called from fuse_setattr convert kuids and kgids into the uids and gids in the initial user namespace before passing them to the fuse filesystem. In fuse_change_attributes_common called from fuse_dentry_revalidate, fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert the uid and gid from the fuse daemon into a kuid and a kgid to store on the fuse inode. By default fuse mounts are restricted to task whose uid, suid, and euid matches the fuse user_id and whose gid, sgid, and egid matches the fuse group id. Convert the user_id and group_id mount options into kuids and kgids at mount time, and use uid_eq and gid_eq to compare the in fuse_allow_task. Cc: Miklos Szeredi Acked-by: Serge Hallyn Signed-off-by: Eric W. Biederman --- init/Kconfig | 1 - 1 file changed, 1 deletion(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index b6369fbaa22b..38c1a1d0bf38 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1007,7 +1007,6 @@ config UIDGID_CONVERTED depends on CEPH_FS = n depends on CIFS = n depends on CODA_FS = n - depends on FUSE_FS = n depends on GFS2_FS = n depends on NCP_FS = n depends on NFSD = n -- cgit v1.2.3 From 3fbfbf7a3b66ec424042d909f14ba2ddf4372ea8 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 19 Aug 2012 21:35:53 -0700 Subject: rcu: Add callback-free CPUs RCU callback execution can add significant OS jitter and also can degrade both scheduling latency and, in asymmetric multiprocessors, energy efficiency. This commit therefore adds the ability for selected CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded to kthreads. If the "rcu_nocb_poll" boot parameter is also specified, these kthreads will do polling, removing the need for the offloaded CPUs to do wakeups. At least one CPU must be doing normal callback processing: currently CPU 0 cannot be selected as a no-CBs CPU. In addition, attempts to offline the last normal-CBs CPU will fail. This feature was inspired by Jim Houston's and Joe Korty's JRCU, and this commit includes fixes to problems located by Fengguang Wu's kbuild test robot. [ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ] Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney --- init/Kconfig | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index ec62139207d3..5ac6ee094225 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -654,6 +654,28 @@ config RCU_BOOST_DELAY Accept the default if unsure. +config RCU_NOCB_CPU + bool "Offload RCU callback processing from boot-selected CPUs" + depends on TREE_RCU || TREE_PREEMPT_RCU + default n + help + Use this option to reduce OS jitter for aggressive HPC or + real-time workloads. It can also be used to offload RCU + callback invocation to energy-efficient CPUs in battery-powered + asymmetric multiprocessors. + + This option offloads callback invocation from the set of + CPUs specified at boot time by the rcu_nocbs parameter. + For each such CPU, a kthread ("rcuoN") will be created to + invoke callbacks, where the "N" is the CPU being offloaded. + Nothing prevents this kthread from running on the specified + CPUs, but (1) the kthreads may be preempted between each + callback, and (2) affinity or cgroups can be used to force + the kthreads to run on whatever set of CPUs is desired. + + Say Y here if you want reduced OS jitter on selected CPUs. + Say N here if you are unsure. + endmenu # "RCU Subsystem" config IKCONFIG -- cgit v1.2.3 From 1c4042c29bd2e85aac4110552ca8ade763762e84 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 12 Jul 2010 17:10:36 -0700 Subject: pidns: Consolidate initialzation of special init task state Instead of setting child_reaper and SIGNAL_UNKILLABLE one way for the system init process, and another way for pid namespace init processes test pid->nr == 1 and use the same code for both. For the global init this results in SIGNAL_UNKILLABLE being set much earlier in the initialization process. This is a small cleanup and it paves the way for allowing unshare and enter of the pid namespace as that path like our global init also will not set CLONE_NEWPID. Signed-off-by: Eric W. Biederman --- init/main.c | 1 - 1 file changed, 1 deletion(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 9cf77ab138a6..317750a18f74 100644 --- a/init/main.c +++ b/init/main.c @@ -810,7 +810,6 @@ static int __ref kernel_init(void *unused) system_state = SYSTEM_RUNNING; numa_default_policy(); - current->signal->flags |= SIGNAL_UNKILLABLE; flush_delayed_fput(); if (ramdisk_execute_command) { -- cgit v1.2.3 From 98f842e675f96ffac96e6c50315790912b2812be Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 15 Jun 2011 10:21:48 -0700 Subject: proc: Usable inode numbers for the namespace file descriptors. Assign a unique proc inode to each namespace, and use that inode number to ensure we only allocate at most one proc inode for every namespace in proc. A single proc inode per namespace allows userspace to test to see if two processes are in the same namespace. This has been a long requested feature and only blocked because a naive implementation would put the id in a global space and would ultimately require having a namespace for the names of namespaces, making migration and certain virtualization tricks impossible. We still don't have per superblock inode numbers for proc, which appears necessary for application unaware checkpoint/restart and migrations (if the application is using namespace file descriptors) but that is now allowd by the design if it becomes important. I have preallocated the ipc and uts initial proc inode numbers so their structures can be statically initialized. Signed-off-by: Eric W. Biederman --- init/version.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'init') diff --git a/init/version.c b/init/version.c index 86fe0ccb997a..58170f18912d 100644 --- a/init/version.c +++ b/init/version.c @@ -12,6 +12,7 @@ #include #include #include +#include #ifndef CONFIG_KALLSYMS #define version(a) Version_ ## a @@ -34,6 +35,7 @@ struct uts_namespace init_uts_ns = { .domainname = UTS_DOMAINNAME, }, .user_ns = &init_user_ns, + .proc_inum = PROC_UTS_INIT_INO, }; EXPORT_SYMBOL_GPL(init_uts_ns); -- cgit v1.2.3 From 1ad7e89940d5ac411928189e1a4a01901dbf590f Mon Sep 17 00:00:00 2001 From: Stephen Warren Date: Thu, 8 Nov 2012 16:12:25 -0800 Subject: block: store partition_meta_info.uuid as a string This will allow other types of UUID to be stored here, aside from true UUIDs. This also simplifies code that uses this field, since it's usually constructed from a, used as a, or compared to other, strings. Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a /PARTNROFF option was found to be present. However, this modifies the input string, and causes subsequent calls to devt_from_partuuid() not to see the /PARTNROFF option, which causes different results. In order to avoid misleading future maintainers, this parameter is marked const. Signed-off-by: Stephen Warren Cc: Tejun Heo Cc: Will Drewry Cc: Kay Sievers Signed-off-by: Andrew Morton Signed-off-by: Jens Axboe --- init/do_mounts.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) (limited to 'init') diff --git a/init/do_mounts.c b/init/do_mounts.c index f8a66424360d..b28ec5819325 100644 --- a/init/do_mounts.c +++ b/init/do_mounts.c @@ -69,23 +69,28 @@ __setup("ro", readonly); __setup("rw", readwrite); #ifdef CONFIG_BLOCK +struct uuidcmp { + const char *uuid; + int len; +}; + /** * match_dev_by_uuid - callback for finding a partition using its uuid * @dev: device passed in by the caller - * @data: opaque pointer to a 36 byte char array with a UUID + * @data: opaque pointer to the desired struct uuidcmp to match * * Returns 1 if the device matches, and 0 otherwise. */ static int match_dev_by_uuid(struct device *dev, void *data) { - u8 *uuid = data; + struct uuidcmp *cmp = data; struct hd_struct *part = dev_to_part(dev); if (!part->info) goto no_match; - if (memcmp(uuid, part->info->uuid, sizeof(part->info->uuid))) - goto no_match; + if (strncasecmp(cmp->uuid, part->info->uuid, cmp->len)) + goto no_match; return 1; no_match: @@ -95,7 +100,7 @@ no_match: /** * devt_from_partuuid - looks up the dev_t of a partition by its UUID - * @uuid: min 36 byte char array containing a hex ascii UUID + * @uuid: char array containing ascii UUID * * The function will return the first partition which contains a matching * UUID value in its partition_meta_info struct. This does not search @@ -106,11 +111,11 @@ no_match: * * Returns the matching dev_t on success or 0 on failure. */ -static dev_t devt_from_partuuid(char *uuid_str) +static dev_t devt_from_partuuid(const char *uuid_str) { dev_t res = 0; + struct uuidcmp cmp; struct device *dev = NULL; - u8 uuid[16]; struct gendisk *disk; struct hd_struct *part; int offset = 0; @@ -118,6 +123,9 @@ static dev_t devt_from_partuuid(char *uuid_str) if (strlen(uuid_str) < 36) goto done; + cmp.uuid = uuid_str; + cmp.len = 36; + /* Check for optional partition number offset attributes. */ if (uuid_str[36]) { char c = 0; @@ -134,10 +142,8 @@ static dev_t devt_from_partuuid(char *uuid_str) } } - /* Pack the requested UUID in the expected format. */ - part_pack_uuid(uuid_str, uuid); - - dev = class_find_device(&block_class, NULL, uuid, &match_dev_by_uuid); + dev = class_find_device(&block_class, NULL, &cmp, + &match_dev_by_uuid); if (!dev) goto done; -- cgit v1.2.3 From 283f8fc03927b0ef42a2faa60a0df5ec8c612edb Mon Sep 17 00:00:00 2001 From: Stephen Warren Date: Thu, 8 Nov 2012 16:12:27 -0800 Subject: init: reduce PARTUUID min length to 1 from 36 Reduce the minimum length for a root=PARTUUID= parameter to be considered valid from 36 to 1. EFI/GPT partition UUIDs are always exactly 36 characters long, hence the previous limit. However, the next patch will support DOS/MBR UUIDs too, which have a different, shorter, format. Instead of validating any particular length, just ensure that at least some non-empty value was given by the user. Also, consider a missing UUID value to be a parsing error, in the same vein as if /PARTNROFF exists and can't be parsed. As such, make both error cases print a message and disable rootwait. Convert to pr_err while we're at it. Signed-off-by: Stephen Warren Cc: Tejun Heo Cc: Will Drewry Cc: Kay Sievers Signed-off-by: Andrew Morton Signed-off-by: Jens Axboe --- init/do_mounts.c | 35 ++++++++++++++++++++++------------- 1 file changed, 22 insertions(+), 13 deletions(-) (limited to 'init') diff --git a/init/do_mounts.c b/init/do_mounts.c index b28ec5819325..c950d7c93f98 100644 --- a/init/do_mounts.c +++ b/init/do_mounts.c @@ -119,27 +119,29 @@ static dev_t devt_from_partuuid(const char *uuid_str) struct gendisk *disk; struct hd_struct *part; int offset = 0; - - if (strlen(uuid_str) < 36) - goto done; + bool clear_root_wait = false; + char *slash; cmp.uuid = uuid_str; - cmp.len = 36; + slash = strchr(uuid_str, '/'); /* Check for optional partition number offset attributes. */ - if (uuid_str[36]) { + if (slash) { char c = 0; /* Explicitly fail on poor PARTUUID syntax. */ - if (sscanf(&uuid_str[36], - "/PARTNROFF=%d%c", &offset, &c) != 1) { - printk(KERN_ERR "VFS: PARTUUID= is invalid.\n" - "Expected PARTUUID=[/PARTNROFF=%%d]\n"); - if (root_wait) - printk(KERN_ERR - "Disabling rootwait; root= is invalid.\n"); - root_wait = 0; + if (sscanf(slash + 1, + "PARTNROFF=%d%c", &offset, &c) != 1) { + clear_root_wait = true; goto done; } + cmp.len = slash - uuid_str; + } else { + cmp.len = strlen(uuid_str); + } + + if (!cmp.len) { + clear_root_wait = true; + goto done; } dev = class_find_device(&block_class, NULL, &cmp, @@ -164,6 +166,13 @@ static dev_t devt_from_partuuid(const char *uuid_str) no_offset: put_device(dev); done: + if (clear_root_wait) { + pr_err("VFS: PARTUUID= is invalid.\n" + "Expected PARTUUID=[/PARTNROFF=%%d]\n"); + if (root_wait) + pr_err("Disabling rootwait; root= is invalid.\n"); + root_wait = 0; + } return res; } #endif -- cgit v1.2.3 From d33b98fc82b0908e91fb05ae081acaed7323f9d2 Mon Sep 17 00:00:00 2001 From: Stephen Warren Date: Thu, 8 Nov 2012 16:12:28 -0800 Subject: block: partition: msdos: provide UUIDs for partitions The MSDOS/MBR partition table includes a 32-bit unique ID, often referred to as the NT disk signature. When combined with a partition number within the table, this can form a unique ID similar in concept to EFI/GPT's partition UUID. Constructing and recording this value in struct partition_meta_info allows MSDOS partitions to be referred to on the kernel command-line using the following syntax: root=PARTUUID=0002dd75-01 Signed-off-by: Stephen Warren Cc: Tejun Heo Cc: Will Drewry Cc: Kay Sievers Signed-off-by: Andrew Morton Signed-off-by: Jens Axboe --- init/do_mounts.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'init') diff --git a/init/do_mounts.c b/init/do_mounts.c index c950d7c93f98..1d1b6348f903 100644 --- a/init/do_mounts.c +++ b/init/do_mounts.c @@ -189,6 +189,10 @@ done: * used when disk name of partitioned disk ends on a digit. * 6) PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF representing the * unique id of a partition if the partition table provides it. + * The UUID may be either an EFI/GPT UUID, or refer to an MSDOS + * partition using the format SSSSSSSS-PP, where SSSSSSSS is a zero- + * filled hex representation of the 32-bit "NT disk signature", and PP + * is a zero-filled hex representation of the 1-based partition number. * 7) PARTUUID=/PARTNROFF= to select a partition in relation to * a partition with a known unique id. * -- cgit v1.2.3 From 91d1aa43d30505b0b825db8898ffc80a8eca96c7 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 27 Nov 2012 19:33:25 +0100 Subject: context_tracking: New context tracking susbsystem Create a new subsystem that probes on kernel boundaries to keep track of the transitions between level contexts with two basic initial contexts: user or kernel. This is an abstraction of some RCU code that use such tracking to implement its userspace extended quiescent state. We need to pull this up from RCU into this new level of indirection because this tracking is also going to be used to implement an "on demand" generic virtual cputime accounting. A necessary step to shutdown the tick while still accounting the cputime. Signed-off-by: Frederic Weisbecker Cc: Andrew Morton Cc: H. Peter Anvin Cc: Ingo Molnar Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Li Zhong Cc: Gilad Ben-Yossef Reviewed-by: Steven Rostedt [ paulmck: fix whitespace error and email address. ] Signed-off-by: Paul E. McKenney --- init/Kconfig | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 5ac6ee094225..2054e048bb98 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -486,9 +486,13 @@ config PREEMPT_RCU This option enables preemptible-RCU code that is common between the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations. +config CONTEXT_TRACKING + bool + config RCU_USER_QS bool "Consider userspace as in RCU extended quiescent state" - depends on HAVE_RCU_USER_QS && SMP + depends on HAVE_CONTEXT_TRACKING && SMP + select CONTEXT_TRACKING help This option sets hooks on kernel / userspace boundaries and puts RCU in extended quiescent state when the CPU runs in @@ -497,24 +501,20 @@ config RCU_USER_QS try to keep the timer tick on for RCU. Unless you want to hack and help the development of the full - tickless feature, you shouldn't enable this option. It also + dynticks mode, you shouldn't enable this option. It also adds unnecessary overhead. If unsure say N -config RCU_USER_QS_FORCE - bool "Force userspace extended QS by default" - depends on RCU_USER_QS +config CONTEXT_TRACKING_FORCE + bool "Force context tracking" + depends on CONTEXT_TRACKING help - Set the hooks in user/kernel boundaries by default in order to - test this feature that treats userspace as an extended quiescent - state until we have a real user like a full adaptive nohz option. - - Unless you want to hack and help the development of the full - tickless feature, you shouldn't enable this option. It adds - unnecessary overhead. - - If unsure say N + Probe on user/kernel boundaries by default in order to + test the features that rely on it such as userspace RCU extended + quiescent states. + This test is there for debugging until we have a real user like the + full dynticks mode. config RCU_FANOUT int "Tree-based hierarchical RCU fanout value" -- cgit v1.2.3 From be3a728427a605990a7a0b6dbf9e29b68e266146 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Thu, 4 Oct 2012 01:50:47 +0200 Subject: mm: numa: pte_numa() and pmd_numa() Implement pte_numa and pmd_numa. We must atomically set the numa bit and clear the present bit to define a pte_numa or pmd_numa. Once a pte or pmd has been set as pte_numa or pmd_numa, the next time a thread touches a virtual address in the corresponding virtual range, a NUMA hinting page fault will trigger. The NUMA hinting page fault will clear the NUMA bit and set the present bit again to resolve the page fault. The expectation is that a NUMA hinting page fault is used as part of a placement policy that decides if a page should remain on the current node or migrated to a different node. Acked-by: Rik van Riel Signed-off-by: Andrea Arcangeli Signed-off-by: Mel Gorman --- init/Kconfig | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 6fdd6e339326..9f00f004796a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -696,6 +696,43 @@ config LOG_BUF_SHIFT config HAVE_UNSTABLE_SCHED_CLOCK bool +# +# For architectures that want to enable the support for NUMA-affine scheduler +# balancing logic: +# +config ARCH_SUPPORTS_NUMA_BALANCING + bool + +# For architectures that (ab)use NUMA to represent different memory regions +# all cpu-local but of different latencies, such as SuperH. +# +config ARCH_WANT_NUMA_VARIABLE_LOCALITY + bool + +# +# For architectures that are willing to define _PAGE_NUMA as _PAGE_PROTNONE +config ARCH_WANTS_PROT_NUMA_PROT_NONE + bool + +config ARCH_USES_NUMA_PROT_NONE + bool + default y + depends on ARCH_WANTS_PROT_NUMA_PROT_NONE + depends on NUMA_BALANCING + +config NUMA_BALANCING + bool "Memory placement aware NUMA scheduler" + default y + depends on ARCH_SUPPORTS_NUMA_BALANCING + depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY + depends on SMP && NUMA && MIGRATION + help + This option adds support for automatic NUMA aware memory/task placement. + The mechanism is quite primitive and is based on migrating memory when + it is references to the node the task is running on. + + This system will be inactive on UMA systems. + menuconfig CGROUPS boolean "Control Group support" depends on EVENTFD -- cgit v1.2.3 From 1a687c2e9a99335c9e77392f050fe607fa18a652 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 22 Nov 2012 11:16:36 +0000 Subject: mm: sched: numa: Control enabling and disabling of NUMA balancing This patch adds Kconfig options and kernel parameters to allow the enabling and disabling of automatic NUMA balancing. The existance of such a switch was and is very important when debugging problems related to transparent hugepages and we should have the same for automatic NUMA placement. Signed-off-by: Mel Gorman --- init/Kconfig | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 9f00f004796a..18e2a5920a34 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -720,6 +720,14 @@ config ARCH_USES_NUMA_PROT_NONE depends on ARCH_WANTS_PROT_NUMA_PROT_NONE depends on NUMA_BALANCING +config NUMA_BALANCING_DEFAULT_ENABLED + bool "Automatically enable NUMA aware memory/task placement" + default y + depends on NUMA_BALANCING + help + If set, autonumic NUMA balancing will be enabled if running on a NUMA + machine. + config NUMA_BALANCING bool "Memory placement aware NUMA scheduler" default y -- cgit v1.2.3 From 3c466d46a9f38de141d8a492d8b1b1d0fa504759 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Wed, 12 Dec 2012 13:51:40 -0800 Subject: init: use N_MEMORY instead N_HIGH_MEMORY N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: Lai Jiangshan Signed-off-by: Wen Congyang Cc: Christoph Lameter Cc: Hillf Danton Cc: Lin Feng Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index e33e09df3cbc..63ae904a99a8 100644 --- a/init/main.c +++ b/init/main.c @@ -857,7 +857,7 @@ static void __init kernel_init_freeable(void) /* * init can allocate pages on any node */ - set_mems_allowed(node_states[N_HIGH_MEMORY]); + set_mems_allowed(node_states[N_MEMORY]); /* * init can run on any cpu. */ -- cgit v1.2.3 From 11520e5e7c1855fc3bf202bb3be35a39d9efa034 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sat, 15 Dec 2012 15:15:24 -0800 Subject: Revert "x86-64/efi: Use EFI to deal with platform wall clock (again)" This reverts commit bd52276fa1d4 ("x86-64/efi: Use EFI to deal with platform wall clock (again)"), and the two supporting commits: da5a108d05b4: "x86/kernel: remove tboot 1:1 page table creation code" 185034e72d59: "x86, efi: 1:1 pagetable mapping for virtual EFI calls") as they all depend semantically on commit 53b87cf088e2 ("x86, mm: Include the entire kernel memory map in trampoline_pgd") that got reverted earlier due to the problems it caused. This was pointed out by Yinghai Lu, and verified by me on my Macbook Air that uses EFI. Pointed-out-by: Yinghai Lu Signed-off-by: Linus Torvalds --- init/main.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 6af5470b8067..63ae904a99a8 100644 --- a/init/main.c +++ b/init/main.c @@ -463,10 +463,6 @@ static void __init mm_init(void) percpu_init_late(); pgtable_cache_init(); vmalloc_init(); -#ifdef CONFIG_X86 - if (efi_enabled) - efi_enter_virtual_mode(); -#endif } asmlinkage void __init start_kernel(void) @@ -607,6 +603,10 @@ asmlinkage void __init start_kernel(void) calibrate_delay(); pidmap_init(); anon_vma_init(); +#ifdef CONFIG_X86 + if (efi_enabled) + efi_enter_virtual_mode(); +#endif thread_info_cache_init(); cred_init(); fork_init(totalram_pages); -- cgit v1.2.3 From 510fc4e11b772fd60f2c545c64d4c55abd07ce36 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Tue, 18 Dec 2012 14:21:47 -0800 Subject: memcg: kmem accounting basic infrastructure Add the basic infrastructure for the accounting of kernel memory. To control that, the following files are created: * memory.kmem.usage_in_bytes * memory.kmem.limit_in_bytes * memory.kmem.failcnt * memory.kmem.max_usage_in_bytes They have the same meaning of their user memory counterparts. They reflect the state of the "kmem" res_counter. Per cgroup kmem memory accounting is not enabled until a limit is set for the group. Once the limit is set the accounting cannot be disabled for that group. This means that after the patch is applied, no behavioral changes exists for whoever is still using memcg to control their memory usage, until memory.kmem.limit_in_bytes is set for the first time. We always account to both user and kernel resource_counters. This effectively means that an independent kernel limit is in place when the limit is set to a lower value than the user memory. A equal or higher value means that the user limit will always hit first, meaning that kmem is effectively unlimited. People who want to track kernel memory but not limit it, can set this limit to a very high number (like RESOURCE_MAX - 1page - that no one will ever hit, or equal to the user memory) [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub] Signed-off-by: Glauber Costa Acked-by: Kamezawa Hiroyuki Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Tejun Heo Cc: Christoph Lameter Cc: David Rientjes Cc: Frederic Weisbecker Cc: Greg Thelen Cc: JoonSoo Kim Cc: Mel Gorman Cc: Pekka Enberg Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/Kconfig | 1 + 1 file changed, 1 insertion(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 675d8a2326cf..19ccb33c99d9 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -882,6 +882,7 @@ config MEMCG_SWAP_ENABLED config MEMCG_KMEM bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)" depends on MEMCG && EXPERIMENTAL + depends on SLUB || SLAB default n help The Kernel Memory extension for Memory Resource Controller can limit -- cgit v1.2.3 From d7f25f8a2f81252d1ac134470ba1d0a287cf8fcd Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Tue, 18 Dec 2012 14:22:40 -0800 Subject: memcg: infrastructure to match an allocation to the right cache The page allocator is able to bind a page to a memcg when it is allocated. But for the caches, we'd like to have as many objects as possible in a page belonging to the same cache. This is done in this patch by calling memcg_kmem_get_cache in the beginning of every allocation function. This function is patched out by static branches when kernel memory controller is not being used. It assumes that the task allocating, which determines the memcg in the page allocator, belongs to the same cgroup throughout the whole process. Misaccounting can happen if the task calls memcg_kmem_get_cache() while belonging to a cgroup, and later on changes. This is considered acceptable, and should only happen upon task migration. Before the cache is created by the memcg core, there is also a possible imbalance: the task belongs to a memcg, but the cache being allocated from is the global cache, since the child cache is not yet guaranteed to be ready. This case is also fine, since in this case the GFP_KMEMCG will not be passed and the page allocator will not attempt any cgroup accounting. Signed-off-by: Glauber Costa Cc: Christoph Lameter Cc: David Rientjes Cc: Frederic Weisbecker Cc: Greg Thelen Cc: Johannes Weiner Cc: JoonSoo Kim Cc: KAMEZAWA Hiroyuki Cc: Mel Gorman Cc: Michal Hocko Cc: Pekka Enberg Cc: Rik van Riel Cc: Suleiman Souhlal Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/Kconfig | 1 - 1 file changed, 1 deletion(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 19ccb33c99d9..7d30240e5bfe 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -883,7 +883,6 @@ config MEMCG_KMEM bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)" depends on MEMCG && EXPERIMENTAL depends on SLUB || SLAB - default n help The Kernel Memory extension for Memory Resource Controller can limit the amount of memory used by kernel objects in the system. Those are -- cgit v1.2.3 From ae903caae267154de7cf8576b130ff474630596b Mon Sep 17 00:00:00 2001 From: Al Viro Date: Fri, 14 Dec 2012 12:44:11 -0500 Subject: Bury the conditionals from kernel_thread/kernel_execve series All architectures have CONFIG_GENERIC_KERNEL_THREAD CONFIG_GENERIC_KERNEL_EXECVE __ARCH_WANT_SYS_EXECVE None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers of kernel_execve() (which is a trivial wrapper for do_execve() now) left. Kill the conditionals and make both callers use do_execve(). Signed-off-by: Al Viro --- init/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index e33e09df3cbc..155ac208d581 100644 --- a/init/main.c +++ b/init/main.c @@ -797,7 +797,9 @@ static void __init do_pre_smp_initcalls(void) static int run_init_process(const char *init_filename) { argv_init[0] = init_filename; - return kernel_execve(init_filename, argv_init, envp_init); + return do_execve(init_filename, + (const char __user *const __user *)argv_init, + (const char __user *const __user *)envp_init); } static void __init kernel_init_freeable(void); -- cgit v1.2.3 From 3a55fb0d9fe8e2f4594329edd58c5fd6f35a99dd Mon Sep 17 00:00:00 2001 From: Kirill Smelkov Date: Fri, 2 Nov 2012 15:41:01 +0400 Subject: Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE In commit 281dc5c5ec0f ("Give up on pushing CC_OPTIMIZE_FOR_SIZE") we already changed the actual default value, but the help-text still suggested 'y'. Fix the help text too, for all the same reasons. Sadly, -Os keeps on generating some very suboptimal code for certain cases, to the point where any I$ miss upside is swamped by the downside. The main ones are: - using "rep movsb" for memcpy, even on CPU's where that is horrendously bad for performance. - not honoring branch prediction information, so any I$ footprint you win from smaller code, you lose from less code density in the I$. - using divide instructions when that is very expensive. Signed-off-by: Kirill Smelkov Signed-off-by: Linus Torvalds --- init/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 7d30240e5bfe..be8b7f55312d 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1182,7 +1182,7 @@ config CC_OPTIMIZE_FOR_SIZE Enabling this option will pass "-Os" instead of "-O2" to gcc resulting in a smaller kernel. - If unsure, say Y. + If unsure, say N. config SYSCTL bool -- cgit v1.2.3