From 80367ad01d93ac781b0e1df246edaf006928002f Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 16 Apr 2025 18:29:12 +0200 Subject: futex: Add basic infrastructure for local task local hash The futex hash is system wide and shared by all tasks. Each slot is hashed based on futex address and the VMA of the thread. Due to randomized VMAs (and memory allocations) the same logical lock (pointer) can end up in a different hash bucket on each invocation of the application. This in turn means that different applications may share a hash bucket on the first invocation but not on the second and it is not always clear which applications will be involved. This can result in high latency's to acquire the futex_hash_bucket::lock especially if the lock owner is limited to a CPU and can not be effectively PI boosted. Introduce basic infrastructure for process local hash which is shared by all threads of process. This hash will only be used for a PROCESS_PRIVATE FUTEX operation. The hashmap can be allocated via: prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num); A `num' of 0 means that the global hash is used instead of a private hash. Other values for `num' specify the number of slots for the hash and the number must be power of two, starting with two. The prctl() returns zero on success. This function can only be used before a thread is created. The current status for the private hash can be queried via: num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS); which return the current number of slots. The value 0 means that the global hash is used. Values greater than 0 indicate the number of slots that are used. A negative number indicates an error. For optimisation, for the private hash jhash2() uses only two arguments the address and the offset. This omits the VMA which is always the same. [peterz: Use 0 for global hash. A bit shuffling and renaming. ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de --- include/uapi/linux/prctl.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include/uapi/linux/prctl.h') diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 15c18ef4eb11..3b93fb906e3c 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -364,4 +364,9 @@ struct prctl_mm_map { # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 +/* FUTEX hash management */ +#define PR_FUTEX_HASH 78 +# define PR_FUTEX_HASH_SET_SLOTS 1 +# define PR_FUTEX_HASH_GET_SLOTS 2 + #endif /* _LINUX_PRCTL_H */ -- cgit v1.2.3 From 63e8595c060a1fef421e3eecfc05ad882dafb8ac Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 16 Apr 2025 18:29:15 +0200 Subject: futex: Allow to make the private hash immutable My initial testing showed that: perf bench futex hash reported less operations/sec with private hash. After using the same amount of buckets in the private hash as used by the global hash then the operations/sec were about the same. This changed once the private hash became resizable. This feature added an RCU section and reference counting via atomic inc+dec operation into the hot path. The reference counting can be avoided if the private hash is made immutable. Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the private should be made immutable. Once set (to true) the a further resize is not allowed (same if set to global hash). Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not be changed. Update "perf bench" suite. For comparison, results of "perf bench futex hash -s": - Xeon CPU E5-2650, 2 NUMA nodes, total 32 CPUs: - Before the introducing task local hash shared Averaged 1.487.148 operations/sec (+- 0,53%), total secs = 10 private Averaged 2.192.405 operations/sec (+- 0,07%), total secs = 10 - With the series shared Averaged 1.326.342 operations/sec (+- 0,41%), total secs = 10 -b128 Averaged 141.394 operations/sec (+- 1,15%), total secs = 10 -Ib128 Averaged 851.490 operations/sec (+- 0,67%), total secs = 10 -b8192 Averaged 131.321 operations/sec (+- 2,13%), total secs = 10 -Ib8192 Averaged 1.923.077 operations/sec (+- 0,61%), total secs = 10 128 is the default allocation of hash buckets. 8192 was the previous amount of allocated hash buckets. - Xeon(R) CPU E7-8890 v3, 4 NUMA nodes, total 144 CPUs: - Before the introducing task local hash shared Averaged 1.810.936 operations/sec (+- 0,26%), total secs = 20 private Averaged 2.505.801 operations/sec (+- 0,05%), total secs = 20 - With the series shared Averaged 1.589.002 operations/sec (+- 0,25%), total secs = 20 -b1024 Averaged 42.410 operations/sec (+- 0,20%), total secs = 20 -Ib1024 Averaged 740.638 operations/sec (+- 1,51%), total secs = 20 -b65536 Averaged 48.811 operations/sec (+- 1,35%), total secs = 20 -Ib65536 Averaged 1.963.165 operations/sec (+- 0,18%), total secs = 20 1024 is the default allocation of hash buckets. 65536 was the previous amount of allocated hash buckets. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Peter Zijlstra (Intel) Acked-by: Shrikanth Hegde Link: https://lore.kernel.org/r/20250416162921.513656-16-bigeasy@linutronix.de --- include/uapi/linux/prctl.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/uapi/linux/prctl.h') diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 3b93fb906e3c..43dec6eed559 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -367,6 +367,8 @@ struct prctl_mm_map { /* FUTEX hash management */ #define PR_FUTEX_HASH 78 # define PR_FUTEX_HASH_SET_SLOTS 1 +# define FH_FLAG_IMMUTABLE (1ULL << 0) # define PR_FUTEX_HASH_GET_SLOTS 2 +# define PR_FUTEX_HASH_GET_IMMUTABLE 3 #endif /* _LINUX_PRCTL_H */ -- cgit v1.2.3