<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-toradex.git/Documentation/features, branch v6.9-rc4</title>
<subtitle>Linux kernel for Apalis and Colibri modules</subtitle>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/'/>
<entry>
<title>membarrier: riscv: Provide core serializing command</title>
<updated>2024-02-15T16:04:14+00:00</updated>
<author>
<name>Andrea Parri</name>
<email>parri.andrea@gmail.com</email>
</author>
<published>2024-01-31T14:49:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=cd9b29014dc69609489261efe351d0c7709ae8bf'/>
<id>cd9b29014dc69609489261efe351d0c7709ae8bf</id>
<content type='text'>
RISC-V uses xRET instructions on return from interrupt and to go back
to user-space; the xRET instruction is not core serializing.

Use FENCE.I for providing core serialization as follows:

 - by calling sync_core_before_usermode() on return from interrupt (cf.
   ipi_sync_core()),

 - via switch_mm() and sync_core_before_usermode() (respectively, for
   uthread-&gt;uthread and kthread-&gt;uthread transitions) before returning
   to user-space.

On RISC-V, the serialization in switch_mm() is activated by resetting
the icache_stale_mask of the mm at prepare_sync_core_cmd().

Suggested-by: Palmer Dabbelt &lt;palmer@dabbelt.com&gt;
Signed-off-by: Andrea Parri &lt;parri.andrea@gmail.com&gt;
Reviewed-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Link: https://lore.kernel.org/r/20240131144936.29190-5-parri.andrea@gmail.com
Signed-off-by: Palmer Dabbelt &lt;palmer@rivosinc.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
RISC-V uses xRET instructions on return from interrupt and to go back
to user-space; the xRET instruction is not core serializing.

Use FENCE.I for providing core serialization as follows:

 - by calling sync_core_before_usermode() on return from interrupt (cf.
   ipi_sync_core()),

 - via switch_mm() and sync_core_before_usermode() (respectively, for
   uthread-&gt;uthread and kthread-&gt;uthread transitions) before returning
   to user-space.

On RISC-V, the serialization in switch_mm() is activated by resetting
the icache_stale_mask of the mm at prepare_sync_core_cmd().

Suggested-by: Palmer Dabbelt &lt;palmer@dabbelt.com&gt;
Signed-off-by: Andrea Parri &lt;parri.andrea@gmail.com&gt;
Reviewed-by: Mathieu Desnoyers &lt;mathieu.desnoyers@efficios.com&gt;
Link: https://lore.kernel.org/r/20240131144936.29190-5-parri.andrea@gmail.com
Signed-off-by: Palmer Dabbelt &lt;palmer@rivosinc.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>riscv: Add support for BATCHED_UNMAP_TLB_FLUSH</title>
<updated>2024-01-11T16:01:53+00:00</updated>
<author>
<name>Alexandre Ghiti</name>
<email>alexghiti@rivosinc.com</email>
</author>
<published>2024-01-08T19:36:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=54d7431af73e2fa53b73cfeb2bec559c6664a4e4'/>
<id>54d7431af73e2fa53b73cfeb2bec559c6664a4e4</id>
<content type='text'>
Allow to defer the flushing of the TLB when unmapping pages, which allows
to reduce the numbers of IPI and the number of sfence.vma.

The ubenchmarch used in commit 43b3dfdd0455 ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration") that
was multithreaded to force the usage of IPI shows good performance
improvement on all platforms:

* Unmatched: ~34%
* TH1520   : ~78%
* Qemu     : ~81%

In addition, perf on qemu reports an important decrease in time spent
dealing with IPIs:

Before:  68.17%  main     [kernel.kallsyms]            [k] __sbi_rfence_v02_call
After :   8.64%  main     [kernel.kallsyms]            [k] __sbi_rfence_v02_call

* Benchmark:

int stick_this_thread_to_core(int core_id) {
        int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
        if (core_id &lt; 0 || core_id &gt;= num_cores)
           return EINVAL;

        cpu_set_t cpuset;
        CPU_ZERO(&amp;cpuset);
        CPU_SET(core_id, &amp;cpuset);

        pthread_t current_thread = pthread_self();
        return pthread_setaffinity_np(current_thread,
sizeof(cpu_set_t), &amp;cpuset);
}

static void *fn_thread (void *p_data)
{
        int ret;
        pthread_t thread;

        stick_this_thread_to_core((int)p_data);

        while (1) {
                sleep(1);
        }

        return NULL;
}

int main()
{
        volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                         MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        pthread_t threads[4];
        int ret;

        for (int i = 0; i &lt; 4; ++i) {
                ret = pthread_create(&amp;threads[i], NULL, fn_thread, (void *)i);
                if (ret)
                {
                        printf("%s", strerror (ret));
                }
        }

        memset(p, 0x88, SIZE);

        for (int k = 0; k &lt; 10000; k++) {
                /* swap in */
                for (int i = 0; i &lt; SIZE; i += 4096) {
                        (void)p[i];
                }

                /* swap out */
                madvise(p, SIZE, MADV_PAGEOUT);
        }

        for (int i = 0; i &lt; 4; i++)
        {
                pthread_cancel(threads[i]);
        }

        for (int i = 0; i &lt; 4; i++)
        {
                pthread_join(threads[i], NULL);
        }

        return 0;
}

Signed-off-by: Alexandre Ghiti &lt;alexghiti@rivosinc.com&gt;
Reviewed-by: Jisheng Zhang &lt;jszhang@kernel.org&gt;
Tested-by: Jisheng Zhang &lt;jszhang@kernel.org&gt; # Tested on TH1520
Tested-by: Nam Cao &lt;namcao@linutronix.de&gt;
Link: https://lore.kernel.org/r/20240108193640.344929-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt &lt;palmer@rivosinc.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Allow to defer the flushing of the TLB when unmapping pages, which allows
to reduce the numbers of IPI and the number of sfence.vma.

The ubenchmarch used in commit 43b3dfdd0455 ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration") that
was multithreaded to force the usage of IPI shows good performance
improvement on all platforms:

* Unmatched: ~34%
* TH1520   : ~78%
* Qemu     : ~81%

In addition, perf on qemu reports an important decrease in time spent
dealing with IPIs:

Before:  68.17%  main     [kernel.kallsyms]            [k] __sbi_rfence_v02_call
After :   8.64%  main     [kernel.kallsyms]            [k] __sbi_rfence_v02_call

* Benchmark:

int stick_this_thread_to_core(int core_id) {
        int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
        if (core_id &lt; 0 || core_id &gt;= num_cores)
           return EINVAL;

        cpu_set_t cpuset;
        CPU_ZERO(&amp;cpuset);
        CPU_SET(core_id, &amp;cpuset);

        pthread_t current_thread = pthread_self();
        return pthread_setaffinity_np(current_thread,
sizeof(cpu_set_t), &amp;cpuset);
}

static void *fn_thread (void *p_data)
{
        int ret;
        pthread_t thread;

        stick_this_thread_to_core((int)p_data);

        while (1) {
                sleep(1);
        }

        return NULL;
}

int main()
{
        volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                         MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        pthread_t threads[4];
        int ret;

        for (int i = 0; i &lt; 4; ++i) {
                ret = pthread_create(&amp;threads[i], NULL, fn_thread, (void *)i);
                if (ret)
                {
                        printf("%s", strerror (ret));
                }
        }

        memset(p, 0x88, SIZE);

        for (int k = 0; k &lt; 10000; k++) {
                /* swap in */
                for (int i = 0; i &lt; SIZE; i += 4096) {
                        (void)p[i];
                }

                /* swap out */
                madvise(p, SIZE, MADV_PAGEOUT);
        }

        for (int i = 0; i &lt; 4; i++)
        {
                pthread_cancel(threads[i]);
        }

        for (int i = 0; i &lt; 4; i++)
        {
                pthread_join(threads[i], NULL);
        }

        return 0;
}

Signed-off-by: Alexandre Ghiti &lt;alexghiti@rivosinc.com&gt;
Reviewed-by: Jisheng Zhang &lt;jszhang@kernel.org&gt;
Tested-by: Jisheng Zhang &lt;jszhang@kernel.org&gt; # Tested on TH1520
Tested-by: Nam Cao &lt;namcao@linutronix.de&gt;
Link: https://lore.kernel.org/r/20240108193640.344929-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt &lt;palmer@rivosinc.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Documentation: Drop IA64 from feature descriptions</title>
<updated>2023-09-11T08:13:18+00:00</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ardb@kernel.org</email>
</author>
<published>2023-01-13T17:07:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=af1f459233d4edeef634f559539e7f4b64cb1d25'/>
<id>af1f459233d4edeef634f559539e7f4b64cb1d25</id>
<content type='text'>
Itanium (IA64) is going away, so drop it from the kernel feature
documentation.

Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Itanium (IA64) is going away, so drop it from the kernel feature
documentation.

Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'loongarch-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson</title>
<updated>2023-09-08T19:16:52+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-09-08T19:16:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=12952b6bbd36b372345f179f1a85576c5924d425'/>
<id>12952b6bbd36b372345f179f1a85576c5924d425</id>
<content type='text'>
Pull LoongArch updates from Huacai Chen:

 - Allow usage of LSX/LASX in the kernel, and use them for
   SIMD-optimized RAID5/RAID6 routines

 - Add Loongson Binary Translation (LBT) extension support

 - Add basic KGDB &amp; KDB support

 - Add building with kcov coverage

 - Add KFENCE (Kernel Electric-Fence) support

 - Add KASAN (Kernel Address Sanitizer) support

 - Some bug fixes and other small changes

 - Update the default config file

* tag 'loongarch-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: (25 commits)
  LoongArch: Update Loongson-3 default config file
  LoongArch: Add KASAN (Kernel Address Sanitizer) support
  LoongArch: Simplify the processing of jumping new kernel for KASLR
  kasan: Add (pmd|pud)_init for LoongArch zero_(pud|p4d)_populate process
  kasan: Add __HAVE_ARCH_SHADOW_MAP to support arch specific mapping
  LoongArch: Add KFENCE (Kernel Electric-Fence) support
  LoongArch: Get partial stack information when providing regs parameter
  LoongArch: mm: Add page table mapped mode support for virt_to_page()
  kfence: Defer the assignment of the local variable addr
  LoongArch: Allow building with kcov coverage
  LoongArch: Provide kaslr_offset() to get kernel offset
  LoongArch: Add basic KGDB &amp; KDB support
  LoongArch: Add Loongson Binary Translation (LBT) extension support
  raid6: Add LoongArch SIMD recovery implementation
  raid6: Add LoongArch SIMD syndrome calculation
  LoongArch: Add SIMD-optimized XOR routines
  LoongArch: Allow usage of LSX/LASX in the kernel
  LoongArch: Define symbol 'fault' as a local label in fpu.S
  LoongArch: Adjust {copy, clear}_user exception handler behavior
  LoongArch: Use static defined zero page rather than allocated
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull LoongArch updates from Huacai Chen:

 - Allow usage of LSX/LASX in the kernel, and use them for
   SIMD-optimized RAID5/RAID6 routines

 - Add Loongson Binary Translation (LBT) extension support

 - Add basic KGDB &amp; KDB support

 - Add building with kcov coverage

 - Add KFENCE (Kernel Electric-Fence) support

 - Add KASAN (Kernel Address Sanitizer) support

 - Some bug fixes and other small changes

 - Update the default config file

* tag 'loongarch-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: (25 commits)
  LoongArch: Update Loongson-3 default config file
  LoongArch: Add KASAN (Kernel Address Sanitizer) support
  LoongArch: Simplify the processing of jumping new kernel for KASLR
  kasan: Add (pmd|pud)_init for LoongArch zero_(pud|p4d)_populate process
  kasan: Add __HAVE_ARCH_SHADOW_MAP to support arch specific mapping
  LoongArch: Add KFENCE (Kernel Electric-Fence) support
  LoongArch: Get partial stack information when providing regs parameter
  LoongArch: mm: Add page table mapped mode support for virt_to_page()
  kfence: Defer the assignment of the local variable addr
  LoongArch: Allow building with kcov coverage
  LoongArch: Provide kaslr_offset() to get kernel offset
  LoongArch: Add basic KGDB &amp; KDB support
  LoongArch: Add Loongson Binary Translation (LBT) extension support
  raid6: Add LoongArch SIMD recovery implementation
  raid6: Add LoongArch SIMD syndrome calculation
  LoongArch: Add SIMD-optimized XOR routines
  LoongArch: Allow usage of LSX/LASX in the kernel
  LoongArch: Define symbol 'fault' as a local label in fpu.S
  LoongArch: Adjust {copy, clear}_user exception handler behavior
  LoongArch: Use static defined zero page rather than allocated
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>LoongArch: Add KASAN (Kernel Address Sanitizer) support</title>
<updated>2023-09-06T14:54:16+00:00</updated>
<author>
<name>Qing Zhang</name>
<email>zhangqing@loongson.cn</email>
</author>
<published>2023-09-06T14:54:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=5aa4ac64e6add3e40d5049e31275b2822daf885d'/>
<id>5aa4ac64e6add3e40d5049e31275b2822daf885d</id>
<content type='text'>
1/8 of kernel addresses reserved for shadow memory. But for LoongArch,
There are a lot of holes between different segments and valid address
space (256T available) is insufficient to map all these segments to kasan
shadow memory with the common formula provided by kasan core, saying
(addr &gt;&gt; KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET

So LoongArch has a arch-specific mapping formula, different segments are
mapped individually, and only limited space lengths of these specific
segments are mapped to shadow.

At early boot stage the whole shadow region populated with just one
physical page (kasan_early_shadow_page). Later, this page is reused as
readonly zero shadow for some memory that kasan currently don't track.
After mapping the physical memory, pages for shadow memory are allocated
and mapped.

Functions like memset()/memcpy()/memmove() do a lot of memory accesses.
If bad pointer passed to one of these function it is important to be
caught. Compiler's instrumentation cannot do this since these functions
are written in assembly.

KASan replaces memory functions with manually instrumented variants.
Original functions declared as weak symbols so strong definitions in
mm/kasan/kasan.c could replace them. Original functions have aliases
with '__' prefix in names, so we could call non-instrumented variant
if needed.

Signed-off-by: Qing Zhang &lt;zhangqing@loongson.cn&gt;
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
1/8 of kernel addresses reserved for shadow memory. But for LoongArch,
There are a lot of holes between different segments and valid address
space (256T available) is insufficient to map all these segments to kasan
shadow memory with the common formula provided by kasan core, saying
(addr &gt;&gt; KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET

So LoongArch has a arch-specific mapping formula, different segments are
mapped individually, and only limited space lengths of these specific
segments are mapped to shadow.

At early boot stage the whole shadow region populated with just one
physical page (kasan_early_shadow_page). Later, this page is reused as
readonly zero shadow for some memory that kasan currently don't track.
After mapping the physical memory, pages for shadow memory are allocated
and mapped.

Functions like memset()/memcpy()/memmove() do a lot of memory accesses.
If bad pointer passed to one of these function it is important to be
caught. Compiler's instrumentation cannot do this since these functions
are written in assembly.

KASan replaces memory functions with manually instrumented variants.
Original functions declared as weak symbols so strong definitions in
mm/kasan/kasan.c could replace them. Original functions have aliases
with '__' prefix in names, so we could call non-instrumented variant
if needed.

Signed-off-by: Qing Zhang &lt;zhangqing@loongson.cn&gt;
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>LoongArch: Allow building with kcov coverage</title>
<updated>2023-09-06T14:53:55+00:00</updated>
<author>
<name>Feiyang Chen</name>
<email>chenfeiyang@loongson.cn</email>
</author>
<published>2023-09-06T14:53:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=2363088eba2ecccfb643725e4864af73c4226a04'/>
<id>2363088eba2ecccfb643725e4864af73c4226a04</id>
<content type='text'>
Add ARCH_HAS_KCOV and HAVE_GCC_PLUGINS to the LoongArch Kconfig. And
also disable instrumentation of vdso.

Signed-off-by: Feiyang Chen &lt;chenfeiyang@loongson.cn&gt;
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add ARCH_HAS_KCOV and HAVE_GCC_PLUGINS to the LoongArch Kconfig. And
also disable instrumentation of vdso.

Signed-off-by: Feiyang Chen &lt;chenfeiyang@loongson.cn&gt;
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>LoongArch: Add basic KGDB &amp; KDB support</title>
<updated>2023-09-06T14:53:55+00:00</updated>
<author>
<name>Qing Zhang</name>
<email>zhangqing@loongson.cn</email>
</author>
<published>2023-09-06T14:53:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=e14dd076964ef11e9d6e3b06a2f1c6bb7d034133'/>
<id>e14dd076964ef11e9d6e3b06a2f1c6bb7d034133</id>
<content type='text'>
KGDB is intended to be used as a source level debugger for the Linux
kernel. It is used along with gdb to debug a Linux kernel. GDB can be
used to "break in" to the kernel to inspect memory, variables and regs
similar to the way an application developer would use GDB to debug an
application. KDB is a frontend of KGDB which is similar to GDB.

By now, in addition to the generic KGDB features, the LoongArch KGDB
implements the following features:
- Hardware breakpoints/watchpoints;
- Software single-step support for KDB.

Signed-off-by: Qing Zhang &lt;zhangqing@loongson.cn&gt;   # Framework &amp; CoreFeature
Signed-off-by: Binbin Zhou &lt;zhoubinbin@loongson.cn&gt; # BreakPoint &amp; SingleStep
Signed-off-by: Hui Li &lt;lihui@loongson.cn&gt;           # Some Minor Improvements
Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt; # Some Build Error Fixes
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
KGDB is intended to be used as a source level debugger for the Linux
kernel. It is used along with gdb to debug a Linux kernel. GDB can be
used to "break in" to the kernel to inspect memory, variables and regs
similar to the way an application developer would use GDB to debug an
application. KDB is a frontend of KGDB which is similar to GDB.

By now, in addition to the generic KGDB features, the LoongArch KGDB
implements the following features:
- Hardware breakpoints/watchpoints;
- Software single-step support for KDB.

Signed-off-by: Qing Zhang &lt;zhangqing@loongson.cn&gt;   # Framework &amp; CoreFeature
Signed-off-by: Binbin Zhou &lt;zhoubinbin@loongson.cn&gt; # BreakPoint &amp; SingleStep
Signed-off-by: Hui Li &lt;lihui@loongson.cn&gt;           # Some Minor Improvements
Signed-off-by: Randy Dunlap &lt;rdunlap@infradead.org&gt; # Some Build Error Fixes
Signed-off-by: Huacai Chen &lt;chenhuacai@loongson.cn&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'docs-6.6' of git://git.lwn.net/linux</title>
<updated>2023-08-31T03:05:42+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-08-31T03:05:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=cd99b9eb4b702563c5ac7d26b632a628f5a832a5'/>
<id>cd99b9eb4b702563c5ac7d26b632a628f5a832a5</id>
<content type='text'>
Pull documentation updates from Jonathan Corbet:
 "Documentation work keeps chugging along; this includes:

   - Work from Carlos Bilbao to integrate rustdoc output into the
     generated HTML documentation. This took some work to figure out how
     to do it without slowing the docs build and without creating people
     who don't have Rust installed, but Carlos got there

   - Move the loongarch and mips architecture documentation under
     Documentation/arch/

   - Some more maintainer documentation from Jakub

  ... plus the usual assortment of updates, translations, and fixes"

* tag 'docs-6.6' of git://git.lwn.net/linux: (56 commits)
  Docu: genericirq.rst: fix irq-example
  input: docs: pxrc: remove reference to phoenix-sim
  Documentation: serial-console: Fix literal block marker
  docs/mm: remove references to hmm_mirror ops and clean typos
  docs/zh_CN: correct regi_chg(),regi_add() to region_chg(),region_add()
  Documentation: Fix typos
  Documentation/ABI: Fix typos
  scripts: kernel-doc: fix macro handling in enums
  scripts: kernel-doc: parse DEFINE_DMA_UNMAP_[ADDR|LEN]
  Documentation: riscv: Update boot image header since EFI stub is supported
  Documentation: riscv: Add early boot document
  Documentation: arm: Add bootargs to the table of added DT parameters
  docs: kernel-parameters: Refer to the correct bitmap function
  doc: update params of memhp_default_state=
  docs: Add book to process/kernel-docs.rst
  docs: sparse: fix invalid link addresses
  docs: vfs: clean up after the iterate() removal
  docs: Add a section on surveys to the researcher guidelines
  docs: move mips under arch
  docs: move loongarch under arch
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull documentation updates from Jonathan Corbet:
 "Documentation work keeps chugging along; this includes:

   - Work from Carlos Bilbao to integrate rustdoc output into the
     generated HTML documentation. This took some work to figure out how
     to do it without slowing the docs build and without creating people
     who don't have Rust installed, but Carlos got there

   - Move the loongarch and mips architecture documentation under
     Documentation/arch/

   - Some more maintainer documentation from Jakub

  ... plus the usual assortment of updates, translations, and fixes"

* tag 'docs-6.6' of git://git.lwn.net/linux: (56 commits)
  Docu: genericirq.rst: fix irq-example
  input: docs: pxrc: remove reference to phoenix-sim
  Documentation: serial-console: Fix literal block marker
  docs/mm: remove references to hmm_mirror ops and clean typos
  docs/zh_CN: correct regi_chg(),regi_add() to region_chg(),region_add()
  Documentation: Fix typos
  Documentation/ABI: Fix typos
  scripts: kernel-doc: fix macro handling in enums
  scripts: kernel-doc: parse DEFINE_DMA_UNMAP_[ADDR|LEN]
  Documentation: riscv: Update boot image header since EFI stub is supported
  Documentation: riscv: Add early boot document
  Documentation: arm: Add bootargs to the table of added DT parameters
  docs: kernel-parameters: Refer to the correct bitmap function
  doc: update params of memhp_default_state=
  docs: Add book to process/kernel-docs.rst
  docs: sparse: fix invalid link addresses
  docs: vfs: clean up after the iterate() removal
  docs: Add a section on surveys to the researcher guidelines
  docs: move mips under arch
  docs: move loongarch under arch
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>Documentation: Fix typos</title>
<updated>2023-08-18T17:29:03+00:00</updated>
<author>
<name>Bjorn Helgaas</name>
<email>bhelgaas@google.com</email>
</author>
<published>2023-08-14T21:28:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=d56b699d76d1b352f7a3d3a0a3e91c79b8612d94'/>
<id>d56b699d76d1b352f7a3d3a0a3e91c79b8612d94</id>
<content type='text'>
Fix typos in Documentation.

Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Link: https://lore.kernel.org/r/20230814212822.193684-4-helgaas@kernel.org
Signed-off-by: Jonathan Corbet &lt;corbet@lwn.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Fix typos in Documentation.

Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Link: https://lore.kernel.org/r/20230814212822.193684-4-helgaas@kernel.org
Signed-off-by: Jonathan Corbet &lt;corbet@lwn.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>arm64: support batched/deferred tlb shootdown during page reclamation/migration</title>
<updated>2023-08-18T17:12:37+00:00</updated>
<author>
<name>Barry Song</name>
<email>v-songbaohua@oppo.com</email>
</author>
<published>2023-07-17T13:10:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.toradex.cn/cgit/linux-toradex.git/commit/?id=43b3dfdd04553171488cb11d46d21948b6b90e27'/>
<id>43b3dfdd04553171488cb11d46d21948b6b90e27</id>
<content type='text'>
On x86, batched and deferred tlb shootdown has lead to 90% performance
increase on tlb shootdown.  on arm64, HW can do tlb shootdown without
software IPI.  But sync tlbi is still quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include &lt;sys/types.h&gt;
 #include &lt;unistd.h&gt;
 #include &lt;sys/mman.h&gt;
 #include &lt;string.h&gt;

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k &lt; 10000; k++) {
                 /* swap in */
                 for (int i = 0; i &lt; SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  ......
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out
a page mapped by only one process.  If the page is mapped by multiple
processes, typically, like more than 100 on a phone, the overhead would be
much higher as we have to run tlb flush 100 times for one single page. 
Plus, tlb flush overhead will increase with the number of CPU cores due to
the bad scalability of tlb shootdown in HW, so those ARM64 servers should
expect much higher overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush is actually
used by the final dsb() to wait for the completion of tlb flush.  This
provides us a very good chance to leverage the existing batched tlb in
kernel.  The minimum modification is that we only send async tlbi in the
first stage and we send dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to finish the
program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also tested with benchmark in the commit on Kunpeng920 arm64 server
and observed an improvement around 12.5% with command
`time ./swap_bench`.
        w/o             w/
real    0m13.460s       0m11.771s
user    0m0.248s        0m0.279s
sys     0m12.039s       0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench &amp;&amp; perf report
[...]
16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform.

Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
100 times for 1G memory, this patch decrease the time used around 20%
(prev 18.338318910 sec after 13.981866350 sec) and saved the time used
by ptep_clear_flush().

Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com
Tested-by: Yicong Yang &lt;yangyicong@hisilicon.com&gt;
Tested-by: Xin Hao &lt;xhao@linux.alibaba.com&gt;
Tested-by: Punit Agrawal &lt;punit.agrawal@bytedance.com&gt;
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Signed-off-by: Yicong Yang &lt;yangyicong@hisilicon.com&gt;
Reviewed-by: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Reviewed-by: Xin Hao &lt;xhao@linux.alibaba.com&gt;
Reviewed-by: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Reviewed-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Nadav Amit &lt;namit@vmware.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Anshuman Khandual &lt;khandual@linux.vnet.ibm.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Barry Song &lt;baohua@kernel.org&gt;
Cc: Darren Hart &lt;darren@os.amperecomputing.com&gt;
Cc: Jonathan Cameron &lt;Jonathan.Cameron@huawei.com&gt;
Cc: lipeifeng &lt;lipeifeng@oppo.com&gt;
Cc: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Steven Miao &lt;realmz6@gmail.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Zeng Tao &lt;prime.zeng@hisilicon.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
On x86, batched and deferred tlb shootdown has lead to 90% performance
increase on tlb shootdown.  on arm64, HW can do tlb shootdown without
software IPI.  But sync tlbi is still quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include &lt;sys/types.h&gt;
 #include &lt;unistd.h&gt;
 #include &lt;sys/mman.h&gt;
 #include &lt;string.h&gt;

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k &lt; 10000; k++) {
                 /* swap in */
                 for (int i = 0; i &lt; SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  ......
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out
a page mapped by only one process.  If the page is mapped by multiple
processes, typically, like more than 100 on a phone, the overhead would be
much higher as we have to run tlb flush 100 times for one single page. 
Plus, tlb flush overhead will increase with the number of CPU cores due to
the bad scalability of tlb shootdown in HW, so those ARM64 servers should
expect much higher overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush is actually
used by the final dsb() to wait for the completion of tlb flush.  This
provides us a very good chance to leverage the existing batched tlb in
kernel.  The minimum modification is that we only send async tlbi in the
first stage and we send dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to finish the
program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also tested with benchmark in the commit on Kunpeng920 arm64 server
and observed an improvement around 12.5% with command
`time ./swap_bench`.
        w/o             w/
real    0m13.460s       0m11.771s
user    0m0.248s        0m0.279s
sys     0m12.039s       0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench &amp;&amp; perf report
[...]
16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform.

Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
100 times for 1G memory, this patch decrease the time used around 20%
(prev 18.338318910 sec after 13.981866350 sec) and saved the time used
by ptep_clear_flush().

Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com
Tested-by: Yicong Yang &lt;yangyicong@hisilicon.com&gt;
Tested-by: Xin Hao &lt;xhao@linux.alibaba.com&gt;
Tested-by: Punit Agrawal &lt;punit.agrawal@bytedance.com&gt;
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Signed-off-by: Yicong Yang &lt;yangyicong@hisilicon.com&gt;
Reviewed-by: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Reviewed-by: Xin Hao &lt;xhao@linux.alibaba.com&gt;
Reviewed-by: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Reviewed-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Anshuman Khandual &lt;anshuman.khandual@arm.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Nadav Amit &lt;namit@vmware.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Anshuman Khandual &lt;khandual@linux.vnet.ibm.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Barry Song &lt;baohua@kernel.org&gt;
Cc: Darren Hart &lt;darren@os.amperecomputing.com&gt;
Cc: Jonathan Cameron &lt;Jonathan.Cameron@huawei.com&gt;
Cc: lipeifeng &lt;lipeifeng@oppo.com&gt;
Cc: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Steven Miao &lt;realmz6@gmail.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Zeng Tao &lt;prime.zeng@hisilicon.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
