From 91d25ba8a6b0d810dc844cebeedc53029118ce3e Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Wed, 6 Sep 2017 16:18:43 -0700 Subject: dax: use common 4k zero page for dax mmap reads When servicing mmap() reads from file holes the current DAX code allocates a page cache page of all zeroes and places the struct page pointer in the mapping->page_tree radix tree. This has three major drawbacks: 1) It consumes memory unnecessarily. For every 4k page that is read via a DAX mmap() over a hole, we allocate a new page cache page. This means that if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory. This is easily visible by looking at the overall memory consumption of the system or by looking at /proc/[pid]/smaps: 7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 1048576 kB Pss: 1048576 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 1048576 kB Private_Dirty: 0 kB Referenced: 1048576 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB 2) It is slower than using a common zero page because each page fault has more work to do. Instead of just inserting a common zero page we have to allocate a page cache page, zero it, and then insert it. Here are the average latencies of dax_load_hole() as measured by ftrace on a random test box: Old method, using zeroed page cache pages: 3.4 us New method, using the common 4k zero page: 0.8 us This was the average latency over 1 GiB of sequential reads done by this simple fio script: [global] size=1G filename=/root/dax/data fallocate=none [io] rw=read ioengine=mmap 3) The fact that we had to check for both DAX exceptional entries and for page cache pages in the radix tree made the DAX code more complex. Solve these issues by following the lead of the DAX PMD code and using a common 4k zero page instead. As with the PMD code we will now insert a DAX exceptional entry into the radix tree instead of a struct page pointer which allows us to remove all the special casing in the DAX code. Note that we do still pretty aggressively check for regular pages in the DAX radix tree, especially where we take action based on the bits set in the page. If we ever find a regular page in our radix tree now that most likely means that someone besides DAX is inserting pages (which has happened lots of times in the past), and we want to find that out early and fail loudly. This solution also removes the extra memory consumption. Here is that same /proc/[pid]/smaps after 1GiB of reading from a hole with the new code: 7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 0 kB Pss: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB Overall system memory consumption is similarly improved. Another major change is that we remove dax_pfn_mkwrite() from our fault flow, and instead rely on the page fault itself to make the PTE dirty and writeable. The following description from the patch adding the vm_insert_mixed_mkwrite() call explains this a little more: "To be able to use the common 4k zero page in DAX we need to have our PTE fault path look more like our PMD fault path where a PTE entry can be marked as dirty and writeable as it is first inserted rather than waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call. Right now we can rely on having a dax_pfn_mkwrite() call because we can distinguish between these two cases in do_wp_page(): case 1: 4k zero page => writable DAX storage case 2: read-only DAX storage => writeable DAX storage This distinction is made by via vm_normal_page(). vm_normal_page() returns false for the common 4k zero page, though, just as it does for DAX ptes. Instead of special casing the DAX + 4k zero page case we will simplify our DAX PTE page fault sequence so that it matches our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead use dax_iomap_fault() to handle write-protection faults. This means that insert_pfn() needs to follow the lead of insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite' is set insert_pfn() will do the work that was previously done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path" Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler Reviewed-by: Jan Kara Cc: "Darrick J. Wong" Cc: "Theodore Ts'o" Cc: Alexander Viro Cc: Andreas Dilger Cc: Christoph Hellwig Cc: Dan Williams Cc: Dave Chinner Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Steven Rostedt Cc: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ext4/file.c | 32 +------------------------------- 1 file changed, 1 insertion(+), 31 deletions(-) (limited to 'fs/ext4') diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 0d7cf0cc9b87..f28ac999dfba 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -311,41 +311,11 @@ static int ext4_dax_fault(struct vm_fault *vmf) return ext4_dax_huge_fault(vmf, PE_SIZE_PTE); } -/* - * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_fault() - * handler we check for races agaist truncate. Note that since we cycle through - * i_mmap_sem, we are sure that also any hole punching that began before we - * were called is finished by now and so if it included part of the file we - * are working on, our pte will get unmapped and the check for pte_same() in - * wp_pfn_shared() fails. Thus fault gets retried and things work out as - * desired. - */ -static int ext4_dax_pfn_mkwrite(struct vm_fault *vmf) -{ - struct inode *inode = file_inode(vmf->vma->vm_file); - struct super_block *sb = inode->i_sb; - loff_t size; - int ret; - - sb_start_pagefault(sb); - file_update_time(vmf->vma->vm_file); - down_read(&EXT4_I(inode)->i_mmap_sem); - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; - if (vmf->pgoff >= size) - ret = VM_FAULT_SIGBUS; - else - ret = dax_pfn_mkwrite(vmf); - up_read(&EXT4_I(inode)->i_mmap_sem); - sb_end_pagefault(sb); - - return ret; -} - static const struct vm_operations_struct ext4_dax_vm_ops = { .fault = ext4_dax_fault, .huge_fault = ext4_dax_huge_fault, .page_mkwrite = ext4_dax_fault, - .pfn_mkwrite = ext4_dax_pfn_mkwrite, + .pfn_mkwrite = ext4_dax_fault, }; #else #define ext4_dax_vm_ops ext4_file_vm_ops -- cgit v1.2.3 From d72dc8a25afc71ce90ee92bdd77550e9beb85d4d Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 6 Sep 2017 16:21:18 -0700 Subject: mm: make pagevec_lookup() update index Make pagevec_lookup() (and underlying find_get_pages()) update index to the next page where iteration should continue. Most callers want this and also pagevec_lookup_tag() already does this. Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz Signed-off-by: Jan Kara Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ext4/file.c | 4 +--- fs/ext4/inode.c | 8 ++------ 2 files changed, 3 insertions(+), 9 deletions(-) (limited to 'fs/ext4') diff --git a/fs/ext4/file.c b/fs/ext4/file.c index f28ac999dfba..1c73b21fdc13 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -468,7 +468,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode, unsigned long nr_pages; num = min_t(pgoff_t, end - index, PAGEVEC_SIZE - 1) + 1; - nr_pages = pagevec_lookup(&pvec, inode->i_mapping, index, + nr_pages = pagevec_lookup(&pvec, inode->i_mapping, &index, (pgoff_t)num); if (nr_pages == 0) break; @@ -536,8 +536,6 @@ next: /* The no. of pages is less than our desired, we are done. */ if (nr_pages < num) break; - - index = pvec.pages[i - 1]->index + 1; pagevec_release(&pvec); } while (index <= end); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c774bdc22759..b3ce1c6d9f23 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1720,7 +1720,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd, pagevec_init(&pvec, 0); while (index <= end) { - nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE); + nr_pages = pagevec_lookup(&pvec, mapping, &index, PAGEVEC_SIZE); if (nr_pages == 0) break; for (i = 0; i < nr_pages; i++) { @@ -1737,7 +1737,6 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd, } unlock_page(page); } - index = pvec.pages[nr_pages - 1]->index + 1; pagevec_release(&pvec); } } @@ -2348,7 +2347,7 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd) pagevec_init(&pvec, 0); while (start <= end) { - nr_pages = pagevec_lookup(&pvec, inode->i_mapping, start, + nr_pages = pagevec_lookup(&pvec, inode->i_mapping, &start, PAGEVEC_SIZE); if (nr_pages == 0) break; @@ -2357,8 +2356,6 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd) if (page->index > end) break; - /* Up to 'end' pages must be contiguous */ - BUG_ON(page->index != start); bh = head = page_buffers(page); do { if (lblk < mpd->map.m_lblk) @@ -2403,7 +2400,6 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd) pagevec_release(&pvec); return err; } - start++; } pagevec_release(&pvec); } -- cgit v1.2.3 From dec0da7b608253b004daea1da5bd82b3b292ba0f Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 6 Sep 2017 16:21:27 -0700 Subject: ext4: use pagevec_lookup_range() in ext4_find_unwritten_pgoff() Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are interested only in pages in the given range. Simplify the logic as a result of not getting pages out of range and index getting automatically advanced. Link: http://lkml.kernel.org/r/20170726114704.7626-6-jack@suse.cz Signed-off-by: Jan Kara Cc: "Theodore Ts'o" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ext4/file.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) (limited to 'fs/ext4') diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 1c73b21fdc13..726344809721 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -464,12 +464,11 @@ static int ext4_find_unwritten_pgoff(struct inode *inode, pagevec_init(&pvec, 0); do { - int i, num; + int i; unsigned long nr_pages; - num = min_t(pgoff_t, end - index, PAGEVEC_SIZE - 1) + 1; - nr_pages = pagevec_lookup(&pvec, inode->i_mapping, &index, - (pgoff_t)num); + nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, + &index, end, PAGEVEC_SIZE); if (nr_pages == 0) break; @@ -488,9 +487,6 @@ static int ext4_find_unwritten_pgoff(struct inode *inode, goto out; } - if (page->index > end) - goto out; - lock_page(page); if (unlikely(page->mapping != inode->i_mapping)) { @@ -533,12 +529,10 @@ next: unlock_page(page); } - /* The no. of pages is less than our desired, we are done. */ - if (nr_pages < num) - break; pagevec_release(&pvec); } while (index <= end); + /* There are no pages upto endoff - that would be a hole in there. */ if (whence == SEEK_HOLE && lastoff < endoff) { found = 1; *offset = lastoff; -- cgit v1.2.3 From 2b85a6171d1370903d65cc8b84cefca26a16b5e4 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 6 Sep 2017 16:21:30 -0700 Subject: ext4: use pagevec_lookup_range() in writeback code Both occurences of pagevec_lookup() actually want only pages from a given range. Use pagevec_lookup_range() for the lookup. Link: http://lkml.kernel.org/r/20170726114704.7626-7-jack@suse.cz Signed-off-by: Jan Kara Cc: "Theodore Ts'o" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ext4/inode.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) (limited to 'fs/ext4') diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b3ce1c6d9f23..6103ce0430df 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1720,13 +1720,13 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd, pagevec_init(&pvec, 0); while (index <= end) { - nr_pages = pagevec_lookup(&pvec, mapping, &index, PAGEVEC_SIZE); + nr_pages = pagevec_lookup_range(&pvec, mapping, &index, end, + PAGEVEC_SIZE); if (nr_pages == 0) break; for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; - if (page->index > end) - break; + BUG_ON(!PageLocked(page)); BUG_ON(PageWriteback(page)); if (invalidate) { @@ -2347,15 +2347,13 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd) pagevec_init(&pvec, 0); while (start <= end) { - nr_pages = pagevec_lookup(&pvec, inode->i_mapping, &start, - PAGEVEC_SIZE); + nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, + &start, end, PAGEVEC_SIZE); if (nr_pages == 0) break; for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; - if (page->index > end) - break; bh = head = page_buffers(page); do { if (lblk < mpd->map.m_lblk) -- cgit v1.2.3 From 397162ffa2ed1cadffe05c324c6ddc53647f9c62 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 6 Sep 2017 16:21:43 -0700 Subject: mm: remove nr_pages argument from pagevec_lookup{,_range}() All users of pagevec_lookup() and pagevec_lookup_range() now pass PAGEVEC_SIZE as a desired number of pages. Just drop the argument. Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz Signed-off-by: Jan Kara Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ext4/file.c | 2 +- fs/ext4/inode.c | 5 ++--- 2 files changed, 3 insertions(+), 4 deletions(-) (limited to 'fs/ext4') diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 726344809721..484c1326dd6a 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -468,7 +468,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode, unsigned long nr_pages; nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, - &index, end, PAGEVEC_SIZE); + &index, end); if (nr_pages == 0) break; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 6103ce0430df..ce68a3483666 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1720,8 +1720,7 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd, pagevec_init(&pvec, 0); while (index <= end) { - nr_pages = pagevec_lookup_range(&pvec, mapping, &index, end, - PAGEVEC_SIZE); + nr_pages = pagevec_lookup_range(&pvec, mapping, &index, end); if (nr_pages == 0) break; for (i = 0; i < nr_pages; i++) { @@ -2348,7 +2347,7 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd) pagevec_init(&pvec, 0); while (start <= end) { nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, - &start, end, PAGEVEC_SIZE); + &start, end); if (nr_pages == 0) break; for (i = 0; i < nr_pages; i++) { -- cgit v1.2.3