summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2005-07-07[PATCH] propagate __nocast annotationsAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] mm: quieten OOM killer noiseAnton Blanchard
We now print statistics when invoking the OOM killer, however this information is not rate limited and you can get into situations where the console is continually spammed. For example, when a task is exiting the OOM killer will simply return (waiting for that task to exit and clear up memory). If the VM continually calls back into the OOM killer we get thousands of copies of show_mem() on the console. Use printk_ratelimit() to quieten it. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] remove completly bogus comment inside __alloc_pages() ↵Marcelo Tosatti
try_to_free_pages handling Remove completly bogus comment from did_some_progress != 0 handling (that same comment is a few lines below on did_some_progress = 0 case, where it belongs). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] print order information when OOM killingMarcelo Tosatti
Dump the current allocation order when OOM killing. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-06[PATCH] Fix broken kmalloc_node in rc1/rc2Christoph Lameter
This patch used to be in Andrew's tree before the NUMA slab allocator went in. Either this patch or the NUMA slab allocator is needed in order for kmalloc_node to work correctly. pcibus_to_node may be used to generate the node information passed to kmalloc_node. pcibus_to_node returns -1 if it was not able to determine on which node a pcibus is located. For that case kmalloc_node must work like kmalloc. Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-28[PATCH] rename wakeup_bdflush to wakeup_pdflushPekka J Enberg
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27[PATCH] fix WANT_PAGE_VIRTUAL in memmap_initBob Picco
I spotted this issue while in memmap_init last week. I can't say the change has any test coverage by me. start_pfn was formerly used in main "for" loop. The fix is replace start_pfn with pfn. Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25Merge Christoph's freeze cleanup patchLinus Torvalds
2005-06-25[PATCH] Cleanup patch for process freezingChristoph Lameter
1. Establish a simple API for process freezing defined in linux/include/sched.h: frozen(process) Check for frozen process freezing(process) Check if a process is being frozen freeze(process) Tell a process to freeze (go to refrigerator) thaw_process(process) Restart process frozen_process(process) Process is frozen now 2. Remove all references to PF_FREEZE and PF_FROZEN from all kernel sources except sched.h 3. Fix numerous locations where try_to_freeze is manually done by a driver 4. Remove the argument that is no longer necessary from two function calls. 5. Some whitespace cleanup 6. Clear potential race in refrigerator (provides an open window of PF_FREEZE cleared before setting PF_FROZEN, recalc_sigpending does not check PF_FROZEN). This patch does not address the problem of freeze_processes() violating the rule that a task may only modify its own flags by setting PF_FREEZE. This is not clean in an SMP environment. freeze(process) is therefore not SMP safe! Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] Use ALIGN to remove duplicate codeNick Wilson
This patch makes use of ALIGN() to remove duplicate round-up code. Signed-off-by: Nick Wilson <njw@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] kdump: Retrieve saved max pfnVivek Goyal
This patch retrieves the max_pfn being used by previous kernel and stores it in a safe location (saved_max_pfn) before it is overwritten due to user defined memory map. This pfn is used to make sure that user does not try to read the physical memory beyond saved_max_pfn. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] fix for generic_file_write iov problemBadari Pulavarty
Here is the fix for the problem described in http://bugzilla.kernel.org/show_bug.cgi?id=4721 Basically, problem is generic_file_buffered_write() is accessing beyond end of the iov[] vector after handling the last vector. If we happen to cross page boundary, we get a fault. I think this simple patch is good enough. If we really don't want to depend on the "count", then we need pass nr_segs to filemap_set_next_iovec() and decrement it and check it. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] swsusp: kill config_pm_diskPavel Machek
CONFIG_PM_DISK is long gone, but it still managed to survived at few places. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] mm: fix remap_pte_range BUGHugh Dickins
Out-of-tree user of remap_pfn_range hit kernel BUG at mm/memory.c:1112! It passes an unrounded size to remap_pfn_range, which was okay before 2.6.12, but misses remap_pte_range's new end condition. An audit of all the other ptwalks confirms that this is the only one so exposed. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-25[PATCH] Fix the error handling in direct I/OHifumi Hisashi
Fix a bug on error handling in the direct I/O function. Currently, if a file is opened with the O_DIRECT|O_SYNC flag, the write() syscall cannot receive the EIO error after an I/O error (SCSI cable is disconnected etc.). Return values of other points that call generic_osync_inode() are treated appropriately. Signed-off-by: Hisashi Hifumi <hifumi.hisashi@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24[PATCH] xip: madvice/fadvice: execute in placeCarsten Otte
Make sys_madvice/fadvice return sane with xip. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24[PATCH] xip: reduce code duplicationCarsten Otte
This patch reworks filemap_xip.c with the goal to reduce code duplication from mm/filemap.c. It applies agains 2.6.12-rc6-mm1. Instead of implementing the aio functions, this one implements the synchronous read/write functions only. For readv and writev, the generic fallback is used. For aio, we rely on the application doing the fallback. Since our "synchronous" function does memcpy immediately anyway, there is no performance difference between using the fallbacks or implementing each operation. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24[PATCH] xip: fs/mm: execute in placeCarsten Otte
- generic_file* file operations do no longer have a xip/non-xip split - filemap_xip.c implements a new set of fops that require get_xip_page aop to work proper. all new fops are exported GPL-only (don't like to see whatever code use those except GPL modules) - __xip_unmap now uses page_check_address, which is no longer static in rmap.c, and defined in linux/rmap.h - mm/filemap.h is now much more clean, plainly having just Linus' inline funcs moved here from filemap.c - fix includes in filemap_xip to make it build cleanly on i386 Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24[PATCH] DocBook: update commentsMartin Waitz
This patch updates some comments to match code changes. Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] Remove f_error field from struct fileChristoph Lameter
The following patch removes the f_error field and all checks of f_error. Trond said: f_error was introduced for NFS, and made sense when we were guaranteed always to have a file pointer around when write errors occurred. Since then, we have (for various reasons) had to introduce the nfs_open_context in order to track the file read/write state, and it made sense to move our f_error tracking there too. Signed-off-by: Christoph Lameter <christoph@lameter.com> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] mempool - only init waitqueue in slow pathBenjamin LaHaise
Here's a small patch to improve the performance of mempool_alloc by only initializing the wait queue when we're about to wait. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] remove redundant vm_flags clearing from madvise.cPekka Enberg
This patch removes redundant VM_ClearReadHint from mm/madvice.c which was left there by Prasanna's patch. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] create a kstrdup library functionPaulo Marques
This patch creates a new kstrdup library function and changes the "local" implementations in several places to use this function. Most of the changes come from the sound and net subsystems. The sound part had already been acknowledged by Takashi Iwai and the net part by David S. Miller. I left UML alone for now because I would need more time to read the code carefully before making changes there. Signed-off-by: Paulo Marques <pmarques@grupopie.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] NUMA aware block device control structure allocationChristoph Lameter
Patch to allocate the control structures for for ide devices on the node of the device itself (for NUMA systems). The patch depends on the Slab API change patch by Manfred and me (in mm) and the pcidev_to_node patch that I posted today. Does some realignment too. Signed-off-by: Justin M. Forbes <jmforbes@linuxtx.org> Signed-off-by: Christoph Lameter <christoph@lameter.com> Signed-off-by: Pravin Shelar <pravin@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem hotplug baseAndy Whitcroft
Make sparse's initalization be accessible at runtime. This allows sparse mappings to be created after boot in a hotplug situation. This patch is separated from the previous one just to give an indication how much of the sparse infrastructure is *just* for hotplug memory. The section_mem_map doesn't really store a pointer. It stores something that is convenient to do some math against to get a pointer. It isn't valid to just do *section_mem_map, so I don't think it should be stored as a pointer. There are a couple of things I'd like to store about a section. First of all, the fact that it is !NULL does not mean that it is present. There could be such a combination where section_mem_map *is* NULL, but the math gets you properly to a real mem_map. So, I don't think that check is safe. Since we're storing 32-bit-aligned structures, we have a few bits in the bottom of the pointer to play with. Use one bit to encode whether there's really a mem_map there, and the other one to tell whether there's a valid section there. We need to distinguish between the two because sometimes there's a gap between when a section is discovered to be present and when we can get the mem_map for it. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem swiss cheese numa layoutsAndy Whitcroft
The part of the sparsemem patch which modifies memmap_init_zone() has recently become a problem. It changes behavior so that there is a call to pfn_to_page() for each individual page inside of a node's range: node_start_pfn through node_end_pfn. It used to simply do this once, at the beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside of a node made it necessary to change. Mike Kravetz recently wrote a patch which made the NUMA code accept some new kinds of layouts. The system's memory was laid out like this, with node 0's memory in two pieces: one before and one after node 1's memory: Node 0: +++++ +++++ Node 1: +++++ Previous behavior before Mike's patch was to assign nodes like this: Node 0: 00000 XXXXX Node 1: 11111 Where the 'X' areas were simply thrown away. The new behavior was to make the pg_data_t span node 0 across all of its areas, including areas that are really node 1's: Node 0: 000000000000000 Node 1: 11111 This wastes a little bit of mem_map space, but ends up being OK, and more fully utilizes the system's memory. memmap_init_zone() initializes all of the "struct page"s for node 0, even for the "hole", but those never get used, because there is no pfn_to_page() that resolves to those pages. However, only calling pfn_to_page() once, memmap_init_zone() always uses the pages that were allocated for node0->node_mem_map because: struct page *start = pfn_to_page(start_pfn); // effectively start = &node->node_mem_map[0] for (page = start; page < (start + size); page++) { init_page_here();... page++; } Slow, and wasteful, but generally harmless. But, modify that to call pfn_to_page() for each loop iteration (like sparsemem does): for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) { page = pfn_to_page(pfn); } And you end up trying to initialize node 1's pages too early, along with bogus data from node 0. This patch checks for those weird layouts and declines to touch the pages, making the more frequent pfn_to_page() calls OK to do. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem memory modelAndy Whitcroft
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of mem_map[] is needed by discontiguous memory machines (like in the old CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually become a complete replacement. A significant advantage over DISCONTIGMEM is that it's completely separated from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA and DISCONTIG are often confused. Another advantage is that sparse doesn't require each NUMA node's ranges to be contiguous. It can handle overlapping ranges between nodes with no problems, where DISCONTIGMEM currently throws away that memory. Sparsemem uses an array to provide different pfn_to_page() translations for each SECTION_SIZE area of physical memory. This is what allows the mem_map[] to be chopped up. In order to do quick pfn_to_page() operations, the section number of the page is encoded in page->flags. Part of the sparsemem infrastructure enables sharing of these bits more dynamically (at compile-time) between the page_zone() and sparsemem operations. However, on 32-bit architectures, the number of bits is quite limited, and may require growing the size of the page->flags type in certain conditions. Several things might force this to occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of memory), an increase in the physical address space, or an increase in the number of used page->flags. One thing to note is that, once sparsemem is present, the NUMA node information no longer needs to be stored in the page->flags. It might provide speed increases on certain platforms and will be stored there if there is room. But, if out of room, an alternate (theoretically slower) mechanism is used. This patch introduces CONFIG_FLATMEM. It is used in almost all cases where there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM often have to compile out the same areas of code. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] generify memory presentAndy Whitcroft
Allow architectures to indicate that they will be providing hooks to indice installed memory areas, memory_present(). Provide prototypes for the i386 implementation. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Martin Bligh <mbligh@aracnet.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] mm/Kconfig: give DISCONTIG more help textDave Hansen
This gives DISCONTIGMEM a bit more help text to explain what it does, not just when to choose it. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] mm/Kconfig: hide "Memory Model" selection menuDave Hansen
I got some feedback from users who think that the new "Memory Model" menu is a little invasive. This patch will hide that menu, except when CONFIG_EXPERIMENTAL is enabled *or* when an individual architecture wants it. An individual arch may want to enable it because they've removed their arch-specific DISCONTIG prompt in favor of the mm/Kconfig one. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem: fix minor "defaults" issue in mm/KconfigDave Hansen
The following patch applies on top of 2.6.12-rc2-mm1. It fixes a minor user interaction issue, and an early reference to SPARSEMEM. This "choice" menu would always default to FLATMEM, as it was listed first. Move it to the end so that the other defaults have a chance first. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] Introduce new Kconfig option for NUMA or DISCONTIGDave Hansen
There is some confusion that arose when working on SPARSEMEM patch between what is needed for DISCONTIG vs. NUMA. Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently. All of the current NUMA implementations require an implementation of DISCONTIG. Because of this, quite a lot of code which is really needed for NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n case. Introducing this new NEED_MULTIPLE_NODES config option allows code that is needed for both NUMA or DISCONTIG to be separated out from code that is specific to DISCONTIG. One great advantage of this approach is that it doesn't require every architecture to be converted over. All of the current implementations should "just work", only the ones implementing SPARSEMEM will have to be fixed up. The change to free_area_init() makes it work inside, or out of the new config option. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] create mm/Kconfig for arch-independent memory optionsDave Hansen
With sparsemem being introduced, we need a central place for new memory-related .config options: mm/Kconfig. This allows us to remove many of the duplicated arch-specific options. The new option, CONFIG_FLATMEM, is there to enable us to detangle NUMA and DISCONTIGMEM. This is a requirement for sparsemem because sparsemem uses the NUMA code without the presence of DISCONTIGMEM. The sparsemem patches use CONFIG_FLATMEM in generic code, so this patch is a requirement before applying them. Almost all places that used to do '#ifndef CONFIG_DISCONTIGMEM' should use '#ifdef CONFIG_FLATMEM' instead. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem base: reorganize page->flags bit operationsDave Hansen
Generify the value fields in the page_flags. The aim is to allow the location and size of these fields to be varied. Additionally we want to move away from fixed allocations per field whilst still enforcing the overall bit utilisation limits. We rely on the compiler to spot and optimise the accessor functions. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] sparsemem base: simple NUMA remap space allocatorDave Hansen
Introduce a simple allocator for the NUMA remap space. This space is very scarce, used for structures which are best allocated node local. This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep the pgdat->node_mem_map initialized in a consistent place for all architectures. Issues: o alloc_remap takes a node_id where we might expect a pgdat which was intended to allow us to allocate the pgdat's using this mechanism; which we do not yet do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-22[PATCH] boot_pageset must not be freed.Christoph Lameter
The boot_pageset needs to be preserved for hotplugging and for off line processors and nodes. Otherwise pointers will point into memory that has now a different use. /proc/zoneinfo is currently showing strange results if processors / nodes are not present. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] Kill stray newlineDenis Vlasenko
OOM killer prints a stray newline. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] msync: check pte dirty earlierAbhijit Karmarkar
It's common practice to msync a large address range regularly, in which often only a few ptes have actually been dirtied since the previous pass. sync_pte_range then goes much faster if it tests whether pte is dirty before locating and accessing each struct page cacheline; and it is hardly slowed by ptep_clear_flush_dirty repeating that test in the opposite case, when every pte actually is dirty. But beware, s390's pte_dirty always says false, since its dirty bit is kept in the storage key, located via the struct page address. So skip this optimization in its case: use a pte_maybe_dirty macro which just says true if page_test_and_clear_dirty is implemented. Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] can_share_swap_page: use page_mapcountHugh Dickins
Remember that ironic get_user_pages race? when the raised page_count on a page swapped out led do_wp_page to decide that it had to copy on write, so substituted a different page into userspace. 2.6.7 onwards have Andrea's solution, where try_to_unmap_one backs out if it finds page_count raised. Which works, but is unsatisfying (rmap.c has no other page_count heuristics), and was found a few months ago to hang an intensive page migration test. A year ago I was hesitant to engage page_mapcount, now it seems the right fix. So remove the page_count hack from try_to_unmap_one; and use activate_page in unuse_mm when dropping lock, to replace its secondary effect of helping swapoff to make progress in that case. Simplify can_share_swap_page (now called only on anonymous pages) to check page_mapcount + page_swapcount == 1: still needs the page lock to stabilize their (pessimistic) sum, but does not need swapper_space.tree_lock for that. In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to keep sum on the high side, and correct when can_share_swap_page called. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] do_wp_page: cannot share file pageHugh Dickins
A small optimization to do_wp_page's check for whether to avoid copy by reusing the page already mapped. It can never share a cached file page, nor can it share a reserved page (often the empty zero page), so it's a waste of time to lock and unlock in those cases. Which nowadays can both be neatly excluded by a preliminary PageAnon test. Christoph has reported that a preliminary page_count test proved valuable for scalability here, but PageAnon covers more common cases all at once. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] get_user_pages: kill get_page_mapHugh Dickins
Since its birth, get_user_pages has been calling a misguided get_page_map function. follow_page has already returned NULL if the pfn is invalid, we cannot reach an invalid pfn from a validated struct page. Remove get_page_map, and the messy rewind in get_user_pages to cope with its failure. Oh, and could we please call that "struct page *page" like everywhere else, instead of "struct page *map"? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] bad_page: clear reclaim and slabHugh Dickins
Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page ought to clear them to avoid repetitive reports (Nikita noticed this too). Let prep_new_page check page_count and PG_slab as free_pages_check does. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] mbind: check_range use standard ptwalkHugh Dickins
Strict mbind's check for currently mapped pages being on node has been using a slow loop which re-evaluates pgd, pud, pmd, pte for each entry: replace that by a standard four-level page table walk like others in mm. Since mmap_sem is held for writing, page_table_lock can be taken at the inner level to limit latency. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] mbind: fix verify_pages pte_pageHugh Dickins
Strict mbind's check that pages already mapped are on right node has been using pte_page without checking if pfn_valid, and without page_table_lock to prevent spurious failures when try_to_unmap_one intervenes between the pte_present and the pte_page. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] shmem: restore superblock infoHugh Dickins
To improve shmem scalability, we allowed tmpfs instances which don't need their blocks or inodes limited not to count them, and not to allocate any sbinfo. Which was okay when the only use for the sbinfo was accounting blocks and inodes; but since then a couple of unrelated projects extending tmpfs want to store other data in the sbinfo. Whether either extension reaches mainline is beside the point: I'm guilty of a bad design decision, and should restore sbinfo to make any such future extensions easier. So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited inodes. Brent Casavant verified (many months ago) that this does not perceptibly impact the scalability (since the unlimited sbinfo cacheline is repeatedly accessed but only once dirtied). And merge shmem_set_size into its sole caller shmem_remount_fs. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] Reduce size of huge boot per_cpu_pagesetChristoph Lameter
Reduce size of the huge per_cpu_pageset structure in __initdata introduced into mm1 with the pageset localization patchset. Use one specially configured pageset per cpu for all zones and nodes during bootup. - Avoid duplication of pageset initialization code. - do the adding to the pageset list before potential free_pages_bulk in free_hot_cold_page (otherwise we would have to hold a page in a pageset during the period that the boot pagesets are in use). - remove mistaken __cpuinitdata attribute and revert back to __initdata for the boot pageset. A boot pageset is not necessary for cpu hotplug. Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of sparsemem patches) Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] Periodically drain non local pagesetsChristoph Lameter
The pageset array can potentially acquire a huge amount of memory on large NUMA systems. F.e. on a system with 512 processors and 256 nodes there will be 256*512 pagesets. If each pageset only holds 5 pages then we are talking about 655360 pages.With a 16K page size on IA64 this results in potentially 10 Gigabytes of memory being trapped in pagesets. The typical cases are much less for smaller systems but there is still the potential of memory being trapped in off node pagesets. Off node memory may be rarely used if local memory is available and so we may potentially have memory in seldom used pagesets without this patch. The slab allocator flushes its per cpu caches every 2 seconds. The following patch flushes the off node pageset caches in the same way by tying into the slab flush. The patch also changes /proc/zoneinfo to include the number of pages currently in each pageset. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] add OOM debugJanet Morgan
This patch provides more debug info when the system is OOM. It displays memory stats (basically sysrq-m info) from __alloc_pages() when page allocation fails and during OOM kill. Thanks to Dave Jones for coming up with the idea. Signed-off-by: Janet Morgan <janetmor@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] __read_page_state(): pass unsigned long instead of unsignedBenjamin LaHaise
By making the offset argument of __read_page_state an unsigned long instead of unsigned, we can avoid forcing the compiler to sign extend a usually constant argument. This saves 1 instruction on x86-64. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21[PATCH] __mod_page_state(): pass unsigned long instead of unsignedBenjamin LaHaise
By making the offset argument of __mod_page_state an unsigned long instead of unsigned, we can avoid forcing the compiler to sign extend a usually constant argument. This saves 1 instruction on x86-64. Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>