linux-toradex.git/mm, branch v2.6.17.2

[PATCH] tmpfs: Decrement i_nlink correctly in shmem_rmdir()

2006-06-12T21:29:04+00:00

shmem_rmdir() must undo the increment of i_nlink done in
shmem_get_inode() for directories, otherwise at least
IN_DELETE_SELF inotify event generation is broken.

Signed-off-by: Sergey Vlasov 
Signed-off-by: Hugh Dickins 
Signed-off-by: Linus Torvalds

[PATCH] tmpfs: time granularity fix for [acm]time going backwards

2006-06-12T20:55:52+00:00

I noticed a strange behavior in a tmpfs file system the other day, while
building packages - occasionally, and seemingly at random, make decided to
rebuild a target. However, only on tmpfs.

A file would be created, and if checked, it had a sub-second timestamp.
However, after an utimes related call where sub-seconds should be set, they
were zeroed instead. In the case that a file was created, and utimes(...,NULL)
was used on it in the same second, the timestamp on the file moved backwards.

After some digging, I found that this was being caused by tmpfs not having a
time granularity set, thus inheriting the default 1 second granularity.

Hugh adds: yes, we missed tmpfs when the s_time_gran mods went into 2.6.11.
Unfortunately, the granularity of CURRENT_TIME, often used in filesystems,
does not match the default granularity set by alloc_super.  A few more such
discrepancies have been found, but this is the most important to fix now.

Signed-off-by: Robin H. Johnson 
Acked-by: Andi Kleen 
Signed-off-by: Hugh Dickins 
Signed-off-by: Linus Torvalds

[PATCH] typo in vmscan.c

2006-06-11T22:27:37+00:00

From: Christoph Lameter 

Looks like a comma was left from the conversion from a struct to an
assignment.

Signed-off-by: Christoph Lameter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] slab.c: fix offslab_limit bug

2006-06-02T18:21:10+00:00

mm/slab.c's offlab_limit logic is totally broken.

Firstly, "offslab_limit" is a global variable while it should either be
calculated in situ or should be passed in as a parameter.

Secondly, the more serious problem with it is that the condition for
calculating it:

               if (!(OFF_SLAB(sizes->cs_cachep))) {
                       offslab_limit = sizes->cs_size - sizeof(struct slab);
                       offslab_limit /= sizeof(kmem_bufctl_t);

is in total disconnect with the condition that makes use of it:

               /* More than offslab_limit objects will cause problems */
               if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit)
                       break;

but due to offslab_limit being a global variable this breakage was
hidden.

Up until lockdep came along and perturbed the slab sizes sufficiently so
that the first off-slab cache would still see a (non-calculated) zero
value for offslab_limit and would panic with:

  kmem_cache_create: couldn't create cache size-512.

  Call Trace:
   [] show_trace+0x96/0x1c8
   [] dump_stack+0x13/0x15
   [] panic+0x39/0x21a
   [] kmem_cache_create+0x5a0/0x5d0
   [] kmem_cache_init+0x193/0x379
   [] start_kernel+0x17f/0x218
   [] _sinittext+0x263/0x26a

  Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-512'

Paolo Ornati's config on x86_64 managed to trigger it.

The fix is to move the calculation to the place that makes use of it.
This also makes slab.o 54 bytes smaller.

Btw., the check itself is quite silly. Its intention is to test whether
the number of objects per slab would be higher than the number of slab
control pointers possible. In theory it could be triggered: if someone
tried to allocate 4-byte objects cache and explicitly requested with
CFLGS_OFF_SLAB. So i kept the check.

Out of historic interest i checked how old this bug was and it's
ancient, 10 years old! It is the oldest hidden and then truly triggering
bugs i ever saw being fixed in the kernel!

Signed-off-by: Ingo Molnar 
Signed-off-by: Linus Torvalds

[PATCH] spanned_pages is not updated at a case of memory hot-add

2006-05-31T23:27:10+00:00

From: Yasunori Goto 

If hot-added memory's address is smaller than old area, spanned_pages will
not be updated.  It must be fixed.

example) Old zone_start_pfn = 0x60000, and spanned_pages = 0x10000
         Added new memory's start_pfn = 0x50000, and end_pfn = 0x60000

  new spanned_pages will be still 0x10000 by old code.
  (It should be updated to 0x20000.) Because old_zone_end_pfn will be
  0x70000, and end_pfn smaller than it. So, spanned_pages will not be
  updated.

In current code, spanned_pages is updated only when end_pfn is updated.
But, it should be updated by subtraction between bigger end_pfn and new
zone_start_pfn.

Signed-off-by: Yasunori Goto 
Signed-off-by: Dave Hansen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundary

2006-05-21T19:59:22+00:00

Andy added code to buddy allocator which does not require the zone's
endpoints to be aligned to MAX_ORDER.  An issue is that the buddy allocator
requires the node_mem_map's endpoints to be MAX_ORDER aligned.  Otherwise
__page_find_buddy could compute a buddy not in node_mem_map for partial
MAX_ORDER regions at zone's endpoints.  page_is_buddy will detect that
these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
allocator and not part of zone).  Of course the negative here is we could
waste a little memory but the positive is eliminating all the old checks
for zone boundary conditions.

SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
when SPARSEMEM is configured.  ia64 VIRTUAL_MEM_MAP doesn't need the logic
either because the holes and endpoints are handled differently.  This
leaves checking alloc_remap and other arches which privately allocate for
node_mem_map.

Signed-off-by: Bob Picco 
Acked-by: Mel Gorman 
Cc: Dave Hansen 
Cc: Andy Whitcroft 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] Cpuset: might sleep checking zones allowed fix

2006-05-21T19:59:18+00:00

Fix a couple of infrequently encountered 'sleeping function called from
invalid context' in the cpuset hooks in __alloc_pages.  Could sleep while
interrupts disabled.

The routine cpuset_zone_allowed() is called by code in mm/page_alloc.c
__alloc_pages() to determine if a zone is allowed in the current tasks
cpuset.  This routine can sleep, for certain GFP_KERNEL allocations, if the
zone is on a memory node not allowed in the current cpuset, but might be
allowed in a parent cpuset.

But we can't sleep in __alloc_pages() if in interrupt, nor if called for a
GFP_ATOMIC request (__GFP_WAIT not set in gfp_flags).

The rule was intended to be:
  Don't call cpuset_zone_allowed() if you can't sleep, unless you
  pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
  the code that might scan up ancestor cpusets and sleep.

This rule was being violated in a couple of places, due to a bogus change
made (by myself, pj) to __alloc_pages() as part of the November 2005 effort
to cleanup its logic, and also due to a later fix to constrain which swap
daemons were awoken.

The bogus change can be seen at:
  http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html
  [PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags

This was first noticed on a tight memory system, in code that was disabling
interrupts and doing allocation requests with __GFP_WAIT not set, which
resulted in __might_sleep() writing complaints to the log "Debug: sleeping
function called ...", when the code in cpuset_zone_allowed() tried to take
the callback_sem cpuset semaphore.

We haven't seen a system hang on this 'might_sleep' yet, but we are at
decent risk of seeing it fairly soon, especially since the additional
cpuset_zone_allowed() check was added, conditioning wakeup_kswapd(), in
March 2006.

Special thanks to Dave Chinner, for figuring this out, and a tip of the hat
to Nick Piggin who warned me of this back in Nov 2005, before I was ready
to listen.

Signed-off-by: Paul Jackson 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] SPARSEMEM incorrectly calculates section number

2006-05-21T19:59:17+00:00

A bad calculation/loop in __section_nr() could result in incorrect section
information being put into sysfs memory entries.  This primarily impacts
memory add operations as the sysfs information is used while onlining new
memory.

Fix suggested by Dave Hansen.

Note that the bug may not be obvious from the patch.  It actually occurs in
the function's return statement:

	return (root_nr * SECTIONS_PER_ROOT) + (ms - root);

In the existing code, root_nr has already been multiplied by
SECTIONS_PER_ROOT.

Signed-off-by: Mike Kravetz 
Cc: Dave Hansen 
Cc: Andy Whitcroft 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] slab: Fix kmem_cache_destroy() on NUMA

2006-05-16T14:59:32+00:00

With CONFIG_NUMA set, kmem_cache_destroy() may fail and say "Can't
free all objects."  The problem is caused by sequences such as the
following (suppose we are on a NUMA machine with two nodes, 0 and 1):

 * Allocate an object from cache on node 0.
 * Free the object on node 1.  The object is put into node 1's alien
   array_cache for node 0.
 * Call kmem_cache_destroy(), which ultimately ends up in __cache_shrink().
 * __cache_shrink() does drain_cpu_caches(), which loops through all nodes.
   For each node it drains the shared array_cache and then handles the
   alien array_cache for the other node.

However this means that node 0's shared array_cache will be drained,
and then node 1 will move the contents of its alien[0] array_cache
into that same shared array_cache.  node 0's shared array_cache is
never looked at again, so the objects left there will appear to be in
use when __cache_shrink() calls __node_shrink() for node 0.  So
__node_shrink() will return 1 and kmem_cache_destroy() will fail.

This patch fixes this by having drain_cpu_caches() do
drain_alien_cache() on every node before it does drain_array() on the
nodes' shared array_caches.

The problem was originally reported by Or Gerlitz .

Signed-off-by: Roland Dreier 
Acked-by: Christoph Lameter 
Acked-by: Pekka Enberg 
Signed-off-by: Linus Torvalds

[PATCH] add slab_is_available() routine for boot code

2006-05-15T18:20:56+00:00

slab_is_available() indicates slab based allocators are available for use.
SPARSEMEM code needs to know this as it can be called at various times
during the boot process.

Signed-off-by: Mike Kravetz 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds