1. 24 Jul, 2008 8 commits
  2. 08 Jul, 2008 6 commits
  3. 03 Jul, 2008 1 commit
  4. 26 Jun, 2008 1 commit
  5. 10 Jun, 2008 1 commit
    • Yinghai Lu's avatar
      mm, x86: shrink_active_range() should check all · cc1a9d86
      Yinghai Lu authored
      
      Now we are using register_e820_active_regions() instead of
      add_active_range() directly. So end_pfn could be different between the
      value in early_node_map to node_end_pfn.
      
      So we need to make shrink_active_range() smarter.
      
      shrink_active_range() is a generic MM function in mm/page_alloc.c but
      it is only used on 32-bit x86. Should we move it back to some file in
      arch/x86?
      Signed-off-by: default avatarYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      cc1a9d86
  6. 09 Jun, 2008 1 commit
  7. 03 Jun, 2008 1 commit
  8. 24 May, 2008 3 commits
    • Heiko Carstens's avatar
      memory hotplug: fix early allocation handling · cd94b9db
      Heiko Carstens authored
      
      Trying to add memory via add_memory() from within an initcall function
      results in
      
      bootmem alloc of 163840 bytes failed!
      Kernel panic - not syncing: Out of memory
      
      This is caused by zone_wait_table_init() which uses system_state to decide
      if it should use the bootmem allocator or not.
      
      When initcalls are handled the system_state is still SYSTEM_BOOTING but
      the bootmem allocator doesn't work anymore.  So the allocation will fail.
      
      To fix this use slab_is_available() instead as indicator like we do it
      everywhere else.
      
      [akpm@linux-foundation.org: coding-style fix]
      Reviewed-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd94b9db
    • Andy Whitcroft's avatar
      zonelists: handle a node zonelist with no applicable entries · 7eb54824
      Andy Whitcroft authored
      
      When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing
      panics when trying node local allocations:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000034c
       IP: [<c1042507>] get_page_from_freelist+0x4a/0x18e
       *pdpt = 00000000013a7001 *pde = 0000000000000000
       Oops: 0000 [#1] SMP
       Modules linked in:
      
       Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82)
       EIP: 0060:[<c1042507>] EFLAGS: 00010282 CPU: 0
       EIP is at get_page_from_freelist+0x4a/0x18e
       EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000
       ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0
        DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
       Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000)
       Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0
              f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000
              00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0
       Call Trace:
        [<c10426e4>] __alloc_pages_internal+0x99/0x378
        [<c10429ca>] __alloc_pages+0x7/0x9
        [<c105e0e8>] kmem_getpages+0x66/0xef
        [<c105ec55>] cache_grow+0x8f/0x123
        [<c105f117>] ____cache_alloc_node+0xb9/0xe4
        [<c105f427>] kmem_cache_alloc_node+0x92/0xd2
        [<c122118c>] setup_cpu_cache+0xaf/0x177
        [<c105e6ca>] kmem_cache_create+0x2c8/0x353
        [<c13853af>] kmem_cache_init+0x1ce/0x3ad
        [<c13755c5>] start_kernel+0x178/0x1ee
      
      This occurs when we are scanning the zonelists looking for a ZONE_NORMAL
      page.  In this system there is only ZONE_DMA and ZONE_NORMAL memory on
      node 0, all other nodes are mapped above 4GB physical.  Here is a dump
      of the zonelists from this system:
      
          zonelists pgdat=c1400000
           0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0
           1: c14006c0:2 c1400360:1 c1400000:0
          zonelists pgdat=f7c00000
           0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0
           1: f7c006c0:2
          zonelists pgdat=f7e00000
           0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0
           1: f7e006c0:2
      
      When performing a node local allocation we call get_page_from_freelist()
      looking for a page.  It in turn calls first_zones_zonelist() which returns
      a preferred_zone.  Where there are no applicable zones this will be NULL.
      However we use this unconditionally, leading to this panic.
      
      Where there are no applicable zones there is no possibility of a successful
      allocation, so simply fail the allocation.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7eb54824
    • Johannes Weiner's avatar
      mm: don't drop a partial page in a zone's memory map size · f7232154
      Johannes Weiner authored
      
      In a zone's present pages number, account for all pages occupied by the
      memory map, including a partial.
      Signed-off-by: default avatarJohannes Weiner <hannes@saeurebad.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7232154
  9. 14 May, 2008 1 commit
    • Heiko Carstens's avatar
      memory_hotplug: always initialize pageblock bitmap · 76cdd58e
      Heiko Carstens authored
      Trying to online a new memory section that was added via memory hotplug
      sometimes results in crashes when the new pages are added via __free_page.
       Reason for that is that the pageblock bitmap isn't initialized and hence
      contains random stuff.  That means that get_pageblock_migratetype()
      returns also random stuff and therefore
      
      	list_add(&page->lru,
      		&zone->free_area[order].free_list[migratetype]);
      
      in __free_one_page() tries to do a list_add to something that isn't even
      necessarily a list.
      
      This happens since 86051ca5 ("mm: fix
      usemap initialization") which makes sure that the pageblock bitmap gets
      only initialized for pages present in a zone.  Unfortunately for hot-added
      memory the zones "grow" after the memmap and the pageblock memmap have
      been initialized.  Which means that the new pages have an unitialized
      bitmap.  To solve this the calls to grow_zone_span() and grow_pgdat_span()
      are moved to __add_zone() just before ...
      76cdd58e
  10. 30 Apr, 2008 1 commit
    • Thomas Gleixner's avatar
      infrastructure to debug (dynamic) objects · 3ac7fe5a
      Thomas Gleixner authored
      
      We can see an ever repeating problem pattern with objects of any kind in the
      kernel:
      
      1) freeing of active objects
      2) reinitialization of active objects
      
      Both problems can be hard to debug because the crash happens at a point where
      we have no chance to decode the root cause anymore.  One problem spot are
      kernel timers, where the detection of the problem often happens in interrupt
      context and usually causes the machine to panic.
      
      While working on a timer related bug report I had to hack specialized code
      into the timer subsystem to get a reasonable hint for the root cause.  This
      debug hack was fine for temporary use, but far from a mergeable solution due
      to the intrusiveness into the timer code.
      
      The code further lacked the ability to detect and report the root cause
      instantly and keep the system operational.
      
      Keeping the system operational is important to get hold of the debug
      information without special debugging aids like serial consoles and special
      knowledge of the bug reporter.
      
      The problems described above are not restricted to timers, but timers tend to
      expose it usually in a full system crash.  Other objects are less explosive,
      but the symptoms caused by such mistakes can be even harder to debug.
      
      Instead of creating specialized debugging code for the timer subsystem a
      generic infrastructure is created which allows developers to verify their code
      and provides an easy to enable debug facility for users in case of trouble.
      
      The debugobjects core code keeps track of operations on static and dynamic
      objects by inserting them into a hashed list and sanity checking them on
      object operations and provides additional checks whenever kernel memory is
      freed.
      
      The tracked object operations are:
      - initializing an object
      - adding an object to a subsystem list
      - deleting an object from a subsystem list
      
      Each operation is sanity checked before the operation is executed and the
      subsystem specific code can provide a fixup function which allows to prevent
      the damage of the operation.  When the sanity check triggers a warning message
      and a stack trace is printed.
      
      The list of operations can be extended if the need arises.  For now it's
      limited to the requirements of the first user (timers).
      
      The core code enqueues the objects into hash buckets.  The hash index is
      generated from the address of the object to simplify the lookup for the check
      on kfree/vfree.  Each bucket has it's own spinlock to avoid contention on a
      global lock.
      
      The debug code can be compiled in without being active.  The runtime overhead
      is minimal and could be optimized by asm alternatives.  A kernel command line
      option enables the debugging code.
      
      Thanks to Ingo Molnar for review, suggestions and cleanup patches.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ac7fe5a
  11. 29 Apr, 2008 3 commits
    • Nishanth Aravamudan's avatar
      page allocator: smarter retry of costly-order allocations · a41f24ea
      Nishanth Aravamudan authored
      
      Because of page order checks in __alloc_pages(), hugepage (and similarly
      large order) allocations will not retry unless explicitly marked
      __GFP_REPEAT. However, the current retry logic is nearly an infinite
      loop (or until reclaim does no progress whatsoever). For these costly
      allocations, that seems like overkill and could potentially never
      terminate. Mel observed that allowing current __GFP_REPEAT semantics for
      hugepage allocations essentially killed the system. I believe this is
      because we may continue to reclaim small orders of pages all over, but
      never have enough to satisfy the hugepage allocation request. This is
      clearly only a problem for large order allocations, of which hugepages
      are the most obvious (to me).
      
      Modify try_to_free_pages() to indicate how many pages were reclaimed.
      Use that information in __alloc_pages() to eventually fail a large
      __GFP_REPEAT allocation when we've reclaimed an order of pages equal to
      or greater than the allocation's order. This relies on lumpy reclaim
      functioning as advertised. Due to fragmentation, lumpy reclaim may not
      be able to free up the order needed in one invocation, so multiple
      iterations may be requred. In other words, the more fragmented memory
      is, the more retry attempts __GFP_REPEAT will make (particularly for
      higher order allocations).
      
      This changes the semantics of __GFP_REPEAT subtly, but *only* for
      allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
      allocations, we will try up to some point (at least 1<<order reclaimed
      pages), rather than forever (which is the case for allocations <=
      PAGE_ALLOC_COSTLY_ORDER).
      
      This change improves the /proc/sys/vm/nr_hugepages interface with a
      follow-on patch that makes pool allocations use __GFP_REPEAT. Rather
      than administrators repeatedly echo'ing a particular value into the
      sysctl, and forcing reclaim into action manually, this change allows for
      the sysctl to attempt a reasonable effort itself. Similarly, dynamic
      pool growth should be more successful under load, as lumpy reclaim can
      try to free up pages, rather than failing right away.
      
      Choosing to reclaim only up to the order of the requested allocation
      strikes a balance between not failing hugepage allocations and returning
      to the caller when it's unlikely to every succeed. Because of lumpy
      reclaim, if we have freed the order requested, hopefully it has been in
      big chunks and those chunks will allow our allocation to succeed. If
      that isn't the case after freeing up the current order, I don't think it
      is likely to succeed in the future, although it is possible given a
      particular fragmentation pattern.
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Tested-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a41f24ea
    • Nishanth Aravamudan's avatar
      mm: fix misleading __GFP_REPEAT related comments · ab857d09
      Nishanth Aravamudan authored
      
      The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the
      core VM have somewhat differing comments as to their actual semantics.
      Annoyingly, the flags definition has inline and header comments, which might
      be interpreted as not being equivalent.  Just add references to the header
      comments in the inline ones so they don't go out of sync in the future.  In
      their use in __alloc_pages() clarify that the current implementation treats
      low-order allocations and __GFP_REPEAT allocations as distinct cases.
      
      To clarify, the flags' semantics are:
      
      __GFP_NORETRY means try no harder than one run through __alloc_pages
      
      __GFP_REPEAT means __GFP_NOFAIL
      
      __GFP_NOFAIL means repeat forever
      
      order <= PAGE_ALLOC_COSTLY_ORDER means __GFP_NOFAIL
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab857d09
    • KAMEZAWA Hiroyuki's avatar
      mm: fix usemap initialization · 86051ca5
      KAMEZAWA Hiroyuki authored
      
      usemap must be initialized only when pfn is within zone.  If not, it corrupts
      memory.
      
      And this patch also reduces the number of calls to set_pageblock_migratetype()
      from
      	(pfn & (pageblock_nr_pages -1)
      to
      	!(pfn & (pageblock_nr_pages-1)
      it should be called once per pageblock.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Shi Weihua <shiwh@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86051ca5
  12. 28 Apr, 2008 10 commits
    • Pavel Machek's avatar
      mm/page_alloc.c: remove hand-coded get_order() · 2309f9e6
      Pavel Machek authored
      
      Remove hand-coded get_order() from page_alloc.c.
      Signed-off-by: default avatarPavel Machek <pavel@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2309f9e6
    • Yasunori Goto's avatar
      memory hotplug: free memmaps allocated by bootmem · 0c0a4a51
      Yasunori Goto authored
      
      This patch is to free memmaps which is allocated by bootmem.
      
      Freeing usemap is not necessary.  The pages of usemap may be necessary for
      other sections.
      
      If removing section is last section on the node, its section is the final user
      of usemap page.  (usemaps are allocated on its section by previous patch.) But
      it shouldn't be freed too, because the section must be logical offline state
      which all pages are isolated against page allocater.  If it is freed, page
      alloctor may use it which will be removed physically soon.  It will be
      disaster.  So, this patch keeps it as it is.
      Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c0a4a51
    • Christoph Lameter's avatar
      pageflags: eliminate PG_xxx aliases · 0a128b2b
      Christoph Lameter authored
      
      Remove aliases of PG_xxx.  We can easily drop those now and alias by
      specifying the PG_xxx flag in the macro that generates the functions.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a128b2b
    • S.Caglar Onur's avatar
      mm/page_alloc.c: fix indentation · f05111f5
      S.Caglar Onur authored
      zlc_setup(): handle jiffies wraparound
      (10ed273f
      
      ) changes tab with spaces
      Signed-off-by: default avatarS.Caglar Onur <caglar@pardus.org.tr>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f05111f5
    • Mel Gorman's avatar
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman authored
      
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • Mel Gorman's avatar
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman authored
      
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • Mel Gorman's avatar
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman authored
      
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
    • Mel Gorman's avatar
      mm: remember what the preferred zone is for zone_statistics · 18ea7e71
      Mel Gorman authored
      
      On NUMA, zone_statistics() is used to record events like numa hit, miss and
      foreign.  It assumes that the first zone in a zonelist is the preferred zone.
      When multiple zonelists are replaced by one that is filtered, this is no
      longer the case.
      
      This patch records what the preferred zone is rather than assuming the first
      zone in the zonelist is it.  This simplifies the reading of later patches in
      this set.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18ea7e71
    • Mel Gorman's avatar
      mm: introduce node_zonelist() for accessing the zonelist for a GFP mask · 0e88460d
      Mel Gorman authored
      
      Introduce a node_zonelist() helper function.  It is used to lookup the
      appropriate zonelist given a node and a GFP mask.  The patch on its own is a
      cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
      necessary, it can be merged with the next patch in this set without problems.
      Reviewed-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e88460d
    • Mel Gorman's avatar
      mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b
      Mel Gorman authored
      
      The following patches replace multiple zonelists per node with two zonelists
      that are filtered based on the GFP flags.  The patches as a set fix a bug with
      regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
      MPOL_BIND will apply to the two highest zones when the highest zone is
      ZONE_MOVABLE.  This should be considered as an alternative fix for the
      MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
      only custom zonelists.
      
      The first patch cleans up an inconsistency where direct reclaim uses
      zonelist->zones where other places use zonelist.
      
      The second patch introduces a helper function node_zonelist() for looking up
      the appropriate zonelist for a GFP mask which simplifies patches later in the
      set.
      
      The third patch defines/remembers the "preferred zone" for numa statistics, as
      it is no longer always the first zone in a zonelist.
      
      The forth patch replaces multiple zonelists with two zonelists that are
      filtered.  The two zonelists are due to the fact that the memoryless patchset
      introduces a second set of zonelists for __GFP_THISNODE.
      
      The fifth patch introduces helper macros for retrieving the zone and node
      indices of entries in a zonelist.
      
      The final patch introduces filtering of the zonelists based on a nodemask.
      Two zonelists exist per node, one for normal allocations and one for
      __GFP_THISNODE.
      
      Performance results varied depending on the machine configuration.  In real
      workloads the gain/loss will depend on how much the userspace portion of the
      benchmark benefits from having more cache available due to reduced referencing
      of zonelists.
      
      These are the range of performance losses/gains when running against
      2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
      ppc64 both NUMA and non-NUMA.
      			     loss   to  gain
      Total CPU time on Kernbench: -0.86% to  1.13%
      Elapsed   time on Kernbench: -0.79% to  0.76%
      page_test from aim9:         -4.37% to  0.79%
      brk_test  from aim9:         -0.71% to  4.07%
      fork_test from aim9:         -1.84% to  4.60%
      exec_test from aim9:         -0.71% to  1.08%
      
      This patch:
      
      The allocator deals with zonelists which indicate the order in which zones
      should be targeted for an allocation.  Similarly, direct reclaim of pages
      iterates over an array of zones.  For consistency, this patch converts direct
      reclaim to use a zonelist.  No functionality is changed by this patch.  This
      simplifies zonelist iterators in the next patch.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dac1d27b
  13. 19 Apr, 2008 1 commit
    • Mike Travis's avatar
      nodemask: use new node_to_cpumask_ptr function · c5f59f08
      Mike Travis authored
      
        * Use new node_to_cpumask_ptr.  This creates a pointer to the
          cpumask for a given node.  This definition is in mm patch:
      
      	asm-generic-add-node_to_cpumask_ptr-macro.patch
      
        * Use new set_cpus_allowed_ptr function.
      
      Depends on:
      	[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
      	[sched-devel]: sched: add new set_cpus_allowed_ptr function
      	[x86/latest]: x86: add cpus_scnprintf function
      
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Greg Banks <gnb@melbourne.sgi.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarMike Travis <travis@sgi.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c5f59f08
  14. 04 Mar, 2008 2 commits