1. 07 Feb, 2008 16 commits
    • Hugh Dickins's avatar
      memcgroup: fix zone isolation OOM · 436c6541
      Hugh Dickins authored
      
      mem_cgroup_charge_common shows a tendency to OOM without good reason, when
      a memhog goes well beyond its rss limit but with plenty of swap available.
      Seen on x86 but not on PowerPC; seen when the next patch omits swapcache
      from memcgroup, but we presume it can happen without.
      
      mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria for OOM
      avoidance.  Already it has to scan beyond the nr_to_scan limit when it
      finds a !LRU page or an active page when handling inactive or an inactive
      page when handling active.  It needs to do exactly the same when it finds a
      page from the wrong zone (the x86 tests had two zones, the PowerPC tests
      had only one).
      
      Don't increment scan and then decrement it in these cases, just move the
      incrementation down.  Fix recent off-by-one when checking against
      nr_to_scan.  Cut out "Check if the meta page went away from under us",
      presumably left over from early debugging: no amount of such checks could
      save us if this list really were being updated without locking.
      
      This change does make the unlimited scan while holding two spinlocks
      even worse - bad for latency and bad for containment; but that's a
      separate issue which is better left to be fixed a little later.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      436c6541
    • KAMEZAWA Hiroyuki's avatar
      bugfix for memory cgroup controller: avoid !PageLRU page in mem_cgroup_isolate_pages · ff7283fa
      KAMEZAWA Hiroyuki authored
      
      This patch makes mem_cgroup_isolate_pages() to be
      
        - ignore !PageLRU pages.
        - fixes the bug that isolation makes no progress if page_zone(page) != zone
          page once find. (just increment scan in this case.)
      
      kswapd and memory migration removes a page from list when it handles
      a page for reclaiming/migration.
      
      Because __isolate_lru_page() doesn't moves page !PageLRU pages, it will
      be safe to avoid touching !PageLRU() page and its page_cgroup.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff7283fa
    • KAMEZAWA Hiroyuki's avatar
      bugfix for memory cgroup controller: migration under memory controller fix · ae41be37
      KAMEZAWA Hiroyuki authored
      
      While using memory control cgroup, page-migration under it works as following.
      ==
       1. uncharge all refs at try to unmap.
       2. charge regs again remove_migration_ptes()
      ==
      This is simple but has following problems.
      ==
       The page is uncharged and charged back again if *mapped*.
          - This means that cgroup before migration can be different from one after
            migration
          - If page is not mapped but charged as page cache, charge is just ignored
            (because not mapped, it will not be uncharged before migration)
            This is memory leak.
      ==
      This patch tries to keep memory cgroup at page migration by increasing
      one refcnt during it. 3 functions are added.
      
       mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
       mem_cgroup_end_migration()     --- decrease refcnt of page->page_cgroup
       mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
                                       new page.
      
      During migration
        - old page is under PG_locked.
        - new page is under PG_locked, too.
        - both old page and new page is not on LRU.
      
      These 3 facts guarantee that page_cgroup() migration has no race.
      
      Tested and worked well in x86_64/fake-NUMA box.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae41be37
    • KAMEZAWA Hiroyuki's avatar
      bugfix for memory controller: add helper function for assigning cgroup to page · 9175e031
      KAMEZAWA Hiroyuki authored
      
      This patch adds following functions.
         - clear_page_cgroup(page, pc)
         - page_cgroup_assign_new_page_group(page, pc)
      
      Mainly for cleanup.
      
      A manner "check page->cgroup again after lock_page_cgroup()" is
      implemented in straight way.
      
      A comment in mem_cgroup_uncharge() will be removed by force-empty patch
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9175e031
    • David Rientjes's avatar
      memcontrol: move oom task exclusion to tasklist scan · 4c4a2214
      David Rientjes authored
      
      Creates a helper function to return non-zero if a task is a member of a
      memory controller:
      
      	int task_in_mem_cgroup(const struct task_struct *task,
      			       const struct mem_cgroup *mem);
      
      When the OOM killer is constrained by the memory controller, the exclusion
      of tasks that are not a member of that controller was previously misplaced
      and appeared in the badness scoring function.  It should be excluded
      during the tasklist scan in select_bad_process() instead.
      
      [akpm@linux-foundation.org: build fix]
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c4a2214
    • David Rientjes's avatar
      memcontrol: move mm_cgroup to header file · 3062fc67
      David Rientjes authored
      
      Inline functions must preceed their use, so mm_cgroup() should be defined
      in linux/memcontrol.h.
      
      include/linux/memcontrol.h:48: warning: 'mm_cgroup' declared inline after
      	being called
      include/linux/memcontrol.h:48: warning: previous declaration of
      	'mm_cgroup' was here
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3062fc67
    • Balbir Singh's avatar
      Memory controller: make charging gfp mask aware · e1a1cd59
      Balbir Singh authored
      
      Nick Piggin pointed out that swap cache and page cache addition routines
      could be called from non GFP_KERNEL contexts.  This patch makes the
      charging routine aware of the gfp context.  Charging might fail if the
      cgroup is over it's limit, in which case a suitable error is returned.
      
      This patch was tested on a Powerpc box.  I am still looking at being able
      to test the path, through which allocations happen in non GFP_KERNEL
      contexts.
      
      [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1a1cd59
    • Balbir Singh's avatar
      Memory controller: make page_referenced() cgroup aware · bed7161a
      Balbir Singh authored
      
      Make page_referenced() cgroup aware.  Without this patch, page_referenced()
      can cause a page to be skipped while reclaiming pages.  This patch ensures
      that other cgroups do not hold pages in a particular cgroup hostage.  It
      is required to ensure that shared pages are freed from a cgroup when they
      are not actively referenced from the cgroup that brought them in
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bed7161a
    • Balbir Singh's avatar
      Memory controller: add switch to control what type of pages to limit · 8697d331
      Balbir Singh authored
      
      Choose if we want cached pages to be accounted or not.  By default both are
      accounted for.  A new set of tunables are added.
      
      echo -n 1 > mem_control_type
      
      switches the accounting to account for only mapped pages
      
      echo -n 3 > mem_control_type
      
      switches the behaviour back
      
      [bunk@kernel.org: mm/memcontrol.c: clenups]
      [akpm@linux-foundation.org: fix sparc32 build]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8697d331
    • Pavel Emelianov's avatar
      Memory controller: OOM handling · c7ba5c9e
      Pavel Emelianov authored
      
      Out of memory handling for cgroups over their limit. A task from the
      cgroup over limit is chosen using the existing OOM logic and killed.
      
      TODO:
      1. As discussed in the OLS BOF session, consider implementing a user
      space policy for OOM handling.
      
      [akpm@linux-foundation.org: fix build due to oom-killer changes]
      Signed-off-by: default avatarPavel Emelianov <xemul@openvz.org>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7ba5c9e
    • Balbir Singh's avatar
      Memory controller improve user interface · 0eea1030
      Balbir Singh authored
      
      Change the interface to use bytes instead of pages.  Page sizes can vary
      across platforms and configurations.  A new strategy routine has been added
      to the resource counters infrastructure to format the data as desired.
      
      Suggested by David Rientjes, Andrew Morton and Herbert Poetzl
      
      Tested on a UML setup with the config for memory control enabled.
      
      [kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarPavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0eea1030
    • Balbir Singh's avatar
      Memory controller: add per cgroup LRU and reclaim · 66e1707b
      Balbir Singh authored
      
      Add the page_cgroup to the per cgroup LRU.  The reclaim algorithm has
      been modified to make the isolate_lru_pages() as a pluggable component.  The
      scan_control data structure now accepts the cgroup on behalf of which
      reclaims are carried out.  try_to_free_pages() has been extended to become
      cgroup aware.
      
      [akpm@linux-foundation.org: fix warning]
      [Lee.Schermerhorn@hp.com: initialize all scan_control's isolate_pages member]
      [bunk@kernel.org: make do_try_to_free_pages() static]
      [hugh@veritas.com: memcgroup: fix try_to_free order]
      [kamezawa.hiroyu@jp.fujitsu.com: this unlock_page_cgroup() is unnecessary]
      Signed-off-by: default avatarPavel Emelianov <xemul@openvz.org>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66e1707b
    • Balbir Singh's avatar
      Memory controller: task migration · 67e465a7
      Balbir Singh authored
      
      Allow tasks to migrate from one cgroup to the other.  We migrate
      mm_struct's mem_cgroup only when the thread group id migrates.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67e465a7
    • Balbir Singh's avatar
      Memory controller: memory accounting · 8a9f3ccd
      Balbir Singh authored
      
      Add the accounting hooks.  The accounting is carried out for RSS and Page
      Cache (unmapped) pages.  There is now a common limit and accounting for both.
      The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
      time.  Page cache is accounted at add_to_page_cache(),
      __delete_from_page_cache().  Swap cache is also accounted for.
      
      Each page's page_cgroup is protected with the last bit of the
      page_cgroup pointer, this makes handling of race conditions involving
      simultaneous mappings of a page easier.  A reference count is kept in the
      page_cgroup to deal with cases where a page might be unmapped from the RSS
      of all tasks, but still lives in the page cache.
      
      Credits go to Vaidyanathan Srinivasan for helping with reference counting work
      of the page cgroup.  Almost all of the page cache accounting code has help
      from Vaidyanathan Srinivasan.
      
      [hugh@veritas.com: fix swapoff breakage]
      [akpm@linux-foundation.org: fix locking]
      Signed-off-by: default avatarVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <Valdis.Kletnieks@vt.edu>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a9f3ccd
    • Pavel Emelianov's avatar
      Memory controller: accounting setup · 78fb7466
      Pavel Emelianov authored
      
      Basic setup routines, the mm_struct has a pointer to the cgroup that
      it belongs to and the the page has a page_cgroup associated with it.
      Signed-off-by: default avatarPavel Emelianov <xemul@openvz.org>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78fb7466
    • Balbir Singh's avatar
      Memory controller: cgroups setup · 8cdea7c0
      Balbir Singh authored
      
      Setup the memory cgroup and add basic hooks and controls to integrate
      and work with the cgroup.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cdea7c0