1. 30 Oct, 2005 40 commits
    • Roland McGrath's avatar
      [PATCH] posix-cpu-timers: fix overrun reporting · 708f430d
      Roland McGrath authored
      
      This change corrects an omission in posix_cpu_timer_schedule, so that it
      correctly propagates the overrun calculation to where it will get reported
      to the user.
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      708f430d
    • Paul E. McKenney's avatar
      [PATCH] RCU torture-testing kernel module · a241ec65
      Paul E. McKenney authored
      This patch is a rewrite of the one submitted on October 1st, using modules
      (http://marc.theaimsgroup.com/?l=linux-kernel&m=112819093522998&w=2
      
      ).
      
      This rewrite adds a tristate CONFIG_RCU_TORTURE_TEST, which enables an
      intense torture test of the RCU infratructure.  This is needed due to the
      continued changes to the RCU infrastructure to accommodate dynamic ticks,
      CPU hotplug, realtime, and so on.  Most of the code is in a separate file
      that is compiled only if the CONFIG variable is set.  Documentation on how
      to run the test and interpret the output is also included.
      
      This code has been tested on i386 and ppc64, and an earlier version of the
      code has received extensive testing on a number of architectures as part of
      the PREEMPT_RT patchset.
      Signed-off-by: default avatar"Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a241ec65
    • Thomas Gleixner's avatar
      [PATCH] jiffies_64 cleanup · ecea8d19
      Thomas Gleixner authored
      
      Define jiffies_64 in kernel/timer.c rather than having 24 duplicated
      defines in each architecture.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ecea8d19
    • Roland McGrath's avatar
      [PATCH] wait4 PTRACE_ATTACH race fix · 7f2a5255
      Roland McGrath authored
      
      Back about a year ago when I last fiddled heavily with the do_wait code, I
      was thinking too hard about the wrong thing and I now think I introduced a
      bug whose inverse thought I was fixing.
      
      Apparently noone was looking too hard over much shoulder, so as to cite my
      bogus reasoning at the time.  In the race condition when PTRACE_ATTACH is
      about to steal a child and then the child hits a tracing event (what
      my_ptrace_child checks for), the real parent does need to set its flag
      noting it has some eligible live children.  Otherwise a spurious ECHILD
      error is possible, since the child in question is not yet on the
      ptrace_children list.
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7f2a5255
    • Coywolf Qi Hunt's avatar
      [PATCH] PF_DEAD cleanup · 7407251a
      Coywolf Qi Hunt authored
      
      The PF_DEAD setting doesn't belong to exit_notify(), move it to a proper
      place.
      Signed-off-by: default avatarCoywolf Qi Hunt <qiyong@fc-cn.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7407251a
    • Jesper Juhl's avatar
      [PATCH] cleanup for kernel/printk.c · 40dc5651
      Jesper Juhl authored
      
      - Removes some trailing whitespace
      
      - Breaks long lines and make other small changes to conform to CodingStyle
      
      - Add explicit printk loglevels in two places.
      Signed-off-by: default avatarJesper Juhl <jesper.juhl@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      40dc5651
    • David Howells's avatar
      [PATCH] Keys: Get rid of warning in kmod.c if keys disabled · 20e1129a
      David Howells authored
      
      The attached patch gets rid of a "statement without effect" warning when
      CONFIG_KEYS is disabled by making use of the return value of key_get().
      The compiler will optimise all of this away when keys are disabled.
      Signed-Off-By: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      20e1129a
    • Andrea Arcangeli's avatar
      [PATCH] ptrace/coredump/exit_group deadlock · 30e0fca6
      Andrea Arcangeli authored
      
      I could seldom reproduce a deadlock with a task not killable in T state
      (TASK_STOPPED, not TASK_TRACED) by attaching a NPTL threaded program to
      gdb, by segfaulting the task and triggering a core dump while some other
      task is executing exit_group and while one task is in ptrace_attached
      TASK_STOPPED state (not TASK_TRACED yet).  This originated from a gdb
      bugreport (the fact gdb was segfaulting the task wasn't a kernel bug), but
      I just incidentally noticed the gdb bug triggered a real kernel bug as
      well.
      
      Most threads hangs in exit_mm because the core_dumping is still going, the
      core dumping hangs because the stopped task doesn't exit, the stopped task
      can't wakeup because it has SIGNAL_GROUP_EXIT set, hence the deadlock.
      
      To me it seems that the problem is that the force_sig_specific(SIGKILL) in
      zap_threads is a noop if the task has PF_PTRACED set (like in this case
      because gdb is attached).  The __ptrace_unlink does nothing because the
      signal->flags is set to SIGNAL_GROUP_EXIT|SIGNAL_STOP_DEQUEUED (verified).
      
      The above info also shows that the stopped task hit a race and got the stop
      signal (presumably by the ptrace_attach, only the attach, state is still
      TASK_STOPPED and gdb hangs waiting the core before it can set it to
      TASK_TRACED) after one of the thread invoked the core dump (it's the core
      dump that sets signal->flags to SIGNAL_GROUP_EXIT).
      
      So beside the fact nobody would wakeup the task in __ptrace_unlink (the
      state is _not_ TASK_TRACED), there's a secondary problem in the signal
      handling code, where a task should ignore the ptrace-sigstops as long as
      SIGNAL_GROUP_EXIT is set (or the wakeup in __ptrace_unlink path wouldn't be
      enough).
      
      So I attempted to make this patch that seems to fix the problem.  There
      were various ways to fix it, perhaps you prefer a different one, I just
      opted to the one that looked safer to me.
      
      I also removed the clearing of the stopped bits from the zap_other_threads
      (zap_other_threads was safe unlike zap_threads).  I don't like useless
      code, this whole NPTL signal/ptrace thing is already unreadable enough and
      full of corner cases without confusing useless code into it to make it even
      less readable.  And if this code is really needed, then you may want to
      explain why it's not being done in the other paths that sets
      SIGNAL_GROUP_EXIT at least.
      
      Even after this patch I still wonder who serializes the read of
      p->ptrace in zap_threads.
      
      Patch is called ptrace-core_dump-exit_group-deadlock-1.
      
      This was the trace I've got:
      
      test          T ffff81003e8118c0     0 14305      1         14311 14309 (NOTLB)
      ffff810058ccdde8 0000000000000082 000001f4000037e1 ffff810000000013
             00000000000000f8 ffff81003e811b00 ffff81003e8118c0 ffff810011362100
             0000000000000012 ffff810017ca4180
      Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff80141677>{finish_stop+87}
             <ffffffff8014367f>{get_signal_to_deliver+1359} <ffffffff8010d3ad>{do_signal+157}
             <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80111575>{sys_ptrace+2293}
             <ffffffff80131810>{default_wake_function+0} <ffffffff80196399>{sys_ioctl+73}
             <ffffffff8010dd27>{sysret_signal+28} <ffffffff8010e00f>{ptregscall_common+103}
      
      test          D ffff810011362100     0 14309      1         14305 14312 (NOTLB)
      ffff810053c81cf8 0000000000000082 0000000000000286 0000000000000001
             0000000000000195 ffff810011362340 ffff810011362100 ffff81002e338040
             ffff810001e0ca80 0000000000000001
      Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173}
             <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149}
             <ffffffff801381af>{do_exit+479} <ffffffff80138d0c>{do_group_exit+252}
             <ffffffff801436db>{get_signal_to_deliver+1451} <ffffffff8010d3ad>{do_signal+157}
             <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2
      
             <ffffffff8014208a>{force_sig_info+186} <ffffffff804479a0>{do_int3+112}
             <ffffffff8010e308>{retint_signal+61}
      test          D ffff81002e338040     0 14311      1         14716 14305 (NOTLB)
      ffff81005ca8dcf8 0000000000000082 0000000000000286 0000000000000001
             0000000000000120 ffff81002e338280 ffff81002e338040 ffff8100481cb740
             ffff810001e0ca80 0000000000000001
      Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173}
             <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149}
             <ffffffff801381af>{do_exit+479} <ffffffff80142d0e>{__dequeue_signal+558}
             <ffffffff80138d0c>{do_group_exit+252} <ffffffff801436db>{get_signal_to_deliver+1451}
             <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222}
             <ffffffff80140850>{specific_send_sig_info+208} <ffffffff8014208a>{force_sig_info+186}
             <ffffffff804479a0>{do_int3+112} <ffffffff8010e308>{retint_signal+61}
      
      test          D ffff810017ca4180     0 14312      1         14309 13882 (NOTLB)
      ffff81005d15fcb8 0000000000000082 ffff81005d15fc58 ffffffff80130816
             0000000000000897 ffff810017ca43c0 ffff810017ca4180 ffff81003e8118c0
             0000000000000082 ffffffff801317ed
      Call Trace:<ffffffff80130816>{activate_task+150} <ffffffff801317ed>{try_to_wake_up+893}
             <ffffffff8044677d>{wait_for_completion+173} <ffffffff80131810>{default_wake_function+0}
             <ffffffff8018cdc3>{do_coredump+819} <ffffffff80445f52>{thread_return+82}
             <ffffffff801436d4>{get_signal_to_deliver+1444} <ffffffff8010d3ad>{do_signal+157}
             <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2
      
             <ffffffff804472e5>{_spin_unlock_irqrestore+5} <ffffffff8014208a>{force_sig_info+186}
             <ffffffff804476ff>{do_general_protection+159} <ffffffff8010e308>{retint_signal+61}
      Signed-off-by: default avatarAndrea Arcangeli <andrea@suse.de>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Linus Torvalds <torvalds@osdl.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      30e0fca6
    • Paul Jackson's avatar
      [PATCH] cpusets: automatic numa mempolicy rebinding · 68860ec1
      Paul Jackson authored
      
      This patch automatically updates a tasks NUMA mempolicy when its cpuset
      memory placement changes.  It does so within the context of the task,
      without any need to support low level external mempolicy manipulation.
      
      If a system is not using cpusets, or if running on a system with just the
      root (all-encompassing) cpuset, then this remap is a no-op.  Only when a
      task is moved between cpusets, or a cpusets memory placement is changed
      does the following apply.  Otherwise, the main routine below,
      rebind_policy() is not even called.
      
      When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
      essential role of cpusets is to place jobs (several related tasks) on a set
      of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
      manage a jobs processor placement within its allowed cpuset, and the
      essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
      memory placement within its allowed cpuset.
      
      However, CPU affinity and NUMA memory placement are managed within the
      kernel using absolute system wide numbering, not cpuset relative numbering.
      
      This is ok until a job is migrated to a different cpuset, or what's the
      same, a jobs cpuset is moved to different CPUs and Memory Nodes.
      
      Then the CPU affinity and NUMA memory placement of the tasks in the job
      need to be updated, to preserve their cpuset-relative position.  This can
      be done for CPU affinity using sched_setaffinity() from user code, as one
      task can modify anothers CPU affinity.  This cannot be done from an
      external task for NUMA memory placement, as that can only be modified in
      the context of the task using it.
      
      However, it easy enough to remap a tasks NUMA mempolicy automatically when
      a task is migrated, using the existing cpuset mechanism to trigger a
      refresh of a tasks memory placement after its cpuset has changed.  All that
      is needed is the old and new nodemask, and notice to the task that it needs
      to rebind its mempolicy.  The tasks mems_allowed has the old mask, the
      tasks cpuset has the new mask, and the existing
      cpuset_update_current_mems_allowed() mechanism provides the notice.  The
      bitmap/cpumask/nodemask remap operators provide the cpuset relative
      calculations.
      
      This patch leaves open a couple of issues:
      
       1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:
      
          These mempolicies may reference nodes outside of those allowed to
          the current task by its cpuset.  Tasks are migrated as part of jobs,
          which reside on what might be several cpusets in a subtree.  When such
          a job is migrated, all NUMA memory policy references to nodes within
          that cpuset subtree should be translated, and references to any nodes
          outside that subtree should be left untouched.  A future patch will
          provide the cpuset mechanism needed to mark such subtrees.  With that
          patch, we will be able to correctly migrate these other memory policies
          across a job migration.
      
       2) Updating cpuset, affinity and memory policies in user space:
      
          This is harder.  Any placement state stored in user space using
          system-wide numbering will be invalidated across a migration.  More
          work will be required to provide user code with a migration-safe means
          to manage its cpuset relative placement, while preserving the current
          API's that pass system wide numbers, not cpuset relative numbers across
          the kernel-user boundary.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      68860ec1
    • Paul Jackson's avatar
      [PATCH] cpusets: simple rename · 18a19cb3
      Paul Jackson authored
      
      Add support for renaming cpusets.  Only allow simple rename of cpuset
      directories in place.  Don't allow moving cpusets elsewhere in hierarchy or
      renaming the special cpuset files in each cpuset directory.
      
      The usefulness of this simple rename became apparent when developing task
      migration facilities.  It allows building a second cpuset hierarchy using
      new names and containing new CPUs and Memory Nodes, moving tasks from the
      old to the new cpusets, removing the old cpusets, and then renaming the new
      cpusets to be just like the old names, so that any knowledge that the tasks
      had of their cpuset names will still be valid.
      
      Leaf node cpusets can be migrated to other CPUs or Memory Nodes by just
      updating their 'cpus' and 'mems' files, but because no cpuset can contain
      CPUs or Nodes not in its parent cpuset, one cannot do this in a cpuset
      hierarchy without first expanding all the non-leaf cpusets to contain the
      union of both the old and new CPUs and Nodes, which would obfuscate the
      one-to-one migration of a task from one cpuset to another required to
      correctly migrate the physical page frames currently allocated to that
      task.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      18a19cb3
    • Paul Jackson's avatar
      [PATCH] cpusets: dual semaphore locking overhaul · 053199ed
      Paul Jackson authored
      
      Overhaul cpuset locking.  Replace single semaphore with two semaphores.
      
      The suggestion to use two locks was made by Roman Zippel.
      
      Both locks are global.  Code that wants to modify cpusets must first
      acquire the exclusive manage_sem, which allows them read-only access to
      cpusets, and holds off other would-be modifiers.  Before making actual
      changes, the second semaphore, callback_sem must be acquired as well.  Code
      that needs only to query cpusets must acquire callback_sem, which is also a
      global exclusive lock.
      
      The earlier problems with double tripping are avoided, because it is
      allowed for holders of manage_sem to nest the second callback_sem lock, and
      only callback_sem is needed by code called from within __alloc_pages(),
      where the double tripping had been possible.
      
      This is not quite the same as a normal read/write semaphore, because
      obtaining read-only access with intent to change must hold off other such
      attempts, while allowing read-only access w/o such intention.  Changing
      cpusets involves several related checks and changes, which must be done
      while allowing read-only queries (to avoid the double trip), but while
      ensuring nothing changes (holding off other would be modifiers.)
      
      This overhaul of cpuset locking also makes careful use of task_lock() to
      guard access to the task->cpuset pointer, closing a couple of race
      conditions noticed while reading this code (thanks, Roman).  I've never
      seen these races fail in any use or test.
      
      See further the comments in the code.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      053199ed
    • Paul Jackson's avatar
      [PATCH] cpusets: remove depth counted locking hack · 5aa15b5f
      Paul Jackson authored
      
      Remove a rather hackish depth counter on cpuset locking.  The depth counter
      was avoiding a possible double trip on the global cpuset_sem semaphore.  It
      worked, but now an improved version of cpuset locking is available, to come
      in the next patch, using two global semaphores.
      
      This patch reverses "cpuset semaphore depth check deadlock fix"
      
      The kernel still works, even after this patch, except for some rare and
      difficult to reproduce race conditions when agressively creating and
      destroying cpusets marked with the notify_on_release option, on very large
      systems.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5aa15b5f
    • Paul Jackson's avatar
      [PATCH] cpuset cleanup · f35f31d7
      Paul Jackson authored
      
      Remove one more useless line from cpuset_common_file_read().
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f35f31d7
    • Oleg Nesterov's avatar
      [PATCH] posix-timers: use schedule_timeout() in common_nsleep() · 4eb9af2a
      Oleg Nesterov authored
      
      common_nsleep() reimplements schedule_timeout_interruptible() for unknown
      reason.
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4eb9af2a
    • Vadim Lobanov's avatar
      [PATCH] Unify sys_tkill() and sys_tgkill() · 6dd69f10
      Vadim Lobanov authored
      
      The majority of the sys_tkill() and sys_tgkill() function code is
      duplicated between the two of them.  This patch pulls the duplication out
      into a separate function -- do_tkill() -- and lets sys_tkill() and
      sys_tgkill() be simple wrappers around it.  This should make it easier to
      maintain in light of future changes.
      Signed-off-by: default avatarVadim Lobanov <vlobanov@speakeasy.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6dd69f10
    • Oleg Nesterov's avatar
      [PATCH] kill sigqueue->lock · 19a4fcb5
      Oleg Nesterov authored
      
      This lock is used in sigqueue_free(), but it is always equal to
      current->sighand->siglock, so we don't need to keep it in the struct
      sigqueue.
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      19a4fcb5
    • Andrew Morton's avatar
      [PATCH] remove timer debug field · dfc4f94d
      Andrew Morton authored
      
      Remove timer_list.magic and associated debugging code.
      
      I originally added this when a spinlock was added to timer_list - this meant
      that an all-zeroes timer became illegal and init_timer() was required.
      
      That spinlock isn't even there any more, although timer.base must now be
      initialised.
      
      I'll keep this debugging code in -mm.
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      dfc4f94d
    • Christoph Lameter's avatar
      [PATCH] Use alloc_percpu to allocate workqueues locally · 89ada679
      Christoph Lameter authored
      
      This patch makes the workqueus use alloc_percpu instead of an array.  The
      workqueues are placed on nodes local to each processor.
      
      The workqueue structure can grow to a significant size on a system with
      lots of processors if this patch is not applied.  64 bit architectures with
      all debugging features enabled and configured for 512 processors will not
      be able to boot without this patch.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      89ada679
    • Andrew Morton's avatar
      [PATCH] ntp whitespace cleanup · a5a0d52c
      Andrew Morton authored
      
      Fix bizarre 4-space coding style in the NTP code.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a5a0d52c
    • john stultz's avatar
      [PATCH] NTP shift_right cleanup · 1bb34a41
      john stultz authored
      
      Create a macro shift_right() that avoids the numerous ugly conditionals in the
      NTP code that look like:
      
              if(a < 0)
                      b = -(-a >> shift);
              else
                      b = a >> shift;
      
      Replacing it with:
      
              b = shift_right(a, shift);
      
      This should have zero effect on the logic, however it should probably have
      a bit of testing just to be sure.
      
      Also replace open-coded min/max with the macros.
      
      Signed-off-by : John Stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1bb34a41
    • Alan Stern's avatar
      [PATCH] Add kthread_stop_sem() · 61e1a9ea
      Alan Stern authored
      
      Enhance the kthread API by adding kthread_stop_sem, for use in stopping
      threads that spend their idle time waiting on a semaphore.
      Signed-off-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      61e1a9ea
    • Oleg Nesterov's avatar
      [PATCH] introduce setup_timer() helper · a8db2db1
      Oleg Nesterov authored
      
      Every user of init_timer() also needs to initialize ->function and ->data
      fields.  This patch adds a simple setup_timer() helper for that.
      
      The schedule_timeout() is patched as an example of usage.
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a8db2db1
    • Shaohua Li's avatar
      [PATCH] introduce .valid callback for pm_ops · eb9289eb
      Shaohua Li authored
      
      Add pm_ops.valid callback, so only the available pm states show in
      /sys/power/state.  And this also makes an earlier states error report at
      enter_state before we do actual suspend/resume.
      
      Signed-off-by: Shaohua Li<shaohua.li@intel.com>
      Acked-by: Pavel Machek<pavel@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      eb9289eb
    • Rafael J. Wysocki's avatar
      [PATCH] swsusp: two simplifications · 0245b3e7
      Rafael J. Wysocki authored
      
      The following patch simplifies the progress meter in disk.c:free_some_memory()
      and makes disk.c:pm_suspend_disk() call device_resume() explicitly in the
      suspend path.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: default avatarPavel Machek <pavel@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0245b3e7
    • Rafael J. Wysocki's avatar
      [PATCH] swsusp: get rid of unnecessary wrapper function · 2e32a43e
      Rafael J. Wysocki authored
      
      The following patch merges two functions in a trivial way.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: default avatarPavel Machek <pavel@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2e32a43e
    • Pavel Machek's avatar
      [PATCH] swsusp: cleanups · de491861
      Pavel Machek authored
      
      Reduce number of ifdefs somehow, and fix whitespace a bit.  No real code
      changes.
      Signed-off-by: default avatarPavel Machek <pavel@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      de491861
    • Pavel Machek's avatar
      [PATCH] swsusp: remove unneccessary includes · 96bc7aec
      Pavel Machek authored
      
      Cleanup comments and remove unneccessary includes.
      Signed-off-by: default avatarPavel Machek <pavel@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      96bc7aec
    • Rafael J. Wysocki's avatar
      [PATCH] swsusp: rework memory freeing on resume · 2c1b4a5c
      Rafael J. Wysocki authored
      
      The following patch makes swsusp use the PG_nosave and PG_nosave_free flags to
      mark pages that should be freed in case of an error during resume.
      
      This allows us to simplify the code and to use swsusp_free() in all of the
      swsusp's resume error paths, which makes them actually work.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2c1b4a5c
    • Rafael J. Wysocki's avatar
      a0f49651
    • Rafael J. Wysocki's avatar
      [PATCH] swsusp: move snapshot functionality to separate file · 25761b6e
      Rafael J. Wysocki authored
      
      The following patch moves the functionality of swsusp related to creating and
      handling the snapshot of memory to a separate file, snapshot.c
      
      This should enable us to untangle the code in the future and eventually to
      implement some parts of swsusp.c in the user space.
      
      The patch does not change the code.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarPavel Machek <pavel@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      25761b6e
    • Rafael J. Wysocki's avatar
      [PATCH] swsusp: rework image freeing · 351619ba
      Rafael J. Wysocki authored
      
      The following patch makes swsusp use PG_nosave and PG_nosave_free flags to
      mark pages that should be freed after the state of the system has been
      restored from the image (or in case of an error during suspend).
      
      This allows us to avoid storing metadata in swap twice and to reduce the
      amount of memory needed by swsusp.   Additionally, it allows us to simplify
      the code by removing a couple of functions that are no longer necessary.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarPavel Machek <pavel@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      351619ba
    • Ashok Raj's avatar
      [PATCH] create and destroy cpufreq sysfs entries based on cpu notifiers · c32b6b8e
      Ashok Raj authored
      
      cpufreq entries in sysfs should only be populated when CPU is online state.
       When we either boot with maxcpus=x and then boot the other cpus by echoing
      to sysfs online file, these entries should be created and destroyed when
      CPU_DEAD is notified.  Same treatement as cache entries under sysfs.
      
      We place the processor in the lowest frequency, so hw managed P-State
      transitions can still work on the other threads to save power.
      
      Primary goal was to just make these directories appear/disapper dynamically.
      
      There is one in this patch i had to do, which i really dont like myself but
      probably best if someone handling the cpufreq infrastructure could give
      this code right treatment if this is not acceptable.  I guess its probably
      good for the first cut.
      
      - Converting lock_cpu_hotplug()/unlock_cpu_hotplug() to disable/enable preempt.
        The locking was smack in the middle of the notification path, when the
        hotplug is already holding the lock. I tried another solution to avoid this
        so avoid taking locks if we know we are from notification path. The solution
        was getting very ugly and i decided this was probably good for this iteration
        until someone who understands cpufreq could do a better job than me.
      
      (akpm: export cpucontrol to GPL modules: drivers/cpufreq/cpufreq_stats.c now
      does lock_cpu_hotplug())
      Signed-off-by: default avatarAshok Raj <ashok.raj@intel.com>
      Signed-off-by: default avatarVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Zwane Mwaikambo <zwane@holomorphy.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c32b6b8e
    • Brian Gerst's avatar
      [PATCH] Remove redundant configs.o · 4276d322
      Brian Gerst authored
      
      Since CONFIG_IKCONFIG_PROC already depends on CONFIG_IKCONFIG, adding
      configs.o again is redundant.
      Signed-off-by: default avatarBrian Gerst <bgerst@didntduck.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4276d322
    • Hugh Dickins's avatar
      [PATCH] mm: split page table lock · 4c21e2f2
      Hugh Dickins authored
      
      Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
      a many-threaded application which concurrently initializes different parts of
      a large anonymous area.
      
      This patch corrects that, by using a separate spinlock per page table page, to
      guard the page table entries in that page, instead of using the mm's single
      page_table_lock.  (But even then, page_table_lock is still used to guard page
      table allocation, and anon_vma allocation.)
      
      In this implementation, the spinlock is tucked inside the struct page of the
      page table page: with a BUILD_BUG_ON in case it overflows - which it would in
      the case of 32-bit PA-RISC with spinlock debugging enabled.
      
      Splitting the lock is not quite for free: another cacheline access.  Ideally,
      I suppose we would use split ptlock only for multi-threaded processes on
      multi-cpu machines; but deciding that dynamically would have its own costs.
      So for now enable it by config, at some number of cpus - since the Kconfig
      language doesn't support inequalities, let preprocessor compare that with
      NR_CPUS.  But I don't think it's worth being user-configurable: for good
      testing of both split and unsplit configs, split now at 4 cpus, and perhaps
      change that to 8 later.
      
      There is a benefit even for singly threaded processes: kswapd can be attacking
      one part of the mm while another part is busy faulting.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4c21e2f2
    • Hugh Dickins's avatar
      [PATCH] mm: follow_page with inner ptlock · deceb6cd
      Hugh Dickins authored
      
      Final step in pushing down common core's page_table_lock.  follow_page no
      longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself;
      and so no page_table_lock is taken in get_user_pages itself.
      
      But get_user_pages (and get_futex_key) do then need follow_page to pin the
      page for them: take Daniel's suggestion of bitflags to follow_page.
      
      Need one for WRITE, another for TOUCH (it was the accessed flag before:
      vanished along with check_user_page_readable, but surely get_numa_maps is
      wrong to mark every page it finds as accessed), another for GET.
      
      And another, ANON to dispose of untouched_anonymous_page: it seems silly for
      that to descend a second time, let follow_page observe if there was no page
      table and return ZERO_PAGE if so.  Fix minor bug in that: check VM_LOCKED -
      make_pages_present ought to make readonly anonymous present.
      
      Give get_numa_maps a cond_resched while we're there.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      deceb6cd
    • Hugh Dickins's avatar
      [PATCH] mm: ptd_alloc take ptlock · c74df32c
      Hugh Dickins authored
      
      Second step in pushing down the page_table_lock.  Remove the temporary
      bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
      to hold page_table_lock, whether it's on init_mm or a user mm; take
      page_table_lock internally to check if a racing task already allocated.
      
      Convert their callers from common code.  But avoid coming back to change them
      again later: instead of moving the spin_lock(&mm->page_table_lock) down,
      switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
      encapsulate the mapping+locking and unlocking+unmapping together, and in the
      end may use alternatives to the mm page_table_lock itself.
      
      These callers all hold mmap_sem (some exclusively, some not), so at no level
      can a page table be whipped away from beneath them; and pte_alloc uses the
      "atomic" pmd_present to test whether it needs to allocate.  It appears that on
      all arches we can safely descend without page_table_lock.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c74df32c
    • Hugh Dickins's avatar
      [PATCH] mm: update_hiwaters just in time · 365e9c87
      Hugh Dickins authored
      
      update_mem_hiwater has attracted various criticisms, in particular from those
      concerned with mm scalability.  Originally it was called whenever rss or
      total_vm got raised.  Then many of those callsites were replaced by a timer
      tick call from account_system_time.  Now Frank van Maarseveen reports that to
      be found inadequate.  How about this?  Works for Frank.
      
      Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
      update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
      mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
      by 1): those are hot paths.  Do the opposite, update only when about to lower
      rss (usually by many), or just before final accounting in do_exit.  Handle
      mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
      that whoever collects these hiwater statistics do the work of taking the
      maximum with rss or total_vm.
      
      And there has been no collector of these hiwater statistics in the tree.  The
      new convention needs an example, so match Frank's usage by adding a VmPeak
      line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
      (High-Water-Mark or High-Water-Memory).
      
      There was a particular anomaly during mremap move, that hiwater_vm might be
      captured too high.  A fleeting such anomaly remains, but it's quickly
      corrected now, whereas before it would stick.
      
      What locking?  None: if the app is racy then these statistics will be racy,
      it's not worth any overhead to make them exact.  But whenever it suits,
      hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
      page_table_lock (for now) or with preemption disabled (later on): without
      going to any trouble, minimize the time between reading current values and
      updating, to minimize those occasions when a racing thread bumps a count up
      and back down in between.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      365e9c87
    • Nick Piggin's avatar
      [PATCH] core remove PageReserved · b5810039
      Nick Piggin authored
      
      Remove PageReserved() calls from core code by tightening VM_RESERVED
      handling in mm/ to cover PageReserved functionality.
      
      PageReserved special casing is removed from get_page and put_page.
      
      All setting and clearing of PageReserved is retained, and it is now flagged
      in the page_alloc checks to help ensure we don't introduce any refcount
      based freeing of Reserved pages.
      
      MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
      deprecated.  We never completely handled it correctly anyway, and is be
      reintroduced in future if required (Hugh has a proof of concept).
      
      Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
      arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
      be trivially removed.
      
      Last real user of PageReserved is swsusp, which uses PageReserved to
      determine whether a struct page points to valid memory or not.  This still
      needs to be addressed (a generic page_is_ram() should work).
      
      A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
      thus mapcounted and count towards shared rss).  These writes to the struct
      page could cause excessive cacheline bouncing on big systems.  There are a
      number of ways this could be addressed if it is an issue.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      
      Refcount bug fix for filemap_xip.c
      Signed-off-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b5810039
    • Hugh Dickins's avatar
      [PATCH] mm: dup_mmap down new mmap_sem · 7ee78232
      Hugh Dickins authored
      
      One anomaly remains from when Andrea rationalized the responsibilities of
      mmap_sem and page_table_lock: in dup_mmap we add vmas to the child holding its
      page_table_lock, but not the mmap_sem which normally guards the vma list and
      rbtree.  Which could be an issue for unuse_mm: though since it just walks down
      the list (today with page_table_lock, tomorrow not), it's probably okay.  Will
      need a memory barrier?  Oh, keep it simple, Nick and I agreed, no harm in
      taking child's mmap_sem here.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7ee78232
    • Hugh Dickins's avatar
      [PATCH] mm: dup_mmap use oldmm more · fd3e42fc
      Hugh Dickins authored
      
      Use the parent's oldmm throughout dup_mmap, instead of perversely going back
      to current->mm.  (Can you hear the sigh of relief from those mpnts?  Usually I
      squash them, but not today.)
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fd3e42fc