- 25 Feb, 2008 2 commits
-
-
Harvey Harrison authored
Unsigned long values are always assigned to switch_count, make it unsigned long. kernel/sched.c:3897:15: warning: incorrect type in assignment (different signedness) kernel/sched.c:3897:15: expected long *switch_count kernel/sched.c:3897:15: got unsigned long *<noident> kernel/sched.c:3921:16: warning: incorrect type in assignment (different signedness) kernel/sched.c:3921:16: expected long *switch_count kernel/sched.c:3921:16: got unsigned long *<noident> Signed-off-by:
Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Ingo Molnar authored
do not call sched_clock() too early. Not only might rq->idle not be set up - but pure per-cpu data might not be accessible either. this solves an ia64 early bootup hang with CONFIG_PRINTK_TIME=y. Tested-by:
Tony Luck <tony.luck@gmail.com> Acked-by:
Tony Luck <tony.luck@gmail.com> Acked-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- 23 Feb, 2008 2 commits
-
-
Linus Torvalds authored
Oleg Nesterov and others have pointed out that on some architectures, the traditional sequence of set_current_state(TASK_INTERRUPTIBLE); if (CONDITION) return; schedule(); is racy wrt another CPU doing CONDITION = 1; wake_up_process(p); because while set_current_state() has a memory barrier separating setting of the TASK_INTERRUPTIBLE state from reading of the CONDITION variable, there is no such memory barrier on the wakeup side. Now, wake_up_process() does actually take a spinlock before it reads and sets the task state on the waking side, and on x86 (and many other architectures) that spinlock is in fact equivalent to a memory barrier, but that is not generally guaranteed. The write that sets CONDITION could move into the critical region protected by the runqueue spinlock. However, adding a smp_wmb() to before the spinlock should now order the writing of CONDITION wrt the lock itself, which in turn is ordered wrt the accesses within the spinlock (which includes the reading of the old state). This should thus close the race (which probably has never been seen in practice, but since smp_wmb() is a no-op on x86, it's not like this will make anything worse either on the most common architecture where the spinlock already gave the required protection). Acked-by:
Oleg Nesterov <oleg@tv-sign.ru> Acked-by:
Dmitry Adamushko <dmitry.adamushko@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Srinivasa Ds authored
Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these functions inturn call add/sub_preempt_count(). So we need to refuse user from inserting probe in to these functions. This patch disallows user from probing add/sub_preempt_count(). Signed-off-by:
Srinivasa DS <srinivasa@in.ibm.com> Acked-by:
Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 13 Feb, 2008 7 commits
-
-
Peter Zijlstra authored
Refuse to accept or create RT tasks in groups that can't run them. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Clean up some of the excessive ifdeffery introduces in the last patch. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Make the rt group scheduler compile time configurable. Keep it experimental for now. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Change the rt_ratio interface to rt_runtime_us, to match rt_period_us. This avoids picking a granularity for the ratio. Extend the /sys/kernel/uids/<uid>/ interface to allow setting the group's rt_runtime. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Steven mentioned the fun case where a lock holding task will be throttled. Simple fix: allow groups that have boosted tasks to run anyway. If a runnable task in a throttled group gets boosted the dequeue/enqueue done by rt_mutex_setprio() is enough to unthrottle the group. This is ofcourse not quite correct. Two possible ways forward are: - second prio array for boosted tasks - boost to a prio ceiling (this would also work for deadline scheduling) Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
lockdep spotted this bogus irq locking. normalize_rt_tasks() can be called from hardirq context through sysrq-n Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
On Mon, 2008-02-11 at 15:09 +0300, Denis V. Lunev wrote: > BUG: sleeping function called from invalid context > at /home/den/src/linux-netns26/kernel/mutex.c:209 > in_atomic():1, irqs_disabled():0 > no locks held by swapper/0. > Pid: 0, comm: swapper Not tainted 2.6.24 #304 > > Call Trace: > <IRQ> [<ffffffff80252d1e>] ? __debug_show_held_locks+0x15/0x27 > [<ffffffff8022c2a8>] __might_sleep+0xc0/0xdf > [<ffffffff8049f1df>] mutex_lock_nested+0x28/0x2a9 > [<ffffffff80231294>] sched_destroy_group+0x18/0xea > [<ffffffff8023e835>] sched_destroy_user+0xd/0xf > [<ffffffff8023e8c1>] free_uid+0x8a/0xab > [<ffffffff80233e24>] __put_task_struct+0x3f/0xd3 > [<ffffffff80236708>] delayed_put_task_struct+0x23/0x25 > [<ffffffff8026fda7>] __rcu_process_callbacks+0x8d/0x215 > [<ffffffff8026ff52>] rcu_process_callbacks+0x23/0x44 > [<ffffffff8023a2ae>] __do_softirq+0x79/0xf8 > [<ffffffff8020f8c3>] ? profile_pc+0x2a/0x67 > [<ffffffff8020d38c>] call_softirq+0x1c/0x30 > [<ffffffff8020f689>] do_softirq+0x61/0x9c > [<ffffffff8023a233>] irq_exit+0x51/0x53 > [<ffffffff8021bd1a>] smp_apic_timer_interrupt+0x77/0xad > [<ffffffff8020ce3b>] apic_timer_interrupt+0x6b/0x70 > <EOI> [<ffffffff8020b0dd>] ? default_idle+0x43/0x76 > [<ffffffff8020b0db>] ? default_idle+0x41/0x76 > [<ffffffff8020b09a>] ? default_idle+0x0/0x76 > [<ffffffff8020b186>] ? cpu_idle+0x76/0x98 separate the tg->shares protection from the task_group lock. Reported-by:
Denis V. Lunev <den@openvz.org> Tested-by:
Denis V. Lunev <den@openvz.org> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- 08 Feb, 2008 1 commit
-
-
Harvey Harrison authored
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by:
Harvey Harrison <harvey.harrison@gmail.com> Acked-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 31 Jan, 2008 1 commit
-
-
Gerald Stralko authored
This removes the extra struct task_struct *p parameter in inc_nr_running and dec_nr_running functions. Signed-off by: Jerry Stralko <gerb.stralko@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- 30 Jan, 2008 1 commit
-
-
Nick Piggin authored
The break_lock data structure and code for spinlocks is quite nasty. Not only does it double the size of a spinlock but it changes locking to a potentially less optimal trylock. Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a __raw_spin_is_contended that uses the lock data itself to determine whether there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is not set. Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to decouple it from the spinlock implementation, and make it typesafe (rwlocks do not have any need_lockbreak sites -- why do they even get bloated up with that break_lock then?). Signed-off-by:
Nick Piggin <npiggin@suse.de> Signed-off-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de>
-
- 25 Jan, 2008 26 commits
-
-
Nick Piggin authored
The attached patch is something really simple that can sometimes help in getting more info out of a hung system. Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Guillaume Chazarain authored
We monitor clock overflows, let's also monitor clock underflows. Signed-off-by:
Guillaume Chazarain <guichaz@yahoo.fr> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Guillaume Chazarain authored
sched: fix rq->clock warps on frequency changes Fix 2bacec8c (sched: touch softlockup watchdog after idling) that reintroduced warps on frequency changes. touch_softlockup_watchdog() calls __update_rq_clock that checks rq->clock for warps, so call it after adjusting rq->clock. Signed-off-by:
Guillaume Chazarain <guichaz@yahoo.fr> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Ingo Molnar authored
remove the !PREEMPT_BKL code. this removes 160 lines of legacy code. Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
We need to teach no_hz about the rt throttling because its tick driven. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Extend group scheduling to also cover the realtime classes. It uses the time limiting introduced by the previous patch to allow multiple realtime groups. The hard time limit is required to keep behaviour deterministic. The algorithms used make the realtime scheduler O(tg), linear scaling wrt the number of task groups. This is the worst case behaviour I can't seem to get out of, the avg. case of the algorithms can be improved, I focused on correctness and worst case. [ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ] Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Very simple time limit on the realtime scheduling classes. Allow the rq's realtime class to consume sched_rt_ratio of every sched_rt_period slice. If the class exceeds this quota the fair class will preempt the realtime class. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Use HR-timers (when available) to deliver an accurate preemption tick. The regular scheduler tick that runs at 1/HZ can be too coarse when nice level are used. The fairness system will still keep the cpu utilisation 'fair' by then delaying the task that got an excessive amount of CPU time but try to minimize this by delivering preemption points spot-on. The average frequency of this extra interrupt is sched_latency / nr_latency. Which need not be higher than 1/HZ, its just that the distribution within the sched_latency period is important. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Herbert Xu authored
Why do we even have cond_resched when real preemption is on? It seems to be a waste of space and time. remove cond_resched with CONFIG_PREEMPT on. Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Ingo Molnar authored
whitespace fixes. Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Peter Zijlstra authored
Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Gregory Haskins authored
The baseline code statically builds the span maps when the domain is formed. Previous attempts at dynamically updating the maps caused a suspend-to-ram regression, which should now be fixed. Signed-off-by:
Gregory Haskins <ghaskins@novell.com> CC: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Steven Rostedt authored
Dmitry Adamushko found that the current implementation of the RT balancing code left out changes to the sched_setscheduler and rt_mutex_setprio. This patch addresses this issue by adding methods to the schedule classes to handle being switched out of (switched_from) and being switched into (switched_to) a sched_class. Also a method for changing of priorities is also added (prio_changed). This patch also removes some duplicate logic between rt_mutex_setprio and sched_setscheduler. Signed-off-by:
Steven Rostedt <srostedt@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Steven Rostedt authored
To make the main sched.c code more agnostic to the schedule classes. Instead of having specific hooks in the schedule code for the RT class balancing. They are replaced with a pre_schedule, post_schedule and task_wake_up methods. These methods may be used by any of the classes but currently, only the sched_rt class implements them. Signed-off-by:
Steven Rostedt <srostedt@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Dmitry Adamushko authored
Clean-up try_to_wake_up(). Get rid of the 'new_cpu' variable in try_to_wake_up() [ that's, one #ifdef section less ]. Also remove a few redundant blank lines. Signed-off-by:
Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Ingo Molnar authored
add credits for RT balancing improvements. Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Ingo Molnar authored
style cleanup of various changes that were done recently. no code changed: text data bss dec hex filename 26399 2578 48 29025 7161 sched.o.before 26399 2578 48 29025 7161 sched.o.after Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Ingo Molnar authored
remove unused JIFFIES_TO_NS() macro. Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Gregory Haskins authored
We move the rt-overload data as the first global to per-domain reclassification. This limits the scope of overload related cache-line bouncing to stay with a specified partition instead of affecting all cpus in the system. Finally, we limit the scope of find_lowest_cpu searches to the domain instead of the entire system. Note that we would always respect domain boundaries even without this patch, but we first would scan potentially all cpus before whittling the list down. Now we can avoid looking at RQs that are out of scope, again reducing cache-line hits. Note: In some cases, task->cpus_allowed will effectively reduce our search to within our domain. However, I believe there are cases where the cpus_allowed mask may be all ones and therefore we err on the side of caution. If it can be optimized later, so be it. Signed-off-by:
Gregory Haskins <ghaskins@novell.com> CC: Christoph Lameter <clameter@sgi.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Gregory Haskins authored
We add the notion of a root-domain which will be used later to rescope global variables to per-domain variables. Each exclusive cpuset essentially defines an island domain by fully partitioning the member cpus from any other cpuset. However, we currently still maintain some policy/state as global variables which transcend all cpusets. Consider, for instance, rt-overload state. Whenever a new exclusive cpuset is created, we also create a new root-domain object and move each cpu member to the root-domain's span. By default the system creates a single root-domain with all cpus as members (mimicking the global state we have today). We add some plumbing for storing class specific data in our root-domain. Whenever a RQ is switching root-domains (because of repartitioning) we give each sched_class the opportunity to remove any state from its old domain and add state to the new one. This logic doesn't have any clients yet but it will later in the series. Signed-off-by:
Gregory Haskins <ghaskins@novell.com> CC: Christoph Lameter <clameter@sgi.com> CC: Paul Jackson <pj@sgi.com> CC: Simon Derr <simon.derr@bull.net> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Steven Rostedt authored
rt-balance when creating new tasks. Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Gregory Haskins authored
We have logic to detect whether the system has migratable tasks, but we are not using it when deciding whether to push tasks away. So we add support for considering this new information. Signed-off-by:
Gregory Haskins <ghaskins@novell.com> Signed-off-by:
Steven Rostedt <srostedt@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Gregory Haskins authored
The current wake-up code path tries to determine if it can optimize the wake-up to "this_cpu" by computing load calculations. The problem is that these calculations are only relevant to SCHED_OTHER tasks where load is king. For RT tasks, priority is king. So the load calculation is completely wasted bandwidth. Therefore, we create a new sched_class interface to help with pre-wakeup routing decisions and move the load calculation as a function of CFS task's class. Signed-off-by:
Gregory Haskins <ghaskins@novell.com> Signed-off-by:
Steven Rostedt <srostedt@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Gregory Haskins authored
Some RT tasks (particularly kthreads) are bound to one specific CPU. It is fairly common for two or more bound tasks to get queued up at the same time. Consider, for instance, softirq_timer and softirq_sched. A timer goes off in an ISR which schedules softirq_thread to run at RT50. Then the timer handler determines that it's time to smp-rebalance the system so it schedules softirq_sched to run. So we are in a situation where we have two RT50 tasks queued, and the system will go into rt-overload condition to request other CPUs for help. This causes two problems in the current code: 1) If a high-priority bound task and a low-priority unbounded task queue up behind the running task, we will fail to ever relocate the unbounded task because we terminate the search on the first unmovable task. 2) We spend precious futile cycles in the fast-path trying to pull overloaded tasks over. It is therefore optimial to strive to avoid the overhead all together if we can cheaply detect the condition before overload even occurs. This patch tries to achieve this optimization by utilizing the hamming weight of the task->cpus_allowed mask. A weight of 1 indicates that the task cannot be migrated. We will then utilize this information to skip non-migratable tasks and to eliminate uncessary rebalance attempts. We introduce a per-rq variable to count the number of migratable tasks that are currently running. We only go into overload if we have more than one rt task, AND at least one of them is migratable. In addition, we introduce a per-task variable to cache the cpus_allowed weight, since the hamming calculation is probably relatively expensive. We only update the cached value when the mask is updated which should be relatively infrequent, especially compared to scheduling frequency in the fast path. Signed-off-by:
Gregory Haskins <ghaskins@novell.com> Signed-off-by:
Steven Rostedt <srostedt@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Steven Rostedt authored
This patch adds pushing of overloaded RT tasks from a runqueue that is having tasks (most likely RT tasks) added to the run queue. TODO: We don't cover the case of waking of new RT tasks (yet). Signed-off-by:
Steven Rostedt <srostedt@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Steven Rostedt authored
This patch adds the algorithm to pull tasks from RT overloaded runqueues. When a pull RT is initiated, all overloaded runqueues are examined for a RT task that is higher in prio than the highest prio task queued on the target runqueue. If another runqueue holds a RT task that is of higher prio than the highest prio task on the target runqueue is found it is pulled to the target runqueue. Signed-off-by:
Steven Rostedt <srostedt@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-