aboutsummaryrefslogtreecommitdiff
path: root/include/linux/sched.h
AgeCommit message (Collapse)Author
2007-08-28sched: make the scheduler converge to the ideal latencyIngo Molnar
de-HZ-ification of the granularity defaults unearthed a pre-existing property of CFS: while it correctly converges to the granularity goal, it does not prevent run-time fluctuations in the range of [-gran ... 0 ... +gran]. With the increase of the granularity due to the removal of HZ dependencies, this becomes visible in chew-max output (with 5 tasks running): out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40 out: 27 . 27. 32 | flu: 0 . 0 | ran: 17 . 13 | per: 44 . 40 out: 27 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 36 . 40 out: 29 . 27. 32 | flu: 2 . 0 | ran: 17 . 13 | per: 46 . 40 out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40 out: 29 . 27. 32 | flu: 0 . 0 | ran: 18 . 13 | per: 47 . 40 out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40 average slice is the ideal 13 msecs and the period is picture-perfect 40 msecs. But the 'ran' field fluctuates around 13.33 msecs and there's no mechanism in CFS to keep that from happening: it's a perfectly valid solution that CFS finds. to fix this we add a granularity/preemption rule that knows about the "target latency", which makes tasks that run longer than the ideal latency run a bit less. The simplest approach is to simply decrease the preemption granularity when a task overruns its ideal latency. For this we have to track how much the task executed since its last preemption. ( this adds a new field to task_struct, but we can eliminate that overhead in 2.6.24 by putting all the scheduler timestamps into an anonymous union. ) with this change in place, chew-max output is fluctuation-less all around: out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40 out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40 this patch has no impact on any fastpath or on any globally observable scheduling property. (unless you have sharp enough eyes to see millisecond-level ruckles in glxgears smoothness :-) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de>
2007-08-25sched: cleanup, sched_granularity -> sched_min_granularityIngo Molnar
due to adaptive granularity scheduling the role of sched_granularity has changed to "minimum granularity", so rename the variable (and the tunable) accordingly. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2007-08-25sched: adaptive scheduler granularityPeter Zijlstra
Instead of specifying the preemption granularity, specify the wanted latency. By fixing the granlarity to a constany the wakeup latency it a function of the number of running tasks on the rq. Invert this relation. sysctl_sched_granularity becomes a minimum for the dynamic granularity computed from the new sysctl_sched_latency. Then use this latency to do more intelligent granularity decisions: if there are fewer tasks running then we can schedule coarser. This helps performance while still always keeping the latency target. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-23sched: fix broken SMT/MC optimizationsSuresh Siddha
On a four package system with HT - HT load balancing optimizations were broken. For example, if two tasks end up running on two logical threads of one of the packages, scheduler is not able to pull one of the tasks to a completely idle package. In this scenario, for nice-0 tasks, imbalance calculated by scheduler will be 512 and find_busiest_queue() will return 0 (as each cpu's load is 1024 > imbalance and has only one task running). Similarly MC scheduler optimizations also get fixed with this patch. [ mingo@elte.hu: restored fair balancing by increasing the fuzz and adding it back to the power decision, without the /2 factor. ] Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-23sched: sched_clock_idle_[sleep|wakeup]_event()Ingo Molnar
construct a more or less wall-clock time out of sched_clock(), by using ACPI-idle's existing knowledge about how much time we spent idling. This allows the rq clock to work around TSC-stops-in-C2, TSC-gets-corrupted-in-C3 type of problems. ( Besides the scheduler's statistics this also benefits blktrace and printk-timestamps as well. ) Furthermore, the precise before-C2/C3-sleep and after-C2/C3-wakeup callbacks allow the scheduler to get out the most of the period where the CPU has a reliable TSC. This results in slightly more precise task statistics. the ACPI bits were acked by Len. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Len Brown <len.brown@intel.com>
2007-08-09sched: remove the 'u64 now' parameter from ->task_new()Ingo Molnar
remove the 'u64 now' parameter from ->task_new(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-09sched: remove the 'u64 now' parameter from ->put_prev_task()Ingo Molnar
remove the 'u64 now' parameter from ->put_prev_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-09sched: remove the 'u64 now' parameter from ->pick_next_task()Ingo Molnar
remove the 'u64 now' parameter from ->pick_next_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-09sched: remove the 'u64 now' parameter from ->dequeue_task()Ingo Molnar
remove the 'u64 now' parameter from ->dequeue_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-09sched: remove the 'u64 now' parameter from ->enqueue_task()Ingo Molnar
remove the 'u64 now' parameter from ->enqueue_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-09sched: remove the 'u64 now' parameter from print_cfs_rq()Ingo Molnar
remove the 'u64 now' parameter from print_cfs_rq(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-09sched: fix bug in balance_tasks()Peter Williams
There are two problems with balance_tasks() and how it used: 1. The variables best_prio and best_prio_seen (inherited from the old move_tasks()) were only required to handle problems caused by the active/expired arrays, the order in which they were processed and the possibility that the task with the highest priority could be on either. These issues are no longer present and the extra overhead associated with their use is unnecessary (and possibly wrong). 2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same this_best_prio variable needs to be used by all scheduling classes or there is a risk of moving too much load. E.g. if the highest priority task on this at the beginning is a fairly low priority task and the rt class migrates a task (during its turn) then that moved task becomes the new highest priority task on this_rq but when the sched_fair class initializes its copy of this_best_prio it will get the priority of the original highest priority task as, due to the run queue locks being held, the reschedule triggered by pull_task() will not have taken place. This could result in inappropriate overriding of skip_for_load and excessive load being moved. The attached patch addresses these problems by deleting all reference to best_prio and best_prio_seen and making this_best_prio a reference parameter to the various functions involved. load_balance_fair() has also been modified so that this_best_prio is only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set. This should preserve the effect of helping spread groups' higher priority tasks around the available CPUs while improving system performance when CONFIG_FAIR_GROUP_SCHED isn't set. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-09sched: simplify move_tasks()Peter Williams
The move_tasks() function is currently multiplexed with two distinct capabilities: 1. attempt to move a specified amount of weighted load from one run queue to another; and 2. attempt to move a specified number of tasks from one run queue to another. The first of these capabilities is used in two places, load_balance() and load_balance_idle(), and in both of these cases the return value of move_tasks() is used purely to decide if tasks/load were moved and no notice of the actual number of tasks moved is taken. The second capability is used in exactly one place, active_load_balance(), to attempt to move exactly one task and, as before, the return value is only used as an indicator of success or failure. This multiplexing of sched_task() was introduced, by me, as part of the smpnice patches and was motivated by the fact that the alternative, one function to move specified load and one to move a single task, would have led to two functions of roughly the same complexity as the old move_tasks() (or the new balance_tasks()). However, the new modular design of the new CFS scheduler allows a simpler solution to be adopted and this patch addresses that solution by: 1. adding a new function, move_one_task(), to be used by active_load_balance(); and 2. making move_tasks() a single purpose function that tries to move a specified weighted load and returns 1 for success and 0 for failure. One of the consequences of these changes is that neither move_one_task() or the new move_tasks() care how many tasks sched_class.load_balance() moves and this enables its interface to be simplified by returning the amount of load moved as its result and removing the load_moved pointer from the argument list. This helps simplify the new move_tasks() and slightly reduces the amount of work done in each of sched_class.load_balance()'s implementations. Further simplification, e.g. changes to balance_tasks(), are possible but (slightly) complicated by the special needs of load_balance_fair() so I've left them to a later patch (if this one gets accepted). NB Since move_tasks() gets called with two run queue locks held even small reductions in overhead are worthwhile. [ mingo@elte.hu ] this change also reduces code size nicely: text data bss dec hex filename 39216 3618 24 42858 a76a sched.o.before 39173 3618 24 42815 a73f sched.o.after Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-02[PATCH] sched: reduce task_struct sizeIngo Molnar
more task_struct size reduction, by moving the debugging/instrumentation fields to under CONFIG_SCHEDSTATS: (i386, nodebug): size ---- pre-CFS 1328 CFS 1472 CFS+patch 1376 Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-02[PATCH] sched: ->task_new cleanupIngo Molnar
make sched_class.task_new == NULL a 'default method', this allows the removal of task_rt_new. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-08-02[PATCH] sched: remove cache_hot_timeIngo Molnar
remove the last unused remains of cache_hot_time. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-26[PATCH] sched: add above_background_load() functionCon Kolivas
Add an above_background_load() function which can be used by other subsystems to detect if there is anything besides niced tasks running. Place it in sched.h to allow it to be compiled out if not used. Unused for now, but it is a useful hint to the IO scheduler and to swap-prefetch. Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-26[PATCH] sched: arch preempt notifier mechanismAvi Kivity
This adds a general mechanism whereby a task can request the scheduler to notify it whenever it is preempted or scheduled back in. This allows the task to swap any special-purpose registers like the fpu or Intel's VT registers. Signed-off-by: Avi Kivity <avi@qumranet.com> [ mingo@elte.hu: fixes, cleanups ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-26[PATCH] sched: increase SCHED_LOAD_SCALE_FUZZIngo Molnar
increase SCHED_LOAD_SCALE_FUZZ that adds a small amount of over-balancing: to help distribute CPU-bound tasks more fairly on SMP systems. the problem of unfair balancing was noticed and reported by Tong N Li. 10 CPU-bound tasks running on 8 CPUs, v2.6.23-rc1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2572 mingo 20 0 1576 244 196 R 100 0.0 1:03.61 loop 2578 mingo 20 0 1576 248 196 R 100 0.0 1:03.59 loop 2576 mingo 20 0 1576 248 196 R 100 0.0 1:03.52 loop 2571 mingo 20 0 1576 244 196 R 100 0.0 1:03.46 loop 2569 mingo 20 0 1576 244 196 R 99 0.0 1:03.36 loop 2570 mingo 20 0 1576 244 196 R 95 0.0 1:00.55 loop 2577 mingo 20 0 1576 248 196 R 50 0.0 0:31.88 loop 2574 mingo 20 0 1576 248 196 R 50 0.0 0:31.87 loop 2573 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop 2575 mingo 20 0 1576 248 196 R 50 0.0 0:31.86 loop v2.6.23-rc1 + patch: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2681 mingo 20 0 1576 244 196 R 85 0.0 3:51.68 loop 2688 mingo 20 0 1576 244 196 R 81 0.0 3:46.35 loop 2682 mingo 20 0 1576 244 196 R 80 0.0 3:43.68 loop 2685 mingo 20 0 1576 248 196 R 80 0.0 3:45.97 loop 2683 mingo 20 0 1576 248 196 R 80 0.0 3:40.25 loop 2679 mingo 20 0 1576 244 196 R 80 0.0 3:33.53 loop 2680 mingo 20 0 1576 244 196 R 79 0.0 3:43.53 loop 2686 mingo 20 0 1576 244 196 R 79 0.0 3:39.31 loop 2687 mingo 20 0 1576 244 196 R 78 0.0 3:33.31 loop 2684 mingo 20 0 1576 244 196 R 77 0.0 3:27.52 loop so they now nicely converge to the expected 80% long-term CPU usage. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-19[PATCH] sched: implement cpu_clock(cpu) high-speed time sourceIngo Molnar
Implement the cpu_clock(cpu) interface for kernel-internal use: high-speed (but slightly incorrect) per-cpu clock constructed from sched_clock(). This API, unused at the moment, will be used in the future by blktrace, by the softlockup-watchdog, by printk and by lockstat. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-19coredump masking: add an interface for core dump filterKawai, Hidehiro
This patch adds an interface to set/reset flags which determines each memory segment should be dumped or not when a core file is generated. /proc/<pid>/coredump_filter file is provided to access the flags. You can change the flag status for a particular process by writing to or reading from the file. The flag status is inherited to the child process when it is created. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19coredump masking: reimplementation of dumpable using two flagsKawai, Hidehiro
This patch changes mm_struct.dumpable to a pair of bit flags. set_dumpable() converts three-value dumpable to two flags and stores it into lower two bits of mm_struct.flags instead of mm_struct.dumpable. get_dumpable() behaves in the opposite way. [akpm@linux-foundation.org: export set_dumpable] Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16user namespace: add unshareSerge E. Hallyn
This patch enables the unshare of user namespaces. It adds a new clone flag CLONE_NEWUSER and implements copy_user_ns() which resets the current user_struct and adds a new root user (uid == 0) For now, unsharing the user namespace allows a process to reset its user_struct accounting and uid 0 in the new user namespace should be contained using appropriate means, for instance selinux The plan, when the full support is complete (all uid checks covered), is to keep the original user's rights in the original namespace, and let a process become uid 0 in the new namespace, with full capabilities to the new namespace. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Andrew Morgan <agm@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16user namespace: add the frameworkCedric Le Goater
Basically, it will allow a process to unshare its user_struct table, resetting at the same time its own user_struct and all the associated accounting. A new root user (uid == 0) is added to the user namespace upon creation. Such root users have full privileges and it seems that theses privileges should be controlled through some means (process capabilities ?) The unshare is not included in this patch. Changes since [try #4]: - Updated get_user_ns and put_user_ns to accept NULL, and get_user_ns to return the namespace. Changes since [try #3]: - moved struct user_namespace to files user_namespace.{c,h} Changes since [try #2]: - removed struct user_namespace* argument from find_user() Changes since [try #1]: - removed struct user_namespace* argument from find_user() - added a root_user per user namespace Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Andrew Morgan <agm@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Audit: add TTY input auditingMiloslav Trmac
Add TTY input auditing, used to audit system administrator's actions. This is required by various security standards such as DCID 6/3 and PCI to provide non-repudiation of administrator's actions and to allow a review of past actions if the administrator seems to overstep their duties or if the system becomes misconfigured for unknown reasons. These requirements do not make it necessary to audit TTY output as well. Compared to an user-space keylogger, this approach records TTY input using the audit subsystem, correlated with other audit events, and it is completely transparent to the user-space application (e.g. the console ioctls still work). TTY input auditing works on a higher level than auditing all system calls within the session, which would produce an overwhelming amount of mostly useless audit events. Add an "audit_tty" attribute, inherited across fork (). Data read from TTYs by process with the attribute is sent to the audit subsystem by the kernel. The audit netlink interface is extended to allow modifying the audit_tty attribute, and to allow sending explanatory audit events from user-space (for example, a shell might send an event containing the final command, after the interactive command-line editing and history expansion is performed, which might be difficult to decipher from the TTY input alone). Because the "audit_tty" attribute is inherited across fork (), it would be set e.g. for sshd restarted within an audited session. To prevent this, the audit_tty attribute is cleared when a process with no open TTY file descriptors (e.g. after daemon startup) opens a TTY. See https://www.redhat.com/archives/linux-audit/2007-June/msg00000.html for a more detailed rationale document for an older version of this patch. [akpm@linux-foundation.org: build fix] Signed-off-by: Miloslav Trmac <mitr@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Paul Fulghum <paulkf@microgate.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16Use boot based time for process start time and boot time in /procTomas Janousek
Commit 411187fb05cd11676b0979d9fbf3291db69dbce2 caused boot time to move and process start times to become invalid after suspend. Using boot based time for those restores the old behaviour and fixes the issue. [akpm@linux-foundation.org: little cleanup] Signed-off-by: Tomas Janousek <tjanouse@redhat.com> Cc: Tomas Smetana <tsmetana@redhat.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-09sched: micro-optimize mmdrop()Ingo Molnar
micro-optimize mmdrop(). Improves schedule()'s assembly a bit. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: scheduler debugging, coreIngo Molnar
scheduler debugging core: implement /proc/sched_debug and /proc/<PID>/sched files for scheduler debugging. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: remove old cpu accounting fieldIngo Molnar
remove the old cpu-accounting field from signal_struct, now that the code is using CFS's stats. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: remove batch_task()Ingo Molnar
batch_task() in sched.h is now unused - remove it. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: remove interactivity types from sched.hIngo Molnar
remove now-unused types/fields used by the old scheduler. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: clean up fastcall uses of sched_fork()/sched_exit()Ingo Molnar
sched_fork()/sched_exit() does not need to specify fastcall anymore, as the x86 kernel defaults to regparm3, and no assembly code calls these functions. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: update delay-accounting to use CFS's precise statsBalbir Singh
update delay-accounting to use CFS's precise stats. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: x86, track TSC-unstable eventsIngo Molnar
track TSC-unstable events and propagate it to the scheduler code. Also allow sched_clock() to be used when the TSC is unstable, the rq_clock() wrapper creates a reliable clock out of it. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: remove sleep_typeIngo Molnar
remove the sleep_type heuristics from the core scheduler - scheduling policy is implemented in the scheduling-policy modules. (and CFS does not use this type of sleep-type heuristics) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: clean up the rt priority macrosIngo Molnar
clean up the rt priority macros, pointed out by Andrew Morton. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: make posix-cpu-timers use CFS's accounting informationIngo Molnar
update the posix-cpu-timers code to use CFS's CPU accounting information. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: cfs, core data typesIngo Molnar
add the CFS data types to sched.h. (the old scheduler is still fully intact.) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: cfs core, kernel/sched_fair.cIngo Molnar
add kernel/sched_fair.c - which implements the bulk of CFS's behavioral changes for SCHED_OTHER tasks. see Documentation/sched-design-CFS.txt about details. Authors: Ingo Molnar <mingo@elte.hu> Dmitry Adamushko <dmitry.adamushko@gmail.com> Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
2007-07-09sched: increase the resolution of smpniceIngo Molnar
increase SMP-nice's resolution. This is needed by CFS to implement SCHED_IDLE and cleaned up nice level support. no behavioral changes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: add init_idle_bootup_task()Ingo Molnar
add the init_idle_bootup_task() callback to the bootup thread, unused at the moment. (CFS will use it to switch the scheduling class of the boot thread to the idle class) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: uninline set_task_cpu()Ingo Molnar
uninline set_task_cpu(): CFS will add more code to it. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: zap the migration init / cache-hot balancing codeIngo Molnar
the SMP load-balancer uses the boot-time migration-cost estimation code to attempt to improve the quality of balancing. The reason for this code is that the discrete priority queues do not preserve the order of scheduling accurately, so the load-balancer skips tasks that were running on a CPU 'recently'. this code is fundamental fragile: the boot-time migration cost detector doesnt really work on systems that had large L3 caches, it caused boot delays on large systems and the whole cache-hot concept made the balancing code pretty undeterministic as well. (and hey, i wrote most of it, so i can say it out loud that it sucks ;-) under CFS the same purpose of cache affinity can be achieved without any special cache-hot special-case: tasks are sorted in the 'timeline' tree and the SMP balancer picks tasks from the left side of the tree, thus the most cache-cold task is balanced automatically. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: add SCHED_IDLE policyIngo Molnar
this patch adds the SCHED_IDLE policy to sched.h. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09sched: rename idle_type/SCHED_IDLEIngo Molnar
enum idle_type (used by the load-balancer) clashes with the SCHED_IDLE name that we want to introduce. 'CPU_IDLE' instead of 'SCHED_IDLE' is more descriptive as well. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-06-08pi-futex: fix exit races and locking problemsAlexey Kuznetsov
1. New entries can be added to tsk->pi_state_list after task completed exit_pi_state_list(). The result is memory leakage and deadlocks. 2. handle_mm_fault() is called under spinlock. The result is obvious. 3. results in self-inflicted deadlock inside glibc. Sometimes futex_lock_pi returns -ESRCH, when it is not expected and glibc enters to for(;;) sleep() to simulate deadlock. This problem is quite obvious and I think the patch is right. Though it looks like each "if" in futex_lock_pi() got some stupid special case "else if". :-) 4. sometimes futex_lock_pi() returns -EDEADLK, when nobody has the lock. The reason is also obvious (see comment in the patch), but correct fix is far beyond my comprehension. I guess someone already saw this, the chunk: if (rt_mutex_trylock(&q.pi_state->pi_mutex)) ret = 0; is obviously from the same opera. But it does not work, because the rtmutex is really taken at this point: wake_futex_pi() of previous owner reassigned it to us. My fix works. But it looks very stupid. I would think about removal of shift of ownership in wake_futex_pi() and making all the work in context of process taking lock. From: Thomas Gleixner <tglx@linutronix.de> Fix 1) Avoid the tasklist lock variant of the exit race fix by adding an additional state transition to the exit code. This fixes also the issue, when a task with recursive segfaults is not able to release the futexes. Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH problem finally. Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup in the lock protected section by using the in_atomic userspace access functions. This removes also the ugly lock drop / unqueue inside of fixup_pi_state() Fix 4) Fix a stale lock in the error path of futex_wake_pi() Added some error checks for verification. The -EDEADLK problem is solved by the rtmutex fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23recalc_sigpending_tsk fixesRoland McGrath
Steve Hawkes discovered a problem where recalc_sigpending_tsk was called in do_sigaction but no signal_wake_up call was made, preventing later signals from waking up blocked threads with TIF_SIGPENDING already set. In fact, the few other calls to recalc_sigpending_tsk outside the signals code are also subject to this problem in other race conditions. This change makes recalc_sigpending_tsk private to the signals code. It changes the outside calls, as well as do_sigaction, to use the new recalc_sigpending_and_wake instead. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: <Steve.Hawkes@motorola.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23freezer: fix vfork problemRafael J. Wysocki
Currently try_to_freeze_tasks() has to wait until all of the vforked processes exit and for this reason every user can make it fail. To fix this problem we can introduce the additional process flag PF_FREEZER_SKIP to be used by tasks that do not want to be counted as freezable by the freezer and want to have TIF_FREEZE set nevertheless. Then, this flag can be set by tasks using sys_vfork() before they call wait_for_completion(&vfork) and cleared after they have woken up. After clearing it, the tasks should call try_to_freeze() as soon as possible. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block: Fix compile/link of init/do_mounts.c with !CONFIG_BLOCK When stacked block devices are in-use (e.g. md or dm), the recursive calls
2007-05-11signal/timer/event: signalfd coreDavide Libenzi
This patch series implements the new signalfd() system call. I took part of the original Linus code (and you know how badly it can be broken :), and I added even more breakage ;) Signals are fetched from the same signal queue used by the process, so signalfd will compete with standard kernel delivery in dequeue_signal(). If you want to reliably fetch signals on the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This seems to be working fine on my Dual Opteron machine. I made a quick test program for it: http://www.xmailserver.org/signafd-test.c The signalfd() system call implements signal delivery into a file descriptor receiver. The signalfd file descriptor if created with the following API: int signalfd(int ufd, const sigset_t *mask, size_t masksize); The "ufd" parameter allows to change an existing signalfd sigmask, w/out going to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new signalfd file. The "mask" allows to specify the signal mask of signals that we are interested in. The "masksize" parameter is the size of "mask". The signalfd fd supports the poll(2) and read(2) system calls. The poll(2) will return POLLIN when signals are available to be dequeued. As a direct consequence of supporting the Linux poll subsystem, the signalfd fd can use used together with epoll(2) too. The read(2) system call will return a "struct signalfd_siginfo" structure in the userspace supplied buffer. The return value is the number of bytes copied in the supplied buffer, or -1 in case of error. The read(2) call can also return 0, in case the sighand structure to which the signalfd was attached, has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will return -EAGAIN in case no signal is available. If the size of the buffer passed to read(2) is lower than sizeof(struct signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also return -ERESTARTSYS in case a signal hits the process. The format of the struct signalfd_siginfo is, and the valid fields depends of the (->code & __SI_MASK) value, in the same way a struct siginfo would: struct signalfd_siginfo { __u32 signo; /* si_signo */ __s32 err; /* si_errno */ __s32 code; /* si_code */ __u32 pid; /* si_pid */ __u32 uid; /* si_uid */ __s32 fd; /* si_fd */ __u32 tid; /* si_fd */ __u32 band; /* si_band */ __u32 overrun; /* si_overrun */ __u32 trapno; /* si_trapno */ __s32 status; /* si_status */ __s32 svint; /* si_int */ __u64 svptr; /* si_ptr */ __u64 utime; /* si_utime */ __u64 stime; /* si_stime */ __u64 addr; /* si_addr */ }; [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>