From 97e1c18e8d17bd87e1e383b2e9d9fc740332c8e2 Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Fri, 18 Jul 2008 12:16:16 -0400 Subject: tracing: Kernel Tracepoints Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers. Allows complete typing verification by declaring both tracing statement inline functions and probe registration/unregistration static inline functions within the same macro "DEFINE_TRACE". No format string is required. See the tracepoint Documentation and Samples patches for usage examples. Taken from the documentation patch : "A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or "off" (no probe is attached). When a tracepoint is "off" it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is "on", the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site). You can put tracepoints at important locations in the code. They are lightweight hooks that can pass an arbitrary number of parameters, which prototypes are described in a tracepoint declaration placed in a header file." Addition and removal of tracepoints is synchronized by RCU using the scheduler (and preempt_disable) as guarantees to find a quiescent state (this is really RCU "classic"). The update side uses rcu_barrier_sched() with call_rcu_sched() and the read/execute side uses "preempt_disable()/preempt_enable()". We make sure the previous array containing probes, which has been scheduled for deletion by the rcu callback, is indeed freed before we proceed to the next update. It therefore limits the rate of modification of a single tracepoint to one update per RCU period. The objective here is to permit fast batch add/removal of probes on _different_ tracepoints. Changelog : - Use #name ":" #proto as string to identify the tracepoint in the tracepoint table. This will make sure not type mismatch happens due to connexion of a probe with the wrong type to a tracepoint declared with the same name in a different header. - Add tracepoint_entry_free_old. - Change __TO_TRACE to get rid of the 'i' iterator. Masami Hiramatsu : Tested on x86-64. Performance impact of a tracepoint : same as markers, except that it adds about 70 bytes of instructions in an unlikely branch of each instrumented function (the for loop, the stack setup and the function call). It currently adds a memory read, a test and a conditional branch at the instrumentation site (in the hot path). Immediate values will eventually change this into a load immediate, test and branch, which removes the memory read which will make the i-cache impact smaller (changing the memory read for a load immediate removes 3-4 bytes per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also saves the d-cache hit). About the performance impact of tracepoints (which is comparable to markers), even without immediate values optimizations, tests done by Hideo Aoki on ia64 show no regression. His test case was using hackbench on a kernel where scheduler instrumentation (about 5 events in code scheduler code) was added. Quoting Hideo Aoki about Markers : I evaluated overhead of kernel marker using linux-2.6-sched-fixes git tree, which includes several markers for LTTng, using an ia64 server. While the immediate trace mark feature isn't implemented on ia64, there is no major performance regression. So, I think that we don't have any issues to propose merging marker point patches into Linus's tree from the viewpoint of performance impact. I prepared two kernels to evaluate. The first one was compiled without CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS. I downloaded the original hackbench from the following URL: http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c I ran hackbench 5 times in each condition and calculated the average and difference between the kernels. The parameter of hackbench: every 50 from 50 to 800 The number of CPUs of the server: 2, 4, and 8 Below is the results. As you can see, major performance regression wasn't found in any case. Even if number of processes increases, differences between marker-enabled kernel and marker- disabled kernel doesn't increase. Moreover, if number of CPUs increases, the differences doesn't increase either. Curiously, marker-enabled kernel is better than marker-disabled kernel in more than half cases, although I guess it comes from the difference of memory access pattern. * 2 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 4.811 | 4.872 | +0.061 | +1.27 | 100 | 9.854 | 10.309 | +0.454 | +4.61 | 150 | 15.602 | 15.040 | -0.562 | -3.6 | 200 | 20.489 | 20.380 | -0.109 | -0.53 | 250 | 25.798 | 25.652 | -0.146 | -0.56 | 300 | 31.260 | 30.797 | -0.463 | -1.48 | 350 | 36.121 | 35.770 | -0.351 | -0.97 | 400 | 42.288 | 42.102 | -0.186 | -0.44 | 450 | 47.778 | 47.253 | -0.526 | -1.1 | 500 | 51.953 | 52.278 | +0.325 | +0.63 | 550 | 58.401 | 57.700 | -0.701 | -1.2 | 600 | 63.334 | 63.222 | -0.112 | -0.18 | 650 | 68.816 | 68.511 | -0.306 | -0.44 | 700 | 74.667 | 74.088 | -0.579 | -0.78 | 750 | 78.612 | 79.582 | +0.970 | +1.23 | 800 | 85.431 | 85.263 | -0.168 | -0.2 | -------------------------------------------------------------- * 4 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.586 | 2.584 | -0.003 | -0.1 | 100 | 5.254 | 5.283 | +0.030 | +0.56 | 150 | 8.012 | 8.074 | +0.061 | +0.76 | 200 | 11.172 | 11.000 | -0.172 | -1.54 | 250 | 13.917 | 14.036 | +0.119 | +0.86 | 300 | 16.905 | 16.543 | -0.362 | -2.14 | 350 | 19.901 | 20.036 | +0.135 | +0.68 | 400 | 22.908 | 23.094 | +0.186 | +0.81 | 450 | 26.273 | 26.101 | -0.172 | -0.66 | 500 | 29.554 | 29.092 | -0.461 | -1.56 | 550 | 32.377 | 32.274 | -0.103 | -0.32 | 600 | 35.855 | 35.322 | -0.533 | -1.49 | 650 | 39.192 | 38.388 | -0.804 | -2.05 | 700 | 41.744 | 41.719 | -0.025 | -0.06 | 750 | 45.016 | 44.496 | -0.520 | -1.16 | 800 | 48.212 | 47.603 | -0.609 | -1.26 | -------------------------------------------------------------- * 8 CPUs Number of | without | with | diff | diff | processes | Marker [Sec] | Marker [Sec] | [Sec] | [%] | -------------------------------------------------------------- 50 | 2.094 | 2.072 | -0.022 | -1.07 | 100 | 4.162 | 4.273 | +0.111 | +2.66 | 150 | 6.485 | 6.540 | +0.055 | +0.84 | 200 | 8.556 | 8.478 | -0.078 | -0.91 | 250 | 10.458 | 10.258 | -0.200 | -1.91 | 300 | 12.425 | 12.750 | +0.325 | +2.62 | 350 | 14.807 | 14.839 | +0.032 | +0.22 | 400 | 16.801 | 16.959 | +0.158 | +0.94 | 450 | 19.478 | 19.009 | -0.470 | -2.41 | 500 | 21.296 | 21.504 | +0.208 | +0.98 | 550 | 23.842 | 23.979 | +0.137 | +0.57 | 600 | 26.309 | 26.111 | -0.198 | -0.75 | 650 | 28.705 | 28.446 | -0.259 | -0.9 | 700 | 31.233 | 31.394 | +0.161 | +0.52 | 750 | 34.064 | 33.720 | -0.344 | -1.01 | 800 | 36.320 | 36.114 | -0.206 | -0.57 | -------------------------------------------------------------- Signed-off-by: Mathieu Desnoyers Acked-by: Masami Hiramatsu Acked-by: 'Peter Zijlstra' Signed-off-by: Ingo Molnar --- init/Kconfig | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index c11da38837e..70082678a91 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -771,6 +771,13 @@ config PROFILING Say Y here to enable the extended profiling support mechanisms used by profilers such as OProfile. +config TRACEPOINTS + bool "Activate tracepoints" + default y + help + Place an empty function call at each tracepoint site. Can be + dynamically changed for a probe function. + config MARKERS bool "Activate markers" help -- cgit v1.2.3 From fa340d9c050e78fb21a142b617304214ae5e0c2d Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 23 Jul 2008 13:38:00 +0200 Subject: tracing: disable tracepoints by default while it's arguably low overhead, we dont enable new features by default. Signed-off-by: Ingo Molnar --- init/Kconfig | 1 - 1 file changed, 1 deletion(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 70082678a91..d5994490b0b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -773,7 +773,6 @@ config PROFILING config TRACEPOINTS bool "Activate tracepoints" - default y help Place an empty function call at each tracepoint site. Can be dynamically changed for a probe function. -- cgit v1.2.3 From 5f87f1121895dc09d2d1c1db5f14af6aa4ce3e94 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 23 Jul 2008 14:15:22 +0200 Subject: tracing: clean up tracepoints kconfig structure do not expose users to CONFIG_TRACEPOINTS - tracers can select it just fine. update ftrace to select CONFIG_TRACEPOINTS. Signed-off-by: Ingo Molnar --- init/Kconfig | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index d5994490b0b..031344f954f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -771,11 +771,12 @@ config PROFILING Say Y here to enable the extended profiling support mechanisms used by profilers such as OProfile. +# +# Place an empty function call at each tracepoint site. Can be +# dynamically changed for a probe function. +# config TRACEPOINTS - bool "Activate tracepoints" - help - Place an empty function call at each tracepoint site. Can be - dynamically changed for a probe function. + bool config MARKERS bool "Activate markers" -- cgit v1.2.3 From 68bf21aa15c85d2e9b623dcda2b1ed8893275fa1 Mon Sep 17 00:00:00 2001 From: Steven Rostedt Date: Thu, 14 Aug 2008 15:45:08 -0400 Subject: ftrace: mcount call site on boot nops core This is the infrastructure to the converting the mcount call sites recorded by the __mcount_loc section into nops on boot. It also allows for using these sites to enable tracing as normal. When the __mcount_loc section is used, the "ftraced" kernel thread is disabled. This uses the current infrastructure to record the mcount call sites as well as convert them to nops. The mcount function is kept as a stub on boot up and not converted to the ftrace_record_ip function. We use the ftrace_record_ip to only record from the table. This patch does not handle modules. That comes with a later patch. Signed-off-by: Steven Rostedt Signed-off-by: Ingo Molnar --- init/main.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 3820323c4c8..ded1fae965a 100644 --- a/init/main.c +++ b/init/main.c @@ -60,6 +60,7 @@ #include #include #include +#include #include #include @@ -687,6 +688,8 @@ asmlinkage void __init start_kernel(void) acpi_early_init(); /* before LAPIC and SMP init */ + ftrace_init(); + /* Do the rest non-__init'ed, we're now alive */ rest_init(); } -- cgit v1.2.3 From aa5d9151f745b6ee6a236a1f109118034277eb92 Mon Sep 17 00:00:00 2001 From: Arjan van de Ven Date: Sat, 13 Sep 2008 09:36:06 -0700 Subject: tracing/fastboot: add a script to visualize the kernel boot process / time When optimizing the kernel boot time, it's very valuable to visualize what is going on at which time. In addition, with the fastboot asynchronous initcall level, it's very valuable to see which initcall gets run where and when. This patch adds a script to turn a dmesg into a SVG graph (that can be shown with tools such as InkScape, Gimp or Firefox) and a small change to the initcall code to print the PID of the thread calling the initcall (so that the script can work out the parallelism). Signed-off-by: Arjan van de Ven --- init/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index ded1fae965a..16abba05c82 100644 --- a/init/main.c +++ b/init/main.c @@ -711,7 +711,8 @@ int do_one_initcall(initcall_t fn) int result; if (initcall_debug) { - printk("calling %pF\n", fn); + printk("calling %pF", fn); + printk(" @ %i\n", task_pid_nr(current)); t0 = ktime_get(); } -- cgit v1.2.3 From 3bf77af6e1fef1124bf71d81f9f84885f0ee0dea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Weisbecker?= Date: Tue, 23 Sep 2008 11:38:18 +0100 Subject: tracing/ftrace: launch boot tracing after pre-smp initcalls Launch the boot tracing inside the initcall_debug area. Old printk have not been removed to keep the old way of initcall tracing for backward compatibility. [ mingo@elte.hu: resolved conflicts ] Signed-off-by: Frederic Weisbecker Signed-off-by: Ingo Molnar --- init/main.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 16abba05c82..1e39a1eab19 100644 --- a/init/main.c +++ b/init/main.c @@ -709,10 +709,12 @@ int do_one_initcall(initcall_t fn) ktime_t t0, t1, delta; char msgbuf[64]; int result; + struct boot_trace it; if (initcall_debug) { - printk("calling %pF", fn); - printk(" @ %i\n", task_pid_nr(current)); + it.caller = task_pid_nr(current); + it.func = fn; + printk("calling %pF @ %i\n", fn, it.caller); t0 = ktime_get(); } @@ -721,10 +723,11 @@ int do_one_initcall(initcall_t fn) if (initcall_debug) { t1 = ktime_get(); delta = ktime_sub(t1, t0); - - printk("initcall %pF returned %d after %Ld msecs\n", - fn, result, - (unsigned long long) delta.tv64 >> 20); + it.result = result; + it.duration = (unsigned long long) delta.tv64 >> 20; + printk("initcall %pF returned %d after %Ld msecs\n", fn, + result, it.duration); + trace_boot(&it); } msgbuf[0] = 0; @@ -859,6 +862,7 @@ static int __init kernel_init(void * unused) smp_prepare_cpus(setup_max_cpus); do_pre_smp_initcalls(); + start_boot_trace(); smp_init(); sched_init_smp(); -- cgit v1.2.3 From cb5ab74204a6e2579d1119bf1348eb806526b12b Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 2 Oct 2008 12:59:20 +0200 Subject: tracing/fastboot: change the printing of boot tracer according to bootgraph.pl Change the boot tracer printing to make it parsable for the scripts/bootgraph.pl script. We have now to output two lines for each initcall, according to the printk in do_one_initcall() in init/main.c We need now the call's time and the return's time. Signed-off-by: Frederic Weisbecker Signed-off-by: Ingo Molnar --- init/main.c | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 1e39a1eab19..61eb6615939 100644 --- a/init/main.c +++ b/init/main.c @@ -706,34 +706,32 @@ __setup("initcall_debug", initcall_debug_setup); int do_one_initcall(initcall_t fn) { int count = preempt_count(); - ktime_t t0, t1, delta; + ktime_t delta; char msgbuf[64]; - int result; struct boot_trace it; if (initcall_debug) { it.caller = task_pid_nr(current); it.func = fn; printk("calling %pF @ %i\n", fn, it.caller); - t0 = ktime_get(); + it.calltime = ktime_get(); } - result = fn(); + it.result = fn(); if (initcall_debug) { - t1 = ktime_get(); - delta = ktime_sub(t1, t0); - it.result = result; + it.rettime = ktime_get(); + delta = ktime_sub(it.rettime, it.calltime); it.duration = (unsigned long long) delta.tv64 >> 20; printk("initcall %pF returned %d after %Ld msecs\n", fn, - result, it.duration); + it.result, it.duration); trace_boot(&it); } msgbuf[0] = 0; - if (result && result != -ENODEV && initcall_debug) - sprintf(msgbuf, "error code %d ", result); + if (it.result && it.result != -ENODEV && initcall_debug) + sprintf(msgbuf, "error code %d ", it.result); if (preempt_count() != count) { strlcat(msgbuf, "preemption imbalance ", sizeof(msgbuf)); @@ -747,7 +745,7 @@ int do_one_initcall(initcall_t fn) printk("initcall %pF returned with %s\n", fn, msgbuf); } - return result; + return it.result; } -- cgit v1.2.3 From 5601020feb0c3010e9e3e0131e9697ac6a06777b Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Thu, 2 Oct 2008 13:26:05 +0200 Subject: tracing/fastboot: get the initcall name before it disappears After some initcall traces, some initcall names may be inconsistent. That's because these functions will disappear from the .init section and also their name from the symbols table. So we have to copy the name of the function in a buffer large enough during the trace appending. It is not costly for the ring_buffer because the number of initcall entries is commonly not really large. Signed-off-by: Frederic Weisbecker Signed-off-by: Ingo Molnar --- init/main.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 61eb6615939..8e96a0ef17f 100644 --- a/init/main.c +++ b/init/main.c @@ -712,7 +712,6 @@ int do_one_initcall(initcall_t fn) if (initcall_debug) { it.caller = task_pid_nr(current); - it.func = fn; printk("calling %pF @ %i\n", fn, it.caller); it.calltime = ktime_get(); } @@ -725,7 +724,7 @@ int do_one_initcall(initcall_t fn) it.duration = (unsigned long long) delta.tv64 >> 20; printk("initcall %pF returned %d after %Ld msecs\n", fn, it.result, it.duration); - trace_boot(&it); + trace_boot(&it, fn); } msgbuf[0] = 0; -- cgit v1.2.3 From 097d036a2f25eecc42435c57e010aaf4a2eed2d9 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 3 Oct 2008 15:39:21 +0200 Subject: tracing/fastboot: only trace non-module initcalls At this time, only built-in initcalls interest us. We can't really produce a relevant graph if we include the modules initcall too. I had good results after this patch (see svg in attachment). Signed-off-by: Frederic Weisbecker Signed-off-by: Ingo Molnar --- init/main.c | 1 + 1 file changed, 1 insertion(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 8e96a0ef17f..e7939de80f3 100644 --- a/init/main.c +++ b/init/main.c @@ -886,6 +886,7 @@ static int __init kernel_init(void * unused) * we're essentially up and running. Get rid of the * initmem segments and start the user-mode stuff.. */ + stop_boot_trace(); init_post(); return 0; } -- cgit v1.2.3 From ca538f6bbe583406f941f3041d40c41f9a13d1de Mon Sep 17 00:00:00 2001 From: Tim Bird Date: Thu, 9 Oct 2008 15:23:05 -0700 Subject: tracing/fastboot: add better resolution to initcall debug/tracing Change the time resolution for initcall_debug to microseconds, from milliseconds. This is handy to determine which initcalls you want to work on for faster booting. One one of my test machines, over 90% of the initcalls are less than a millisecond and (without this patch) these are all reported as 0 msecs. Working on the 900 us ones is more important than the 4 us ones. With 'quiet' on the kernel command line, this adds no significant overhead to kernel boot time. Signed-off-by: Tim Bird Signed-off-by: Andrew Morton Signed-off-by: Ingo Molnar --- init/main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index e7939de80f3..b2e7ff4a534 100644 --- a/init/main.c +++ b/init/main.c @@ -721,8 +721,8 @@ int do_one_initcall(initcall_t fn) if (initcall_debug) { it.rettime = ktime_get(); delta = ktime_sub(it.rettime, it.calltime); - it.duration = (unsigned long long) delta.tv64 >> 20; - printk("initcall %pF returned %d after %Ld msecs\n", fn, + it.duration = (unsigned long long) delta.tv64 >> 10; + printk("initcall %pF returned %d after %Ld usecs\n", fn, it.result, it.duration); trace_boot(&it, fn); } -- cgit v1.2.3 From 93fd85d005eae2d1106aabd581adb6f20e335c83 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Wed, 15 Oct 2008 22:01:32 -0700 Subject: identify_ramdisk_image(): correct typo about return value in comment identify_ramdisk_image() returns 0 (not -1) if a gzipped ramdisk is found: if (buf[0] == 037 && ((buf[1] == 0213) || (buf[1] == 0236))) { printk(KERN_NOTICE "RAMDISK: Compressed image found at block %d\n", start_block); nblocks = 0; ^^^^^^^^^^^ goto done; } ... done: sys_lseek(fd, start_block * BLOCK_SIZE, 0); kfree(buf); return nblocks; ^^^^^^^^^^^^^^ Hence correct the typo in the comment, which has existed since the addition of compressed ramdisk support in 1.3.48. Signed-off-by: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/do_mounts_rd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'init') diff --git a/init/do_mounts_rd.c b/init/do_mounts_rd.c index fedef93b586..a7c748fa977 100644 --- a/init/do_mounts_rd.c +++ b/init/do_mounts_rd.c @@ -71,7 +71,7 @@ identify_ramdisk_image(int fd, int start_block) sys_read(fd, buf, size); /* - * If it matches the gzip magic numbers, return -1 + * If it matches the gzip magic numbers, return 0 */ if (buf[0] == 037 && ((buf[1] == 0213) || (buf[1] == 0236))) { printk(KERN_NOTICE -- cgit v1.2.3 From 889d51a10712b6fd6175196626de2116858394f4 Mon Sep 17 00:00:00 2001 From: Nye Liu Date: Wed, 15 Oct 2008 22:01:40 -0700 Subject: initramfs: add option to preserve mtime from initramfs cpio images When unpacking the cpio into the initramfs, mtimes are not preserved by default. This patch adds an INITRAMFS_PRESERVE_MTIME option that allows mtimes stored in the cpio image to be used when constructing the initramfs. For embedded applications that run exclusively out of the initramfs, this is invaluable: When building embedded application initramfs images, its nice to know when the files were actually created during the build process - that makes it easier to see what files were modified when so we can compare the files that are being used on the image with the files used during the build process. This might help (for example) to determine if the target system has all the updated files you expect to see w/o having to check MD5s etc. In our environment, the whole system runs off the initramfs partition, and seeing the modified times of the shared libraries (for example) helps us find bugs that may have been introduced by the build system incorrectly propogating outdated shared libraries into the image. Similarly, many of the initializion/configuration files in /etc might be dynamically built by the build system, and knowing when they were modified helps us sanity check whether the target system has the "latest" files etc. Finally, we might use last modified times to determine whether a hot fix should be applied or not to the running ramfs. Signed-off-by: Nye Liu Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/initramfs.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) (limited to 'init') diff --git a/init/initramfs.c b/init/initramfs.c index 644fc01ad5f..4f5ba75aaa7 100644 --- a/init/initramfs.c +++ b/init/initramfs.c @@ -6,6 +6,7 @@ #include #include #include +#include static __initdata char *message; static void __init error(char *x) @@ -72,6 +73,49 @@ static void __init free_hash(void) } } +static long __init do_utime(char __user *filename, time_t mtime) +{ + struct timespec t[2]; + + t[0].tv_sec = mtime; + t[0].tv_nsec = 0; + t[1].tv_sec = mtime; + t[1].tv_nsec = 0; + + return do_utimes(AT_FDCWD, filename, t, AT_SYMLINK_NOFOLLOW); +} + +static __initdata LIST_HEAD(dir_list); +struct dir_entry { + struct list_head list; + char *name; + time_t mtime; +}; + +static void __init dir_add(const char *name, time_t mtime) +{ + struct dir_entry *de = kmalloc(sizeof(struct dir_entry), GFP_KERNEL); + if (!de) + panic("can't allocate dir_entry buffer"); + INIT_LIST_HEAD(&de->list); + de->name = kstrdup(name, GFP_KERNEL); + de->mtime = mtime; + list_add(&de->list, &dir_list); +} + +static void __init dir_utime(void) +{ + struct dir_entry *de, *tmp; + list_for_each_entry_safe(de, tmp, &dir_list, list) { + list_del(&de->list); + do_utime(de->name, de->mtime); + kfree(de->name); + kfree(de); + } +} + +static __initdata time_t mtime; + /* cpio header parsing */ static __initdata unsigned long ino, major, minor, nlink; @@ -97,6 +141,7 @@ static void __init parse_header(char *s) uid = parsed[2]; gid = parsed[3]; nlink = parsed[4]; + mtime = parsed[5]; body_len = parsed[6]; major = parsed[7]; minor = parsed[8]; @@ -130,6 +175,7 @@ static inline void __init eat(unsigned n) count -= n; } +static __initdata char *vcollected; static __initdata char *collected; static __initdata int remains; static __initdata char *collect; @@ -271,6 +317,7 @@ static int __init do_name(void) if (wfd >= 0) { sys_fchown(wfd, uid, gid); sys_fchmod(wfd, mode); + vcollected = kstrdup(collected, GFP_KERNEL); state = CopyFile; } } @@ -278,12 +325,14 @@ static int __init do_name(void) sys_mkdir(collected, mode); sys_chown(collected, uid, gid); sys_chmod(collected, mode); + dir_add(collected, mtime); } else if (S_ISBLK(mode) || S_ISCHR(mode) || S_ISFIFO(mode) || S_ISSOCK(mode)) { if (maybe_link() == 0) { sys_mknod(collected, mode, rdev); sys_chown(collected, uid, gid); sys_chmod(collected, mode); + do_utime(collected, mtime); } } return 0; @@ -294,6 +343,8 @@ static int __init do_copy(void) if (count >= body_len) { sys_write(wfd, victim, body_len); sys_close(wfd); + do_utime(vcollected, mtime); + kfree(vcollected); eat(body_len); state = SkipIt; return 0; @@ -311,6 +362,7 @@ static int __init do_symlink(void) clean_path(collected, 0); sys_symlink(collected + N_ALIGN(name_len), collected); sys_lchown(collected, uid, gid); + do_utime(collected, mtime); state = SkipIt; next_state = Reset; return 0; @@ -466,6 +518,7 @@ static char * __init unpack_to_rootfs(char *buf, unsigned len, int check_only) buf += inptr; len -= inptr; } + dir_utime(); kfree(window); kfree(name_buf); kfree(symlink_buf); -- cgit v1.2.3 From ebf3f09c634906d371f2bfd71b41c7e0c52efe7e Mon Sep 17 00:00:00 2001 From: Thomas Petazzoni Date: Wed, 15 Oct 2008 22:05:12 -0700 Subject: Configure out AIO support This patchs adds the CONFIG_AIO option which allows to remove support for asynchronous I/O operations, that are not necessarly used by applications, particularly on embedded devices. As this is a size-reduction option, it depends on CONFIG_EMBEDDED. It allows to save ~7 kilobytes of kernel code/data: text data bss dec hex filename 1115067 119180 217088 1451335 162547 vmlinux 1108025 119048 217088 1444161 160941 vmlinux.new -7042 -132 0 -7174 -1C06 +/- This patch has been originally written by Matt Mackall , and is part of the Linux Tiny project. [randy.dunlap@oracle.com: build fix] Signed-off-by: Thomas Petazzoni Cc: Benjamin LaHaise Cc: Zach Brown Signed-off-by: Matt Mackall Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/Kconfig | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 8a8e2d00c40..5ceff3249a2 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -713,6 +713,14 @@ config SHMEM option replaces shmem and tmpfs with the much simpler ramfs code, which may be appropriate on small systems without swap. +config AIO + bool "Enable AIO support" if EMBEDDED + default y + help + This option enables POSIX asynchronous I/O which may by used + by some high performance threaded applications. Disabling + this option saves about 7k. + config VM_EVENT_COUNTERS default y bool "Enable VM event counters for /proc/vmstat" if EMBEDDED -- cgit v1.2.3 From 73b4a24f5ff09389ba6277c53a266b142f655ed2 Mon Sep 17 00:00:00 2001 From: Adrian Bunk Date: Thu, 16 Oct 2008 23:29:21 +0300 Subject: init/do_mounts_md.c must #include This patch fixes the following compile error caused by commit 589f800bb12c5cd6c9167bbf9bf3cb70cd8e422c ("fastboot: make the raid autodetect code wait for all devices to init"): CC init/do_mounts_md.o init/do_mounts_md.c: In function 'autodetect_raid': init/do_mounts_md.c:285: error: implicit declaration of function 'msleep' make[2]: *** [init/do_mounts_md.o] Error 1 Signed-off-by: Adrian Bunk Signed-off-by: Linus Torvalds --- init/do_mounts_md.c | 1 + 1 file changed, 1 insertion(+) (limited to 'init') diff --git a/init/do_mounts_md.c b/init/do_mounts_md.c index 48b3fadd83e..4c87ee1fe5d 100644 --- a/init/do_mounts_md.c +++ b/init/do_mounts_md.c @@ -1,5 +1,6 @@ #include +#include #include "do_mounts.h" -- cgit v1.2.3 From db64fe02258f1507e13fe5212a989922323685ce Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Sat, 18 Oct 2008 20:27:03 -0700 Subject: mm: rewrite vmap layer Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and provide a fast, scalable percpu frontend for small vmaps (requires a slightly different API, though). The biggest problem with vmap is actually vunmap. Presently this requires a global kernel TLB flush, which on most architectures is a broadcast IPI to all CPUs to flush the cache. This is all done under a global lock. As the number of CPUs increases, so will the number of vunmaps a scaled workload will want to perform, and so will the cost of a global TLB flush. This gives terrible quadratic scalability characteristics. Another problem is that the entire vmap subsystem works under a single lock. It is a rwlock, but it is actually taken for write in all the fast paths, and the read locking would likely never be run concurrently anyway, so it's just pointless. This is a rewrite of vmap subsystem to solve those problems. The existing vmalloc API is implemented on top of the rewritten subsystem. The TLB flushing problem is solved by using lazy TLB unmapping. vmap addresses do not have to be flushed immediately when they are vunmapped, because the kernel will not reuse them again (would be a use-after-free) until they are reallocated. So the addresses aren't allocated again until a subsequent TLB flush. A single TLB flush then can flush multiple vunmaps from each CPU. XEN and PAT and such do not like deferred TLB flushing because they can't always handle multiple aliasing virtual addresses to a physical address. They now call vm_unmap_aliases() in order to flush any deferred mappings. That call is very expensive (well, actually not a lot more expensive than a single vunmap under the old scheme), however it should be OK if not called too often. The virtual memory extent information is stored in an rbtree rather than a linked list to improve the algorithmic scalability. There is a per-CPU allocator for small vmaps, which amortizes or avoids global locking. To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces must be used in place of vmap and vunmap. Vmalloc does not use these interfaces at the moment, so it will not be quite so scalable (although it will use lazy TLB flushing). As a quick test of performance, I ran a test that loops in the kernel, linearly mapping then touching then unmapping 4 pages. Different numbers of tests were run in parallel on an 4 core, 2 socket opteron. Results are in nanoseconds per map+touch+unmap. threads vanilla vmap rewrite 1 14700 2900 2 33600 3000 4 49500 2800 8 70631 2900 So with a 8 cores, the rewritten version is already 25x faster. In a slightly more realistic test (although with an older and less scalable version of the patch), I ripped the not-very-good vunmap batching code out of XFS, and implemented the large buffer mapping with vm_map_ram and vm_unmap_ram... along with a couple of other tricks, I was able to speed up a large directory workload by 20x on a 64 CPU system. I believe vmap/vunmap is actually sped up a lot more than 20x on such a system, but I'm running into other locks now. vmap is pretty well blown off the profiles. Before: 1352059 total 0.1401 798784 _write_lock 8320.6667 <- vmlist_lock 529313 default_idle 1181.5022 15242 smp_call_function 15.8771 <- vmap tlb flushing 2472 __get_vm_area_node 1.9312 <- vmap 1762 remove_vm_area 4.5885 <- vunmap 316 map_vm_area 0.2297 <- vmap 312 kfree 0.1950 300 _spin_lock 3.1250 252 sn_send_IPI_phys 0.4375 <- tlb flushing 238 vmap 0.8264 <- vmap 216 find_lock_page 0.5192 196 find_next_bit 0.3603 136 sn2_send_IPI 0.2024 130 pio_phys_write_mmr 2.0312 118 unmap_kernel_range 0.1229 After: 78406 total 0.0081 40053 default_idle 89.4040 33576 ia64_spinlock_contention 349.7500 1650 _spin_lock 17.1875 319 __reg_op 0.5538 281 _atomic_dec_and_lock 1.0977 153 mutex_unlock 1.5938 123 iget_locked 0.1671 117 xfs_dir_lookup 0.1662 117 dput 0.1406 114 xfs_iget_core 0.0268 92 xfs_da_hashname 0.1917 75 d_alloc 0.0670 68 vmap_page_range 0.0462 <- vmap 58 kmem_cache_alloc 0.0604 57 memset 0.0540 52 rb_next 0.1625 50 __copy_user 0.0208 49 bitmap_find_free_region 0.2188 <- vmap 46 ia64_sn_udelay 0.1106 45 find_inode_fast 0.1406 42 memcmp 0.2188 42 finish_task_switch 0.1094 42 __d_lookup 0.0410 40 radix_tree_lookup_slot 0.1250 37 _spin_unlock_irqrestore 0.3854 36 xfs_bmapi 0.0050 36 kmem_cache_free 0.0256 35 xfs_vn_getattr 0.0322 34 radix_tree_lookup 0.1062 33 __link_path_walk 0.0035 31 xfs_da_do_buf 0.0091 30 _xfs_buf_find 0.0204 28 find_get_page 0.0875 27 xfs_iread 0.0241 27 __strncpy_from_user 0.2812 26 _xfs_buf_initialize 0.0406 24 _xfs_buf_lookup_pages 0.0179 24 vunmap_page_range 0.0250 <- vunmap 23 find_lock_page 0.0799 22 vm_map_ram 0.0087 <- vmap 20 kfree 0.0125 19 put_page 0.0330 18 __kmalloc 0.0176 17 xfs_da_node_lookup_int 0.0086 17 _read_lock 0.0885 17 page_waitqueue 0.0664 vmap has gone from being the top 5 on the profiles and flushing the crap out of all TLBs, to using less than 1% of kernel time. [akpm@linux-foundation.org: cleanups, section fix] [akpm@linux-foundation.org: fix build on alpha] Signed-off-by: Nick Piggin Cc: Jeremy Fitzhardinge Cc: Krzysztof Helt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/main.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 27f6bf6108e..4371d11721f 100644 --- a/init/main.c +++ b/init/main.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -642,6 +643,7 @@ asmlinkage void __init start_kernel(void) initrd_start = 0; } #endif + vmalloc_init(); vfs_caches_init_early(); cpuset_init_early(); mem_init(); -- cgit v1.2.3 From dc52ddc0e6f45b04780b26fc0813509f8e798c42 Mon Sep 17 00:00:00 2001 From: Matt Helsley Date: Sat, 18 Oct 2008 20:27:21 -0700 Subject: container freezer: implement freezer cgroup subsystem This patch implements a new freezer subsystem in the control groups framework. It provides a way to stop and resume execution of all tasks in a cgroup by writing in the cgroup filesystem. The freezer subsystem in the container filesystem defines a file named freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in the cgroup. Reading will return the current state. * Examples of usage : # mkdir /containers/freezer # mount -t cgroup -ofreezer freezer /containers # mkdir /containers/0 # echo $some_pid > /containers/0/tasks to get status of the freezer subsystem : # cat /containers/0/freezer.state RUNNING to freeze all tasks in the container : # echo FROZEN > /containers/0/freezer.state # cat /containers/0/freezer.state FREEZING # cat /containers/0/freezer.state FROZEN to unfreeze all tasks in the container : # echo RUNNING > /containers/0/freezer.state # cat /containers/0/freezer.state RUNNING This is the basic mechanism which should do the right thing for user space task in a simple scenario. It's important to note that freezing can be incomplete. In that case we return EBUSY. This means that some tasks in the cgroup are busy doing something that prevents us from completely freezing the cgroup at this time. After EBUSY, the cgroup will remain partially frozen -- reflected by freezer.state reporting "FREEZING" when read. The state will remain "FREEZING" until one of these things happens: 1) Userspace cancels the freezing operation by writing "RUNNING" to the freezer.state file 2) Userspace retries the freezing operation by writing "FROZEN" to the freezer.state file (writing "FREEZING" is not legal and returns EIO) 3) The tasks that blocked the cgroup from entering the "FROZEN" state disappear from the cgroup's set of tasks. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: export thaw_process] Signed-off-by: Cedric Le Goater Signed-off-by: Matt Helsley Acked-by: Serge E. Hallyn Tested-by: Matt Helsley Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/Kconfig | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 5ceff3249a2..8828ed0b205 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -299,6 +299,13 @@ config CGROUP_NS for instance virtual servers and checkpoint/restart jobs. +config CGROUP_FREEZER + bool "control group freezer subsystem" + depends on CGROUPS + help + Provides a way to freeze and unfreeze all tasks in a + cgroup. + config CGROUP_DEVICE bool "Device controller for cgroups" depends on CGROUPS && EXPERIMENTAL -- cgit v1.2.3 From 3d137310245e4cdc3e8c8ba1bea2e145a87ae8e3 Mon Sep 17 00:00:00 2001 From: Thomas Petazzoni Date: Tue, 19 Aug 2008 10:28:24 +0200 Subject: PCI: allow quirks to be compiled out This patch adds the CONFIG_PCI_QUIRKS option which allows to remove all the PCI quirks, which are not necessarily used on embedded systems when PCI is working properly. As this is a size-reduction option, it depends on CONFIG_EMBEDDED. It allows to save almost 12 kilobytes of kernel code: text data bss dec hex filename 1287806 123596 212992 1624394 18c94a vmlinux.old 1275854 123596 212992 1612442 189a9a vmlinux -11952 0 0 -11952 -2EB0 +/- This patch has originally been written by Zwane Mwaikambo and is part of the Linux Tiny project. Signed-off-by: Thomas Petazzoni Signed-off-by: Jesse Barnes --- init/Kconfig | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 8828ed0b205..06330a30524 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -737,6 +737,14 @@ config VM_EVENT_COUNTERS on EMBEDDED systems. /proc/vmstat will only show page counts if VM event counters are disabled. +config PCI_QUIRKS + default y + bool "Enable PCI quirk workarounds" if EMBEDDED && PCI + help + This enables workarounds for various PCI chipset + bugs/quirks. Disable this only if your target machine is + unaffected by PCI quirks. + config SLUB_DEBUG default y bool "Enable SLUB debugging support" if EMBEDDED -- cgit v1.2.3 From d0ea3d7d286aeda2a9216d76424abc285b87b7b4 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Wed, 22 Oct 2008 10:00:23 -0500 Subject: Make initcall_debug a core_param This is the one I really wanted: now it effects module loading, it makes sense to be able to flip it after boot. Signed-off-by: Rusty Russell Acked-by: Arjan van de Ven --- init/main.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 3e17a3bafe6..3d68aaaf616 100644 --- a/init/main.c +++ b/init/main.c @@ -697,13 +697,7 @@ asmlinkage void __init start_kernel(void) } static int initcall_debug; - -static int __init initcall_debug_setup(char *str) -{ - initcall_debug = 1; - return 1; -} -__setup("initcall_debug", initcall_debug_setup); +core_param(initcall_debug, initcall_debug, bool, 0644); int do_one_initcall(initcall_t fn) { -- cgit v1.2.3 From a802dd0eb5fc97a50cf1abb1f788a8f6cc5db635 Mon Sep 17 00:00:00 2001 From: Heiko Carstens Date: Mon, 13 Oct 2008 23:50:08 +0200 Subject: Call init_workqueues before pre smp initcalls. This allows to create workqueues from within the context of a pre smp initcall (aka early_initcall). Signed-off-by: Heiko Carstens Signed-off-by: Rusty Russell --- init/main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 3d68aaaf616..6c7fd137c8c 100644 --- a/init/main.c +++ b/init/main.c @@ -767,8 +767,6 @@ static void __init do_initcalls(void) static void __init do_basic_setup(void) { rcu_init_sched(); /* needed by module_init stage. */ - /* drivers will send hotplug events */ - init_workqueues(); usermodehelper_init(); driver_init(); init_irq_proc(); @@ -852,6 +850,8 @@ static int __init kernel_init(void * unused) cad_pid = task_pid(current); + init_workqueues(); + smp_prepare_cpus(setup_max_cpus); do_pre_smp_initcalls(); -- cgit v1.2.3 From 61cfc7e442c52c14e632d9af0e70779cfa04249d Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Wed, 22 Oct 2008 08:53:25 +0200 Subject: PCI: PCI_QUIRKS depends on PCI commit 3d137310245e4cdc3e8c8ba1bea2e145a87ae8e3 ("PCI: allow quirks to be compiled out") introduced CONFIG_PCI_QUIRKS, which now shows up in each and every .config. Fix this by making it depend on PCI. Signed-off-by: Geert Uytterhoeven Signed-off-by: Jesse Barnes --- init/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 113c74c07da..44e9208f9c7 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -739,7 +739,8 @@ config VM_EVENT_COUNTERS config PCI_QUIRKS default y - bool "Enable PCI quirk workarounds" if EMBEDDED && PCI + bool "Enable PCI quirk workarounds" if EMBEDDED + depends on PCI help This enables workarounds for various PCI chipset bugs/quirks. Disable this only if your target machine is -- cgit v1.2.3 From 6de24f0ed08054b2a202902e4d63beff27654db8 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Thu, 28 Aug 2008 06:25:49 +0400 Subject: [PATCH 1/2] anondev: init IDR statically Signed-off-by: Alexey Dobriyan --- init/main.c | 1 - 1 file changed, 1 deletion(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 3e17a3bafe6..c6a1024a27a 100644 --- a/init/main.c +++ b/init/main.c @@ -670,7 +670,6 @@ asmlinkage void __init start_kernel(void) fork_init(num_physpages); proc_caches_init(); buffer_init(); - unnamed_dev_init(); key_init(); security_init(); vfs_caches_init(num_physpages); -- cgit v1.2.3 From 94b6da5ab8293b04a300ba35c72eddfa94db8b02 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 22 Oct 2008 14:15:05 -0700 Subject: memcg: fix page_cgroup allocation page_cgroup_init() is called from mem_cgroup_init(). But at this point, we cannot call alloc_bootmem(). (and this caused panic at boot.) This patch moves page_cgroup_init() to init/main.c. Time table is following: == parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this. .... cgroup_init_early() # "early" init of cgroup. .... setup_arch() # memmap is allocated. ... page_cgroup_init(); mem_init(); # we cannot call alloc_bootmem after this. .... cgroup_init() # mem_cgroup is initialized. == Before page_cgroup_init(), mem_map must be initialized. So, I added page_cgroup_init() to init/main.c directly. (*) maybe this is not very clean but - cgroup_init_early() is too early - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem(). use of vmalloc area in x86-32 is important and we should avoid very large vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init() directly to init/main.c [akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration] [akpm@linux-foundation.org: fix build] Acked-by: Balbir Singh Tested-by: Balbir Singh Signed-off-by: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/main.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 3e17a3bafe6..672ae75b205 100644 --- a/init/main.c +++ b/init/main.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include #include @@ -647,6 +648,7 @@ asmlinkage void __init start_kernel(void) vmalloc_init(); vfs_caches_init_early(); cpuset_init_early(); + page_cgroup_init(); mem_init(); enable_debug_pagealloc(); cpu_hotplug_init(); -- cgit v1.2.3 From 4403b406d4369a275d483ece6ddee0088cc0d592 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sat, 25 Oct 2008 19:53:38 -0700 Subject: Revert "Call init_workqueues before pre smp initcalls." This reverts commit a802dd0eb5fc97a50cf1abb1f788a8f6cc5db635 by moving the call to init_workqueues() back where it belongs - after SMP has been initialized. It also moves stop_machine_init() - which needs workqueues - to a later phase using a core_initcall() instead of early_initcall(). That should satisfy all ordering requirements, and was apparently the reason why init_workqueues() was moved to be too early. Cc: Heiko Carstens Cc: Rusty Russell Signed-off-by: Linus Torvalds --- init/main.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 130d1a0eef1..7e117a231af 100644 --- a/init/main.c +++ b/init/main.c @@ -768,6 +768,7 @@ static void __init do_initcalls(void) static void __init do_basic_setup(void) { rcu_init_sched(); /* needed by module_init stage. */ + init_workqueues(); usermodehelper_init(); driver_init(); init_irq_proc(); @@ -851,8 +852,6 @@ static int __init kernel_init(void * unused) cad_pid = task_pid(current); - init_workqueues(); - smp_prepare_cpus(setup_max_cpus); do_pre_smp_initcalls(); -- cgit v1.2.3 From f8b77d39397e1510b1a3bcfd385ebd1a45aae77f Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Wed, 29 Oct 2008 14:01:05 -0700 Subject: init/do_mounts_md.c: msleep compile fix init/do_mounts_md.c:285: error: implicit declaration of function 'msleep' Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/do_mounts_md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'init') diff --git a/init/do_mounts_md.c b/init/do_mounts_md.c index 4c87ee1fe5d..4d42f450b59 100644 --- a/init/do_mounts_md.c +++ b/init/do_mounts_md.c @@ -1,4 +1,4 @@ - +#include #include #include -- cgit v1.2.3 From 84ad6d70001df969d7e8569dd18d98d9550277fb Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Wed, 29 Oct 2008 14:01:06 -0700 Subject: memcg: update menuconfig help text page_cgroup is now allocated at boot and memmap doesn't includes pointer for page_cgroup. Fix the menu help text. Reviewed-by: Balbir Singh Signed-off-by: KAMEZAWA hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/Kconfig | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 44e9208f9c7..86b00c53fad 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -401,16 +401,20 @@ config CGROUP_MEM_RES_CTLR depends on CGROUPS && RESOURCE_COUNTERS select MM_OWNER help - Provides a memory resource controller that manages both page cache and - RSS memory. + Provides a memory resource controller that manages both anonymous + memory and page cache. (See Documentation/controllers/memory.txt) Note that setting this option increases fixed memory overhead - associated with each page of memory in the system by 4/8 bytes - and also increases cache misses because struct page on many 64bit - systems will not fit into a single cache line anymore. + associated with each page of memory in the system. By this, + 20(40)bytes/PAGE_SIZE on 32(64)bit system will be occupied by memory + usage tracking struct at boot. Total amount of this is printed out + at boot. Only enable when you're ok with these trade offs and really - sure you need the memory resource controller. + sure you need the memory resource controller. Even when you enable + this, you can set "cgroup_disable=memory" at your boot option to + disable memory resource controller and you can avoid overheads. + (and lose benefits of memory resource contoller) This config option also selects MM_OWNER config option, which could in turn add some fork/exit overhead. -- cgit v1.2.3 From d3f15800d5752ca4814270180798ab8323157d28 Mon Sep 17 00:00:00 2001 From: Huang Weiyi Date: Fri, 31 Oct 2008 12:47:23 +0800 Subject: init/do_mounts_md.c: remove duplicated #include Removed duplicated #include in init/do_mounts_md.c. The same compile error ("error: implicit declaration of function 'msleep'") got fixed twice: - f8b77d39397e1510b1a3bcfd385ebd1a45aae77f ("init/do_mounts_md.c: msleep compile fix") - 73b4a24f5ff09389ba6277c53a266b142f655ed2 ("init/do_mounts_md.c must #include ") by people adding the include in two slightly different places. Andrew's quilt scripts happily ignore the fuzz, and will re-apply the patch even though they had conflicts. Signed-off-by: Huang Weiyi Signed-off-by: Linus Torvalds --- init/do_mounts_md.c | 1 - 1 file changed, 1 deletion(-) (limited to 'init') diff --git a/init/do_mounts_md.c b/init/do_mounts_md.c index 4d42f450b59..d6da5cdd3c3 100644 --- a/init/do_mounts_md.c +++ b/init/do_mounts_md.c @@ -1,6 +1,5 @@ #include #include -#include #include "do_mounts.h" -- cgit v1.2.3 From 71566a0d161edec70361b7f90f6e54af6a6d5d05 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Fri, 31 Oct 2008 12:57:20 +0100 Subject: tracing/fastboot: Enable boot tracing only during initcalls Impact: modify boot tracer We used to disable the initcall tracing at a specified time (IE: end of builtin initcalls). But we don't need it anymore. It will be stopped when initcalls are finished. However we want two things: _Start this tracing only after pre-smp initcalls are finished. _Since we are planning to trace sched_switches at the same time, we want to enable them only during the initcall execution. For this purpose, this patch introduce two functions to enable/disable the sched_switch tracing during boot. Signed-off-by: Frederic Weisbecker Signed-off-by: Ingo Molnar --- init/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 7e117a231af..4b03cd5656c 100644 --- a/init/main.c +++ b/init/main.c @@ -711,6 +711,7 @@ int do_one_initcall(initcall_t fn) it.caller = task_pid_nr(current); printk("calling %pF @ %i\n", fn, it.caller); it.calltime = ktime_get(); + enable_boot_trace(); } it.result = fn(); @@ -722,6 +723,7 @@ int do_one_initcall(initcall_t fn) printk("initcall %pF returned %d after %Ld usecs\n", fn, it.result, it.duration); trace_boot(&it, fn); + disable_boot_trace(); } msgbuf[0] = 0; @@ -882,7 +884,7 @@ static int __init kernel_init(void * unused) * we're essentially up and running. Get rid of the * initmem segments and start the user-mode stuff.. */ - stop_boot_trace(); + init_post(); return 0; } -- cgit v1.2.3 From 3f5ec13696fd4a33bde42f385406cbb1d3cc96fd Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 11 Nov 2008 23:21:31 +0100 Subject: tracing/fastboot: move boot tracer structs and funcs into their own header. Impact: Cleanups on the boot tracer and ftrace This patch bring some cleanups about the boot tracer headers. The functions and structures of this tracer have nothing related to ftrace and should have so their own header file. Signed-off-by: Frederic Weisbecker Acked-by: Steven Rostedt Signed-off-by: Ingo Molnar --- init/main.c | 1 + 1 file changed, 1 insertion(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 4b03cd5656c..16ca1ee071c 100644 --- a/init/main.c +++ b/init/main.c @@ -63,6 +63,7 @@ #include #include #include +#include #include #include -- cgit v1.2.3 From 74239072830ef3f1398edeb1bc1076fc330fd4a2 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Tue, 11 Nov 2008 23:24:42 +0100 Subject: tracing/fastboot: Use the ring-buffer timestamp for initcall entries Impact: Split the boot tracer entries in two parts: call and return Now that we are using the sched tracer from the boot tracer, we want to use the same timestamp than the ring-buffer to have consistent time captures between sched events and initcall events. So we get rid of the old time capture by the boot tracer and split the initcall events in two parts: call and return. This way we have the ring buffer timestamp of both. An example trace: [ 27.904149584] calling net_ns_init+0x0/0x1c0 @ 1 [ 27.904429624] initcall net_ns_init+0x0/0x1c0 returned 0 after 0 msecs [ 27.904575926] calling reboot_init+0x0/0x20 @ 1 [ 27.904655399] initcall reboot_init+0x0/0x20 returned 0 after 0 msecs [ 27.904800228] calling sysctl_init+0x0/0x30 @ 1 [ 27.905142914] initcall sysctl_init+0x0/0x30 returned 0 after 0 msecs [ 27.905287211] calling ksysfs_init+0x0/0xb0 @ 1 ##### CPU 0 buffer started #### init-1 [000] 27.905395: 1:120:R + [001] 11:115:S ##### CPU 1 buffer started #### -0 [001] 27.905425: 0:140:R ==> [001] 11:115:R init-1 [000] 27.905426: 1:120:D ==> [000] 0:140:R -0 [000] 27.905431: 0:140:R + [000] 4:115:S -0 [000] 27.905451: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.905456: 4:115:S ==> [000] 0:140:R udevd-11 [001] 27.905458: 11:115:R + [001] 14:115:R -0 [000] 27.905459: 0:140:R + [000] 4:115:S -0 [000] 27.905462: 0:140:R ==> [000] 4:115:R udevd-11 [001] 27.905462: 11:115:R ==> [001] 14:115:R ksoftirqd/0-4 [000] 27.905467: 4:115:S ==> [000] 0:140:R -0 [000] 27.905470: 0:140:R + [000] 4:115:S -0 [000] 27.905473: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.905476: 4:115:S ==> [000] 0:140:R -0 [000] 27.905479: 0:140:R + [000] 4:115:S -0 [000] 27.905482: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.905486: 4:115:S ==> [000] 0:140:R udevd-14 [001] 27.905499: 14:120:X ==> [001] 11:115:R udevd-11 [001] 27.905506: 11:115:R + [000] 1:120:D -0 [000] 27.905515: 0:140:R ==> [000] 1:120:R udevd-11 [001] 27.905517: 11:115:S ==> [001] 0:140:R [ 27.905557107] initcall ksysfs_init+0x0/0xb0 returned 0 after 3906 msecs [ 27.905705736] calling init_jiffies_clocksource+0x0/0x10 @ 1 [ 27.905779239] initcall init_jiffies_clocksource+0x0/0x10 returned 0 after 0 msecs [ 27.906769814] calling pm_init+0x0/0x30 @ 1 [ 27.906853627] initcall pm_init+0x0/0x30 returned 0 after 0 msecs [ 27.906997803] calling pm_disk_init+0x0/0x20 @ 1 [ 27.907076946] initcall pm_disk_init+0x0/0x20 returned 0 after 0 msecs [ 27.907222556] calling swsusp_header_init+0x0/0x30 @ 1 [ 27.907294325] initcall swsusp_header_init+0x0/0x30 returned 0 after 0 msecs [ 27.907439620] calling stop_machine_init+0x0/0x50 @ 1 init-1 [000] 27.907485: 1:120:R + [000] 2:115:S init-1 [000] 27.907490: 1:120:D ==> [000] 2:115:R kthreadd-2 [000] 27.907507: 2:115:R + [001] 15:115:R -0 [001] 27.907517: 0:140:R ==> [001] 15:115:R kthreadd-2 [000] 27.907517: 2:115:D ==> [000] 0:140:R -0 [000] 27.907521: 0:140:R + [000] 4:115:S -0 [000] 27.907524: 0:140:R ==> [000] 4:115:R udevd-15 [001] 27.907527: 15:115:D + [000] 2:115:D ksoftirqd/0-4 [000] 27.907537: 4:115:S ==> [000] 2:115:R udevd-15 [001] 27.907537: 15:115:D ==> [001] 0:140:R kthreadd-2 [000] 27.907546: 2:115:R + [000] 1:120:D kthreadd-2 [000] 27.907550: 2:115:S ==> [000] 1:120:R init-1 [000] 27.907584: 1:120:R + [000] 15: 0:D init-1 [000] 27.907589: 1:120:R + [000] 2:115:S init-1 [000] 27.907593: 1:120:D ==> [000] 15: 0:R udevd-15 [000] 27.907601: 15: 0:S ==> [000] 2:115:R ##### CPU 0 buffer started #### kthreadd-2 [000] 27.907616: 2:115:R + [001] 16:115:R ##### CPU 1 buffer started #### -0 [001] 27.907620: 0:140:R ==> [001] 16:115:R kthreadd-2 [000] 27.907621: 2:115:D ==> [000] 0:140:R udevd-16 [001] 27.907625: 16:115:D + [000] 2:115:D -0 [000] 27.907628: 0:140:R + [000] 4:115:S udevd-16 [001] 27.907629: 16:115:D ==> [001] 0:140:R -0 [000] 27.907631: 0:140:R ==> [000] 4:115:R ksoftirqd/0-4 [000] 27.907636: 4:115:S ==> [000] 2:115:R kthreadd-2 [000] 27.907644: 2:115:R + [000] 1:120:D kthreadd-2 [000] 27.907647: 2:115:S ==> [000] 1:120:R init-1 [000] 27.907657: 1:120:R + [001] 16: 0:D -0 [001] 27.907666: 0:140:R ==> [001] 16: 0:R [ 27.907703862] initcall stop_machine_init+0x0/0x50 returned 0 after 0 msecs [ 27.907850704] calling filelock_init+0x0/0x30 @ 1 [ 27.907926573] initcall filelock_init+0x0/0x30 returned 0 after 0 msecs [ 27.908071327] calling init_script_binfmt+0x0/0x10 @ 1 [ 27.908165195] initcall init_script_binfmt+0x0/0x10 returned 0 after 0 msecs [ 27.908309461] calling init_elf_binfmt+0x0/0x10 @ 1 Signed-off-by: Frederic Weisbecker Acked-by: Steven Rostedt Signed-off-by: Ingo Molnar --- init/main.c | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index 16ca1ee071c..e810196bf2f 100644 --- a/init/main.c +++ b/init/main.c @@ -704,33 +704,35 @@ core_param(initcall_debug, initcall_debug, bool, 0644); int do_one_initcall(initcall_t fn) { int count = preempt_count(); - ktime_t delta; + ktime_t calltime, delta, rettime; char msgbuf[64]; - struct boot_trace it; + struct boot_trace_call call; + struct boot_trace_ret ret; if (initcall_debug) { - it.caller = task_pid_nr(current); - printk("calling %pF @ %i\n", fn, it.caller); - it.calltime = ktime_get(); + call.caller = task_pid_nr(current); + printk("calling %pF @ %i\n", fn, call.caller); + calltime = ktime_get(); + trace_boot_call(&call, fn); enable_boot_trace(); } - it.result = fn(); + ret.result = fn(); if (initcall_debug) { - it.rettime = ktime_get(); - delta = ktime_sub(it.rettime, it.calltime); - it.duration = (unsigned long long) delta.tv64 >> 10; - printk("initcall %pF returned %d after %Ld usecs\n", fn, - it.result, it.duration); - trace_boot(&it, fn); disable_boot_trace(); + rettime = ktime_get(); + delta = ktime_sub(rettime, calltime); + ret.duration = (unsigned long long) delta.tv64 >> 10; + trace_boot_ret(&ret, fn); + printk("initcall %pF returned %d after %Ld usecs\n", fn, + ret.result, ret.duration); } msgbuf[0] = 0; - if (it.result && it.result != -ENODEV && initcall_debug) - sprintf(msgbuf, "error code %d ", it.result); + if (ret.result && ret.result != -ENODEV && initcall_debug) + sprintf(msgbuf, "error code %d ", ret.result); if (preempt_count() != count) { strlcat(msgbuf, "preemption imbalance ", sizeof(msgbuf)); @@ -744,7 +746,7 @@ int do_one_initcall(initcall_t fn) printk("initcall %pF returned with %s\n", fn, msgbuf); } - return it.result; + return ret.result; } -- cgit v1.2.3 From 2fe401e38602e853e01376cdb670b0bc4d526a6d Mon Sep 17 00:00:00 2001 From: Adrian Knoth Date: Wed, 12 Nov 2008 16:23:55 -0800 Subject: sched: correct sched-rt-group.txt pathname in init/Kconfig init/Kconfig directs the user to Documentation/sched-rt-group.txt, but the file is actually in Documentation/scheduler/sched-rt-group.txt. This patch corrects the pathname mentioned in init/Kconfig. Signed-off-by: Adrian Knoth Signed-off-by: Andrew Morton Signed-off-by: Ingo Molnar --- init/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 86b00c53fad..2f850d800d9 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -354,7 +354,7 @@ config RT_GROUP_SCHED setting below. If enabled, it will also make it impossible to schedule realtime tasks for non-root users until you allocate realtime bandwidth for them. - See Documentation/sched-rt-group.txt for more information. + See Documentation/scheduler/sched-rt-group.txt for more information. choice depends on GROUP_SCHED -- cgit v1.2.3 From 02f5621042e3f7e2fb6c741cbe5ee7c1f3caf354 Mon Sep 17 00:00:00 2001 From: Simon Arlott Date: Wed, 5 Nov 2008 22:18:19 +0000 Subject: Kconfig: SLUB is the default slab allocator In 2007, a0acd820807680d2ccc4ef3448387fcdbf152c73 changed the default slab allocator to SLUB, but the SLAB help text still says SLAB is the default. This change fixes that. Signed-off-by: Simon Arlott Signed-off-by: Pekka Enberg --- init/Kconfig | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 86b00c53fad..226da2733c1 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -771,8 +771,7 @@ config SLAB help The regular slab allocator that is established and known to work well in all environments. It organizes cache hot objects in - per cpu and per node queues. SLAB is the default choice for - a slab allocator. + per cpu and per node queues. config SLUB bool "SLUB (Unqueued Allocator)" @@ -781,7 +780,8 @@ config SLUB instead of managing queues of cached objects (SLAB approach). Per cpu caching is realized using slabs of objects instead of queues of objects. SLUB can use memory efficiently - and has enhanced diagnostics. + and has enhanced diagnostics. SLUB is the default choice for + a slab allocator. config SLOB depends on EMBEDDED -- cgit v1.2.3 From d84f4f992cbd76e8f39c488cf0c5d123843923b1 Mon Sep 17 00:00:00 2001 From: David Howells Date: Fri, 14 Nov 2008 10:39:23 +1100 Subject: CRED: Inaugurate COW credentials Inaugurate copy-on-write credentials management. This uses RCU to manage the credentials pointer in the task_struct with respect to accesses by other tasks. A process may only modify its own credentials, and so does not need locking to access or modify its own credentials. A mutex (cred_replace_mutex) is added to the task_struct to control the effect of PTRACE_ATTACHED on credential calculations, particularly with respect to execve(). With this patch, the contents of an active credentials struct may not be changed directly; rather a new set of credentials must be prepared, modified and committed using something like the following sequence of events: struct cred *new = prepare_creds(); int ret = blah(new); if (ret < 0) { abort_creds(new); return ret; } return commit_creds(new); There are some exceptions to this rule: the keyrings pointed to by the active credentials may be instantiated - keyrings violate the COW rule as managing COW keyrings is tricky, given that it is possible for a task to directly alter the keys in a keyring in use by another task. To help enforce this, various pointers to sets of credentials, such as those in the task_struct, are declared const. The purpose of this is compile-time discouragement of altering credentials through those pointers. Once a set of credentials has been made public through one of these pointers, it may not be modified, except under special circumstances: (1) Its reference count may incremented and decremented. (2) The keyrings to which it points may be modified, but not replaced. The only safe way to modify anything else is to create a replacement and commit using the functions described in Documentation/credentials.txt (which will be added by a later patch). This patch and the preceding patches have been tested with the LTP SELinux testsuite. This patch makes several logical sets of alteration: (1) execve(). This now prepares and commits credentials in various places in the security code rather than altering the current creds directly. (2) Temporary credential overrides. do_coredump() and sys_faccessat() now prepare their own credentials and temporarily override the ones currently on the acting thread, whilst preventing interference from other threads by holding cred_replace_mutex on the thread being dumped. This will be replaced in a future patch by something that hands down the credentials directly to the functions being called, rather than altering the task's objective credentials. (3) LSM interface. A number of functions have been changed, added or removed: (*) security_capset_check(), ->capset_check() (*) security_capset_set(), ->capset_set() Removed in favour of security_capset(). (*) security_capset(), ->capset() New. This is passed a pointer to the new creds, a pointer to the old creds and the proposed capability sets. It should fill in the new creds or return an error. All pointers, barring the pointer to the new creds, are now const. (*) security_bprm_apply_creds(), ->bprm_apply_creds() Changed; now returns a value, which will cause the process to be killed if it's an error. (*) security_task_alloc(), ->task_alloc_security() Removed in favour of security_prepare_creds(). (*) security_cred_free(), ->cred_free() New. Free security data attached to cred->security. (*) security_prepare_creds(), ->cred_prepare() New. Duplicate any security data attached to cred->security. (*) security_commit_creds(), ->cred_commit() New. Apply any security effects for the upcoming installation of new security by commit_creds(). (*) security_task_post_setuid(), ->task_post_setuid() Removed in favour of security_task_fix_setuid(). (*) security_task_fix_setuid(), ->task_fix_setuid() Fix up the proposed new credentials for setuid(). This is used by cap_set_fix_setuid() to implicitly adjust capabilities in line with setuid() changes. Changes are made to the new credentials, rather than the task itself as in security_task_post_setuid(). (*) security_task_reparent_to_init(), ->task_reparent_to_init() Removed. Instead the task being reparented to init is referred directly to init's credentials. NOTE! This results in the loss of some state: SELinux's osid no longer records the sid of the thread that forked it. (*) security_key_alloc(), ->key_alloc() (*) security_key_permission(), ->key_permission() Changed. These now take cred pointers rather than task pointers to refer to the security context. (4) sys_capset(). This has been simplified and uses less locking. The LSM functions it calls have been merged. (5) reparent_to_kthreadd(). This gives the current thread the same credentials as init by simply using commit_thread() to point that way. (6) __sigqueue_alloc() and switch_uid() __sigqueue_alloc() can't stop the target task from changing its creds beneath it, so this function gets a reference to the currently applicable user_struct which it then passes into the sigqueue struct it returns if successful. switch_uid() is now called from commit_creds(), and possibly should be folded into that. commit_creds() should take care of protecting __sigqueue_alloc(). (7) [sg]et[ug]id() and co and [sg]et_current_groups. The set functions now all use prepare_creds(), commit_creds() and abort_creds() to build and check a new set of credentials before applying it. security_task_set[ug]id() is called inside the prepared section. This guarantees that nothing else will affect the creds until we've finished. The calling of set_dumpable() has been moved into commit_creds(). Much of the functionality of set_user() has been moved into commit_creds(). The get functions all simply access the data directly. (8) security_task_prctl() and cap_task_prctl(). security_task_prctl() has been modified to return -ENOSYS if it doesn't want to handle a function, or otherwise return the return value directly rather than through an argument. Additionally, cap_task_prctl() now prepares a new set of credentials, even if it doesn't end up using it. (9) Keyrings. A number of changes have been made to the keyrings code: (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have all been dropped and built in to the credentials functions directly. They may want separating out again later. (b) key_alloc() and search_process_keyrings() now take a cred pointer rather than a task pointer to specify the security context. (c) copy_creds() gives a new thread within the same thread group a new thread keyring if its parent had one, otherwise it discards the thread keyring. (d) The authorisation key now points directly to the credentials to extend the search into rather pointing to the task that carries them. (e) Installing thread, process or session keyrings causes a new set of credentials to be created, even though it's not strictly necessary for process or session keyrings (they're shared). (10) Usermode helper. The usermode helper code now carries a cred struct pointer in its subprocess_info struct instead of a new session keyring pointer. This set of credentials is derived from init_cred and installed on the new process after it has been cloned. call_usermodehelper_setup() allocates the new credentials and call_usermodehelper_freeinfo() discards them if they haven't been used. A special cred function (prepare_usermodeinfo_creds()) is provided specifically for call_usermodehelper_setup() to call. call_usermodehelper_setkeys() adjusts the credentials to sport the supplied keyring as the new session keyring. (11) SELinux. SELinux has a number of changes, in addition to those to support the LSM interface changes mentioned above: (a) selinux_setprocattr() no longer does its check for whether the current ptracer can access processes with the new SID inside the lock that covers getting the ptracer's SID. Whilst this lock ensures that the check is done with the ptracer pinned, the result is only valid until the lock is released, so there's no point doing it inside the lock. (12) is_single_threaded(). This function has been extracted from selinux_setprocattr() and put into a file of its own in the lib/ directory as join_session_keyring() now wants to use it too. The code in SELinux just checked to see whether a task shared mm_structs with other tasks (CLONE_VM), but that isn't good enough. We really want to know if they're part of the same thread group (CLONE_THREAD). (13) nfsd. The NFS server daemon now has to use the COW credentials to set the credentials it is going to use. It really needs to pass the credentials down to the functions it calls, but it can't do that until other patches in this series have been applied. Signed-off-by: David Howells Acked-by: James Morris Signed-off-by: James Morris --- init/main.c | 1 + 1 file changed, 1 insertion(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 7e117a231af..db843bff573 100644 --- a/init/main.c +++ b/init/main.c @@ -669,6 +669,7 @@ asmlinkage void __init start_kernel(void) efi_enter_virtual_mode(); #endif thread_info_cache_init(); + cred_init(); fork_init(num_physpages); proc_caches_init(); buffer_init(); -- cgit v1.2.3 From c1df1bd2c4d4b20c83755a0f41956b57aec4842a Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Fri, 14 Nov 2008 17:47:39 -0500 Subject: markers: auto enable tracepoints (new API : trace_mark_tp()) Impact: new API Add a new API trace_mark_tp(), which declares a marker within a tracepoint probe. When the marker is activated, the tracepoint is automatically enabled. No branch test is used at the marker site, because it would be a duplicate of the branch already present in the tracepoint. Signed-off-by: Mathieu Desnoyers Signed-off-by: Ingo Molnar --- init/Kconfig | 1 + 1 file changed, 1 insertion(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 86b00c53fad..f5bacb43871 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -808,6 +808,7 @@ config TRACEPOINTS config MARKERS bool "Activate markers" + depends on TRACEPOINTS help Place an empty function call at each marker site. Can be dynamically changed for a probe function. -- cgit v1.2.3 From 1d926f2756392c6909f60e0c9fe2a09d5462e376 Mon Sep 17 00:00:00 2001 From: Will Newton Date: Fri, 21 Nov 2008 14:08:59 -0800 Subject: init/main.c: use ktime accessor function in initcall_debug code Impact: fix initcall debug output on non-scalar ktime platforms (32-bit embedded) The initcall_debug code access the tv64 member of ktime. This won't work correctly for large deltas on platforms that don't use the scalar ktime implementation. Signed-off-by: Will Newton Acked-by: Tim Bird Signed-off-by: Andrew Morton Signed-off-by: Ingo Molnar --- init/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'init') diff --git a/init/main.c b/init/main.c index e810196bf2f..79213c0785d 100644 --- a/init/main.c +++ b/init/main.c @@ -723,7 +723,7 @@ int do_one_initcall(initcall_t fn) disable_boot_trace(); rettime = ktime_get(); delta = ktime_sub(rettime, calltime); - ret.duration = (unsigned long long) delta.tv64 >> 10; + ret.duration = (unsigned long long) ktime_to_ns(delta) >> 10; trace_boot_ret(&ret, fn); printk("initcall %pF returned %d after %Ld usecs\n", fn, ret.result, ret.duration); -- cgit v1.2.3 From 0b8f1efad30bd58f89961b82dfe68b9edf8fd2ac Mon Sep 17 00:00:00 2001 From: Yinghai Lu Date: Fri, 5 Dec 2008 18:58:31 -0800 Subject: sparse irq_desc[] array: core kernel and x86 changes Impact: new feature Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with NR_CPUS set to large values. The goal is to be able to scale up to much larger NR_IRQS value without impacting the (important) common case. To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of irq_desc pointers. When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc, this also makes the IRQ descriptors NUMA-local (to the site that calls request_irq()). This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now uses desc->chip_data for x86 to store irq_cfg. Signed-off-by: Yinghai Lu Signed-off-by: Ingo Molnar --- init/main.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'init') diff --git a/init/main.c b/init/main.c index 7e117a231af..c1f999a3cf3 100644 --- a/init/main.c +++ b/init/main.c @@ -539,6 +539,15 @@ void __init __weak thread_info_cache_init(void) { } +void __init __weak arch_early_irq_init(void) +{ +} + +void __init __weak early_irq_init(void) +{ + arch_early_irq_init(); +} + asmlinkage void __init start_kernel(void) { char * command_line; @@ -603,6 +612,8 @@ asmlinkage void __init start_kernel(void) sort_main_extable(); trap_init(); rcu_init(); + /* init some links before init_ISA_irqs() */ + early_irq_init(); init_IRQ(); pidhash_init(); init_timers(); -- cgit v1.2.3 From 64db4cfff99c04cd5f550357edcc8780f96b54a2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 18 Dec 2008 21:55:32 +0100 Subject: "Tree RCU": scalable classic RCU implementation This patch fixes a long-standing performance bug in classic RCU that results in massive internal-to-RCU lock contention on systems with more than a few hundred CPUs. Although this patch creates a separate flavor of RCU for ease of review and patch maintenance, it is intended to replace classic RCU. This patch still handles stress better than does mainline, so I am still calling it ready for inclusion. This patch is against the -tip tree. Nevertheless, experience on an actual 1000+ CPU machine would still be most welcome. Most of the changes noted below were found while creating an rcutiny (which should permit ejecting the current rcuclassic) and while doing detailed line-by-line documentation. Updates from v9 (http://lkml.org/lkml/2008/12/2/334): o Fixes from remainder of line-by-line code walkthrough, including comment spelling, initialization, undesirable narrowing due to type conversion, removing redundant memory barriers, removing redundant local-variable initialization, and removing redundant local variables. I do not believe that any of these fixes address the CPU-hotplug issues that Andi Kleen was seeing, but please do give it a whirl in case the machine is smarter than I am. A writeup from the walkthrough may be found at the following URL, in case you are suffering from terminal insomnia or masochism: http://www.kernel.org/pub/linux/kernel/people/paulmck/tmp/rcutree-walkthrough.2008.12.16a.pdf o Made rcutree tracing use seq_file, as suggested some time ago by Lai Jiangshan. o Added a .csv variant of the rcudata debugfs trace file, to allow people having thousands of CPUs to drop the data into a spreadsheet. Tested with oocalc and gnumeric. Updated documentation to suit. Updates from v8 (http://lkml.org/lkml/2008/11/15/139): o Fix a theoretical race between grace-period initialization and force_quiescent_state() that could occur if more than three jiffies were required to carry out the grace-period initialization. Which it might, if you had enough CPUs. o Apply Ingo's printk-standardization patch. o Substitute local variables for repeated accesses to global variables. o Fix comment misspellings and redundant (but harmless) increments of ->n_rcu_pending (this latter after having explicitly added it). o Apply checkpatch fixes. Updates from v7 (http://lkml.org/lkml/2008/10/10/291): o Fixed a number of problems noted by Gautham Shenoy, including the cpu-stall-detection bug that he was having difficulty convincing me was real. ;-) o Changed cpu-stall detection to wait for ten seconds rather than three in order to reduce false positive, as suggested by Ingo Molnar. o Produced a design document (http://lwn.net/Articles/305782/). The act of writing this document uncovered a number of both theoretical and "here and now" bugs as noted below. o Fix dynticks_nesting accounting confusion, simplify WARN_ON() condition, fix kerneldoc comments, and add memory barriers in dynticks interface functions. o Add more data to tracing. o Remove unused "rcu_barrier" field from rcu_data structure. o Count calls to rcu_pending() from scheduling-clock interrupt to use as a surrogate timebase should jiffies stop counting. o Fix a theoretical race between force_quiescent_state() and grace-period initialization. Yes, initialization does have to go on for some jiffies for this race to occur, but given enough CPUs... Updates from v6 (http://lkml.org/lkml/2008/9/23/448): o Fix a number of checkpatch.pl complaints. o Apply review comments from Ingo Molnar and Lai Jiangshan on the stall-detection code. o Fix several bugs in !CONFIG_SMP builds. o Fix a misspelled config-parameter name so that RCU now announces at boot time if stall detection is configured. o Run tests on numerous combinations of configurations parameters, which after the fixes above, now build and run correctly. Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line): o Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a changeset some time ago, and finally got around to retesting this option). o Fix some tracing bugs in rcupreempt that caused incorrect totals to be printed. o I now test with a more brutal random-selection online/offline script (attached). Probably more brutal than it needs to be on the people reading it as well, but so it goes. o A number of optimizations and usability improvements: o Make rcu_pending() ignore the grace-period timeout when there is no grace period in progress. o Make force_quiescent_state() avoid going for a global lock in the case where there is no grace period in progress. o Rearrange struct fields to improve struct layout. o Make call_rcu() initiate a grace period if RCU was idle, rather than waiting for the next scheduling clock interrupt. o Invoke rcu_irq_enter() and rcu_irq_exit() only when idle, as suggested by Andi Kleen. I still don't completely trust this change, and might back it out. o Make CONFIG_RCU_TRACE be the single config variable manipulated for all forms of RCU, instead of the prior confusion. o Document tracing files and formats for both rcupreempt and rcutree. Updates from v4 for those missing v5 given its bad subject line: o Separated dynticks interface so that NMIs and irqs call separate functions, greatly simplifying it. In particular, this code no longer requires a proof of correctness. ;-) o Separated dynticks state out into its own per-CPU structure, avoiding the duplicated accounting. o The case where a dynticks-idle CPU runs an irq handler that invokes call_rcu() is now correctly handled, forcing that CPU out of dynticks-idle mode. o Review comments have been applied (thank you all!!!). For but one example, fixed the dynticks-ordering issue that Manfred pointed out, saving me much debugging. ;-) o Adjusted rcuclassic and rcupreempt to handle dynticks changes. Attached is an updated patch to Classic RCU that applies a hierarchy, greatly reducing the contention on the top-level lock for large machines. This passes 10-hour concurrent rcutorture and online-offline testing on 128-CPU ppc64 without dynticks enabled, and exposes some timekeeping bugs in presence of dynticks (exciting working on a system where "sleep 1" hangs until interrupted...), which were fixed in the 2.6.27 kernel. It is getting more reliable than mainline by some measures, so the next version will be against -tip for inclusion. See also Manfred Spraul's recent patches (or his earlier work from 2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2). We will converge onto a common patch in the fullness of time, but are currently exploring different regions of the design space. That said, I have already gratefully stolen quite a few of Manfred's ideas. This patch provides CONFIG_RCU_FANOUT, which controls the bushiness of the RCU hierarchy. Defaults to 32 on 32-bit machines and 64 on 64-bit machines. If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT, there is no hierarchy. By default, the RCU initialization code will adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable this balancing, allowing the hierarchy to be exactly aligned to the underlying hardware. Up to two levels of hierarchy are permitted (in addition to the root node), allowing up to 16,384 CPUs on 32-bit systems and up to 262,144 CPUs on 64-bit systems. I just know that I am going to regret saying this, but this seems more than sufficient for the foreseeable future. (Some architectures might wish to set CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs. If this becomes a real problem, additional levels can be added, but I doubt that it will make a significant difference on real hardware.) In the common case, a given CPU will manipulate its private rcu_data structure and the rcu_node structure that it shares with its immediate neighbors. This can reduce both lock and memory contention by multiple orders of magnitude, which should eliminate the need for the strange manipulations that are reported to be required when running Linux on very large systems. Some shortcomings: o More bugs will probably surface as a result of an ongoing line-by-line code inspection. Patches will be provided as required. o There are probably hangs, rcutorture failures, &c. Seems quite stable on a 128-CPU machine, but that is kind of small compared to 4096 CPUs. However, seems to do better than mainline. Patches will be provided as required. o The memory footprint of this version is several KB larger than rcuclassic. A separate UP-only rcutiny patch will be provided, which will reduce the memory footprint significantly, even compared to the old rcuclassic. One such patch passes light testing, and has a memory footprint smaller even than rcuclassic. Initial reaction from various embedded guys was "it is not worth it", so am putting it aside. Credits: o Manfred Spraul for ideas, review comments, and bugs spotted, as well as some good friendly competition. ;-) o Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers, Lai Jiangshan, Andi Kleen, Andy Whitcroft, and Andrew Morton for reviews and comments. o Thomas Gleixner for much-needed help with some timer issues (see patches below). o Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos, Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton Blanchard, Dave Kleikamp, and Nathan Lynch for keeping machines alive despite my heavy abuse^Wtesting. Signed-off-by: Paul E. McKenney Signed-off-by: Ingo Molnar --- init/Kconfig | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index f763762d544..9dd7958a71f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -928,10 +928,16 @@ source "block/Kconfig" config PREEMPT_NOTIFIERS bool -config CLASSIC_RCU - def_bool !PREEMPT_RCU +config TREE_RCU_TRACE + def_bool RCU_TRACE && TREE_RCU + select DEBUG_FS help - This option selects the classic RCU implementation that is - designed for best read-side performance on non-realtime - systems. Classic RCU is the default. Note that the - PREEMPT_RCU symbol is used to select/deselect this option. + This option provides tracing for the TREE_RCU implementation, + permitting Makefile to trivially select kernel/rcutree_trace.c. + +config PREEMPT_RCU_TRACE + def_bool RCU_TRACE && PREEMPT_RCU + select DEBUG_FS + help + This option provides tracing for the PREEMPT_RCU implementation, + permitting Makefile to trivially select kernel/rcupreempt_trace.c. -- cgit v1.2.3 From 9bb482476c6c9d1ae033306440c51ceac93ea80c Mon Sep 17 00:00:00 2001 From: Jan Beulich Date: Tue, 16 Dec 2008 11:30:08 +0000 Subject: allow stripping of generated symbols under CONFIG_KALLSYMS_ALL Building upon parts of the module stripping patch, this patch introduces similar stripping for vmlinux when CONFIG_KALLSYMS_ALL=y. Using CONFIG_KALLSYMS_STRIP_GENERATED reduces the overhead of CONFIG_KALLSYMS_ALL from 245k/310k to 65k/80k for the (i386/x86-64) kernels I tested with. The patch also does away with the need to special case the kallsyms- internal symbols by making them available even in the first linking stage. While it is a generated file, the patch includes the changes to scripts/genksyms/keywords.c_shipped, as I'm unsure what the procedure here is. Signed-off-by: Jan Beulich Signed-off-by: Sam Ravnborg --- init/Kconfig | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index f763762d544..0f5af409fef 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -588,6 +588,13 @@ config KALLSYMS_ALL Say N. +config KALLSYMS_STRIP_GENERATED + bool "Strip machine generated symbols from kallsyms" + depends on KALLSYMS_ALL + default y + help + Say N if you want kallsyms to retain even machine generated symbols. + config KALLSYMS_EXTRA_PASS bool "Do an extra kallsyms pass" depends on KALLSYMS -- cgit v1.2.3 From 12d79bafb75639f406a9f71aab94808c414c836e Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Thu, 25 Dec 2008 09:31:28 +0100 Subject: rcu: provide RCU options on non-preempt architectures too Impact: build fix Some old architectures still do not use kernel/Kconfig.preempt, so the moving of the RCU options there broke their build: In file included from /home/mingo/tip/include/linux/sem.h:81, from /home/mingo/tip/include/linux/sched.h:69, from /home/mingo/tip/arch/alpha/kernel/asm-offsets.c:9: /home/mingo/tip/include/linux/rcupdate.h:62:2: error: #error "Unknown RCU implementation specified to kernel configuration" Move these options back to init/Kconfig, which every architecture includes. Signed-off-by: Ingo Molnar --- init/Kconfig | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) (limited to 'init') diff --git a/init/Kconfig b/init/Kconfig index 9dd7958a71f..6b0fdedf359 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -928,6 +928,80 @@ source "block/Kconfig" config PREEMPT_NOTIFIERS bool +choice + prompt "RCU Implementation" + default CLASSIC_RCU + +config CLASSIC_RCU + bool "Classic RCU" + help + This option selects the classic RCU implementation that is + designed for best read-side performance on non-realtime + systems. + + Select this option if you are unsure. + +config TREE_RCU + bool "Tree-based hierarchical RCU" + help + This option selects the RCU implementation that is + designed for very large SMP system with hundreds or + thousands of CPUs. + +config PREEMPT_RCU + bool "Preemptible RCU" + depends on PREEMPT + help + This option reduces the latency of the kernel by making certain + RCU sections preemptible. Normally RCU code is non-preemptible, if + this option is selected then read-only RCU sections become + preemptible. This helps latency, but may expose bugs due to + now-naive assumptions about each RCU read-side critical section + remaining on a given CPU through its execution. + +endchoice + +config RCU_TRACE + bool "Enable tracing for RCU" + depends on TREE_RCU || PREEMPT_RCU + help + This option provides tracing in RCU which presents stats + in debugfs for debugging RCU implementation. + + Say Y here if you want to enable RCU tracing + Say N if you are unsure. + +config RCU_FANOUT + int "Tree-based hierarchical RCU fanout value" + range 2 64 if 64BIT + range 2 32 if !64BIT + depends on TREE_RCU + default 64 if 64BIT + default 32 if !64BIT + help + This option controls the fanout of hierarchical implementations + of RCU, allowing RCU to work efficiently on machines with + large numbers of CPUs. This value must be at least the cube + root of NR_CPUS, which allows NR_CPUS up to 32,768 for 32-bit + systems and up to 262,144 for 64-bit systems. + + Select a specific number if testing RCU itself. + Take the default if unsure. + +config RCU_FANOUT_EXACT + bool "Disable tree-based hierarchical RCU auto-balancing" + depends on TREE_RCU + default n + help + This option forces use of the exact RCU_FANOUT value specified, + regardless of imbalances in the hierarchy. This is useful for + testing RCU itself, and might one day be useful on systems with + strong NUMA behavior. + + Without RCU_FANOUT_EXACT, the code will balance the hierarchy. + + Say N if unsure. + config TREE_RCU_TRACE def_bool RCU_TRACE && TREE_RCU select DEBUG_FS -- cgit v1.2.3