diff options
Diffstat (limited to 'Documentation')
101 files changed, 3205 insertions, 1037 deletions
diff --git a/Documentation/ABI/stable/sysfs-class-backlight b/Documentation/ABI/stable/sysfs-class-backlight new file mode 100644 index 00000000000..4d637e1c4ff --- /dev/null +++ b/Documentation/ABI/stable/sysfs-class-backlight @@ -0,0 +1,36 @@ +What: /sys/class/backlight/<backlight>/bl_power +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Control BACKLIGHT power, values are FB_BLANK_* from fb.h + - FB_BLANK_UNBLANK (0) : power on. + - FB_BLANK_POWERDOWN (4) : power off +Users: HAL + +What: /sys/class/backlight/<backlight>/brightness +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Control the brightness for this <backlight>. Values + are between 0 and max_brightness. This file will also + show the brightness level stored in the driver, which + may not be the actual brightness (see actual_brightness). +Users: HAL + +What: /sys/class/backlight/<backlight>/actual_brightness +Date: March 2006 +KernelVersion: 2.6.17 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Show the actual brightness by querying the hardware. +Users: HAL + +What: /sys/class/backlight/<backlight>/max_brightness +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Maximum brightness for <backlight>. +Users: HAL diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss b/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss index 0a92a7c93a6..4f29e5f1ebf 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-cciss @@ -31,3 +31,31 @@ Date: March 2009 Kernel Version: 2.6.30 Contact: iss_storagedev@hp.com Description: A symbolic link to /sys/block/cciss!cXdY + +Where: /sys/bus/pci/devices/<dev>/ccissX/rescan +Date: August 2009 +Kernel Version: 2.6.31 +Contact: iss_storagedev@hp.com +Description: Kicks of a rescan of the controller to discover logical + drive topology changes. + +Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/lunid +Date: August 2009 +Kernel Version: 2.6.31 +Contact: iss_storagedev@hp.com +Description: Displays the 8-byte LUN ID used to address logical + drive Y of controller X. + +Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/raid_level +Date: August 2009 +Kernel Version: 2.6.31 +Contact: iss_storagedev@hp.com +Description: Displays the RAID level of logical drive Y of + controller X. + +Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/usage_count +Date: August 2009 +Kernel Version: 2.6.31 +Contact: iss_storagedev@hp.com +Description: Displays the usage count (number of opens) of logical drive Y + of controller X. diff --git a/Documentation/ABI/testing/sysfs-class-lcd b/Documentation/ABI/testing/sysfs-class-lcd new file mode 100644 index 00000000000..35906bf7aa7 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-lcd @@ -0,0 +1,23 @@ +What: /sys/class/lcd/<lcd>/lcd_power +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Control LCD power, values are FB_BLANK_* from fb.h + - FB_BLANK_UNBLANK (0) : power on. + - FB_BLANK_POWERDOWN (4) : power off + +What: /sys/class/lcd/<lcd>/contrast +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Current contrast of this LCD device. Value is between 0 and + /sys/class/lcd/<lcd>/max_contrast. + +What: /sys/class/lcd/<lcd>/max_contrast +Date: April 2005 +KernelVersion: 2.6.12 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Maximum contrast for this LCD device. diff --git a/Documentation/ABI/testing/sysfs-class-led b/Documentation/ABI/testing/sysfs-class-led new file mode 100644 index 00000000000..9e4541d71cb --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-led @@ -0,0 +1,28 @@ +What: /sys/class/leds/<led>/brightness +Date: March 2006 +KernelVersion: 2.6.17 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Set the brightness of the LED. Most LEDs don't + have hardware brightness support so will just be turned on for + non-zero brightness settings. The value is between 0 and + /sys/class/leds/<led>/max_brightness. + +What: /sys/class/leds/<led>/max_brightness +Date: March 2006 +KernelVersion: 2.6.17 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Maximum brightness level for this led, default is 255 (LED_FULL). + +What: /sys/class/leds/<led>/trigger +Date: March 2006 +KernelVersion: 2.6.17 +Contact: Richard Purdie <rpurdie@rpsys.net> +Description: + Set the trigger for this LED. A trigger is a kernel based source + of led events. + You can change triggers in a similar manner to the way an IO + scheduler is chosen. Trigger specific parameters can appear in + /sys/class/leds/<led> once a given trigger is selected. + diff --git a/Documentation/ABI/testing/sysfs-gpio b/Documentation/ABI/testing/sysfs-gpio index 8aab8092ad3..80f4c94c7be 100644 --- a/Documentation/ABI/testing/sysfs-gpio +++ b/Documentation/ABI/testing/sysfs-gpio @@ -19,6 +19,7 @@ Description: /gpioN ... for each exported GPIO #N /value ... always readable, writes fail for input GPIOs /direction ... r/w as: in, out (default low); write: high, low + /edge ... r/w as: none, falling, rising, both /gpiochipN ... for each gpiochip; #N is its first GPIO /base ... (r/o) same as N /label ... (r/o) descriptive, not necessarily unique diff --git a/Documentation/ABI/testing/sysfs-platform-asus-laptop b/Documentation/ABI/testing/sysfs-platform-asus-laptop new file mode 100644 index 00000000000..a1cb660c50c --- /dev/null +++ b/Documentation/ABI/testing/sysfs-platform-asus-laptop @@ -0,0 +1,52 @@ +What: /sys/devices/platform/asus-laptop/display +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + This file allows display switching. The value + is composed by 4 bits and defined as follow: + 4321 + |||`- LCD + ||`-- CRT + |`--- TV + `---- DVI + Ex: - 0 (0000b) means no display + - 3 (0011b) CRT+LCD. + +What: /sys/devices/platform/asus-laptop/gps +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the gps device. 1 means on, 0 means off. +Users: Lapsus + +What: /sys/devices/platform/asus-laptop/ledd +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Some models like the W1N have a LED display that can be + used to display several informations. + To control the LED display, use the following : + echo 0x0T000DDD > /sys/devices/platform/asus-laptop/ + where T control the 3 letters display, and DDD the 3 digits display. + The DDD table can be found in Documentation/laptops/asus-laptop.txt + +What: /sys/devices/platform/asus-laptop/bluetooth +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the bluetooth device. 1 means on, 0 means off. + This may control the led, the device or both. +Users: Lapsus + +What: /sys/devices/platform/asus-laptop/wlan +Date: January 2007 +KernelVersion: 2.6.20 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the bluetooth device. 1 means on, 0 means off. + This may control the led, the device or both. +Users: Lapsus diff --git a/Documentation/ABI/testing/sysfs-platform-eeepc-laptop b/Documentation/ABI/testing/sysfs-platform-eeepc-laptop new file mode 100644 index 00000000000..7445dfb321b --- /dev/null +++ b/Documentation/ABI/testing/sysfs-platform-eeepc-laptop @@ -0,0 +1,50 @@ +What: /sys/devices/platform/eeepc-laptop/disp +Date: May 2008 +KernelVersion: 2.6.26 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + This file allows display switching. + - 1 = LCD + - 2 = CRT + - 3 = LCD+CRT + If you run X11, you should use xrandr instead. + +What: /sys/devices/platform/eeepc-laptop/camera +Date: May 2008 +KernelVersion: 2.6.26 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the camera. 1 means on, 0 means off. + +What: /sys/devices/platform/eeepc-laptop/cardr +Date: May 2008 +KernelVersion: 2.6.26 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Control the card reader. 1 means on, 0 means off. + +What: /sys/devices/platform/eeepc-laptop/cpufv +Date: Jun 2009 +KernelVersion: 2.6.31 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + Change CPU clock configuration. + On the Eee PC 1000H there are three available clock configuration: + * 0 -> Super Performance Mode + * 1 -> High Performance Mode + * 2 -> Power Saving Mode + On Eee PC 701 there is only 2 available clock configurations. + Available configuration are listed in available_cpufv file. + Reading this file will show the raw hexadecimal value which + is defined as follow: + | 8 bit | 8 bit | + | `---- Current mode + `------------ Availables modes + For example, 0x301 means: mode 1 selected, 3 available modes. + +What: /sys/devices/platform/eeepc-laptop/available_cpufv +Date: Jun 2009 +KernelVersion: 2.6.31 +Contact: "Corentin Chary" <corentincj@iksaif.net> +Description: + List available cpufv modes. diff --git a/Documentation/DocBook/mtdnand.tmpl b/Documentation/DocBook/mtdnand.tmpl index 8e145857fc9..df0d089d0fb 100644 --- a/Documentation/DocBook/mtdnand.tmpl +++ b/Documentation/DocBook/mtdnand.tmpl @@ -568,7 +568,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip) <para> The blocks in which the tables are stored are procteted against accidental access by marking them bad in the memory bad block - table. The bad block table managment functions are allowed + table. The bad block table management functions are allowed to circumvernt this protection. </para> <para> diff --git a/Documentation/DocBook/scsi.tmpl b/Documentation/DocBook/scsi.tmpl index 10a150ae2a7..d87f4569e76 100644 --- a/Documentation/DocBook/scsi.tmpl +++ b/Documentation/DocBook/scsi.tmpl @@ -317,7 +317,7 @@ <para> The SAS transport class contains common code to deal with SAS HBAs, an aproximated representation of SAS topologies in the driver model, - and various sysfs attributes to expose these topologies and managment + and various sysfs attributes to expose these topologies and management interfaces to userspace. </para> <para> diff --git a/Documentation/Intel-IOMMU.txt b/Documentation/Intel-IOMMU.txt index 21bc416d887..cf9431db873 100644 --- a/Documentation/Intel-IOMMU.txt +++ b/Documentation/Intel-IOMMU.txt @@ -56,11 +56,7 @@ Graphics Problems? ------------------ If you encounter issues with graphics devices, you can try adding option intel_iommu=igfx_off to turn off the integrated graphics engine. - -If it happens to be a PCI device included in the INCLUDE_ALL Engine, -then try enabling CONFIG_DMAR_GFX_WA to setup a 1-1 map. We hear -graphics drivers may be in process of using DMA api's in the near -future and at that time this option can be yanked out. +If this fixes anything, please ensure you file a bug reporting the problem. Some exceptions to IOVA ----------------------- diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index 5c555a8b39e..72651f788f4 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -183,7 +183,7 @@ the MAN-PAGES maintainer (as listed in the MAINTAINERS file) a man-pages patch, or at least a notification of the change, so that some information makes its way into the manual pages. -Even if the maintainer did not respond in step #4, make sure to ALWAYS +Even if the maintainer did not respond in step #5, make sure to ALWAYS copy the maintainer when you change their code. For small patches you may want to CC the Trivial Patch Monkey @@ -232,7 +232,7 @@ your e-mail client so that it sends your patches untouched. When sending patches to Linus, always follow step #7. Large changes are not appropriate for mailing lists, and some -maintainers. If your patch, uncompressed, exceeds 40 kB in size, +maintainers. If your patch, uncompressed, exceeds 300 kB in size, it is preferred that you store your patch on an Internet-accessible server, and provide instead a URL (link) pointing to your patch. diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c index aa73e72fd79..6e25c2659e0 100644 --- a/Documentation/accounting/getdelays.c +++ b/Documentation/accounting/getdelays.c @@ -116,7 +116,7 @@ error: } -int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid, +static int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid, __u8 genl_cmd, __u16 nla_type, void *nla_data, int nla_len) { @@ -160,7 +160,7 @@ int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid, * Probe the controller in genetlink to find the family id * for the TASKSTATS family */ -int get_family_id(int sd) +static int get_family_id(int sd) { struct { struct nlmsghdr n; @@ -190,7 +190,7 @@ int get_family_id(int sd) return id; } -void print_delayacct(struct taskstats *t) +static void print_delayacct(struct taskstats *t) { printf("\n\nCPU %15s%15s%15s%15s\n" " %15llu%15llu%15llu%15llu\n" @@ -216,7 +216,7 @@ void print_delayacct(struct taskstats *t) (unsigned long long)t->freepages_delay_total); } -void task_context_switch_counts(struct taskstats *t) +static void task_context_switch_counts(struct taskstats *t) { printf("\n\nTask %15s%15s\n" " %15llu%15llu\n", @@ -224,7 +224,7 @@ void task_context_switch_counts(struct taskstats *t) (unsigned long long)t->nvcsw, (unsigned long long)t->nivcsw); } -void print_cgroupstats(struct cgroupstats *c) +static void print_cgroupstats(struct cgroupstats *c) { printf("sleeping %llu, blocked %llu, running %llu, stopped %llu, " "uninterruptible %llu\n", (unsigned long long)c->nr_sleeping, @@ -235,7 +235,7 @@ void print_cgroupstats(struct cgroupstats *c) } -void print_ioacct(struct taskstats *t) +static void print_ioacct(struct taskstats *t) { printf("%s: read=%llu, write=%llu, cancelled_write=%llu\n", t->ac_comm, diff --git a/Documentation/arm/tcm.txt b/Documentation/arm/tcm.txt new file mode 100644 index 00000000000..77fd9376e6d --- /dev/null +++ b/Documentation/arm/tcm.txt @@ -0,0 +1,147 @@ +ARM TCM (Tightly-Coupled Memory) handling in Linux +---- +Written by Linus Walleij <linus.walleij@stericsson.com> + +Some ARM SoC:s have a so-called TCM (Tightly-Coupled Memory). +This is usually just a few (4-64) KiB of RAM inside the ARM +processor. + +Due to being embedded inside the CPU The TCM has a +Harvard-architecture, so there is an ITCM (instruction TCM) +and a DTCM (data TCM). The DTCM can not contain any +instructions, but the ITCM can actually contain data. +The size of DTCM or ITCM is minimum 4KiB so the typical +minimum configuration is 4KiB ITCM and 4KiB DTCM. + +ARM CPU:s have special registers to read out status, physical +location and size of TCM memories. arch/arm/include/asm/cputype.h +defines a CPUID_TCM register that you can read out from the +system control coprocessor. Documentation from ARM can be found +at http://infocenter.arm.com, search for "TCM Status Register" +to see documents for all CPUs. Reading this register you can +determine if ITCM (bit 0) and/or DTCM (bit 16) is present in the +machine. + +There is further a TCM region register (search for "TCM Region +Registers" at the ARM site) that can report and modify the location +size of TCM memories at runtime. This is used to read out and modify +TCM location and size. Notice that this is not a MMU table: you +actually move the physical location of the TCM around. At the +place you put it, it will mask any underlying RAM from the +CPU so it is usually wise not to overlap any physical RAM with +the TCM. + +The TCM memory can then be remapped to another address again using +the MMU, but notice that the TCM if often used in situations where +the MMU is turned off. To avoid confusion the current Linux +implementation will map the TCM 1 to 1 from physical to virtual +memory in the location specified by the machine. + +TCM is used for a few things: + +- FIQ and other interrupt handlers that need deterministic + timing and cannot wait for cache misses. + +- Idle loops where all external RAM is set to self-refresh + retention mode, so only on-chip RAM is accessible by + the CPU and then we hang inside ITCM waiting for an + interrupt. + +- Other operations which implies shutting off or reconfiguring + the external RAM controller. + +There is an interface for using TCM on the ARM architecture +in <asm/tcm.h>. Using this interface it is possible to: + +- Define the physical address and size of ITCM and DTCM. + +- Tag functions to be compiled into ITCM. + +- Tag data and constants to be allocated to DTCM and ITCM. + +- Have the remaining TCM RAM added to a special + allocation pool with gen_pool_create() and gen_pool_add() + and provice tcm_alloc() and tcm_free() for this + memory. Such a heap is great for things like saving + device state when shutting off device power domains. + +A machine that has TCM memory shall select HAVE_TCM in +arch/arm/Kconfig for itself, and then the +rest of the functionality will depend on the physical +location and size of ITCM and DTCM to be defined in +mach/memory.h for the machine. Code that needs to use +TCM shall #include <asm/tcm.h> If the TCM is not located +at the place given in memory.h it will be moved using +the TCM Region registers. + +Functions to go into itcm can be tagged like this: +int __tcmfunc foo(int bar); + +Variables to go into dtcm can be tagged like this: +int __tcmdata foo; + +Constants can be tagged like this: +int __tcmconst foo; + +To put assembler into TCM just use +.section ".tcm.text" or .section ".tcm.data" +respectively. + +Example code: + +#include <asm/tcm.h> + +/* Uninitialized data */ +static u32 __tcmdata tcmvar; +/* Initialized data */ +static u32 __tcmdata tcmassigned = 0x2BADBABEU; +/* Constant */ +static const u32 __tcmconst tcmconst = 0xCAFEBABEU; + +static void __tcmlocalfunc tcm_to_tcm(void) +{ + int i; + for (i = 0; i < 100; i++) + tcmvar ++; +} + +static void __tcmfunc hello_tcm(void) +{ + /* Some abstract code that runs in ITCM */ + int i; + for (i = 0; i < 100; i++) { + tcmvar ++; + } + tcm_to_tcm(); +} + +static void __init test_tcm(void) +{ + u32 *tcmem; + int i; + + hello_tcm(); + printk("Hello TCM executed from ITCM RAM\n"); + + printk("TCM variable from testrun: %u @ %p\n", tcmvar, &tcmvar); + tcmvar = 0xDEADBEEFU; + printk("TCM variable: 0x%x @ %p\n", tcmvar, &tcmvar); + + printk("TCM assigned variable: 0x%x @ %p\n", tcmassigned, &tcmassigned); + + printk("TCM constant: 0x%x @ %p\n", tcmconst, &tcmconst); + + /* Allocate some TCM memory from the pool */ + tcmem = tcm_alloc(20); + if (tcmem) { + printk("TCM Allocated 20 bytes of TCM @ %p\n", tcmem); + tcmem[0] = 0xDEADBEEFU; + tcmem[1] = 0x2BADBABEU; + tcmem[2] = 0xCAFEBABEU; + tcmem[3] = 0xDEADBEEFU; + tcmem[4] = 0x2BADBABEU; + for (i = 0; i < 5; i++) + printk("TCM tcmem[%d] = %08x\n", i, tcmem[i]); + tcm_free(tcmem, 20); + } +} diff --git a/Documentation/auxdisplay/cfag12864b-example.c b/Documentation/auxdisplay/cfag12864b-example.c index 2caeea5e499..e7823ffb1ca 100644 --- a/Documentation/auxdisplay/cfag12864b-example.c +++ b/Documentation/auxdisplay/cfag12864b-example.c @@ -62,7 +62,7 @@ unsigned char cfag12864b_buffer[CFAG12864B_SIZE]; * Unable to open: return = -1 * Unable to mmap: return = -2 */ -int cfag12864b_init(char *path) +static int cfag12864b_init(char *path) { cfag12864b_fd = open(path, O_RDWR); if (cfag12864b_fd == -1) @@ -81,7 +81,7 @@ int cfag12864b_init(char *path) /* * exit a cfag12864b framebuffer device */ -void cfag12864b_exit(void) +static void cfag12864b_exit(void) { munmap(cfag12864b_mem, CFAG12864B_SIZE); close(cfag12864b_fd); @@ -90,7 +90,7 @@ void cfag12864b_exit(void) /* * set (x, y) pixel */ -void cfag12864b_set(unsigned char x, unsigned char y) +static void cfag12864b_set(unsigned char x, unsigned char y) { if (CFAG12864B_CHECK(x, y)) cfag12864b_buffer[CFAG12864B_ADDRESS(x, y)] |= @@ -100,7 +100,7 @@ void cfag12864b_set(unsigned char x, unsigned char y) /* * unset (x, y) pixel */ -void cfag12864b_unset(unsigned char x, unsigned char y) +static void cfag12864b_unset(unsigned char x, unsigned char y) { if (CFAG12864B_CHECK(x, y)) cfag12864b_buffer[CFAG12864B_ADDRESS(x, y)] &= @@ -113,7 +113,7 @@ void cfag12864b_unset(unsigned char x, unsigned char y) * Pixel off: return = 0 * Pixel on: return = 1 */ -unsigned char cfag12864b_isset(unsigned char x, unsigned char y) +static unsigned char cfag12864b_isset(unsigned char x, unsigned char y) { if (CFAG12864B_CHECK(x, y)) if (cfag12864b_buffer[CFAG12864B_ADDRESS(x, y)] & @@ -126,7 +126,7 @@ unsigned char cfag12864b_isset(unsigned char x, unsigned char y) /* * not (x, y) pixel */ -void cfag12864b_not(unsigned char x, unsigned char y) +static void cfag12864b_not(unsigned char x, unsigned char y) { if (cfag12864b_isset(x, y)) cfag12864b_unset(x, y); @@ -137,7 +137,7 @@ void cfag12864b_not(unsigned char x, unsigned char y) /* * fill (set all pixels) */ -void cfag12864b_fill(void) +static void cfag12864b_fill(void) { unsigned short i; @@ -148,7 +148,7 @@ void cfag12864b_fill(void) /* * clear (unset all pixels) */ -void cfag12864b_clear(void) +static void cfag12864b_clear(void) { unsigned short i; @@ -162,7 +162,7 @@ void cfag12864b_clear(void) * Pixel off: src[i] = 0 * Pixel on: src[i] > 0 */ -void cfag12864b_format(unsigned char * matrix) +static void cfag12864b_format(unsigned char * matrix) { unsigned char i, j, n; @@ -182,7 +182,7 @@ void cfag12864b_format(unsigned char * matrix) /* * blit buffer to lcd */ -void cfag12864b_blit(void) +static void cfag12864b_blit(void) { memcpy(cfag12864b_mem, cfag12864b_buffer, CFAG12864B_SIZE); } @@ -194,11 +194,10 @@ void cfag12864b_blit(void) */ #include <stdio.h> -#include <string.h> #define EXAMPLES 6 -void example(unsigned char n) +static void example(unsigned char n) { unsigned short i, j; unsigned char matrix[CFAG12864B_WIDTH * CFAG12864B_HEIGHT]; diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 6eb1a97e88c..0b33bfe7dde 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -227,7 +227,14 @@ as the path relative to the root of the cgroup file system. Each cgroup is represented by a directory in the cgroup file system containing the following files describing that cgroup: - - tasks: list of tasks (by pid) attached to that cgroup + - tasks: list of tasks (by pid) attached to that cgroup. This list + is not guaranteed to be sorted. Writing a thread id into this file + moves the thread into this cgroup. + - cgroup.procs: list of tgids in the cgroup. This list is not + guaranteed to be sorted or free of duplicate tgids, and userspace + should sort/uniquify the list if this property is required. + Writing a tgid into this file moves all threads with that tgid into + this cgroup. - notify_on_release flag: run the release agent on exit? - release_agent: the path to use for release notifications (this file exists in the top cgroup only) @@ -374,7 +381,7 @@ Now you want to do something with this cgroup. In this directory you can find several files: # ls -notify_on_release tasks +cgroup.procs notify_on_release tasks (plus whatever files added by the attached subsystems) Now attach your shell to this cgroup: @@ -408,6 +415,26 @@ You can attach the current shell task by echoing 0: # echo 0 > tasks +2.3 Mounting hierarchies by name +-------------------------------- + +Passing the name=<x> option when mounting a cgroups hierarchy +associates the given name with the hierarchy. This can be used when +mounting a pre-existing hierarchy, in order to refer to it by name +rather than by its set of active subsystems. Each hierarchy is either +nameless, or has a unique name. + +The name should match [\w.-]+ + +When passing a name=<x> option for a new hierarchy, you need to +specify subsystems manually; the legacy behaviour of mounting all +subsystems when none are explicitly specified is not supported when +you give a subsystem a name. + +The name of the subsystem appears as part of the hierarchy description +in /proc/mounts and /proc/<pid>/cgroups. + + 3. Kernel API ============= @@ -501,7 +528,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be called multiple times against a cgroup. int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, - struct task_struct *task) + struct task_struct *task, bool threadgroup) (cgroup_mutex held by caller) Called prior to moving a task into a cgroup; if the subsystem @@ -509,14 +536,20 @@ returns an error, this will abort the attach operation. If a NULL task is passed, then a successful result indicates that *any* unspecified task can be moved into the cgroup. Note that this isn't called on a fork. If this method returns 0 (success) then this should -remain valid while the caller holds cgroup_mutex. +remain valid while the caller holds cgroup_mutex. If threadgroup is +true, then a successful result indicates that all threads in the given +thread's threadgroup can be moved together. void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, - struct cgroup *old_cgrp, struct task_struct *task) + struct cgroup *old_cgrp, struct task_struct *task, + bool threadgroup) (cgroup_mutex held by caller) Called after the task has been attached to the cgroup, to allow any post-attachment activity that requires memory allocations or blocking. +If threadgroup is true, the subsystem should take care of all threads +in the specified thread's threadgroup. Currently does not support any +subsystem that might need the old_cgrp for every thread in the group. void fork(struct cgroup_subsy *ss, struct task_struct *task) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 23d1262c077..b871f2552b4 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -179,6 +179,9 @@ The reclaim algorithm has not been modified for cgroups, except that pages that are selected for reclaiming come from the per cgroup LRU list. +NOTE: Reclaim does not work for the root cgroup, since we cannot set any +limits on the root cgroup. + 2. Locking The memory controller uses the following hierarchy @@ -210,6 +213,7 @@ We can alter the memory limit: NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, mega or gigabytes. NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). +NOTE: We cannot set limits on the root cgroup any more. # cat /cgroups/0/memory.limit_in_bytes 4194304 @@ -375,7 +379,42 @@ cgroups created below it. NOTE2: This feature can be enabled/disabled per subtree. -7. TODO +7. Soft limits + +Soft limits allow for greater sharing of memory. The idea behind soft limits +is to allow control groups to use as much of the memory as needed, provided + +a. There is no memory contention +b. They do not exceed their hard limit + +When the system detects memory contention or low memory control groups +are pushed back to their soft limits. If the soft limit of each control +group is very high, they are pushed back as much as possible to make +sure that one control group does not starve the others of memory. + +Please note that soft limits is a best effort feature, it comes with +no guarantees, but it does its best to make sure that when memory is +heavily contended for, memory is allocated based on the soft limit +hints/setup. Currently soft limit based reclaim is setup such that +it gets invoked from balance_pgdat (kswapd). + +7.1 Interface + +Soft limits can be setup by using the following commands (in this example we +assume a soft limit of 256 megabytes) + +# echo 256M > memory.soft_limit_in_bytes + +If we want to change this to 1G, we can at any time use + +# echo 1G > memory.soft_limit_in_bytes + +NOTE1: Soft limits take effect over a long period of time, since they involve + reclaiming memory for balancing between memory cgroups +NOTE2: It is recommended to set the soft limit always below the hard limit, + otherwise the hard limit will take precedence. + +8. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first diff --git a/Documentation/connector/cn_test.c b/Documentation/connector/cn_test.c index 1711adc3337..b07add3467f 100644 --- a/Documentation/connector/cn_test.c +++ b/Documentation/connector/cn_test.c @@ -34,7 +34,7 @@ static char cn_test_name[] = "cn_test"; static struct sock *nls; static struct timer_list cn_test_timer; -static void cn_test_callback(struct cn_msg *msg) +static void cn_test_callback(struct cn_msg *msg, struct netlink_skb_parms *nsp) { pr_info("%s: %lu: idx=%x, val=%x, seq=%u, ack=%u, len=%d: %s.\n", __func__, jiffies, msg->id.idx, msg->id.val, diff --git a/Documentation/connector/connector.txt b/Documentation/connector/connector.txt index 81e6bf6ead5..78c9466a9aa 100644 --- a/Documentation/connector/connector.txt +++ b/Documentation/connector/connector.txt @@ -23,7 +23,7 @@ handling, etc... The Connector driver allows any kernelspace agents to use netlink based networking for inter-process communication in a significantly easier way: -int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *)); +int cn_add_callback(struct cb_id *id, char *name, void (*callback) (struct cn_msg *, struct netlink_skb_parms *)); void cn_netlink_send(struct cn_msg *msg, u32 __group, int gfp_mask); struct cb_id @@ -53,15 +53,15 @@ struct cn_msg Connector interfaces. /*****************************************/ -int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *)); +int cn_add_callback(struct cb_id *id, char *name, void (*callback) (struct cn_msg *, struct netlink_skb_parms *)); Registers new callback with connector core. struct cb_id *id - unique connector's user identifier. It must be registered in connector.h for legal in-kernel users. char *name - connector's callback symbolic name. - void (*callback) (void *) - connector's callback. - Argument must be dereferenced to struct cn_msg *. + void (*callback) (struct cn..) - connector's callback. + cn_msg and the sender's credentials void cn_del_callback(struct cb_id *id); diff --git a/Documentation/crypto/async-tx-api.txt b/Documentation/crypto/async-tx-api.txt index 9f59fcbf5d8..ba046b8fa92 100644 --- a/Documentation/crypto/async-tx-api.txt +++ b/Documentation/crypto/async-tx-api.txt @@ -54,20 +54,23 @@ features surfaced as a result: 3.1 General format of the API: struct dma_async_tx_descriptor * -async_<operation>(<op specific parameters>, - enum async_tx_flags flags, - struct dma_async_tx_descriptor *dependency, - dma_async_tx_callback callback_routine, - void *callback_parameter); +async_<operation>(<op specific parameters>, struct async_submit ctl *submit) 3.2 Supported operations: -memcpy - memory copy between a source and a destination buffer -memset - fill a destination buffer with a byte value -xor - xor a series of source buffers and write the result to a - destination buffer -xor_zero_sum - xor a series of source buffers and set a flag if the - result is zero. The implementation attempts to prevent - writes to memory +memcpy - memory copy between a source and a destination buffer +memset - fill a destination buffer with a byte value +xor - xor a series of source buffers and write the result to a + destination buffer +xor_val - xor a series of source buffers and set a flag if the + result is zero. The implementation attempts to prevent + writes to memory +pq - generate the p+q (raid6 syndrome) from a series of source buffers +pq_val - validate that a p and or q buffer are in sync with a given series of + sources +datap - (raid6_datap_recov) recover a raid6 data block and the p block + from the given sources +2data - (raid6_2data_recov) recover 2 raid6 data blocks from the given + sources 3.3 Descriptor management: The return value is non-NULL and points to a 'descriptor' when the operation @@ -80,8 +83,8 @@ acknowledged by the application before the offload engine driver is allowed to recycle (or free) the descriptor. A descriptor can be acked by one of the following methods: 1/ setting the ASYNC_TX_ACK flag if no child operations are to be submitted -2/ setting the ASYNC_TX_DEP_ACK flag to acknowledge the parent - descriptor of a new operation. +2/ submitting an unacknowledged descriptor as a dependency to another + async_tx call will implicitly set the acknowledged state. 3/ calling async_tx_ack() on the descriptor. 3.4 When does the operation execute? @@ -119,30 +122,42 @@ of an operation. Perform a xor->copy->xor operation where each operation depends on the result from the previous operation: -void complete_xor_copy_xor(void *param) +void callback(void *param) { - printk("complete\n"); + struct completion *cmp = param; + + complete(cmp); } -int run_xor_copy_xor(struct page **xor_srcs, - int xor_src_cnt, - struct page *xor_dest, - size_t xor_len, - struct page *copy_src, - struct page *copy_dest, - size_t copy_len) +void run_xor_copy_xor(struct page **xor_srcs, + int xor_src_cnt, + struct page *xor_dest, + size_t xor_len, + struct page *copy_src, + struct page *copy_dest, + size_t copy_len) { struct dma_async_tx_descriptor *tx; + addr_conv_t addr_conv[xor_src_cnt]; + struct async_submit_ctl submit; + addr_conv_t addr_conv[NDISKS]; + struct completion cmp; + + init_async_submit(&submit, ASYNC_TX_XOR_DROP_DST, NULL, NULL, NULL, + addr_conv); + tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len, &submit) - tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len, - ASYNC_TX_XOR_DROP_DST, NULL, NULL, NULL); - tx = async_memcpy(copy_dest, copy_src, 0, 0, copy_len, - ASYNC_TX_DEP_ACK, tx, NULL, NULL); - tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len, - ASYNC_TX_XOR_DROP_DST | ASYNC_TX_DEP_ACK | ASYNC_TX_ACK, - tx, complete_xor_copy_xor, NULL); + submit->depend_tx = tx; + tx = async_memcpy(copy_dest, copy_src, 0, 0, copy_len, &submit); + + init_completion(&cmp); + init_async_submit(&submit, ASYNC_TX_XOR_DROP_DST | ASYNC_TX_ACK, tx, + callback, &cmp, addr_conv); + tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len, &submit); async_tx_issue_pending_all(); + + wait_for_completion(&cmp); } See include/linux/async_tx.h for more information on the flags. See the diff --git a/Documentation/fb/ep93xx-fb.txt b/Documentation/fb/ep93xx-fb.txt new file mode 100644 index 00000000000..5af1bd9effa --- /dev/null +++ b/Documentation/fb/ep93xx-fb.txt @@ -0,0 +1,135 @@ +================================ +Driver for EP93xx LCD controller +================================ + +The EP93xx LCD controller can drive both standard desktop monitors and +embedded LCD displays. If you have a standard desktop monitor then you +can use the standard Linux video mode database. In your board file: + + static struct ep93xxfb_mach_info some_board_fb_info = { + .num_modes = EP93XXFB_USE_MODEDB, + .bpp = 16, + }; + +If you have an embedded LCD display then you need to define a video +mode for it as follows: + + static struct fb_videomode some_board_video_modes[] = { + { + .name = "some_lcd_name", + /* Pixel clock, porches, etc */ + }, + }; + +Note that the pixel clock value is in pico-seconds. You can use the +KHZ2PICOS macro to convert the pixel clock value. Most other values +are in pixel clocks. See Documentation/fb/framebuffer.txt for further +details. + +The ep93xxfb_mach_info structure for your board should look like the +following: + + static struct ep93xxfb_mach_info some_board_fb_info = { + .num_modes = ARRAY_SIZE(some_board_video_modes), + .modes = some_board_video_modes, + .default_mode = &some_board_video_modes[0], + .bpp = 16, + }; + +The framebuffer device can be registered by adding the following to +your board initialisation function: + + ep93xx_register_fb(&some_board_fb_info); + +===================== +Video Attribute Flags +===================== + +The ep93xxfb_mach_info structure has a flags field which can be used +to configure the controller. The video attributes flags are fully +documented in section 7 of the EP93xx users' guide. The following +flags are available: + +EP93XXFB_PCLK_FALLING Clock data on the falling edge of the + pixel clock. The default is to clock + data on the rising edge. + +EP93XXFB_SYNC_BLANK_HIGH Blank signal is active high. By + default the blank signal is active low. + +EP93XXFB_SYNC_HORIZ_HIGH Horizontal sync is active high. By + default the horizontal sync is active low. + +EP93XXFB_SYNC_VERT_HIGH Vertical sync is active high. By + default the vertical sync is active high. + +The physical address of the framebuffer can be controlled using the +following flags: + +EP93XXFB_USE_SDCSN0 Use SDCSn[0] for the framebuffer. This + is the default setting. + +EP93XXFB_USE_SDCSN1 Use SDCSn[1] for the framebuffer. + +EP93XXFB_USE_SDCSN2 Use SDCSn[2] for the framebuffer. + +EP93XXFB_USE_SDCSN3 Use SDCSn[3] for the framebuffer. + +================== +Platform callbacks +================== + +The EP93xx framebuffer driver supports three optional platform +callbacks: setup, teardown and blank. The setup and teardown functions +are called when the framebuffer driver is installed and removed +respectively. The blank function is called whenever the display is +blanked or unblanked. + +The setup and teardown devices pass the platform_device structure as +an argument. The fb_info and ep93xxfb_mach_info structures can be +obtained as follows: + + static int some_board_fb_setup(struct platform_device *pdev) + { + struct ep93xxfb_mach_info *mach_info = pdev->dev.platform_data; + struct fb_info *fb_info = platform_get_drvdata(pdev); + + /* Board specific framebuffer setup */ + } + +====================== +Setting the video mode +====================== + +The video mode is set using the following syntax: + + video=XRESxYRES[-BPP][@REFRESH] + +If the EP93xx video driver is built-in then the video mode is set on +the Linux kernel command line, for example: + + video=ep93xx-fb:800x600-16@60 + +If the EP93xx video driver is built as a module then the video mode is +set when the module is installed: + + modprobe ep93xx-fb video=320x240 + +============== +Screenpage bug +============== + +At least on the EP9315 there is a silicon bug which causes bit 27 of +the VIDSCRNPAGE (framebuffer physical offset) to be tied low. There is +an unofficial errata for this bug at: + http://marc.info/?l=linux-arm-kernel&m=110061245502000&w=2 + +By default the EP93xx framebuffer driver checks if the allocated physical +address has bit 27 set. If it does, then the memory is freed and an +error is returned. The check can be disabled by adding the following +option when loading the driver: + + ep93xx-fb.check_screenpage_bug=0 + +In some cases it may be possible to reconfigure your SDRAM layout to +avoid this bug. See section 13 of the EP93xx users' guide for details. diff --git a/Documentation/fb/matroxfb.txt b/Documentation/fb/matroxfb.txt index ad7a67707d6..e5ce8a1a978 100644 --- a/Documentation/fb/matroxfb.txt +++ b/Documentation/fb/matroxfb.txt @@ -186,9 +186,7 @@ noinverse - show true colors on screen. It is default. dev:X - bind driver to device X. Driver numbers device from 0 up to N, where device 0 is first `known' device found, 1 second and so on. lspci lists devices in this order. - Default is `every' known device for driver with multihead support - and first working device (usually dev:0) for driver without - multihead support. + Default is `every' known device. nohwcursor - disables hardware cursor (use software cursor instead). hwcursor - enables hardware cursor. It is default. If you are using non-accelerated mode (`noaccel' or `fbset -accel false'), software diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index fa75220f8d3..89a47b5aff0 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -354,14 +354,6 @@ Who: Krzysztof Piotr Oledzki <ole@ans.pl> --------------------------- -What: fscher and fscpos drivers -When: June 2009 -Why: Deprecated by the new fschmd driver. -Who: Hans de Goede <hdegoede@redhat.com> - Jean Delvare <khali@linux-fr.org> - ---------------------------- - What: sysfs ui for changing p4-clockmod parameters When: September 2009 Why: See commits 129f8ae9b1b5be94517da76009ea956e89104ce8 and diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index 6208f55c44c..57e0b80a527 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt @@ -18,11 +18,11 @@ the 9p client is available in the form of a USENIX paper: Other applications are described in the following papers: * XCPU & Clustering - http://www.xcpu.org/xcpu-talk.pdf + http://xcpu.org/papers/xcpu-talk.pdf * KVMFS: control file system for KVM - http://www.xcpu.org/kvmfs.pdf - * CellFS: A New ProgrammingModel for the Cell BE - http://www.xcpu.org/cellfs-talk.pdf + http://xcpu.org/papers/kvmfs.pdf + * CellFS: A New Programming Model for the Cell BE + http://xcpu.org/papers/cellfs-talk.pdf * PROSE I/O: Using 9p to enable Application Partitions http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf @@ -48,6 +48,7 @@ OPTIONS (see rfdno and wfdno) virtio - connect to the next virtio channel available (from lguest or KVM with trans_virtio module) + rdma - connect to a specified RDMA channel uname=name user name to attempt mount as on the remote server. The server may override or ignore this value. Certain user @@ -59,16 +60,22 @@ OPTIONS cache=mode specifies a caching policy. By default, no caches are used. loose = no attempts are made at consistency, intended for exclusive, read-only mounts + fscache = use FS-Cache for a persistent, read-only + cache backend. debug=n specifies debug level. The debug level is a bitmask. - 0x01 = display verbose error messages - 0x02 = developer debug (DEBUG_CURRENT) - 0x04 = display 9p trace - 0x08 = display VFS trace - 0x10 = display Marshalling debug - 0x20 = display RPC debug - 0x40 = display transport debug - 0x80 = display allocation debug + 0x01 = display verbose error messages + 0x02 = developer debug (DEBUG_CURRENT) + 0x04 = display 9p trace + 0x08 = display VFS trace + 0x10 = display Marshalling debug + 0x20 = display RPC debug + 0x40 = display transport debug + 0x80 = display allocation debug + 0x100 = display protocol message debug + 0x200 = display Fid debug + 0x400 = display packet debug + 0x800 = display fscache tracing debug rfdno=n the file descriptor for reading with trans=fd @@ -100,6 +107,10 @@ OPTIONS any = v9fs does single attach and performs all operations as one user + cachetag cache tag to use the specified persistent cache. + cache tags for existing cache sessions can be listed at + /sys/fs/9p/caches. (applies only to cache=fscache) + RESOURCES ========= @@ -118,7 +129,7 @@ and export. A Linux version of the 9p server is now maintained under the npfs project on sourceforge (http://sourceforge.net/projects/npfs). The currently maintained version is the single-threaded version of the server (named spfs) -available from the same CVS repository. +available from the same SVN repository. There are user and developer mailing lists available through the v9fs project on sourceforge (http://sourceforge.net/projects/v9fs). @@ -126,7 +137,8 @@ on sourceforge (http://sourceforge.net/projects/v9fs). A stand-alone version of the module (which should build for any 2.6 kernel) is available via (http://github.com/ericvh/9p-sac/tree/master) -News and other information is maintained on SWiK (http://swik.net/v9fs). +News and other information is maintained on SWiK (http://swik.net/v9fs) +and the Wiki (http://sf.net/apps/mediawiki/v9fs/index.php). Bug reports may be issued through the kernel.org bugzilla (http://bugzilla.kernel.org) diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 18b5ec8cea4..bf4f4b7e11b 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -282,9 +282,16 @@ stripe=n Number of filesystem blocks that mballoc will try to use for allocation size and alignment. For RAID5/6 systems this should be the number of data disks * RAID chunk size in file system blocks. -delalloc (*) Deferring block allocation until write-out time. -nodelalloc Disable delayed allocation. Blocks are allocation - when data is copied from user to page cache. + +delalloc (*) Defer block allocation until just before ext4 + writes out the block(s) in question. This + allows ext4 to better allocation decisions + more efficiently. +nodelalloc Disable delayed allocation. Blocks are allocated + when the data is copied from userspace to the + page cache, either via the write(2) system call + or when an mmap'ed page which was previously + unallocated is written for the first time. max_batch_time=usec Maximum amount of time ext4 should wait for additional filesystem operations to be batch diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt index f12c30c93f2..5af164f4b37 100644 --- a/Documentation/filesystems/ncpfs.txt +++ b/Documentation/filesystems/ncpfs.txt @@ -7,6 +7,6 @@ ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors will have it as well. Related products are linware and mars_nwe, which will give Linux partial -NetWare server functionality. Linware's home site is -klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on -ftp.gwdg.de/pub/linux/misc/ncpfs. +NetWare server functionality. + +mars_nwe can be found on ftp.gwdg.de/pub/linux/misc/ncpfs. diff --git a/Documentation/filesystems/nfs41-server.txt b/Documentation/filesystems/nfs41-server.txt index 05d81cbcb2e..5920fe26e6f 100644 --- a/Documentation/filesystems/nfs41-server.txt +++ b/Documentation/filesystems/nfs41-server.txt @@ -11,6 +11,11 @@ the /proc/fs/nfsd/versions control file. Note that to write this control file, the nfsd service must be taken down. Use your user-mode nfs-utils to set this up; see rpc.nfsd(8) +(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and +"-4", respectively. Therefore, code meant to work on both new and old +kernels must turn 4.1 on or off *before* turning support for version 4 +on or off; rpc.nfsd does this correctly.) + The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based on the latest NFSv4.1 Internet Draft: http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29 @@ -25,6 +30,49 @@ are still under development out of tree. See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design for more information. +The current implementation is intended for developers only: while it +does support ordinary file operations on clients we have tested against +(including the linux client), it is incomplete in ways which may limit +features unexpectedly, cause known bugs in rare cases, or cause +interoperability problems with future clients. Known issues: + + - gss support is questionable: currently mounts with kerberos + from a linux client are possible, but we aren't really + conformant with the spec (for example, we don't use kerberos + on the backchannel correctly). + - no trunking support: no clients currently take advantage of + trunking, but this is a mandatory failure, and its use is + recommended to clients in a number of places. (E.g. to ensure + timely renewal in case an existing connection's retry timeouts + have gotten too long; see section 8.3 of the draft.) + Therefore, lack of this feature may cause future clients to + fail. + - Incomplete backchannel support: incomplete backchannel gss + support and no support for BACKCHANNEL_CTL mean that + callbacks (hence delegations and layouts) may not be + available and clients confused by the incomplete + implementation may fail. + - Server reboot recovery is unsupported; if the server reboots, + clients may fail. + - We do not support SSV, which provides security for shared + client-server state (thus preventing unauthorized tampering + with locks and opens, for example). It is mandatory for + servers to support this, though no clients use it yet. + - Mandatory operations which we do not support, such as + DESTROY_CLIENTID, FREE_STATEID, SECINFO_NO_NAME, and + TEST_STATEID, are not currently used by clients, but will be + (and the spec recommends their uses in common cases), and + clients should not be expected to know how to recover from the + case where they are not supported. This will eventually cause + interoperability failures. + +In addition, some limitations are inherited from the current NFSv4 +implementation: + + - Incomplete delegation enforcement: if a file is renamed or + unlinked, a client holding a delegation may continue to + indefinitely allow opens of the file under the old name. + The table below, taken from the NFSv4.1 document, lists the operations that are mandatory to implement (REQ), optional (OPT), and NFSv4.0 operations that are required not to implement (MNI) @@ -142,6 +190,12 @@ NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | Implementation notes: +DELEGPURGE: +* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or + CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that + persist across client reboots). Thus we need not implement this for + now. + EXCHANGE_ID: * only SP4_NONE state protection supported * implementation ids are ignored diff --git a/Documentation/filesystems/nfsroot.txt b/Documentation/filesystems/nfsroot.txt index 68baddf3c3e..3ba0b945aaf 100644 --- a/Documentation/filesystems/nfsroot.txt +++ b/Documentation/filesystems/nfsroot.txt @@ -105,7 +105,7 @@ ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf> the client address and this parameter is NOT empty only replies from the specified server are accepted. - Only required for for NFS root. That is autoconfiguration + Only required for NFS root. That is autoconfiguration will not be triggered if it is missing and NFS root is not in operation. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index ffead13f944..2c48f945546 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -176,6 +176,7 @@ read the file /proc/PID/status: CapBnd: ffffffffffffffff voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 1 + Stack usage: 12 kB This shows you nearly the same information you would get if you viewed it with the ps command. In fact, ps uses the proc file system to obtain its @@ -229,6 +230,7 @@ Table 1-2: Contents of the statm files (as of 2.6.30-rc7) Mems_allowed_list Same as previous, but in "list format" voluntary_ctxt_switches number of voluntary context switches nonvoluntary_ctxt_switches number of non voluntary context switches + Stack usage: stack usage high water mark (round up to page size) .............................................................................. Table 1-3: Contents of the statm files (as of 2.6.8-rc3) @@ -307,7 +309,7 @@ address perms offset dev inode pathname 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] a7cb1000-a7cb2000 ---p 00000000 00:00 0 -a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 [threadstack:001ff4b4] a7eb2000-a7eb3000 ---p 00000000 00:00 0 a7eb3000-a7ed5000 rw-p 00000000 00:00 0 a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 @@ -343,6 +345,7 @@ is not associated with a file: [stack] = the stack of the main process [vdso] = the "virtual dynamic shared object", the kernel system call handler + [threadstack:xxxxxxxx] = the stack of the thread, xxxxxxxx is the stack size or if empty, the mapping is anonymous. @@ -375,6 +378,19 @@ of memory currently marked as referenced or accessed. This file is only present if the CONFIG_MMU kernel configuration option is enabled. +The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG +bits on both physical and virtual pages associated with a process. +To clear the bits for all the pages associated with the process + > echo 1 > /proc/PID/clear_refs + +To clear the bits for the anonymous pages associated with the process + > echo 2 > /proc/PID/clear_refs + +To clear the bits for the file mapped pages associated with the process + > echo 3 > /proc/PID/clear_refs +Any other value written to /proc/PID/clear_refs will have no effect. + + 1.2 Kernel data --------------- @@ -1032,9 +1048,9 @@ Various pieces of information about kernel activity are available in the since the system first booted. For a quick look, simply cat the file: > cat /proc/stat - cpu 2255 34 2290 22625563 6290 127 456 0 - cpu0 1132 34 1441 11311718 3675 127 438 0 - cpu1 1123 0 849 11313845 2614 0 18 0 + cpu 2255 34 2290 22625563 6290 127 456 0 0 + cpu0 1132 34 1441 11311718 3675 127 438 0 0 + cpu1 1123 0 849 11313845 2614 0 18 0 0 intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] ctxt 1990473 btime 1062191376 @@ -1056,6 +1072,7 @@ second). The meanings of the columns are as follows, from left to right: - irq: servicing interrupts - softirq: servicing softirqs - steal: involuntary wait +- guest: running a guest The "intr" line gives counts of interrupts serviced since boot time, for each of the possible system interrupts. The first column is the total of all @@ -1096,7 +1113,6 @@ Table 1-12: Files in /proc/fs/ext4/<devname> .............................................................................. File Content mb_groups details of multiblock allocator buddy cache of free blocks - mb_history multiblock allocation history .............................................................................. @@ -1191,7 +1207,7 @@ The following heuristics are then applied: * if the task was reniced, its score doubles * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE or CAP_SYS_RAWIO) have their score divided by 4 - * if oom condition happened in one cpuset and checked task does not belong + * if oom condition happened in one cpuset and checked process does not belong to it, its score is divided by 8 * the resulting score is multiplied by two to the power of oom_adj, i.e. points <<= oom_adj when it is positive and diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt index 736540045dc..23a181074f9 100644 --- a/Documentation/filesystems/sharedsubtree.txt +++ b/Documentation/filesystems/sharedsubtree.txt @@ -4,7 +4,7 @@ Shared Subtrees Contents: 1) Overview 2) Features - 3) smount command + 3) Setting mount states 4) Use-case 5) Detailed semantics 6) Quiz @@ -41,14 +41,14 @@ replicas continue to be exactly same. Here is an example: - Lets say /mnt has a mount that is shared. + Let's say /mnt has a mount that is shared. mount --make-shared /mnt - note: mount command does not yet support the --make-shared flag. - I have included a small C program which does the same by executing - 'smount /mnt shared' + Note: mount(8) command now supports the --make-shared flag, + so the sample 'smount' program is no longer needed and has been + removed. - #mount --bind /mnt /tmp + # mount --bind /mnt /tmp The above command replicates the mount at /mnt to the mountpoint /tmp and the contents of both the mounts remain identical. @@ -58,8 +58,8 @@ replicas continue to be exactly same. #ls /tmp a b c - Now lets say we mount a device at /tmp/a - #mount /dev/sd0 /tmp/a + Now let's say we mount a device at /tmp/a + # mount /dev/sd0 /tmp/a #ls /tmp/a t1 t2 t2 @@ -80,21 +80,20 @@ replicas continue to be exactly same. Here is an example: - Lets say /mnt has a mount which is shared. - #mount --make-shared /mnt + Let's say /mnt has a mount which is shared. + # mount --make-shared /mnt - Lets bind mount /mnt to /tmp - #mount --bind /mnt /tmp + Let's bind mount /mnt to /tmp + # mount --bind /mnt /tmp the new mount at /tmp becomes a shared mount and it is a replica of the mount at /mnt. - Now lets make the mount at /tmp; a slave of /mnt - #mount --make-slave /tmp - [or smount /tmp slave] + Now let's make the mount at /tmp; a slave of /mnt + # mount --make-slave /tmp - lets mount /dev/sd0 on /mnt/a - #mount /dev/sd0 /mnt/a + let's mount /dev/sd0 on /mnt/a + # mount /dev/sd0 /mnt/a #ls /mnt/a t1 t2 t3 @@ -104,9 +103,9 @@ replicas continue to be exactly same. Note the mount event has propagated to the mount at /tmp - However lets see what happens if we mount something on the mount at /tmp + However let's see what happens if we mount something on the mount at /tmp - #mount /dev/sd1 /tmp/b + # mount /dev/sd1 /tmp/b #ls /tmp/b s1 s2 s3 @@ -124,12 +123,11 @@ replicas continue to be exactly same. 2d) A unbindable mount is a unbindable private mount - lets say we have a mount at /mnt and we make is unbindable + let's say we have a mount at /mnt and we make is unbindable - #mount --make-unbindable /mnt - [ smount /mnt unbindable ] + # mount --make-unbindable /mnt - Lets try to bind mount this mount somewhere else. + Let's try to bind mount this mount somewhere else. # mount --bind /mnt /tmp mount: wrong fs type, bad option, bad superblock on /mnt, or too many mounted file systems @@ -137,149 +135,15 @@ replicas continue to be exactly same. Binding a unbindable mount is a invalid operation. -3) smount command +3) Setting mount states - Currently the mount command is not aware of shared subtree features. - Work is in progress to add the support in mount ( util-linux package ). - Till then use the following program. + The mount command (util-linux package) can be used to set mount + states: - ------------------------------------------------------------------------ - // - //this code was developed my Miklos Szeredi <miklos@szeredi.hu> - //and modified by Ram Pai <linuxram@us.ibm.com> - // sample usage: - // smount /tmp shared - // - #include <stdio.h> - #include <stdlib.h> - #include <unistd.h> - #include <string.h> - #include <sys/mount.h> - #include <sys/fsuid.h> - - #ifndef MS_REC - #define MS_REC 0x4000 /* 16384: Recursive loopback */ - #endif - - #ifndef MS_SHARED - #define MS_SHARED 1<<20 /* Shared */ - #endif - - #ifndef MS_PRIVATE - #define MS_PRIVATE 1<<18 /* Private */ - #endif - - #ifndef MS_SLAVE - #define MS_SLAVE 1<<19 /* Slave */ - #endif - - #ifndef MS_UNBINDABLE - #define MS_UNBINDABLE 1<<17 /* Unbindable */ - #endif - - int main(int argc, char *argv[]) - { - int type; - if(argc != 3) { - fprintf(stderr, "usage: %s dir " - "<rshared|rslave|rprivate|runbindable|shared|slave" - "|private|unbindable>\n" , argv[0]); - return 1; - } - - fprintf(stdout, "%s %s %s\n", argv[0], argv[1], argv[2]); - - if (strcmp(argv[2],"rshared")==0) - type=(MS_SHARED|MS_REC); - else if (strcmp(argv[2],"rslave")==0) - type=(MS_SLAVE|MS_REC); - else if (strcmp(argv[2],"rprivate")==0) - type=(MS_PRIVATE|MS_REC); - else if (strcmp(argv[2],"runbindable")==0) - type=(MS_UNBINDABLE|MS_REC); - else if (strcmp(argv[2],"shared")==0) - type=MS_SHARED; - else if (strcmp(argv[2],"slave")==0) - type=MS_SLAVE; - else if (strcmp(argv[2],"private")==0) - type=MS_PRIVATE; - else if (strcmp(argv[2],"unbindable")==0) - type=MS_UNBINDABLE; - else { - fprintf(stderr, "invalid operation: %s\n", argv[2]); - return 1; - } - setfsuid(getuid()); - - if(mount("", argv[1], "dontcare", type, "") == -1) { - perror("mount"); - return 1; - } - return 0; - } - ----------------------------------------------------------------------- - - Copy the above code snippet into smount.c - gcc -o smount smount.c - - - (i) To mark all the mounts under /mnt as shared execute the following - command: - - smount /mnt rshared - the corresponding syntax planned for mount command is - mount --make-rshared /mnt - - just to mark a mount /mnt as shared, execute the following - command: - smount /mnt shared - the corresponding syntax planned for mount command is - mount --make-shared /mnt - - (ii) To mark all the shared mounts under /mnt as slave execute the - following - - command: - smount /mnt rslave - the corresponding syntax planned for mount command is - mount --make-rslave /mnt - - just to mark a mount /mnt as slave, execute the following - command: - smount /mnt slave - the corresponding syntax planned for mount command is - mount --make-slave /mnt - - (iii) To mark all the mounts under /mnt as private execute the - following command: - - smount /mnt rprivate - the corresponding syntax planned for mount command is - mount --make-rprivate /mnt - - just to mark a mount /mnt as private, execute the following - command: - smount /mnt private - the corresponding syntax planned for mount command is - mount --make-private /mnt - - NOTE: by default all the mounts are created as private. But if - you want to change some shared/slave/unbindable mount as - private at a later point in time, this command can help. - - (iv) To mark all the mounts under /mnt as unbindable execute the - following - - command: - smount /mnt runbindable - the corresponding syntax planned for mount command is - mount --make-runbindable /mnt - - just to mark a mount /mnt as unbindable, execute the following - command: - smount /mnt unbindable - the corresponding syntax planned for mount command is - mount --make-unbindable /mnt + mount --make-shared mountpoint + mount --make-slave mountpoint + mount --make-private mountpoint + mount --make-unbindable mountpoint 4) Use cases @@ -350,7 +214,7 @@ replicas continue to be exactly same. mount --rbind / /view/v3 mount --rbind / /view/v4 - and if /usr has a versioning filesystem mounted, than that + and if /usr has a versioning filesystem mounted, then that mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and /view/v4/usr too @@ -390,7 +254,7 @@ replicas continue to be exactly same. For example: mount --make-shared /mnt - mount --bin /mnt /tmp + mount --bind /mnt /tmp The mount at /mnt and that at /tmp are both shared and belong to the same peer group. Anything mounted or unmounted under @@ -558,7 +422,7 @@ replicas continue to be exactly same. then the subtree under the unbindable mount is pruned in the new location. - eg: lets say we have the following mount tree. + eg: let's say we have the following mount tree. A / \ @@ -566,7 +430,7 @@ replicas continue to be exactly same. / \ / \ D E F G - Lets say all the mount except the mount C in the tree are + Let's say all the mount except the mount C in the tree are of a type other than unbindable. If this tree is rbound to say Z @@ -683,13 +547,13 @@ replicas continue to be exactly same. 'b' on mounts that receive propagation from mount 'B' and does not have sub-mounts within them are unmounted. - Example: Lets say 'B1', 'B2', 'B3' are shared mounts that propagate to + Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to each other. - lets say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount + let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount 'B1', 'B2' and 'B3' respectively. - lets say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on + let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on mount 'B1', 'B2' and 'B3' respectively. if 'C1' is unmounted, all the mounts that are most-recently-mounted on @@ -710,7 +574,7 @@ replicas continue to be exactly same. A cloned namespace contains all the mounts as that of the parent namespace. - Lets say 'A' and 'B' are the corresponding mounts in the parent and the + Let's say 'A' and 'B' are the corresponding mounts in the parent and the child namespace. If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to @@ -759,11 +623,11 @@ replicas continue to be exactly same. mount --make-slave /mnt At this point we have the first mount at /tmp and - its root dentry is 1. Lets call this mount 'A' + its root dentry is 1. Let's call this mount 'A' And then we have a second mount at /tmp1 with root - dentry 2. Lets call this mount 'B' + dentry 2. Let's call this mount 'B' Next we have a third mount at /mnt with root dentry - mnt. Lets call this mount 'C' + mnt. Let's call this mount 'C' 'B' is the slave of 'A' and 'C' is a slave of 'B' A -> B -> C @@ -794,7 +658,7 @@ replicas continue to be exactly same. Q3 Why is unbindable mount needed? - Lets say we want to replicate the mount tree at multiple + Let's say we want to replicate the mount tree at multiple locations within the same subtree. if one rbind mounts a tree within the same subtree 'n' times @@ -803,7 +667,7 @@ replicas continue to be exactly same. mounts. Here is a example. step 1: - lets say the root tree has just two directories with + let's say the root tree has just two directories with one vfsmount. root / \ @@ -875,7 +739,7 @@ replicas continue to be exactly same. Unclonable mounts come in handy here. step 1: - lets say the root tree has just two directories with + let's say the root tree has just two directories with one vfsmount. root / \ diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index b58b84b50fa..eed520fd0c8 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt @@ -102,7 +102,7 @@ shortname=lower|win95|winnt|mixed winnt: emulate the Windows NT rule for display/create. mixed: emulate the Windows NT rule for display, emulate the Windows 95 rule for create. - Default setting is `lower'. + Default setting is `mixed'. tz=UTC -- Interpret timestamps as UTC rather than local time. This option disables the conversion of timestamps diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f49eecf2e57..623f094c9d8 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -536,6 +536,7 @@ struct address_space_operations { /* migrate the contents of a page to the specified target */ int (*migratepage) (struct page *, struct page *); int (*launder_page) (struct page *); + int (*error_remove_page) (struct mapping *mapping, struct page *page); }; writepage: called by the VM to write a dirty page to backing store. @@ -694,6 +695,12 @@ struct address_space_operations { prevent redirtying the page, it is kept locked during the whole operation. + error_remove_page: normally set to generic_error_remove_page if truncation + is ok for this address space. Used for memory failure handling. + Setting this implies you deal with pages going away under you, + unless you have them locked or reference counts increased. + + The File Object =============== diff --git a/Documentation/gcov.txt b/Documentation/gcov.txt index 40ec6335276..e7ca6478cd9 100644 --- a/Documentation/gcov.txt +++ b/Documentation/gcov.txt @@ -47,7 +47,7 @@ Possible uses: Configure the kernel with: - CONFIG_DEBUGFS=y + CONFIG_DEBUG_FS=y CONFIG_GCOV_KERNEL=y and to get coverage data for the entire kernel: diff --git a/Documentation/gpio.txt b/Documentation/gpio.txt index e4b6985044a..fa4dc077ae0 100644 --- a/Documentation/gpio.txt +++ b/Documentation/gpio.txt @@ -524,6 +524,13 @@ and have the following read/write attributes: is configured as an output, this value may be written; any nonzero value is treated as high. + "edge" ... reads as either "none", "rising", "falling", or + "both". Write these strings to select the signal edge(s) + that will make poll(2) on the "value" file return. + + This file exists only if the pin can be configured as an + interrupt generating input pin. + GPIO controllers have paths like /sys/class/gpio/chipchip42/ (for the controller implementing GPIOs starting at #42) and have the following read-only attributes: @@ -555,6 +562,11 @@ requested using gpio_request(): /* reverse gpio_export() */ void gpio_unexport(); + /* create a sysfs link to an exported GPIO node */ + int gpio_export_link(struct device *dev, const char *name, + unsigned gpio) + + After a kernel driver requests a GPIO, it may only be made available in the sysfs interface by gpio_export(). The driver can control whether the signal direction may change. This helps drivers prevent userspace code @@ -563,3 +575,8 @@ from accidentally clobbering important system state. This explicit exporting can help with debugging (by making some kinds of experiments easier), or can provide an always-there interface that's suitable for documenting as part of a board support package. + +After the GPIO has been exported, gpio_export_link() allows creating +symlinks from elsewhere in sysfs to the GPIO sysfs node. Drivers can +use this to provide the interface under their own device in sysfs with +a descriptive name. diff --git a/Documentation/hwmon/acpi_power_meter b/Documentation/hwmon/acpi_power_meter new file mode 100644 index 00000000000..c80399a00c5 --- /dev/null +++ b/Documentation/hwmon/acpi_power_meter @@ -0,0 +1,51 @@ +Kernel driver power_meter +========================= + +This driver talks to ACPI 4.0 power meters. + +Supported systems: + * Any recent system with ACPI 4.0. + Prefix: 'power_meter' + Datasheet: http://acpi.info/, section 10.4. + +Author: Darrick J. Wong + +Description +----------- + +This driver implements sensor reading support for the power meters exposed in +the ACPI 4.0 spec (Chapter 10.4). These devices have a simple set of +features--a power meter that returns average power use over a configurable +interval, an optional capping mechanism, and a couple of trip points. The +sysfs interface conforms with the specification outlined in the "Power" section +of Documentation/hwmon/sysfs-interface. + +Special Features +---------------- + +The power[1-*]_is_battery knob indicates if the power supply is a battery. +Both power[1-*]_average_{min,max} must be set before the trip points will work. +When both of them are set, an ACPI event will be broadcast on the ACPI netlink +socket and a poll notification will be sent to the appropriate +power[1-*]_average sysfs file. + +The power[1-*]_{model_number, serial_number, oem_info} fields display arbitrary +strings that ACPI provides with the meter. The measures/ directory contains +symlinks to the devices that this meter measures. + +Some computers have the ability to enforce a power cap in hardware. If this is +the case, the power[1-*]_cap and related sysfs files will appear. When the +average power consumption exceeds the cap, an ACPI event will be broadcast on +the netlink event socket and a poll notification will be sent to the +appropriate power[1-*]_alarm file to indicate that capping has begun, and the +hardware has taken action to reduce power consumption. Most likely this will +result in reduced performance. + +There are a few other ACPI notifications that can be sent by the firmware. In +all cases the ACPI event will be broadcast on the ACPI netlink event socket as +well as sent as a poll notification to a sysfs file. The events are as +follows: + +power[1-*]_cap will be notified if the firmware changes the power cap. +power[1-*]_interval will be notified if the firmware changes the averaging +interval. diff --git a/Documentation/hwmon/coretemp b/Documentation/hwmon/coretemp index dbbe6c7025b..92267b62db5 100644 --- a/Documentation/hwmon/coretemp +++ b/Documentation/hwmon/coretemp @@ -4,7 +4,9 @@ Kernel driver coretemp Supported chips: * All Intel Core family Prefix: 'coretemp' - CPUID: family 0x6, models 0xe, 0xf, 0x16, 0x17 + CPUID: family 0x6, models 0xe (Pentium M DC), 0xf (Core 2 DC 65nm), + 0x16 (Core 2 SC 65nm), 0x17 (Penryn 45nm), + 0x1a (Nehalem), 0x1c (Atom), 0x1e (Lynnfield) Datasheet: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide http://softwarecommunity.intel.com/Wiki/Mobility/720.htm diff --git a/Documentation/hwmon/fscher b/Documentation/hwmon/fscher deleted file mode 100644 index 64031659aff..00000000000 --- a/Documentation/hwmon/fscher +++ /dev/null @@ -1,169 +0,0 @@ -Kernel driver fscher -==================== - -Supported chips: - * Fujitsu-Siemens Hermes chip - Prefix: 'fscher' - Addresses scanned: I2C 0x73 - -Authors: - Reinhard Nissl <rnissl@gmx.de> based on work - from Hermann Jung <hej@odn.de>, - Frodo Looijaard <frodol@dds.nl>, - Philip Edelbrock <phil@netroedge.com> - -Description ------------ - -This driver implements support for the Fujitsu-Siemens Hermes chip. It is -described in the 'Register Set Specification BMC Hermes based Systemboard' -from Fujitsu-Siemens. - -The Hermes chip implements a hardware-based system management, e.g. for -controlling fan speed and core voltage. There is also a watchdog counter on -the chip which can trigger an alarm and even shut the system down. - -The chip provides three temperature values (CPU, motherboard and -auxiliary), three voltage values (+12V, +5V and battery) and three fans -(power supply, CPU and auxiliary). - -Temperatures are measured in degrees Celsius. The resolution is 1 degree. - -Fan rotation speeds are reported in RPM (rotations per minute). The value -can be divided by a programmable divider (1, 2 or 4) which is stored on -the chip. - -Voltage sensors (also known as "in" sensors) report their values in volts. - -All values are reported as final values from the driver. There is no need -for further calculations. - - -Detailed description --------------------- - -Below you'll find a single line description of all the bit values. With -this information, you're able to decode e. g. alarms, wdog, etc. To make -use of the watchdog, you'll need to set the watchdog time and enable the -watchdog. After that it is necessary to restart the watchdog time within -the specified period of time, or a system reset will occur. - -* revision - READING & 0xff = 0x??: HERMES revision identification - -* alarms - READING & 0x80 = 0x80: CPU throttling active - READING & 0x80 = 0x00: CPU running at full speed - - READING & 0x10 = 0x10: software event (see control:1) - READING & 0x10 = 0x00: no software event - - READING & 0x08 = 0x08: watchdog event (see wdog:2) - READING & 0x08 = 0x00: no watchdog event - - READING & 0x02 = 0x02: thermal event (see temp*:1) - READING & 0x02 = 0x00: no thermal event - - READING & 0x01 = 0x01: fan event (see fan*:1) - READING & 0x01 = 0x00: no fan event - - READING & 0x13 ! 0x00: ALERT LED is flashing - -* control - READING & 0x01 = 0x01: software event - READING & 0x01 = 0x00: no software event - - WRITING & 0x01 = 0x01: set software event - WRITING & 0x01 = 0x00: clear software event - -* watchdog_control - READING & 0x80 = 0x80: power off on watchdog event while thermal event - READING & 0x80 = 0x00: watchdog power off disabled (just system reset enabled) - - READING & 0x40 = 0x40: watchdog timebase 60 seconds (see also wdog:1) - READING & 0x40 = 0x00: watchdog timebase 2 seconds - - READING & 0x10 = 0x10: watchdog enabled - READING & 0x10 = 0x00: watchdog disabled - - WRITING & 0x80 = 0x80: enable "power off on watchdog event while thermal event" - WRITING & 0x80 = 0x00: disable "power off on watchdog event while thermal event" - - WRITING & 0x40 = 0x40: set watchdog timebase to 60 seconds - WRITING & 0x40 = 0x00: set watchdog timebase to 2 seconds - - WRITING & 0x20 = 0x20: disable watchdog - - WRITING & 0x10 = 0x10: enable watchdog / restart watchdog time - -* watchdog_state - READING & 0x02 = 0x02: watchdog system reset occurred - READING & 0x02 = 0x00: no watchdog system reset occurred - - WRITING & 0x02 = 0x02: clear watchdog event - -* watchdog_preset - READING & 0xff = 0x??: configured watch dog time in units (see wdog:3 0x40) - - WRITING & 0xff = 0x??: configure watch dog time in units - -* in* (0: +5V, 1: +12V, 2: onboard 3V battery) - READING: actual voltage value - -* temp*_status (1: CPU sensor, 2: onboard sensor, 3: auxiliary sensor) - READING & 0x02 = 0x02: thermal event (overtemperature) - READING & 0x02 = 0x00: no thermal event - - READING & 0x01 = 0x01: sensor is working - READING & 0x01 = 0x00: sensor is faulty - - WRITING & 0x02 = 0x02: clear thermal event - -* temp*_input (1: CPU sensor, 2: onboard sensor, 3: auxiliary sensor) - READING: actual temperature value - -* fan*_status (1: power supply fan, 2: CPU fan, 3: auxiliary fan) - READING & 0x04 = 0x04: fan event (fan fault) - READING & 0x04 = 0x00: no fan event - - WRITING & 0x04 = 0x04: clear fan event - -* fan*_div (1: power supply fan, 2: CPU fan, 3: auxiliary fan) - Divisors 2,4 and 8 are supported, both for reading and writing - -* fan*_pwm (1: power supply fan, 2: CPU fan, 3: auxiliary fan) - READING & 0xff = 0x00: fan may be switched off - READING & 0xff = 0x01: fan must run at least at minimum speed (supply: 6V) - READING & 0xff = 0xff: fan must run at maximum speed (supply: 12V) - READING & 0xff = 0x??: fan must run at least at given speed (supply: 6V..12V) - - WRITING & 0xff = 0x00: fan may be switched off - WRITING & 0xff = 0x01: fan must run at least at minimum speed (supply: 6V) - WRITING & 0xff = 0xff: fan must run at maximum speed (supply: 12V) - WRITING & 0xff = 0x??: fan must run at least at given speed (supply: 6V..12V) - -* fan*_input (1: power supply fan, 2: CPU fan, 3: auxiliary fan) - READING: actual RPM value - - -Limitations ------------ - -* Measuring fan speed -It seems that the chip counts "ripples" (typical fans produce 2 ripples per -rotation while VERAX fans produce 18) in a 9-bit register. This register is -read out every second, then the ripple prescaler (2, 4 or 8) is applied and -the result is stored in the 8 bit output register. Due to the limitation of -the counting register to 9 bits, it is impossible to measure a VERAX fan -properly (even with a prescaler of 8). At its maximum speed of 3500 RPM the -fan produces 1080 ripples per second which causes the counting register to -overflow twice, leading to only 186 RPM. - -* Measuring input voltages -in2 ("battery") reports the voltage of the onboard lithium battery and not -+3.3V from the power supply. - -* Undocumented features -Fujitsu-Siemens Computers has not documented all features of the chip so -far. Their software, System Guard, shows that there are a still some -features which cannot be controlled by this implementation. diff --git a/Documentation/hwmon/hpfall.c b/Documentation/hwmon/hpfall.c index bbea1ccfd46..681ec22b9d0 100644 --- a/Documentation/hwmon/hpfall.c +++ b/Documentation/hwmon/hpfall.c @@ -16,6 +16,34 @@ #include <stdint.h> #include <errno.h> #include <signal.h> +#include <sys/mman.h> +#include <sched.h> + +char unload_heads_path[64]; + +int set_unload_heads_path(char *device) +{ + char devname[64]; + + if (strlen(device) <= 5 || strncmp(device, "/dev/", 5) != 0) + return -EINVAL; + strncpy(devname, device + 5, sizeof(devname)); + + snprintf(unload_heads_path, sizeof(unload_heads_path), + "/sys/block/%s/device/unload_heads", devname); + return 0; +} +int valid_disk(void) +{ + int fd = open(unload_heads_path, O_RDONLY); + if (fd < 0) { + perror(unload_heads_path); + return 0; + } + + close(fd); + return 1; +} void write_int(char *path, int i) { @@ -40,7 +68,7 @@ void set_led(int on) void protect(int seconds) { - write_int("/sys/block/sda/device/unload_heads", seconds*1000); + write_int(unload_heads_path, seconds*1000); } int on_ac(void) @@ -57,45 +85,62 @@ void ignore_me(void) { protect(0); set_led(0); - } -int main(int argc, char* argv[]) +int main(int argc, char **argv) { - int fd, ret; + int fd, ret; + struct sched_param param; + + if (argc == 1) + ret = set_unload_heads_path("/dev/sda"); + else if (argc == 2) + ret = set_unload_heads_path(argv[1]); + else + ret = -EINVAL; + + if (ret || !valid_disk()) { + fprintf(stderr, "usage: %s <device> (default: /dev/sda)\n", + argv[0]); + exit(1); + } + + fd = open("/dev/freefall", O_RDONLY); + if (fd < 0) { + perror("/dev/freefall"); + return EXIT_FAILURE; + } - fd = open("/dev/freefall", O_RDONLY); - if (fd < 0) { - perror("open"); - return EXIT_FAILURE; - } + daemon(0, 0); + param.sched_priority = sched_get_priority_max(SCHED_FIFO); + sched_setscheduler(0, SCHED_FIFO, ¶m); + mlockall(MCL_CURRENT|MCL_FUTURE); signal(SIGALRM, ignore_me); - for (;;) { - unsigned char count; - - ret = read(fd, &count, sizeof(count)); - alarm(0); - if ((ret == -1) && (errno == EINTR)) { - /* Alarm expired, time to unpark the heads */ - continue; - } - - if (ret != sizeof(count)) { - perror("read"); - break; - } - - protect(21); - set_led(1); - if (1 || on_ac() || lid_open()) { - alarm(2); - } else { - alarm(20); - } - } - - close(fd); - return EXIT_SUCCESS; + for (;;) { + unsigned char count; + + ret = read(fd, &count, sizeof(count)); + alarm(0); + if ((ret == -1) && (errno == EINTR)) { + /* Alarm expired, time to unpark the heads */ + continue; + } + + if (ret != sizeof(count)) { + perror("read"); + break; + } + + protect(21); + set_led(1); + if (1 || on_ac() || lid_open()) + alarm(2); + else + alarm(20); + } + + close(fd); + return EXIT_SUCCESS; } diff --git a/Documentation/hwmon/ltc4215 b/Documentation/hwmon/ltc4215 index 2e6a21eb656..c196a184625 100644 --- a/Documentation/hwmon/ltc4215 +++ b/Documentation/hwmon/ltc4215 @@ -22,12 +22,13 @@ Usage Notes ----------- This driver does not probe for LTC4215 devices, due to the fact that some -of the possible addresses are unfriendly to probing. You will need to use -the "force" parameter to tell the driver where to find the device. +of the possible addresses are unfriendly to probing. You will have to +instantiate the devices explicitly. Example: the following will load the driver for an LTC4215 at address 0x44 on I2C bus #0: -$ modprobe ltc4215 force=0,0x44 +$ modprobe ltc4215 +$ echo ltc4215 0x44 > /sys/bus/i2c/devices/i2c-0/new_device Sysfs entries diff --git a/Documentation/hwmon/ltc4245 b/Documentation/hwmon/ltc4245 index bae7a3adc5d..02838a47d86 100644 --- a/Documentation/hwmon/ltc4245 +++ b/Documentation/hwmon/ltc4245 @@ -23,12 +23,13 @@ Usage Notes ----------- This driver does not probe for LTC4245 devices, due to the fact that some -of the possible addresses are unfriendly to probing. You will need to use -the "force" parameter to tell the driver where to find the device. +of the possible addresses are unfriendly to probing. You will have to +instantiate the devices explicitly. Example: the following will load the driver for an LTC4245 at address 0x23 on I2C bus #1: -$ modprobe ltc4245 force=1,0x23 +$ modprobe ltc4245 +$ echo ltc4245 0x23 > /sys/bus/i2c/devices/i2c-1/new_device Sysfs entries diff --git a/Documentation/hwmon/pc87427 b/Documentation/hwmon/pc87427 index d1ebbe510f3..db5cc1227a8 100644 --- a/Documentation/hwmon/pc87427 +++ b/Documentation/hwmon/pc87427 @@ -34,5 +34,5 @@ Fan rotation speeds are reported as 14-bit values from a gated clock signal. Speeds down to 83 RPM can be measured. An alarm is triggered if the rotation speed drops below a programmable -limit. Another alarm is triggered if the speed is too low to to be measured +limit. Another alarm is triggered if the speed is too low to be measured (including stalled or missing fan). diff --git a/Documentation/i2c/busses/i2c-piix4 b/Documentation/i2c/busses/i2c-piix4 index f889481762b..c5b37c57055 100644 --- a/Documentation/i2c/busses/i2c-piix4 +++ b/Documentation/i2c/busses/i2c-piix4 @@ -8,6 +8,8 @@ Supported adapters: Datasheet: Only available via NDA from ServerWorks * ATI IXP200, IXP300, IXP400, SB600, SB700 and SB800 southbridges Datasheet: Not publicly available + * AMD SB900 + Datasheet: Not publicly available * Standard Microsystems (SMSC) SLC90E66 (Victory66) southbridge Datasheet: Publicly available at the SMSC website http://www.smsc.com diff --git a/Documentation/i2c/chips/pca9539 b/Documentation/i2c/chips/pca9539 deleted file mode 100644 index 6aff890088b..00000000000 --- a/Documentation/i2c/chips/pca9539 +++ /dev/null @@ -1,58 +0,0 @@ -Kernel driver pca9539 -===================== - -NOTE: this driver is deprecated and will be dropped soon, use -drivers/gpio/pca9539.c instead. - -Supported chips: - * Philips PCA9539 - Prefix: 'pca9539' - Addresses scanned: none - Datasheet: - http://www.semiconductors.philips.com/acrobat/datasheets/PCA9539_2.pdf - -Author: Ben Gardner <bgardner@wabtec.com> - - -Description ------------ - -The Philips PCA9539 is a 16 bit low power I/O device. -All 16 lines can be individually configured as an input or output. -The input sense can also be inverted. -The 16 lines are split between two bytes. - - -Detection ---------- - -The PCA9539 is difficult to detect and not commonly found in PC machines, -so you have to pass the I2C bus and address of the installed PCA9539 -devices explicitly to the driver at load time via the force=... parameter. - - -Sysfs entries -------------- - -Each is a byte that maps to the 8 I/O bits. -A '0' suffix is for bits 0-7, while '1' is for bits 8-15. - -input[01] - read the current value -output[01] - sets the output value -direction[01] - direction of each bit: 1=input, 0=output -invert[01] - toggle the input bit sense - -input reads the actual state of the line and is always available. -The direction defaults to input for all channels. - - -General Remarks ---------------- - -Note that each output, direction, and invert entry controls 8 lines. -You should use the read, modify, write sequence. -For example. to set output bit 0 of 1. - val=$(cat output0) - val=$(( $val | 1 )) - echo $val > output0 - diff --git a/Documentation/i2c/chips/pcf8574 b/Documentation/i2c/chips/pcf8574 deleted file mode 100644 index 235815c075f..00000000000 --- a/Documentation/i2c/chips/pcf8574 +++ /dev/null @@ -1,65 +0,0 @@ -Kernel driver pcf8574 -===================== - -Supported chips: - * Philips PCF8574 - Prefix: 'pcf8574' - Addresses scanned: none - Datasheet: Publicly available at the Philips Semiconductors website - http://www.semiconductors.philips.com/pip/PCF8574P.html - - * Philips PCF8574A - Prefix: 'pcf8574a' - Addresses scanned: none - Datasheet: Publicly available at the Philips Semiconductors website - http://www.semiconductors.philips.com/pip/PCF8574P.html - -Authors: - Frodo Looijaard <frodol@dds.nl>, - Philip Edelbrock <phil@netroedge.com>, - Dan Eaton <dan.eaton@rocketlogix.com>, - Aurelien Jarno <aurelien@aurel32.net>, - Jean Delvare <khali@linux-fr.org>, - - -Description ------------ -The PCF8574(A) is an 8-bit I/O expander for the I2C bus produced by Philips -Semiconductors. It is designed to provide a byte I2C interface to up to 16 -separate devices (8 x PCF8574 and 8 x PCF8574A). - -This device consists of a quasi-bidirectional port. Each of the eight I/Os -can be independently used as an input or output. To setup an I/O as an -input, you have to write a 1 to the corresponding output. - -For more informations see the datasheet. - - -Accessing PCF8574(A) via /sys interface -------------------------------------- - -The PCF8574(A) is plainly impossible to detect ! Stupid chip. -So, you have to pass the I2C bus and address of the installed PCF857A -and PCF8574A devices explicitly to the driver at load time via the -force=... parameter. - -On detection (i.e. insmod, modprobe et al.), directories are being -created for each detected PCF8574(A): - -/sys/bus/i2c/devices/<0>-<1>/ -where <0> is the bus the chip was detected on (e. g. i2c-0) -and <1> the chip address ([20..27] or [38..3f]): - -(example: /sys/bus/i2c/devices/1-0020/) - -Inside these directories, there are two files each: -read and write (and one file with chip name). - -The read file is read-only. Reading gives you the current I/O input -if the corresponding output is set as 1, otherwise the current output -value, that is to say 0. - -The write file is read/write. Writing a value outputs it on the I/O -port. Reading returns the last written value. As it is not possible -to read this value from the chip, you need to write at least once to -this file before you can read back from it. diff --git a/Documentation/i2c/chips/pcf8575 b/Documentation/i2c/chips/pcf8575 deleted file mode 100644 index 40b268eb276..00000000000 --- a/Documentation/i2c/chips/pcf8575 +++ /dev/null @@ -1,69 +0,0 @@ -About the PCF8575 chip and the pcf8575 kernel driver -==================================================== - -The PCF8575 chip is produced by the following manufacturers: - - * Philips NXP - http://www.nxp.com/#/pip/cb=[type=product,path=50807/41735/41850,final=PCF8575_3]|pip=[pip=PCF8575_3][0] - - * Texas Instruments - http://focus.ti.com/docs/prod/folders/print/pcf8575.html - - -Some vendors sell small PCB's with the PCF8575 mounted on it. You can connect -such a board to a Linux host via e.g. an USB to I2C interface. Examples of -PCB boards with a PCF8575: - - * SFE Breakout Board for PCF8575 I2C Expander by RobotShop - http://www.robotshop.ca/home/products/robot-parts/electronics/adapters-converters/sfe-pcf8575-i2c-expander-board.html - - * Breakout Board for PCF8575 I2C Expander by Spark Fun Electronics - http://www.sparkfun.com/commerce/product_info.php?products_id=8130 - - -Description ------------ -The PCF8575 chip is a 16-bit I/O expander for the I2C bus. Up to eight of -these chips can be connected to the same I2C bus. You can find this -chip on some custom designed hardware, but you won't find it on PC -motherboards. - -The PCF8575 chip consists of a 16-bit quasi-bidirectional port and an I2C-bus -interface. Each of the sixteen I/O's can be independently used as an input or -an output. To set up an I/O pin as an input, you have to write a 1 to the -corresponding output. - -For more information please see the datasheet. - - -Detection ---------- - -There is no method known to detect whether a chip on a given I2C address is -a PCF8575 or whether it is any other I2C device, so you have to pass the I2C -bus and address of the installed PCF8575 devices explicitly to the driver at -load time via the force=... parameter. - -/sys interface --------------- - -For each address on which a PCF8575 chip was found or forced the following -files will be created under /sys: -* /sys/bus/i2c/devices/<bus>-<address>/read -* /sys/bus/i2c/devices/<bus>-<address>/write -where bus is the I2C bus number (0, 1, ...) and address is the four-digit -hexadecimal representation of the 7-bit I2C address of the PCF8575 -(0020 .. 0027). - -The read file is read-only. Reading it will trigger an I2C read and will hence -report the current input state for the pins configured as inputs, and the -current output value for the pins configured as outputs. - -The write file is read-write. Writing a value to it will configure all pins -as output for which the corresponding bit is zero. Reading the write file will -return the value last written, or -EAGAIN if no value has yet been written to -the write file. - -On module initialization the configuration of the chip is not changed -- the -chip is left in the state it was already configured in through either power-up -or through previous I2C write actions. diff --git a/Documentation/i2c/instantiating-devices b/Documentation/i2c/instantiating-devices index c740b7b4108..e89490270ab 100644 --- a/Documentation/i2c/instantiating-devices +++ b/Documentation/i2c/instantiating-devices @@ -188,7 +188,7 @@ segment, the address is sufficient to uniquely identify the device to be deleted. Example: -# echo eeprom 0x50 > /sys/class/i2c-adapter/i2c-3/new_device +# echo eeprom 0x50 > /sys/bus/i2c/devices/i2c-3/new_device While this interface should only be used when in-kernel device declaration can't be done, there is a variety of cases where it can be helpful: diff --git a/Documentation/ia64/aliasing-test.c b/Documentation/ia64/aliasing-test.c index d23610fb2ff..3dfb76ca693 100644 --- a/Documentation/ia64/aliasing-test.c +++ b/Documentation/ia64/aliasing-test.c @@ -24,7 +24,7 @@ int sum; -int map_mem(char *path, off_t offset, size_t length, int touch) +static int map_mem(char *path, off_t offset, size_t length, int touch) { int fd, rc; void *addr; @@ -62,7 +62,7 @@ int map_mem(char *path, off_t offset, size_t length, int touch) return 0; } -int scan_tree(char *path, char *file, off_t offset, size_t length, int touch) +static int scan_tree(char *path, char *file, off_t offset, size_t length, int touch) { struct dirent **namelist; char *name, *path2; @@ -119,7 +119,7 @@ skip: char buf[1024]; -int read_rom(char *path) +static int read_rom(char *path) { int fd, rc; size_t size = 0; @@ -146,7 +146,7 @@ int read_rom(char *path) return size; } -int scan_rom(char *path, char *file) +static int scan_rom(char *path, char *file) { struct dirent **namelist; char *name, *path2; diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index aafca0a8f66..947374977ca 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -135,6 +135,7 @@ Code Seq# Include File Comments <http://mikonos.dia.unisa.it/tcfs> 'l' 40-7F linux/udf_fs_i.h in development: <http://sourceforge.net/projects/linux-udf/> +'m' 00-09 linux/mmtimer.h 'm' all linux/mtio.h conflict! 'm' all linux/soundcard.h conflict! 'm' all linux/synclink.h conflict! diff --git a/Documentation/isdn/INTERFACE.CAPI b/Documentation/isdn/INTERFACE.CAPI index 686e107923e..5fe8de5cc72 100644 --- a/Documentation/isdn/INTERFACE.CAPI +++ b/Documentation/isdn/INTERFACE.CAPI @@ -60,10 +60,9 @@ open() operation on regular files or character devices. After a successful return from register_appl(), CAPI messages from the application may be passed to the driver for the device via calls to the -send_message() callback function. The CAPI message to send is stored in the -data portion of an skb. Conversely, the driver may call Kernel CAPI's -capi_ctr_handle_message() function to pass a received CAPI message to Kernel -CAPI for forwarding to an application, specifying its ApplID. +send_message() callback function. Conversely, the driver may call Kernel +CAPI's capi_ctr_handle_message() function to pass a received CAPI message to +Kernel CAPI for forwarding to an application, specifying its ApplID. Deregistration requests (CAPI operation CAPI_RELEASE) from applications are forwarded as calls to the release_appl() callback function, passing the same @@ -142,6 +141,7 @@ u16 (*send_message)(struct capi_ctr *ctrlr, struct sk_buff *skb) to accepting or queueing the message. Errors occurring during the actual processing of the message should be signaled with an appropriate reply message. + May be called in process or interrupt context. Calls to this function are not serialized by Kernel CAPI, ie. it must be prepared to be re-entered. @@ -154,7 +154,8 @@ read_proc_t *ctr_read_proc system entry, /proc/capi/controllers/<n>; will be called with a pointer to the device's capi_ctr structure as the last (data) argument -Note: Callback functions are never called in interrupt context. +Note: Callback functions except send_message() are never called in interrupt +context. - to be filled in before calling capi_ctr_ready(): @@ -171,14 +172,40 @@ u8 serial[CAPI_SERIAL_LEN] value to return for CAPI_GET_SERIAL -4.3 The _cmsg Structure +4.3 SKBs + +CAPI messages are passed between Kernel CAPI and the driver via send_message() +and capi_ctr_handle_message(), stored in the data portion of a socket buffer +(skb). Each skb contains a single CAPI message coded according to the CAPI 2.0 +standard. + +For the data transfer messages, DATA_B3_REQ and DATA_B3_IND, the actual +payload data immediately follows the CAPI message itself within the same skb. +The Data and Data64 parameters are not used for processing. The Data64 +parameter may be omitted by setting the length field of the CAPI message to 22 +instead of 30. + + +4.4 The _cmsg Structure (declared in <linux/isdn/capiutil.h>) The _cmsg structure stores the contents of a CAPI 2.0 message in an easily -accessible form. It contains members for all possible CAPI 2.0 parameters, of -which only those appearing in the message type currently being processed are -actually used. Unused members should be set to zero. +accessible form. It contains members for all possible CAPI 2.0 parameters, +including subparameters of the Additional Info and B Protocol structured +parameters, with the following exceptions: + +* second Calling party number (CONNECT_IND) + +* Data64 (DATA_B3_REQ and DATA_B3_IND) + +* Sending complete (subparameter of Additional Info, CONNECT_REQ and INFO_REQ) + +* Global Configuration (subparameter of B Protocol, CONNECT_REQ, CONNECT_RESP + and SELECT_B_PROTOCOL_REQ) + +Only those parameters appearing in the message type currently being processed +are actually used. Unused members should be set to zero. Members are named after the CAPI 2.0 standard names of the parameters they represent. See <linux/isdn/capiutil.h> for the exact spelling. Member data @@ -190,18 +217,19 @@ u16 for CAPI parameters of type 'word' u32 for CAPI parameters of type 'dword' -_cstruct for CAPI parameters of type 'struct' not containing any - variably-sized (struct) subparameters (eg. 'Called Party Number') +_cstruct for CAPI parameters of type 'struct' The member is a pointer to a buffer containing the parameter in CAPI encoding (length + content). It may also be NULL, which will be taken to represent an empty (zero length) parameter. + Subparameters are stored in encoded form within the content part. -_cmstruct for CAPI parameters of type 'struct' containing 'struct' - subparameters ('Additional Info' and 'B Protocol') +_cmstruct alternative representation for CAPI parameters of type 'struct' + (used only for the 'Additional Info' and 'B Protocol' parameters) The representation is a single byte containing one of the values: - CAPI_DEFAULT: the parameter is empty - CAPI_COMPOSE: the values of the subparameters are stored - individually in the corresponding _cmsg structure members + CAPI_DEFAULT: The parameter is empty/absent. + CAPI_COMPOSE: The parameter is present. + Subparameter values are stored individually in the corresponding + _cmsg structure members. Functions capi_cmsg2message() and capi_message2cmsg() are provided to convert messages between their transport encoding described in the CAPI 2.0 standard @@ -297,3 +325,26 @@ char *capi_cmd2str(u8 Command, u8 Subcommand) be NULL if the command/subcommand is not one of those defined in the CAPI 2.0 standard. + +7. Debugging + +The module kernelcapi has a module parameter showcapimsgs controlling some +debugging output produced by the module. It can only be set when the module is +loaded, via a parameter "showcapimsgs=<n>" to the modprobe command, either on +the command line or in the configuration file. + +If the lowest bit of showcapimsgs is set, kernelcapi logs controller and +application up and down events. + +In addition, every registered CAPI controller has an associated traceflag +parameter controlling how CAPI messages sent from and to tha controller are +logged. The traceflag parameter is initialized with the value of the +showcapimsgs parameter when the controller is registered, but can later be +changed via the MANUFACTURER_REQ command KCAPI_CMD_TRACE. + +If the value of traceflag is non-zero, CAPI messages are logged. +DATA_B3 messages are only logged if the value of traceflag is > 2. + +If the lowest bit of traceflag is set, only the command/subcommand and message +length are logged. Otherwise, kernelcapi logs a readable representation of +the entire message. diff --git a/Documentation/kbuild/kbuild.txt b/Documentation/kbuild/kbuild.txt index f3355b6812d..bb3bf38f03d 100644 --- a/Documentation/kbuild/kbuild.txt +++ b/Documentation/kbuild/kbuild.txt @@ -65,6 +65,22 @@ INSTALL_PATH INSTALL_PATH specifies where to place the updated kernel and system map images. Default is /boot, but you can set it to other values. +INSTALLKERNEL +-------------------------------------------------- +Install script called when using "make install". +The default name is "installkernel". + +The script will be called with the following arguments: + $1 - kernel version + $2 - kernel image file + $3 - kernel map file + $4 - default install path (use root directory if blank) + +The implmentation of "make install" is architecture specific +and it may differ from the above. + +INSTALLKERNEL is provided to enable the possibility to +specify a custom installer when cross compiling a kernel. MODLIB -------------------------------------------------- diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index d76cfd8712e..71c602d6168 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -18,6 +18,7 @@ This document describes the Linux kernel Makefiles. --- 3.9 Dependency tracking --- 3.10 Special Rules --- 3.11 $(CC) support functions + --- 3.12 $(LD) support functions === 4 Host Program support --- 4.1 Simple Host Program @@ -435,14 +436,14 @@ more details, with real examples. The second argument is optional, and if supplied will be used if first argument is not supported. - ld-option - ld-option is used to check if $(CC) when used to link object files + cc-ldoption + cc-ldoption is used to check if $(CC) when used to link object files supports the given option. An optional second option may be specified if first option are not supported. Example: #arch/i386/kernel/Makefile - vsyscall-flags += $(call ld-option, -Wl$(comma)--hash-style=sysv) + vsyscall-flags += $(call cc-ldoption, -Wl$(comma)--hash-style=sysv) In the above example, vsyscall-flags will be assigned the option -Wl$(comma)--hash-style=sysv if it is supported by $(CC). @@ -570,6 +571,19 @@ more details, with real examples. endif endif +--- 3.12 $(LD) support functions + + ld-option + ld-option is used to check if $(LD) supports the supplied option. + ld-option takes two options as arguments. + The second argument is an optional option that can be used if the + first option is not supported by $(LD). + + Example: + #Makefile + LDFLAGS_vmlinux += $(call really-ld-option, -X) + + === 4 Host Program support Kbuild supports building executables on the host for use during the diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 0f17d16dc10..9107b387e91 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -671,7 +671,8 @@ and is between 256 and 4096 characters. It is defined in the file earlyprintk= [X86,SH,BLACKFIN] earlyprintk=vga earlyprintk=serial[,ttySn[,baudrate]] - earlyprintk=dbgp + earlyprintk=ttySn[,baudrate] + earlyprintk=dbgp[debugController#] Append ",keep" to not disable it when the real console takes over. @@ -933,7 +934,7 @@ and is between 256 and 4096 characters. It is defined in the file 1 -- enable informational integrity auditing messages. ima_hash= [IMA] - Formt: { "sha1" | "md5" } + Format: { "sha1" | "md5" } default: "sha1" ima_tcb [IMA] diff --git a/Documentation/kmemcheck.txt b/Documentation/kmemcheck.txt index 363044609da..c28f82895d6 100644 --- a/Documentation/kmemcheck.txt +++ b/Documentation/kmemcheck.txt @@ -43,26 +43,7 @@ feature. 1. Downloading ============== -kmemcheck can only be downloaded using git. If you want to write patches -against the current code, you should use the kmemcheck development branch of -the tip tree. It is also possible to use the linux-next tree, which also -includes the latest version of kmemcheck. - -Assuming that you've already cloned the linux-2.6.git repository, all you -have to do is add the -tip tree as a remote, like this: - - $ git remote add tip git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git - -To actually download the tree, fetch the remote: - - $ git fetch tip - -And to check out a new local branch with the kmemcheck code: - - $ git checkout -b kmemcheck tip/kmemcheck - -General instructions for the -tip tree can be found here: -http://people.redhat.com/mingo/tip.git/readme.txt +As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. 2. Configuring and compiling diff --git a/Documentation/laptops/asus-laptop.txt b/Documentation/laptops/asus-laptop.txt new file mode 100644 index 00000000000..c1c5be84e4b --- /dev/null +++ b/Documentation/laptops/asus-laptop.txt @@ -0,0 +1,258 @@ +Asus Laptop Extras + +Version 0.1 +August 6, 2009 + +Corentin Chary <corentincj@iksaif.net> +http://acpi4asus.sf.net/ + + This driver provides support for extra features of ACPI-compatible ASUS laptops. + It may also support some MEDION, JVC or VICTOR laptops (such as MEDION 9675 or + VICTOR XP7210 for example). It makes all the extra buttons generate standard + ACPI events that go through /proc/acpi/events and input events (like keyboards). + On some models adds support for changing the display brightness and output, + switching the LCD backlight on and off, and most importantly, allows you to + blink those fancy LEDs intended for reporting mail and wireless status. + +This driver supercedes the old asus_acpi driver. + +Requirements +------------ + + Kernel 2.6.X sources, configured for your computer, with ACPI support. + You also need CONFIG_INPUT and CONFIG_ACPI. + +Status +------ + + The features currently supported are the following (see below for + detailed description): + + - Fn key combinations + - Bluetooth enable and disable + - Wlan enable and disable + - GPS enable and disable + - Video output switching + - Ambient Light Sensor on and off + - LED control + - LED Display control + - LCD brightness control + - LCD on and off + + A compatibility table by model and feature is maintained on the web + site, http://acpi4asus.sf.net/. + +Usage +----- + + Try "modprobe asus_acpi". Check your dmesg (simply type dmesg). You should + see some lines like this : + + Asus Laptop Extras version 0.42 + L2D model detected. + + If it is not the output you have on your laptop, send it (and the laptop's + DSDT) to me. + + That's all, now, all the events generated by the hotkeys of your laptop + should be reported in your /proc/acpi/event entry. You can check with + "acpi_listen". + + Hotkeys are also reported as input keys (like keyboards) you can check + which key are supported using "xev" under X11. + + You can get informations on the version of your DSDT table by reading the + /sys/devices/platform/asus-laptop/infos entry. If you have a question or a + bug report to do, please include the output of this entry. + +LEDs +---- + + You can modify LEDs be echoing values to /sys/class/leds/asus::*/brightness : + echo 1 > /sys/class/leds/asus::mail/brightness + will switch the mail LED on. + You can also know if they are on/off by reading their content and use + kernel triggers like ide-disk or heartbeat. + +Backlight +--------- + + You can control lcd backlight power and brightness with + /sys/class/backlight/asus-laptop/. Brightness Values are between 0 and 15. + +Wireless devices +--------------- + + You can turn the internal Bluetooth adapter on/off with the bluetooth entry + (only on models with Bluetooth). This usually controls the associated LED. + Same for Wlan adapter. + +Display switching +----------------- + + Note: the display switching code is currently considered EXPERIMENTAL. + + Switching works for the following models: + L3800C + A2500H + L5800C + M5200N + W1000N (albeit with some glitches) + M6700R + A6JC + F3J + + Switching doesn't work for the following: + M3700N + L2X00D (locks the laptop under certain conditions) + + To switch the displays, echo values from 0 to 15 to + /sys/devices/platform/asus-laptop/display. The significance of those values + is as follows: + + +-------+-----+-----+-----+-----+-----+ + | Bin | Val | DVI | TV | CRT | LCD | + +-------+-----+-----+-----+-----+-----+ + + 0000 + 0 + + + + + + +-------+-----+-----+-----+-----+-----+ + + 0001 + 1 + + + + X + + +-------+-----+-----+-----+-----+-----+ + + 0010 + 2 + + + X + + + +-------+-----+-----+-----+-----+-----+ + + 0011 + 3 + + + X + X + + +-------+-----+-----+-----+-----+-----+ + + 0100 + 4 + + X + + + + +-------+-----+-----+-----+-----+-----+ + + 0101 + 5 + + X + + X + + +-------+-----+-----+-----+-----+-----+ + + 0110 + 6 + + X + X + + + +-------+-----+-----+-----+-----+-----+ + + 0111 + 7 + + X + X + X + + +-------+-----+-----+-----+-----+-----+ + + 1000 + 8 + X + + + + + +-------+-----+-----+-----+-----+-----+ + + 1001 + 9 + X + + + X + + +-------+-----+-----+-----+-----+-----+ + + 1010 + 10 + X + + X + + + +-------+-----+-----+-----+-----+-----+ + + 1011 + 11 + X + + X + X + + +-------+-----+-----+-----+-----+-----+ + + 1100 + 12 + X + X + + + + +-------+-----+-----+-----+-----+-----+ + + 1101 + 13 + X + X + + X + + +-------+-----+-----+-----+-----+-----+ + + 1110 + 14 + X + X + X + + + +-------+-----+-----+-----+-----+-----+ + + 1111 + 15 + X + X + X + X + + +-------+-----+-----+-----+-----+-----+ + + In most cases, the appropriate displays must be plugged in for the above + combinations to work. TV-Out may need to be initialized at boot time. + + Debugging: + 1) Check whether the Fn+F8 key: + a) does not lock the laptop (try disabling CONFIG_X86_UP_APIC or boot with + noapic / nolapic if it does) + b) generates events (0x6n, where n is the value corresponding to the + configuration above) + c) actually works + Record the disp value at every configuration. + 2) Echo values from 0 to 15 to /sys/devices/platform/asus-laptop/display. + Record its value, note any change. If nothing changes, try a broader range, + up to 65535. + 3) Send ANY output (both positive and negative reports are needed, unless your + machine is already listed above) to the acpi4asus-user mailing list. + + Note: on some machines (e.g. L3C), after the module has been loaded, only 0x6n + events are generated and no actual switching occurs. In such a case, a line + like: + + echo $((10#$arg-60)) > /sys/devices/platform/asus-laptop/display + + will usually do the trick ($arg is the 0000006n-like event passed to acpid). + + Note: there is currently no reliable way to read display status on xxN + (Centrino) models. + +LED display +----------- + + Some models like the W1N have a LED display that can be used to display + several informations. + + LED display works for the following models: + W1000N + W1J + + To control the LED display, use the following : + + echo 0x0T000DDD > /sys/devices/platform/asus-laptop/ + + where T control the 3 letters display, and DDD the 3 digits display, + according to the tables below. + + DDD (digits) + 000 to 999 = display digits + AAA = --- + BBB to FFF = turn-off + + T (type) + 0 = off + 1 = dvd + 2 = vcd + 3 = mp3 + 4 = cd + 5 = tv + 6 = cpu + 7 = vol + + For example "echo 0x01000001 >/sys/devices/platform/asus-laptop/ledd" + would display "DVD001". + +Driver options: +--------------- + + Options can be passed to the asus-laptop driver using the standard + module argument syntax (<param>=<value> when passing the option to the + module or asus-laptop.<param>=<value> on the kernel boot line when + asus-laptop is statically linked into the kernel). + + wapf: WAPF defines the behavior of the Fn+Fx wlan key + The significance of values is yet to be found, but + most of the time: + - 0x0 should do nothing + - 0x1 should allow to control the device with Fn+Fx key. + - 0x4 should send an ACPI event (0x88) while pressing the Fn+Fx key + - 0x5 like 0x1 or 0x4 + + The default value is 0x1. + +Unsupported models +------------------ + + These models will never be supported by this module, as they use a completely + different mechanism to handle LEDs and extra stuff (meaning we have no clue + how it works): + + - ASUS A1300 (A1B), A1370D + - ASUS L7300G + - ASUS L8400 + +Patches, Errors, Questions: +-------------------------- + + I appreciate any success or failure + reports, especially if they add to or correct the compatibility table. + Please include the following information in your report: + + - Asus model name + - a copy of your ACPI tables, using the "acpidump" utility + - a copy of /sys/devices/platform/asus-laptop/infos + - which driver features work and which don't + - the observed behavior of non-working features + + Any other comments or patches are also more than welcome. + + acpi4asus-user@lists.sourceforge.net + http://sourceforge.net/projects/acpi4asus + diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index e2ddcdeb61b..aafcaa63419 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -199,18 +199,22 @@ kind to allow it (and it often doesn't!). Not all bits in the mask can be modified. Not all bits that can be modified do anything. Not all hot keys can be individually controlled -by the mask. Some models do not support the mask at all, and in those -models, hot keys cannot be controlled individually. The behaviour of -the mask is, therefore, highly dependent on the ThinkPad model. +by the mask. Some models do not support the mask at all. The behaviour +of the mask is, therefore, highly dependent on the ThinkPad model. + +The driver will filter out any unmasked hotkeys, so even if the firmware +doesn't allow disabling an specific hotkey, the driver will not report +events for unmasked hotkeys. Note that unmasking some keys prevents their default behavior. For example, if Fn+F5 is unmasked, that key will no longer enable/disable -Bluetooth by itself. +Bluetooth by itself in firmware. -Note also that not all Fn key combinations are supported through ACPI. -For example, on the X40, the brightness, volume and "Access IBM" buttons -do not generate ACPI events even with this driver. They *can* be used -through the "ThinkPad Buttons" utility, see http://www.nongnu.org/tpb/ +Note also that not all Fn key combinations are supported through ACPI +depending on the ThinkPad model and firmware version. On those +ThinkPads, it is still possible to support some extra hotkeys by +polling the "CMOS NVRAM" at least 10 times per second. The driver +attempts to enables this functionality automatically when required. procfs notes: @@ -219,7 +223,7 @@ The following commands can be written to the /proc/acpi/ibm/hotkey file: echo 0xffffffff > /proc/acpi/ibm/hotkey -- enable all hot keys echo 0 > /proc/acpi/ibm/hotkey -- disable all possible hot keys ... any other 8-hex-digit mask ... - echo reset > /proc/acpi/ibm/hotkey -- restore the original mask + echo reset > /proc/acpi/ibm/hotkey -- restore the recommended mask The following commands have been deprecated and will cause the kernel to log a warning: @@ -240,9 +244,13 @@ sysfs notes: Returns 0. hotkey_bios_mask: + DEPRECATED, DON'T USE, WILL BE REMOVED IN THE FUTURE. + Returns the hot keys mask when thinkpad-acpi was loaded. Upon module unload, the hot keys mask will be restored - to this value. + to this value. This is always 0x80c, because those are + the hotkeys that were supported by ancient firmware + without mask support. hotkey_enable: DEPRECATED, WILL BE REMOVED SOON. @@ -251,18 +259,11 @@ sysfs notes: 1: does nothing hotkey_mask: - bit mask to enable driver-handling (and depending on + bit mask to enable reporting (and depending on the firmware, ACPI event generation) for each hot key (see above). Returns the current status of the hot keys mask, and allows one to modify it. - Note: when NVRAM polling is active, the firmware mask - will be different from the value returned by - hotkey_mask. The driver will retain enabled bits for - hotkeys that are under NVRAM polling even if the - firmware refuses them, and will not set these bits on - the firmware hot key mask. - hotkey_all_mask: bit mask that should enable event reporting for all supported hot keys, when echoed to hotkey_mask above. @@ -275,7 +276,8 @@ sysfs notes: bit mask that should enable event reporting for all supported hot keys, except those which are always handled by the firmware anyway. Echo it to - hotkey_mask above, to use. + hotkey_mask above, to use. This is the default mask + used by the driver. hotkey_source_mask: bit mask that selects which hot keys will the driver @@ -283,9 +285,10 @@ sysfs notes: based on the capabilities reported by the ACPI firmware, but it can be overridden at runtime. - Hot keys whose bits are set in both hotkey_source_mask - and also on hotkey_mask are polled for in NVRAM. Only a - few hot keys are available through CMOS NVRAM polling. + Hot keys whose bits are set in hotkey_source_mask are + polled for in NVRAM, and reported as hotkey events if + enabled in hotkey_mask. Only a few hot keys are + available through CMOS NVRAM polling. Warning: when in NVRAM mode, the volume up/down/mute keys are synthesized according to changes in the mixer, @@ -521,6 +524,7 @@ compatibility purposes when hotkey_report_mode is set to 1. 0x2305 System is waking up from suspend to eject bay 0x2404 System is waking up from hibernation to undock 0x2405 System is waking up from hibernation to eject bay +0x5010 Brightness level changed/control event The above events are never propagated by the driver. @@ -528,7 +532,6 @@ The above events are never propagated by the driver. 0x4003 Undocked (see 0x2x04), can sleep again 0x500B Tablet pen inserted into its storage bay 0x500C Tablet pen removed from its storage bay -0x5010 Brightness level changed (newer Lenovo BIOSes) The above events are propagated by the driver. @@ -617,6 +620,8 @@ For Lenovo models *with* ACPI backlight control: 2. Do *NOT* load up ACPI video, enable the hotkeys in thinkpad-acpi, and map them to KEY_BRIGHTNESS_UP and KEY_BRIGHTNESS_DOWN. Process these keys on userspace somehow (e.g. by calling xbacklight). + The driver will do this automatically if it detects that ACPI video + has been disabled. Bluetooth @@ -1455,3 +1460,8 @@ Sysfs interface changelog: 0x020400: Marker for 16 LEDs support. Also, LEDs that are known to not exist in a given model are not registered with the LED sysfs class anymore. + +0x020500: Updated hotkey driver, hotkey_mask is always available + and it is always able to disable hot keys. Very old + thinkpads are properly supported. hotkey_bios_mask + is deprecated and marked for removal. diff --git a/Documentation/leds-class.txt b/Documentation/leds-class.txt index 6399557cdab..8fd5ca2ae32 100644 --- a/Documentation/leds-class.txt +++ b/Documentation/leds-class.txt @@ -1,3 +1,4 @@ + LED handling under Linux ======================== @@ -5,10 +6,10 @@ If you're reading this and thinking about keyboard leds, these are handled by the input subsystem and the led class is *not* needed. In its simplest form, the LED class just allows control of LEDs from -userspace. LEDs appear in /sys/class/leds/. The brightness file will -set the brightness of the LED (taking a value 0-255). Most LEDs don't -have hardware brightness support so will just be turned on for non-zero -brightness settings. +userspace. LEDs appear in /sys/class/leds/. The maximum brightness of the +LED is defined in max_brightness file. The brightness file will set the brightness +of the LED (taking a value 0-max_brightness). Most LEDs don't have hardware +brightness support so will just be turned on for non-zero brightness settings. The class also introduces the optional concept of an LED trigger. A trigger is a kernel based source of led events. Triggers can either be simple or diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index 950cde6d6e5..ba9373f82ab 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -42,6 +42,7 @@ #include <signal.h> #include "linux/lguest_launcher.h" #include "linux/virtio_config.h" +#include <linux/virtio_ids.h> #include "linux/virtio_net.h" #include "linux/virtio_blk.h" #include "linux/virtio_console.h" @@ -133,6 +134,9 @@ struct device { /* Is it operational */ bool running; + /* Does Guest want an intrrupt on empty? */ + bool irq_on_empty; + /* Device-specific data. */ void *priv; }; @@ -623,10 +627,13 @@ static void trigger_irq(struct virtqueue *vq) return; vq->pending_used = 0; - /* If they don't want an interrupt, don't send one, unless empty. */ - if ((vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) - && lg_last_avail(vq) != vq->vring.avail->idx) - return; + /* If they don't want an interrupt, don't send one... */ + if (vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) { + /* ... unless they've asked us to force one on empty. */ + if (!vq->dev->irq_on_empty + || lg_last_avail(vq) != vq->vring.avail->idx) + return; + } /* Send the Guest an interrupt tell them we used something up. */ if (write(lguest_fd, buf, sizeof(buf)) != 0) @@ -1042,6 +1049,15 @@ static void create_thread(struct virtqueue *vq) close(vq->eventfd); } +static bool accepted_feature(struct device *dev, unsigned int bit) +{ + const u8 *features = get_feature_bits(dev) + dev->feature_len; + + if (dev->feature_len < bit / CHAR_BIT) + return false; + return features[bit / CHAR_BIT] & (1 << (bit % CHAR_BIT)); +} + static void start_device(struct device *dev) { unsigned int i; @@ -1055,6 +1071,8 @@ static void start_device(struct device *dev) verbose(" %02x", get_feature_bits(dev) [dev->feature_len+i]); + dev->irq_on_empty = accepted_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY); + for (vq = dev->vq; vq; vq = vq->next) { if (vq->service) create_thread(vq); diff --git a/Documentation/memory.txt b/Documentation/memory.txt index 2b3dedd3953..802efe58647 100644 --- a/Documentation/memory.txt +++ b/Documentation/memory.txt @@ -1,18 +1,7 @@ There are several classic problems related to memory on Linux systems. - 1) There are some buggy motherboards which cannot properly - deal with the memory above 16MB. Consider exchanging - your motherboard. - - 2) You cannot do DMA on the ISA bus to addresses above - 16M. Most device drivers under Linux allow the use - of bounce buffers which work around this problem. Drivers - that don't use bounce buffers will be unstable with - more than 16M installed. Drivers that use bounce buffers - will be OK, but may have slightly higher overhead. - - 3) There are some motherboards that will not cache above + 1) There are some motherboards that will not cache above a certain quantity of memory. If you have one of these motherboards, your system will be SLOWER, not faster as you add more memory. Consider exchanging your @@ -24,7 +13,7 @@ It can also tell Linux to use less memory than is actually installed. If you use "mem=" on a machine with PCI, consider using "memmap=" to avoid physical address space collisions. -See the documentation of your boot loader (LILO, loadlin, etc.) about +See the documentation of your boot loader (LILO, grub, loadlin, etc.) about how to pass options to the kernel. There are other memory problems which Linux cannot deal with. Random @@ -42,19 +31,3 @@ Try: with the vendor. Consider testing it with memtest86 yourself. * Exchanging your CPU, cache, or motherboard for one that works. - - * Disabling the cache from the BIOS. - - * Try passing the "mem=4M" option to the kernel to limit - Linux to using a very small amount of memory. Use "memmap="-option - together with "mem=" on systems with PCI to avoid physical address - space collisions. - - -Other tricks: - - * Try passing the "no-387" option to the kernel to ignore - a buggy FPU. - - * Try passing the "no-hlt" option to disable the potentially - buggy HLT instruction in your CPU. diff --git a/Documentation/i2c/chips/eeprom b/Documentation/misc-devices/eeprom index f7e8104b576..f7e8104b576 100644 --- a/Documentation/i2c/chips/eeprom +++ b/Documentation/misc-devices/eeprom diff --git a/Documentation/i2c/chips/max6875 b/Documentation/misc-devices/max6875 index 10ca43cd1a7..1e89ee3ccc1 100644 --- a/Documentation/i2c/chips/max6875 +++ b/Documentation/misc-devices/max6875 @@ -42,10 +42,12 @@ General Remarks Valid addresses for the MAX6875 are 0x50 and 0x52. Valid addresses for the MAX6874 are 0x50, 0x52, 0x54 and 0x56. -The driver does not probe any address, so you must force the address. +The driver does not probe any address, so you explicitly instantiate the +devices. Example: -$ modprobe max6875 force=0,0x50 +$ modprobe max6875 +$ echo max6875 0x50 > /sys/bus/i2c/devices/i2c-0/new_device The MAX6874/MAX6875 ignores address bit 0, so this driver attaches to multiple addresses. For example, for address 0x50, it also reserves 0x51. diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index c6cf4a3c16e..61bb645d50e 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -90,6 +90,11 @@ Examples: pgset "dstmac 00:00:00:00:00:00" sets MAC destination address pgset "srcmac 00:00:00:00:00:00" sets MAC source address + pgset "queue_map_min 0" Sets the min value of tx queue interval + pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices + To select queue 1 of a given device, + use queue_map_min=1 and queue_map_max=1 + pgset "src_mac_count 1" Sets the number of MACs we'll range through. The 'minimum' MAC is what you set with srcmac. @@ -101,6 +106,9 @@ Examples: IPDST_RND, UDPSRC_RND, UDPDST_RND, MACSRC_RND, MACDST_RND MPLS_RND, VID_RND, SVID_RND + QUEUE_MAP_RND # queue map random + QUEUE_MAP_CPU # queue map mirrors smp_processor_id() + pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then cycle through the port range. diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.txt index eaa1a25946c..ee31369e9e5 100644 --- a/Documentation/networking/regulatory.txt +++ b/Documentation/networking/regulatory.txt @@ -96,7 +96,7 @@ Example code - drivers hinting an alpha2: This example comes from the zd1211rw device driver. You can start by having a mapping of your device's EEPROM country/regulatory -domain value to to a specific alpha2 as follows: +domain value to a specific alpha2 as follows: static struct zd_reg_alpha2_map reg_alpha2_map[] = { { ZD_REGDOMAIN_FCC, "US" }, diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c index 43d14310421..a7936fe8444 100644 --- a/Documentation/networking/timestamping/timestamping.c +++ b/Documentation/networking/timestamping/timestamping.c @@ -381,7 +381,7 @@ int main(int argc, char **argv) memset(&hwtstamp, 0, sizeof(hwtstamp)); strncpy(hwtstamp.ifr_name, interface, sizeof(hwtstamp.ifr_name)); hwtstamp.ifr_data = (void *)&hwconfig; - memset(&hwconfig, 0, sizeof(&hwconfig)); + memset(&hwconfig, 0, sizeof(hwconfig)); hwconfig.tx_type = (so_timestamping_flags & SOF_TIMESTAMPING_TX_HARDWARE) ? HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF; diff --git a/Documentation/numastat.txt b/Documentation/numastat.txt index 80133ace1eb..9fcc9a608dc 100644 --- a/Documentation/numastat.txt +++ b/Documentation/numastat.txt @@ -7,10 +7,10 @@ All units are pages. Hugepages have separate counters. numa_hit A process wanted to allocate memory from this node, and succeeded. -numa_miss A process wanted to allocate memory from this node, - but ended up with memory from another. -numa_foreign A process wanted to allocate on another node, - but ended up with memory from this one. +numa_miss A process wanted to allocate memory from another node, + but ended up with memory from this node. +numa_foreign A process wanted to allocate on this node, + but ended up with memory from another one. local_node A process ran on this node and got memory from it. other_node A process ran on this node and got memory from another node. interleave_hit Interleaving wanted to allocate from this node diff --git a/Documentation/pcmcia/crc32hash.c b/Documentation/pcmcia/crc32hash.c index 4210e5abab8..44f8beea726 100644 --- a/Documentation/pcmcia/crc32hash.c +++ b/Documentation/pcmcia/crc32hash.c @@ -8,7 +8,7 @@ $ ./crc32hash "Dual Speed" #include <ctype.h> #include <stdlib.h> -unsigned int crc32(unsigned char const *p, unsigned int len) +static unsigned int crc32(unsigned char const *p, unsigned int len) { int i; unsigned int crc = 0; diff --git a/Documentation/power/power_supply_class.txt b/Documentation/power/power_supply_class.txt index c6cd4956047..9f16c5178b6 100644 --- a/Documentation/power/power_supply_class.txt +++ b/Documentation/power/power_supply_class.txt @@ -76,6 +76,11 @@ STATUS - this attribute represents operating status (charging, full, discharging (i.e. powering a load), etc.). This corresponds to BATTERY_STATUS_* values, as defined in battery.h. +CHARGE_TYPE - batteries can typically charge at different rates. +This defines trickle and fast charges. For batteries that +are already charged or discharging, 'n/a' can be displayed (or +'unknown', if the status is not known). + HEALTH - represents health of the battery, values corresponds to POWER_SUPPLY_HEALTH_*, defined in battery.h. @@ -108,6 +113,8 @@ relative, time-based measurements. ENERGY_FULL, ENERGY_EMPTY - same as above but for energy. CAPACITY - capacity in percents. +CAPACITY_LEVEL - capacity level. This corresponds to +POWER_SUPPLY_CAPACITY_LEVEL_*. TEMP - temperature of the power supply. TEMP_AMBIENT - ambient temperature. diff --git a/Documentation/power/regulator/design.txt b/Documentation/power/regulator/design.txt new file mode 100644 index 00000000000..f9b56b72b78 --- /dev/null +++ b/Documentation/power/regulator/design.txt @@ -0,0 +1,33 @@ +Regulator API design notes +========================== + +This document provides a brief, partially structured, overview of some +of the design considerations which impact the regulator API design. + +Safety +------ + + - Errors in regulator configuration can have very serious consequences + for the system, potentially including lasting hardware damage. + - It is not possible to automatically determine the power confugration + of the system - software-equivalent variants of the same chip may + have different power requirments, and not all components with power + requirements are visible to software. + + => The API should make no changes to the hardware state unless it has + specific knowledge that these changes are safe to do perform on + this particular system. + +Consumer use cases +------------------ + + - The overwhelming majority of devices in a system will have no + requirement to do any runtime configuration of their power beyond + being able to turn it on or off. + + - Many of the power supplies in the system will be shared between many + different consumers. + + => The consumer API should be structured so that these use cases are + very easy to handle and so that consumers will work with shared + supplies without any additional effort. diff --git a/Documentation/power/regulator/machine.txt b/Documentation/power/regulator/machine.txt index ce3487d99ab..63728fed620 100644 --- a/Documentation/power/regulator/machine.txt +++ b/Documentation/power/regulator/machine.txt @@ -87,7 +87,7 @@ static struct platform_device regulator_devices[] = { }, }; /* register regulator 1 device */ -platform_device_register(&wm8350_regulator_devices[0]); +platform_device_register(®ulator_devices[0]); /* register regulator 2 device */ -platform_device_register(&wm8350_regulator_devices[1]); +platform_device_register(®ulator_devices[1]); diff --git a/Documentation/power/regulator/overview.txt b/Documentation/power/regulator/overview.txt index 0cded696ca0..ffd185bb605 100644 --- a/Documentation/power/regulator/overview.txt +++ b/Documentation/power/regulator/overview.txt @@ -29,7 +29,7 @@ Some terms used in this document:- o PMIC - Power Management IC. An IC that contains numerous regulators - and often contains other susbsystems. + and often contains other subsystems. o Consumer - Electronic device that is supplied power by a regulator. @@ -168,4 +168,4 @@ relevant to non SoC devices and is split into the following four interfaces:- userspace via sysfs. This could be used to help monitor device power consumption and status. - See Documentation/ABI/testing/regulator-sysfs.txt + See Documentation/ABI/testing/sysfs-class-regulator diff --git a/Documentation/power/regulator/regulator.txt b/Documentation/power/regulator/regulator.txt index 4200accb9bb..3f8b528f237 100644 --- a/Documentation/power/regulator/regulator.txt +++ b/Documentation/power/regulator/regulator.txt @@ -10,8 +10,9 @@ Registration Drivers can register a regulator by calling :- -struct regulator_dev *regulator_register(struct device *dev, - struct regulator_desc *regulator_desc); +struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc, + struct device *dev, struct regulator_init_data *init_data, + void *driver_data); This will register the regulators capabilities and operations to the regulator core. diff --git a/Documentation/powerpc/dts-bindings/fsl/esdhc.txt b/Documentation/powerpc/dts-bindings/fsl/esdhc.txt index 3ed3797b508..8a004073896 100644 --- a/Documentation/powerpc/dts-bindings/fsl/esdhc.txt +++ b/Documentation/powerpc/dts-bindings/fsl/esdhc.txt @@ -10,6 +10,8 @@ Required properties: - interrupts : should contain eSDHC interrupt. - interrupt-parent : interrupt source phandle. - clock-frequency : specifies eSDHC base clock frequency. + - sdhci,wp-inverted : (optional) specifies that eSDHC controller + reports inverted write-protect state; - sdhci,1-bit-only : (optional) specifies that a controller can only handle 1-bit data transfers. diff --git a/Documentation/powerpc/dts-bindings/marvell.txt b/Documentation/powerpc/dts-bindings/marvell.txt index 3708a2fd474..f1533d91953 100644 --- a/Documentation/powerpc/dts-bindings/marvell.txt +++ b/Documentation/powerpc/dts-bindings/marvell.txt @@ -32,7 +32,7 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd. devices. This field represents the number of cells needed to represent the address of the memory-mapped registers of devices within the system controller chip. - - #size-cells : Size representation for for the memory-mapped + - #size-cells : Size representation for the memory-mapped registers within the system controller chip. - #interrupt-cells : Defines the width of cells used to represent interrupts. diff --git a/Documentation/powerpc/dts-bindings/mtd-physmap.txt b/Documentation/powerpc/dts-bindings/mtd-physmap.txt index 667c9bde869..80152cb567d 100644 --- a/Documentation/powerpc/dts-bindings/mtd-physmap.txt +++ b/Documentation/powerpc/dts-bindings/mtd-physmap.txt @@ -1,18 +1,19 @@ -CFI or JEDEC memory-mapped NOR flash +CFI or JEDEC memory-mapped NOR flash, MTD-RAM (NVRAM...) Flash chips (Memory Technology Devices) are often used for solid state file systems on embedded devices. - - compatible : should contain the specific model of flash chip(s) - used, if known, followed by either "cfi-flash" or "jedec-flash" - - reg : Address range(s) of the flash chip(s) + - compatible : should contain the specific model of mtd chip(s) + used, if known, followed by either "cfi-flash", "jedec-flash" + or "mtd-ram". + - reg : Address range(s) of the mtd chip(s) It's possible to (optionally) define multiple "reg" tuples so that - non-identical NOR chips can be described in one flash node. - - bank-width : Width (in bytes) of the flash bank. Equal to the + non-identical chips can be described in one node. + - bank-width : Width (in bytes) of the bank. Equal to the device width times the number of interleaved chips. - - device-width : (optional) Width of a single flash chip. If + - device-width : (optional) Width of a single mtd chip. If omitted, assumed to be equal to 'bank-width'. - - #address-cells, #size-cells : Must be present if the flash has + - #address-cells, #size-cells : Must be present if the device has sub-nodes representing partitions (see below). In this case both #address-cells and #size-cells must be equal to 1. @@ -22,24 +23,24 @@ are defined: - vendor-id : Contains the flash chip's vendor id (1 byte). - device-id : Contains the flash chip's device id (1 byte). -In addition to the information on the flash bank itself, the +In addition to the information on the mtd bank itself, the device tree may optionally contain additional information -describing partitions of the flash address space. This can be +describing partitions of the address space. This can be used on platforms which have strong conventions about which -portions of the flash are used for what purposes, but which don't +portions of a flash are used for what purposes, but which don't use an on-flash partition table such as RedBoot. -Each partition is represented as a sub-node of the flash device. +Each partition is represented as a sub-node of the mtd device. Each node's name represents the name of the corresponding -partition of the flash device. +partition of the mtd device. Flash partitions - - reg : The partition's offset and size within the flash bank. - - label : (optional) The label / name for this flash partition. + - reg : The partition's offset and size within the mtd bank. + - label : (optional) The label / name for this partition. If omitted, the label is taken from the node name (excluding the unit address). - read-only : (optional) This parameter, if present, is a hint to - Linux that this flash partition should only be mounted + Linux that this partition should only be mounted read-only. This is usually used for flash partitions containing early-boot firmware images or data which should not be clobbered. @@ -78,3 +79,12 @@ Here an example with multiple "reg" tuples: reg = <0 0x04000000>; }; }; + +An example using SRAM: + + sram@2,0 { + compatible = "samsung,k6f1616u6a", "mtd-ram"; + reg = <2 0 0x00200000>; + bank-width = <2>; + }; + diff --git a/Documentation/rtc.txt b/Documentation/rtc.txt index 8deffcd68cb..9104c106208 100644 --- a/Documentation/rtc.txt +++ b/Documentation/rtc.txt @@ -135,6 +135,30 @@ a high functionality RTC is integrated into the SOC. That system might read the system clock from the discrete RTC, but use the integrated one for all other tasks, because of its greater functionality. +SYSFS INTERFACE +--------------- + +The sysfs interface under /sys/class/rtc/rtcN provides access to various +rtc attributes without requiring the use of ioctls. All dates and times +are in the RTC's timezone, rather than in system time. + +date: RTC-provided date +hctosys: 1 if the RTC provided the system time at boot via the + CONFIG_RTC_HCTOSYS kernel option, 0 otherwise +max_user_freq: The maximum interrupt rate an unprivileged user may request + from this RTC. +name: The name of the RTC corresponding to this sysfs directory +since_epoch: The number of seconds since the epoch according to the RTC +time: RTC-provided time +wakealarm: The time at which the clock will generate a system wakeup + event. This is a one shot wakeup event, so must be reset + after wake if a daily wakeup is required. Format is either + seconds since the epoch or, if there's a leading +, seconds + in the future. + +IOCTL INTERFACE +--------------- + The ioctl() calls supported by /dev/rtc are also supported by the RTC class framework. However, because the chips and systems are not standardized, some PC/AT functionality might not be provided. And in the same way, some @@ -185,6 +209,8 @@ driver returns ENOIOCTLCMD. Some common examples: hardware in the irq_set_freq function. If it isn't, return -EINVAL. If you cannot actually change the frequency, do not define irq_set_freq. + * RTC_PIE_ON, RTC_PIE_OFF: the irq_set_state function will be called. + If all else fails, check out the rtc-test.c driver! diff --git a/Documentation/scsi/ChangeLog.megaraid b/Documentation/scsi/ChangeLog.megaraid index eaa4801f2ce..38e9e7cadc9 100644 --- a/Documentation/scsi/ChangeLog.megaraid +++ b/Documentation/scsi/ChangeLog.megaraid @@ -514,7 +514,7 @@ iv. Remove yield() while mailbox handshake in synchronous commands v. Remove redundant __megaraid_busywait_mbox routine -vi. Fix bug in the managment module, which causes a system lockup when the +vi. Fix bug in the management module, which causes a system lockup when the IO module is loaded and then unloaded, followed by executing any management utility. The current version of management module does not handle the adapter unregister properly. diff --git a/Documentation/scsi/scsi_fc_transport.txt b/Documentation/scsi/scsi_fc_transport.txt index d7f181701dc..aec6549ab09 100644 --- a/Documentation/scsi/scsi_fc_transport.txt +++ b/Documentation/scsi/scsi_fc_transport.txt @@ -378,7 +378,7 @@ Vport Disable/Enable: int vport_disable(struct fc_vport *vport, bool disable) where: - vport: Is vport to to be enabled or disabled + vport: Is vport to be enabled or disabled disable: If "true", the vport is to be disabled. If "false", the vport is to be enabled. diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt index 97eebd63bed..75fddb40f41 100644 --- a/Documentation/sound/alsa/HD-Audio-Models.txt +++ b/Documentation/sound/alsa/HD-Audio-Models.txt @@ -209,6 +209,7 @@ AD1884A / AD1883 / AD1984A / AD1984B laptop laptop with HP jack sensing mobile mobile devices with HP jack sensing thinkpad Lenovo Thinkpad X300 + touchsmart HP Touchsmart AD1884 ====== @@ -387,7 +388,7 @@ STAC92HD73* STAC92HD83* =========== ref Reference board - mic-ref Reference board with power managment for ports + mic-ref Reference board with power management for ports dell-s14 Dell laptop auto BIOS setup (default) diff --git a/Documentation/spi/spi-summary b/Documentation/spi/spi-summary index 4a02d2508bc..deab51ddc33 100644 --- a/Documentation/spi/spi-summary +++ b/Documentation/spi/spi-summary @@ -350,7 +350,7 @@ SPI protocol drivers somewhat resemble platform device drivers: .resume = CHIP_resume, }; -The driver core will autmatically attempt to bind this driver to any SPI +The driver core will automatically attempt to bind this driver to any SPI device whose board_info gave a modalias of "CHIP". Your probe() code might look like this unless you're creating a device which is managing a bus (appearing under /sys/class/spi_master). diff --git a/Documentation/spi/spidev_test.c b/Documentation/spi/spidev_test.c index c1a5aad3c75..10abd3773e4 100644 --- a/Documentation/spi/spidev_test.c +++ b/Documentation/spi/spidev_test.c @@ -69,7 +69,7 @@ static void transfer(int fd) puts(""); } -void print_usage(const char *prog) +static void print_usage(const char *prog) { printf("Usage: %s [-DsbdlHOLC3]\n", prog); puts(" -D --device device to use (default /dev/spidev1.1)\n" @@ -85,7 +85,7 @@ void print_usage(const char *prog) exit(1); } -void parse_opts(int argc, char *argv[]) +static void parse_opts(int argc, char *argv[]) { while (1) { static const struct option lopts[] = { diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index 1458448436c..62682500878 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -96,13 +96,16 @@ handles that the Linux kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit. -The three values in file-nr denote the number of allocated -file handles, the number of unused file handles and the maximum -number of file handles. When the allocated file handles come -close to the maximum, but the number of unused file handles is -significantly greater than 0, you've encountered a peak in your -usage of file handles and you don't need to increase the maximum. - +Historically, the three values in file-nr denoted the number of +allocated file handles, the number of allocated but unused file +handles, and the maximum number of file handles. Linux 2.6 always +reports 0 as the number of free file handles -- this is not an +error, it just means that the number of allocated file handles +exactly matches the number of used file handles. + +Attempts to allocate more file descriptors than file-max are +reported with printk, look for "VFS: file-max limit <number> +reached". ============================================================== nr_open: diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 2dbff53369d..a028b92001e 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -22,6 +22,7 @@ show up in /proc/sys/kernel: - callhome [ S390 only ] - auto_msgmni - core_pattern +- core_pipe_limit - core_uses_pid - ctrl-alt-del - dentry-state @@ -135,6 +136,27 @@ core_pattern is used to specify a core dumpfile pattern name. ============================================================== +core_pipe_limit: + +This sysctl is only applicable when core_pattern is configured to pipe core +files to user space helper a (when the first character of core_pattern is a '|', +see above). When collecting cores via a pipe to an application, it is +occasionally usefull for the collecting application to gather data about the +crashing process from its /proc/pid directory. In order to do this safely, the +kernel must wait for the collecting process to exit, so as not to remove the +crashing processes proc files prematurely. This in turn creates the possibility +that a misbehaving userspace collecting process can block the reaping of a +crashed process simply by never exiting. This sysctl defends against that. It +defines how many concurrent crashing processes may be piped to user space +applications in parallel. If this value is exceeded, then those crashing +processes above that value are noted via the kernel log and their cores are +skipped. 0 is a special value, indicating that unlimited processes may be +captured in parallel, but that no waiting will take place (i.e. the collecting +process is not guaranteed access to /proc/<crahing pid>/). This value defaults +to 0. + +============================================================== + core_uses_pid: The default coredump filename is "core". By setting @@ -313,31 +335,43 @@ send before ratelimiting kicks in. ============================================================== +printk_delay: + +Delay each printk message in printk_delay milliseconds + +Value from 0 - 10000 is allowed. + +============================================================== + randomize-va-space: This option can be used to select the type of process address space randomization that is used in the system, for architectures that support this feature. -0 - Turn the process address space randomization off by default. +0 - Turn the process address space randomization off. This is the + default for architectures that do not support this feature anyways, + and kernels that are booted with the "norandmaps" parameter. 1 - Make the addresses of mmap base, stack and VDSO page randomized. This, among other things, implies that shared libraries will be - loaded to random addresses. Also for PIE-linked binaries, the location - of code start is randomized. + loaded to random addresses. Also for PIE-linked binaries, the + location of code start is randomized. This is the default if the + CONFIG_COMPAT_BRK option is enabled. - With heap randomization, the situation is a little bit more - complicated. - There a few legacy applications out there (such as some ancient +2 - Additionally enable heap randomization. This is the default if + CONFIG_COMPAT_BRK is disabled. + + There are a few legacy applications out there (such as some ancient versions of libc.so.5 from 1996) that assume that brk area starts - just after the end of the code+bss. These applications break when - start of the brk area is randomized. There are however no known + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known non-legacy applications that would be broken this way, so for most - systems it is safe to choose full randomization. However there is - a CONFIG_COMPAT_BRK option for systems with ancient and/or broken - binaries, that makes heap non-randomized, but keeps all other - parts of process address space randomized if randomize_va_space - sysctl is turned on. + systems it is safe to choose full randomization. + + Systems with ancient and/or broken binaries should be configured + with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + address space randomization. ============================================================== diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index c4de6359d44..a6e360d2055 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -32,6 +32,8 @@ Currently, these files are in /proc/sys/vm: - legacy_va_layout - lowmem_reserve_ratio - max_map_count +- memory_failure_early_kill +- memory_failure_recovery - min_free_kbytes - min_slab_ratio - min_unmapped_ratio @@ -53,7 +55,6 @@ Currently, these files are in /proc/sys/vm: - vfs_cache_pressure - zone_reclaim_mode - ============================================================== block_dump @@ -275,6 +276,44 @@ e.g., up to one or two maps per allocation. The default value is 65536. +============================================================= + +memory_failure_early_kill: + +Control how to kill processes when uncorrected memory error (typically +a 2bit error in a memory module) is detected in the background by hardware +that cannot be handled by the kernel. In some cases (like the page +still having a valid copy on disk) the kernel will handle the failure +transparently without affecting any applications. But if there is +no other uptodate copy of the data it will kill to prevent any data +corruptions from propagating. + +1: Kill all processes that have the corrupted and not reloadable page mapped +as soon as the corruption is detected. Note this is not supported +for a few types of pages, like kernel internally allocated data or +the swap cache, but works for the majority of user pages. + +0: Only unmap the corrupted page from all processes and only kill a process +who tries to access it. + +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can +handle this if they want to. + +This is only active on architectures/platforms with advanced machine +check handling and depends on the hardware capabilities. + +Applications can override this setting individually with the PR_MCE_KILL prctl + +============================================================== + +memory_failure_recovery + +Enable memory failure recovery (when supported by the platform) + +1: Attempt recovery. + +0: Always panic on a memory failure. + ============================================================== min_free_kbytes: @@ -585,7 +624,9 @@ caching of directory and inode objects. At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 +to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will +never reclaim dentries and inodes due to memory pressure and this can easily +lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes. ============================================================== diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt new file mode 100644 index 00000000000..6ef2a8652e1 --- /dev/null +++ b/Documentation/trace/events-kmem.txt @@ -0,0 +1,107 @@ + Subsystem Trace Points: kmem + +The tracing system kmem captures events related to object and page allocation +within the kernel. Broadly speaking there are four major subheadings. + + o Slab allocation of small objects of unknown type (kmalloc) + o Slab allocation of small objects of known type + o Page allocation + o Per-CPU Allocator Activity + o External Fragmentation + +This document will describe what each of the tracepoints are and why they +might be useful. + +1. Slab allocation of small objects of unknown type +=================================================== +kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s +kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d +kfree call_site=%lx ptr=%p + +Heavy activity for these events may indicate that a specific cache is +justified, particularly if kmalloc slab pages are getting significantly +internal fragmented as a result of the allocation pattern. By correlating +kmalloc with kfree, it may be possible to identify memory leaks and where +the allocation sites were. + + +2. Slab allocation of small objects of known type +================================================= +kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s +kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d +kmem_cache_free call_site=%lx ptr=%p + +These events are similar in usage to the kmalloc-related events except that +it is likely easier to pin the event down to a specific cache. At the time +of writing, no information is available on what slab is being allocated from, +but the call_site can usually be used to extrapolate that information + +3. Page allocation +================== +mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s +mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d +mm_page_free_direct page=%p pfn=%lu order=%d +mm_pagevec_free page=%p pfn=%lu order=%d cold=%d + +These four events deal with page allocation and freeing. mm_page_alloc is +a simple indicator of page allocator activity. Pages may be allocated from +the per-CPU allocator (high performance) or the buddy allocator. + +If pages are allocated directly from the buddy allocator, the +mm_page_alloc_zone_locked event is triggered. This event is important as high +amounts of activity imply high activity on the zone->lock. Taking this lock +impairs performance by disabling interrupts, dirtying cache lines between +CPUs and serialising many CPUs. + +When a page is freed directly by the caller, the mm_page_free_direct event +is triggered. Significant amounts of activity here could indicate that the +callers should be batching their activities. + +When pages are freed using a pagevec, the mm_pagevec_free is +triggered. Broadly speaking, pages are taken off the LRU lock in bulk and +freed in batch with a pagevec. Significant amounts of activity here could +indicate that the system is under memory pressure and can also indicate +contention on the zone->lru_lock. + +4. Per-CPU Allocator Activity +============================= +mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d +mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d + +In front of the page allocator is a per-cpu page allocator. It exists only +for order-0 pages, reduces contention on the zone->lock and reduces the +amount of writing on struct page. + +When a per-CPU list is empty or pages of the wrong type are allocated, +the zone->lock will be taken once and the per-CPU list refilled. The event +triggered is mm_page_alloc_zone_locked for each page allocated with the +event indicating whether it is for a percpu_refill or not. + +When the per-CPU list is too full, a number of pages are freed, each one +which triggers a mm_page_pcpu_drain event. + +The individual nature of the events are so that pages can be tracked +between allocation and freeing. A number of drain or refill pages that occur +consecutively imply the zone->lock being taken once. Large amounts of PCP +refills and drains could imply an imbalance between CPUs where too much work +is being concentrated in one place. It could also indicate that the per-CPU +lists should be a larger size. Finally, large amounts of refills on one CPU +and drains on another could be a factor in causing large amounts of cache +line bounces due to writes between CPUs and worth investigating if pages +can be allocated and freed on the same CPU through some algorithm change. + +5. External Fragmentation +========================= +mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d + +External fragmentation affects whether a high-order allocation will be +successful or not. For some types of hardware, this is important although +it is avoided where possible. If the system is using huge pages and needs +to be able to resize the pool over the lifetime of the system, this value +is important. + +Large numbers of this event implies that memory is fragmenting and +high-order allocations will start failing at some time in the future. One +means of reducing the occurange of this event is to increase the size of +min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where +pageblock_size is usually the size of the default hugepage size. diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt index 78c45a87be5..02ac6ed38b2 100644 --- a/Documentation/trace/events.txt +++ b/Documentation/trace/events.txt @@ -72,7 +72,7 @@ To enable all events in sched subsystem: # echo 1 > /sys/kernel/debug/tracing/events/sched/enable -To eanble all events: +To enable all events: # echo 1 > /sys/kernel/debug/tracing/events/enable diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 1b6292bbdd6..957b22fde2d 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -133,7 +133,7 @@ of ftrace. Here is a list of some of the key files: than requested, the rest of the page will be used, making the actual allocation bigger than requested. ( Note, the size may not be a multiple of the page size - due to buffer managment overhead. ) + due to buffer management overhead. ) This can only be updated when the current_tracer is set to "nop". diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl new file mode 100644 index 00000000000..7df50e8cf4d --- /dev/null +++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl @@ -0,0 +1,418 @@ +#!/usr/bin/perl +# This is a POC (proof of concept or piece of crap, take your pick) for reading the +# text representation of trace output related to page allocation. It makes an attempt +# to extract some high-level information on what is going on. The accuracy of the parser +# may vary considerably +# +# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe +# other options +# --prepend-parent Report on the parent proc and PID +# --read-procstat If the trace lacks process info, get it from /proc +# --ignore-pid Aggregate processes of the same name together +# +# Copyright (c) IBM Corporation 2009 +# Author: Mel Gorman <mel@csn.ul.ie> +use strict; +use Getopt::Long; + +# Tracepoint events +use constant MM_PAGE_ALLOC => 1; +use constant MM_PAGE_FREE_DIRECT => 2; +use constant MM_PAGEVEC_FREE => 3; +use constant MM_PAGE_PCPU_DRAIN => 4; +use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5; +use constant MM_PAGE_ALLOC_EXTFRAG => 6; +use constant EVENT_UNKNOWN => 7; + +# Constants used to track state +use constant STATE_PCPU_PAGES_DRAINED => 8; +use constant STATE_PCPU_PAGES_REFILLED => 9; + +# High-level events extrapolated from tracepoints +use constant HIGH_PCPU_DRAINS => 10; +use constant HIGH_PCPU_REFILLS => 11; +use constant HIGH_EXT_FRAGMENT => 12; +use constant HIGH_EXT_FRAGMENT_SEVERE => 13; +use constant HIGH_EXT_FRAGMENT_MODERATE => 14; +use constant HIGH_EXT_FRAGMENT_CHANGED => 15; + +my %perprocesspid; +my %perprocess; +my $opt_ignorepid; +my $opt_read_procstat; +my $opt_prepend_parent; + +# Catch sigint and exit on request +my $sigint_report = 0; +my $sigint_exit = 0; +my $sigint_pending = 0; +my $sigint_received = 0; +sub sigint_handler { + my $current_time = time; + if ($current_time - 2 > $sigint_received) { + print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; + $sigint_report = 1; + } else { + if (!$sigint_exit) { + print "Second SIGINT received quickly, exiting\n"; + } + $sigint_exit++; + } + + if ($sigint_exit > 3) { + print "Many SIGINTs received, exiting now without report\n"; + exit; + } + + $sigint_received = $current_time; + $sigint_pending = 1; +} +$SIG{INT} = "sigint_handler"; + +# Parse command line options +GetOptions( + 'ignore-pid' => \$opt_ignorepid, + 'read-procstat' => \$opt_read_procstat, + 'prepend-parent' => \$opt_prepend_parent, +); + +# Defaults for dynamically discovered regex's +my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])'; + +# Dyanically discovered regex +my $regex_fragdetails; + +# Static regex used. Specified like this for readability and for use with /o +# (process_pid) (cpus ) ( time ) (tpoint ) (details) +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; +my $regex_statname = '[-0-9]*\s\((.*)\).*'; +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; + +sub generate_traceevent_regex { + my $event = shift; + my $default = shift; + my $regex; + + # Read the event format or use the default + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { + $regex = $default; + } else { + my $line; + while (!eof(FORMAT)) { + $line = <FORMAT>; + if ($line =~ /^print fmt:\s"(.*)",.*/) { + $regex = $1; + $regex =~ s/%p/\([0-9a-f]*\)/g; + $regex =~ s/%d/\([-0-9]*\)/g; + $regex =~ s/%lu/\([0-9]*\)/g; + } + } + } + + # Verify fields are in the right order + my $tuple; + foreach $tuple (split /\s/, $regex) { + my ($key, $value) = split(/=/, $tuple); + my $expected = shift; + if ($key ne $expected) { + print("WARNING: Format not as expected '$key' != '$expected'"); + $regex =~ s/$key=\((.*)\)/$key=$1/; + } + } + + if (defined shift) { + die("Fewer fields than expected in format"); + } + + return $regex; +} +$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag", + $regex_fragdetails_default, + "page", "pfn", + "alloc_order", "fallback_order", "pageblock_order", + "alloc_migratetype", "fallback_migratetype", + "fragmenting", "change_ownership"); + +sub read_statline($) { + my $pid = $_[0]; + my $statline; + + if (open(STAT, "/proc/$pid/stat")) { + $statline = <STAT>; + close(STAT); + } + + if ($statline eq '') { + $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; + } + + return $statline; +} + +sub guess_process_pid($$) { + my $pid = $_[0]; + my $statline = $_[1]; + + if ($pid == 0) { + return "swapper-0"; + } + + if ($statline !~ /$regex_statname/o) { + die("Failed to math stat line for process name :: $statline"); + } + return "$1-$pid"; +} + +sub parent_info($$) { + my $pid = $_[0]; + my $statline = $_[1]; + my $ppid; + + if ($pid == 0) { + return "NOPARENT-0"; + } + + if ($statline !~ /$regex_statppid/o) { + die("Failed to match stat line process ppid:: $statline"); + } + + # Read the ppid stat line + $ppid = $1; + return guess_process_pid($ppid, read_statline($ppid)); +} + +sub process_events { + my $traceevent; + my $process_pid; + my $cpus; + my $timestamp; + my $tracepoint; + my $details; + my $statline; + + # Read each line of the event log +EVENT_PROCESS: + while ($traceevent = <STDIN>) { + if ($traceevent =~ /$regex_traceevent/o) { + $process_pid = $1; + $tracepoint = $4; + + if ($opt_read_procstat || $opt_prepend_parent) { + $process_pid =~ /(.*)-([0-9]*)$/; + my $process = $1; + my $pid = $2; + + $statline = read_statline($pid); + + if ($opt_read_procstat && $process eq '') { + $process_pid = guess_process_pid($pid, $statline); + } + + if ($opt_prepend_parent) { + $process_pid = parent_info($pid, $statline) . " :: $process_pid"; + } + } + + # Unnecessary in this script. Uncomment if required + # $cpus = $2; + # $timestamp = $3; + } else { + next; + } + + # Perl Switch() sucks majorly + if ($tracepoint eq "mm_page_alloc") { + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++; + } elsif ($tracepoint eq "mm_page_free_direct") { + $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++; + } elsif ($tracepoint eq "mm_pagevec_free") { + $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++; + } elsif ($tracepoint eq "mm_page_pcpu_drain") { + $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++; + } elsif ($tracepoint eq "mm_page_alloc_zone_locked") { + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++; + } elsif ($tracepoint eq "mm_page_alloc_extfrag") { + + # Extract the details of the event now + $details = $5; + + my ($page, $pfn); + my ($alloc_order, $fallback_order, $pageblock_order); + my ($alloc_migratetype, $fallback_migratetype); + my ($fragmenting, $change_ownership); + + if ($details !~ /$regex_fragdetails/o) { + print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n"; + next; + } + + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++; + $page = $1; + $pfn = $2; + $alloc_order = $3; + $fallback_order = $4; + $pageblock_order = $5; + $alloc_migratetype = $6; + $fallback_migratetype = $7; + $fragmenting = $8; + $change_ownership = $9; + + if ($fragmenting) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++; + if ($fallback_order <= 3) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++; + } else { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++; + } + } + if ($change_ownership) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++; + } + } else { + $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; + } + + # Catch a full pcpu drain event + if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} && + $tracepoint ne "mm_page_pcpu_drain") { + + $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; + } + + # Catch a full pcpu refill event + if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} && + $tracepoint ne "mm_page_alloc_zone_locked") { + $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; + } + + if ($sigint_pending) { + last EVENT_PROCESS; + } + } +} + +sub dump_stats { + my $hashref = shift; + my %stats = %$hashref; + + # Dump per-process stats + my $process_pid; + my $max_strlen = 0; + + # Get the maximum process name + foreach $process_pid (keys %perprocesspid) { + my $len = length($process_pid); + if ($len > $max_strlen) { + $max_strlen = $len; + } + } + $max_strlen += 2; + + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", ""); + + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", ""); + + foreach $process_pid (keys %stats) { + # Dump final aggregates + if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) { + $stats{$process_pid}->{HIGH_PCPU_DRAINS}++; + $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; + } + if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) { + $stats{$process_pid}->{HIGH_PCPU_REFILLS}++; + $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; + } + + printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n", + $process_pid, + $stats{$process_pid}->{MM_PAGE_ALLOC}, + $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}, + $stats{$process_pid}->{MM_PAGE_FREE_DIRECT}, + $stats{$process_pid}->{MM_PAGEVEC_FREE}, + $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN}, + $stats{$process_pid}->{HIGH_PCPU_DRAINS}, + $stats{$process_pid}->{HIGH_PCPU_REFILLS}, + $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}, + $stats{$process_pid}->{HIGH_EXT_FRAG}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}, + $stats{$process_pid}->{EVENT_UNKNOWN}); + } +} + +sub aggregate_perprocesspid() { + my $process_pid; + my $process; + undef %perprocess; + + foreach $process_pid (keys %perprocesspid) { + $process = $process_pid; + $process =~ s/-([0-9])*$//; + if ($process eq '') { + $process = "NO_PROCESS_NAME"; + } + + $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}; + $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}; + $perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}; + $perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}; + $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}; + $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}; + $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}; + $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}; + $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}; + $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN}; + } +} + +sub report() { + if (!$opt_ignorepid) { + dump_stats(\%perprocesspid); + } else { + aggregate_perprocesspid(); + dump_stats(\%perprocess); + } +} + +# Process events or signals until neither is available +sub signal_loop() { + my $sigint_processed; + do { + $sigint_processed = 0; + process_events(); + + # Handle pending signals if any + if ($sigint_pending) { + my $current_time = time; + + if ($sigint_exit) { + print "Received exit signal\n"; + $sigint_pending = 0; + } + if ($sigint_report) { + if ($current_time >= $sigint_received + 2) { + report(); + $sigint_report = 0; + $sigint_pending = 0; + $sigint_processed = 1; + } + } + } + } while ($sigint_pending || $sigint_processed); +} + +signal_loop(); +report(); diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt new file mode 100644 index 00000000000..5eb4e487e66 --- /dev/null +++ b/Documentation/trace/tracepoint-analysis.txt @@ -0,0 +1,327 @@ + Notes on Analysing Behaviour Using Events and Tracepoints + + Documentation written by Mel Gorman + PCL information heavily based on email from Ingo Molnar + +1. Introduction +=============== + +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without +creating custom kernel modules to register probe functions using the event +tracing infrastructure. + +Simplistically, tracepoints will represent an important event that when can +be taken in conjunction with other tracepoints to build a "Big Picture" of +what is going on within the system. There are a large number of methods for +gathering and interpreting these events. Lacking any current Best Practises, +this document describes some of the methods that can be used. + +This document assumes that debugfs is mounted on /sys/kernel/debug and that +the appropriate tracing options have been configured into the kernel. It is +assumed that the PCL tool tools/perf has been installed and is in your path. + +2. Listing Available Events +=========================== + +2.1 Standard Utilities +---------------------- + +All possible events are visible from /sys/kernel/debug/tracing/events. Simply +calling + + $ find /sys/kernel/debug/tracing/events -type d + +will give a fair indication of the number of events available. + +2.2 PCL +------- + +Discovery and enumeration of all counters and events, including tracepoints +are available with the perf tool. Getting a list of available events is a +simple case of + + $ perf list 2>&1 | grep Tracepoint + ext4:ext4_free_inode [Tracepoint event] + ext4:ext4_request_inode [Tracepoint event] + ext4:ext4_allocate_inode [Tracepoint event] + ext4:ext4_write_begin [Tracepoint event] + ext4:ext4_ordered_write_end [Tracepoint event] + [ .... remaining output snipped .... ] + + +2. Enabling Events +================== + +2.1 System-Wide Event Enabling +------------------------------ + +See Documentation/trace/events.txt for a proper description on how events +can be enabled system-wide. A short example of enabling all events related +to page allocation would look something like + + $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done + +2.2 System-Wide Event Enabling with SystemTap +--------------------------------------------- + +In SystemTap, tracepoints are accessible using the kernel.trace() function +call. The following is an example that reports every 5 seconds what processes +were allocating the pages. + + global page_allocs + + probe kernel.trace("mm_page_alloc") { + page_allocs[execname()]++ + } + + function print_count() { + printf ("%-25s %-s\n", "#Pages Allocated", "Process Name") + foreach (proc in page_allocs-) + printf("%-25d %s\n", page_allocs[proc], proc) + printf ("\n") + delete page_allocs + } + + probe timer.s(5) { + print_count() + } + +2.3 System-Wide Event Enabling with PCL +--------------------------------------- + +By specifying the -a switch and analysing sleep, the system-wide events +for a duration of time can be examined. + + $ perf stat -a \ + -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free \ + sleep 10 + Performance counter stats for 'sleep 10': + + 9630 kmem:mm_page_alloc + 2143 kmem:mm_page_free_direct + 7424 kmem:mm_pagevec_free + + 10.002577764 seconds time elapsed + +Similarly, one could execute a shell and exit it as desired to get a report +at that point. + +2.4 Local Event Enabling +------------------------ + +Documentation/trace/ftrace.txt describes how to enable events on a per-thread +basis using set_ftrace_pid. + +2.5 Local Event Enablement with PCL +----------------------------------- + +Events can be activate and tracked for the duration of a process on a local +basis using PCL such as follows. + + $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free ./hackbench 10 + Time: 0.909 + + Performance counter stats for './hackbench 10': + + 17803 kmem:mm_page_alloc + 12398 kmem:mm_page_free_direct + 4827 kmem:mm_pagevec_free + + 0.973913387 seconds time elapsed + +3. Event Filtering +================== + +Documentation/trace/ftrace.txt covers in-depth how to filter events in +ftrace. Obviously using grep and awk of trace_pipe is an option as well +as any script reading trace_pipe. + +4. Analysing Event Variances with PCL +===================================== + +Any workload can exhibit variances between runs and it can be important +to know what the standard deviation in. By and large, this is left to the +performance analyst to do it by hand. In the event that the discrete event +occurrences are useful to the performance analyst, then perf can be used. + + $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct + -e kmem:mm_pagevec_free ./hackbench 10 + Time: 0.890 + Time: 0.895 + Time: 0.915 + Time: 1.001 + Time: 0.899 + + Performance counter stats for './hackbench 10' (5 runs): + + 16630 kmem:mm_page_alloc ( +- 3.542% ) + 11486 kmem:mm_page_free_direct ( +- 4.771% ) + 4730 kmem:mm_pagevec_free ( +- 2.325% ) + + 0.982653002 seconds time elapsed ( +- 1.448% ) + +In the event that some higher-level event is required that depends on some +aggregation of discrete events, then a script would need to be developed. + +Using --repeat, it is also possible to view how events are fluctuating over +time on a system wide basis using -a and sleep. + + $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free \ + -a --repeat 10 \ + sleep 1 + Performance counter stats for 'sleep 1' (10 runs): + + 1066 kmem:mm_page_alloc ( +- 26.148% ) + 182 kmem:mm_page_free_direct ( +- 5.464% ) + 890 kmem:mm_pagevec_free ( +- 30.079% ) + + 1.002251757 seconds time elapsed ( +- 0.005% ) + +5. Higher-Level Analysis with Helper Scripts +============================================ + +When events are enabled the events that are triggering can be read from +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary +options exist as well. By post-processing the output, further information can +be gathered on-line as appropriate. Examples of post-processing might include + + o Reading information from /proc for the PID that triggered the event + o Deriving a higher-level event from a series of lower-level events. + o Calculate latencies between two events + +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example +script that can read trace_pipe from STDIN or a copy of a trace. When used +on-line, it can be interrupted once to generate a report without existing +and twice to exit. + +Simplistically, the script just reads STDIN and counts up events but it +also can do more such as + + o Derive high-level events from many low-level events. If a number of pages + are freed to the main allocator from the per-CPU lists, it recognises + that as one per-CPU drain even though there is no specific tracepoint + for that event + o It can aggregate based on PID or individual process number + o In the event memory is getting externally fragmented, it reports + on whether the fragmentation event was severe or moderate. + o When receiving an event about a PID, it can record who the parent was so + that if large numbers of events are coming from very short-lived + processes, the parent process responsible for creating all the helpers + can be identified + +6. Lower-Level Analysis with PCL +================================ + +There may also be a requirement to identify what functions with a program +were generating events within the kernel. To begin this sort of analysis, the +data must be recorded. At the time of writing, this required root + + $ perf record -c 1 \ + -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free \ + ./hackbench 10 + Time: 0.894 + [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ] + +Note the use of '-c 1' to set the event period to sample. The default sample +period is quite high to minimise overhead but the information collected can be +very coarse as a result. + +This record outputted a file called perf.data which can be analysed using +perf report. + + $ perf report + # Samples: 30922 + # + # Overhead Command Shared Object + # ........ ......... ................................ + # + 87.27% hackbench [vdso] + 6.85% hackbench /lib/i686/cmov/libc-2.9.so + 2.62% hackbench /lib/ld-2.9.so + 1.52% perf [vdso] + 1.22% hackbench ./hackbench + 0.48% hackbench [kernel] + 0.02% perf /lib/i686/cmov/libc-2.9.so + 0.01% perf /usr/bin/perf + 0.01% perf /lib/ld-2.9.so + 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so + # + # (For more details, try: perf report --sort comm,dso,symbol) + # + +According to this, the vast majority of events occured triggered on events +within the VDSO. With simple binaries, this will often be the case so lets +take a slightly different example. In the course of writing this, it was +noticed that X was generating an insane amount of page allocations so lets look +at it + + $ perf record -c 1 -f \ + -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ + -e kmem:mm_pagevec_free \ + -p `pidof X` + +This was interrupted after a few seconds and + + $ perf report + # Samples: 27666 + # + # Overhead Command Shared Object + # ........ ....... ....................................... + # + 51.95% Xorg [vdso] + 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 + 0.09% Xorg /lib/i686/cmov/libc-2.9.so + 0.01% Xorg [kernel] + # + # (For more details, try: perf report --sort comm,dso,symbol) + # + +So, almost half of the events are occuring in a library. To get an idea which +symbol. + + $ perf report --sort comm,dso,symbol + # Samples: 27666 + # + # Overhead Command Shared Object Symbol + # ........ ....... ....................................... ...... + # + 51.95% Xorg [vdso] [.] 0x000000ffffe424 + 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2 + 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc + 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f + 0.01% Xorg [kernel] [k] read_hpet + 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path + 0.00% Xorg [kernel] [k] ftrace_trace_userstack + +To see where within the function pixmanFillsse2 things are going wrong + + $ perf annotate pixmanFillsse2 + [ ... ] + 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax) + : } + : + : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _ + : _mm_store_si128 (__m128i *__P, __m128i __B) : { + : *__P = __B; + 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax) + 0.00 : 34ef5: ff + 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax) + 0.00 : 34efd: ff + 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax) + 0.00 : 34f05: ff + 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax) + 0.00 : 34f0d: ff + 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax) + 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax) + 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax) + 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax) + +At a glance, it looks like the time is being spent copying pixmaps to +the card. Further investigation would be needed to determine why pixmaps +are being copied around so much but a starting point would be to take an +ancient build of libpixmap out of the library path where it was totally +forgotten about from months ago! diff --git a/Documentation/usb/authorization.txt b/Documentation/usb/authorization.txt index 381b22ee783..c069b6884c7 100644 --- a/Documentation/usb/authorization.txt +++ b/Documentation/usb/authorization.txt @@ -16,20 +16,20 @@ Usage: Authorize a device to connect: -$ echo 1 > /sys/usb/devices/DEVICE/authorized +$ echo 1 > /sys/bus/usb/devices/DEVICE/authorized Deauthorize a device: -$ echo 0 > /sys/usb/devices/DEVICE/authorized +$ echo 0 > /sys/bus/usb/devices/DEVICE/authorized Set new devices connected to hostX to be deauthorized by default (ie: lock down): -$ echo 0 > /sys/bus/devices/usbX/authorized_default +$ echo 0 > /sys/bus/usb/devices/usbX/authorized_default Remove the lock down: -$ echo 1 > /sys/bus/devices/usbX/authorized_default +$ echo 1 > /sys/bus/usb/devices/usbX/authorized_default By default, Wired USB devices are authorized by default to connect. Wireless USB hosts deauthorize by default all new connected @@ -47,7 +47,7 @@ USB port): boot up rc.local -> - for host in /sys/bus/devices/usb* + for host in /sys/bus/usb/devices/usb* do echo 0 > $host/authorized_default done diff --git a/Documentation/usb/usbmon.txt b/Documentation/usb/usbmon.txt index 6c3c625b7f3..66f92d1194c 100644 --- a/Documentation/usb/usbmon.txt +++ b/Documentation/usb/usbmon.txt @@ -33,7 +33,7 @@ if usbmon is built into the kernel. Verify that bus sockets are present. -# ls /sys/kernel/debug/usbmon +# ls /sys/kernel/debug/usb/usbmon 0s 0u 1s 1t 1u 2s 2t 2u 3s 3t 3u 4s 4t 4u # @@ -58,11 +58,11 @@ Bus=03 means it's bus 3. 3. Start 'cat' -# cat /sys/kernel/debug/usbmon/3u > /tmp/1.mon.out +# cat /sys/kernel/debug/usb/usbmon/3u > /tmp/1.mon.out to listen on a single bus, otherwise, to listen on all buses, type: -# cat /sys/kernel/debug/usbmon/0u > /tmp/1.mon.out +# cat /sys/kernel/debug/usb/usbmon/0u > /tmp/1.mon.out This process will be reading until killed. Naturally, the output can be redirected to a desirable location. This is preferred, because it is going @@ -305,7 +305,7 @@ Before the call, hdr, data, and alloc should be filled. Upon return, the area pointed by hdr contains the next event structure, and the data buffer contains the data, if any. The event is removed from the kernel buffer. -The MON_IOCX_GET copies 48 bytes, MON_IOCX_GETX copies 64 bytes. +The MON_IOCX_GET copies 48 bytes to hdr area, MON_IOCX_GETX copies 64 bytes. MON_IOCX_MFETCH, defined as _IOWR(MON_IOC_MAGIC, 7, struct mon_mfetch_arg) diff --git a/Documentation/video4linux/v4lgrab.c b/Documentation/video4linux/v4lgrab.c index 05769cff100..c8ded175796 100644 --- a/Documentation/video4linux/v4lgrab.c +++ b/Documentation/video4linux/v4lgrab.c @@ -89,7 +89,7 @@ } \ } -int get_brightness_adj(unsigned char *image, long size, int *brightness) { +static int get_brightness_adj(unsigned char *image, long size, int *brightness) { long i, tot = 0; for (i=0;i<size*3;i++) tot += image[i]; diff --git a/Documentation/vm/.gitignore b/Documentation/vm/.gitignore index 33e8a023df0..09b164a5700 100644 --- a/Documentation/vm/.gitignore +++ b/Documentation/vm/.gitignore @@ -1 +1,2 @@ +page-types slabinfo diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX index 2f77ced35df..e57d6a9dd32 100644 --- a/Documentation/vm/00-INDEX +++ b/Documentation/vm/00-INDEX @@ -6,6 +6,8 @@ balance - various information on memory balancing. hugetlbpage.txt - a brief summary of hugetlbpage support in the Linux kernel. +ksm.txt + - how to use the Kernel Samepage Merging feature. locking - info on how locking and synchronization is done in the Linux vm code. numa @@ -20,3 +22,5 @@ slabinfo.c - source code for a tool to get reports about slabs. slub.txt - a short users guide for SLUB. +map_hugetlb.c + - an example program that uses the MAP_HUGETLB mmap flag. diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index ea8714fcc3a..82a7bd1800b 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -18,13 +18,13 @@ First the Linux kernel needs to be built with the CONFIG_HUGETLBFS automatically when CONFIG_HUGETLBFS is selected) configuration options. -The kernel built with hugepage support should show the number of configured -hugepages in the system by running the "cat /proc/meminfo" command. +The kernel built with huge page support should show the number of configured +huge pages in the system by running the "cat /proc/meminfo" command. /proc/meminfo also provides information about the total number of hugetlb pages configured in the kernel. It also displays information about the number of free hugetlb pages at any time. It also displays information about -the configured hugepage size - this is needed for generating the proper +the configured huge page size - this is needed for generating the proper alignment and size of the arguments to the above system calls. The output of "cat /proc/meminfo" will have lines like: @@ -37,25 +37,27 @@ HugePages_Surp: yyy Hugepagesize: zzz kB where: -HugePages_Total is the size of the pool of hugepages. -HugePages_Free is the number of hugepages in the pool that are not yet -allocated. -HugePages_Rsvd is short for "reserved," and is the number of hugepages -for which a commitment to allocate from the pool has been made, but no -allocation has yet been made. It's vaguely analogous to overcommit. -HugePages_Surp is short for "surplus," and is the number of hugepages in -the pool above the value in /proc/sys/vm/nr_hugepages. The maximum -number of surplus hugepages is controlled by -/proc/sys/vm/nr_overcommit_hugepages. +HugePages_Total is the size of the pool of huge pages. +HugePages_Free is the number of huge pages in the pool that are not yet + allocated. +HugePages_Rsvd is short for "reserved," and is the number of huge pages for + which a commitment to allocate from the pool has been made, + but no allocation has yet been made. Reserved huge pages + guarantee that an application will be able to allocate a + huge page from the pool of huge pages at fault time. +HugePages_Surp is short for "surplus," and is the number of huge pages in + the pool above the value in /proc/sys/vm/nr_hugepages. The + maximum number of surplus huge pages is controlled by + /proc/sys/vm/nr_overcommit_hugepages. /proc/filesystems should also show a filesystem of type "hugetlbfs" configured in the kernel. /proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb pages in the kernel. Super user can dynamically request more (or free some -pre-configured) hugepages. +pre-configured) huge pages. The allocation (or deallocation) of hugetlb pages is possible only if there are -enough physically contiguous free pages in system (freeing of hugepages is +enough physically contiguous free pages in system (freeing of huge pages is possible only if there are enough hugetlb pages free that can be transferred back to regular memory pool). @@ -67,43 +69,82 @@ use either the mmap system call or shared memory system calls to start using the huge pages. It is required that the system administrator preallocate enough memory for huge page purposes. -Use the following command to dynamically allocate/deallocate hugepages: +The administrator can preallocate huge pages on the kernel boot command line by +specifying the "hugepages=N" parameter, where 'N' = the number of huge pages +requested. This is the most reliable method for preallocating huge pages as +memory has not yet become fragmented. + +Some platforms support multiple huge page sizes. To preallocate huge pages +of a specific size, one must preceed the huge pages boot command parameters +with a huge page size selection parameter "hugepagesz=<size>". <size> must +be specified in bytes with optional scale suffix [kKmMgG]. The default huge +page size may be selected with the "default_hugepagesz=<size>" boot parameter. + +/proc/sys/vm/nr_hugepages indicates the current number of configured [default +size] hugetlb pages in the kernel. Super user can dynamically request more +(or free some pre-configured) huge pages. + +Use the following command to dynamically allocate/deallocate default sized +huge pages: echo 20 > /proc/sys/vm/nr_hugepages -This command will try to configure 20 hugepages in the system. The success -or failure of allocation depends on the amount of physically contiguous -memory that is preset in system at this time. System administrators may want -to put this command in one of the local rc init files. This will enable the -kernel to request huge pages early in the boot process (when the possibility -of getting physical contiguous pages is still very high). In either -case, administrators will want to verify the number of hugepages actually -allocated by checking the sysctl or meminfo. - -/proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of -hugepages can grow, if more hugepages than /proc/sys/vm/nr_hugepages are -requested by applications. echo'ing any non-zero value into this file -indicates that the hugetlb subsystem is allowed to try to obtain -hugepages from the buddy allocator, if the normal pool is exhausted. As -these surplus hugepages go out of use, they are freed back to the buddy +This command will try to configure 20 default sized huge pages in the system. +On a NUMA platform, the kernel will attempt to distribute the huge page pool +over the all on-line nodes. These huge pages, allocated when nr_hugepages +is increased, are called "persistent huge pages". + +The success or failure of huge page allocation depends on the amount of +physically contiguous memory that is preset in system at the time of the +allocation attempt. If the kernel is unable to allocate huge pages from +some nodes in a NUMA system, it will attempt to make up the difference by +allocating extra pages on other nodes with sufficient available contiguous +memory, if any. + +System administrators may want to put this command in one of the local rc init +files. This will enable the kernel to request huge pages early in the boot +process when the possibility of getting physical contiguous pages is still +very high. Administrators can verify the number of huge pages actually +allocated by checking the sysctl or meminfo. To check the per node +distribution of huge pages in a NUMA system, use: + + cat /sys/devices/system/node/node*/meminfo | fgrep Huge + +/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of +huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are +requested by applications. Writing any non-zero value into this file +indicates that the hugetlb subsystem is allowed to try to obtain "surplus" +huge pages from the buddy allocator, when the normal pool is exhausted. As +these surplus huge pages go out of use, they are freed back to the buddy allocator. +When increasing the huge page pool size via nr_hugepages, any surplus +pages will first be promoted to persistent huge pages. Then, additional +huge pages will be allocated, if necessary and if possible, to fulfill +the new huge page pool size. + +The administrator may shrink the pool of preallocated huge pages for +the default huge page size by setting the nr_hugepages sysctl to a +smaller value. The kernel will attempt to balance the freeing of huge pages +across all on-line nodes. Any free huge pages on the selected nodes will +be freed back to the buddy allocator. + Caveat: Shrinking the pool via nr_hugepages such that it becomes less -than the number of hugepages in use will convert the balance to surplus +than the number of huge pages in use will convert the balance to surplus huge pages even if it would exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. -With support for multiple hugepage pools at run-time available, much of -the hugepage userspace interface has been duplicated in sysfs. The above -information applies to the default hugepage size (which will be -controlled by the proc interfaces for backwards compatibility). The root -hugepage control directory is +With support for multiple huge page pools at run-time available, much of +the huge page userspace interface has been duplicated in sysfs. The above +information applies to the default huge page size which will be +controlled by the /proc interfaces for backwards compatibility. The root +huge page control directory in sysfs is: /sys/kernel/mm/hugepages -For each hugepage size supported by the running kernel, a subdirectory +For each huge page size supported by the running kernel, a subdirectory will exist, of the form hugepages-${size}kB @@ -116,9 +157,9 @@ Inside each of these directories, the same set of files will exist: resv_hugepages surplus_hugepages -which function as described above for the default hugepage-sized case. +which function as described above for the default huge page-sized case. -If the user applications are going to request hugepages using mmap system +If the user applications are going to request huge pages using mmap system call, then it is required that system administrator mount a file system of type hugetlbfs: @@ -127,7 +168,7 @@ type hugetlbfs: none /mnt/huge This command mounts a (pseudo) filesystem of type hugetlbfs on the directory -/mnt/huge. Any files created on /mnt/huge uses hugepages. The uid and gid +/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid options sets the owner and group of the root of the file system. By default the uid and gid of the current process are taken. The mode option sets the mode of root of file system to value & 0777. This value is given in octal. @@ -146,24 +187,26 @@ Regular chown, chgrp, and chmod commands (with right permissions) could be used to change the file attributes on hugetlbfs. Also, it is important to note that no such mount command is required if the -applications are going to use only shmat/shmget system calls. Users who -wish to use hugetlb page via shared memory segment should be a member of -a supplementary group and system admin needs to configure that gid into -/proc/sys/vm/hugetlb_shm_group. It is possible for same or different -applications to use any combination of mmaps and shm* calls, though the -mount of filesystem will be required for using mmap calls. +applications are going to use only shmat/shmget system calls or mmap with +MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment +should be a member of a supplementary group and system admin needs to +configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for +same or different applications to use any combination of mmaps and shm* +calls, though the mount of filesystem will be required for using mmap calls +without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see +map_hugetlb.c. ******************************************************************* /* - * Example of using hugepage memory in a user application using Sys V shared + * Example of using huge page memory in a user application using Sys V shared * memory system calls. In this example the app is requesting 256MB of * memory that is backed by huge pages. The application uses the flag * SHM_HUGETLB in the shmget system call to inform the kernel that it is - * requesting hugepages. + * requesting huge pages. * * For the ia64 architecture, the Linux kernel reserves Region number 4 for - * hugepages. That means the addresses starting with 0x800000... will need + * huge pages. That means the addresses starting with 0x800000... will need * to be specified. Specifying a fixed address is not required on ppc64, * i386 or x86_64. * @@ -252,14 +295,14 @@ int main(void) ******************************************************************* /* - * Example of using hugepage memory in a user application using the mmap + * Example of using huge page memory in a user application using the mmap * system call. Before running this application, make sure that the * administrator has mounted the hugetlbfs filesystem (on some directory * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this * example, the app is requesting memory of size 256MB that is backed by * huge pages. * - * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages. + * For ia64 architecture, Linux kernel reserves Region number 4 for huge pages. * That means the addresses starting with 0x800000... will need to be * specified. Specifying a fixed address is not required on ppc64, i386 * or x86_64. diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt new file mode 100644 index 00000000000..262d8e6793a --- /dev/null +++ b/Documentation/vm/ksm.txt @@ -0,0 +1,90 @@ +How to use the Kernel Samepage Merging feature +---------------------------------------------- + +KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, +added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation, +and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/ + +The KSM daemon ksmd periodically scans those areas of user memory which +have been registered with it, looking for pages of identical content which +can be replaced by a single write-protected page (which is automatically +copied if a process later wants to update its content). + +KSM was originally developed for use with KVM (where it was known as +Kernel Shared Memory), to fit more virtual machines into physical memory, +by sharing the data common between them. But it can be useful to any +application which generates many instances of the same data. + +KSM only merges anonymous (private) pages, never pagecache (file) pages. +KSM's merged pages are at present locked into kernel memory for as long +as they are shared: so cannot be swapped out like the user pages they +replace (but swapping KSM pages should follow soon in a later release). + +KSM only operates on those areas of address space which an application +has advised to be likely candidates for merging, by using the madvise(2) +system call: int madvise(addr, length, MADV_MERGEABLE). + +The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel +that advice and restore unshared pages: whereupon KSM unmerges whatever +it merged in that range. Note: this unmerging call may suddenly require +more memory than is available - possibly failing with EAGAIN, but more +probably arousing the Out-Of-Memory killer. + +If KSM is not configured into the running kernel, madvise MADV_MERGEABLE +and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was +built with CONFIG_KSM=y, those calls will normally succeed: even if the +the KSM daemon is not currently running, MADV_MERGEABLE still registers +the range for whenever the KSM daemon is started; even if the range +cannot contain any pages which KSM could actually merge; even if +MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. + +Like other madvise calls, they are intended for use on mapped areas of +the user address space: they will report ENOMEM if the specified range +includes unmapped gaps (though working on the intervening mapped areas), +and might fail with EAGAIN if not enough memory for internal structures. + +Applications should be considerate in their use of MADV_MERGEABLE, +restricting its use to areas likely to benefit. KSM's scans may use +a lot of processing power, and its kernel-resident pages are a limited +resource. Some installations will disable KSM for these reasons. + +The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, +readable by all but writable only by root: + +max_kernel_pages - set to maximum number of kernel pages that KSM may use + e.g. "echo 100000 > /sys/kernel/mm/ksm/max_kernel_pages" + Value 0 imposes no limit on the kernel pages KSM may use; + but note that any process using MADV_MERGEABLE can cause + KSM to allocate these pages, unswappable until it exits. + Default: quarter of memory (chosen to not pin too much) + +pages_to_scan - how many present pages to scan before ksmd goes to sleep + e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan" + Default: 100 (chosen for demonstration purposes) + +sleep_millisecs - how many milliseconds ksmd should sleep before next scan + e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs" + Default: 20 (chosen for demonstration purposes) + +run - set 0 to stop ksmd from running but keep merged pages, + set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run", + set 2 to stop ksmd and unmerge all pages currently merged, + but leave mergeable areas registered for next run + Default: 0 (must be changed to 1 to activate KSM, + except if CONFIG_SYSFS is disabled) + +The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: + +pages_shared - how many shared unswappable kernel pages KSM is using +pages_sharing - how many more sites are sharing them i.e. how much saved +pages_unshared - how many pages unique but repeatedly checked for merging +pages_volatile - how many pages changing too fast to be placed in a tree +full_scans - how many times all mergeable areas have been scanned + +A high ratio of pages_sharing to pages_shared indicates good sharing, but +a high ratio of pages_unshared to pages_sharing indicates wasted effort. +pages_volatile embraces several different kinds of activity, but a high +proportion there would also indicate poor use of madvise MADV_MERGEABLE. + +Izik Eidus, +Hugh Dickins, 24 Sept 2009 diff --git a/Documentation/vm/locking b/Documentation/vm/locking index f366fa95617..25fadb44876 100644 --- a/Documentation/vm/locking +++ b/Documentation/vm/locking @@ -80,7 +80,7 @@ Note: PTL can also be used to guarantee that no new clones using the mm start up ... this is a loose form of stability on mm_users. For example, it is used in copy_mm to protect against a racing tlb_gather_mmu single address space optimization, so that the zap_page_range (from -vmtruncate) does not lose sending ipi's to cloned threads that might +truncate) does not lose sending ipi's to cloned threads that might be spawned underneath it and go to user mode to drag in pte's into tlbs. swap_lock diff --git a/Documentation/vm/map_hugetlb.c b/Documentation/vm/map_hugetlb.c new file mode 100644 index 00000000000..e2bdae37f49 --- /dev/null +++ b/Documentation/vm/map_hugetlb.c @@ -0,0 +1,77 @@ +/* + * Example of using hugepage memory in a user application using the mmap + * system call with MAP_HUGETLB flag. Before running this program make + * sure the administrator has allocated enough default sized huge pages + * to cover the 256 MB allocation. + * + * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages. + * That means the addresses starting with 0x800000... will need to be + * specified. Specifying a fixed address is not required on ppc64, i386 + * or x86_64. + */ +#include <stdlib.h> +#include <stdio.h> +#include <unistd.h> +#include <sys/mman.h> +#include <fcntl.h> + +#define LENGTH (256UL*1024*1024) +#define PROTECTION (PROT_READ | PROT_WRITE) + +#ifndef MAP_HUGETLB +#define MAP_HUGETLB 0x40 +#endif + +/* Only ia64 requires this */ +#ifdef __ia64__ +#define ADDR (void *)(0x8000000000000000UL) +#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED) +#else +#define ADDR (void *)(0x0UL) +#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB) +#endif + +void check_bytes(char *addr) +{ + printf("First hex is %x\n", *((unsigned int *)addr)); +} + +void write_bytes(char *addr) +{ + unsigned long i; + + for (i = 0; i < LENGTH; i++) + *(addr + i) = (char)i; +} + +void read_bytes(char *addr) +{ + unsigned long i; + + check_bytes(addr); + for (i = 0; i < LENGTH; i++) + if (*(addr + i) != (char)i) { + printf("Mismatch at %lu\n", i); + break; + } +} + +int main(void) +{ + void *addr; + + addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0); + if (addr == MAP_FAILED) { + perror("mmap"); + exit(1); + } + + printf("Returned address is %p\n", addr); + check_bytes(addr); + write_bytes(addr); + read_bytes(addr); + + munmap(addr, LENGTH); + + return 0; +} diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c index 0833f44ba16..3ec4f2a2258 100644 --- a/Documentation/vm/page-types.c +++ b/Documentation/vm/page-types.c @@ -2,9 +2,13 @@ * page-types: Tool for querying page flags * * Copyright (C) 2009 Intel corporation - * Copyright (C) 2009 Wu Fengguang <fengguang.wu@intel.com> + * + * Authors: Wu Fengguang <fengguang.wu@intel.com> + * + * Released under the General Public License (GPL). */ +#define _LARGEFILE64_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> @@ -13,12 +17,33 @@ #include <string.h> #include <getopt.h> #include <limits.h> +#include <assert.h> #include <sys/types.h> #include <sys/errno.h> #include <sys/fcntl.h> /* + * pagemap kernel ABI bits + */ + +#define PM_ENTRY_BYTES sizeof(uint64_t) +#define PM_STATUS_BITS 3 +#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS) +#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET) +#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK) +#define PM_PSHIFT_BITS 6 +#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS) +#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET) +#define PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK) +#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1) +#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK) + +#define PM_PRESENT PM_STATUS(4LL) +#define PM_SWAP PM_STATUS(2LL) + + +/* * kernel page flags */ @@ -47,7 +72,9 @@ #define KPF_COMPOUND_TAIL 16 #define KPF_HUGE 17 #define KPF_UNEVICTABLE 18 +#define KPF_HWPOISON 19 #define KPF_NOPAGE 20 +#define KPF_KSM 21 /* [32-] kernel hacking assistances */ #define KPF_RESERVED 32 @@ -94,7 +121,9 @@ static char *page_flag_names[] = { [KPF_COMPOUND_TAIL] = "T:compound_tail", [KPF_HUGE] = "G:huge", [KPF_UNEVICTABLE] = "u:unevictable", + [KPF_HWPOISON] = "X:hwpoison", [KPF_NOPAGE] = "n:nopage", + [KPF_KSM] = "x:ksm", [KPF_RESERVED] = "r:reserved", [KPF_MLOCKED] = "m:mlocked", @@ -126,6 +155,11 @@ static int nr_addr_ranges; static unsigned long opt_offset[MAX_ADDR_RANGES]; static unsigned long opt_size[MAX_ADDR_RANGES]; +#define MAX_VMAS 10240 +static int nr_vmas; +static unsigned long pg_start[MAX_VMAS]; +static unsigned long pg_end[MAX_VMAS]; + #define MAX_BIT_FILTERS 64 static int nr_bit_filters; static uint64_t opt_mask[MAX_BIT_FILTERS]; @@ -133,9 +167,15 @@ static uint64_t opt_bits[MAX_BIT_FILTERS]; static int page_size; -#define PAGES_BATCH (64 << 10) /* 64k pages */ +static int pagemap_fd; static int kpageflags_fd; -static uint64_t kpageflags_buf[KPF_BYTES * PAGES_BATCH]; + +static int opt_hwpoison; +static int opt_unpoison; + +static char *hwpoison_debug_fs = "/debug/hwpoison"; +static int hwpoison_inject_fd; +static int hwpoison_forget_fd; #define HASH_SHIFT 13 #define HASH_SIZE (1 << HASH_SHIFT) @@ -158,12 +198,17 @@ static uint64_t page_flags[HASH_SIZE]; type __min2 = (y); \ __min1 < __min2 ? __min1 : __min2; }) -unsigned long pages2mb(unsigned long pages) +#define max_t(type, x, y) ({ \ + type __max1 = (x); \ + type __max2 = (y); \ + __max1 > __max2 ? __max1 : __max2; }) + +static unsigned long pages2mb(unsigned long pages) { return (pages * page_size) >> 20; } -void fatal(const char *x, ...) +static void fatal(const char *x, ...) { va_list ap; @@ -173,12 +218,80 @@ void fatal(const char *x, ...) exit(EXIT_FAILURE); } +int checked_open(const char *pathname, int flags) +{ + int fd = open(pathname, flags); + + if (fd < 0) { + perror(pathname); + exit(EXIT_FAILURE); + } + + return fd; +} + +/* + * pagemap/kpageflags routines + */ + +static unsigned long do_u64_read(int fd, char *name, + uint64_t *buf, + unsigned long index, + unsigned long count) +{ + long bytes; + + if (index > ULONG_MAX / 8) + fatal("index overflow: %lu\n", index); + + if (lseek(fd, index * 8, SEEK_SET) < 0) { + perror(name); + exit(EXIT_FAILURE); + } + + bytes = read(fd, buf, count * 8); + if (bytes < 0) { + perror(name); + exit(EXIT_FAILURE); + } + if (bytes % 8) + fatal("partial read: %lu bytes\n", bytes); + + return bytes / 8; +} + +static unsigned long kpageflags_read(uint64_t *buf, + unsigned long index, + unsigned long pages) +{ + return do_u64_read(kpageflags_fd, PROC_KPAGEFLAGS, buf, index, pages); +} + +static unsigned long pagemap_read(uint64_t *buf, + unsigned long index, + unsigned long pages) +{ + return do_u64_read(pagemap_fd, "/proc/pid/pagemap", buf, index, pages); +} + +static unsigned long pagemap_pfn(uint64_t val) +{ + unsigned long pfn; + + if (val & PM_PRESENT) + pfn = PM_PFRAME(val); + else + pfn = 0; + + return pfn; +} + /* * page flag names */ -char *page_flag_name(uint64_t flags) +static char *page_flag_name(uint64_t flags) { static char buf[65]; int present; @@ -197,7 +310,7 @@ char *page_flag_name(uint64_t flags) return buf; } -char *page_flag_longname(uint64_t flags) +static char *page_flag_longname(uint64_t flags) { static char buf[1024]; int i, n; @@ -221,32 +334,42 @@ char *page_flag_longname(uint64_t flags) * page list and summary */ -void show_page_range(unsigned long offset, uint64_t flags) +static void show_page_range(unsigned long voffset, + unsigned long offset, uint64_t flags) { static uint64_t flags0; + static unsigned long voff; static unsigned long index; static unsigned long count; - if (flags == flags0 && offset == index + count) { + if (flags == flags0 && offset == index + count && + (!opt_pid || voffset == voff + count)) { count++; return; } - if (count) - printf("%lu\t%lu\t%s\n", + if (count) { + if (opt_pid) + printf("%lx\t", voff); + printf("%lx\t%lx\t%s\n", index, count, page_flag_name(flags0)); + } flags0 = flags; index = offset; + voff = voffset; count = 1; } -void show_page(unsigned long offset, uint64_t flags) +static void show_page(unsigned long voffset, + unsigned long offset, uint64_t flags) { - printf("%lu\t%s\n", offset, page_flag_name(flags)); + if (opt_pid) + printf("%lx\t", voffset); + printf("%lx\t%s\n", offset, page_flag_name(flags)); } -void show_summary(void) +static void show_summary(void) { int i; @@ -272,7 +395,7 @@ void show_summary(void) * page flag filters */ -int bit_mask_ok(uint64_t flags) +static int bit_mask_ok(uint64_t flags) { int i; @@ -289,7 +412,7 @@ int bit_mask_ok(uint64_t flags) return 1; } -uint64_t expand_overloaded_flags(uint64_t flags) +static uint64_t expand_overloaded_flags(uint64_t flags) { /* SLOB/SLUB overload several page flags */ if (flags & BIT(SLAB)) { @@ -308,7 +431,7 @@ uint64_t expand_overloaded_flags(uint64_t flags) return flags; } -uint64_t well_known_flags(uint64_t flags) +static uint64_t well_known_flags(uint64_t flags) { /* hide flags intended only for kernel hacker */ flags &= ~KPF_HACKERS_BITS; @@ -320,12 +443,68 @@ uint64_t well_known_flags(uint64_t flags) return flags; } +static uint64_t kpageflags_flags(uint64_t flags) +{ + flags = expand_overloaded_flags(flags); + + if (!opt_raw) + flags = well_known_flags(flags); + + return flags; +} + +/* + * page actions + */ + +static void prepare_hwpoison_fd(void) +{ + char buf[100]; + + if (opt_hwpoison && !hwpoison_inject_fd) { + sprintf(buf, "%s/corrupt-pfn", hwpoison_debug_fs); + hwpoison_inject_fd = checked_open(buf, O_WRONLY); + } + + if (opt_unpoison && !hwpoison_forget_fd) { + sprintf(buf, "%s/renew-pfn", hwpoison_debug_fs); + hwpoison_forget_fd = checked_open(buf, O_WRONLY); + } +} + +static int hwpoison_page(unsigned long offset) +{ + char buf[100]; + int len; + + len = sprintf(buf, "0x%lx\n", offset); + len = write(hwpoison_inject_fd, buf, len); + if (len < 0) { + perror("hwpoison inject"); + return len; + } + return 0; +} + +static int unpoison_page(unsigned long offset) +{ + char buf[100]; + int len; + + len = sprintf(buf, "0x%lx\n", offset); + len = write(hwpoison_forget_fd, buf, len); + if (len < 0) { + perror("hwpoison forget"); + return len; + } + return 0; +} /* * page frame walker */ -int hash_slot(uint64_t flags) +static int hash_slot(uint64_t flags) { int k = HASH_KEY(flags); int i; @@ -352,73 +531,124 @@ int hash_slot(uint64_t flags) exit(EXIT_FAILURE); } -void add_page(unsigned long offset, uint64_t flags) +static void add_page(unsigned long voffset, + unsigned long offset, uint64_t flags) { - flags = expand_overloaded_flags(flags); - - if (!opt_raw) - flags = well_known_flags(flags); + flags = kpageflags_flags(flags); if (!bit_mask_ok(flags)) return; + if (opt_hwpoison) + hwpoison_page(offset); + if (opt_unpoison) + unpoison_page(offset); + if (opt_list == 1) - show_page_range(offset, flags); + show_page_range(voffset, offset, flags); else if (opt_list == 2) - show_page(offset, flags); + show_page(voffset, offset, flags); nr_pages[hash_slot(flags)]++; total_pages++; } -void walk_pfn(unsigned long index, unsigned long count) +#define KPAGEFLAGS_BATCH (64 << 10) /* 64k pages */ +static void walk_pfn(unsigned long voffset, + unsigned long index, + unsigned long count) { + uint64_t buf[KPAGEFLAGS_BATCH]; unsigned long batch; - unsigned long n; + unsigned long pages; unsigned long i; - if (index > ULONG_MAX / KPF_BYTES) - fatal("index overflow: %lu\n", index); + while (count) { + batch = min_t(unsigned long, count, KPAGEFLAGS_BATCH); + pages = kpageflags_read(buf, index, batch); + if (pages == 0) + break; + + for (i = 0; i < pages; i++) + add_page(voffset + i, index + i, buf[i]); - lseek(kpageflags_fd, index * KPF_BYTES, SEEK_SET); + index += pages; + count -= pages; + } +} + +#define PAGEMAP_BATCH (64 << 10) +static void walk_vma(unsigned long index, unsigned long count) +{ + uint64_t buf[PAGEMAP_BATCH]; + unsigned long batch; + unsigned long pages; + unsigned long pfn; + unsigned long i; while (count) { - batch = min_t(unsigned long, count, PAGES_BATCH); - n = read(kpageflags_fd, kpageflags_buf, batch * KPF_BYTES); - if (n == 0) + batch = min_t(unsigned long, count, PAGEMAP_BATCH); + pages = pagemap_read(buf, index, batch); + if (pages == 0) break; - if (n < 0) { - perror(PROC_KPAGEFLAGS); - exit(EXIT_FAILURE); + + for (i = 0; i < pages; i++) { + pfn = pagemap_pfn(buf[i]); + if (pfn) + walk_pfn(index + i, pfn, 1); } - if (n % KPF_BYTES != 0) - fatal("partial read: %lu bytes\n", n); - n = n / KPF_BYTES; + index += pages; + count -= pages; + } +} + +static void walk_task(unsigned long index, unsigned long count) +{ + const unsigned long end = index + count; + unsigned long start; + int i = 0; + + while (index < end) { + + while (pg_end[i] <= index) + if (++i >= nr_vmas) + return; + if (pg_start[i] >= end) + return; - for (i = 0; i < n; i++) - add_page(index + i, kpageflags_buf[i]); + start = max_t(unsigned long, pg_start[i], index); + index = min_t(unsigned long, pg_end[i], end); - index += batch; - count -= batch; + assert(start < index); + walk_vma(start, index - start); } } -void walk_addr_ranges(void) +static void add_addr_range(unsigned long offset, unsigned long size) +{ + if (nr_addr_ranges >= MAX_ADDR_RANGES) + fatal("too many addr ranges\n"); + + opt_offset[nr_addr_ranges] = offset; + opt_size[nr_addr_ranges] = min_t(unsigned long, size, ULONG_MAX-offset); + nr_addr_ranges++; +} + +static void walk_addr_ranges(void) { int i; - kpageflags_fd = open(PROC_KPAGEFLAGS, O_RDONLY); - if (kpageflags_fd < 0) { - perror(PROC_KPAGEFLAGS); - exit(EXIT_FAILURE); - } + kpageflags_fd = checked_open(PROC_KPAGEFLAGS, O_RDONLY); if (!nr_addr_ranges) - walk_pfn(0, ULONG_MAX); + add_addr_range(0, ULONG_MAX); for (i = 0; i < nr_addr_ranges; i++) - walk_pfn(opt_offset[i], opt_size[i]); + if (!opt_pid) + walk_pfn(0, opt_offset[i], opt_size[i]); + else + walk_task(opt_offset[i], opt_size[i]); close(kpageflags_fd); } @@ -428,7 +658,7 @@ void walk_addr_ranges(void) * user interface */ -const char *page_flag_type(uint64_t flag) +static const char *page_flag_type(uint64_t flag) { if (flag & KPF_HACKERS_BITS) return "(r)"; @@ -437,7 +667,7 @@ const char *page_flag_type(uint64_t flag) return " "; } -void usage(void) +static void usage(void) { int i, j; @@ -446,20 +676,22 @@ void usage(void) " -r|--raw Raw mode, for kernel developers\n" " -a|--addr addr-spec Walk a range of pages\n" " -b|--bits bits-spec Walk pages with specified bits\n" -#if 0 /* planned features */ " -p|--pid pid Walk process address space\n" +#if 0 /* planned features */ " -f|--file filename Walk file address space\n" #endif " -l|--list Show page details in ranges\n" " -L|--list-each Show page details one by one\n" " -N|--no-summary Don't show summay info\n" +" -X|--hwpoison hwpoison pages\n" +" -x|--unpoison unpoison pages\n" " -h|--help Show this usage message\n" "addr-spec:\n" " N one page at offset N (unit: pages)\n" " N+M pages range from N to N+M-1\n" " N,M pages range from N to M-1\n" " N, pages range from N to end\n" -" ,M pages range from 0 to M\n" +" ,M pages range from 0 to M-1\n" "bits-spec:\n" " bit1,bit2 (flags & (bit1|bit2)) != 0\n" " bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n" @@ -482,7 +714,7 @@ void usage(void) "(r) raw mode bits (o) overloaded bits\n"); } -unsigned long long parse_number(const char *str) +static unsigned long long parse_number(const char *str) { unsigned long long n; @@ -494,26 +726,58 @@ unsigned long long parse_number(const char *str) return n; } -void parse_pid(const char *str) +static void parse_pid(const char *str) { + FILE *file; + char buf[5000]; + opt_pid = parse_number(str); -} -void parse_file(const char *name) -{ + sprintf(buf, "/proc/%d/pagemap", opt_pid); + pagemap_fd = checked_open(buf, O_RDONLY); + + sprintf(buf, "/proc/%d/maps", opt_pid); + file = fopen(buf, "r"); + if (!file) { + perror(buf); + exit(EXIT_FAILURE); + } + + while (fgets(buf, sizeof(buf), file) != NULL) { + unsigned long vm_start; + unsigned long vm_end; + unsigned long long pgoff; + int major, minor; + char r, w, x, s; + unsigned long ino; + int n; + + n = sscanf(buf, "%lx-%lx %c%c%c%c %llx %x:%x %lu", + &vm_start, + &vm_end, + &r, &w, &x, &s, + &pgoff, + &major, &minor, + &ino); + if (n < 10) { + fprintf(stderr, "unexpected line: %s\n", buf); + continue; + } + pg_start[nr_vmas] = vm_start / page_size; + pg_end[nr_vmas] = vm_end / page_size; + if (++nr_vmas >= MAX_VMAS) { + fprintf(stderr, "too many VMAs\n"); + break; + } + } + fclose(file); } -void add_addr_range(unsigned long offset, unsigned long size) +static void parse_file(const char *name) { - if (nr_addr_ranges >= MAX_ADDR_RANGES) - fatal("too much addr ranges\n"); - - opt_offset[nr_addr_ranges] = offset; - opt_size[nr_addr_ranges] = size; - nr_addr_ranges++; } -void parse_addr_range(const char *optarg) +static void parse_addr_range(const char *optarg) { unsigned long offset; unsigned long size; @@ -547,7 +811,7 @@ void parse_addr_range(const char *optarg) add_addr_range(offset, size); } -void add_bits_filter(uint64_t mask, uint64_t bits) +static void add_bits_filter(uint64_t mask, uint64_t bits) { if (nr_bit_filters >= MAX_BIT_FILTERS) fatal("too much bit filters\n"); @@ -557,7 +821,7 @@ void add_bits_filter(uint64_t mask, uint64_t bits) nr_bit_filters++; } -uint64_t parse_flag_name(const char *str, int len) +static uint64_t parse_flag_name(const char *str, int len) { int i; @@ -577,7 +841,7 @@ uint64_t parse_flag_name(const char *str, int len) return parse_number(str); } -uint64_t parse_flag_names(const char *str, int all) +static uint64_t parse_flag_names(const char *str, int all) { const char *p = str; uint64_t flags = 0; @@ -596,7 +860,7 @@ uint64_t parse_flag_names(const char *str, int all) return flags; } -void parse_bits_mask(const char *optarg) +static void parse_bits_mask(const char *optarg) { uint64_t mask; uint64_t bits; @@ -621,7 +885,7 @@ void parse_bits_mask(const char *optarg) } -struct option opts[] = { +static struct option opts[] = { { "raw" , 0, NULL, 'r' }, { "pid" , 1, NULL, 'p' }, { "file" , 1, NULL, 'f' }, @@ -630,6 +894,8 @@ struct option opts[] = { { "list" , 0, NULL, 'l' }, { "list-each" , 0, NULL, 'L' }, { "no-summary", 0, NULL, 'N' }, + { "hwpoison" , 0, NULL, 'X' }, + { "unpoison" , 0, NULL, 'x' }, { "help" , 0, NULL, 'h' }, { NULL , 0, NULL, 0 } }; @@ -641,7 +907,7 @@ int main(int argc, char *argv[]) page_size = getpagesize(); while ((c = getopt_long(argc, argv, - "rp:f:a:b:lLNh", opts, NULL)) != -1) { + "rp:f:a:b:lLNXxh", opts, NULL)) != -1) { switch (c) { case 'r': opt_raw = 1; @@ -667,6 +933,14 @@ int main(int argc, char *argv[]) case 'N': opt_no_summary = 1; break; + case 'X': + opt_hwpoison = 1; + prepare_hwpoison_fd(); + break; + case 'x': + opt_unpoison = 1; + prepare_hwpoison_fd(); + break; case 'h': usage(); exit(0); @@ -676,15 +950,17 @@ int main(int argc, char *argv[]) } } + if (opt_list && opt_pid) + printf("voffset\t"); if (opt_list == 1) - printf("offset\tcount\tflags\n"); + printf("offset\tlen\tflags\n"); if (opt_list == 2) printf("offset\tflags\n"); walk_addr_ranges(); if (opt_list == 1) - show_page_range(0, 0); /* drain the buffer */ + show_page_range(0, 0, 0); /* drain the buffer */ if (opt_no_summary) return 0; diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 600a304a828..df09b9650a8 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -57,7 +57,9 @@ There are three components to pagemap: 16. COMPOUND_TAIL 16. HUGE 18. UNEVICTABLE + 19. HWPOISON 20. NOPAGE + 21. KSM Short descriptions to the page flags: @@ -86,9 +88,15 @@ Short descriptions to the page flags: 17. HUGE this is an integral part of a HugeTLB page +19. HWPOISON + hardware detected memory corruption on this page: don't touch the data! + 20. NOPAGE no page frame exists at the requested address +21. KSM + identical memory pages dynamically shared between one or more processes + [IO related page flags] 1. ERROR IO error occurred 3. UPTODATE page has up-to-date data diff --git a/Documentation/vm/slabinfo.c b/Documentation/vm/slabinfo.c index df3227605d5..92e729f4b67 100644 --- a/Documentation/vm/slabinfo.c +++ b/Documentation/vm/slabinfo.c @@ -87,7 +87,7 @@ int page_size; regex_t pattern; -void fatal(const char *x, ...) +static void fatal(const char *x, ...) { va_list ap; @@ -97,7 +97,7 @@ void fatal(const char *x, ...) exit(EXIT_FAILURE); } -void usage(void) +static void usage(void) { printf("slabinfo 5/7/2007. (c) 2007 sgi.\n\n" "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n" @@ -131,7 +131,7 @@ void usage(void) ); } -unsigned long read_obj(const char *name) +static unsigned long read_obj(const char *name) { FILE *f = fopen(name, "r"); @@ -151,7 +151,7 @@ unsigned long read_obj(const char *name) /* * Get the contents of an attribute */ -unsigned long get_obj(const char *name) +static unsigned long get_obj(const char *name) { if (!read_obj(name)) return 0; @@ -159,7 +159,7 @@ unsigned long get_obj(const char *name) return atol(buffer); } -unsigned long get_obj_and_str(const char *name, char **x) +static unsigned long get_obj_and_str(const char *name, char **x) { unsigned long result = 0; char *p; @@ -178,7 +178,7 @@ unsigned long get_obj_and_str(const char *name, char **x) return result; } -void set_obj(struct slabinfo *s, const char *name, int n) +static void set_obj(struct slabinfo *s, const char *name, int n) { char x[100]; FILE *f; @@ -192,7 +192,7 @@ void set_obj(struct slabinfo *s, const char *name, int n) fclose(f); } -unsigned long read_slab_obj(struct slabinfo *s, const char *name) +static unsigned long read_slab_obj(struct slabinfo *s, const char *name) { char x[100]; FILE *f; @@ -215,7 +215,7 @@ unsigned long read_slab_obj(struct slabinfo *s, const char *name) /* * Put a size string together */ -int store_size(char *buffer, unsigned long value) +static int store_size(char *buffer, unsigned long value) { unsigned long divisor = 1; char trailer = 0; @@ -247,7 +247,7 @@ int store_size(char *buffer, unsigned long value) return n; } -void decode_numa_list(int *numa, char *t) +static void decode_numa_list(int *numa, char *t) { int node; int nr; @@ -272,7 +272,7 @@ void decode_numa_list(int *numa, char *t) } } -void slab_validate(struct slabinfo *s) +static void slab_validate(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -280,7 +280,7 @@ void slab_validate(struct slabinfo *s) set_obj(s, "validate", 1); } -void slab_shrink(struct slabinfo *s) +static void slab_shrink(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -290,7 +290,7 @@ void slab_shrink(struct slabinfo *s) int line = 0; -void first_line(void) +static void first_line(void) { if (show_activity) printf("Name Objects Alloc Free %%Fast Fallb O\n"); @@ -302,7 +302,7 @@ void first_line(void) /* * Find the shortest alias of a slab */ -struct aliasinfo *find_one_alias(struct slabinfo *find) +static struct aliasinfo *find_one_alias(struct slabinfo *find) { struct aliasinfo *a; struct aliasinfo *best = NULL; @@ -318,18 +318,18 @@ struct aliasinfo *find_one_alias(struct slabinfo *find) return best; } -unsigned long slab_size(struct slabinfo *s) +static unsigned long slab_size(struct slabinfo *s) { return s->slabs * (page_size << s->order); } -unsigned long slab_activity(struct slabinfo *s) +static unsigned long slab_activity(struct slabinfo *s) { return s->alloc_fastpath + s->free_fastpath + s->alloc_slowpath + s->free_slowpath; } -void slab_numa(struct slabinfo *s, int mode) +static void slab_numa(struct slabinfo *s, int mode) { int node; @@ -374,7 +374,7 @@ void slab_numa(struct slabinfo *s, int mode) line++; } -void show_tracking(struct slabinfo *s) +static void show_tracking(struct slabinfo *s) { printf("\n%s: Kernel object allocation\n", s->name); printf("-----------------------------------------------------------------------\n"); @@ -392,7 +392,7 @@ void show_tracking(struct slabinfo *s) } -void ops(struct slabinfo *s) +static void ops(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -405,14 +405,14 @@ void ops(struct slabinfo *s) printf("\n%s has no kmem_cache operations\n", s->name); } -const char *onoff(int x) +static const char *onoff(int x) { if (x) return "On "; return "Off"; } -void slab_stats(struct slabinfo *s) +static void slab_stats(struct slabinfo *s) { unsigned long total_alloc; unsigned long total_free; @@ -477,7 +477,7 @@ void slab_stats(struct slabinfo *s) s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total); } -void report(struct slabinfo *s) +static void report(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -518,7 +518,7 @@ void report(struct slabinfo *s) slab_stats(s); } -void slabcache(struct slabinfo *s) +static void slabcache(struct slabinfo *s) { char size_str[20]; char dist_str[40]; @@ -593,7 +593,7 @@ void slabcache(struct slabinfo *s) /* * Analyze debug options. Return false if something is amiss. */ -int debug_opt_scan(char *opt) +static int debug_opt_scan(char *opt) { if (!opt || !opt[0] || strcmp(opt, "-") == 0) return 1; @@ -642,7 +642,7 @@ int debug_opt_scan(char *opt) return 1; } -int slab_empty(struct slabinfo *s) +static int slab_empty(struct slabinfo *s) { if (s->objects > 0) return 0; @@ -657,7 +657,7 @@ int slab_empty(struct slabinfo *s) return 1; } -void slab_debug(struct slabinfo *s) +static void slab_debug(struct slabinfo *s) { if (strcmp(s->name, "*") == 0) return; @@ -717,7 +717,7 @@ void slab_debug(struct slabinfo *s) set_obj(s, "trace", 1); } -void totals(void) +static void totals(void) { struct slabinfo *s; @@ -976,7 +976,7 @@ void totals(void) b1, b2, b3); } -void sort_slabs(void) +static void sort_slabs(void) { struct slabinfo *s1,*s2; @@ -1005,7 +1005,7 @@ void sort_slabs(void) } } -void sort_aliases(void) +static void sort_aliases(void) { struct aliasinfo *a1,*a2; @@ -1030,7 +1030,7 @@ void sort_aliases(void) } } -void link_slabs(void) +static void link_slabs(void) { struct aliasinfo *a; struct slabinfo *s; @@ -1048,7 +1048,7 @@ void link_slabs(void) } } -void alias(void) +static void alias(void) { struct aliasinfo *a; char *active = NULL; @@ -1079,7 +1079,7 @@ void alias(void) } -void rename_slabs(void) +static void rename_slabs(void) { struct slabinfo *s; struct aliasinfo *a; @@ -1102,12 +1102,12 @@ void rename_slabs(void) } } -int slab_mismatch(char *slab) +static int slab_mismatch(char *slab) { return regexec(&pattern, slab, 0, NULL, 0); } -void read_slab_dir(void) +static void read_slab_dir(void) { DIR *dir; struct dirent *de; @@ -1209,7 +1209,7 @@ void read_slab_dir(void) fatal("Too many aliases\n"); } -void output_slabs(void) +static void output_slabs(void) { struct slabinfo *slab; diff --git a/Documentation/w1/masters/ds2482 b/Documentation/w1/masters/ds2482 index 9210d6fa502..299b91c7609 100644 --- a/Documentation/w1/masters/ds2482 +++ b/Documentation/w1/masters/ds2482 @@ -24,8 +24,8 @@ General Remarks Valid addresses are 0x18, 0x19, 0x1a, and 0x1b. However, the device cannot be detected without writing to the i2c bus, so no -detection is done. -You should force the device address. +detection is done. You should instantiate the device explicitly. -$ modprobe ds2482 force=0,0x18 +$ modprobe ds2482 +$ echo ds2482 0x18 > /sys/bus/i2c/devices/i2c-0/new_device diff --git a/Documentation/watchdog/src/watchdog-test.c b/Documentation/watchdog/src/watchdog-test.c index 65f6c19cb86..a750532ffcf 100644 --- a/Documentation/watchdog/src/watchdog-test.c +++ b/Documentation/watchdog/src/watchdog-test.c @@ -18,7 +18,7 @@ int fd; * the PC Watchdog card to reset its internal timer so it doesn't trigger * a computer reset. */ -void keep_alive(void) +static void keep_alive(void) { int dummy; diff --git a/Documentation/x86/earlyprintk.txt b/Documentation/x86/earlyprintk.txt index 607b1a01606..f19802c0f48 100644 --- a/Documentation/x86/earlyprintk.txt +++ b/Documentation/x86/earlyprintk.txt @@ -7,7 +7,7 @@ and two USB cables, connected like this: [host/target] <-------> [USB debug key] <-------> [client/console] -1. There are three specific hardware requirements: +1. There are a number of specific hardware requirements: a.) Host/target system needs to have USB debug port capability. @@ -42,7 +42,35 @@ and two USB cables, connected like this: This is a small blue plastic connector with two USB connections, it draws power from its USB connections. - c.) Thirdly, you need a second client/console system with a regular USB port. + c.) You need a second client/console system with a high speed USB 2.0 + port. + + d.) The Netchip device must be plugged directly into the physical + debug port on the "host/target" system. You cannot use a USB hub in + between the physical debug port and the "host/target" system. + + The EHCI debug controller is bound to a specific physical USB + port and the Netchip device will only work as an early printk + device in this port. The EHCI host controllers are electrically + wired such that the EHCI debug controller is hooked up to the + first physical and there is no way to change this via software. + You can find the physical port through experimentation by trying + each physical port on the system and rebooting. Or you can try + and use lsusb or look at the kernel info messages emitted by the + usb stack when you plug a usb device into various ports on the + "host/target" system. + + Some hardware vendors do not expose the usb debug port with a + physical connector and if you find such a device send a complaint + to the hardware vendor, because there is no reason not to wire + this port into one of the physically accessible ports. + + e.) It is also important to note, that many versions of the Netchip + device require the "client/console" system to be plugged into the + right and side of the device (with the product logo facing up and + readable left to right). The reason being is that the 5 volt + power supply is taken from only one side of the device and it + must be the side that does not get rebooted. 2. Software requirements: @@ -56,6 +84,13 @@ and two USB cables, connected like this: (If you are using Grub, append it to the 'kernel' line in /etc/grub.conf) + On systems with more than one EHCI debug controller you must + specify the correct EHCI debug controller number. The ordering + comes from the PCI bus enumeration of the EHCI controllers. The + default with no number argument is "0" the first EHCI debug + controller. To use the second EHCI debug controller, you would + use the command line: "earlyprintk=dbgp1" + NOTE: normally earlyprintk console gets turned off once the regular console is alive - use "earlyprintk=dbgp,keep" to keep this channel open beyond early bootup. This can be useful for |