mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-09 14:43:16 +00:00
Merge branches 'bugzilla-14668' and 'misc-2.6.35' into release
This commit is contained in:
commit
b42f5b0f0f
@ -250,6 +250,8 @@ numastat.txt
|
||||
- info on how to read Numa policy hit/miss statistics in sysfs.
|
||||
oops-tracing.txt
|
||||
- how to decode those nasty internal kernel error dump messages.
|
||||
padata.txt
|
||||
- An introduction to the "padata" parallel execution API
|
||||
parisc/
|
||||
- directory with info on using Linux on PA-RISC architecture.
|
||||
parport.txt
|
||||
|
31
Documentation/ABI/obsolete/sysfs-bus-usb
Normal file
31
Documentation/ABI/obsolete/sysfs-bus-usb
Normal file
@ -0,0 +1,31 @@
|
||||
What: /sys/bus/usb/devices/.../power/level
|
||||
Date: March 2007
|
||||
KernelVersion: 2.6.21
|
||||
Contact: Alan Stern <stern@rowland.harvard.edu>
|
||||
Description:
|
||||
Each USB device directory will contain a file named
|
||||
power/level. This file holds a power-level setting for
|
||||
the device, either "on" or "auto".
|
||||
|
||||
"on" means that the device is not allowed to autosuspend,
|
||||
although normal suspends for system sleep will still
|
||||
be honored. "auto" means the device will autosuspend
|
||||
and autoresume in the usual manner, according to the
|
||||
capabilities of its driver.
|
||||
|
||||
During normal use, devices should be left in the "auto"
|
||||
level. The "on" level is meant for administrative uses.
|
||||
If you want to suspend a device immediately but leave it
|
||||
free to wake up in response to I/O requests, you should
|
||||
write "0" to power/autosuspend.
|
||||
|
||||
Device not capable of proper suspend and resume should be
|
||||
left in the "on" level. Although the USB spec requires
|
||||
devices to support suspend/resume, many of them do not.
|
||||
In fact so many don't that by default, the USB core
|
||||
initializes all non-hub devices in the "on" level. Some
|
||||
drivers may change this setting when they are bound.
|
||||
|
||||
This file is deprecated and will be removed after 2010.
|
||||
Use the power/control file instead; it does exactly the
|
||||
same thing.
|
29
Documentation/ABI/obsolete/sysfs-class-rfkill
Normal file
29
Documentation/ABI/obsolete/sysfs-class-rfkill
Normal file
@ -0,0 +1,29 @@
|
||||
rfkill - radio frequency (RF) connector kill switch support
|
||||
|
||||
For details to this subsystem look at Documentation/rfkill.txt.
|
||||
|
||||
What: /sys/class/rfkill/rfkill[0-9]+/state
|
||||
Date: 09-Jul-2007
|
||||
KernelVersion v2.6.22
|
||||
Contact: linux-wireless@vger.kernel.org
|
||||
Description: Current state of the transmitter.
|
||||
This file is deprecated and sheduled to be removed in 2014,
|
||||
because its not possible to express the 'soft and hard block'
|
||||
state of the rfkill driver.
|
||||
Values: A numeric value.
|
||||
0: RFKILL_STATE_SOFT_BLOCKED
|
||||
transmitter is turned off by software
|
||||
1: RFKILL_STATE_UNBLOCKED
|
||||
transmitter is (potentially) active
|
||||
2: RFKILL_STATE_HARD_BLOCKED
|
||||
transmitter is forced off by something outside of
|
||||
the driver's control.
|
||||
|
||||
What: /sys/class/rfkill/rfkill[0-9]+/claim
|
||||
Date: 09-Jul-2007
|
||||
KernelVersion v2.6.22
|
||||
Contact: linux-wireless@vger.kernel.org
|
||||
Description: This file is deprecated because there no longer is a way to
|
||||
claim just control over a single rfkill instance.
|
||||
This file is scheduled to be removed in 2012.
|
||||
Values: 0: Kernel handles events
|
67
Documentation/ABI/stable/sysfs-class-rfkill
Normal file
67
Documentation/ABI/stable/sysfs-class-rfkill
Normal file
@ -0,0 +1,67 @@
|
||||
rfkill - radio frequency (RF) connector kill switch support
|
||||
|
||||
For details to this subsystem look at Documentation/rfkill.txt.
|
||||
|
||||
For the deprecated /sys/class/rfkill/*/state and
|
||||
/sys/class/rfkill/*/claim knobs of this interface look in
|
||||
Documentation/ABI/obsolete/sysfs-class-rfkill.
|
||||
|
||||
What: /sys/class/rfkill
|
||||
Date: 09-Jul-2007
|
||||
KernelVersion: v2.6.22
|
||||
Contact: linux-wireless@vger.kernel.org,
|
||||
Description: The rfkill class subsystem folder.
|
||||
Each registered rfkill driver is represented by an rfkillX
|
||||
subfolder (X being an integer > 0).
|
||||
|
||||
|
||||
What: /sys/class/rfkill/rfkill[0-9]+/name
|
||||
Date: 09-Jul-2007
|
||||
KernelVersion v2.6.22
|
||||
Contact: linux-wireless@vger.kernel.org
|
||||
Description: Name assigned by driver to this key (interface or driver name).
|
||||
Values: arbitrary string.
|
||||
|
||||
|
||||
What: /sys/class/rfkill/rfkill[0-9]+/type
|
||||
Date: 09-Jul-2007
|
||||
KernelVersion v2.6.22
|
||||
Contact: linux-wireless@vger.kernel.org
|
||||
Description: Driver type string ("wlan", "bluetooth", etc).
|
||||
Values: See include/linux/rfkill.h.
|
||||
|
||||
|
||||
What: /sys/class/rfkill/rfkill[0-9]+/persistent
|
||||
Date: 09-Jul-2007
|
||||
KernelVersion v2.6.22
|
||||
Contact: linux-wireless@vger.kernel.org
|
||||
Description: Whether the soft blocked state is initialised from non-volatile
|
||||
storage at startup.
|
||||
Values: A numeric value.
|
||||
0: false
|
||||
1: true
|
||||
|
||||
|
||||
What: /sys/class/rfkill/rfkill[0-9]+/hard
|
||||
Date: 12-March-2010
|
||||
KernelVersion v2.6.34
|
||||
Contact: linux-wireless@vger.kernel.org
|
||||
Description: Current hardblock state. This file is read only.
|
||||
Values: A numeric value.
|
||||
0: inactive
|
||||
The transmitter is (potentially) active.
|
||||
1: active
|
||||
The transmitter is forced off by something outside of
|
||||
the driver's control.
|
||||
|
||||
|
||||
What: /sys/class/rfkill/rfkill[0-9]+/soft
|
||||
Date: 12-March-2010
|
||||
KernelVersion v2.6.34
|
||||
Contact: linux-wireless@vger.kernel.org
|
||||
Description: Current softblock state. This file is read and write.
|
||||
Values: A numeric value.
|
||||
0: inactive
|
||||
The transmitter is (potentially) active.
|
||||
1: active
|
||||
The transmitter is turned off by software.
|
@ -133,6 +133,46 @@ Description:
|
||||
The symbolic link points to the PCI device sysfs entry of the
|
||||
Physical Function this device associates with.
|
||||
|
||||
|
||||
What: /sys/bus/pci/slots/...
|
||||
Date: April 2005 (possibly older)
|
||||
KernelVersion: 2.6.12 (possibly older)
|
||||
Contact: linux-pci@vger.kernel.org
|
||||
Description:
|
||||
When the appropriate driver is loaded, it will create a
|
||||
directory per claimed physical PCI slot in
|
||||
/sys/bus/pci/slots/. The names of these directories are
|
||||
specific to the driver, which in turn, are specific to the
|
||||
platform, but in general, should match the label on the
|
||||
machine's physical chassis.
|
||||
|
||||
The drivers that can create slot directories include the
|
||||
PCI hotplug drivers, and as of 2.6.27, the pci_slot driver.
|
||||
|
||||
The slot directories contain, at a minimum, a file named
|
||||
'address' which contains the PCI bus:device:function tuple.
|
||||
Other files may appear as well, but are specific to the
|
||||
driver.
|
||||
|
||||
What: /sys/bus/pci/slots/.../function[0-7]
|
||||
Date: March 2010
|
||||
KernelVersion: 2.6.35
|
||||
Contact: linux-pci@vger.kernel.org
|
||||
Description:
|
||||
If PCI slot directories (as described above) are created,
|
||||
and the physical slot is actually populated with a device,
|
||||
symbolic links in the slot directory pointing to the
|
||||
device's PCI functions are created as well.
|
||||
|
||||
What: /sys/bus/pci/devices/.../slot
|
||||
Date: March 2010
|
||||
KernelVersion: 2.6.35
|
||||
Contact: linux-pci@vger.kernel.org
|
||||
Description:
|
||||
If PCI slot directories (as described above) are created,
|
||||
a symbolic link pointing to the slot directory will be
|
||||
created as well.
|
||||
|
||||
What: /sys/bus/pci/slots/.../module
|
||||
Date: June 2009
|
||||
Contact: linux-pci@vger.kernel.org
|
||||
|
@ -14,34 +14,6 @@ Description:
|
||||
The autosuspend delay for newly-created devices is set to
|
||||
the value of the usbcore.autosuspend module parameter.
|
||||
|
||||
What: /sys/bus/usb/devices/.../power/level
|
||||
Date: March 2007
|
||||
KernelVersion: 2.6.21
|
||||
Contact: Alan Stern <stern@rowland.harvard.edu>
|
||||
Description:
|
||||
Each USB device directory will contain a file named
|
||||
power/level. This file holds a power-level setting for
|
||||
the device, either "on" or "auto".
|
||||
|
||||
"on" means that the device is not allowed to autosuspend,
|
||||
although normal suspends for system sleep will still
|
||||
be honored. "auto" means the device will autosuspend
|
||||
and autoresume in the usual manner, according to the
|
||||
capabilities of its driver.
|
||||
|
||||
During normal use, devices should be left in the "auto"
|
||||
level. The "on" level is meant for administrative uses.
|
||||
If you want to suspend a device immediately but leave it
|
||||
free to wake up in response to I/O requests, you should
|
||||
write "0" to power/autosuspend.
|
||||
|
||||
Device not capable of proper suspend and resume should be
|
||||
left in the "on" level. Although the USB spec requires
|
||||
devices to support suspend/resume, many of them do not.
|
||||
In fact so many don't that by default, the USB core
|
||||
initializes all non-hub devices in the "on" level. Some
|
||||
drivers may change this setting when they are bound.
|
||||
|
||||
What: /sys/bus/usb/devices/.../power/persist
|
||||
Date: May 2007
|
||||
KernelVersion: 2.6.23
|
||||
|
20
Documentation/ABI/testing/sysfs-class-power
Normal file
20
Documentation/ABI/testing/sysfs-class-power
Normal file
@ -0,0 +1,20 @@
|
||||
What: /sys/class/power/ds2760-battery.*/charge_now
|
||||
Date: May 2010
|
||||
KernelVersion: 2.6.35
|
||||
Contact: Daniel Mack <daniel@caiaq.de>
|
||||
Description:
|
||||
This file is writeable and can be used to set the current
|
||||
coloumb counter value inside the battery monitor chip. This
|
||||
is needed for unavoidable corrections of aging batteries.
|
||||
A userspace daemon can monitor the battery charging logic
|
||||
and once the counter drops out of considerable bounds, take
|
||||
appropriate action.
|
||||
|
||||
What: /sys/class/power/ds2760-battery.*/charge_full
|
||||
Date: May 2010
|
||||
KernelVersion: 2.6.35
|
||||
Contact: Daniel Mack <daniel@caiaq.de>
|
||||
Description:
|
||||
This file is writeable and can be used to set the assumed
|
||||
battery 'full level'. As batteries age, this value has to be
|
||||
amended over time.
|
@ -43,7 +43,7 @@ Date: September 2008
|
||||
Contact: Badari Pulavarty <pbadari@us.ibm.com>
|
||||
Description:
|
||||
The file /sys/devices/system/memory/memoryX/state
|
||||
is read-write. When read, it's contents show the
|
||||
is read-write. When read, its contents show the
|
||||
online/offline state of the memory section. When written,
|
||||
root can toggle the the online/offline state of a removable
|
||||
memory section (see removable file description above)
|
||||
|
7
Documentation/ABI/testing/sysfs-devices-node
Normal file
7
Documentation/ABI/testing/sysfs-devices-node
Normal file
@ -0,0 +1,7 @@
|
||||
What: /sys/devices/system/node/nodeX/compact
|
||||
Date: February 2010
|
||||
Contact: Mel Gorman <mel@csn.ul.ie>
|
||||
Description:
|
||||
When this file is written to, all memory within that node
|
||||
will be compacted. When it completes, memory will be freed
|
||||
into blocks which have as many contiguous pages as possible
|
@ -0,0 +1,9 @@
|
||||
What: /sys/devices/platform/_UDC_/gadget/suspended
|
||||
Date: April 2010
|
||||
Contact: Fabien Chouteau <fabien.chouteau@barco.com>
|
||||
Description:
|
||||
Show the suspend state of an USB composite gadget.
|
||||
1 -> suspended
|
||||
0 -> resumed
|
||||
|
||||
(_UDC_ is the name of the USB Device Controller driver)
|
43
Documentation/ABI/testing/sysfs-driver-hid-picolcd
Normal file
43
Documentation/ABI/testing/sysfs-driver-hid-picolcd
Normal file
@ -0,0 +1,43 @@
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/operation_mode
|
||||
Date: March 2010
|
||||
Contact: Bruno Prémont <bonbons@linux-vserver.org>
|
||||
Description: Make it possible to switch the PicoLCD device between LCD
|
||||
(firmware) and bootloader (flasher) operation modes.
|
||||
|
||||
Reading: returns list of available modes, the active mode being
|
||||
enclosed in brackets ('[' and ']')
|
||||
|
||||
Writing: causes operation mode switch. Permitted values are
|
||||
the non-active mode names listed when read.
|
||||
|
||||
Note: when switching mode the current PicoLCD HID device gets
|
||||
disconnected and reconnects after above delay (see attribute
|
||||
operation_mode_delay for its value).
|
||||
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/operation_mode_delay
|
||||
Date: April 2010
|
||||
Contact: Bruno Prémont <bonbons@linux-vserver.org>
|
||||
Description: Delay PicoLCD waits before restarting in new mode when
|
||||
operation_mode has changed.
|
||||
|
||||
Reading/Writing: It is expressed in ms and permitted range is
|
||||
0..30000ms.
|
||||
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/fb_update_rate
|
||||
Date: March 2010
|
||||
Contact: Bruno Prémont <bonbons@linux-vserver.org>
|
||||
Description: Make it possible to adjust defio refresh rate.
|
||||
|
||||
Reading: returns list of available refresh rates (expressed in Hz),
|
||||
the active refresh rate being enclosed in brackets ('[' and ']')
|
||||
|
||||
Writing: accepts new refresh rate expressed in integer Hz
|
||||
within permitted rates.
|
||||
|
||||
Note: As device can barely do 2 complete refreshes a second
|
||||
it only makes sense to adjust this value if only one or two
|
||||
tiles get changed and it's not appropriate to expect the application
|
||||
to flush it's tiny changes explicitely at higher than default rate.
|
||||
|
29
Documentation/ABI/testing/sysfs-driver-hid-prodikeys
Normal file
29
Documentation/ABI/testing/sysfs-driver-hid-prodikeys
Normal file
@ -0,0 +1,29 @@
|
||||
What: /sys/bus/hid/drivers/prodikeys/.../channel
|
||||
Date: April 2010
|
||||
KernelVersion: 2.6.34
|
||||
Contact: Don Prince <dhprince.devel@yahoo.co.uk>
|
||||
Description:
|
||||
Allows control (via software) the midi channel to which
|
||||
that the pc-midi keyboard will output.midi data.
|
||||
Range: 0..15
|
||||
Type: Read/write
|
||||
What: /sys/bus/hid/drivers/prodikeys/.../sustain
|
||||
Date: April 2010
|
||||
KernelVersion: 2.6.34
|
||||
Contact: Don Prince <dhprince.devel@yahoo.co.uk>
|
||||
Description:
|
||||
Allows control (via software) the sustain duration of a
|
||||
note held by the pc-midi driver.
|
||||
0 means sustain mode is disabled.
|
||||
Range: 0..5000 (milliseconds)
|
||||
Type: Read/write
|
||||
What: /sys/bus/hid/drivers/prodikeys/.../octave
|
||||
Date: April 2010
|
||||
KernelVersion: 2.6.34
|
||||
Contact: Don Prince <dhprince.devel@yahoo.co.uk>
|
||||
Description:
|
||||
Controls the octave shift modifier in the pc-midi driver.
|
||||
The octave can be shifted via software up/down 2 octaves.
|
||||
0 means the no ocatve shift.
|
||||
Range: -2..2 (minus 2 to plus 2)
|
||||
Type: Read/Write
|
111
Documentation/ABI/testing/sysfs-driver-hid-roccat-kone
Normal file
111
Documentation/ABI/testing/sysfs-driver-hid-roccat-kone
Normal file
@ -0,0 +1,111 @@
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/actual_dpi
|
||||
Date: March 2010
|
||||
Contact: Stefan Achatz <erazor_de@users.sourceforge.net>
|
||||
Description: It is possible to switch the dpi setting of the mouse with the
|
||||
press of a button.
|
||||
When read, this file returns the raw number of the actual dpi
|
||||
setting reported by the mouse. This number has to be further
|
||||
processed to receive the real dpi value.
|
||||
|
||||
VALUE DPI
|
||||
1 800
|
||||
2 1200
|
||||
3 1600
|
||||
4 2000
|
||||
5 2400
|
||||
6 3200
|
||||
|
||||
This file is readonly.
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/actual_profile
|
||||
Date: March 2010
|
||||
Contact: Stefan Achatz <erazor_de@users.sourceforge.net>
|
||||
Description: When read, this file returns the number of the actual profile.
|
||||
This file is readonly.
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/firmware_version
|
||||
Date: March 2010
|
||||
Contact: Stefan Achatz <erazor_de@users.sourceforge.net>
|
||||
Description: When read, this file returns the raw integer version number of the
|
||||
firmware reported by the mouse. Using the integer value eases
|
||||
further usage in other programs. To receive the real version
|
||||
number the decimal point has to be shifted 2 positions to the
|
||||
left. E.g. a returned value of 138 means 1.38
|
||||
This file is readonly.
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/kone_driver_version
|
||||
Date: March 2010
|
||||
Contact: Stefan Achatz <erazor_de@users.sourceforge.net>
|
||||
Description: When read, this file returns the driver version.
|
||||
The format of the string is "v<major>.<minor>.<patchlevel>".
|
||||
This attribute is used by the userland tools to find the sysfs-
|
||||
paths of installed kone-mice and determine the capabilites of
|
||||
the driver. Versions of this driver for old kernels replace
|
||||
usbhid instead of generic-usb. The way to scan for this file
|
||||
has been chosen to provide a consistent way for all supported
|
||||
kernel versions.
|
||||
This file is readonly.
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/profile[1-5]
|
||||
Date: March 2010
|
||||
Contact: Stefan Achatz <erazor_de@users.sourceforge.net>
|
||||
Description: The mouse can store 5 profiles which can be switched by the
|
||||
press of a button. A profile holds informations like button
|
||||
mappings, sensitivity, the colors of the 5 leds and light
|
||||
effects.
|
||||
When read, these files return the respective profile. The
|
||||
returned data is 975 bytes in size.
|
||||
When written, this file lets one write the respective profile
|
||||
data back to the mouse. The data has to be 975 bytes long.
|
||||
The mouse will reject invalid data, whereas the profile number
|
||||
stored in the profile doesn't need to fit the number of the
|
||||
store.
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/settings
|
||||
Date: March 2010
|
||||
Contact: Stefan Achatz <erazor_de@users.sourceforge.net>
|
||||
Description: When read, this file returns the settings stored in the mouse.
|
||||
The size of the data is 36 bytes and holds information like the
|
||||
startup_profile, tcu state and calibration_data.
|
||||
When written, this file lets write settings back to the mouse.
|
||||
The data has to be 36 bytes long. The mouse will reject invalid
|
||||
data.
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/startup_profile
|
||||
Date: March 2010
|
||||
Contact: Stefan Achatz <erazor_de@users.sourceforge.net>
|
||||
Description: The integer value of this attribute ranges from 1 to 5.
|
||||
When read, this attribute returns the number of the profile
|
||||
that's active when the mouse is powered on.
|
||||
When written, this file sets the number of the startup profile
|
||||
and the mouse activates this profile immediately.
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/tcu
|
||||
Date: March 2010
|
||||
Contact: Stefan Achatz <erazor_de@users.sourceforge.net>
|
||||
Description: The mouse has a "Tracking Control Unit" which lets the user
|
||||
calibrate the laser power to fit the mousepad surface.
|
||||
When read, this file returns the current state of the TCU,
|
||||
where 0 means off and 1 means on.
|
||||
Writing 0 in this file will switch the TCU off.
|
||||
Writing 1 in this file will start the calibration which takes
|
||||
around 6 seconds to complete and activates the TCU.
|
||||
|
||||
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/weight
|
||||
Date: March 2010
|
||||
Contact: Stefan Achatz <erazor_de@users.sourceforge.net>
|
||||
Description: The mouse can be equipped with one of four supplied weights
|
||||
ranging from 5 to 20 grams which are recognized by the mouse
|
||||
and its value can be read out. When read, this file returns the
|
||||
raw value returned by the mouse which eases further processing
|
||||
in other software.
|
||||
The values map to the weights as follows:
|
||||
|
||||
VALUE WEIGHT
|
||||
0 none
|
||||
1 5g
|
||||
2 10g
|
||||
3 15g
|
||||
4 20g
|
||||
|
||||
This file is readonly.
|
15
Documentation/ABI/testing/sysfs-firmware-sfi
Normal file
15
Documentation/ABI/testing/sysfs-firmware-sfi
Normal file
@ -0,0 +1,15 @@
|
||||
What: /sys/firmware/sfi/tables/
|
||||
Date: May 2010
|
||||
Contact: Len Brown <lenb@kernel.org>
|
||||
Description:
|
||||
SFI defines a number of small static memory tables
|
||||
so the kernel can get platform information from firmware.
|
||||
|
||||
The tables are defined in the latest SFI specification:
|
||||
http://simplefirmware.org/documentation
|
||||
|
||||
While the tables are used by the kernel, user-space
|
||||
can observe them this way:
|
||||
|
||||
# cd /sys/firmware/sfi/tables
|
||||
# cat $TABLENAME > $TABLENAME.bin
|
10
Documentation/ABI/testing/sysfs-wacom
Normal file
10
Documentation/ABI/testing/sysfs-wacom
Normal file
@ -0,0 +1,10 @@
|
||||
What: /sys/class/hidraw/hidraw*/device/speed
|
||||
Date: April 2010
|
||||
Kernel Version: 2.6.35
|
||||
Contact: linux-bluetooth@vger.kernel.org
|
||||
Description:
|
||||
The /sys/class/hidraw/hidraw*/device/speed file controls
|
||||
reporting speed of wacom bluetooth tablet. Reading from
|
||||
this file returns 1 if tablet reports in high speed mode
|
||||
or 0 otherwise. Writing to this file one of these values
|
||||
switches reporting speed.
|
@ -49,7 +49,7 @@ o oprofile 0.9 # oprofiled --version
|
||||
o udev 081 # udevinfo -V
|
||||
o grub 0.93 # grub --version
|
||||
o mcelog 0.6
|
||||
o iptables 1.4.1 # iptables -V
|
||||
o iptables 1.4.2 # iptables -V
|
||||
|
||||
|
||||
Kernel compilation
|
||||
|
@ -639,6 +639,36 @@ is planned to completely remove virt_to_bus() and bus_to_virt() as
|
||||
they are entirely deprecated. Some ports already do not provide these
|
||||
as it is impossible to correctly support them.
|
||||
|
||||
Handling Errors
|
||||
|
||||
DMA address space is limited on some architectures and an allocation
|
||||
failure can be determined by:
|
||||
|
||||
- checking if dma_alloc_coherent returns NULL or dma_map_sg returns 0
|
||||
|
||||
- checking the returned dma_addr_t of dma_map_single and dma_map_page
|
||||
by using dma_mapping_error():
|
||||
|
||||
dma_addr_t dma_handle;
|
||||
|
||||
dma_handle = dma_map_single(dev, addr, size, direction);
|
||||
if (dma_mapping_error(dev, dma_handle)) {
|
||||
/*
|
||||
* reduce current DMA mapping usage,
|
||||
* delay and try again later or
|
||||
* reset driver.
|
||||
*/
|
||||
}
|
||||
|
||||
Networking drivers must call dev_kfree_skb to free the socket buffer
|
||||
and return NETDEV_TX_OK if the DMA mapping fails on the transmit hook
|
||||
(ndo_start_xmit). This means that the socket buffer is just dropped in
|
||||
the failure case.
|
||||
|
||||
SCSI drivers must return SCSI_MLQUEUE_HOST_BUSY if the DMA mapping
|
||||
fails in the queuecommand hook. This means that the SCSI subsystem
|
||||
passes the command to the driver again later.
|
||||
|
||||
Optimizing Unmap State Space Consumption
|
||||
|
||||
On many platforms, dma_unmap_{single,page}() is simply a nop.
|
||||
@ -703,46 +733,29 @@ to "Closing".
|
||||
|
||||
1) Struct scatterlist requirements.
|
||||
|
||||
Struct scatterlist must contain, at a minimum, the following
|
||||
members:
|
||||
Don't invent the architecture specific struct scatterlist; just use
|
||||
<asm-generic/scatterlist.h>. You need to enable
|
||||
CONFIG_NEED_SG_DMA_LENGTH if the architecture supports IOMMUs
|
||||
(including software IOMMU).
|
||||
|
||||
struct page *page;
|
||||
unsigned int offset;
|
||||
unsigned int length;
|
||||
2) ARCH_KMALLOC_MINALIGN
|
||||
|
||||
The base address is specified by a "page+offset" pair.
|
||||
Architectures must ensure that kmalloc'ed buffer is
|
||||
DMA-safe. Drivers and subsystems depend on it. If an architecture
|
||||
isn't fully DMA-coherent (i.e. hardware doesn't ensure that data in
|
||||
the CPU cache is identical to data in main memory),
|
||||
ARCH_KMALLOC_MINALIGN must be set so that the memory allocator
|
||||
makes sure that kmalloc'ed buffer doesn't share a cache line with
|
||||
the others. See arch/arm/include/asm/cache.h as an example.
|
||||
|
||||
Previous versions of struct scatterlist contained a "void *address"
|
||||
field that was sometimes used instead of page+offset. As of Linux
|
||||
2.5., page+offset is always used, and the "address" field has been
|
||||
deleted.
|
||||
|
||||
2) More to come...
|
||||
|
||||
Handling Errors
|
||||
|
||||
DMA address space is limited on some architectures and an allocation
|
||||
failure can be determined by:
|
||||
|
||||
- checking if dma_alloc_coherent returns NULL or dma_map_sg returns 0
|
||||
|
||||
- checking the returned dma_addr_t of dma_map_single and dma_map_page
|
||||
by using dma_mapping_error():
|
||||
|
||||
dma_addr_t dma_handle;
|
||||
|
||||
dma_handle = dma_map_single(dev, addr, size, direction);
|
||||
if (dma_mapping_error(dev, dma_handle)) {
|
||||
/*
|
||||
* reduce current DMA mapping usage,
|
||||
* delay and try again later or
|
||||
* reset driver.
|
||||
*/
|
||||
}
|
||||
Note that ARCH_KMALLOC_MINALIGN is about DMA memory alignment
|
||||
constraints. You don't need to worry about the architecture data
|
||||
alignment constraints (e.g. the alignment constraints about 64-bit
|
||||
objects).
|
||||
|
||||
Closing
|
||||
|
||||
This document, and the API itself, would not be in it's current
|
||||
This document, and the API itself, would not be in its current
|
||||
form without the feedback and suggestions from numerous individuals.
|
||||
We would like to specifically mention, in no particular order, the
|
||||
following people:
|
||||
|
@ -14,7 +14,7 @@ DOCBOOKS := z8530book.xml mcabook.xml device-drivers.xml \
|
||||
genericirq.xml s390-drivers.xml uio-howto.xml scsi.xml \
|
||||
mac80211.xml debugobjects.xml sh.xml regulator.xml \
|
||||
alsa-driver-api.xml writing-an-alsa-driver.xml \
|
||||
tracepoint.xml media.xml
|
||||
tracepoint.xml media.xml drm.xml
|
||||
|
||||
###
|
||||
# The build process is as follows (targets):
|
||||
|
839
Documentation/DocBook/drm.tmpl
Normal file
839
Documentation/DocBook/drm.tmpl
Normal file
@ -0,0 +1,839 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
||||
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
|
||||
|
||||
<book id="drmDevelopersGuide">
|
||||
<bookinfo>
|
||||
<title>Linux DRM Developer's Guide</title>
|
||||
|
||||
<copyright>
|
||||
<year>2008-2009</year>
|
||||
<holder>
|
||||
Intel Corporation (Jesse Barnes <jesse.barnes@intel.com>)
|
||||
</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
The contents of this file may be used under the terms of the GNU
|
||||
General Public License version 2 (the "GPL") as distributed in
|
||||
the kernel source COPYING file.
|
||||
</para>
|
||||
</legalnotice>
|
||||
</bookinfo>
|
||||
|
||||
<toc></toc>
|
||||
|
||||
<!-- Introduction -->
|
||||
|
||||
<chapter id="drmIntroduction">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
The Linux DRM layer contains code intended to support the needs
|
||||
of complex graphics devices, usually containing programmable
|
||||
pipelines well suited to 3D graphics acceleration. Graphics
|
||||
drivers in the kernel can make use of DRM functions to make
|
||||
tasks like memory management, interrupt handling and DMA easier,
|
||||
and provide a uniform interface to applications.
|
||||
</para>
|
||||
<para>
|
||||
A note on versions: this guide covers features found in the DRM
|
||||
tree, including the TTM memory manager, output configuration and
|
||||
mode setting, and the new vblank internals, in addition to all
|
||||
the regular features found in current kernels.
|
||||
</para>
|
||||
<para>
|
||||
[Insert diagram of typical DRM stack here]
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<!-- Internals -->
|
||||
|
||||
<chapter id="drmInternals">
|
||||
<title>DRM Internals</title>
|
||||
<para>
|
||||
This chapter documents DRM internals relevant to driver authors
|
||||
and developers working to add support for the latest features to
|
||||
existing drivers.
|
||||
</para>
|
||||
<para>
|
||||
First, we'll go over some typical driver initialization
|
||||
requirements, like setting up command buffers, creating an
|
||||
initial output configuration, and initializing core services.
|
||||
Subsequent sections will cover core internals in more detail,
|
||||
providing implementation notes and examples.
|
||||
</para>
|
||||
<para>
|
||||
The DRM layer provides several services to graphics drivers,
|
||||
many of them driven by the application interfaces it provides
|
||||
through libdrm, the library that wraps most of the DRM ioctls.
|
||||
These include vblank event handling, memory
|
||||
management, output management, framebuffer management, command
|
||||
submission & fencing, suspend/resume support, and DMA
|
||||
services.
|
||||
</para>
|
||||
<para>
|
||||
The core of every DRM driver is struct drm_device. Drivers
|
||||
will typically statically initialize a drm_device structure,
|
||||
then pass it to drm_init() at load time.
|
||||
</para>
|
||||
|
||||
<!-- Internals: driver init -->
|
||||
|
||||
<sect1>
|
||||
<title>Driver initialization</title>
|
||||
<para>
|
||||
Before calling the DRM initialization routines, the driver must
|
||||
first create and fill out a struct drm_device structure.
|
||||
</para>
|
||||
<programlisting>
|
||||
static struct drm_driver driver = {
|
||||
/* don't use mtrr's here, the Xserver or user space app should
|
||||
* deal with them for intel hardware.
|
||||
*/
|
||||
.driver_features =
|
||||
DRIVER_USE_AGP | DRIVER_REQUIRE_AGP |
|
||||
DRIVER_HAVE_IRQ | DRIVER_IRQ_SHARED | DRIVER_MODESET,
|
||||
.load = i915_driver_load,
|
||||
.unload = i915_driver_unload,
|
||||
.firstopen = i915_driver_firstopen,
|
||||
.lastclose = i915_driver_lastclose,
|
||||
.preclose = i915_driver_preclose,
|
||||
.save = i915_save,
|
||||
.restore = i915_restore,
|
||||
.device_is_agp = i915_driver_device_is_agp,
|
||||
.get_vblank_counter = i915_get_vblank_counter,
|
||||
.enable_vblank = i915_enable_vblank,
|
||||
.disable_vblank = i915_disable_vblank,
|
||||
.irq_preinstall = i915_driver_irq_preinstall,
|
||||
.irq_postinstall = i915_driver_irq_postinstall,
|
||||
.irq_uninstall = i915_driver_irq_uninstall,
|
||||
.irq_handler = i915_driver_irq_handler,
|
||||
.reclaim_buffers = drm_core_reclaim_buffers,
|
||||
.get_map_ofs = drm_core_get_map_ofs,
|
||||
.get_reg_ofs = drm_core_get_reg_ofs,
|
||||
.fb_probe = intelfb_probe,
|
||||
.fb_remove = intelfb_remove,
|
||||
.fb_resize = intelfb_resize,
|
||||
.master_create = i915_master_create,
|
||||
.master_destroy = i915_master_destroy,
|
||||
#if defined(CONFIG_DEBUG_FS)
|
||||
.debugfs_init = i915_debugfs_init,
|
||||
.debugfs_cleanup = i915_debugfs_cleanup,
|
||||
#endif
|
||||
.gem_init_object = i915_gem_init_object,
|
||||
.gem_free_object = i915_gem_free_object,
|
||||
.gem_vm_ops = &i915_gem_vm_ops,
|
||||
.ioctls = i915_ioctls,
|
||||
.fops = {
|
||||
.owner = THIS_MODULE,
|
||||
.open = drm_open,
|
||||
.release = drm_release,
|
||||
.ioctl = drm_ioctl,
|
||||
.mmap = drm_mmap,
|
||||
.poll = drm_poll,
|
||||
.fasync = drm_fasync,
|
||||
#ifdef CONFIG_COMPAT
|
||||
.compat_ioctl = i915_compat_ioctl,
|
||||
#endif
|
||||
},
|
||||
.pci_driver = {
|
||||
.name = DRIVER_NAME,
|
||||
.id_table = pciidlist,
|
||||
.probe = probe,
|
||||
.remove = __devexit_p(drm_cleanup_pci),
|
||||
},
|
||||
.name = DRIVER_NAME,
|
||||
.desc = DRIVER_DESC,
|
||||
.date = DRIVER_DATE,
|
||||
.major = DRIVER_MAJOR,
|
||||
.minor = DRIVER_MINOR,
|
||||
.patchlevel = DRIVER_PATCHLEVEL,
|
||||
};
|
||||
</programlisting>
|
||||
<para>
|
||||
In the example above, taken from the i915 DRM driver, the driver
|
||||
sets several flags indicating what core features it supports.
|
||||
We'll go over the individual callbacks in later sections. Since
|
||||
flags indicate which features your driver supports to the DRM
|
||||
core, you need to set most of them prior to calling drm_init(). Some,
|
||||
like DRIVER_MODESET can be set later based on user supplied parameters,
|
||||
but that's the exception rather than the rule.
|
||||
</para>
|
||||
<variablelist>
|
||||
<title>Driver flags</title>
|
||||
<varlistentry>
|
||||
<term>DRIVER_USE_AGP</term>
|
||||
<listitem><para>
|
||||
Driver uses AGP interface
|
||||
</para></listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>DRIVER_REQUIRE_AGP</term>
|
||||
<listitem><para>
|
||||
Driver needs AGP interface to function.
|
||||
</para></listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>DRIVER_USE_MTRR</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Driver uses MTRR interface for mapping memory. Deprecated.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>DRIVER_PCI_DMA</term>
|
||||
<listitem><para>
|
||||
Driver is capable of PCI DMA. Deprecated.
|
||||
</para></listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>DRIVER_SG</term>
|
||||
<listitem><para>
|
||||
Driver can perform scatter/gather DMA. Deprecated.
|
||||
</para></listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>DRIVER_HAVE_DMA</term>
|
||||
<listitem><para>Driver supports DMA. Deprecated.</para></listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>DRIVER_HAVE_IRQ</term><term>DRIVER_IRQ_SHARED</term>
|
||||
<listitem>
|
||||
<para>
|
||||
DRIVER_HAVE_IRQ indicates whether the driver has a IRQ
|
||||
handler, DRIVER_IRQ_SHARED indicates whether the device &
|
||||
handler support shared IRQs (note that this is required of
|
||||
PCI drivers).
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>DRIVER_DMA_QUEUE</term>
|
||||
<listitem>
|
||||
<para>
|
||||
If the driver queues DMA requests and completes them
|
||||
asynchronously, this flag should be set. Deprecated.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>DRIVER_FB_DMA</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Driver supports DMA to/from the framebuffer. Deprecated.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term>DRIVER_MODESET</term>
|
||||
<listitem>
|
||||
<para>
|
||||
Driver supports mode setting interfaces.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
<para>
|
||||
In this specific case, the driver requires AGP and supports
|
||||
IRQs. DMA, as we'll see, is handled by device specific ioctls
|
||||
in this case. It also supports the kernel mode setting APIs, though
|
||||
unlike in the actual i915 driver source, this example unconditionally
|
||||
exports KMS capability.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<!-- Internals: driver load -->
|
||||
|
||||
<sect1>
|
||||
<title>Driver load</title>
|
||||
<para>
|
||||
In the previous section, we saw what a typical drm_driver
|
||||
structure might look like. One of the more important fields in
|
||||
the structure is the hook for the load function.
|
||||
</para>
|
||||
<programlisting>
|
||||
static struct drm_driver driver = {
|
||||
...
|
||||
.load = i915_driver_load,
|
||||
...
|
||||
};
|
||||
</programlisting>
|
||||
<para>
|
||||
The load function has many responsibilities: allocating a driver
|
||||
private structure, specifying supported performance counters,
|
||||
configuring the device (e.g. mapping registers & command
|
||||
buffers), initializing the memory manager, and setting up the
|
||||
initial output configuration.
|
||||
</para>
|
||||
<para>
|
||||
Note that the tasks performed at driver load time must not
|
||||
conflict with DRM client requirements. For instance, if user
|
||||
level mode setting drivers are in use, it would be problematic
|
||||
to perform output discovery & configuration at load time.
|
||||
Likewise, if pre-memory management aware user level drivers are
|
||||
in use, memory management and command buffer setup may need to
|
||||
be omitted. These requirements are driver specific, and care
|
||||
needs to be taken to keep both old and new applications and
|
||||
libraries working. The i915 driver supports the "modeset"
|
||||
module parameter to control whether advanced features are
|
||||
enabled at load time or in legacy fashion. If compatibility is
|
||||
a concern (e.g. with drivers converted over to the new interfaces
|
||||
from the old ones), care must be taken to prevent incompatible
|
||||
device initialization and control with the currently active
|
||||
userspace drivers.
|
||||
</para>
|
||||
|
||||
<sect2>
|
||||
<title>Driver private & performance counters</title>
|
||||
<para>
|
||||
The driver private hangs off the main drm_device structure and
|
||||
can be used for tracking various device specific bits of
|
||||
information, like register offsets, command buffer status,
|
||||
register state for suspend/resume, etc. At load time, a
|
||||
driver can simply allocate one and set drm_device.dev_priv
|
||||
appropriately; at unload the driver can free it and set
|
||||
drm_device.dev_priv to NULL.
|
||||
</para>
|
||||
<para>
|
||||
The DRM supports several counters which can be used for rough
|
||||
performance characterization. Note that the DRM stat counter
|
||||
system is not often used by applications, and supporting
|
||||
additional counters is completely optional.
|
||||
</para>
|
||||
<para>
|
||||
These interfaces are deprecated and should not be used. If performance
|
||||
monitoring is desired, the developer should investigate and
|
||||
potentially enhance the kernel perf and tracing infrastructure to export
|
||||
GPU related performance information to performance monitoring
|
||||
tools and applications.
|
||||
</para>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Configuring the device</title>
|
||||
<para>
|
||||
Obviously, device configuration will be device specific.
|
||||
However, there are several common operations: finding a
|
||||
device's PCI resources, mapping them, and potentially setting
|
||||
up an IRQ handler.
|
||||
</para>
|
||||
<para>
|
||||
Finding & mapping resources is fairly straightforward. The
|
||||
DRM wrapper functions, drm_get_resource_start() and
|
||||
drm_get_resource_len() can be used to find BARs on the given
|
||||
drm_device struct. Once those values have been retrieved, the
|
||||
driver load function can call drm_addmap() to create a new
|
||||
mapping for the BAR in question. Note you'll probably want a
|
||||
drm_local_map_t in your driver private structure to track any
|
||||
mappings you create.
|
||||
<!-- !Fdrivers/gpu/drm/drm_bufs.c drm_get_resource_* -->
|
||||
<!-- !Finclude/drm/drmP.h drm_local_map_t -->
|
||||
</para>
|
||||
<para>
|
||||
if compatibility with other operating systems isn't a concern
|
||||
(DRM drivers can run under various BSD variants and OpenSolaris),
|
||||
native Linux calls can be used for the above, e.g. pci_resource_*
|
||||
and iomap*/iounmap. See the Linux device driver book for more
|
||||
info.
|
||||
</para>
|
||||
<para>
|
||||
Once you have a register map, you can use the DRM_READn() and
|
||||
DRM_WRITEn() macros to access the registers on your device, or
|
||||
use driver specific versions to offset into your MMIO space
|
||||
relative to a driver specific base pointer (see I915_READ for
|
||||
example).
|
||||
</para>
|
||||
<para>
|
||||
If your device supports interrupt generation, you may want to
|
||||
setup an interrupt handler at driver load time as well. This
|
||||
is done using the drm_irq_install() function. If your device
|
||||
supports vertical blank interrupts, it should call
|
||||
drm_vblank_init() to initialize the core vblank handling code before
|
||||
enabling interrupts on your device. This ensures the vblank related
|
||||
structures are allocated and allows the core to handle vblank events.
|
||||
</para>
|
||||
<!--!Fdrivers/char/drm/drm_irq.c drm_irq_install-->
|
||||
<para>
|
||||
Once your interrupt handler is registered (it'll use your
|
||||
drm_driver.irq_handler as the actual interrupt handling
|
||||
function), you can safely enable interrupts on your device,
|
||||
assuming any other state your interrupt handler uses is also
|
||||
initialized.
|
||||
</para>
|
||||
<para>
|
||||
Another task that may be necessary during configuration is
|
||||
mapping the video BIOS. On many devices, the VBIOS describes
|
||||
device configuration, LCD panel timings (if any), and contains
|
||||
flags indicating device state. Mapping the BIOS can be done
|
||||
using the pci_map_rom() call, a convenience function that
|
||||
takes care of mapping the actual ROM, whether it has been
|
||||
shadowed into memory (typically at address 0xc0000) or exists
|
||||
on the PCI device in the ROM BAR. Note that once you've
|
||||
mapped the ROM and extracted any necessary information, be
|
||||
sure to unmap it; on many devices the ROM address decoder is
|
||||
shared with other BARs, so leaving it mapped can cause
|
||||
undesired behavior like hangs or memory corruption.
|
||||
<!--!Fdrivers/pci/rom.c pci_map_rom-->
|
||||
</para>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Memory manager initialization</title>
|
||||
<para>
|
||||
In order to allocate command buffers, cursor memory, scanout
|
||||
buffers, etc., as well as support the latest features provided
|
||||
by packages like Mesa and the X.Org X server, your driver
|
||||
should support a memory manager.
|
||||
</para>
|
||||
<para>
|
||||
If your driver supports memory management (it should!), you'll
|
||||
need to set that up at load time as well. How you intialize
|
||||
it depends on which memory manager you're using, TTM or GEM.
|
||||
</para>
|
||||
<sect3>
|
||||
<title>TTM initialization</title>
|
||||
<para>
|
||||
TTM (for Translation Table Manager) manages video memory and
|
||||
aperture space for graphics devices. TTM supports both UMA devices
|
||||
and devices with dedicated video RAM (VRAM), i.e. most discrete
|
||||
graphics devices. If your device has dedicated RAM, supporting
|
||||
TTM is desireable. TTM also integrates tightly with your
|
||||
driver specific buffer execution function. See the radeon
|
||||
driver for examples.
|
||||
</para>
|
||||
<para>
|
||||
The core TTM structure is the ttm_bo_driver struct. It contains
|
||||
several fields with function pointers for initializing the TTM,
|
||||
allocating and freeing memory, waiting for command completion
|
||||
and fence synchronization, and memory migration. See the
|
||||
radeon_ttm.c file for an example of usage.
|
||||
</para>
|
||||
<para>
|
||||
The ttm_global_reference structure is made up of several fields:
|
||||
</para>
|
||||
<programlisting>
|
||||
struct ttm_global_reference {
|
||||
enum ttm_global_types global_type;
|
||||
size_t size;
|
||||
void *object;
|
||||
int (*init) (struct ttm_global_reference *);
|
||||
void (*release) (struct ttm_global_reference *);
|
||||
};
|
||||
</programlisting>
|
||||
<para>
|
||||
There should be one global reference structure for your memory
|
||||
manager as a whole, and there will be others for each object
|
||||
created by the memory manager at runtime. Your global TTM should
|
||||
have a type of TTM_GLOBAL_TTM_MEM. The size field for the global
|
||||
object should be sizeof(struct ttm_mem_global), and the init and
|
||||
release hooks should point at your driver specific init and
|
||||
release routines, which will probably eventually call
|
||||
ttm_mem_global_init and ttm_mem_global_release respectively.
|
||||
</para>
|
||||
<para>
|
||||
Once your global TTM accounting structure is set up and initialized
|
||||
(done by calling ttm_global_item_ref on the global object you
|
||||
just created), you'll need to create a buffer object TTM to
|
||||
provide a pool for buffer object allocation by clients and the
|
||||
kernel itself. The type of this object should be TTM_GLOBAL_TTM_BO,
|
||||
and its size should be sizeof(struct ttm_bo_global). Again,
|
||||
driver specific init and release functions can be provided,
|
||||
likely eventually calling ttm_bo_global_init and
|
||||
ttm_bo_global_release, respectively. Also like the previous
|
||||
object, ttm_global_item_ref is used to create an initial reference
|
||||
count for the TTM, which will call your initalization function.
|
||||
</para>
|
||||
</sect3>
|
||||
<sect3>
|
||||
<title>GEM initialization</title>
|
||||
<para>
|
||||
GEM is an alternative to TTM, designed specifically for UMA
|
||||
devices. It has simpler initialization and execution requirements
|
||||
than TTM, but has no VRAM management capability. Core GEM
|
||||
initialization is comprised of a basic drm_mm_init call to create
|
||||
a GTT DRM MM object, which provides an address space pool for
|
||||
object allocation. In a KMS configuration, the driver will
|
||||
need to allocate and initialize a command ring buffer following
|
||||
basic GEM initialization. Most UMA devices have a so-called
|
||||
"stolen" memory region, which provides space for the initial
|
||||
framebuffer and large, contiguous memory regions required by the
|
||||
device. This space is not typically managed by GEM, and must
|
||||
be initialized separately into its own DRM MM object.
|
||||
</para>
|
||||
<para>
|
||||
Initialization will be driver specific, and will depend on
|
||||
the architecture of the device. In the case of Intel
|
||||
integrated graphics chips like 965GM, GEM initialization can
|
||||
be done by calling the internal GEM init function,
|
||||
i915_gem_do_init(). Since the 965GM is a UMA device
|
||||
(i.e. it doesn't have dedicated VRAM), GEM will manage
|
||||
making regular RAM available for GPU operations. Memory set
|
||||
aside by the BIOS (called "stolen" memory by the i915
|
||||
driver) will be managed by the DRM memrange allocator; the
|
||||
rest of the aperture will be managed by GEM.
|
||||
<programlisting>
|
||||
/* Basic memrange allocator for stolen space (aka vram) */
|
||||
drm_memrange_init(&dev_priv->vram, 0, prealloc_size);
|
||||
/* Let GEM Manage from end of prealloc space to end of aperture */
|
||||
i915_gem_do_init(dev, prealloc_size, agp_size);
|
||||
</programlisting>
|
||||
<!--!Edrivers/char/drm/drm_memrange.c-->
|
||||
</para>
|
||||
<para>
|
||||
Once the memory manager has been set up, we can allocate the
|
||||
command buffer. In the i915 case, this is also done with a
|
||||
GEM function, i915_gem_init_ringbuffer().
|
||||
</para>
|
||||
</sect3>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>Output configuration</title>
|
||||
<para>
|
||||
The final initialization task is output configuration. This involves
|
||||
finding and initializing the CRTCs, encoders and connectors
|
||||
for your device, creating an initial configuration and
|
||||
registering a framebuffer console driver.
|
||||
</para>
|
||||
<sect3>
|
||||
<title>Output discovery and initialization</title>
|
||||
<para>
|
||||
Several core functions exist to create CRTCs, encoders and
|
||||
connectors, namely drm_crtc_init(), drm_connector_init() and
|
||||
drm_encoder_init(), along with several "helper" functions to
|
||||
perform common tasks.
|
||||
</para>
|
||||
<para>
|
||||
Connectors should be registered with sysfs once they've been
|
||||
detected and initialized, using the
|
||||
drm_sysfs_connector_add() function. Likewise, when they're
|
||||
removed from the system, they should be destroyed with
|
||||
drm_sysfs_connector_remove().
|
||||
</para>
|
||||
<programlisting>
|
||||
<![CDATA[
|
||||
void intel_crt_init(struct drm_device *dev)
|
||||
{
|
||||
struct drm_connector *connector;
|
||||
struct intel_output *intel_output;
|
||||
|
||||
intel_output = kzalloc(sizeof(struct intel_output), GFP_KERNEL);
|
||||
if (!intel_output)
|
||||
return;
|
||||
|
||||
connector = &intel_output->base;
|
||||
drm_connector_init(dev, &intel_output->base,
|
||||
&intel_crt_connector_funcs, DRM_MODE_CONNECTOR_VGA);
|
||||
|
||||
drm_encoder_init(dev, &intel_output->enc, &intel_crt_enc_funcs,
|
||||
DRM_MODE_ENCODER_DAC);
|
||||
|
||||
drm_mode_connector_attach_encoder(&intel_output->base,
|
||||
&intel_output->enc);
|
||||
|
||||
/* Set up the DDC bus. */
|
||||
intel_output->ddc_bus = intel_i2c_create(dev, GPIOA, "CRTDDC_A");
|
||||
if (!intel_output->ddc_bus) {
|
||||
dev_printk(KERN_ERR, &dev->pdev->dev, "DDC bus registration "
|
||||
"failed.\n");
|
||||
return;
|
||||
}
|
||||
|
||||
intel_output->type = INTEL_OUTPUT_ANALOG;
|
||||
connector->interlace_allowed = 0;
|
||||
connector->doublescan_allowed = 0;
|
||||
|
||||
drm_encoder_helper_add(&intel_output->enc, &intel_crt_helper_funcs);
|
||||
drm_connector_helper_add(connector, &intel_crt_connector_helper_funcs);
|
||||
|
||||
drm_sysfs_connector_add(connector);
|
||||
}
|
||||
]]>
|
||||
</programlisting>
|
||||
<para>
|
||||
In the example above (again, taken from the i915 driver), a
|
||||
CRT connector and encoder combination is created. A device
|
||||
specific i2c bus is also created, for fetching EDID data and
|
||||
performing monitor detection. Once the process is complete,
|
||||
the new connector is regsitered with sysfs, to make its
|
||||
properties available to applications.
|
||||
</para>
|
||||
<sect4>
|
||||
<title>Helper functions and core functions</title>
|
||||
<para>
|
||||
Since many PC-class graphics devices have similar display output
|
||||
designs, the DRM provides a set of helper functions to make
|
||||
output management easier. The core helper routines handle
|
||||
encoder re-routing and disabling of unused functions following
|
||||
mode set. Using the helpers is optional, but recommended for
|
||||
devices with PC-style architectures (i.e. a set of display planes
|
||||
for feeding pixels to encoders which are in turn routed to
|
||||
connectors). Devices with more complex requirements needing
|
||||
finer grained management can opt to use the core callbacks
|
||||
directly.
|
||||
</para>
|
||||
<para>
|
||||
[Insert typical diagram here.] [Insert OMAP style config here.]
|
||||
</para>
|
||||
</sect4>
|
||||
<para>
|
||||
For each encoder, CRTC and connector, several functions must
|
||||
be provided, depending on the object type. Encoder objects
|
||||
need should provide a DPMS (basically on/off) function, mode fixup
|
||||
(for converting requested modes into native hardware timings),
|
||||
and prepare, set and commit functions for use by the core DRM
|
||||
helper functions. Connector helpers need to provide mode fetch and
|
||||
validity functions as well as an encoder matching function for
|
||||
returing an ideal encoder for a given connector. The core
|
||||
connector functions include a DPMS callback, (deprecated)
|
||||
save/restore routines, detection, mode probing, property handling,
|
||||
and cleanup functions.
|
||||
</para>
|
||||
<!--!Edrivers/char/drm/drm_crtc.h-->
|
||||
<!--!Edrivers/char/drm/drm_crtc.c-->
|
||||
<!--!Edrivers/char/drm/drm_crtc_helper.c-->
|
||||
</sect3>
|
||||
</sect2>
|
||||
</sect1>
|
||||
|
||||
<!-- Internals: vblank handling -->
|
||||
|
||||
<sect1>
|
||||
<title>VBlank event handling</title>
|
||||
<para>
|
||||
The DRM core exposes two vertical blank related ioctls:
|
||||
DRM_IOCTL_WAIT_VBLANK and DRM_IOCTL_MODESET_CTL.
|
||||
<!--!Edrivers/char/drm/drm_irq.c-->
|
||||
</para>
|
||||
<para>
|
||||
DRM_IOCTL_WAIT_VBLANK takes a struct drm_wait_vblank structure
|
||||
as its argument, and is used to block or request a signal when a
|
||||
specified vblank event occurs.
|
||||
</para>
|
||||
<para>
|
||||
DRM_IOCTL_MODESET_CTL should be called by application level
|
||||
drivers before and after mode setting, since on many devices the
|
||||
vertical blank counter will be reset at that time. Internally,
|
||||
the DRM snapshots the last vblank count when the ioctl is called
|
||||
with the _DRM_PRE_MODESET command so that the counter won't go
|
||||
backwards (which is dealt with when _DRM_POST_MODESET is used).
|
||||
</para>
|
||||
<para>
|
||||
To support the functions above, the DRM core provides several
|
||||
helper functions for tracking vertical blank counters, and
|
||||
requires drivers to provide several callbacks:
|
||||
get_vblank_counter(), enable_vblank() and disable_vblank(). The
|
||||
core uses get_vblank_counter() to keep the counter accurate
|
||||
across interrupt disable periods. It should return the current
|
||||
vertical blank event count, which is often tracked in a device
|
||||
register. The enable and disable vblank callbacks should enable
|
||||
and disable vertical blank interrupts, respectively. In the
|
||||
absence of DRM clients waiting on vblank events, the core DRM
|
||||
code will use the disable_vblank() function to disable
|
||||
interrupts, which saves power. They'll be re-enabled again when
|
||||
a client calls the vblank wait ioctl above.
|
||||
</para>
|
||||
<para>
|
||||
Devices that don't provide a count register can simply use an
|
||||
internal atomic counter incremented on every vertical blank
|
||||
interrupt, and can make their enable and disable vblank
|
||||
functions into no-ops.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>Memory management</title>
|
||||
<para>
|
||||
The memory manager lies at the heart of many DRM operations, and
|
||||
is also required to support advanced client features like OpenGL
|
||||
pbuffers. The DRM currently contains two memory managers, TTM
|
||||
and GEM.
|
||||
</para>
|
||||
|
||||
<sect2>
|
||||
<title>The Translation Table Manager (TTM)</title>
|
||||
<para>
|
||||
TTM was developed by Tungsten Graphics, primarily by Thomas
|
||||
Hellström, and is intended to be a flexible, high performance
|
||||
graphics memory manager.
|
||||
</para>
|
||||
<para>
|
||||
Drivers wishing to support TTM must fill out a drm_bo_driver
|
||||
structure.
|
||||
</para>
|
||||
<para>
|
||||
TTM design background and information belongs here.
|
||||
</para>
|
||||
</sect2>
|
||||
|
||||
<sect2>
|
||||
<title>The Graphics Execution Manager (GEM)</title>
|
||||
<para>
|
||||
GEM is an Intel project, authored by Eric Anholt and Keith
|
||||
Packard. It provides simpler interfaces than TTM, and is well
|
||||
suited for UMA devices.
|
||||
</para>
|
||||
<para>
|
||||
GEM-enabled drivers must provide gem_init_object() and
|
||||
gem_free_object() callbacks to support the core memory
|
||||
allocation routines. They should also provide several driver
|
||||
specific ioctls to support command execution, pinning, buffer
|
||||
read & write, mapping, and domain ownership transfers.
|
||||
</para>
|
||||
<para>
|
||||
On a fundamental level, GEM involves several operations: memory
|
||||
allocation and freeing, command execution, and aperture management
|
||||
at command execution time. Buffer object allocation is relatively
|
||||
straightforward and largely provided by Linux's shmem layer, which
|
||||
provides memory to back each object. When mapped into the GTT
|
||||
or used in a command buffer, the backing pages for an object are
|
||||
flushed to memory and marked write combined so as to be coherent
|
||||
with the GPU. Likewise, when the GPU finishes rendering to an object,
|
||||
if the CPU accesses it, it must be made coherent with the CPU's view
|
||||
of memory, usually involving GPU cache flushing of various kinds.
|
||||
This core CPU<->GPU coherency management is provided by the GEM
|
||||
set domain function, which evaluates an object's current domain and
|
||||
performs any necessary flushing or synchronization to put the object
|
||||
into the desired coherency domain (note that the object may be busy,
|
||||
i.e. an active render target; in that case the set domain function
|
||||
will block the client and wait for rendering to complete before
|
||||
performing any necessary flushing operations).
|
||||
</para>
|
||||
<para>
|
||||
Perhaps the most important GEM function is providing a command
|
||||
execution interface to clients. Client programs construct command
|
||||
buffers containing references to previously allocated memory objects
|
||||
and submit them to GEM. At that point, GEM will take care to bind
|
||||
all the objects into the GTT, execute the buffer, and provide
|
||||
necessary synchronization between clients accessing the same buffers.
|
||||
This often involves evicting some objects from the GTT and re-binding
|
||||
others (a fairly expensive operation), and providing relocation
|
||||
support which hides fixed GTT offsets from clients. Clients must
|
||||
take care not to submit command buffers that reference more objects
|
||||
than can fit in the GTT or GEM will reject them and no rendering
|
||||
will occur. Similarly, if several objects in the buffer require
|
||||
fence registers to be allocated for correct rendering (e.g. 2D blits
|
||||
on pre-965 chips), care must be taken not to require more fence
|
||||
registers than are available to the client. Such resource management
|
||||
should be abstracted from the client in libdrm.
|
||||
</para>
|
||||
</sect2>
|
||||
|
||||
</sect1>
|
||||
|
||||
<!-- Output management -->
|
||||
<sect1>
|
||||
<title>Output management</title>
|
||||
<para>
|
||||
At the core of the DRM output management code is a set of
|
||||
structures representing CRTCs, encoders and connectors.
|
||||
</para>
|
||||
<para>
|
||||
A CRTC is an abstraction representing a part of the chip that
|
||||
contains a pointer to a scanout buffer. Therefore, the number
|
||||
of CRTCs available determines how many independent scanout
|
||||
buffers can be active at any given time. The CRTC structure
|
||||
contains several fields to support this: a pointer to some video
|
||||
memory, a display mode, and an (x, y) offset into the video
|
||||
memory to support panning or configurations where one piece of
|
||||
video memory spans multiple CRTCs.
|
||||
</para>
|
||||
<para>
|
||||
An encoder takes pixel data from a CRTC and converts it to a
|
||||
format suitable for any attached connectors. On some devices,
|
||||
it may be possible to have a CRTC send data to more than one
|
||||
encoder. In that case, both encoders would receive data from
|
||||
the same scanout buffer, resulting in a "cloned" display
|
||||
configuration across the connectors attached to each encoder.
|
||||
</para>
|
||||
<para>
|
||||
A connector is the final destination for pixel data on a device,
|
||||
and usually connects directly to an external display device like
|
||||
a monitor or laptop panel. A connector can only be attached to
|
||||
one encoder at a time. The connector is also the structure
|
||||
where information about the attached display is kept, so it
|
||||
contains fields for display data, EDID data, DPMS &
|
||||
connection status, and information about modes supported on the
|
||||
attached displays.
|
||||
</para>
|
||||
<!--!Edrivers/char/drm/drm_crtc.c-->
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>Framebuffer management</title>
|
||||
<para>
|
||||
In order to set a mode on a given CRTC, encoder and connector
|
||||
configuration, clients need to provide a framebuffer object which
|
||||
will provide a source of pixels for the CRTC to deliver to the encoder(s)
|
||||
and ultimately the connector(s) in the configuration. A framebuffer
|
||||
is fundamentally a driver specific memory object, made into an opaque
|
||||
handle by the DRM addfb function. Once an fb has been created this
|
||||
way it can be passed to the KMS mode setting routines for use in
|
||||
a configuration.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>Command submission & fencing</title>
|
||||
<para>
|
||||
This should cover a few device specific command submission
|
||||
implementations.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>Suspend/resume</title>
|
||||
<para>
|
||||
The DRM core provides some suspend/resume code, but drivers
|
||||
wanting full suspend/resume support should provide save() and
|
||||
restore() functions. These will be called at suspend,
|
||||
hibernate, or resume time, and should perform any state save or
|
||||
restore required by your device across suspend or hibernate
|
||||
states.
|
||||
</para>
|
||||
</sect1>
|
||||
|
||||
<sect1>
|
||||
<title>DMA services</title>
|
||||
<para>
|
||||
This should cover how DMA mapping etc. is supported by the core.
|
||||
These functions are deprecated and should not be used.
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
|
||||
<!-- External interfaces -->
|
||||
|
||||
<chapter id="drmExternals">
|
||||
<title>Userland interfaces</title>
|
||||
<para>
|
||||
The DRM core exports several interfaces to applications,
|
||||
generally intended to be used through corresponding libdrm
|
||||
wrapper functions. In addition, drivers export device specific
|
||||
interfaces for use by userspace drivers & device aware
|
||||
applications through ioctls and sysfs files.
|
||||
</para>
|
||||
<para>
|
||||
External interfaces include: memory mapping, context management,
|
||||
DMA operations, AGP management, vblank control, fence
|
||||
management, memory management, and output management.
|
||||
</para>
|
||||
<para>
|
||||
Cover generic ioctls and sysfs layout here. Only need high
|
||||
level info, since man pages will cover the rest.
|
||||
</para>
|
||||
</chapter>
|
||||
|
||||
<!-- API reference -->
|
||||
|
||||
<appendix id="drmDriverApi">
|
||||
<title>DRM Driver API</title>
|
||||
<para>
|
||||
Include auto-generated API reference here (need to reference it
|
||||
from paragraphs above too).
|
||||
</para>
|
||||
</appendix>
|
||||
|
||||
</book>
|
@ -4,7 +4,7 @@
|
||||
|
||||
<book id="kgdbOnLinux">
|
||||
<bookinfo>
|
||||
<title>Using kgdb and the kgdb Internals</title>
|
||||
<title>Using kgdb, kdb and the kernel debugger internals</title>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
@ -17,33 +17,8 @@
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Tom</firstname>
|
||||
<surname>Rini</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>trini@kernel.crashing.org</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<authorgroup>
|
||||
<author>
|
||||
<firstname>Amit S.</firstname>
|
||||
<surname>Kale</surname>
|
||||
<affiliation>
|
||||
<address>
|
||||
<email>amitkale@linsyssoft.com</email>
|
||||
</address>
|
||||
</affiliation>
|
||||
</author>
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2008</year>
|
||||
<year>2008,2010</year>
|
||||
<holder>Wind River Systems, Inc.</holder>
|
||||
</copyright>
|
||||
<copyright>
|
||||
@ -69,41 +44,76 @@
|
||||
<chapter id="Introduction">
|
||||
<title>Introduction</title>
|
||||
<para>
|
||||
kgdb is a source level debugger for linux kernel. It is used along
|
||||
with gdb to debug a linux kernel. The expectation is that gdb can
|
||||
be used to "break in" to the kernel to inspect memory, variables
|
||||
and look through call stack information similar to what an
|
||||
application developer would use gdb for. It is possible to place
|
||||
breakpoints in kernel code and perform some limited execution
|
||||
stepping.
|
||||
The kernel has two different debugger front ends (kdb and kgdb)
|
||||
which interface to the debug core. It is possible to use either
|
||||
of the debugger front ends and dynamically transition between them
|
||||
if you configure the kernel properly at compile and runtime.
|
||||
</para>
|
||||
<para>
|
||||
Two machines are required for using kgdb. One of these machines is a
|
||||
development machine and the other is a test machine. The kernel
|
||||
to be debugged runs on the test machine. The development machine
|
||||
runs an instance of gdb against the vmlinux file which contains
|
||||
the symbols (not boot image such as bzImage, zImage, uImage...).
|
||||
In gdb the developer specifies the connection parameters and
|
||||
connects to kgdb. The type of connection a developer makes with
|
||||
gdb depends on the availability of kgdb I/O modules compiled as
|
||||
builtin's or kernel modules in the test machine's kernel.
|
||||
Kdb is simplistic shell-style interface which you can use on a
|
||||
system console with a keyboard or serial console. You can use it
|
||||
to inspect memory, registers, process lists, dmesg, and even set
|
||||
breakpoints to stop in a certain location. Kdb is not a source
|
||||
level debugger, although you can set breakpoints and execute some
|
||||
basic kernel run control. Kdb is mainly aimed at doing some
|
||||
analysis to aid in development or diagnosing kernel problems. You
|
||||
can access some symbols by name in kernel built-ins or in kernel
|
||||
modules if the code was built
|
||||
with <symbol>CONFIG_KALLSYMS</symbol>.
|
||||
</para>
|
||||
<para>
|
||||
Kgdb is intended to be used as a source level debugger for the
|
||||
Linux kernel. It is used along with gdb to debug a Linux kernel.
|
||||
The expectation is that gdb can be used to "break in" to the
|
||||
kernel to inspect memory, variables and look through call stack
|
||||
information similar to the way an application developer would use
|
||||
gdb to debug an application. It is possible to place breakpoints
|
||||
in kernel code and perform some limited execution stepping.
|
||||
</para>
|
||||
<para>
|
||||
Two machines are required for using kgdb. One of these machines is
|
||||
a development machine and the other is the target machine. The
|
||||
kernel to be debugged runs on the target machine. The development
|
||||
machine runs an instance of gdb against the vmlinux file which
|
||||
contains the symbols (not boot image such as bzImage, zImage,
|
||||
uImage...). In gdb the developer specifies the connection
|
||||
parameters and connects to kgdb. The type of connection a
|
||||
developer makes with gdb depends on the availability of kgdb I/O
|
||||
modules compiled as built-ins or loadable kernel modules in the test
|
||||
machine's kernel.
|
||||
</para>
|
||||
</chapter>
|
||||
<chapter id="CompilingAKernel">
|
||||
<title>Compiling a kernel</title>
|
||||
<title>Compiling a kernel</title>
|
||||
<para>
|
||||
<itemizedlist>
|
||||
<listitem><para>In order to enable compilation of kdb, you must first enable kgdb.</para></listitem>
|
||||
<listitem><para>The kgdb test compile options are described in the kgdb test suite chapter.</para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
<sect1 id="CompileKGDB">
|
||||
<title>Kernel config options for kgdb</title>
|
||||
<para>
|
||||
To enable <symbol>CONFIG_KGDB</symbol> you should first turn on
|
||||
"Prompt for development and/or incomplete code/drivers"
|
||||
(CONFIG_EXPERIMENTAL) in "General setup", then under the
|
||||
"Kernel debugging" select "KGDB: kernel debugging with remote gdb".
|
||||
"Kernel debugging" select "KGDB: kernel debugger".
|
||||
</para>
|
||||
<para>
|
||||
While it is not a hard requirement that you have symbols in your
|
||||
vmlinux file, gdb tends not to be very useful without the symbolic
|
||||
data, so you will want to turn
|
||||
on <symbol>CONFIG_DEBUG_INFO</symbol> which is called "Compile the
|
||||
kernel with debug info" in the config menu.
|
||||
</para>
|
||||
<para>
|
||||
It is advised, but not required that you turn on the
|
||||
CONFIG_FRAME_POINTER kernel option. This option inserts code to
|
||||
into the compiled executable which saves the frame information in
|
||||
registers or on the stack at different points which will allow a
|
||||
debugger such as gdb to more accurately construct stack back traces
|
||||
while debugging the kernel.
|
||||
<symbol>CONFIG_FRAME_POINTER</symbol> kernel option which is called "Compile the
|
||||
kernel with frame pointers" in the config menu. This option
|
||||
inserts code to into the compiled executable which saves the frame
|
||||
information in registers or on the stack at different points which
|
||||
allows a debugger such as gdb to more accurately construct
|
||||
stack back traces while debugging the kernel.
|
||||
</para>
|
||||
<para>
|
||||
If the architecture that you are using supports the kernel option
|
||||
@ -116,38 +126,160 @@
|
||||
this option.
|
||||
</para>
|
||||
<para>
|
||||
Next you should choose one of more I/O drivers to interconnect debugging
|
||||
host and debugged target. Early boot debugging requires a KGDB
|
||||
I/O driver that supports early debugging and the driver must be
|
||||
built into the kernel directly. Kgdb I/O driver configuration
|
||||
takes place via kernel or module parameters, see following
|
||||
chapter.
|
||||
Next you should choose one of more I/O drivers to interconnect
|
||||
debugging host and debugged target. Early boot debugging requires
|
||||
a KGDB I/O driver that supports early debugging and the driver
|
||||
must be built into the kernel directly. Kgdb I/O driver
|
||||
configuration takes place via kernel or module parameters which
|
||||
you can learn more about in the in the section that describes the
|
||||
parameter "kgdboc".
|
||||
</para>
|
||||
<para>
|
||||
The kgdb test compile options are described in the kgdb test suite chapter.
|
||||
<para>Here is an example set of .config symbols to enable or
|
||||
disable for kgdb:
|
||||
<itemizedlist>
|
||||
<listitem><para># CONFIG_DEBUG_RODATA is not set</para></listitem>
|
||||
<listitem><para>CONFIG_FRAME_POINTER=y</para></listitem>
|
||||
<listitem><para>CONFIG_KGDB=y</para></listitem>
|
||||
<listitem><para>CONFIG_KGDB_SERIAL_CONSOLE=y</para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
|
||||
</sect1>
|
||||
<sect1 id="CompileKDB">
|
||||
<title>Kernel config options for kdb</title>
|
||||
<para>Kdb is quite a bit more complex than the simple gdbstub
|
||||
sitting on top of the kernel's debug core. Kdb must implement a
|
||||
shell, and also adds some helper functions in other parts of the
|
||||
kernel, responsible for printing out interesting data such as what
|
||||
you would see if you ran "lsmod", or "ps". In order to build kdb
|
||||
into the kernel you follow the same steps as you would for kgdb.
|
||||
</para>
|
||||
<para>The main config option for kdb
|
||||
is <symbol>CONFIG_KGDB_KDB</symbol> which is called "KGDB_KDB:
|
||||
include kdb frontend for kgdb" in the config menu. In theory you
|
||||
would have already also selected an I/O driver such as the
|
||||
CONFIG_KGDB_SERIAL_CONSOLE interface if you plan on using kdb on a
|
||||
serial port, when you were configuring kgdb.
|
||||
</para>
|
||||
<para>If you want to use a PS/2-style keyboard with kdb, you would
|
||||
select CONFIG_KDB_KEYBOARD which is called "KGDB_KDB: keyboard as
|
||||
input device" in the config menu. The CONFIG_KDB_KEYBOARD option
|
||||
is not used for anything in the gdb interface to kgdb. The
|
||||
CONFIG_KDB_KEYBOARD option only works with kdb.
|
||||
</para>
|
||||
<para>Here is an example set of .config symbols to enable/disable kdb:
|
||||
<itemizedlist>
|
||||
<listitem><para># CONFIG_DEBUG_RODATA is not set</para></listitem>
|
||||
<listitem><para>CONFIG_FRAME_POINTER=y</para></listitem>
|
||||
<listitem><para>CONFIG_KGDB=y</para></listitem>
|
||||
<listitem><para>CONFIG_KGDB_SERIAL_CONSOLE=y</para></listitem>
|
||||
<listitem><para>CONFIG_KGDB_KDB=y</para></listitem>
|
||||
<listitem><para>CONFIG_KDB_KEYBOARD=y</para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
<chapter id="EnableKGDB">
|
||||
<title>Enable kgdb for debugging</title>
|
||||
<para>
|
||||
In order to use kgdb you must activate it by passing configuration
|
||||
information to one of the kgdb I/O drivers. If you do not pass any
|
||||
configuration information kgdb will not do anything at all. Kgdb
|
||||
will only actively hook up to the kernel trap hooks if a kgdb I/O
|
||||
driver is loaded and configured. If you unconfigure a kgdb I/O
|
||||
driver, kgdb will unregister all the kernel hook points.
|
||||
<chapter id="kgdbKernelArgs">
|
||||
<title>Kernel Debugger Boot Arguments</title>
|
||||
<para>This section describes the various runtime kernel
|
||||
parameters that affect the configuration of the kernel debugger.
|
||||
The following chapter covers using kdb and kgdb as well as
|
||||
provides some examples of the configuration parameters.</para>
|
||||
<sect1 id="kgdboc">
|
||||
<title>Kernel parameter: kgdboc</title>
|
||||
<para>The kgdboc driver was originally an abbreviation meant to
|
||||
stand for "kgdb over console". Today it is the primary mechanism
|
||||
to configure how to communicate from gdb to kgdb as well as the
|
||||
devices you want to use to interact with the kdb shell.
|
||||
</para>
|
||||
<para>
|
||||
All drivers can be reconfigured at run time, if
|
||||
<symbol>CONFIG_SYSFS</symbol> and <symbol>CONFIG_MODULES</symbol>
|
||||
are enabled, by echo'ing a new config string to
|
||||
<constant>/sys/module/<driver>/parameter/<option></constant>.
|
||||
The driver can be unconfigured by passing an empty string. You cannot
|
||||
change the configuration while the debugger is attached. Make sure
|
||||
to detach the debugger with the <constant>detach</constant> command
|
||||
prior to trying unconfigure a kgdb I/O driver.
|
||||
<para>For kgdb/gdb, kgdboc is designed to work with a single serial
|
||||
port. It is intended to cover the circumstance where you want to
|
||||
use a serial console as your primary console as well as using it to
|
||||
perform kernel debugging. It is also possible to use kgdb on a
|
||||
serial port which is not designated as a system console. Kgdboc
|
||||
may be configured as a kernel built-in or a kernel loadable module.
|
||||
You can only make use of <constant>kgdbwait</constant> and early
|
||||
debugging if you build kgdboc into the kernel as a built-in.
|
||||
</para>
|
||||
<sect2 id="kgdbocArgs">
|
||||
<title>kgdboc arguments</title>
|
||||
<para>Usage: <constant>kgdboc=[kbd][[,]serial_device][,baud]</constant></para>
|
||||
<sect3 id="kgdbocArgs1">
|
||||
<title>Using loadable module or built-in</title>
|
||||
<para>
|
||||
<orderedlist>
|
||||
<listitem><para>As a kernel built-in:</para>
|
||||
<para>Use the kernel boot argument: <constant>kgdboc=<tty-device>,[baud]</constant></para></listitem>
|
||||
<listitem>
|
||||
<para>As a kernel loadable module:</para>
|
||||
<para>Use the command: <constant>modprobe kgdboc kgdboc=<tty-device>,[baud]</constant></para>
|
||||
<para>Here are two examples of how you might formate the kgdboc
|
||||
string. The first is for an x86 target using the first serial port.
|
||||
The second example is for the ARM Versatile AB using the second
|
||||
serial port.
|
||||
<orderedlist>
|
||||
<listitem><para><constant>kgdboc=ttyS0,115200</constant></para></listitem>
|
||||
<listitem><para><constant>kgdboc=ttyAMA1,115200</constant></para></listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
</orderedlist></para>
|
||||
</sect3>
|
||||
<sect3 id="kgdbocArgs2">
|
||||
<title>Configure kgdboc at runtime with sysfs</title>
|
||||
<para>At run time you can enable or disable kgdboc by echoing a
|
||||
parameters into the sysfs. Here are two examples:</para>
|
||||
<orderedlist>
|
||||
<listitem><para>Enable kgdboc on ttyS0</para>
|
||||
<para><constant>echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem>
|
||||
<listitem><para>Disable kgdboc</para>
|
||||
<para><constant>echo "" > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem>
|
||||
</orderedlist>
|
||||
<para>NOTE: You do not need to specify the baud if you are
|
||||
configuring the console on tty which is already configured or
|
||||
open.</para>
|
||||
</sect3>
|
||||
<sect3 id="kgdbocArgs3">
|
||||
<title>More examples</title>
|
||||
<para>You can configure kgdboc to use the keyboard, and or a serial device
|
||||
depending on if you are using kdb and or kgdb, in one of the
|
||||
following scenarios.
|
||||
<orderedlist>
|
||||
<listitem><para>kdb and kgdb over only a serial port</para>
|
||||
<para><constant>kgdboc=<serial_device>[,baud]</constant></para>
|
||||
<para>Example: <constant>kgdboc=ttyS0,115200</constant></para>
|
||||
</listitem>
|
||||
<listitem><para>kdb and kgdb with keyboard and a serial port</para>
|
||||
<para><constant>kgdboc=kbd,<serial_device>[,baud]</constant></para>
|
||||
<para>Example: <constant>kgdboc=kbd,ttyS0,115200</constant></para>
|
||||
</listitem>
|
||||
<listitem><para>kdb with a keyboard</para>
|
||||
<para><constant>kgdboc=kbd</constant></para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</sect3>
|
||||
<para>NOTE: Kgdboc does not support interrupting the target via the
|
||||
gdb remote protocol. You must manually send a sysrq-g unless you
|
||||
have a proxy that splits console output to a terminal program.
|
||||
A console proxy has a separate TCP port for the debugger and a separate
|
||||
TCP port for the "human" console. The proxy can take care of sending
|
||||
the sysrq-g for you.
|
||||
</para>
|
||||
<para>When using kgdboc with no debugger proxy, you can end up
|
||||
connecting the debugger at one of two entry points. If an
|
||||
exception occurs after you have loaded kgdboc, a message should
|
||||
print on the console stating it is waiting for the debugger. In
|
||||
this case you disconnect your terminal program and then connect the
|
||||
debugger in its place. If you want to interrupt the target system
|
||||
and forcibly enter a debug session you have to issue a Sysrq
|
||||
sequence and then type the letter <constant>g</constant>. Then
|
||||
you disconnect the terminal session and connect gdb. Your options
|
||||
if you don't like this are to hack gdb to send the sysrq-g for you
|
||||
as well as on the initial connect, or to use a debugger proxy that
|
||||
allows an unmodified gdb to do the debugging.
|
||||
</para>
|
||||
</sect2>
|
||||
</sect1>
|
||||
<sect1 id="kgdbwait">
|
||||
<title>Kernel parameter: kgdbwait</title>
|
||||
<para>
|
||||
@ -162,103 +294,204 @@
|
||||
</para>
|
||||
<para>
|
||||
The kernel will stop and wait as early as the I/O driver and
|
||||
architecture will allow when you use this option. If you build the
|
||||
kgdb I/O driver as a kernel module kgdbwait will not do anything.
|
||||
architecture allows when you use this option. If you build the
|
||||
kgdb I/O driver as a loadable kernel module kgdbwait will not do
|
||||
anything.
|
||||
</para>
|
||||
</sect1>
|
||||
<sect1 id="kgdboc">
|
||||
<title>Kernel parameter: kgdboc</title>
|
||||
<para>
|
||||
The kgdboc driver was originally an abbreviation meant to stand for
|
||||
"kgdb over console". Kgdboc is designed to work with a single
|
||||
serial port. It was meant to cover the circumstance
|
||||
where you wanted to use a serial console as your primary console as
|
||||
well as using it to perform kernel debugging. Of course you can
|
||||
also use kgdboc without assigning a console to the same port.
|
||||
</para>
|
||||
<sect2 id="UsingKgdboc">
|
||||
<title>Using kgdboc</title>
|
||||
<para>
|
||||
You can configure kgdboc via sysfs or a module or kernel boot line
|
||||
parameter depending on if you build with CONFIG_KGDBOC as a module
|
||||
or built-in.
|
||||
<orderedlist>
|
||||
<listitem><para>From the module load or build-in</para>
|
||||
<para><constant>kgdboc=<tty-device>,[baud]</constant></para>
|
||||
<para>
|
||||
The example here would be if your console port was typically ttyS0, you would use something like <constant>kgdboc=ttyS0,115200</constant> or on the ARM Versatile AB you would likely use <constant>kgdboc=ttyAMA0,115200</constant>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem><para>From sysfs</para>
|
||||
<para><constant>echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc</constant></para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
<para>
|
||||
NOTE: Kgdboc does not support interrupting the target via the
|
||||
gdb remote protocol. You must manually send a sysrq-g unless you
|
||||
have a proxy that splits console output to a terminal problem and
|
||||
has a separate port for the debugger to connect to that sends the
|
||||
sysrq-g for you.
|
||||
</para>
|
||||
<para>When using kgdboc with no debugger proxy, you can end up
|
||||
connecting the debugger for one of two entry points. If an
|
||||
exception occurs after you have loaded kgdboc a message should print
|
||||
on the console stating it is waiting for the debugger. In case you
|
||||
disconnect your terminal program and then connect the debugger in
|
||||
its place. If you want to interrupt the target system and forcibly
|
||||
enter a debug session you have to issue a Sysrq sequence and then
|
||||
type the letter <constant>g</constant>. Then you disconnect the
|
||||
terminal session and connect gdb. Your options if you don't like
|
||||
this are to hack gdb to send the sysrq-g for you as well as on the
|
||||
initial connect, or to use a debugger proxy that allows an
|
||||
unmodified gdb to do the debugging.
|
||||
</para>
|
||||
</sect2>
|
||||
</sect1>
|
||||
<sect1 id="kgdbcon">
|
||||
<title>Kernel parameter: kgdbcon</title>
|
||||
<para>
|
||||
Kgdb supports using the gdb serial protocol to send console messages
|
||||
to the debugger when the debugger is connected and running. There
|
||||
are two ways to activate this feature.
|
||||
<orderedlist>
|
||||
<listitem><para>Activate with the kernel command line option:</para>
|
||||
<para><constant>kgdbcon</constant></para>
|
||||
</listitem>
|
||||
<listitem><para>Use sysfs before configuring an io driver</para>
|
||||
<para>
|
||||
<constant>echo 1 > /sys/module/kgdb/parameters/kgdb_use_con</constant>
|
||||
</para>
|
||||
<para>
|
||||
NOTE: If you do this after you configure the kgdb I/O driver, the
|
||||
setting will not take effect until the next point the I/O is
|
||||
reconfigured.
|
||||
</para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
<para>
|
||||
IMPORTANT NOTE: Using this option with kgdb over the console
|
||||
(kgdboc) is not supported.
|
||||
<sect1 id="kgdbcon">
|
||||
<title>Kernel parameter: kgdbcon</title>
|
||||
<para> The kgdbcon feature allows you to see printk() messages
|
||||
inside gdb while gdb is connected to the kernel. Kdb does not make
|
||||
use of the kgdbcon feature.
|
||||
</para>
|
||||
<para>Kgdb supports using the gdb serial protocol to send console
|
||||
messages to the debugger when the debugger is connected and running.
|
||||
There are two ways to activate this feature.
|
||||
<orderedlist>
|
||||
<listitem><para>Activate with the kernel command line option:</para>
|
||||
<para><constant>kgdbcon</constant></para>
|
||||
</listitem>
|
||||
<listitem><para>Use sysfs before configuring an I/O driver</para>
|
||||
<para>
|
||||
<constant>echo 1 > /sys/module/kgdb/parameters/kgdb_use_con</constant>
|
||||
</para>
|
||||
<para>
|
||||
NOTE: If you do this after you configure the kgdb I/O driver, the
|
||||
setting will not take effect until the next point the I/O is
|
||||
reconfigured.
|
||||
</para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
<para>IMPORTANT NOTE: You cannot use kgdboc + kgdbcon on a tty that is an
|
||||
active system console. An example incorrect usage is <constant>console=ttyS0,115200 kgdboc=ttyS0 kgdbcon</constant>
|
||||
</para>
|
||||
<para>It is possible to use this option with kgdboc on a tty that is not a system console.
|
||||
</para>
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
<chapter id="ConnectingGDB">
|
||||
<title>Connecting gdb</title>
|
||||
<chapter id="usingKDB">
|
||||
<title>Using kdb</title>
|
||||
<para>
|
||||
</para>
|
||||
<sect1 id="quickKDBserial">
|
||||
<title>Quick start for kdb on a serial port</title>
|
||||
<para>This is a quick example of how to use kdb.</para>
|
||||
<para><orderedlist>
|
||||
<listitem><para>Boot kernel with arguments:
|
||||
<itemizedlist>
|
||||
<listitem><para><constant>console=ttyS0,115200 kgdboc=ttyS0,115200</constant></para></listitem>
|
||||
</itemizedlist></para>
|
||||
<para>OR</para>
|
||||
<para>Configure kgdboc after the kernel booted; assuming you are using a serial port console:
|
||||
<itemizedlist>
|
||||
<listitem><para><constant>echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem><para>Enter the kernel debugger manually or by waiting for an oops or fault. There are several ways you can enter the kernel debugger manually; all involve using the sysrq-g, which means you must have enabled CONFIG_MAGIC_SYSRQ=y in your kernel config.</para>
|
||||
<itemizedlist>
|
||||
<listitem><para>When logged in as root or with a super user session you can run:</para>
|
||||
<para><constant>echo g > /proc/sysrq-trigger</constant></para></listitem>
|
||||
<listitem><para>Example using minicom 2.2</para>
|
||||
<para>Press: <constant>Control-a</constant></para>
|
||||
<para>Press: <constant>f</constant></para>
|
||||
<para>Press: <constant>g</constant></para>
|
||||
</listitem>
|
||||
<listitem><para>When you have telneted to a terminal server that supports sending a remote break</para>
|
||||
<para>Press: <constant>Control-]</constant></para>
|
||||
<para>Type in:<constant>send break</constant></para>
|
||||
<para>Press: <constant>Enter</constant></para>
|
||||
<para>Press: <constant>g</constant></para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</listitem>
|
||||
<listitem><para>From the kdb prompt you can run the "help" command to see a complete list of the commands that are available.</para>
|
||||
<para>Some useful commands in kdb include:
|
||||
<itemizedlist>
|
||||
<listitem><para>lsmod -- Shows where kernel modules are loaded</para></listitem>
|
||||
<listitem><para>ps -- Displays only the active processes</para></listitem>
|
||||
<listitem><para>ps A -- Shows all the processes</para></listitem>
|
||||
<listitem><para>summary -- Shows kernel version info and memory usage</para></listitem>
|
||||
<listitem><para>bt -- Get a backtrace of the current process using dump_stack()</para></listitem>
|
||||
<listitem><para>dmesg -- View the kernel syslog buffer</para></listitem>
|
||||
<listitem><para>go -- Continue the system</para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>When you are done using kdb you need to consider rebooting the
|
||||
system or using the "go" command to resuming normal kernel
|
||||
execution. If you have paused the kernel for a lengthy period of
|
||||
time, applications that rely on timely networking or anything to do
|
||||
with real wall clock time could be adversely affected, so you
|
||||
should take this into consideration when using the kernel
|
||||
debugger.</para>
|
||||
</listitem>
|
||||
</orderedlist></para>
|
||||
</sect1>
|
||||
<sect1 id="quickKDBkeyboard">
|
||||
<title>Quick start for kdb using a keyboard connected console</title>
|
||||
<para>This is a quick example of how to use kdb with a keyboard.</para>
|
||||
<para><orderedlist>
|
||||
<listitem><para>Boot kernel with arguments:
|
||||
<itemizedlist>
|
||||
<listitem><para><constant>kgdboc=kbd</constant></para></listitem>
|
||||
</itemizedlist></para>
|
||||
<para>OR</para>
|
||||
<para>Configure kgdboc after the kernel booted:
|
||||
<itemizedlist>
|
||||
<listitem><para><constant>echo kbd > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem><para>Enter the kernel debugger manually or by waiting for an oops or fault. There are several ways you can enter the kernel debugger manually; all involve using the sysrq-g, which means you must have enabled CONFIG_MAGIC_SYSRQ=y in your kernel config.</para>
|
||||
<itemizedlist>
|
||||
<listitem><para>When logged in as root or with a super user session you can run:</para>
|
||||
<para><constant>echo g > /proc/sysrq-trigger</constant></para></listitem>
|
||||
<listitem><para>Example using a laptop keyboard</para>
|
||||
<para>Press and hold down: <constant>Alt</constant></para>
|
||||
<para>Press and hold down: <constant>Fn</constant></para>
|
||||
<para>Press and release the key with the label: <constant>SysRq</constant></para>
|
||||
<para>Release: <constant>Fn</constant></para>
|
||||
<para>Press and release: <constant>g</constant></para>
|
||||
<para>Release: <constant>Alt</constant></para>
|
||||
</listitem>
|
||||
<listitem><para>Example using a PS/2 101-key keyboard</para>
|
||||
<para>Press and hold down: <constant>Alt</constant></para>
|
||||
<para>Press and release the key with the label: <constant>SysRq</constant></para>
|
||||
<para>Press and release: <constant>g</constant></para>
|
||||
<para>Release: <constant>Alt</constant></para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Now type in a kdb command such as "help", "dmesg", "bt" or "go" to continue kernel execution.</para>
|
||||
</listitem>
|
||||
</orderedlist></para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
<chapter id="EnableKGDB">
|
||||
<title>Using kgdb / gdb</title>
|
||||
<para>In order to use kgdb you must activate it by passing
|
||||
configuration information to one of the kgdb I/O drivers. If you
|
||||
do not pass any configuration information kgdb will not do anything
|
||||
at all. Kgdb will only actively hook up to the kernel trap hooks
|
||||
if a kgdb I/O driver is loaded and configured. If you unconfigure
|
||||
a kgdb I/O driver, kgdb will unregister all the kernel hook points.
|
||||
</para>
|
||||
<para> All kgdb I/O drivers can be reconfigured at run time, if
|
||||
<symbol>CONFIG_SYSFS</symbol> and <symbol>CONFIG_MODULES</symbol>
|
||||
are enabled, by echo'ing a new config string to
|
||||
<constant>/sys/module/<driver>/parameter/<option></constant>.
|
||||
The driver can be unconfigured by passing an empty string. You cannot
|
||||
change the configuration while the debugger is attached. Make sure
|
||||
to detach the debugger with the <constant>detach</constant> command
|
||||
prior to trying to unconfigure a kgdb I/O driver.
|
||||
</para>
|
||||
<sect1 id="ConnectingGDB">
|
||||
<title>Connecting with gdb to a serial port</title>
|
||||
<orderedlist>
|
||||
<listitem><para>Configure kgdboc</para>
|
||||
<para>Boot kernel with arguments:
|
||||
<itemizedlist>
|
||||
<listitem><para><constant>kgdboc=ttyS0,115200</constant></para></listitem>
|
||||
</itemizedlist></para>
|
||||
<para>OR</para>
|
||||
<para>Configure kgdboc after the kernel booted:
|
||||
<itemizedlist>
|
||||
<listitem><para><constant>echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem>
|
||||
</itemizedlist></para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Stop kernel execution (break into the debugger)</para>
|
||||
<para>In order to connect to gdb via kgdboc, the kernel must
|
||||
first be stopped. There are several ways to stop the kernel which
|
||||
include using kgdbwait as a boot argument, via a sysrq-g, or running
|
||||
the kernel until it takes an exception where it waits for the
|
||||
debugger to attach.
|
||||
<itemizedlist>
|
||||
<listitem><para>When logged in as root or with a super user session you can run:</para>
|
||||
<para><constant>echo g > /proc/sysrq-trigger</constant></para></listitem>
|
||||
<listitem><para>Example using minicom 2.2</para>
|
||||
<para>Press: <constant>Control-a</constant></para>
|
||||
<para>Press: <constant>f</constant></para>
|
||||
<para>Press: <constant>g</constant></para>
|
||||
</listitem>
|
||||
<listitem><para>When you have telneted to a terminal server that supports sending a remote break</para>
|
||||
<para>Press: <constant>Control-]</constant></para>
|
||||
<para>Type in:<constant>send break</constant></para>
|
||||
<para>Press: <constant>Enter</constant></para>
|
||||
<para>Press: <constant>g</constant></para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Connect from from gdb</para>
|
||||
<para>
|
||||
If you are using kgdboc, you need to have used kgdbwait as a boot
|
||||
argument, issued a sysrq-g, or the system you are going to debug
|
||||
has already taken an exception and is waiting for the debugger to
|
||||
attach before you can connect gdb.
|
||||
</para>
|
||||
<para>
|
||||
If you are not using different kgdb I/O driver other than kgdboc,
|
||||
you should be able to connect and the target will automatically
|
||||
respond.
|
||||
</para>
|
||||
<para>
|
||||
Example (using a serial port):
|
||||
Example (using a directly connected port):
|
||||
</para>
|
||||
<programlisting>
|
||||
% gdb ./vmlinux
|
||||
@ -266,7 +499,7 @@
|
||||
(gdb) target remote /dev/ttyS0
|
||||
</programlisting>
|
||||
<para>
|
||||
Example (kgdb to a terminal server on tcp port 2012):
|
||||
Example (kgdb to a terminal server on TCP port 2012):
|
||||
</para>
|
||||
<programlisting>
|
||||
% gdb ./vmlinux
|
||||
@ -283,6 +516,83 @@
|
||||
communications. You do this prior to issuing the <constant>target
|
||||
remote</constant> command by typing in: <constant>set debug remote 1</constant>
|
||||
</para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
<para>Remember if you continue in gdb, and need to "break in" again,
|
||||
you need to issue an other sysrq-g. It is easy to create a simple
|
||||
entry point by putting a breakpoint at <constant>sys_sync</constant>
|
||||
and then you can run "sync" from a shell or script to break into the
|
||||
debugger.</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
<chapter id="switchKdbKgdb">
|
||||
<title>kgdb and kdb interoperability</title>
|
||||
<para>It is possible to transition between kdb and kgdb dynamically.
|
||||
The debug core will remember which you used the last time and
|
||||
automatically start in the same mode.</para>
|
||||
<sect1>
|
||||
<title>Switching between kdb and kgdb</title>
|
||||
<sect2>
|
||||
<title>Switching from kgdb to kdb</title>
|
||||
<para>
|
||||
There are two ways to switch from kgdb to kdb: you can use gdb to
|
||||
issue a maintenance packet, or you can blindly type the command $3#33.
|
||||
Whenever kernel debugger stops in kgdb mode it will print the
|
||||
message <constant>KGDB or $3#33 for KDB</constant>. It is important
|
||||
to note that you have to type the sequence correctly in one pass.
|
||||
You cannot type a backspace or delete because kgdb will interpret
|
||||
that as part of the debug stream.
|
||||
<orderedlist>
|
||||
<listitem><para>Change from kgdb to kdb by blindly typing:</para>
|
||||
<para><constant>$3#33</constant></para></listitem>
|
||||
<listitem><para>Change from kgdb to kdb with gdb</para>
|
||||
<para><constant>maintenance packet 3</constant></para>
|
||||
<para>NOTE: Now you must kill gdb. Typically you press control-z and
|
||||
issue the command: kill -9 %</para></listitem>
|
||||
</orderedlist>
|
||||
</para>
|
||||
</sect2>
|
||||
<sect2>
|
||||
<title>Change from kdb to kgdb</title>
|
||||
<para>There are two ways you can change from kdb to kgdb. You can
|
||||
manually enter kgdb mode by issuing the kgdb command from the kdb
|
||||
shell prompt, or you can connect gdb while the kdb shell prompt is
|
||||
active. The kdb shell looks for the typical first commands that gdb
|
||||
would issue with the gdb remote protocol and if it sees one of those
|
||||
commands it automatically changes into kgdb mode.</para>
|
||||
<orderedlist>
|
||||
<listitem><para>From kdb issue the command:</para>
|
||||
<para><constant>kgdb</constant></para>
|
||||
<para>Now disconnect your terminal program and connect gdb in its place</para></listitem>
|
||||
<listitem><para>At the kdb prompt, disconnect the terminal program and connect gdb in its place.</para></listitem>
|
||||
</orderedlist>
|
||||
</sect2>
|
||||
</sect1>
|
||||
<sect1>
|
||||
<title>Running kdb commands from gdb</title>
|
||||
<para>It is possible to run a limited set of kdb commands from gdb,
|
||||
using the gdb monitor command. You don't want to execute any of the
|
||||
run control or breakpoint operations, because it can disrupt the
|
||||
state of the kernel debugger. You should be using gdb for
|
||||
breakpoints and run control operations if you have gdb connected.
|
||||
The more useful commands to run are things like lsmod, dmesg, ps or
|
||||
possibly some of the memory information commands. To see all the kdb
|
||||
commands you can run <constant>monitor help</constant>.</para>
|
||||
<para>Example:
|
||||
<informalexample><programlisting>
|
||||
(gdb) monitor ps
|
||||
1 idle process (state I) and
|
||||
27 sleeping system daemon (state M) processes suppressed,
|
||||
use 'ps A' to see all.
|
||||
Task Addr Pid Parent [*] cpu State Thread Command
|
||||
|
||||
0xc78291d0 1 0 0 0 S 0xc7829404 init
|
||||
0xc7954150 942 1 0 0 S 0xc7954384 dropbear
|
||||
0xc78789c0 944 1 0 0 S 0xc7878bf4 sh
|
||||
(gdb)
|
||||
</programlisting></informalexample>
|
||||
</para>
|
||||
</sect1>
|
||||
</chapter>
|
||||
<chapter id="KGDBTestSuite">
|
||||
<title>kgdb Test Suite</title>
|
||||
@ -309,34 +619,36 @@
|
||||
</para>
|
||||
</chapter>
|
||||
<chapter id="CommonBackEndReq">
|
||||
<title>KGDB Internals</title>
|
||||
<title>Kernel Debugger Internals</title>
|
||||
<sect1 id="kgdbArchitecture">
|
||||
<title>Architecture Specifics</title>
|
||||
<para>
|
||||
Kgdb is organized into three basic components:
|
||||
The kernel debugger is organized into a number of components:
|
||||
<orderedlist>
|
||||
<listitem><para>kgdb core</para>
|
||||
<listitem><para>The debug core</para>
|
||||
<para>
|
||||
The kgdb core is found in kernel/kgdb.c. It contains:
|
||||
The debug core is found in kernel/debugger/debug_core.c. It contains:
|
||||
<itemizedlist>
|
||||
<listitem><para>All the logic to implement the gdb serial protocol</para></listitem>
|
||||
<listitem><para>A generic OS exception handler which includes sync'ing the processors into a stopped state on an multi cpu system.</para></listitem>
|
||||
<listitem><para>A generic OS exception handler which includes
|
||||
sync'ing the processors into a stopped state on an multi-CPU
|
||||
system.</para></listitem>
|
||||
<listitem><para>The API to talk to the kgdb I/O drivers</para></listitem>
|
||||
<listitem><para>The API to make calls to the arch specific kgdb implementation</para></listitem>
|
||||
<listitem><para>The API to make calls to the arch-specific kgdb implementation</para></listitem>
|
||||
<listitem><para>The logic to perform safe memory reads and writes to memory while using the debugger</para></listitem>
|
||||
<listitem><para>A full implementation for software breakpoints unless overridden by the arch</para></listitem>
|
||||
<listitem><para>The API to invoke either the kdb or kgdb frontend to the debug core.</para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem><para>kgdb arch specific implementation</para>
|
||||
<listitem><para>kgdb arch-specific implementation</para>
|
||||
<para>
|
||||
This implementation is generally found in arch/*/kernel/kgdb.c.
|
||||
As an example, arch/x86/kernel/kgdb.c contains the specifics to
|
||||
implement HW breakpoint as well as the initialization to
|
||||
dynamically register and unregister for the trap handlers on
|
||||
this architecture. The arch specific portion implements:
|
||||
this architecture. The arch-specific portion implements:
|
||||
<itemizedlist>
|
||||
<listitem><para>contains an arch specific trap catcher which
|
||||
<listitem><para>contains an arch-specific trap catcher which
|
||||
invokes kgdb_handle_exception() to start kgdb about doing its
|
||||
work</para></listitem>
|
||||
<listitem><para>translation to and from gdb specific packet format to pt_regs</para></listitem>
|
||||
@ -347,11 +659,35 @@
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem><para>gdbstub frontend (aka kgdb)</para>
|
||||
<para>The gdbstub is located in kernel/debug/gdbstub.c. It contains:</para>
|
||||
<itemizedlist>
|
||||
<listitem><para>All the logic to implement the gdb serial protocol</para></listitem>
|
||||
</itemizedlist>
|
||||
</listitem>
|
||||
<listitem><para>kdb frontend</para>
|
||||
<para>The kdb debugger shell is broken down into a number of
|
||||
components. The kdb core is located in kernel/debug/kdb. There
|
||||
are a number of helper functions in some of the other kernel
|
||||
components to make it possible for kdb to examine and report
|
||||
information about the kernel without taking locks that could
|
||||
cause a kernel deadlock. The kdb core contains implements the following functionality.</para>
|
||||
<itemizedlist>
|
||||
<listitem><para>A simple shell</para></listitem>
|
||||
<listitem><para>The kdb core command set</para></listitem>
|
||||
<listitem><para>A registration API to register additional kdb shell commands.</para>
|
||||
<para>A good example of a self-contained kdb module is the "ftdump" command for dumping the ftrace buffer. See: kernel/trace/trace_kdb.c</para></listitem>
|
||||
<listitem><para>The implementation for kdb_printf() which
|
||||
emits messages directly to I/O drivers, bypassing the kernel
|
||||
log.</para></listitem>
|
||||
<listitem><para>SW / HW breakpoint management for the kdb shell</para></listitem>
|
||||
</itemizedlist>
|
||||
</listitem>
|
||||
<listitem><para>kgdb I/O driver</para>
|
||||
<para>
|
||||
Each kgdb I/O driver has to provide an implemenation for the following:
|
||||
Each kgdb I/O driver has to provide an implementation for the following:
|
||||
<itemizedlist>
|
||||
<listitem><para>configuration via builtin or module</para></listitem>
|
||||
<listitem><para>configuration via built-in or module</para></listitem>
|
||||
<listitem><para>dynamic configuration and kgdb hook registration calls</para></listitem>
|
||||
<listitem><para>read and write character interface</para></listitem>
|
||||
<listitem><para>A cleanup handler for unconfiguring from the kgdb core</para></listitem>
|
||||
@ -416,15 +752,15 @@
|
||||
underlying low level to the hardware driver having "polling hooks"
|
||||
which the to which the tty driver is attached. In the initial
|
||||
implementation of kgdboc it the serial_core was changed to expose a
|
||||
low level uart hook for doing polled mode reading and writing of a
|
||||
low level UART hook for doing polled mode reading and writing of a
|
||||
single character while in an atomic context. When kgdb makes an I/O
|
||||
request to the debugger, kgdboc invokes a call back in the serial
|
||||
core which in turn uses the call back in the uart driver. It is
|
||||
certainly possible to extend kgdboc to work with non-uart based
|
||||
core which in turn uses the call back in the UART driver. It is
|
||||
certainly possible to extend kgdboc to work with non-UART based
|
||||
consoles in the future.
|
||||
</para>
|
||||
<para>
|
||||
When using kgdboc with a uart, the uart driver must implement two callbacks in the <constant>struct uart_ops</constant>. Example from drivers/8250.c:<programlisting>
|
||||
When using kgdboc with a UART, the UART driver must implement two callbacks in the <constant>struct uart_ops</constant>. Example from drivers/8250.c:<programlisting>
|
||||
#ifdef CONFIG_CONSOLE_POLL
|
||||
.poll_get_char = serial8250_get_poll_char,
|
||||
.poll_put_char = serial8250_put_poll_char,
|
||||
@ -434,7 +770,7 @@
|
||||
<constant>#ifdef CONFIG_CONSOLE_POLL</constant>, as shown above.
|
||||
Keep in mind that polling hooks have to be implemented in such a way
|
||||
that they can be called from an atomic context and have to restore
|
||||
the state of the uart chip on return such that the system can return
|
||||
the state of the UART chip on return such that the system can return
|
||||
to normal when the debugger detaches. You need to be very careful
|
||||
with any kind of lock you consider, because failing here is most
|
||||
going to mean pressing the reset button.
|
||||
@ -453,6 +789,10 @@
|
||||
<itemizedlist>
|
||||
<listitem><para>Jason Wessel<email>jason.wessel@windriver.com</email></para></listitem>
|
||||
</itemizedlist>
|
||||
In Jan 2010 this document was updated to include kdb.
|
||||
<itemizedlist>
|
||||
<listitem><para>Jason Wessel<email>jason.wessel@windriver.com</email></para></listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</chapter>
|
||||
</book>
|
||||
|
@ -81,16 +81,14 @@ void (*port_disable) (struct ata_port *);
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
Called from ata_bus_probe() and ata_bus_reset() error paths,
|
||||
as well as when unregistering from the SCSI module (rmmod, hot
|
||||
unplug).
|
||||
Called from ata_bus_probe() error path, as well as when
|
||||
unregistering from the SCSI module (rmmod, hot unplug).
|
||||
This function should do whatever needs to be done to take the
|
||||
port out of use. In most cases, ata_port_disable() can be used
|
||||
as this hook.
|
||||
</para>
|
||||
<para>
|
||||
Called from ata_bus_probe() on a failed probe.
|
||||
Called from ata_bus_reset() on a failed bus reset.
|
||||
Called from ata_scsi_release().
|
||||
</para>
|
||||
|
||||
@ -227,6 +225,18 @@ u8 (*sff_check_altstatus)(struct ata_port *ap);
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2><title>Write specific ATA shadow register</title>
|
||||
<programlisting>
|
||||
void (*sff_set_devctl)(struct ata_port *ap, u8 ctl);
|
||||
</programlisting>
|
||||
|
||||
<para>
|
||||
Write the device control ATA shadow register to the hardware.
|
||||
Most drivers don't need to define this.
|
||||
</para>
|
||||
|
||||
</sect2>
|
||||
|
||||
<sect2><title>Select ATA device on bus</title>
|
||||
<programlisting>
|
||||
void (*sff_dev_select)(struct ata_port *ap, unsigned int device);
|
||||
@ -477,7 +487,7 @@ void (*host_stop) (struct ata_host_set *host_set);
|
||||
allocates space for a legacy IDE PRD table and returns.
|
||||
</para>
|
||||
<para>
|
||||
->port_stop() is called after ->host_stop(). It's sole function
|
||||
->port_stop() is called after ->host_stop(). Its sole function
|
||||
is to release DMA/memory resources, now that they are no longer
|
||||
actively being used. Many drivers also free driver-private
|
||||
data from port at this time.
|
||||
|
@ -17,6 +17,7 @@
|
||||
<!ENTITY VIDIOC-DBG-G-REGISTER "<link linkend='vidioc-dbg-g-register'><constant>VIDIOC_DBG_G_REGISTER</constant></link>">
|
||||
<!ENTITY VIDIOC-DBG-S-REGISTER "<link linkend='vidioc-dbg-g-register'><constant>VIDIOC_DBG_S_REGISTER</constant></link>">
|
||||
<!ENTITY VIDIOC-DQBUF "<link linkend='vidioc-qbuf'><constant>VIDIOC_DQBUF</constant></link>">
|
||||
<!ENTITY VIDIOC-DQEVENT "<link linkend='vidioc-dqevent'><constant>VIDIOC_DQEVENT</constant></link>">
|
||||
<!ENTITY VIDIOC-ENCODER-CMD "<link linkend='vidioc-encoder-cmd'><constant>VIDIOC_ENCODER_CMD</constant></link>">
|
||||
<!ENTITY VIDIOC-ENUMAUDIO "<link linkend='vidioc-enumaudio'><constant>VIDIOC_ENUMAUDIO</constant></link>">
|
||||
<!ENTITY VIDIOC-ENUMAUDOUT "<link linkend='vidioc-enumaudioout'><constant>VIDIOC_ENUMAUDOUT</constant></link>">
|
||||
@ -60,6 +61,7 @@
|
||||
<!ENTITY VIDIOC-REQBUFS "<link linkend='vidioc-reqbufs'><constant>VIDIOC_REQBUFS</constant></link>">
|
||||
<!ENTITY VIDIOC-STREAMOFF "<link linkend='vidioc-streamon'><constant>VIDIOC_STREAMOFF</constant></link>">
|
||||
<!ENTITY VIDIOC-STREAMON "<link linkend='vidioc-streamon'><constant>VIDIOC_STREAMON</constant></link>">
|
||||
<!ENTITY VIDIOC-SUBSCRIBE-EVENT "<link linkend='vidioc-subscribe-event'><constant>VIDIOC_SUBSCRIBE_EVENT</constant></link>">
|
||||
<!ENTITY VIDIOC-S-AUDIO "<link linkend='vidioc-g-audio'><constant>VIDIOC_S_AUDIO</constant></link>">
|
||||
<!ENTITY VIDIOC-S-AUDOUT "<link linkend='vidioc-g-audioout'><constant>VIDIOC_S_AUDOUT</constant></link>">
|
||||
<!ENTITY VIDIOC-S-CROP "<link linkend='vidioc-g-crop'><constant>VIDIOC_S_CROP</constant></link>">
|
||||
@ -83,6 +85,7 @@
|
||||
<!ENTITY VIDIOC-TRY-ENCODER-CMD "<link linkend='vidioc-encoder-cmd'><constant>VIDIOC_TRY_ENCODER_CMD</constant></link>">
|
||||
<!ENTITY VIDIOC-TRY-EXT-CTRLS "<link linkend='vidioc-g-ext-ctrls'><constant>VIDIOC_TRY_EXT_CTRLS</constant></link>">
|
||||
<!ENTITY VIDIOC-TRY-FMT "<link linkend='vidioc-g-fmt'><constant>VIDIOC_TRY_FMT</constant></link>">
|
||||
<!ENTITY VIDIOC-UNSUBSCRIBE-EVENT "<link linkend='vidioc-subscribe-event'><constant>VIDIOC_UNSUBSCRIBE_EVENT</constant></link>">
|
||||
|
||||
<!-- Types -->
|
||||
<!ENTITY v4l2-std-id "<link linkend='v4l2-std-id'>v4l2_std_id</link>">
|
||||
@ -141,6 +144,9 @@
|
||||
<!ENTITY v4l2-enc-idx "struct <link linkend='v4l2-enc-idx'>v4l2_enc_idx</link>">
|
||||
<!ENTITY v4l2-enc-idx-entry "struct <link linkend='v4l2-enc-idx-entry'>v4l2_enc_idx_entry</link>">
|
||||
<!ENTITY v4l2-encoder-cmd "struct <link linkend='v4l2-encoder-cmd'>v4l2_encoder_cmd</link>">
|
||||
<!ENTITY v4l2-event "struct <link linkend='v4l2-event'>v4l2_event</link>">
|
||||
<!ENTITY v4l2-event-subscription "struct <link linkend='v4l2-event-subscription'>v4l2_event_subscription</link>">
|
||||
<!ENTITY v4l2-event-vsync "struct <link linkend='v4l2-event-vsync'>v4l2_event_vsync</link>">
|
||||
<!ENTITY v4l2-ext-control "struct <link linkend='v4l2-ext-control'>v4l2_ext_control</link>">
|
||||
<!ENTITY v4l2-ext-controls "struct <link linkend='v4l2-ext-controls'>v4l2_ext_controls</link>">
|
||||
<!ENTITY v4l2-fmtdesc "struct <link linkend='v4l2-fmtdesc'>v4l2_fmtdesc</link>">
|
||||
@ -200,6 +206,7 @@
|
||||
<!ENTITY sub-controls SYSTEM "v4l/controls.xml">
|
||||
<!ENTITY sub-dev-capture SYSTEM "v4l/dev-capture.xml">
|
||||
<!ENTITY sub-dev-codec SYSTEM "v4l/dev-codec.xml">
|
||||
<!ENTITY sub-dev-event SYSTEM "v4l/dev-event.xml">
|
||||
<!ENTITY sub-dev-effect SYSTEM "v4l/dev-effect.xml">
|
||||
<!ENTITY sub-dev-osd SYSTEM "v4l/dev-osd.xml">
|
||||
<!ENTITY sub-dev-output SYSTEM "v4l/dev-output.xml">
|
||||
@ -292,6 +299,8 @@
|
||||
<!ENTITY sub-v4l2grab-c SYSTEM "v4l/v4l2grab.c.xml">
|
||||
<!ENTITY sub-videodev2-h SYSTEM "v4l/videodev2.h.xml">
|
||||
<!ENTITY sub-v4l2 SYSTEM "v4l/v4l2.xml">
|
||||
<!ENTITY sub-dqevent SYSTEM "v4l/vidioc-dqevent.xml">
|
||||
<!ENTITY sub-subscribe-event SYSTEM "v4l/vidioc-subscribe-event.xml">
|
||||
<!ENTITY sub-intro SYSTEM "dvb/intro.xml">
|
||||
<!ENTITY sub-frontend SYSTEM "dvb/frontend.xml">
|
||||
<!ENTITY sub-dvbproperty SYSTEM "dvb/dvbproperty.xml">
|
||||
@ -381,3 +390,5 @@
|
||||
<!ENTITY reqbufs SYSTEM "v4l/vidioc-reqbufs.xml">
|
||||
<!ENTITY s-hw-freq-seek SYSTEM "v4l/vidioc-s-hw-freq-seek.xml">
|
||||
<!ENTITY streamon SYSTEM "v4l/vidioc-streamon.xml">
|
||||
<!ENTITY dqevent SYSTEM "v4l/vidioc-dqevent.xml">
|
||||
<!ENTITY subscribe_event SYSTEM "v4l/vidioc-subscribe-event.xml">
|
||||
|
@ -269,7 +269,7 @@ static void board_hwcontrol(struct mtd_info *mtd, int cmd)
|
||||
information about the device.
|
||||
</para>
|
||||
<programlisting>
|
||||
int __init board_init (void)
|
||||
static int __init board_init (void)
|
||||
{
|
||||
struct nand_chip *this;
|
||||
int err = 0;
|
||||
|
@ -19,13 +19,17 @@
|
||||
</authorgroup>
|
||||
|
||||
<copyright>
|
||||
<year>2008</year>
|
||||
<year>2008-2010</year>
|
||||
<holder>Paul Mundt</holder>
|
||||
</copyright>
|
||||
<copyright>
|
||||
<year>2008</year>
|
||||
<year>2008-2010</year>
|
||||
<holder>Renesas Technology Corp.</holder>
|
||||
</copyright>
|
||||
<copyright>
|
||||
<year>2010</year>
|
||||
<holder>Renesas Electronics Corp.</holder>
|
||||
</copyright>
|
||||
|
||||
<legalnotice>
|
||||
<para>
|
||||
@ -77,7 +81,7 @@
|
||||
</chapter>
|
||||
<chapter id="clk">
|
||||
<title>Clock Framework Extensions</title>
|
||||
!Iarch/sh/include/asm/clock.h
|
||||
!Iinclude/linux/sh_clk.h
|
||||
</chapter>
|
||||
<chapter id="mach">
|
||||
<title>Machine Specific Interfaces</title>
|
||||
|
@ -2332,15 +2332,26 @@ more information.</para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
</section>
|
||||
</section>
|
||||
<section>
|
||||
<title>V4L2 in Linux 2.6.34</title>
|
||||
<orderedlist>
|
||||
<listitem>
|
||||
<para>Added
|
||||
<constant>V4L2_CID_IRIS_ABSOLUTE</constant> and
|
||||
<constant>V4L2_CID_IRIS_RELATIVE</constant> controls to the
|
||||
<link linkend="camera-controls">Camera controls class</link>.
|
||||
</para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
</section>
|
||||
|
||||
<section id="other">
|
||||
<title>Relation of V4L2 to other Linux multimedia APIs</title>
|
||||
<section id="other">
|
||||
<title>Relation of V4L2 to other Linux multimedia APIs</title>
|
||||
|
||||
<section id="xvideo">
|
||||
<title>X Video Extension</title>
|
||||
<section id="xvideo">
|
||||
<title>X Video Extension</title>
|
||||
|
||||
<para>The X Video Extension (abbreviated XVideo or just Xv) is
|
||||
<para>The X Video Extension (abbreviated XVideo or just Xv) is
|
||||
an extension of the X Window system, implemented for example by the
|
||||
XFree86 project. Its scope is similar to V4L2, an API to video capture
|
||||
and output devices for X clients. Xv allows applications to display
|
||||
@ -2351,7 +2362,7 @@ capture or output still images in XPixmaps<footnote>
|
||||
extension available across many operating systems and
|
||||
architectures.</para>
|
||||
|
||||
<para>Because the driver is embedded into the X server Xv has a
|
||||
<para>Because the driver is embedded into the X server Xv has a
|
||||
number of advantages over the V4L2 <link linkend="overlay">video
|
||||
overlay interface</link>. The driver can easily determine the overlay
|
||||
target, &ie; visible graphics memory or off-screen buffers for a
|
||||
@ -2360,16 +2371,16 @@ overlay, scaling or color-keying, or the clipping functions of the
|
||||
video capture hardware, always in sync with drawing operations or
|
||||
windows moving or changing their stacking order.</para>
|
||||
|
||||
<para>To combine the advantages of Xv and V4L a special Xv
|
||||
<para>To combine the advantages of Xv and V4L a special Xv
|
||||
driver exists in XFree86 and XOrg, just programming any overlay capable
|
||||
Video4Linux device it finds. To enable it
|
||||
<filename>/etc/X11/XF86Config</filename> must contain these lines:</para>
|
||||
<para><screen>
|
||||
<para><screen>
|
||||
Section "Module"
|
||||
Load "v4l"
|
||||
EndSection</screen></para>
|
||||
|
||||
<para>As of XFree86 4.2 this driver still supports only V4L
|
||||
<para>As of XFree86 4.2 this driver still supports only V4L
|
||||
ioctls, however it should work just fine with all V4L2 devices through
|
||||
the V4L2 backward-compatibility layer. Since V4L2 permits multiple
|
||||
opens it is possible (if supported by the V4L2 driver) to capture
|
||||
@ -2377,83 +2388,84 @@ video while an X client requested video overlay. Restrictions of
|
||||
simultaneous capturing and overlay are discussed in <xref
|
||||
linkend="overlay" /> apply.</para>
|
||||
|
||||
<para>Only marginally related to V4L2, XFree86 extended Xv to
|
||||
<para>Only marginally related to V4L2, XFree86 extended Xv to
|
||||
support hardware YUV to RGB conversion and scaling for faster video
|
||||
playback, and added an interface to MPEG-2 decoding hardware. This API
|
||||
is useful to display images captured with V4L2 devices.</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<title>Digital Video</title>
|
||||
<section>
|
||||
<title>Digital Video</title>
|
||||
|
||||
<para>V4L2 does not support digital terrestrial, cable or
|
||||
<para>V4L2 does not support digital terrestrial, cable or
|
||||
satellite broadcast. A separate project aiming at digital receivers
|
||||
exists. You can find its homepage at <ulink
|
||||
url="http://linuxtv.org">http://linuxtv.org</ulink>. The Linux DVB API
|
||||
has no connection to the V4L2 API except that drivers for hybrid
|
||||
hardware may support both.</para>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<title>Audio Interfaces</title>
|
||||
|
||||
<para>[to do - OSS/ALSA]</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<title>Audio Interfaces</title>
|
||||
<section id="experimental">
|
||||
<title>Experimental API Elements</title>
|
||||
|
||||
<para>[to do - OSS/ALSA]</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="experimental">
|
||||
<title>Experimental API Elements</title>
|
||||
|
||||
<para>The following V4L2 API elements are currently experimental
|
||||
<para>The following V4L2 API elements are currently experimental
|
||||
and may change in the future.</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>Video Output Overlay (OSD) Interface, <xref
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>Video Output Overlay (OSD) Interface, <xref
|
||||
linkend="osd" />.</para>
|
||||
</listitem>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><constant>V4L2_BUF_TYPE_VIDEO_OUTPUT_OVERLAY</constant>,
|
||||
<para><constant>V4L2_BUF_TYPE_VIDEO_OUTPUT_OVERLAY</constant>,
|
||||
&v4l2-buf-type;, <xref linkend="v4l2-buf-type" />.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><constant>V4L2_CAP_VIDEO_OUTPUT_OVERLAY</constant>,
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><constant>V4L2_CAP_VIDEO_OUTPUT_OVERLAY</constant>,
|
||||
&VIDIOC-QUERYCAP; ioctl, <xref linkend="device-capabilities" />.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-ENUM-FRAMESIZES; and
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-ENUM-FRAMESIZES; and
|
||||
&VIDIOC-ENUM-FRAMEINTERVALS; ioctls.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-G-ENC-INDEX; ioctl.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-ENCODER-CMD; and &VIDIOC-TRY-ENCODER-CMD;
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-G-ENC-INDEX; ioctl.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-ENCODER-CMD; and &VIDIOC-TRY-ENCODER-CMD;
|
||||
ioctls.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-DBG-G-REGISTER; and &VIDIOC-DBG-S-REGISTER;
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-DBG-G-REGISTER; and &VIDIOC-DBG-S-REGISTER;
|
||||
ioctls.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-DBG-G-CHIP-IDENT; ioctl.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</section>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>&VIDIOC-DBG-G-CHIP-IDENT; ioctl.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</section>
|
||||
|
||||
<section id="obsolete">
|
||||
<title>Obsolete API Elements</title>
|
||||
<section id="obsolete">
|
||||
<title>Obsolete API Elements</title>
|
||||
|
||||
<para>The following V4L2 API elements were superseded by new
|
||||
<para>The following V4L2 API elements were superseded by new
|
||||
interfaces and should not be implemented in new drivers.</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para><constant>VIDIOC_G_MPEGCOMP</constant> and
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para><constant>VIDIOC_G_MPEGCOMP</constant> and
|
||||
<constant>VIDIOC_S_MPEGCOMP</constant> ioctls. Use Extended Controls,
|
||||
<xref linkend="extended-controls" />.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<!--
|
||||
|
@ -266,6 +266,12 @@ minimum value disables backlight compensation.</entry>
|
||||
<entry>boolean</entry>
|
||||
<entry>Chroma automatic gain control.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_CID_CHROMA_GAIN</constant></entry>
|
||||
<entry>integer</entry>
|
||||
<entry>Adjusts the Chroma gain control (for use when chroma AGC
|
||||
is disabled).</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_CID_COLOR_KILLER</constant></entry>
|
||||
<entry>boolean</entry>
|
||||
@ -277,8 +283,15 @@ minimum value disables backlight compensation.</entry>
|
||||
<entry>Selects a color effect. Possible values for
|
||||
<constant>enum v4l2_colorfx</constant> are:
|
||||
<constant>V4L2_COLORFX_NONE</constant> (0),
|
||||
<constant>V4L2_COLORFX_BW</constant> (1) and
|
||||
<constant>V4L2_COLORFX_SEPIA</constant> (2).</entry>
|
||||
<constant>V4L2_COLORFX_BW</constant> (1),
|
||||
<constant>V4L2_COLORFX_SEPIA</constant> (2),
|
||||
<constant>V4L2_COLORFX_NEGATIVE</constant> (3),
|
||||
<constant>V4L2_COLORFX_EMBOSS</constant> (4),
|
||||
<constant>V4L2_COLORFX_SKETCH</constant> (5),
|
||||
<constant>V4L2_COLORFX_SKY_BLUE</constant> (6),
|
||||
<constant>V4L2_COLORFX_GRASS_GREEN</constant> (7),
|
||||
<constant>V4L2_COLORFX_SKIN_WHITEN</constant> (8) and
|
||||
<constant>V4L2_COLORFX_VIVID</constant> (9).</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_CID_ROTATE</constant></entry>
|
||||
@ -1824,6 +1837,25 @@ wide-angle direction. The zoom speed unit is driver-specific.</entry>
|
||||
</row>
|
||||
<row><entry></entry></row>
|
||||
|
||||
<row>
|
||||
<entry spanname="id"><constant>V4L2_CID_IRIS_ABSOLUTE</constant> </entry>
|
||||
<entry>integer</entry>
|
||||
</row><row><entry spanname="descr">This control sets the
|
||||
camera's aperture to the specified value. The unit is undefined.
|
||||
Larger values open the iris wider, smaller values close it.</entry>
|
||||
</row>
|
||||
<row><entry></entry></row>
|
||||
|
||||
<row>
|
||||
<entry spanname="id"><constant>V4L2_CID_IRIS_RELATIVE</constant> </entry>
|
||||
<entry>integer</entry>
|
||||
</row><row><entry spanname="descr">This control modifies the
|
||||
camera's aperture by the specified amount. The unit is undefined.
|
||||
Positive values open the iris one step further, negative values close
|
||||
it one step further. This is a write-only control.</entry>
|
||||
</row>
|
||||
<row><entry></entry></row>
|
||||
|
||||
<row>
|
||||
<entry spanname="id"><constant>V4L2_CID_PRIVACY</constant> </entry>
|
||||
<entry>boolean</entry>
|
||||
|
31
Documentation/DocBook/v4l/dev-event.xml
Normal file
31
Documentation/DocBook/v4l/dev-event.xml
Normal file
@ -0,0 +1,31 @@
|
||||
<title>Event Interface</title>
|
||||
|
||||
<para>The V4L2 event interface provides means for user to get
|
||||
immediately notified on certain conditions taking place on a device.
|
||||
This might include start of frame or loss of signal events, for
|
||||
example.
|
||||
</para>
|
||||
|
||||
<para>To receive events, the events the user is interested in first must
|
||||
be subscribed using the &VIDIOC-SUBSCRIBE-EVENT; ioctl. Once an event is
|
||||
subscribed, the events of subscribed types are dequeueable using the
|
||||
&VIDIOC-DQEVENT; ioctl. Events may be unsubscribed using
|
||||
VIDIOC_UNSUBSCRIBE_EVENT ioctl. The special event type V4L2_EVENT_ALL may
|
||||
be used to unsubscribe all the events the driver supports.</para>
|
||||
|
||||
<para>The event subscriptions and event queues are specific to file
|
||||
handles. Subscribing an event on one file handle does not affect
|
||||
other file handles.
|
||||
</para>
|
||||
|
||||
<para>The information on dequeueable events is obtained by using select or
|
||||
poll system calls on video devices. The V4L2 events use POLLPRI events on
|
||||
poll system call and exceptions on select system call. </para>
|
||||
|
||||
<!--
|
||||
Local Variables:
|
||||
mode: sgml
|
||||
sgml-parent-document: "v4l2.sgml"
|
||||
indent-tabs-mode: nil
|
||||
End:
|
||||
-->
|
@ -701,6 +701,16 @@ buffer cannot be on both queues at the same time, the
|
||||
They can be both cleared however, then the buffer is in "dequeued"
|
||||
state, in the application domain to say so.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_BUF_FLAG_ERROR</constant></entry>
|
||||
<entry>0x0040</entry>
|
||||
<entry>When this flag is set, the buffer has been dequeued
|
||||
successfully, although the data might have been corrupted.
|
||||
This is recoverable, streaming may continue as normal and
|
||||
the buffer may be reused normally.
|
||||
Drivers set this flag when the <constant>VIDIOC_DQBUF</constant>
|
||||
ioctl is called.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_BUF_FLAG_KEYFRAME</constant></entry>
|
||||
<entry>0x0008</entry>
|
||||
@ -918,8 +928,8 @@ order</emphasis>.</para>
|
||||
|
||||
<para>When the driver provides or accepts images field by field
|
||||
rather than interleaved, it is also important applications understand
|
||||
how the fields combine to frames. We distinguish between top and
|
||||
bottom fields, the <emphasis>spatial order</emphasis>: The first line
|
||||
how the fields combine to frames. We distinguish between top (aka odd) and
|
||||
bottom (aka even) fields, the <emphasis>spatial order</emphasis>: The first line
|
||||
of the top field is the first line of an interlaced frame, the first
|
||||
line of the bottom field is the second line of that frame.</para>
|
||||
|
||||
@ -972,12 +982,12 @@ between <constant>V4L2_FIELD_TOP</constant> and
|
||||
<row>
|
||||
<entry><constant>V4L2_FIELD_TOP</constant></entry>
|
||||
<entry>2</entry>
|
||||
<entry>Images consist of the top field only.</entry>
|
||||
<entry>Images consist of the top (aka odd) field only.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_FIELD_BOTTOM</constant></entry>
|
||||
<entry>3</entry>
|
||||
<entry>Images consist of the bottom field only.
|
||||
<entry>Images consist of the bottom (aka even) field only.
|
||||
Applications may wish to prevent a device from capturing interlaced
|
||||
images because they will have "comb" or "feathering" artefacts around
|
||||
moving objects.</entry>
|
||||
|
@ -792,6 +792,18 @@ http://www.thedirks.org/winnov/</ulink></para></entry>
|
||||
<entry>'YYUV'</entry>
|
||||
<entry>unknown</entry>
|
||||
</row>
|
||||
<row id="V4L2-PIX-FMT-Y4">
|
||||
<entry><constant>V4L2_PIX_FMT_Y4</constant></entry>
|
||||
<entry>'Y04 '</entry>
|
||||
<entry>Old 4-bit greyscale format. Only the least significant 4 bits of each byte are used,
|
||||
the other bits are set to 0.</entry>
|
||||
</row>
|
||||
<row id="V4L2-PIX-FMT-Y6">
|
||||
<entry><constant>V4L2_PIX_FMT_Y6</constant></entry>
|
||||
<entry>'Y06 '</entry>
|
||||
<entry>Old 6-bit greyscale format. Only the least significant 6 bits of each byte are used,
|
||||
the other bits are set to 0.</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
@ -401,6 +401,7 @@ and discussions on the V4L mailing list.</revremark>
|
||||
<section id="ttx"> &sub-dev-teletext; </section>
|
||||
<section id="radio"> &sub-dev-radio; </section>
|
||||
<section id="rds"> &sub-dev-rds; </section>
|
||||
<section id="event"> &sub-dev-event; </section>
|
||||
</chapter>
|
||||
|
||||
<chapter id="driver">
|
||||
@ -426,6 +427,7 @@ and discussions on the V4L mailing list.</revremark>
|
||||
&sub-cropcap;
|
||||
&sub-dbg-g-chip-ident;
|
||||
&sub-dbg-g-register;
|
||||
&sub-dqevent;
|
||||
&sub-encoder-cmd;
|
||||
&sub-enumaudio;
|
||||
&sub-enumaudioout;
|
||||
@ -467,6 +469,7 @@ and discussions on the V4L mailing list.</revremark>
|
||||
&sub-reqbufs;
|
||||
&sub-s-hw-freq-seek;
|
||||
&sub-streamon;
|
||||
&sub-subscribe-event;
|
||||
<!-- End of ioctls. -->
|
||||
&sub-mmap;
|
||||
&sub-munmap;
|
||||
|
@ -1018,6 +1018,13 @@ enum <link linkend="v4l2-colorfx">v4l2_colorfx</link> {
|
||||
V4L2_COLORFX_NONE = 0,
|
||||
V4L2_COLORFX_BW = 1,
|
||||
V4L2_COLORFX_SEPIA = 2,
|
||||
V4L2_COLORFX_NEGATIVE = 3,
|
||||
V4L2_COLORFX_EMBOSS = 4,
|
||||
V4L2_COLORFX_SKETCH = 5,
|
||||
V4L2_COLORFX_SKY_BLUE = 6,
|
||||
V4L2_COLORFX_GRASS_GREEN = 7,
|
||||
V4L2_COLORFX_SKIN_WHITEN = 8,
|
||||
V4L2_COLORFX_VIVID = 9.
|
||||
};
|
||||
#define V4L2_CID_AUTOBRIGHTNESS (V4L2_CID_BASE+32)
|
||||
#define V4L2_CID_BAND_STOP_FILTER (V4L2_CID_BASE+33)
|
||||
@ -1271,6 +1278,9 @@ enum <link linkend="v4l2-exposure-auto-type">v4l2_exposure_auto_type</link> {
|
||||
|
||||
#define V4L2_CID_PRIVACY (V4L2_CID_CAMERA_CLASS_BASE+16)
|
||||
|
||||
#define V4L2_CID_IRIS_ABSOLUTE (V4L2_CID_CAMERA_CLASS_BASE+17)
|
||||
#define V4L2_CID_IRIS_RELATIVE (V4L2_CID_CAMERA_CLASS_BASE+18)
|
||||
|
||||
/* FM Modulator class control IDs */
|
||||
#define V4L2_CID_FM_TX_CLASS_BASE (V4L2_CTRL_CLASS_FM_TX | 0x900)
|
||||
#define V4L2_CID_FM_TX_CLASS (V4L2_CTRL_CLASS_FM_TX | 1)
|
||||
|
131
Documentation/DocBook/v4l/vidioc-dqevent.xml
Normal file
131
Documentation/DocBook/v4l/vidioc-dqevent.xml
Normal file
@ -0,0 +1,131 @@
|
||||
<refentry id="vidioc-dqevent">
|
||||
<refmeta>
|
||||
<refentrytitle>ioctl VIDIOC_DQEVENT</refentrytitle>
|
||||
&manvol;
|
||||
</refmeta>
|
||||
|
||||
<refnamediv>
|
||||
<refname>VIDIOC_DQEVENT</refname>
|
||||
<refpurpose>Dequeue event</refpurpose>
|
||||
</refnamediv>
|
||||
|
||||
<refsynopsisdiv>
|
||||
<funcsynopsis>
|
||||
<funcprototype>
|
||||
<funcdef>int <function>ioctl</function></funcdef>
|
||||
<paramdef>int <parameter>fd</parameter></paramdef>
|
||||
<paramdef>int <parameter>request</parameter></paramdef>
|
||||
<paramdef>struct v4l2_event
|
||||
*<parameter>argp</parameter></paramdef>
|
||||
</funcprototype>
|
||||
</funcsynopsis>
|
||||
</refsynopsisdiv>
|
||||
|
||||
<refsect1>
|
||||
<title>Arguments</title>
|
||||
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term><parameter>fd</parameter></term>
|
||||
<listitem>
|
||||
<para>&fd;</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><parameter>request</parameter></term>
|
||||
<listitem>
|
||||
<para>VIDIOC_DQEVENT</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><parameter>argp</parameter></term>
|
||||
<listitem>
|
||||
<para></para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
</refsect1>
|
||||
|
||||
<refsect1>
|
||||
<title>Description</title>
|
||||
|
||||
<para>Dequeue an event from a video device. No input is required
|
||||
for this ioctl. All the fields of the &v4l2-event; structure are
|
||||
filled by the driver. The file handle will also receive exceptions
|
||||
which the application may get by e.g. using the select system
|
||||
call.</para>
|
||||
|
||||
<table frame="none" pgwide="1" id="v4l2-event">
|
||||
<title>struct <structname>v4l2_event</structname></title>
|
||||
<tgroup cols="4">
|
||||
&cs-str;
|
||||
<tbody valign="top">
|
||||
<row>
|
||||
<entry>__u32</entry>
|
||||
<entry><structfield>type</structfield></entry>
|
||||
<entry></entry>
|
||||
<entry>Type of the event.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>union</entry>
|
||||
<entry><structfield>u</structfield></entry>
|
||||
<entry></entry>
|
||||
<entry></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry></entry>
|
||||
<entry>&v4l2-event-vsync;</entry>
|
||||
<entry><structfield>vsync</structfield></entry>
|
||||
<entry>Event data for event V4L2_EVENT_VSYNC.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry></entry>
|
||||
<entry>__u8</entry>
|
||||
<entry><structfield>data</structfield>[64]</entry>
|
||||
<entry>Event data. Defined by the event type. The union
|
||||
should be used to define easily accessible type for
|
||||
events.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>__u32</entry>
|
||||
<entry><structfield>pending</structfield></entry>
|
||||
<entry></entry>
|
||||
<entry>Number of pending events excluding this one.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>__u32</entry>
|
||||
<entry><structfield>sequence</structfield></entry>
|
||||
<entry></entry>
|
||||
<entry>Event sequence number. The sequence number is
|
||||
incremented for every subscribed event that takes place.
|
||||
If sequence numbers are not contiguous it means that
|
||||
events have been lost.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>struct timespec</entry>
|
||||
<entry><structfield>timestamp</structfield></entry>
|
||||
<entry></entry>
|
||||
<entry>Event timestamp.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>__u32</entry>
|
||||
<entry><structfield>reserved</structfield>[9]</entry>
|
||||
<entry></entry>
|
||||
<entry>Reserved for future extensions. Drivers must set
|
||||
the array to zero.</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
</refsect1>
|
||||
</refentry>
|
||||
<!--
|
||||
Local Variables:
|
||||
mode: sgml
|
||||
sgml-parent-document: "v4l2.sgml"
|
||||
indent-tabs-mode: nil
|
||||
End:
|
||||
-->
|
@ -283,7 +283,7 @@ input/output interface to linux-media@vger.kernel.org on 19 Oct 2009.
|
||||
<entry>This input supports setting DV presets by using VIDIOC_S_DV_PRESET.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_OUT_CAP_CUSTOM_TIMINGS</constant></entry>
|
||||
<entry><constant>V4L2_IN_CAP_CUSTOM_TIMINGS</constant></entry>
|
||||
<entry>0x00000002</entry>
|
||||
<entry>This input supports setting custom video timings by using VIDIOC_S_DV_TIMINGS.</entry>
|
||||
</row>
|
||||
|
@ -111,7 +111,11 @@ from the driver's outgoing queue. They just set the
|
||||
and <structfield>reserved</structfield>
|
||||
fields of a &v4l2-buffer; as above, when <constant>VIDIOC_DQBUF</constant>
|
||||
is called with a pointer to this structure the driver fills the
|
||||
remaining fields or returns an error code.</para>
|
||||
remaining fields or returns an error code. The driver may also set
|
||||
<constant>V4L2_BUF_FLAG_ERROR</constant> in the <structfield>flags</structfield>
|
||||
field. It indicates a non-critical (recoverable) streaming error. In such case
|
||||
the application may continue as normal, but should be aware that data in the
|
||||
dequeued buffer might be corrupted.</para>
|
||||
|
||||
<para>By default <constant>VIDIOC_DQBUF</constant> blocks when no
|
||||
buffer is in the outgoing queue. When the
|
||||
@ -158,7 +162,13 @@ enqueue a user pointer buffer.</para>
|
||||
<para><constant>VIDIOC_DQBUF</constant> failed due to an
|
||||
internal error. Can also indicate temporary problems like signal
|
||||
loss. Note the driver might dequeue an (empty) buffer despite
|
||||
returning an error, or even stop capturing.</para>
|
||||
returning an error, or even stop capturing. Reusing such buffer may be unsafe
|
||||
though and its details (e.g. <structfield>index</structfield>) may not be
|
||||
returned either. It is recommended that drivers indicate recoverable errors
|
||||
by setting the <constant>V4L2_BUF_FLAG_ERROR</constant> and returning 0 instead.
|
||||
In that case the application should be able to safely reuse the buffer and
|
||||
continue streaming.
|
||||
</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
|
@ -325,7 +325,7 @@ should be part of the control documentation.</entry>
|
||||
<entry>n/a</entry>
|
||||
<entry>This is not a control. When
|
||||
<constant>VIDIOC_QUERYCTRL</constant> is called with a control ID
|
||||
equal to a control class code (see <xref linkend="ctrl-class" />), the
|
||||
equal to a control class code (see <xref linkend="ctrl-class" />) + 1, the
|
||||
ioctl returns the name of the control class and this control type.
|
||||
Older drivers which do not support this feature return an
|
||||
&EINVAL;.</entry>
|
||||
|
@ -61,7 +61,7 @@ fields of the <structname>v4l2_requestbuffers</structname> structure.
|
||||
They set the <structfield>type</structfield> field to the respective
|
||||
stream or buffer type, the <structfield>count</structfield> field to
|
||||
the desired number of buffers, <structfield>memory</structfield>
|
||||
must be set to the requested I/O method and the reserved array
|
||||
must be set to the requested I/O method and the <structfield>reserved</structfield> array
|
||||
must be zeroed. When the ioctl
|
||||
is called with a pointer to this structure the driver will attempt to allocate
|
||||
the requested number of buffers and it stores the actual number
|
||||
|
133
Documentation/DocBook/v4l/vidioc-subscribe-event.xml
Normal file
133
Documentation/DocBook/v4l/vidioc-subscribe-event.xml
Normal file
@ -0,0 +1,133 @@
|
||||
<refentry id="vidioc-subscribe-event">
|
||||
<refmeta>
|
||||
<refentrytitle>ioctl VIDIOC_SUBSCRIBE_EVENT, VIDIOC_UNSUBSCRIBE_EVENT</refentrytitle>
|
||||
&manvol;
|
||||
</refmeta>
|
||||
|
||||
<refnamediv>
|
||||
<refname>VIDIOC_SUBSCRIBE_EVENT, VIDIOC_UNSUBSCRIBE_EVENT</refname>
|
||||
<refpurpose>Subscribe or unsubscribe event</refpurpose>
|
||||
</refnamediv>
|
||||
|
||||
<refsynopsisdiv>
|
||||
<funcsynopsis>
|
||||
<funcprototype>
|
||||
<funcdef>int <function>ioctl</function></funcdef>
|
||||
<paramdef>int <parameter>fd</parameter></paramdef>
|
||||
<paramdef>int <parameter>request</parameter></paramdef>
|
||||
<paramdef>struct v4l2_event_subscription
|
||||
*<parameter>argp</parameter></paramdef>
|
||||
</funcprototype>
|
||||
</funcsynopsis>
|
||||
</refsynopsisdiv>
|
||||
|
||||
<refsect1>
|
||||
<title>Arguments</title>
|
||||
|
||||
<variablelist>
|
||||
<varlistentry>
|
||||
<term><parameter>fd</parameter></term>
|
||||
<listitem>
|
||||
<para>&fd;</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><parameter>request</parameter></term>
|
||||
<listitem>
|
||||
<para>VIDIOC_SUBSCRIBE_EVENT, VIDIOC_UNSUBSCRIBE_EVENT</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><parameter>argp</parameter></term>
|
||||
<listitem>
|
||||
<para></para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
</refsect1>
|
||||
|
||||
<refsect1>
|
||||
<title>Description</title>
|
||||
|
||||
<para>Subscribe or unsubscribe V4L2 event. Subscribed events are
|
||||
dequeued by using the &VIDIOC-DQEVENT; ioctl.</para>
|
||||
|
||||
<table frame="none" pgwide="1" id="v4l2-event-subscription">
|
||||
<title>struct <structname>v4l2_event_subscription</structname></title>
|
||||
<tgroup cols="3">
|
||||
&cs-str;
|
||||
<tbody valign="top">
|
||||
<row>
|
||||
<entry>__u32</entry>
|
||||
<entry><structfield>type</structfield></entry>
|
||||
<entry>Type of the event.</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>__u32</entry>
|
||||
<entry><structfield>reserved</structfield>[7]</entry>
|
||||
<entry>Reserved for future extensions. Drivers and applications
|
||||
must set the array to zero.</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<table frame="none" pgwide="1" id="event-type">
|
||||
<title>Event Types</title>
|
||||
<tgroup cols="3">
|
||||
&cs-def;
|
||||
<tbody valign="top">
|
||||
<row>
|
||||
<entry><constant>V4L2_EVENT_ALL</constant></entry>
|
||||
<entry>0</entry>
|
||||
<entry>All events. V4L2_EVENT_ALL is valid only for
|
||||
VIDIOC_UNSUBSCRIBE_EVENT for unsubscribing all events at once.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_EVENT_VSYNC</constant></entry>
|
||||
<entry>1</entry>
|
||||
<entry>This event is triggered on the vertical sync.
|
||||
This event has &v4l2-event-vsync; associated with it.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_EVENT_EOS</constant></entry>
|
||||
<entry>2</entry>
|
||||
<entry>This event is triggered when the end of a stream is reached.
|
||||
This is typically used with MPEG decoders to report to the application
|
||||
when the last of the MPEG stream has been decoded.
|
||||
</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry><constant>V4L2_EVENT_PRIVATE_START</constant></entry>
|
||||
<entry>0x08000000</entry>
|
||||
<entry>Base event number for driver-private events.</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<table frame="none" pgwide="1" id="v4l2-event-vsync">
|
||||
<title>struct <structname>v4l2_event_vsync</structname></title>
|
||||
<tgroup cols="3">
|
||||
&cs-str;
|
||||
<tbody valign="top">
|
||||
<row>
|
||||
<entry>__u8</entry>
|
||||
<entry><structfield>field</structfield></entry>
|
||||
<entry>The upcoming field. See &v4l2-field;.</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
</refsect1>
|
||||
</refentry>
|
||||
<!--
|
||||
Local Variables:
|
||||
mode: sgml
|
||||
sgml-parent-document: "v4l2.sgml"
|
||||
indent-tabs-mode: nil
|
||||
End:
|
||||
-->
|
@ -5518,34 +5518,41 @@ struct _snd_pcm_runtime {
|
||||
]]>
|
||||
</programlisting>
|
||||
</informalexample>
|
||||
|
||||
For the raw data, <structfield>size</structfield> field must be
|
||||
set properly. This specifies the maximum size of the proc file access.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The callback is much more complicated than the text-file
|
||||
version. You need to use a low-level I/O functions such as
|
||||
The read/write callbacks of raw mode are more direct than the text mode.
|
||||
You need to use a low-level I/O functions such as
|
||||
<function>copy_from/to_user()</function> to transfer the
|
||||
data.
|
||||
|
||||
<informalexample>
|
||||
<programlisting>
|
||||
<![CDATA[
|
||||
static long my_file_io_read(struct snd_info_entry *entry,
|
||||
static ssize_t my_file_io_read(struct snd_info_entry *entry,
|
||||
void *file_private_data,
|
||||
struct file *file,
|
||||
char *buf,
|
||||
unsigned long count,
|
||||
unsigned long pos)
|
||||
size_t count,
|
||||
loff_t pos)
|
||||
{
|
||||
long size = count;
|
||||
if (pos + size > local_max_size)
|
||||
size = local_max_size - pos;
|
||||
if (copy_to_user(buf, local_data + pos, size))
|
||||
if (copy_to_user(buf, local_data + pos, count))
|
||||
return -EFAULT;
|
||||
return size;
|
||||
return count;
|
||||
}
|
||||
]]>
|
||||
</programlisting>
|
||||
</informalexample>
|
||||
|
||||
If the size of the info entry has been set up properly,
|
||||
<structfield>count</structfield> and <structfield>pos</structfield> are
|
||||
guaranteed to fit within 0 and the given size.
|
||||
You don't have to check the range in the callbacks unless any
|
||||
other condition is required.
|
||||
|
||||
</para>
|
||||
|
||||
</chapter>
|
||||
|
@ -342,7 +342,7 @@ static inline void skel_delete (struct usb_skel *dev)
|
||||
{
|
||||
kfree (dev->bulk_in_buffer);
|
||||
if (dev->bulk_out_buffer != NULL)
|
||||
usb_buffer_free (dev->udev, dev->bulk_out_size,
|
||||
usb_free_coherent (dev->udev, dev->bulk_out_size,
|
||||
dev->bulk_out_buffer,
|
||||
dev->write_urb->transfer_dma);
|
||||
usb_free_urb (dev->write_urb);
|
||||
|
@ -216,7 +216,7 @@ The driver should return one of the following result codes:
|
||||
|
||||
- PCI_ERS_RESULT_NEED_RESET
|
||||
Driver returns this if it thinks the device is not
|
||||
recoverable in it's current state and it needs a slot
|
||||
recoverable in its current state and it needs a slot
|
||||
reset to proceed.
|
||||
|
||||
- PCI_ERS_RESULT_DISCONNECT
|
||||
@ -241,7 +241,7 @@ in working condition.
|
||||
|
||||
The driver is not supposed to restart normal driver I/O operations
|
||||
at this point. It should limit itself to "probing" the device to
|
||||
check it's recoverability status. If all is right, then the platform
|
||||
check its recoverability status. If all is right, then the platform
|
||||
will call resume() once all drivers have ack'd link_reset().
|
||||
|
||||
Result codes:
|
||||
|
@ -13,7 +13,7 @@ Reporting (AER) driver and provides information on how to use it, as
|
||||
well as how to enable the drivers of endpoint devices to conform with
|
||||
PCI Express AER driver.
|
||||
|
||||
1.2 Copyright © Intel Corporation 2006.
|
||||
1.2 Copyright (C) Intel Corporation 2006.
|
||||
|
||||
1.3 What is the PCI Express AER Driver?
|
||||
|
||||
@ -71,15 +71,11 @@ console. If it's a correctable error, it is outputed as a warning.
|
||||
Otherwise, it is printed as an error. So users could choose different
|
||||
log level to filter out correctable error messages.
|
||||
|
||||
Below shows an example.
|
||||
+------ PCI-Express Device Error -----+
|
||||
Error Severity : Uncorrected (Fatal)
|
||||
PCIE Bus Error type : Transaction Layer
|
||||
Unsupported Request : First
|
||||
Requester ID : 0500
|
||||
VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h
|
||||
TLB Header:
|
||||
04000001 00200a03 05010000 00050100
|
||||
Below shows an example:
|
||||
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
|
||||
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
|
||||
0000:50:00.0: [20] Unsupported Request (First)
|
||||
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
|
||||
|
||||
In the example, 'Requester ID' means the ID of the device who sends
|
||||
the error message to root port. Pls. refer to pci express specs for
|
||||
@ -112,7 +108,7 @@ but the PCI Express link itself is fully functional. Fatal errors, on
|
||||
the other hand, cause the link to be unreliable.
|
||||
|
||||
When AER is enabled, a PCI Express device will automatically send an
|
||||
error message to the PCIE root port above it when the device captures
|
||||
error message to the PCIe root port above it when the device captures
|
||||
an error. The Root Port, upon receiving an error reporting message,
|
||||
internally processes and logs the error message in its PCI Express
|
||||
capability structure. Error information being logged includes storing
|
||||
@ -198,8 +194,9 @@ to reset link, AER port service driver is required to provide the
|
||||
function to reset link. Firstly, kernel looks for if the upstream
|
||||
component has an aer driver. If it has, kernel uses the reset_link
|
||||
callback of the aer driver. If the upstream component has no aer driver
|
||||
and the port is downstream port, we will use the aer driver of the
|
||||
root port who reports the AER error. As for upstream ports,
|
||||
and the port is downstream port, we will perform a hot reset as the
|
||||
default by setting the Secondary Bus Reset bit of the Bridge Control
|
||||
register associated with the downstream port. As for upstream ports,
|
||||
they should provide their own aer service drivers with reset_link
|
||||
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
|
||||
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
|
||||
@ -253,11 +250,11 @@ cleanup uncorrectable status register. Pls. refer to section 3.3.
|
||||
|
||||
4. Software error injection
|
||||
|
||||
Debugging PCIE AER error recovery code is quite difficult because it
|
||||
Debugging PCIe AER error recovery code is quite difficult because it
|
||||
is hard to trigger real hardware errors. Software based error
|
||||
injection can be used to fake various kinds of PCIE errors.
|
||||
injection can be used to fake various kinds of PCIe errors.
|
||||
|
||||
First you should enable PCIE AER software error injection in kernel
|
||||
First you should enable PCIe AER software error injection in kernel
|
||||
configuration, that is, following item should be in your .config.
|
||||
|
||||
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
|
||||
|
@ -3,35 +3,79 @@ Using RCU's CPU Stall Detector
|
||||
The CONFIG_RCU_CPU_STALL_DETECTOR kernel config parameter enables
|
||||
RCU's CPU stall detector, which detects conditions that unduly delay
|
||||
RCU grace periods. The stall detector's idea of what constitutes
|
||||
"unduly delayed" is controlled by a pair of C preprocessor macros:
|
||||
"unduly delayed" is controlled by a set of C preprocessor macros:
|
||||
|
||||
RCU_SECONDS_TILL_STALL_CHECK
|
||||
|
||||
This macro defines the period of time that RCU will wait from
|
||||
the beginning of a grace period until it issues an RCU CPU
|
||||
stall warning. It is normally ten seconds.
|
||||
stall warning. This time period is normally ten seconds.
|
||||
|
||||
RCU_SECONDS_TILL_STALL_RECHECK
|
||||
|
||||
This macro defines the period of time that RCU will wait after
|
||||
issuing a stall warning until it issues another stall warning.
|
||||
It is normally set to thirty seconds.
|
||||
issuing a stall warning until it issues another stall warning
|
||||
for the same stall. This time period is normally set to thirty
|
||||
seconds.
|
||||
|
||||
RCU_STALL_RAT_DELAY
|
||||
|
||||
The CPU stall detector tries to make the offending CPU rat on itself,
|
||||
as this often gives better-quality stack traces. However, if
|
||||
the offending CPU does not detect its own stall in the number
|
||||
of jiffies specified by RCU_STALL_RAT_DELAY, then other CPUs will
|
||||
complain. This is normally set to two jiffies.
|
||||
The CPU stall detector tries to make the offending CPU print its
|
||||
own warnings, as this often gives better-quality stack traces.
|
||||
However, if the offending CPU does not detect its own stall in
|
||||
the number of jiffies specified by RCU_STALL_RAT_DELAY, then
|
||||
some other CPU will complain. This delay is normally set to
|
||||
two jiffies.
|
||||
|
||||
The following problems can result in an RCU CPU stall warning:
|
||||
When a CPU detects that it is stalling, it will print a message similar
|
||||
to the following:
|
||||
|
||||
INFO: rcu_sched_state detected stall on CPU 5 (t=2500 jiffies)
|
||||
|
||||
This message indicates that CPU 5 detected that it was causing a stall,
|
||||
and that the stall was affecting RCU-sched. This message will normally be
|
||||
followed by a stack dump of the offending CPU. On TREE_RCU kernel builds,
|
||||
RCU and RCU-sched are implemented by the same underlying mechanism,
|
||||
while on TREE_PREEMPT_RCU kernel builds, RCU is instead implemented
|
||||
by rcu_preempt_state.
|
||||
|
||||
On the other hand, if the offending CPU fails to print out a stall-warning
|
||||
message quickly enough, some other CPU will print a message similar to
|
||||
the following:
|
||||
|
||||
INFO: rcu_bh_state detected stalls on CPUs/tasks: { 3 5 } (detected by 2, 2502 jiffies)
|
||||
|
||||
This message indicates that CPU 2 detected that CPUs 3 and 5 were both
|
||||
causing stalls, and that the stall was affecting RCU-bh. This message
|
||||
will normally be followed by stack dumps for each CPU. Please note that
|
||||
TREE_PREEMPT_RCU builds can be stalled by tasks as well as by CPUs,
|
||||
and that the tasks will be indicated by PID, for example, "P3421".
|
||||
It is even possible for a rcu_preempt_state stall to be caused by both
|
||||
CPUs -and- tasks, in which case the offending CPUs and tasks will all
|
||||
be called out in the list.
|
||||
|
||||
Finally, if the grace period ends just as the stall warning starts
|
||||
printing, there will be a spurious stall-warning message:
|
||||
|
||||
INFO: rcu_bh_state detected stalls on CPUs/tasks: { } (detected by 4, 2502 jiffies)
|
||||
|
||||
This is rare, but does happen from time to time in real life.
|
||||
|
||||
So your kernel printed an RCU CPU stall warning. The next question is
|
||||
"What caused it?" The following problems can result in RCU CPU stall
|
||||
warnings:
|
||||
|
||||
o A CPU looping in an RCU read-side critical section.
|
||||
|
||||
o A CPU looping with interrupts disabled.
|
||||
o A CPU looping with interrupts disabled. This condition can
|
||||
result in RCU-sched and RCU-bh stalls.
|
||||
|
||||
o A CPU looping with preemption disabled.
|
||||
o A CPU looping with preemption disabled. This condition can
|
||||
result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
|
||||
stalls.
|
||||
|
||||
o A CPU looping with bottom halves disabled. This condition can
|
||||
result in RCU-sched and RCU-bh stalls.
|
||||
|
||||
o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
|
||||
without invoking schedule().
|
||||
@ -39,20 +83,24 @@ o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
|
||||
o A bug in the RCU implementation.
|
||||
|
||||
o A hardware failure. This is quite unlikely, but has occurred
|
||||
at least once in a former life. A CPU failed in a running system,
|
||||
at least once in real life. A CPU failed in a running system,
|
||||
becoming unresponsive, but not causing an immediate crash.
|
||||
This resulted in a series of RCU CPU stall warnings, eventually
|
||||
leading the realization that the CPU had failed.
|
||||
|
||||
The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning.
|
||||
SRCU does not do so directly, but its calls to synchronize_sched() will
|
||||
result in RCU-sched detecting any CPU stalls that might be occurring.
|
||||
The RCU, RCU-sched, and RCU-bh implementations have CPU stall
|
||||
warning. SRCU does not have its own CPU stall warnings, but its
|
||||
calls to synchronize_sched() will result in RCU-sched detecting
|
||||
RCU-sched-related CPU stalls. Please note that RCU only detects
|
||||
CPU stalls when there is a grace period in progress. No grace period,
|
||||
no CPU stall warnings.
|
||||
|
||||
To diagnose the cause of the stall, inspect the stack traces. The offending
|
||||
function will usually be near the top of the stack. If you have a series
|
||||
of stall warnings from a single extended stall, comparing the stack traces
|
||||
can often help determine where the stall is occurring, which will usually
|
||||
be in the function nearest the top of the stack that stays the same from
|
||||
trace to trace.
|
||||
To diagnose the cause of the stall, inspect the stack traces.
|
||||
The offending function will usually be near the top of the stack.
|
||||
If you have a series of stall warnings from a single extended stall,
|
||||
comparing the stack traces can often help determine where the stall
|
||||
is occurring, which will usually be in the function nearest the top of
|
||||
that portion of the stack which remains the same from trace to trace.
|
||||
If you can reliably trigger the stall, ftrace can be quite helpful.
|
||||
|
||||
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE.
|
||||
|
@ -182,16 +182,6 @@ Similarly, sched_expedited RCU provides the following:
|
||||
sched_expedited-torture: Reader Pipe: 12660320201 95875 0 0 0 0 0 0 0 0 0
|
||||
sched_expedited-torture: Reader Batch: 12660424885 0 0 0 0 0 0 0 0 0 0
|
||||
sched_expedited-torture: Free-Block Circulation: 1090795 1090795 1090794 1090793 1090792 1090791 1090790 1090789 1090788 1090787 0
|
||||
state: -1 / 0:0 3:0 4:0
|
||||
|
||||
As before, the first four lines are similar to those for RCU.
|
||||
The last line shows the task-migration state. The first number is
|
||||
-1 if synchronize_sched_expedited() is idle, -2 if in the process of
|
||||
posting wakeups to the migration kthreads, and N when waiting on CPU N.
|
||||
Each of the colon-separated fields following the "/" is a CPU:state pair.
|
||||
Valid states are "0" for idle, "1" for waiting for quiescent state,
|
||||
"2" for passed through quiescent state, and "3" when a race with a
|
||||
CPU-hotplug event forces use of the synchronize_sched() primitive.
|
||||
|
||||
|
||||
USAGE
|
||||
|
@ -256,23 +256,23 @@ o Each element of the form "1/1 0:127 ^0" represents one struct
|
||||
The output of "cat rcu/rcu_pending" looks as follows:
|
||||
|
||||
rcu_sched:
|
||||
0 np=255892 qsp=53936 cbr=0 cng=14417 gpc=10033 gps=24320 nf=6445 nn=146741
|
||||
1 np=261224 qsp=54638 cbr=0 cng=25723 gpc=16310 gps=2849 nf=5912 nn=155792
|
||||
2 np=237496 qsp=49664 cbr=0 cng=2762 gpc=45478 gps=1762 nf=1201 nn=136629
|
||||
3 np=236249 qsp=48766 cbr=0 cng=286 gpc=48049 gps=1218 nf=207 nn=137723
|
||||
4 np=221310 qsp=46850 cbr=0 cng=26 gpc=43161 gps=4634 nf=3529 nn=123110
|
||||
5 np=237332 qsp=48449 cbr=0 cng=54 gpc=47920 gps=3252 nf=201 nn=137456
|
||||
6 np=219995 qsp=46718 cbr=0 cng=50 gpc=42098 gps=6093 nf=4202 nn=120834
|
||||
7 np=249893 qsp=49390 cbr=0 cng=72 gpc=38400 gps=17102 nf=41 nn=144888
|
||||
0 np=255892 qsp=53936 rpq=85 cbr=0 cng=14417 gpc=10033 gps=24320 nf=6445 nn=146741
|
||||
1 np=261224 qsp=54638 rpq=33 cbr=0 cng=25723 gpc=16310 gps=2849 nf=5912 nn=155792
|
||||
2 np=237496 qsp=49664 rpq=23 cbr=0 cng=2762 gpc=45478 gps=1762 nf=1201 nn=136629
|
||||
3 np=236249 qsp=48766 rpq=98 cbr=0 cng=286 gpc=48049 gps=1218 nf=207 nn=137723
|
||||
4 np=221310 qsp=46850 rpq=7 cbr=0 cng=26 gpc=43161 gps=4634 nf=3529 nn=123110
|
||||
5 np=237332 qsp=48449 rpq=9 cbr=0 cng=54 gpc=47920 gps=3252 nf=201 nn=137456
|
||||
6 np=219995 qsp=46718 rpq=12 cbr=0 cng=50 gpc=42098 gps=6093 nf=4202 nn=120834
|
||||
7 np=249893 qsp=49390 rpq=42 cbr=0 cng=72 gpc=38400 gps=17102 nf=41 nn=144888
|
||||
rcu_bh:
|
||||
0 np=146741 qsp=1419 cbr=0 cng=6 gpc=0 gps=0 nf=2 nn=145314
|
||||
1 np=155792 qsp=12597 cbr=0 cng=0 gpc=4 gps=8 nf=3 nn=143180
|
||||
2 np=136629 qsp=18680 cbr=0 cng=0 gpc=7 gps=6 nf=0 nn=117936
|
||||
3 np=137723 qsp=2843 cbr=0 cng=0 gpc=10 gps=7 nf=0 nn=134863
|
||||
4 np=123110 qsp=12433 cbr=0 cng=0 gpc=4 gps=2 nf=0 nn=110671
|
||||
5 np=137456 qsp=4210 cbr=0 cng=0 gpc=6 gps=5 nf=0 nn=133235
|
||||
6 np=120834 qsp=9902 cbr=0 cng=0 gpc=6 gps=3 nf=2 nn=110921
|
||||
7 np=144888 qsp=26336 cbr=0 cng=0 gpc=8 gps=2 nf=0 nn=118542
|
||||
0 np=146741 qsp=1419 rpq=6 cbr=0 cng=6 gpc=0 gps=0 nf=2 nn=145314
|
||||
1 np=155792 qsp=12597 rpq=3 cbr=0 cng=0 gpc=4 gps=8 nf=3 nn=143180
|
||||
2 np=136629 qsp=18680 rpq=1 cbr=0 cng=0 gpc=7 gps=6 nf=0 nn=117936
|
||||
3 np=137723 qsp=2843 rpq=0 cbr=0 cng=0 gpc=10 gps=7 nf=0 nn=134863
|
||||
4 np=123110 qsp=12433 rpq=0 cbr=0 cng=0 gpc=4 gps=2 nf=0 nn=110671
|
||||
5 np=137456 qsp=4210 rpq=1 cbr=0 cng=0 gpc=6 gps=5 nf=0 nn=133235
|
||||
6 np=120834 qsp=9902 rpq=2 cbr=0 cng=0 gpc=6 gps=3 nf=2 nn=110921
|
||||
7 np=144888 qsp=26336 rpq=0 cbr=0 cng=0 gpc=8 gps=2 nf=0 nn=118542
|
||||
|
||||
As always, this is once again split into "rcu_sched" and "rcu_bh"
|
||||
portions, with CONFIG_TREE_PREEMPT_RCU kernels having an additional
|
||||
@ -284,6 +284,9 @@ o "np" is the number of times that __rcu_pending() has been invoked
|
||||
o "qsp" is the number of times that the RCU was waiting for a
|
||||
quiescent state from this CPU.
|
||||
|
||||
o "rpq" is the number of times that the CPU had passed through
|
||||
a quiescent state, but not yet reported it to RCU.
|
||||
|
||||
o "cbr" is the number of times that this CPU had RCU callbacks
|
||||
that had passed through a grace period, and were thus ready
|
||||
to be invoked.
|
||||
|
@ -73,7 +73,7 @@ NOTE: Smack labels are limited to 23 characters. The attr command
|
||||
If you don't do anything special all users will get the floor ("_")
|
||||
label when they log in. If you do want to log in via the hacked ssh
|
||||
at other labels use the attr command to set the smack value on the
|
||||
home directory and it's contents.
|
||||
home directory and its contents.
|
||||
|
||||
You can add access rules in /etc/smack/accesses. They take the form:
|
||||
|
||||
|
@ -18,6 +18,8 @@ kernel patches.
|
||||
|
||||
2b: Passes allnoconfig, allmodconfig
|
||||
|
||||
2c: Builds successfully when using O=builddir
|
||||
|
||||
3: Builds on multiple CPU architectures by using local cross-compile tools
|
||||
or some other build farm.
|
||||
|
||||
@ -95,3 +97,13 @@ kernel patches.
|
||||
|
||||
25: If any ioctl's are added by the patch, then also update
|
||||
Documentation/ioctl/ioctl-number.txt.
|
||||
|
||||
26: If your modified source code depends on or uses any of the kernel
|
||||
APIs or features that are related to the following kconfig symbols,
|
||||
then test multiple builds with the related kconfig symbols disabled
|
||||
and/or =m (if that option is available) [not all of these at the
|
||||
same time, just various/random combinations of them]:
|
||||
|
||||
CONFIG_SMP, CONFIG_SYSFS, CONFIG_PROC_FS, CONFIG_INPUT, CONFIG_PCI,
|
||||
CONFIG_BLOCK, CONFIG_PM, CONFIG_HOTPLUG, CONFIG_MAGIC_SYSRQ,
|
||||
CONFIG_NET, CONFIG_INET=n (but latter with CONFIG_NET=y)
|
||||
|
@ -130,6 +130,8 @@ Linux kernel master tree:
|
||||
ftp.??.kernel.org:/pub/linux/kernel/...
|
||||
?? == your country code, such as "us", "uk", "fr", etc.
|
||||
|
||||
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git
|
||||
|
||||
Linux kernel mailing list:
|
||||
linux-kernel@vger.kernel.org
|
||||
[mail majordomo@vger.kernel.org to subscribe]
|
||||
@ -160,3 +162,6 @@ How to NOT write kernel driver by Arjan van de Ven:
|
||||
|
||||
Kernel Janitor:
|
||||
http://janitor.kernelnewbies.org/
|
||||
|
||||
GIT, Fast Version Control System:
|
||||
http://git-scm.com/
|
||||
|
59
Documentation/acpi/apei/einj.txt
Normal file
59
Documentation/acpi/apei/einj.txt
Normal file
@ -0,0 +1,59 @@
|
||||
APEI Error INJection
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
EINJ provides a hardware error injection mechanism
|
||||
It is very useful for debugging and testing of other APEI and RAS features.
|
||||
|
||||
To use EINJ, make sure the following are enabled in your kernel
|
||||
configuration:
|
||||
|
||||
CONFIG_DEBUG_FS
|
||||
CONFIG_ACPI_APEI
|
||||
CONFIG_ACPI_APEI_EINJ
|
||||
|
||||
The user interface of EINJ is debug file system, under the
|
||||
directory apei/einj. The following files are provided.
|
||||
|
||||
- available_error_type
|
||||
Reading this file returns the error injection capability of the
|
||||
platform, that is, which error types are supported. The error type
|
||||
definition is as follow, the left field is the error type value, the
|
||||
right field is error description.
|
||||
|
||||
0x00000001 Processor Correctable
|
||||
0x00000002 Processor Uncorrectable non-fatal
|
||||
0x00000004 Processor Uncorrectable fatal
|
||||
0x00000008 Memory Correctable
|
||||
0x00000010 Memory Uncorrectable non-fatal
|
||||
0x00000020 Memory Uncorrectable fatal
|
||||
0x00000040 PCI Express Correctable
|
||||
0x00000080 PCI Express Uncorrectable fatal
|
||||
0x00000100 PCI Express Uncorrectable non-fatal
|
||||
0x00000200 Platform Correctable
|
||||
0x00000400 Platform Uncorrectable non-fatal
|
||||
0x00000800 Platform Uncorrectable fatal
|
||||
|
||||
The format of file contents are as above, except there are only the
|
||||
available error type lines.
|
||||
|
||||
- error_type
|
||||
This file is used to set the error type value. The error type value
|
||||
is defined in "available_error_type" description.
|
||||
|
||||
- error_inject
|
||||
Write any integer to this file to trigger the error
|
||||
injection. Before this, please specify all necessary error
|
||||
parameters.
|
||||
|
||||
- param1
|
||||
This file is used to set the first error parameter value. Effect of
|
||||
parameter depends on error_type specified. For memory error, this is
|
||||
physical memory address.
|
||||
|
||||
- param2
|
||||
This file is used to set the second error parameter value. Effect of
|
||||
parameter depends on error_type specified. For memory error, this is
|
||||
physical memory address mask.
|
||||
|
||||
For more information about EINJ, please refer to ACPI specification
|
||||
version 4.0, section 17.5.
|
@ -20,6 +20,8 @@ Samsung-S3C24XX
|
||||
- S3C24XX ARM Linux Overview
|
||||
Sharp-LH
|
||||
- Linux on Sharp LH79524 and LH7A40X System On a Chip (SOC)
|
||||
SPEAr
|
||||
- ST SPEAr platform Linux Overview
|
||||
VFP/
|
||||
- Release notes for Linux Kernel Vector Floating Point support code
|
||||
empeg/
|
||||
|
@ -32,7 +32,7 @@ Notes:
|
||||
|
||||
- The flash on board is divided into 3 partitions.
|
||||
You should be careful to use flash on board.
|
||||
It's partition is different from GraphicsClient Plus and GraphicsMaster
|
||||
Its partition is different from GraphicsClient Plus and GraphicsMaster
|
||||
|
||||
- 16bpp mode requires a different cable than what ships with the board.
|
||||
Contact ADS or look through the manual to wire your own. Currently,
|
||||
|
60
Documentation/arm/SPEAr/overview.txt
Normal file
60
Documentation/arm/SPEAr/overview.txt
Normal file
@ -0,0 +1,60 @@
|
||||
SPEAr ARM Linux Overview
|
||||
==========================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
SPEAr (Structured Processor Enhanced Architecture).
|
||||
weblink : http://www.st.com/spear
|
||||
|
||||
The ST Microelectronics SPEAr range of ARM9/CortexA9 System-on-Chip CPUs are
|
||||
supported by the 'spear' platform of ARM Linux. Currently SPEAr300,
|
||||
SPEAr310, SPEAr320 and SPEAr600 SOCs are supported. Support for the SPEAr13XX
|
||||
series is in progress.
|
||||
|
||||
Hierarchy in SPEAr is as follows:
|
||||
|
||||
SPEAr (Platform)
|
||||
- SPEAr3XX (3XX SOC series, based on ARM9)
|
||||
- SPEAr300 (SOC)
|
||||
- SPEAr300_EVB (Evaluation Board)
|
||||
- SPEAr310 (SOC)
|
||||
- SPEAr310_EVB (Evaluation Board)
|
||||
- SPEAr320 (SOC)
|
||||
- SPEAr320_EVB (Evaluation Board)
|
||||
- SPEAr6XX (6XX SOC series, based on ARM9)
|
||||
- SPEAr600 (SOC)
|
||||
- SPEAr600_EVB (Evaluation Board)
|
||||
- SPEAr13XX (13XX SOC series, based on ARM CORTEXA9)
|
||||
- SPEAr1300 (SOC)
|
||||
|
||||
Configuration
|
||||
-------------
|
||||
|
||||
A generic configuration is provided for each machine, and can be used as the
|
||||
default by
|
||||
make spear600_defconfig
|
||||
make spear300_defconfig
|
||||
make spear310_defconfig
|
||||
make spear320_defconfig
|
||||
|
||||
Layout
|
||||
------
|
||||
|
||||
The common files for multiple machine families (SPEAr3XX, SPEAr6XX and
|
||||
SPEAr13XX) are located in the platform code contained in arch/arm/plat-spear
|
||||
with headers in plat/.
|
||||
|
||||
Each machine series have a directory with name arch/arm/mach-spear followed by
|
||||
series name. Like mach-spear3xx, mach-spear6xx and mach-spear13xx.
|
||||
|
||||
Common file for machines of spear3xx family is mach-spear3xx/spear3xx.c and for
|
||||
spear6xx is mach-spear6xx/spear6xx.c. mach-spear* also contain soc/machine
|
||||
specific files, like spear300.c, spear310.c, spear320.c and spear600.c.
|
||||
mach-spear* also contains board specific files for each machine type.
|
||||
|
||||
|
||||
Document Author
|
||||
---------------
|
||||
|
||||
Viresh Kumar, (c) 2010 ST Microelectronics
|
@ -12,6 +12,8 @@ Introduction
|
||||
of the s3c2410 GPIO system, please read the Samsung provided
|
||||
data-sheet/users manual to find out the complete list.
|
||||
|
||||
See Documentation/arm/Samsung/GPIO.txt for the core implemetation.
|
||||
|
||||
|
||||
GPIOLIB
|
||||
-------
|
||||
@ -24,8 +26,60 @@ GPIOLIB
|
||||
listed below will be removed (they may be marked as __deprecated
|
||||
in the near future).
|
||||
|
||||
- s3c2410_gpio_getpin
|
||||
- s3c2410_gpio_setpin
|
||||
The following functions now either have a s3c_ specific variant
|
||||
or are merged into gpiolib. See the definitions in
|
||||
arch/arm/plat-samsung/include/plat/gpio-cfg.h:
|
||||
|
||||
s3c2410_gpio_setpin() gpio_set_value() or gpio_direction_output()
|
||||
s3c2410_gpio_getpin() gpio_get_value() or gpio_direction_input()
|
||||
s3c2410_gpio_getirq() gpio_to_irq()
|
||||
s3c2410_gpio_cfgpin() s3c_gpio_cfgpin()
|
||||
s3c2410_gpio_getcfg() s3c_gpio_getcfg()
|
||||
s3c2410_gpio_pullup() s3c_gpio_setpull()
|
||||
|
||||
|
||||
GPIOLIB conversion
|
||||
------------------
|
||||
|
||||
If you need to convert your board or driver to use gpiolib from the exiting
|
||||
s3c2410 api, then here are some notes on the process.
|
||||
|
||||
1) If your board is exclusively using an GPIO, say to control peripheral
|
||||
power, then it will require to claim the gpio with gpio_request() before
|
||||
it can use it.
|
||||
|
||||
It is recommended to check the return value, with at least WARN_ON()
|
||||
during initialisation.
|
||||
|
||||
2) The s3c2410_gpio_cfgpin() can be directly replaced with s3c_gpio_cfgpin()
|
||||
as they have the same arguments, and can either take the pin specific
|
||||
values, or the more generic special-function-number arguments.
|
||||
|
||||
3) s3c2410_gpio_pullup() changs have the problem that whilst the
|
||||
s3c2410_gpio_pullup(x, 1) can be easily translated to the
|
||||
s3c_gpio_setpull(x, S3C_GPIO_PULL_NONE), the s3c2410_gpio_pullup(x, 0)
|
||||
are not so easy.
|
||||
|
||||
The s3c2410_gpio_pullup(x, 0) case enables the pull-up (or in the case
|
||||
of some of the devices, a pull-down) and as such the new API distinguishes
|
||||
between the UP and DOWN case. There is currently no 'just turn on' setting
|
||||
which may be required if this becomes a problem.
|
||||
|
||||
4) s3c2410_gpio_setpin() can be replaced by gpio_set_value(), the old call
|
||||
does not implicitly configure the relevant gpio to output. The gpio
|
||||
direction should be changed before using gpio_set_value().
|
||||
|
||||
5) s3c2410_gpio_getpin() is replaceable by gpio_get_value() if the pin
|
||||
has been set to input. It is currently unknown what the behaviour is
|
||||
when using gpio_get_value() on an output pin (s3c2410_gpio_getpin
|
||||
would return the value the pin is supposed to be outputting).
|
||||
|
||||
6) s3c2410_gpio_getirq() should be directly replacable with the
|
||||
gpio_to_irq() call.
|
||||
|
||||
The s3c2410_gpio and gpio_ calls have always operated on the same gpio
|
||||
numberspace, so there is no problem with converting the gpio numbering
|
||||
between the calls.
|
||||
|
||||
|
||||
Headers
|
||||
@ -54,6 +108,11 @@ PIN Numbers
|
||||
eg S3C2410_GPA(0) or S3C2410_GPF(1). These defines are used to tell
|
||||
the GPIO functions which pin is to be used.
|
||||
|
||||
With the conversion to gpiolib, there is no longer a direct conversion
|
||||
from gpio pin number to register base address as in earlier kernels. This
|
||||
is due to the number space required for newer SoCs where the later
|
||||
GPIOs are not contiguous.
|
||||
|
||||
|
||||
Configuring a pin
|
||||
-----------------
|
||||
@ -71,6 +130,8 @@ Configuring a pin
|
||||
which would turn GPA(0) into the lowest Address line A0, and set
|
||||
GPE(8) to be connected to the SDIO/MMC controller's SDDAT1 line.
|
||||
|
||||
The s3c_gpio_cfgpin() call is a functional replacement for this call.
|
||||
|
||||
|
||||
Reading the current configuration
|
||||
---------------------------------
|
||||
@ -82,6 +143,9 @@ Reading the current configuration
|
||||
The return value will be from the same set of values which can be
|
||||
passed to s3c2410_gpio_cfgpin().
|
||||
|
||||
The s3c_gpio_getcfg() call should be a functional replacement for
|
||||
this call.
|
||||
|
||||
|
||||
Configuring a pull-up resistor
|
||||
------------------------------
|
||||
@ -95,6 +159,10 @@ Configuring a pull-up resistor
|
||||
Where the to value is zero to set the pull-up off, and 1 to enable
|
||||
the specified pull-up. Any other values are currently undefined.
|
||||
|
||||
The s3c_gpio_setpull() offers similar functionality, but with the
|
||||
ability to encode whether the pull is up or down. Currently there
|
||||
is no 'just on' state, so up or down must be selected.
|
||||
|
||||
|
||||
Getting the state of a PIN
|
||||
--------------------------
|
||||
@ -106,6 +174,9 @@ Getting the state of a PIN
|
||||
This will return either zero or non-zero. Do not count on this
|
||||
function returning 1 if the pin is set.
|
||||
|
||||
This call is now implemented by the relevant gpiolib calls, convert
|
||||
your board or driver to use gpiolib.
|
||||
|
||||
|
||||
Setting the state of a PIN
|
||||
--------------------------
|
||||
@ -117,6 +188,9 @@ Setting the state of a PIN
|
||||
Which sets the given pin to the value. Use 0 to write 0, and 1 to
|
||||
set the output to 1.
|
||||
|
||||
This call is now implemented by the relevant gpiolib calls, convert
|
||||
your board or driver to use gpiolib.
|
||||
|
||||
|
||||
Getting the IRQ number associated with a PIN
|
||||
--------------------------------------------
|
||||
@ -128,6 +202,9 @@ Getting the IRQ number associated with a PIN
|
||||
|
||||
Note, not all pins have an IRQ.
|
||||
|
||||
This call is now implemented by the relevant gpiolib calls, convert
|
||||
your board or driver to use gpiolib.
|
||||
|
||||
|
||||
Authour
|
||||
-------
|
||||
|
@ -8,10 +8,16 @@ Introduction
|
||||
|
||||
The Samsung S3C24XX range of ARM9 System-on-Chip CPUs are supported
|
||||
by the 's3c2410' architecture of ARM Linux. Currently the S3C2410,
|
||||
S3C2412, S3C2413, S3C2440, S3C2442 and S3C2443 devices are supported.
|
||||
S3C2412, S3C2413, S3C2416 S3C2440, S3C2442, S3C2443 and S3C2450 devices
|
||||
are supported.
|
||||
|
||||
Support for the S3C2400 and S3C24A0 series are in progress.
|
||||
|
||||
The S3C2416 and S3C2450 devices are very similar and S3C2450 support is
|
||||
included under the arch/arm/mach-s3c2416 directory. Note, whilst core
|
||||
support for these SoCs is in, work on some of the extra peripherals
|
||||
and extra interrupts is still ongoing.
|
||||
|
||||
|
||||
Configuration
|
||||
-------------
|
||||
@ -209,6 +215,13 @@ GPIO
|
||||
Newer kernels carry GPIOLIB, and support is being moved towards
|
||||
this with some of the older support in line to be removed.
|
||||
|
||||
As of v2.6.34, the move towards using gpiolib support is almost
|
||||
complete, and very little of the old calls are left.
|
||||
|
||||
See Documentation/arm/Samsung-S3C24XX/GPIO.txt for the S3C24XX specific
|
||||
support and Documentation/arm/Samsung/GPIO.txt for the core Samsung
|
||||
implementation.
|
||||
|
||||
|
||||
Clock Management
|
||||
----------------
|
||||
|
42
Documentation/arm/Samsung/GPIO.txt
Normal file
42
Documentation/arm/Samsung/GPIO.txt
Normal file
@ -0,0 +1,42 @@
|
||||
Samsung GPIO implementation
|
||||
===========================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
This outlines the Samsung GPIO implementation and the architecture
|
||||
specfic calls provided alongisde the drivers/gpio core.
|
||||
|
||||
|
||||
S3C24XX (Legacy)
|
||||
----------------
|
||||
|
||||
See Documentation/arm/Samsung-S3C24XX/GPIO.txt for more information
|
||||
about these devices. Their implementation is being brought into line
|
||||
with the core samsung implementation described in this document.
|
||||
|
||||
|
||||
GPIOLIB integration
|
||||
-------------------
|
||||
|
||||
The gpio implementation uses gpiolib as much as possible, only providing
|
||||
specific calls for the items that require Samsung specific handling, such
|
||||
as pin special-function or pull resistor control.
|
||||
|
||||
GPIO numbering is synchronised between the Samsung and gpiolib system.
|
||||
|
||||
|
||||
PIN configuration
|
||||
-----------------
|
||||
|
||||
Pin configuration is specific to the Samsung architecutre, with each SoC
|
||||
registering the necessary information for the core gpio configuration
|
||||
implementation to configure pins as necessary.
|
||||
|
||||
The s3c_gpio_cfgpin() and s3c_gpio_setpull() provide the means for a
|
||||
driver or machine to change gpio configuration.
|
||||
|
||||
See arch/arm/plat-samsung/include/plat/gpio-cfg.h for more information
|
||||
on these functions.
|
||||
|
||||
|
@ -13,9 +13,10 @@ Introduction
|
||||
|
||||
- S3C24XX: See Documentation/arm/Samsung-S3C24XX/Overview.txt for full list
|
||||
- S3C64XX: S3C6400 and S3C6410
|
||||
- S5PC6440
|
||||
|
||||
S5PC100 and S5PC110 support is currently being merged
|
||||
- S5P6440
|
||||
- S5P6442
|
||||
- S5PC100
|
||||
- S5PC110 / S5PV210
|
||||
|
||||
|
||||
S3C24XX Systems
|
||||
@ -35,7 +36,10 @@ Configuration
|
||||
unifying all the SoCs into one kernel.
|
||||
|
||||
s5p6440_defconfig - S5P6440 specific default configuration
|
||||
s5p6442_defconfig - S5P6442 specific default configuration
|
||||
s5pc100_defconfig - S5PC100 specific default configuration
|
||||
s5pc110_defconfig - S5PC110 specific default configuration
|
||||
s5pv210_defconfig - S5PV210 specific default configuration
|
||||
|
||||
|
||||
Layout
|
||||
@ -50,18 +54,27 @@ Layout
|
||||
specific information. It contains the base clock, GPIO and device definitions
|
||||
to get the system running.
|
||||
|
||||
plat-s3c is the s3c24xx/s3c64xx platform directory, although it is currently
|
||||
involved in other builds this will be phased out once the relevant code is
|
||||
moved elsewhere.
|
||||
|
||||
plat-s3c24xx is for s3c24xx specific builds, see the S3C24XX docs.
|
||||
|
||||
plat-s3c64xx is for the s3c64xx specific bits, see the S3C24XX docs.
|
||||
|
||||
plat-s5p is for s5p specific builds, more to be added.
|
||||
plat-s5p is for s5p specific builds, and contains common support for the
|
||||
S5P specific systems. Not all S5Ps use all the features in this directory
|
||||
due to differences in the hardware.
|
||||
|
||||
|
||||
Layout changes
|
||||
--------------
|
||||
|
||||
The old plat-s3c and plat-s5pc1xx directories have been removed, with
|
||||
support moved to either plat-samsung or plat-s5p as necessary. These moves
|
||||
where to simplify the include and dependency issues involved with having
|
||||
so many different platform directories.
|
||||
|
||||
It was decided to remove plat-s5pc1xx as some of the support was already
|
||||
in plat-s5p or plat-samsung, with the S5PC110 support added with S5PV210
|
||||
the only user was the S5PC100. The S5PC100 specific items where moved to
|
||||
arch/arm/mach-s5pc100.
|
||||
|
||||
|
||||
[ to finish ]
|
||||
|
||||
|
||||
Port Contributors
|
||||
|
@ -7,7 +7,7 @@ The driver only implements a four-wire touch panel protocol.
|
||||
|
||||
The touchscreen driver is maintenance free except for the pen-down or
|
||||
touch threshold. Some resistive displays and board combinations may
|
||||
require tuning of this threshold. The driver exposes some of it's
|
||||
require tuning of this threshold. The driver exposes some of its
|
||||
internal state in the sys filesystem. If the kernel is configured
|
||||
with it, CONFIG_SYSFS, and sysfs is mounted at /sys, there will be a
|
||||
directory
|
||||
|
@ -320,7 +320,7 @@ counter decrement would not become globally visible until the
|
||||
obj->active update does.
|
||||
|
||||
As a historical note, 32-bit Sparc used to only allow usage of
|
||||
24-bits of it's atomic_t type. This was because it used 8 bits
|
||||
24-bits of its atomic_t type. This was because it used 8 bits
|
||||
as a spinlock for SMP safety. Sparc32 lacked a "compare and swap"
|
||||
type instruction. However, 32-bit Sparc has since been moved over
|
||||
to a "hash table of spinlocks" scheme, that allows the full 32-bit
|
||||
|
@ -43,7 +43,7 @@
|
||||
void bfin_gpio_irq_free(unsigned gpio);
|
||||
|
||||
The request functions will record the function state for a certain pin,
|
||||
the free functions will clear it's function state.
|
||||
the free functions will clear its function state.
|
||||
Once a pin is requested, it can't be requested again before it is freed by
|
||||
previous caller, otherwise kernel will dump stacks, and the request
|
||||
function fail.
|
||||
|
@ -5,7 +5,7 @@
|
||||
|
||||
This document describes the cache/tlb flushing interfaces called
|
||||
by the Linux VM subsystem. It enumerates over each interface,
|
||||
describes it's intended purpose, and what side effect is expected
|
||||
describes its intended purpose, and what side effect is expected
|
||||
after the interface is invoked.
|
||||
|
||||
The side effects described below are stated for a uniprocessor
|
||||
@ -231,7 +231,7 @@ require a whole different set of interfaces to handle properly.
|
||||
The biggest problem is that of virtual aliasing in the data cache
|
||||
of a processor.
|
||||
|
||||
Is your port susceptible to virtual aliasing in it's D-cache?
|
||||
Is your port susceptible to virtual aliasing in its D-cache?
|
||||
Well, if your D-cache is virtually indexed, is larger in size than
|
||||
PAGE_SIZE, and does not prevent multiple cache lines for the same
|
||||
physical address from existing at once, you have this problem.
|
||||
@ -249,7 +249,7 @@ one way to solve this (in particular SPARC_FLAG_MMAPSHARED).
|
||||
Next, you have to solve the D-cache aliasing issue for all
|
||||
other cases. Please keep in mind that fact that, for a given page
|
||||
mapped into some user address space, there is always at least one more
|
||||
mapping, that of the kernel in it's linear mapping starting at
|
||||
mapping, that of the kernel in its linear mapping starting at
|
||||
PAGE_OFFSET. So immediately, once the first user maps a given
|
||||
physical page into its address space, by implication the D-cache
|
||||
aliasing problem has the potential to exist since the kernel already
|
||||
|
@ -17,6 +17,9 @@ HOWTO
|
||||
You can do a very simple testing of running two dd threads in two different
|
||||
cgroups. Here is what you can do.
|
||||
|
||||
- Enable Block IO controller
|
||||
CONFIG_BLK_CGROUP=y
|
||||
|
||||
- Enable group scheduling in CFQ
|
||||
CONFIG_CFQ_GROUP_IOSCHED=y
|
||||
|
||||
@ -54,32 +57,52 @@ cgroups. Here is what you can do.
|
||||
|
||||
Various user visible config options
|
||||
===================================
|
||||
CONFIG_BLK_CGROUP
|
||||
- Block IO controller.
|
||||
|
||||
CONFIG_DEBUG_BLK_CGROUP
|
||||
- Debug help. Right now some additional stats file show up in cgroup
|
||||
if this option is enabled.
|
||||
|
||||
CONFIG_CFQ_GROUP_IOSCHED
|
||||
- Enables group scheduling in CFQ. Currently only 1 level of group
|
||||
creation is allowed.
|
||||
|
||||
CONFIG_DEBUG_CFQ_IOSCHED
|
||||
- Enables some debugging messages in blktrace. Also creates extra
|
||||
cgroup file blkio.dequeue.
|
||||
|
||||
Config options selected automatically
|
||||
=====================================
|
||||
These config options are not user visible and are selected/deselected
|
||||
automatically based on IO scheduler configuration.
|
||||
|
||||
CONFIG_BLK_CGROUP
|
||||
- Block IO controller. Selected by CONFIG_CFQ_GROUP_IOSCHED.
|
||||
|
||||
CONFIG_DEBUG_BLK_CGROUP
|
||||
- Debug help. Selected by CONFIG_DEBUG_CFQ_IOSCHED.
|
||||
|
||||
Details of cgroup files
|
||||
=======================
|
||||
- blkio.weight
|
||||
- Specifies per cgroup weight.
|
||||
|
||||
- Specifies per cgroup weight. This is default weight of the group
|
||||
on all the devices until and unless overridden by per device rule.
|
||||
(See blkio.weight_device).
|
||||
Currently allowed range of weights is from 100 to 1000.
|
||||
|
||||
- blkio.weight_device
|
||||
- One can specify per cgroup per device rules using this interface.
|
||||
These rules override the default value of group weight as specified
|
||||
by blkio.weight.
|
||||
|
||||
Following is the format.
|
||||
|
||||
#echo dev_maj:dev_minor weight > /path/to/cgroup/blkio.weight_device
|
||||
Configure weight=300 on /dev/sdb (8:16) in this cgroup
|
||||
# echo 8:16 300 > blkio.weight_device
|
||||
# cat blkio.weight_device
|
||||
dev weight
|
||||
8:16 300
|
||||
|
||||
Configure weight=500 on /dev/sda (8:0) in this cgroup
|
||||
# echo 8:0 500 > blkio.weight_device
|
||||
# cat blkio.weight_device
|
||||
dev weight
|
||||
8:0 500
|
||||
8:16 300
|
||||
|
||||
Remove specific weight for /dev/sda in this cgroup
|
||||
# echo 8:0 0 > blkio.weight_device
|
||||
# cat blkio.weight_device
|
||||
dev weight
|
||||
8:16 300
|
||||
|
||||
- blkio.time
|
||||
- disk time allocated to cgroup per device in milliseconds. First
|
||||
two fields specify the major and minor number of the device and
|
||||
@ -92,13 +115,105 @@ Details of cgroup files
|
||||
third field specifies the number of sectors transferred by the
|
||||
group to/from the device.
|
||||
|
||||
- blkio.io_service_bytes
|
||||
- Number of bytes transferred to/from the disk by the group. These
|
||||
are further divided by the type of operation - read or write, sync
|
||||
or async. First two fields specify the major and minor number of the
|
||||
device, third field specifies the operation type and the fourth field
|
||||
specifies the number of bytes.
|
||||
|
||||
- blkio.io_serviced
|
||||
- Number of IOs completed to/from the disk by the group. These
|
||||
are further divided by the type of operation - read or write, sync
|
||||
or async. First two fields specify the major and minor number of the
|
||||
device, third field specifies the operation type and the fourth field
|
||||
specifies the number of IOs.
|
||||
|
||||
- blkio.io_service_time
|
||||
- Total amount of time between request dispatch and request completion
|
||||
for the IOs done by this cgroup. This is in nanoseconds to make it
|
||||
meaningful for flash devices too. For devices with queue depth of 1,
|
||||
this time represents the actual service time. When queue_depth > 1,
|
||||
that is no longer true as requests may be served out of order. This
|
||||
may cause the service time for a given IO to include the service time
|
||||
of multiple IOs when served out of order which may result in total
|
||||
io_service_time > actual time elapsed. This time is further divided by
|
||||
the type of operation - read or write, sync or async. First two fields
|
||||
specify the major and minor number of the device, third field
|
||||
specifies the operation type and the fourth field specifies the
|
||||
io_service_time in ns.
|
||||
|
||||
- blkio.io_wait_time
|
||||
- Total amount of time the IOs for this cgroup spent waiting in the
|
||||
scheduler queues for service. This can be greater than the total time
|
||||
elapsed since it is cumulative io_wait_time for all IOs. It is not a
|
||||
measure of total time the cgroup spent waiting but rather a measure of
|
||||
the wait_time for its individual IOs. For devices with queue_depth > 1
|
||||
this metric does not include the time spent waiting for service once
|
||||
the IO is dispatched to the device but till it actually gets serviced
|
||||
(there might be a time lag here due to re-ordering of requests by the
|
||||
device). This is in nanoseconds to make it meaningful for flash
|
||||
devices too. This time is further divided by the type of operation -
|
||||
read or write, sync or async. First two fields specify the major and
|
||||
minor number of the device, third field specifies the operation type
|
||||
and the fourth field specifies the io_wait_time in ns.
|
||||
|
||||
- blkio.io_merged
|
||||
- Total number of bios/requests merged into requests belonging to this
|
||||
cgroup. This is further divided by the type of operation - read or
|
||||
write, sync or async.
|
||||
|
||||
- blkio.io_queued
|
||||
- Total number of requests queued up at any given instant for this
|
||||
cgroup. This is further divided by the type of operation - read or
|
||||
write, sync or async.
|
||||
|
||||
- blkio.avg_queue_size
|
||||
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
|
||||
The average queue size for this cgroup over the entire time of this
|
||||
cgroup's existence. Queue size samples are taken each time one of the
|
||||
queues of this cgroup gets a timeslice.
|
||||
|
||||
- blkio.group_wait_time
|
||||
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
|
||||
This is the amount of time the cgroup had to wait since it became busy
|
||||
(i.e., went from 0 to 1 request queued) to get a timeslice for one of
|
||||
its queues. This is different from the io_wait_time which is the
|
||||
cumulative total of the amount of time spent by each IO in that cgroup
|
||||
waiting in the scheduler queue. This is in nanoseconds. If this is
|
||||
read when the cgroup is in a waiting (for timeslice) state, the stat
|
||||
will only report the group_wait_time accumulated till the last time it
|
||||
got a timeslice and will not include the current delta.
|
||||
|
||||
- blkio.empty_time
|
||||
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
|
||||
This is the amount of time a cgroup spends without any pending
|
||||
requests when not being served, i.e., it does not include any time
|
||||
spent idling for one of the queues of the cgroup. This is in
|
||||
nanoseconds. If this is read when the cgroup is in an empty state,
|
||||
the stat will only report the empty_time accumulated till the last
|
||||
time it had a pending request and will not include the current delta.
|
||||
|
||||
- blkio.idle_time
|
||||
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
|
||||
This is the amount of time spent by the IO scheduler idling for a
|
||||
given cgroup in anticipation of a better request than the exising ones
|
||||
from other queues/cgroups. This is in nanoseconds. If this is read
|
||||
when the cgroup is in an idling state, the stat will only report the
|
||||
idle_time accumulated till the last idle period and will not include
|
||||
the current delta.
|
||||
|
||||
- blkio.dequeue
|
||||
- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
|
||||
- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This
|
||||
gives the statistics about how many a times a group was dequeued
|
||||
from service tree of the device. First two fields specify the major
|
||||
and minor number of the device and third field specifies the number
|
||||
of times a group was dequeued from a particular device.
|
||||
|
||||
- blkio.reset_stats
|
||||
- Writing an int to this file will result in resetting all the stats
|
||||
for that cgroup.
|
||||
|
||||
CFQ sysfs tunable
|
||||
=================
|
||||
/sys/block/<disk>/queue/iosched/group_isolation
|
||||
|
@ -339,7 +339,7 @@ To mount a cgroup hierarchy with all available subsystems, type:
|
||||
The "xxx" is not interpreted by the cgroup code, but will appear in
|
||||
/proc/mounts so may be any useful identifying string that you like.
|
||||
|
||||
To mount a cgroup hierarchy with just the cpuset and numtasks
|
||||
To mount a cgroup hierarchy with just the cpuset and memory
|
||||
subsystems, type:
|
||||
# mount -t cgroup -o cpuset,memory hier1 /dev/cgroup
|
||||
|
||||
@ -572,7 +572,7 @@ void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
|
||||
|
||||
Called when a task attach operation has failed after can_attach() has succeeded.
|
||||
A subsystem whose can_attach() has some side-effects should provide this
|
||||
function, so that the subsytem can implement a rollback. If not, not necessary.
|
||||
function, so that the subsystem can implement a rollback. If not, not necessary.
|
||||
This will be called only about subsystems whose can_attach() operation have
|
||||
succeeded.
|
||||
|
||||
|
@ -42,7 +42,7 @@ Nodes to a set of tasks. In this document "Memory Node" refers to
|
||||
an on-line node that contains memory.
|
||||
|
||||
Cpusets constrain the CPU and Memory placement of tasks to only
|
||||
the resources within a tasks current cpuset. They form a nested
|
||||
the resources within a task's current cpuset. They form a nested
|
||||
hierarchy visible in a virtual file system. These are the essential
|
||||
hooks, beyond what is already present, required to manage dynamic
|
||||
job placement on large systems.
|
||||
@ -53,11 +53,11 @@ Documentation/cgroups/cgroups.txt.
|
||||
Requests by a task, using the sched_setaffinity(2) system call to
|
||||
include CPUs in its CPU affinity mask, and using the mbind(2) and
|
||||
set_mempolicy(2) system calls to include Memory Nodes in its memory
|
||||
policy, are both filtered through that tasks cpuset, filtering out any
|
||||
policy, are both filtered through that task's cpuset, filtering out any
|
||||
CPUs or Memory Nodes not in that cpuset. The scheduler will not
|
||||
schedule a task on a CPU that is not allowed in its cpus_allowed
|
||||
vector, and the kernel page allocator will not allocate a page on a
|
||||
node that is not allowed in the requesting tasks mems_allowed vector.
|
||||
node that is not allowed in the requesting task's mems_allowed vector.
|
||||
|
||||
User level code may create and destroy cpusets by name in the cgroup
|
||||
virtual file system, manage the attributes and permissions of these
|
||||
@ -121,9 +121,9 @@ Cpusets extends these two mechanisms as follows:
|
||||
- Each task in the system is attached to a cpuset, via a pointer
|
||||
in the task structure to a reference counted cgroup structure.
|
||||
- Calls to sched_setaffinity are filtered to just those CPUs
|
||||
allowed in that tasks cpuset.
|
||||
allowed in that task's cpuset.
|
||||
- Calls to mbind and set_mempolicy are filtered to just
|
||||
those Memory Nodes allowed in that tasks cpuset.
|
||||
those Memory Nodes allowed in that task's cpuset.
|
||||
- The root cpuset contains all the systems CPUs and Memory
|
||||
Nodes.
|
||||
- For any cpuset, one can define child cpusets containing a subset
|
||||
@ -141,11 +141,11 @@ into the rest of the kernel, none in performance critical paths:
|
||||
- in init/main.c, to initialize the root cpuset at system boot.
|
||||
- in fork and exit, to attach and detach a task from its cpuset.
|
||||
- in sched_setaffinity, to mask the requested CPUs by what's
|
||||
allowed in that tasks cpuset.
|
||||
allowed in that task's cpuset.
|
||||
- in sched.c migrate_live_tasks(), to keep migrating tasks within
|
||||
the CPUs allowed by their cpuset, if possible.
|
||||
- in the mbind and set_mempolicy system calls, to mask the requested
|
||||
Memory Nodes by what's allowed in that tasks cpuset.
|
||||
Memory Nodes by what's allowed in that task's cpuset.
|
||||
- in page_alloc.c, to restrict memory to allowed nodes.
|
||||
- in vmscan.c, to restrict page recovery to the current cpuset.
|
||||
|
||||
@ -155,7 +155,7 @@ new system calls are added for cpusets - all support for querying and
|
||||
modifying cpusets is via this cpuset file system.
|
||||
|
||||
The /proc/<pid>/status file for each task has four added lines,
|
||||
displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
|
||||
displaying the task's cpus_allowed (on which CPUs it may be scheduled)
|
||||
and mems_allowed (on which Memory Nodes it may obtain memory),
|
||||
in the two formats seen in the following example:
|
||||
|
||||
@ -323,17 +323,17 @@ stack segment pages of a task.
|
||||
|
||||
By default, both kinds of memory spreading are off, and memory
|
||||
pages are allocated on the node local to where the task is running,
|
||||
except perhaps as modified by the tasks NUMA mempolicy or cpuset
|
||||
except perhaps as modified by the task's NUMA mempolicy or cpuset
|
||||
configuration, so long as sufficient free memory pages are available.
|
||||
|
||||
When new cpusets are created, they inherit the memory spread settings
|
||||
of their parent.
|
||||
|
||||
Setting memory spreading causes allocations for the affected page
|
||||
or slab caches to ignore the tasks NUMA mempolicy and be spread
|
||||
or slab caches to ignore the task's NUMA mempolicy and be spread
|
||||
instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
|
||||
mempolicies will not notice any change in these calls as a result of
|
||||
their containing tasks memory spread settings. If memory spreading
|
||||
their containing task's memory spread settings. If memory spreading
|
||||
is turned off, then the currently specified NUMA mempolicy once again
|
||||
applies to memory page allocations.
|
||||
|
||||
@ -357,7 +357,7 @@ pages from the node returned by cpuset_mem_spread_node().
|
||||
|
||||
The cpuset_mem_spread_node() routine is also simple. It uses the
|
||||
value of a per-task rotor cpuset_mem_spread_rotor to select the next
|
||||
node in the current tasks mems_allowed to prefer for the allocation.
|
||||
node in the current task's mems_allowed to prefer for the allocation.
|
||||
|
||||
This memory placement policy is also known (in other contexts) as
|
||||
round-robin or interleave.
|
||||
@ -594,7 +594,7 @@ is attached, is subtle.
|
||||
If a cpuset has its Memory Nodes modified, then for each task attached
|
||||
to that cpuset, the next time that the kernel attempts to allocate
|
||||
a page of memory for that task, the kernel will notice the change
|
||||
in the tasks cpuset, and update its per-task memory placement to
|
||||
in the task's cpuset, and update its per-task memory placement to
|
||||
remain within the new cpusets memory placement. If the task was using
|
||||
mempolicy MPOL_BIND, and the nodes to which it was bound overlap with
|
||||
its new cpuset, then the task will continue to use whatever subset
|
||||
@ -603,13 +603,13 @@ was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
|
||||
in the new cpuset, then the task will be essentially treated as if it
|
||||
was MPOL_BIND bound to the new cpuset (even though its NUMA placement,
|
||||
as queried by get_mempolicy(), doesn't change). If a task is moved
|
||||
from one cpuset to another, then the kernel will adjust the tasks
|
||||
from one cpuset to another, then the kernel will adjust the task's
|
||||
memory placement, as above, the next time that the kernel attempts
|
||||
to allocate a page of memory for that task.
|
||||
|
||||
If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
|
||||
will have its allowed CPU placement changed immediately. Similarly,
|
||||
if a tasks pid is written to another cpusets 'cpuset.tasks' file, then its
|
||||
if a task's pid is written to another cpusets 'cpuset.tasks' file, then its
|
||||
allowed CPU placement is changed immediately. If such a task had been
|
||||
bound to some subset of its cpuset using the sched_setaffinity() call,
|
||||
the task will be allowed to run on any CPU allowed in its new cpuset,
|
||||
@ -626,16 +626,16 @@ cpusets memory placement policy 'cpuset.mems' subsequently changes.
|
||||
If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
|
||||
tasks are attached to that cpuset, any pages that task had
|
||||
allocated to it on nodes in its previous cpuset are migrated
|
||||
to the tasks new cpuset. The relative placement of the page within
|
||||
to the task's new cpuset. The relative placement of the page within
|
||||
the cpuset is preserved during these migration operations if possible.
|
||||
For example if the page was on the second valid node of the prior cpuset
|
||||
then the page will be placed on the second valid node of the new cpuset.
|
||||
|
||||
Also if 'cpuset.memory_migrate' is set true, then if that cpusets
|
||||
Also if 'cpuset.memory_migrate' is set true, then if that cpuset's
|
||||
'cpuset.mems' file is modified, pages allocated to tasks in that
|
||||
cpuset, that were on nodes in the previous setting of 'cpuset.mems',
|
||||
will be moved to nodes in the new setting of 'mems.'
|
||||
Pages that were not in the tasks prior cpuset, or in the cpusets
|
||||
Pages that were not in the task's prior cpuset, or in the cpuset's
|
||||
prior 'cpuset.mems' setting, will not be moved.
|
||||
|
||||
There is an exception to the above. If hotplug functionality is used
|
||||
@ -655,7 +655,7 @@ There is a second exception to the above. GFP_ATOMIC requests are
|
||||
kernel internal allocations that must be satisfied, immediately.
|
||||
The kernel may drop some request, in rare cases even panic, if a
|
||||
GFP_ATOMIC alloc fails. If the request cannot be satisfied within
|
||||
the current tasks cpuset, then we relax the cpuset, and look for
|
||||
the current task's cpuset, then we relax the cpuset, and look for
|
||||
memory anywhere we can find it. It's better to violate the cpuset
|
||||
than stress the kernel.
|
||||
|
||||
|
@ -244,7 +244,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
|
||||
we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
|
||||
|
||||
8. LRU
|
||||
Each memcg has its own private LRU. Now, it's handling is under global
|
||||
Each memcg has its own private LRU. Now, its handling is under global
|
||||
VM's control (means that it's handled under global zone->lru_lock).
|
||||
Almost all routines around memcg's LRU is called by global LRU's
|
||||
list management functions under zone->lru_lock().
|
||||
|
@ -1,18 +1,15 @@
|
||||
Memory Resource Controller
|
||||
|
||||
NOTE: The Memory Resource Controller has been generically been referred
|
||||
to as the memory controller in this document. Do not confuse memory controller
|
||||
used here with the memory controller that is used in hardware.
|
||||
to as the memory controller in this document. Do not confuse memory
|
||||
controller used here with the memory controller that is used in hardware.
|
||||
|
||||
Salient features
|
||||
|
||||
a. Enable control of Anonymous, Page Cache (mapped and unmapped) and
|
||||
Swap Cache memory pages.
|
||||
b. The infrastructure allows easy addition of other types of memory to control
|
||||
c. Provides *zero overhead* for non memory controller users
|
||||
d. Provides a double LRU: global memory pressure causes reclaim from the
|
||||
global LRU; a cgroup on hitting a limit, reclaims from the per
|
||||
cgroup LRU
|
||||
(For editors)
|
||||
In this document:
|
||||
When we mention a cgroup (cgroupfs's directory) with memory controller,
|
||||
we call it "memory cgroup". When you see git-log and source code, you'll
|
||||
see patch's title and function names tend to use "memcg".
|
||||
In this document, we avoid using it.
|
||||
|
||||
Benefits and Purpose of the memory controller
|
||||
|
||||
@ -33,6 +30,45 @@ d. A CD/DVD burner could control the amount of memory used by the
|
||||
e. There are several other use cases, find one or use the controller just
|
||||
for fun (to learn and hack on the VM subsystem).
|
||||
|
||||
Current Status: linux-2.6.34-mmotm(development version of 2010/April)
|
||||
|
||||
Features:
|
||||
- accounting anonymous pages, file caches, swap caches usage and limiting them.
|
||||
- private LRU and reclaim routine. (system's global LRU and private LRU
|
||||
work independently from each other)
|
||||
- optionally, memory+swap usage can be accounted and limited.
|
||||
- hierarchical accounting
|
||||
- soft limit
|
||||
- moving(recharging) account at moving a task is selectable.
|
||||
- usage threshold notifier
|
||||
- oom-killer disable knob and oom-notifier
|
||||
- Root cgroup has no limit controls.
|
||||
|
||||
Kernel memory and Hugepages are not under control yet. We just manage
|
||||
pages on LRU. To add more controls, we have to take care of performance.
|
||||
|
||||
Brief summary of control files.
|
||||
|
||||
tasks # attach a task(thread) and show list of threads
|
||||
cgroup.procs # show list of processes
|
||||
cgroup.event_control # an interface for event_fd()
|
||||
memory.usage_in_bytes # show current memory(RSS+Cache) usage.
|
||||
memory.memsw.usage_in_bytes # show current memory+Swap usage
|
||||
memory.limit_in_bytes # set/show limit of memory usage
|
||||
memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
|
||||
memory.failcnt # show the number of memory usage hits limits
|
||||
memory.memsw.failcnt # show the number of memory+Swap hits limits
|
||||
memory.max_usage_in_bytes # show max memory usage recorded
|
||||
memory.memsw.usage_in_bytes # show max memory+Swap usage recorded
|
||||
memory.soft_limit_in_bytes # set/show soft limit of memory usage
|
||||
memory.stat # show various statistics
|
||||
memory.use_hierarchy # set/show hierarchical account enabled
|
||||
memory.force_empty # trigger forced move charge to parent
|
||||
memory.swappiness # set/show swappiness parameter of vmscan
|
||||
(See sysctl's vm.swappiness)
|
||||
memory.move_charge_at_immigrate # set/show controls of moving charges
|
||||
memory.oom_control # set/show oom controls.
|
||||
|
||||
1. History
|
||||
|
||||
The memory controller has a long history. A request for comments for the memory
|
||||
@ -106,14 +142,14 @@ the necessary data structures and check if the cgroup that is being charged
|
||||
is over its limit. If it is then reclaim is invoked on the cgroup.
|
||||
More details can be found in the reclaim section of this document.
|
||||
If everything goes well, a page meta-data-structure called page_cgroup is
|
||||
allocated and associated with the page. This routine also adds the page to
|
||||
the per cgroup LRU.
|
||||
updated. page_cgroup has its own LRU on cgroup.
|
||||
(*) page_cgroup structure is allocated at boot/memory-hotplug time.
|
||||
|
||||
2.2.1 Accounting details
|
||||
|
||||
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
|
||||
(some pages which never be reclaimable and will not be on global LRU
|
||||
are not accounted. we just accounts pages under usual vm management.)
|
||||
Some pages which are never reclaimable and will not be on the global LRU
|
||||
are not accounted. We just account pages under usual VM management.
|
||||
|
||||
RSS pages are accounted at page_fault unless they've already been accounted
|
||||
for earlier. A file page will be accounted for as Page Cache when it's
|
||||
@ -121,12 +157,19 @@ inserted into inode (radix-tree). While it's mapped into the page tables of
|
||||
processes, duplicate accounting is carefully avoided.
|
||||
|
||||
A RSS page is unaccounted when it's fully unmapped. A PageCache page is
|
||||
unaccounted when it's removed from radix-tree.
|
||||
unaccounted when it's removed from radix-tree. Even if RSS pages are fully
|
||||
unmapped (by kswapd), they may exist as SwapCache in the system until they
|
||||
are really freed. Such SwapCaches also also accounted.
|
||||
A swapped-in page is not accounted until it's mapped.
|
||||
|
||||
Note: The kernel does swapin-readahead and read multiple swaps at once.
|
||||
This means swapped-in pages may contain pages for other tasks than a task
|
||||
causing page fault. So, we avoid accounting at swap-in I/O.
|
||||
|
||||
At page migration, accounting information is kept.
|
||||
|
||||
Note: we just account pages-on-lru because our purpose is to control amount
|
||||
of used pages. not-on-lru pages are tend to be out-of-control from vm view.
|
||||
Note: we just account pages-on-LRU because our purpose is to control amount
|
||||
of used pages; not-on-LRU pages tend to be out-of-control from VM view.
|
||||
|
||||
2.3 Shared Page Accounting
|
||||
|
||||
@ -143,6 +186,7 @@ caller of swapoff rather than the users of shmem.
|
||||
|
||||
|
||||
2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
|
||||
|
||||
Swap Extension allows you to record charge for swap. A swapped-in page is
|
||||
charged back to original page allocator if possible.
|
||||
|
||||
@ -150,13 +194,20 @@ When swap is accounted, following files are added.
|
||||
- memory.memsw.usage_in_bytes.
|
||||
- memory.memsw.limit_in_bytes.
|
||||
|
||||
usage of mem+swap is limited by memsw.limit_in_bytes.
|
||||
memsw means memory+swap. Usage of memory+swap is limited by
|
||||
memsw.limit_in_bytes.
|
||||
|
||||
* why 'mem+swap' rather than swap.
|
||||
Example: Assume a system with 4G of swap. A task which allocates 6G of memory
|
||||
(by mistake) under 2G memory limitation will use all swap.
|
||||
In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
|
||||
By using memsw limit, you can avoid system OOM which can be caused by swap
|
||||
shortage.
|
||||
|
||||
* why 'memory+swap' rather than swap.
|
||||
The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
|
||||
to move account from memory to swap...there is no change in usage of
|
||||
mem+swap. In other words, when we want to limit the usage of swap without
|
||||
affecting global LRU, mem+swap limit is better than just limiting swap from
|
||||
memory+swap. In other words, when we want to limit the usage of swap without
|
||||
affecting global LRU, memory+swap limit is better than just limiting swap from
|
||||
OS point of view.
|
||||
|
||||
* What happens when a cgroup hits memory.memsw.limit_in_bytes
|
||||
@ -168,12 +219,12 @@ it by cgroup.
|
||||
|
||||
2.5 Reclaim
|
||||
|
||||
Each cgroup maintains a per cgroup LRU that consists of an active
|
||||
and inactive list. When a cgroup goes over its limit, we first try
|
||||
Each cgroup maintains a per cgroup LRU which has the same structure as
|
||||
global VM. When a cgroup goes over its limit, we first try
|
||||
to reclaim memory from the cgroup so as to make space for the new
|
||||
pages that the cgroup has touched. If the reclaim is unsuccessful,
|
||||
an OOM routine is invoked to select and kill the bulkiest task in the
|
||||
cgroup.
|
||||
cgroup. (See 10. OOM Control below.)
|
||||
|
||||
The reclaim algorithm has not been modified for cgroups, except that
|
||||
pages that are selected for reclaiming come from the per cgroup LRU
|
||||
@ -184,13 +235,22 @@ limits on the root cgroup.
|
||||
|
||||
Note2: When panic_on_oom is set to "2", the whole system will panic.
|
||||
|
||||
2. Locking
|
||||
When oom event notifier is registered, event will be delivered.
|
||||
(See oom_control section)
|
||||
|
||||
The memory controller uses the following hierarchy
|
||||
2.6 Locking
|
||||
|
||||
1. zone->lru_lock is used for selecting pages to be isolated
|
||||
2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone)
|
||||
3. lock_page_cgroup() is used to protect page->page_cgroup
|
||||
lock_page_cgroup()/unlock_page_cgroup() should not be called under
|
||||
mapping->tree_lock.
|
||||
|
||||
Other lock order is following:
|
||||
PG_locked.
|
||||
mm->page_table_lock
|
||||
zone->lru_lock
|
||||
lock_page_cgroup.
|
||||
In many cases, just lock_page_cgroup() is called.
|
||||
per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
|
||||
zone->lru_lock, it has no lock of its own.
|
||||
|
||||
3. User Interface
|
||||
|
||||
@ -199,6 +259,7 @@ The memory controller uses the following hierarchy
|
||||
a. Enable CONFIG_CGROUPS
|
||||
b. Enable CONFIG_RESOURCE_COUNTERS
|
||||
c. Enable CONFIG_CGROUP_MEM_RES_CTLR
|
||||
d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension)
|
||||
|
||||
1. Prepare the cgroups
|
||||
# mkdir -p /cgroups
|
||||
@ -206,31 +267,28 @@ c. Enable CONFIG_CGROUP_MEM_RES_CTLR
|
||||
|
||||
2. Make the new group and move bash into it
|
||||
# mkdir /cgroups/0
|
||||
# echo $$ > /cgroups/0/tasks
|
||||
# echo $$ > /cgroups/0/tasks
|
||||
|
||||
Since now we're in the 0 cgroup,
|
||||
We can alter the memory limit:
|
||||
Since now we're in the 0 cgroup, we can alter the memory limit:
|
||||
# echo 4M > /cgroups/0/memory.limit_in_bytes
|
||||
|
||||
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
|
||||
mega or gigabytes.
|
||||
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
|
||||
|
||||
NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
|
||||
NOTE: We cannot set limits on the root cgroup any more.
|
||||
|
||||
# cat /cgroups/0/memory.limit_in_bytes
|
||||
4194304
|
||||
|
||||
NOTE: The interface has now changed to display the usage in bytes
|
||||
instead of pages
|
||||
|
||||
We can check the usage:
|
||||
# cat /cgroups/0/memory.usage_in_bytes
|
||||
1216512
|
||||
|
||||
A successful write to this file does not guarantee a successful set of
|
||||
this limit to the value written into the file. This can be due to a
|
||||
this limit to the value written into the file. This can be due to a
|
||||
number of factors, such as rounding up to page boundaries or the total
|
||||
availability of memory on the system. The user is required to re-read
|
||||
availability of memory on the system. The user is required to re-read
|
||||
this file after a write to guarantee the value committed by the kernel.
|
||||
|
||||
# echo 1 > memory.limit_in_bytes
|
||||
@ -245,15 +303,23 @@ caches, RSS and Active pages/Inactive pages are shown.
|
||||
|
||||
4. Testing
|
||||
|
||||
Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11].
|
||||
Apart from that v6 has been tested with several applications and regular
|
||||
daily use. The controller has also been tested on the PPC64, x86_64 and
|
||||
UML platforms.
|
||||
For testing features and implementation, see memcg_test.txt.
|
||||
|
||||
Performance test is also important. To see pure memory controller's overhead,
|
||||
testing on tmpfs will give you good numbers of small overheads.
|
||||
Example: do kernel make on tmpfs.
|
||||
|
||||
Page-fault scalability is also important. At measuring parallel
|
||||
page fault test, multi-process test may be better than multi-thread
|
||||
test because it has noise of shared objects/status.
|
||||
|
||||
But the above two are testing extreme situations.
|
||||
Trying usual test under memory controller is always helpful.
|
||||
|
||||
4.1 Troubleshooting
|
||||
|
||||
Sometimes a user might find that the application under a cgroup is
|
||||
terminated. There are several causes for this:
|
||||
terminated by OOM killer. There are several causes for this:
|
||||
|
||||
1. The cgroup limit is too low (just too low to do anything useful)
|
||||
2. The user is using anonymous memory and swap is turned off or too low
|
||||
@ -261,23 +327,29 @@ terminated. There are several causes for this:
|
||||
A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
|
||||
some of the pages cached in the cgroup (page cache pages).
|
||||
|
||||
To know what happens, disable OOM_Kill by 10. OOM Control(see below) and
|
||||
seeing what happens will be helpful.
|
||||
|
||||
4.2 Task migration
|
||||
|
||||
When a task migrates from one cgroup to another, it's charge is not
|
||||
When a task migrates from one cgroup to another, its charge is not
|
||||
carried forward by default. The pages allocated from the original cgroup still
|
||||
remain charged to it, the charge is dropped when the page is freed or
|
||||
reclaimed.
|
||||
|
||||
Note: You can move charges of a task along with task migration. See 8.
|
||||
You can move charges of a task along with task migration.
|
||||
See 8. "Move charges at task migration"
|
||||
|
||||
4.3 Removing a cgroup
|
||||
|
||||
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
|
||||
cgroup might have some charge associated with it, even though all
|
||||
tasks have migrated away from it.
|
||||
Such charges are freed(at default) or moved to its parent. When moved,
|
||||
both of RSS and CACHES are moved to parent.
|
||||
If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
|
||||
tasks have migrated away from it. (because we charge against pages, not
|
||||
against tasks.)
|
||||
|
||||
Such charges are freed or moved to their parent. At moving, both of RSS
|
||||
and CACHES are moved to parent.
|
||||
rmdir() may return -EBUSY if freeing/moving fails. See 5.1 also.
|
||||
|
||||
Charges recorded in swap information is not updated at removal of cgroup.
|
||||
Recorded information is discarded and a cgroup which uses swap (swapcache)
|
||||
@ -293,10 +365,10 @@ will be charged as a new owner of it.
|
||||
|
||||
# echo 0 > memory.force_empty
|
||||
|
||||
Almost all pages tracked by this memcg will be unmapped and freed. Some of
|
||||
pages cannot be freed because it's locked or in-use. Such pages are moved
|
||||
to parent and this cgroup will be empty. But this may return -EBUSY in
|
||||
some too busy case.
|
||||
Almost all pages tracked by this memory cgroup will be unmapped and freed.
|
||||
Some pages cannot be freed because they are locked or in-use. Such pages are
|
||||
moved to parent and this cgroup will be empty. This may return -EBUSY if
|
||||
VM is too busy to free/move all pages immediately.
|
||||
|
||||
Typical use case of this interface is that calling this before rmdir().
|
||||
Because rmdir() moves all pages to parent, some out-of-use page caches can be
|
||||
@ -306,19 +378,41 @@ will be charged as a new owner of it.
|
||||
|
||||
memory.stat file includes following statistics
|
||||
|
||||
# per-memory cgroup local status
|
||||
cache - # of bytes of page cache memory.
|
||||
rss - # of bytes of anonymous and swap cache memory.
|
||||
mapped_file - # of bytes of mapped file (includes tmpfs/shmem)
|
||||
pgpgin - # of pages paged in (equivalent to # of charging events).
|
||||
pgpgout - # of pages paged out (equivalent to # of uncharging events).
|
||||
active_anon - # of bytes of anonymous and swap cache memory on active
|
||||
lru list.
|
||||
swap - # of bytes of swap usage
|
||||
inactive_anon - # of bytes of anonymous memory and swap cache memory on
|
||||
inactive lru list.
|
||||
active_file - # of bytes of file-backed memory on active lru list.
|
||||
inactive_file - # of bytes of file-backed memory on inactive lru list.
|
||||
LRU list.
|
||||
active_anon - # of bytes of anonymous and swap cache memory on active
|
||||
inactive LRU list.
|
||||
inactive_file - # of bytes of file-backed memory on inactive LRU list.
|
||||
active_file - # of bytes of file-backed memory on active LRU list.
|
||||
unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).
|
||||
|
||||
The following additional stats are dependent on CONFIG_DEBUG_VM.
|
||||
# status considering hierarchy (see memory.use_hierarchy settings)
|
||||
|
||||
hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy
|
||||
under which the memory cgroup is
|
||||
hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to
|
||||
hierarchy under which memory cgroup is.
|
||||
|
||||
total_cache - sum of all children's "cache"
|
||||
total_rss - sum of all children's "rss"
|
||||
total_mapped_file - sum of all children's "cache"
|
||||
total_pgpgin - sum of all children's "pgpgin"
|
||||
total_pgpgout - sum of all children's "pgpgout"
|
||||
total_swap - sum of all children's "swap"
|
||||
total_inactive_anon - sum of all children's "inactive_anon"
|
||||
total_active_anon - sum of all children's "active_anon"
|
||||
total_inactive_file - sum of all children's "inactive_file"
|
||||
total_active_file - sum of all children's "active_file"
|
||||
total_unevictable - sum of all children's "unevictable"
|
||||
|
||||
# The following additional stats are dependent on CONFIG_DEBUG_VM.
|
||||
|
||||
inactive_ratio - VM internal parameter. (see mm/page_alloc.c)
|
||||
recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
|
||||
@ -327,24 +421,37 @@ recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
|
||||
recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
|
||||
|
||||
Memo:
|
||||
recent_rotated means recent frequency of lru rotation.
|
||||
recent_scanned means recent # of scans to lru.
|
||||
recent_rotated means recent frequency of LRU rotation.
|
||||
recent_scanned means recent # of scans to LRU.
|
||||
showing for better debug please see the code for meanings.
|
||||
|
||||
Note:
|
||||
Only anonymous and swap cache memory is listed as part of 'rss' stat.
|
||||
This should not be confused with the true 'resident set size' or the
|
||||
amount of physical memory used by the cgroup. Per-cgroup rss
|
||||
accounting is not done yet.
|
||||
amount of physical memory used by the cgroup.
|
||||
'rss + file_mapped" will give you resident set size of cgroup.
|
||||
(Note: file and shmem may be shared among other cgroups. In that case,
|
||||
file_mapped is accounted only when the memory cgroup is owner of page
|
||||
cache.)
|
||||
|
||||
5.3 swappiness
|
||||
Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
|
||||
|
||||
Following cgroups' swappiness can't be changed.
|
||||
- root cgroup (uses /proc/sys/vm/swappiness).
|
||||
- a cgroup which uses hierarchy and it has child cgroup.
|
||||
- a cgroup which uses hierarchy and not the root of hierarchy.
|
||||
Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
|
||||
|
||||
Following cgroups' swappiness can't be changed.
|
||||
- root cgroup (uses /proc/sys/vm/swappiness).
|
||||
- a cgroup which uses hierarchy and it has other cgroup(s) below it.
|
||||
- a cgroup which uses hierarchy and not the root of hierarchy.
|
||||
|
||||
5.4 failcnt
|
||||
|
||||
A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
|
||||
This failcnt(== failure count) shows the number of times that a usage counter
|
||||
hit its limit. When a memory cgroup hits a limit, failcnt increases and
|
||||
memory under it will be reclaimed.
|
||||
|
||||
You can reset failcnt by writing 0 to failcnt file.
|
||||
# echo 0 > .../memory.failcnt
|
||||
|
||||
6. Hierarchy support
|
||||
|
||||
@ -363,13 +470,13 @@ hierarchy
|
||||
|
||||
In the diagram above, with hierarchical accounting enabled, all memory
|
||||
usage of e, is accounted to its ancestors up until the root (i.e, c and root),
|
||||
that has memory.use_hierarchy enabled. If one of the ancestors goes over its
|
||||
that has memory.use_hierarchy enabled. If one of the ancestors goes over its
|
||||
limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
|
||||
children of the ancestor.
|
||||
|
||||
6.1 Enabling hierarchical accounting and reclaim
|
||||
|
||||
The memory controller by default disables the hierarchy feature. Support
|
||||
A memory cgroup by default disables the hierarchy feature. Support
|
||||
can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
|
||||
|
||||
# echo 1 > memory.use_hierarchy
|
||||
@ -379,10 +486,10 @@ The feature can be disabled by
|
||||
# echo 0 > memory.use_hierarchy
|
||||
|
||||
NOTE1: Enabling/disabling will fail if the cgroup already has other
|
||||
cgroups created below it.
|
||||
cgroups created below it.
|
||||
|
||||
NOTE2: When panic_on_oom is set to "2", the whole system will panic in
|
||||
case of an oom event in any cgroup.
|
||||
case of an OOM event in any cgroup.
|
||||
|
||||
7. Soft limits
|
||||
|
||||
@ -392,7 +499,7 @@ is to allow control groups to use as much of the memory as needed, provided
|
||||
a. There is no memory contention
|
||||
b. They do not exceed their hard limit
|
||||
|
||||
When the system detects memory contention or low memory control groups
|
||||
When the system detects memory contention or low memory, control groups
|
||||
are pushed back to their soft limits. If the soft limit of each control
|
||||
group is very high, they are pushed back as much as possible to make
|
||||
sure that one control group does not starve the others of memory.
|
||||
@ -406,7 +513,7 @@ it gets invoked from balance_pgdat (kswapd).
|
||||
7.1 Interface
|
||||
|
||||
Soft limits can be setup by using the following commands (in this example we
|
||||
assume a soft limit of 256 megabytes)
|
||||
assume a soft limit of 256 MiB)
|
||||
|
||||
# echo 256M > memory.soft_limit_in_bytes
|
||||
|
||||
@ -442,7 +549,7 @@ Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread
|
||||
Note: If we cannot find enough space for the task in the destination cgroup, we
|
||||
try to make space by reclaiming memory. Task migration may fail if we
|
||||
cannot make enough space.
|
||||
Note: It can take several seconds if you move charges in giga bytes order.
|
||||
Note: It can take several seconds if you move charges much.
|
||||
|
||||
And if you want disable it again:
|
||||
|
||||
@ -451,21 +558,27 @@ And if you want disable it again:
|
||||
8.2 Type of charges which can be move
|
||||
|
||||
Each bits of move_charge_at_immigrate has its own meaning about what type of
|
||||
charges should be moved.
|
||||
charges should be moved. But in any cases, it must be noted that an account of
|
||||
a page or a swap can be moved only when it is charged to the task's current(old)
|
||||
memory cgroup.
|
||||
|
||||
bit | what type of charges would be moved ?
|
||||
-----+------------------------------------------------------------------------
|
||||
0 | A charge of an anonymous page(or swap of it) used by the target task.
|
||||
| Those pages and swaps must be used only by the target task. You must
|
||||
| enable Swap Extension(see 2.4) to enable move of swap charges.
|
||||
|
||||
Note: Those pages and swaps must be charged to the old cgroup.
|
||||
Note: More type of pages(e.g. file cache, shmem,) will be supported by other
|
||||
bits in future.
|
||||
-----+------------------------------------------------------------------------
|
||||
1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory)
|
||||
| and swaps of tmpfs file) mmapped by the target task. Unlike the case of
|
||||
| anonymous pages, file pages(and swaps) in the range mmapped by the task
|
||||
| will be moved even if the task hasn't done page fault, i.e. they might
|
||||
| not be the task's "RSS", but other task's "RSS" that maps the same file.
|
||||
| And mapcount of the page is ignored(the page can be moved even if
|
||||
| page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to
|
||||
| enable move of swap charges.
|
||||
|
||||
8.3 TODO
|
||||
|
||||
- Add support for other types of pages(e.g. file cache, shmem, etc.).
|
||||
- Implement madvise(2) to let users decide the vma to be moved or not to be
|
||||
moved.
|
||||
- All of moving charge operations are done under cgroup_mutex. It's not good
|
||||
@ -473,22 +586,61 @@ Note: More type of pages(e.g. file cache, shmem,) will be supported by other
|
||||
|
||||
9. Memory thresholds
|
||||
|
||||
Memory controler implements memory thresholds using cgroups notification
|
||||
Memory cgroup implements memory thresholds using cgroups notification
|
||||
API (see cgroups.txt). It allows to register multiple memory and memsw
|
||||
thresholds and gets notifications when it crosses.
|
||||
|
||||
To register a threshold application need:
|
||||
- create an eventfd using eventfd(2);
|
||||
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
|
||||
- write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
|
||||
cgroup.event_control.
|
||||
- create an eventfd using eventfd(2);
|
||||
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
|
||||
- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
|
||||
cgroup.event_control.
|
||||
|
||||
Application will be notified through eventfd when memory usage crosses
|
||||
threshold in any direction.
|
||||
|
||||
It's applicable for root and non-root cgroup.
|
||||
|
||||
10. TODO
|
||||
10. OOM Control
|
||||
|
||||
memory.oom_control file is for OOM notification and other controls.
|
||||
|
||||
Memory cgroup implements OOM notifier using cgroup notification
|
||||
API (See cgroups.txt). It allows to register multiple OOM notification
|
||||
delivery and gets notification when OOM happens.
|
||||
|
||||
To register a notifier, application need:
|
||||
- create an eventfd using eventfd(2)
|
||||
- open memory.oom_control file
|
||||
- write string like "<event_fd> <fd of memory.oom_control>" to
|
||||
cgroup.event_control
|
||||
|
||||
Application will be notified through eventfd when OOM happens.
|
||||
OOM notification doesn't work for root cgroup.
|
||||
|
||||
You can disable OOM-killer by writing "1" to memory.oom_control file, as:
|
||||
|
||||
#echo 1 > memory.oom_control
|
||||
|
||||
This operation is only allowed to the top cgroup of sub-hierarchy.
|
||||
If OOM-killer is disabled, tasks under cgroup will hang/sleep
|
||||
in memory cgroup's OOM-waitqueue when they request accountable memory.
|
||||
|
||||
For running them, you have to relax the memory cgroup's OOM status by
|
||||
* enlarge limit or reduce usage.
|
||||
To reduce usage,
|
||||
* kill some tasks.
|
||||
* move some tasks to other group with account migration.
|
||||
* remove some files (on tmpfs?)
|
||||
|
||||
Then, stopped tasks will work again.
|
||||
|
||||
At reading, current status of OOM is shown.
|
||||
oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
|
||||
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
|
||||
be stopped.)
|
||||
|
||||
11. TODO
|
||||
|
||||
1. Add support for accounting huge pages (as a separate controller)
|
||||
2. Make per-cgroup scanner reclaim not-shared pages first
|
||||
|
@ -88,7 +88,7 @@ int cn_netlink_send(struct cn_msg *msg, u32 __groups, int gfp_mask);
|
||||
int gfp_mask - GFP mask.
|
||||
|
||||
Note: When registering new callback user, connector core assigns
|
||||
netlink group to the user which is equal to it's id.idx.
|
||||
netlink group to the user which is equal to its id.idx.
|
||||
|
||||
/*****************************************/
|
||||
Protocol description.
|
||||
|
@ -408,9 +408,6 @@ This should be used inside the RCU read lock, as in the following example:
|
||||
...
|
||||
}
|
||||
|
||||
A function need not get RCU read lock to use __task_cred() if it is holding a
|
||||
spinlock at the time as this implicitly holds the RCU read lock.
|
||||
|
||||
Should it be necessary to hold another task's credentials for a long period of
|
||||
time, and possibly to sleep whilst doing so, then the caller should get a
|
||||
reference on them using:
|
||||
@ -426,17 +423,16 @@ credentials, hiding the RCU magic from the caller:
|
||||
uid_t task_uid(task) Task's real UID
|
||||
uid_t task_euid(task) Task's effective UID
|
||||
|
||||
If the caller is holding a spinlock or the RCU read lock at the time anyway,
|
||||
then:
|
||||
If the caller is holding the RCU read lock at the time anyway, then:
|
||||
|
||||
__task_cred(task)->uid
|
||||
__task_cred(task)->euid
|
||||
|
||||
should be used instead. Similarly, if multiple aspects of a task's credentials
|
||||
need to be accessed, RCU read lock or a spinlock should be used, __task_cred()
|
||||
called, the result stored in a temporary pointer and then the credential
|
||||
aspects called from that before dropping the lock. This prevents the
|
||||
potentially expensive RCU magic from being invoked multiple times.
|
||||
need to be accessed, RCU read lock should be used, __task_cred() called, the
|
||||
result stored in a temporary pointer and then the credential aspects called
|
||||
from that before dropping the lock. This prevents the potentially expensive
|
||||
RCU magic from being invoked multiple times.
|
||||
|
||||
Should some other single aspect of another task's credentials need to be
|
||||
accessed, then this can be used:
|
||||
|
@ -151,7 +151,7 @@ The stages that a patch goes through are, generally:
|
||||
well.
|
||||
|
||||
- Wider review. When the patch is getting close to ready for mainline
|
||||
inclusion, it will be accepted by a relevant subsystem maintainer -
|
||||
inclusion, it should be accepted by a relevant subsystem maintainer -
|
||||
though this acceptance is not a guarantee that the patch will make it
|
||||
all the way to the mainline. The patch will show up in the maintainer's
|
||||
subsystem tree and into the staging trees (described below). When the
|
||||
@ -159,6 +159,15 @@ The stages that a patch goes through are, generally:
|
||||
the discovery of any problems resulting from the integration of this
|
||||
patch with work being done by others.
|
||||
|
||||
- Please note that most maintainers also have day jobs, so merging
|
||||
your patch may not be their highest priority. If your patch is
|
||||
getting feedback about changes that are needed, you should either
|
||||
make those changes or justify why they should not be made. If your
|
||||
patch has no review complaints but is not being merged by its
|
||||
appropriate subsystem or driver maintainer, you should be persistent
|
||||
in updating the patch to the current kernel so that it applies cleanly
|
||||
and keep sending it for review and merging.
|
||||
|
||||
- Merging into the mainline. Eventually, a successful patch will be
|
||||
merged into the mainline repository managed by Linus Torvalds. More
|
||||
comments and/or problems may surface at this time; it is important that
|
||||
@ -258,12 +267,8 @@ an appropriate subsystem tree or be sent directly to Linus. In a typical
|
||||
development cycle, approximately 10% of the patches going into the mainline
|
||||
get there via -mm.
|
||||
|
||||
The current -mm patch can always be found from the front page of
|
||||
|
||||
http://kernel.org/
|
||||
|
||||
Those who want to see the current state of -mm can get the "-mm of the
|
||||
moment" tree, found at:
|
||||
The current -mm patch is available in the "mmotm" (-mm of the moment)
|
||||
directory at:
|
||||
|
||||
http://userweb.kernel.org/~akpm/mmotm/
|
||||
|
||||
@ -298,6 +303,12 @@ volatility of linux-next tends to make it a difficult development target.
|
||||
See http://lwn.net/Articles/289013/ for more information on this topic, and
|
||||
stay tuned; much is still in flux where linux-next is involved.
|
||||
|
||||
Besides the mmotm and linux-next trees, the kernel source tree now contains
|
||||
the drivers/staging/ directory and many sub-directories for drivers or
|
||||
filesystems that are on their way to being added to the kernel tree
|
||||
proper, but they remain in drivers/staging/ while they still need more
|
||||
work.
|
||||
|
||||
|
||||
2.5: TOOLS
|
||||
|
||||
@ -319,9 +330,9 @@ developers; even if they do not use it for their own work, they'll need git
|
||||
to keep up with what other developers (and the mainline) are doing.
|
||||
|
||||
Git is now packaged by almost all Linux distributions. There is a home
|
||||
page at
|
||||
page at:
|
||||
|
||||
http://git.or.cz/
|
||||
http://git-scm.com/
|
||||
|
||||
That page has pointers to documentation and tutorials. One should be
|
||||
aware, in particular, of the Kernel Hacker's Guide to git, which has
|
||||
|
@ -25,7 +25,7 @@ long document in its own right. Instead, the focus here will be on how git
|
||||
fits into the kernel development process in particular. Developers who
|
||||
wish to come up to speed with git will find more information at:
|
||||
|
||||
http://git.or.cz/
|
||||
http://git-scm.com/
|
||||
|
||||
http://www.kernel.org/pub/software/scm/git/docs/user-manual.html
|
||||
|
||||
|
@ -443,6 +443,8 @@ Your cooperation is appreciated.
|
||||
231 = /dev/snapshot System memory snapshot device
|
||||
232 = /dev/kvm Kernel-based virtual machine (hardware virtualization extensions)
|
||||
233 = /dev/kmview View-OS A process with a view
|
||||
234 = /dev/btrfs-control Btrfs control device
|
||||
235 = /dev/autofs Autofs control device
|
||||
240-254 Reserved for local use
|
||||
255 Reserved for MISC_DYNAMIC_MINOR
|
||||
|
||||
|
@ -41,7 +41,7 @@ This application requires the following to function properly as of now.
|
||||
|
||||
* Cards that fall in this category
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
At present the cards that fall in this category are the Twinhan and it's
|
||||
At present the cards that fall in this category are the Twinhan and its
|
||||
clones, these cards are available as VVMER, Tomato, Hercules, Orange and
|
||||
so on.
|
||||
|
||||
|
@ -1,7 +1,7 @@
|
||||
Thanks go to the following people for patches and contributions:
|
||||
|
||||
Michael Hunold <m.hunold@gmx.de>
|
||||
for the initial saa7146 driver and it's recent overhaul
|
||||
for the initial saa7146 driver and its recent overhaul
|
||||
|
||||
Christian Theiss
|
||||
for his work on the initial Linux DVB driver
|
||||
|
@ -241,16 +241,6 @@ Who: Thomas Gleixner <tglx@linutronix.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What (Why):
|
||||
- xt_recent: the old ipt_recent proc dir
|
||||
(superseded by /proc/net/xt_recent)
|
||||
|
||||
When: January 2009 or Linux 2.7.0, whichever comes first
|
||||
Why: Superseded by newer revisions or modules
|
||||
Who: Jan Engelhardt <jengelh@computergmbh.de>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: GPIO autorequest on gpio_direction_{input,output}() in gpiolib
|
||||
When: February 2010
|
||||
Why: All callers should use explicit gpio_request()/gpio_free().
|
||||
@ -520,26 +510,21 @@ Who: Hans de Goede <hdegoede@redhat.com>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: corgikbd, spitzkbd, tosakbd driver
|
||||
When: 2.6.35
|
||||
Files: drivers/input/keyboard/{corgi,spitz,tosa}kbd.c
|
||||
Why: We now have a generic GPIO based matrix keyboard driver that
|
||||
are fully capable of handling all the keys on these devices.
|
||||
The original drivers manipulate the GPIO registers directly
|
||||
and so are difficult to maintain.
|
||||
Who: Eric Miao <eric.y.miao@gmail.com>
|
||||
What: sysfs-class-rfkill state file
|
||||
When: Feb 2014
|
||||
Files: net/rfkill/core.c
|
||||
Why: Documented as obsolete since Feb 2010. This file is limited to 3
|
||||
states while the rfkill drivers can have 4 states.
|
||||
Who: anybody or Florian Mickler <florian@mickler.org>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: corgi_ssp and corgi_ts driver
|
||||
When: 2.6.35
|
||||
Files: arch/arm/mach-pxa/corgi_ssp.c, drivers/input/touchscreen/corgi_ts.c
|
||||
Why: The corgi touchscreen is now deprecated in favour of the generic
|
||||
ads7846.c driver. The noise reduction technique used in corgi_ts.c,
|
||||
that's to wait till vsync before ADC sampling, is also integrated into
|
||||
ads7846 driver now. Provided that the original driver is not generic
|
||||
and is difficult to maintain, it will be removed later.
|
||||
Who: Eric Miao <eric.y.miao@gmail.com>
|
||||
What: sysfs-class-rfkill claim file
|
||||
When: Feb 2012
|
||||
Files: net/rfkill/core.c
|
||||
Why: It is not possible to claim an rfkill driver since 2007. This is
|
||||
Documented as obsolete since Feb 2010.
|
||||
Who: anybody or Florian Mickler <florian@mickler.org>
|
||||
|
||||
----------------------------
|
||||
|
||||
@ -564,6 +549,16 @@ Who: Avi Kivity <avi@redhat.com>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: xtime, wall_to_monotonic
|
||||
When: 2.6.36+
|
||||
Files: kernel/time/timekeeping.c include/linux/time.h
|
||||
Why: Cleaning up timekeeping internal values. Please use
|
||||
existing timekeeping accessor functions to access
|
||||
the equivalent functionality.
|
||||
Who: John Stultz <johnstul@us.ibm.com>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: KVM kernel-allocated memory slots
|
||||
When: July 2010
|
||||
Why: Since 2.6.25, kvm supports user-allocated memory slots, which are
|
||||
@ -583,15 +578,35 @@ Who: Avi Kivity <avi@redhat.com>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: "acpi=ht" boot option
|
||||
When: 2.6.35
|
||||
Why: Useful in 2003, implementation is a hack.
|
||||
Generally invoked by accident today.
|
||||
Seen as doing more harm than good.
|
||||
Who: Len Brown <len.brown@intel.com>
|
||||
What: iwlwifi 50XX module parameters
|
||||
When: 2.6.40
|
||||
Why: The "..50" modules parameters were used to configure 5000 series and
|
||||
up devices; different set of module parameters also available for 4965
|
||||
with same functionalities. Consolidate both set into single place
|
||||
in drivers/net/wireless/iwlwifi/iwl-agn.c
|
||||
|
||||
Who: Wey-Yi Guy <wey-yi.w.guy@intel.com>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: iwl4965 alias support
|
||||
When: 2.6.40
|
||||
Why: Internal alias support has been present in module-init-tools for some
|
||||
time, the MODULE_ALIAS("iwl4965") boilerplate aliases can be removed
|
||||
with no impact.
|
||||
|
||||
Who: Wey-Yi Guy <wey-yi.w.guy@intel.com>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: xt_NOTRACK
|
||||
Files: net/netfilter/xt_NOTRACK.c
|
||||
When: April 2011
|
||||
Why: Superseded by xt_CT
|
||||
Who: Netfilter developer team <netfilter-devel@vger.kernel.org>
|
||||
|
||||
---------------------------
|
||||
|
||||
What: video4linux /dev/vtx teletext API support
|
||||
When: 2.6.35
|
||||
Files: drivers/media/video/saa5246a.c drivers/media/video/saa5249.c
|
||||
@ -612,3 +627,23 @@ Why: The vtx device nodes have been superseded by vbi device nodes
|
||||
provided by the vtx API, then that functionality should be build
|
||||
around the sliced VBI API instead.
|
||||
Who: Hans Verkuil <hverkuil@xs4all.nl>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: IRQF_DISABLED
|
||||
When: 2.6.36
|
||||
Why: The flag is a NOOP as we run interrupt handlers with interrupts disabled
|
||||
Who: Thomas Gleixner <tglx@linutronix.de>
|
||||
|
||||
----------------------------
|
||||
|
||||
What: old ieee1394 subsystem (CONFIG_IEEE1394)
|
||||
When: 2.6.37
|
||||
Files: drivers/ieee1394/ except init_ohci1394_dma.c
|
||||
Why: superseded by drivers/firewire/ (CONFIG_FIREWIRE) which offers more
|
||||
features, better performance, and better security, all with smaller
|
||||
and more modern code base
|
||||
Who: Stefan Richter <stefanr@s5r6.in-berlin.de>
|
||||
|
||||
----------------------------
|
||||
|
||||
|
@ -178,7 +178,7 @@ prototypes:
|
||||
locking rules:
|
||||
All except set_page_dirty may block
|
||||
|
||||
BKL PageLocked(page) i_sem
|
||||
BKL PageLocked(page) i_mutex
|
||||
writepage: no yes, unlocks (see below)
|
||||
readpage: no yes, unlocks
|
||||
sync_page: no maybe
|
||||
@ -380,7 +380,7 @@ prototypes:
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*flush) (struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*fsync) (struct file *, struct dentry *, int datasync);
|
||||
int (*fsync) (struct file *, int datasync);
|
||||
int (*aio_fsync) (struct kiocb *, int datasync);
|
||||
int (*fasync) (int, struct file *, int);
|
||||
int (*lock) (struct file *, int, struct file_lock *);
|
||||
@ -429,8 +429,9 @@ check_flags: no
|
||||
implementations. If your fs is not using generic_file_llseek, you
|
||||
need to acquire and release the appropriate locks in your ->llseek().
|
||||
For many filesystems, it is probably safe to acquire the inode
|
||||
semaphore. Note some filesystems (i.e. remote ones) provide no
|
||||
protection for i_size so you will need to use the BKL.
|
||||
mutex or just to use i_size_read() instead.
|
||||
Note: this does not protect the file->f_pos against concurrent modifications
|
||||
since this is something the userspace has to take care about.
|
||||
|
||||
Note: ext2_release() was *the* source of contention on fs-intensive
|
||||
loads and dropping BKL on ->release() helps to get rid of that (we still
|
||||
|
@ -146,7 +146,7 @@ found to be inadequate, in this case. The Generic Netlink system was
|
||||
used for this as raw Netlink would lead to a significant increase in
|
||||
complexity. There's no question that the Generic Netlink system is an
|
||||
elegant solution for common case ioctl functions but it's not a complete
|
||||
replacement probably because it's primary purpose in life is to be a
|
||||
replacement probably because its primary purpose in life is to be a
|
||||
message bus implementation rather than specifically an ioctl replacement.
|
||||
While it would be possible to work around this there is one concern
|
||||
that lead to the decision to not use it. This is that the autofs
|
||||
|
@ -90,7 +90,7 @@ Mount Options
|
||||
Specify the IP and/or port the client should bind to locally.
|
||||
There is normally not much reason to do this. If the IP is not
|
||||
specified, the client's IP address is determined by looking at the
|
||||
address it's connection to the monitor originates from.
|
||||
address its connection to the monitor originates from.
|
||||
|
||||
wsize=X
|
||||
Specify the maximum write size in bytes. By default there is no
|
||||
|
@ -47,7 +47,7 @@ You'll want to start heartbeating on a volume which all the nodes in
|
||||
your lockspace can access. The easiest way to do this is via
|
||||
ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires
|
||||
that an OCFS2 file system be in place so that it can automatically
|
||||
find it's heartbeat area, though it will eventually support heartbeat
|
||||
find its heartbeat area, though it will eventually support heartbeat
|
||||
against raw disks.
|
||||
|
||||
Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed
|
||||
|
@ -59,8 +59,19 @@ commit=nrsec (*) Ext3 can be told to sync all its data and metadata
|
||||
Setting it to very large values will improve
|
||||
performance.
|
||||
|
||||
barrier=1 This enables/disables barriers. barrier=0 disables
|
||||
it, barrier=1 enables it.
|
||||
barrier=<0(*)|1> This enables/disables the use of write barriers in
|
||||
barrier the jbd code. barrier=0 disables, barrier=1 enables.
|
||||
nobarrier (*) This also requires an IO stack which can support
|
||||
barriers, and if jbd gets an error on a barrier
|
||||
write, it will disable again with a warning.
|
||||
Write barriers enforce proper on-disk ordering
|
||||
of journal commits, making volatile disk write caches
|
||||
safe to use, at some performance penalty. If
|
||||
your disks are battery-backed in one way or another,
|
||||
disabling barriers may safely improve performance.
|
||||
The mount options "barrier" and "nobarrier" can
|
||||
also be used to enable or disable barriers, for
|
||||
consistency with other ext3 mount options.
|
||||
|
||||
orlov (*) This enables the new Orlov block allocator. It is
|
||||
enabled by default.
|
||||
|
@ -38,7 +38,7 @@ flags, it will return EBADR and the contents of fm_flags will contain
|
||||
the set of flags which caused the error. If the kernel is compatible
|
||||
with all flags passed, the contents of fm_flags will be unmodified.
|
||||
It is up to userspace to determine whether rejection of a particular
|
||||
flag is fatal to it's operation. This scheme is intended to allow the
|
||||
flag is fatal to its operation. This scheme is intended to allow the
|
||||
fiemap interface to grow in the future but without losing
|
||||
compatibility with old software.
|
||||
|
||||
@ -56,7 +56,7 @@ If this flag is set, the kernel will sync the file before mapping extents.
|
||||
|
||||
* FIEMAP_FLAG_XATTR
|
||||
If this flag is set, the extents returned will describe the inodes
|
||||
extended attribute lookup tree, instead of it's data tree.
|
||||
extended attribute lookup tree, instead of its data tree.
|
||||
|
||||
|
||||
Extent Mapping
|
||||
@ -89,7 +89,7 @@ struct fiemap_extent {
|
||||
};
|
||||
|
||||
All offsets and lengths are in bytes and mirror those on disk. It is valid
|
||||
for an extents logical offset to start before the request or it's logical
|
||||
for an extents logical offset to start before the request or its logical
|
||||
length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is
|
||||
returned, fe_logical, fe_physical, and fe_length will be aligned to the
|
||||
block size of the file system. With the exception of extents flagged as
|
||||
@ -125,7 +125,7 @@ been allocated for the file yet.
|
||||
|
||||
* FIEMAP_EXTENT_DELALLOC
|
||||
- This will also set FIEMAP_EXTENT_UNKNOWN.
|
||||
Delayed allocation - while there is data for this extent, it's
|
||||
Delayed allocation - while there is data for this extent, its
|
||||
physical location has not been allocated yet.
|
||||
|
||||
* FIEMAP_EXTENT_ENCODED
|
||||
@ -159,7 +159,7 @@ Data is located within a meta data block.
|
||||
Data is packed into a block with data from other files.
|
||||
|
||||
* FIEMAP_EXTENT_UNWRITTEN
|
||||
Unwritten extent - the extent is allocated but it's data has not been
|
||||
Unwritten extent - the extent is allocated but its data has not been
|
||||
initialized. This indicates the extent's data will be all zero if read
|
||||
through the filesystem but the contents are undefined if read directly from
|
||||
the device.
|
||||
@ -176,7 +176,7 @@ VFS -> File System Implementation
|
||||
|
||||
File systems wishing to support fiemap must implement a ->fiemap callback on
|
||||
their inode_operations structure. The fs ->fiemap call is responsible for
|
||||
defining it's set of supported fiemap flags, and calling a helper function on
|
||||
defining its set of supported fiemap flags, and calling a helper function on
|
||||
each discovered extent:
|
||||
|
||||
struct inode_operations {
|
||||
|
@ -91,7 +91,7 @@ Mount options
|
||||
'default_permissions'
|
||||
|
||||
By default FUSE doesn't check file access permissions, the
|
||||
filesystem is free to implement it's access policy or leave it to
|
||||
filesystem is free to implement its access policy or leave it to
|
||||
the underlying file access mechanism (e.g. in case of network
|
||||
filesystems). This option enables permission checking, restricting
|
||||
access based on file mode. It is usually useful together with the
|
||||
@ -171,7 +171,7 @@ or may honor them by sending a reply to the _original_ request, with
|
||||
the error set to EINTR.
|
||||
|
||||
It is also possible that there's a race between processing the
|
||||
original request and it's INTERRUPT request. There are two possibilities:
|
||||
original request and its INTERRUPT request. There are two possibilities:
|
||||
|
||||
1) The INTERRUPT request is processed before the original request is
|
||||
processed
|
||||
|
@ -1,7 +1,7 @@
|
||||
Global File System
|
||||
------------------
|
||||
|
||||
http://sources.redhat.com/cluster/
|
||||
http://sources.redhat.com/cluster/wiki/
|
||||
|
||||
GFS is a cluster file system. It allows a cluster of computers to
|
||||
simultaneously use a block device that is shared between them (with FC,
|
||||
@ -36,11 +36,11 @@ GFS2 is not on-disk compatible with previous versions of GFS, but it
|
||||
is pretty close.
|
||||
|
||||
The following man pages can be found at the URL above:
|
||||
fsck.gfs2 to repair a filesystem
|
||||
gfs2_grow to expand a filesystem online
|
||||
gfs2_jadd to add journals to a filesystem online
|
||||
gfs2_tool to manipulate, examine and tune a filesystem
|
||||
fsck.gfs2 to repair a filesystem
|
||||
gfs2_grow to expand a filesystem online
|
||||
gfs2_jadd to add journals to a filesystem online
|
||||
gfs2_tool to manipulate, examine and tune a filesystem
|
||||
gfs2_quota to examine and change quota values in a filesystem
|
||||
gfs2_convert to convert a gfs filesystem to gfs2 in-place
|
||||
mount.gfs2 to help mount(8) mount a filesystem
|
||||
mkfs.gfs2 to make a filesystem
|
||||
mkfs.gfs2 to make a filesystem
|
||||
|
@ -103,7 +103,7 @@ to analyze or change OS2SYS.INI.
|
||||
Codepages
|
||||
|
||||
HPFS can contain several uppercasing tables for several codepages and each
|
||||
file has a pointer to codepage it's name is in. However OS/2 was created in
|
||||
file has a pointer to codepage its name is in. However OS/2 was created in
|
||||
America where people don't care much about codepages and so multiple codepages
|
||||
support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk.
|
||||
Once I booted English OS/2 working in cp 850 and I created a file on my 852
|
||||
|
@ -59,7 +59,7 @@ Levels
|
||||
------
|
||||
|
||||
Garbage collection (GC) may fail if all data is written
|
||||
indiscriminately. One requirement of GC is that data is seperated
|
||||
indiscriminately. One requirement of GC is that data is separated
|
||||
roughly according to the distance between the tree root and the data.
|
||||
Effectively that means all file data is on level 0, indirect blocks
|
||||
are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
|
||||
@ -67,7 +67,7 @@ respectively. Inode file data is on level 6 for the inodes and 7-11
|
||||
for indirect blocks.
|
||||
|
||||
Each segment contains objects of a single level only. As a result,
|
||||
each level requires its own seperate segment to be open for writing.
|
||||
each level requires its own separate segment to be open for writing.
|
||||
|
||||
Inode File
|
||||
----------
|
||||
@ -106,9 +106,9 @@ Vim
|
||||
---
|
||||
|
||||
By cleverly predicting the life time of data, it is possible to
|
||||
seperate long-living data from short-living data and thereby reduce
|
||||
separate long-living data from short-living data and thereby reduce
|
||||
the GC overhead later. Each type of distinc life expectency (vim) can
|
||||
have a seperate segment open for writing. Each (level, vim) tupel can
|
||||
have a separate segment open for writing. Each (level, vim) tupel can
|
||||
be open just once. If an open segment with unknown vim is encountered
|
||||
at mount time, it is closed and ignored henceforth.
|
||||
|
||||
|
@ -137,7 +137,7 @@ NS*| OPENATTR | OPT | | Section 18.17 |
|
||||
| READ | REQ | | Section 18.22 |
|
||||
| READDIR | REQ | | Section 18.23 |
|
||||
| READLINK | OPT | | Section 18.24 |
|
||||
NS | RECLAIM_COMPLETE | REQ | | Section 18.51 |
|
||||
| RECLAIM_COMPLETE | REQ | | Section 18.51 |
|
||||
| RELEASE_LOCKOWNER | MNI | | N/A |
|
||||
| REMOVE | REQ | | Section 18.25 |
|
||||
| RENAME | REQ | | Section 18.26 |
|
||||
|
@ -185,7 +185,7 @@ failed lookup meant a definite 'no'.
|
||||
request/response format
|
||||
-----------------------
|
||||
|
||||
While each cache is free to use it's own format for requests
|
||||
While each cache is free to use its own format for requests
|
||||
and responses over channel, the following is recommended as
|
||||
appropriate and support routines are available to help:
|
||||
Each request or response record should be printable ASCII
|
||||
|
@ -50,8 +50,8 @@ NILFS2 supports the following mount options:
|
||||
(*) == default
|
||||
|
||||
nobarrier Disables barriers.
|
||||
errors=continue(*) Keep going on a filesystem error.
|
||||
errors=remount-ro Remount the filesystem read-only on an error.
|
||||
errors=continue Keep going on a filesystem error.
|
||||
errors=remount-ro(*) Remount the filesystem read-only on an error.
|
||||
errors=panic Panic and halt the machine if an error occurs.
|
||||
cp=n Specify the checkpoint-number of the snapshot to be
|
||||
mounted. Checkpoints and snapshots are listed by lscp
|
||||
|
@ -80,3 +80,10 @@ user_xattr (*) Enables Extended User Attributes.
|
||||
nouser_xattr Disables Extended User Attributes.
|
||||
acl Enables POSIX Access Control Lists support.
|
||||
noacl (*) Disables POSIX Access Control Lists support.
|
||||
resv_level=2 (*) Set how agressive allocation reservations will be.
|
||||
Valid values are between 0 (reservations off) to 8
|
||||
(maximum space for reservations).
|
||||
dir_resv_level= (*) By default, directory reservations will scale with file
|
||||
reservations - users should rarely need to change this
|
||||
value. If allocation reservations are turned off, this
|
||||
option will have no effect.
|
||||
|
@ -305,7 +305,7 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
|
||||
cgtime guest time of the task children in jiffies
|
||||
..............................................................................
|
||||
|
||||
The /proc/PID/map file containing the currently mapped memory regions and
|
||||
The /proc/PID/maps file containing the currently mapped memory regions and
|
||||
their access permissions.
|
||||
|
||||
The format is:
|
||||
@ -565,6 +565,10 @@ The default_smp_affinity mask applies to all non-active IRQs, which are the
|
||||
IRQs which have not yet been allocated/activated, and hence which lack a
|
||||
/proc/irq/[0-9]* directory.
|
||||
|
||||
The node file on an SMP system shows the node to which the device using the IRQ
|
||||
reports itself as being attached. This hardware locality information does not
|
||||
include information about any possible driver locality preference.
|
||||
|
||||
prof_cpu_mask specifies which CPUs are to be profiled by the system wide
|
||||
profiler. Default value is ffffffff (all cpus).
|
||||
|
||||
@ -964,7 +968,7 @@ your system and how much traffic was routed over those devices:
|
||||
...] 1375103 17405 0 0 0 0 0 0
|
||||
...] 1703981 5535 0 0 0 3 0 0
|
||||
|
||||
In addition, each Channel Bond interface has it's own directory. For
|
||||
In addition, each Channel Bond interface has its own directory. For
|
||||
example, the bond0 device will have a directory called /proc/net/bond0/.
|
||||
It will contain information that is specific to that bond, such as the
|
||||
current slaves of the bond, the link status of the slaves, and how
|
||||
@ -1361,7 +1365,7 @@ been accounted as having caused 1MB of write.
|
||||
In other words: The number of bytes which this process caused to not happen,
|
||||
by truncating pagecache. A task can cause "negative" IO too. If this task
|
||||
truncates some dirty pagecache, some IO which another task has been accounted
|
||||
for (in it's write_bytes) will not be happening. We _could_ just subtract that
|
||||
for (in its write_bytes) will not be happening. We _could_ just subtract that
|
||||
from the truncating task's write_bytes, but there is information loss in doing
|
||||
that.
|
||||
|
||||
|
@ -3,6 +3,6 @@ protocol used by Windows for Workgroups, Windows 95 and Windows NT.
|
||||
Smbfs was inspired by Samba, the program written by Andrew Tridgell
|
||||
that turns any Unix host into a file server for DOS or Windows clients.
|
||||
|
||||
Smbfs is a SMB client, but uses parts of samba for it's operation. For
|
||||
Smbfs is a SMB client, but uses parts of samba for its operation. For
|
||||
more info on samba, including documentation, please go to
|
||||
http://www.samba.org/ and then on to your nearest mirror.
|
||||
|
@ -38,7 +38,8 @@ Hard link support: yes no
|
||||
Real inode numbers: yes no
|
||||
32-bit uids/gids: yes no
|
||||
File creation time: yes no
|
||||
Xattr and ACL support: no no
|
||||
Xattr support: yes no
|
||||
ACL support: no no
|
||||
|
||||
Squashfs compresses data, inodes and directories. In addition, inode and
|
||||
directory data are highly compacted, and packed on byte boundaries. Each
|
||||
@ -58,7 +59,7 @@ obtained from this site also.
|
||||
3. SQUASHFS FILESYSTEM DESIGN
|
||||
-----------------------------
|
||||
|
||||
A squashfs filesystem consists of seven parts, packed together on a byte
|
||||
A squashfs filesystem consists of a maximum of eight parts, packed together on a byte
|
||||
alignment:
|
||||
|
||||
---------------
|
||||
@ -80,6 +81,9 @@ alignment:
|
||||
|---------------|
|
||||
| uid/gid |
|
||||
| lookup table |
|
||||
|---------------|
|
||||
| xattr |
|
||||
| table |
|
||||
---------------
|
||||
|
||||
Compressed data blocks are written to the filesystem as files are read from
|
||||
@ -192,6 +196,26 @@ This table is stored compressed into metadata blocks. A second index table is
|
||||
used to locate these. This second index table for speed of access (and because
|
||||
it is small) is read at mount time and cached in memory.
|
||||
|
||||
3.7 Xattr table
|
||||
---------------
|
||||
|
||||
The xattr table contains extended attributes for each inode. The xattrs
|
||||
for each inode are stored in a list, each list entry containing a type,
|
||||
name and value field. The type field encodes the xattr prefix
|
||||
("user.", "trusted." etc) and it also encodes how the name/value fields
|
||||
should be interpreted. Currently the type indicates whether the value
|
||||
is stored inline (in which case the value field contains the xattr value),
|
||||
or if it is stored out of line (in which case the value field stores a
|
||||
reference to where the actual value is stored). This allows large values
|
||||
to be stored out of line improving scanning and lookup performance and it
|
||||
also allows values to be de-duplicated, the value being stored once, and
|
||||
all other occurences holding an out of line reference to that value.
|
||||
|
||||
The xattr lists are packed into compressed 8K metadata blocks.
|
||||
To reduce overhead in inodes, rather than storing the on-disk
|
||||
location of the xattr list inside each inode, a 32-bit xattr id
|
||||
is stored. This xattr id is mapped into the location of the xattr
|
||||
list using a second xattr id lookup table.
|
||||
|
||||
4. TODOS AND OUTSTANDING ISSUES
|
||||
-------------------------------
|
||||
@ -199,9 +223,7 @@ it is small) is read at mount time and cached in memory.
|
||||
4.1 Todo list
|
||||
-------------
|
||||
|
||||
Implement Xattr and ACL support. The Squashfs 4.0 filesystem layout has hooks
|
||||
for these but the code has not been written. Once the code has been written
|
||||
the existing layout should not require modification.
|
||||
Implement ACL support.
|
||||
|
||||
4.2 Squashfs internal cache
|
||||
---------------------------
|
||||
|
42
Documentation/filesystems/sysfs-tagging.txt
Normal file
42
Documentation/filesystems/sysfs-tagging.txt
Normal file
@ -0,0 +1,42 @@
|
||||
Sysfs tagging
|
||||
-------------
|
||||
|
||||
(Taken almost verbatim from Eric Biederman's netns tagging patch
|
||||
commit msg)
|
||||
|
||||
The problem. Network devices show up in sysfs and with the network
|
||||
namespace active multiple devices with the same name can show up in
|
||||
the same directory, ouch!
|
||||
|
||||
To avoid that problem and allow existing applications in network
|
||||
namespaces to see the same interface that is currently presented in
|
||||
sysfs, sysfs now has tagging directory support.
|
||||
|
||||
By using the network namespace pointers as tags to separate out the
|
||||
the sysfs directory entries we ensure that we don't have conflicts
|
||||
in the directories and applications only see a limited set of
|
||||
the network devices.
|
||||
|
||||
Each sysfs directory entry may be tagged with zero or one
|
||||
namespaces. A sysfs_dirent is augmented with a void *s_ns. If a
|
||||
directory entry is tagged, then sysfs_dirent->s_flags will have a
|
||||
flag between KOBJ_NS_TYPE_NONE and KOBJ_NS_TYPES, and s_ns will
|
||||
point to the namespace to which it belongs.
|
||||
|
||||
Each sysfs superblock's sysfs_super_info contains an array void
|
||||
*ns[KOBJ_NS_TYPES]. When a a task in a tagging namespace
|
||||
kobj_nstype first mounts sysfs, a new superblock is created. It
|
||||
will be differentiated from other sysfs mounts by having its
|
||||
s_fs_info->ns[kobj_nstype] set to the new namespace. Note that
|
||||
through bind mounting and mounts propagation, a task can easily view
|
||||
the contents of other namespaces' sysfs mounts. Therefore, when a
|
||||
namespace exits, it will call kobj_ns_exit() to invalidate any
|
||||
sysfs_dirent->s_ns pointers pointing to it.
|
||||
|
||||
Users of this interface:
|
||||
- define a type in the kobj_ns_type enumeration.
|
||||
- call kobj_ns_type_register() with its kobj_ns_type_operations which has
|
||||
- current_ns() which returns current's namespace
|
||||
- netlink_ns() which returns a socket's namespace
|
||||
- initial_ns() which returns the initial namesapce
|
||||
- call kobj_ns_exit() when an individual tag is no longer valid
|
@ -94,11 +94,19 @@ NodeList format is a comma-separated list of decimal numbers and ranges,
|
||||
a range being two hyphen-separated decimal numbers, the smallest and
|
||||
largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15
|
||||
|
||||
A memory policy with a valid NodeList will be saved, as specified, for
|
||||
use at file creation time. When a task allocates a file in the file
|
||||
system, the mount option memory policy will be applied with a NodeList,
|
||||
if any, modified by the calling task's cpuset constraints
|
||||
[See Documentation/cgroups/cpusets.txt] and any optional flags, listed
|
||||
below. If the resulting NodeLists is the empty set, the effective memory
|
||||
policy for the file will revert to "default" policy.
|
||||
|
||||
NUMA memory allocation policies have optional flags that can be used in
|
||||
conjunction with their modes. These optional flags can be specified
|
||||
when tmpfs is mounted by appending them to the mode before the NodeList.
|
||||
See Documentation/vm/numa_memory_policy.txt for a list of all available
|
||||
memory allocation policy mode flags.
|
||||
memory allocation policy mode flags and their effect on memory policy.
|
||||
|
||||
=static is equivalent to MPOL_F_STATIC_NODES
|
||||
=relative is equivalent to MPOL_F_RELATIVE_NODES
|
||||
|
@ -72,7 +72,7 @@ structure (this is the kernel-side implementation of file
|
||||
descriptors). The freshly allocated file structure is initialized with
|
||||
a pointer to the dentry and a set of file operation member functions.
|
||||
These are taken from the inode data. The open() file method is then
|
||||
called so the specific filesystem implementation can do it's work. You
|
||||
called so the specific filesystem implementation can do its work. You
|
||||
can see that this is another switch performed by the VFS. The file
|
||||
structure is placed into the file descriptor table for the process.
|
||||
|
||||
@ -401,11 +401,16 @@ otherwise noted.
|
||||
started might not be in the page cache at the end of the
|
||||
walk).
|
||||
|
||||
truncate: called by the VFS to change the size of a file. The
|
||||
truncate: Deprecated. This will not be called if ->setsize is defined.
|
||||
Called by the VFS to change the size of a file. The
|
||||
i_size field of the inode is set to the desired size by the
|
||||
VFS before this method is called. This method is called by
|
||||
the truncate(2) system call and related functionality.
|
||||
|
||||
Note: ->truncate and vmtruncate are deprecated. Do not add new
|
||||
instances/calls of these. Filesystems should be converted to do their
|
||||
truncate sequence via ->setattr().
|
||||
|
||||
permission: called by the VFS to check for access rights on a POSIX-like
|
||||
filesystem.
|
||||
|
||||
@ -729,7 +734,7 @@ struct file_operations {
|
||||
int (*open) (struct inode *, struct file *);
|
||||
int (*flush) (struct file *);
|
||||
int (*release) (struct inode *, struct file *);
|
||||
int (*fsync) (struct file *, struct dentry *, int datasync);
|
||||
int (*fsync) (struct file *, int datasync);
|
||||
int (*aio_fsync) (struct kiocb *, int datasync);
|
||||
int (*fasync) (int, struct file *, int);
|
||||
int (*lock) (struct file *, int, struct file_lock *);
|
||||
|
816
Documentation/filesystems/xfs-delayed-logging-design.txt
Normal file
816
Documentation/filesystems/xfs-delayed-logging-design.txt
Normal file
@ -0,0 +1,816 @@
|
||||
XFS Delayed Logging Design
|
||||
--------------------------
|
||||
|
||||
Introduction to Re-logging in XFS
|
||||
---------------------------------
|
||||
|
||||
XFS logging is a combination of logical and physical logging. Some objects,
|
||||
such as inodes and dquots, are logged in logical format where the details
|
||||
logged are made up of the changes to in-core structures rather than on-disk
|
||||
structures. Other objects - typically buffers - have their physical changes
|
||||
logged. The reason for these differences is to reduce the amount of log space
|
||||
required for objects that are frequently logged. Some parts of inodes are more
|
||||
frequently logged than others, and inodes are typically more frequently logged
|
||||
than any other object (except maybe the superblock buffer) so keeping the
|
||||
amount of metadata logged low is of prime importance.
|
||||
|
||||
The reason that this is such a concern is that XFS allows multiple separate
|
||||
modifications to a single object to be carried in the log at any given time.
|
||||
This allows the log to avoid needing to flush each change to disk before
|
||||
recording a new change to the object. XFS does this via a method called
|
||||
"re-logging". Conceptually, this is quite simple - all it requires is that any
|
||||
new change to the object is recorded with a *new copy* of all the existing
|
||||
changes in the new transaction that is written to the log.
|
||||
|
||||
That is, if we have a sequence of changes A through to F, and the object was
|
||||
written to disk after change D, we would see in the log the following series
|
||||
of transactions, their contents and the log sequence number (LSN) of the
|
||||
transaction:
|
||||
|
||||
Transaction Contents LSN
|
||||
A A X
|
||||
B A+B X+n
|
||||
C A+B+C X+n+m
|
||||
D A+B+C+D X+n+m+o
|
||||
<object written to disk>
|
||||
E E Y (> X+n+m+o)
|
||||
F E+F Yٍ+p
|
||||
|
||||
In other words, each time an object is relogged, the new transaction contains
|
||||
the aggregation of all the previous changes currently held only in the log.
|
||||
|
||||
This relogging technique also allows objects to be moved forward in the log so
|
||||
that an object being relogged does not prevent the tail of the log from ever
|
||||
moving forward. This can be seen in the table above by the changing
|
||||
(increasing) LSN of each subsquent transaction - the LSN is effectively a
|
||||
direct encoding of the location in the log of the transaction.
|
||||
|
||||
This relogging is also used to implement long-running, multiple-commit
|
||||
transactions. These transaction are known as rolling transactions, and require
|
||||
a special log reservation known as a permanent transaction reservation. A
|
||||
typical example of a rolling transaction is the removal of extents from an
|
||||
inode which can only be done at a rate of two extents per transaction because
|
||||
of reservation size limitations. Hence a rolling extent removal transaction
|
||||
keeps relogging the inode and btree buffers as they get modified in each
|
||||
removal operation. This keeps them moving forward in the log as the operation
|
||||
progresses, ensuring that current operation never gets blocked by itself if the
|
||||
log wraps around.
|
||||
|
||||
Hence it can be seen that the relogging operation is fundamental to the correct
|
||||
working of the XFS journalling subsystem. From the above description, most
|
||||
people should be able to see why the XFS metadata operations writes so much to
|
||||
the log - repeated operations to the same objects write the same changes to
|
||||
the log over and over again. Worse is the fact that objects tend to get
|
||||
dirtier as they get relogged, so each subsequent transaction is writing more
|
||||
metadata into the log.
|
||||
|
||||
Another feature of the XFS transaction subsystem is that most transactions are
|
||||
asynchronous. That is, they don't commit to disk until either a log buffer is
|
||||
filled (a log buffer can hold multiple transactions) or a synchronous operation
|
||||
forces the log buffers holding the transactions to disk. This means that XFS is
|
||||
doing aggregation of transactions in memory - batching them, if you like - to
|
||||
minimise the impact of the log IO on transaction throughput.
|
||||
|
||||
The limitation on asynchronous transaction throughput is the number and size of
|
||||
log buffers made available by the log manager. By default there are 8 log
|
||||
buffers available and the size of each is 32kB - the size can be increased up
|
||||
to 256kB by use of a mount option.
|
||||
|
||||
Effectively, this gives us the maximum bound of outstanding metadata changes
|
||||
that can be made to the filesystem at any point in time - if all the log
|
||||
buffers are full and under IO, then no more transactions can be committed until
|
||||
the current batch completes. It is now common for a single current CPU core to
|
||||
be to able to issue enough transactions to keep the log buffers full and under
|
||||
IO permanently. Hence the XFS journalling subsystem can be considered to be IO
|
||||
bound.
|
||||
|
||||
Delayed Logging: Concepts
|
||||
-------------------------
|
||||
|
||||
The key thing to note about the asynchronous logging combined with the
|
||||
relogging technique XFS uses is that we can be relogging changed objects
|
||||
multiple times before they are committed to disk in the log buffers. If we
|
||||
return to the previous relogging example, it is entirely possible that
|
||||
transactions A through D are committed to disk in the same log buffer.
|
||||
|
||||
That is, a single log buffer may contain multiple copies of the same object,
|
||||
but only one of those copies needs to be there - the last one "D", as it
|
||||
contains all the changes from the previous changes. In other words, we have one
|
||||
necessary copy in the log buffer, and three stale copies that are simply
|
||||
wasting space. When we are doing repeated operations on the same set of
|
||||
objects, these "stale objects" can be over 90% of the space used in the log
|
||||
buffers. It is clear that reducing the number of stale objects written to the
|
||||
log would greatly reduce the amount of metadata we write to the log, and this
|
||||
is the fundamental goal of delayed logging.
|
||||
|
||||
From a conceptual point of view, XFS is already doing relogging in memory (where
|
||||
memory == log buffer), only it is doing it extremely inefficiently. It is using
|
||||
logical to physical formatting to do the relogging because there is no
|
||||
infrastructure to keep track of logical changes in memory prior to physically
|
||||
formatting the changes in a transaction to the log buffer. Hence we cannot avoid
|
||||
accumulating stale objects in the log buffers.
|
||||
|
||||
Delayed logging is the name we've given to keeping and tracking transactional
|
||||
changes to objects in memory outside the log buffer infrastructure. Because of
|
||||
the relogging concept fundamental to the XFS journalling subsystem, this is
|
||||
actually relatively easy to do - all the changes to logged items are already
|
||||
tracked in the current infrastructure. The big problem is how to accumulate
|
||||
them and get them to the log in a consistent, recoverable manner.
|
||||
Describing the problems and how they have been solved is the focus of this
|
||||
document.
|
||||
|
||||
One of the key changes that delayed logging makes to the operation of the
|
||||
journalling subsystem is that it disassociates the amount of outstanding
|
||||
metadata changes from the size and number of log buffers available. In other
|
||||
words, instead of there only being a maximum of 2MB of transaction changes not
|
||||
written to the log at any point in time, there may be a much greater amount
|
||||
being accumulated in memory. Hence the potential for loss of metadata on a
|
||||
crash is much greater than for the existing logging mechanism.
|
||||
|
||||
It should be noted that this does not change the guarantee that log recovery
|
||||
will result in a consistent filesystem. What it does mean is that as far as the
|
||||
recovered filesystem is concerned, there may be many thousands of transactions
|
||||
that simply did not occur as a result of the crash. This makes it even more
|
||||
important that applications that care about their data use fsync() where they
|
||||
need to ensure application level data integrity is maintained.
|
||||
|
||||
It should be noted that delayed logging is not an innovative new concept that
|
||||
warrants rigorous proofs to determine whether it is correct or not. The method
|
||||
of accumulating changes in memory for some period before writing them to the
|
||||
log is used effectively in many filesystems including ext3 and ext4. Hence
|
||||
no time is spent in this document trying to convince the reader that the
|
||||
concept is sound. Instead it is simply considered a "solved problem" and as
|
||||
such implementing it in XFS is purely an exercise in software engineering.
|
||||
|
||||
The fundamental requirements for delayed logging in XFS are simple:
|
||||
|
||||
1. Reduce the amount of metadata written to the log by at least
|
||||
an order of magnitude.
|
||||
2. Supply sufficient statistics to validate Requirement #1.
|
||||
3. Supply sufficient new tracing infrastructure to be able to debug
|
||||
problems with the new code.
|
||||
4. No on-disk format change (metadata or log format).
|
||||
5. Enable and disable with a mount option.
|
||||
6. No performance regressions for synchronous transaction workloads.
|
||||
|
||||
Delayed Logging: Design
|
||||
-----------------------
|
||||
|
||||
Storing Changes
|
||||
|
||||
The problem with accumulating changes at a logical level (i.e. just using the
|
||||
existing log item dirty region tracking) is that when it comes to writing the
|
||||
changes to the log buffers, we need to ensure that the object we are formatting
|
||||
is not changing while we do this. This requires locking the object to prevent
|
||||
concurrent modification. Hence flushing the logical changes to the log would
|
||||
require us to lock every object, format them, and then unlock them again.
|
||||
|
||||
This introduces lots of scope for deadlocks with transactions that are already
|
||||
running. For example, a transaction has object A locked and modified, but needs
|
||||
the delayed logging tracking lock to commit the transaction. However, the
|
||||
flushing thread has the delayed logging tracking lock already held, and is
|
||||
trying to get the lock on object A to flush it to the log buffer. This appears
|
||||
to be an unsolvable deadlock condition, and it was solving this problem that
|
||||
was the barrier to implementing delayed logging for so long.
|
||||
|
||||
The solution is relatively simple - it just took a long time to recognise it.
|
||||
Put simply, the current logging code formats the changes to each item into an
|
||||
vector array that points to the changed regions in the item. The log write code
|
||||
simply copies the memory these vectors point to into the log buffer during
|
||||
transaction commit while the item is locked in the transaction. Instead of
|
||||
using the log buffer as the destination of the formatting code, we can use an
|
||||
allocated memory buffer big enough to fit the formatted vector.
|
||||
|
||||
If we then copy the vector into the memory buffer and rewrite the vector to
|
||||
point to the memory buffer rather than the object itself, we now have a copy of
|
||||
the changes in a format that is compatible with the log buffer writing code.
|
||||
that does not require us to lock the item to access. This formatting and
|
||||
rewriting can all be done while the object is locked during transaction commit,
|
||||
resulting in a vector that is transactionally consistent and can be accessed
|
||||
without needing to lock the owning item.
|
||||
|
||||
Hence we avoid the need to lock items when we need to flush outstanding
|
||||
asynchronous transactions to the log. The differences between the existing
|
||||
formatting method and the delayed logging formatting can be seen in the
|
||||
diagram below.
|
||||
|
||||
Current format log vector:
|
||||
|
||||
Object +---------------------------------------------+
|
||||
Vector 1 +----+
|
||||
Vector 2 +----+
|
||||
Vector 3 +----------+
|
||||
|
||||
After formatting:
|
||||
|
||||
Log Buffer +-V1-+-V2-+----V3----+
|
||||
|
||||
Delayed logging vector:
|
||||
|
||||
Object +---------------------------------------------+
|
||||
Vector 1 +----+
|
||||
Vector 2 +----+
|
||||
Vector 3 +----------+
|
||||
|
||||
After formatting:
|
||||
|
||||
Memory Buffer +-V1-+-V2-+----V3----+
|
||||
Vector 1 +----+
|
||||
Vector 2 +----+
|
||||
Vector 3 +----------+
|
||||
|
||||
The memory buffer and associated vector need to be passed as a single object,
|
||||
but still need to be associated with the parent object so if the object is
|
||||
relogged we can replace the current memory buffer with a new memory buffer that
|
||||
contains the latest changes.
|
||||
|
||||
The reason for keeping the vector around after we've formatted the memory
|
||||
buffer is to support splitting vectors across log buffer boundaries correctly.
|
||||
If we don't keep the vector around, we do not know where the region boundaries
|
||||
are in the item, so we'd need a new encapsulation method for regions in the log
|
||||
buffer writing (i.e. double encapsulation). This would be an on-disk format
|
||||
change and as such is not desirable. It also means we'd have to write the log
|
||||
region headers in the formatting stage, which is problematic as there is per
|
||||
region state that needs to be placed into the headers during the log write.
|
||||
|
||||
Hence we need to keep the vector, but by attaching the memory buffer to it and
|
||||
rewriting the vector addresses to point at the memory buffer we end up with a
|
||||
self-describing object that can be passed to the log buffer write code to be
|
||||
handled in exactly the same manner as the existing log vectors are handled.
|
||||
Hence we avoid needing a new on-disk format to handle items that have been
|
||||
relogged in memory.
|
||||
|
||||
|
||||
Tracking Changes
|
||||
|
||||
Now that we can record transactional changes in memory in a form that allows
|
||||
them to be used without limitations, we need to be able to track and accumulate
|
||||
them so that they can be written to the log at some later point in time. The
|
||||
log item is the natural place to store this vector and buffer, and also makes sense
|
||||
to be the object that is used to track committed objects as it will always
|
||||
exist once the object has been included in a transaction.
|
||||
|
||||
The log item is already used to track the log items that have been written to
|
||||
the log but not yet written to disk. Such log items are considered "active"
|
||||
and as such are stored in the Active Item List (AIL) which is a LSN-ordered
|
||||
double linked list. Items are inserted into this list during log buffer IO
|
||||
completion, after which they are unpinned and can be written to disk. An object
|
||||
that is in the AIL can be relogged, which causes the object to be pinned again
|
||||
and then moved forward in the AIL when the log buffer IO completes for that
|
||||
transaction.
|
||||
|
||||
Essentially, this shows that an item that is in the AIL can still be modified
|
||||
and relogged, so any tracking must be separate to the AIL infrastructure. As
|
||||
such, we cannot reuse the AIL list pointers for tracking committed items, nor
|
||||
can we store state in any field that is protected by the AIL lock. Hence the
|
||||
committed item tracking needs it's own locks, lists and state fields in the log
|
||||
item.
|
||||
|
||||
Similar to the AIL, tracking of committed items is done through a new list
|
||||
called the Committed Item List (CIL). The list tracks log items that have been
|
||||
committed and have formatted memory buffers attached to them. It tracks objects
|
||||
in transaction commit order, so when an object is relogged it is removed from
|
||||
it's place in the list and re-inserted at the tail. This is entirely arbitrary
|
||||
and done to make it easy for debugging - the last items in the list are the
|
||||
ones that are most recently modified. Ordering of the CIL is not necessary for
|
||||
transactional integrity (as discussed in the next section) so the ordering is
|
||||
done for convenience/sanity of the developers.
|
||||
|
||||
|
||||
Delayed Logging: Checkpoints
|
||||
|
||||
When we have a log synchronisation event, commonly known as a "log force",
|
||||
all the items in the CIL must be written into the log via the log buffers.
|
||||
We need to write these items in the order that they exist in the CIL, and they
|
||||
need to be written as an atomic transaction. The need for all the objects to be
|
||||
written as an atomic transaction comes from the requirements of relogging and
|
||||
log replay - all the changes in all the objects in a given transaction must
|
||||
either be completely replayed during log recovery, or not replayed at all. If
|
||||
a transaction is not replayed because it is not complete in the log, then
|
||||
no later transactions should be replayed, either.
|
||||
|
||||
To fulfill this requirement, we need to write the entire CIL in a single log
|
||||
transaction. Fortunately, the XFS log code has no fixed limit on the size of a
|
||||
transaction, nor does the log replay code. The only fundamental limit is that
|
||||
the transaction cannot be larger than just under half the size of the log. The
|
||||
reason for this limit is that to find the head and tail of the log, there must
|
||||
be at least one complete transaction in the log at any given time. If a
|
||||
transaction is larger than half the log, then there is the possibility that a
|
||||
crash during the write of a such a transaction could partially overwrite the
|
||||
only complete previous transaction in the log. This will result in a recovery
|
||||
failure and an inconsistent filesystem and hence we must enforce the maximum
|
||||
size of a checkpoint to be slightly less than a half the log.
|
||||
|
||||
Apart from this size requirement, a checkpoint transaction looks no different
|
||||
to any other transaction - it contains a transaction header, a series of
|
||||
formatted log items and a commit record at the tail. From a recovery
|
||||
perspective, the checkpoint transaction is also no different - just a lot
|
||||
bigger with a lot more items in it. The worst case effect of this is that we
|
||||
might need to tune the recovery transaction object hash size.
|
||||
|
||||
Because the checkpoint is just another transaction and all the changes to log
|
||||
items are stored as log vectors, we can use the existing log buffer writing
|
||||
code to write the changes into the log. To do this efficiently, we need to
|
||||
minimise the time we hold the CIL locked while writing the checkpoint
|
||||
transaction. The current log write code enables us to do this easily with the
|
||||
way it separates the writing of the transaction contents (the log vectors) from
|
||||
the transaction commit record, but tracking this requires us to have a
|
||||
per-checkpoint context that travels through the log write process through to
|
||||
checkpoint completion.
|
||||
|
||||
Hence a checkpoint has a context that tracks the state of the current
|
||||
checkpoint from initiation to checkpoint completion. A new context is initiated
|
||||
at the same time a checkpoint transaction is started. That is, when we remove
|
||||
all the current items from the CIL during a checkpoint operation, we move all
|
||||
those changes into the current checkpoint context. We then initialise a new
|
||||
context and attach that to the CIL for aggregation of new transactions.
|
||||
|
||||
This allows us to unlock the CIL immediately after transfer of all the
|
||||
committed items and effectively allow new transactions to be issued while we
|
||||
are formatting the checkpoint into the log. It also allows concurrent
|
||||
checkpoints to be written into the log buffers in the case of log force heavy
|
||||
workloads, just like the existing transaction commit code does. This, however,
|
||||
requires that we strictly order the commit records in the log so that
|
||||
checkpoint sequence order is maintained during log replay.
|
||||
|
||||
To ensure that we can be writing an item into a checkpoint transaction at
|
||||
the same time another transaction modifies the item and inserts the log item
|
||||
into the new CIL, then checkpoint transaction commit code cannot use log items
|
||||
to store the list of log vectors that need to be written into the transaction.
|
||||
Hence log vectors need to be able to be chained together to allow them to be
|
||||
detatched from the log items. That is, when the CIL is flushed the memory
|
||||
buffer and log vector attached to each log item needs to be attached to the
|
||||
checkpoint context so that the log item can be released. In diagrammatic form,
|
||||
the CIL would look like this before the flush:
|
||||
|
||||
CIL Head
|
||||
|
|
||||
V
|
||||
Log Item <-> log vector 1 -> memory buffer
|
||||
| -> vector array
|
||||
V
|
||||
Log Item <-> log vector 2 -> memory buffer
|
||||
| -> vector array
|
||||
V
|
||||
......
|
||||
|
|
||||
V
|
||||
Log Item <-> log vector N-1 -> memory buffer
|
||||
| -> vector array
|
||||
V
|
||||
Log Item <-> log vector N -> memory buffer
|
||||
-> vector array
|
||||
|
||||
And after the flush the CIL head is empty, and the checkpoint context log
|
||||
vector list would look like:
|
||||
|
||||
Checkpoint Context
|
||||
|
|
||||
V
|
||||
log vector 1 -> memory buffer
|
||||
| -> vector array
|
||||
| -> Log Item
|
||||
V
|
||||
log vector 2 -> memory buffer
|
||||
| -> vector array
|
||||
| -> Log Item
|
||||
V
|
||||
......
|
||||
|
|
||||
V
|
||||
log vector N-1 -> memory buffer
|
||||
| -> vector array
|
||||
| -> Log Item
|
||||
V
|
||||
log vector N -> memory buffer
|
||||
-> vector array
|
||||
-> Log Item
|
||||
|
||||
Once this transfer is done, the CIL can be unlocked and new transactions can
|
||||
start, while the checkpoint flush code works over the log vector chain to
|
||||
commit the checkpoint.
|
||||
|
||||
Once the checkpoint is written into the log buffers, the checkpoint context is
|
||||
attached to the log buffer that the commit record was written to along with a
|
||||
completion callback. Log IO completion will call that callback, which can then
|
||||
run transaction committed processing for the log items (i.e. insert into AIL
|
||||
and unpin) in the log vector chain and then free the log vector chain and
|
||||
checkpoint context.
|
||||
|
||||
Discussion Point: I am uncertain as to whether the log item is the most
|
||||
efficient way to track vectors, even though it seems like the natural way to do
|
||||
it. The fact that we walk the log items (in the CIL) just to chain the log
|
||||
vectors and break the link between the log item and the log vector means that
|
||||
we take a cache line hit for the log item list modification, then another for
|
||||
the log vector chaining. If we track by the log vectors, then we only need to
|
||||
break the link between the log item and the log vector, which means we should
|
||||
dirty only the log item cachelines. Normally I wouldn't be concerned about one
|
||||
vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
|
||||
vectors in one checkpoint transaction. I'd guess this is a "measure and
|
||||
compare" situation that can be done after a working and reviewed implementation
|
||||
is in the dev tree....
|
||||
|
||||
Delayed Logging: Checkpoint Sequencing
|
||||
|
||||
One of the key aspects of the XFS transaction subsystem is that it tags
|
||||
committed transactions with the log sequence number of the transaction commit.
|
||||
This allows transactions to be issued asynchronously even though there may be
|
||||
future operations that cannot be completed until that transaction is fully
|
||||
committed to the log. In the rare case that a dependent operation occurs (e.g.
|
||||
re-using a freed metadata extent for a data extent), a special, optimised log
|
||||
force can be issued to force the dependent transaction to disk immediately.
|
||||
|
||||
To do this, transactions need to record the LSN of the commit record of the
|
||||
transaction. This LSN comes directly from the log buffer the transaction is
|
||||
written into. While this works just fine for the existing transaction
|
||||
mechanism, it does not work for delayed logging because transactions are not
|
||||
written directly into the log buffers. Hence some other method of sequencing
|
||||
transactions is required.
|
||||
|
||||
As discussed in the checkpoint section, delayed logging uses per-checkpoint
|
||||
contexts, and as such it is simple to assign a sequence number to each
|
||||
checkpoint. Because the switching of checkpoint contexts must be done
|
||||
atomically, it is simple to ensure that each new context has a monotonically
|
||||
increasing sequence number assigned to it without the need for an external
|
||||
atomic counter - we can just take the current context sequence number and add
|
||||
one to it for the new context.
|
||||
|
||||
Then, instead of assigning a log buffer LSN to the transaction commit LSN
|
||||
during the commit, we can assign the current checkpoint sequence. This allows
|
||||
operations that track transactions that have not yet completed know what
|
||||
checkpoint sequence needs to be committed before they can continue. As a
|
||||
result, the code that forces the log to a specific LSN now needs to ensure that
|
||||
the log forces to a specific checkpoint.
|
||||
|
||||
To ensure that we can do this, we need to track all the checkpoint contexts
|
||||
that are currently committing to the log. When we flush a checkpoint, the
|
||||
context gets added to a "committing" list which can be searched. When a
|
||||
checkpoint commit completes, it is removed from the committing list. Because
|
||||
the checkpoint context records the LSN of the commit record for the checkpoint,
|
||||
we can also wait on the log buffer that contains the commit record, thereby
|
||||
using the existing log force mechanisms to execute synchronous forces.
|
||||
|
||||
It should be noted that the synchronous forces may need to be extended with
|
||||
mitigation algorithms similar to the current log buffer code to allow
|
||||
aggregation of multiple synchronous transactions if there are already
|
||||
synchronous transactions being flushed. Investigation of the performance of the
|
||||
current design is needed before making any decisions here.
|
||||
|
||||
The main concern with log forces is to ensure that all the previous checkpoints
|
||||
are also committed to disk before the one we need to wait for. Therefore we
|
||||
need to check that all the prior contexts in the committing list are also
|
||||
complete before waiting on the one we need to complete. We do this
|
||||
synchronisation in the log force code so that we don't need to wait anywhere
|
||||
else for such serialisation - it only matters when we do a log force.
|
||||
|
||||
The only remaining complexity is that a log force now also has to handle the
|
||||
case where the forcing sequence number is the same as the current context. That
|
||||
is, we need to flush the CIL and potentially wait for it to complete. This is a
|
||||
simple addition to the existing log forcing code to check the sequence numbers
|
||||
and push if required. Indeed, placing the current sequence checkpoint flush in
|
||||
the log force code enables the current mechanism for issuing synchronous
|
||||
transactions to remain untouched (i.e. commit an asynchronous transaction, then
|
||||
force the log at the LSN of that transaction) and so the higher level code
|
||||
behaves the same regardless of whether delayed logging is being used or not.
|
||||
|
||||
Delayed Logging: Checkpoint Log Space Accounting
|
||||
|
||||
The big issue for a checkpoint transaction is the log space reservation for the
|
||||
transaction. We don't know how big a checkpoint transaction is going to be
|
||||
ahead of time, nor how many log buffers it will take to write out, nor the
|
||||
number of split log vector regions are going to be used. We can track the
|
||||
amount of log space required as we add items to the commit item list, but we
|
||||
still need to reserve the space in the log for the checkpoint.
|
||||
|
||||
A typical transaction reserves enough space in the log for the worst case space
|
||||
usage of the transaction. The reservation accounts for log record headers,
|
||||
transaction and region headers, headers for split regions, buffer tail padding,
|
||||
etc. as well as the actual space for all the changed metadata in the
|
||||
transaction. While some of this is fixed overhead, much of it is dependent on
|
||||
the size of the transaction and the number of regions being logged (the number
|
||||
of log vectors in the transaction).
|
||||
|
||||
An example of the differences would be logging directory changes versus logging
|
||||
inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
|
||||
there are lots of transactions that only contain an inode core and an inode log
|
||||
format structure. That is, two vectors totaling roughly 150 bytes. If we modify
|
||||
10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
|
||||
vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
|
||||
comparison, if we are logging full directory buffers, they are typically 4KB
|
||||
each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
|
||||
buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
|
||||
space. From this, it should be obvious that a static log space reservation is
|
||||
not particularly flexible and is difficult to select the "optimal value" for
|
||||
all workloads.
|
||||
|
||||
Further, if we are going to use a static reservation, which bit of the entire
|
||||
reservation does it cover? We account for space used by the transaction
|
||||
reservation by tracking the space currently used by the object in the CIL and
|
||||
then calculating the increase or decrease in space used as the object is
|
||||
relogged. This allows for a checkpoint reservation to only have to account for
|
||||
log buffer metadata used such as log header records.
|
||||
|
||||
However, even using a static reservation for just the log metadata is
|
||||
problematic. Typically log record headers use at least 16KB of log space per
|
||||
1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
|
||||
large enough to handle arbitrary sized checkpoint transactions. This
|
||||
reservation needs to be made before the checkpoint is started, and we need to
|
||||
be able to reserve the space without sleeping. For a 8MB checkpoint, we need a
|
||||
reservation of around 150KB, which is a non-trivial amount of space.
|
||||
|
||||
A static reservation needs to manipulate the log grant counters - we can take a
|
||||
permanent reservation on the space, but we still need to make sure we refresh
|
||||
the write reservation (the actual space available to the transaction) after
|
||||
every checkpoint transaction completion. Unfortunately, if this space is not
|
||||
available when required, then the regrant code will sleep waiting for it.
|
||||
|
||||
The problem with this is that it can lead to deadlocks as we may need to commit
|
||||
checkpoints to be able to free up log space (refer back to the description of
|
||||
rolling transactions for an example of this). Hence we *must* always have
|
||||
space available in the log if we are to use static reservations, and that is
|
||||
very difficult and complex to arrange. It is possible to do, but there is a
|
||||
simpler way.
|
||||
|
||||
The simpler way of doing this is tracking the entire log space used by the
|
||||
items in the CIL and using this to dynamically calculate the amount of log
|
||||
space required by the log metadata. If this log metadata space changes as a
|
||||
result of a transaction commit inserting a new memory buffer into the CIL, then
|
||||
the difference in space required is removed from the transaction that causes
|
||||
the change. Transactions at this level will *always* have enough space
|
||||
available in their reservation for this as they have already reserved the
|
||||
maximal amount of log metadata space they require, and such a delta reservation
|
||||
will always be less than or equal to the maximal amount in the reservation.
|
||||
|
||||
Hence we can grow the checkpoint transaction reservation dynamically as items
|
||||
are added to the CIL and avoid the need for reserving and regranting log space
|
||||
up front. This avoids deadlocks and removes a blocking point from the
|
||||
checkpoint flush code.
|
||||
|
||||
As mentioned early, transactions can't grow to more than half the size of the
|
||||
log. Hence as part of the reservation growing, we need to also check the size
|
||||
of the reservation against the maximum allowed transaction size. If we reach
|
||||
the maximum threshold, we need to push the CIL to the log. This is effectively
|
||||
a "background flush" and is done on demand. This is identical to
|
||||
a CIL push triggered by a log force, only that there is no waiting for the
|
||||
checkpoint commit to complete. This background push is checked and executed by
|
||||
transaction commit code.
|
||||
|
||||
If the transaction subsystem goes idle while we still have items in the CIL,
|
||||
they will be flushed by the periodic log force issued by the xfssyncd. This log
|
||||
force will push the CIL to disk, and if the transaction subsystem stays idle,
|
||||
allow the idle log to be covered (effectively marked clean) in exactly the same
|
||||
manner that is done for the existing logging method. A discussion point is
|
||||
whether this log force needs to be done more frequently than the current rate
|
||||
which is once every 30s.
|
||||
|
||||
|
||||
Delayed Logging: Log Item Pinning
|
||||
|
||||
Currently log items are pinned during transaction commit while the items are
|
||||
still locked. This happens just after the items are formatted, though it could
|
||||
be done any time before the items are unlocked. The result of this mechanism is
|
||||
that items get pinned once for every transaction that is committed to the log
|
||||
buffers. Hence items that are relogged in the log buffers will have a pin count
|
||||
for every outstanding transaction they were dirtied in. When each of these
|
||||
transactions is completed, they will unpin the item once. As a result, the item
|
||||
only becomes unpinned when all the transactions complete and there are no
|
||||
pending transactions. Thus the pinning and unpinning of a log item is symmetric
|
||||
as there is a 1:1 relationship with transaction commit and log item completion.
|
||||
|
||||
For delayed logging, however, we have an assymetric transaction commit to
|
||||
completion relationship. Every time an object is relogged in the CIL it goes
|
||||
through the commit process without a corresponding completion being registered.
|
||||
That is, we now have a many-to-one relationship between transaction commit and
|
||||
log item completion. The result of this is that pinning and unpinning of the
|
||||
log items becomes unbalanced if we retain the "pin on transaction commit, unpin
|
||||
on transaction completion" model.
|
||||
|
||||
To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
|
||||
insertion into the CIL, unpin on checkpoint completion". In other words, the
|
||||
pinning and unpinning becomes symmetric around a checkpoint context. We have to
|
||||
pin the object the first time it is inserted into the CIL - if it is already in
|
||||
the CIL during a transaction commit, then we do not pin it again. Because there
|
||||
can be multiple outstanding checkpoint contexts, we can still see elevated pin
|
||||
counts, but as each checkpoint completes the pin count will retain the correct
|
||||
value according to it's context.
|
||||
|
||||
Just to make matters more slightly more complex, this checkpoint level context
|
||||
for the pin count means that the pinning of an item must take place under the
|
||||
CIL commit/flush lock. If we pin the object outside this lock, we cannot
|
||||
guarantee which context the pin count is associated with. This is because of
|
||||
the fact pinning the item is dependent on whether the item is present in the
|
||||
current CIL or not. If we don't pin the CIL first before we check and pin the
|
||||
object, we have a race with CIL being flushed between the check and the pin
|
||||
(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
|
||||
lock to guarantee that we pin the items correctly.
|
||||
|
||||
Delayed Logging: Concurrent Scalability
|
||||
|
||||
A fundamental requirement for the CIL is that accesses through transaction
|
||||
commits must scale to many concurrent commits. The current transaction commit
|
||||
code does not break down even when there are transactions coming from 2048
|
||||
processors at once. The current transaction code does not go any faster than if
|
||||
there was only one CPU using it, but it does not slow down either.
|
||||
|
||||
As a result, the delayed logging transaction commit code needs to be designed
|
||||
for concurrency from the ground up. It is obvious that there are serialisation
|
||||
points in the design - the three important ones are:
|
||||
|
||||
1. Locking out new transaction commits while flushing the CIL
|
||||
2. Adding items to the CIL and updating item space accounting
|
||||
3. Checkpoint commit ordering
|
||||
|
||||
Looking at the transaction commit and CIL flushing interactions, it is clear
|
||||
that we have a many-to-one interaction here. That is, the only restriction on
|
||||
the number of concurrent transactions that can be trying to commit at once is
|
||||
the amount of space available in the log for their reservations. The practical
|
||||
limit here is in the order of several hundred concurrent transactions for a
|
||||
128MB log, which means that it is generally one per CPU in a machine.
|
||||
|
||||
The amount of time a transaction commit needs to hold out a flush is a
|
||||
relatively long period of time - the pinning of log items needs to be done
|
||||
while we are holding out a CIL flush, so at the moment that means it is held
|
||||
across the formatting of the objects into memory buffers (i.e. while memcpy()s
|
||||
are in progress). Ultimately a two pass algorithm where the formatting is done
|
||||
separately to the pinning of objects could be used to reduce the hold time of
|
||||
the transaction commit side.
|
||||
|
||||
Because of the number of potential transaction commit side holders, the lock
|
||||
really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
|
||||
want every other CPU in the machine spinning on the CIL lock. Given that
|
||||
flushing the CIL could involve walking a list of tens of thousands of log
|
||||
items, it will get held for a significant time and so spin contention is a
|
||||
significant concern. Preventing lots of CPUs spinning doing nothing is the
|
||||
main reason for choosing a sleeping lock even though nothing in either the
|
||||
transaction commit or CIL flush side sleeps with the lock held.
|
||||
|
||||
It should also be noted that CIL flushing is also a relatively rare operation
|
||||
compared to transaction commit for asynchronous transaction workloads - only
|
||||
time will tell if using a read-write semaphore for exclusion will limit
|
||||
transaction commit concurrency due to cache line bouncing of the lock on the
|
||||
read side.
|
||||
|
||||
The second serialisation point is on the transaction commit side where items
|
||||
are inserted into the CIL. Because transactions can enter this code
|
||||
concurrently, the CIL needs to be protected separately from the above
|
||||
commit/flush exclusion. It also needs to be an exclusive lock but it is only
|
||||
held for a very short time and so a spin lock is appropriate here. It is
|
||||
possible that this lock will become a contention point, but given the short
|
||||
hold time once per transaction I think that contention is unlikely.
|
||||
|
||||
The final serialisation point is the checkpoint commit record ordering code
|
||||
that is run as part of the checkpoint commit and log force sequencing. The code
|
||||
path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
|
||||
an ordering loop after writing all the log vectors into the log buffers but
|
||||
before writing the commit record. This loop walks the list of committing
|
||||
checkpoints and needs to block waiting for checkpoints to complete their commit
|
||||
record write. As a result it needs a lock and a wait variable. Log force
|
||||
sequencing also requires the same lock, list walk, and blocking mechanism to
|
||||
ensure completion of checkpoints.
|
||||
|
||||
These two sequencing operations can use the mechanism even though the
|
||||
events they are waiting for are different. The checkpoint commit record
|
||||
sequencing needs to wait until checkpoint contexts contain a commit LSN
|
||||
(obtained through completion of a commit record write) while log force
|
||||
sequencing needs to wait until previous checkpoint contexts are removed from
|
||||
the committing list (i.e. they've completed). A simple wait variable and
|
||||
broadcast wakeups (thundering herds) has been used to implement these two
|
||||
serialisation queues. They use the same lock as the CIL, too. If we see too
|
||||
much contention on the CIL lock, or too many context switches as a result of
|
||||
the broadcast wakeups these operations can be put under a new spinlock and
|
||||
given separate wait lists to reduce lock contention and the number of processes
|
||||
woken by the wrong event.
|
||||
|
||||
|
||||
Lifecycle Changes
|
||||
|
||||
The existing log item life cycle is as follows:
|
||||
|
||||
1. Transaction allocate
|
||||
2. Transaction reserve
|
||||
3. Lock item
|
||||
4. Join item to transaction
|
||||
If not already attached,
|
||||
Allocate log item
|
||||
Attach log item to owner item
|
||||
Attach log item to transaction
|
||||
5. Modify item
|
||||
Record modifications in log item
|
||||
6. Transaction commit
|
||||
Pin item in memory
|
||||
Format item into log buffer
|
||||
Write commit LSN into transaction
|
||||
Unlock item
|
||||
Attach transaction to log buffer
|
||||
|
||||
<log buffer IO dispatched>
|
||||
<log buffer IO completes>
|
||||
|
||||
7. Transaction completion
|
||||
Mark log item committed
|
||||
Insert log item into AIL
|
||||
Write commit LSN into log item
|
||||
Unpin log item
|
||||
8. AIL traversal
|
||||
Lock item
|
||||
Mark log item clean
|
||||
Flush item to disk
|
||||
|
||||
<item IO completion>
|
||||
|
||||
9. Log item removed from AIL
|
||||
Moves log tail
|
||||
Item unlocked
|
||||
|
||||
Essentially, steps 1-6 operate independently from step 7, which is also
|
||||
independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
|
||||
at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
|
||||
at the same time. If the log item is in the AIL or between steps 6 and 7
|
||||
and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
|
||||
are entered and completed is the object considered clean.
|
||||
|
||||
With delayed logging, there are new steps inserted into the life cycle:
|
||||
|
||||
1. Transaction allocate
|
||||
2. Transaction reserve
|
||||
3. Lock item
|
||||
4. Join item to transaction
|
||||
If not already attached,
|
||||
Allocate log item
|
||||
Attach log item to owner item
|
||||
Attach log item to transaction
|
||||
5. Modify item
|
||||
Record modifications in log item
|
||||
6. Transaction commit
|
||||
Pin item in memory if not pinned in CIL
|
||||
Format item into log vector + buffer
|
||||
Attach log vector and buffer to log item
|
||||
Insert log item into CIL
|
||||
Write CIL context sequence into transaction
|
||||
Unlock item
|
||||
|
||||
<next log force>
|
||||
|
||||
7. CIL push
|
||||
lock CIL flush
|
||||
Chain log vectors and buffers together
|
||||
Remove items from CIL
|
||||
unlock CIL flush
|
||||
write log vectors into log
|
||||
sequence commit records
|
||||
attach checkpoint context to log buffer
|
||||
|
||||
<log buffer IO dispatched>
|
||||
<log buffer IO completes>
|
||||
|
||||
8. Checkpoint completion
|
||||
Mark log item committed
|
||||
Insert item into AIL
|
||||
Write commit LSN into log item
|
||||
Unpin log item
|
||||
9. AIL traversal
|
||||
Lock item
|
||||
Mark log item clean
|
||||
Flush item to disk
|
||||
<item IO completion>
|
||||
10. Log item removed from AIL
|
||||
Moves log tail
|
||||
Item unlocked
|
||||
|
||||
From this, it can be seen that the only life cycle differences between the two
|
||||
logging methods are in the middle of the life cycle - they still have the same
|
||||
beginning and end and execution constraints. The only differences are in the
|
||||
commiting of the log items to the log itself and the completion processing.
|
||||
Hence delayed logging should not introduce any constraints on log item
|
||||
behaviour, allocation or freeing that don't already exist.
|
||||
|
||||
As a result of this zero-impact "insertion" of delayed logging infrastructure
|
||||
and the design of the internal structures to avoid on disk format changes, we
|
||||
can basically switch between delayed logging and the existing mechanism with a
|
||||
mount option. Fundamentally, there is no reason why the log manager would not
|
||||
be able to swap methods automatically and transparently depending on load
|
||||
characteristics, but this should not be necessary if delayed logging works as
|
||||
designed.
|
||||
|
||||
Roadmap:
|
||||
|
||||
2.6.35 Inclusion in mainline as an experimental mount option
|
||||
=> approximately 2-3 months to merge window
|
||||
=> needs to be in xfs-dev tree in 4-6 weeks
|
||||
=> code is nearing readiness for review
|
||||
|
||||
2.6.37 Remove experimental tag from mount option
|
||||
=> should be roughly 6 months after initial merge
|
||||
=> enough time to:
|
||||
=> gain confidence and fix problems reported by early
|
||||
adopters (a.k.a. guinea pigs)
|
||||
=> address worst performance regressions and undesired
|
||||
behaviours
|
||||
=> start tuning/optimising code for parallelism
|
||||
=> start tuning/optimising algorithms consuming
|
||||
excessive CPU time
|
||||
|
||||
2.6.39 Switch default mount option to use delayed logging
|
||||
=> should be roughly 12 months after initial merge
|
||||
=> enough time to shake out remaining problems before next round of
|
||||
enterprise distro kernel rebases
|
@ -9,11 +9,15 @@ Supported chips:
|
||||
* SMSC SCH3112, SCH3114, SCH3116
|
||||
Prefix: 'sch311x'
|
||||
Addresses scanned: none, address read from Super-I/O config space
|
||||
Datasheet: http://www.nuhorizons.com/FeaturedProducts/Volume1/SMSC/311x.pdf
|
||||
Datasheet: Available on the Internet
|
||||
* SMSC SCH5027
|
||||
Prefix: 'sch5027'
|
||||
Addresses scanned: I2C 0x2c, 0x2d, 0x2e
|
||||
Datasheet: Provided by SMSC upon request and under NDA
|
||||
* SMSC SCH5127
|
||||
Prefix: 'sch5127'
|
||||
Addresses scanned: none, address read from Super-I/O config space
|
||||
Datasheet: Provided by SMSC upon request and under NDA
|
||||
|
||||
Authors:
|
||||
Juerg Haefliger <juergh@gmail.com>
|
||||
@ -36,8 +40,8 @@ Description
|
||||
-----------
|
||||
|
||||
This driver implements support for the hardware monitoring capabilities of the
|
||||
SMSC DME1737 and Asus A8000 (which are the same), SMSC SCH5027, and SMSC
|
||||
SCH311x Super-I/O chips. These chips feature monitoring of 3 temp sensors
|
||||
SMSC DME1737 and Asus A8000 (which are the same), SMSC SCH5027, SCH311x,
|
||||
and SCH5127 Super-I/O chips. These chips feature monitoring of 3 temp sensors
|
||||
temp[1-3] (2 remote diodes and 1 internal), 7 voltages in[0-6] (6 external and
|
||||
1 internal) and up to 6 fan speeds fan[1-6]. Additionally, the chips implement
|
||||
up to 5 PWM outputs pwm[1-3,5-6] for controlling fan speeds both manually and
|
||||
@ -48,14 +52,14 @@ Fan[3-6] and pwm[3,5-6] are optional features and their availability depends on
|
||||
the configuration of the chip. The driver will detect which features are
|
||||
present during initialization and create the sysfs attributes accordingly.
|
||||
|
||||
For the SCH311x, fan[1-3] and pwm[1-3] are always present and fan[4-6] and
|
||||
pwm[5-6] don't exist.
|
||||
For the SCH311x and SCH5127, fan[1-3] and pwm[1-3] are always present and
|
||||
fan[4-6] and pwm[5-6] don't exist.
|
||||
|
||||
The hardware monitoring features of the DME1737, A8000, and SCH5027 are only
|
||||
accessible via SMBus, while the SCH311x only provides access via the ISA bus.
|
||||
The driver will therefore register itself as an I2C client driver if it detects
|
||||
a DME1737, A8000, or SCH5027 and as a platform driver if it detects a SCH311x
|
||||
chip.
|
||||
accessible via SMBus, while the SCH311x and SCH5127 only provide access via
|
||||
the ISA bus. The driver will therefore register itself as an I2C client driver
|
||||
if it detects a DME1737, A8000, or SCH5027 and as a platform driver if it
|
||||
detects a SCH311x or SCH5127 chip.
|
||||
|
||||
|
||||
Voltage Monitoring
|
||||
@ -76,7 +80,7 @@ DME1737, A8000:
|
||||
in6: Vbat (+3.0V) 0V - 4.38V
|
||||
|
||||
SCH311x:
|
||||
in0: +2.5V 0V - 6.64V
|
||||
in0: +2.5V 0V - 3.32V
|
||||
in1: Vccp (processor core) 0V - 2V
|
||||
in2: VCC (internal +3.3V) 0V - 4.38V
|
||||
in3: +5V 0V - 6.64V
|
||||
@ -93,6 +97,15 @@ SCH5027:
|
||||
in5: VTR (+3.3V standby) 0V - 4.38V
|
||||
in6: Vbat (+3.0V) 0V - 4.38V
|
||||
|
||||
SCH5127:
|
||||
in0: +2.5 0V - 3.32V
|
||||
in1: Vccp (processor core) 0V - 3V
|
||||
in2: VCC (internal +3.3V) 0V - 4.38V
|
||||
in3: V2_IN 0V - 1.5V
|
||||
in4: V1_IN 0V - 1.5V
|
||||
in5: VTR (+3.3V standby) 0V - 4.38V
|
||||
in6: Vbat (+3.0V) 0V - 4.38V
|
||||
|
||||
Each voltage input has associated min and max limits which trigger an alarm
|
||||
when crossed.
|
||||
|
||||
@ -293,3 +306,21 @@ pwm[1-3]_auto_point1_pwm RW Auto PWM pwm point. Auto_point1 is the
|
||||
pwm[1-3]_auto_point2_pwm RO Auto PWM pwm point. Auto_point2 is the
|
||||
full-speed duty-cycle which is hard-
|
||||
wired to 255 (100% duty-cycle).
|
||||
|
||||
Chip Differences
|
||||
----------------
|
||||
|
||||
Feature dme1737 sch311x sch5027 sch5127
|
||||
-------------------------------------------------------
|
||||
temp[1-3]_offset yes yes
|
||||
vid yes
|
||||
zone3 yes yes yes
|
||||
zone[1-3]_hyst yes yes
|
||||
pwm min/off yes yes
|
||||
fan3 opt yes opt yes
|
||||
pwm3 opt yes opt yes
|
||||
fan4 opt opt
|
||||
fan5 opt opt
|
||||
pwm5 opt opt
|
||||
fan6 opt opt
|
||||
pwm6 opt opt
|
||||
|
@ -7,6 +7,11 @@ Supported chips:
|
||||
Addresses scanned: I2C 0x4c
|
||||
Datasheet: Publicly available at the National Semiconductor website
|
||||
http://www.national.com/pf/LM/LM63.html
|
||||
* National Semiconductor LM64
|
||||
Prefix: 'lm64'
|
||||
Addresses scanned: I2C 0x18 and 0x4e
|
||||
Datasheet: Publicly available at the National Semiconductor website
|
||||
http://www.national.com/pf/LM/LM64.html
|
||||
|
||||
Author: Jean Delvare <khali@linux-fr.org>
|
||||
|
||||
@ -55,3 +60,5 @@ The lm63 driver will not update its values more frequently than every
|
||||
second; reading them more often will do no harm, but will return 'old'
|
||||
values.
|
||||
|
||||
The LM64 is effectively an LM63 with GPIO lines. The driver does not
|
||||
support these GPIO lines at present.
|
||||
|
@ -157,7 +157,7 @@ temperature configuration points:
|
||||
|
||||
There are three PWM outputs. The LM85 datasheet suggests that the
|
||||
pwm3 output control both fan3 and fan4. Each PWM can be individually
|
||||
configured and assigned to a zone for it's control value. Each PWM can be
|
||||
configured and assigned to a zone for its control value. Each PWM can be
|
||||
configured individually according to the following options.
|
||||
|
||||
* pwm#_auto_pwm_min - this specifies the PWM value for temp#_auto_temp_off
|
||||
|
@ -72,9 +72,7 @@ in6_min_alarm 5v output undervoltage alarm
|
||||
in7_min_alarm 3v output undervoltage alarm
|
||||
in8_min_alarm Vee (-12v) output undervoltage alarm
|
||||
|
||||
in9_input GPIO #1 voltage data
|
||||
in10_input GPIO #2 voltage data
|
||||
in11_input GPIO #3 voltage data
|
||||
in9_input GPIO voltage data
|
||||
|
||||
power1_input 12v power usage (mW)
|
||||
power2_input 5v power usage (mW)
|
||||
|
@ -80,9 +80,9 @@ All entries (except name) are optional, and should only be created in a
|
||||
given driver if the chip has the feature.
|
||||
|
||||
|
||||
********
|
||||
* Name *
|
||||
********
|
||||
*********************
|
||||
* Global attributes *
|
||||
*********************
|
||||
|
||||
name The chip name.
|
||||
This should be a short, lowercase string, not containing
|
||||
@ -91,6 +91,13 @@ name The chip name.
|
||||
I2C devices get this attribute created automatically.
|
||||
RO
|
||||
|
||||
update_rate The rate at which the chip will update readings.
|
||||
Unit: millisecond
|
||||
RW
|
||||
Some devices have a variable update rate. This attribute
|
||||
can be used to change the update rate to the desired
|
||||
frequency.
|
||||
|
||||
|
||||
************
|
||||
* Voltages *
|
||||
|
26
Documentation/hwmon/tmp102
Normal file
26
Documentation/hwmon/tmp102
Normal file
@ -0,0 +1,26 @@
|
||||
Kernel driver tmp102
|
||||
====================
|
||||
|
||||
Supported chips:
|
||||
* Texas Instruments TMP102
|
||||
Prefix: 'tmp102'
|
||||
Addresses scanned: none
|
||||
Datasheet: http://focus.ti.com/docs/prod/folders/print/tmp102.html
|
||||
|
||||
Author:
|
||||
Steven King <sfking@fdwdc.com>
|
||||
|
||||
Description
|
||||
-----------
|
||||
|
||||
The Texas Instruments TMP102 implements one temperature sensor. Limits can be
|
||||
set through the Overtemperature Shutdown register and Hysteresis register. The
|
||||
sensor is accurate to 0.5 degree over the range of -25 to +85 C, and to 1.0
|
||||
degree from -40 to +125 C. Resolution of the sensor is 0.0625 degree. The
|
||||
operating temperature has a minimum of -55 C and a maximum of +150 C.
|
||||
|
||||
The TMP102 has a programmable update rate that can select between 8, 4, 1, and
|
||||
0.5 Hz. (Currently the driver only supports the default of 4 Hz).
|
||||
|
||||
The driver provides the common sysfs-interface for temperatures (see
|
||||
Documentation/hwmon/sysfs-interface under Temperatures).
|
@ -27,7 +27,13 @@ Authors:
|
||||
Module Parameters
|
||||
-----------------
|
||||
|
||||
None.
|
||||
* disable_features (bit vector)
|
||||
Disable selected features normally supported by the device. This makes it
|
||||
possible to work around possible driver or hardware bugs if the feature in
|
||||
question doesn't work as intended for whatever reason. Bit values:
|
||||
1 disable SMBus PEC
|
||||
2 disable the block buffer
|
||||
8 disable the I2C block read functionality
|
||||
|
||||
|
||||
Description
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user