mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-01 10:45:49 +00:00
[PATCH] EDAC: core EDAC support code
This is a subset of the bluesmoke project core code, stripped of the NMI work which isn't ready to merge and some of the "interesting" proc functionality that needs reworking or just has no place in kernel. It requires no core kernel changes except the added scrub functions already posted. The goal is to merge further functionality only after the core code is accepted and proven in the base kernel, and only at the point the upstream extras are really ready to merge. From: doug thompson <norsk5@xmission.com> This converts EDAC to sysfs and is the final chunk neccessary before EDAC has a stable user space API and can be considered for submission into the base kernel. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: doug thompson <norsk5@xmission.com> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This commit is contained in:
parent
2f768af73f
commit
da9bb1d27b
673
Documentation/drivers/edac/edac.txt
Normal file
673
Documentation/drivers/edac/edac.txt
Normal file
@ -0,0 +1,673 @@
|
||||
|
||||
|
||||
EDAC - Error Detection And Correction
|
||||
|
||||
Written by Doug Thompson <norsk5@xmission.com>
|
||||
7 Dec 2005
|
||||
|
||||
|
||||
EDAC was written by:
|
||||
Thayne Harbaugh,
|
||||
modified by Dave Peterson, Doug Thompson, et al,
|
||||
from the bluesmoke.sourceforge.net project.
|
||||
|
||||
|
||||
============================================================================
|
||||
EDAC PURPOSE
|
||||
|
||||
The 'edac' kernel module goal is to detect and report errors that occur
|
||||
within the computer system. In the initial release, memory Correctable Errors
|
||||
(CE) and Uncorrectable Errors (UE) are the primary errors being harvested.
|
||||
|
||||
Detecting CE events, then harvesting those events and reporting them,
|
||||
CAN be a predictor of future UE events. With CE events, the system can
|
||||
continue to operate, but with less safety. Preventive maintainence and
|
||||
proactive part replacement of memory DIMMs exhibiting CEs can reduce
|
||||
the likelihood of the dreaded UE events and system 'panics'.
|
||||
|
||||
|
||||
In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
|
||||
in order to determine if errors are occurring on data transfers.
|
||||
The presence of PCI Parity errors must be examined with a grain of salt.
|
||||
There are several addin adapters that do NOT follow the PCI specification
|
||||
with regards to Parity generation and reporting. The specification says
|
||||
the vendor should tie the parity status bits to 0 if they do not intend
|
||||
to generate parity. Some vendors do not do this, and thus the parity bit
|
||||
can "float" giving false positives.
|
||||
|
||||
The PCI Parity EDAC device has the ability to "skip" known flakey
|
||||
cards during the parity scan. These are set by the parity "blacklist"
|
||||
interface in the sysfs for PCI Parity. (See the PCI section in the sysfs
|
||||
section below.) There is also a parity "whitelist" which is used as
|
||||
an explicit list of devices to scan, while the blacklist is a list
|
||||
of devices to skip.
|
||||
|
||||
EDAC will have future error detectors that will be added or integrated
|
||||
into EDAC in the following list:
|
||||
|
||||
MCE Machine Check Exception
|
||||
MCA Machine Check Architecture
|
||||
NMI NMI notification of ECC errors
|
||||
MSRs Machine Specific Register error cases
|
||||
and other mechanisms.
|
||||
|
||||
These errors are usually bus errors, ECC errors, thermal throttling
|
||||
and the like.
|
||||
|
||||
|
||||
============================================================================
|
||||
EDAC VERSIONING
|
||||
|
||||
EDAC is composed of a "core" module (edac_mc.ko) and several Memory
|
||||
Controller (MC) driver modules. On a given system, the CORE
|
||||
is loaded and one MC driver will be loaded. Both the CORE and
|
||||
the MC driver have individual versions that reflect current release
|
||||
level of their respective modules. Thus, to "report" on what version
|
||||
a system is running, one must report both the CORE's and the
|
||||
MC driver's versions.
|
||||
|
||||
|
||||
LOADING
|
||||
|
||||
If 'edac' was statically linked with the kernel then no loading is
|
||||
necessary. If 'edac' was built as modules then simply modprobe the
|
||||
'edac' pieces that you need. You should be able to modprobe
|
||||
hardware-specific modules and have the dependencies load the necessary core
|
||||
modules.
|
||||
|
||||
Example:
|
||||
|
||||
$> modprobe amd76x_edac
|
||||
|
||||
loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
|
||||
core module.
|
||||
|
||||
|
||||
============================================================================
|
||||
EDAC sysfs INTERFACE
|
||||
|
||||
EDAC presents a 'sysfs' interface for control, reporting and attribute
|
||||
reporting purposes.
|
||||
|
||||
EDAC lives in the /sys/devices/system/edac directory. Within this directory
|
||||
there currently reside 2 'edac' components:
|
||||
|
||||
mc memory controller(s) system
|
||||
pci PCI status system
|
||||
|
||||
|
||||
============================================================================
|
||||
Memory Controller (mc) Model
|
||||
|
||||
First a background on the memory controller's model abstracted in EDAC.
|
||||
Each mc device controls a set of DIMM memory modules. These modules are
|
||||
layed out in a Chip-Select Row (csrowX) and Channel table (chX). There can
|
||||
be multiple csrows and two channels.
|
||||
|
||||
Memory controllers allow for several csrows, with 8 csrows being a typical value.
|
||||
Yet, the actual number of csrows depends on the electrical "loading"
|
||||
of a given motherboard, memory controller and DIMM characteristics.
|
||||
|
||||
Dual channels allows for 128 bit data transfers to the CPU from memory.
|
||||
|
||||
|
||||
Channel 0 Channel 1
|
||||
===================================
|
||||
csrow0 | DIMM_A0 | DIMM_B0 |
|
||||
csrow1 | DIMM_A0 | DIMM_B0 |
|
||||
===================================
|
||||
|
||||
===================================
|
||||
csrow2 | DIMM_A1 | DIMM_B1 |
|
||||
csrow3 | DIMM_A1 | DIMM_B1 |
|
||||
===================================
|
||||
|
||||
In the above example table there are 4 physical slots on the motherboard
|
||||
for memory DIMMs:
|
||||
|
||||
DIMM_A0
|
||||
DIMM_B0
|
||||
DIMM_A1
|
||||
DIMM_B1
|
||||
|
||||
Labels for these slots are usually silk screened on the motherboard. Slots
|
||||
labeled 'A' are channel 0 in this example. Slots labled 'B'
|
||||
are channel 1. Notice that there are two csrows possible on a
|
||||
physical DIMM. These csrows are allocated their csrow assignment
|
||||
based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
|
||||
is placed in each Channel, the csrows cross both DIMMs.
|
||||
|
||||
Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
|
||||
Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
|
||||
will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
|
||||
when 2 dual ranked DIMMs are similiaryly placed, then both csrow0 and
|
||||
csrow1 will be populated. The pattern repeats itself for csrow2 and
|
||||
csrow3.
|
||||
|
||||
The representation of the above is reflected in the directory tree
|
||||
in EDAC's sysfs interface. Starting in directory
|
||||
/sys/devices/system/edac/mc each memory controller will be represented
|
||||
by its own 'mcX' directory, where 'X" is the index of the MC.
|
||||
|
||||
|
||||
..../edac/mc/
|
||||
|
|
||||
|->mc0
|
||||
|->mc1
|
||||
|->mc2
|
||||
....
|
||||
|
||||
Under each 'mcX' directory each 'csrowX' is again represented by a
|
||||
'csrowX', where 'X" is the csrow index:
|
||||
|
||||
|
||||
.../mc/mc0/
|
||||
|
|
||||
|->csrow0
|
||||
|->csrow2
|
||||
|->csrow3
|
||||
....
|
||||
|
||||
Notice that there is no csrow1, which indicates that csrow0 is
|
||||
composed of a single ranked DIMMs. This should also apply in both
|
||||
Channels, in order to have dual-channel mode be operational. Since
|
||||
both csrow2 and csrow3 are populated, this indicates a dual ranked
|
||||
set of DIMMs for channels 0 and 1.
|
||||
|
||||
|
||||
Within each of the 'mc','mcX' and 'csrowX' directories are several
|
||||
EDAC control and attribute files.
|
||||
|
||||
|
||||
============================================================================
|
||||
DIRECTORY 'mc'
|
||||
|
||||
In directory 'mc' are EDAC system overall control and attribute files:
|
||||
|
||||
|
||||
Panic on UE control file:
|
||||
|
||||
'panic_on_ue'
|
||||
|
||||
An uncorrectable error will cause a machine panic. This is usually
|
||||
desirable. It is a bad idea to continue when an uncorrectable error
|
||||
occurs - it is indeterminate what was uncorrected and the operating
|
||||
system context might be so mangled that continuing will lead to further
|
||||
corruption. If the kernel has MCE configured, then EDAC will never
|
||||
notice the UE.
|
||||
|
||||
LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]
|
||||
|
||||
RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue
|
||||
|
||||
|
||||
Log UE control file:
|
||||
|
||||
'log_ue'
|
||||
|
||||
Generate kernel messages describing uncorrectable errors. These errors
|
||||
are reported through the system message log system. UE statistics
|
||||
will be accumulated even when UE logging is disabled.
|
||||
|
||||
LOAD TIME: module/kernel parameter: log_ue=[0|1]
|
||||
|
||||
RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue
|
||||
|
||||
|
||||
Log CE control file:
|
||||
|
||||
'log_ce'
|
||||
|
||||
Generate kernel messages describing correctable errors. These
|
||||
errors are reported through the system message log system.
|
||||
CE statistics will be accumulated even when CE logging is disabled.
|
||||
|
||||
LOAD TIME: module/kernel parameter: log_ce=[0|1]
|
||||
|
||||
RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce
|
||||
|
||||
|
||||
Polling period control file:
|
||||
|
||||
'poll_msec'
|
||||
|
||||
The time period, in milliseconds, for polling for error information.
|
||||
Too small a value wastes resources. Too large a value might delay
|
||||
necessary handling of errors and might loose valuable information for
|
||||
locating the error. 1000 milliseconds (once each second) is about
|
||||
right for most uses.
|
||||
|
||||
LOAD TIME: module/kernel parameter: poll_msec=[0|1]
|
||||
|
||||
RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
|
||||
|
||||
|
||||
Module Version read-only attribute file:
|
||||
|
||||
'mc_version'
|
||||
|
||||
The EDAC CORE modules's version and compile date are shown here to
|
||||
indicate what EDAC is running.
|
||||
|
||||
|
||||
|
||||
============================================================================
|
||||
'mcX' DIRECTORIES
|
||||
|
||||
|
||||
In 'mcX' directories are EDAC control and attribute files for
|
||||
this 'X" instance of the memory controllers:
|
||||
|
||||
|
||||
Counter reset control file:
|
||||
|
||||
'reset_counters'
|
||||
|
||||
This write-only control file will zero all the statistical counters
|
||||
for UE and CE errors. Zeroing the counters will also reset the timer
|
||||
indicating how long since the last counter zero. This is useful
|
||||
for computing errors/time. Since the counters are always reset at
|
||||
driver initialization time, no module/kernel parameter is available.
|
||||
|
||||
RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset
|
||||
|
||||
This resets the counters on memory controller 0
|
||||
|
||||
|
||||
Seconds since last counter reset control file:
|
||||
|
||||
'seconds_since_reset'
|
||||
|
||||
This attribute file displays how many seconds have elapsed since the
|
||||
last counter reset. This can be used with the error counters to
|
||||
measure error rates.
|
||||
|
||||
|
||||
|
||||
DIMM capability attribute file:
|
||||
|
||||
'edac_capability'
|
||||
|
||||
The EDAC (Error Detection and Correction) capabilities/modes of
|
||||
the memory controller hardware.
|
||||
|
||||
|
||||
DIMM Current Capability attribute file:
|
||||
|
||||
'edac_current_capability'
|
||||
|
||||
The EDAC capabilities available with the hardware
|
||||
configuration. This may not be the same as "EDAC capability"
|
||||
if the correct memory is not used. If a memory controller is
|
||||
capable of EDAC, but DIMMs without check bits are in use, then
|
||||
Parity, SECDED, S4ECD4ED capabilities will not be available
|
||||
even though the memory controller might be capable of those
|
||||
modes with the proper memory loaded.
|
||||
|
||||
|
||||
Memory Type supported on this controller attribute file:
|
||||
|
||||
'supported_mem_type'
|
||||
|
||||
This attribute file displays the memory type, usually
|
||||
buffered and unbuffered DIMMs.
|
||||
|
||||
|
||||
Memory Controller name attribute file:
|
||||
|
||||
'mc_name'
|
||||
|
||||
This attribute file displays the type of memory controller
|
||||
that is being utilized.
|
||||
|
||||
|
||||
Memory Controller Module name attribute file:
|
||||
|
||||
'module_name'
|
||||
|
||||
This attribute file displays the memory controller module name,
|
||||
version and date built. The name of the memory controller
|
||||
hardware - some drivers work with multiple controllers and
|
||||
this field shows which hardware is present.
|
||||
|
||||
|
||||
Total memory managed by this memory controller attribute file:
|
||||
|
||||
'size_mb'
|
||||
|
||||
This attribute file displays, in count of megabytes, of memory
|
||||
that this instance of memory controller manages.
|
||||
|
||||
|
||||
Total Uncorrectable Errors count attribute file:
|
||||
|
||||
'ue_count'
|
||||
|
||||
This attribute file displays the total count of uncorrectable
|
||||
errors that have occurred on this memory controller. If panic_on_ue
|
||||
is set this counter will not have a chance to increment,
|
||||
since EDAC will panic the system.
|
||||
|
||||
|
||||
Total UE count that had no information attribute fileY:
|
||||
|
||||
'ue_noinfo_count'
|
||||
|
||||
This attribute file displays the number of UEs that
|
||||
have occurred have occurred with no informations as to which DIMM
|
||||
slot is having errors.
|
||||
|
||||
|
||||
Total Correctable Errors count attribute file:
|
||||
|
||||
'ce_count'
|
||||
|
||||
This attribute file displays the total count of correctable
|
||||
errors that have occurred on this memory controller. This
|
||||
count is very important to examine. CEs provide early
|
||||
indications that a DIMM is beginning to fail. This count
|
||||
field should be monitored for non-zero values and report
|
||||
such information to the system administrator.
|
||||
|
||||
|
||||
Total Correctable Errors count attribute file:
|
||||
|
||||
'ce_noinfo_count'
|
||||
|
||||
This attribute file displays the number of CEs that
|
||||
have occurred wherewith no informations as to which DIMM slot
|
||||
is having errors. Memory is handicapped, but operational,
|
||||
yet no information is available to indicate which slot
|
||||
the failing memory is in. This count field should be also
|
||||
be monitored for non-zero values.
|
||||
|
||||
Device Symlink:
|
||||
|
||||
'device'
|
||||
|
||||
Symlink to the memory controller device
|
||||
|
||||
|
||||
|
||||
============================================================================
|
||||
'csrowX' DIRECTORIES
|
||||
|
||||
In the 'csrowX' directories are EDAC control and attribute files for
|
||||
this 'X" instance of csrow:
|
||||
|
||||
|
||||
Total Uncorrectable Errors count attribute file:
|
||||
|
||||
'ue_count'
|
||||
|
||||
This attribute file displays the total count of uncorrectable
|
||||
errors that have occurred on this csrow. If panic_on_ue is set
|
||||
this counter will not have a chance to increment, since EDAC
|
||||
will panic the system.
|
||||
|
||||
|
||||
Total Correctable Errors count attribute file:
|
||||
|
||||
'ce_count'
|
||||
|
||||
This attribute file displays the total count of correctable
|
||||
errors that have occurred on this csrow. This
|
||||
count is very important to examine. CEs provide early
|
||||
indications that a DIMM is beginning to fail. This count
|
||||
field should be monitored for non-zero values and report
|
||||
such information to the system administrator.
|
||||
|
||||
|
||||
Total memory managed by this csrow attribute file:
|
||||
|
||||
'size_mb'
|
||||
|
||||
This attribute file displays, in count of megabytes, of memory
|
||||
that this csrow contatins.
|
||||
|
||||
|
||||
Memory Type attribute file:
|
||||
|
||||
'mem_type'
|
||||
|
||||
This attribute file will display what type of memory is currently
|
||||
on this csrow. Normally, either buffered or unbuffered memory.
|
||||
|
||||
|
||||
EDAC Mode of operation attribute file:
|
||||
|
||||
'edac_mode'
|
||||
|
||||
This attribute file will display what type of Error detection
|
||||
and correction is being utilized.
|
||||
|
||||
|
||||
Device type attribute file:
|
||||
|
||||
'dev_type'
|
||||
|
||||
This attribute file will display what type of DIMM device is
|
||||
being utilized. Example: x4
|
||||
|
||||
|
||||
Channel 0 CE Count attribute file:
|
||||
|
||||
'ch0_ce_count'
|
||||
|
||||
This attribute file will display the count of CEs on this
|
||||
DIMM located in channel 0.
|
||||
|
||||
|
||||
Channel 0 UE Count attribute file:
|
||||
|
||||
'ch0_ue_count'
|
||||
|
||||
This attribute file will display the count of UEs on this
|
||||
DIMM located in channel 0.
|
||||
|
||||
|
||||
Channel 0 DIMM Label control file:
|
||||
|
||||
'ch0_dimm_label'
|
||||
|
||||
This control file allows this DIMM to have a label assigned
|
||||
to it. With this label in the module, when errors occur
|
||||
the output can provide the DIMM label in the system log.
|
||||
This becomes vital for panic events to isolate the
|
||||
cause of the UE event.
|
||||
|
||||
DIMM Labels must be assigned after booting, with information
|
||||
that correctly identifies the physical slot with its
|
||||
silk screen label. This information is currently very
|
||||
motherboard specific and determination of this information
|
||||
must occur in userland at this time.
|
||||
|
||||
|
||||
Channel 1 CE Count attribute file:
|
||||
|
||||
'ch1_ce_count'
|
||||
|
||||
This attribute file will display the count of CEs on this
|
||||
DIMM located in channel 1.
|
||||
|
||||
|
||||
Channel 1 UE Count attribute file:
|
||||
|
||||
'ch1_ue_count'
|
||||
|
||||
This attribute file will display the count of UEs on this
|
||||
DIMM located in channel 0.
|
||||
|
||||
|
||||
Channel 1 DIMM Label control file:
|
||||
|
||||
'ch1_dimm_label'
|
||||
|
||||
This control file allows this DIMM to have a label assigned
|
||||
to it. With this label in the module, when errors occur
|
||||
the output can provide the DIMM label in the system log.
|
||||
This becomes vital for panic events to isolate the
|
||||
cause of the UE event.
|
||||
|
||||
DIMM Labels must be assigned after booting, with information
|
||||
that correctly identifies the physical slot with its
|
||||
silk screen label. This information is currently very
|
||||
motherboard specific and determination of this information
|
||||
must occur in userland at this time.
|
||||
|
||||
|
||||
============================================================================
|
||||
SYSTEM LOGGING
|
||||
|
||||
If logging for UEs and CEs are enabled then system logs will have
|
||||
error notices indicating errors that have been detected:
|
||||
|
||||
MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
|
||||
channel 1 "DIMM_B1": amd76x_edac
|
||||
|
||||
MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
|
||||
channel 1 "DIMM_B1": amd76x_edac
|
||||
|
||||
|
||||
The structure of the message is:
|
||||
the memory controller (MC0)
|
||||
Error type (CE)
|
||||
memory page (0x283)
|
||||
offset in the page (0xce0)
|
||||
the byte granularity (grain 8)
|
||||
or resolution of the error
|
||||
the error syndrome (0xb741)
|
||||
memory row (row 0)
|
||||
memory channel (channel 1)
|
||||
DIMM label, if set prior (DIMM B1
|
||||
and then an optional, driver-specific message that may
|
||||
have additional information.
|
||||
|
||||
Both UEs and CEs with no info will lack all but memory controller,
|
||||
error type, a notice of "no info" and then an optional,
|
||||
driver-specific error message.
|
||||
|
||||
|
||||
|
||||
============================================================================
|
||||
PCI Bus Parity Detection
|
||||
|
||||
|
||||
On Header Type 00 devices the primary status is looked at
|
||||
for any parity error regardless of whether Parity is enabled on the
|
||||
device. (The spec indicates parity is generated in some cases).
|
||||
On Header Type 01 bridges, the secondary status register is also
|
||||
looked at to see if parity ocurred on the bus on the other side of
|
||||
the bridge.
|
||||
|
||||
|
||||
SYSFS CONFIGURATION
|
||||
|
||||
Under /sys/devices/system/edac/pci are control and attribute files as follows:
|
||||
|
||||
|
||||
Enable/Disable PCI Parity checking control file:
|
||||
|
||||
'check_pci_parity'
|
||||
|
||||
|
||||
This control file enables or disables the PCI Bus Parity scanning
|
||||
operation. Writing a 1 to this file enables the scanning. Writing
|
||||
a 0 to this file disables the scanning.
|
||||
|
||||
Enable:
|
||||
echo "1" >/sys/devices/system/edac/pci/check_pci_parity
|
||||
|
||||
Disable:
|
||||
echo "0" >/sys/devices/system/edac/pci/check_pci_parity
|
||||
|
||||
|
||||
|
||||
Panic on PCI PARITY Error:
|
||||
|
||||
'panic_on_pci_parity'
|
||||
|
||||
|
||||
This control files enables or disables panic'ing when a parity
|
||||
error has been detected.
|
||||
|
||||
|
||||
module/kernel parameter: panic_on_pci_parity=[0|1]
|
||||
|
||||
Enable:
|
||||
echo "1" >/sys/devices/system/edac/pci/panic_on_pci_parity
|
||||
|
||||
Disable:
|
||||
echo "0" >/sys/devices/system/edac/pci/panic_on_pci_parity
|
||||
|
||||
|
||||
Parity Count:
|
||||
|
||||
'pci_parity_count'
|
||||
|
||||
This attribute file will display the number of parity errors that
|
||||
have been detected.
|
||||
|
||||
|
||||
|
||||
PCI Device Whitelist:
|
||||
|
||||
'pci_parity_whitelist'
|
||||
|
||||
This control file allows for an explicit list of PCI devices to be
|
||||
scanned for parity errors. Only devices found on this list will
|
||||
be examined. The list is a line of hexadecimel VENDOR and DEVICE
|
||||
ID tuples:
|
||||
|
||||
1022:7450,1434:16a6
|
||||
|
||||
One or more can be inserted, seperated by a comma.
|
||||
|
||||
To write the above list doing the following as one command line:
|
||||
|
||||
echo "1022:7450,1434:16a6"
|
||||
> /sys/devices/system/edac/pci/pci_parity_whitelist
|
||||
|
||||
|
||||
|
||||
To display what the whitelist is, simply 'cat' the same file.
|
||||
|
||||
|
||||
PCI Device Blacklist:
|
||||
|
||||
'pci_parity_blacklist'
|
||||
|
||||
This control file allows for a list of PCI devices to be
|
||||
skipped for scanning.
|
||||
The list is a line of hexadecimel VENDOR and DEVICE ID tuples:
|
||||
|
||||
1022:7450,1434:16a6
|
||||
|
||||
One or more can be inserted, seperated by a comma.
|
||||
|
||||
To write the above list doing the following as one command line:
|
||||
|
||||
echo "1022:7450,1434:16a6"
|
||||
> /sys/devices/system/edac/pci/pci_parity_blacklist
|
||||
|
||||
|
||||
To display what the whitelist current contatins,
|
||||
simply 'cat' the same file.
|
||||
|
||||
=======================================================================
|
||||
|
||||
PCI Vendor and Devices IDs can be obtained with the lspci command. Using
|
||||
the -n option lspci will display the vendor and device IDs. The system
|
||||
adminstrator will have to determine which devices should be scanned or
|
||||
skipped.
|
||||
|
||||
|
||||
|
||||
The two lists (white and black) are prioritized. blacklist is the lower
|
||||
priority and will NOT be utilized when a whitelist has been set.
|
||||
Turn OFF a whitelist by an empty echo command:
|
||||
|
||||
echo > /sys/devices/system/edac/pci/pci_parity_whitelist
|
||||
|
||||
and any previous blacklist will be utililzed.
|
||||
|
@ -867,6 +867,15 @@ L: ebtables-devel@lists.sourceforge.net
|
||||
W: http://ebtables.sourceforge.net/
|
||||
S: Maintained
|
||||
|
||||
EDAC-CORE
|
||||
P: Doug Thompson
|
||||
M: norsk5@xmission.com, dthompson@linuxnetworx.com
|
||||
P: Dave Peterson
|
||||
M: dsp@llnl.gov, dave_peterson@pobox.com
|
||||
L: bluesmoke-devel@lists.sourceforge.net
|
||||
W: bluesmoke.sourceforge.net
|
||||
S: Maintained
|
||||
|
||||
EEPRO100 NETWORK DRIVER
|
||||
P: Andrey V. Savochkin
|
||||
M: saw@saw.sw.com.sg
|
||||
|
@ -25,8 +25,7 @@ static void __devinit quirk_intel_irqbalance(struct pci_dev *dev)
|
||||
|
||||
/* enable access to config space*/
|
||||
pci_read_config_byte(dev, 0xf4, &config);
|
||||
config |= 0x2;
|
||||
pci_write_config_byte(dev, 0xf4, config);
|
||||
pci_write_config_byte(dev, 0xf4, config|0x2);
|
||||
|
||||
/* read xTPR register */
|
||||
raw_pci_ops->read(0, 0, 0x40, 0x4c, 2, &word);
|
||||
@ -42,9 +41,9 @@ static void __devinit quirk_intel_irqbalance(struct pci_dev *dev)
|
||||
#endif
|
||||
}
|
||||
|
||||
config &= ~0x2;
|
||||
/* disable access to config space*/
|
||||
pci_write_config_byte(dev, 0xf4, config);
|
||||
/* put back the original value for config space*/
|
||||
if (!(config & 0x2))
|
||||
pci_write_config_byte(dev, 0xf4, config);
|
||||
}
|
||||
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_E7320_MCH, quirk_intel_irqbalance);
|
||||
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_E7525_MCH, quirk_intel_irqbalance);
|
||||
|
@ -68,4 +68,6 @@ source "drivers/infiniband/Kconfig"
|
||||
|
||||
source "drivers/sn/Kconfig"
|
||||
|
||||
source "drivers/edac/Kconfig"
|
||||
|
||||
endmenu
|
||||
|
@ -63,6 +63,7 @@ obj-$(CONFIG_PHONE) += telephony/
|
||||
obj-$(CONFIG_MD) += md/
|
||||
obj-$(CONFIG_BT) += bluetooth/
|
||||
obj-$(CONFIG_ISDN) += isdn/
|
||||
obj-$(CONFIG_EDAC) += edac/
|
||||
obj-$(CONFIG_MCA) += mca/
|
||||
obj-$(CONFIG_EISA) += eisa/
|
||||
obj-$(CONFIG_CPU_FREQ) += cpufreq/
|
||||
|
102
drivers/edac/Kconfig
Normal file
102
drivers/edac/Kconfig
Normal file
@ -0,0 +1,102 @@
|
||||
#
|
||||
# EDAC Kconfig
|
||||
# Copyright (c) 2003 Linux Networx
|
||||
# Licensed and distributed under the GPL
|
||||
#
|
||||
# $Id: Kconfig,v 1.4.2.7 2005/07/08 22:05:38 dsp_llnl Exp $
|
||||
#
|
||||
|
||||
menu 'EDAC - error detection and reporting (RAS)'
|
||||
|
||||
config EDAC
|
||||
tristate "EDAC core system error reporting"
|
||||
depends on X86
|
||||
default y
|
||||
help
|
||||
EDAC is designed to report errors in the core system.
|
||||
These are low-level errors that are reported in the CPU or
|
||||
supporting chipset: memory errors, cache errors, PCI errors,
|
||||
thermal throttling, etc.. If unsure, select 'Y'.
|
||||
|
||||
|
||||
comment "Reporting subsystems"
|
||||
depends on EDAC
|
||||
|
||||
config EDAC_DEBUG
|
||||
bool "Debugging"
|
||||
depends on EDAC
|
||||
help
|
||||
This turns on debugging information for the entire EDAC
|
||||
sub-system. You can insert module with "debug_level=x", current
|
||||
there're four debug levels (x=0,1,2,3 from low to high).
|
||||
Usually you should select 'N'.
|
||||
|
||||
config EDAC_MM_EDAC
|
||||
tristate "Main Memory EDAC (Error Detection And Correction) reporting"
|
||||
depends on EDAC
|
||||
default y
|
||||
help
|
||||
Some systems are able to detect and correct errors in main
|
||||
memory. EDAC can report statistics on memory error
|
||||
detection and correction (EDAC - or commonly referred to ECC
|
||||
errors). EDAC will also try to decode where these errors
|
||||
occurred so that a particular failing memory module can be
|
||||
replaced. If unsure, select 'Y'.
|
||||
|
||||
|
||||
config EDAC_AMD76X
|
||||
tristate "AMD 76x (760, 762, 768)"
|
||||
depends on EDAC_MM_EDAC && PCI
|
||||
help
|
||||
Support for error detection and correction on the AMD 76x
|
||||
series of chipsets used with the Athlon processor.
|
||||
|
||||
config EDAC_E7XXX
|
||||
tristate "Intel e7xxx (e7205, e7500, e7501, e7505)"
|
||||
depends on EDAC_MM_EDAC && PCI
|
||||
help
|
||||
Support for error detection and correction on the Intel
|
||||
E7205, E7500, E7501 and E7505 server chipsets.
|
||||
|
||||
config EDAC_E752X
|
||||
tristate "Intel e752x (e7520, e7525, e7320)"
|
||||
depends on EDAC_MM_EDAC && PCI
|
||||
help
|
||||
Support for error detection and correction on the Intel
|
||||
E7520, E7525, E7320 server chipsets.
|
||||
|
||||
config EDAC_I82875P
|
||||
tristate "Intel 82875p (D82875P, E7210)"
|
||||
depends on EDAC_MM_EDAC && PCI
|
||||
help
|
||||
Support for error detection and correction on the Intel
|
||||
DP82785P and E7210 server chipsets.
|
||||
|
||||
config EDAC_I82860
|
||||
tristate "Intel 82860"
|
||||
depends on EDAC_MM_EDAC && PCI
|
||||
help
|
||||
Support for error detection and correction on the Intel
|
||||
82860 chipset.
|
||||
|
||||
config EDAC_R82600
|
||||
tristate "Radisys 82600 embedded chipset"
|
||||
depends on EDAC_MM_EDAC
|
||||
help
|
||||
Support for error detection and correction on the Radisys
|
||||
82600 embedded chipset.
|
||||
|
||||
choice
|
||||
prompt "Error detecting method"
|
||||
depends on EDAC
|
||||
default EDAC_POLL
|
||||
|
||||
config EDAC_POLL
|
||||
bool "Poll for errors"
|
||||
depends on EDAC
|
||||
help
|
||||
Poll the chipset periodically to detect errors.
|
||||
|
||||
endchoice
|
||||
|
||||
endmenu
|
18
drivers/edac/Makefile
Normal file
18
drivers/edac/Makefile
Normal file
@ -0,0 +1,18 @@
|
||||
#
|
||||
# Makefile for the Linux kernel EDAC drivers.
|
||||
#
|
||||
# Copyright 02 Jul 2003, Linux Networx (http://lnxi.com)
|
||||
# This file may be distributed under the terms of the
|
||||
# GNU General Public License.
|
||||
#
|
||||
# $Id: Makefile,v 1.4.2.3 2005/07/08 22:05:38 dsp_llnl Exp $
|
||||
|
||||
|
||||
obj-$(CONFIG_EDAC_MM_EDAC) += edac_mc.o
|
||||
obj-$(CONFIG_EDAC_AMD76X) += amd76x_edac.o
|
||||
obj-$(CONFIG_EDAC_E7XXX) += e7xxx_edac.o
|
||||
obj-$(CONFIG_EDAC_E752X) += e752x_edac.o
|
||||
obj-$(CONFIG_EDAC_I82875P) += i82875p_edac.o
|
||||
obj-$(CONFIG_EDAC_I82860) += i82860_edac.o
|
||||
obj-$(CONFIG_EDAC_R82600) += r82600_edac.o
|
||||
|
@ -338,7 +338,7 @@ static struct pci_driver amd76x_driver = {
|
||||
.id_table = amd76x_pci_tbl,
|
||||
};
|
||||
|
||||
int __init amd76x_init(void)
|
||||
static int __init amd76x_init(void)
|
||||
{
|
||||
return pci_register_driver(&amd76x_driver);
|
||||
}
|
||||
|
@ -13,7 +13,7 @@
|
||||
* Wang Zhenyu at intel.com
|
||||
* Dave Jiang at mvista.com
|
||||
*
|
||||
* $Id: bluesmoke_e752x.c,v 1.5.2.11 2005/10/05 00:43:44 dsp_llnl Exp $
|
||||
* $Id: edac_e752x.c,v 1.5.2.11 2005/10/05 00:43:44 dsp_llnl Exp $
|
||||
*
|
||||
*/
|
||||
|
||||
@ -376,14 +376,14 @@ static inline void process_threshold_ce(struct mem_ctl_info *mci, u16 error,
|
||||
mci->mc_idx);
|
||||
}
|
||||
|
||||
char *global_message[11] = {
|
||||
static char *global_message[11] = {
|
||||
"PCI Express C1", "PCI Express C", "PCI Express B1",
|
||||
"PCI Express B", "PCI Express A1", "PCI Express A",
|
||||
"DMA Controler", "HUB Interface", "System Bus",
|
||||
"DRAM Controler", "Internal Buffer"
|
||||
};
|
||||
|
||||
char *fatal_message[2] = { "Non-Fatal ", "Fatal " };
|
||||
static char *fatal_message[2] = { "Non-Fatal ", "Fatal " };
|
||||
|
||||
static void do_global_error(int fatal, u32 errors)
|
||||
{
|
||||
@ -405,7 +405,7 @@ static inline void global_error(int fatal, u32 errors, int *error_found,
|
||||
do_global_error(fatal, errors);
|
||||
}
|
||||
|
||||
char *hub_message[7] = {
|
||||
static char *hub_message[7] = {
|
||||
"HI Address or Command Parity", "HI Illegal Access",
|
||||
"HI Internal Parity", "Out of Range Access",
|
||||
"HI Data Parity", "Enhanced Config Access",
|
||||
@ -432,7 +432,7 @@ static inline void hub_error(int fatal, u8 errors, int *error_found,
|
||||
do_hub_error(fatal, errors);
|
||||
}
|
||||
|
||||
char *membuf_message[4] = {
|
||||
static char *membuf_message[4] = {
|
||||
"Internal PMWB to DRAM parity",
|
||||
"Internal PMWB to System Bus Parity",
|
||||
"Internal System Bus or IO to PMWB Parity",
|
||||
@ -458,6 +458,7 @@ static inline void membuf_error(u8 errors, int *error_found, int handle_error)
|
||||
do_membuf_error(errors);
|
||||
}
|
||||
|
||||
#if 0
|
||||
char *sysbus_message[10] = {
|
||||
"Addr or Request Parity",
|
||||
"Data Strobe Glitch",
|
||||
@ -469,6 +470,7 @@ char *sysbus_message[10] = {
|
||||
"Memory Parity",
|
||||
"IO Subsystem Parity"
|
||||
};
|
||||
#endif /* 0 */
|
||||
|
||||
static void do_sysbus_error(int fatal, u32 errors)
|
||||
{
|
||||
@ -1044,7 +1046,7 @@ static struct pci_driver e752x_driver = {
|
||||
};
|
||||
|
||||
|
||||
int __init e752x_init(void)
|
||||
static int __init e752x_init(void)
|
||||
{
|
||||
int pci_rc;
|
||||
|
||||
|
@ -537,7 +537,7 @@ static struct pci_driver e7xxx_driver = {
|
||||
};
|
||||
|
||||
|
||||
int __init e7xxx_init(void)
|
||||
static int __init e7xxx_init(void)
|
||||
{
|
||||
return pci_register_driver(&e7xxx_driver);
|
||||
}
|
||||
|
2209
drivers/edac/edac_mc.c
Normal file
2209
drivers/edac/edac_mc.c
Normal file
File diff suppressed because it is too large
Load Diff
448
drivers/edac/edac_mc.h
Normal file
448
drivers/edac/edac_mc.h
Normal file
@ -0,0 +1,448 @@
|
||||
/*
|
||||
* MC kernel module
|
||||
* (C) 2003 Linux Networx (http://lnxi.com)
|
||||
* This file may be distributed under the terms of the
|
||||
* GNU General Public License.
|
||||
*
|
||||
* Written by Thayne Harbaugh
|
||||
* Based on work by Dan Hollis <goemon at anime dot net> and others.
|
||||
* http://www.anime.net/~goemon/linux-ecc/
|
||||
*
|
||||
* NMI handling support added by
|
||||
* Dave Peterson <dsp@llnl.gov> <dave_peterson@pobox.com>
|
||||
*
|
||||
* $Id: edac_mc.h,v 1.4.2.10 2005/10/05 00:43:44 dsp_llnl Exp $
|
||||
*
|
||||
*/
|
||||
|
||||
|
||||
#ifndef _EDAC_MC_H_
|
||||
#define _EDAC_MC_H_
|
||||
|
||||
|
||||
#include <linux/config.h>
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/types.h>
|
||||
#include <linux/module.h>
|
||||
#include <linux/spinlock.h>
|
||||
#include <linux/smp.h>
|
||||
#include <linux/pci.h>
|
||||
#include <linux/time.h>
|
||||
#include <linux/nmi.h>
|
||||
#include <linux/rcupdate.h>
|
||||
#include <linux/completion.h>
|
||||
#include <linux/kobject.h>
|
||||
|
||||
|
||||
#define EDAC_MC_LABEL_LEN 31
|
||||
#define MC_PROC_NAME_MAX_LEN 7
|
||||
|
||||
#if PAGE_SHIFT < 20
|
||||
#define PAGES_TO_MiB( pages ) ( ( pages ) >> ( 20 - PAGE_SHIFT ) )
|
||||
#else /* PAGE_SHIFT > 20 */
|
||||
#define PAGES_TO_MiB( pages ) ( ( pages ) << ( PAGE_SHIFT - 20 ) )
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_EDAC_DEBUG
|
||||
extern int edac_debug_level;
|
||||
#define edac_debug_printk(level, fmt, args...) \
|
||||
do { if (level <= edac_debug_level) printk(KERN_DEBUG fmt, ##args); } while(0)
|
||||
#define debugf0( ... ) edac_debug_printk(0, __VA_ARGS__ )
|
||||
#define debugf1( ... ) edac_debug_printk(1, __VA_ARGS__ )
|
||||
#define debugf2( ... ) edac_debug_printk(2, __VA_ARGS__ )
|
||||
#define debugf3( ... ) edac_debug_printk(3, __VA_ARGS__ )
|
||||
#define debugf4( ... ) edac_debug_printk(4, __VA_ARGS__ )
|
||||
#else /* !CONFIG_EDAC_DEBUG */
|
||||
#define debugf0( ... )
|
||||
#define debugf1( ... )
|
||||
#define debugf2( ... )
|
||||
#define debugf3( ... )
|
||||
#define debugf4( ... )
|
||||
#endif /* !CONFIG_EDAC_DEBUG */
|
||||
|
||||
|
||||
#define bs_xstr(s) bs_str(s)
|
||||
#define bs_str(s) #s
|
||||
#define BS_MOD_STR bs_xstr(KBUILD_BASENAME)
|
||||
|
||||
#define BIT(x) (1 << (x))
|
||||
|
||||
#define PCI_VEND_DEV(vend, dev) PCI_VENDOR_ID_ ## vend, PCI_DEVICE_ID_ ## vend ## _ ## dev
|
||||
|
||||
/* memory devices */
|
||||
enum dev_type {
|
||||
DEV_UNKNOWN = 0,
|
||||
DEV_X1,
|
||||
DEV_X2,
|
||||
DEV_X4,
|
||||
DEV_X8,
|
||||
DEV_X16,
|
||||
DEV_X32, /* Do these parts exist? */
|
||||
DEV_X64 /* Do these parts exist? */
|
||||
};
|
||||
|
||||
#define DEV_FLAG_UNKNOWN BIT(DEV_UNKNOWN)
|
||||
#define DEV_FLAG_X1 BIT(DEV_X1)
|
||||
#define DEV_FLAG_X2 BIT(DEV_X2)
|
||||
#define DEV_FLAG_X4 BIT(DEV_X4)
|
||||
#define DEV_FLAG_X8 BIT(DEV_X8)
|
||||
#define DEV_FLAG_X16 BIT(DEV_X16)
|
||||
#define DEV_FLAG_X32 BIT(DEV_X32)
|
||||
#define DEV_FLAG_X64 BIT(DEV_X64)
|
||||
|
||||
/* memory types */
|
||||
enum mem_type {
|
||||
MEM_EMPTY = 0, /* Empty csrow */
|
||||
MEM_RESERVED, /* Reserved csrow type */
|
||||
MEM_UNKNOWN, /* Unknown csrow type */
|
||||
MEM_FPM, /* Fast page mode */
|
||||
MEM_EDO, /* Extended data out */
|
||||
MEM_BEDO, /* Burst Extended data out */
|
||||
MEM_SDR, /* Single data rate SDRAM */
|
||||
MEM_RDR, /* Registered single data rate SDRAM */
|
||||
MEM_DDR, /* Double data rate SDRAM */
|
||||
MEM_RDDR, /* Registered Double data rate SDRAM */
|
||||
MEM_RMBS /* Rambus DRAM */
|
||||
};
|
||||
|
||||
#define MEM_FLAG_EMPTY BIT(MEM_EMPTY)
|
||||
#define MEM_FLAG_RESERVED BIT(MEM_RESERVED)
|
||||
#define MEM_FLAG_UNKNOWN BIT(MEM_UNKNOWN)
|
||||
#define MEM_FLAG_FPM BIT(MEM_FPM)
|
||||
#define MEM_FLAG_EDO BIT(MEM_EDO)
|
||||
#define MEM_FLAG_BEDO BIT(MEM_BEDO)
|
||||
#define MEM_FLAG_SDR BIT(MEM_SDR)
|
||||
#define MEM_FLAG_RDR BIT(MEM_RDR)
|
||||
#define MEM_FLAG_DDR BIT(MEM_DDR)
|
||||
#define MEM_FLAG_RDDR BIT(MEM_RDDR)
|
||||
#define MEM_FLAG_RMBS BIT(MEM_RMBS)
|
||||
|
||||
|
||||
/* chipset Error Detection and Correction capabilities and mode */
|
||||
enum edac_type {
|
||||
EDAC_UNKNOWN = 0, /* Unknown if ECC is available */
|
||||
EDAC_NONE, /* Doesnt support ECC */
|
||||
EDAC_RESERVED, /* Reserved ECC type */
|
||||
EDAC_PARITY, /* Detects parity errors */
|
||||
EDAC_EC, /* Error Checking - no correction */
|
||||
EDAC_SECDED, /* Single bit error correction, Double detection */
|
||||
EDAC_S2ECD2ED, /* Chipkill x2 devices - do these exist? */
|
||||
EDAC_S4ECD4ED, /* Chipkill x4 devices */
|
||||
EDAC_S8ECD8ED, /* Chipkill x8 devices */
|
||||
EDAC_S16ECD16ED, /* Chipkill x16 devices */
|
||||
};
|
||||
|
||||
#define EDAC_FLAG_UNKNOWN BIT(EDAC_UNKNOWN)
|
||||
#define EDAC_FLAG_NONE BIT(EDAC_NONE)
|
||||
#define EDAC_FLAG_PARITY BIT(EDAC_PARITY)
|
||||
#define EDAC_FLAG_EC BIT(EDAC_EC)
|
||||
#define EDAC_FLAG_SECDED BIT(EDAC_SECDED)
|
||||
#define EDAC_FLAG_S2ECD2ED BIT(EDAC_S2ECD2ED)
|
||||
#define EDAC_FLAG_S4ECD4ED BIT(EDAC_S4ECD4ED)
|
||||
#define EDAC_FLAG_S8ECD8ED BIT(EDAC_S8ECD8ED)
|
||||
#define EDAC_FLAG_S16ECD16ED BIT(EDAC_S16ECD16ED)
|
||||
|
||||
|
||||
/* scrubbing capabilities */
|
||||
enum scrub_type {
|
||||
SCRUB_UNKNOWN = 0, /* Unknown if scrubber is available */
|
||||
SCRUB_NONE, /* No scrubber */
|
||||
SCRUB_SW_PROG, /* SW progressive (sequential) scrubbing */
|
||||
SCRUB_SW_SRC, /* Software scrub only errors */
|
||||
SCRUB_SW_PROG_SRC, /* Progressive software scrub from an error */
|
||||
SCRUB_SW_TUNABLE, /* Software scrub frequency is tunable */
|
||||
SCRUB_HW_PROG, /* HW progressive (sequential) scrubbing */
|
||||
SCRUB_HW_SRC, /* Hardware scrub only errors */
|
||||
SCRUB_HW_PROG_SRC, /* Progressive hardware scrub from an error */
|
||||
SCRUB_HW_TUNABLE /* Hardware scrub frequency is tunable */
|
||||
};
|
||||
|
||||
#define SCRUB_FLAG_SW_PROG BIT(SCRUB_SW_PROG)
|
||||
#define SCRUB_FLAG_SW_SRC BIT(SCRUB_SW_SRC_CORR)
|
||||
#define SCRUB_FLAG_SW_PROG_SRC BIT(SCRUB_SW_PROG_SRC_CORR)
|
||||
#define SCRUB_FLAG_SW_TUN BIT(SCRUB_SW_SCRUB_TUNABLE)
|
||||
#define SCRUB_FLAG_HW_PROG BIT(SCRUB_HW_PROG)
|
||||
#define SCRUB_FLAG_HW_SRC BIT(SCRUB_HW_SRC_CORR)
|
||||
#define SCRUB_FLAG_HW_PROG_SRC BIT(SCRUB_HW_PROG_SRC_CORR)
|
||||
#define SCRUB_FLAG_HW_TUN BIT(SCRUB_HW_TUNABLE)
|
||||
|
||||
enum mci_sysfs_status {
|
||||
MCI_SYSFS_INACTIVE = 0, /* sysfs entries NOT registered */
|
||||
MCI_SYSFS_ACTIVE /* sysfs entries ARE registered */
|
||||
};
|
||||
|
||||
/* FIXME - should have notify capabilities: NMI, LOG, PROC, etc */
|
||||
|
||||
/*
|
||||
* There are several things to be aware of that aren't at all obvious:
|
||||
*
|
||||
*
|
||||
* SOCKETS, SOCKET SETS, BANKS, ROWS, CHIP-SELECT ROWS, CHANNELS, etc..
|
||||
*
|
||||
* These are some of the many terms that are thrown about that don't always
|
||||
* mean what people think they mean (Inconceivable!). In the interest of
|
||||
* creating a common ground for discussion, terms and their definitions
|
||||
* will be established.
|
||||
*
|
||||
* Memory devices: The individual chip on a memory stick. These devices
|
||||
* commonly output 4 and 8 bits each. Grouping several
|
||||
* of these in parallel provides 64 bits which is common
|
||||
* for a memory stick.
|
||||
*
|
||||
* Memory Stick: A printed circuit board that agregates multiple
|
||||
* memory devices in parallel. This is the atomic
|
||||
* memory component that is purchaseable by Joe consumer
|
||||
* and loaded into a memory socket.
|
||||
*
|
||||
* Socket: A physical connector on the motherboard that accepts
|
||||
* a single memory stick.
|
||||
*
|
||||
* Channel: Set of memory devices on a memory stick that must be
|
||||
* grouped in parallel with one or more additional
|
||||
* channels from other memory sticks. This parallel
|
||||
* grouping of the output from multiple channels are
|
||||
* necessary for the smallest granularity of memory access.
|
||||
* Some memory controllers are capable of single channel -
|
||||
* which means that memory sticks can be loaded
|
||||
* individually. Other memory controllers are only
|
||||
* capable of dual channel - which means that memory
|
||||
* sticks must be loaded as pairs (see "socket set").
|
||||
*
|
||||
* Chip-select row: All of the memory devices that are selected together.
|
||||
* for a single, minimum grain of memory access.
|
||||
* This selects all of the parallel memory devices across
|
||||
* all of the parallel channels. Common chip-select rows
|
||||
* for single channel are 64 bits, for dual channel 128
|
||||
* bits.
|
||||
*
|
||||
* Single-Ranked stick: A Single-ranked stick has 1 chip-select row of memmory.
|
||||
* Motherboards commonly drive two chip-select pins to
|
||||
* a memory stick. A single-ranked stick, will occupy
|
||||
* only one of those rows. The other will be unused.
|
||||
*
|
||||
* Double-Ranked stick: A double-ranked stick has two chip-select rows which
|
||||
* access different sets of memory devices. The two
|
||||
* rows cannot be accessed concurrently.
|
||||
*
|
||||
* Double-sided stick: DEPRECATED TERM, see Double-Ranked stick.
|
||||
* A double-sided stick has two chip-select rows which
|
||||
* access different sets of memory devices. The two
|
||||
* rows cannot be accessed concurrently. "Double-sided"
|
||||
* is irrespective of the memory devices being mounted
|
||||
* on both sides of the memory stick.
|
||||
*
|
||||
* Socket set: All of the memory sticks that are required for for
|
||||
* a single memory access or all of the memory sticks
|
||||
* spanned by a chip-select row. A single socket set
|
||||
* has two chip-select rows and if double-sided sticks
|
||||
* are used these will occupy those chip-select rows.
|
||||
*
|
||||
* Bank: This term is avoided because it is unclear when
|
||||
* needing to distinguish between chip-select rows and
|
||||
* socket sets.
|
||||
*
|
||||
* Controller pages:
|
||||
*
|
||||
* Physical pages:
|
||||
*
|
||||
* Virtual pages:
|
||||
*
|
||||
*
|
||||
* STRUCTURE ORGANIZATION AND CHOICES
|
||||
*
|
||||
*
|
||||
*
|
||||
* PS - I enjoyed writing all that about as much as you enjoyed reading it.
|
||||
*/
|
||||
|
||||
|
||||
struct channel_info {
|
||||
int chan_idx; /* channel index */
|
||||
u32 ce_count; /* Correctable Errors for this CHANNEL */
|
||||
char label[EDAC_MC_LABEL_LEN + 1]; /* DIMM label on motherboard */
|
||||
struct csrow_info *csrow; /* the parent */
|
||||
};
|
||||
|
||||
|
||||
struct csrow_info {
|
||||
unsigned long first_page; /* first page number in dimm */
|
||||
unsigned long last_page; /* last page number in dimm */
|
||||
unsigned long page_mask; /* used for interleaving -
|
||||
0UL for non intlv */
|
||||
u32 nr_pages; /* number of pages in csrow */
|
||||
u32 grain; /* granularity of reported error in bytes */
|
||||
int csrow_idx; /* the chip-select row */
|
||||
enum dev_type dtype; /* memory device type */
|
||||
u32 ue_count; /* Uncorrectable Errors for this csrow */
|
||||
u32 ce_count; /* Correctable Errors for this csrow */
|
||||
enum mem_type mtype; /* memory csrow type */
|
||||
enum edac_type edac_mode; /* EDAC mode for this csrow */
|
||||
struct mem_ctl_info *mci; /* the parent */
|
||||
|
||||
struct kobject kobj; /* sysfs kobject for this csrow */
|
||||
|
||||
/* FIXME the number of CHANNELs might need to become dynamic */
|
||||
u32 nr_channels;
|
||||
struct channel_info *channels;
|
||||
};
|
||||
|
||||
|
||||
struct mem_ctl_info {
|
||||
struct list_head link; /* for global list of mem_ctl_info structs */
|
||||
unsigned long mtype_cap; /* memory types supported by mc */
|
||||
unsigned long edac_ctl_cap; /* Mem controller EDAC capabilities */
|
||||
unsigned long edac_cap; /* configuration capabilities - this is
|
||||
closely related to edac_ctl_cap. The
|
||||
difference is that the controller
|
||||
may be capable of s4ecd4ed which would
|
||||
be listed in edac_ctl_cap, but if
|
||||
channels aren't capable of s4ecd4ed then the
|
||||
edac_cap would not have that capability. */
|
||||
unsigned long scrub_cap; /* chipset scrub capabilities */
|
||||
enum scrub_type scrub_mode; /* current scrub mode */
|
||||
|
||||
enum mci_sysfs_status sysfs_active; /* status of sysfs */
|
||||
|
||||
/* pointer to edac checking routine */
|
||||
void (*edac_check) (struct mem_ctl_info * mci);
|
||||
/*
|
||||
* Remaps memory pages: controller pages to physical pages.
|
||||
* For most MC's, this will be NULL.
|
||||
*/
|
||||
/* FIXME - why not send the phys page to begin with? */
|
||||
unsigned long (*ctl_page_to_phys) (struct mem_ctl_info * mci,
|
||||
unsigned long page);
|
||||
int mc_idx;
|
||||
int nr_csrows;
|
||||
struct csrow_info *csrows;
|
||||
/*
|
||||
* FIXME - what about controllers on other busses? - IDs must be
|
||||
* unique. pdev pointer should be sufficiently unique, but
|
||||
* BUS:SLOT.FUNC numbers may not be unique.
|
||||
*/
|
||||
struct pci_dev *pdev;
|
||||
const char *mod_name;
|
||||
const char *mod_ver;
|
||||
const char *ctl_name;
|
||||
char proc_name[MC_PROC_NAME_MAX_LEN + 1];
|
||||
void *pvt_info;
|
||||
u32 ue_noinfo_count; /* Uncorrectable Errors w/o info */
|
||||
u32 ce_noinfo_count; /* Correctable Errors w/o info */
|
||||
u32 ue_count; /* Total Uncorrectable Errors for this MC */
|
||||
u32 ce_count; /* Total Correctable Errors for this MC */
|
||||
unsigned long start_time; /* mci load start time (in jiffies) */
|
||||
|
||||
/* this stuff is for safe removal of mc devices from global list while
|
||||
* NMI handlers may be traversing list
|
||||
*/
|
||||
struct rcu_head rcu;
|
||||
struct completion complete;
|
||||
|
||||
/* edac sysfs device control */
|
||||
struct kobject edac_mci_kobj;
|
||||
};
|
||||
|
||||
|
||||
|
||||
/* write all or some bits in a byte-register*/
|
||||
static inline void pci_write_bits8(struct pci_dev *pdev, int offset,
|
||||
u8 value, u8 mask)
|
||||
{
|
||||
if (mask != 0xff) {
|
||||
u8 buf;
|
||||
pci_read_config_byte(pdev, offset, &buf);
|
||||
value &= mask;
|
||||
buf &= ~mask;
|
||||
value |= buf;
|
||||
}
|
||||
pci_write_config_byte(pdev, offset, value);
|
||||
}
|
||||
|
||||
|
||||
/* write all or some bits in a word-register*/
|
||||
static inline void pci_write_bits16(struct pci_dev *pdev, int offset,
|
||||
u16 value, u16 mask)
|
||||
{
|
||||
if (mask != 0xffff) {
|
||||
u16 buf;
|
||||
pci_read_config_word(pdev, offset, &buf);
|
||||
value &= mask;
|
||||
buf &= ~mask;
|
||||
value |= buf;
|
||||
}
|
||||
pci_write_config_word(pdev, offset, value);
|
||||
}
|
||||
|
||||
|
||||
/* write all or some bits in a dword-register*/
|
||||
static inline void pci_write_bits32(struct pci_dev *pdev, int offset,
|
||||
u32 value, u32 mask)
|
||||
{
|
||||
if (mask != 0xffff) {
|
||||
u32 buf;
|
||||
pci_read_config_dword(pdev, offset, &buf);
|
||||
value &= mask;
|
||||
buf &= ~mask;
|
||||
value |= buf;
|
||||
}
|
||||
pci_write_config_dword(pdev, offset, value);
|
||||
}
|
||||
|
||||
|
||||
#ifdef CONFIG_EDAC_DEBUG
|
||||
void edac_mc_dump_channel(struct channel_info *chan);
|
||||
void edac_mc_dump_mci(struct mem_ctl_info *mci);
|
||||
void edac_mc_dump_csrow(struct csrow_info *csrow);
|
||||
#endif /* CONFIG_EDAC_DEBUG */
|
||||
|
||||
extern int edac_mc_add_mc(struct mem_ctl_info *mci);
|
||||
extern int edac_mc_del_mc(struct mem_ctl_info *mci);
|
||||
|
||||
extern int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci,
|
||||
unsigned long page);
|
||||
|
||||
extern struct mem_ctl_info *edac_mc_find_mci_by_pdev(struct pci_dev
|
||||
*pdev);
|
||||
|
||||
extern void edac_mc_scrub_block(unsigned long page,
|
||||
unsigned long offset, u32 size);
|
||||
|
||||
/*
|
||||
* The no info errors are used when error overflows are reported.
|
||||
* There are a limited number of error logging registers that can
|
||||
* be exausted. When all registers are exhausted and an additional
|
||||
* error occurs then an error overflow register records that an
|
||||
* error occured and the type of error, but doesn't have any
|
||||
* further information. The ce/ue versions make for cleaner
|
||||
* reporting logic and function interface - reduces conditional
|
||||
* statement clutter and extra function arguments.
|
||||
*/
|
||||
extern void edac_mc_handle_ce(struct mem_ctl_info *mci,
|
||||
unsigned long page_frame_number,
|
||||
unsigned long offset_in_page,
|
||||
unsigned long syndrome,
|
||||
int row, int channel, const char *msg);
|
||||
|
||||
extern void edac_mc_handle_ce_no_info(struct mem_ctl_info *mci,
|
||||
const char *msg);
|
||||
|
||||
extern void edac_mc_handle_ue(struct mem_ctl_info *mci,
|
||||
unsigned long page_frame_number,
|
||||
unsigned long offset_in_page,
|
||||
int row, const char *msg);
|
||||
|
||||
extern void edac_mc_handle_ue_no_info(struct mem_ctl_info *mci,
|
||||
const char *msg);
|
||||
|
||||
/*
|
||||
* This kmalloc's and initializes all the structures.
|
||||
* Can't be used if all structures don't have the same lifetime.
|
||||
*/
|
||||
extern struct mem_ctl_info *edac_mc_alloc(unsigned sz_pvt,
|
||||
unsigned nr_csrows, unsigned nr_chans);
|
||||
|
||||
/* Free an mc previously allocated by edac_mc_alloc() */
|
||||
extern void edac_mc_free(struct mem_ctl_info *mci);
|
||||
|
||||
|
||||
#endif /* _EDAC_MC_H_ */
|
@ -253,7 +253,7 @@ static struct pci_driver i82860_driver = {
|
||||
.id_table = i82860_pci_tbl,
|
||||
};
|
||||
|
||||
int __init i82860_init(void)
|
||||
static int __init i82860_init(void)
|
||||
{
|
||||
int pci_rc;
|
||||
|
||||
|
@ -483,7 +483,7 @@ static struct pci_driver i82875p_driver = {
|
||||
};
|
||||
|
||||
|
||||
int __init i82875p_init(void)
|
||||
static int __init i82875p_init(void)
|
||||
{
|
||||
int pci_rc;
|
||||
|
||||
|
@ -381,7 +381,7 @@ static struct pci_driver r82600_driver = {
|
||||
};
|
||||
|
||||
|
||||
int __init r82600_init(void)
|
||||
static int __init r82600_init(void)
|
||||
{
|
||||
return pci_register_driver(&r82600_driver);
|
||||
}
|
||||
|
@ -255,17 +255,5 @@ __asm__ __volatile__(LOCK "orl %0,%1" \
|
||||
#define smp_mb__before_atomic_inc() barrier()
|
||||
#define smp_mb__after_atomic_inc() barrier()
|
||||
|
||||
/* ECC atomic, DMA, SMP and interrupt safe scrub function */
|
||||
|
||||
static __inline__ void atomic_scrub(unsigned long *virt_addr, u32 size)
|
||||
{
|
||||
u32 i;
|
||||
for (i = 0; i < size / 4; i++, virt_addr++)
|
||||
/* Very carefully read and write to memory atomically
|
||||
* so we are interrupt, DMA and SMP safe.
|
||||
*/
|
||||
__asm__ __volatile__("lock; addl $0, %0"::"m"(*virt_addr));
|
||||
}
|
||||
|
||||
#include <asm-generic/atomic.h>
|
||||
#endif
|
||||
|
18
include/asm-i386/edac.h
Normal file
18
include/asm-i386/edac.h
Normal file
@ -0,0 +1,18 @@
|
||||
#ifndef ASM_EDAC_H
|
||||
#define ASM_EDAC_H
|
||||
|
||||
/* ECC atomic, DMA, SMP and interrupt safe scrub function */
|
||||
|
||||
static __inline__ void atomic_scrub(void *va, u32 size)
|
||||
{
|
||||
unsigned long *virt_addr = va;
|
||||
u32 i;
|
||||
|
||||
for (i = 0; i < size / 4; i++, virt_addr++)
|
||||
/* Very carefully read and write to memory atomically
|
||||
* so we are interrupt, DMA and SMP safe.
|
||||
*/
|
||||
__asm__ __volatile__("lock; addl $0, %0"::"m"(*virt_addr));
|
||||
}
|
||||
|
||||
#endif
|
@ -426,17 +426,5 @@ __asm__ __volatile__(LOCK "orl %0,%1" \
|
||||
#define smp_mb__before_atomic_inc() barrier()
|
||||
#define smp_mb__after_atomic_inc() barrier()
|
||||
|
||||
/* ECC atomic, DMA, SMP and interrupt safe scrub function */
|
||||
|
||||
static __inline__ void atomic_scrub(u32 *virt_addr, u32 size)
|
||||
{
|
||||
u32 i;
|
||||
for (i = 0; i < size / 4; i++, virt_addr++)
|
||||
/* Very carefully read and write to memory atomically
|
||||
* so we are interrupt, DMA and SMP safe.
|
||||
*/
|
||||
__asm__ __volatile__("lock; addl $0, %0"::"m"(*virt_addr));
|
||||
}
|
||||
|
||||
#include <asm-generic/atomic.h>
|
||||
#endif
|
||||
|
18
include/asm-x86_64/edac.h
Normal file
18
include/asm-x86_64/edac.h
Normal file
@ -0,0 +1,18 @@
|
||||
#ifndef ASM_EDAC_H
|
||||
#define ASM_EDAC_H
|
||||
|
||||
/* ECC atomic, DMA, SMP and interrupt safe scrub function */
|
||||
|
||||
static __inline__ void atomic_scrub(void *va, u32 size)
|
||||
{
|
||||
unsigned int *virt_addr = va;
|
||||
u32 i;
|
||||
|
||||
for (i = 0; i < size / 4; i++, virt_addr++)
|
||||
/* Very carefully read and write to memory atomically
|
||||
* so we are interrupt, DMA and SMP safe.
|
||||
*/
|
||||
__asm__ __volatile__("lock; addl $0, %0"::"m"(*virt_addr));
|
||||
}
|
||||
|
||||
#endif
|
Loading…
Reference in New Issue
Block a user