Some PCs with Intel N100 (with PCI device 8086:461c, DID_ADL_N_SKU4)
experienced issues with error interrupts not working, even with the
following configuration in the BIOS.
In-Band ECC Support: Enabled
In-Band ECC Operation Mode: 2 (make all requests protected and
ignore range checks)
IBECC Error Injection Control: Inject Correctable Error on insertion
counter
Error Injection Insertion Count: 251658240 (0xf000000)
Add polling mode support for these machines to ensure that memory error
events are handled.
Signed-off-by: Orange Kao <orange@aiven.io>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/all/20241106114024.941659-3-orange@aiven.io
Currently, igen6_edac sets edac_op_state to EDAC_OPSTATE_NMI, while the
driver also supports memory errors reported from Machine Check. Initialize
edac_op_state to the correct value according to the configuration data
that the driver probed.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20241106114024.941659-2-orange@aiven.io
The segmentation fault happens because:
During modprobe:
1. In igen6_probe(), igen6_pvt will be allocated with kzalloc()
2. In igen6_register_mci(), mci->pvt_info will point to
&igen6_pvt->imc[mc]
During rmmod:
1. In mci_release() in edac_mc.c, it will kfree(mci->pvt_info)
2. In igen6_remove(), it will kfree(igen6_pvt);
Fix this issue by setting mci->pvt_info to NULL to avoid the double
kfree.
Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219360
Signed-off-by: Orange Kao <orange@aiven.io>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241104124237.124109-2-orange@aiven.io
The conversion of system address to physical memory address (as viewed by
the memory controller) by igen6_edac is incorrect when the system address
is above the TOM (Total amount Of populated physical Memory) for Elkhart
Lake and Ice Lake (Neural Network Processor). Fix this conversion.
Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/stable/20240814061011.43545-1-qiuxu.zhuo%40intel.com
non-power-of-2 denormalization in the sense that certain bits of the
system physical address cannot be reconstructed from the normalized
address reported by the RAS hardware. Add support for handling such
addresses
- Switch the EDAC drivers to the new Intel CPU model defines
- The usual fixes and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaU9aQACgkQEsHwGGHe
VUpkKQ//eWbeC4JosmRohUECE7MtZppAJ7iX7I7DbQkpKAjdeN4qnPESIQleFN9o
qg7CYkLRUOi8sYJ3MKmIG5l+yxgztKZl7EvzfAaKiCPDt2EK0DDLmhO3VTE1muTn
bYo3kk0HpxCVFfuWxmDCu36CC11wkGmjUo5k6XCE5L4hFlywvVwrktc55jQWsbWk
Kc5iAJxxSc+C8/7oTjqnYuARNl/6Fl4S376GYoxHXzlZI8VoFLO/sW20fz7gQjZg
n/y25CEHki/K9y+bU8Gsexcwhd0jbU02HYtKQI7klcDqyamm8IlmLcTEXZ6Ozlhg
C/dYs2FI9vi6V8B3f8tGHSA3jZgFmcU0OJV9Zl1Pr/ORax9+nbhfxyJbYgp/SgT5
1so5d3iqM2vD+UHnyld0WftVO/HxurhhKPgfCHvcagQnseFwNNqSKGUuwcJ33RCs
iUMBtwmupJL4nAoF+7ZskYbT2zTUduxgCjRiw0ok3h/mxZ+HvmPne5T8y1c1nzUC
+GJbPmprLhKhxKaBrd8w2vrWZHb3X0OccZzfyoS/Eiy0VTdZsVGZfhFEYHvRxYHA
rpM2ex0HrrI3RwrGRmp80PJjMVdGTVbue9yWRBN7LTyBmB+GkUPzCnGpFzyxibNe
iKnwwUjIzhZ48ImImbiCcVA+VMUHSqvLvBMEeYD3nyrZO1x9OKI=
=kLNX
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- The AMD memory controllers data fabric version 4.5 supports
non-power-of-2 denormalization in the sense that certain bits of the
system physical address cannot be reconstructed from the normalized
address reported by the RAS hardware. Add support for handling such
addresses
- Switch the EDAC drivers to the new Intel CPU model defines
- The usual fixes and cleanups all over the place
* tag 'edac_updates_for_v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC: Add missing MODULE_DESCRIPTION() macros
EDAC/dmc520: Use devm_platform_ioremap_resource()
EDAC/igen6: Add Intel Arrow Lake-U/H SoCs support
RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA
RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization
RAS/AMD/ATL: Validate address map when information is gathered
RAS/AMD/ATL: Expand helpers for adding and removing base and hole
RAS/AMD/ATL: Read DRAM hole base early
RAS/AMD/ATL: Add amd_atl pr_fmt() prefix
RAS/AMD/ATL: Add a missing module description
EDAC, i10nm: make skx_common.o a separate module
EDAC/skx: Switch to new Intel CPU model defines
EDAC/sb_edac: Switch to new Intel CPU model defines
EDAC, pnd2: Switch to new Intel CPU model defines
EDAC/i10nm: Switch to new Intel CPU model defines
EDAC/ghes: Add missing newline to pr_info() statement
RAS/AMD/ATL: Add missing newline to pr_info() statement
EDAC/thunderx: Remove unused struct error_syndrome
errcmd_enable_error_reporting() uses pci_{read,write}_config_word()
that return PCIBIOS_* codes. The return code is then returned all the
way into the probe function igen6_probe() that returns it as is. The
probe functions, however, should return normal errnos.
Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal
errno before returning it from errcmd_enable_error_reporting().
Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240527132236.13875-2-ilpo.jarvinen@linux.intel.com
Add a new Intel Alder Lake-N SoC compute die ID for EDAC support.
Signed-off-by: Lili Li <lili.li@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20240129062040.60809-2-qiuxu.zhuo@intel.com
Add Intel Meteor Lake-PS SoC compute die IDs for EDAC support.
These SoCs share similar IBECC registers with Alder Lake-P SoCs.
The only difference is that IBECC presence is detected through an
MMIO-mapped register instead of the capability register in the
PCI configuration space.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add Intel Raptor Lake-P SoC compute die IDs for EDAC support.
These Raptor Lake-P SoCs share similar IBECC registers with Alder Lake-P
SoCs but extend the most significant bit of the error address logged
in IBECC from bit 38 to bit 45.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add Intel Alder Lake-N SoC compute die IDs for EDAC support.
Alder Lake-N, with one memory controller, is a reduced version of
Alder Lake-P, which has two memory controllers.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Make get_mchbar() helper function to retrieve the BAR address of
the memory controller. No function changes.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Current igen6_edac checks for pending errors before the registration
of the error handler. However, there is a possibility that the error
occurs during the registration process, leading to unhandled pending
errors and no future error events. This issue can be reproduced by
repeatedly injecting errors during the loading of the igen6_edac.
Fix this issue by moving the pending error handler after the registration
of the error handler, ensuring that no pending errors are left unhandled.
Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Reported-by: Ee Wey Lim <ee.wey.lim@intel.com>
Tested-by: Ee Wey Lim <ee.wey.lim@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20230725080427.23883-1-qiuxu.zhuo@intel.com
Call ghes_get_devices() to check whether ghes_edac should be used on the
platform where it is preferred over the corresponding chipset-specific
EDAC driver.
Unlike the existing edac_get_owner() check, the ghes_get_devices() check
works independent to the module_init ordering.
[ bp: Massage. ]
Suggested-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221010023559.69655-6-justin.he@arm.com
Alder Lake SoC shares the same memory controller and In-Band ECC
(IBECC) IP with Tiger Lake SoC. Like Tiger Lake, it also has two
memory controllers each associated one IBECC instance. The minor
differences include the MMIO offset of each memory controller and
the type of memory error address logged in the IBECC.
So add Alder Lake compute die IDs, adjust the MMIO offset for each
memory controller and handle the type of memory error address logged
in the IBECC for Alder Lake EDAC support.
Tested-by: Vrukesh V Panse <vrukesh.v.panse@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-7-tony.luck@intel.com
Tiger Lake SoC shares the same memory controller and In-Band ECC
(IBECC) IP with Elkhart Lake SoC. The main differences are that Tiger
Lake has two memory controllers each associated with one IBECC and
uses Machine Check for the memory error notification.
So add Tiger Lake compute die IDs, MCE decoding chain registration,
and memory slice decoding for Tiger Lake EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-6-tony.luck@intel.com
The Ice Lake Neural Network Processor for Deep Learning Inference
(ICL-NNPI) SoC shares the same memory controller and In-Band ECC with
Elkhart Lake SoC. Add the ICL-NNPI compute die IDs for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-5-tony.luck@intel.com
Add debugfs support to fake memory correctable errors to test the
error reporting path and the error address decoding logic in the
igen6_edac driver.
Please note that the fake errors are also reported to EDAC core and
then the CE counter in EDAC sysfs is also increased.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This driver supports Intel client SoC with integrated memory controller
using In-Band ECC(IBECC). The memory correctable and uncorrectable errors
are reported via NMIs. The driver handles the NMIs and decodes the memory
error address to platform specific address. The first IBECC-supported SoC
is Elkhart Lake.
[Tony: s/#include <linux/nmi.h>/#include <asm/nmi.h>/ to fix randconfig build]
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>