When a hardware error occurs in a cell of the HBM memory, the internal
SRAM of the memory controller is used to replace the faulty memory, this
method is ACLS (Adaptive Cache Line Sparing). The IMU reports the ACLS
RAS, and the rasdaemon record it and runs the ACLS to replace the faulty
memory.
HBM ACLS can repair one cell (258-bit) memory at a time. The HBM can
check which HBM cell the physical address belongs to and filter invalid
HBM addresses. Multiple RAS errors are reported if memory errors occur
in different HBM cells.
The feature depends on the linux kernel CONFIG_HISI_HBMDEV [1] and
CONFIG_HWPOISON_INJECT [2].
[1]: https://gitee.com/openeuler/kernel/pulls/2757
[2]: https://gitee.com/openeuler/kernel/blob/OLK-5.10/mm/hwpoison-inject.c
Signed-off-by: Junhao He <hejunhao3@huawei.com>
Patch#1: 0001-rasdaemon-ras-mc-ctl-Modify-check-for-HiSilicon-KunP.patch
Fix "nimbus_id=0" field is not displayed when query RAS error information.
Patch#2: 0001-Check-CPUs-online-not-configured.patch
Check CPUs online, not configured.
Signed-off-by: caijian <caijian11@h-partners.com>
Add the following fix patches and changes,
1. Fix return value type issue of read/write function from unistd.h.
2. Fix issue of signed and unsigned integer comparison.
3. Remove redundant header file and do some clean-up.
4. Add support for create/open the vendor error tables at rasdaemon startup.
5. Make changes in the HiSilicon error handling code for the same.
6. Add four modules supported by HiSilicon common section.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Add the following patch to fix startup core dumped issue.
0001-rasdaemon-use-standard-length-PATH_MAX-for-path-name.patch
Signed-off-by: fenglei <fenglei47@h-partners.com>
Update with the latest patches for the
1. CPU online fault isolation for arm event.
2. Modify recording Hisilicon common error data in the rasdaemon
3. In the ras-mc-ctl,
3.1. Improve Hisilicon common error statistics.
3.2. Add support to display the HiSilicon vendor-errors for a specified module.
3.3. Add printing usage if necessary parameters are not passed for the HiSilicon vendor-errors options.
3.4. Reformat error info of the HiSilicon Kunpeng920.
3.5. Relocate reading and display Kunpeng920 errors to under Kunpeng9xx.
3.6. Updated the HiSilicon platform name as KunPeng9xx.
4. Fixed a memory out-of-bounds issue in the rasdaemon.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
1. Modify the recording format of the Hisilicon Kunpeng9xx common errors in the rasdaemon.
2. In the ras-mc-ctl,
2.1. Modify the error statistics for the HiSilicon Kunpeng9xx common errors to display
the statistics and error info based on the module and the error severity..
2.2. Add support to display the vendor-errors for a specified module.
2.3. Add printing usage if the necessary parameters are not passed for the
vendor-errors options.
2.4. Reformat error info of the HiSilicon Kunpeng920.
2.5. Relocate reading and display Kunpeng920 errors to under Kunpeng9xx.
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
(cherry picked from commit ca01a3db7b2b002855070d02a095296680325354)
1.Backport 4 patches from openEuler master branch.
2.Enable compilation of the feature memory fault prediction based on corrected error.
3.Fix changelog date error of this spec file.
Signed-off-by: Xiaofei Tan <tanxiaofei@huawei.com>
(cherry picked from commit beb85dd5cfd3920dee725abb2e07fffd13f3dc1c)
1. ras-page-isolation: do_page_offline always considers page offline was successful
e4d27840e1
2. ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again
c329012ce4