Home abi bus cachefiles coda crypto csky debug dev fs kernel net s390 s390dbf sunrpc user vm xen xpc

memory_failure_early_kill

file: /proc/sys/vm/memory_failure_early_kill
variable: vm.memory_failure_early_kill

Official reference

Control how to kill processes when uncorrected memory error (typically a 2bit error in a memory module) is detected in the background by hardware that cannot be handled by the kernel. In some cases (like the page still having a valid copy on disk) the kernel will handle the failure transparently without affecting any applications. But if there is no other up-to-date copy of the data it will kill to prevent any data corruptions from propagating.

1: Kill all processes that have the corrupted and not reloadable page mapped as soon as the corruption is detected. Note this is not supported for a few types of pages, like kernel internally allocated data or the swap cache, but works for the majority of user pages.

0: Only unmap the corrupted page from all processes and only kill a process who tries to access it.

The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can handle this if they want to.

This is only active on architectures/platforms with advanced machine check handling and depends on the hardware capabilities.

Applications can override this setting individually with the PR_MCE_KILL prctl

man/info ubuntu man lkml redhat kb github google source