enable_soft_offline

file: /proc/sys/vm/enable_soft_offline
variable: vm.enable_soft_offline
Official reference

Correctable memory errors are very common on servers. Soft-offline is kernel’s solution for memory pages having (excessive) corrected memory errors.

For different types of page, soft-offline has different behaviors / costs.

  • For a raw error page, soft-offline migrates the in-use page’s content to a new raw page.

  • For a page that is part of a transparent hugepage, soft-offline splits the transparent hugepage into raw pages, then migrates only the raw error page. As a result, user is transparently backed by 1 less hugepage, impacting memory access performance.

  • For a page that is part of a HugeTLB hugepage, soft-offline first migrates the entire HugeTLB hugepage, during which a free hugepage will be consumed as migration target. Then the original hugepage is dissolved into raw pages without compensation, reducing the capacity of the HugeTLB pool by 1.

It is user’s call to choose between reliability (staying away from fragile physical memory) vs performance / capacity implications in transparent and HugeTLB cases.

For all architectures, enable_soft_offline controls whether to soft offline memory pages. When set to 1, kernel attempts to soft offline the pages whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to the request to soft offline the pages. Its default value is 1.

It is worth mentioning that after setting enable_soft_offline to 0, the following requests to soft offline pages will not be performed:

  • Request to soft offline pages from RAS Correctable Errors Collector.

  • On ARM, the request to soft offline pages from GHES driver.

  • On PARISC, the request to soft offline pages from Page Deallocation Table.