From Fedora Project Wiki
(F32 system wide change proposal to enable earlyoom.service by default)
 
Line 33: Line 33:
 
The kernel in Fedora editions and spins, enables the in-kernel OOM (out-of-memory) manager. Its concern is to keep the kernel itself functioning, it has no concern at all about user space function or interactivity. This change attempts to improve the user experience in the short term by triggering the same process killing mechanism, but sooner. Instead of the system becoming completely unresponsive for tens of minutes, hours or days, the expectation is an offending process (determined by oom_score, same as now) will be killed of within seconds to minutes. This is better, but admittedly still suboptimal, and there is more long term work on-going to improve the user experience in this area.
 
The kernel in Fedora editions and spins, enables the in-kernel OOM (out-of-memory) manager. Its concern is to keep the kernel itself functioning, it has no concern at all about user space function or interactivity. This change attempts to improve the user experience in the short term by triggering the same process killing mechanism, but sooner. Instead of the system becoming completely unresponsive for tens of minutes, hours or days, the expectation is an offending process (determined by oom_score, same as now) will be killed of within seconds to minutes. This is better, but admittedly still suboptimal, and there is more long term work on-going to improve the user experience in this area.
  
Background information on this complicated problem:
+
Background information on this complicated problem:<br>
https://www.kernel.org/doc/gorman/html/understand/understand016.html
+
https://www.kernel.org/doc/gorman/html/understand/understand016.html<br>
https://lwn.net/Articles/317814/
+
https://lwn.net/Articles/317814/<br><br>
  
Recent discussion:
+
Recent discussion:<br>
https://pagure.io/fedora-workstation/issue/98
+
https://pagure.io/fedora-workstation/issue/98<br>
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/XUZLHJ5O32OX24LG44R7UZ2TMN6NY47N/
+
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/XUZLHJ5O32OX24LG44R7UZ2TMN6NY47N/<br><br>
  
Other in-progress solutions:
+
Other in-progress solutions:<br>
https://gitlab.freedesktop.org/hadess/low-memory-monitor
+
https://gitlab.freedesktop.org/hadess/low-memory-monitor<br>
  
  
Line 49: Line 49:
 
There are two major benefits to Fedora:  
 
There are two major benefits to Fedora:  
  
- improved user experience by more quickly regaining control over one's system, rather than having to force power off in low-memory situations where there's aggressive swapping. Once a system becomes unresponsive, it's completely reasonable for the user to assume the system is lost, but that includes high potential for data loss.
+
* improved user experience by more quickly regaining control over one's system, rather than having to force power off in low-memory situations where there's aggressive swapping. Once a system becomes unresponsive, it's completely reasonable for the user to assume the system is lost, but that includes high potential for data loss.
  
- reducing forced poweroff as the main work around will increase data collection, improving understanding of low memory situations and how to handle them better
+
* reducing forced poweroff as the main work around will increase data collection, improving understanding of low memory situations and how to handle them better
  
  
Line 61: Line 61:
 
Desktop spins may choose to opt-out. Server, Cloud, IoT may choose to opt-in.
 
Desktop spins may choose to opt-out. Server, Cloud, IoT may choose to opt-in.
  
* Release engineering: [https://pagure.io/releng/issues #Releng issue number] (a check of an impact with Release Engineering is needed) <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
+
* Release engineering: [https://pagure.io/releng/issues #9141] (a check of an impact with Release Engineering is needed) <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
<!-- Does this feature require coordination with release engineering (e.g. changes to installer image generation or update package delivery)?  Is a mass rebuild required?  include a link to the releng issue.
 
The issue is required to be filed prior to feature submission, to ensure that someone is on board to do any process development work and testing, and that all changes make it into the pipeline; a bullet point in a change is not sufficient communication -->
 
  
 
* Policies and guidelines: N/A
 
* Policies and guidelines: N/A
Line 69: Line 67:
  
 
== Upgrade/compatibility impact ==
 
== Upgrade/compatibility impact ==
<!-- What happens to systems that have had a previous versions of Fedora installed and are updated to the version containing this change? Will anything require manual configuration or data migration? Will any existing functionality be no longer supported? -->
+
fc30/fc31->fc32 upgrades will also have this service enabled
  
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 
N/A (not a System Wide Change)
 
  
 
== How To Test ==
 
== How To Test ==
<!-- This does not need to be a full-fledged document. Describe the dimensions of tests that this change implementation is expected to pass when it is done.  If it needs to be tested with different hardware or software configurations, indicate them.  The more specific you can be, the better the community testing can be.
+
Fedora 30/31 users can test today, any edition or spin:
 +
{{code|sudo dnf install earlyoom}}
 +
{{code|sudo systemctl enable earlyoom}}
  
Remember that you are writing this how to for interested testers to use to check out your change implementation - documenting what you do for testing is OK, but it's much better to document what *I* can do to test your change.
+
And then attempt to cause an out of memory situation. Extreme example by building webkitgtk, if your system has few CPUs or a lot of RAM, you may need to sabotage it by specifying an unreasonable number of jobs with {{code|-j}} flag.<br>
 +
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/XUZLHJ5O32OX24LG44R7UZ2TMN6NY47N/
  
A good "how to test" should answer these four questions:
+
Fedora Workstation 32 (and Rawhide) users will see this service is already enabled, and can experiment with it enabled and disabled without rebooting.
  
0. What special hardware / data / etc. is needed (if any)?
 
1. How do I prepare my system to test this change? What packages
 
need to be installed, config files edited, etc.?
 
2. What specific actions do I perform to check that the change is
 
working like it's supposed to?
 
3. What are the expected results of those actions?
 
-->
 
  
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 
N/A (not a System Wide Change)
 
  
 
== User Experience ==
 
== User Experience ==
<!-- If this change proposal is noticeable by users, how will their experiences change as a result?
+
The most egregious instances this change is trying to mitigate:
 +
a. RAM is completely used
 +
b. Swap is completely used
 +
c. System becomes unresponsive to the user as swap thrashing has ensued
 +
--> earlyoom disabled, the user often gives up and forces power off (in my own testing this condition lasts >30 minutes with no kernel triggered oom killer and no recovery)
 +
--> earlyoom enabled, the system likely still becomes unresponsive but oom killer is triggered in much less time (seconds or a few minutes, in my testing, after less than 10% RAM and 10% swap is remaining)
  
This section partially overlaps with the Benefit to Fedora section above. This section should be primarily about the User Experience, written in a way that does not assume deep technical knowledge. More detailed technical description should be left for the Benefit to Fedora section.
+
earlyoom starts sending SIGTERM once both memory and swap are below their respective PERCENT setting, default 10%. It sends SIGKILL once both are below their respective KILL_PERCENT setting, default 2%.
 +
 
 +
The package includes configuration file /etc/default/earlyoom which sets option {{code|-r 60}} causing a memory report to be entered into the journal every minute.
  
Describe what Users will see or notice, for example:
 
  - Packages are compressed more efficiently, making downloads and upgrades faster by 10%.
 
  - Kerberos tickets can be renewed automatically. Users will now have to authenticate less and become more productive. Credential management improvements mean a user can start their work day with a single sign on and not have to pause for reauthentication during their entire day.
 
- Libreoffice is one of the most commonly installed applications on Fedora and it is now available by default to help users "hit the ground running".
 
- Green has been scientifically proven to be the most relaxing color. The move to a default background color of green with green text will result in Fedora users being the most relaxed users of any operating system.
 
-->
 
  
 
== Dependencies ==
 
== Dependencies ==
<!-- What other packages (RPMs) depend on this package?  Are there changes outside the developers' control on which completion of this change depends?  In other words, completion of another change owned by someone else and might cause you to not be able to finish on time or that you would need to coordinate?  Other upstream projects like the kernel (if this is not a kernel change)? -->
+
earlyoom package has no dependencies
 
 
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 
N/A (not a System Wide Change)
 
  
 
== Contingency Plan ==
 
== Contingency Plan ==
 
+
* Contingency mechanism: Owner will revert all changes
<!-- If you cannot complete your feature by the final development freeze, what is the backup plan?  This might be as simple as "Revert the shipped configuration".  Or it might not (e.g. rebuilding a number of dependent packages).  If you feature is not completed in time we want to assure others that other parts of Fedora will not be in jeopardy.  -->
+
* Contingency deadline: Final freeze
* Contingency mechanism: (What to do?  Who will do it?) N/A (not a System Wide Change)  <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
+
* Blocks release? No
<!-- When is the last time the contingency mechanism can be put in place?  This will typically be the beta freeze. -->
+
* Blocks product? No
* Contingency deadline: N/A (not a System Wide Change)  <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 
<!-- Does finishing this feature block the release, or can we ship with the feature in incomplete state? -->
 
* Blocks release? N/A (not a System Wide Change), Yes/No <!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 
* Blocks product? product <!-- Applicable for Changes that blocks specific product release/Fedora.next -->
 
  
 
== Documentation ==
 
== Documentation ==
<!-- Is there upstream documentation on this change, or notes you have written yourself?  Link to that material here so other interested developers can get involved. -->
+
man earlyoom<br><br>
 
+
https://www.kernel.org/doc/gorman/html/understand/understand016.html
<!-- REQUIRED FOR SYSTEM WIDE CHANGES -->
 
N/A (not a System Wide Change)
 
  
 
== Release Notes ==
 
== Release Notes ==
<!-- The Fedora Release Notes inform end-users about what is new in the release.  Examples of past release notes are here: http://docs.fedoraproject.org/release-notes/ -->
+
Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner. To revert to previous behavior, {{code|sudo systemctl disable earlyoom.service}}, and to customize see {{code|man earlyoom}}.
<!-- The release notes also help users know how to deal with platform changes such as ABIs/APIs, configuration or data file formats, or upgrade concerns. If there are any such changes involved in this change, indicate them here.  A link to upstream documentation will often satisfy this need.  This information forms the basis of the release notes edited by the documentation team and shipped with the release.
 
 
 
Release Notes are not required for initial draft of the Change Proposal but has to be completed by the Change Freeze.  
 
-->
 
  
 
[[Category:ChangePageIncomplete]]
 
[[Category:ChangePageIncomplete]]
Line 140: Line 118:
  
 
<!-- Select proper category, default is Self Contained Change -->
 
<!-- Select proper category, default is Self Contained Change -->
[[Category:SelfContainedChange]]
+
[[Category:SystemWideChange]]
 
<!-- [[Category:SystemWideChange]] -->
 
<!-- [[Category:SystemWideChange]] -->

Revision as of 19:45, 2 January 2020

Enable EarlyOOM killing

Summary

Install earlyoom package, and enable it by default. This will cause the kernel oomkiller to trigger sooner, but will not affect which process it chooses to kill off. The idea is to recover from out of memory situations sooner, rather than the typical complete system hang in which the user has no other choice but to force power off.


Owner

Current status

  • Targeted release: Fedora 32
  • Last updated: 2020-01-02
  • Tracker bug: <will be assigned by the Wrangler>
  • Release notes tracker: <will be assigned by the Wrangler>

Detailed Description

The kernel in Fedora editions and spins, enables the in-kernel OOM (out-of-memory) manager. Its concern is to keep the kernel itself functioning, it has no concern at all about user space function or interactivity. This change attempts to improve the user experience in the short term by triggering the same process killing mechanism, but sooner. Instead of the system becoming completely unresponsive for tens of minutes, hours or days, the expectation is an offending process (determined by oom_score, same as now) will be killed of within seconds to minutes. This is better, but admittedly still suboptimal, and there is more long term work on-going to improve the user experience in this area.

Background information on this complicated problem:
https://www.kernel.org/doc/gorman/html/understand/understand016.html
https://lwn.net/Articles/317814/

Recent discussion:
https://pagure.io/fedora-workstation/issue/98
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/XUZLHJ5O32OX24LG44R7UZ2TMN6NY47N/

Other in-progress solutions:
https://gitlab.freedesktop.org/hadess/low-memory-monitor


Benefit to Fedora

There are two major benefits to Fedora:

  • improved user experience by more quickly regaining control over one's system, rather than having to force power off in low-memory situations where there's aggressive swapping. Once a system becomes unresponsive, it's completely reasonable for the user to assume the system is lost, but that includes high potential for data loss.
  • reducing forced poweroff as the main work around will increase data collection, improving understanding of low memory situations and how to handle them better


Scope

  • Proposal owners:

Include earlyoom package and enabled it by default, both for clean installs and upgrades.

  • Other developers:

Desktop spins may choose to opt-out. Server, Cloud, IoT may choose to opt-in.

  • Release engineering: #9141 (a check of an impact with Release Engineering is needed)
  • Policies and guidelines: N/A
  • Trademark approval: N/A

Upgrade/compatibility impact

fc30/fc31->fc32 upgrades will also have this service enabled


How To Test

Fedora 30/31 users can test today, any edition or spin: sudo dnf install earlyoom sudo systemctl enable earlyoom

And then attempt to cause an out of memory situation. Extreme example by building webkitgtk, if your system has few CPUs or a lot of RAM, you may need to sabotage it by specifying an unreasonable number of jobs with -j flag.
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/XUZLHJ5O32OX24LG44R7UZ2TMN6NY47N/

Fedora Workstation 32 (and Rawhide) users will see this service is already enabled, and can experiment with it enabled and disabled without rebooting.


User Experience

The most egregious instances this change is trying to mitigate: a. RAM is completely used b. Swap is completely used c. System becomes unresponsive to the user as swap thrashing has ensued --> earlyoom disabled, the user often gives up and forces power off (in my own testing this condition lasts >30 minutes with no kernel triggered oom killer and no recovery) --> earlyoom enabled, the system likely still becomes unresponsive but oom killer is triggered in much less time (seconds or a few minutes, in my testing, after less than 10% RAM and 10% swap is remaining)

earlyoom starts sending SIGTERM once both memory and swap are below their respective PERCENT setting, default 10%. It sends SIGKILL once both are below their respective KILL_PERCENT setting, default 2%.

The package includes configuration file /etc/default/earlyoom which sets option -r 60 causing a memory report to be entered into the journal every minute.


Dependencies

earlyoom package has no dependencies

Contingency Plan

  • Contingency mechanism: Owner will revert all changes
  • Contingency deadline: Final freeze
  • Blocks release? No
  • Blocks product? No

Documentation

man earlyoom

https://www.kernel.org/doc/gorman/html/understand/understand016.html

Release Notes

Earlyoom service is enabled by default, which will cause kernel oom-killer to trigger sooner. To revert to previous behavior, sudo systemctl disable earlyoom.service, and to customize see man earlyoom.