| Central, scheduler-driven, power-performance control |
| (EXPERIMENTAL) |
| |
| Abstract |
| ======== |
| |
| The topic of a single simple power-performance tunable, that is wholly |
| scheduler centric, and has well defined and predictable properties has come up |
| on several occasions in the past [1,2]. With techniques such as a scheduler |
| driven DVFS [3], we now have a good framework for implementing such a tunable. |
| This document describes the overall ideas behind its design and implementation. |
| |
| |
| Table of Contents |
| ================= |
| |
| 1. Motivation |
| 2. Introduction |
| 3. Signal Boosting Strategy |
| 4. OPP selection using boosted CPU utilization |
| 5. Per task group boosting |
| 6. Per-task wakeup-placement-strategy Selection |
| 7. Question and Answers |
| - What about "auto" mode? |
| - What about boosting on a congested system? |
| - How CPUs are boosted when we have tasks with multiple boost values? |
| 8. References |
| |
| |
| 1. Motivation |
| ============= |
| |
| Sched-DVFS [3] was a new event-driven cpufreq governor which allows the |
| scheduler to select the optimal DVFS operating point (OPP) for running a task |
| allocated to a CPU. Later, the cpufreq maintainers introduced a similar |
| governor, schedutil. The introduction of schedutil also enables running |
| workloads at the most energy efficient OPPs. |
| |
| However, sometimes it may be desired to intentionally boost the performance of |
| a workload even if that could imply a reasonable increase in energy |
| consumption. For example, in order to reduce the response time of a task, we |
| may want to run the task at a higher OPP than the one that is actually required |
| by it's CPU bandwidth demand. |
| |
| This last requirement is especially important if we consider that one of the |
| main goals of the utilization-driven governor component is to replace all |
| currently available CPUFreq policies. Since sched-DVFS and schedutil are event |
| based, as opposed to the sampling driven governors we currently have, they are |
| already more responsive at selecting the optimal OPP to run tasks allocated to |
| a CPU. However, just tracking the actual task load demand may not be enough |
| from a performance standpoint. For example, it is not possible to get |
| behaviors similar to those provided by the "performance" and "interactive" |
| CPUFreq governors. |
| |
| This document describes an implementation of a tunable, stacked on top of the |
| utilization-driven governors which extends their functionality to support task |
| performance boosting. |
| |
| By "performance boosting" we mean the reduction of the time required to |
| complete a task activation, i.e. the time elapsed from a task wakeup to its |
| next deactivation (e.g. because it goes back to sleep or it terminates). For |
| example, if we consider a simple periodic task which executes the same workload |
| for 5[s] every 20[s] while running at a certain OPP, a boosted execution of |
| that task must complete each of its activations in less than 5[s]. |
| |
| A previous attempt [5] to introduce such a boosting feature has not been |
| successful mainly because of the complexity of the proposed solution. Previous |
| versions of the approach described in this document exposed a single simple |
| interface to user-space. This single tunable knob allowed the tuning of |
| system wide scheduler behaviours ranging from energy efficiency at one end |
| through to incremental performance boosting at the other end. This first |
| tunable affects all tasks. However, that is not useful for Android products |
| so in this version only a more advanced extension of the concept is provided |
| which uses CGroups to boost the performance of only selected tasks while using |
| the energy efficient default for all others. |
| |
| The rest of this document introduces in more details the proposed solution |
| which has been named SchedTune. |
| |
| |
| 2. Introduction |
| =============== |
| |
| SchedTune exposes a simple user-space interface provided through a new |
| CGroup controller 'stune' which provides two power-performance tunables |
| per group: |
| |
| /<stune cgroup mount point>/schedtune.prefer_idle |
| /<stune cgroup mount point>/schedtune.boost |
| |
| The CGroup implementation permits arbitrary user-space defined task |
| classification to tune the scheduler for different goals depending on the |
| specific nature of the task, e.g. background vs interactive vs low-priority. |
| |
| More details are given in section 5. |
| |
| 2.1 Boosting |
| ============ |
| |
| The boost value is expressed as an integer in the range [-100..0..100]. |
| |
| A value of 0 (default) configures the CFS scheduler for maximum energy |
| efficiency. This means that sched-DVFS runs the tasks at the minimum OPP |
| required to satisfy their workload demand. |
| |
| A value of 100 configures scheduler for maximum performance, which translates |
| to the selection of the maximum OPP on that CPU. |
| |
| A value of -100 configures scheduler for minimum performance, which translates |
| to the selection of the minimum OPP on that CPU. |
| |
| The range between -100, 0 and 100 can be set to satisfy other scenarios suitably. |
| For example to satisfy interactive response or depending on other system events |
| (battery level etc). |
| |
| The overall design of the SchedTune module is built on top of "Per-Entity Load |
| Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating |
| Performance Point (OPP) selection. |
| |
| Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune |
| the operating frequency of that CPU to better match the workload demand. The |
| selection of the actual OPP being activated is influenced by the boost value |
| for the task CGroup. |
| |
| This simple biasing approach leverages existing frameworks, which means minimal |
| modifications to the scheduler, and yet it allows to achieve a range of |
| different behaviours all from a single simple tunable knob. |
| |
| In EAS schedulers, we use boosted task and CPU utilization for energy |
| calculation and energy-aware task placement. |
| |
| 2.2 prefer_idle |
| =============== |
| |
| This is a flag which indicates to the scheduler that userspace would like |
| the scheduler to focus on energy or to focus on performance. |
| |
| A value of 0 (default) signals to the CFS scheduler that tasks in this group |
| can be placed according to the energy-aware wakeup strategy. |
| |
| A value of 1 signals to the CFS scheduler that tasks in this group should be |
| placed to minimise wakeup latency. |
| |
| The value is combined with the boost value - task placement will not be |
| boost aware however CPU OPP selection is still boost aware. |
| |
| Android platforms typically use this flag for application tasks which the |
| user is currently interacting with. |
| |
| |
| 3. Signal Boosting Strategy |
| =========================== |
| |
| The whole PELT machinery works based on the value of a few load tracking signals |
| which basically track the CPU bandwidth requirements for tasks and the capacity |
| of CPUs. The basic idea behind the SchedTune knob is to artificially inflate |
| some of these load tracking signals to make a task or RQ appears more demanding |
| that it actually is. |
| |
| Which signals have to be inflated depends on the specific "consumer". However, |
| independently from the specific (signal, consumer) pair, it is important to |
| define a simple and possibly consistent strategy for the concept of boosting a |
| signal. |
| |
| A boosting strategy defines how the "abstract" user-space defined |
| sched_cfs_boost value is translated into an internal "margin" value to be added |
| to a signal to get its inflated value: |
| |
| margin := boosting_strategy(sched_cfs_boost, signal) |
| boosted_signal := signal + margin |
| |
| Different boosting strategies were identified and analyzed before selecting the |
| one found to be most effective. |
| |
| Signal Proportional Compensation (SPC) |
| -------------------------------------- |
| |
| In this boosting strategy the sched_cfs_boost value is used to compute a |
| margin which is proportional to the complement of the original signal. |
| When a signal has a maximum possible value, its complement is defined as |
| the delta from the actual value and its possible maximum. |
| |
| Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as |
| the maximum possible value, the margin becomes: |
| |
| margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal) |
| |
| Using this boosting strategy: |
| - a 100% sched_cfs_boost means that the signal is scaled to the maximum value |
| - each value in the range of sched_cfs_boost effectively inflates the signal in |
| question by a quantity which is proportional to the maximum value. |
| |
| For example, by applying the SPC boosting strategy to the selection of the OPP |
| to run a task it is possible to achieve these behaviors: |
| |
| - 0% boosting: run the task at the minimum OPP required by its workload |
| - 100% boosting: run the task at the maximum OPP available for the CPU |
| - 50% boosting: run at the half-way OPP between minimum and maximum |
| |
| Which means that, at 50% boosting, a task will be scheduled to run at half of |
| the maximum theoretically achievable performance on the specific target |
| platform. |
| |
| A graphical representation of an SPC boosted signal is represented in the |
| following figure where: |
| a) "-" represents the original signal |
| b) "b" represents a 50% boosted signal |
| c) "p" represents a 100% boosted signal |
| |
| |
| ^ |
| | SCHED_LOAD_SCALE |
| +-----------------------------------------------------------------+ |
| |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp |
| | |
| | boosted_signal |
| | bbbbbbbbbbbbbbbbbbbbbbbb |
| | |
| | original signal |
| | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+ |
| | | |
| |bbbbbbbbbbbbbbbbbb | |
| | | |
| | | |
| | | |
| | +-----------------------+ |
| | | |
| | | |
| | | |
| |------------------+ |
| | |
| | |
| +-----------------------------------------------------------------------> |
| |
| The plot above shows a ramped load signal (titled 'original_signal') and it's |
| boosted equivalent. For each step of the original signal the boosted signal |
| corresponding to a 50% boost is midway from the original signal and the upper |
| bound. Boosting by 100% generates a boosted signal which is always saturated to |
| the upper bound. |
| |
| |
| 4. OPP selection using boosted CPU utilization |
| ============================================== |
| |
| It is worth calling out that the implementation does not introduce any new load |
| signals. Instead, it provides an API to tune existing signals. This tuning is |
| done on demand and only in scheduler code paths where it is sensible to do so. |
| The new API calls are defined to return either the default signal or a boosted |
| one, depending on the value of sched_cfs_boost. This is a clean an non invasive |
| modification of the existing existing code paths. |
| |
| The signal representing a CPU's utilization is boosted according to the |
| previously described SPC boosting strategy. To sched-DVFS, this allows a CPU |
| (ie CFS run-queue) to appear more used then it actually is. |
| |
| Thus, with the sched_cfs_boost enabled we have the following main functions to |
| get the current utilization of a CPU: |
| |
| cpu_util() |
| boosted_cpu_util() |
| |
| The new boosted_cpu_util() is similar to the first but returns a boosted |
| utilization signal which is a function of the sched_cfs_boost value. |
| |
| This function is used in the CFS scheduler code paths where sched-DVFS needs to |
| decide the OPP to run a CPU at. |
| For example, this allows selecting the highest OPP for a CPU which has |
| the boost value set to 100%. |
| |
| |
| 5. Per task group boosting |
| ========================== |
| |
| On battery powered devices there usually are many background services which are |
| long running and need energy efficient scheduling. On the other hand, some |
| applications are more performance sensitive and require an interactive |
| response and/or maximum performance, regardless of the energy cost. |
| |
| To better service such scenarios, the SchedTune implementation has an extension |
| that provides a more fine grained boosting interface. |
| |
| A new CGroup controller, namely "schedtune", can be enabled which allows to |
| defined and configure task groups with different boosting values. |
| Tasks that require special performance can be put into separate CGroups. |
| The value of the boost associated with the tasks in this group can be specified |
| using a single knob exposed by the CGroup controller: |
| |
| schedtune.boost |
| |
| This knob allows the definition of a boost value that is to be used for |
| SPC boosting of all tasks attached to this group. |
| |
| The current schedtune controller implementation is really simple and has these |
| main characteristics: |
| |
| 1) It is only possible to create 1 level depth hierarchies |
| |
| The root control groups define the system-wide boost value to be applied |
| by default to all tasks. Its direct subgroups are named "boost groups" and |
| they define the boost value for specific set of tasks. |
| Further nested subgroups are not allowed since they do not have a sensible |
| meaning from a user-space standpoint. |
| |
| 2) It is possible to define only a limited number of "boost groups" |
| |
| This number is defined at compile time and by default configured to 16. |
| This is a design decision motivated by two main reasons: |
| a) In a real system we do not expect utilization scenarios with more then few |
| boost groups. For example, a reasonable collection of groups could be |
| just "background", "interactive" and "performance". |
| b) It simplifies the implementation considerably, especially for the code |
| which has to compute the per CPU boosting once there are multiple |
| RUNNABLE tasks with different boost values. |
| |
| Such a simple design should allow servicing the main utilization scenarios identified |
| so far. It provides a simple interface which can be used to manage the |
| power-performance of all tasks or only selected tasks. |
| Moreover, this interface can be easily integrated by user-space run-times (e.g. |
| Android, ChromeOS) to implement a QoS solution for task boosting based on tasks |
| classification, which has been a long standing requirement. |
| |
| Setup and usage |
| --------------- |
| |
| 0. Use a kernel with CONFIG_SCHED_TUNE support enabled |
| |
| 1. Check that the "schedtune" CGroup controller is available: |
| |
| root@linaro-nano:~# cat /proc/cgroups |
| #subsys_name hierarchy num_cgroups enabled |
| cpuset 0 1 1 |
| cpu 0 1 1 |
| schedtune 0 1 1 |
| |
| 2. Mount a tmpfs to create the CGroups mount point (Optional) |
| |
| root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup |
| |
| 3. Mount the "schedtune" controller |
| |
| root@linaro-nano:~# mkdir /sys/fs/cgroup/stune |
| root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune |
| |
| 4. Create task groups and configure their specific boost value (Optional) |
| |
| For example here we create a "performance" boost group configure to boost |
| all its tasks to 100% |
| |
| root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance |
| root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost |
| |
| 5. Move tasks into the boost group |
| |
| For example, the following moves the tasks with PID $TASKPID (and all its |
| threads) into the "performance" boost group. |
| |
| root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs |
| |
| This simple configuration allows only the threads of the $TASKPID task to run, |
| when needed, at the highest OPP in the most capable CPU of the system. |
| |
| |
| 6. Per-task wakeup-placement-strategy Selection |
| =============================================== |
| |
| Many devices have a number of CFS tasks in use which require an absolute |
| minimum wakeup latency, and many tasks for which wakeup latency is not |
| important. |
| |
| For touch-driven environments, removing additional wakeup latency can be |
| critical. |
| |
| When you use the Schedtume CGroup controller, you have access to a second |
| parameter which allows a group to be marked such that energy_aware task |
| placement is bypassed for tasks belonging to that group. |
| |
| prefer_idle=0 (default - use energy-aware task placement if available) |
| prefer_idle=1 (never use energy-aware task placement for these tasks) |
| |
| Since the regular wakeup task placement algorithm in CFS is biased for |
| performance, this has the effect of restoring minimum wakeup latency |
| for the desired tasks whilst still allowing energy-aware wakeup placement |
| to save energy for other tasks. |
| |
| |
| 7. Question and Answers |
| ======================= |
| |
| What about "auto" mode? |
| ----------------------- |
| |
| The 'auto' mode as described in [5] can be implemented by interfacing SchedTune |
| with some suitable user-space element. This element could use the exposed |
| system-wide or cgroup based interface. |
| |
| How are multiple groups of tasks with different boost values managed? |
| --------------------------------------------------------------------- |
| |
| The current SchedTune implementation keeps track of the boosted RUNNABLE tasks |
| on a CPU. The CPU utilization seen by the scheduler-driven cpufreq governors |
| (and used to select an appropriate OPP) is boosted with a value which is the |
| maximum of the boost values of the currently RUNNABLE tasks in its RQ. |
| |
| This allows cpufreq to boost a CPU only while there are boosted tasks ready |
| to run and switch back to the energy efficient mode as soon as the last boosted |
| task is dequeued. |
| |
| |
| 8. References |
| ============= |
| [1] http://lwn.net/Articles/552889 |
| [2] http://lkml.org/lkml/2012/5/18/91 |
| [3] http://lkml.org/lkml/2015/6/26/620 |