| Energy cost model for energy-aware scheduling (EXPERIMENTAL) |
| |
| Introduction |
| ============= |
| |
| The basic energy model uses platform energy data stored in sched_group_energy |
| data structures attached to the sched_groups in the sched_domain hierarchy. The |
| energy cost model offers two functions that can be used to guide scheduling |
| decisions: |
| |
| 1. static unsigned int sched_group_energy(struct energy_env *eenv) |
| 2. static int energy_diff(struct energy_env *eenv) |
| |
| sched_group_energy() estimates the energy consumed by all cpus in a specific |
| sched_group including any shared resources owned exclusively by this group of |
| cpus. Resources shared with other cpus are excluded (e.g. later level caches). |
| |
| energy_diff() estimates the total energy impact of a utilization change. That |
| is, adding, removing, or migrating utilization (tasks). |
| |
| Both functions use a struct energy_env to specify the scenario to be evaluated: |
| |
| struct energy_env { |
| struct sched_group *sg_top; |
| struct sched_group *sg_cap; |
| int cap_idx; |
| int util_delta; |
| int src_cpu; |
| int dst_cpu; |
| int energy; |
| }; |
| |
| sg_top: sched_group to be evaluated. Not used by energy_diff(). |
| |
| sg_cap: sched_group covering the cpus in the same frequency domain. Set by |
| sched_group_energy(). |
| |
| cap_idx: Capacity state to be used for energy calculations. Set by |
| find_new_capacity(). |
| |
| util_delta: Amount of utilization to be added, removed, or migrated. |
| |
| src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be |
| -1 if no source (e.g. task wake-up). |
| |
| dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1 |
| if utilization is removed (e.g. terminating tasks). |
| |
| energy: Result of sched_group_energy(). |
| |
| The metric used to represent utilization is the actual per-entity running time |
| averaged over time using a geometric series. Very similar to the existing |
| per-entity load-tracking, but _not_ scaled by task priority and capped by the |
| capacity of the cpu. The latter property does mean that utilization may |
| underestimate the compute requirements for task on fully/over utilized cpus. |
| The greatest potential for energy savings without affecting performance too much |
| is scenarios where the system isn't fully utilized. If the system is deemed |
| fully utilized load-balancing should be done with task load (includes task |
| priority) instead in the interest of fairness and performance. |
| |
| |
| Background and Terminology |
| =========================== |
| |
| To make it clear from the start: |
| |
| energy = [joule] (resource like a battery on powered devices) |
| power = energy/time = [joule/second] = [watt] |
| |
| The goal of energy-aware scheduling is to minimize energy, while still getting |
| the job done. That is, we want to maximize: |
| |
| performance [inst/s] |
| -------------------- |
| power [W] |
| |
| which is equivalent to minimizing: |
| |
| energy [J] |
| ----------- |
| instruction |
| |
| while still getting 'good' performance. It is essentially an alternative |
| optimization objective to the current performance-only objective for the |
| scheduler. This alternative considers two objectives: energy-efficiency and |
| performance. Hence, there needs to be a user controllable knob to switch the |
| objective. Since it is early days, this is currently a sched_feature |
| (ENERGY_AWARE). |
| |
| The idea behind introducing an energy cost model is to allow the scheduler to |
| evaluate the implications of its decisions rather than applying energy-saving |
| techniques blindly that may only have positive effects on some platforms. At |
| the same time, the energy cost model must be as simple as possible to minimize |
| the scheduler latency impact. |
| |
| Platform topology |
| ------------------ |
| |
| The system topology (cpus, caches, and NUMA information, not peripherals) is |
| represented in the scheduler by the sched_domain hierarchy which has |
| sched_groups attached at each level that covers one or more cpus (see |
| sched-domains.txt for more details). To add energy awareness to the scheduler |
| we need to consider power and frequency domains. |
| |
| Power domain: |
| |
| A power domain is a part of the system that can be powered on/off |
| independently. Power domains are typically organized in a hierarchy where you |
| may be able to power down just a cpu or a group of cpus along with any |
| associated resources (e.g. shared caches). Powering up a cpu means that all |
| power domains it is a part of in the hierarchy must be powered up. Hence, it is |
| more expensive to power up the first cpu that belongs to a higher level power |
| domain than powering up additional cpus in the same high level domain. Two |
| level power domain hierarchy example: |
| |
| Power source |
| +-------------------------------+----... |
| per group PD G G |
| | +----------+ | |
| +--------+-------| Shared | (other groups) |
| per-cpu PD G G | resource | |
| | | +----------+ |
| +-------+ +-------+ |
| | CPU 0 | | CPU 1 | |
| +-------+ +-------+ |
| |
| Frequency domain: |
| |
| Frequency domains (P-states) typically cover the same group of cpus as one of |
| the power domain levels. That is, there might be several smaller power domains |
| sharing the same frequency (P-state) or there might be a power domain spanning |
| multiple frequency domains. |
| |
| From a scheduling point of view there is no need to know the actual frequencies |
| [Hz]. All the scheduler cares about is the compute capacity available at the |
| current state (P-state) the cpu is in and any other available states. For that |
| reason, and to also factor in any cpu micro-architecture differences, compute |
| capacity scaling states are called 'capacity states' in this document. For SMP |
| systems this is equivalent to P-states. For mixed micro-architecture systems |
| (like ARM big.LITTLE) it is P-states scaled according to the micro-architecture |
| performance relative to the other cpus in the system. |
| |
| Energy modelling: |
| ------------------ |
| |
| Due to the hierarchical nature of the power domains, the most obvious way to |
| model energy costs is therefore to associate power and energy costs with |
| domains (groups of cpus). Energy costs of shared resources are associated with |
| the group of cpus that share the resources, only the cost of powering the |
| cpu itself and any private resources (e.g. private L1 caches) is associated |
| with the per-cpu groups (lowest level). |
| |
| For example, for an SMP system with per-cpu power domains and a cluster level |
| (group of cpus) power domain we get the overall energy costs to be: |
| |
| energy = energy_cluster + n * energy_cpu |
| |
| where 'n' is the number of cpus powered up and energy_cluster is the cost paid |
| as soon as any cpu in the cluster is powered up. |
| |
| The power and frequency domains can naturally be mapped onto the existing |
| sched_domain hierarchy and sched_groups by adding the necessary data to the |
| existing data structures. |
| |
| The energy model considers energy consumption from two contributors (shown in |
| the illustration below): |
| |
| 1. Busy energy: Energy consumed while a cpu and the higher level groups that it |
| belongs to are busy running tasks. Busy energy is associated with the state of |
| the cpu, not an event. The time the cpu spends in this state varies. Thus, the |
| most obvious platform parameter for this contribution is busy power |
| (energy/time). |
| |
| 2. Idle energy: Energy consumed while a cpu and higher level groups that it |
| belongs to are idle (in a C-state). Like busy energy, idle energy is associated |
| with the state of the cpu. Thus, the platform parameter for this contribution |
| is idle power (energy/time). |
| |
| Energy consumed during transitions from an idle-state (C-state) to a busy state |
| (P-state) or going the other way is ignored by the model to simplify the energy |
| model calculations. |
| |
| |
| Power |
| ^ |
| | busy->idle idle->busy |
| | transition transition |
| | |
| | _ __ |
| | / \ / \__________________ |
| |______________/ \ / |
| | \ / |
| | Busy \ Idle / Busy |
| | low P-state \____________/ high P-state |
| | |
| +------------------------------------------------------------> time |
| |
| Busy |--------------| |-----------------| |
| |
| Wakeup |------| |------| |
| |
| Idle |------------| |
| |
| |
| The basic algorithm |
| ==================== |
| |
| The basic idea is to determine the total energy impact when utilization is |
| added or removed by estimating the impact at each level in the sched_domain |
| hierarchy starting from the bottom (sched_group contains just a single cpu). |
| The energy cost comes from busy time (sched_group is awake because one or more |
| cpus are busy) and idle time (in an idle-state). Energy model numbers account |
| for energy costs associated with all cpus in the sched_group as a group. |
| |
| for_each_domain(cpu, sd) { |
| sg = sched_group_of(cpu) |
| energy_before = curr_util(sg) * busy_power(sg) |
| + (1-curr_util(sg)) * idle_power(sg) |
| energy_after = new_util(sg) * busy_power(sg) |
| + (1-new_util(sg)) * idle_power(sg) |
| energy_diff += energy_before - energy_after |
| |
| } |
| |
| return energy_diff |
| |
| {curr, new}_util: The cpu utilization at the lowest level and the overall |
| non-idle time for the entire group for higher levels. Utilization is in the |
| range 0.0 to 1.0 in the pseudo-code. |
| |
| busy_power: The power consumption of the sched_group. |
| |
| idle_power: The power consumption of the sched_group when idle. |
| |
| Note: It is a fundamental assumption that the utilization is (roughly) scale |
| invariant. Task utilization tracking factors in any frequency scaling and |
| performance scaling differences due to difference cpu microarchitectures such |
| that task utilization can be used across the entire system. |
| |
| |
| Platform energy data |
| ===================== |
| |
| struct sched_group_energy can be attached to sched_groups in the sched_domain |
| hierarchy and has the following members: |
| |
| cap_states: |
| List of struct capacity_state representing the supported capacity states |
| (P-states). struct capacity_state has two members: cap and power, which |
| represents the compute capacity and the busy_power of the state. The |
| list must be ordered by capacity low->high. |
| |
| nr_cap_states: |
| Number of capacity states in cap_states list. |
| |
| idle_states: |
| List of struct idle_state containing idle_state power cost for each |
| idle-state supported by the system orderd by shallowest state first. |
| All states must be included at all level in the hierarchy, i.e. a |
| sched_group spanning just a single cpu must also include coupled |
| idle-states (cluster states). In addition to the cpuidle idle-states, |
| the list must also contain an entry for the idling using the arch |
| default idle (arch_idle_cpu()). Despite this state may not be a true |
| hardware idle-state it is considered the shallowest idle-state in the |
| energy model and must be the first entry. cpus may enter this state |
| (possibly 'active idling') if cpuidle decides not enter a cpuidle |
| idle-state. Default idle may not be used when cpuidle is enabled. |
| In this case, it should just be a copy of the first cpuidle idle-state. |
| |
| nr_idle_states: |
| Number of idle states in idle_states list. |
| |
| There are no unit requirements for the energy cost data. Data can be normalized |
| with any reference, however, the normalization must be consistent across all |
| energy cost data. That is, one bogo-joule/watt must be the same quantity for |
| data, but we don't care what it is. |
| |
| A recipe for platform characterization |
| ======================================= |
| |
| Obtaining the actual model data for a particular platform requires some way of |
| measuring power/energy. There isn't a tool to help with this (yet). This |
| section provides a recipe for use as reference. It covers the steps used to |
| characterize the ARM TC2 development platform. This sort of measurements is |
| expected to be done anyway when tuning cpuidle and cpufreq for a given |
| platform. |
| |
| The energy model needs two types of data (struct sched_group_energy holds |
| these) for each sched_group where energy costs should be taken into account: |
| |
| 1. Capacity state information |
| |
| A list containing the compute capacity and power consumption when fully |
| utilized attributed to the group as a whole for each available capacity state. |
| At the lowest level (group contains just a single cpu) this is the power of the |
| cpu alone without including power consumed by resources shared with other cpus. |
| It basically needs to fit the basic modelling approach described in "Background |
| and Terminology" section: |
| |
| energy_system = energy_shared + n * energy_cpu |
| |
| for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at |
| the lowest level. 'energy_shared' is included at the next level which |
| represents the group of cpus among which the resources are shared. |
| |
| This model is, of course, a simplification of reality. Thus, power/energy |
| attributions might not always exactly represent how the hardware is designed. |
| Also, busy power is likely to depend on the workload. It is therefore |
| recommended to use a representative mix of workloads when characterizing the |
| capacity states. |
| |
| If the group has no capacity scaling support, the list will contain a single |
| state where power is the busy power attributed to the group. The capacity |
| should be set to a default value (1024). |
| |
| When frequency domains include multiple power domains, the group representing |
| the frequency domain and all child groups share capacity states. This must be |
| indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at |
| all levels that share the capacity state must have the list of capacity states |
| with the power set to the contribution of the individual group. |
| |
| 2. Idle power information |
| |
| Stored in the idle_states list. The power number is the group idle power |
| consumption in each idle state as well when the group is idle but has not |
| entered an idle-state ('active idle' as mentioned earlier). Due to the way the |
| energy model is defined, the idle power of the deepest group idle state can |
| alternatively be accounted for in the parent group busy power. In that case the |
| group idle state power values are offset such that the idle power of the |
| deepest state is zero. It is less intuitive, but it is easier to measure as |
| idle power consumed by the group and the busy/idle power of the parent group |
| cannot be distinguished without per group measurement points. |
| |
| Measuring capacity states and idle power: |
| |
| The capacity states' capacity and power can be estimated by running a benchmark |
| workload at each available capacity state. By restricting the benchmark to run |
| on subsets of cpus it is possible to extrapolate the power consumption of |
| shared resources. |
| |
| ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a |
| shared L2 cache. TC2 has on-chip energy counters per cluster. Running a |
| benchmark workload on just one cpu in a cluster means that power is consumed in |
| the cluster (higher level group) and a single cpu (lowest level group). Adding |
| another benchmark task to another cpu increases the power consumption by the |
| amount consumed by the additional cpu. Hence, it is possible to extrapolate the |
| cluster busy power. |
| |
| For platforms that don't have energy counters or equivalent instrumentation |
| built-in, it may be possible to use an external DAQ to acquire similar data. |
| |
| If the benchmark includes some performance score (for example sysbench cpu |
| benchmark), this can be used to record the compute capacity. |
| |
| Measuring idle power requires insight into the idle state implementation on the |
| particular platform. Specifically, if the platform has coupled idle-states (or |
| package states). To measure non-coupled per-cpu idle-states it is necessary to |
| keep one cpu busy to keep any shared resources alive to isolate the idle power |
| of the cpu from idle/busy power of the shared resources. The cpu can be tricked |
| into different per-cpu idle states by disabling the other states. Based on |
| various combinations of measurements with specific cpus busy and disabling |
| idle-states it is possible to extrapolate the idle-state power. |