Documentation/scheduler/sched-energy.txt - LeafOS-Devices/android_kernel_samsung_gta4xl - Gitiles

 Energy cost model for energy-aware scheduling (EXPERIMENTAL)

 Introduction
 =============

 The basic energy model uses platform energy data stored in sched_group_energy
 data structures attached to the sched_groups in the sched_domain hierarchy. The
 energy cost model offers two functions that can be used to guide scheduling
 decisions:

 1.	static unsigned int sched_group_energy(struct energy_env *eenv)
 2.	static int energy_diff(struct energy_env *eenv)

 sched_group_energy() estimates the energy consumed by all cpus in a specific
 sched_group including any shared resources owned exclusively by this group of
 cpus. Resources shared with other cpus are excluded (e.g. later level caches).

 energy_diff() estimates the total energy impact of a utilization change. That
 is, adding, removing, or migrating utilization (tasks).

 Both functions use a struct energy_env to specify the scenario to be evaluated:

 	struct energy_env {
 		struct sched_group      *sg_top;
 		struct sched_group      *sg_cap;
 		int                     cap_idx;
 		int                     util_delta;
 		int                     src_cpu;
 		int                     dst_cpu;
 		int                     energy;
 	};

 sg_top: sched_group to be evaluated. Not used by energy_diff().

 sg_cap: sched_group covering the cpus in the same frequency domain. Set by
 sched_group_energy().

 cap_idx: Capacity state to be used for energy calculations. Set by
 find_new_capacity().

 util_delta: Amount of utilization to be added, removed, or migrated.

 src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be
 -1 if no source (e.g. task wake-up).

 dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1
 if utilization is removed (e.g. terminating tasks).

 energy: Result of sched_group_energy().

 The metric used to represent utilization is the actual per-entity running time
 averaged over time using a geometric series. Very similar to the existing
 per-entity load-tracking, but _not_ scaled by task priority and capped by the
 capacity of the cpu. The latter property does mean that utilization may
 underestimate the compute requirements for task on fully/over utilized cpus.
 The greatest potential for energy savings without affecting performance too much
 is scenarios where the system isn't fully utilized. If the system is deemed
 fully utilized load-balancing should be done with task load (includes task
 priority) instead in the interest of fairness and performance.


 Background and Terminology
 ===========================

 To make it clear from the start:

 energy = [joule] (resource like a battery on powered devices)
 power = energy/time = [joule/second] = [watt]

 The goal of energy-aware scheduling is to minimize energy, while still getting
 the job done. That is, we want to maximize:

 	performance [inst/s]
 	--------------------
 	    power [W]

 which is equivalent to minimizing:

 	energy [J]
 	-----------
 	instruction

 while still getting 'good' performance. It is essentially an alternative
 optimization objective to the current performance-only objective for the
 scheduler. This alternative considers two objectives: energy-efficiency and
 performance. Hence, there needs to be a user controllable knob to switch the
 objective. Since it is early days, this is currently a sched_feature
 (ENERGY_AWARE).

 The idea behind introducing an energy cost model is to allow the scheduler to
 evaluate the implications of its decisions rather than applying energy-saving
 techniques blindly that may only have positive effects on some platforms. At
 the same time, the energy cost model must be as simple as possible to minimize
 the scheduler latency impact.

 Platform topology
 ------------------

 The system topology (cpus, caches, and NUMA information, not peripherals) is
 represented in the scheduler by the sched_domain hierarchy which has
 sched_groups attached at each level that covers one or more cpus (see
 sched-domains.txt for more details). To add energy awareness to the scheduler
 we need to consider power and frequency domains.

 Power domain:

 A power domain is a part of the system that can be powered on/off
 independently. Power domains are typically organized in a hierarchy where you
 may be able to power down just a cpu or a group of cpus along with any
 associated resources (e.g.  shared caches). Powering up a cpu means that all
 power domains it is a part of in the hierarchy must be powered up. Hence, it is
 more expensive to power up the first cpu that belongs to a higher level power
 domain than powering up additional cpus in the same high level domain. Two
 level power domain hierarchy example:

 		Power source
 		         +-------------------------------+----...
 per group PD		 G                               G
 		         |           +----------+        |
 		    +--------+-------| Shared   |  (other groups)
 per-cpu PD	    G        G       | resource |
 		    |        |       +----------+
 		+-------+ +-------+
 		| CPU 0 | | CPU 1 |
 		+-------+ +-------+

 Frequency domain:

 Frequency domains (P-states) typically cover the same group of cpus as one of
 the power domain levels. That is, there might be several smaller power domains
 sharing the same frequency (P-state) or there might be a power domain spanning
 multiple frequency domains.

 From a scheduling point of view there is no need to know the actual frequencies
 [Hz]. All the scheduler cares about is the compute capacity available at the
 current state (P-state) the cpu is in and any other available states. For that
 reason, and to also factor in any cpu micro-architecture differences, compute
 capacity scaling states are called 'capacity states' in this document. For SMP
 systems this is equivalent to P-states. For mixed micro-architecture systems
 (like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
 performance relative to the other cpus in the system.

 Energy modelling:
 ------------------

 Due to the hierarchical nature of the power domains, the most obvious way to
 model energy costs is therefore to associate power and energy costs with
 domains (groups of cpus). Energy costs of shared resources are associated with
 the group of cpus that share the resources, only the cost of powering the
 cpu itself and any private resources (e.g. private L1 caches) is associated
 with the per-cpu groups (lowest level).

 For example, for an SMP system with per-cpu power domains and a cluster level
 (group of cpus) power domain we get the overall energy costs to be:

 	energy = energy_cluster + n * energy_cpu

 where 'n' is the number of cpus powered up and energy_cluster is the cost paid
 as soon as any cpu in the cluster is powered up.

 The power and frequency domains can naturally be mapped onto the existing
 sched_domain hierarchy and sched_groups by adding the necessary data to the
 existing data structures.

 The energy model considers energy consumption from two contributors (shown in
 the illustration below):

 1. Busy energy: Energy consumed while a cpu and the higher level groups that it
 belongs to are busy running tasks. Busy energy is associated with the state of
 the cpu, not an event. The time the cpu spends in this state varies. Thus, the
 most obvious platform parameter for this contribution is busy power
 (energy/time).

 2. Idle energy: Energy consumed while a cpu and higher level groups that it
 belongs to are idle (in a C-state). Like busy energy, idle energy is associated
 with the state of the cpu. Thus, the platform parameter for this contribution
 is idle power (energy/time).

 Energy consumed during transitions from an idle-state (C-state) to a busy state
 (P-state) or going the other way is ignored by the model to simplify the energy
 model calculations.


 	Power
 	^
 	|            busy->idle             idle->busy
 	|            transition             transition
 	|
 	|                _                      __
 	|               / \                    /  \__________________
 	|______________/   \                  /
 	|                   \                /
 	|  Busy              \    Idle      /        Busy
 	|  low P-state        \____________/         high P-state
 	|
 	+------------------------------------------------------------> time

 Busy    |--------------|                          |-----------------|

 Wakeup                 |------|            |------|

 Idle                          |------------|


 The basic algorithm
 ====================

 The basic idea is to determine the total energy impact when utilization is
 added or removed by estimating the impact at each level in the sched_domain
 hierarchy starting from the bottom (sched_group contains just a single cpu).
 The energy cost comes from busy time (sched_group is awake because one or more
 cpus are busy) and idle time (in an idle-state). Energy model numbers account
 for energy costs associated with all cpus in the sched_group as a group.

 	for_each_domain(cpu, sd) {
 		sg = sched_group_of(cpu)
 		energy_before = curr_util(sg) * busy_power(sg)
 				+ (1-curr_util(sg)) * idle_power(sg)
 		energy_after = new_util(sg) * busy_power(sg)
 				+ (1-new_util(sg)) * idle_power(sg)
 		energy_diff += energy_before - energy_after

 	}

 	return energy_diff

 {curr, new}_util: The cpu utilization at the lowest level and the overall
 non-idle time for the entire group for higher levels. Utilization is in the
 range 0.0 to 1.0 in the pseudo-code.

 busy_power: The power consumption of the sched_group.

 idle_power: The power consumption of the sched_group when idle.

 Note: It is a fundamental assumption that the utilization is (roughly) scale
 invariant. Task utilization tracking factors in any frequency scaling and
 performance scaling differences due to difference cpu microarchitectures such
 that task utilization can be used across the entire system.


 Platform energy data
 =====================

 struct sched_group_energy can be attached to sched_groups in the sched_domain
 hierarchy and has the following members:

 cap_states:
 	List of struct capacity_state representing the supported capacity states
 	(P-states). struct capacity_state has two members: cap and power, which
 	represents the compute capacity and the busy_power of the state. The
 	list must be ordered by capacity low->high.

 nr_cap_states:
 	Number of capacity states in cap_states list.

 idle_states:
 	List of struct idle_state containing idle_state power cost for each
 	idle-state supported by the system orderd by shallowest state first.
 	All states must be included at all level in the hierarchy, i.e. a
 	sched_group spanning just a single cpu must also include coupled
 	idle-states (cluster states). In addition to the cpuidle idle-states,
 	the list must also contain an entry for the idling using the arch
 	default idle (arch_idle_cpu()). Despite this state may not be a true
 	hardware idle-state it is considered the shallowest idle-state in the
 	energy model and must be the first entry. cpus may enter this state
 	(possibly 'active idling') if cpuidle decides not enter a cpuidle
 	idle-state. Default idle may not be used when cpuidle is enabled.
 	In this case, it should just be a copy of the first cpuidle idle-state.

 nr_idle_states:
 	Number of idle states in idle_states list.

 There are no unit requirements for the energy cost data. Data can be normalized
 with any reference, however, the normalization must be consistent across all
 energy cost data. That is, one bogo-joule/watt must be the same quantity for
 data, but we don't care what it is.

 A recipe for platform characterization
 =======================================

 Obtaining the actual model data for a particular platform requires some way of
 measuring power/energy. There isn't a tool to help with this (yet). This
 section provides a recipe for use as reference. It covers the steps used to
 characterize the ARM TC2 development platform. This sort of measurements is
 expected to be done anyway when tuning cpuidle and cpufreq for a given
 platform.

 The energy model needs two types of data (struct sched_group_energy holds
 these) for each sched_group where energy costs should be taken into account:

 1. Capacity state information

 A list containing the compute capacity and power consumption when fully
 utilized attributed to the group as a whole for each available capacity state.
 At the lowest level (group contains just a single cpu) this is the power of the
 cpu alone without including power consumed by resources shared with other cpus.
 It basically needs to fit the basic modelling approach described in "Background
 and Terminology" section:

 	energy_system = energy_shared + n * energy_cpu

 for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
 the lowest level. 'energy_shared' is included at the next level which
 represents the group of cpus among which the resources are shared.

 This model is, of course, a simplification of reality. Thus, power/energy
 attributions might not always exactly represent how the hardware is designed.
 Also, busy power is likely to depend on the workload. It is therefore
 recommended to use a representative mix of workloads when characterizing the
 capacity states.

 If the group has no capacity scaling support, the list will contain a single
 state where power is the busy power attributed to the group. The capacity
 should be set to a default value (1024).

 When frequency domains include multiple power domains, the group representing
 the frequency domain and all child groups share capacity states. This must be
 indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
 all levels that share the capacity state must have the list of capacity states
 with the power set to the contribution of the individual group.

 2. Idle power information

 Stored in the idle_states list. The power number is the group idle power
 consumption in each idle state as well when the group is idle but has not
 entered an idle-state ('active idle' as mentioned earlier). Due to the way the
 energy model is defined, the idle power of the deepest group idle state can
 alternatively be accounted for in the parent group busy power. In that case the
 group idle state power values are offset such that the idle power of the
 deepest state is zero. It is less intuitive, but it is easier to measure as
 idle power consumed by the group and the busy/idle power of the parent group
 cannot be distinguished without per group measurement points.

 Measuring capacity states and idle power:

 The capacity states' capacity and power can be estimated by running a benchmark
 workload at each available capacity state. By restricting the benchmark to run
 on subsets of cpus it is possible to extrapolate the power consumption of
 shared resources.

 ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
 shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
 benchmark workload on just one cpu in a cluster means that power is consumed in
 the cluster (higher level group) and a single cpu (lowest level group). Adding
 another benchmark task to another cpu increases the power consumption by the
 amount consumed by the additional cpu. Hence, it is possible to extrapolate the
 cluster busy power.

 For platforms that don't have energy counters or equivalent instrumentation
 built-in, it may be possible to use an external DAQ to acquire similar data.

 If the benchmark includes some performance score (for example sysbench cpu
 benchmark), this can be used to record the compute capacity.

 Measuring idle power requires insight into the idle state implementation on the
 particular platform. Specifically, if the platform has coupled idle-states (or
 package states). To measure non-coupled per-cpu idle-states it is necessary to
 keep one cpu busy to keep any shared resources alive to isolate the idle power
 of the cpu from idle/busy power of the shared resources. The cpu can be tricked
 into different per-cpu idle states by disabling the other states. Based on
 various combinations of measurements with specific cpus busy and disabling
 idle-states it is possible to extrapolate the idle-state power.
	Energy cost model for energy-aware scheduling (EXPERIMENTAL)

	Introduction
	=============

	The basic energy model uses platform energy data stored in sched_group_energy
	data structures attached to the sched_groups in the sched_domain hierarchy. The
	energy cost model offers two functions that can be used to guide scheduling
	decisions:

	1. static unsigned int sched_group_energy(struct energy_env *eenv)
	2. static int energy_diff(struct energy_env *eenv)

	sched_group_energy() estimates the energy consumed by all cpus in a specific
	sched_group including any shared resources owned exclusively by this group of
	cpus. Resources shared with other cpus are excluded (e.g. later level caches).

	energy_diff() estimates the total energy impact of a utilization change. That
	is, adding, removing, or migrating utilization (tasks).

	Both functions use a struct energy_env to specify the scenario to be evaluated:

	struct energy_env {
	struct sched_group *sg_top;
	struct sched_group *sg_cap;
	int cap_idx;
	int util_delta;
	int src_cpu;
	int dst_cpu;
	int energy;
	};

	sg_top: sched_group to be evaluated. Not used by energy_diff().

	sg_cap: sched_group covering the cpus in the same frequency domain. Set by
	sched_group_energy().

	cap_idx: Capacity state to be used for energy calculations. Set by
	find_new_capacity().

	util_delta: Amount of utilization to be added, removed, or migrated.

	src_cpu: Source cpu from where 'util_delta' utilization is removed. Should be
	-1 if no source (e.g. task wake-up).

	dst_cpu: Destination cpu where 'util_delta' utilization is added. Should be -1
	if utilization is removed (e.g. terminating tasks).

	energy: Result of sched_group_energy().

	The metric used to represent utilization is the actual per-entity running time
	averaged over time using a geometric series. Very similar to the existing
	per-entity load-tracking, but _not_ scaled by task priority and capped by the
	capacity of the cpu. The latter property does mean that utilization may
	underestimate the compute requirements for task on fully/over utilized cpus.
	The greatest potential for energy savings without affecting performance too much
	is scenarios where the system isn't fully utilized. If the system is deemed
	fully utilized load-balancing should be done with task load (includes task
	priority) instead in the interest of fairness and performance.


	Background and Terminology
	===========================

	To make it clear from the start:

	energy = [joule] (resource like a battery on powered devices)
	power = energy/time = [joule/second] = [watt]

	The goal of energy-aware scheduling is to minimize energy, while still getting
	the job done. That is, we want to maximize:

	performance [inst/s]
	--------------------
	power [W]

	which is equivalent to minimizing:

	energy [J]
	-----------
	instruction

	while still getting 'good' performance. It is essentially an alternative
	optimization objective to the current performance-only objective for the
	scheduler. This alternative considers two objectives: energy-efficiency and
	performance. Hence, there needs to be a user controllable knob to switch the
	objective. Since it is early days, this is currently a sched_feature
	(ENERGY_AWARE).

	The idea behind introducing an energy cost model is to allow the scheduler to
	evaluate the implications of its decisions rather than applying energy-saving
	techniques blindly that may only have positive effects on some platforms. At
	the same time, the energy cost model must be as simple as possible to minimize
	the scheduler latency impact.

	Platform topology
	------------------

	The system topology (cpus, caches, and NUMA information, not peripherals) is
	represented in the scheduler by the sched_domain hierarchy which has
	sched_groups attached at each level that covers one or more cpus (see
	sched-domains.txt for more details). To add energy awareness to the scheduler
	we need to consider power and frequency domains.

	Power domain:

	A power domain is a part of the system that can be powered on/off
	independently. Power domains are typically organized in a hierarchy where you
	may be able to power down just a cpu or a group of cpus along with any
	associated resources (e.g. shared caches). Powering up a cpu means that all
	power domains it is a part of in the hierarchy must be powered up. Hence, it is
	more expensive to power up the first cpu that belongs to a higher level power
	domain than powering up additional cpus in the same high level domain. Two
	level power domain hierarchy example:

	Power source
	+-------------------------------+----...
	per group PD G G
	\| +----------+ \|
	+--------+-------\| Shared \| (other groups)
	per-cpu PD G G \| resource \|
	\| \| +----------+
	+-------+ +-------+
	\| CPU 0 \| \| CPU 1 \|
	+-------+ +-------+

	Frequency domain:

	Frequency domains (P-states) typically cover the same group of cpus as one of
	the power domain levels. That is, there might be several smaller power domains
	sharing the same frequency (P-state) or there might be a power domain spanning
	multiple frequency domains.

	From a scheduling point of view there is no need to know the actual frequencies
	[Hz]. All the scheduler cares about is the compute capacity available at the
	current state (P-state) the cpu is in and any other available states. For that
	reason, and to also factor in any cpu micro-architecture differences, compute
	capacity scaling states are called 'capacity states' in this document. For SMP
	systems this is equivalent to P-states. For mixed micro-architecture systems
	(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
	performance relative to the other cpus in the system.

	Energy modelling:
	------------------

	Due to the hierarchical nature of the power domains, the most obvious way to
	model energy costs is therefore to associate power and energy costs with
	domains (groups of cpus). Energy costs of shared resources are associated with
	the group of cpus that share the resources, only the cost of powering the
	cpu itself and any private resources (e.g. private L1 caches) is associated
	with the per-cpu groups (lowest level).

	For example, for an SMP system with per-cpu power domains and a cluster level
	(group of cpus) power domain we get the overall energy costs to be:

	energy = energy_cluster + n * energy_cpu

	where 'n' is the number of cpus powered up and energy_cluster is the cost paid
	as soon as any cpu in the cluster is powered up.

	The power and frequency domains can naturally be mapped onto the existing
	sched_domain hierarchy and sched_groups by adding the necessary data to the
	existing data structures.

	The energy model considers energy consumption from two contributors (shown in
	the illustration below):

	1. Busy energy: Energy consumed while a cpu and the higher level groups that it
	belongs to are busy running tasks. Busy energy is associated with the state of
	the cpu, not an event. The time the cpu spends in this state varies. Thus, the
	most obvious platform parameter for this contribution is busy power
	(energy/time).

	2. Idle energy: Energy consumed while a cpu and higher level groups that it
	belongs to are idle (in a C-state). Like busy energy, idle energy is associated
	with the state of the cpu. Thus, the platform parameter for this contribution
	is idle power (energy/time).

	Energy consumed during transitions from an idle-state (C-state) to a busy state
	(P-state) or going the other way is ignored by the model to simplify the energy
	model calculations.


	Power
	^
	\| busy->idle idle->busy
	\| transition transition
	\|
	\| _ __
	\| / \ / \__________________
	\|______________/ \ /
	\| \ /
	\| Busy \ Idle / Busy
	\| low P-state \____________/ high P-state
	\|
	+------------------------------------------------------------> time

	Busy \|--------------\| \|-----------------\|

	Wakeup \|------\| \|------\|

	Idle \|------------\|


	The basic algorithm
	====================

	The basic idea is to determine the total energy impact when utilization is
	added or removed by estimating the impact at each level in the sched_domain
	hierarchy starting from the bottom (sched_group contains just a single cpu).
	The energy cost comes from busy time (sched_group is awake because one or more
	cpus are busy) and idle time (in an idle-state). Energy model numbers account
	for energy costs associated with all cpus in the sched_group as a group.

	for_each_domain(cpu, sd) {
	sg = sched_group_of(cpu)
	energy_before = curr_util(sg) * busy_power(sg)
	+ (1-curr_util(sg)) * idle_power(sg)
	energy_after = new_util(sg) * busy_power(sg)
	+ (1-new_util(sg)) * idle_power(sg)
	energy_diff += energy_before - energy_after

	}

	return energy_diff

	{curr, new}_util: The cpu utilization at the lowest level and the overall
	non-idle time for the entire group for higher levels. Utilization is in the
	range 0.0 to 1.0 in the pseudo-code.

	busy_power: The power consumption of the sched_group.

	idle_power: The power consumption of the sched_group when idle.

	Note: It is a fundamental assumption that the utilization is (roughly) scale
	invariant. Task utilization tracking factors in any frequency scaling and
	performance scaling differences due to difference cpu microarchitectures such
	that task utilization can be used across the entire system.


	Platform energy data
	=====================

	struct sched_group_energy can be attached to sched_groups in the sched_domain
	hierarchy and has the following members:

	cap_states:
	List of struct capacity_state representing the supported capacity states
	(P-states). struct capacity_state has two members: cap and power, which
	represents the compute capacity and the busy_power of the state. The
	list must be ordered by capacity low->high.

	nr_cap_states:
	Number of capacity states in cap_states list.

	idle_states:
	List of struct idle_state containing idle_state power cost for each
	idle-state supported by the system orderd by shallowest state first.
	All states must be included at all level in the hierarchy, i.e. a
	sched_group spanning just a single cpu must also include coupled
	idle-states (cluster states). In addition to the cpuidle idle-states,
	the list must also contain an entry for the idling using the arch
	default idle (arch_idle_cpu()). Despite this state may not be a true
	hardware idle-state it is considered the shallowest idle-state in the
	energy model and must be the first entry. cpus may enter this state
	(possibly 'active idling') if cpuidle decides not enter a cpuidle
	idle-state. Default idle may not be used when cpuidle is enabled.
	In this case, it should just be a copy of the first cpuidle idle-state.

	nr_idle_states:
	Number of idle states in idle_states list.

	There are no unit requirements for the energy cost data. Data can be normalized
	with any reference, however, the normalization must be consistent across all
	energy cost data. That is, one bogo-joule/watt must be the same quantity for
	data, but we don't care what it is.

	A recipe for platform characterization
	=======================================

	Obtaining the actual model data for a particular platform requires some way of
	measuring power/energy. There isn't a tool to help with this (yet). This
	section provides a recipe for use as reference. It covers the steps used to
	characterize the ARM TC2 development platform. This sort of measurements is
	expected to be done anyway when tuning cpuidle and cpufreq for a given
	platform.

	The energy model needs two types of data (struct sched_group_energy holds
	these) for each sched_group where energy costs should be taken into account:

	1. Capacity state information

	A list containing the compute capacity and power consumption when fully
	utilized attributed to the group as a whole for each available capacity state.
	At the lowest level (group contains just a single cpu) this is the power of the
	cpu alone without including power consumed by resources shared with other cpus.
	It basically needs to fit the basic modelling approach described in "Background
	and Terminology" section:

	energy_system = energy_shared + n * energy_cpu

	for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
	the lowest level. 'energy_shared' is included at the next level which
	represents the group of cpus among which the resources are shared.

	This model is, of course, a simplification of reality. Thus, power/energy
	attributions might not always exactly represent how the hardware is designed.
	Also, busy power is likely to depend on the workload. It is therefore
	recommended to use a representative mix of workloads when characterizing the
	capacity states.

	If the group has no capacity scaling support, the list will contain a single
	state where power is the busy power attributed to the group. The capacity
	should be set to a default value (1024).

	When frequency domains include multiple power domains, the group representing
	the frequency domain and all child groups share capacity states. This must be
	indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
	all levels that share the capacity state must have the list of capacity states
	with the power set to the contribution of the individual group.

	2. Idle power information

	Stored in the idle_states list. The power number is the group idle power
	consumption in each idle state as well when the group is idle but has not
	entered an idle-state ('active idle' as mentioned earlier). Due to the way the
	energy model is defined, the idle power of the deepest group idle state can
	alternatively be accounted for in the parent group busy power. In that case the
	group idle state power values are offset such that the idle power of the
	deepest state is zero. It is less intuitive, but it is easier to measure as
	idle power consumed by the group and the busy/idle power of the parent group
	cannot be distinguished without per group measurement points.

	Measuring capacity states and idle power:

	The capacity states' capacity and power can be estimated by running a benchmark
	workload at each available capacity state. By restricting the benchmark to run
	on subsets of cpus it is possible to extrapolate the power consumption of
	shared resources.

	ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
	shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
	benchmark workload on just one cpu in a cluster means that power is consumed in
	the cluster (higher level group) and a single cpu (lowest level group). Adding
	another benchmark task to another cpu increases the power consumption by the
	amount consumed by the additional cpu. Hence, it is possible to extrapolate the
	cluster busy power.

	For platforms that don't have energy counters or equivalent instrumentation
	built-in, it may be possible to use an external DAQ to acquire similar data.

	If the benchmark includes some performance score (for example sysbench cpu
	benchmark), this can be used to record the compute capacity.

	Measuring idle power requires insight into the idle state implementation on the
	particular platform. Specifically, if the platform has coupled idle-states (or
	package states). To measure non-coupled per-cpu idle-states it is necessary to
	keep one cpu busy to keep any shared resources alive to isolate the idle power
	of the cpu from idle/busy power of the shared resources. The cpu can be tricked
	into different per-cpu idle states by disabling the other states. Based on
	various combinations of measurements with specific cpus busy and disabling
	idle-states it is possible to extrapolate the idle-state power.