## System-Level Low-Power Techniques



Seoul National University

Naehyuck Chang Dept. of EECS/CSE Seoul National University naehyuck@snu.ac.kr



#### Contents

- Dynamic Power Management
  - DPM introduction
  - Time-out method
  - Predictive method
  - Stochastic method
- Dynamic Voltage Scaling
  - DVS introduction
  - intra-task DVS
  - inter-task DVS



#### Definition

- Systems and components are
  - Designed to deliver peak performance
  - Not needing the peak performance most of the time
- Slack and idle time exist
- Dynamic power management (DPM)
  - In wide-sense definition, DPM includes DVS
  - Shut-down idle components
- Dynamic voltage scaling (DVS)
  - Slow-down components, by scaling down frequency and voltage
  - DFS and DVFS





#### **Power manageable components**

- Components with several internal states
  - Corresponding to power and service levels
- Abstracted as power state machines
  - State diagram with:
    - Power and service annotation on states
    - Power and delay annotation on edges





#### Example: SA1100

- RUN
  - Operational state
- IDLE
  - A software routine may stop the CPU when not in use, while monitoring interrupts
- SLEEP
  - Shutdown of on-chip activity







#### **Another example: hard disk drive**

• Model: (Fujitsu MHF 2043 AT)





#### ELPL Embedded Low-Power Laboratory

#### **Structure of power-manageable systems**

- System consists of several components:
  - E.g., Laptop: processor, memory, disk, display, and so on
  - E.g., SoC: CPU, DSP, FPU, RF unit, and so on
- Components may
  - Self-manage state transitions
  - Be controlled externally
- Power manager (PM)
  - Abstraction of power control unit
  - Implemented typically in software
  - Energy consumption of PM is negligible



## The applicability of DPM

- State transition power (P<sub>tr</sub>) and delay (T<sub>tr</sub>)
- If  $T_{tr} = 0$ ,  $P_{tr} = 0$  the policy is trivial
  - Stop a component when it is not needed
- If  $T_{tr} != 0$  or  $P_{tr} != 0$ 
  - Shutdown only when idleness is long enough to a
  - What if T and P fluctuate?







#### The opportunity

- Reduce power according to workloads
- Shutdown only during long idle time





#### The challenge



# Is an idle period long enough for shutdown $(T_{be})$ ?

# **Predicting the future!**





- Break even time: T<sub>be</sub>
- Shortest idle period for energy saving







- Break even time: T<sub>be</sub>
- Shortest idle period for energy saving







- Break even time: T<sub>be</sub>
- Shortest idle period for energy saving







- Break even time: T<sub>be</sub>
- Shortest idle period for energy saving







#### System break-even time: T<sub>BE</sub>

• Minimum idle time for amortizing the cost of component shutdown

$$T_{BE} = T_{tr} + T_{tr} \frac{P_{tr} - P_{on}}{P_{on} - P_{off}}$$

$$\boxed{Transition \ delay \ (T_{tr})} \quad Transition \ power \ (P_{tr})$$
Sleep power (P<sub>off</sub>)





 $T_i$ 

 $T_{BE}$ 





#### When to use power management

- When T<sub>BE</sub> < T<sup>avg</sup><sub>idle</sub>
  - Average idle periods are long enough
  - Transition delay is short enough
  - Transition power is low enough
  - Sleep power is low enough
- When designing system for a known workload
  - Criteria for component specification and design





## **Controlling PM systems**



- DPM is a control problem: a policy is the control law
  - Collect observations
  - Issue commands
- Optimal control
  - Synthesize the *"best"* controller (PM)



#### ELPL Embedded Low-Power Laboratory

#### **Categories of DPM techniques**

- Timeout : [Karlin94, Douglis95, Li94, Krishnan99]
  - Shutdown the system when timeout expires
  - Fixed vs. adaptive
- Predictive : [Chung99, Golding95, Hwang00, Srivastava96]
  - Shutdown the system if prediction is longer than Tbe
- Stochastic : [Chung99, Benini99, Qiu99, Simunic01]
  - Model the system stochastically (Markov chain)
  - Policy optimization with constraints
  - Trade off between energy saving and performance
  - Non-deterministic decision
  - Discrete time model/continuous time model
  - Superior to predictive and timeout





#### Contents

- Dynamic Power Management
  - DPM introduction
  - Time-out method
  - Predictive method
  - Stochastic method
- Dynamic Voltage Scaling
  - DVS introduction
  - intra-task DVS
  - inter-task DVS





#### **Time-out method (I)**

- Shut-down the system if the idle time is longer than the pre-defined threshold
  - widely used technique
    - PC, monitor, disk, ...
- Rationale
  - When  $T_{idle} > T_{TO}$  it is likely that:  $T_{idle} > T_{TO} + T_{BE}$
- How to determine the **T**<sub>τo</sub>?
  - Choice of T<sub>TO</sub> is critical
    - Large is safe, but it could be useless
    - Too small is highly undesirable





#### **Time-out method (II)**

#### • Two typical ways to control the time-out value

- Fixed time-out
  - independent to the workload
- Adaptive time-out
  - Varies time-out value depending on the workload
- Limitations
  - Performance penalty for wake-up is paid after every shutdown
  - Power is wasted during T<sub>TO</sub>
  - No way to handle them





#### Contents

- Dynamic Power Management
  - DPM introduction
  - Time-out method
  - Predictive method
  - Stochastic method
- Dynamic Voltage Scaling
  - DVS introduction
  - intra-task DVS
  - inter-task DVS





#### **Predictive method**

- Observe time-varying workload
  - Predict idle period T<sub>pred</sub> ~ T<sub>idle</sub>
  - Go to sleep state if T<sub>pred</sub> is long enough to amortize state transition cost
- Main issue: prediction accuracy







#### When to use predictive methods?

- When workload has memory
  - Implementing predictive schemes
    - Predictor families must be chosen based on workload types
    - Predictor parameters must be tuned to the instance-specific workload statistics
    - Low cost
    - When workload is non-stationary or unknown, on-line adaptation is required





#### Contents

- Dynamic Power Management
  - DPM introduction
  - Time-out method
  - Predictive method
  - Stochastic method
- Dynamic Voltage Scaling
  - DVS introduction
  - intra-task DVS
  - inter-task DVS



#### **Stochastic method**

- Recognize inherent uncertainty
  - Exact prediction of future events is impossible
  - Abstraction of system model implies uncertainty
- Model components, system and workload as stochastic processes
- Expected values of cost metrics are optimized
  - Power
  - Latency





#### **System modeling**





#### **Controlled Markov Processes**

- Component and workload modeled as Markov chains
  - Component is called service provider (**SP**)
  - Workload is called service requester (SR)
  - System (S) is the combination of SR and SP (with queue)
- SP is a controlled Markov chain:
  - State transition probabilities depends on commands
- The power manager (PM) observes the state of the system and issues commands to control evolution







#### **Discrete-time, finite-state CMPs**

- Discrete time t = 1, 2, ...
  - System evaluated at periodic time points
- SR and SP are modeled by Markov chains
- PM can issue a finite number of commands a in A



#### **Power management policies**

- PM observes system state and issues a command
- A policy is a sequence of commands
- A Markovian policy yields commands as function of system state (and not previous history)
- A deterministic policy
  - For each state s in S, policy specifies command a in A
- A randomized policy
  - For each state *s* in *S*, policy specifies the probability of issuing command *a*
- A stationary policy
  - The policy does not change with time





## **PM policy optimization**

- Solve a stochastic optimal control problem:
  - Find a policy that
- Minimizes power cost function
- Satisfies performance constraints
- **Dual formulation**
- Key result for CMPs:
  - Optimum policy is stationary, Markovian and randomized
  - Policy optimization can be reduced to a LP and solved exactly and efficiently





#### **Power-performance trade-off**







#### **CMP** advantages

- Constrained optimization:
  - Energy/performance (latency) trade-off
- Global view of the system:
  - Workload and component models
- Optimum policy is captured by commands:
  - Control policy is a table
  - Policy implementation is easy
- Policy computation can be cast as a linear program and solved exactly and efficiently





#### **CMP** limitations

- Discrete-time models require periodic evaluation
  - Use continuous-time Markov models
- Event-driven paradigm
- Stochastic distributions:
  - Geometric and exponential distribution of events may not fit component and workload
  - Use (time-indexed) semi-Markov models
- Non-stationary workloads
  - Use adaptive schemes





#### Contents

- Dynamic Power Management
  - DPM introduction
  - Time-out method
  - Predictive method
  - Stochastic method
- Dynamic Voltage Scaling
  - DVS introduction
  - intra-task DVS
  - inter-task DVS





## **Dynamic Voltage Scaling definitions**

- For a given task T and its deadline  $d_{T}$ 
  - Reduce the voltage and frequency to finish task T as close as to its deadline  $d_T$ (but, not over the  $d_T$ )





## **Alpha-Power Model**

• Simple hand calculation model that empirically fits the real data

Measured data 
$$I_{DS} = K_S \frac{W}{L} (V_{GS} - V_T)^{\alpha}$$
Measured data

 α is close to 1 than 2, which is approximately 1.25, and continue to approach to 1 as technology scales

$$I_{ON} = I_0 (S\alpha)^{-\alpha} (V_{GS} - V_T)^{\alpha}$$
$$I_{sub} = I_0 e^{-\alpha} e^{\frac{V_{GS} - V_T}{S}}$$
$$Delay \propto \frac{V_{DD}}{(V_{DD} - V_T)^{\alpha}}$$





## **DVS effect**

- Exploits under-utilized resources by reducing f and V
  - Power is proportional to frequency and voltage<sup>2</sup>
  - Energy is proportional to power and time
    - Frequency scaling does not have an impact on energy
- Overhead: typically tenths of microseconds
  - Wait until voltage is stabilized
  - Wait until frequency is stabilized
- Order of change
  - When f is going up: change voltage first
  - When f is going down: change frequency first





## **DVS supporting HW block diagram**



Figure 2.4. Block diagram of DVS-enabled processor [36]

- Procedure (when f<sub>d</sub> is larger)
  - Processor writes the desired frequency to frequency register (f<sub>d</sub>)
  - DC/DC converter compares f<sub>d</sub> with f<sub>c</sub> (current frequency)
  - DC/DC converter changes VDD to a certain value paired with f<sub>d</sub>
  - VCO adapts the system clock

## **Processors supporting DVS**

- Recent processors such Xscale and ARM11 series also support DVS
  - IEM: Intelligent Energy Manager from ARM

| Processor  |                        | Clock Range           | Voltage Range   | Transition Time                                   |  |
|------------|------------------------|-----------------------|-----------------|---------------------------------------------------|--|
|            | Transmeta's Crusoe [2] | 200-700(megahertz)    | 1.1-1.65(volt)  | 300 µs                                            |  |
|            | AMD's Mobile K6 [3]    | 192–588(megahertz)    | 0.9-2.0(volt)   | 200 µs                                            |  |
| Commercial | Intel PXA250 [4]       | 100-400(megahertz)    | 0.85-1.3(volt)  | 500 µs                                            |  |
|            | Compaq's Itsy [24]     | 59.0-206.4(megahertz) | 1.0-1.55(volt)  | 189 µs                                            |  |
|            | TI's TMS320C55x [23]   | 6-200(megahertz)      | 1.1-1.6(volt)   | $3.3 \text{ ms}(1.6 \rightarrow 1.1 V)$           |  |
| 2          |                        |                       |                 | $300~\mu\mathrm{s}(1.1\rightarrow1.6~V)$          |  |
|            | Burd et al. [1]        | 5-80(megahertz)       | 1.2-3.8(volt)   | 520 µs                                            |  |
| Academic   | LART [25]              | 59-251(megahertz)     | 0.79-1.65(volt) | $5.5 \text{ ms}(1.65 \rightarrow 0.79 \text{ V})$ |  |
|            |                        |                       |                 | $40 \ \mu \mathrm{s}(0.79 \rightarrow 1.65 \ V)$  |  |





## **DVS classification**







## Inter-task vs. Intra-task DVS (I)

- Classification is based on the scaling granularity
- Inter-task DVFS
  - Scaling occurs at the start of a task
    - It is unchanged until the task is completed
  - Use worst-case slack time (= Deadline<sub>task</sub> WCET<sub>task</sub>)
  - Usually used in multi-task scheduling scenario at OS level
- Intra-task DVFS
  - Scaling occurs at the sub-task level
    - Different frequency is set for each sub-task
  - Use workload-variation slack time



## Inter-task vs. Intra-task DVS (II)

- Average-Case Execution Time (ACET) rather than Worst-Case Execution Time (WCET)
  - Much finer granularity than inter-task
  - Fully exploits the slack time arising from task execution time variation
  - Requires off-line profiling and source code modification
  - Can achieve higher energy saving compared to inter-task
  - Energy and delay overheads of voltage switching must be carefully considered





### Contents

- Dynamic Power Management
  - DPM introduction
  - Time-out method
  - Predictive method
  - Stochastic method
- Dynamic Voltage Scaling
  - DVS introduction
  - intra-task DVS
  - inter-task DVS





- By Shin and Kim (SNU)
- For the given hard-real time constrained code
  - Extract CFG
- Each execution path has different execution time
- WCET method
  - Loss of too much slack
- ACET method
  - Hard to predict which path will be executed





• Example



- 51 paths exist
  - Worst path : 200 \* 10<sup>6</sup> cycles
  - 12 out of 51 paths: under 100 \* 10<sup>6</sup> cycles





- Problem definition
  - Find an optimal voltage / frequency pair at each edge (bi, bj) in a given CFG
- Pre-processing
  - Estimate the # of clock cycles required for each basic block
  - Estimate the visiting probability for each path
- Indirect energy consumption metrics
  - Clock speed representation
    - normalized to initial clock speed
  - Speed Update ratio
    - clock speed ratio between two edges
- Energy can be easily estimated from the information above











- Too many decision points
  - Increase voltage / frequency changing overhead
- To deal with this issue, predict the path!
  - RWEP: Predict the worst case execution path
  - RAEP: Predict the average case execution path
- To cope with the mis-prediction
  - Voltage scaling edges (VSE) are selected
  - Based on static timing analysis for the given code
- VSE can change the speed
  - RWEP: monotonically decrease
  - RAEP: either decrease or increase





Automated tool flow for this method





ELPL Embedded Low-Power Laboratory

#### Intra-task DVS

Result for RWEP method







• Comparison of RWEP and RAEP







### Contents

- Dynamic Power Management
  - DPM introduction
  - Time-out method
  - Predictive method
  - Stochastic method
- Dynamic Voltage Scaling
  - DVS introduction
  - intra-task DVS
  - inter-task DVS





## Inter-task voltage scaling technique

- Single processor environment
  - Similar to the conventional task scheduling method
  - Additional work is to exploits slacks maximally
- Multi processor environment
  - In conjunction with task assignment problem
  - Need to consider the communication overhead
- We will see the multi processor environment with the consideration of energy gradient





## **Target architecture and task graph**

- Two heterogeneous processors
  - Transmeta Crusoe and StrongARM with Xscale technology
  - connected by a single bus
- Each processors has its dedicated memory
- Task graph (system specification)
  - Already scheduled (five tasks)





## **Task and inter-task information**

- **PEO**  $(V_{max} = 5V, V_t = 1.2V)$  || **PE1**  $(V_{max} = 3.3V, V_t = 0.8V)$ exe. time (ms)power (mW)exe. time (ms)power (mW task 0.15 85 0.70 30  $\tau_0$ 0.4090 0.30 20  $\tau_1$ 0.10 75 0.75 15  $\tau_2$ 0.10 50 0.15 80  $\tau_3$ 0.15 100 0.20 60  $\tau_4$
- Nominal task execution time / power dissipation

• Communication time / power dissipation

| comm.                     | comm. time $(ms)$ | power dis. $(mW)$ |  |  |
|---------------------------|-------------------|-------------------|--|--|
| $\gamma_{0\rightarrow 1}$ | 0.05              | 5                 |  |  |
| $\gamma_{1\rightarrow 2}$ | 0.05              | 5                 |  |  |
| $\gamma_{1\rightarrow 3}$ | 0.15              | 5                 |  |  |
| $\gamma_{2\rightarrow 4}$ | 0.10              | 5                 |  |  |



# **One possible mapping scenario**

- Task mapping
  - P0: T0, T4
  - P1: T1, T2, T3
- Simply computes the power of each processor (at nominal)
- Slack exist for T3 and T4







# With non energy-gradient model

- Evenly distribute the slack to all the tasks
- Extension factor
  - $e = ((\Sigma t_{nom}(\top)) + ts) / \Sigma t_{nom}(\top)$  for all  $\top$
- Delay
  - $\alpha \sim 1/f \sim k_d V_{dd} / (v_{dd} V_t)^2$
- Voltage
  - $V_{dd} = V_t + V_0/2d^* + ((V_t + V_0/2d^*)^2 V_t^2)^{1/2}$
- Energy reduction: 8.2%





# With energy-gradient model

- Suppose that ∠t = 0.01ms
  - Ten times smaller than the slack
- Compute energy gradient for all tasks
  - Using  $\Delta E_{\top} = E_{\top}(t_{exe}) E_{\top}(t_{exe} + \Delta t)$
  - $E_{T}(t_{exe})$  is given from the table
  - $E_{T}(t_{exe} + \Delta t)$  is computed by using the previous method for entire slack range
  - The task which has the largest gradient is the winner
    - The largest energy saver
- Iteratively perform the energy gradient computation until slack is reached



# **Result of energy-gradient model**

|           | Energy-gradient $\Delta E(\mu J)$ |          |         |         |         |  |
|-----------|-----------------------------------|----------|---------|---------|---------|--|
| iteration | $	au_0$                           | $\tau_1$ | $	au_2$ | $	au_3$ | $	au_4$ |  |
| 1         | 0.960                             | 0.234    | 0.156   | 0.899   | 1.130   |  |
| 2         | 0.960                             | 0.234    | 0.156   | 0.899   | 0.965   |  |
| 3         | 0.960                             | 0.234    | 0.156   | 0.899   | 0.833   |  |
| 4         | 0.820                             | 0.234    | 0.156   | 0.899   | 0.833   |  |
| 5         | 0.820                             | 0.234    | 0.156   | 0.768   | 0.833   |  |
| 6         | 0.820                             | 0.234    | 0.1.56  | 0.7.68  | 0.7.25  |  |
| 7         | 0.708                             | 0.234    | 0.156   | 0.768   | 0.725   |  |
| 8         | 0.708                             | 0.2.34   | 0.156   | 0.663   | 0.725   |  |
| 9         | 0.708                             | 0.234    | 0.156   | 0.663   | 0.636   |  |
| 10        | 0.616                             | 0.234    | 0.156   | 0.663   | 0.636   |  |
| 11        | 0.616                             | 0.234    | 0.156   | 0.578   | 0.636   |  |
| 12        | 0.616                             | 0.234    | 0.156   | 0.578   | 0.562   |  |
| 13        | 0.541                             | 0.234    | 0.156   | 0.578   | 0.562   |  |
| 14        | 0.541                             | 0.234    | 0.156   | 0.507   | 0.562   |  |
| 15        |                                   |          |         | 0.507   |         |  |
| 16        |                                   |          |         | 0.451   |         |  |
| extension | 0.04                              | 0        | 0       | 0.06    | 0.06    |  |







# Formal approach (I)

- MSTG generation
  - Mapped-Scheduled Task Graph
  - Insert communication edges
  - Each comm. edge is represented as a pseudo node
  - Insert pseudo edge
    - Dependency of tasks mapped to the same resource



# Formal approach (II)

- Perform the schedule as qualitatively mentioned earlier
  - Q<sub>E</sub>: priority queue
  - T<sub>d</sub>: a set of tasks who have deadlines
  - $T_s(\top)$  : task start time



```
Algorithm: PV-DVS
Input: - task graph G_S(\mathcal{T}, \mathcal{C})
           - mapping
           - schedule
            - architectural information
           - minimum extension time \Delta t_{min}
Output: - energy optimised voltages V_{dd}(\tau)
              - dissipated dynamic energy E
01: MSTG_TRANSFORM //Generate MSTG from G_S
02: \mathbb{Q}_E \leftarrow \emptyset
03: forall (\tau \in T_d) \{\Delta t_d(\tau) := t_d(\tau) - (t_S(\tau) + t_{exe}(\tau))\}
04: forall (\tau \in T) {calculate t_{\epsilon}}
05: forall (\tau \in T) {if t_{\epsilon} \ge \Delta t_{min} then \mathbb{Q}_E := \mathbb{Q}_E + \tau}
06: \Delta t = \frac{\min t_{\epsilon}}{|\Omega_{E}|}, if \Delta t < \Delta t_{min} then \Delta t = \Delta t_{min}
07: for all (\tau \in \mathbb{Q}_E) {calculate \Delta E(\tau)}
08: reorder \mathbb{Q}_E in decreasing order of \Delta E
09: while (\mathbb{Q}_E \neq \emptyset) {
10:
               select first task \tau_{\Delta Emax} \in \mathbb{Q}_E
               t_{exe}(\tau_{\Delta Emax}) := t_{exe}(\tau_{\Delta Emax}) + \Delta t
11:
12:
               update E_{\tau_{\Delta E max}}
               forall (\tau \in T) {update t_S, t_E and t_{\epsilon}}
13:
               forall (\tau \in \mathbb{Q}_E) {if (t_{\epsilon}(\tau) < \Delta t_{min}) \lor (V_{dd}(\tau) \le V_t(\tau))
14:
                                             then \mathbb{Q}_E := \mathbb{Q}_E - \tau
               \Delta t = \frac{\min t_{\epsilon}}{|\mathbb{Q}_{F}|}, if \Delta t < \Delta t_{min} then \Delta t = \Delta t_{min}
15:
               forall (\tau \in \mathbb{Q}_E) {update \Delta E(\tau)}
16:
17:
               reorder \mathbb{Q}_E in decreasing order of \Delta E
18: }
19: delete MSTG
20: return E_{\Sigma}, and V_{dd}(\tau), \forall (\tau \in T)
```

# Voltage change is practically discrete (I)

Embedded Low-Power

Laboratory

- For a task T
  - We have t<sub>exe</sub> with V<sub>dd</sub> in continuous domain
- In discrete domain
  - two nearest voltage (one is lower, the other is higher) can be utilized
  - e.g.  $V_{d1} < V_{dd} < V_{d2}$
- How?

• 
$$t_{exe} = t_{d1} + t_{d2}$$

•  $t_{d1} = t_{exe} (V_{d1}(v_{dd}-V_t)^2 / (v_{d1}-V_t)^2 V_{d1}) \times \{(V_{dd}/(V_{dd}-V_t)^2 - V_{d2}/(V_{d2}-V_t)^2) / (V_{d1}/(V_{d1}-V_t)^2 - V_{d2}/(V_{d2}-V_t)^2)\}$ 



# Voltage change is practically discrete (I)

Embedded Low-Power

Laboratorv

- Make  $t_{d1}$  and  $t_{d2}$  as integers
  - Since they should be represented as # of clock cycles
- $t_i = NCi * f_i$ 
  - NC<sub>i</sub>: # of clock cycles executed at frequency f<sub>i</sub>
- NC<sub>d1</sub> = base( $t_{d1} * f_{d1}$ )
- $NC_{d2} = NC_{tot} NC_{d1}$
- $td_1 = NC_{d1} / f_{d1}$
- $t_{d1} = NC_{d1} / f_{d1}$





### Summary

- Two system-level power management techniques
  - DPM by shutdown
  - DVS by extending the execution time
- DPM
  - Time-out / Predictive / Stochastic
  - Prediction accuracy is critical
- DVS
  - Intra-task / Inter-task
- Common feature of DVS and DPM
  - Exploiting idleness



### References

- Luca Benini, Giovanni De Micheli, "Dynamic Power Management: Design Techniques and CAD Tools", Springer, 2005
- Marcus T. Schmitz, Bashir M. Al-Hashimi, and Petru Eles, "System-Level Design Techniques for Energy-Efficient Embedded Systems", Kluwer Academic Publishers, 2003
- D. Shin, J. Kim and S. Lee, "Intra-Task Voltage Scheduling for Low-Energy Hard Real-Time Applications," IEEE Design and Test of Computers, Vol.18 No.2, pp. 20-30, March 2001
- E. Macci, M. Pedram, "Leakage Power Modelling and Minimization", Tutorial at ICCAD, 2004

