# Software-level Power-Aware Computing Lecture 4 # Lecture Organizations - Lecture 1: - Introduction to Low-power systems - · Low-power binary encoding - Power-aware compiler techniques - Lectures 2 & 3 - · Dynamic voltage scaling (DVS) techniques - OS-level DVS: Inter-Task DVS - Compiler-level DVS: Intra-Task DVS - Application-level DVS - Dynamic power management - Lecture 4 - Software power estimation & optimization - Low-power techniques for multiprocessor systems - Leakage reduction techniques Low Power SW.4 J. Kim/SNU ### Software Power Estimation & Optimization - · Modeling-based Approaches - Instruction-level power modeling (V. Tiwari) - Employs base energy cost of each instruction - Instruction-level analysis and optimization - · Component-based power modeling - Wattch, SimPower - Measurement-based Approaches - SES: SNU Energy Scanner - Cycle-level power measurement on the target board - · PowerScope (J. Flinn) - Function-level analysis w/ DMM - ePRO: energy PRofiler and Optimizer - Function-level analysis and optimization w/ DAQ ### Instruction-level Power Estimation (1) Base energy cost For variations in base energy costs due to operands and addresses, average base cost values are employed Low Power SW.4 J. Kim/SNU Low Power SW.4 J. Kim/SNU #### Instruction-level Power Estimation (2) - Inter-Instruction Effects - Effects of circuit state: switching activity in a circuit - Avg. overhead through extensive experiments between pairs of instructions - Effect of resource constraints: pipeline stall and write buffer stalls - Avg. energy cost of each stall experimentally determined - Effect of cache misses - Avg. energy penalty of cache miss cycles $$E_p = \sum_i (B_i \times N_i) + \sum_{i,j} (O_{i,j} \times N_{i,j}) + \sum_k E_k$$ $E_{\scriptscriptstyle n}$ : average energy cost for a program $O_{\!\scriptscriptstyle i,i}$ : the circuit state overhead $B_{\cdot}$ : the base cost of each instruction $N_{i,i}$ : the no. of times the pair is executed J. Kim/SNU J. Kim/SNU $N_i$ : the number of times executed $E_{k}$ : other instruction effects Low Power SW.4 ### Instruction-level Power Estimation (3) Energy estimation framework for a program J. Kim/SNU Low Power SW.4 ### Instruction-level Power Optimization - Instruction Reordering - A technique to reduce the circuit state overhead - Instructions are scheduled in order to minimize the estimated switching in the control path - For 486DX, energy saving only up to 2%, but for a DSP up to 33.1% - Energy cost driven code generation - Instructions with memory operands have much higher currents than those with register operands - Better utilization like optimal global register allocation of temporaries and frequently used variables RESULTS OF ENERGY OPTIMIZATION OF SORT AND CIRCLE | Program | hlcc.asm | hht1.asm | hht2.asm | hht3.asm | |-----------------------|----------|---------------|-----------|-------------| | Avg. Current (m.A) | 525.7 | 534.2 | 507.6 | 486.6 | | Execution Time (µsec) | 11.02 | 9.37 | 8.73 | 7.07 | | Energy $(10^{-6}J)$ | 19.12 | 16.52 | 14.62 | 11.35 | | Program | clcc.asm | chtl.asm | cht2.asm | cht3.asm | | | | CILLII. GOILI | CHOE-GOIL | 01100.00111 | | Avg. Current (mA) | 530.2 | 527.9 | 516.3 | 514.8 | | • | | | | | # SES: SNU Energy Scanner (1) **Overall Structure** - Summary - On-board, cycle-level power measurement - Source code related energy analysis Low Power SW.4 J. Kim/SNL ### SES: SNU Energy Scanner (2) Energy Measurement H/W Module Low Power SW.4 J. Kim/SNU ### SES: SNU Energy Scanner (3) Energy Analyzer GUI Low Power SW.4 J. Kim/SNU ### SES: SNU Energy Scanner (4) - Pros - No additional measurement device (like DMM or DAQ) necessary - · Cycle-level accuracy and timeliness - · Source code related energy analysis - C program function or instruction level - Cons - No portability - For each processor, new hardware and program are necessary - → time, cost, and effort!! - No exact performance-energy correlation - Performance is not measured # PowerScope (1) Overall Structure - Summary - Power measurement w/ external device (DMM) - · Source code related energy analysis Low Power SW.4 J. Kim/SNU Low Power SW.4 ### Powerscope (2) - Pros - Portability - Employs Linux LKM (Loadable Kernel Module) and DMM - Moderate accuracy but fast measurement - Source code related energy analysis - C program function level - Cons - Overhead - Sampling trigger feedback between target b'd and DMM - → varying interrupt handling time - Long profile function path - No performance-energy correlation - Performance is not measured Low Power SW.4 J. Kim/SNU ### ePRO: energy PRofiler and Optimizer (1) - Summary - Power measurement w/ external device (DAQ) - · Source code related energy, performance, and code size analysis - Automatic compiler-level optimization Low Power SW.4 J. Kim/SNU ### ePRO: energy PRofiler and Optimizer (2) - Overview - Automatized tool which analyzes and optimizes software energy and performance based on measurement - Function details - Performance Analysis - Function-level performance indices - Energy Analysis - Function-level energy consumption - Device-level energy consumption - · Energy Optimization - Energy-optimal compiler option selection - Integrated Development Environment (IDE) - Plug-in of Eclipse # ePRO: energy PRofiler and Optimizer (3) - Performance Analysis - Using XScale processor's PMU - CPI (Cycles Per Instruction) I-cache/D-cache efficiency - Instruction fetch latency - Data/bus request buffer : D-cache buffer stall - Stall/writeback statistics - I-TLB/D-TLB efficiency | unction | | Source | Code | Energy Consu | Energy Ratio | Ĺ | Execution Tim | I-Cache Miss | D-Cache Miss | Cl^ | |---------|-------------|-----------------|------|--------------|--------------|---|---------------|--------------|--------------|-------| | d | vector | nruil.c:60 | 64 | 0.00 | | | 7.82 | 0, 1540 | 2.6846 | 5,56 | | fe | nft | frrft,c:59 | 6172 | 9,033,39 | | | 6.585,05 | 0,0022 | 0.0126 | 1,746 | | fo | ur1 | fmt_c:466 | 536 | 7,741.11 | | | 832.85 | 0,0004 | 0.0085 | 1,683 | | fe | ee rimatrix | nniti.c:132 | 52 | 0,00 | | | 0.10 | 0.1743 | 0.6897 | 3.00 | | fe | ee_dvector | nruti.c:123 | 28 | 0.00 | | | 15.20 | 0, 1624 | 4,9190 | 10.72 | | fr | ee,ivector | nruff, c:98 | 28 | 0,00 | | | 0.01 | 4, £736 | 1,8595 | 9,165 | | to the | ee_livector | nruti_c:106 | 28 | 0.00 | | | 0.00 | 0, 1805 | 0,0000 | 2,575 | | fe | ee_vector | nruti_c:114 | 28 | 0.00 | | | 0.36 | 5, 6490 | 6,5225 | 19.34 | | 9 | olden | fmt_c:565 | 516 | 0,00 | | | 11,20 | 0,0265 | 2,0906 | 2,3% | | ĥ | rector | nrudl, c:23 | 64 | 0,00 | | | 0,00 | 0, 1754 | 1,6129 | 3,212 | | h | ector | nrufil.c:35 | 64 | 0.00 | | | 0.03 | 7, 4543 | 6,5760 | 18,17 | | n | nain | main_frrft,c:19 | 516 | 0.00 | | | 76, 79 | 0, (303 | 0.2548 | 2,700 | | n | renor | nrudl,c:11 | 100 | 0,00 | | | 0,00 | 0,0000 | 0,0000 | 0,000 | | p | hifun | fmt_c:638 | 488 | 55,103,28 | | | 6,116,69 | 0,0014 | 0.5121 | 1,790 | | p | hisar | fmt_c:621 | 112 | 31,74 | | | 4.37 | 0, (323 | 2.2641 | 2,925 | | | nwar | frot c:#19 | 200 | 1 184 97 | | | 119 M | n rms | n 1000 | 1 763 | | | | | | | | | | | | > | 15 Low Power SW.4 Low Power SW.4 Low Power SW.4 Low Power SW.4 Low Power SW.4 J. Kim/SNU ### ePRO: energy PRofiler and Optimizer (4) Function-level Energy Consumption Analysis - Device-level Energy Consumption Analysis - · CPU, RAM, FLASH, HDD, etc | Device Name | Energy Consumption( | Energy Ratio | Percentage (%) | |-------------|---------------------|---------------|----------------| | | | Lifetgy Hallo | | | _TOTAL_ | 678,84 | | 100,00 | | CPU | 123,90 | | 18,25 | | Flash | 43,60 | | 6,42 | | HDD | 410,03 | | 60, 40 | | RAM | 101,31 | | 14,92 | Low Power SW.4 J. Kim/SNU ### ePRO: energy PRofiler and Optimizer (5) - Energy Optimization - CL-OSE (Compiler-Level Optimal Space Exploration): Selects the energy-optimal options time-efficiently for the target program among the all the available compiler options Low Power SW.4 J. Kim/SNU # ePRO: energy PRofiler and Optimizer (6) - Integrated Development Environment - Employs Eclipse's plug-in function # ePRO: energy PRofiler and Optimizer (7) - Pros - Portability - Employs Linux LKM and DAQ assembly - Program function-level energy, performance, and code size analysis - Automatized compiler-level energy optimization - Cons - Overhead - System behavior sampling overhead - Limited to a processor with PMU (Performance Monitoring Unit): e.g. XScale - No support for multiple processes till now Low Power SW.4 J. Kim/SNU Low Power SW.4 ### 3D Graphics Pipeline - 3D Graphics Pipeline - Geometry Calculation - -Calculation of geometric data of objects - Rasterization - -Converting an object on a screen <3D Graphics Pipeline> Low Power SW.4 J. Kim/SNU Face model ## Power Breakdown of 3D Graphics Power Consumption of 3D Graphics Application # A Low-Power Texture Mapping (1) - Previous Work - "A Low-Power Content-Adaptive Texture Mapping Architecture for Real-Time 3D Graphics", Jeongseon Euh et al, PACS'02. - Adaptive texture mapping - · based on a model of human visual perception (HVP) - DVS is applied to the interpolation block - "Trading Efficiency for Energy in a Texture Cache Architecture", losif Antochi et al, MPCS'02 - Mobile devices cannot afford large texture cache - Due to gate count limitation and low power consumption - 128~512 bytes texture cache between the graphics accelerator and texture memory Low Power SW.4 J. Kim/SNU Low Power SW.4 J. Kim/SNU Low Power SW.4 # A Low-Power Texture Mapping (2) - A Low-Power Texture Mapping Technique for Mobile 3D Graphics - A small texture cache can increase the miss ratio - The technique to preserve performance is needed - -Prefetching - -Victim cache Low Power SW.4 J. Kim/SNU # A Low-Power Texture Mapping (3) - Prefetch techniques - Technique 1: Prediction of next texels - Division is required due to "perspective correction" - Technique 2: Prediction of next blocks - -Assuming that derivatives are not changed - -Division is eliminated - Technique 3: Prediction of next blocks based on direction of texture map access - -Simple, but less exact than technique 2 - A small fully associative prefetch buffer is used Low Power SW.4 J. Kim/SNU # Technique 1 J. Kim/SNU # Technique 2 Low Power SW.4 J. Kim/SNU Low Power SW.4 # Technique 3 # A Low-Power Texture Mapping (4) - Using victim cache - · Sizes of texture images are powers of two - Conflict misses can occur between blocks - -Especially in the small texture cache - Blocks are reused in processing of next spanline - Victim cache can reduce conflict misses - -Prefetch buffer performs as the victim cache - Evicted blocks are moved into the prefetch buffer Low Power SW.4 J. Kim/SNU # A Low-Power Texture Mapping (5) - Experimental Results - Area Reduction Prefetch Accuracy # A Low-Power Texture Mapping (6) Miss Ratio and Power Consumption ## Researches on 3D Graphics (1) - "An Effective Pixel Rasterization Pipeline Architecture for 3D Rendering Processors", Woo-Chan Park et al, IEEE Transactions on Computers '03 - Avoid unnecessary texture mapping for obscured fragments - Reduce the miss penalties of the pixel cache by prefetching scheme - "Design and Implementation of Low-Power 3D Graphics SoC for Mobile Multimedia Applications", Ramchan Woo, PHD thesis, KAIST '04 - Implementing full-3D pipeline with texture mapping and special effects Low Power SW.4 J. Kim/SNU ## Researches on 3D Graphics (2) - "GraalBench: A 3D Graphics Benchmark Suite for Mobile Phones", losif Antochi et al. LCTES'04 - A set of 3D graphics workloads representative for mobile devices - "Power-Aware 3D Computer Graphics Rendering", Jeongseon Euh, Journal of VLSI Signal Processing '05 - Low power system based on Approximate Graphics Rendering (AGR) - Power savings are examined for stages - -Shading - -Texture mapping Low Power SW.4 J. Kim/SNU # Low-Power Techniques for Multiprocessors - DVS techniques for Multiprocessors - Slack reclamation - Condition-aware scheduling - dist-PID - Power-Aware Parallelism Optimization - Static & Dynamic optimizing - Local Memory Management ## Slack Reclamation (1) - Problem definition - · Greedy slack reclamation - Any slack is used to reduce the speed of next task on same processor - -It cannot guarantee deadline - Example: 6 tasks in 2 processors - - $\Gamma$ ={T<sub>i</sub>(WCET,AET)|T<sub>1</sub>(5,2), T<sub>2</sub>(4,4), T<sub>3</sub>(3,3), T<sub>4</sub>(2,2), T<sub>5</sub>(2,2), T<sub>6</sub>(2,2)}, Deadline=9 - $\rightarrow$ T<sub>3</sub> uses up its time, <u>T<sub>6</sub> misses the deadline</u> Low Power SW.4 J. Kim/SNU Low Power SW.4 J. Kim/SNU ## Slack Reclamation (2) - Shared slack reclamation [Zhu03] - Share the slack with other processors - -Split slack into multiple parts - Slack sharing example (see figure (b)) - -Slack1: Two time units before T<sub>2</sub>'s finish time (based on T<sub>2</sub>'s WCET) - -Slack2: One time units after T2's finish time - → Share slack2 with P<sub>2</sub> - · All tasks meet deadlines Low Power SW.4 J. Kim/SNU # Condition-Aware Scheduling (1) - Task scheduling for conditional task graph - Conditional Task Graph (CTG) - Various task sequences depending on the conditions - Require power-aware scheduling technique considering conditions Low Power SW.4 J. Kim/SNU J. Kim/SNL ## Condition-Aware Scheduling (2) - Condition-Aware scheduling [Shin03] - Step 1: Task ordering - -Use the schedule table: <start time, clock speed> - Depending the condition value, each task has different start time and clock speed | condition | tr | ие | C | 1 | c | 2 | C | 3 | |----------------|----|------|------|------|------|------|----|------| | $\tau_0$ | 0 | 0.5 | | | | | | | | $\tau_1$ | 10 | 0.25 | | | | | | | | $\tau_2$ | 10 | 0.5 | | | | | | | | $\tau_3$ | | | 30 | 0.39 | 30 | 0.39 | 30 | 0.38 | | $\tau_4$ | | | | | 30 | 0.42 | | | | τ <sub>5</sub> | | | | | | | 30 | 0.5 | | $\tau_6$ | | | | | | | 60 | 0.5 | | τ <sub>7</sub> | | | 68.6 | 0.39 | 68.6 | 0.39 | 80 | 0.5 | # Condition-Aware Scheduling (3) - Condition-Aware scheduling [Shin03] - Step 2: Task stretching - Use probabilities of conditions from profile information - -Minimize $\sum E(\tau)$ Prob $(\tau)$ - Optimize for high probability conditions - $\cdot c_1 \gg c_2, c_3$ Low Power SW.4 ### DVFS in MPSoC (1) - Local-DVFS - Decide the frequency of the each processor only using the local information. - Do not use the information of the other processors. - Higher frequency as more tasks in the task queue. - Limitations of the local-DVFS - If a processor is executed with lower frequency, it can hurt the performance of the other processor because of the dependency. Low Power SW.4 J. Kim/SNU ### DVFS in MPSoC (2) - dist-DVFS [Juang05] - Decide the frequency of the each processor using the global information. - Operation steps - -Estimate the future task queue occupancy - Identify the critical-path-tasks (with the highest queue occupancy) - Decide the frequency of the each processor not hurting the performance of the critical-pathtasks Low Power SW.4 J. Kim/SNU ### Optimizing Parallelism (2) The number of processors that generate the best execution time for each loop nest | Benchmark Name | N1 | N2 | N3 | N4 | N5 | N6 | N7 | N8 | N9 | | Loop nest | |----------------|----|-----|----|----|----|----|-----|----|----|-----|--------------| | 3step-log | 1 | 1 | 5 | | | | | | | Ì | | | adi | 4 | - 5 | | | | | | | | 1 | | | aps | 1 | 1 | 1 | | | | | | | 1 | | | bmcm | 1 | 1 | 2 | 4 | | | | | | ] 4 | | | btrix | 2 | 1 | 7 | 6 | 1 | 3 | 8 | | | 1 🥒 | Number of | | eflux | 2 | 3 | | | | | | | | | processors | | full-search | 2 | 2 | 6 | | | | | | | | | | hier | 1. | - 1 | 3 | 3 | 2 | 1. | - 5 | | | 1 | for the best | | lms | 2 | 1 | 2 | 2 | | | | | | 1 | execution | | n-real-updates | 4 | 4 | 4 | | | | | | | 1 ` | | | parallel-hier | 3 | 3 | 1 | 1 | 2 | | | | | ] | time | | tomcat | 2 | 1 | 3 | 1 | 2 | 4 | 1 | 8 | 2 | ] | | | tsf | 1 | 7 | 2 | 4 | | | | | | 1 | | - Using only a small subset of processors out of 8 processors - This is a strong motivation for shutting down unused processors. ### Optimizing Parallelism (3) - Designing an effective parallelization strategy for an on-chip multiprocessor - Mechanism - Dynamic approach - •The number of processors for each loop nest is decided at run time. - -Static approach - •The number of processors for each loop nest is decided at compile time. - Policy - -Criterion to decide the number of processors - •Execution time, energy and so on. Low Power SW.4 J. Kim/SNU Low Power SW.4 ### Optimizing Parallelism (4) - Procedure - Determine the number of processors from mechanism and policy - Insert activation / deactivation call in the code - Optimize the code - Optimization - Current active/idle status of processors is maintained as much as possible - -To minimize overhead from on/off - We have to pre-activate the processors, - -If the processor will be used in next loop - -Not to hurt the performance Low Power SW.4 J. Kim/SNU ### Runtime Code Parallelization (1) - A run-time strategy for determining the best number of processors to use [Kandemir03] - · Dynamic mechanism - To minimize energy and execution time - Need some help from H/W and compiler. Low Power SW.4 J. Kim/SNU ### Run-time Code Parallelization (2) - Parallelization based on training - · Each dot represents an iteration - Training period - -Find the optimal number of processors - Using the number of processors determined - Extra Optimization - Minimize training iteration based on history. - Utilize the past history, avoid redundant training. ### Local Memory Management (1) Latency of memory access Target MPSoC architecture - Frequent off-chip memory access can be very costly from both performance and energy perspectives - Propose local memory management scheme for low cost [Chen05] Low Power SW.4 J. Kim/SNU Low Power SW.4 ### Local Memory Management (2) - Access pattern of the data block is analyzed by compiler - Software-managed memory is used - When a data block is stored in the local memory of the processor, - Even though the data block is predicted not to be used any more by the processor, - If the data block is predicted to be used by another processor, keep the data block in the local memory. Low Power SW.4 J. Kim/SNU ### References - [Chen05] Guilin Chen, Guangyu Chen, Ozcan Ozturk and Mahmut Kandemir, "Exploiting Inter-Processor Data Sharing for Improving Bebavior of Multi-Processor SoCs", In Proc. of Annual Symposium on VLSI, 2005. - [Juang05] Philo Juang and Qiang Wu, "Coordinated, Distributed, Formal Energy Management of Chip Multiprocessors", In Proc. of ISLPED, 2005 - [Kadayif05] Ismail Kadayif, Mahmut Kandemir, Guilin Chen and Ozcan Ozturk, Optimizing Array-Intensive Applications for On-Chip Multiprocessors", IEEE Trans. on Parallel and Distributed Systems, 2005. - [Kandemir03] M Kandemir, W Zhang, M Karakoy, "Runtime code parallelization for on-chip multiprocessors", In Proc. of DATE, 2003. - [Shin03] Dongkun Shin and Jihong Kim, "Power-Aware Scheduling of Conditional Task Graphs in Real-Time Multiprocessor Systems", In Proc. of ISLPED, 2005. - [Zhu03] Dakai Zhu, Rami Melhem, and Bruce R. Childers, "Scheduling with Dynamic Voltage/Speed Adjustment Using Slack Reclamation in Multiprocessor Real-Time Systems", IEEE Trans. on Parallel and Distributed Systems, Vol.14, No.7, July 2003. Low Power SW.4 J. Kim/SNU # Leakage Current Low Power SW.4 (source: Kim et al., IEEE Computer, Dec, 2003) J. Kim/SNU # Subthreshold Leakage, Isub • $I_{\text{sub}} \sim e^{(-Vt/Va)} (1 - e^{(-V/Va)})$ where Va is the thermal voltage - How to reduce Isub - Turn off the supply voltage - (-) loss of state - Increase the threshold voltage - (-) loss of performance Low Power SW.4 J. Kim/SNU # **Leakage Power Reduction** - State-Destructive vs State-Preserving - Application—Sensitive vs. Application—Insensitive Low Power SW.4 J. Kim/SNU # Dynamic Resizing of Instruction Cache # Gated-Vdd State Destructive, Application Insensitive **Drowsy Cache** [Flautner, 2002] drowsy bit voltage controller drowsy (set) drowsy power line word line driver VDD (1V) row decoder SRAMs VDDLow (0.3V) word line word line wordline gate drowsy signal State Preserving, Application Insensitive Low Power SW.4 J. Kim/SNU Low Power SW.4 J. Kim/SNU # Compiler-Directed Approach Low Power SW.4 [Zhang, MICRO-35]