#### COMPUTER ORGANIZATION AND DESIGN



The Hardware/Software Interface

# Chapter 1

# Computer Abstractions and Technology

#### Goals

- Understand the "how" and "why" of computer system organization
  - Instruction Set Architecture (ISA)
  - System Organization (processor, memory, I/O)
  - Microarchitecture: detailed internal architecture of a processor
  - Memory hierarchy:
- Learn methods of evaluating performance
  - Metrics & benchmarks
- Learn how to make systems go fast
  - Pipelining, caching, branch prediction,
  - Parallelism (ILP, DLP, TLP)

### Organization and Architecture

- Architecture: attributes visible to a programmer. (ISA)
- Organization: operational units and their interconnection that realize the architecture specification. (microarchitecture)
  - An architectural issue: support a multiply instruction or not.
  - An organizational issue: implemented by a multiply unit or with repeated use of the add unit.
- The organization decision may be bases on
  - The frequency of use of the multiply instruction,
  - the relative speed of the two approaches, and
  - the cost and physical size of a multiply unit.
- An architecture may survive many years, but its organization changes with changing technology.

### Logistics

Lectures Tu/Th 11:00 am -12:15pm, Room: 301-104

Instructor Soo-Ik Chae, Tu/Th 2:00 - 3:00pm

880-5457, <u>chae@snu.ac.kr</u>

TA JiCheon Kim,

010-9308-1267, modesty@sdgroup.snu.ac.kr

Texts Hennessy & Patterson, Computer

Organization and Design (Fourth Edition)

Revised Fourth Edition preferred, not required

### **Abstraction layers**

#### Specification

Program

ISA (Instruction Set Architecture)

compute the fibonacci sequence

load r1, a[i];
add r2, r2, r1;

Arch vs. µarch

#### microArchitecture

Logic

**Transistors** 



Chapter 1 — Computer Abstractions and Technology — 5

### **Topics**

- Technology Trends
- Instruction set architectures
- Pipelining
- Modern pipelined architectures
  - Dynamic ILP machines
  - Static ILP machines
- Cache memory systems
- Virtual memory
- I/O devices
- Multiprocessors
- Computer system implementation

#### What is Computer Architecture?



### **The Computer Revolution**

- Progress in computer technology
  - Underpinned by Moore's Law
- Makes novel applications feasible
  - Computers in automobiles
  - Cell phones
  - Human genome project
  - World Wide Web
  - Search Engines
- Computers are pervasive

### **Classes of Computers**

- Desktop computers
  - General purpose, variety of software
  - Subject to cost/performance tradeoff
- Server computers
  - Network based
  - High capacity, performance, reliability
  - Range from small servers to building sized
- Embedded computers
  - Hidden as components of systems
  - Stringent power/performance/cost constraints

#### **The Processor Market**



#### What You Will Learn

- How programs are translated into the machine language
  - And how the hardware executes them
- The hardware/software interface
- What determines program performance
  - And how it can be improved
- How hardware designers improve performance
- What is parallel processing

# **Understanding Performance**

- Algorithm
  - Determines number of operations executed
- Programming language, compiler, architecture
  - Determine number of machine instructions executed per operation
- Processor and memory system
  - Determine how fast instructions are executed
- I/O system (including OS)
  - Determines how fast I/O operations are executed

### **Below Your Program**



- Written in high-level language
- System software
  - Compiler: translates HLL code to machine code
  - Operating System: service code
    - Handling input/output
    - Allocating memory and storage
    - Scheduling tasks & sharing resources

#### Hardware

Processor, memory, I/O controllers



### **Levels of Program Code**

- High-level language
  - Level of abstraction closer to problem domain
  - Provides for productivity and portability
- Assembly language
  - Textual representation of instructions
- Machine language
  - Binary digits (bits)
  - Encoded instructions and data

High-level language program (in C)

Assembly language program (for MIPS)





Binary machine language program (for MIPS) \$31

### Components of a Computer

#### **The BIG Picture**



- Same components for all kinds of computer
  - Desktop, server, embedded
  - Input/output includes
    - User-interface devices
      - Display, keyboard, mouse
    - Storage devices
      - Hard disk, CD/DVD, flash
    - Network adapters
      - For communicating with other computers

Five components: input, output, memory, datapath, control

### **Anatomy of a Computer**



#### **Anatomy of a Mouse**

- Optical mouse
  - LED illuminates desktop
  - Small low-res camera (1500 fps)
  - Basic image processor (18 MIPS)
    - Looks for x, y movement
  - Buttons & wheel
- Supersedes roller-ball mechanical mouse





# Through the Looking Glass

- LCD screen: picture elements (pixels)
  - Mirrors content of frame buffer memory



# **Opening the Box**





Chapter 1 — Computer Abstractions and Technology — 19

### Inside the Processor (CPU)

- Datapath: performs operations on data
- Control: sequences datapath, memory, ...
- Cache memory
  - Small fast SRAM memory for immediate access to data

#### **Inside the Processor**

AMD Barcelona: 4 processor cores





#### **Abstractions**

**The BIG Picture** 

- Abstraction helps us deal with complexity
  - Hide lower-level detail
- Instruction set architecture (ISA)
  - The hardware/software interface
- Application binary interface
  - ISA plus operating system interfaces for doing I/O, allocating memory, and low-level system functions
- Implementation
  - The details underlying and interface

#### A Safe Place for Data

- Volatile main memory
  - Loses instructions and data when power off
  - DRAM
- Non-volatile secondary memory
  - Magnetic disk
  - Flash memory
  - Optical disk (CDROM, DVD)









#### **Networks**

- Communication and resource sharing
- Local area network (LAN): Ethernet
  - Within a building
- Wide area network (WAN: the Internet)
- Wireless network: WiFi, Bluetooth





### **Technology Trends**

- Electronics technology continues to evolve
  - Increased capacity and performance
  - Reduced cost



DRAM capacity

| Year | Technology                 | Relative performance/cost |  |
|------|----------------------------|---------------------------|--|
| 1951 | Vacuum tube                | 1                         |  |
| 1965 | Transistor                 | 35                        |  |
| 1975 | Integrated circuit (IC)    | 900                       |  |
| 1995 | Very large scale IC (VLSI) | 2,400,000                 |  |
| 2005 | Ultra large scale IC       | 6,200,000,000             |  |

### **Defining Performance**

Which airplane has the best performance?









#### Response Time and Throughput

- Response time
  - How long it takes to do a task
- Throughput
  - Total work done per unit time
    - e.g., tasks/transactions/... per hour
- How are response time and throughput affected by
  - Replacing the processor with a faster version?
  - Adding more processors?
- We'll focus on response time for now...

#### **Relative Performance**



- Define Performance = 1/Execution Time
- "X is n time faster than Y"

```
Performance<sub>x</sub>/Performance<sub>y</sub>
```

- = Execution time $_{Y}$  /Execution time $_{X} = n$
- Example: time taken to run a program
  - 10s on A, 15s on B
  - Execution Time<sub>B</sub> / Execution Time<sub>A</sub>
     = 15s / 10s = 1.5
  - So A is 1.5 times faster than B

### **Measuring Execution Time**

- Elapsed time
  - Total response time, including all aspects
    - Processing, I/O, OS overhead, idle time
  - Determines system performance
- CPU time
  - Time spent processing a given job
    - Discounts I/O time, other jobs' shares
  - Comprises user CPU time and system CPU time
  - Different programs are affected differently by CPU and system performance

# **CPU Clocking**

 Operation of digital hardware governed by a constant-rate clock



- Clock period: duration of a clock cycle
  - e.g.,  $250ps = 0.25ns = 250 \times 10^{-12}s$
- Clock frequency (rate): cycles per second
  - e.g.,  $4.0GHz = 4000MHz = 4.0 \times 10^9Hz$

#### **CPU Time**

CPU Time = CPU Clock Cycles × Clock Cycle Time

= CPU Clock Cycles

Clock Rate

- Performance improved by
  - Reducing number of clock cycles
  - Increasing clock rate: HW
  - Hardware designer must often trade off clock rate against cycle count (?)
    - Tradeoff between clock cycle count and clock period

# **CPU Time Example**

- Computer A: 2GHz clock, 10s CPU time
- Designing Computer B
  - Aim for 6s CPU time
  - Can do faster clock, but causes 1.2 × clock cycles
- How fast must Computer B clock be?

$$\begin{aligned} \text{Clock Rate}_{\text{B}} &= \frac{\text{Clock Cycles}_{\text{B}}}{\text{CPU Time}_{\text{B}}} = \frac{1.2 \times \text{Clock Cycles}_{\text{A}}}{6\text{s}} \\ \text{Clock Cycles}_{\text{A}} &= \text{CPU Time}_{\text{A}} \times \text{Clock Rate}_{\text{A}} \\ &= 10\text{s} \times 2\text{GHz} = 20 \times 10^9 \\ \text{Clock Rate}_{\text{B}} &= \frac{1.2 \times 20 \times 10^9}{6\text{s}} = \frac{24 \times 10^9}{6\text{s}} = 4\text{GHz} \end{aligned}$$

#### Instruction Count and CPI

Clock Cycles = Instruction Count  $\times$  Cycles per Instruction CPU Time = Instruction Count  $\times$  CPI  $\times$  Clock Cycle Time =  $\frac{Instruction Count \times CPI}{Clock Rate}$ 

- Instruction Count for a program
  - Determined by program, ISA and compiler
- Average cycles per instruction (CPI)
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix

#### **CPI Example**

- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA
- Which is faster, and by how much?

$$\begin{aligned} \text{CPU Time}_{A} &= \text{Instruction Count} \times \text{CPI}_{A} \times \text{Cycle Time}_{A} \\ &= I \times 2.0 \times 250 \text{ps} = I \times 500 \text{ps} & \text{A is faster...} \end{aligned}$$
 
$$\begin{aligned} \text{CPU Time}_{B} &= \text{Instruction Count} \times \text{CPI}_{B} \times \text{Cycle Time}_{B} \\ &= I \times 1.2 \times 500 \text{ps} = I \times 600 \text{ps} \end{aligned}$$
 
$$\begin{aligned} &= I \times 600 \text{ps} \\ &= I \times 500 \text{ps} \end{aligned}$$
 
$$\begin{aligned} &= I \times 600 \text{ps} \\ &= I \times 500 \text{ps} \end{aligned}$$
 
$$\begin{aligned} &= I \times 600 \text{ps} \\ &= I \times 500 \text{ps} \end{aligned}$$
 ...by this much

#### **CPI in More Detail**

 If different instruction classes take different numbers of cycles

$$Clock\ Cycles = \sum_{i=1}^{n} (CPI_{i} \times Instruction\ Count_{i})$$

Weighted average CPI

$$CPI = \frac{Clock \ Cycles}{Instruction \ Count} = \sum_{i=1}^{n} \left( CPI_i \times \frac{Instruction \ Count_i}{Instruction \ Count} \right)$$

Relative frequency

### **CPI Example**

 Alternative compiled code sequences using instructions in classes A, B, C

| Class            | А | В | С |
|------------------|---|---|---|
| CPI for class    | 1 | 2 | 3 |
| IC in sequence 1 | 2 | 1 | 2 |
| IC in sequence 2 | 4 | 1 | 1 |

- Sequence 1: IC = 5
  - Clock Cycles= 2×1 + 1×2 + 2×3= 10
  - Avg. CPI = 10/5 = 2.0

- Sequence 2: IC = 6
  - Clock Cycles= 4×1 + 1×2 + 1×3= 9
  - Avg. CPI = 9/6 = 1.5

# **Performance Summary**

#### **The BIG Picture**

$$CPU \, Time = \frac{Instructions}{Program} \times \frac{Clock \, cycles}{Instruction} \times \frac{Seconds}{Clock \, cycle}$$

- Performance depends on (page 38)
  - Algorithm: affects IC, possibly CPI
  - Programming language: affects IC, CPI
  - Compiler: affects IC, CPI
  - Instruction set architecture: affects IC, CPI, Tc

#### **Power Trends**



In CMOS IC technology



# **Reducing Power**

- Suppose a new CPU has
  - 85% of capacitive load of old CPU
  - 15% voltage and 15% frequency reduction

$$\frac{P_{\text{new}}}{P_{\text{old}}} = \frac{C_{\text{old}} \times 0.85 \times (V_{\text{old}} \times 0.85)^2 \times F_{\text{old}} \times 0.85}{C_{\text{old}} \times V_{\text{old}}^2 \times F_{\text{old}}} = 0.85^4 = 0.52$$

- The power wall
  - We can't reduce voltage further
  - We can't remove more heat
- How else can we improve performance?

# **Uniprocessor Performance**



# Multiprocessors

- Multicore microprocessors
  - More than one processor per chip
- Requires explicitly parallel programming
  - Compare with instruction level parallelism
    - Hardware executes multiple instructions at once
    - Hidden from the programmer
  - Hard to do
    - Programming for performance
    - Load balancing
    - Optimizing communication and synchronization

# 2008 multicore microprocessors

- Processor
- Microprocessor
- Core: processor

| Product              | AMD<br>Opteron X4<br>(Bar celon a) | Intel Neha lem | IBM Power 6 | Sun<br>Ultra SP ARC T2<br>(Ni aga ra 2) |
|----------------------|------------------------------------|----------------|-------------|-----------------------------------------|
| Cores per c hip      | 4                                  | 4              | 2           | 8                                       |
| Clock rate           | 2.5 GHz                            | ~ 2.5 GHz?     | 4.7 G Hz    | 1.4 G Hz                                |
| Microprocessor power | 120 W                              | ~ 10 0 W ?     | ~ 100 W ?   | 94 W                                    |

# Manufacturing ICs



Yield: proportion of working dies per wafer

## **AMD Opteron X2 Wafer**



- X2: 300mm wafer, 117 chips, 90nm technology
- X4: 45nm technology

# **Integrated Circuit Cost**

Cost per die = 
$$\frac{\text{Cost per wafer}}{\text{Dies per wafer} \times \text{Yield}}$$

Dies per wafer  $\approx \text{Wafer area/Die area}$ 

Yield =  $\frac{1}{(1+(\text{Defects per area} \times \text{Die area/2}))^2}$ 

- Nonlinear relation to area and defect rate
  - Wafer cost and area are fixed
  - Defect rate determined by manufacturing process
  - Die area determined by architecture and circuit design

#### **SPEC CPU Benchmark**

- Programs used to measure performance
  - Supposedly typical of actual workload
- Standard Performance Evaluation Corp (SPEC)
  - Develops benchmarks for CPU, I/O, Web, ...
- SPEC CPU2006
  - Elapsed time to execute a selection of programs
    - Negligible I/O, so focuses on CPU performance
  - Normalize relative to reference machine
  - Summarize as geometric mean of performance ratios
    - CINT2006 (integer: 12 benchmarks)
    - CFP2006 (floating-point: 17 benchmarks)



## CINT2006 for Opteron X4 2356

| Name           | Description                   | IC×10 | CPI   | Tc (ns) | Exec time | Ref time | SPECratio |
|----------------|-------------------------------|-------|-------|---------|-----------|----------|-----------|
| perl           | Interpreted string processing | 2,118 | 0.75  | 0.40    | 637       | 9,777    | 15.3      |
| bzip2          | Block-sorting compression     | 2,389 | 0.85  | 0.40    | 817       | 9,650    | 11.8      |
| gcc            | GNU C Compiler                | 1,050 | 1.72  | 0.47    | 24        | 8,050    | 11.1      |
| mcf            | Combinatorial optimization    | 336   | 10.00 | 0.40    | 1,345     | 9,120    | 6.8       |
| go             | Go game (AI)                  | 1,658 | 1.09  | 0.40    | 721       | 10,490   | 14.6      |
| hmmer          | Search gene sequence          | 2,783 | 0.80  | 0.40    | 890       | 9,330    | 10.5      |
| sjeng          | Chess game (AI)               | 2,176 | 0.96  | 0.48    | 37        | 12,100   | 14.5      |
| libquantum     | Quantum computer simulation   | 1,623 | 1.61  | 0.40    | 1,047     | 20,720   | 19.8      |
| h264avc        | Video compression             | 3,102 | 0.80  | 0.40    | 993       | 22,130   | 22.3      |
| omnetpp        | Discrete event simulation     | 587   | 2.94  | 0.40    | 690       | 6,250    | 9.1       |
| astar          | Games/path finding            | 1,082 | 1.79  | 0.40    | 773       | 7,020    | 9.1       |
| xalancbmk      | XML parsing                   | 1,058 | 2.70  | 0.40    | 1,143     | 6,900    | 6.0       |
| Geometric mean |                               |       |       |         | 11.7      |          |           |

High cache miss rates

# **Averaging**

- Arithmetic mean (AM)
  - ≥ Geometric mean (GM)
    - ≥ Harmonic mean (HM)
- Geometric Mean: for ratios normalized
- Harmonic Mean: for rates and ratios
- The geometric mean is the only correct mean when averaging normalized results

### **SPEC Power Benchmark**

- Power consumption of server at different workload levels
  - Performance: ssj\_ops/sec
  - Power: Watts (Joules/sec)

Overall ssj\_ops per Watt = 
$$\left(\sum_{i=0}^{10} ssj_ops_i\right) / \left(\sum_{i=0}^{10} power_i\right)$$

# SPECpower\_ssj2008 for X4

| Target Load %    | Performance (ssj_ops/sec) | Average Power (Watts) |  |  |
|------------------|---------------------------|-----------------------|--|--|
| 100%             | 231,867                   | 295                   |  |  |
| 90%              | 211,282                   | 286                   |  |  |
| 80%              | 185,803                   | 275                   |  |  |
| 70%              | 163,427                   | 265                   |  |  |
| 60%              | 140,160                   | 256                   |  |  |
| 50%              | 118,324                   | 246                   |  |  |
| 40%              | 920,35                    | 233                   |  |  |
| 30%              | 70,500                    | 222                   |  |  |
| 20%              | 47,126                    | 206                   |  |  |
| 10%              | 23,066                    | 180                   |  |  |
| 0%               | 0                         | 141                   |  |  |
| Overall sum      | 1,283,590                 | 2,605                 |  |  |
| ∑ssj_ops/ ∑power |                           | 493                   |  |  |

## **Computer Elements**

- Transistors (computing)
  - How can they be connected to do something useful?
  - How do we evaluate how fast a logic block is?
- Wires (transporting)
  - What and where are they?
  - How can they be modeled?
- Memories (storing)
  - SRAM vs. DRAM

### **Transistor As a Switch**

Ideal Voltage Controlled Switch

- Three terminals
  - Gate
  - Drain
  - Source



$$V_G = 0$$

$$V_G = V_{dd}$$

# **Abstractions in Logic Design**

- In physical world
  - Voltages, Currents
  - Electron flow
- In logical world abstraction

■ 
$$V < V_{IO} \Rightarrow$$
 "0" = FALSE

- V > Vhi ⇒ "1" = TRUE
- In between forbidden
- Simplify design problem



# What Is a Digital System?

- A digital system is a data technology that uses discrete (discontinuous) values.
- The description of a digital system...
  - Can be completely accurate
  - AND completely precise (no Heisenberg here)
- The world is messy analog, convert to digital using careful engineering
  - Some inherent problems---metastability (see Appendix C)

# **Composition of Transistors**

- Logic Gates
  - Inverters, And, Or, arbitrary



- Buffers (drive large capacitances, long wires, etc.)
- Memory elements
  - Latches, registers, SRAM, DRAM

## **Basic Components: CMOS Logic Gates**

#### **NAND Gate**





#### **NOR Gate**

Out





# **Clocking and Clocked Elements**

- Typical Clock
  - 1Hz = 1 cycle per second



 Transparent Latch



Edge TriggeredFlip-Flop



# Storage Element's Timing Model



- Setup Time: Input must be stable BEFORE the trigger clock edge
- Hold Time: Input must REMAIN stable after the trigger clock edge
- Clock-to-Q time:
  - Output cannot change instantaneously at the trigger clock edge
  - Similar to delay in logic gates, two components:
    - Internal Clock-to-Q
    - Load dependent Clock-to-Q

# **Clocking Methodology**



- All storage elements are clocked by the same clock edge
- The combination logic block's:
  - Inputs are updated at each clock tick
  - All outputs MUST be stable before the next clock tick

# Critical Path & Cycle Time



- Critical path: the slowest path between any two storage devices
- Cycle time is a function of the critical path must be greater than:
  - Clock-to-Q + Longest Path through the Combination Logic + Setup

#### Wires

- Limiting Factor
  - Density
  - Speed
  - Power
- 3 models for wires (model to use depends on switching frequency)
  - Short



Lossless



Lossy



#### Pitfall: Amdahl's Law

 Improving an aspect of a computer and expecting a proportional improvement in overall performance

$$T_{improved} = \frac{T_{affected}}{improvement factor} + T_{unaffected}$$

- Example: multiply accounts for 80s/100s
  - How much improvement in multiply performance to get 5× overall?

$$20 = \frac{80}{n} + 20$$
 • Can't be done!

Corollary: make the common case fast

## Fallacy: Low Power at Idle

- Look back at X4 power benchmark
  - At 100% load: 295W
  - At 50% load: 246W (83%)
  - At 10% load: 180W (61%)
- Google datacenter
  - Mostly operates at 10% 50% load
  - At 100% load less than 1% of the time
- Consider designing processors to make power proportional to load

#### Pitfall: MIPS as a Performance Metric

- MIPS: Millions of Instructions Per Second
  - Doesn't account for
    - Differences in ISAs between computers
    - Differences in complexity between instructions

$$\begin{aligned} \text{MIPS} &= \frac{\text{Instruction count}}{\text{Execution time} \times 10^6} \\ &= \frac{\text{Instruction count}}{\frac{\text{Instruction count} \times \text{CPI}}{\text{Clock rate}}} = \frac{\text{Clock rate}}{\text{CPI} \times 10^6} \end{aligned}$$

CPI varies between programs on a given CPU

# **Concluding Remarks**

- Cost/performance is improving
  - Due to underlying technology development
- Hierarchical layers of abstraction
  - In both hardware and software
- Instruction set architecture
  - The hardware/software interface
- Execution time: the best performance measure
- Power is a limiting factor
  - Use parallelism to improve performance

#### Homework

- Due before starting the class on Sep. 13.
- Exercise 1.1
- Exercise 1.5
- Exercise 1.10
- Exercise 1. 15