#### **ECE 498AL**

# Lectures 9: Memory Hardware in G80

# CUDA Device Memory Space: Review

Host

#### • Each thread can:

- R/W per-thread registers
- R/W per-thread local memory
- R/W per-block shared memory
- R/W per-grid global memory
- Read only per-grid constant memory
- Read only per-grid texture memory
- The host can R/W global, constant, and texture memories

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-200 ECE 498AL, University of Illinois, Urbana-Champaign



# Parallel Memory Sharing



# SM Memory Architecture



# SM Register File

- Register File (RF)
  - 32 KB (8K entries) for each SM in G80
- TEX pipe can also read/write RF
  - 2 SMs share 1 TEX
- Load/Store pipe can also read/write RF



## Programmer View of Register File

- There are 8192 registers in each SM in G80
  - This is an implementation decision, not part of CUDA
  - Registers are dynamically partitioned across all blocks assigned to the SM
  - Once assigned to a block, the register is NOT accessible by threads in other blocks
  - Each thread in the same block only access registers assigned to itself



## Matrix Multiplication Example

- If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?
  - Each block requires 10\*256 = 2560 registers
  - -8192 = 3 \* 2560 + change
  - So, three blocks can run on an SM as far as registers are concerned
- How about if each thread increases the use of registers by 1?
  - Each Block now requires 11\*256 = 2816 registers
  - -8192 < 2816\*3
  - Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!

## More on Dynamic Partitioning

- Dynamic partitioning gives more flexibility to compilers/programmers
  - One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each
    - This allows for finer grain threading than traditional CPU threading models.
  - The compiler can tradeoff between instruction-level parallelism and thread level parallelism

#### ILP vs. TLP Example

- Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global laods have 200 cycles
  - 3 Blocks can run on each SM
- If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load
  - Only two can run on each SM
  - However, one only needs 200/(8\*4) = 7 Warps to tolerate the memory latency
  - Two blocks have 16 Warps. The performance can be actually higher!

# Memory Layout of a Matrix in C





# Memory Coalescing

• When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations.



# Memory Layout of a Matrix in C

Access direction in Kernel code





# Memory Layout of a Matrix in C



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

#### **Constants**

- Immediate address constants
- Indexed address constants
- Constants stored in DRAM, and cached on chip
  - L1 per SM
- A constant value can be broadcast to all threads in a Warp
  - Extremely efficient way of accessing a value that is common for all threads in a block!



# **Shared Memory**

Each SM has 16 KB of Shared Memory

16 banks of 32bit words

 CUDA uses Shared Memory as shared storage visible to all threads in a thread block

- read and write access
- Not used explicitly for pixel shader programs
  - we dislike pixels talking to each other ☺



#### Parallel Memory Architecture

- In a parallel machine, many threads access memory
  - Therefore, memory is divided into banks
  - Essential to achieve high bandwidth
- Each bank can service one address per cycle
  - A memory can service as many simultaneous accesses as it has banks

- Multiple simultaneous accesses to a bank result in a bank conflict
  - Conflicting accesses are serialized



Bank 15

# Bank Addressing Examples



## Bank Addressing Examples



## How addresses map to banks on G80

- Each bank has a bandwidth of 32 bits per clock cycle
- Successive 32-bit words are assigned to successive banks
- G80 has 16 banks
  - So bank = address % 16
  - Same as the size of a half-warp
    - No bank conflicts between different half-warps, only within a single half-warp

## Shared memory bank conflicts

• Shared memory is as fast as registers if there are no bank conflicts

#### • The fast case:

- If all threads of a half-warp access different banks, there is no bank conflict
- If all threads of a half-warp access the identical address, there is no bank conflict (broadcast)

#### • The slow case:

- Bank Conflict: multiple threads in the same half-warp access the same bank
- Must serialize the accesses
- Cost = max # of simultaneous accesses to a single bank

Linear Addressing

• Given:

```
__shared__ float shared[256];
float foo =
   shared[baseIndex + s *
   threadIdx.x];
```

- This is only bank-conflict-free if s shares no common factors with the number of banks
  - 16 on G80, so s must be odd

s=1Thread 0 Bank 0 Thread 1 Bank 1 Thread 2 Thread 3 Bank 3 Thread 4 Bank 4 Thread 5 Bank 5 Thread 6 Bank 6 Thread 7 Bank 7 Thread 15



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign