# **CSE 4201 Computer Architecture** Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson at UCB

CSE4201

CSE4201

Fall 08

Fall 08

# **Outline**

CSE4201

- · MIPS and instruction set
- Simple pipeline in MIPS
- · Structural and data hazards
- Forwarding
- Branching

Fall 08

- · Exception and interrupts
- Multicycle operations

#### **MIPS Instruction set Instruction Set** Instruction Set Architecture · 32-bit fixed format instruction (3 formats) Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing • 32 32-bit GPR (R0 contains zero, DP take pair) Meaning of each instruction is described by RTL on *architected* registers and memory · 3-address, reg-reg arithmetic instruction Given technology constraints assemble adequate datapath · Single address mode for load/store: Architected storage mapped to actual storage Function units to do all the required operations Possible additional storage (eg. MAR, MDR, ...) base + displacement - Interconnect to move information among regs and FUs - no indirection Map each instruction to sequence of RTLs · Simple branch conditions Collate sequences into symbolic controller state transition diagram (STD) · Delayed branch Lower symbolic STD to control points • • Implement controller

Fall 08

CSE4201





















| Speed Up Equation for<br>Pipelining                                                                                                                                                                       |  |  |  |  |  |  |  |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|--|
| ripenning                                                                                                                                                                                                 |  |  |  |  |  |  |  |  |
| CPI <sub>pipelined</sub> = Ideal CPI + Average Stall cycles per Inst                                                                                                                                      |  |  |  |  |  |  |  |  |
| Speedup = $\frac{\text{Ideal CPI} \times \text{Pipeline depth}}{\text{Ideal CPI} + \text{Pipeline stall CPI}} \times \frac{\text{Cycle Time}_{\text{unpipelined}}}{\text{Cycle Time}_{\text{pipelined}}}$ |  |  |  |  |  |  |  |  |
| For simple RISC pipeline, CPI = 1:                                                                                                                                                                        |  |  |  |  |  |  |  |  |
| Speedup = $\frac{\text{Pipeline depth}}{1 + \text{Pipeline stall CPI}} \times \frac{\text{Cycle Time}_{unpipelined}}{\text{Cycle Time}_{pipelined}}$                                                      |  |  |  |  |  |  |  |  |
| Fall 08 CSE4201                                                                                                                                                                                           |  |  |  |  |  |  |  |  |







| Three Generic Data<br>Hazards                                                                                                                                                                   |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Write After Read (WAR)     Instr <sub>J</sub> writes operand <u>before</u> Instr <sub>I</sub> reads it                                                                                          |
| I: sub r4,r1,r3<br>J: add r1,r2,r3<br>K: mul r6,r1,r7                                                                                                                                           |
| • Called an "anti-dependence" by compiler writers.<br>This results from reuse of the name "r1".                                                                                                 |
| <ul> <li>Can't happen in MIPS 5 stage pipeline because:</li> <li>All instructions take 5 stages, and</li> <li>Reads are always in stage 2, and</li> <li>Writes are always in stage 5</li> </ul> |
| Fall 08 CSE4201                                                                                                                                                                                 |

|                | Three Generic Data                                                                                  |
|----------------|-----------------------------------------------------------------------------------------------------|
|                | Hazards                                                                                             |
| •              | Write After Write (WAW)                                                                             |
|                | Instr <sub>J</sub> writes operand <u>before</u> Instr <sub>I</sub> writes it.                       |
|                | $ \begin{array}{c} I: \text{ sub } r1, r4, r3 \\ J: \text{ add } r1, r2, r3 \end{array} $           |
|                | $\rightarrow$ J: add r1,r2,r3                                                                       |
|                | K: mul r6,r1,r7                                                                                     |
| •              | Called an "output dependence" by compiler writers<br>This also results from the reuse of name "r1". |
| •              | Can't happen in MIPS 5 stage pipeline because:                                                      |
|                | <ul> <li>All instructions take 5 stages, and</li> </ul>                                             |
|                | <ul> <li>Writes are always in stage 5</li> </ul>                                                    |
| • <sub>F</sub> | Will see WAR and WAW instruction complicated pipes                                                  |















| Branch Stall Impact                                                                                                                                                                                                                                                                                                                                       |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>If CPI = 1, 30% branch,<br/>Stall 3 cycles =&gt; new CPI = 1.9!</li> <li>Two part solution: <ul> <li>Determine branch taken or not sooner, AND</li> <li>Compute taken branch address earlier</li> </ul> </li> <li>MIPS branch tests if register = 0 or ≠ 0</li> <li>MIPS Solution: <ul> <li>Move Zero test to ID/RF stage</li> </ul> </li> </ul> |
| <ul> <li>Adder to calculate new PC in ID/RF stage</li> <li>1 clock cycle penalty for branch versus 3</li> </ul>                                                                                                                                                                                                                                           |
| Fall 08 CSE4201                                                                                                                                                                                                                                                                                                                                           |





### #3: Predict Branch Taken

- 53% MIPS branches taken on average
- But haven't calculated branch target address in MIPS
   MIPS still incurs 1 cycle branch penalty
- Other machines: branch target known before outcome
   SE4201

**Four Branch Hazard Alternatives** #4: Delayed Branch - Define branch to take place AFTER a following instruction branch instruction  $sequential successor_{1}$ sequential successor<sub>2</sub> <sup>></sup> Branch delay of length *n* . . . . . . . . sequential successor branch target if taken - 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this Fall 08 CSE4201



# **Delayed Branch**

- Compiler effectiveness for single branch delay slot:
   Fills about 60% of branch delay slots
  - About 80% of instructions executed in branch delay slots useful in computation
  - About 50% (60% x 80%) of slots usefully filled
- Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot
  - Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches
  - Growth in available transistors has made dynamic approaches relatively cheaper

Fall 08

CSE4201

# Evaluating Branch Alternatives

Pipeline speedup =  $\frac{\text{Pipeline depth}}{1 + \text{Branch frequency} \times \text{Branch penalty}}$ 

Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken Scheduling Branch CPI speedup v. speedup v. unpipelined scheme penalty stall Stall pipeline 3 1 60 31 10 Predict taken 1 1.20 4.2 1.33

CSE4201

44

4.5

1.40

1.45

1 1.14

0.5 1.10

Predict not taken

Delayed branch

Fall 08





# Multi cycles operations

| Function    | Unit latency                          | initiation period                             |
|-------------|---------------------------------------|-----------------------------------------------|
| Integer ALU | 0                                     | 1                                             |
| Data Memory | 1                                     | 1                                             |
| FP add      | 3                                     | 1                                             |
| FP Multiply | 6                                     | 1                                             |
| FP Divide   | 24                                    | 24                                            |
|             | add and multiply<br>ine respectively) | are pipelined (4 and                          |
|             |                                       | between an instructi<br>other one that uses t |
|             |                                       |                                               |



| LDDD IF ID A1 A2 A3 A4 MEM WB                                                                                                                                                    | ADDD IF ID AI A2 A3 A4 MEM WB<br>.D IF ID EX MEM WB<br>SD IF ID EX MEM WB<br>Stages in red indicates when data are needed, in<br>blue indicates when data are produced    | ULTD |         |        |       |        |        |        |        |       |    |     |    |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|---------|--------|-------|--------|--------|--------|--------|-------|----|-----|----|
| ID     IF     ID     EX     MEM     WB       pD     IF     ID     EX     MEM     WB       Stages in red indicates when data are needed, in blue indicates when data are produced | IF     ID     EX     MEM     WB       SD     IF     ID     EX     MEM     WB       Stages in red indicates when data are needed, in blue indicates when data are produced |      | IF      | ID     | Ml    | M2     | М3     | M4     | M5     | мб    | м7 | MEM | WB |
| D IF ID EX MEM WB<br>Stages in red indicates when data are needed, in<br>blue indicates when data are produced                                                                   | D IF ID EX MEM WB<br>Stages in red indicates when data are needed, in<br>blue indicates when data are produced                                                            | DDD  |         | IF     | ID    | A1     | A2     | A3     | A4     | MEM   | WB |     |    |
| Stages in red indicates when data are needed, in blue indicates when data are produced                                                                                           | Stages in red indicates when data are needed, in blue indicates when data are produced                                                                                    | D    |         | IF     | ID    | EX     | MEM    | WB     |        |       |    |     |    |
| blue indicates when data are produced                                                                                                                                            | blue indicates when data are produced                                                                                                                                     | 3D   |         |        | IF    | ID     | EX     | MEM    | WB     |       |    |     |    |
|                                                                                                                                                                                  |                                                                                                                                                                           | Ne   | ed to i | introd | uce m | ore pi | peline | regist | ters A | 1/A2, |    |     |    |
|                                                                                                                                                                                  |                                                                                                                                                                           |      |         |        |       |        |        |        |        |       |    |     |    |



- Because the divide unit is not pipelined, structural hazards may arise
- Because of different running times. We may need to do more than one register write in a single cycle
- WAW hazard is now possible, WAR is not since they all read in one stage
- Instructions can complete in different order, more complicated exception handling
- Because of the longer latency, more RAW Fall 08 hazard CSE4201

# **Hazards and Forwarding**

| Instruction   | 1        | 2     | 3    | 4     | 5   | 6   | 7    | 8    | 9  | 10 | 11 | 12  | 13 | 14 | 15 |   |
|---------------|----------|-------|------|-------|-----|-----|------|------|----|----|----|-----|----|----|----|---|
| LD F4,0(R2)   | IF       | ID    | EX   | MEM   | WB  |     |      |      |    |    |    |     |    |    |    |   |
| UL F0,F4,F6   |          | IF    | ID   | s     | Ml  | М2  | М3   | M4   | м5 | м6 | М7 | MEM | WB |    |    |   |
| ADDD F2,F0,F8 |          |       | IF   | ID    | s   | s   | s    | s    | S  | s  | s  | A1  | A2 | A3 | Α4 |   |
| SD F2,0(R2)   |          |       |      | IF    | ID  | s   | s    | s    | S  | S  | S  | s   | S  | s  | s  | M |
| Substa        | antially | y lor | nger | stall | and | for | ward | ling |    |    |    |     |    |    |    |   |
|               |          |       |      |       |     |     |      |      |    |    |    |     |    |    |    |   |
|               |          |       |      |       |     |     |      |      |    |    |    |     |    |    |    |   |
|               |          |       |      |       |     |     |      |      |    |    |    |     |    |    |    |   |
|               |          |       |      |       |     |     |      |      |    |    |    |     |    |    |    |   |
|               |          |       |      |       |     |     |      |      |    |    |    |     |    |    |    |   |
|               |          |       |      |       |     |     |      |      |    |    |    |     |    |    |    |   |
| Fall 08       |          |       |      |       |     | CSE | 4201 |      |    |    |    |     |    |    |    |   |
|               |          |       |      |       |     |     |      |      |    |    |    |     |    |    |    |   |

| Instruction                      | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10  | 11 |
|----------------------------------|----|----|----|----|----|----|----|----|----|-----|----|
| MULTD F0,F4,F6                   | IF | ID | М1 | М2 | М3 | М4 | м5 | м6 | м7 | MEM | WB |
| <br>ADDD F2,F4,F6<br>            |    |    |    | IF | ID | A1 | A2 | A3 | А4 | MEM | WB |
| <br>LD F8,0(R2)                  |    |    |    |    |    |    | IF | ID | EX | MEM | WB |
| Three differen<br>one cycle earl |    |    |    |    |    |    |    |    |    |     |    |

|        | Hazards and Forwarding                                                                                                                                                 |
|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| •      | One way to deal with multiple writes is to have multiple write ports, but it may be rarely used.                                                                       |
| •      | Another way is to detect the structural<br>hazard by using an interlock                                                                                                |
|        | 1. We can track the use of the write port before it is issued (ID stage) and stall                                                                                     |
|        | 2. Or, we can detect this hazard at entering the MEM stage, it is easier to detect, and we can choose which instruction to proceed (the one with the longest latency?) |
| all 08 | CSE4201                                                                                                                                                                |

# **Maintaining Precise Exception**

| DIVF | F0,F2,F4    |
|------|-------------|
| ADDF | F10,F10,F8  |
| SUBF | F12,F12,F14 |

- This is known as out of order completion
- What if SUBF causes an Exception after ADDF is completed but before DIVF is, or if DIVF caused an exception after both ADDF and SUBF completed, there is no way to maintain a precise exception since ADDF destroys one of its operands

Fall 08

CSE4201

## Maintaining Precise Exception (sol 1)

- Early solution is to ignore the problem
- More recent ones, are to introduce two modes of operations, fast but with imprecise exception, and slow with precise exception.
- DEC Alpha 2104, Power1 and Power-2, MIPS R8000

CSE4201

Maintaining Precise Exception
 (sol 2)

 Buffer the results of an operation until all the operations before it are completed.

- Costly, especially with long pipes.
- One variation is called history file, old values are stored in the history file and can be restored in case of exception
- Or, we an use future file, where new values are stored until all proceeding instructions are completed.

# Maintaining Precise Exception (sol 3)

- Allow the exception to become imprecise, but we have to keep enough information so that the exception handling routine can recover.
- These information are usually the PC addresses of the instructions that were in the pipe during the exception, who finished, and who not.

Fall 08

Fall 08

CSE4201

## Maintaining Precise Exception (sol 4) • A hybrid technique, Allow the

- A hybrid technique, Allow the instructions to be issued only if we are certain that all the instructions before the issuing instruction will complete without causing an exception
- That guarantees that no instruction after the interrupting one will be completed, and all instructions before it will complete.
- Must check for exception early in the EX

## **MIPS4000**

#### 8 Stage Pipeline:

- IF-first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access.
- IS-second half of access to instruction cache
- RF-instruction decode and register fetch, hazard checking and also instruction cache hit detection.
- EX-execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.
- DF-data fetch, first half of access to data cache.
- DS-second half of access to data cache.
- TC-tag check, determine whether the data cache access hit.
- WB-write back for loads and register-register operations.
- 8 Stages: What is impact on Load delay? Branch delay? Why?

CSE4201

Fall 08

| MIP                                                                                                                                   | <b>S4(</b> | )00            |                      |                            |                                  |                                        |                                              |
|---------------------------------------------------------------------------------------------------------------------------------------|------------|----------------|----------------------|----------------------------|----------------------------------|----------------------------------------|----------------------------------------------|
| IF<br>2 cycle load delay<br>(data is ready after DS)                                                                                  | IS<br>IF   | RF<br>IS<br>IF | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF |
| IF<br>3 cycles branch delay,<br>MIOS4000 has a singly<br>cycle branch delay<br>scheduling with a predict<br>taken for the remaining 2 | IS<br>IF   | RF<br>IS<br>IF | EX<br>RF<br>IS<br>IF | DF<br>EX<br>RF<br>IS<br>IF | DS<br>DF<br>EX<br>RF<br>IS<br>IF | TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF | WB<br>TC<br>DS<br>DF<br>EX<br>RF<br>IS<br>IF |

|        | N                                                   | AIPS400            | ) FP Pipeline              |  |  |  |  |  |  |  |
|--------|-----------------------------------------------------|--------------------|----------------------------|--|--|--|--|--|--|--|
| •      | FP Adde                                             | er, FP Multiplier, | FP Divider                 |  |  |  |  |  |  |  |
| •      | Last step of FP Multiplier/Divider uses FP Adder HW |                    |                            |  |  |  |  |  |  |  |
| •      | 8 kinds (                                           | of stages in FP ι  | units:                     |  |  |  |  |  |  |  |
|        | Stage                                               | Functional unit    | Description                |  |  |  |  |  |  |  |
|        | А                                                   | FP adder           | Mantissa ADD stage         |  |  |  |  |  |  |  |
|        | D                                                   | FP divider         | Divide pipeline stage      |  |  |  |  |  |  |  |
|        | Е                                                   | FP multiplier      | Exception test stage       |  |  |  |  |  |  |  |
|        | М                                                   | FP multiplier      | First stage of multiplier  |  |  |  |  |  |  |  |
|        | Ν                                                   | FP multiplier      | Second stage of multiplier |  |  |  |  |  |  |  |
|        | R                                                   | FP adder           | Rounding stage             |  |  |  |  |  |  |  |
|        | S                                                   | FP adder           | Operand shift stage        |  |  |  |  |  |  |  |
|        | U                                                   |                    | Unpack FP numbers          |  |  |  |  |  |  |  |
| all O8 | 3                                                   | C                  | SE4201                     |  |  |  |  |  |  |  |

| FP Instr                | Pipeline stag            | jes                                                 |                     |  |  |  |  |  |
|-------------------------|--------------------------|-----------------------------------------------------|---------------------|--|--|--|--|--|
| Add, Subtract           | U,S+A,A+R                | ,R+S                                                |                     |  |  |  |  |  |
| Multiply                | U,E+M,M,M                | U,E+M,M,M,M,N,N+A,R                                 |                     |  |  |  |  |  |
| Divide                  | U,A,R,D <sup>28</sup> ,E | U,A,R,D <sup>28</sup> ,D+A,D+R, D+R, D+A, D+R, A, R |                     |  |  |  |  |  |
| Square root             | U,E,(A+R)10              | U,E,(A+R) <sup>108</sup> ,A,R                       |                     |  |  |  |  |  |
| Negate                  | U,S                      |                                                     |                     |  |  |  |  |  |
| Absolute value          | U,S                      |                                                     |                     |  |  |  |  |  |
| FP compare              | U,A,R                    |                                                     |                     |  |  |  |  |  |
| OP                      |                          | Latency                                             | Initiation interval |  |  |  |  |  |
| ADD,SUB                 |                          | 4                                                   | 3                   |  |  |  |  |  |
| MUL                     |                          | 8                                                   | 4                   |  |  |  |  |  |
| DIV                     |                          | 36                                                  | 35                  |  |  |  |  |  |
| SQRT<br>NEG,ABS<br>COMP |                          | 112                                                 | 111                 |  |  |  |  |  |
|                         |                          | 2                                                   | 1                   |  |  |  |  |  |
|                         |                          | 3                                                   | 2                   |  |  |  |  |  |

| • | MUL   | ISSUE                        | 0    | 1     | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9  | 10    |
|---|-------|------------------------------|------|-------|-----|-----|-----|-----|-----|-----|-----|----|-------|
| • | MUL   | Issue                        | U    | М     | М   | М   | М   | N   | N,A | R   |     |    |       |
| • | ADD   | Issue                        |      | U     | S,A | A,R | R,S |     |     |     |     |    |       |
| • | ADD   | Issue                        |      |       | U   | S,A | A,R | R,S |     |     |     |    |       |
| • | ADD   | Stall                        |      |       |     | U   | S,A | A,R | R,S |     |     |    |       |
| • | ADD   | Stall                        |      |       |     |     | U   | S,A | A,R | R,S |     |    |       |
| • | ADD   | Issue                        |      |       |     |     |     | U   | S,A | A,R | R,S |    |       |
| • | ADD   | Issue                        |      |       |     |     |     |     | U   | S,A | A,R | R, | s     |
| • | ADD   | Issue                        |      |       |     |     |     |     |     | U   | S,A | A, | R R,S |
| e | tweer | eraction<br>1 and<br>have to | 7, i | n all |     |     |     |     |     |     |     |    |       |