## More CPU Pipelining Issues



#### Important Stuff:

- Study Session for Problem Set 5 tomorrow night (11/11) 5:30-9:00pm
- Study Session for 2<sup>nd</sup> Midterm on Friday (11/13) during Lab time (1:30-3:00)
- 2<sup>nd</sup> Midterm next Tuesday (11/17)

## A Quick Review of Last Time

Thus far, while attempting to "speed-up" our miniMIPS using pipelining, we added 4-stages and encountered 2 issues.



before they are written into the register file in the WB stage

## Structural Data Hazard

Consider LOADS: Can we fix this problem using bypass paths like before?



Source operands that reference the destination of a previous lw instruction



For a 1w instruction fetched during cycle *i*, data isn't returned from memory until late into cycle *i*+3 (*in a 4-stage pipeline*). Bypassing will fix xor but not add!

# Load Delays

i+3

xor

add

nop

lw

i+2

xor

add

lw

i+1

add

lw

Bypassing CAN'T fix the problem with add since the data simply isn't available! In order to fix it we have to add pipeline interlock hardware to <u>stall</u> the add's execution, or else program around it.

lw

IF

RF

ALU

WB

Here's where the add detects it needs a result from the lw instruction. It then stalls the pipe, and inserts a bubble.

xor

add

nop

i+5

xor

add

lw \$t4,0(\$t1)
add \$t5,\$t1,\$t4
xor \$t6,\$t3,\$t4

i+6

xor

Recall, adding stalls to the pipeline in order to assure proper operation is called inserting pipeline BUBBLES

**X** 

This requires "inserting a MUX" just before the instruction register of the ALU stage, IR<sup>ALU</sup>, to insert a NOP, and "clock enables" on the PC and IR pipeline registers of earlier pipeline stages to stall the pipeline. Unlike branching, no instructions are annulled.

## Punting on Load Interlock

lw

Early versions of MIPS did not include a pipeline interlock, thus, requiring the compiler/programmer to work around it. Old code still worked once interlocks were added, it just used more memory.

If you put an instruction here it was not allowed to access \$t4, thus complicating the ISA. S/W guys rebelled. \$t4,0(\$t1) nop add \$t5,\$t1,\$t4

xor \$t6,\$t3,\$t4

|     | i  | i+1 | i+2 | i+3 | i+4 | i+5 | i+6 |   |
|-----|----|-----|-----|-----|-----|-----|-----|---|
| IF  | lw | nop | add | xor |     |     |     |   |
| RF  |    | lw  | nop | add | xor |     |     | _ |
| ALU |    |     | lw  | nop | add | xor |     | _ |
| WB  |    |     |     | lw  | nop | add | xor |   |
| 1   |    |     |     |     |     |     | i i |   |

OMG! What if there was a lw instruction in a branch-delay slot? This is getting complicated!

If compiler knows about load delay, it can often rearrange the code sequence to eliminate the hazard. Many compilers can provide implementation-specific instruction scheduling. This requires no additional H/W, but it leads to awkward instruction semantics. We'll include interlocks in miniMIPS.

# Load Delays (cont'd)

#### But, what about FASTER processors?

FACT: Processors have been become very fast relative to memories!

Can we just stall the pipe longer? Add more NOPs?

ALTERNATIVE: Longer pipelines.

- 1. Add "MEMORY WAIT" stages between INITIATION of load operation and when it returns data.
- 2. Build pipelined memories, so that multiple (say, N) memory transactions can be in progress at once.
- 3. (Optional). Stall pipeline when the N limit is exceeded.



Sadly, this IS the

4-Stage pipeline requires READ access in LESS than one clock.

## SOLUTION: A 5-Stage pipeline that allows nearly two clocks for data memory accesses...



L20 - Pipeline Issues 7

### One More Fly in the Ointment

There is one more structural hazard that we have not discussed. That is, the saving, and subsequent accesses, of the return address resulting from the jump-and-link, jal, and jalr instructions.

Moreover, given that we have bought into a single delay-slot, which is always executed, we now need to store the address of the instruction FOLLOWING the delay slot instruction.



## **Return Address Register Writes**





On JALs, we need to store the next PC of the DELAY SLOT instruction (often PC+8).

Note this bypass is routed from the PC pipeline not from the ALU output. Thus, we need to add bypass paths for PC<sup>MEM</sup>.



We need another PCALU bypass.

In this case, the bypass path supplies the \$31 operand for the XOR instruction.



And, we need another PC<sup>MEM</sup> bypass.

In this case, the bypass path supplies the \$31 operand for the OR instruction.

PC<sup>WB</sup> is already taken care of. for the following ADD, using the WB stage bypass at the output of the WDSEL mux.



## **Bypass MUX Details**

The previous diagram was oversimplified. Really need for the bypass muxes to precede the A and B muxes to provide the correct values for the jump target (JT), write data, and early branch decision logic.



## **Bypass Logic**

miniMIPS A bypass logic (need another copy for B bypass that compares to rt rather than rs):



\* If instruction is a sw (doesn't write into regfile), set rt for ALU/MEM/WB to \$0



## CPU Pipeline Summary (I)



# CPU Pipeline Summary (II)

- We ended up with a 5-stage pipelined implementation. It increased throughput (3x-4x), but it had impact
  - 1) We added delayed branch decisions (1 stage)

Chose to \*always\* execute instruction after branch, so pipelined code does not work the same as before (Changed ISA).

2) We added bypasses to forward results ahead of register write-back stage (3 stages)

Did not impact instruction semantics, but it adds delays (due to 2 six-input Muxes) to forward correct values.

3) We stall the CPU if any of the 2 instructions following a LW reference its destination register

Introduced NOPs at IR<sup>RF</sup> and IR<sup>ALU</sup>, to stall until LW result was ready. No impact on ISA, but timing of LW varies. (1, 2, or 3 clocks)

## CPU Pipeline Summary (III)

#### Fallacy #1: Pipelining is easy

Smart people get it wrong all of the time! Costs? Re-spins of the design. Force S/W folks to devise program/compiler workarounds.

### Fallacy #2: Pipelining is independent of ISA

Many ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). Bad decisions impact future implementations. (delay slot vs. annul?, load interlocks?) and break otherwise clean semantics. For performance, S/W must be aware!

Fallacy #3: Increasing Pipeline stages improves performance

Diminishing returns. Increasing complexity. Can introduce unusable delay slots, long interlock stalls.

### Fallacy of my Generation RISC == Simplicity???

"The P.T. Barnum World's Tallest Dwarf Competition" World's Most Complex RISC?

- RISC was conceived to be SIMPLE
- SIMPLE --> FAST

