The PPC Processor: History

Contents

The PPC Processor: History

POWER (Performance Optimization With Enhanced RISC)

IBM RISC processor for mainframes and servers.

68000 (68K)

Motorola processor which previously formed the core of Apple's desktop computing line

"Apple needed a CPU for its personal computers that would be both cutting-edge and backwards compatible with the 68K." - arstechnica 1

PPC 601

Introduction date: March 14, 1994
Process: 0.60 micron
Transistor Count: 2.8 million
Die size: 121mm2
Clock speed at introduction: 60-80MHz
Cache sizes: 32KB unified L1
First appeared in: Power Macintosh 6100/60
"Instructions are all the same size; the 601's instruction fetch logic doesn't have the instruction alignment headaches that plague x86 designs, which means that the fetch hardware can be simpler and faster." - arstechnica 1 === Pipeline ===
1. Fetch
  - into 8-entry Instruction Queue used to facilitate branch detection/prediction.
  - branch folding - if will branch no matter what, replace branch instruction with instruction at branch target. Saves 2 cycles -- the branch that gets deleted, and the bubble that happens after most branch instructions.
2. Decode/Dispatch
3. Execute
  - to one of three different execution units: the integer unit, the floating-point unit, and the branch unit.
4. Writeback
=== Integer Unit ===
- Performs: Integer and floating-point load-address calculations, Integer and floating-point store-address calculations, Integer and floating-point load-data operations, Integer store-data operations
- "Cramming all of these load-store functions into the 601's exactly single integer ALU didn't help the chip's integer performance, but it was good enough to keep up with the Pentium in this area despite the fact that the Pentium had two integer ALUs. I imagine that most of this integer performance parity came from the 601's huge 32K unified L1 cache (compare the Pentium's 8K split L1)." - arstechnica 1
- Multicycle integer operations NOT pipelined (but most are 1 cycle anyway, so not so bad)
=== Floating Point Unit ===
- Impressive.
- All single-precision ops pipelined AND most double-precisions ops too
- "For single-precision operations (with the exception of divides) and most double-precision operations, the 601's floating-point hardware could turn out one instruction per cycle with a two-cycle latency." -arstechnica 1
- Since integer unit handled all memory traffic, and since int and fp instructions tend to hang together rather than mix, in a big stretch of fp ops the sole purpose of integer unit would be to feed the fp unit data from memory -- ars calls this a "dedicated load-store unit"
- 32K cache ROCKED HARDCORE with floating point
=== Branch Execution Unit ===
- Static branch predictor (?)
=== Sequencer Unit ===
- holdover from POWER RISC single processor -- basically a mini-processor with its own instruction set, reg file, ROM, etc
- time crunches happen in the Real World, too, apparently

PPC 603 and 603e

	PowerPC 603 vitals	PowerPC 603e vitals
Introduction date	May 1, 1995	October 16, 1995
Process	0.50 micron	0.50 micron
Transistor count	1.6 million	2.6 million
Die size	81mm2	98mm2
Clock speed at introduction	75MHz	100MHz
L1 cache size	16K split L1	32K split L1
First appeared in	Macintosh Performa 5200CD	Macintosh Performa 6300CD

-direct copy from arstechnica 1

Low power: Apple needed a PowerBook chip. The first 603 was a bit shitty due to the small cache; they didn't start using them for powerbooks until the 603e came out and could actually do the 68K emulation reasonably. === 603e: Integer Unit === === 603e: Floating-point Unit ===
- Worse performance than the 601 (3 fp instructions every 4 cycles?), BUT:
- this is where the fp multiply-add combo instruction comes in that the carch book talks about. "A core DSP operation"; thus Apple begins its legacy of keeping the audio/graphics people happy happy happy.
=== 603e: Branch Unit === === 603e: Load-Store Unit (*new*) ===
- Takes over address calculating (previously in 601's Integer Unit)
=== 603e: System Unit (*new*) ===
- Handles condition register updates (previously in 601's Integer Unit)
=== Do more research on... ===
- "Reservation Systems"
- "Completion Unit"
- Branch unit
- "Instruction window"
- Register renaming

PPC 604 and 604e

	PowerPC 604 vitals	PowerPC 604e vitals
Introduction date	May 1, 1995	July 19, 1996
Process	0.50 micron	0.35 micron (->0.25 micron)
Transistor count	3.6 million	5.1 million
Die size	197mm2	148mm2
Clock speed at introduction	120MHz	180-200MHz (->350MHz)
L1 cache size	32K split L1	64K split L1
First appeared in	Power Mac 9500/120	PowerComputing PowerTower Pro 200 (Power Mac 9500/180 on August 7, 1996)

- Totally stolen from arstechnica 1

whitepaper put out by *Motorola*? === 604 === ==== Pipeline ====
1. Fetch
2. Decode
3. Dispatch (*new*)
4. Execute
5. Complete (*new*)
6. Write-back
==== Integer Units ====
- Two simple integer units (SIUs)
  - Do simple, single-cycle ops
  - Faster
- One complex integer unit (CIU)
  - Do multi-cycle ops
  - Slower
==== Floating Point Unit ====
- Almost entirely pipelined, including doubles
==== Branch Unit ====
- Now handles condition register (previously in 603e's System Unit)
- Now uses dynamic branch prediction rather than "new branches not taken; old branches taken" rule
  - 512-entry branch history table, 2 bits per entry
- Longer pipeline means more penalty for a mispredicted branch -- hence why the better branch unit
==== Do more research on... ====
- "Re-order buffer(ROB)" and equivalency to "Completion queue"
=== 604e ===
- doubled instruction and data cache sizes to 32K each
- added Condition Register Unit
==== Condition Register Unit ====
- Condition Register finally has its own unit for control
- Other units no longer tied up with CR calcs (which happen often)
==== Branch Unit ====
- "expanded capabilities"

PowerPC 750

Introduction date: November 10, 1997
Process: 0.25 micron
Transistor Count: 6.35 million
Die size: 167mm2
Clock speed at introduction: 233-266MHz
Cache sizes: 64KB unified L1, 512KB L2
First appeared in: Power Macintosh G3/233
Very like the 603e
4-stage pipeline
Could not do vector calcs; intel and AMD had extended their instruction set to do SIMD. Motorola addresses this when it develops the G3 into an embedded/media wksttn chip. === Integer units ===
Two of them:
- simple(SIU), does all int ops except mult and div
- complex(CIU), does all int ops
=== Floating Point Unit ===
- 3-cycle latency (like 603) on single
- also 3-cycle latency on doubles, except mult and mult-add, which take 4
- no bubble after 3rd instr (improvement)
=== Load-Store Unit ===
- Just like 603
=== System Register Unit ===
- just like 603
=== Branch Prediction Unit ===
- Vast improvement: A 64-entry branch target instruction cache stores not the target addresses of recently taken branches, but the target instructions. Then the processor doesn't have to wait for the instruction fetch logic, and can put fewer bubbles in its pipeline per branch.
=== Frontend ===
- 6-entry reorder buffer (opposed to 603's 5)
- Has half the rename registers of the 604, a smaller reorder buffer, and fewer reservation stations, and but has a much shorter pipeline(and the cooler branch prediction), which makes up for it
=== Do more research on... ===
- SIMD. What is it?

PPC 7400

AKA MPC7400
Essentially same as 750, but with SIMD
Lower power version: 7410
Stats:
- Introduction date: August 31, 1999
- Process: 0.25 micron
- Transistor Count: 10.5 million
- Die size: 83mm2
- Clock speed at introduction: 350-450MHz
- Cache sizes: 32KB L1 (instructions), 32KB L1 (data), 512KB L2
- First appeared in: Power Macintosh G4/400
=== Pipeline ===
- 4 stages
- ->>> kept the clock speed down for a long time
=== Floating Point Unit ====
- Improvement: now has a full double-precision FPU
- All operations on both single and double precision numbers have 3-cycle latency
=== Vector Unit ===
- AltiVec!
  - Instructions operate on 128-bit chunks of data
- Two subunits: Vector ALU (VALU) and Vector Permute Unit (VPU)
- Added 32 128-bit vector registers
- Added 6 vector rename registers
==== Vector ALU ====
- Vector arithmetic
- Vector logic
==== Vector Permute Unit ====
- Permute and shift
=== Frontend ===
- has bigger completion queue(8) than the 750(6)

PowerPC 7450

Could have easily moved on to being called the G5, but Apple stuck wtih G4. Known as G4e.
Deeper pipeline, better VALU
Stats:
- Introduction date: January 9, 2001
- Process: 0.18 micron
- Transistor Count: 33 million
- Die size: 106mm2
- Clock speed at introduction: 667-733MHz
- Cache sizes: 32KB L1 (instructions), 32KB L1 (data)
- 256KB L2, 512KB-2MB L3 cache
- First appeared in: Power Macintosh G4/667
=== Pipeline ===
1. Fetch-1
2. Fetch-2
  - 4 instrs per clock cycle if in L1 cache; delay of 9 cycles if have to go to L2
3. Decode/dispatch
  - 12-entry instruction queue
  - dispatches up to 3 at a time into "issue queues"
4. Issue (*new*)
  - eliminates stall from previous processors if reservation stations were busy
  - General Issue Queue - 6-entry, can accept up to 3 instrs per cycle. Out-of-order execution issues bottom three instructions to any of three integer units or the load-store unit.
  - Vector Issue Queue - 4-entry, can accept up to 2 instrs per cycle. In-order execution issues bottom two instructions to any of 4 vector exec units.
  - Floating-point Issue Queue - 1-entry, accepts 1 instr per cycle. Executes 1 instr to the FPU.
5. Execute
  - Instrs pass from reservation units into execution, and are executed. Go figure.
6. Complete
7. Write-back (Commit)
  - Complete and WB must reorder effects of instructions back into their order of issuance -- user needs to think they happened in the order he coded them.
=== Branch Prediction Unit ===
- static and dynamic
- 2048-entry BHT
- 128-entry BTIC, also stores first four instructions starting at each branch target
- Basically, deeper pipeline => crappier branch penalties, => more awesome branch prediction strategies.
=== Integer Units ===
- 4 of them: 3 fast SIUs, 1 slow CIU
- SIU mostly single-cycle(some exceptions)(1 reservation station), CIU takes four cycles or more(2 reservation stations)
- instrs updating the Condition Register have an extra pipeline stage, "finish" (doesn't cause hazards or delays, just latency -- uses forwarding properly)
=== ISA (stands for?) ===
- 32 general purpose registers
- 16 rename registers (only for on-chip logic; not visible to user)
=== Floating Point Unit ===
- Single pipeline for all instrs
- single and double precision ops take 5 cycles
- One instr per cycle, except:
  - FPU is not fully pipelined. If first four of 5 pipeline stages are occupied, will stall on next cycle -- so only 4 instrs every 5 cycles can be executed. "[Certain floating-point code can run pretty badly]" - arstechnica
=== Vector Units ===
- AltiVec ads 162 instructions to the PPC set (just FYI. this isn't new info, I think)
- Four independend altivec units, 3 fully pipelined:
  - Vector Permute (dwisott)
  - Vector Simple Integer Unit (mostly single-cycle stuff; like regular SIU)(single pipeline stage; can only handle one instr at a time; "not pipelined")
  - Vector Complex Integer Unit (like regular CIU)
  - Vector Floating Point Unit
- Tied to 32 128-bit architectural registers and 16 vector rename registers

PPC 970

Like, whoa. This one's really different.
Instructions are split into IOPs (Internal OPerations?)
- Flashback to x86 CISC instruction decoding into micro-ops
- Most 970 instructions translate to a single IOP
- instrs that translate to exactly two IOPs are "cracked"
- instrs that translate to more than two IOPs are "millicoded"
IOPs are issued in groups of 5 at a time
- Pros: reorder buffer only has to keep track of 20 things to put in the right order, not 100 (P4 has problems with this -- has to watch 196 instrs)
- Cons: Lots of rules for how a group is put together combined with you can only have 20 groups going at a time means that you often have NOPs in the pipeline
- Notes: if your compiler is wicked good at ordering instructions properly so that the processor can put them into groups quite nicely, you can probably get really great performance out of it. But your compiler would have to be, like, magic. Typical compiler conditions will probably get "okay" performance.
Branch prediction is amazing, though it really has to be -- we're already throwing bubbles into the pipeline with the IOPs === Integer Units ===
- 2 IUs that each do "most" ops
  - only one can do fixed-point (integer) divides
  - only one can do "Special Purpose Register" ops
- Some weird latency issues: Independent integer IOPs can execute one per cycle, BUT dependent IOPs must be separated by a dead cycle, giving a latency of 2. This is a higher latency than past PPC processors for integer ops.
=== Condition Register Logical Unit (CRU) ===
- Stuff dealing with the condition register gets its own unit in the 970. (haven't we seen this before, though? ars seems to think not.)
=== Load-Store Units ===
- There are two of them, presumably to keep up with and feed the deeper pipeline and faster clock
=== Branch Unit ===
- (re-fetch this data from the first 970 article)
=== Floating Point Units ===
- Just like the G4e, but with two of them...
- Fully pipelined except for FP divide, which stalls both FPUs
- 32 architectural registers, 48 rename registers
=== Vector Units ===
- minor differences. arithmetic units (SIU, CIU, FPU) combined in a Vector ALU
- Vector Permute queue takes 2 instrs/cycle (1 into each of a pair of queues)
- VALU queue takes 2 instrs/cycle (1 into each of a pair of queues)
- Vector stuff sortof slapped on haphazardly in a corner of the die
=== ISA ===
- 32 general purpose
- 48 rename registers
=== Do more research on... ===
- The frontside bus, which apparently runs at half the clockrate(1/4 the clock rate, double-pumped) and something called DDR
- IBM's high-end processor the POWER4 -- apparently the 970's big brother