The PPC Processor: History
Contents
POWER (Performance Optimization With Enhanced RISC)
- IBM RISC processor for mainframes and servers.
68000 (68K)
- Motorola processor which previously formed the core of Apple's desktop computing line
"Apple needed a CPU for its personal computers that would be both cutting-edge and backwards compatible with the 68K." - arstechnica 1
PPC 601
- Introduction date: March 14, 1994
- Process: 0.60 micron
- Transistor Count: 2.8 million
- Die size: 121mm2
- Clock speed at introduction: 60-80MHz
- Cache sizes: 32KB unified L1
- First appeared in: Power Macintosh 6100/60
- "Instructions are all the same size; the 601's instruction fetch logic doesn't have the instruction alignment headaches that plague x86 designs, which means that the fetch hardware can be simpler and faster." - arstechnica 1 === Pipeline ===
- Fetch
into 8-entry Instruction Queue used to facilitate branch detection/prediction.
branch folding - if will branch no matter what, replace branch instruction with instruction at branch target. Saves 2 cycles -- the branch that gets deleted, and the bubble that happens after most branch instructions.
- Decode/Dispatch
- Execute
to one of three different execution units: the integer unit, the floating-point unit, and the branch unit.
- Writeback
Performs: Integer and floating-point load-address calculations, Integer and floating-point store-address calculations, Integer and floating-point load-data operations, Integer store-data operations
- "Cramming all of these load-store functions into the 601's exactly single integer ALU didn't help the chip's integer performance, but it was good enough to keep up with the Pentium in this area despite the fact that the Pentium had two integer ALUs. I imagine that most of this integer performance parity came from the 601's huge 32K unified L1 cache (compare the Pentium's 8K split L1)." - arstechnica 1
- Multicycle integer operations NOT pipelined (but most are 1 cycle anyway, so not so bad)
- Impressive.
- All single-precision ops pipelined AND most double-precisions ops too
- "For single-precision operations (with the exception of divides) and most double-precision operations, the 601's floating-point hardware could turn out one instruction per cycle with a two-cycle latency." -arstechnica 1
- Since integer unit handled all memory traffic, and since int and fp instructions tend to hang together rather than mix, in a big stretch of fp ops the sole purpose of integer unit would be to feed the fp unit data from memory -- ars calls this a "dedicated load-store unit"
- 32K cache ROCKED HARDCORE with floating point
- Static branch predictor (?)
- holdover from POWER RISC single processor -- basically a mini-processor with its own instruction set, reg file, ROM, etc
- time crunches happen in the Real World, too, apparently
- Fetch
PPC 603 and 603e
- -direct copy from arstechnica 1
PowerPC 603 vitals
PowerPC 603e vitals
Introduction date
May 1, 1995
October 16, 1995
Process
0.50 micron
0.50 micron
Transistor count
1.6 million
2.6 million
Die size
81mm2
98mm2
Clock speed at introduction
75MHz
100MHz
L1 cache size
16K split L1
32K split L1
First appeared in
Macintosh Performa 5200CD
Macintosh Performa 6300CD
Low power: Apple needed a PowerBook chip. The first 603 was a bit shitty due to the small cache; they didn't start using them for powerbooks until the 603e came out and could actually do the 68K emulation reasonably. === 603e: Integer Unit === === 603e: Floating-point Unit ===
- Worse performance than the 601 (3 fp instructions every 4 cycles?), BUT:
- this is where the fp multiply-add combo instruction comes in that the carch book talks about. "A core DSP operation"; thus Apple begins its legacy of keeping the audio/graphics people happy happy happy.
- Takes over address calculating (previously in 601's Integer Unit)
Handles condition register updates (previously in 601's Integer Unit)
- "Reservation Systems"
- "Completion Unit"
- Branch unit
- "Instruction window"
- Register renaming
PPC 604 and 604e
- - Totally stolen from arstechnica 1
PowerPC 604 vitals
PowerPC 604e vitals
Introduction date
May 1, 1995
July 19, 1996
Process
0.50 micron
0.35 micron (->0.25 micron)
Transistor count
3.6 million
5.1 million
Die size
197mm2
148mm2
Clock speed at introduction
120MHz
180-200MHz (->350MHz)
L1 cache size
32K split L1
64K split L1
First appeared in
Power Mac 9500/120
PowerComputing PowerTower Pro 200 (Power Mac 9500/180 on August 7, 1996)
whitepaper put out by *Motorola*? === 604 === ==== Pipeline ====
- Fetch
- Decode
- Dispatch (*new*)
- Execute
- Complete (*new*)
- Write-back
- Two simple integer units (SIUs)
- Do simple, single-cycle ops
- Faster
- One complex integer unit (CIU)
- Do multi-cycle ops
- Slower
- Almost entirely pipelined, including doubles
Now handles condition register (previously in 603e's System Unit)
Now uses dynamic branch prediction rather than "new branches not taken; old branches taken" rule
- 512-entry branch history table, 2 bits per entry
- Longer pipeline means more penalty for a mispredicted branch -- hence why the better branch unit
- "Re-order buffer(ROB)" and equivalency to "Completion queue"
- doubled instruction and data cache sizes to 32K each
- added Condition Register Unit
- Condition Register finally has its own unit for control
- Other units no longer tied up with CR calcs (which happen often)
- "expanded capabilities"
PowerPC 750
- Introduction date: November 10, 1997
- Process: 0.25 micron
- Transistor Count: 6.35 million
- Die size: 167mm2
- Clock speed at introduction: 233-266MHz
- Cache sizes: 64KB unified L1, 512KB L2
- First appeared in: Power Macintosh G3/233
- Very like the 603e
- 4-stage pipeline
Could not do vector calcs; intel and AMD had extended their instruction set to do SIMD. Motorola addresses this when it develops the G3 into an embedded/media wksttn chip. === Integer units ===
- Two of them:
- simple(SIU), does all int ops except mult and div
- complex(CIU), does all int ops
- 3-cycle latency (like 603) on single
also 3-cycle latency on doubles, except mult and mult-add, which take 4
- no bubble after 3rd instr (improvement)
- Just like 603
- just like 603
Vast improvement: A 64-entry branch target instruction cache stores not the target addresses of recently taken branches, but the target instructions. Then the processor doesn't have to wait for the instruction fetch logic, and can put fewer bubbles in its pipeline per branch.
- 6-entry reorder buffer (opposed to 603's 5)
- Has half the rename registers of the 604, a smaller reorder buffer, and fewer reservation stations, and but has a much shorter pipeline(and the cooler branch prediction), which makes up for it
- SIMD. What is it?
PPC 7400
- AKA MPC7400
- Essentially same as 750, but with SIMD
- Lower power version: 7410
- Stats:
- Introduction date: August 31, 1999
- Process: 0.25 micron
- Transistor Count: 10.5 million
- Die size: 83mm2
- Clock speed at introduction: 350-450MHz
- Cache sizes: 32KB L1 (instructions), 32KB L1 (data), 512KB L2
- First appeared in: Power Macintosh G4/400
- 4 stages
->>> kept the clock speed down for a long time
- Improvement: now has a full double-precision FPU
- All operations on both single and double precision numbers have 3-cycle latency
AltiVec!
- Instructions operate on 128-bit chunks of data
- Two subunits: Vector ALU (VALU) and Vector Permute Unit (VPU)
- Added 32 128-bit vector registers
- Added 6 vector rename registers
- Vector arithmetic
- Vector logic
- Permute and shift
- has bigger completion queue(8) than the 750(6)
PowerPC 7450
- Could have easily moved on to being called the G5, but Apple stuck wtih G4. Known as G4e.
- Deeper pipeline, better VALU
- Stats:
- Introduction date: January 9, 2001
- Process: 0.18 micron
- Transistor Count: 33 million
- Die size: 106mm2
- Clock speed at introduction: 667-733MHz
- Cache sizes: 32KB L1 (instructions), 32KB L1 (data)
- 256KB L2, 512KB-2MB L3 cache
- First appeared in: Power Macintosh G4/667
- Fetch-1
- Fetch-2
- 4 instrs per clock cycle if in L1 cache; delay of 9 cycles if have to go to L2
- Decode/dispatch
- 12-entry instruction queue
- dispatches up to 3 at a time into "issue queues"
- Issue (*new*)
- eliminates stall from previous processors if reservation stations were busy
General Issue Queue - 6-entry, can accept up to 3 instrs per cycle. Out-of-order execution issues bottom three instructions to any of three integer units or the load-store unit.
Vector Issue Queue - 4-entry, can accept up to 2 instrs per cycle. In-order execution issues bottom two instructions to any of 4 vector exec units.
Floating-point Issue Queue - 1-entry, accepts 1 instr per cycle. Executes 1 instr to the FPU.
- Execute
- Instrs pass from reservation units into execution, and are executed. Go figure.
- Complete
- Write-back (Commit)
- Complete and WB must reorder effects of instructions back into their order of issuance -- user needs to think they happened in the order he coded them.
- static and dynamic
- 2048-entry BHT
128-entry BTIC, also stores first four instructions starting at each branch target
Basically, deeper pipeline => crappier branch penalties, => more awesome branch prediction strategies.
- 4 of them: 3 fast SIUs, 1 slow CIU
- SIU mostly single-cycle(some exceptions)(1 reservation station), CIU takes four cycles or more(2 reservation stations)
- instrs updating the Condition Register have an extra pipeline stage, "finish" (doesn't cause hazards or delays, just latency -- uses forwarding properly)
- 32 general purpose registers
- 16 rename registers (only for on-chip logic; not visible to user)
- Single pipeline for all instrs
- single and double precision ops take 5 cycles
One instr per cycle, except:
FPU is not fully pipelined. If first four of 5 pipeline stages are occupied, will stall on next cycle -- so only 4 instrs every 5 cycles can be executed. "[Certain floating-point code can run pretty badly]" - arstechnica
AltiVec ads 162 instructions to the PPC set (just FYI. this isn't new info, I think)
Four independend altivec units, 3 fully pipelined:
Vector Permute (dwisott)
- Vector Simple Integer Unit (mostly single-cycle stuff; like regular SIU)(single pipeline stage; can only handle one instr at a time; "not pipelined")
- Vector Complex Integer Unit (like regular CIU)
- Vector Floating Point Unit
- Tied to 32 128-bit architectural registers and 16 vector rename registers
PPC 970
- Like, whoa. This one's really different.
- Instructions are split into IOPs (Internal OPerations?)
- Flashback to x86 CISC instruction decoding into micro-ops
- Most 970 instructions translate to a single IOP
- instrs that translate to exactly two IOPs are "cracked"
- instrs that translate to more than two IOPs are "millicoded"
- IOPs are issued in groups of 5 at a time
- Pros: reorder buffer only has to keep track of 20 things to put in the right order, not 100 (P4 has problems with this -- has to watch 196 instrs)
- Cons: Lots of rules for how a group is put together combined with you can only have 20 groups going at a time means that you often have NOPs in the pipeline
- Notes: if your compiler is wicked good at ordering instructions properly so that the processor can put them into groups quite nicely, you can probably get really great performance out of it. But your compiler would have to be, like, magic. Typical compiler conditions will probably get "okay" performance.
- Branch prediction is amazing, though it really has to be -- we're already throwing bubbles into the pipeline with the IOPs === Integer Units ===
- 2 IUs that each do "most" ops
- only one can do fixed-point (integer) divides
- only one can do "Special Purpose Register" ops
Some weird latency issues: Independent integer IOPs can execute one per cycle, BUT dependent IOPs must be separated by a dead cycle, giving a latency of 2. This is a higher latency than past PPC processors for integer ops.
- Stuff dealing with the condition register gets its own unit in the 970. (haven't we seen this before, though? ars seems to think not.)
There are two of them, presumably to keep up with and feed the deeper pipeline and faster clock
- (re-fetch this data from the first 970 article)
- Just like the G4e, but with two of them...
- Fully pipelined except for FP divide, which stalls both FPUs
- 32 architectural registers, 48 rename registers
- minor differences. arithmetic units (SIU, CIU, FPU) combined in a Vector ALU
- Vector Permute queue takes 2 instrs/cycle (1 into each of a pair of queues)
- VALU queue takes 2 instrs/cycle (1 into each of a pair of queues)
- Vector stuff sortof slapped on haphazardly in a corner of the die
- 32 general purpose
- 48 rename registers
- The frontside bus, which apparently runs at half the clockrate(1/4 the clock rate, double-pumped) and something called DDR
- IBM's high-end processor the POWER4 -- apparently the 970's big brother
- 2 IUs that each do "most" ops