Future of Microprocessors

Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Future of Microprocessors : 

Future of Microprocessors David Patterson University of California, Berkeley June 2001

Outline : 

Outline A 30 year history of microprocessors Four generation of innovation High performance microprocessor drivers: Memory hierarchies instruction level parallelism (ILP) Where are we and where are we going? Focus on desktop/server microprocessors vs. embedded/DSP microprocessor

Microprocessor Generations : 

Microprocessor Generations First generation: 1971-78 Behind the power curve (16-bit, <50k transistors) Second Generation: 1979-85 Becoming “real” computers (32-bit , >50k transistors) Third Generation: 1985-89 Challenging the “establishment” (Reduced Instruction Set Computer/RISC, >100k transistors) Fourth Generation: 1990- Architectural and performance leadership (64-bit, > 1M transistors, Intel/AMD translate into RISC internally)

In the beginning (8-bit) Intel 4004 : 

In the beginning (8-bit) Intel 4004 First general-purpose, single-chip microprocessor Shipped in 1971 8-bit architecture, 4-bit implementation 2,300 transistors Performance < 0.1 MIPS(Million Instructions Per Sec) 8008: 8-bit implementation in 1972 3,500 transistors First microprocessor-based computer (Micral) Targeted at laboratory instrumentation Mostly sold in Europe All chip photos in this talk courtesy of Michael W. Davidson and The Florida State University

1st Generation (16-bit) Intel 8086 : 

1st Generation (16-bit) Intel 8086 Introduced in 1978 Performance < 0.5 MIPS New 16-bit architecture “Assembly language” compatible with 8080 29,000 transistors Includes memory protection, support for Floating Point coprocessor In 1981, IBM introduces PC Based on 8088--8-bit bus version of 8086

2nd Generation (32-bit) Motorola 68000 : 

2nd Generation (32-bit) Motorola 68000 Major architectural step in microprocessors: First 32-bit architecture initial 16-bit implementation First flat 32-bit address Support for paging General-purpose register architecture Loosely based on PDP-11 minicomputer First implementation in 1979 68,000 transistors < 1 MIPS (Million Instructions Per Second) Used in Apple Mac Sun , Silicon Graphics, & Apollo workstations

3rd Generation: MIPS R2000 : 

3rd Generation: MIPS R2000 Several firsts: First (commercial) RISC microprocessor First microprocessor to provide integrated support for instruction & data cache First pipelined microprocessor (sustains 1 instruction/clock) Implemented in 1985 125,000 transistors 5-8 MIPS (Million Instructions per Second)

4th Generation (64 bit) MIPS R4000 : 

4th Generation (64 bit) MIPS R4000 First 64-bit architecture Integrated caches On-chip Support for off-chip, secondary cache Integrated floating point Implemented in 1991: Deep pipeline 1.4M transistors Initially 100MHz > 50 MIPS Intel translates 80x86/ Pentium X instructions into RISC internally

Key Architectural Trends : 

Key Architectural Trends Increase performance at 1.6x per year (2X/1.5yr) True from 1985-present Combination of technology and architectural enhancements Technology provides faster transistors ( 1/lithographic feature size) and more of them Faster transistors leads to high clock rates More transistors (“Moore’s Law”): Architectural ideas turn transistors into performance Responsible for about half the yearly performance growth Two key architectural directions Sophisticated memory hierarchies Exploiting instruction level parallelism

Memory Hierarchies : 

Memory Hierarchies Caches: hide latency of DRAM and increase BW CPU-DRAM access gap has grown by a factor of 30-50! Trend 1: Increasingly large caches On-chip: from 128 bytes (1984) to 100,000+ bytes Multilevel caches: add another level of caching First multilevel cache:1986 Secondary cache sizes today: 128,000 B to 16,000,000 B Third level caches: 1998 Trend 2: Advances in caching techniques: Reduce or hide cache miss latencies early restart after cache miss (1992) nonblocking caches: continue during a cache miss (1994) Cache aware combos: computers, compilers, code writers prefetching: instruction to bring data into cache early

Exploiting Instruction Level Parallelism (ILP) : 

Exploiting Instruction Level Parallelism (ILP) ILP is the implicit parallelism among instructions (programmer not aware) Exploited by Overlapping execution in a pipeline Issuing multiple instruction per clock superscalar: uses dynamic issue decision (HW driven) VLIW: uses static issue decision (SW driven) 1985: simple microprocessor pipeline (1 instr/clock) 1990: first static multiple issue microprocessors 1995: sophisticated dynamic schemes determine parallelism dynamically execute instructions out-of-order speculative execution depending on branch prediction “Off-the-shelf” ILP techniques yielded 15 year path of 2X performance every 1.5 years => 1000X faster!

Where have all the transistors gone? : 

Where have all the transistors gone? Superscalar (multiple instructions per clock cycle) Intel Pentium III (10M transistors) Branch prediction (predict outcome of decisions) 3 levels of cache Out-of-order execution (executing instructions in different order than programmer wrote them)

Deminishing Return On Investment : 

Deminishing Return On Investment Until recently: Microprocessor effective work per clock cycle (instructions per clock)goes up by ~ square root of number of transistors Microprocessor clock rate goes up as lithographic feature size shrinks With >4 instructions per clock, microprocessor performance increases even less efficiently Chip-wide wires no longer scale with technology They get relatively slower than gates  (1/scale)3 More complicated processors have longer wires

Moore’s Law vs. Common Sense? : 

Moore’s Law vs. Common Sense? RISC II die Intel MPU die Scaled 32-bit, 5-stage RISC II 1/1000th of current MPU, die size or transistors (1/4 mm2 )

New view: ClusterOnaChip (CoC) : 

New view: ClusterOnaChip (CoC) Use several simple processors on a single chip: Performance goes up linearly in number of transistors Simpler processors can run at faster clocks Less design cost/time, Less time to market risk (reuse) Inspiration: Google Search engine for world: 100M/day Economical, scalable build block:PC cluster today 8000 PCs, 16000 disks Advantages in fault tolerance, scalability, cost/performance 32-bit MPU as the new “Transistor” “Cluster on a chip” with 1000s of processors enable amazing MIPS/$, MIPS/watt for cluster applications MPUs combined with dense memory + system on a chip CAD 30 years ago Intel 4004 used 2300 transistors: when 2300 32-bit RISC processors on a single chip?

VIRAM-1 Integrated Processor/Memory : 

VIRAM-1 Integrated Processor/Memory Microprocessor 256-bit media processor (vector) 14 MBytes DRAM 2.5-3.2 billion operations per second 2W at 170-200 MHz Industrial strength compiler 280 mm2 die area 18.72 x 15 mm ~200 mm2 for memory/logic DRAM: ~140 mm2 Vector lanes: ~50 mm2 Technology: IBM SA-27E 0.18mm CMOS 6 metal layers (copper) Transistor count: >100M Implemented by 6 Berkeley graduate students 15 mm 18.7 mm Thanks to DARPA: funding IBM: donate masks, fab Avanti: donate CAD tools MIPS: donate MIPS core Cray: Compilers, MIT:FPU

Concluding Remarks : 

Concluding Remarks A great 30 year history and a challenge for the next 30! Not a wall in performance growth, but a slowing down Diminishing returns on silicon investment But need to use right metrics. Not just raw (peak) performance, but: Performance per transistor Performance per Watt Possible New Direction? Consider true multiprocessing? Key question: Could multiprocessors on a single piece of silicon be much easier to use efficiently then today’s multiprocessors? (Thanks to John Hennessy@Stanford, Norm Jouppi@Compaq for most of these slides)