logging in or signing up fundamentals aSGuest73086 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 247 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 28, 2010 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Fundamentals of Computer Design : 1 Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion CDA 4102/5155 – Fall 2010 Copyright © 2010 Prabhat Mishra Microprocessor Performance Trends : 2 Microprocessor Performance Trends Relative to VAX-11/780 using SpecInt Benchmarks Due to technological advances Due to advances in architecture Design Complexity : 3 Design Complexity Exponential Growth – doubling of transistors every couple of years Technology and Demand : 4 Technology and Demand Technology Demand #of transistors are doubling every 2 years Communication, multimedia, entertainment, networking Who wants to be a Millionaire : 5 Who wants to be a Millionaire You double your investment everyday Starting investment - one cent. How long it takes to become a millionaire 20 days 27 days 37 days 365 days Lifetime ++ Who wants to be a Millionaire : 6 Who wants to be a Millionaire You double your investment everyday Starting investment - one cent. How long it takes to become a millionaire 20 days One million cents 27 days Millionaire 37 days Billionaire Believe it or not Each of us have more than a million ancestors in last 20 generations. Doubling transistors every 18 months This growth rate is hard to imagine Time-to-Market : 7 Time-to-Market Time required to develop a product to the point it can be sold to customers Market window Period during which the product would have highest sales Average time-to-market constraint is about 8 months Delays can be costly Losses due to Delayed Market Entry : 8 Losses due to Delayed Market Entry Simplified revenue model Product life = 2W, peak at W Time of market entry defines a triangle, representing market penetration Triangle area equals revenue Loss Difference between on-time and delayed triangle areas (shaded region) Examples: Delayed Market Entry : 9 Examples: Delayed Market Entry Area = 1/2 * base * height On-time = 1/2 * 2W * W Delayed = 1/2 * (W-D+W)*(W-D) Percentage revenue loss (D(3W-D)/2W2)*100% On-time Delayed entry entry Peak revenue Peak revenue from delayed entry Market rise Market fall W 2W Time D On-time Delayed Revenues ($) Some examples Lifetime 2W=52 weeks, delay D=4 weeks Loss = (4*(3*26 –4)/2*262) = 22% Lifetime 2W=52 weeks, delay D=10 weeks Loss = (10*(3*26 –10)/2*262) = 50% Delays are costly! Design Productivity Gap : 10 Design Productivity Gap 1981 leading edge chip required 100 man-months 10,000 transistors / 100 transistors/month 2002 leading edge chip requires 30K man-months 150,000,000 / 5000 transistors/month Designer cost increase from $1M to $300M Mythical Man-Month : 11 Mythical Man-Month In theory, adding designers reduces project completion time In reality, productivity/designer decreases due to complexities of team management and communication overhead At some point, can actually lengthen project completion time! Some Examples 1M transistors, one designer = 5000 transistors/month Each additional designer reduces for 100 transistors/month Fundamentals of Computer Design : 12 Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Computer Market : 13 Computer Market Desktop Driven by price-performance $1000 - $10,000 [$100 - $1000 per processor] Server Throughput, availability, scalability $10K - $10M [$200 - $2000 per processor] Embedded Systems Application specific Low cost, low power, real-time performance $10 - $100,000 [$0.20 - $200 per processor] An Example Embedded System : 14 An Example Embedded System Digital Camera Block Diagram Components of Embedded Systems : 15 Components of Embedded Systems Analog Digital Analog Memory Coprocessors Controllers Converters Processor Interface Software (Application Programs) ASIC Fundamentals of Computer Design : 16 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Computer Architecture : 17 Computer Architecture Definition Instruction set architecture (ISA) Programmer (user) View Implementation Organization: CPU, memory, buses, I/O Hardware: logic design, packaging technology Computer design must meet Functional requirements Area, performance, cost, power goals Optimize, evaluate, and explore to find best possible architecture Consider other factors Time-to-market, technology trend, safety, reliability, … Instruction-Set Architecture (ISA) : 18 Instruction-Set Architecture (ISA) An instruction set architecture is a specification of a standardized programmer-visible interface to hardware, comprised of: A set of instructions (instruction types and operations) With associated argument fields, assembly syntax, binary encoding. A set of named storage locations and addressing Registers, memory, … programmer-accessible caches? A set of addressing modes (ways to name locations) Types and sizes of operands Control flow instructions Often an I/O interface (usually memory-mapped) Example: MIPS : 19 Example: MIPS 0 r0 r1 ° ° ° r31 PC lo hi Programmable storage 232 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC Data types ? Format ? Addressing Modes? Arithmetic logical ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU, ADDI, ADDIU, SLTI, SLTIU, ANDI, ORL, XORL, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU, LW, LWL, LWR SB, SH, SW, SWL, SWR Control J, JAL, JR, JALR BEQ, BNE, BLEZ, BGTZ, BLTZ, BGEZ, BLTZAL, BGEZAL MIPS64 Instruction Format : 20 MIPS64 Instruction Format Overview of This Course : 21 Overview of This Course Understanding the design techniques, machine structures, technology factors, evaluation methods that determine the form of computers in 21st century Fundamentals of Computer Design : 22 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Technology Trend : 23 Technology Trend Component IC technology: transistor/chip increases 55% per year DRAM: density increases 40-60% per year Magnetic disk: density increases 100% per year Network: Ethernet from 10 100Mb took 10 years; 100Mb 1Gb in 5 years Scaling of performance, wires and power Feature size: 10 micron in 1971; 0.18 in 2001, … Microprocessor organization improvement Wiring delay Power issue: ~100 watts for 2GHz Pentium 4 Disk Comparison : 24 Disk Comparison CDC Wren I, 1983 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9550 Three 5.25” platters Bandwidth: 0.6 MBytes/sec Latency: 48.3 ms Cache: none Seagate 373453, 2003 15000 RPM (4X) 73.4 GBytes (2500X) Tracks/Inch: 64000 (80X) Bits/Inch: 533,000 (60X) Four 2.5” platters (in 3.5” form factor) Bandwidth: 86 MBytes/sec (140X) Latency: 5.7 ms (8X) Cache: 8 MBytes Memory Comparison : 25 Memory Comparison 1980 DRAM (asynchronous) 0.06 Mbits/chip 64 K transistors, 35 mm2 16-bit data bus per module 16 pins/chip 13 Mbytes/sec Latency: 225 ns (no block transfer) 2000 Double Data Rate Synchronous (clocked) DRAM 256.00 Mbits/chip (4000X) 256 M transistors, 204 mm2 64-bit data bus per DIMM (4X) 66 pins/chip 1600 Mbytes/sec (120X) Latency: 52 ns (4X) Block transfers (page mode) LAN Comparison : 26 LAN Comparison Ethernet 802.3 Year of Standard: 1978 10 Mbits/s link speed Latency: 3000 msec Shared media Coaxial cable Ethernet 802.3ae Year of Standard: 2003 10,000 Mbits/s (1000X)link speed Latency: 190 msec (15X) Switched media Category 5 copper wire Copper core Insulator Braided outer conductor Plastic Covering CPU Comparison : 27 CPU Comparison 1982 Intel 80286 12.5 MHz 2 MIPS (peak) Latency 320 ns 134,000 xtors, 47 mm2 16-bit data bus, 68 pins Microcode interpreter, separate FPU chip (no caches) 2001 Intel Pentium 4 1500 MHz (120X) 4500 MIPS (peak) (2250X) Latency 15 ns (20X) 42,000,000 xtors, 217 mm2 64-bit data bus, 423 pins 3-way superscalar,Dynamic translate to RISC, Superpipelined (22 stage),Out-of-Order execution On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache Bandwidth vs. Latency : 28 Bandwidth vs. Latency Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x) Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) Latency improvement is 10X while bandwidth improvement is 100X to 1000X. Summary: Technology Trends : 29 Summary: Technology Trends For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components Multiple processors in a cluster or even in a chip Multiple disks in a disk array Multiple memory modules in a large memory Simultaneous communication in switched LAN HW and SW developers should innovate assuming “Latency Lags Bandwidth” If everything improves at the same rate, then nothing really changes When rates vary, requires real innovation Fundamentals of Computer Design : 30 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Slide 31: 31 Power and Energy : 32 Power and Energy For CMOS, traditional dominant energy consumption has been in switching transistors, called dynamic power For mobile devices, energy better metric For fixed task, slowing clock rate (frequency switched) reduces power, but not energy Capacitive load, a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors Dropping voltage helps both, dropped from 5V to 1V Turn off clock to save energy & dynamic power Example : 33 Example Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power? Static Power : 34 Static Power Because leakage current flows even when a transistor is off, now static power important too Leakage current increases in processors with smaller transistor sizes Increasing the number of transistors increases power even if they are turned off In 2006, goal for leakage is 25% of total power consumption; high performance designs at 40% Very low power systems even gate voltage to inactive modules to control loss due to leakage How to Reduce Power Consumption : How to Reduce Power Consumption Multicore One core with frequency 2 GHz Two cores with 1 GHz frequency (each) Same performance Two 1 GHz cores require half power/energy Power freq2 1GHz core needs one-fourth power compared to 2GHz core. New challenges – Performance How to utilize the cores It is difficult to find parallelism in programs to keep all these cores busy. Reducing Energy Consumption : Reducing Energy Consumption [www.transmeta.com] Pentium Crusoe Running the same multimedia application. Infrared Cameras (FLIR) can be used to detect thermal distribution. DRAM Pricing : 37 DRAM Pricing © 2003 Elsevier Science (USA). All rights reserved. Processor Pricing (Intel Pentium III) : 38 Processor Pricing (Intel Pentium III) © 2003 Elsevier Science (USA). All rights reserved. Silicon wafer and microprocessor die : 39 Silicon wafer and microprocessor die This 8-inch wafer contains 564 MIPS64 R20K processors (0.18) Intel Pentium 4 Microprocessor Cost of an Integrated Circuit (IC) : 40 Cost of an Integrated Circuit (IC) Cost of IC: (die + packaging + test) / yield See examples in Page 22-24 Cost of a system Processor board: ~ 37% I/O device: ~ 37% Cabinet: ~ 6% Software: ~ 20% Cost : 41 Cost Unit cost Monetary cost of manufacturing one unit, excluding NRE cost NRE cost (Non-Recurring Engineering cost) The one-time monetary cost of designing the system Total cost NRE cost + unit cost * # of unit Per-product cost total cost / # of units = (NRE cost / # of units) + unit cost Example NRE=$2000, unit=$100 For 10 units total cost = $2000 + 10*$100 = $3000 per-product cost = $2000/10 + $100 = $300 NRE versus Unit Cost : 42 NRE versus Unit Cost High NRE, low production cost Low NRE, high production cost Volume Unit Cost Cost versus Price : 43 Cost versus Price Fundamentals of Computer Design : 44 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Define and Quantify Dependability : 45 Define and Quantify Dependability How to decide when a system is operating properly? Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable Systems alternate between 2 states of service with respect to an SLA: Service accomplishment, where the service is delivered as specified in SLA Service interruption, where the delivered service is different from the SLA Failure = transition from state 1 to state 2 Restoration = transition from state 2 to state 1 Dependability : 46 Dependability Module reliability = measure of continuous service accomplishment (or time to failure) Two metrics: Mean Time To Failure (MTTF) – measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures Traditionally reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR) Example : 47 Example If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF): ( ) Fundamentals of Computer Design : 48 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Performance Measurement : 49 Performance Measurement Performance metrics execution time Increasing performance decreases execution time Other metrics Wall-clock time, response time, elapsed time CPU time: user or system We will focus on CPU performance, i.e., user CPU time on unloaded system Choosing Programs to Evaluate Performance : 50 Choosing Programs to Evaluate Performance Real applications For example: gcc compiler, Microsoft Word Modified (or scripted) applications For example: remove I/O, script to simulate interactive behavior. Kernels For example: Livermore loops, Linpack Toy benchmarks For example: sieve of eratosthenes, quicksort Synthetic benchmarks For example: wheatstone, dhrystone Benchmark Suites : 51 Benchmark Suites Desktop New SPEC CPU2006 SPEC CPU2000: 11 integer, 14 floating-point SPECviewperf, SPECapc: graphics benchmarks Server SPEC CPU2000: running multiple copies SPECSFS: for NFS performance SPECWeb: Web server benchmark TPC-x: measure transaction-processing, queries, and decision making database applications Embedded Processor EEMBC: EDN Embedded Microprocessor Benchmark Consortium SPEC CPU Benchmarks : 52 SPEC CPU Benchmarks Reporting Performance : 53 Reporting Performance Performance should be reproducible Description of the machine and compiler flags Report for both baseline and optimized version Source code modifications Not allowed in SPEC benchmarks Allowed but difficult or impossible TPC-C using Oracle or SQL database Allowed in supercomputer benchmarks Modify or re-write algorithms Hand-coding in assembly for EEMBC benchmark Comparing Performance : 54 Comparing Performance Arithmetic Mean: What is the mixture of programs in the workload? Comparing Performance : 55 Comparing Performance Weighted Arithmetic Mean: What if programs are fixed and inputs are not? Comparing Performance : 56 Comparing Performance Geometric Mean: Execution time ratio is normalized to a base machine. Reference machine is not important. The arithmetic means are different depending on which machine is used as basis, but geometric means are same. Geometric mean does not predict execution time Normalized Execution Times (SPECRatio) : 57 Normalized Execution Times (SPECRatio) Geometric mean does not predict execution time Performance of machines A and B are same only if program P1 is executed 100 times for every occurrence of program P2 Rewards easy enhancements Improving program P3 (2 to 1) is same as improving program P4 (1000 to 500). Performance, Price-Performance (SPEC) : 58 Performance, Price-Performance (SPEC) Performance, Price-Performance (TPC-C) : 59 Performance, Price-Performance (TPC-C) Fundamentals of Computer Design : 60 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Amdahl’s Law : 61 Amdahl’s Law Where: f is a fraction of the execution time that can be enhanced n is the enhancement factor Example: f = 0.1, n = 10 Speedup = 1.1 Make the common case fast Performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Application of Amdahl’s Law : 62 Application of Amdahl’s Law Amdahl’s law is useful for comparing overall performance of two design alternatives. Example: Floating-point (FP) operations consume 50% of the execution time of a graphics application. FP square root (FPSQRT) is used 20% of the time. Improve FPSQRT operation execution by 10 times Speedup = 1 / ((1-0.2) + 0.2/10) = 1.22 Improve all FP operations by 1.6 times Speedup = 1 / ((1-0.5) + 0.5/1.6) = 1.23 Due to higher frequency of FP operations, the performance gain is more (case 2) compared to drastic improvement of FPSQRT (case 1). Measuring the Performance : 63 Measuring the Performance Performance Equation CPU time = Instruction Count x Clock cycle time x CPI How to compute these parameters Known for existing processors Clock cycle time Use of counters in new processors CPI, Instruction count Simulation for performance analysis Profile based Trace-driven Execution-driven CPU Performance Equation : 64 CPU Performance Equation The parameters are dependent Instruction Count: ISA and compiler technology CPI: Organization and ISA Cycle Time: Hardware technology and organization Many performance enhancing techniques improves one with small/predictable impacts on the other two. Example : 65 Example Parameters: Frequency of FP operations (incl. FPSQR) = 25% CPI for FP operations = 4; CPI for others = 1.33 Frequency of FPSQR = 2%; CPI of FPSQR = 20 Compare 2 designs: Decrease CPI of FPSQR to 2 CPI of all FP to 2.5 Fundamentals of Computer Design : 66 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Fallacies and Pitfalls : 67 Fallacies and Pitfalls The relative performance of two processors with the same ISA can be judged by clock rate or by the performance of a single benchmark suite. 1.7 GHz Pentium 4 relative to 1.0 GHz Pentium III © 2003 Elsevier Science (USA). All rights reserved. Fallacies and Pitfalls : 68 Fallacies and Pitfalls Benchmarks remain valid indefinitely. One line in matrix300(SPEC89) executes 99% of the time Peak performance tracks observed performance. The best design is the one that optimizes the primary objective without considering design costs. Synthetic benchmarks predict performance for real programs. Compiler/hardware optimizations can inflate performance MIPS is an accurate measure for comparing performance among computers Consider using FP hardware instead of FP routines. You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
fundamentals aSGuest73086 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 247 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 28, 2010 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Fundamentals of Computer Design : 1 Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion CDA 4102/5155 – Fall 2010 Copyright © 2010 Prabhat Mishra Microprocessor Performance Trends : 2 Microprocessor Performance Trends Relative to VAX-11/780 using SpecInt Benchmarks Due to technological advances Due to advances in architecture Design Complexity : 3 Design Complexity Exponential Growth – doubling of transistors every couple of years Technology and Demand : 4 Technology and Demand Technology Demand #of transistors are doubling every 2 years Communication, multimedia, entertainment, networking Who wants to be a Millionaire : 5 Who wants to be a Millionaire You double your investment everyday Starting investment - one cent. How long it takes to become a millionaire 20 days 27 days 37 days 365 days Lifetime ++ Who wants to be a Millionaire : 6 Who wants to be a Millionaire You double your investment everyday Starting investment - one cent. How long it takes to become a millionaire 20 days One million cents 27 days Millionaire 37 days Billionaire Believe it or not Each of us have more than a million ancestors in last 20 generations. Doubling transistors every 18 months This growth rate is hard to imagine Time-to-Market : 7 Time-to-Market Time required to develop a product to the point it can be sold to customers Market window Period during which the product would have highest sales Average time-to-market constraint is about 8 months Delays can be costly Losses due to Delayed Market Entry : 8 Losses due to Delayed Market Entry Simplified revenue model Product life = 2W, peak at W Time of market entry defines a triangle, representing market penetration Triangle area equals revenue Loss Difference between on-time and delayed triangle areas (shaded region) Examples: Delayed Market Entry : 9 Examples: Delayed Market Entry Area = 1/2 * base * height On-time = 1/2 * 2W * W Delayed = 1/2 * (W-D+W)*(W-D) Percentage revenue loss (D(3W-D)/2W2)*100% On-time Delayed entry entry Peak revenue Peak revenue from delayed entry Market rise Market fall W 2W Time D On-time Delayed Revenues ($) Some examples Lifetime 2W=52 weeks, delay D=4 weeks Loss = (4*(3*26 –4)/2*262) = 22% Lifetime 2W=52 weeks, delay D=10 weeks Loss = (10*(3*26 –10)/2*262) = 50% Delays are costly! Design Productivity Gap : 10 Design Productivity Gap 1981 leading edge chip required 100 man-months 10,000 transistors / 100 transistors/month 2002 leading edge chip requires 30K man-months 150,000,000 / 5000 transistors/month Designer cost increase from $1M to $300M Mythical Man-Month : 11 Mythical Man-Month In theory, adding designers reduces project completion time In reality, productivity/designer decreases due to complexities of team management and communication overhead At some point, can actually lengthen project completion time! Some Examples 1M transistors, one designer = 5000 transistors/month Each additional designer reduces for 100 transistors/month Fundamentals of Computer Design : 12 Fundamentals of Computer Design Introduction Classes of Computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Computer Market : 13 Computer Market Desktop Driven by price-performance $1000 - $10,000 [$100 - $1000 per processor] Server Throughput, availability, scalability $10K - $10M [$200 - $2000 per processor] Embedded Systems Application specific Low cost, low power, real-time performance $10 - $100,000 [$0.20 - $200 per processor] An Example Embedded System : 14 An Example Embedded System Digital Camera Block Diagram Components of Embedded Systems : 15 Components of Embedded Systems Analog Digital Analog Memory Coprocessors Controllers Converters Processor Interface Software (Application Programs) ASIC Fundamentals of Computer Design : 16 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Computer Architecture : 17 Computer Architecture Definition Instruction set architecture (ISA) Programmer (user) View Implementation Organization: CPU, memory, buses, I/O Hardware: logic design, packaging technology Computer design must meet Functional requirements Area, performance, cost, power goals Optimize, evaluate, and explore to find best possible architecture Consider other factors Time-to-market, technology trend, safety, reliability, … Instruction-Set Architecture (ISA) : 18 Instruction-Set Architecture (ISA) An instruction set architecture is a specification of a standardized programmer-visible interface to hardware, comprised of: A set of instructions (instruction types and operations) With associated argument fields, assembly syntax, binary encoding. A set of named storage locations and addressing Registers, memory, … programmer-accessible caches? A set of addressing modes (ways to name locations) Types and sizes of operands Control flow instructions Often an I/O interface (usually memory-mapped) Example: MIPS : 19 Example: MIPS 0 r0 r1 ° ° ° r31 PC lo hi Programmable storage 232 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC Data types ? Format ? Addressing Modes? Arithmetic logical ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU, ADDI, ADDIU, SLTI, SLTIU, ANDI, ORL, XORL, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU, LW, LWL, LWR SB, SH, SW, SWL, SWR Control J, JAL, JR, JALR BEQ, BNE, BLEZ, BGTZ, BLTZ, BGEZ, BLTZAL, BGEZAL MIPS64 Instruction Format : 20 MIPS64 Instruction Format Overview of This Course : 21 Overview of This Course Understanding the design techniques, machine structures, technology factors, evaluation methods that determine the form of computers in 21st century Fundamentals of Computer Design : 22 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Technology Trend : 23 Technology Trend Component IC technology: transistor/chip increases 55% per year DRAM: density increases 40-60% per year Magnetic disk: density increases 100% per year Network: Ethernet from 10 100Mb took 10 years; 100Mb 1Gb in 5 years Scaling of performance, wires and power Feature size: 10 micron in 1971; 0.18 in 2001, … Microprocessor organization improvement Wiring delay Power issue: ~100 watts for 2GHz Pentium 4 Disk Comparison : 24 Disk Comparison CDC Wren I, 1983 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9550 Three 5.25” platters Bandwidth: 0.6 MBytes/sec Latency: 48.3 ms Cache: none Seagate 373453, 2003 15000 RPM (4X) 73.4 GBytes (2500X) Tracks/Inch: 64000 (80X) Bits/Inch: 533,000 (60X) Four 2.5” platters (in 3.5” form factor) Bandwidth: 86 MBytes/sec (140X) Latency: 5.7 ms (8X) Cache: 8 MBytes Memory Comparison : 25 Memory Comparison 1980 DRAM (asynchronous) 0.06 Mbits/chip 64 K transistors, 35 mm2 16-bit data bus per module 16 pins/chip 13 Mbytes/sec Latency: 225 ns (no block transfer) 2000 Double Data Rate Synchronous (clocked) DRAM 256.00 Mbits/chip (4000X) 256 M transistors, 204 mm2 64-bit data bus per DIMM (4X) 66 pins/chip 1600 Mbytes/sec (120X) Latency: 52 ns (4X) Block transfers (page mode) LAN Comparison : 26 LAN Comparison Ethernet 802.3 Year of Standard: 1978 10 Mbits/s link speed Latency: 3000 msec Shared media Coaxial cable Ethernet 802.3ae Year of Standard: 2003 10,000 Mbits/s (1000X)link speed Latency: 190 msec (15X) Switched media Category 5 copper wire Copper core Insulator Braided outer conductor Plastic Covering CPU Comparison : 27 CPU Comparison 1982 Intel 80286 12.5 MHz 2 MIPS (peak) Latency 320 ns 134,000 xtors, 47 mm2 16-bit data bus, 68 pins Microcode interpreter, separate FPU chip (no caches) 2001 Intel Pentium 4 1500 MHz (120X) 4500 MIPS (peak) (2250X) Latency 15 ns (20X) 42,000,000 xtors, 217 mm2 64-bit data bus, 423 pins 3-way superscalar,Dynamic translate to RISC, Superpipelined (22 stage),Out-of-Order execution On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache Bandwidth vs. Latency : 28 Bandwidth vs. Latency Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x) Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) Latency improvement is 10X while bandwidth improvement is 100X to 1000X. Summary: Technology Trends : 29 Summary: Technology Trends For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components Multiple processors in a cluster or even in a chip Multiple disks in a disk array Multiple memory modules in a large memory Simultaneous communication in switched LAN HW and SW developers should innovate assuming “Latency Lags Bandwidth” If everything improves at the same rate, then nothing really changes When rates vary, requires real innovation Fundamentals of Computer Design : 30 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Slide 31: 31 Power and Energy : 32 Power and Energy For CMOS, traditional dominant energy consumption has been in switching transistors, called dynamic power For mobile devices, energy better metric For fixed task, slowing clock rate (frequency switched) reduces power, but not energy Capacitive load, a function of number of transistors connected to output and technology, which determines capacitance of wires and transistors Dropping voltage helps both, dropped from 5V to 1V Turn off clock to save energy & dynamic power Example : 33 Example Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power? Static Power : 34 Static Power Because leakage current flows even when a transistor is off, now static power important too Leakage current increases in processors with smaller transistor sizes Increasing the number of transistors increases power even if they are turned off In 2006, goal for leakage is 25% of total power consumption; high performance designs at 40% Very low power systems even gate voltage to inactive modules to control loss due to leakage How to Reduce Power Consumption : How to Reduce Power Consumption Multicore One core with frequency 2 GHz Two cores with 1 GHz frequency (each) Same performance Two 1 GHz cores require half power/energy Power freq2 1GHz core needs one-fourth power compared to 2GHz core. New challenges – Performance How to utilize the cores It is difficult to find parallelism in programs to keep all these cores busy. Reducing Energy Consumption : Reducing Energy Consumption [www.transmeta.com] Pentium Crusoe Running the same multimedia application. Infrared Cameras (FLIR) can be used to detect thermal distribution. DRAM Pricing : 37 DRAM Pricing © 2003 Elsevier Science (USA). All rights reserved. Processor Pricing (Intel Pentium III) : 38 Processor Pricing (Intel Pentium III) © 2003 Elsevier Science (USA). All rights reserved. Silicon wafer and microprocessor die : 39 Silicon wafer and microprocessor die This 8-inch wafer contains 564 MIPS64 R20K processors (0.18) Intel Pentium 4 Microprocessor Cost of an Integrated Circuit (IC) : 40 Cost of an Integrated Circuit (IC) Cost of IC: (die + packaging + test) / yield See examples in Page 22-24 Cost of a system Processor board: ~ 37% I/O device: ~ 37% Cabinet: ~ 6% Software: ~ 20% Cost : 41 Cost Unit cost Monetary cost of manufacturing one unit, excluding NRE cost NRE cost (Non-Recurring Engineering cost) The one-time monetary cost of designing the system Total cost NRE cost + unit cost * # of unit Per-product cost total cost / # of units = (NRE cost / # of units) + unit cost Example NRE=$2000, unit=$100 For 10 units total cost = $2000 + 10*$100 = $3000 per-product cost = $2000/10 + $100 = $300 NRE versus Unit Cost : 42 NRE versus Unit Cost High NRE, low production cost Low NRE, high production cost Volume Unit Cost Cost versus Price : 43 Cost versus Price Fundamentals of Computer Design : 44 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Define and Quantify Dependability : 45 Define and Quantify Dependability How to decide when a system is operating properly? Infrastructure providers now offer Service Level Agreements (SLA) to guarantee that their networking or power service would be dependable Systems alternate between 2 states of service with respect to an SLA: Service accomplishment, where the service is delivered as specified in SLA Service interruption, where the delivered service is different from the SLA Failure = transition from state 1 to state 2 Restoration = transition from state 2 to state 1 Dependability : 46 Dependability Module reliability = measure of continuous service accomplishment (or time to failure) Two metrics: Mean Time To Failure (MTTF) – measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures Traditionally reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability = MTTF / ( MTTF + MTTR) Example : 47 Example If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF): ( ) Fundamentals of Computer Design : 48 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Performance Measurement : 49 Performance Measurement Performance metrics execution time Increasing performance decreases execution time Other metrics Wall-clock time, response time, elapsed time CPU time: user or system We will focus on CPU performance, i.e., user CPU time on unloaded system Choosing Programs to Evaluate Performance : 50 Choosing Programs to Evaluate Performance Real applications For example: gcc compiler, Microsoft Word Modified (or scripted) applications For example: remove I/O, script to simulate interactive behavior. Kernels For example: Livermore loops, Linpack Toy benchmarks For example: sieve of eratosthenes, quicksort Synthetic benchmarks For example: wheatstone, dhrystone Benchmark Suites : 51 Benchmark Suites Desktop New SPEC CPU2006 SPEC CPU2000: 11 integer, 14 floating-point SPECviewperf, SPECapc: graphics benchmarks Server SPEC CPU2000: running multiple copies SPECSFS: for NFS performance SPECWeb: Web server benchmark TPC-x: measure transaction-processing, queries, and decision making database applications Embedded Processor EEMBC: EDN Embedded Microprocessor Benchmark Consortium SPEC CPU Benchmarks : 52 SPEC CPU Benchmarks Reporting Performance : 53 Reporting Performance Performance should be reproducible Description of the machine and compiler flags Report for both baseline and optimized version Source code modifications Not allowed in SPEC benchmarks Allowed but difficult or impossible TPC-C using Oracle or SQL database Allowed in supercomputer benchmarks Modify or re-write algorithms Hand-coding in assembly for EEMBC benchmark Comparing Performance : 54 Comparing Performance Arithmetic Mean: What is the mixture of programs in the workload? Comparing Performance : 55 Comparing Performance Weighted Arithmetic Mean: What if programs are fixed and inputs are not? Comparing Performance : 56 Comparing Performance Geometric Mean: Execution time ratio is normalized to a base machine. Reference machine is not important. The arithmetic means are different depending on which machine is used as basis, but geometric means are same. Geometric mean does not predict execution time Normalized Execution Times (SPECRatio) : 57 Normalized Execution Times (SPECRatio) Geometric mean does not predict execution time Performance of machines A and B are same only if program P1 is executed 100 times for every occurrence of program P2 Rewards easy enhancements Improving program P3 (2 to 1) is same as improving program P4 (1000 to 500). Performance, Price-Performance (SPEC) : 58 Performance, Price-Performance (SPEC) Performance, Price-Performance (TPC-C) : 59 Performance, Price-Performance (TPC-C) Fundamentals of Computer Design : 60 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Amdahl’s Law : 61 Amdahl’s Law Where: f is a fraction of the execution time that can be enhanced n is the enhancement factor Example: f = 0.1, n = 10 Speedup = 1.1 Make the common case fast Performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Application of Amdahl’s Law : 62 Application of Amdahl’s Law Amdahl’s law is useful for comparing overall performance of two design alternatives. Example: Floating-point (FP) operations consume 50% of the execution time of a graphics application. FP square root (FPSQRT) is used 20% of the time. Improve FPSQRT operation execution by 10 times Speedup = 1 / ((1-0.2) + 0.2/10) = 1.22 Improve all FP operations by 1.6 times Speedup = 1 / ((1-0.5) + 0.5/1.6) = 1.23 Due to higher frequency of FP operations, the performance gain is more (case 2) compared to drastic improvement of FPSQRT (case 1). Measuring the Performance : 63 Measuring the Performance Performance Equation CPU time = Instruction Count x Clock cycle time x CPI How to compute these parameters Known for existing processors Clock cycle time Use of counters in new processors CPI, Instruction count Simulation for performance analysis Profile based Trace-driven Execution-driven CPU Performance Equation : 64 CPU Performance Equation The parameters are dependent Instruction Count: ISA and compiler technology CPI: Organization and ISA Cycle Time: Hardware technology and organization Many performance enhancing techniques improves one with small/predictable impacts on the other two. Example : 65 Example Parameters: Frequency of FP operations (incl. FPSQR) = 25% CPI for FP operations = 4; CPI for others = 1.33 Frequency of FPSQR = 2%; CPI of FPSQR = 20 Compare 2 designs: Decrease CPI of FPSQR to 2 CPI of all FP to 2.5 Fundamentals of Computer Design : 66 Fundamentals of Computer Design Introduction Classes of computers Defining Computer Architecture Trends in Technology Trends in Cost, Power and Performance Dependability Measuring and Reporting Performance Quantitative Principles of Computer Design Conclusion Fallacies and Pitfalls : 67 Fallacies and Pitfalls The relative performance of two processors with the same ISA can be judged by clock rate or by the performance of a single benchmark suite. 1.7 GHz Pentium 4 relative to 1.0 GHz Pentium III © 2003 Elsevier Science (USA). All rights reserved. Fallacies and Pitfalls : 68 Fallacies and Pitfalls Benchmarks remain valid indefinitely. One line in matrix300(SPEC89) executes 99% of the time Peak performance tracks observed performance. The best design is the one that optimizes the primary objective without considering design costs. Synthetic benchmarks predict performance for real programs. Compiler/hardware optimizations can inflate performance MIPS is an accurate measure for comparing performance among computers Consider using FP hardware instead of FP routines.