logging in or signing up mji Concurrency Dec 2006 Haggrid Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 41 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Reliability Adaptive Processor Allocation: Mary Jane (Janie) Irwin Yang Ding, Mahmut Kandemir, Padma Raghavan Penn State University December 2006 Reliability Adaptive Processor Allocation Resiliency Issues in CMPs: Resiliency Issues in CMPs Run away leakage on idle PEs Thermal emergencies PE timing errors due to PV + temperature Memory errors due to SEUs PE faults due to EM, NBTI, … . . . Coping Mechanisms: Coping Mechanisms Hardware sensors temperature reliability Hardware counters BIST Periodic software test routines How can ongoing computations 'recover' most efficiently? Turn off faulty PEs The Problem at Hand: How to design a reliable, scalable, power efficient, fast processor allocation and thread mapping system to handle runtime processor availability changes The Problem at Hand High Availability Low Availability Program Execution 8 PEs 8 threads ? 1 PE goes down 7 PEs 8 threads Adapting to Low Availability: Adapting to Low Availability P PE Thread Another Adaptation: Another Adaptation P PE Thread Do these have similar execution times? If so, this choice should have much lower EDP Performance/EDP Expectation: 2 3 4 Performance/EDP Expectation 8 7 6 5 4 3 2 1 8 16 # of PEs execution time A program with 16 threads running on a varying number of PEs Architecture Model: P I$ Shared L2 Cache D$ P I$ D$ P I$ D$ P I$ D$ P I$ D$ P I$ D$ Architecture Model I$: 64KB L1 Instruction Cache D$: 64KB L1 Data Cache L2: 4MB or 8MB or 32MB (not currently modeling NUCA L2) N PEs Technology: 70nm Metrics: Metrics Performance Cycle count [from SIMICS] EDP = Energy * t = Power * t * t CPU power numbers [from Simplescalar andamp; Wattch, industry datasheets] Memory (on-chip caches) power numbers Leakage energy = (Leakage Power [from CACTI 4.2]) * t Dynamic energy = (# of access [from SIMICS]) * (dynamic energy per access [from CACTI 4.2]) NAS FT Benchmark: NAS FT Benchmark Benchmark FT performs the spectral method, first with a 3-D fast Fourier transform (FFT) and then the inverse FFT in an iterative loop CALL SETUP CALL FFT(1) DO ITER=1, NITER CALL EVOLVE CALL FFT(-1) CALL CHECKSUM END DO Experiment Methodology: 16 processors 16 threads andlt; 16 processors 16 threads Assume no change in cache: best case Flush cache and let it refill on PE switch: worst case Copy the cache over on PE switch: practical case Experiment Methodology Power Baselines : Power Baselines Running FT.W on Sim-Wattch CPU Power: ≈ 11.3 watts From CACTI 4.2 64KB L1 Leakage Power: ≈ 0.23 watts 4MB L2 Leakage Power: ≈ 15 watts 8MB L2 Leakage Power: ≈ 30 watts 32MB L2 Leakage Power: ≈ 118 watts Technology: 70nm; Associativity: 16; Line Size: 64Bytes Slide13: Slide14: Slide15: L2 Dynamic Energy Slide16: L1 Dynamic Energy Slide17: Slide18: Slide19: EDP Results Thank you!: Thank you! Calculate the Power: Calculate the Power Without considering dynamic power for caches PowerTotal = (PowerCPU+2*PowerL1)*n + PowerL2 Linear relationship with n (the number of processors)! Metamorphosis: Metamorphosis Exploit software-controlled hardware states to adapt the chip’s resources 'just-in-time' to meet performance, reliability, and power goals via helper threads to monitor andamp; predict program behavior and chip states using performance counters, on-chip sensors, and periodic self-test routines determine actions necessary to achieve the desired optimization goals (e.g., power, throughput, failure rate) activate the appropriate architectural features (e.g., error control circuitry) and circuit configuration 'knobs' and 'switches' You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
mji Concurrency Dec 2006 Haggrid Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 41 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Reliability Adaptive Processor Allocation: Mary Jane (Janie) Irwin Yang Ding, Mahmut Kandemir, Padma Raghavan Penn State University December 2006 Reliability Adaptive Processor Allocation Resiliency Issues in CMPs: Resiliency Issues in CMPs Run away leakage on idle PEs Thermal emergencies PE timing errors due to PV + temperature Memory errors due to SEUs PE faults due to EM, NBTI, … . . . Coping Mechanisms: Coping Mechanisms Hardware sensors temperature reliability Hardware counters BIST Periodic software test routines How can ongoing computations 'recover' most efficiently? Turn off faulty PEs The Problem at Hand: How to design a reliable, scalable, power efficient, fast processor allocation and thread mapping system to handle runtime processor availability changes The Problem at Hand High Availability Low Availability Program Execution 8 PEs 8 threads ? 1 PE goes down 7 PEs 8 threads Adapting to Low Availability: Adapting to Low Availability P PE Thread Another Adaptation: Another Adaptation P PE Thread Do these have similar execution times? If so, this choice should have much lower EDP Performance/EDP Expectation: 2 3 4 Performance/EDP Expectation 8 7 6 5 4 3 2 1 8 16 # of PEs execution time A program with 16 threads running on a varying number of PEs Architecture Model: P I$ Shared L2 Cache D$ P I$ D$ P I$ D$ P I$ D$ P I$ D$ P I$ D$ Architecture Model I$: 64KB L1 Instruction Cache D$: 64KB L1 Data Cache L2: 4MB or 8MB or 32MB (not currently modeling NUCA L2) N PEs Technology: 70nm Metrics: Metrics Performance Cycle count [from SIMICS] EDP = Energy * t = Power * t * t CPU power numbers [from Simplescalar andamp; Wattch, industry datasheets] Memory (on-chip caches) power numbers Leakage energy = (Leakage Power [from CACTI 4.2]) * t Dynamic energy = (# of access [from SIMICS]) * (dynamic energy per access [from CACTI 4.2]) NAS FT Benchmark: NAS FT Benchmark Benchmark FT performs the spectral method, first with a 3-D fast Fourier transform (FFT) and then the inverse FFT in an iterative loop CALL SETUP CALL FFT(1) DO ITER=1, NITER CALL EVOLVE CALL FFT(-1) CALL CHECKSUM END DO Experiment Methodology: 16 processors 16 threads andlt; 16 processors 16 threads Assume no change in cache: best case Flush cache and let it refill on PE switch: worst case Copy the cache over on PE switch: practical case Experiment Methodology Power Baselines : Power Baselines Running FT.W on Sim-Wattch CPU Power: ≈ 11.3 watts From CACTI 4.2 64KB L1 Leakage Power: ≈ 0.23 watts 4MB L2 Leakage Power: ≈ 15 watts 8MB L2 Leakage Power: ≈ 30 watts 32MB L2 Leakage Power: ≈ 118 watts Technology: 70nm; Associativity: 16; Line Size: 64Bytes Slide13: Slide14: Slide15: L2 Dynamic Energy Slide16: L1 Dynamic Energy Slide17: Slide18: Slide19: EDP Results Thank you!: Thank you! Calculate the Power: Calculate the Power Without considering dynamic power for caches PowerTotal = (PowerCPU+2*PowerL1)*n + PowerL2 Linear relationship with n (the number of processors)! Metamorphosis: Metamorphosis Exploit software-controlled hardware states to adapt the chip’s resources 'just-in-time' to meet performance, reliability, and power goals via helper threads to monitor andamp; predict program behavior and chip states using performance counters, on-chip sensors, and periodic self-test routines determine actions necessary to achieve the desired optimization goals (e.g., power, throughput, failure rate) activate the appropriate architectural features (e.g., error control circuitry) and circuit configuration 'knobs' and 'switches'