Share PowerPoint. Anywhere!

takala

Uploaded from authorPOINT
Download as Download Not Available PPT
Presentation Description

No description available

Like authorSTREAM?


You can vote once a day till December
10th, Vote Now!
Views: 27
Like it  ( Likes) Dislike it  ( Dislikes)
Added: September 07, 2007 This presentation is Public
Presentation Category :Entertainment
Presentation StatisticsNew!
Views on authorSTREAM: 25 | Views from Embeds: 2
Others - 2 views
Presentation Transcript

Design Methodology for Customizable Programmable ProcessorsBerkeley – Finland Day, Oct. 18, 2002 : Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland Tel: +358 – 33115 3879; Email: jarmo.takala@tut.fi


Outline : Outline Motivation Transport Triggered Architecture (TTA) Design Methodology for TTAs Research at TUT Conclusions


Motivation : Motivation Programmable processors often used in products using digital signal processing (DSP) Flexibility Ease of verification Traditionally DSP processor architectures have been developed based on average performance in several benchmark tasks (~100) User applications often contain only subset of total benchmarks Efficiency can be improved by customizing architecture according to given tasks


Motivation : Motivation DSP applications are often hard realtime constrained execution should be deterministic dynamic runtime behaviours should be avoided Static scheduling lends itself to DSP Current design complexities call for increase in designer productivity High level languages should be used DSP algorithms contain inherent parallelism Instruction level parallelism (ILP) should be maximized


What is needed? : What is needed? Application driven design process with easy design space exploration Replace hardware complexity by software complexity Compiler driven process Use templated architecture Flexible heterogeneous function units Modular scalability Orthogonal compiler friendly


Choices for Architecture Template : Choices for Architecture Template Frontend Application sequential (superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) Compilation time (Software) Determine Dependencies Determine Independencies Bind Function Units Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths andamp; Execute Run time (Hardware) ILP Architectures


VLIW Gained Popularity in DSP : VLIW Gained Popularity in DSP Register File Instruction Fetch Instruction Decode Data Memory Instruction Memory Bypassing Network CPU FU-1 FU-2 FU-3 FU-4 FU-5


Transport Triggered Architecture : Transport Triggered Architecture VLIW drawbacks Bypass complexity Register file complexity Register file design restricts FU flexibility Operation encoding format restricts FU flexibility Reverse programming paradigm [H. Corporaal, 94] data transport  operation Instruction set contains only a single instruction: move


From VLIW to TTA : From VLIW to TTA Register File Bypassing Network VLIW Instruction Fetch Instruction Decode Instruction Memory FU-1 FU-2 FU-3 FU-4 FU-5 Data Memory TTA


TTA Datapath : TTA Datapath Integer ALU Integer ALU Float ALU Boolean RF Float RF Integer RF Socket Instruction Memory Data Memory Load/Store Unit Load/Store Unit Immediate Unit Instruction Unit


Function Units : Function Units Operands written to operand registers (O) Operation performed when last operand written to trigger register (T) Pipeline synchronized with control bits (C) Standard interface FU_ready Result_ready Global_lock T optional Optional shadow register O logic logic R logic C C C C


ILP Architectures : ILP Architectures Frontend Application sequential (superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) Compilation time independence (TTA) Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Execute Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Run time


TTA Characteristics: HW : TTA Characteristics: HW Modular Can be constructed with standard building blocks Very flexible and scalable FU functionality can be arbitrary Supports user defined Special Function Units (SFU) Lower complexity Reduction on # register ports Reduced bypass complexity Reduction in bypass connectivity Reduced register pressure Trivial decoding (implies long instructions)


TTA Characteristics: SW : TTA Characteristics: SW Traditional operation-triggered instruction: Transport-triggered instruction: Reminds dataflow and time-stationary coding mul r1,r2,r3; r1mul.o; r2mul.t; mul.rr3; r1mul.o, r2mul.t; mul.rr3; or


TTA Design Tools : TTA Design Tools Design tools based on TTA architecture template have been developed at Delft University of Technology (DUT), Delft, the Netherlands MOVE project lead by Prof. Henk Corporaal Fully parametric C/C++ Compiler buses, connections, function units, register files, etc. Design space explorer Processor generator


Code Generation Trajectory : Sequential Simulator Code Generation Trajectory I/O Parallel Code GCC or SUIF Profiling Data Parallel Simulator Compiler Backend Sequential Code Application (C/C++) Architecture Description Compiler Frontend I/O (MOVE Project at DUT)


TTA Specific Optimizations : TTA Specific Optimizations TTA allows extra scheduling optimizations E.g., software bypassing Bypassing can eliminate the need of RF access However, more difficult to schedule ! Example: r1 → add.o, r2 → add.t; add.r → r3; r3 → sub.o, r4 → sub.t sub.r → r5; Translates to: r1 → add.o, r2 → add.t; add.r → sub.o, r4 → sub.t; sub.r → r5;


Design Space Exploration : Resource Optimization Connectivity Optimization Design Space Exploration Application (C/C++) Mapandamp;Schedule Frontend FU models Cost Functions Simulator Resources (Mach) Mapandamp;Schedule Design Point Simulator Design Points Select Resources Reduce Connections (MOVE Project at DUT)


Exploration: Resourse Optimization : Exploration: Resourse Optimization Pareto curve represents the lowest bound of found architecture configurations Selected architecture for further optimization (MOVE Project at DUT)


Exploration: Connectivity Optimization : Exploration: Connectivity Optimization (MOVE Project at DUT) Reduced connections decrease bus delay Critical connections have been removed


Topics to be Investigated : Topics to be Investigated Poor code density good target for code compression techniques apriori information of application, thus instruction propabilities known Estimations Power estimation Fast estimations with sufficient accuracy Flexibity, reuse Applications may change, thus additional resources need to assigned although not needed by the original application Tool-assisted special function unit generation Analysis support Model creation support Characterization support Parameterized processor generator Interconnections, control, etc. maybe realized in several ways depending on the target Low-power optimizations Clustered TTAs Interprocessor communication schemes These topics considered in FlexDSP Project at TUT


New Design Environment : Code Compression New Design Environment Functionality (C/C++) Operation Analysis Parametric Compiler Parametric Processor Generator Parallel Object Code HDL Code Frontend Design Space Exploration FU models (C, HDL) Cost Functions (area, power, speed) Resource Constraints SFU Generation Target of FlexDSP Project at TUT


Conclusions : Conclusions Design methodologies allowing processor customization will improve efficiency in certain application areas, e.g., multimedia, telecom TTA is a promising candidate for architectural template for customized processors In particular, support for custom function units allows powerful tailoring Results of MOVE project at DUT have already proven the concept Parameterized compiler allows tool-assisted design space exploration Still more research needed on Hardware implementations Enhanced compiler strategies