Presentation Transcript
Design Methodology for Customizable Programmable ProcessorsBerkeley – Finland Day, Oct. 18, 2002 : Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala
Institute of Digital and Computer Systems
Tampere University of Technology
Tampere, Finland
Tel: +358 – 33115 3879; Email: jarmo.takala@tut.fi
Outline : Outline Motivation
Transport Triggered Architecture (TTA)
Design Methodology for TTAs
Research at TUT
Conclusions
Motivation : Motivation Programmable processors often used in products using digital signal processing (DSP)
Flexibility
Ease of verification
Traditionally DSP processor architectures have been developed based on average performance in several benchmark tasks (~100)
User applications often contain only subset of total benchmarks
Efficiency can be improved by customizing architecture according to given tasks
Motivation : Motivation DSP applications are often hard realtime constrained
execution should be deterministic
dynamic runtime behaviours should be avoided
Static scheduling lends itself to DSP
Current design complexities call for increase in designer productivity
High level languages should be used
DSP algorithms contain inherent parallelism
Instruction level parallelism (ILP) should be maximized
What is needed? : What is needed? Application driven design process with easy design space exploration
Replace hardware complexity by software complexity
Compiler driven process
Use templated architecture
Flexible
heterogeneous function units
Modular
scalability
Orthogonal
compiler friendly
Choices for Architecture Template : Choices for Architecture Template Frontend Application sequential
(superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) Compilation time
(Software) Determine Dependencies Determine Independencies Bind Function Units Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths andamp; Execute Run time
(Hardware) ILP Architectures
VLIW Gained Popularity in DSP : VLIW Gained Popularity in DSP Register File Instruction Fetch Instruction Decode Data Memory Instruction Memory Bypassing Network CPU FU-1 FU-2 FU-3 FU-4 FU-5
Transport Triggered Architecture : Transport Triggered Architecture VLIW drawbacks
Bypass complexity
Register file complexity
Register file design restricts FU flexibility
Operation encoding format restricts FU flexibility
Reverse programming paradigm [H. Corporaal, 94]
data transport operation
Instruction set contains only a single instruction: move
From VLIW to TTA : From VLIW to TTA Register File Bypassing Network VLIW Instruction Fetch Instruction Decode Instruction Memory FU-1 FU-2 FU-3 FU-4 FU-5 Data Memory TTA
TTA Datapath : TTA Datapath Integer ALU Integer ALU Float ALU Boolean RF Float RF Integer RF Socket Instruction Memory Data Memory Load/Store Unit Load/Store Unit Immediate Unit Instruction Unit
Function Units : Function Units Operands written to operand registers (O)
Operation performed when last operand written to trigger register (T)
Pipeline synchronized with control bits (C)
Standard interface
FU_ready
Result_ready
Global_lock T optional Optional shadow register O logic logic R logic C C C C
ILP Architectures : ILP Architectures Frontend Application sequential
(superscalar) dependence (dataflow) independence (EPIC) independence (VLIW) Compilation time independence (TTA) Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Execute Determine Dependencies Determine Independencies Bind Function Units Bind Datapaths Run time
TTA Characteristics: HW : TTA Characteristics: HW Modular
Can be constructed with standard building blocks
Very flexible and scalable
FU functionality can be arbitrary
Supports user defined Special Function Units (SFU)
Lower complexity
Reduction on # register ports
Reduced bypass complexity
Reduction in bypass connectivity
Reduced register pressure
Trivial decoding (implies long instructions)
TTA Characteristics: SW : TTA Characteristics: SW Traditional operation-triggered instruction:
Transport-triggered instruction:
Reminds dataflow and time-stationary coding mul r1,r2,r3; r1mul.o;
r2mul.t;
mul.rr3; r1mul.o, r2mul.t;
mul.rr3; or
TTA Design Tools : TTA Design Tools Design tools based on TTA architecture template have been developed at Delft University of Technology (DUT), Delft, the Netherlands
MOVE project lead by Prof. Henk Corporaal
Fully parametric C/C++ Compiler
buses, connections, function units, register files, etc.
Design space explorer
Processor generator
Code Generation Trajectory : Sequential Simulator Code Generation Trajectory I/O Parallel Code GCC or SUIF Profiling Data Parallel Simulator Compiler Backend Sequential Code Application (C/C++) Architecture Description Compiler Frontend I/O (MOVE Project at DUT)
TTA Specific Optimizations : TTA Specific Optimizations TTA allows extra scheduling optimizations
E.g., software bypassing
Bypassing can eliminate the need of RF access
However, more difficult to schedule ! Example: r1 → add.o, r2 → add.t;
add.r → r3;
r3 → sub.o, r4 → sub.t
sub.r → r5; Translates to: r1 → add.o, r2 → add.t;
add.r → sub.o, r4 → sub.t;
sub.r → r5;
Design Space Exploration : Resource Optimization Connectivity Optimization Design Space Exploration Application (C/C++) Mapandamp;Schedule Frontend FU models Cost Functions Simulator Resources (Mach) Mapandamp;Schedule Design Point Simulator Design Points Select Resources Reduce Connections (MOVE Project at DUT)
Exploration: Resourse Optimization : Exploration: Resourse Optimization Pareto curve represents the lowest bound of found architecture configurations Selected architecture for further optimization (MOVE Project at DUT)
Exploration: Connectivity Optimization : Exploration: Connectivity Optimization (MOVE Project at DUT) Reduced connections decrease bus delay Critical connections have been removed
Topics to be Investigated : Topics to be Investigated Poor code density
good target for code compression techniques
apriori information of application, thus instruction propabilities known
Estimations
Power estimation
Fast estimations with sufficient accuracy
Flexibity, reuse
Applications may change, thus additional resources need to assigned although not needed by the original application
Tool-assisted special function unit generation
Analysis support
Model creation support
Characterization support
Parameterized processor generator
Interconnections, control, etc. maybe realized in several ways depending on the target
Low-power optimizations
Clustered TTAs
Interprocessor communication schemes
These topics considered in FlexDSP Project at TUT
New Design Environment : Code Compression New Design Environment Functionality (C/C++) Operation Analysis Parametric Compiler Parametric Processor Generator Parallel Object Code HDL Code Frontend Design Space Exploration FU models (C, HDL) Cost Functions (area, power, speed) Resource Constraints SFU Generation Target of FlexDSP Project at TUT
Conclusions : Conclusions Design methodologies allowing processor customization will improve efficiency in certain application areas, e.g., multimedia, telecom
TTA is a promising candidate for architectural template for customized processors
In particular, support for custom function units allows powerful tailoring
Results of MOVE project at DUT have already proven the concept
Parameterized compiler allows tool-assisted design space exploration
Still more research needed on
Hardware implementations
Enhanced compiler strategies
Catch the
buzz on authorSTREAM
Copyright © 2002-2008 authorSTREAM. All rights reserved.