# dsp processors-ii

Views:

tms320c64x

## Presentation Transcript

### DSP PROCESSORS-II :

DSP PROCESSORS-II

MODULE 2

### Syllabus :

Syllabus Fixed and floating point formats code improvement Constraints TMS 320C64x CPU simple programming examples using C/assembly.

### Fixed point numbers :

Fixed point numbers Fast and inexpensive implementation Limited in the range of numbers Susceptible to problems of overflow Fixed-point numbers and their data types are characterized by their - word size in bits binary point and whether they are signed or unsigned

### Slide 5:

In unsigned integer the stored number can take on any integer value from 0 to 65,535. signed integer uses two's complement allows negative numbers it ranges from -32,768 to 32,767 With unsigned fraction notation 65,536 levels spread uniformly between 0 and 1 the signed fraction format allows negative numbers, equally spaced between -1 and 1

### Carry and Overflow• Carry applies to unsigned numbers — when adding or subtracting, result is incorrect.• Overflow applies to signed numbers — when adding or subtracting, result is incorrect. :

Carry and Overflow• Carry applies to unsigned numbers — when adding or subtracting, result is incorrect.• Overflow applies to signed numbers — when adding or subtracting, result is incorrect.

### Slide 9:

01111 + 100+ 00111 111 -------- ------------- 10110 1011 Overflow Carry Sign bit Carry Examples: Sign bit

### Data types :

Data types 1.Short: it is of size 16 bits represented as 2’s complement with a range from -215 to (215 -1)‏ 2.Int or signed int: it is of size 32 bits represented as 2’s complement with a range from -231 to ( 231-1)‏ 3.Float: it is of size 32 bits represented as IEEE 32 bit with a range from 2-126(1.175494x10-38) to 2+128 (3.40282346x1038)‏ 4.Double: it is of size 64 bits represented as IEEE 64 bit with a range from 2-1022(2.22507385x10-308) to 2 1024(1.79769313x10308)‏

### Floating-point representation :

Floating-point representation The advantage over fixed-point representation is that it can support a much wider range of values. The floating-point format needs slightly more storage The speed of floating-point operations is measured in FLOPS.

### Slide 14:

General format of floating point number : X= M. be where M is the value of the significand (mantissa), b is the base e is the exponent. Mantissa determines the accuracy of the number Exponent determines the range of numbers that can be represented

### Slide 15:

Floating point numbers can be represented as: Single precision : called "float" in the C language family it is a binary format that occupies 32 bits its significand has a precision of 24 bits Double precision : called "double" in the C language family it is a binary format that occupies 64 bits its significand has a precision of 53 bits

### Slide 16:

Single Precision (SP): Bit 31 represents sign bit Bits 23 to 30 represents exponent bits Bits 0 to 22 represents fractional bits Numbers as small as 10-38 and as large as 10 38 can be represented S e f 0 22 23 30 31

### Slide 17:

Double precision (DP) : since 64 bits, more exponent and fractional bits are available a pair of registers are used Bits 0 to 31 of first register represents fractional bits Bits 0 to 19 second register also represents fractional bits Bits 20 to 30 represents exponent bits Bits 31 is the sign bit Numbers as small as 10 -308 and as large as 10 +308 can be represented f f e s 0 31 0 19 20 30 31

### Slide 18:

Instructions ending in SP or DP represents single and double precision Some Floating point instructions have more latencies than fixed point instructions Eg: MPY requires one delay MPYSP has three delays MPYDP requires nine delays Single precision floating point value can be loaded into a single register where as Double precision values need a pair of registers A1:A0, A3:A2 ,…….. B1:B0, B3:B2 ,…………… C6711 processor has a single precision reciprocal instruction RCPSP for performing division

### Code Optimization :

Code Optimization code optimization is used to drastically reduce the execution time of the code. There are several techniques- (i) Use instructions in parallel (ii) Word-wide data (iii) intrinsic functions (iv) Software pipelining. Optimized assembly (ASM) code runs faster than C and require less memory space.

### Slide 20:

Optimization Steps 1.Program in C. Build your project without Optimization 2. Use intrinsic functions when appropriate as well as the various optimization levels 3. Use the profiler to determine/ identify the functions that may need to be further optimized. Then convert these functions in linear ASM. 4. Optimize code in ASM.

### Slide 21:

Compiler options:A C-coded program is first passed through a parser that performs preprocessing functions and generate an intermediate file (.if) which becomes the input to an optimizer. The optimizer generates an (.opt) file which becomes the input to a code generator for further optimization and generates ASM file. Optimizer Parser code generator ASM C Code .if .opt

### Slide 22:

The options for optimization levels:1. -00 optimizes the use of registers2. -01 performs a local optimization in addition to optimization done by -00.3. -02 performs global optimization in addition to optimization done by -00 and -01.4. -03 performs file optimization in addition to the optimizations done by -00, -01 and -02. -02 and -03 attempt to do software optimizations.

### Slide 23:

Intrinsic C functions: Similar to run time support library function C intrinsic function are used to increase the efficiency of code. int-mpy ( ) has an equivalent ASM instruction MPY, which multiplies 16 LSBs of a number by 16 LSBs of another number. 2. int-mpyh ( ) has an equivalent ASM instruction MPYH which multiplies 16 MSBs of a number by the 16 MSBs of another number. 3. int-mpylh ( ) has an equivalent ASM instruction MPYLH which multiplies 16 LSBs of a number by 16 MSBs of another. 4. int-mpyhl ( ) has an equivalent ASM instruction MPYHL which multiplies 16 MSBs of a number by the 16 LSBs of another. 5. Void-nassert (int) generates no code. It tells the compiler that expression declared with the asssert function is true. 6. Uint-lo (double) and Uint-hi (double) obtain low and high 32 bits of a double word.

### Slide 24:

Trip directive for loop count:Linear assembly directive (.trip) is used to specify the number of times a loop iterates.If the exact number is known and used, redundant loops are not generated and can improve both code size and execution time.

### Software pipelining :

Software pipelining software pipelining is a scheme which uses available resources to obtain efficient pipelining code. The aim is to use all eight functional units within one cycle. There are three stages: 1. prolog (warm-up)- This stage contains instructions needed to build up the loop kernel cycle. 2. Loop kernel (cycle)- within this loop, all instructions are executed in parallel. Entire loop is executed in one cycle. 3. Epilog (cool-off)- This stage contains the instructions necessary to complete all iterations

### Slide 26:

Procedure for software pipelining: 1. Draw the dependency graph 2. Set up a scheduling table 3. Obtain code from the scheduling table. Dependency graph: (Procedure)‏ 1. Draw the nodes and paths 2. Write the number of cycles to complete an instruction 3. Assign functional units associated with each code 4. Separate the data paths, so that the maximum number of units are utilized.

### Dependency graph : (Eg. Two sum of product)‏ :

Dependency graph : (Eg. Two sum of product)‏ bi Sum l count loop Sum h Prod h ai Prod l Side A Side B LDW LDW .D1 .D2 .M1x .M2x .L1 .L2 .S1 .S2 MPY MPYH ADD SUB B 5 5 5 5 2 2 1 1 1 1

### Slide 28:

Scheduling table: 1. LDW starts in cycle 1 2. MPY and MPYH must start five cycles after LDW, due to four delay slots. Therefore MPY/MPYH starts at cycle 6. 3. ADD must start two cycles after MPY/MPYH due to one delay slot of MPY/MPYH. Therefore ADD starts in cycle 8. 4. B has 5 delay slots and starts in cycle 3, since branching occurs in cycle 9, after ADD instructions. 5. SUB instruction must start one cycle before branch instruction, since the loop count is decremented before branching occurs. Therefore SUB starts in cycle 2.

### Schedule table before software pipelining: :

Schedule table before software pipelining: units cycles .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 1,9,17.. 2,10,18.. 3,11,.. 4,12,.. 5,13,.. 6,14,.. 7,15,.. 8,16,.. LDW LDW SUB B MPY MPYH ADD ADD

### Schedule table after software pipelining: :

Schedule table after software pipelining: units cycles .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 1,9,17.. 2,10,18.. 3,11,.. 4,12,.. 5,13,.. 6,14,.. 7,15,.. 8,16,.. LDW LDW SUB B MPY MPYH ADD ADD LDW LDW LDW LDW LDW LDW LDW LDW LDW LDW LDW LDW LDW LDW MPY MPY MPYH MPYH SUB SUB SUB SUB SUB SUB B B B B B

### Slide 31:

Instructions within prolog stage (cycles 1-7) are repeated until and including loop kernel (cycle 8). Instructions in the epilog stage (cycles 9,10…) are to complete the functionality of the code.

### Slide 32:

Loop Kernel Within the loop cycle 8, multiple iterations of the loop-execute in parallel. ie, different iterations are processed at same time. eg: ADDs add data for iteration 1 MPY/MPYH multiply data for iteration 3 LDW load data for iterations 8 SUB decrements the counter for iteration 7 B branches for iteration 6 ie, values being multiplied are loaded into registers 5 cycles prior to cycle when the values are actually multiplied. Before first multiplication occurs, fifth load has just completed. This software pipelining is 8 iterations deep.

### Slide 33:

If the loop count is 100 (200 numbers)‏ Cycle 1: LDW, LDW (also initialization of count and accumulators A7 and B7)‏ Cycle 2: LDW, LDW, SUB Cycle 3-5: LDW, LDW, SUB, B Cycle 6-7: LDW, LDW, MPY, MPYH, SUB, B Cycle 8-107: LDW, LDW, MPY, MPYH, ADD, ADD, SUB, B Cycle 108: LDW, LDW, MPY, MPYH, ADD, ADD, SUB, B Prolog section is within cycle 1-7 Loop kernel is in cycle 8 Epilog section is in cycle 108.

### Slide 34:

Execution Cycles: Number of cycles (with software pipelining): Fixed point = 7+ (N/2) +1 eg: N = 200 ; 7+100+1 = 108 Floating points = 9 + (N/2) + 15 Fixed Point Floating Point No Optimization 2 + (16 X 200) = 3202 2 + (18 X 200) = 3602 With parallel instructions 1 + (8 X 200) = 1601 1 + (10 X 200) = 2001 Two sums per iterations 1 + (8 X 100) = 801 1 + (10 X 100) + 7 = 1008 With S/W pipelining 7 + (200/2) + 1 = 108 9 + (200/2) +15 = 124

### Slide 35:

Memory Constraints: Internal memory is arranged through various banks of memory so that loads and stores can occur simultaneously. Since banks are single ported, only one access to each bank is performed per cycle. Two memory access per cycle can be performed if they do not access the same bank. If multiple access is performed to the same bank, pipeline will stall.

### Slide 36:

Cross Path Constraints: Since there is one cross path in each side of the two datapaths, there can be at most two instructions per cycle using cross path. eg: Valid code segment (because both available cross paths are utilized )‏ ADD .L1X A1, B1, A0 II MPY .M2X A2, B2, B3 eg: Not valid ( because one cross path is used for both instructions)‏ ADD .L1X A1, B1, A0 II MPY .M1X A2, B2, A3

### Slide 37:

Load/store constraints: The address register to be used must be on the same side as the .D unit. eg: Valid code: LDW .D1 *A1, A2 II LDW .D2 *B1, B2 eg: Invalid code: LDW .D1 . *A1, A2 II LDW .D2 *A3, B2 Loading and storing cannot be from the same register file. eg: Valid code: LDW .D1 *A0, B1 II STW .D2 A1,*B2 eg: Invalid code: LDW .D1 *A0, A1 II STW .D2 A2,*B2

### TMS320C64x :

TMS320C64x TMS320C64x is a family of 16-bit Very Long Instruction Word (VLIW) DSP from Texas Instruments At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS C64x DSPs can do more work each cycle with built-in extensions. They can process all C62x object code unmodified (but not vice-versa)‏

### Applications for the C64x :

Applications for the C64x TMS320C64x can be used as a CPU in the following devices: Wireless local base stations; Remote access server (RAS); Digital subscriber loop (DSL) systems; Cable modems; Multichannel telephony systems; Pooled modems;

### New extensions :

New extensions Register file enhancements Data path extensions Packed data processing Additional functional unit hardware Increased orthogonality

### Register file enhancements :

Register file enhancements The ’C64x register file has double the number of general-purpose registers than the ’C62x/’C67x cores There are 32 32-bit registers per data path A0-A31 for file A and B0-B31 for file B A0 may also be used as a condition register bringing the total to six condition registers. In all ’C6000 devices, registers A4-A7 and B4-B7 can be used for circular addressing.

### Packed data processing :

Packed data processing The ’C64x register file supports all the ’C62x data types and extends this by additionally supporting packed 8-bit types and 64-bit fixed-point data types. Instructions operate directly on packed data to streamline data flow and increase instruction set efficiency. Packed data types store either four 8-bit values or two 16-bit values in a single 32-bit register or four 16-bit values in a 64-bit register pair. Besides being able to perform all the ’C62x instructions, the ’C64x also contains many 8–bit and 16–bit extensions to the instruction set. Eg: MPYU4 instruction performs four 8x8 unsigned multiplies with a single instruction on a .M unit.

### Data path extensions :

Data path extensions On the ’C64x, all eight of the functional units have access to the register file on the opposite side via a cross path. on the ’C62x/’C67x, only six functional units have access to the register file on the opposite side via a cross path; the .D units do not have a data cross path. The ’C64x pipelines data cross path accesses allowing multiple units per side to read the same cross path source simultaneously. In ’C62x/’C67x, only one functional unit per data path per execute packet could get an operand from the opposite register file.

### Slide 44:

The ’C64x supports double-word loads and stores. There are four 32-bit paths for loading data for memory to the register file. For side A, LD1a is the load path for the 32 LSBs; LD1b is the load path for the 32 MSBs. For side B, LD2a is the load path for the 32 LSBs; LD2b is the load path for the 32 MSBs. There are also four 32-bit paths for storing register values to memory from each register file. ST1a is the write path for the 32 LSBs on side A; ST1b is the write path for the 32 MSBs for side A. For side B, ST2a is the write path for the 32 LSBs and ST2b is the write path for the 32 MSBs.

### Slide 45:

The ’C64x can also access words and double words at any byte boundary using non-aligned loads and stores. As a result, word and double-word data does not always need alignment to 32-bit or 64-bit boundaries as in the ’C62x/’C67x

### Additional Functional Unit Hardware :

Additional Functional Unit Hardware the .L units can perform byte shifts and the .M units can perform bi-directional variable shifts in addition to the .S unit’s ability to do shifts. The .L units can now perform quad 8-bit subtracts with absolute value. This absolute difference instruction greatly aids motion estimation algorithms. Special communication-specific instructions, such as SHFL, DEAL and GMPY4, have been added to the .M unit to address common operations in error-correcting codes. Bit-count and rotate hardware on the .M unit extends support for bit-level algorithms such as binary morphology, image metric calculations and encryption algorithms.

### Increased Orthogonality :

Increased Orthogonality The .D unit can now perform 32-bit logical instructions in addition to the .S and .L units. Also, the .D unit now directly supports load and store instructions for double-word data values The ’C62x/’C67x allows up to four reads of a given register in a given clock cycle. The ’C64x allows any number of reads of a given register in a given clock cycle. On the ’C62x/’C67x, one long source and one long result per data path could occur every clock cycle. On the ’C64x, up to two long sources and two long results can be accessed on each data path every clock cycle.

### Block diagram :

Block diagram Enhanced DMA Controller (64-channel)‏ ZBT RAM SDRAM SBSRAM FIFO SRAM I/O devices L2 Memory 1024K bytes L1 Program cache Direct-mapped 16 K Bytes total EMIF A EMIF B . L1 Data cache 2-way set-associative 16 K Bytes total CPU CORE

C64X CPU

### Architecture Overview :

Architecture Overview 2 (almost) identical fixed-point data paths that each contain 1 ALU (The .L Unit)‏ 1 Shifter (The .S Unit)‏ 1 Multiplier (The .M Unit)‏ 1 Adder/Subtractor used for address generation (The .D Unit)‏ 1 register file containing thirty-two 32-bit registers

### Slide 51:

The 8 execution units in the 2 data paths are capable of executing up to 8 instructions in parallel. Can operate on 8-, 16-, 32-, and 40-bit data Can perform double-word (64-bit) loads and stores by using 2 registers for the one operation.

### General-Purpose Register Files :

General-Purpose Register Files The C64x register file contains 32 32-bit registers (A0-A31 for file A and B0-B31 for file B); can be used for data, pointers or conditions Values larger than 32 bits (40-bit long and 64-bit float quantities) are stored in register pairs. Packed data types are: four 8-bit values or two 16-bit values in a single 32-bit register, four 16-bit values in a 64-bit register pair. Zero filled Odd register Even register 32 39 31 0

### Pipeline :

Pipeline Fetch Decode Execute The C64x pipeline has the following features: 11 phases divided into Fetch, Decode, Execute; Fetch has 4 phases for all instructions, the decode phase has two phases for all instructions; The execute stage of the pipeline requires a varying number of phases, depending on the type of the instruction. The stages of the fixed-point pipeline are:

### Slide 54:

In the C64x instructions are fetched from the instruction memory in grouping of eight instructions, called fetch packets (FPs); Each FP can be split into one to eight executable packets (EP). Each EP contains only instructions that can execute in parallel. Each instruction in EP executes in an independent functional unit; The C64x pipe is most effective when it is kept as full as possible by organizing instructions;

Pipeline Stages

### Execute Pipeline Stages: E1 :

Execute Pipeline Stages: E1 E1: Execute stage 1 Single cycle instructions are completed For all instructions, conditions are evaluated and operands are read For load/store, address generation is performed, and address modifications are written to register file For branch instructions, branch fetch packet in PG phase is affected For single cycle instructions, results are written to register

### Execute Pipeline Stages: E2 :

Execute Pipeline Stages: E2 E2: Execute stage 2 Multiply instructions are completed Load inst. sends address to memory Store inst. sends address and data to memory The SAT bit in the control status register (CSR) is set if a single cycle instruction saturated the result set Single 16x16 multiply inst. results are written to the register .M Unit non-multiply instructions are written to the register

### Execute Pipeline Stages: E3 :

Execute Pipeline Stages: E3 E3: Execute stage 3 Store instructions are completed Data memory accesses are performed The SAT bit in the control status register (CSR) is set for multiply instructions

### Execute Pipeline Stages: E4 :

Execute Pipeline Stages: E4 E4: Execute stage 4 Multiply extension instructions are completed Load instructions bring the data to the CPU Multiply extension instruction (MPY2, MYP4, DOTPx2, DOTPU4, MPYHIx, MPYLIx and MVD) results are written to the register

### Execute Pipeline Stages: E5 :

Execute Pipeline Stages: E5 E5: Execute stage 5 Load instructions are completed Load instruction data is written to the register

Pipeline summary

Pipeline summary

### Delay Slots :

Delay Slots Delay slots mean “how many CPU cycles come between the current instruction and when the results of the instruction can be used by another instruction” Single Cycle Instructions: 0 delay slots 16x16 Single Multiply and .M Unit non-multiply Instructions: 1 delay slot

### Slide 64:

Store: 0 delay slots If a load occurs before a store (either in parallel or not), then the old data is loaded from memory before the new data is stored. If a load occurs after a store, (either in parallel or not), then the new data is stored before the data is loaded. C64x Multiply Extensions: 3 delay slots Load: 4 delay slots Branch: 5 delay slots The branch target is in the PG slot when the branch condition is determined in E1. There are 5 slots between PG and E1 when the branch target begins executing useful code again.

### Memory :

Memory The C64x has different spaces for program and data memory; Uses two-level cache memory scheme;

### Internal Memory :

Internal Memory The C64x has a 32-bit byte-addressable memory with the following features: Separate data and program address spaces; Large on chip RAM, up to 7MB; 2-level cache; Single internal program memory port with an instruction-fetch bandwidth of 256 bits; Two 64-bit internal data memory ports;

### Memory Map (Internal and External Memory)‏ :

Memory Map (Internal and External Memory)‏ Level 1 Program Cache is 128 Kbit direct mapped Level 1 Data cache is 128Kbit 2-way set-associative Shared Level 2 Program/Data Memory/Cache of 4Mbit Can be configured as mapped memory Cache (up to 256 Kbytes)‏ Combination of the two

### Memory Buses :

Memory Buses Instruction fetch using 32-bit address bus and 256-bit data bus two 64-bit load buses (LD1 and LD2)‏ two 64-bit store buses (ST1 and ST2)‏

### Interrupts :

Interrupts 16 prioritized interrupts: INT_00 to INT_15 INT_00 has the highest priority and is dedicated to RESET. This halts the CPU and returns it to a known state The first four interrupts (INT_00 – INT_03) are fixed and non maskable INT_01 – INT_03 are generally used to alert the CPU of an impending hardware problem, such as an imminent power failure The remaining interrupts are maskable and can be programmed

### Interrupt Performance Consideration :

Interrupt Performance Consideration Overhead for all CPU interrupts is 7 cycles Interrupt latency is 11 cycles Interrupts can be recognized every 2 cycles 2 occurrences of a specific interrupt can be recognized in 2 cycles

### Peripheral Set :

Peripheral Set 2 multichannel buffered audio serial ports 2 inter-integrated circuit bus modules (I2Cs)‏ 2 multichannel buffered serial ports (McBSPs)‏ 3 32-bit general-purpose timers 1 user-configurable 16-bit or 32-bit host-port interface (HPI16/HPI32)‏ 1 16-pin general-purpose input/output port (GP0) with programmable interrupt/event generation modes 1 32-bit glueless external memory interface (EMIFA), capable of interfacing to synchronous and asynchronous memories and peripherals.

### ZBT RAM :

ZBT RAM Zero Bus Turnaround (ZBT) is a synchronous SRAM architecture optimized for networking and telecommunications applications. It can increase the internal bandwidth of a switch fabric when compared to standard SyncBurst SRAM. The ZBT architecture is optimized for switching and other applications with highly random READs and WRITEs. ZBT SRAMs eliminate all idle cycles when turning the data bus around from a WRITE operation to a READ operation

### Packaging – Top View :

Packaging – Top View

### Packaging - Bottom View :

Packaging - Bottom View

### Sum of products example :

Sum of products example C code: int DotP(short* m, short* n, int count) { int i, product, sum = 0; for(i = 0; i < count; i++)‏ { product = m[i] * n[i]; sum+=product; } return(sum); } TI TMS C64x code: LOOP: [A0] SUB .L1 A0, 1, A0 | | [!A0] ADD .S1 A6, A5, A5 | | MPY .M1X B4, A4, A6 | | [B0] BDEC .S2 LOOP, B0 LDH .D1T1 *A3++, A4 LDH .D2T2 *B5++, B4

### Another code example :

Another code example MIPS: loop: LW R1, 0(R11)‏ MUL R2, R1, R10 SW R2, 0(R12)‏ ADDI R12, R12, #-4 ADDI R11, R11, #-4 BGTZ R12, loop TI TMS C64x: ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MVK .S2 #-4,B1 ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || ADDK .S2 #-12,B12 loop: ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12) || ADD .L2 B12,B1,B12 || BGTZ .S2 B12, loop ADD .L2 B12, B1, B12 || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)‏ ADD .L2 B12, B1, B12 || STW .D2x A2,0(B12)‏

### Special purpose instructions :

Special purpose instructions

THE END