Reconfigurable Computing:Current Status and Potential for Spacecraft Computing Systems: Reconfigurable Computing: Current Status and Potential for Spacecraft Computing Systems Rod Barto
NASA/GSFC Office of Logic Design
Spacecraft Digital Electronics
3312 Moonlight
El Paso, Texas 79904
Reconfigurable Computing is…: Reconfigurable Computing is… A design methodology by which computational components can be arranged in several ways to perform various computing tasks
Two types of reconfigurable computing:
Static, i.e., the computing system is configured before launch
Dynamic, i.e., the computing system can be reconfigured after launch
Static Reconfigurability: Static Reconfigurability Several examples exist, e.g., Cray
Typically processing modules connected by an intercommunication mechanism, e.g., Ethernet
Goals are
To reduce system development costs
To provide higher performance computing
Dynamic Reconfigurability (DR): Dynamic Reconfigurability (DR) Processing modules that can be reconfigured in flight
Goal is to provide processing support for algorithms that do not map well onto general purpose computers using reduced amounts of hardware
Outline of Paper: Outline of Paper Discuss the computation of a series of algorithms on general purpose, special purpose, and DR computers
Calculate the execution time of an image processing algorithm on a concept DR computer
Compare the reconfiguration time of a Xilinx FPGA with the algorithm execution time calculated in section 2.
Obtain an extremely rough estimate of image processing algorithm execution time on a flight computer
Conclude that the DR computer described offers higher performance than does the flight computer
Section 1:Algorithm Execution on General Purpose (GP), Special Purpose (SP), and DR Computers: Section 1: Algorithm Execution on General Purpose (GP), Special Purpose (SP), and DR Computers
Processing example: Processing example A computing function is the composition of n algorithms executed serially
Can be executed on a general purpose computer (GP) or a special purpose computer (SP)
Execution on a GP Computer : Execution on a GP Computer Processing time of each stage = ti, i=1..n
Total processing time =
Latency time = GP computer must execute processing stages sequentially, and cannot exploit parallelism in overall computing function
Processing on an SP Processor: Processing on an SP Processor Each stage is an independently operating processor designed specifically for the algorithm it executes
Processing time of each stage = ti, i=1..n
Results appear at rate of one per max(ti), 1=1..n
Latency time = Performance increase comes from two factors:
Pipelining of constituent algorithms exploiting parallelism
Processors being designed specifically for their algorithms
Processing on a DR Computer: Processing on a DR Computer Two processing elements alternately process and reconfigure, i.e., fodd executes one algorithm while feven reconfigures for the next algorithm, etc. fodd feven Input Output
DR Computer Processing Flow: DR Computer Processing Flow Performance increase comes from configuring processors specifically for the algorithm they are executing
Do not get increase from exploiting parallelism.
Section 2:Execution Time of an Image Processing Algorithm on a Concept DR Computer: Section 2: Execution Time of an Image Processing Algorithm on a Concept DR Computer
DR Computer Concept: DR Computer Concept RAM0 is source for FPGA0, destination for FPGFA1, etc.
Processing elements are implemented in FPGAs
FPGA0 and FPGA1 alternately process and reconfigure, as previously discussed.
Input and output not shown FPGA0
FPGA1
RAM1
RAM0
AlgorithmExample: 3x3 Image Convolution: AlgorithmExample: 3x3 Image Convolution Shifting in 1 row at a time pixel-serial, and parallel shifting into the upper 3 row registers, the rows are shifted around through the convolution processor. All the row registers and processing is inside the FPGA. The results are written to the destination RAM after a latency of 3 row reads. Image width in pixels
row i-1
row i
row i+1 Parallel shift rows up row i+2 Circular shift rows through convolution processor 3x3 convolution processor Destination RAM Source
RAM one pixel
Convolution Operation: Convolution Operation Used, for example, to compute the intensity gradient (derivative) at pixel (i,j)
Result = P(i-1,j-1)*m11+P(i-1,j)*m12+P(i-1,j-1)*m13+…+P(i+1,j+1)*m33 Pixel array Convolution mask
Convolution Calculation: Convolution Calculation Arithmetic processing may require some pipelining Result(I,j)
Convolution Timing: Convolution Timing Total time = latency+processing = 20.971 msec
This assumes we can get pixels into the FPGA at a 20 nsec/pixel rate
Latency = time to read 3 rows:
1024 pixels *3 rows * 20 nsec/pixel = 61 usec
Processing = time to stream remaining 1021 rows through and process:
1024 * 1021 * 20 nsec = 20.910 msec
Larger convolutions (e.g., 7x7) have longer latencies, but same computation time
Calculation is for a mono image, stereo image would take twice as long.
Section 3:Comparing the Reconfiguration Time of a Xilinx FPGA With the Algorithm Execution Time Calculated in Section 2.: Section 3: Comparing the Reconfiguration Time of a Xilinx FPGA With the Algorithm Execution Time Calculated in Section 2.
DR Computer Processing Element:Virtex-4 LX FPGA: DR Computer Processing Element: Virtex-4 LX FPGA Eight versions:
XC4VLX15, -25, -40, -60, -80, -100, -160, -200
Logic hierarchically arranged:
2 flip-flops per slice
4 slices per CLB
Time to Configure FPGA: Time to Configure FPGA FPGA Configuration Sequence PROG_B INIT_B CCLK DONE Tpl Tconfig Total Configuration Time
Configuration Timing: Tpl: Configuration Timing: Tpl Tpl = 0.5 usec/frame
“frame” is a unit of configuration RAM
Tpl period clears configuration RAM
Configuration Timing: Tconfig: Configuration Timing: Tconfig FPGA programmed by bitstream
CCLK (programming CLK) can run at 100 MHz
Parallel mode loads 8 bits per CCLK
Total Configuration Time: Total Configuration Time Plus some extra time amounting to a few CCLK cycles (@ 10 nsec each)
Processing and Reconfiguration Time Comparison: Processing and Reconfiguration Time Comparison Convolution execution is faster than reconfiguration
Convolution = 21 msec mono, 42 msec stereo
Reconfiguration = 81 msec
Assuming -200 device
Processing shown is well within FPGA’s capabilities
More complex algorithms may require use of FPGA performance features
Much higher internal clock rates
Large internal RAM
Dedicated arithmetic support in –SX series
What this shows is that it’s reasonable to consider alternating execution and reconfiguration of two FPGAs
Section 4: An Extremely Rough Estimate of Image Processing Algorithm Execution Time on a Flight Computer: Section 4: An Extremely Rough Estimate of Image Processing Algorithm Execution Time on a Flight Computer
GP Computing Performance Estimate: GP Computing Performance Estimate DANGER: really rough estimate!
Based on data from this paper:
“Stereo Vision and Rover Navigation Software for Planetary Exploration”, Steven B. Goldberg, Indelible Systems; Mark Maimone, Larry Matthies, JPL; 2002 IEEE Aerospace Conference
Available at robotics.jpl.nasa.gov/people/mwm/visnavsw/aero.pdf
Describes processing and algorithms to be used on 2004 Rover missions, and Rover requirements.
Published Vision Algorithm Timing: Published Vision Algorithm Timing Timed on Pentium III 700 MHz CPU, 32K L1 cache, 256K L2 cache, 512M RAM, Win2K
algorithms explicitly timed (names from paper):
The Gaussian and most vision algorithms involve neighborhood operations that are comparable to an image convolution of some size
Flight Computer Performance: Flight Computer Performance Flight processor is RAD6000
GESTALT Navigation algorithm timed on 3 processors: Assume that the RAD6000 takes 7 times as long as the 500 MHz Pentium
Final Peformance Estimate: Final Peformance Estimate Assume RAD6000 time = 7 times the 500 MHz Pentium time
Assume 500 MHz Pentium time = 7/5=1.4 times the 700 MHz Pentium time
Then, RAD6000 time is 1.4*7=9.8 times the 700 MHz Pentium time
Vision algorithm timing can be estimated as follows: Remember: This is a really rough estimate!!
Section 5: Conclusions: Section 5: Conclusions
What We Have Shown: What We Have Shown We have shown that the concept DR computer presented executes a 3x3 neighborhood-type algorithm “a lot” faster than it appears that a RAD6000 executes what are probably a bunch of neighborhood algorithms.
The reader is cautioned to not try to quantify what “a lot” means based on the data given here.
But, it’s a good enough estimate to tell us that this is worth looking into in more detail.
Conclusions: Conclusions Xilinx-based DR computer shows promise for performance enhancement of a vision system
By extension, the DR computer shows promise for the performance enhancement of other algorithms.