Proceedings Volume 4212

Reconfigurable Technology: FPGAs for Computing and Applications II

John Schewel, Peter M. Athanas, Chris H. Dick, et al.
cover
Proceedings Volume 4212

Reconfigurable Technology: FPGAs for Computing and Applications II

John Schewel, Peter M. Athanas, Chris H. Dick, et al.
View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 6 October 2000
Contents: 6 Sessions, 20 Papers, 0 Presentations
Conference: Information Technologies 2000 2000
Volume Number: 4212

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Applications I
  • Tool/Techniques I
  • Tools/Techniques II
  • Applications II
  • Applications III
  • Devices and Systems
Applications I
icon_mobile_dropdown
Hardware-based image processing library for Virtex FPGA
Marek Gorgon, Ryszard Tadeusiewicz
The paper considers hardware-based realization of image processing algorithms. Usage of single FPGA device - Virtex as a processing element capable to carry out image processing in real-time is thoroughly discussed. For implementation of the algorithms in hardware resources specialized IP cores architectures has been designed and tested. The image-processing library consists of individual cores able to be linked together on a software level and implemented in high capacity FPGA devices is proposed.
Approach to constructing reconfigurable computer vision system
Jianru Xue, Nanning Zheng, Xiaoling Wang, et al.
In this paper, we propose an approach to constructing reconfigurable vision system. We found that timely and efficient execution of early tasks can significantly enhance the performance of whole computer vision tasks, and abstract out a set of basic, computationally intensive stream operations that may be performed in parallel and embodies them in a series of specific front-end processors. These processors which based on FPGAs (Field programmable gate arrays) can be re-programmable to permit a range of different types of feature maps, such as edge detection & linking, image filtering. Front-end processors and a powerful DSP constitute a computing platform which can perform many CV tasks. Additionally we adopt the focus-of-attention technologies to reduce the I/O and computational demands by performing early vision processing only within a particular region of interest. Then we implement a multi-page, dual-ported image memory interface between the image input and computing platform (including front-end processors, DSP). Early vision features were loaded into banks of dual-ported image memory arrays, which are continually raster scan updated at high speed from the input image or video data stream. Moreover, the computing platform can be complete asynchronous, random access to the image data or any other early vision feature maps through the dual-ported memory banks. In this way, the computing platform resources can be properly allocated to a region of interest and decoupled from the task of dealing with a high speed serial raster scan input. Finally, we choose PCI Bus as the main channel between the PC and computing platform. Consequently, front-end processors' control registers and DSP's program memory were mapped into the PC's memory space, which provides user access to reconfigure the system at any time. We also present test result of a computer vision application based on the system.
Face recognition by using feature position extraction and feature geometry comparison
In this face recognition research, the head is fixed when a photograph is taken. The infrared diodes provide the only illumination. In front of the CCD camera, a light filter lens is used to filter all other light. After the photograph is taken, the eyebrows, eyes, nostrils, lips, and face contour are extracted separately. The shape, size, object-to-object distance, center and orientation are found for each extracted object. The techniques to solve the object shifting and rotating problems are investigated. Image subtraction is used to examine the geometric differences of the two different faces. The obtained classifying data in this research can accurately classify different people's faces.
FPGA implementation of the pixel purity index algorithm
Dominique D. Lavenier, James P. Theiler, John J. Szymanski, et al.
The Pixel Purity Index (PPI) is an algorithm employed in remote sensing for analyzing hyperspectral images. Particularly for low-resolution imagery, a single pixel usually covers several different materials, and its observed spectrum is (to a good approximation) a linear combination of a few pure spectral shapes. The PPI algorithm tries to identify these pure spectra by assigning a pixel purity index to each pixel in the image; the spectra for those pixels with a high index value are candidates for basis elements in the image decomposition. The PPI algorithm is extremely time consuming but is a good candidate for parallel hardware implementation due to its high volume of independent dot-product calculations. This article presents two parallel architectures we have developed and implemented on the Wildforce board. The first one is based on bit-serial arithmetic operators and the second deals with standard operators. Speed-up factors of up to 80 have been measured for these hand-coded architectures. In addition,the second version has been synthesized with the Streams-C compiler. The compiler translates a high level algorithm expressed in a parallel C extension into synthesizable VHDL. This comparison provides an interesting way of estimating the tradeoff between a traditional approach which tailors the design to get optimal performance and a fully automatic approach which aims to generate a correct design in minimal time.
Application of a dynamically reconfigurable cell-array processor to an MPEG-2 video decoder
Kiyotaka Komoku, Fumihiro Hatano, Takayuki Morishita, et al.
We have proposed and developed the Dynamically Reconfigurable Cell-Array Processor (DRCAP) that consists of functional Cell Arrays (CAs), and buses/bus-switches that provide with connections between CAs. A software simulator of the DRCAP is constructed, on which the MPEG-2 video decoder is successfully implemented. This MPEG-2 decoder dynamically changes the configuration in many times during the decoding process. The processing is executed every macro-block, reconfiguring in each component process of the MPEG-2 decoding such as the variable length decoding, the dequantization, the inverse DCT, and so on. The resources required for the DRCAP to decode the MPEG-2 MP@ML video stream is investigated. In the simulation it is found that the numbers of CAs to decode the MPEG-2 MP@ML video stream are 8 for PCAs, 1 for LCA, 2 for CCAs and 35 for MCAs, and the execution cycle required is 94.6MHz. In the case of doubling all configurations, where the same two processes are executed in parallel, the numbers of CAs are 15, 1, 4 and 69, for PCA, LCA, CCA and MCA, respectively, and the execution frequency of 55.9MHz is required.
Tool/Techniques I
icon_mobile_dropdown
VirtexDS: a Virtex device simulator
Scott P. McMillan, Brandon J. Blodget, Steven A. Guccione
Until recently FPGAs have been used almost exclusively to implement static circuits. Because FPGAs can be reprogrammed at any time, even in-system at run-time, interest in exploiting this mode of operation has steadily increased. One barrier to widespread use of Run-Time Reconfiguration (RTR) has been the lack of design tools. While tools such as JBits have begun to provide basic support for design entry, traditional verification tools such as simulators have been lacking. This paper discusses VirtexDS, a device level simulator for the Xilinx Virtex (tm) series. The approach taken by VirtexDS is to simulate at the device level, providing an interface which operates much like actual hardware. This approach not only supports simulation for run-time reconfiguration, but also interfaces easily to existing tools. In addition, this low-level simulation approach can provide higher performance than higher-level approaches to simulation.
Fast scheduling and placement methods for C to hardware/software compilation
Kia Bazargan, Majid Sarrafzadeh
Advances in the FPGA technology, both in terms of device capacity and architecture, have resulted in introduction of reconfigurable computing machines, where the hardware adapts itself to the running application to gain speedup. To keep up with the ever-growing performance expectations of such systems, designers need new methodologies and tools for developing reconfigurable computing systems (RCS). This paper addresses the need for fast compilation and physical design phase to be used in application development/debugging/testing cycle for RCS. We present a high-level synthesis approach that is integrated with placement, hence making the compilation cycle much faster. On the average, our tool generates the VHDL code (and the corresponding placment information) from the data flow graph of a program in less than a minute. By losing 1.3 times in the quality of the design, we can achieve, 10.7 times speedup in the Xilinx placement phase, and 2.5 times overall speedup in the Xilinx place-and-route phase.
Object-oriented meta tools for reconfigurable architectures
Loic Lagadec, Bernard Pottier
A number of experimental and commercial reconfigurable architectures are designed with various objectives: random logic integration, hardware prototyping, computation accelerators, planar smart sensors or transducers etc Getting a new reconfigurable part to the final user remains a very difficult task, because there are no common tools, nor are there standard models that provides retargeting software development tools. A generic model for reconfigurable circuits has been built in three stages: full implementation of tools for a practical platform, creation of an abstract model and associated tools for arbitrary architectures ( programmable editor, geometric operations on physical modules. place and route) , descriptiontools for concrete architecture defined as a specialization of the abstract model. Main advantages and further fields of research based on this approach are: retargetable tools based on a description of the new architecture with possibility to embed new primitives, possibility of a quantitative approach in the design of new reconfigurable architectures.
Tools/Techniques II
icon_mobile_dropdown
Effect of data truncation in an implementation of pixel clustering on a custom computing machine
Miriam E. Leeser, James P. Theiler, Michael Estlick, et al.
We investigate the effect of truncating the precision of hyperspectral image data for the purpose of more efficiently segmenting the image using a variant of k-means clustering. We describe the implementation of the algorithm on field-programmable gate array (FPGA) hardware. Truncating the data to only a few bits per pixel in each spectral channel permits a more compact hardware design, enabling greater parallelism, and ultimately a more rapid execution. It also enables the storage of larger images in the onboard memory. In exchange for faster clustering, however, one trades off the quality of the produced segmentation. We find, however, that the clustering algorithm can tolerate considerable data truncation with little degradation in cluster quality. This robustness to truncated data can be extended by computing the cluster centers to a few more bits of precision than the data. Since there are so many more pixels than centers, the more aggressive data truncation leads to significant gains in the number of pixels that can be stored in memory and processed in hardware concurrently.
XVPI: a portable hardware/software interface for Virtex
Prasanna Sundararajan, Steven A. Guccione
XVPI, the Xilinx Virtex Portable Interface, is a hardware / software interface and specification to assist in the design and implementation of Xilinx Virtex (tm) based systems. XVPI specifies a software accessible register to be defined in the hardware. This register contains all of the control and data signals necessary to drive the Virtex device. The software supplied with XVPI uses this register to read and write control and data signals to perform various device level functions. These functions combine to produce an Application Program Interface (API) which provides access to the Virtex device from software. The XVPI API supports all of the device level operations including partial configuration download, partial configuration readback, clock control and reset. Once the system is operational, designers may replace the software routines in the XVPI API with hardware assisted routines. This increases the system performance incrementally, without affecting the functionality. Though specified for the Virtex based system, this technique to perform the device level functions from software can be applied to any FPGA device. Additionally, XVPI is also supplied with an interface supporting Xilinx's JBits toolkit. Once XVPI is implemented, JBits and its associated applications, including the BoardScope debug tool, are fully operational on that Virtex based system.
Compiler for a dynamically reconfigurable processor with cell-array structures
Fumihiro Hatano, Takayuki Morishita, Kiyotaka Komoku, et al.
An extensive study has been made of the reconfigurable Cell-Array processor that realizes very high-speed parallel computations. Our processor is featured by the architecture such that the configuration can be dynamically and optimally rearranged in real-time by changing the pipeline length or the registry-area depth in macro-Cell units of memories and accumulators. In this processor, therefore, a much smaller size of instruction codes and also a much shorter reconfiguration time are required than in the conventional FPGAs with small-scale logic layers. This paper described a newly developed compiler as the alternative to that using the circuit designs, which requires a long labor even with a well-trained skill. This compile application analyzes a program written with the C-language, using the C-Language is able to use a lot of programs of the past, and produces instruction codes containing information about hardware configurations. According to the structure of hardware used, the compiler can find out the optimal configuration, involving the most efficient depth of pipelined accumulations and parallel calculations. The details of the program analysis are shown with the utility of the compiler in the reconfigurable cell-array processor.
Applications II
icon_mobile_dropdown
Parallel k-mismatching of strings using daughter-board structure
Toomas P. Plaks
This paper presents a family of scalable regular array structures for k-mismatches problem of a reference string of length n and a pattern of length m. The conventional regular array of size O(m) computes in O(n + m). The drawback of his solution is, first, that the performance is bounded by the length of pattern, and second, the long latency. In this paper we present regular array solutions where the array size is parameterized by the number, s, 1<=s<=n, of parallel input/output channels. The array of size O(sm) computes in O(n/s + m) time. In order to reduce the latency time, tree-like structures for computing reduction are used, reducing the time complexity to O(n/s + m^{1/r}), r is the dimensionality of array. The number of patterns can be l, while the time complexity increases to O(n/s + m^{1/r}+l) using O(sml) processors. Proposed regular arrays are suitable for teal-time applications and are efficiently mapped onto FPGAs using daughter-board structure.
DDGIPS: a general image processing system in robot vision
Yuan Tian, Jun Ying, Xiuqing Ye, et al.
Real-Time Image Processing is the key work in robot vision. With the limitation of the hardware technique, many algorithm-oriented firmware systems were designed in the past. But their architectures were not flexible enough to achieve a multi-algorithm development system. Because of the rapid development of microelectronics technique, many high performance DSP chips and high density FPGA chips have come to life, and this makes it possible to construct a more flexible architecture in real-time image processing system. In this paper, a Double DSP General Image Processing System (DDGIPS) is concerned. We try to construct a two-DSP-based FPGA-computational system with two TMS320C6201s. The TMS320C6x devices are fixed-point processors based on the advanced VLIW CPU, which has eight functional units, including two multipliers and six arithmetic logic units. These features make C6x a good candidate for a general purpose system. In our system, the two TMS320C6201s each has a local memory space, and they also have a shared system memory space which enables them to intercommunicate and exchange data efficiently. At the same time, they can be directly inter-connected in star-shaped architecture. All of these are under the control of a FPGA group. As the core of the system, FPGA plays a very important role: it takes charge of DPS control, DSP communication, memory space access arbitration and the communication between the system and the host machine. And taking advantage of reconfiguring FPGA, all of the interconnection between the two DSP or between DSP and FPGA can be changed. In this way, users can easily rebuild the real-time image processing system according to the data stream and the task of the application and gain great flexibility.
Integration of biologically plausible vision systems for controlling autonomous robots
Rene Zapata, P. Lepinay, Lionel Torres, et al.
This paper describes the realization of a biologically-plausible integrated vision system for implementing reactive behaviors of mobile robots. The starting point is the coupling of a vision stereo-matching algorithm with a collision avoidance method called DVZ (Deformable Virtual Zone). Some experiments have been carried out with MATLAB in order to test the validity of this method. Very interesting results have already been obtained in simulation and a prototyping board is presented.
Dynamic circuit specialization of a CORDIC processor
Eric R. Keller
A significant advantage of run-time reconfiguration (RTR) is that circuitry can be minimized or the delay path can be reduced through customization based on the current problem. Two reconfiguration techniques are common in todays applications. These include context switching and constant folding. This paper describes an implementation of a CORDIC processor and show how other reconfiguration techniques, which include run-time routing, can be applied and require low overhead if care is taken. CORDIC is an algorithm to efficiently calculate several different functions in hardware, such as sine and cosine. Through the use of JBits and JRoute the CORDIC processor can be customized to the specific problem, and can be reconfigured at run-time, with minimal changes, to fit another problem. In doing so, the complexity of the circuit is only as complex as it needs to be, and an optimal circuit is maintained. In this paper the CORDIC algorithm and its implementation in an FPGA are described. The implementation is specific to the mode of operation of the CORDIC processor, it is not general purpose. That reduces the complexity of the circuit and RTR can be used to fit the given problem. Also presented is what has been done to reduce the run-time overhead to change between different modes.
Applications III
icon_mobile_dropdown
Building a flexible trigger system for high-energy physics
David G. Cussans, Dave M. Newbold, Greg P. Heath, et al.
In the 17th century Sir Isaac Newton wrote ``These are therefore the Agents in Nature able to make the Particles of Bodies ftick together by very ftrong Attractions, And it is the Bufinefs of Experimatal Philofophy to find them out''. In the $21st century the Large Hadron Collider (LHC) continues that ``business''. Bunches of protons will be accelerated to an energy of 7TeV/c per proton around a 27km circumference ring buried underneath French/Swiss countryside near Geneva. At a number of points around the ring counter-rotating bunches of protons will be passed through each other. Some of the protons will interact violently and the new particles generated will fly outwards into detectors surrounding the interaction region. Typically these detectors have millions of channels and data flows out of the ``front-end'' at about 400EBytes/year. Data can only be stored at about 1PByte/year, a factor of 4x10^5 less. Fortunately most interactions involve physical processes that are already understood and a multi-level trigger system is used to select data from interesting or unexpected events. This paper describes the Global Calorimeter Trigger (GCT), part of the trigger system for the Compact Muon Solenoid (CMS) detector at the LHC.
High-performance reconfigurable constant coefficient multiplier implementations
Philip B. James-Roxby, Brandon J. Blodget
The use of dynamic reconfiguration appears extremely attractive for implementing adaptive processing algorithms. Often, the adaption involves updating look-up tables based on a parameter which can only be determined at run-time. For reasons of efficiency, these look-up tables are read-only to the rest of the circuitry. This paper compares the use of run-time reconfiguration and read-only look-up tables, with similar implementations using writable memories. The application under consideration is the multi-layer perceptron neural network. It is shown that the ROM based network is considerably simpler than the RAM based network, at the expense of a dramatically increased time to update the weights during training.
Framework for architecture-independent run-time reconfigurable applications
David I. Lehn, Rhett D. Hudson, Peter M. Athanas
Configurable Computing Machines (CCMs) have emerged as a technology with the computational benefits of custom ASICs as well as the flexibility and reconfigurability of general-purpose microprocessors. Significant effort from the research community has focused on techniques to move this reconfigurability from a rapid application development tool to a run-time tool. This requires the ability to change the hardware design while the application is executing and is known as Run-Time Reconfiguration (RTR). Widespread acceptance of run-time reconfigurable custom computing depends upon the existence of high-level automated design tools. Such tools must reduce the designers effort to port applications between different platforms as the architecture, hardware, and software evolves. A Java implementation of a high-level application framework, called Janus, is presented here. In this environment, developers create Java classes that describe the structural behavior of an application. The framework allows hardware and software modules to be freely mixed and interchanged. A compilation phase of the development process analyzes the structure of the application and adapts it to the target platform. Janus is capable of structuring the run-time behavior of an application to take advantage of the memory and computational resources available.
Devices and Systems
icon_mobile_dropdown
Programming method and a management unit in a reconfigurable processor with cell-array structure for general-purpose calculation
Takayuki Morishita, Kiyotaka Komoku, Fumihiro Hatano, et al.
We have been developing a parallel processor that it is possible to reconfigure to realize a general-purpose computation at very high speed. Dynamic Reconfiguration means to change a kind of and a number of processing elements and connection between processing elements at real time. This function realizes the optimum hardware to execute any software and decrease a scale of hardware to execute software. Our proposed processor is configured with cells, which make any kind of processing, and bus, which makes any routing between cells. We support reconfiguration based on very larger macro cell than a logic cell in this processor. Decreasing a command length by using macro cell is responsible for realizing a dynamically reconfiguration. We improve a usage of sources by separating cells as four kind parts, i.e., an arithmetic calculation cell, a logic calculation cell, a memory cell and a counter cell. This processor can be programmed by two kinds of method, i.e., both circuit design and c language programming. In this paper, we extend circuit design technique to describing a dynamically reconfiguration of circuit component and propose a management unit in reconfigurable processor necessary for executing a program. Finally, we examine a performance of this processor by using an example program.
Debug of reconfigurable systems
Tim Price, Delon Levi, Steven A. Guccione
While FPGA design tools have progressed steadily, availability of tools to aid in debug of FPGA-based systems has lagged. In particular, support for debug of run-time reconfigurable (RTR) systems have been all but absent. In this paper we describe DDTScript, a scripting language to aid in design, debug and test of RTR systems. The DDTScript language is fully integrated with the BoardScope graphical debug tool and permits parameterization, instantiation, and removal of Run-time Parameterizable (RTP) Cores and other configurable circuit components. DDTScript also provides control of system level resources and supplies access to device state and configuration data. DDTScript is currently used not only to test and debug RTP Cores, but to construct and interact with complete designs. DDTScript is currently part of the JBits tool suite and supports the Xilinx Virtex family of FPGA devices.