Reconfigurable Technology: FPGAs and Reconfigurable Processors for Computing and Communications III

Fault injection emulator for field-programmable gate arrays

Thomas Slaughter, Charles Stroud, John Emmert, et al.

Show abstract

As the prevalence and size of Field Programmable Gate Arrays (FPGAs) has increased, so too has the complexity of manufacturing testing and defect diagnosis for yield enhancement. The re-programmability of FPGAs has attracted considerable interest in the ability to re-program a system function to avoid any known faults. As a result, various test, diagnostic, and fault tolerant techniques have been developed for FPGAs. However, the evaluation of the effectiveness of these techniques is nearly impossible using traditional fault simulation techniques due to the size and complexity of current FPGAs. We have developed an emulation procedure to inject faults into FPGAs in such a way that the faults are actually emulated in the physical FPGA. By determining proper bit locations within the configuration memory of the FPGA, download files used to program the FPGA can be manipulated to emulate faults including stuck-at faults, bridging faults, and opens in the programmable logic and routing resources of the FPGA. Almost any combination faults can be emulated spatially (allowing for either clustering or random distributions) and/or temporally (allowing for the simulation of burst or random faults over time).

Transformation from C-program to circuitry for a dynamically reconfigurable cell array processor

Takayuki Morishita, Kiyotaka Komoku, Fumihiro Hatano, et al.

Show abstract

We have been developing a parallel processor that it is possible to reconfigure hardware according to a software. Dynamic Reconfiguration means to change a kind of and a number of processing elements and connection between processing elements at real time. Our proposed processor creates a very long pipeline, which is able to execute for-loop calculation at very high speed. In this paper, we develop an algorithm which transform automatically a c-language program to a circuit diagram. Especially, we consider processing method of if-sentence and for-sentence and realize high-performance processing of them by a pipeline processing. The automatic transforming program is created by c-language. Finally, we examine a performance of this processor by using a MPEG decoding program.

Designing application-specific cores using JBits: a run-time parameterizable FIR filter

Philip B. James-Roxby

Show abstract

An example of an application-specific core is described, which embodies domain specific knowledge in with the construction of parameterizable circuit. Application-specific cores present a more familiar interface to the end user, since parameters are now domain-specific rather than core-specific. A FIR filter is described, which is parameterizable at run-time by its desired frequency response. The core can calculate its own weights, and a configuration datastream suitable for downloading to a programmable device in a matter of seconds, opening up a new realm of testing, both at the core level and at the system level.

Temporal partitioning of circuits for advanced partially reconfigurable systems

Rajanikant Mohan, Aravind R. Dasu, Sethuraman Panchanathan

Show abstract

Reconfigurable architectures are proving to be very effective in applications that involve the implementation of multiple compute-intensive algorithms, which share the same computing modules. With the advent of dynamically reconfigurable architectures, many temporal partitioning algorithms (TPA) have been proposed address the issue of area and time constraints. The main objective of TPA is to divide a large design into smaller sub-components so that they can be implemented over multiple reconfigurations. In this paper, we propose a new temporal partitioning process (TPP), which includes a modified TPA along with a port reallocation algorithm (PRA) to reduce the reconfiguration time to facilitate real-time implementation. The reduction in reconfiguration time is achieved by employing the knowledge of the function implemented in each logic block thereby effectively reusing the cells in the array in a selective manner. This avoids the need for complete reconfiguration and reduces the net reconfiguration time. The proposed approach has been tested on random graphs and on the MCNC benchmark circuits. Significant reduction in reconfiguration time has been achieved.

Model-based performance analysis for reconfigurable coprocessors

Stephen M. Charlwood, Jon P. Mangnall, Steven F. Quigley

Show abstract

Uni-processor and shared memory UMA multi-processor workstations are currently ubiquitous. The capabilities of such machines are commonly extended through the use of one or more application-specific coprocessors, located on the system expansion/peripheral bus, or a dedicated local bus. It is therefore considered worthwhile to investigate the limits of applicability of FPGA-based reconfigurable coprocessors when used to enhance such machines. In order to do this, it must be possible to estimate performance for coprocessor architectures that do not currently exist. This paper describes a method for generating estimates of performance for applications which make use of such reconfigurable coprocessors. By combining direct measurements on the target platform with model-based estimates and simulation data, estimates of performance can be synthesised which are accurate to better than +/- 5%.

Configuration subsystem design exploration for domain-specific reconfigurable technologies

Milan Vasilko

Show abstract

This paper presents an example of using a reconfigurable systems CAD tool (DYNASTY Framework) for the design exploration of various configuration subsystems in a specific reconfigurable technology. Given a set of application domain benchmarks, it is possible to examine whether a selected device configuration subsystem provides performance suitable for the targeted application domain. The approach is based on a `plug-in' technology server, implemented in the DYNASTY Framework, which allows for many different configuration subsystems to be modelled at high level. The feasibility of using a specific configuration subsystem for the selected application domain can be assessed without the need to produce transistor-level device models. As an example, the paper presents an evaluation of partial reconfiguration performance for a simple design with two arithmetic modules. The design reconfiguration performance is evaluated for three different configuration subsystems (parallel random-access, frame-based and context-switched) implemented on top of the Xilinx XC6200 device architecture.

Programming high-performance reconfigurable computers

Melissa C. Smith, Gregory D. Peterson

Show abstract

High Performance Computers (HPC) provide dramatically improved capabilities for a number of defense and commercial applications, but often are too expensive to acquire and to program. The smaller market and customized nature of HPC architectures combine to increase the cost of most such platforms. To address the problems with high hardware costs, one may create more inexpensive Beowolf clusters of dedicated commodity processors. Despite the benefit of reduced hardware costs, programming the HPC platforms to achieve high performance often proves extremely time-consuming and expensive in practice. In recent years, programming productivity gains come from the development of common APIs and libraries of functions to support distributed applications. Examples include PVM, MPI, BLAS, and VSIPL. The implementation of each API or library is optimized for a given platform, but application developers can write code that is portable across specific HPC architectures. The application of reconfigurable computing (RC) into HPC platforms promises significantly enhanced performance and flexibility at a modest cost. Unfortunately, configuring (programming) the reconfigurable computing nodes remains a challenging task and relatively little work to date has focused on potential high performance reconfigurable computing (HPRC) platforms consisting of reconfigurable nodes paired with processing nodes. This paper addresses the challenge of effectively exploiting HPRC resources by first considering the performance evaluation and optimization problem before turning to improving the programming infrastructure used for porting applications to HPRC platforms.

Real-time debugger with bitstream configurator and C language design control for FPGAs

Steve Casselman, John Schewel, Frank Wartel

Show abstract

At the boundary between hardware and software, where FPGAs with 2,000,000 gates is just the beginning, we've found that the tools you have in your toolbox make all the difference. With larger and more feature rich programmable devices such as the Virtex^TM Platform FPGA, even minors changes in the design can require hours of compile time. The combination of design complexity and component size is taxing current design entry and implementation tools, making the design cycle loner. The simulation-verification cycle doesn't mean the design will work in the final product. The engineer needs more than ever, to debug designs within the target hardware in real-time. We have built a series of integrated tools aimed at enhancing productivity at the last stages of product design, the final ten percent of the design that takes ninety percent of the time. The tools shown are not meant to replace current tools. These are advanced tools for the FPGA power user. Our goal is creating a set of tools specifically designed to provide better design control, dramatically reduce iteration times and enable real-time In-Circuit debugged in hardware. We have organized this series of tools as The Technology Stack^TM.

Digital FPGA implementation for Bellman-Ford computation

Wai-ming Fung, Hoi-shing Ng, Kai-pui Lam

Show abstract

The binary relation inference network (BRIN) is an architecture for the realisation of the Bellman-Ford and Floyd-Warshall algorithms. It has been used to solve a range of path problems, including shortest path and minimum spanning tree (MST) on graphs. Previous implementation was performed on an analog platform, by connecting op-amp chips externally. However, physical size of circuits would become impractical as the problem size grows. The external connections would also lead to bandwidth problems. The advancement of field programmable gate arrays (FPGAs) in recent years, allowing millions of gates on a single chip and accompanying with high level design tools, has allowed the implementation of very complex networks. With this exemption on manual circuit construction and availability of efficient design platform, the BRIN architecture could be built in a much more efficient way. Problems on bandwidth are removed by taking all previous external connections to the inside of the chip. By transforming BRIN to FPGA (Xilinx XC4010XL and XCV800 Virtex), we implement a synchronous network with computations in a finite number of steps. Two case studies are presented, with correct results verified from both simulation and circuit implementation. Resource consumption on FPGAs is studied showing that Virtex devices are more suitable for the expansion of network in future developments.

Reconfiguring an FPGA-based RISC for LNS arithmetic

Mark G. Arnold, Mark D. Winkel

Show abstract

Field Programmable Gate Arrays (FPGAs) have some difficulty with the implementation of floating-point operations. In particular, devoting the large number of slices needed by floating-point multipliers prohibits incorporating floating point into smaller, less expensive FPGAs. An alternative is the Logarithmic Number System (LNS), where multiplication and division are easy and fast. LNS also has the advantage of lower power consumption than fixed point. The problem with LNS has been the implementation of addition. There are many price/performance tradeoffs in the LNS design space between pure software and specialised-high-speed hardware. This paper focuses on a compromise between these extremes. We report on a small RISC core of our own design (loosely inspired by the popular ARM processor) in which only 4 percent additional investment in FPGA resources beyond that required for the integer RISC core more than doubles the speed of LNS addition compared to a pure software approach. Our approach shares resources in the datapath of the non-LNS parts of the RISC so that the only significant cost is the decoding and control for the LNS instruction. Since adoption of LNS depends on its cost effectiveness (e.g., FLOPs/slice), we compare our design against an earlier LNS ALU implemented in a similar FPGA. Our preliminary experiments suggest modest LNS-FPGA implementations, like ours, are more cost effective than pure software and can be as cost effective as more expensive LNS-FPGA implementations that attempt to maximise speed. Thus, our LNS-RISC fits in the Virtex-300, which is not possible for a comparable design.

Highly reconfigurable communication protocol multiplexing element for SCOPH

Gordon Brebner

Show abstract

The Soft Circuitry Optimised Protocol Harness (SCOPH) project is concerned with implementing streaming data paths that perform communication protocol functions. The data paths are implemented dynamically using soft circuitry on a programmed logic array, based upon a small set of parameterised function blocks that are instantiated and linked together on a per-connection basis, and may also be modified during connections. This allows the use of optimised bespoke protocols, with handling of protocol functions at hardware speeds. This paper focuses on one particular pair of function blocks, for the demultiplexing and multiplexing of multiple connections sharing a single communication channel. An example would be multiple TCP port connections simultaneously active over a single IP channel. In the SCOPH context, a single data stream emanating from a main processor or from a network interface would be split into multiple data streams, which undergo protocol processing, and are then recombined into a single stream to a network interface or main processor respectively. The demultiplexor block always requires a means of selecting one of a number of streams, given addressing information in the data stream itself. Symmetrically, the multiplexor block must insert appropriate addressing information into the data stream. However, the main thrust of the paper is in comparing area-time trade-offs between using blocks with a (small) fixed number of streams always configured and using blocks that vary in size and shape with the current number of active streams. Both options are being investigated on a Xilinx Virtex FPGA, using JBits for the dynamic configuration.

Spatially reconfigurable module for FIR filters

Toomas P. Plaks

Show abstract

This paper presents a spatially reconfigurable program module for FIR filter that can be used in FPGA design. Such a module is highly scalable and parameterized. In general, such program module describes classical probe lm size dependent arrays, fixed size arrays and arrays with increased performance. In the last case the problem is mapped into higher-dimensional array. As a theoretical basis we will use the theory of regular arrays. Efficient reconfiguring requires the rewriting of algorithms using algebraic transformations. For this purpose we will use the partitioning and specific pipe-structures. We will also use multi-dimensional time to increase the range of different array topologies. The description of a module includes pre scheduled designs, so, the reconfiguring does not require solving integer linear programming problems, and thus, can be used at run-time. The paper presents a family of scalable array structures for FIR filter: 1-D, 2-D and 3-D structures with different input/output positioning and performance.

Run-time reconfigurable 2D discrete wavelet transform using JBits

Jonathan Ballagh, Peter M. Athanas, Eric R. Keller

Show abstract

With the growth in high performance multimedia applications, specialized hardware for certain tasks is desirable. While ASICs provide a solution addressing performance, they are unable to provide an optimal solution for a given problem instance. FPGAs can be used with run-time reconfiguration to dynamically customize a circuit. Optimizations leading to faster circuits and reduced logic can result. The paper discusses the implementation of a run-time parameterizable 2D Discrete Wavelet Transform core using the JBits tool suite. The motivation for such a core is discussed, as well the benefits afforded by dynamic circuit specialization.

Network processor architecture for flexible buffer management in very high speed line interfaces

Shimonishi Hideyuki, Murase Tutomu

Show abstract

In this paper, the proposed architecture is described and the results obtained when evaluating it in a typical application program for traffic handling are reported. It is shown that the architecture enables Weighted Round Robin packet scheduling at 4.1 Gbps line speed, in addition to 10 Gbps IP packet forwarding and 2.4 Gbps IP/ATM multi-layer switching.

Reconfigurable processors for handhelds and wearables: application analysis

Rolf Enzler, Marco Platzner, Christian Plessl, et al.

Show abstract

In this paper, we present the analysis of applications from the domain of handheld and wearable computing. This analysis is the first step to derive and evaluate design parameters for dynamically reconfigurable processors. We discuss the selection of representative benchmarks for handhelds and wearables and group the applications into multimedia, communications, and cryptography programs. We simulate the applications on a cycle-accurate processor simulator and gather statistical data such as instruction mix, cache hit rates and memory requirements for an embedded processor model. A breakdown of the executed cycles into different functions identifies the most compute-intensive code sections - the kernels. Then, we analyze the applications and discuss parameters that strongly influence the design of dynamically reconfigurable processors. Finally, we outline the construction of a parameterizable simulation model for a reconfigurable unit that is attached to a processor core.

Variable length decoder on dynamically reconfigurable cell array processor

Kiyotaka Komoku, Takayuki Morishita, Fumihiro Hatano, et al.

Show abstract

Three kinds of basic Variable Length Decoder were implemented on Dynamically Reconfigurable Cell Array Processor. Traditional method, Leading zeros method, Generated unique address method were discussed. The number of required resources for each Decoder was described. Especially, in Generated unique address method, the Variable Length Decoder circuit size on Dynamically Reconfigurable Cell Array Processor was quite small.

XHWIF: a portable hardware interface for reconfigurable computing

Prasanna Sundararajan, Steven A. Guccione, Delon Levi

Show abstract

As the interest in FPGA-based hardware has grown, so has the number and type of commercially available platforms. The greatest drawback to this proliferation of hardware platforms is the lack of standards. Even boards using identical hosts, FPGA devices and bus interfaces typically have widely varying software interfaces, limiting the portability of tools and applications across these platforms. Xilinx's XHWIF(tm) portable hardware interface attempts to address this problem. The XHWIF interface provides a software layer providing all necessary communication and control for generic FPGA-based hardware. This interface permits tools and applications to be run on a variety of platforms, typically without modifications or re-compilation. In addition, a remote network interface is supplied as part of XHWIF API. Applications and tools which use the XHWIF interface can also run transparently across a network without modification. This permits not only sharing of hardware resources in a networked environment, but a simple way of implementing systems which use Remote Network Reconfiguration. XHWIF API is currently provided as part of Xilinx's JBits (tm) Software Development Kit.

Signal processing for multiuser wireless systems on reconfigurable platforms

Joseph Thomas

Show abstract

Iterative multiuser decoding (IMD) is a suboptimal reception technique that yields near-optimal performance via the iterative exchange of soft information between a (multiuser) demodulator and a bank of single-user soft decoders to their mutual benefit. From a practical perspective, using the minimum mean square error (MMSE) multiuser interference suppressor for demodulation yields good performance and near-far resistance over a wide range of received signal powers, in both additive white Gaussian noise and dispersive fading channels. The computational issues and system tradeoffs, involved in realizing MMSE-IMD receivers on reconfigurable platforms, at base stations in the multicell code-division multiple access (CDMA) uplink, are considered in this paper. In particular, an approximation is proposed, to reduce the complexity of the most computationally intensive task in the receiver, viz., the inversion of a matrix of dimensions proportional to the product of the spreading factor and the receiver sensor-count; this approximation is observed to degrade system performance only marginally. A suitable partitioning of the receiver's computational tasks between software- and hardware-configurable platforms is also proposed. This is followed by a study of the performance-complexity tradeoffs among various system design options (such as the iteration-count, receiver sensor-count, and choice of soft decoding algorithm) in such rapidly configurable environments, and their impact on system capacity. It is inferred that actual realizations of the MMSE-IMD can indeed provide vast performance gains over existing suboptimal receivers.

Reconfigurable Technology: FPGAs and Reconfigurable Processors for Computing and Communications III

Volume Details

Table of Contents

Table of Contents