SPIE Membership Get updates from SPIE Newsroom
  • Newsroom Home
  • Astronomy
  • Biomedical Optics & Medical Imaging
  • Defense & Security
  • Electronic Imaging & Signal Processing
  • Illumination & Displays
  • Lasers & Sources
  • Micro/Nano Lithography
  • Nanotechnology
  • Optical Design & Engineering
  • Optoelectronics & Communications
  • Remote Sensing
  • Sensing & Measurement
  • Solar & Alternative Energy
  • Sign up for Newsroom E-Alerts
  • Information for:
SPIE Photonics West 2018 | Call for Papers

OPIE 2017

OPIC 2017




Print PageEmail PageView PDF

Biomedical Optics & Medical Imaging

Practical 3D template matching with FPGAs

Field-programmable gate arrays (FPGAs) can implement high-performance coprocessors that accelerate volumetric pattern matching by 100-1000 times as compared to PCs.
6 March 2006, SPIE Newsroom. DOI: 10.1117/2.1200602.0013

Medical imaging and confocal microscopy are just two technologies that collect three-dimensional, volumetric images and are becoming increasingly common techniques. The resulting images are becoming increasingly complex with multiple fluorescence channels from microscopes or 3D flow vectors from diffusion tensor tomography. As the amount of volumetric data grows, so does the need for automated pattern recognition. Correlation—scanning an image for matches to a known template—is a standard approach in two dimensions but very hard in three. If a digitized image measures 100 units along each edge, the 2D image contains 104 pixels, but the 3D image contains 106 voxels. Searching for rotated forms of the template, assuming 10 voxel resolution, requires 36 correlations for 2D templates, but over 12,000 correlations when rotating around three axes.

As a result, 3D correlation presents thousands of times more computational load than the corresponding problem in two dimensions. Fast Fourier transforms (FFTs) improve correlation performance somewhat, especially for larger problems, but must be repeated for each channel in multispectral data. Rotations can be distributed across a computing cluster, but the costs of buying, maintaining and cooling such systems can be prohibitive. Co-processors built around field programmable gate arrays (FPGAs)—reprogrammable, commodity parts with performance approaching that of custom chips—represent an attractive alternative.For example, one FPGA can hold the entire correlation pipeline shown in Figure 1.1

Figure 1. Computation pipeline for 3D correlation.

Three-dimensional correlation starts at the ‘image rotation memory’, fetching one voxel per clock cycle. If the voxel data contains oriented values, such as diffusion flow vectors, ‘voxel rotation’ rotates those values along with the image grid. The ‘systolic correlation array’ then performs the correlation arithmetic. Here hundreds or thousands of processing elements (PEs) compute their partial correlation results concurrently. Then all the PEs in unison pass their results to their neighbors for further processing until the data reduction filter selects correlation peaks that may represent one or more matches to the template.

The co-processor structure of Figure 1 handles many different applications because the FPGA's fine-grained programmability allows all the details to vary. The designer allocates just enough bits for each voxel value, making efficient use of on-chip RAM. The correlation array also depends on the application: larger numbers of simple PEs versus fewer, complex PEs, according to the amount of FPGA fabric needed for each. These choices tune each instance of the coprocessor to the unique correlation task at hand. The PEs can also generalize beyond traditional sum-of-products correlation: they can sum arbitrary functions, including nonlinear ones (like absolute difference) that are a problem for FFT-based algorithms.

Each rotation of template with respect to the image requires a separate correlation, but the ‘image rotation memory’ eliminates rotation as a distinct operation. Instead, the memory structure traverses the rotated space in simple scanning order. An optimized linear transform inside the rotation memory converts scan indices to image coordinates, and accesses image voxels in rotated order, padding as needed. This reverses the usual approach: it rotates the image relative to the template. Only the relative rotation matters, however, and this technique handles any rotation by downloading a few control parameters instead of the whole image or template.

The transfer of the results data is also a potential performance bottleneck, so the data reduction filter selects only a very few results to be reported, as shown in Figure 2(a). This gives a much better summary than collecting the highest N values, or reporting all values past some cutoff as in Figure 2(b). These approaches report many values clustered near broad maxima or ignore important peaks below the threshold. On-chip analysis also reduces the number of correlation scores transferred to the host, and double buffering allows the transfer of one result to overlap with acquisition of the next. Together, these techniques improve throughput by eliminating dead time between correlations.

Figure 2. (a) The sub-block maxima reduce redundant reporting and detect more local peaks. (b) Using a threshold technique reports broad peaks repeatedly and misses local maxima

Correlation has a long history in two-dimensional pattern-matching, but applying it to volumetric images has been problematic because of the mass and complexity of 3D data. FPGA accelerators enable this and other computing applications,2,3 by applying massive parallelism tailored uniquely to each task. Reconfigurable co-processors, including Annapolis Micro Systems and Nallatech products, have traditionally served niche markets. However, recent announcements from Silicon Graphics4 and Cray5 place FPGA computation squarely in the mainstream. The challenge now lies in the design tools to unlock the full potential of FPGA computing.

Tom VanCourt and Martin Herbordt
Department of Electrical and Computer Engineering, Boston University
Boston, MA
Tom VanCourt is a PhD candidate in Boston University's Department of Electrical and Computer Engineering. He spent over twenty years in industry developing operating systems, embedded control, and other applications. He has also taught advanced software design techniques in BU's Metropolitan College. Current research interests include application-specific processors and use of software technologies in developing hardware systems.
Prof. Herbordt has directed the Computer Architecture and Automated Design Lab since 2001. His current research addresses configurable computing, particularly in its application to biological problems and in creating a development environment for that application. Herbordt is the author or co-author of 2 book chapters and more than 40 refereed papers.