3D modeling using miniscule volume elements
The rapid growth of overhead imagery collected from aerial and space platforms has sparked a revolution in mapping and surveillance applications. Geographical information systems (GISs) such as maps and road networks can be updated rapidly, and dynamic events such as natural disasters and military operations can be monitored as they evolve. But the discrepancy between 2D imagery and the 3D nature of the observed phenomena creates an algorithmic challenge that existing technology has yet to adequately address. Substantial manual effort to analyze hours of video footage from various imaging platforms is currently required on a daily basis. Furthermore, the full potential of the imagery, for example, in detecting subtle or rare changes in large volumes of visual data, cannot be fully realized due to inherent human limitations. Only automated processing of overhead imagery can allow effective exploitation of existing collection resources.
Automated overhead image processing requires basic algorithmic capabilities such as image registration, change detection, tracking, labeling, and efficient storage. Existing technology provides these functions, but only in the 2D domain of the image. 2D image processing alone is not enough to accurately assess the relative movement of scene elements in the presence of occlusion and 3D relief. A 3D scene representation, on the other hand, provides a complete representation of the scene from any viewpoint. Recent research has focused on reconstructing the 3D surface geometry of a scene from aerial or ground-level imagery.1–3 The majority of these approaches rely on established image-processing techniques that extract 2D image features that are matched across images, assuming that they originate from a common 3D surface element.4, 5 The location of the surface element is then triangulated using the centers and orientations of the cameras, as illustrated in the camera geometry of Figure 1. Using this technique, however, only the surface elements containing matching, detectable features appear in the resulting 3D surface representation. This partial representation requires another layer of processing, which generates a complete surface.6 The additional layer of processing, or surface estimation, requires optimization techniques using assumptions about the structure of the 3D scene that may not hold. Furthermore, the standard polygonal models (meshes) used for estimation cannot represent the shape of the uncertainty regions around the output surfaces. Due to the ambiguous nature of the input data, characterizing these uncertainty regions is crucial for higher level reasoning applications that use the output surfaces. A completely different kind of representation is required for this purpose.
Assume that the surfaces in a 3D scene will be observed by a series of images. Before seeing any image, the observer assumes that each point in the 3D space of the scene volume is equally likely to contain a surface. As more images are observed from different viewpoints, more information regarding the shapes and locations of the surfaces becomes available. This accumulating evidence can only be captured by giving every point in the scene volume an equal chance to develop into a surface. This idea formed the basis of a probabilistic space carving approach developed in the late 1990s.7–9 In this approach, the scene volume is regularly divided into miniscule volume elements called voxels (see Figure 1). Each voxel's probability of containing a surface is measured by projecting it into each image of the scene using the camera geometry and checking the consistency of the observed surface's appearance over all the views. This volumetric method enables a continuous representation of the probability of scene geometry throughout all of space. However, the volumetric representation technique requires substantial storage and computation compared to polygonal surface meshes.
We honed this probabilistic volumetric technology into a practical solution through extensive development efforts over the last eight years. Our solution reformulates the space carving theory so that the appearance of a surface element is modeled probabilistically along with the surface existence at each voxel location. As more images are observed, the surface existence probabilities are updated simultaneously with their appearance models. In this way, we handle the uncertainty in the appearance of the surfaces explicitly, leading to more robust model reconstruction.10–12 We adopted an adaptive spatial resolution scheme to mitigate storage and computational costs (see Figure 2).
In this new framework, the volume starts out with a coarse, space-efficient decomposition. As the scene is observed from more viewpoints, the voxels around the surfaces subdivide to represent finer details, while the voxels at the empty portions of the scene remain the same. The volumetric reconstruction algorithm is optimized to run on graphics-processing hardware, reducing model construction time to minutes and achieving real-time model rendering.12 A short video available online shows one of the 3D models generated using aerial footage collected above an urban scene with high 3D relief (see video16). Observe from the realistic rendering of the scene that both the geometry and texture of the building surfaces, the roads, and the vegetation are captured with realistic detail. Due to their dynamic and highly irregular nature, amorphous shapes such as vegetation and geological formations can only be crudely approximated by a mesh representation, but can be defined in detail by the volumetric representation.
The voxel technology generates the 3D model of a scene from its aerial imagery with increasing certainty as more images become available.10 Accurately representing the 3D context of the observed phenomena opens up exciting opportunities for higher level applications, such as change detection, object recognition and moving object tracking.13–15 Our ongoing research projects aim to demonstrate the effectiveness of volumetric representation in integrating imagery streaming from multiple platforms and with different modalities such as electo-optical, IR, and multispectral, as well as increasing the coverage of the 3D model to hundreds of square kilometers.
Ozge Ozcanli received a PhD in engineering from Brown University (2011), and co-founded Vision Systems Inc. Her current research focuses on geo-registration of 3D models and ground-level imagery.
Daniel Crispell received a PhD in engineering from Brown University (2010). Crispell is a co-founder of Vision Systems Inc. His projects there include work in artificial intelligence, 3D modeling, and GIS.
Joseph Mundy is a professor of engineering at Brown University and a co-founder of Vision Systems Inc., where his research focuses on probabilistic 3D modeling. He worked for GE Global Research for many years prior to joining Brown in 2002, and in 1993 was a co-recipient of the Marr Prize, the leading award in the field of computer vision.
Vishal Jain earned a PhD degree from Brown University (2009). He is a co-founder of Vision Systems Inc., where he works on the fusion of imagery from different modalities and GIS applications using probabilistic 3D models.
Tom Pollard received his PhD from Brown University (2010). Pollard currently works as a computer vision scientist for STR, where his research focuses on tracking, camera calibration, and fusion of imagery from different modalities.