A myriad of new applications would be enabled by bringing together 3D virtual spaces with physical spaces to deliver real-time 3D information to every household. For example, imagine watching 3D TV shows in which each viewer can choose the viewpoint and interact with the participants of the TV program. Similarly, art performances could be delivered by bringing artists with unique skills at multiple geographic locations to a joint virtual space and presenting the combined performance at a viewer's site. Another example is to hold business, government, or scientific meetings in 3D virtual spaces with full interactivity to avoid the rising costs of travel. One could list many other applications where people at multiple locations would use such technologies for education, rehabilitation, and communication.
Multiple techniques have been developed to create and fuse real-time 2D and 3D content from the physical and virtual worlds. The basic approach is to place a large number of 2D cameras around a scene and provide a viewer with one of the 2D camera views. A more advanced approach is to sense or estimate the depth of objects involved in the activities from many 2D cameras, and then to provide a user with arbitrary 2D views obtained by interpolation. The ideal approach is to reconstruct a 3D scene by acquiring or computing depth and spectrum values and then providing an arbitrary 3D view in real time. The spectrum values cover primarily those that are visible, but other bands are also of interest (e.g., thermal IR to measure heat during strenuous human activities). These approaches have been implemented in several commercial systems,1 such as Cisco's CTS 3000 Telepresence System (~$300, 000 plus ~$10, 000 a month in bandwidth costs), Polycom's RPX Telepresence Systems in Multipoint Call (complete room solution, ~$299, 000 for a Polycom RPX HD 204), HP's Halo Collaboration Center (a system for 2 to 18 people, ~$120, 000–349,000 plus ~$18, 000 a month in fees), or the China-based Huawei Technologies Co. Ltd.'s ViewPoint Telepresence 3006 (price unknown).
Figure 1. An experimental setup with three green/blue targets monitored by a ceiling camera. Participant performance is evaluated based on the occlusion of colored areas. Different cues are presented on the 52" LCD monitor.
These products and others like them are frequently described as delivering ‘telepresence,’ which aims at creating a user perception of being in 3D space rather than delivering actual 3D information. That leads to technical difficulties when the virtual and physical worlds need to be fused and co-locations of objects must be detected to facilitate interactions. In addition to the cost, networking requirements (typically dedicated networks), system complexity, and the time needed to set up the systems (typically days) prevent these technologies from becoming truly ubiquitous.
Our prototype tele-immersive system at the National Center for Supercomputing Applications (NCSA) aims to achieve portability, robustness, and low cost by leveraging commercial off-the-shelf (COTS) components.2 Our system consists of multiple stereo and thermal IR cameras, a server receiving 3D data streams from the stereo cameras, and several LCDs for 3D video information. The stereo cameras are manufactured by TYZX Inc. and consist of two CMOS imagers combined with on-board processing and Ethernet capabilities. The two imagers are constructed with a 6cm baseline and have lenses that cover a 44° field of view. The cameras produce 500×312 pixel images at 33 frames per second. The thermal IR cameras (Photon 320 model manufactured by FLIR Systems) deliver single-band 320×240 pixel images at 30 frames per second. The hardware cost of our system (11 stereo cameras and 2 thermal IR cameras) would be less than $60, 000, which compares favorably with those of the systems described above. By mounting the cameras on tripods or on TV carts with LCDs, the whole system becomes very portable (assuming that the deployment site has networking in place). Given the hardware components, we have implemented software for the optimal placement of stereo cameras, acquisition of the data from multiple stereo cameras, colorimetric calibration of color images, color space conversions, transformation of all depth maps into a common 3D coordinate system, and rendering of 3D clouds of points in real time on a large LCD.
We have performed studies of tele-immersive spaces for rehabilitation purposes to quantify the benefits for citizens with proprioceptive impairments (lack of awareness of where the body is in space). We hypothesized that humans with proprioceptive impairments can use their other senses as proprioceptive feedback from real-time 3D+color reconstructions of their bodies in space.
In our experiments, three bi-modal targets consisting of blue and green halves are placed on the floor and monitored by a ceiling camera (see Figure 1). The goal for participants is to reach the targets in such a way that one color of the target is occluded by the wheelchair as viewed by the ceiling camera. In effect, these markers simulate a virtual wall, where the intersection between the colors represents the wall, and proximity to that wall can be measured by the amount of each color seen. Eight cues are presented to human subjects in wheelchairs by using the immersive virtual reality space. The types of cues include: direct observation/no cues, an audio cue, a top-down color video cue, a side-view color video cue (fixed and proximity based), a side-view depth video cue (proximity based with and without learning), and a 3D+color video cue (fixed viewpoint). While each cue is presented, a human subject has to move from a base location to one of the green/blue targets, stop at the boundary, return to the base location, and proceed to the next target.
Based on accuracy and speed measurements, we concluded that the three most helpful cues are direct observation/no cue, top-down fixed color video, and 3D+color video. We anticipated that using no cue would be the best due to the simplicity of the task, where leaning forward and using direct visual observation would be the most natural way of assessing where the body is in space. The inclusion of top-down fixed color video relates to the fact that the targets are two dimensional, and the view is the most intuitive to humans. The data-driven selection of 3D+color video confirms our hypothesis that among the indirect cues, the 3D content in tele-immersive environments has the highest value for regaining proprioception. These experiments also considered improvement after learning. For instance, based on the results, we have also confirmed that beginners have achieved on average better accuracy and better speed than novices.
The problem of enabling new applications using mixed 3D content from virtual and physical spaces requires advancements in multiple disciplines, as well as investments in infrastructure to deliver high-quality 3D content in real time to every household. Academic and industrial investments in building 3D TV, telepresence, and tele-immersive technologies might lead to meeting multiple application-specific requirements in the future. We have outlined and addressed three of those requirements: portability, robustness, and low cost. We plan to deploy tele-immersive systems in various environments and address the questions of illumination robustness and user-friendly deployment. Challenges remain in understanding the 3D+color video quality requirements in various applications.
Funding was provided by grant 490630 from the US National Science Foundation IIS 07-03756. The project is part of a joint collaboration with the computer science departments at the University of Illinois at Urbana-Champaign and the University of California at Berkeley.
Peter Bajcsy, Kenton McHenry
National Center for Supercomputing Applications (NCSA)
University of Illinois at Urbana-Champaign (UIUC)
Peter Bajcsy works as a research scientist on problems related to automatic transfer of image content to knowledge. Dr. Bajcsy's scientific interests include image processing, novel sensor technology, and computer and machine vision.
Kenton McHenry works as a research programmer on problems related to 3D content creation, conversion, and preservation. Dr. McHenry's research interests include computer vision, pattern recognition, and automation.