SPIE Membership Get updates from SPIE Newsroom
  • Newsroom Home
  • Astronomy
  • Biomedical Optics & Medical Imaging
  • Defense & Security
  • Electronic Imaging & Signal Processing
  • Illumination & Displays
  • Lasers & Sources
  • Micro/Nano Lithography
  • Nanotechnology
  • Optical Design & Engineering
  • Optoelectronics & Communications
  • Remote Sensing
  • Sensing & Measurement
  • Solar & Alternative Energy
  • Sign up for Newsroom E-Alerts
  • Information for:
SPIE Photonics West 2019 | Call for Papers

2018 SPIE Optics + Photonics | Register Today



Print PageEmail Page

Electronic Imaging & Signal Processing

Smart media management

From oemagazine July 2001
30 July 2001, SPIE Newsroom. DOI: 10.1117/2.5200107.0005

A system is called 'smart' if a user perceives the actions and reactions of the system as being smart. Media is therefore managed smartly if a computer system helps a user to perform an extensive set of operations on a large media database quickly, efficiently, and conveniently. Such operations include searching, browsing, manipulating, sharing, and reusing.

The more the computer knows about the media it manages, the smarter it can be. Thus, algorithms that are capable of extracting semantic information automatically from media are an important part of a smart media-management system. As part of this effort, our lab at Intel Corp. (Santa Clara, CA) is focusing on tasks such as reliable shot detection; text localization and text segmentation in images, web pages, and videos; and automatic semantic labeling of images.

calling the shots

A shot is commonly defined as the uninterrupted recording of an event or locale. Any video sequence consists of one or more shots concatenated by some kind of transition effects. Detecting shot boundaries thus means recovering those elementary video units, which in turn provide the ground for nearly all existing video abstraction and high-level video segmentation algorithms. In addition, during video production each transition type is chosen carefully in order to support the content and context of the video sequences; therefore, automatically recovering all their positions and types may help the computer to deduce high-level semantics. For instance, feature films often use dissolves to convey a passage of time. Dissolves also occur much more often in feature films, documentaries, and biographical and scenic video material than in newscasts, sports, comedies, and other shows. The opposite is true for wipes, in which a line moving across the screen marks the transition from one scene to the next. Therefore, automatic detection of transitions and their type can be used for automatic recognition of the video genre.

A recent review of the state-of-the-art in automatic shot boundary detection techniques emphasizes algorithms that specialize in detecting specific types of transitions such as hard cuts, fades, and dissolves. In a fade, the scene gradually diminishes to a black screen for several seconds; when a scene dissolves, it fades as the next scene becomes clearer, not to black as a true fade does.1 Today's cutting-edge systems can detect hard cuts and fades at a high hit rate of 99% and 82% and at a low false-alarm rate of 1% and 18%, respectively. Dissolves are more difficult to detect, and the best approaches report hit and false-alarm rates of 75% and 16% on a representative video test set.

extracting text

Extracting truly high-level semantics from images and videos in most cases is still an unsolved problem. One of the few exceptions is the extraction of text in complex backgrounds and cluttered scenes. Several researchers have recently developed novel algorithms for detecting, segmenting, and recognizing such text occurrences.2 These extracted text occurrences provide a valuable source of high-level semantics for indexing and retrieval. For instance, text extraction enables users of a video database to query for all movies featuring John Wayne or produced by Steven Spielberg. Or it can be used to jump to news stories about a specific topic since captions in newscasts often provide a condensation of the underlying news story.

Detecting, segmenting, and recognizing text in nontext parts of web pages also is a very important operation. More and more web pages present text in images. Existing document-based text segmentation and text recognition algorithms cannot extract such text occurrences due to their potentially difficult background and the large variety of text color used. The new algorithms allow users to index the content of image-rich web pages properly. Automatic text segmentation and text recognition might also help in automatic conversion of web pages designed for large monitors to small LCD displays of appliances, since the textual content in images can be retrieved.

Our latest text segmentation method is not only able to locate text occurrences and segment them into large binary images, but also to label each pixel within an image or video whether it belongs to text or not.3 Thus, our text detection and text segmentation methods can be used for object-based video encoding. Object-based video encoding is known to achieve a much better video quality at a fixed bit rate compared with existing compression technologies. In most cases, however, the problem of extracting objects automatically is not solved yet. Our text localization and text segmentation algorithms solve this problem for text occurrences in videos. Using this technique, the multiple video object video (multiple video object plane, or VOP) achieved a peak signal-to-noise ratio about 1.5 dB better than the single object encoded MPEG-4 video. Thus, encoding the text lines as rigid foreground objects and the rest of the video separately achieved a much better visual quality.

Figure 1. A classification scheme for web images enables an algorithm to sort automatically.

Although much research has been published on extraction of low-level features from images and videos, only recently has the focus shifted to exploiting low-level features to classify images and videos automatically into semantically meaningful and broad categories. Examples of broad and general-purpose semantic classes are outdoor versus indoor scenes and city versus landscape scenes.4 In one of our media indexing research projects, we crawled about 300,000 images from the web. After browsing carefully through those images, we came up with broad- and general-purpose categories (see figure 1).


Figure 2. Starting with an undifferentiated mixture of images (top), algorithm automatically sorts them according to a classification scheme (bottom). (INTEL)

Although it uses only simple, low-level features, such as the overall color diversity in the image, the average noise level in the images, and the distribution of text line positions and sizes, our classification algorithm achieved an accuracy of 97.3% in separating photo-like images from graphical images on a large image database. In the subset of photo-like images, the algorithm could separate true photos from ray-traced/rendered images with an accuracy of 87.3%, while the subset of graphical images was successfully partitioned into presentation slides and comics with an accuracy of 93.2%. Sample images illustrating the chaos before and the order after their classification are shown in figure 2.5 We are now working to increase the number of categories that can be classified automatically and will have to explore how joint classification can be done accurately and efficiently.

browsing media

Figure 3. In video browsing, scenes from videos similar to the current selection are shown on the border around the main image.

Although automatic media content analysis capabilities provide the basis of a smart media-management system, efficient methods to browse a media database in a random but directed way are equally important. One potentially useful video browsing paradigm is shown in figure 3 on page 25. In the center, a normal video player allows the user to navigate through the currently selected video. Every 3 s while the main selection is playing, the system queries the whole video database for shots that are most similar to the currently visible video sequence. The result of the query is shown as a decorative border around the main video player. At any time, the user can select any of those similar shots as the current video. In the example, similarity is based on color, but any similarity measure can be applied. For instance, similarity based on the text visually occurring in a video sequence can be a useful criterion for browsing through a database of newscasts recorded from a diverse set of broadcast channels.

Another equally important task is automatic video abstraction. A video abstract is a sequence of still or moving images (with or without audio). The video abstract is designed to rapidly provide the user with concise information about the content of the video while preserving the essential message of the original. Different abstraction algorithms for edited video (newscasts, feature films) and raw video (home video and raw news footage) have been developed in the past, but even better methods are needed for the future.

Many interesting challenges are still waiting to be addressed by researchers. The SPIE conference titled Storage and Retrieval of Media Databases (20–26 January, San Jose, CA) was one of the major research meetings on this topic. A new special feature track is on peer-to-peer media sharing and distributed media searching and indexing (see www.spie.org/Conferences/Calls/02/pw/confs/ei23). More information and related work about smart media management are available at www.videoanalysis.org. oe


1. Rainer Lienhart. Reliable Transition Detection In Videos: A Survey and Practitioner's Guide. MRL technical report MRL_VIG000002-01, Intel Corporation, 2001. to appear in International Journal of Image and Graphics (IJIG).

2. Video Content Analysis Homepage at www.videoanalysis.org or www.videoanalysis.de.

3. Axel Wernicke and Rainer Lienhart. On the Segmentation of Text in Videos. IEEE Int. Conference of Multimedia and Expo (ICME2000), Vol. 3, pp. 1511-1514, July 2000.

4. Aditya Vailaya. Semantic Classification in Image Databases, Ph.D thesis, Department of Computer Science, Michigan State University, 2000, www.cse.msu.edu/~vailayaa/publications.

5. Alexander Hartmann. Automatic Classification of Images on the Web. Master thesis. University of Mannheim, August 2000.

more than skin deep

In the mid-nineties, scientists at the National Library of Medicine (Bethesda, MD) and the University of Colorado (Boulder, CO) embarked on the Visible Human project, digitizing the bodies of a male and a female cadaver at high- resolution by photographing 1-mm and 0.3-mm slices respectively. The result was a 3-D data set for both male and female bodies.

Now, with the parallel image server developed by Ecole Polytechnique Federale De Lausanne (EPFL; Lausanne, Switzerland) in collaboration with the Geneva Hospitals and WDS Technologies SA (Geneva, Switzerland), users can view such a body inside and out. "We call this service virtual dissection," says Roger Hersch, head of the Peripheral Systems Laboratory of EPFL, the group responsible for the present program.

Using a parallel program running on multiple processors with parallel accesses to files striped on many disks, anyone can extract and animate, via the Internet, images of a knee, a brain, or an eye of one of the virtual humans by simply specifying the positioning and orientation of a series of sections. "What is outstanding about this work is that one can easily and quickly navigate though the entire body," says Giordano Beretta, member of the technical staff of Hewlett Packard (Palo Alto, CA). "It is a virtual human atlas where students or surgeons can look at various parts within the atlas."

Suppose, for example, a physician wanted to show a patient how the optical nerve leads to a view of the brain's blood vessels. With the visible human, one can extract both horizontal and oblique images that show most of the brain and optic nerve by successive slicing. "One can also specify a trajectory within the body, and the system will extract the surface along this trajectory perpendicular to the slice," says Hersch. This is helpful as very little on the human body is linear, he adds.

EPFL developed a parallel programming tool called computer-aided parallelization (CAP). When a user asks for a slice having a given orientation and position, the tool sends a request to the master server PC, which sends it to the server PCs, which in turn split the global slice access request into approximately 450 subvolume access and slice part extraction requests. Disk access operations are overlapped with processing operations. The server's master PC assembles the slice parts together, compresses the resulting image into a JPEG, and sends it to the Web client.

To view the Visible Human project online, go to visiblehuman.epfl.ch.

—Laurie Ann Toupin

Rainer Lienhart

Rainer Lienhart is staff researcher at the Intel Corp. Microprocessor Research Lab, Santa Clara, CA.