Proceedings Volume 6506

Multimedia Content Access: Algorithms and Systems

cover
Proceedings Volume 6506

Multimedia Content Access: Algorithms and Systems

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 28 January 2007
Contents: 9 Sessions, 26 Papers, 0 Presentations
Conference: Electronic Imaging 2007 2007
Volume Number: 6506

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 6506
  • Image Analysis and Retrieval
  • Content Analysis-based Browsing
  • Video Analysis and Retrieval I
  • Applications I
  • Applications II
  • Bioinformatics
  • Video Analysis and Retrieval II
  • Applications III
Front Matter: Volume 6506
icon_mobile_dropdown
Front Matter: Volume 6506
This PDF file contains the front matter associated with SPIE Proceedings Volume 6506, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and the Conference Committee listing.
Image Analysis and Retrieval
icon_mobile_dropdown
A model-based conceptual clustering of moving objects in video surveillance
Jeongkyu Lee, Pragya Rajauria, Subodh Kumar Shah
Data mining techniques have been applied in video databases to identify various patterns or groups. Clustering analysis is used to find the patterns and groups of moving objects in video surveillance systems. Most existing methods for the clustering focus on finding the optimum of overall partitioning. However, these approaches cannot provide meaningful descriptions of the clusters. Also, they are not very suitable for moving object databases since video data have spatial and temporal characteristics, and high-dimensional attributes. In this paper, we propose a model-based conceptual clustering (MCC) of moving objects in video surveillance based on a formal concept analysis. Our proposed MCC consists of three steps: 'model formation', 'model-based concept analysis', and 'concept graph generation'. The generated concept graph provides conceptual descriptions of moving objects. In order to assess the proposed approach, we conduct comprehensive experiments with artificial and real video surveillance data sets. The experimental results indicate that our MCC dominates two other methods, i.e., generality-based and error-based conceptual clustering algorithms, in terms of quality of concepts.
Image watermarking based on a color quantization process
The purpose of this paper is to propose a color image watermarking scheme based on an image dependent color gamut sampling of the L*a*b* color space. The main motivation of this work is to control the reproduction of color images on different output devices in order to have the same color feeling, coupling intrinsic informations on the image gamut and output device calibration. This paper is focused firstly on the research of an optimal LUT (Look Up Table) which both circumscribes the color gamut of the studied image and samples the color distribution of this image. This LUT is next embedded in the image as a secret message. The principle of the watermarking scheme is to modify the pixel value of the host image without causing any change neither in image appearance nor on the shape of the image gamut.
Search and retrieval of medical images for improved diagnosis of neurodegenerative diseases
Ahmet Ekin, Radu Jasinschi, Erman Turan, et al.
In the medical world, the accuracy of diagnosis is mainly affected by either lack of sufficient understanding of some diseases or the inter-, and/or intra-observer variability of the diagnoses. The former requires understanding the progress of diseases at much earlier stages, extraction of important information from ever growing amounts of data, and finally finding correlations with certain features and complications that will illuminate the disease progression. The latter (inter-, and intra- observer variability) is caused by the differences in the experience levels of different medical experts (inter-observer variability) or by mental and physical tiredness of one expert (intra-observer variability). We believe that the use of large databases can help improve the current status of disease understanding and decision making. By comparing large number of patients, some of the otherwise hidden relations can be revealed that results in better understanding, patients with similar complications can be found, the diagnosis and treatment can be compared so that the medical expert can make a better diagnosis. To this effect, this paper introduces a search and retrieval system for brain MR databases and shows that brain iron accumulation shape provides additional information to the shape-insensitive features, such as the total brain iron load, that are commonly used in the clinics. We propose to use Kendall's correlation value to automatically compare various returns to a query. We also describe a fully automated and fast brain MR image analysis system to detect degenerative iron accumulation in brain, as it is the case in Alzheimer's and Parkinson's. The system is composed of several novel image processing algorithms and has been extensively tested in Leiden University Medical Center over so far more than 600 patients.
Content Analysis-based Browsing
icon_mobile_dropdown
Assessment of end-user response to sports highlights extraction for personal video recorders
Hal Shubin, Ajay Divakaran, Kent Wittenburg, et al.
We tested our previously reported sports highlights playback for personal video recorders with a carefully chosen set of sports aficionados. Each subject spent about an hour with the content, going through the same basic steps of introduction, trying out the system, and follow up questionnaire. The main conclusion was that the users unanimously liked the functionality very much even when it made mistakes. Furthermore, the users felt that if the user interface were made much more responsive so as to quickly compensate for false alarms and misses, the functionality would be vastly enhanced. The ability to choose summaries of any desired length turned out to be the main attraction.
Examining user interactions with video retrieval systems
The Informedia group at Carnegie Mellon University has since 1994 been developing and evaluating surrogates, summary interfaces, and visualizations for accessing digital video collections containing thousands of documents, millions of shots, and terabytes of data. This paper reports on TRECVID 2005 and 2006 interactive search tasks conducted with the Informedia system by users having no knowledge of Informedia or other video retrieval interfaces, but being experts in analyst activities. Think-aloud protocols, questionnaires, and interviews were also conducted with this user group to assess the contributions of various video summarization and browsing techniques with respect to broadcast news test corpora. Lessons learned from these user interactions are reported, with recommendations on both interface improvements for video retrieval systems and enhancing the ecological validity of video retrieval interface evaluations.
Automatic and user-centric approaches to video summary evaluation
Cuneyt M. Taskiran, Frank Bentley
Automatic video summarization has become an active research topic in content-based video processing. However, not much emphasis has been placed on developing rigorous summary evaluation methods and developing summarization systems based on a clear understanding of user needs, obtained through user centered design. In this paper we address these two topics and propose an automatic video summary evaluation algorithm adapted from teh text summarization domain.
Video Analysis and Retrieval I
icon_mobile_dropdown
Efficient re-indexing of automatically annotated image collections using keyword combination
Alexei Yavlinsky, Stefan Rüger
This paper presents a framework for improving the image index obtained by automated image annotation. Within this framework, the technique of keyword combination is used for fast image re-indexing based on initial automated annotations. It aims to tackle the challenges of limited vocabulary size and low annotation accuracies resulting from differences between training and test collections. It is useful for situations when these two problems are not anticipated at the time of annotation. We show that based on example images from the automatically annotated collection, it is often possible to find multiple keyword queries that can retrieve new image concepts which are not present in the training vocabulary, and improve retrieval results of those that are already present. We demonstrate that this can be done at a very small computational cost and at an acceptable performance tradeoff, compared to traditional annotation models. We present a simple, robust, and computationally efficient approach for finding an appropriate set of keywords for a given target concept. We report results on TRECVID 2005, Getty Image Archive, and Web image datasets, the last two of which were specifically constructed to support realistic retrieval scenarios.
Video to the rescue of audio: shot boundary assisted speaker change detection
Speaker change detection (SCD) is a preliminary step for many audio applications such as speaker segmentation and recognition. Thus, its robustness is crucial to achieve a good performance in the later steps. Especially, misses (false negatives) affect the results. For some applications, domain-specific characteristics can be used to improve the reliability of the SCD. In broadcast news and discussions, the cooccurrence of shot boundaries and change points provides a robust clue for speaker changes. In this paper, two multimodal approaches are presented that utilize the results of a shot boundary detection (SBD) step to improve the robustness of the SCD. Both approaches clearly outperform the audio-only approach and are exclusively applicable for TV broadcast news and plenary discussions.
A trajectory based video segmentation for surveillance applications
Naveen M. Thomas, Nishan Canagarajah
Video segmentation for content based retrieval has traditionally been done using shot cut detection algorithms that search for abrupt changes in scene content. Surveillance videos however, usually use still cameras, and do not contain any shots. Hence, a novel high level semantic change detection algorithm is proposed in this paper that uses object trajectory features to segment surveillance footage. These trajectory features are extracted automatically, using background subtraction and a multiple blob tracking algorithm. The trajectory features are first used to remove false object detections from background subtraction. Semantics extracted from the remaining object trajectories are then used to segment the video. The results of the algorithm when applied to surveillance data are compared with hand labeled segmentation to obtain precision recall curves and harmonic mean. Comparisons with traditional background subtraction and video segmentation algorithms show a drastic improvement in performance.
Applications I
icon_mobile_dropdown
Knowledge discovery for better photographs
Jonathan Yen, Peng Wu, Daniel Tretter
A photograph captured by a digital camera usually includes camera metadata in which sensor readings, camera settings and other capture pipeline information are recorded. The camera metadata, typically stored in an EXIF header, contains a rich set of information reflecting the conditions under which the photograph was captured. This set of rich information can be potentially useful for improvement in digital photography but its multi-dimensionality and heterogeneous data structure make it difficult to be useful. Knowledge discovery, on the other hand, is usually associated with data mining to extract potentially useful information from complex data sets. In this paper we use a knowledge discovery framework based on data mining to automatically associate combinations of high-dimensional, heterogeneous metadata with scene types. In this way, we can perform very simple and efficient scene classification for certain types of photographs. We have also provided an interactive user interface in which a user can type in a query on metadata and the system will retrieve from our image database the images that satisfy the query and display them. We have used this approach to associate EXIF metadata with specific scene types like back-lit scenes, night scenes and snow scenes. To improve the classification results, we have combined an initial classification based only on the metadata with a simple, histogram based analysis for quick verification of the discovered knowledge. The classification results, in turn, can be used to better manage, assess, or enhance the photographs.
Organising a daily visual diary using multifeature clustering
The SenseCam is a prototype device from Microsoft that facilitates automatic capture of images of a person's life by integrating a colour camera, storage media and multiple sensors into a small wearable device. However, efficient search methods are required to reduce the user's burden of sifting through the thousands of images that are captured per day. In this paper, we describe experiments using colour spatiogram and block-based cross-correlation image features in conjunction with accelerometer sensor readings to cluster a day's worth of data into meaningful events, allowing the user to quickly browse a day's captured images. Two different low-complexity algorithms are detailed and evaluated for SenseCam image clustering.
Applications II
icon_mobile_dropdown
Recognizing persons in images by learning from videos
Eva Hörster, Jochen Lux, Rainer Lienhart
In this paper, we propose an approach for automatically recognizing persons in images based on their general outer appearance. Therefore we build a statistical model for each person. Large amounts of training data are collected and labeled automatically by using a visual sensor array capturing image sequences containing the person to be learnt. Foreground-background segementation is performed to seperate the person from background, thus enabeling to learn the persons appearance independent of the background. Color and gradient features are extracted representing the segmented person. Person recognition of incoming photos is carried out using (k)- Nearest Neighbor(s) classification and the normalized histogram intersection match value is used as distance measure. Reported experimental results show that the presented approach performs well.
Storage format for personalized broadcasting content consumption
Sung Ho Jin, Jea-Seok Jang, Hyun-Seok Min, et al.
In this paper, we propose a storage format which binds digital broadcasts with related data such as TV-Anytime metadata, additional multimedia resources, and personal viewing history. The goal of the proposed format is to make it possible to offer personalized content consumption after recording broadcasting contents to storage devices, e.g., HD-DVD and Blu-ray Disc. To achieve that, we adopt MPEG-4 file format as a container and apply a binary format for scenes (BIFS) for representing and rendering personal viewing history. In addition, TV-Anytime metadata is used to describe broadcasts and to refer to the additional multimedia resources, e.g, images, audio clips, and short video clips. To demonstrate the usefulness of the proposed format, we introduce an application scenario and test it on that scenario.
A unified and efficient framework for court-net sports video analysis using 3D camera modeling
The extensive amount of video data stored on available media (hard and optical disks) necessitates video content analysis, which is a cornerstone for different user-friendly applications, such as, smart video retrieval and intelligent video summarization. This paper aims at finding a unified and efficient framework for court-net sports video analysis. We concentrate on techniques that are generally applicable for more than one sports type to come to a unified approach. To this end, our framework employs the concept of multi-level analysis, where a novel 3-D camera modeling is utilized to bridge the gap between the object-level and the scene-level analysis. The new 3-D camera modeling is based on collecting features points from two planes, which are perpendicular to each other, so that a true 3-D reference is obtained. Another important contribution is a new tracking algorithm for the objects (i.e. players). The algorithm can track up to four players simultaneously. The complete system contributes to summarization by various forms of information, of which the most important are the moving trajectory and real-speed of each player, as well as 3-D height information of objects and the semantic event segments in a game. We illustrate the performance of the proposed system by evaluating it for a variety of court-net sports videos containing badminton, tennis and volleyball, and we show that the feature detection performance is above 92% and events detection about 90%.
Bioinformatics
icon_mobile_dropdown
Ontology driven image search engine
Yun Bei, Julia Dmitrieva, Mounia Belmamoune, et al.
Image collections are most often domain specific. We have developed a system for image retrieval of multimodal microscopy images. That is, the same object of study visualized with a range of microscope techniques and with a range of different resolutions. In microscopy, image content is depending on the preparation method of the object under study as well as the microscope technique. Both are taken into account in the submission phase as metadata whilst at the same time (domain specific) ontologies are employed as controlled vocabularies to annotate the image. From that point onward, image data are interrelated through the relationships derived from annotated concepts in the ontology. By using concepts and relationships of an ontology, complex queries can be built with true semantic content. Image metadata can be used as powerful criteria to query image data which are directly or indirectly related to original data. The results of image retrieval can be represented using a structural graph by exploiting relationships from ontology rather than a listed table. Applying this to retrieve images from the same subject at different levels of resolution opens a new field for the analysis of image content.
Adaptation of video game UVW mapping to 3D visualization of gene expression patterns
Peter D. Vize, Victor E. Gerth
Analysis of gene expression patterns within an organism plays a critical role in associating genes with biological processes in both health and disease. During embryonic development the analysis and comparison of different gene expression patterns allows biologists to identify candidate genes that may regulate the formation of normal tissues and organs and to search for genes associated with congenital diseases. No two individual embryos, or organs, are exactly the same shape or size so comparing spatial gene expression in one embryo to that in another is difficult. We will present our efforts in comparing gene expression data collected using both volumetric and projection approaches. Volumetric data is highly accurate but difficult to process and compare. Projection methods use UV mapping to align texture maps to standardized spatial frameworks. This approach is less accurate but is very rapid and requires very little processing. We have built a database of over 180 3D models depicting gene expression patterns mapped onto the surface of spline based embryo models. Gene expression data in different models can easily be compared to determine common regions of activity. Visualization software, both Java and OpenGL optimized for viewing 3D gene expression data will also be demonstrated.
Classification of yeast cells from image features to evaluate pathogen conditions
Peter van der Putten, Laura Bertens, Jinshuo Liu, et al.
Morphometrics from images, image analysis, may reveal differences between classes of objects present in the images. We have performed an image-features-based classification for the pathogenic yeast Cryptococcus neoformans. Building and analyzing image collections from the yeast under different environmental or genetic conditions may help to diagnose a new "unseen" situation. Diagnosis here means that retrieval of the relevant information from the image collection is at hand each time a new "sample" is presented. The basidiomycetous yeast Cryptococcus neoformans can cause infections such as meningitis or pneumonia. The presence of an extra-cellular capsule is known to be related to virulence. This paper reports on the approach towards developing classifiers for detecting potentially more or less virulent cells in a sample, i.e. an image, by using a range of features derived from the shape or density distribution. The classifier can henceforth be used for automating screening and annotating existing image collections. In addition we will present our methods for creating samples, collecting images, image preprocessing, identifying "yeast cells" and creating feature extraction from the images. We compare various expertise based and fully automated methods of feature selection and benchmark a range of classification algorithms and illustrate successful application to this particular domain.
Video Analysis and Retrieval II
icon_mobile_dropdown
Analysis of unstructured video based on camera motion
Although considerable work has been done in management of "structured" video such as movies, sports, and television programs that has known scene structures, "unstructured" video analysis is still a challenging problem due to its unrestricted nature. The purpose of this paper is to address issues in the analysis of unstructured video and in particular video shot by a typical unprofessional user (i.e home video). We describe how one can make use of camera motion information for unstructured video analysis. A new concept, "camera viewing direction," is introduced as the building block of home video analysis. Motion displacement vectors are employed to temporally segment the video based on this concept. We then find the correspondence between the camera behavior with respect to the subjective importance of the information in each segment and describe how different patterns in the camera motion can indicate levels of interest in a particular object or scene. By extracting these patterns, the most representative frames, keyframes, for the scenes are determined and aggregated to summarize the video sequence.
Dialog detection in narrative video by shot and face analysis
The proliferation of captured personal and broadcast content in personal consumer archives necessitates comfortable access to stored audiovisual content. Intuitive retrieval and navigation solutions require however a semantic level that cannot be reached by generic multimedia content analysis alone. A fusion with film grammar rules can help to boost the reliability significantly. The current paper describes the fusion of low-level content analysis cues including face parameters and inter-shot similarities to segment commercial content into film grammar rule-based entities and subsequently classify those sequences into so-called shot reverse shots, i.e. dialog sequences. Moreover shot reverse shot specific mid-level cues are analyzed augmenting the shot reverse shot information with dialog specific descriptions.
Edit while watching: home video editing made easy
In recent years, more and more people capture their experiences in home videos. However, home video editing still is a difficult and time-consuming task. We present the Edit While Watching system that allows users to automatically create and change a summary of a home video in an easy, intuitive and lean-back way. Based on content analysis, video is indexed, segmented, and combined with proper music and editing effects. The result is an automatically generated home video summary that is shown to the user. While watching it, users can indicate whether they like certain content, so that the system will adapt the summary to contain more content that is similar or related to the displayed content. During the video playback users can also modify and enrich the content, seeing immediately the effects of their changes. Edit While Watching does not require a complex user interface: a TV and a few keys of a remote control are sufficient. A user study has shown that it is easy to learn and to use, even if users expressed the need for more control in the editing operations and in the editing process.
Applications III
icon_mobile_dropdown
Multi-module human motion analysis from a monocular video
In this paper, we propose an effective framework for semantic analysis of human motion from a monocular video. As it is difficult to find a good motion description for humans, we focus on a reliable recognition of the motion type and estimate the body orientation involved in the video sequence. Our framework analyzes the body motion in three modules: a pre-processing module, matching module and semantic module. The proposed framework includes novel object-level processing algorithms, such as a local descriptor and a global descriptor to detect body parts and analyze the shape of the whole body as well. Both descriptors jointly contribute to the matching process by incorporating them into a new weighted linear combination for matching. We also introduce a simple cost function based on time-index di.erences to distinguish motion types and cycles in human motions. Our system can provide three different types of analysis results: (1) foreground person detection; (2) motion recognition in the sequence; (3) 3-D modeling of human motion based on generic human models. The proposed framework was evaluated and proved its effectiveness as it achieves the motion recognition and body-orientation classification at the accuracy of 95% and 98%, respectively.
A study on video viewing behavior: application to movie trailer miner
Sylvain Mongy, Chabane Djeraba
In this paper, we present a study on video viewing behavior. Based on a well-suited Markovian model, we have developed a clustering algorithm called K-Models and inspired by the K-Means technique to cluster and analyze behaviors. These models are constructed using the different actions proposed to the user while he is viewing a video sequence (play, pause, forward, rewind, jump, stop). We have applied our algorithm with a movie trailer mining tool. This tool allows users to perform searches on basic attributes (cast, director, onscreen date...) and to watch selected trailers. With an appropriate server, we log every action to analyze behaviors. First results obtained from a set of beta users answering to a set of de.ned questions reveals interesting typical behaviors.
ARGOS: French evaluation campaign for benchmarking of video content analysis methods
The paper presents the Argos evaluation campaign of video content analysis tools supported by the French Techno- Vision program. This project aims at developing the resources of a benchmark of content analysis methods and algorithms. The paper describes the type of the evaluated tasks, the way the content set has been produced, metrics and tools developed for the evaluations and results obtained at the end of the first phase.
Data mining learning bootstrap through semantic thumbnail analysis
Sebastiano Battiato, Giovanni Maria Farinella, Giovanni Giuffrida, et al.
The rapid increase of technological innovations in the mobile phone industry induces the research community to develop new and advanced systems to optimize services offered by mobile phones operators (telcos) to maximize their effectiveness and improve their business. Data mining algorithms can run over data produced by mobile phones usage (e.g. image, video, text and logs files) to discover user's preferences and predict the most likely (to be purchased) offer for each individual customer. One of the main challenges is the reduction of the learning time and cost of these automatic tasks. In this paper we discuss an experiment where a commercial offer is composed by a small picture augmented with a short text describing the offer itself. Each customer's purchase is properly logged with all relevant information. Upon arrival of new items we need to learn who the best customers (prospects) for each item are, that is, the ones most likely to be interested in purchasing that specific item. Such learning activity is time consuming and, in our specific case, is not applicable given the large number of new items arriving every day. Basically, given the current customer base we are not able to learn on all new items. Thus, we need somehow to select among those new items to identify the best candidates. We do so by using a joint analysis between visual features and text to estimate how good each new item could be, that is, whether or not is worth to learn on it. Preliminary results show the effectiveness of the proposed approach to improve classical data mining techniques.
A spatiotemporal decomposition strategy for personal home video management
Haoran Yi, Igor Kozintsev, Marzia Polito, et al.
With the advent and proliferation of low cost and high performance digital video recorder devices, an increasing number of personal home video clips are recorded and stored by the consumers. Compared to image data, video data is lager in size and richer in multimedia content. Efficient access to video content is expected to be more challenging than image mining. Previously, we have developed a content-based image retrieval system and the benchmarking framework for personal images. In this paper, we extend our personal image retrieval system to include personal home video clips. A possible initial solution to video mining is to represent video clips by a set of key frames extracted from them thus converting the problem into an image search one. Here we report that a careful selection of key frames may improve the retrieval accuracy. However, because video also has temporal dimension, its key frame representation is inherently limited. The use of temporal information can give us better representation for video content at semantic object and concept levels than image-only based representation. In this paper we propose a bottom-up framework to combine interest point tracking, image segmentation and motion-shape factorization to decompose the video into spatiotemporal regions. We show an example application of activity concept detection using the trajectories extracted from the spatio-temporal regions. The proposed approach shows good potential for concise representation and indexing of objects and their motion in real-life consumer video.