Proceedings Volume 6820

Multimedia Content Access: Algorithms and Systems II

cover
Proceedings Volume 6820

Multimedia Content Access: Algorithms and Systems II

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 27 January 2008
Contents: 9 Sessions, 28 Papers, 0 Presentations
Conference: Electronic Imaging 2008
Volume Number: 6820

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 6820
  • Image Analysis and Retrieval I
  • Text and Image Retrieval
  • Face Analysis for Image Retrieval
  • Video Analysis and Retrieval I
  • Image Analysis and Retrieval II
  • Image Retrieval Applications
  • Video Analysis and Retrieval II
  • Image and Video Retrieval
Front Matter: Volume 6820
icon_mobile_dropdown
Front Matter: Volume 6820
This PDF file contains the front matter associated with SPIE-IS&T Proceedings Volume 6820, including the Title Page, Copyright information, Table of Contents, and the Conference Committee listing.
Image Analysis and Retrieval I
icon_mobile_dropdown
Logical unit and scene detection: a comparative survey
Logical units are semantic video segments above the shot level. Depending on the common semantics within the unit and data domain, different types of logical unit extraction algorithms have been presented in literature. Topic units are typically extracted for documentaries or news broadcasts while scenes are extracted for narrative-driven video such as feature films, sitcoms, or cartoons. Other types of logical units are extracted from home video and sports. Different algorithms in literature used for the extraction of logical units are reviewed in this paper based on the categories unit type, data domain, features used, segmentation method, and thresholds applied. A detailed comparative study is presented for the case of extracting scenes from narrative-driven video. While earlier comparative studies focused on scene segmentation methods only or on complete news-story segmentation algorithms, in this paper various visual features and segmentation methods with their thresholding mechanisms and their combination into complete scene detection algorithms are investigated. The performance of the resulting large set of algorithms is then evaluated on a set of video files including feature films, sitcoms, children's shows, a detective story, and cartoons.
Hierarchical photo stream segmentation using context
Photo stream segmentation is to segment photo streams into groups, each of which corresponds to an event. Photo stream segmentation can be done with or without prior knowledge of event structure. In this paper, we study the problem by assuming that there is no a priori event model available. Although both context and content information are important for photo stream segmentation, we focus on investigating the usage of context information in this work. We consider different information components of context such as time, location, and optical setting for inexpensive segmentation of photo streams from common users of modern digital camera. As events are hierarchical, we propose to segment photo stream using hierarchical mixture model. We compare the generated hierarchy with that created by users to see how well results can be obtained without knowing the prior event model. We experimented with about 3000 photos from amateur photographers to study the efficacy of the approach for these context information components.
Text and Image Retrieval
icon_mobile_dropdown
Enriching text with images and colored light
Dragan Sekulovski, Gijs Geleijnse, Bram Kater, et al.
We present an unsupervised method to enrich textual applications with relevant images and colors. The images are collected by querying large image repositories and subsequently the colors are computed using image processing. A prototype system based on this method is presented where the method is applied to song lyrics. In combination with a lyrics synchronization algorithm the system produces a rich multimedia experience. In order to identify terms within the text that may be associated with images and colors, we select noun phrases using a part of speech tagger. Large image repositories are queried with these terms. Per term representative colors are extracted using the collected images. Hereto, we either use a histogram-based or a mean shift-based algorithm. The representative color extraction uses the non-uniform distribution of the colors found in the large repositories. The images that are ranked best by the search engine are displayed on a screen, while the extracted representative colors are rendered on controllable lighting devices in the living room. We evaluate our method by comparing the computed colors to standard color representations of a set of English color terms. A second evaluation focuses on the distance in color between a queried term in English and its translation in a foreign language. Based on results from three sets of terms, a measure of suitability of a term for color extraction based on KL Divergence is proposed. Finally, we compare the performance of the algorithm using either the automatically indexed repository of Google Images and the manually annotated Flickr.com. Based on the results of these experiments, we conclude that using the presented method we can compute the relevant color for a term using a large image repository and image processing.
Giving order to image queries
Jonathon S. Hare, Patrick A. S. Sinclair, Paul H. Lewis, et al.
Users of image retrieval systems often find it frustrating that the image they are looking for is not ranked near the top of the results they are presented. This paper presents a computational approach for ranking keyworded images in order of relevance to a given keyword. Our approach uses machine learning to attempt to learn what visual features within an image are most related to the keywords, and then provide ranking based on similarity to a visual aggregate. To evaluate the technique, a Web 2.0 application has been developed to obtain a corpus of user-generated ranking information for a given image collection that can be used to evaluate the performance of the ranking algorithm.
Logo detection using wavelet co-occurrence histograms
Ali Hesson, Dimitrios Androutsos
In this paper, we propose a retrieval system for logo and trademark images. One technique that has been proposed in the past employs the histogram of edge direction angles to index an image in a database of images. The histogram is called the Edge Directional Histogram (EDH). Our proposed technique is based on using the wavelet decomposition coefficients of an image to build a co-occurrence histogram. We call this histogram the wavelet co-occurrence histogram (WCH). By capturing the edge information and intensity variations in an image, as well as the spatial separation of these variations, the WCH presents a more accurate representation of the image features than the EDH. Results demonstrate that our retrieval system performs better than the EDH.
Face Analysis for Image Retrieval
icon_mobile_dropdown
A novel approach to personal photo album representation and management
Edoardo Ardizzone, Marco La Cascia, Filippo Vella
In this paper we present a novel approach to personal photo album management allowing the end user to efficiently access the collection without any need for tedious manual annotation or indexing of the photos. The proposed work exploits methods and technology from the field of computer vision and pattern recognition for face detection, face representation and image annotation to automatically create description of images useful for content-based searching and retrieval. In fact, even if most of the used techniques are not reliable enough to address the general problem of content-based image retrieval, we show that, in a limited domain such as the one of personal photo album, it is possible to obtain results that improve the browsing capabilities of current photo album management systems. In particular, starting from the observation that most personal photos depict a usually small number of people in a relatively small number of different context (indoor, outdoor, beach, mountain, city, etc...) we propose the use of automatic techniques to index images based on who is present in the scene and on the context where the picture was taken. Experiments on a personal photo collection of about a thousand images proved that relatively simple content-based techniques lead to surprisingly good results in term of easyness of user access to the data.
Facial features matching using a virtual structuring element
Face analysis in a real-world environment is a complex task as it should deal with challenging problems such as pose variations, illumination changes and complex backgrounds. The use of active appearance models for facial features detection is often successful in restricted environments, but the performance decreases when applied in unconstrained environments. Therefore, in this paper, we introduce a novel method that integrates the knowledge of a face detector inside the shape and the appearance models by using what we call a 'virtual structuring element' (VSE). In this way the possible settings of the active appearance models are constrained in an appearance-driven manner. The use of a virtual structuring element in an active appearance model provides increased performance in both accuracy and robustness over standard active appearance models applied to different environments.
Picture management using person retrieval for consumer image collections
Gabriel Costache, Rhys Mulryan, Alexandru Drimbarean, et al.
In the last years the huge evolution of digital photography lead to an increasing interest in developing algorithms for indexing and classifying collections of digital images. This paper presents an automatic system for organizing and browsing through consumer digital image collections using the persons in the images as patterns. In order to implement such an automatic system we have to detect and classify the people in the images according to their similarities. For this we employ algorithms for face detection, face recognition and additional methods to cope with large variations that are usually present in consumer images. These additional methods includes using more than one type of classifiers for face recognition and also using additional information about the person characteristics extracted from other region than the face. This additional information will be more robust to factors that influence the accuracy of classical face recognition systems when working with consumer images. The proposed system was tested using a typical consumer image collection and practical applications using the system are presented in the end.
Distributed wireless face recognition system
A face recognition system gains flexibility and cost efficiency while being integrated into a wireless network. Meanwhile, face recognition enhances the functionality and security of the wireless network. This paper proposes a distributed wireless network prototype, consisting of feature net and database net, to accomplish face identification task by optimally allocating network resources. The face recognition technique used in this paper is subspace-based modular processing with score and decision level fusion. The subspace features are selected by a step-wise statistical procedure, Modified Indifference-Zone Method, which improves efficiency and accuracy. Fusion further improves the performance from using either the whole face or modules alone. The face recognition techniques are re-engineered to be implemented on the distributed wireless network, and the simulation result shows promising improvement over centralized recognition.
Video Analysis and Retrieval I
icon_mobile_dropdown
Improving multimedia retrieval with a video OCR
We present a set of experiments with a video OCR system (VOCR) tailored for video information retrieval and establish its importance in multimedia search in general and for some specific queries in particular. The system, inspired by an existing work on text detection and recognition in images, has been developed using techniques involving detailed analysis of video frames producing candidate text regions. The text regions are then binarized and sent to a commercial OCR resulting in ASCII text, that is finally used to create search indexes. The system is evaluated using the TRECVID data. We compare the system's performance from an information retrieval perspective with another VOCR developed using multi-frame integration and empirically demonstrate that deep analysis on individual video frames result in better video retrieval. We also evaluate the effect of various textual sources on multimedia retrieval by combining the VOCR outputs with automatic speech recognition (ASR) transcripts. For general search queries, the VOCR system coupled with ASR sources outperforms the other system by a very large extent. For search queries that involve named entities, especially people names, the VOCR system even outperforms speech transcripts, demonstrating that source selection for particular query types is extremely essential.
Event-centric media management
Ansgar Scherp, Srikanth Agaram, Ramesh Jain
The management of the vast amount of media assets captured at every day events such as meetings, birthday parties, vacation, and conferences has become an increasingly challenging problem. Today, most media management applications are media-centric. This means, they put the captured media assets into the center of the management. However, in recent years it has been proposed that events are a much better abstraction of human experience and thus provide a more appropriate means for managing media assets. Consequently, approaches that include events into their media management solution have been explored. However, they typically consider events only as some more metadata that can be extracted from the media assets. In addition, today's applications and approaches concentrate on particular problems such as event detection, tagging, sharing, classification, or clustering and are often focused on a single media type. In this paper, we argue for the benefits of an event-centric media management (EMMa) approach that looks at the problem of media management holistically. Based on a generic event model, we specify a media event model for the EMMa approach. The single phases and processes of the EMMa approach are defined in a general process chain for an event-centric media management, the EMMa cycle. This cycle follows the event concept throughout all phases and processes of the chain and puts the concept of events in the center of the media management. Based on the media event model and EMMa cycle, we design a component-based architecture for the EMMa approach and conduct an implementation of the approach.
Improving scene detection by using gradual shot transitions as cues from film grammar
The types of shot transitions used by film editors in video are not randomly chosen. Cuts, dissolves, fades, and wipes are devices in film grammar used to structure video. In this work knowledge of film grammar is used to improve scene detection algorithms. Three improvements to known scene detection algorithms are proposed: (1) The selection of key-frames for shot similarity measurement should take the position of gradual shot transitions into account. (2) Gradual shot transitions have a separating effect. It is shown how this local cue can be used to improve the global structuring into logical units. (3) Gradual shot transitions also have a merging effect upon shots in their temporal proximity. It is shown how coherence values and shot similarity values used during scene detection have to be modified to exploit this fact. The proposed improvements can be used together with a variety of scene detection approaches. Experimental results with time adaptive grouping indicate that considerable improvements in terms of precision and recall are achieved.
Video fingerprinting: features for duplicate and similar video detection and query-based video retrieval
Anindya Sarkar, Pratim Ghosh, Emily Moxley, et al.
A video "fingerprint" is a feature extracted from the video that should represent the video compactly, allowing faster search without compromising the retrieval accuracy. Here, we use a keyframe set to represent a video, motivated by the video summarization approach. We experiment with different features to represent each keyframe with the goal of identifying duplicate and similar videos. Various image processing operations like blurring, gamma correction, JPEG compression, and Gaussian noise addition are applied on the individual video frames to generate duplicate videos. Random and bursty frame drop errors of 20%, 40% and 60% (over the entire video) are also applied to create more noisy "duplicate" videos. The similar videos consist of videos with similar content but with varying camera angles, cuts, and idiosyncrasies that occur during successive retakes of a video. Among the feature sets used for comparison, for duplicate video detection, Compact Fourier-Mellin Transform (CFMT) performs the best while for similar video retrieval, Scale Invariant Feature Transform (SIFT) features are found to be better than comparable-dimension features. We also address the problem of retrieval of full-length videos with shorter-length clip queries. For identical feature size, CFMT performs the best for video retrieval.
Semantic video indexing using context-dependent fusion
We present a novel method for fusing the results of multiple semantic video indexing algorithms that use different types of feature descriptors and different classification methods. This method, called Context-Dependent Fusion (CDF), is motivated by the fact that the relative performance of different semantic indexing methods can vary significantly depending on the video type, context information, and the high-level concept of the video segment to be labeled. The training part of CDF has two main components: context extraction and algorithm fusion. In context extraction, the low-level audio-visual descriptors used by the different classification algorithms are combined and used to partition the descriptors space into groups of similar video shots, or contexts. The algorithm fusion component identifies a subset of classification algorithms (local experts) for each context based on their relative performance within the context. Results on the TRECVID-2002 data collections show that the proposed method can identify meaningful and coherent clusters and that different labeling algorithms can be identified for the different contexts. Our initial experiments have indicated that the context-dependent fusion outperforms the individual algorithms. We also show that using simple visual descriptors and a simple K-NN classifier, the CDF approach provides results that are comparable to the state-of-the-art methods in semantic indexing.
Highlight summarization in golf videos using audio signals
Hyoung-Gook Kim, Jin Young Kim
In this paper, we present an automatic summarization of highlights in golf videos based on audio information alone without video information. The proposed highlight summarization system is carried out based on semantic audio segmentation and detection on action units from audio signals. Studio speech, field speech, music, and applause are segmented by means of sound classification. Swing is detected by the methods of impulse onset detection. Sounds like swing and applause form a complete action unit, while studio speech and music parts are used to anchor the program structure. With the advantage of highly precise detection of applause, highlights are extracted effectively. Our experimental results obtain high classification precision on 18 golf games. It proves that the proposed system is very effective and computationally efficient to apply the technology to embedded consumer electronic devices.
Image Analysis and Retrieval II
icon_mobile_dropdown
Concept annotation and search space decrement of digital photos using optical context information
A modern digital camera is not just a single sensor capturing light. It is an ensemble of different sensors which capture independent contextual information about the photo shooting event. This is stored as metadata in the image. In this paper, we demonstrate how the optical metadata (data related to the optics of the camera) can be retrieved, interpreted and used along with content information for organizing and indexing digital photos. Our model is based on the physics of vision and operation of a camera. We use our algorithm on images from personal photo albums. Our results show that the optical metadata improves annotation performance and decreases the search space for retrieval.
Content-based image retrieval using greedy routing
Anthony Don, Nicolas Hanusse
In this paper, we propose a new concept for browsing and searching in large collections of content-based indexed images. Our approach is inspired by greedy routing algorithms used in distributed networks. We define a navigation graph, called navgraph, whose vertices represent images. The edges of the navgraph are computed according to a similarity measure between indexed images. The resulting graph can be seen as an ad-hoc network of images in which a greedy routing algorithm can be applied for retrieval purposes. A request for a target image consists of a walk in the navigation graph using a greedy approach : starting from an arbitrary vertex/image, the neighbors of the current vertex are presented to the user, who iteratively selects the vertex which is the most similar to the target. We present the navgraph construction and prove its efficiency for greedy routing. We also propose a specific content-descriptor that we compare to the MPEG7 Color Layout Descriptor. Experimental results with test-users show the usability of this approach.
Evaluation of content-based features for user-centered image retrieval in small media collections
Horst Eidenberger, Maia Zaharieva
The experiments described in this paper indicate that under certain conditions content-based features are not required for efficient user-centred image retrieval in small media collections. The importance of feature selection drops dramatically if classification is used for retrieval (e.g. if Support Vector Machines are used) and only little user feedback is available. In this situation simple image features and even random features perform equally well as sophisticated signal processing-based features (e.g. the content-based MPEG-7 image descriptors). Practically relevant applications for these findings are retrieval on mobile devices and in heterogeneous (e.g. ad hoc generated) media collections.
Image Retrieval Applications
icon_mobile_dropdown
Content-based unconstrained color logo and trademark retrieval with color edge gradient co-occurrence histograms
Raymond Phan, Dimitrios Androutsos
In this paper, we present a logo and trademark retrieval system for unconstrained color image databases that extends the Color Edge Co-occurrence Histogram (CECH) object detection scheme. We introduce more accurate information to the CECH, by virtue of incorporating color edge detection using vector order statistics. This produces a more accurate representation of edges in color images, in comparison to the simple color pixel difference classification of edges as seen in the CECH. Our proposed method is thus reliant on edge gradient information, and as such, we call this the Color Edge Gradient Co-occurrence Histogram (CEGCH). We use this as the main mechanism for our unconstrained color logo and trademark retrieval scheme. Results illustrate that the proposed retrieval system retrieves logos and trademarks with good accuracy, and outperforms the CECH object detection scheme with higher precision and recall.
MapSnapper: engineering an efficient algorithm for matching images of maps from mobile phones
Jonathon S. Hare, Paul H. Lewis, Layla Gordon, et al.
The MapSnapper project aimed to develop a system for robust matching of low-quality images of a paper map taken from a mobile phone against a high quality digital raster representation of the same map. The paper presents a novel methodology for performing content-based image retrieval and object recognition from query images that have been degraded by noise and subjected to transformations through the imaging system. In addition the paper also provides an insight into the evaluation-driven development process that was used to incrementally improve the matching performance until the design specifications were met.
Visual search engine for product images
Xiaofan Lin, Burak Gokturk, Baris Sumengen, et al.
Nowadays there are many product comparison web sites. But most of them only use text information. This paper introduces a novel visual search engine for product images, which provides a brand-new way of visually locating products through Content-based Image Retrieval (CBIR) technology. We discusses the unique technical challenges, solutions, and experimental results in the design and implementation of this system.
Video Analysis and Retrieval II
icon_mobile_dropdown
Distributed classifier chain optimization for real-time multimedia stream mining systems
We consider the problem of optimally configuring classifier chains for real-time multimedia stream mining systems. Jointly maximizing the performance over several classifiers under minimal end-to-end processing delay is a difficult task due to the distributed nature of analytics (e.g. utilized models or stored data sets), where changing the filtering process at a single classifier can have an unpredictable effect on both the feature values of data arriving at classifiers further downstream, as well as the end-to-end processing delay. While the utility function can not be accurately modeled, in this paper we propose a randomized distributed algorithm that guarantees almost sure convergence to the optimal solution. We also provide results using speech data showing that the algorithm can perform well under highly dynamic environments.
Distributed multi-dimensional hidden Markov model: theory and application in multiple-object trajectory classification and recognition
Xiang Ma, Dan Schonfeld, Ashfaq Khokhar
In this paper, we propose a novel distributed causal multi-dimensional hidden Markov model (DHMM). The proposed model can represent, for example, multiple motion trajectories of objects and their interaction activities in a scene; it is capable of conveying not only dynamics of each trajectory, but also interactions information between multiple trajectories, which can be critical in many applications. We firstly provide a solution for non-causal, multi-dimensional hidden Markov model (HMM) by distributing the non-causal model into multiple distributed causal HMMs. We approximate the simultaneous solution of multiple HMMs on a sequential processor by an alternate updating scheme. Subsequently we provide three algorithms for the training and classification of our proposed model. A new Expectation-Maximization (EM) algorithm suitable for estimation of the new model is derived, where a novel General Forward-Backward (GFB) algorithm is proposed for recursive estimation of the model parameters. A new conditional independent subset-state sequence structure decomposition of state sequences is proposed for the 2D Viterbi algorithm. The new model can be applied to many other areas such as image segmentation and image classification. Simulation results in classification of multiple interacting trajectories demonstrate the superior performance and higher accuracy rate of our distributed HMM in comparison to previous models.
STRG-QL: spatio-temporal region graph query language for video databases
In this paper, we present a new graph-based query language and its query processing for a Graph-based Video Database Management System (GVDBMS). Although extensive researches have proposed various query languages for video databases, most of them have the limitation in handling general-purpose video queries. Each method can handle specific data model, query type or application. In order to develop a general-purpose video query language, we first produce Spatio-Temporal Region Graph (STRG) for each video, which represents spatial and temporal information of video objects. An STRG data model is generated from the STRG by exploiting object-oriented model. Based on the STRG data model, we propose a new graph-based query language named STRG-QL, which supports various types of video query. To process the proposed STRG-QL, we introduce a rule-based query optimization that considers the characteristics of video data, i.e., the hierarchical correlations among video segments. The results of our extensive experimental study show that the proposed STRG-QL is promising in terms of accuracy and cost.
Method of shot determination in a robot camera cooperative shooting system
Makoto Okuda, Takao Tsuda, Kazutoshi Mutou, et al.
We are building a program-production system employing multiple robot cameras as a new program-production support technology. In this system, the robot cameras are automatically controlled in accordance with shooting rules that specify the relationship between changes in the program situation and the shots taken by individual cameras, but studio layout elements, such as the number of participants and the position in which flip-cards are displayed, are different for each program. For this reason, production staff must reset shooting rules for every program, and this operation is extremely burdensome in the limited preparation time available. We therefore devised a method of automatically generating shooting rules through simple information input based on analysis of the shooting methods of cameramen, and have tested the validity of this method in simulation tests. Moreover, we built a program-production system in which robot cameras are connected via a network to various sensors that we developed to detect changes in the program situation, and we evaluated the system by conducting program shooting experiments whose subject is engaged in actual TV program production.
Image and Video Retrieval
icon_mobile_dropdown
Colour appearance descriptors for image browsing and retrieval
In this paper, we focus on the development of whole-scene colour appearance descriptors for classification to be used in browsing applications. The descriptors can classify a whole-scene image into various categories of semantically-based colour appearance. Colour appearance is an important feature and has been extensively used in image-analysis, retrieval and classification. By using pre-existing global CIELAB colour histograms, firstly, we try to develop metrics for whole-scene colour appearance: "colour strength", "high/low lightness" and "multicoloured". Secondly we propose methods using these metrics either alone or combined to classify whole-scene images into five categories of appearance: strong, pastel, dark, pale and multicoloured. Experiments show positive results and that the global colour histogram is actually useful and can be used for whole-scene colour appearance classification. We have also conducted a small-scale human evaluation test on whole-scene colour appearance. The results show, with suitable threshold settings, the proposed methods can describe the whole-scene colour appearance of images close to human classification. The descriptors were tested on thousands of images from various scenes: paintings, natural scenes, objects, photographs and documents. The colour appearance classifications are being integrated into an image browsing system which allows them to also be used to refine browsing.
Audio scene segmentation for video with generic content
Feng Niu, Naveen Goela, Ajay Divakaran, et al.
In this paper, we present a content-adaptive audio texture based method to segment video into audio scenes. The audio scene is modeled as a semantically consistent chunk of audio data. Our algorithm is based on "semantic audio texture analysis." At first, we train GMM models for basic audio classes such as speech, music, etc. Then we define the semantic audio texture based on those classes. We study and present two types of scene changes, those corresponding to an overall audio texture change and those corresponding to a special "transition marker" used by the content creator, such as a short stretch of music in a sitcom or silence in dramatic content. Unlike prior work using genre specific heuristics, such as some methods presented for detecting commercials, we adaptively find out if such special transition markers are being used and if so, which of the base classes are being used as markers without any prior knowledge about the content. Our experimental results show that our proposed audio scene segmentation works well across a wide variety of broadcast content genres.