Proceedings Volume 5682

Storage and Retrieval Methods and Applications for Multimedia 2005

cover
Proceedings Volume 5682

Storage and Retrieval Methods and Applications for Multimedia 2005

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 17 January 2005
Contents: 8 Sessions, 34 Papers, 0 Presentations
Conference: Electronic Imaging 2005 2005
Volume Number: 5682

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Fast Storage Access
  • Storage Security
  • Media Mining
  • Special Session: Video Surveillance
  • Image Indexing
  • Image Retrieval
  • Audio Processing
  • Video Processing
Fast Storage Access
icon_mobile_dropdown
A Beowulf Class parallel remote-sensed image database retrieval system developed in ASSIST environment
Vincenzo Di Lecce, Andrea Guerriero, I. Guarino
Image databases are now currently utilized in a wide range of different areas, in particular, the development and application of remote sensing platforms result in the production of huge amounts of image data. Though advanced image compression technology has solved part of the storage problem, searching and locating through such a database is still a difficult task. In the 90's Content-based Image Retrieval (CBIR) has gained increasing popularity among researchers, however, how to retrieve the content of an image efficiently and effectively still lacks of common recognition. This is because the low level features of an image including color, shape, texture, etc., which could be easily analyzed do not coincide with the high level concepts of an image. Another major problem in the practical implementation of a CBIR for remotely sensed images is that the content-based indexing and searching process always requires extremely high computational power. On the other hand, the content-based image retrieval algorithms are very suitable for parallel computation as the algorithms can be broken into several data independent processes for running on a parallel computer. In this paper, we discuss the porting problem of a sequential application of remote sensed image retrieval in a parallel environment using the new paradigm of programming introduced by born of a new structured program languages (Assist 1.2) and evaluate several skeletons composition to optimize the performance of our application.
Fast and constant time random access decoding with log(2)n block seek time
For faster random access of a target image block, a bi-section idea is applied to link image blocks. Conventional methods configure the blocks in linearly linked way, for which the block seek time entirely depends on the location of the block on the compressed bitstream. The block linkage information is configured such that binary search is possible, giving the worst case block seek time of log2(n), for n blocks. Experimental results with 3D-SPIHT on video sequences show that the presented idea gives substantial speed improvement with minimal bit overhead.
Storage Security
icon_mobile_dropdown
Protecting multimedia data in storage: a survey of techniques emphasizing encryption
Paul Stanton, William Yurcik, Larry Brumbaugh
Protecting multimedia data from malicious computer users continues to grow in importance. Whether preventing unauthorized access to digital photographs, ensuring compliance with copyright regulations, or guaranteeing the integrity of a video teleconference, all multimedia applications require increased security in the presence of talented intruders. Specifically, as more and more files are preserved on disk the requirement to provide secure storage has become more important. This paper presents a survey of techniques for securely storing multimedia data, including theoretical approaches, prototype systems, and existing systems ready for employment. Due to the wide variety of potential solutions available, a prospective customer can easily become overwhelmed while researching an appropriate system for multimedia requirements. Since added security measures inevitably result in slower system performance, certain storage solutions provide a better fit for particular applications along a security/performance continuum. This paper provides an overview of the prominent characteristics of several systems to provide a foundation for selecting the most appropriate solution. Initially, the paper establishes a set of criteria for evaluating a storage solution based on confidentiality, integrity, availability, and performance. Then, using these criteria, the paper explains the relevant characteristics of select storage systems providing a comparison of the major differences. Finally, the paper examines specific applications of storage devices in the multimedia environment.
Tamper-resistant storage techniques for multimedia systems
Elizabeth Haubert, Joseph Tucek, Larry Brumbaugh, et al.
Tamper-resistant storage techniques provide varying degrees of authenticity and integrity for data. This paper surveys five implemented tamper-resistant storage systems that use encryption, cryptographic hashes, digital signatures and error-correction primitives to provide varying levels of data protection. Five key evaluation points for such systems are: (1) authenticity guarantees, (2) integrity guarantees, (3) confidentiality guarantees, (4) performance overhead attributed to security, and (5) scalability concerns. Immutable storage techniques can enhance tamper-resistant techniques. Digital watermarking is not appropriate for tamper-resistance implemented in the storage system rather than at the application level.
The techniques and challenges of immutable storage with applications in multimedia
Security of storage and archival systems has become a basic necessity in recent years. Due to the increased vulnerability of the existing systems and the need to comply with government regulations, different methods have been explored to attain a secure storage system. One of the primary problems to ensuring the integrity of storage systems is to make sure a file cannot be changed without proper authorization. Immutable storage is storage whose content cannot be changed once it has been written. For example, it is apparent that critical system files and other important documents should never be changed and thus stored as immutable. In multimedia systems, immutability provides proper archival of indices as well as content. In this paper we present a survey of existing techniques for immutability in file systems.
Human identification using correlation metrics of iris images
Mehmet Celenk, Michael Brown, Yi Luo, et al.
Biometric identification relies on information that is difficult to misplace or duplicate, making it a very useful tool when properly implemented. One biometric feature of considerable interest is the iris. Since most people rely heavily on their vision, they are protective of their eyes. This means there is less likelihood of change due to environmental factors. In addition, since the iris is created in a random morphogenetic process, there is a large amount of complexity suitable for use as a discriminator. There are currently several powerful methods available for using the human iris as a biometric for identification. One drawback inherent in the existing methods, however, is their computational complexity. Adopting stochastic models can provide an approach to reducing the extensive computing burden. To this end, we have presented two methods that rely on a wide-sense stationary approximation to the texture and gray scale information in the iris; one uses auto- and cross-correlations while the other employs second order statistics of co-occurrence matrices. Our experiments indicate that cross- and auto-correlations and co-occurrence matrix features are likely to be prominent iris discriminators for correct identification. Future tests will be conducted on larger sample sets to further verify the findings presented here. Two main methods for feature generation will also be compared and combined to produce an optimal classification strategy for an embedded hardware realization of the method. The addition of more features for discrimination is a likely necessity for classifying larger numbers of irises.
Media Mining
icon_mobile_dropdown
Content-based image retrieval using a mobile device as a novel interface
Jonathon S. Hare, Paul H. Lewis
Given the large amount of research into content-based image retrieval currently taking place, new interfaces to systems that perform queries based on image content need to be considered. A new paradigm for content-based image retrieval is introduced, in which a mobile device is used to capture the query image and display the results. The system consists of a client-server architecture in which query images are captured on a mobile device and then transferred to a server for further processing. The server then returns the results of the query to the mobile device. The use of a mobile device as an interface to a content-based image retrieval or object recognition system presents a number of challenges because the query image from the device will have been degraded by noise and subjected to transformations through the imaging system. A methodology is presented that uses techniques inspired from the information retrieval community in order to aid efficient indexing and retrieval. In particular, a vector-space model is used in the efficient indexing of each image, and a two-stage pruning/ranking procedure is used to determine the correct matching image. The retrieval algorithm is shown to outperform existing algorithms when used with query images from the device.
Instantaneous reliability assessment of motion features in surveillance videos
Although a tremendous effort has been made to perform a reliable analysis of images and videos in the past fifty years, the reality is that one cannot rely 100% on the analysis results. The only exception is applications in controlled environments as dealt in machine vision, where closed world assumptions apply. However, in general, one has to deal with an open world, which means that content of images may significantly change, and it seems impossible to predict all possible changes. For example, in the context of surveillance videos, the light conditions may suddenly fluctuate in parts of images only, video compression or transmission artifacts may occur, a wind may cause a stationary camera to tremble, and so on. The problem is that video analysis has to be performed in order to detect content changes, but such analysis may be unreliable due to the changes, and thus fail to detect the changes and lead to "vicious cycle". The solution pursuit in this paper is to monitor the reliability of the computed features by analyzing their general properties. We consider statistical properties of feature value distributions as well as temporal properties. Our main strategy is to estimate the feature properties when the features are reliable computed, so that any set of features that does not have these properties is detected as being unreliable. This way we do not perform any direct content analysis, but instead perform analysis of feature properties related to their reliability.
3-D shape descriptors and distance metrics for content-based artifact retrieval
Simon Goodall, Paul H. Lewis, Kirk Martinez
The growing number of large multimedia collections has led to an increased interest in content-based retrieval research. Applications of content-based techniques to image retrieval is an active research area but much less work has been reported on content-based retrieval of 3-D objects in a multimedia database context. Increasingly such objects are being captured and added to multimedia collections and the European project, SCULPTEUR, is developing a museum information system which includes the introduction of facilities for content-based retrieval of the 3-D representations. This paper provides a comparison and evaluation of a range of 3-D shape descriptors and distance metrics which have been introduced into the SCULPTEUR project to demonstrate their use for content-based retrieval applications. Results show that while particular descriptors and distance metrics provide good overall performance, it can be more appropriate to choose different descriptors for different search tasks.
Applying vertebral boundary semantics to CBIR of digitized spine x-ray images
In developing reliable content-based image retrieval (CBIR) techniques specialized for biomedical image retrieval, applicable feature representation and similarity algorithms have to balance conflicting goals of efficient and effective retrieval. These methods must index important and often subtle biomedical features and also incorporate their siginificance. From a collection of digitized X-rays of the spine, such as that from the second National Health and Nutrition Examination Survey (NHANES II) maintained by the U.S. National Library of Medicine, a typical user may be interested in cases where the pathology is exhibited by only a pertinent small region of the vertebral boundary: for this experiment, the Anterior Osteophyte (AO). A previous experiment in such pathology-based retrieval using partial shape matching (PSM) on a subset from the collection; 89% normal vertebrae and 45% of moderate and severe cases were correctly retrieved. Additionally, analysis of results also showed high inter-pathology-class confusion. The experiment showed that shape matching without incorporating application semantics is insufficient for correct retrieval of pathological cases. This paper describes an automatic localization algorithm that incorporates reasoning about vertebral boundary semantics equivalent to those applied by the content-expert as a step in our enhancements to PSM, and results from initial experiments.
Asynchronous multimedia annotations for web-based collaboration in biology education
Dragutin Petkovic, E. Lank, F. A. Ramirez, et al.
The focus of this paper is on the design, implementation, and validation of asynchronous multimedia annotations designed for Web-based collaboration in educational and research settings. The two key questions we explore in this paper are: How useful are such annotations and what purpose do annotations serve? What is the ease of use of our specific implementation of annotations? The context of our project has been in the area of multimedia information usage and collaboration in the biological sciences. We have developed asynchronous annotations for HTML and image data. Our annotations can be executed via any browser and require no downloads. They are stored in a central database allowing search and asynchronous access by all registered users. An easy to use user interface allows users to add, view and search annotations. We also performed a usability study that showed that our implementation of text annotations to validate our implementation.
Shape-based posture and gesture recognition in videos
The recognition of human postures and gestures is considered to be highly relevant semantic information in videos and surveillance systems. We present a new three-step approach to classifying the posture or gesture of a person based on segmentation, classification, and aggregation. A background image is constructed from succeeding frames using motion compensation and shapes of people are segmented by comparing the background image with each frame. We use a modified curvature scale space (CSS) approach to classify a shape. But a major drawback to this approach is its poor representation of convex segments in shapes: Convex objects cannot be represented at all since there are no inflection points. We have extended the CSS approach to generate feature points for both the concave and convex segments of a shape. The key idea is to reflect each contour pixel and map the original shape to a second one whose curvature is the reverse: Strong convex segments in the original shape are mapped to concave segments in the second one and vice versa. For each shape a CSS image is generated whose feature points characterize the shape of a person very well. The last step aggregates the matching results. A transition matrix is defined that classifies possible transitions between adjacent frames, e.g. a person who is sitting on a chair in one frame cannot be walking in the next. A valid transition requires at least several frames where the posture is classified as "standing-up". We present promising results and compare the classification rates of postures and gestures for the standard CSS and our new approach.
Imperfect learning for autonomous concept modeling
Ching-Yung Lin, Xiaodan Song, Gang Wu
Most existing supervised machine learning frameworks assume there is no mistake or false interpretation on the training samples. However, this assumption may not be true in practical applications. In some cases, if human being is involved in providing training samples, there may be errors in the training set. In this paper, we study the effect of imperfect training samples on the supervised machine learning framework. We focus on the mathematical framework that describes the learnability of noisy training data. We study theorems to estimate the error bounds of generated models and the required amount of training samples. These errors are dependent on the amount of data trained and the probability of the accuracy of training data. Based on the effectiveness of learnability on imperfect annotation, we describe an autonomous learning framework, which uses cross-modality information to learn concept models. For instance, visual concept models can be trained based on the detection result of Automatic Speech Recognition, Closed Captions, or prior detection results of the same modality. Those detection results on an unsupervised training set serve as imperfect labeling for the models-to-build. A prototype system based on this learning technique has been built. Promising results have been shown on these experiments.
Special Session: Video Surveillance
icon_mobile_dropdown
Collaborative visual tracking of multiple identical targets
Multiple target tracking in video is an important problem in many emerging applications. It is also a challenging problem, where the coalescence phenomenon often happens, meaning the tracker associates more than one trajectories to some targets while loses track for others. This coalescence may result in the failure of tracker, especially when similar targets move close or present partial or complete occlusions. Existing approaches are mainly based on joint state space representation of the multiple targets being tracked, therefore confronted by the combinatorial complexity due to the nature of the intrinsic high dimensionality. In this paper, we propose a novel distributed framework with linear complexity to this problem. The basic idea is a collaborative inference mechanism, where the estimate of each individual target state is not only determined by its own observation and dynamics, but also through the interaction and collaboration with the state estimates of other targets, which finally leads to a competition mechanism that enables different but spatial adjacent targets to compete for the common image observations. The theoretical foundation of the new approach is based on a well designed Markov network, where the structure configuration in this network can change with time. In order to inference from such a Markov network, a probabilistic variational analysis of this Markov network is conducted and reveals a mean field approximation to the posterior density of each target, therefore provides a computationally efficient way for such a difficult inference problem. Compared with the existing solutions, the proposed new approach stands out by its linear computational cost and excellent performance achieved to deal with the coalescence problem, as pronounced in the extensive experiments.
A hierarchical framework for understanding human-human interactions in video surveillance
Understanding human behavior in video is essential in numerous applications including smart surveillance, video annotation/retrieval, and human-computer interaction. However, recognizing human interactions is a challenging task due to ambiguity in body articulation, variations in body size and appearance, loose clothing, mutual occlusion, and shadows. In this paper we present a framework for recognizing human actions and interactions in color video, and a hierarchical graphical model that unifies multiple-level processing in video computing: pixel level, blob level, object level, and event level. A mixture of Gaussian (MOG) model is used at the pixel level to train and classify individual pixel colors. A relaxation labeling with attribute relational graph (ARG) is used at the blob level to merge the pixels into coherent blobs and to register inter-blob relations. At the object level, the poses of individual body parts are recognized using Bayesian networks (BNs). At the event level, the actions of a single person are modeled using a dynamic Bayesian network (DBN). The results of the object-level descriptions for each person are juxtaposed along a common timeline to identify an interaction between two persons. The linguistic 'verb argument structure' is used to represent human action in terms of triplets. A meaningful semantic description in terms of is obtained. Our system achieves semantic descriptions of positive, neutral, and negative interactions between two persons including hand-shaking, standing hand-in-hand, and hugging as the positive interactions, approaching, departing, and pointing as the neutral interactions, and pushing, punching, and kicking as the negative interactions.
Multi-modal analysis for person type classification in news video
Classifying the identities of people appearing in broadcast news video into anchor, reporter, or news subject is an important topic in high-level video analysis, which remains as a missing piece in the existing research. Given the visual resemblance of different types of people, this work explores multi-modal features derived from a variety of evidences, including the speech identity, transcript clues, temporal video structure, named entities, and face information. A Support Vector Machine (SVM) model is trained on manually-classified people to combine the multitude of features to predict the types of people who are giving monologue-style speeches in news videos. Experiments conducted on ABC World News Tonight video have demonstrated that this approach can achieve over 93% accuracy on classifying person types. The contributions of different categories of features have been compared, which shows that the relatively understudied features such as speech identities and video temporal structure are very effective in this task.
Real-time multiple-object tracking and anomaly detection
Mei Han, Yihong Gong
In this paper we describe a real time video surveillance system which is capable of tracking multiple objects simultaneously and detecting violations. The number of objects is unknown and varies during tracking. Based on preliminary results of object detection in each image which may have missing and/or false detections, the multiple object tracking algorithm keeps a graph structure where it maintains multiple hypotheses about the number and the trajectories of the objects in the video. The image information drives the process of extending and pruning the graph, and determines the best hypothesis to explain the video. The multiple object tracking algorithm gives feedbacks which are predictions of object locations to the object detection module. Therefore, the algorithm integrates object detection and tracking tightly. The most possible hypothesis provides the multiple object tracking result which is used to accomplish anomaly detection. The trajectories generated by the tracking algorithm provide information of object identifications, motion histories, timing at sensitive areas and object interactions. The system has been running at a few access control areas for more than eighteen months. Experimental results on human tracking are presented and applications to anomaly detection are described.
Image Indexing
icon_mobile_dropdown
Efficiently querying spatial histograms
Yujun Wang, Simone Santini, Amarnath Gupta
In this paper, we examine the problem of efficiently computing a class of aggregate functions on regions of space. We first formalize region-based aggregations for a large class of efficient geometric aggregations. The idea is to represent the query object with pre-defined objects with set operations, and compute the aggregation using the pre-computed aggregation values. We first show that it applies to existing results about points and rectangular objects. Since it is defined using set theory instead of object shapes, it can be applied to polygons. Given a database D of polygonal regions, a tessellation T of the plane, and a query polygon q constructed from T, we prove that the aggregation of q can be calculated by the aggregation over triangles and lines constructed from segments and vertices in q, which can be pre-computed. The query time complexity is O(klogn), where k is the size of query polygon and n is the size of T.
Image retrieval using combination of color and multiresolution texture features
Young Deok Chun, Joong Ki Sung, Nam Chul Kim
We propose a content-based image retrieval (CBIR) method based on an efficient combination of a color feature and multiresolution texture features. As a color feature, a HSV autocorrelogram is chosen which is known to measure spatial correlation of colors well. As texture features, BDIP and BVLC moments are chosen which is known to measure local intensity variations well and measure local texture smoothness well, respectively. The texture features are obtained in a wavelet pyramid of the luminance component of a color image. The extracted features are combined for efficient similarity computation by the normalization depending on their dimensions and standard deviation vectors. Experimental results show that the proposed method yielded average 10% better performance in precision vs. recall and average 0.12 in average normalized modified retrieval rank (ANMRR) than the methods using color autocorrelogram, BDIP and BVLC moments, and wavelet moments, respectively.
An image-clustering method based on cross-correlation of color histograms
Yifeng Wu, Kevin Hudson
Color histogram analysis is a powerful tool for characterizing color images. It has been widely used in image indexing and retrieval systems. A key problem to use color histogram in image classification is to find a robust similarity measurement between different color histograms. In this paper, we propose to use a cross-correlation function to measure color histogram similarity. We show that a cross-correlation function has several advantages over the method of histogram intersection, which has been widely used to calculate the similarity between color histograms: A cross-correlation function is normalized automatically; it can determine the similarity irrespective of image size; it is invariant to small color shift; it is easier to implement using the computationally efficient methods. We present an example of unsupervised image clustering by applying cross-correlation function to color histograms. This method was used to improve the perceived color consistency in a multi-print-engine system. We also show how to optimize the cross-correlation function to compensate for the color shift.
Image Retrieval
icon_mobile_dropdown
Automated situation clustering of home photos for digital albuming
Seungji Yang, Sang Kyun Kim, Yong Man Ro
In this paper, we propose automatic situation clustering method for digital photo album. A group of photos having the same situation could have similar visual semantics. In this paper, visual semantic hints of photo are proposed and used to cluster situations. Experiments were performed with 2345 photos and results showed that the proposed clustering with the visual semantic hints was useful for automated situation clustering based on human perception.
A relevance feedback image retrieval scheme using multi-instance and pseudo-image concepts
Content-based image search has long been considered a difficult task. Making correct conjectures on the user intention (perception) based on the query images is a critical step in the content-based search. One key concept in this paper is how we find the user preferred image characteristics from the multiple positive samples provided by the user. The second key concept is that when the user does not provide a sufficient number of samples, how we generate a set of consistent "pseudo images". The notion of image feature stability is thus introduced. The third key concept is how we use negative images as pruning criterion. In realizing the preceding concepts, an image search scheme is developed using the weighted low-level image features. At the end, quantitative simulation results are used to show the effectiveness of these concepts.
A lightweight image retrieval system for paintings
Thomas Lombardi, Sung-Hyuk Cha, Charles Tappert
For describing and analyzing digital images of paintings we propose a model to serve as the basis for an interactive image retrieval system. The model defines two types of features: palette and canvas features. Palette features are those related to the set of colors in a painting while canvas features relate to the frequency and spatial distribution of those colors. The image retrieval system differs from previous retrieval systems for paintings in that it does not rely on image or color segmentation. The features specified in the model can be extracted from any image and stored in a database with other control information. Users select a sample image and the system returns the ten closest images as determined by calculating the Euclidean distance between feature sets. The system was tested with an initial dataset of 100 images (training set) and 90 sample images (testing set). In 81 percent of test cases, the system retrieved at least one painting by the same artist suggesting that the model is sufficient for the interactive classification of paintings by artist. Future studies aim to expand and refine the model for the classification of artwork according to artist and period style.
New method for similarity retrieval of iconic image database
Shu-Ming Hsieh, Chiun-Chieh Hsu
It is an important basis to retrieve images in Image Databases (IDBs) by the objects contained in the images and the interrelationships among these objects. Spatial relationship is one of the most perceptive discriminations (or similarities) between images. The 2D string approach provides a compact and efficient method to preserve the spatial knowledge and to perform the matching mechanism. However, the 2D string matching method is only suitable for sub-image queries. A similarity retrieval method based on 2D string longest common subsequence (2D LCS) is proposed by Lee et al., but their algorithm to calculate the length of 2D LCS (or the similarity degree of two images) is transformed to an NP-hard problem. In this paper, we propose a new method of similarity retrieval based on 2D LCS. The efficiency of the proposed algorithm is polynomial. Furthermore, the proposed model can be extended to discriminate images by the multiple attributes of contained objects as well as their spatial constraints. Thus an efficient and effective similarity retrieval model is achieved.
Audio Processing
icon_mobile_dropdown
Music genre classification via likelihood fusion from multiple feature models
Music genre provides an efficient way to index songs in a music database, and can be used as an effective means to retrieval music of a similar type, i.e. content-based music retrieval. A new two-stage scheme for music genre classification is proposed in this work. At the first stage, we examine a couple of different features, construct their corresponding parametric models (e.g. GMM and HMM) and compute their likelihood functions to yield soft classification results. In particular, the timbre, rhythm and temporal variation features are considered. Then, at the second stage, these soft classification results are integrated to result in a hard decision for final music genre classification. Experimental results are given to demonstrate the performance of the proposed scheme.
Modeling sports highlights using a time-series clustering framework and model interpretation
Regunathan Radhakrishnan, Isao Otsuka, Ziyou Xiong, et al.
In our past work on sports highlights extraction, we have shown the utility of detecting audience reaction using an audio classification framework. The audio classes in the framework were chosen based on intuition. In this paper, we present a systematic way of identifying the key audio classes for sports highlights extraction using a time series clustering framework. We treat the low-level audio features as a time series and model the highlight segments as "unusual" events in a background of an "usual" process. The set of audio classes to characterize the sports domain is then identified by analyzing the consistent patterns in each of the clusters output from the time series clustering framework. The distribution of features from the training data so obtained for each of the key audio classes, is parameterized by a Minimum Description Length Gaussian Mixture Model (MDL-GMM). We also interpret the meaning of each of the mixture components of the MDL-GMM for the key audio class (the "highlight" class) that is correlated with highlight moments. Our results show that the "highlight" class is a mixture of audience cheering and commentator's excited speech. Furthermore, we show that the precision-recall performance for highlights extraction based on this "highlight" class is better than that of our previous approach which uses only audience cheering as the key highlight class.
Towards automatic music transcription: note extraction based on independent subspace analysis
In this paper we present a technique for the separation of harmonic sounds within real sound mixtures for automatic music transcription using Independent Subspace Analysis (ISA). The algorithm is based on the assumption that tones played by an instrument within polyphonic music consist of components that are statistically independent from components of other tones. The first step of the algorithm is a temporal segmentation into note events. Both features in the time domain and in the frequency domain are used to detect segment boundaries, which are represented by starting or decaying tones. Each segment is now examined using the ISA and a set of statistically independent components is calculated. One tone played by an instrument consists of the fundamental frequency and its harmonics. Usually, the ISA results in more independent components than played notes, because not all harmonics are separated to the component containing their fundamental frequencies. Some harmonics are separated in components of its own. Using the Kullback-Leibler divergence components belonging together are grouped. A note classification, which is trained for piano music at the time, is the last step of the algorithm. Results show, that statistic independence is a promising measure for separating sounds into single notes using ISA as a step towards automatic music transcription.
MPEG-7-based description infrastructure for an audiovisual content analysis and retrieval system
Werner Bailer, Peter Schallauer, Michael Hausenblas, et al.
We present a case study of establishing a description infrastructure for an audiovisual content-analysis and retrieval system. The description infrastructure consists of an internal metadata model and access tool for using it. Based on an analysis of requirements, we have selected, out of a set of candidates, MPEG-7 as the basis of our metadata model. The openness and generality of MPEG-7 allow using it in broad range of applications, but increase complexity and hinder interoperability. Profiling has been proposed as a solution, with the focus on selecting and constraining description tools. Semantic constraints are currently only described in textual form. Conformance in terms of semantics can thus not be evaluated automatically and mappings between different profiles can only be defined manually. As a solution, we propose an approach to formalize the semantic constraints of an MPEG-7 profile using a formal vocabulary expressed in OWL, which allows automated processing of semantic constraints. We have defined the Detailed Audiovisual Profile as the profile to be used in our metadata model and we show how some of the semantic constraints of this profile can be formulated using ontologies. To work practically with the metadata model, we have implemented a MPEG-7 library and a client/server document access infrastructure.
Video Processing
icon_mobile_dropdown
Home-video content analysis for MTV-style video generation
Intelligent video pre-processing and authoring techniques that facilitate people to create MTV-style music video clips are investigated in this research. First, we present an automatic approach to detect and remove bad shots often occurring in home video, such as video with poor lighting or motion blur. Then, we consider the generation of MTV-style video clips by performing video and music tempo analysis and seeking an effective way in matching these two tempos. Experiment results are given to demonstrate the feasibility and efficiency of the proposed techniques for home video editing.
Multimodal approach for speaker identification in news programs
The process of identifying speakers in a news program is difficult using only text information. We propose a system that will first perform text and video processing separately to identify the start of speech of a speaker. These start of speech locations are aligned and used to identify a change of speaker in the program. An analysis is performed to identify the contribution of the text and video information. It will be be shown that the change of speaker locations identified by our alignment algorithm is more accurate then either mode individually.
Detection of goal events in soccer videos
Hyoung-Gook Kim, Steffen Roeber, Amjad Samour, et al.
In this paper, we present an automatic extraction of goal events in soccer videos by using audio track features alone without relying on expensive-to-compute video track features. The extracted goal events can be used for high-level indexing and selective browsing of soccer videos. The detection of soccer video highlights using audio contents comprises three steps: 1) extraction of audio features from a video sequence, 2) event candidate detection of highlight events based on the information provided by the feature extraction Methods and the Hidden Markov Model (HMM), 3) goal event selection to finally determine the video intervals to be included in the summary. For this purpose we compared the performance of the well known Mel-scale Frequency Cepstral Coefficients (MFCC) feature extraction method vs. MPEG-7 Audio Spectrum Projection feature (ASP) extraction method based on three different decomposition methods namely Principal Component Analysis( PCA), Independent Component Analysis (ICA) and Non-Negative Matrix Factorization (NMF). To evaluate our system we collected five soccer game videos from various sources. In total we have seven hours of soccer games consisting of eight gigabytes of data. One of five soccer games is used as the training data (e.g., announcers' excited speech, audience ambient speech noise, audience clapping, environmental sounds). Our goal event detection results are encouraging.
Panoramic video in video-mediated education
This paper discusses the use of panoramic video and its benefits in video mediated education. A panoramic view is generated by covering the blackboard by two or more cameras and then stitching the captured videos together. This paper describes the properties and advantages of multi-camera, panoramic video compared to single-camera approaches. One important difference between panoramic video and regular video is that the former has a wider field of view (FOV). As a result, the blackboard covers a larger part of the video screen and the information density is increased. Most importantly, the size of the letters written on the blackboard is enlarged, which improves the student’s ability to clearly read what is written on the blackboard. The panoramic view also allows students to focus their attention on different parts of the blackboard in the same way they would be able to in the classroom. This paper also discusses the results from a study among students where a panoramic view was tested against single-camera views. The study indicates that the students preferred the panoramic view. The study also revealed potential improvements that could make panoramic video even more beneficial.
Wipe shot boundary determination
Huge amounts of video data are produced around the world each day. Management is increasingly difficult. Tasks like archiving, browsing, analysis, and search and retrieval are aided by prior automatic temporal video segmentation into shots which are basic units of video. Among the different shot transition types: cut, dissolve, fade and wipe are wipes regarded as difficult to detect because of their variety. This paper presents a new efficient and fast algorithm for wipe detection and position determination. It can be used on the luminance DC coefficients extracted from MPEG sequences in compressed domain or on spatially sub-sampled sequences. With the newly proposed evenness factor the observation that during a wipe spatial zones of change move thru the image can be exploited very well. Wipe candidates are checked with frame differences of frame pairs with certain temporal distances. For the remaining candidates usage of a new approach for the detection of uniform movement of linear zones of change using double Hough transform is proposed. Motion compensation is used to handle local and global motion. The algorithm has a low computational complexity due to its small input data rate and step-wise reduction of wipe candidates.
An efficient approach for video information retrieval
Daoguo Dong, Xiangyang Xue
Today, more and more video information can be accessed through internet, satellite, etc.. Retrieving specific video information from large-scale video database has become an important and challenging research topic in the area of multimedia information retrieval. Generally, video retrieval can be categorized by the retrieval of video shot and retrieval of video clip. Up to now, few approaches can support both shot retrieval and clip retrieval efficiently. In this paper, we introduce a new high-dimensional index structure OVA-File, which is a variant of VA-File. In OVA-File, the approximations close to each other in data space are stored in close positions of the approximation file. The benefit is that only a part of approximations near the query vector need to be visited to get the approximate query result. Then, both shot query algorithm and video clip query algorithm are proposed to support video information retrieval efficiently. The experimental results showed that the queries based on OVA-File were much faster than that based on VA-File with small loss of result quality.