Proceedings Volume 6492

Human Vision and Electronic Imaging XII

cover
Proceedings Volume 6492

Human Vision and Electronic Imaging XII

View the digital version of this volume at SPIE Digital Libarary.

Volume Details

Date Published: 7 February 2007
Contents: 12 Sessions, 62 Papers, 0 Presentations
Conference: Electronic Imaging 2007 2007
Volume Number: 6492

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 6492
  • Keynote Session
  • Perception of Natural Images
  • Perceptual Image Quality and Compression
  • Perceptual Video Quality
  • Visual Optics
  • Characterizing Color in Imaging Systems
  • Cognitive Graphics
  • Perceptual Issues in High-Dynamic Range Imaging
  • Eye Movements and Visual Attention
  • Higher Level Vision and Cognition
  • Poster Session
  • Keynote Session
Front Matter: Volume 6492
icon_mobile_dropdown
Front Matter: Volume 6492
This PDF file contains the front matter associated with SPIE Proceedings Volume 6492, including the Title Page, Copyright information, Table of Contents, Introduction (if any), and the Conference Committee listing.
Keynote Session
icon_mobile_dropdown
New vistas in image and video quality
In this Keynote Address paper, we review early work on Image and Video Quality Assessment against the backdrop of an interpretation of image perception as a visual communication problem. As a way of explaining our recent work on Video Quality Assessment, we first describe our recent successful advances on QA algorithms for still images, specifically, the Structural SIMilarity (SSIM) Index and the Visual Information Fidelity (VIF) Index. We then describe our efforts towards extending these Image Quality Assessment frameworks to the much more complex problem of Video Quality Assessment. We also discuss our current efforts towards the design and construction of a generic and publicly-available Video Quality Assessment database.
Painterly rendered portraits from photographs using a knowledge-based approach
Portrait artists using oils, acrylics or pastels use a specific but open human vision methodology to create a painterly portrait of a live sitter. When they must use a photograph as source, artists augment their process, since photographs have: different focusing - everything is in focus or focused in vertical planes; value clumping - the camera darkens the shadows and lightens the bright areas; as well as color and perspective distortion. In general, artistic methodology attempts the following: from the photograph, the painting must 'simplify, compose and leave out what's irrelevant, emphasizing what's important'. While seemingly a qualitative goal, artists use known techniques such as relying on source tone over color to indirect into a semantic color temperature model, use brush and tonal "sharpness" to create a center of interest, lost and found edges to move the viewers gaze through the image towards the center of interest as well as other techniques to filter and emphasize. Our work attempts to create a knowledge domain of the portrait painter process and incorporate this knowledge into a multi-space parameterized system that can create an array of NPR painterly rendering output by analyzing the photographic-based input which informs the semantic knowledge rules.
Nonlinear encoding in multilayer LNL systems optimized for the representation of natural images
Christoph Zetzsche, Ulrich Nuding
We consider the coding properties of multilayer LNL (linear-nonlinear-linear) systems. Such systems consist of interleaved layers of linear transforms (or filter banks), nonlinear mappings, linear transforms, and so forth. They can be used as models of visual processing in higher cortical areas (V2, V4), and are also interesting with respect to image processing and coding. The linear filter operations in the different layers are optimized for the exploitation of the statistical redundancies of natural images. We explain why even simple nonlinear operations-like ON/OFF rectification-can convert higher-order statistical dependencies remaining between the linear filter coefficients of the first layer to a lower order. The resulting nonlinear coefficients can then be linearly recombined by the second-level filtering stage, using the same principles as in the first stage. The complete nonlinear scheme is invertible, i.e., information is preserved, if nonlinearities like ON/OFF rectification or gain control are employed. In order to obtain insights into the coding efficiency of these systems we investigate the feature selectivity of the resulting nonlinear output units and the use of LNL systems in image compression.
Perception of Natural Images
icon_mobile_dropdown
Optimal sensor design for estimating local velocity in natural environments
Motion coding in the brain undoubtedly reflects the statistics of retinal image motion occurring in the natural environment. To characterize these statistics it is useful to measure motion in artificial movies derived from simulated environments where the "ground truth" is known precisely. Here we consider the problem of coding retinal image motion when an observer moves through an environment. Simulated environments were created by combining the range statistics of natural scenes with the spatial statistics of natural images. Artificial movies were then created by moving along a known trajectory at a constant speed through the simulated environments. We find that across a range of environments the optimal integration area of local motion sensors increases logarithmically with the speed to which the sensor is tuned. This result makes predictions for cortical neurons involved in heading perception and may find use in robotics applications.
Bilinear models of natural images
Bruno A. Olshausen, Charles Cadieu, Jack Culpepper, et al.
Previous work on unsupervised learning has shown that it is possible to learn Gabor-like feature representations, similar to those employed in the primary visual cortex, from the statistics of natural images. However, such representations are still not readily suited for object recognition or other high-level visual tasks because they can change drastically as the image changes to due object motion, variations in viewpoint, lighting, and other factors. In this paper, we describe how bilinear image models can be used to learn independent representations of the invariances, and their transformations, in natural image sequences. These models provide the foundation for learning higher-order feature representations that could serve as models of higher stages of processing in the cortex, in addition to having practical merit for computer vision tasks.
Statistically and perceptually motivated nonlinear image representation
We describe an invertible nonlinear image transformation that is well-matched to the statistical properties of photographic images, as well as the perceptual sensitivity of the human visual system. Images are first decomposed using a multi-scale oriented linear transformation. In this domain, we develop a Markov random field model based on the dependencies within local clusters of transform coefficients associated with basis functions at nearby positions, orientations and scales. In this model, division of each coefficient by a particular linear combination of the amplitudes of others in the cluster produces a new nonlinear representation with marginally Gaussian statistics. We develop a reliable and efficient iterative procedure for inverting the divisive transformation. Finally, we probe the statistical and perceptual advantages of this image representation, examining robustness to added noise, rate-distortion behavior, and artifact-free local contrast enhancement.
Spatiotemporal power spectra of motion parallax: the case of cluttered 3D scenes
Derek Rivait, Michael S. Langer
We examine the spatiotemporal power spectra of image sequences that depict dense motion parallax, namely the parallax seen by an observer moving laterally in a cluttered 3D scene. Previous models of the spatiotemporal power have accounted for effects such as a static 1/f spectrum in each image frame, a spreading of power at high spatial frequencies in the direction of motion, and a bias toward either lower or higher image speeds depending on the 3D density of objects the scene. Here we use computer graphics to generate a parameterized set of image sequences and qualitatively verify the main features of these models. The novel contribution is to discuss how failures of 1/f scaling can occur in cluttered scenes. Such failures have been described for the spatial case, but not for the spatiotemporal case. We find that when objects in the cluttered scene are visible over a wide range of depths, and when the image size of objects is smaller than the image width, failures of 1/f scaling tend to occur at certain critical frequencies, defined by a correspondence between object size and object speed.
Feature category systems for 2nd order local image structure induced by natural image statistics and otherwise
Lewis D. Griffin, Martin Lillholm
We report progress on an approach (Geometric Texton Theory - GTT) that like Marr's 'primal sketch' aims to describe image structure in a way that emphasises its qualitative aspects. In both approaches, image description is by labelling points using a vocabulary of feature types, though compared to Marr we aim for a much larger feature vocabulary. We base GTT on the Gaussian derivative (DtG) model of V1 measurement. Marr's primal sketch was based on DtG filters of derivative order up to 2nd, for GTT we plan to extend to the physiologically plausible limit of 4th. This is how we will achieve a larger feature vocabulary (we estimate 30-150) than Marr's 'edge', 'line' and 'blob'. The central requirement of GTT then is for a procedure for determining the feature vocabulary that will scale up to 4th order. We have previously published feature category systems for 1-D 1st order, 1-D 2nd order, 2-D 1st order and 2-D pure 2nd order. In this paper we will present results of GTT as applied to 2-D mixed 1st + 2nd order features. We will review various approaches to defining the feature vocabulary, including ones based on (i) purely geometrical considerations, and (ii) natural image statistics.
The independent components of natural images are perceptually dependent
Matthias Bethge, Thomas V. Wiecki, Felix A. Wichmann
The independent components of natural images are a set of linear filters which are optimized for statistical independence. With such a set of filters images can be represented without loss of information. Intriguingly, the filter shapes are localized, oriented, and bandpass, resembling important properties of V1 simple cell receptive fields. Here we address the question of whether the independent components of natural images are also perceptually less dependent than other image components. We compared the pixel basis, the ICA basis and the discrete cosine basis by asking subjects to interactively predict missing pixels (for the pixel basis) or to predict the coefficients of ICA and DCT basis functions in patches of natural images. Like Kersten (1987)1 we find the pixel basis to be perceptually highly redundant but perhaps surprisingly, the ICA basis showed significantly higher perceptual dependencies than the DCT basis. This shows a dissociation between statistical and perceptual dependence measures.
Learning optimal features for visual pattern recognition
Kai Labusch, Udo Siewert, Thomas Martinetz, et al.
The optimal coding hypothesis proposes that the human visual system has adapted to the statistical properties of the environment by the use of relatively simple optimality criteria. We here (i) discuss how the properties of different models of image coding, i.e. sparseness, decorrelation, and statistical independence are related to each other (ii) propose to evaluate the different models by verifiable performance measures (iii) analyse the classification performance on images of handwritten digits (MNIST data base). We first employ the SPARSENET algorithm (Olshausen, 1998) to derive a local filter basis (on 13 × 13 pixels windows). We then filter the images in the database (28 × 28 pixels images of digits) and reduce the dimensionality of the resulting feature space by selecting the locally maximal filter responses. We then train a support vector machine on a training set to classify the digits and report results obtained on a separate test set. Currently, the best state-of-the-art result on the MNIST data base has an error rate of 0,4%. This result, however, has been obtained by using explicit knowledge that is specific to the data (elastic distortion model for digits). We here obtain an error rate of 0,55% which is second best but does not use explicit data specific knowledge. In particular it outperforms by far all methods that do not use data-specific knowledge.
Unsupervised learning of a steerable basis for invariant image representations
Matthias Bethge, Sebastian Gerwinn, Jakob H. Macke
There are two aspects to unsupervised learning of invariant representations of images: First, we can reduce the dimensionality of the representation by finding an optimal trade-off between temporal stability and informativeness. We show that the answer to this optimization problem is generally not unique so that there is still considerable freedom in choosing a suitable basis. Which of the many optimal representations should be selected? Here, we focus on this second aspect, and seek to find representations that are invariant under geometrical transformations occuring in sequences of natural images. We utilize ideas of 'steerability' and Lie groups, which have been developed in the context of filter design. In particular, we show how an anti-symmetric version of canonical correlation analysis can be used to learn a full-rank image basis which is steerable with respect to rotations. We provide a geometric interpretation of this algorithm by showing that it finds the two-dimensional eigensubspaces of the average bivector. For data which exhibits a variety of transformations, we develop a bivector clustering algorithm, which we use to learn a basis of generalized quadrature pairs (i.e. 'complex cells') from sequences of natural images.
Analysis of segment statistics for semantic classification of natural images
A major challenge facing content-based image retrieval is bridging the gap between low-level image primitives and high-level semantics. We have proposed a new approach for semantic image classification that utilizes the adaptive perceptual color-texture segmentation algorithm by Chen et al., which segments natural scenes into perceptually uniform regions. The color composition and spatial texture features of the regions are used as medium level descriptors, based on which the segments are classified into semantic categories. The segment features consist of spatial texture orientation information and color composition in terms of a limited number of spatially adapted dominant colors. The feature selection and the performance of the classification algorithms are based on the segment statistics. We investigate the dependence of the segment statistics on the segmentation algorithm. For this, we compare the statistics of the segment features obtained using the Chen et al. algorithm to those that correspond to human segmentations, and show that they are remarkably similar. We also show that when human segmentations are used instead of the automatically detected segments, the performance of the semantic classification approach remains approximately the same.
Perceptual Image Quality and Compression
icon_mobile_dropdown
The role of spatially adaptive versus non-spatially adaptive distortion in supra-threshold compression
Matthew D. Gaubatz, Stephanie Kwan, Sheila S. Hemami
In wavelet-based image coding, a variety of masking properties have been exploited that result in spatially-adaptive quantization schemes. It has been shown that carefully selecting uniform quantization step-sizes across entire wavelet subbands or subband codeblocks results in considerable gains in efficiency with respect to visual quality. These gains have been achieved through analysis of wavelet distortion additivity in the presence of a background image; in effect, how wavelet distortions from different bands mask each other while being masked by the image itself at and above threshold. More recent studies have illustrated how the contrast and structural class of natural image data influences masking properties at threshold. Though these results have been extended in a number of methods to achieve supra-threshold compression schemes, the relationship between inter-band and intra-band masking at supra-threshold rates is not well understood. This work aims to quantify the importance of spatially-adaptive distortion as a function of compressed target rate. Two experiments are performed that require the subject to specify the optimal balance between spatially-adaptive and non-spatiallyadaptive distortion. Analyses of the resulting data indicate that on average, the balance between spatially-adaptive and non-spatially-adaptive distortion is equally important across all tested rates. Furthermore, though it is known that meansquared- error alone is not a good indicator of image quality, it can be used to predict the outcome of this experiment with reasonable accuracy. This result has convenient implications for image coding that are also discussed.
Image compression using sparse colour sampling combined with nonlinear image processing
Stephen Brooks, Ian Saunders, Neil A. Dodgson
We apply two recent non-linear, image-processing algorithms to colour image compression. The two algorithms are colorization and joint bilateral filtering. Neither algorithm was designed for image compression. Our investigations were to ascertain whether their mechanisms could be used to improve the image compression rate for the same level of visual quality. Both show interesting behaviour, with the second showing a visible improvement in visual quality, over JPEG, at the same compression rate. In both cases, we store luminance as a standard, lossily compressed, greyscale image and store colour at a very low sampling rate. Each of the non-linear algorithms then uses the information from the luminance channel to determine how to propagate the colour information appropriately to reconstruct a full colour image.
Compression of Image Clusters using Karhunen Loeve Transformations
This paper proposes to extend the Karhunen-Loeve compression algorithm to multiple images. The resulting algorithm is compared against single-image Karhunen Loeve as well as algorithms based on the Discrete Cosine Transformation (DCT). Futhermore, various methods for obtaining compressable clusters from large image databases are evaluated.
Visual ergonomic aspects of glare on computer displays: glossy screens and angular dependence
Kjell Brunnström, Börje Andrén, Zacharias Konstantinides, et al.
Recently flat panel computer displays and notebook computer are designed with a so called glare panel i.e. highly glossy screens, have emerged on the market. The shiny look of the display appeals to the costumers, also there are arguments that the contrast, colour saturation etc improves by using a glare panel. LCD displays suffer often from angular dependent picture quality. This has been even more pronounced by the introduction of Prism Light Guide plates into displays for notebook computers. The TCO label is the leading labelling system for computer displays. Currently about 50% of all computer displays on the market are certified according to the TCO requirements. The requirements are periodically updated to keep up with the technical development and the latest research in e.g. visual ergonomics. The gloss level of the screen and the angular dependence has recently been investigated by conducting user studies. A study of the effect of highly glossy screens compared to matt screens has been performed. The results show a slight advantage for the glossy screen when no disturbing reflexes are present, however the difference was not statistically significant. When disturbing reflexes are present the advantage is changed into a larger disadvantage and this difference is statistically significant. Another study of angular dependence has also been performed. The results indicates a linear relationship between the picture quality and the centre luminance of the screen.
The blur effect: perception and estimation with a new no-reference perceptual blur metric
Frederique Crete, Thierry Dolmiere, Patricia Ladret, et al.
To achieve the best image quality, noise and artifacts are generally removed at the cost of a loss of details generating the blur effect. To control and quantify the emergence of the blur effect, blur metrics have already been proposed in the literature. By associating the blur effect with the edge spreading, these metrics are sensitive not only to the threshold choice to classify the edge, but also to the presence of noise which can mislead the edge detection. Based on the observation that we have difficulties to perceive differences between a blurred image and the same reblurred image, we propose a new approach which is not based on transient characteristics but on the discrimination between different levels of blur perceptible on the same picture. Using subjective tests and psychophysics functions, we validate our blur perception theory for a set of pictures which are naturally unsharp or more or less blurred through one or two-dimensional low-pass filters. Those tests show the robustness and the ability of the metric to evaluate not only the blur introduced by a restoration processing but also focal blur or motion blur. Requiring no reference and a low cost implementation, this new perceptual blur metric is applicable in a large domain from a simple metric to a means to fine-tune artifacts corrections.
Perceptual quality evaluation of geometric distortions in images
Human perception of image distortions has been widely explored in recent years, however, research has not dealt with distortions due to geometric operations. In this paper, we present the results we obtained by means of psychovisual experiments aimed at evaluating the way the human visual system perceives geometric distortions in images. A mathematical model of the geometric distortions is first introduced, then the impact of the model parameters on the visibility of the distortion is measured by means of both objective metrics and subjective tests.
Perceptual Video Quality
icon_mobile_dropdown
Color preference, color naturalness, and annoyance of compressed and color scaled digital videos
Chin Chye Koh, John M. Foley, Sanjit K. Mitra
In this work, we studied how video compression and color scaling interact to affect the overall video quality and the color quality attributes. We examined the three subjective attributes: perceived color preference, perceived color naturalness, and overall annoyance, as digital videos were subjected to compression and chroma scaling. Our objectives were: (1) to determine how the color chroma scaling of compressed digital videos affected the mean color preference and naturalness and overall annoyance ratings across subjects and (2) to determine how preference, naturalness, and annoyance were related. Psychophysical experiments were carried out in which naïve subjects made numerical judgments of these three attributes. Preference and naturalness scores increased to a maximum and decreased as the mean chroma of the videos increased. As compression increased, both preference and naturalness scores decreased and they varied less with mean chroma. Naturalness scores tended to reach a maximum at lower mean chroma than preference scores. Annoyance scores decreased to a minimum and then increased as mean chroma increased. The mean chroma at which annoyance was minimum was less than the mean chroma at which naturalness and preference were maximum. Preference, naturalness, and annoyance scores for individual videos, were approximated relatively well by Gaussian functions of mean chroma. Preference and naturalness scores decreased linearly as a function of the logarithm of the total squared error, while annoyance scores increased as an S-shaped function of the logarithm of the total squared error. A three-parameter model is shown to provide a good description of how each attribute depends on chroma and compression for individual videos. Model parameters vary with video content.
Relation between DSIS and DSCQS for temporal and spatial video artifacts in a wireless home environment
To interpret the impressions of observers, it is necessary to understand the relationship between components that influence perceived video quality. This paper addresses the effect of assessment methodology on the subjective judgement for spatial and temporal impaired video material, caused by video adaptation methods that come into play when there is variable throughput of video material (I-Frame Delay and Signal-to-Noise Ratio scalability). Judgement strategies used are the double-stimulus continuous-quality scale (DSCQS) and the double stimulus impairment scale (DSIS). Results show no evidence for an influence of spatial artifacts on perceived video quality with the presented judgement strategies. Results for the influence of temporal artifacts are less easy to interpret, because it is not possible to distinguish whether the non-linear relation between DSIS and DSCQS appeared because of the temporal artifacts themselves or presented scene content.
"Can you see me now?" An objective metric for predicting intelligibility of compressed American Sign Language video
Francis M. Ciaramello, Sheila S. Hemami
For members of the Deaf Community in the United States, current communication tools include TTY/TTD services, video relay services, and text-based communication. With the growth of cellular technology, mobile sign language conversations are becoming a possibility. Proper coding techniques must be employed to compress American Sign Language (ASL) video for low-rate transmission while maintaining the quality of the conversation. In order to evaluate these techniques, an appropriate quality metric is needed. This paper demonstrates that traditional video quality metrics, such as PSNR, fail to predict subjective intelligibility scores. By considering the unique structure of ASL video, an appropriate objective metric is developed. Face and hand segmentation is performed using skin-color detection techniques. The distortions in the face and hand regions are optimally weighted and pooled across all frames to create an objective intelligibility score for a distorted sequence. The objective intelligibility metric performs significantly better than PSNR in terms of correlation with subjective responses.
Price-dependent quality: examining the effects of price on multimedia quality requirements
David S. Hands, Caroline Partridge, Kennedy Cheng, et al.
Traditionally, subjective quality assessments are made in isolation of mediating factors (e.g. interest in content, price). This approach is useful for determining the pure perceptual quality of content. Recently, there has been a growing interest in understanding users' quality of experience. To move from perceptual quality assessment to quality of experience assessment, factors beyond reproduction quality must be considered. From a commercial perspective, content and price are key determinants of success. This paper investigates the relationship between price and quality. Subjects selected content that was of interest to them. Subjects were given a budget of ten pounds at the start of the test. When viewing content, subjects were free to select different levels of quality. The lowest quality was free (and subjects left the test with ten pounds). The highest quality used up the full budget (and subjects left the test with no money). A range of pricing tariffs was used in the test. During the test, subjects were allowed to prioritise quality or price. The results of the test found that subjects prioritised quality over price across all tariff levels. At the higher pricing tariffs, subjects became more price sensitive. Using data from a number of subjective tests, a utility function describing the relationship between price and quality was produced.
Visual Optics
icon_mobile_dropdown
Correcting spurious resolution in defocused images
John I. Yellott, John W. Yellott
Optical modeling suggests that levels of retinal defocus routinely caused by presbyopia should produce phase reversals (spurious resolution-SR) for spatial frequencies in the 2 cycles/letter range known to be critical for reading. Simulations show that such reversals can have a decisive impact on character legibility, and that correcting only this feature of defocused images (by re-reversing contrast sign errors created by defocus) can make unrecognizably blurred letters completely legible. This deblurring impact of SR correction is remarkably unaffected by the magnitude of defocus, as determined by blur-circle size. Both the deblurrring itself and its robustness can be understood from the effect that SR correction has on the defocused pointspread function, which changes from a broad flat cake to a sharply pointed cone. This SR-corrected pointspread acts like a delta function, preserving image shape during convolution regardless of blur-disk size. Curiously, such pointspread functions always contain a narrow annulus of negative light-intensity values whose radius equals the diameter of the blur circle. We show that these properties of SR-correction all stem from the mathematical nature of the Fourier transform of the sign of the optical transfer function, which also accounts for the inevitable low contrast of images pre-corrected for SR.
Do focus measures apply to retinal images?
Yibin Tian, Kevin Shieh, Christine F. Wildsoet
The diverse needs for digital auto-focusing systems have driven the development of a variety of focus measures. The purpose of the current study was to investigate whether any of these focus measures are biologically plausible; specifically whether they are applicable to retinal images from which defocus information is extracted in the operation of accommodation and emmetropization, two ocular auto-focusing mechanisms. Ten representative focus measures were chosen for analysis, 6 in the spatial domain and 4 transform-based. Their performance was examined for combinations of non-defocus aberrations and positive and negative defocus. For each combination, a wavefront was reconstructed, the corresponding point spread function (PSF) computed using Fast Fourier Transform (FFT), and then the blurred image obtained as the convolution of the PSF and a perfect image. For each blurred image, a focus measure curve was derived for each focus measure. Aberration data were either collected from 22 real eyes or randomly generated data based on Gaussian parameters describing data from a published large scale human study (n>100). For the latter data set, analyses made use of distributed computing on a small inhomogeneous computer cluster. In the presence of small amounts of nondefocus aberrations, all focus measures showed monotonic changes with positive or negative defocus, and their curves generally remained unimodal, although there were large differences in their variability, sensitivity to defocus and effective ranges. However, the performance of a number of these focus measures became unacceptable when nondefocus aberrations exceed a certain level.
Characterizing Color in Imaging Systems
icon_mobile_dropdown
Aperture and object mode appearances in images
Vision scientists have segmented appearances into aperture and object modes, based on observations that scene stimuli appear different in a black -no light- surround. This is a 19th century assumption that the stimulus determines the mode, and sensory feedback determines the appearance. Since the 1960's there have been innumerable experiments on spatial vision following the work of Hubel and Wiesel, Campbell, Gibson, Land and Zeki. The modern view of vision is that appearance is generated by spatial interactions, or contrast. This paper describes experiments that provide a significant increment of new data on the effects of contrast and constancy over a wider range of luminances than previously studied. Matches are not consistent with discounting the illuminant. The observers' matches fit a simple two-step physical description: The appearance of maxima is dependent on luminance, and less-luminous areas are dependent on spatial contrast. The need to rely on unspecified feedback processes, such as aperture mode and object mode, is no longer necessary. Simple rules of maxima and spatial interactions account for all matches in flat 2D transparent targets, complex 3D reflection prints and HDR displays.
Visibility improvement based on gray matching experiment between dark and ambient condition in mobile display
The purpose of this study is to examine gray matching between dark and ambient condition and to improve visibility using result of gray matching experiment in mobile display and target luminance is 30000 lux for experiment. First of all, for measuring visibility on ambient condition, the patch count experiment is conducted by investigating that how many patches can be seen at original images under the ambient light. The visibility in ambient condition was significant in comparison to dark condition. Next, the gray matching experiment is conducted by comparing gray patches between dark and ambient condition using method of adjustment. The participants responded that the white or bright gray patch could not find same brightness patch under ambient condition. To confirm the visibility improvement through the result of gray matching experiment, visibility is measured under the ambient light after simple implementation. It was same procedure of the first visibility experiment. After applying the gray matching curve, visibility was more improvement. Statistic T test result between patches applied gray curve and maximum of dark condition was not significant. It means that visibility was not different between original patches of dark condition and patches applied curve of ambient condition.
Multispectral color constancy: real image tests
Experiments using real images are conducted on a variety of color constancy algorithms (Chromagenic, Greyworld, Max RGB, and a Maloney-Wandell extension called Subspace Testing) in order to determine whether or not extending the number of channels from 3 to 6 to 9 would enhance the accuracy with which they estimate the scene illuminant color. To create the 6 and 9 channel images, filters where placed over a standard 3-channel color camera. Although some improvement is found with 6 channels, the results indicate that essentially the extra channels do not help as much as might be expected.
Color balancing based upon gamut and temporal correlations
Image acquisition devices inherently do not have color constancy mechanism like human visual system. Machine color constancy problem can be circumvented using a white balancing technique based upon accurate illumination estimation. Unfortunately, previous study can give satisfactory results for both accuracy and stability under various conditions. To overcome these problems, we suggest a new method: spatial and temporal illumination estimation. This method, an evolution of the Retinex and Color by Correlation method, predicts on initial illuminant point, and estimates scene-illumination between the point and sub-gamuts derived by from luminance levels. The method proposed can raise estimation probability by not only detecting motion of scene reflectance but also by finding valid scenes using different information from sequential scenes. This proposed method outperforms recently developed algorithms.
Visibility of hue, saturation, and luminance deviations in LEDs
It is well known that LEDs have problems with color consistency and color stability over time. Two perception experiments were conducted in order to determine guidelines for the color and luminance deviations between LEDs that are allowed. The first experiment determined the visibility threshold of hue, saturation, and luminance deviations of one LED in an array of LEDs and the second experiment measured the visibility threshold of hue, saturation, and luminance gratings for different spatial frequencies. The results of the first experiment show that people are most sensitive for color deviations between LEDs when a white color is generated. The visibility threshold for white was 0.004 &Dgr;u'v' for a deviation in the hue of the LED primaries, 0.007 &Dgr;u'v' for a deviation in the saturation of the LED primaries and 0.006 &Dgr;u'v' for a deviation in the luminance of the LED primaries. The second experiment showed that the visibility of hue gratings is independent of spatial frequency in the range of 0.4 to 1.2 cycles/degree. However, for saturation and luminance gratings there was a significant effect of spatial frequency on the visibility threshold. Both experiments show that observers are more sensitive to hue than to saturation deviations.
Paper whiteness and its effect on the reproduction of colors
The whiteness level of a printing paper is considered as an important quality measure. High paper whiteness improves the contrast to printed areas providing a more distinct appearance of printed text and colors and increases the number of reproducible colors. Its influence on perceived color rendering quality is however not completely explained. The intuitive interpretation of paper whiteness is a material with high light reflection for all wavelengths in the visual part of the color spectrum. However, a slightly bluish shade is perceived as being whiter than a neutral white. Accordingly, papers with high whiteness values incline toward bluish-white. In paper production, a high whiteness level is achieved by the use of highly bleached pulp together with high light scattering filler pigment. To further increase whiteness levels expensive additives such as Fluorescent Whitening Agents (FWA) and shading dyes are needed. During the last years, the CIE whiteness level of some commercial available office paper has exceeded 170 CIE units, a level that can only be reached by the addition of significant amounts of FWA. Although paper whiteness is considered as an important paper quality criterion, its influence on printed color images is complicated. The dynamic mechanisms of the human visual system strive to optimize the visual response to each particular viewing condition. One of these mechanisms is chromatic adaptation, where colored objects get the same appearance under different light sources, i.e. a white paper appears white under tungsten, fluorescent and day light. In the process of judging printed color images, paper whiteness will be part of the chromatic adaptation. This implies that variations in paper whiteness would be discounted by the human visual system. On the other hand, high paper whiteness improves the contrast as well as the color gamut, both important parameters for the perceived color reproduction quality. In order to quantify the influence of paper whiteness pilot papers with different amount of FWA but in all other respects similar were produced on a small scale experimental paper machine. The fact that only the FWA content changes reduces the influences of other properties separated from the paper whiteness in the evaluation process. A set of images, all having characteristics with the potential to reveal the influence of the varied whiteness level on color reproduction quality, were printed on the pilot papers in two different printers. Prior to printing the test images in the experiment, ICC-profiles were calculated for all the used printer-substrate combinations. A visual assessment study of the printed samples was carried out in order to relate the influence of the paper whiteness level to perceived color reproduction quality. The results show an improved color rendering quality with increased CIE whiteness value up to a certain level. Any further increase in paper whiteness does not contribute to an improved color reproduction quality. Furthermore, the fact that some printing inks are UV blocking while others are not will introduce a non uniform color shift in the printed image when the FWA activation changes. This non uniform color shift has been quantified both for variations in illuminant as well as variations of FWA content in the paper.
Cognitive Graphics
icon_mobile_dropdown
Higher-order image representations for hyper-resolution image synthesis and capture
Real time imaging applications such as interactive rendering and video conferencing face particularly challenging bandwidth problems, especially as we attempt to improve resolution to perceptual limits. Compression has been an amazing enabler of video streaming and storage, but in interactive settings, it can introduce application-killing latencies. Rather than synthesizing or capturing a verbose representation and then immediately converting it into its succinct form, we should generate the concise representation directly. Our research is inspired by human vision, which as Hoffman (1998) notes, constructs "continuous lines and surfaces...from discrete information." Our adaptive frameless renderer uses gradient samples and steerable filters to perform spatiotemporally adaptive reconstruction that preserves both edges and occlusion boundaries. Resulting RMS qualities are equivalent to traditionally synthesized imagery with 10 times more samples. Nevertheless in dynamic scenes, producing pleasing edges with so few samples is challenging. We are currently developing methods for reconstructing imagery using color samples supplemented with sparse edge information. Such higher-order representations will be a crucial enabler of interactive, hyper-resolution image synthesis, capture and display.
Semantic photosynthesis
Matthew Johnson, G. J. Brostow, J. Shotton, et al.
Composite images are synthesized from existing photographs by artists who make concept art, e.g. storyboards for movies or architectural planning. Current techniques allow an artist to fabricate such an image by digitally splicing parts of stock photographs. While these images serve mainly to "quickly" convey how a scene should look, their production is laborious. We propose a technique that allows a person to design a new photograph with substantially less effort. This paper presents a method that generates a composite image when a user types in nouns, such as "boat" and "sand." The artist can optionally design an intended image by specifying other constraints. Our algorithm formulates the constraints as queries to search an automatically annotated image database. The desired photograph, not a collage, is then synthesized using graph-cut optimization, optionally allowing for further user interaction to edit or choose among alternative generated photos. Our results demonstrate our contributions of (1) a method of creating specific images with minimal human effort, and (2) a combined algorithm for automatically building an image library with semantic annotations from any photo collection.
Revealing pentimenti: the hidden history in a painting
Art conservators often explore X-ray images of paintings to help find pentimenti, the artist's revisions hidden beneath the painting's visible first surfaces. X-ray interpretation is difficult due to artifacts in the image, superimposed features from all paint layers, and because image intensity depends on both the paint layer thickness and each pigment's opacity. We present a robust user-guided method to suppress clutter, find visually significant differences between X-ray images and color photographs, and visualize them together. These tools allow domain experts as well as museum visitors to explore the artist's creative decisions that led to a masterpiece.
Perceptual Issues in High-Dynamic Range Imaging
icon_mobile_dropdown
Self-calibrating wide color gamut high-dynamic-range display
Helge Seetzen, Samy Makki, Henry Ip, et al.
High Dynamic Range displays offer higher brightness, higher contrast, better color reproduction and lower power consumption compared to conventional displays available today. In addition to these benefits, it is possible to leverage the unique design of HDR displays to overcome many of the calibration and lifetime degradation problems of liquid crystal displays, especially those using light emitting diodes. This paper describes a combination of sensor mechanisms and algorithms that reduce luminance and color variation for both HDR and conventional displays even with the use of highly variable light elements.
Tone mapping for high-dynamic range displays
We address the problem of re-rendering images to high dynamic range (HDR) displays, which were originally tone-mapped to standard displays. As these new HDR displays have a much larger dynamic range than standard displays, an image rendered to standard monitors is likely to look too bright when displayed on a HDR monitor. Moreover, because of the operations performed during capture and rendering to standard displays, the specular highlights are likely to have been clipped or compressed, which causes a loss of realism. We propose a tone scale function to re-render images first tone-mapped to standard displays, that focuses on the representation of specular highlights. The shape of the tone scale function depends on the segmentation of the input image into its diffuse and specular components. In this article, we describe a method to perform this segmentation automatically. Our method detects specular highlights by using two low-pass filters of different sizes combined with morphological operators. The results show that our method successfully detects small and middle sized specular highlights. The locations of specular highlights define a mask used for the construction of the tone scale function. We then propose two ways of applying the tone scale, the global version that applies the same curve to each pixel in the image and the local version that uses spatial information given by the mask to apply the tone scale differently to diffuse and to specular pixels.
High-dynamic range imaging pipeline: perception-motivated representation of visual content
Rafal Mantiuk, Grzegorz Krawczyk, Radoslaw Mantiuk, et al.
The advances in high dynamic range (HDR) imaging, especially in the display and camera technology, have a significant impact on the existing imaging systems. The assumptions of the traditional low-dynamic range imaging, designed for paper print as a major output medium, are ill suited for the range of visual material that is shown on modern displays. For example, the common assumption that the brightest color in an image is white can be hardly justified for high contrast LCD displays, not to mention next generation HDR displays, that can easily create bright highlights and the impression of self-luminous colors. We argue that high dynamic range representation can encode images regardless of the technology used to create and display them, with the accuracy that is only constrained by the limitations of the human eye and not a particular output medium. To facilitate the research on high dynamic range imaging, we have created a software package (http://pfstools.sourceforge.net/) capable of handling HDR data on all stages of image and video processing. The software package is available as open source under the General Public License and includes solutions for high quality image acquisition from multiple exposures, a range of tone mapping algorithms and a visual difference predictor for HDR images. Examples of shell scripts demonstrate how the software can be used for processing single images as well as video sequences.
Veiling glare: the dynamic range limit of HDR images
High Dynamic Range (HDR) images are superior to conventional images. However, veiling glare is a physical limit to HDR image acquisition and display. We performed camera calibration experiments using a single test target with 40 luminance patches covering a luminance range of 18,619:1. Veiling glare is a scene-dependent physical limit of the camera and the lens. Multiple exposures cannot accurately reconstruct scene luminances beyond the veiling glare limit. Human observer experiments, using the same targets, showed that image-dependent intraocular scatter changes identical display luminances into different retinal luminances. Vision's contrast mechanism further distorts any correlation of scene luminance and appearance. There must be reasons, other than accurate luminance, that explains the improvement in HDR images. The multiple exposure technique significantly improves digital quantization. The improved quantization allows displays to present better spatial information to humans. When human vision looks at high-dynamic range displays, it processes them using spatial comparisons.
Eye Movements and Visual Attention
icon_mobile_dropdown
Hidden Markov model-based face recognition using selective attention
A. A. Salah, M. Bicego, L. Akarun, et al.
Sequential methods for face recognition rely on the analysis of local facial features in a sequential manner, typically with a raster scan. However, the distribution of discriminative information is not uniform over the facial surface. For instance, the eyes and the mouth are more informative than the cheek. We propose an extension to the sequential approach, where we take into account local feature saliency, and replace the raster scan with a guided scan that mimicks the scanpath of the human eye. The selective attention mechanism that guides the human eye operates by coarsely detecting salient locations, and directing more resources (the fovea) at interesting or informative parts. We simulate this idea by employing a computationally cheap saliency scheme, based on Gabor wavelet filters. Hidden Markov models are used for classification, and the observations, i.e. features obtained with the simulation of the scanpath, are modeled with Gaussian distributions at each state of the model. We show that by visiting important locations first, our method is able to reach high accuracy with much shorter feature sequences. We compare several features in observation sequences, among which DCT coefficients result in the highest accuracy.
The role of eye movement signals in dorsal and ventral processing
John A. Black Jr., Stuart B. Braiman, Chatapuramkrishnan Narayanan, et al.
As human eyes scan an image, each fixation captures high resolution visual information from a small region of that image. The resulting intermittent visual stream is sent along two visual pathways to the visual centers of the brain concurrently with eye movement information. The ventral stream (the what pathway) is associated with object recognition, while the dorsal stream (the where pathway) is associated with spatial perception. This research employs three experiments to compare the relative importance of eye movement information within these two visual pathways. During Experiment 1 participants visually examine (a) outdoor scenery images, and (b) object images, while their fixation sequences are captured. These fixation sequences are then used to generate sequences of foveated of images, in the form of videos. In Experiments 2 and 3 these videos are viewed by another set of participants. In doing so, participants in Experiment 2 and 3 experience the same sequence of foveal stimuli as those in Experiment 1, but might or might not experience the corresponding eye movement signals. The subsequent ability of the Experiment 2 and 3 participants to (a) recognize objects, and (b) locate landmarks in outdoor scenes provides information about the importance of eye movement information in dorsal and ventral processing.
Variable resolution images and their effects on eye movements during free viewing
Marcus Nyström, Kenneth Holmqvist
Earlier studies have shown that while free-viewing images people tend to gaze at regions with a high local density of bottom up features such as contrast and edge density. In particular, this tendency seems to be more emphasized during the first few fixations after image onset. In this paper, we present a new method to investigate how gaze locations are chosen by introducing varying image resolution, and measure how it affects eye-movement behavior during free viewing. Results show that gaze density overall is shifted toward regions presented in high resolution over those degraded in resolution. However, certain image regions seem to attract early fixations regardless of display resolution. These results suggest that top-down control of gaze guidance may be the dominant factor early in visual processing.
Hierarchy visual attention map
Kai-Chieh Yang, Clark C. Guest, Pankaj K. Das
In this paper, a new implementation of the human attention map is introduced. Most conventional approaches share characteristics such as the pooling rule is fixed and prior knowledge of camera aim is discarded. Unlike previous research, the proposed method allows more freedom at the feature integration stage since human eyes have a different sensitivity for each feature under different video aiming scenarios. An intelligent mechanism is designed to identify the importance of each feature for each type of camera motion and skin tone feature, and the feature integration is adaptive to different content. With this framework, more important features are emphasized and less important features are suppressed.
Attention trees and semantic paths
In the last few decades several techniques for image content extraction, often based on segmentation, have been proposed. It has been suggested that under the assumption of very general image content, segmentation becomes unstable and classification becomes unreliable. According to recent psychological theories, certain image regions attract the attention of human observers more than others and, generally, the image main meaning appears concentrated in those regions. Initially, regions attracting our attention are perceived as a whole and hypotheses on their content are formulated; successively the components of those regions are carefully analyzed and a more precise interpretation is reached. It is interesting to observe that an image decomposition process performed according to these psychological visual attention theories might present advantages with respect to a traditional segmentation approach. In this paper we propose an automatic procedure generating image decomposition based on the detection of visual attention regions. A new clustering algorithm taking advantage of the Delaunay- Voronoi diagrams for achieving the decomposition target is proposed. By applying that algorithm recursively, starting from the whole image, a transformation of the image into a tree of related meaningful regions is obtained (Attention Tree). Successively, a semantic interpretation of the leaf nodes is carried out by using a structure of Neural Networks (Neural Tree) assisted by a knowledge base (Ontology Net). Starting from leaf nodes, paths toward the root node across the Attention Tree are attempted. The task of the path consists in relating the semantics of each child-parent node pair and, consequently, in merging the corresponding image regions. The relationship detected in this way between two tree nodes generates, as a result, the extension of the interpreted image area through each step of the path. The construction of several Attention Trees has been performed and partial results will be shown.
Motion integration in visual attention models for predicting simple dynamic scenes
A. Bur, P. Wurtz, R. M. Müri, et al.
Visual attention models mimic the ability of a visual system, to detect potentially relevant parts of a scene. This process of attentional selection is a prerequisite for higher level tasks such as object recognition. Given the high relevance of temporal aspects in human visual attention, dynamic information as well as static information must be considered in computer models of visual attention. While some models have been proposed for extending to motion the classical static model, a comparison of the performances of models integrating motion in different manners is still not available. In this article, we present a comparative study of various visual attention models combining both static and dynamic features. The considered models are compared by measuring their respective performance with respect to the eye movement patterns of human subjects. Simple synthetic video sequences, containing static and moving objects, are used to assess the model suitability. Qualitative and quantitative results provide a ranking of the different models.
Higher Level Vision and Cognition
icon_mobile_dropdown
Motion of specularities on low-relief surfaces: frequency domain analysis
Yousef Farasat, Michael S. Langer
Typical studies of the visual motion of specularities have been concerned with how to discriminate the motion of specularities from the motion of surface markings, and how to estimate the underlying surface shape. Here we take a different approach and ask whether a field of specularities gives rise to motion parallax that is similar to that of the underlying surface. The idea is that the caustics that are defined by specularities exist both in front of and behind the underlying surface and hence define a range of depths relative to the observer. We asked whether this range of depths leads to motion parallax. Our experiments are based on image sequences generated using computer graphics and Phong shading. Using low relief undulating surfaces and assuming a laterally moving observer, we compare the specular and diffuse components of the resulting image sequences. In particular, we compare the image power spectra. We find that as long as the undulations are sufficiently large, the range of speeds that are indicated in the power spectra of the diffuse and specular components will be similar to each other. This suggests that specularities could provide reliable motion parallax information to a moving observer.
Fully automatic perceptual modeling of near regular textures
G. Menegaz, A. Franceschetti, A. Mecocci
Near regular textures feature a relatively high degree of regularity. They can be conveniently modeled by the combination of a suitable set of textons and a placement rule. The main issues in this respect are the selection of the minimum set of textons bringing the variability of the basic patterns; the identification and positioning of the generating lattice; and the modelization of the variability in both the texton structure and the deviation from periodicity of the lattice capturing the naturalness of the considered texture. In this contribution, we provide a fully automatic solution to both the analysis and the synthesis issues leading to the generation of textures samples that are perceptually indistinguishable from the original ones. The definition of an ad-hoc periodicity index allows to predict the suitability of the model for a given texture. The model is validated through psychovisual experiments providing the conditions for subjective equivalence among the original and synthetic textures, while allowing to determine the minimum number of textons to be used to meet such a requirement for a given texture class. This is of prime importance in model-based coding applications, as is the one we foresee, as it allows to minimize the amount of information to be transmitted to the receiver.
Machine perception using the five human defined forms plus infrared
Paul DeRego, Steven Cave
A machine vision system needing to remain vigilant within its environment must be able to quickly perceive both clearly identifiable objects as well as those that are deceptive or camouflaged (attempting to blend into the background). Humans accomplish this task early in the visual pathways, using five spatially defined forms of processing. These forms are Luminance-defined, Color-defined, Texture-defined, Motion-defined, and Disparity-defined. This paper discusses a visual sensor approach that combines a biological system's strategy to break down camouflage with simple image processing algorithms that may be implemented for real time video. Thermal imaging is added to increase sensing capability. Preliminary filters using MATLAB and operating on digital still images show somewhat encouraging results. Current efforts include implementing the sensor for real-time video processing.
The normalized compression distance and image distinguishability
We use an information-theoretic distortion measure called the Normalized Compression Distance (NCD), first proposed by M. Li et al., to determine whether two rectangular gray-scale images are visually distinguishable to a human observer. Image distinguishability is a fundamental constraint on operations carried out by all players in an image watermarking system. The NCD between two binary strings is defined in terms of compressed sizes of the two strings and of their concatenation; it is designed to be an effective approximation of the noncomputable but universal Kolmogorov distance between two strings. We compare the effectiveness of different types of compression algorithms? in predicting image distinguishability when they are used to compute the NCD between a sample of images and their watermarked counterparts. Our experiment shows that, as predicted by Li's theory, the NCD is largely independent of the underlying compression algorithm. However, in some cases the NCD fails as a predictor of image distinguishability, since it is designed to measure the more general notion of similarity. We propose and study a modified version of the NCD to model the latter, which requires that not only the change be small but also in some sense random with respect to the original image.
Instantaneous stimulus paradigm: cortical network and dynamics of figure-ground organization
To reveal the cortical network underlying figure/ground perception and to understand its neural dynamics, we developed a novel paradigm that creates distinct and prolonged percepts of spatial structures by instantaneous refreshes in random dot fields. Three different forms of spatial configuration were generated by: (i) updating the whole stimulus field, (ii) updating the ground region only (negative-figure), and (iii) updating the figure and ground regions in brief temporal asynchrony. FMRI responses were measured throughout the brain. As expected, activation by the homogenous whole-field update was focused onto the posterior part of the brain, but distinct networks extending beyond the occipital lobe into the parietal and frontal cortex were activated by the figure/ground and by the negativefigure configurations. The instantaneous stimulus paradigm generated a wide variety of BOLD waveforms and corresponding neural response estimates throughout the network. Such expressly different responses evoked by differential stimulation of the identical cortical regions assure that the differences could be securely attributed to the neural dynamics, not to spatial variations in the HRF. The activation pattern for figure/ground implies a widely distributed neural architecture, distinct from the control conditions. Even where activations are partially overlapping, an integrated analysis of the BOLD response properties will enable the functional specificity of the cortical areas to be distinguished.
Comparing realness between real objects and images at various resolutions
Kenichiro Masaoka, Masaki Emoto, Masayuki Sugawara, et al.
Image resolution is one of the important factors for visual realness. We performed subjective assessments to examine the realness of images at six different resolutions, ranging from 19.5 cpd (cycles per degree) to 156 cpd. A paired-comparison procedure was used to quantify the realness of six images versus each other or versus the real object. Three objects were used. Both real objects and images were viewed through a synopter, which removed horizontal disparity and presented the same image to both eyes. Sixty-five observers were asked to choose the viewed image which was closer to the real object and appeared to be there naturally for each pair of stimuli selected from the group of six images and the real object. It was undisclosed to the observers that real objects were included in the stimuli. The paired comparison data were analyzed using the Bradley-Terry model. The results indicated that realness of an image increased as the image resolution increased up to about 40-50 cpd, which corresponded to the discrimination threshold calculated based on the observers' visual acuity, and reached a plateau above this threshold.
Navigation based on a sensorimotor representation: a virtual reality study
Christoph Zetzsche, Christopher Galbraith, Johannes Wolter, et al.
We investigate the hypothesis that the basic representation of space which underlies human navigation does not resemble an image-like map and is not restricted by the laws of Euclidean geometry. For this we developed a new experimental technique in which we use the properties of a virtual environment (VE) to directly influence the development of the representation. We compared the navigation performance of human observers under two conditions. Either the VE is consistent with the geometrical properties of physical space and could hence be represented in a map-like fashion, or it contains severe violations of Euclidean metric and planar topology, and would thus pose difficulties for the correct development of such a representation. Performance is not influenced by this difference, suggesting that a map-like representation is not the major basis of human navigation. Rather, the results are consistent with a representation which is similar to a non-planar graph augmented with path length information, or with a sensorimotor representation which combines sensory properties and motor actions. The latter may be seen as part of a revised view of perceptual processes due to recent results in psychology and neurobiology, which indicate that the traditional strict separation of sensory and motor systems is no longer tenable.
Poster Session
icon_mobile_dropdown
Making flat art for both eyes
By adding an additional dimension to the traditional two dimensional art we make, we are able to expand our visual experience, what we see, and thus what we might become. This visual expansion changes or adds to the patterns that produce our thoughts and behavior. As 2D artists see and create in a more three dimensional space, their work may generate within the viewer a deeper understanding of the thought processes in themselves and others. This can be achieved by creating images in three dimensional. The work aligns more closely with natural physiology, that is, it is seen with both eyes. Traditionally, color and rules of perspective trick the viewer into thinking in three dimensions. By adding the stereoscopic element, an object is experienced in a naturally 3D space with the use of two eyes. Further visual expansion is achieved with the use of ChromaDepth glasses to actually see the work in 3D as it is being created. This cannot be done with other 3D methods that require two images or special programming to work. Hence, the spontaneous creation of an image within a 3D space becomes a new reality for the artist. By working in a truly three dimensional space that depends on two eyes to experience, an artist gains a new perspective on color, transparency, overlapping, focus, etc. that allows him/her new ways of working and thus seeing: a new form of expression.
A novel Bayer-like WRGB color filter array for CMOS image sensors
Hiroto Honda, Yoshinori Iida, Go Itoh, et al.
We have developed a CMOS image sensor with a novel color filter array(CFA) where one of the green pixels of the Bayer pattern was replaced with a white pixel. A transparent layer has been fabricated on the white pixel instead of a color filter to realize over 95% transmission for visible light with wavelengths of 400-700 nm. Pixel pitch of the device was 3.3 um and the number of pixels was 2 million (1600H x 1200V). The novel Bayer-like WRGB (White-Red-Green-Blue) CFA realized higher signal-to-noise ratios of interpolated R, G, and B values in low illumination (3lux) by 6dB, 1dB, and 6dB, respectively, compared with those of the Bayer pattern, with the low-noise pre-digital signal process. Furthermore, there was no degradation of either resolution or color representation for the interpolated image. This new CFA has a great potential to significantly increase the sensitivity of CMOS/CCD image sensors with digital signal processing technology.
Improving video captioning for deaf and hearing-impaired people based on eye movement and attention overload
C. Chapdelaine, V. Gouaillier, M. Beaulieu, et al.
Deaf and hearing-impaired people capture information in video through visual content and captions. Those activities require different visual attention strategies and up to now, little is known on how caption readers balance these two visual attention demands. Understanding these strategies could suggest more efficient ways of producing captions. Eye tracking and attention overload detections are used to study these strategies. Eye tracking is monitored using a pupilcenter- corneal-reflection apparatus. Afterward, gaze fixation is analyzed for each region of interest such as caption area, high motion areas and faces location. This data is also used to identify the scanpaths. The collected data is used to establish specifications for caption adaptation approach based on the location of visual action and presence of character faces. This approach is implemented in a computer-assisted captioning software which uses a face detector and a motion detection algorithm based on the Lukas-Kanade optical flow algorithm. The different scanpaths obtained among the subjects provide us with alternatives for conflicting caption positioning. This implementation is now undergoing a user evaluation with hearing impaired participants to validate the efficiency of our approach.
Comparison of methods for the simplification of mesh models using quality indices and an observer study
Samuel Silva, Joaquim Madeira, Carlos Ferreira, et al.
The complexity of a polygonal mesh model is usually reduced by applying a simplification method, resulting in a similar mesh having less vertices and faces. Although several such methods have been developed, only a few observer studies are reported comparing them regarding the perceived quality of the obtained simplified meshes, and it is not yet clear how the choice of a given method, and the level of simplification achieved, influence the quality of the resulting model, as perceived by the final users. Mesh quality indices are the obvious less costly alternative to user studies, but it is also not clear how they relate to perceived quality, and which indices best describe the users behavior. Following on earlier work carried out by the authors, but only for mesh models of the lungs, a comparison among the results of three simplification methods was performed through (1) quality indices and (2) a controlled experiment involving 65 observers, for a set of five reference mesh models of different kinds. These were simplified using two methods provided by the OpenMesh library - one using error quadrics, the other additionally using a normal flipping criterion - and also by the widely used QSlim method, for two simplification levels: 50% and 20% of the original number of faces. The main goal was to ascertain whether the findings previously obtained for lung models, through quality indices and a study with 32 observers, could be generalized to other types of models and confirmed for a larger number of observers. Data obtained using the quality indices and the results of the controlled experiment were compared and do confirm that some quality indices (e.g., geometric distance and normal deviation, as well as a new proposed weighted index) can be used, in specific circumstances, as reasonable estimators of the user perceived quality of mesh models.
Influence of motion on contrast perception: supra-threshold spatio-velocity measurements
Sylvain Tourancheau, Patrick Le Callet, Dominique Barba
In this paper, a supra-threshold spatio-velocity CSF experiment is described. It consists in a contrast matching task with a methods of limits procedure. Results enable the determination of contrast perception functions which give, for given spatial and temporal frequencies, the perceived contrast of a moving stimulus. These contrast perception functions are then used to construct supra-threshold spatio-velocity CSF. As for supra-threshold CSF in spatial domain, it can be observed that CSF shape changes from band-pass behaviour at threshold to low-pass behaviour at supra-threshold, along spatial frequencies. However, supra-threshold CSFs have a band-pass behaviour along temporal frequency has threshold one. This means that if spatial variations can be neglected above the visibility threshold, temporal ones are still of primary importance.
Third- and first-party ground truth collection for auto key frame extraction from consumer video clips
Extracting key frames (KF) from video is of great interest in many applications, such as video summary, video organization, video compression, and prints from video. KF extraction is not a new problem. However, current literature has been focused mainly on sports or news video. In the consumer video space, the biggest challenges for key frame selection from consumer videos are the unconstrained content and lack of any preimposed structure. In this study, we conduct ground truth collection of key frames from video clips taken by digital cameras (as opposed to camcorders) using both first- and third-party judges. The goals of this study are: (1) to create a reference database of video clips reasonably representative of the consumer video space; (2) to identify associated key frames by which automated algorithms can be compared and judged for effectiveness; and (3) to uncover the criteria used by both first- and thirdparty human judges so these criteria can influence algorithm design. The findings from these ground truths will be discussed.
Quantifying the use of structure in cognitive tasks
David M. Rouse, Sheila S. Hemami
Modern algorithms that process images to be viewed by humans analyze the images strictly as signals, where processing is typically limited to the pixel and frequency domains. The continuum of visual processing by the human visual system (HVS) from signal analysis to cognition indicates that the signal-processing based model of the HVS could be extended to include some higher-level, structural processing. An experiment was conducted to study the relative importance of higher-level, structural representations and lower-level, signal-based representations of natural images in a cognitive task. Structural representations preserve the overall image organization necessary to recognize the image content and discard the finer details of objects such at textures. Signal-based representations (i.e. digital photographs) decompose an image in terms of its frequency, orientation, and contrast. Participants viewed sequences of images from either structural or signal-based representations, where subsequent images in the sequence reveal additional detail or visual information from the source image. When the content was recognizable, participants were instructed to provide a description of that image in the sequence. The descriptions were subjectively evaluated to identify a participant's recognition threshold for a particular image representation. The results from this experiment suggest that signal-based representations possess meaning to human observers when the proportion of high frequency content, which conveys shape information, exceeds a seemingly fixed proportion. Additional comparisons among the representations chosen for this experiment provide insight toward quantifying their significance in cognition and developing a rudimentary measure of visual entropy.
Quality Metric for H.264/AVC Scalable Video Coding with Full Scalability
Cheon Seog Kim, Dongjun Suh, Tae Meon Bae, et al.
Scalable Video Coding (SVC) is one of the promising techniques to ensure Quality of Service (QoS) in multimedia communication through heterogeneous networks. SVC compresses a raw video into multiple bitstreams composed of a base bitstream and enhancement bitstreams to support multi scalabilities such as SNR, temporal and spatial. Therefore, it is able to extract an appropriate bitstream from original coded bitstream without re-encoding to adapt a video to user environment. In this flexible environment, QoS has appeared as an important issue for service acceptability. Therefore, there has been a need for measuring a degree of video quality to guarantee the quality of video streaming service. Existing studies on the video quality metric have mainly focused on temporal and SNR scalability. In this paper, we propose an efficient quality metric, which allows for spatial scalability as well as temporal and SNR scalability. To this end, we study the effect of frame rate, SNR, spatial scalability and motion characteristics by using the subjective quality assessment, and then a new video quality metric supporting full scalability is proposed. Experimental results show that this quality metric has high correlation with subjective quality. Because the proposed metric is able to measure a degree of video quality according to the variation of scalability, it will play an important role at the extraction point for determining the quality of SVC.
Temporal relation between bottom-up versus top-down strategies for gaze prediction
Sreekar Krishna, John A. Black Jr., Stuart Braiman, et al.
Much research has been focused on the study of bottom-up, feature-based visual perception, as a means to generate salience maps, and predict the distribution of fixations within images. However, it is plausible that the eventual perception of distinct objects within a 3D scene (and the subsequent top-down effects) would also have a significant effect on the distributions of fixations within that scene. This research is aimed at testing a hypothesis that there exists a switching from feature-based to object-based scanning of images, as the viewer gains a higher-level understanding of the image content, and that this switching can be detected by changes in the pattern of eye fixations within the image. An eye tracker is used to monitor the fixations of human participants over time, as they view images, in an effort to answer questions pertaining to (1) the nature of fixations during bottom-up and top-down scene scan scenarios (2) the ability of assessing whether the subject is perceiving the scene content based on low-level visual features or distinct objects, and (3) identification of the participant's transition from a bottom-up feature-based perception to a top-down object-based perception.
Eigen local color histograms for object recognition and orientation estimation
D. Muselet, B. Funt, L. Macaire
Color has been shown to be an important clue for object recognition and image indexing. We present a new algorithm for color-based recognition of objects in cluttered scenes that also determines the 2D pose of each object. As with so many other color-based object recognition algorithms, color histograms are also fundamental to our new approach; however, we use histograms obtained from overlapping subwindows rather than the entire image. An object from a database of prototypes is identified and located in an input image whenever there are many good histogram matches between the respective subwindow histograms of the input image and the image prototype from the database. In essence, local color histograms are the features to be matched. Once an object's position in the image has been determined, its 2D pose is determined by approximating the geometrical transformation most consistently mapping the locations of the prototype's subwindows to their matching locations in the input image.
Keynote Session
icon_mobile_dropdown
Adaptation and perceptual norms
Michael A. Webster, Maiko Yasuda, Sara Haber, et al.
We used adaptation to examine the relationship between perceptual norms--the stimuli observers describe as psychologically neutral, and response norms--the stimulus levels that leave visual sensitivity in a neutral or balanced state. Adapting to stimuli on opposite sides of a neutral point (e.g. redder or greener than white) biases appearance in opposite ways. Thus the adapting stimulus can be titrated to find the unique adapting level that does not bias appearance. We compared these response norms to subjectively defined neutral points both within the same observer (at different retinal eccentricities) and between observers. These comparisons were made for visual judgments of color, image focus, and human faces, stimuli that are very different and may depend on very different levels of processing, yet which share the property that for each there is a well defined and perceptually salient norm. In each case the adaptation aftereffects were consistent with an underlying sensitivity basis for the perceptual norm. Specifically, response norms were similar to and thus covaried with the perceptual norm, and under common adaptation differences between subjectively defined norms were reduced. These results are consistent with models of norm-based codes and suggest that these codes underlie an important link between visual coding and visual experience.