Proceedings Volume 12226

Applications of Digital Image Processing XLV

cover
Proceedings Volume 12226

Applications of Digital Image Processing XLV

Purchase the printed version of this volume at proceedings.com or access the digital version at SPIE Digital Library.

Volume Details

Date Published: 19 October 2022
Contents: 10 Sessions, 45 Papers, 37 Presentations
Conference: SPIE Optical Engineering + Applications 2022
Volume Number: 12226

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 12226
  • Compression I
  • Compression II
  • Human Visual System and Perception
  • Imaging Systems
  • New Imaging Standards
  • Imaging Applications
  • Image and Video Processing
  • New Imaging Modalities and Applications
  • Poster Session
Front Matter: Volume 12226
icon_mobile_dropdown
Front Matter: Volume 12226
This PDF file contains the front matter associated with SPIE Proceedings Volume 12226, including the Title Page, Copyright information, Table of Contents, and Conference Committee listings.
Compression I
icon_mobile_dropdown
Joint backward and forward temporal masking for perceptually optimized x265-based video coding
Dan Grois, Alex Giladi, Praveen Kumar Karadugattu, et al.
There is a strong and ever-growing demand for higher-resolution video content, such as UltraHD, which requires significantly higher bitrates. Providing such content at scale is a challenge due to limitations of the available last-mile bandwidth and content delivery network (CDN) storage and egress capacity. Lower bitrates are often considered an answer. This way, the high-resolution video content is often compressed with visually perceptible coding artifacts, thereby leading to an inferior user experience. Improved compression efficiency is thus the obvious solution for improving the user experience. However, in order to realize the gain from such efficiency improvements in a large-scale deployment, such improvements need to be applicable to an already deployed ecosystem such as set-top boxes or mobile devices, and SmartTVs, and to have a reasonably low computational complexity. This work proposes a low-complexity joint backward and forward temporal masking for reducing bitrate without perceptibly affecting visual quality. This is achieved by introducing a novel low-complexity scenecut-aware adaptive frame-level quantization framework, which considers temporal distances between frames and the closest scenecuts. The proposed framework has been implemented in the popular x265 open-source HEVC encoder. With that said, the framework is codec-independent and can be applied to other encoders and video coding standards. Different backward and forward masking time periods and quantizer behaviors are investigated to determine exact time periods for which temporal masking does not substantially impact video quality, as perceived by the human visual system (HVS). Extensive subjective quality assessments have been carried out for evaluating the benefits and advantages of the proposed scenecut-aware adaptive quantization framework. The subjective results showed significant bitrate savings of up to about 26%, while maintaining substantially the same perceived visual quality.
Spatial scalability with VVC: coding performance and complexity
Version 1 of VVC specification was released in July 2020. VVC is the successor of HEVC, with 40 to 50% better compression. It inherently supports additional features such as sub-pictures and multi-layer coding: different profiles are specified, including multilayer profiles for 4:2:0 and 4:4:4 content. These profiles give access to spatial scalability using a single decoder instance. This paper reports evaluation tests of the VVC 4:2:0 multilayer profile, in a spatially scalable configuration with two layers and a spatial resolution ratio of 2x both vertically and horizontally. The tests were performed using various bitrate allocations between the base layer and the enhancement layer, and the best tradeoff leading to optimal enhancement layer compression performance was investigated. Both objective and subjective evaluations have been carried out and report that Scalable-VVC can be on par or outperform single-layer coding in specific conditions. The paper also addresses coding complexity and shows that customizing the encoding parameters can lead to large encoding time savings. Coding performance comparisons between dual layer solutions, here scalable VVC and LCEVC, are also provided.
AV1 benchmarking test for 3GPP
Zhijun Lei, Jun Sik Song, Adrian Grange, et al.
AV1 is the first-generation royalty free video coding standard developed by the Alliance for Open Media (AOM). Since it was finalized in 2018, there have been a lot of attempts to compare the compression efficiency of AV1 against previous generations of video coding standards, such as HEVC, VP9, etc. Some study results have been published previously, but the results vary significantly depending on test conditions and encoder configurations used. It is challenging to design a test configuration and align the encoding parameters to perform a benchmarking test between coding standards and their encoder implementations. Recently, 3GPP started an exploration project to evaluate next generation video coding standards, after AVC and HEVC, for consideration in future 3GPP standards and applications. In this project, a comprehensive set of test conditions was defined to compare the submitted coding specifications, including AV1, VVC, and EVC, versus HEVC and AVC. The evaluation was conducted using reference encoders for each specification. The reference encoders of each proposed coding specification were thoroughly tested using the defined test condition set to provide benchmarking data that could be used in their evaluation and possible inclusion for future 5G streaming services. In this paper, we will first discuss the test scenarios and test configurations for the 3GPP benchmarking test. Then, the detailed encoding parameters for the AV1 reference encoder, which comply with this set of test conditions, will be introduced and the benchmarking test result will be presented. Finally, we will introduce some encoding settings that can be applied to the AV1 reference encoder without some HM and VTM inspired restrictions defined in the test 3GPP test conditions and the corresponding compression efficiency improvement that can be achieved in the expected AV1 usage.
Towards effective visual information storage on DNA support
Luka Secilmis, Michela Testolina, Davi Lazzarotto, et al.
DNA is an excellent medium for efficient storage of information. Not only it offers a long-term and robust mechanism but also it is environmental friendly and has an unparalleled storage capacity, However, the basic elements in DNA are quaternary, and therefore there is a need for efficient coding of information in quaternary representation while taking into account various biochemical constraints involved. Such constraints create additional complexity on how information should be represented in quaternary code. In this paper, an efficient solution for the storage of JPEG compressed images is proposed. The focus on JPEG file format is motivated by the fact that it is a popular representation of digital pictures. The proposed approach converts an already coded image in JPEG format to a counterpart represented in quaternary representation while taking into account the intrinsic structure of the former. The superiority of the proposed approach is demonstrated by comparing its rate distortion performance to two alternative approaches, namely, a direct transcoding of the binary JPEG compressed file into a quaternary codestream without taking into account its underlying structure, and a complete JPEG decoding followed by an image encoding for DNA storage.
Direct optimisation of λ for HDR content adaptive transcoding in AV1
Vibhoothi ., François Pitié, Angeliki Katsenou, et al.
Since the adoption of VP9 by Netflix in 2016, royalty-free coding standards continued to gain prominence through the activities of the AOMedia consortium. AV1, the latest open source standard, is now widely supported. In the early years after standardisation, HDR video tends to be under served in open source encoders for a variety of reasons including the relatively small amount of true HDR content being broadcast and the challenges in RD optimisation with that material. AV1 codec optimisation has been ongoing since 2020 including consideration of the computational load. In this paper, we explore the idea of direct optimisation of the Lagrangian λ parameter used in the rate control of the encoders to estimate the optimal Rate-Distortion trade-off achievable for a High Dynamic Range signalled video clip. We show that by adjusting the Lagrange multiplier in the RD optimisation process on a frame-hierarchy basis, we are able to increase the Bjontegaard difference rate gains by more than 3.98× on average without visually affecting the quality.
Towards efficient multi-codec streaming
Yuriy Reznik, Karl Lillevold, Abhijith Jagannath, et al.
One of the biggest challenges in modern-era streaming is the fragmentation of codec support across receiving devices. For example, modern Apple devices can decode and seamlessly switch between H.264/AVC and HEVC streams. Most new TVs and set-top boxes can also decode HEVC, but they cannot switch between HEVC and H.264/AVC streams. And there are still plenty of older devices/streaming clients that can only receive and decode H.264/AVC streams. With the arrival of next-generation codecs - such as AV1 and VVC, the fragmentation of codec support across devices becomes even more complex. This situation brings a question – how we can serve such a population of devices most efficiently by using codecs delivering the best performance in all cases yet producing the minimum possible number of streams and such that the overall cost of media delivery is minimal? In this paper, we explain how this problem can be formalized and solved at the stage of dynamic generation of encoding profiles for ABR streaming. The proposed solution is a generalization of contextaware encoding (CAE) class-of techniques, considering multiple sets of renditions generated using each codec and codec usage distributions by the population of the receiving devices. We also discuss several streaming system-level tools needed to make the proposed solution practically deployable.
Compression II
icon_mobile_dropdown
Improving reference picture resampling (RPR) for future video coding
Philippe Bordes, Hassane Guermoud, Franck Galpin, et al.
Considering industry needs for further coding efficiency improvements, the Joined Exploration Team on Video (JVET) established by ITU-T and MPEG for standardizing VVC, has developed a new Enhanced Compression Model (ECM) based on VVC which is a common area for testing future video coding algorithms. The Versatile Video Coding (VVC) supports Reference Picture Resampling (RPR) to change frame resolution without inserting an Instantaneous Decoder Refresh (IDR) or Intra Random Access Picture (IRAP). This feature is particularly well adapted to video streaming and low delay scenarios since it allows seamless frame-based bit-rate adaptation, whereas traditional techniques based on streams switching between coded video chunks at fixed resolution can generate bitrate leaps. ECM implements several new tools that improve the coding efficiency compared to VVC, but some of them were not designed to support RPR. In this paper, we first discuss some necessary adaptations to implement RPR in ECM for these new coding tools. At low bit rate, RPR may improve the coding performance of ECM for luma component, and the coding complexity is reduced. However, RPR may show PSNR drop for chroma component because it performs an additional down-scaling filtering on samples that were already filtered from the original canonical 4:4:4 content to create the 4:2:0 format. Then, in a second part, some modifications of RPR to re-scale luma and chroma differently are proposed. It is shown that it improves ECM efficiency in the context of both super-resolution and low-delay coding use cases.
A study on flexible block partitioning for future video coding standards
Fabrice Urban, Karam Naser, Franck Galpin, et al.
For each generation of video coding standard, increasing the block partitioning flexibility has been very efficient. From fixed-size square blocks to rectangle-sized blocks, the new block shapes have significantly improved the video compression efficiency. During the development of Versatile Video Coding (VVC), the latest video standard developed by the Moving Picture Experts Group (MPEG), it was shown that the flexible partitioning, adding the Binary Tree (BT) and the Ternary Tree (TT) splits on top of the Quadtree (QT) splits, was the tool bringing the highest compression gains over previous standard High Efficiency Video Coding (HEVC). BT and TT splits brought a lot of flexibility to the block partitioning mechanism. During the VVC development, more flexible partitioning tools have also been studied, namely Asymmetric Binary Tree (ABT). In this paper, we show how increasing partition flexibility on top of newest video codec still brings additional gains.
Green image codec: a lightweight learning-based image coding method
Image compression has experienced a new revolution with the success of deep learning, which yields superior rate-distortion (RD) performance against traditional codecs. Yet, high computational complexity and energy consumption are the main bottlenecks that hinder the practical applicability of deep-learning-based (DL-based) codecs. Inspired by the neural network’s hierarchic structure yet with lower complexity, we propose a new lightweight image coding framework and name it the ”Green Image Codec” (GIC) in this work. First, GIC down-samples an input image into several spatial resolutions from fine-to-coarse grids and computes image residuals between two adjacent grids. Then, it encodes the coarsest content, interpolates content from coarse-tofine grids, encodes residuals, and adds residuals to interpolated images for reconstruction. All coding steps are implemented by vector quantization while all interpolation steps are conducted by the Lanczos interpolation. To facilitate VQ codebook training, the Saab transform is applied for energy compaction and, thus, dimension reduction. A simple rate-distortion optimization (RDO) is developed to help select the coding parameters. GIC yields an RD performance that is comparable with BPG at significantly lower complexity.
Advanced video quality assessment of leading codecs: AV1, VVC, and their draft successor designs
P. Topiwala, W. Dai
AV1 and VVC are today top performing video codecs, with VVC holding an edge (in our previous estimates, about 10- 12%). Moreover, both organizations are busy working on advancements of their codecs, respectively. Current work within the ISO/IEC/ITU committees shows updates with improvements of about 15% over VVC on common test data, while the Alliance for Open Media, the source of AV1, is already working on AV2 (called AVM), with an as yet unknown performance gain (but targeting about 20%) over AV1. We compare the performance of these codecs objectively on well-known 1080p test data, restricting to Standard Dynamic Range in this study (we treat HDR in an accompanying paper), not only using PSNR, but more advanced video quality assessment tools such as SSIM, VMAF, and FastVDO’s quality measure, FVQ. We highlight the similarities and difference between the well-worn PSNR, and the more advanced approaches, to develop a nominal performance comparison among these leading video codecs. In this context, we use full reference video quality assessment. We observe that the more advanced metrics may be picking up on differences in quality that PSNR misses.
Human Visual System and Perception
icon_mobile_dropdown
Towards JPEG AIC part 3: visual quality assessment of high to visually lossless image coding
Michela Testolina, Evgeniy Upenik, Jon Sneyer, et al.
Due to the increasing number of pictures captured and stored every day by and on digital devices, lossy image compression has become inevitable to limit the needed storage requirement. As a consequence, these compression methods might introduce some visual artifacts, whose visibility depends on the chosen bitrate. Modern applications target images with high to near-visually lossless quality, in order to maximize the visual quality while still reducing storage space consumption. In this context, subjective and objective image quality assessment are essential tools in order to develop compression methods able to generate images with high visual quality. While a large variety of subjective quality assessment protocols have been standardized in the past, they have been found to be imprecise in the quality interval from high to near-visually lossless. Similarly, an objective quality metric designed to work specifically in the mentioned range has not been designed yet. As current quality assessment methodologies have proven to be unreliable, a renewed activity on the Assessment of Image Coding, also referred to as JPEG AIC, was recently launched by the JPEG Committee. The goal of this activity is to extend previous standardization efforts, i.e. AIC Part 1 and AIC Part 2 (also know as AIC-1 and AIC-2), by developing a new standard, known as AIC Part 3 (or AIC-3). Notably, the goal of the activity is to standardize both subjective and objective visual quality assessment methods, specifically targeting images with quality in the range from high to near-visually lossless. Two Draft Calls for Contributions on Subjective Image Quality Assessment1, 2 were released, aiming at collecting contributions on new methods and best practices for subjective image quality assessment in the target quality range, while a Call for Proposals on Objective Image Quality Assessment is expected to be released at a later date. This paper aims at summarizing past JPEG AIC efforts and reviewing the main objectives of the future activities, outlining the scope of the activity, the main use cases and requirements, and call for contributions. Finally, conclusions on the activity are drawn.
Detection of facial emotions using neuromorphic computation
Modern computing systems are good for tasks that are difficult for humans, even for high-performance computing. Automation and artificial intelligence combined are disciplines, which emit to humans, generating structured data in real-time and transferable. This combination of systems is also compatible with new neuromorphic processor architectures. Neuromorphic computing is completely redesigned in its architecture to the conventional computer model, both in hardware and software. The Procedure that was carried out was the collection of faces with some emotion, the information was also generated for the real-life database, in which the classification of faces, and emotions was carried out using the theory of Robert Plutchik and with an artificial neural network spike, when combined with the previous concepts and tools, signals are obtained, which as a human would act and that does not work like a conventional computer. The result obtained from the classification of faces and emotions was made in a neural network whose performance is affected only when the contrast falls below 3%, it was also found that some images were 0.005%. Regarding noise resistance, the images withstood 50% noise, in which the performance of the network was not affected. Although neuromorphic computing is willing to simulate the human brain using artificial biological neural networks, it is also used in the classification of objects, and pattern recognition, with proposed image processing techniques, obtaining acceptable performance.
Optimal rendition resolution selection algorithm for web streaming players
In web streaming, the size of video rendered on screen may be influenced by a number of factors, such as the layout of a web page embedding the video, the position and size of the web browser window, and the resolution of the screen. During the playback, the adaptive streaming players, usually select one of the available encoded streams (renditions) to pull and render on the screen. Such selection is typically done based on the available network bandwidth, and also based on the size of the player window. Typically, the logic of matching video stream to be played to the size of the window is very simplistic, considering only pixel dimensions of the video. However, with vastly different video playback devices, their pixel densities and other parameters influencing the Quality of Experience (QoE), the reliance of pixel matching is bound to be suboptimal. A better approach must use a proper QoE model, considering parameters of viewing setup on each device, and then predicting which encoded resolution, given player window and other constraints would achieve best quality. In this paper, we adopt such a model and develop an optimal rendition selection algorithm based on it. We report results by considering several different categories of receiving devices (HDTV, PCs, tablets and mobile) and show that optimal selections in all those cases will be considerably different.
Bootstrapping HDR video quality analysis from SDR via data-adaptive grading
P. Topiwala, W. Dai
High Dynamic Range (HDR) video in Ultra-High Definition (UHD) has been in use now for a number of years, in streaming services like Netflix and Amazon Prime and even YouTube, as well as in broadcast, on blu-ray, and can even be captured by consumers using high-end cameras and drones. These applications typically use HDR video with wide color gamut (BT.2020 color space) compressed in the H.265/HEVC video codec. While HDR video is thus by now ubiquitous, measuring the quality of HDR video objectively remains challenging. First, the peak-signal-to-noise ratio (PSNR), never a high-quality measure, becomes even less reliable in HDR. Moreover, with the wide color gamut typically used, one would expect color space distortion measures to play an essential role; several have been tested in the literature, but none has proven reliable. Finally, while there exists both visual quality databases for HDR images, and HDR image quality measures, there is virtually no widely available database for HDR video (we understand that there are some in development), nor a single, widely recognized HDR video quality measure. On the other hand, the field of video quality analysis is by now well-established for standard dynamic range (SDR) video, and FastVDO has developed such a measure (FastVDO Quality, FVQ). In this paper, we propose to make progress in HDR Video Quality Analysis (VQA) by initiating a novel bootstrapping method to measure the quality of HDR video by developing a dynamic conversion of HDR to standard dynamic range (SDR), and combining the video quality measure of the SDR using the FVQ measure previously developed, and a measure for the surplus quality of the HDR over the SDR video. The central conversion function is itself based on an approach previously developed by FastVDO and described in standard documents and presented at SPIE.
Perceptually motivated deep neural network for video compression artifact removal
Darren Ramsook, Anil Kokaram, Neil Birkbeck, et al.
Recent advances have shown that latent representations of pre-trained Deep Convolutional Neural Networks (DCNNs) for classification can be modelled to generate scores that are well correlated with human perceptual judgement. In this paper we seek to extend the use of perceptually relevant losses in training a DCNN for video compression artefact removal. We will use internal representations of a pre-trained classification network as the basis of the loss functions. Specifically, the LPIPS metric and a perceptual discriminator will be responsible for low-level and high-level features respectively. The perceptual discriminator uses differing internal feature representations of the VGG network as its first stage of feature extraction. Initial results shows an increase in performance in perceptually based metrics such VMAF, LPIPS and BRISQUE, while showing a decrease in performance in PSNR.
Multiple image super-resolution from the BGU SWIR CubeSat satellite
The BGU CubeSat satellite is from a class of low-cost, compact satellites. Its dimensions are 10×10×30 cm. It is equipped with a low resolution 256×320 pixels short wave infrared (SWIR) camera at the 1.55-1.7mm wavelength band. Images are transmitted in bursts of tens of images at a time with few pixel shifts from the first image to the last. Each image burst is suitable for Multiple Image Super Resolution (MISR) enhancements. MISR can construct a high-resolution (HR) image from several low-resolution (LR) images yielding an image that can resolve more details that are crucial for research in remote sensing. In this research, we verify the applicability of SOTA deep learning MISR models that were developed following the publication of the PROBA-V MISR satellite dataset at the visible red and near IR. Our SWIR multiple images differ from PROBA-V by the spectral band and by the method of collecting multiple images of the exact location. Our imagery data is acquired by a burst of very close temporal images. PROBA-V revisits the satellite at a period smaller than 30 days, assuming the soil dryness is about the same. We compare the results of Single Image Super-Resolution (SISR) and MISR techniques to "off-the-shelf" products. The quality of the super-resolved images is compared by nonreference metrics suitable for remote sensing applications and by experts' visual inspection. Unlike remarkable achievements by the GAN technique that can achieve very appealing results that are not always faithful to the original ground truth, the super-resolved images should preserve the original details as much as possible for further scientific remote sensing analysis.
Imaging Systems
icon_mobile_dropdown
CLASSROOM: synthetic high dynamic range light field dataset
Mary Guindy, Vamsi K. Adhikarla, Peter A. Kara, et al.
Light field images provide tremendous amounts of visual information regarding the represented scenes, as they describe the light traversing in all directions for all the points of 3D space. Due to the recent technological advancements of light field visualization and its increasing relevance in research, the need for light field image datasets has risen significantly. Among the applications for which light field datasets are considered, high dynamic range light field image reconstruction has gained notable attention in the past years. When capturing a scene, either a single camera with a 2D microlense array or a 2D array of cameras is used to produce narrow- and wide-baseline light field images, respectively. Additionally, the turn-table methodology may be used as well for narrow-baseline light fields. While the majority of these methods enables the creation of plausible and reliable light field image datasets, such baseline-specific setups can be extremely expensive and may require immense computing resources for proper calibration. Furthermore, the resulting light field is commonly limited with regard to angular resolution. A suitable alternative to produce a light field dataset is to do it synthetically by rendering light field images, which may easily overcome the aforementioned issues. In this paper, we discuss our work on creating the “CLASSROOM” light field image dataset, depicting a classroom scene. The content is rendered in horizontal-only parallax and full parallax as well. The scene contains a high variety of light distribution, particularly involving under-exposed and over-exposed regions, which are essential to HDR image applications.
Comparison study of adaptive RGB-D SLAM systems
In visual simultaneous localization and mapping (SLAM), the odometry estimation and navigation map building are carried out concurrently using only cameras. An important step in the SLAM process is the detection and analysis of the keypoints found in the environment. Performing a good correspondence of these points allows us to build an optimal point cloud for maximum localization accuracy of the mobile robot and, therefore, to build a precise map of the environment. In this presentation, we perform an extensive comparison study of the correspondences made by various combinations of detectors/descriptors and contrast the performance of two iterative closest points (ICP) algorithms used in the RGB-D SLAM problem. An adaptive RGB-D SLAM system is proposed, and its performance with the TUM RGB-D dataset is presented and discussed.
Noise removal of thermal images using deep learning approach
Ahmet Çapci, H. Emre Güven, B. Uğur Töreyin
With the widespread use of thermal cameras in various fields such as medical, military, surveillance, astronomy, fire detection etc., image distortion caused by the structure of thermal sensors has become an important problem. Since every detector or pixel in the sensor reacts differently even when fed the same signal, the correction is necessary for good imaging, and this correction is known as non-uniformity correction (NUC). Because the response of each detector/pixel drifts slowly and randomly over time, a one-time or a single fabric correction in the array is not enough. Traditional methods are not sufficient for operational uses. Calibration based approaches are undesirable because of the shutter sound for uncooled thermal imagers, as well as causing time-gaps during imaging for a period for cooled thermal imagers. Scene-based approaches, on the other hand, are not preferred due to high computational cost or rather unrealistic assumptions about the scene. In this study, we propose a deep learning based approach proposed for both cooled and uncooled thermal imagers. We created various thermal datasets to train models for temporal noise for both cooled and uncooled thermal imagers and compared the results. In a thermal system, many operations such as NUC, BPR, IOP are applied in sequence, from the detector raw output to the final output shown to the user. Our deep learning model accounts for the entirety of these operations. We also show that the optical artifacts or distortions can be eliminated using deep learning. As such, we demonstrate this with different system architectures that are suitable for embedded systems.
On combining denoising with learning-based image decoding
Léo Larigauderie, Michela Testolina, Touradj Ebrahimi
Noise is an intrinsic part of any sensor and is present, in various degrees, in any content that has been captured in real life environments. In imaging applications, several pre- and post-processing solutions have been proposed to cope with noise in captured images. More recently, learning-based solutions have shown impressive results in image enhancement in general, and in image denoising in particular. In this paper, we review multiple novel solutions for image denoising in the compressed domain, by integrating denoising operations into the decoder of a learning-based compression method. The paper starts by explaining the advantages of such an approach from different points of view. We then describe the proposed solutions, including both blind and non-blind methods, comparing them to state of the art methods. Finally, conclusions are drawn from the obtained results, summarizing the advantages and drawbacks of each method.
A novel assessment framework for learning-based deepfake detectors in realistic conditions
Detecting manipulations in facial images and video has become an increasingly popular topic in media forensics community. At the same time, deep convolutional neural networks have achieved exceptional results on deepfake detection tasks. Despite the remarkable progress, the performance of such detectors is often evaluated on benchmarks under constrained and non-realistic situations. In fact, current assessment and ranking approaches employed in related benchmarks or competitions are unreliable. The impact of conventional distortions and processing operations found in image and video processing workflows, such as compression, noise, and enhancement, is not sufficiently evaluated. This paper proposes a more rigorous framework to assess the performance of learning-based deepfake detectors in more realistic situations. This framework can serve as a broad benchmarking approach for both general model performance assessment and the ranking of proponents in a competition. In addition, a stochastic degradation-based data augmentation strategy driven by realistic processing operations is designed, which significantly improves the generalization ability of two deepfake detectors.
New Imaging Standards
icon_mobile_dropdown
C2PA: the world’s first industry standard for content provenance (Conference Presentation)
Given the deluge of digital content and rapidly advancing technology, it is challenging for consumers to trust what they see online. Deceptive content, such as deepfakes generated by artificial intelligence or more traditionally manipulated media, can be indistinguishable from the real thing, so establishing the provenance of media is critical to ensure transparency, understanding, and trust. The C2PA specification provides platforms with a open standards-based method to define what information is associated with each type of asset (e.g., images, videos, audio, or documents), how that information is presented and stored, and how evidence of tampering can be identified.
Analysis of AV1 coding tools
Hsiao-Chiang Chuang, Zhijun Lei, Agata Opalach, et al.
AV1 is the first royalty-free video coding standard developed by the Alliance for Open Media (AOM), which was finalized in 2018. During its standardization process, coding tools were gradually adopted into the specification based on a tradeoff between multiple parameters, such as bitrate, quality, encoding and decoding implementation complexity. A fair comparison of the coding tools supported by this codec can be essential for encoder designers who seek to achieve a good balance among all these factors within their implementations. To this end, this paper compiles a tool-on/off analysis of several prominent coding tools supported by the AV1 specification. The analysis includes the impact of such tools on several objective quality metrics, i.e., PSNR, SSIM, and VMAF, when using the reference encoder libaom implementation, as well as the corresponding impact on the software (SW) runtime complexity of both the libaom encoder and decoder.
JPEG XS screen content coding extensions
Thomas Richter, Siegfried Foessel
At its 94th meeting, SC29WG1 (JPEG) decided to create a third edition of the JPEG XS standard (ISO/IEC 21122) for lightweight, low-latency image coding. While existing coding tools already support the compression of natural content and CFA Bayer pattern data well, this third edition has a strong focus on extending the previous edition with coding tools to improve the performance on screen content data. In this work, some of the candidate tools that have been proposed for the third edition are described, and their performance is reported.
Enhancing SVT-AV1 with LCEVC to improve quality-cycles trade-offs and enhance sustainability of VOD transcoding
Guendalina Cobianchi, Guido Meardi, Stergios Poularakis, et al.
The trade-offs between compression performance and encoding complexity are key in software video encoding, even more so with increasing pressure on sustainability. Previous work “Towards much better SVT-AV1 quality-cycles tradeoffs for VOD applications” [1] described three approaches of evaluating compression efficiency vs cycles trade-offs within a convex-hull framework using the Dynamic Optimizer (DO) algorithm developed in [2] [3] for VOD applications. In parallel, the new video codec enhancer LCEVC (Low Complexity Enhancement Video Coding) [4], designed to provide gains in speed-quality trade-offs, has recently been standardized as MPEG-5 Part 2. The core idea of LCEVC is to use any video coding standard (such as AV1) as a base encoder at a lower resolution, and then reduce artifacts and reconstruct a full resolution output by combining the decoded low-resolution output with up to two low-complexity reconstruction enhancement sub-layers of the residual data. This paper starts by applying LCEVC to SVT-AV1 [5], as well as x264 [6] and x265 [7], while using two of the approaches presented in [1] to evaluate the resulting compression efficiency vs cycles trade-offs. The paper then discusses the benefits of LCEVC towards higher playback speed and lower battery power consumption when using AV1 software decoding. Results show that, with fast-encoding parameter selection using the discrete convex hull methodology, LCEVC improves the quality-cycles trade-offs for all the tested codecs and across the full complexity range. In the case of SVT-AV1, LCEVC yields a ~40% reduction in computations while achieving the same quality levels according to VMAF_NEG [8]. LCEVC also enlarges the set of mobile devices capable of playing HD as well as high-frame-rate content encoded with AV1 and extends mobile battery life by up to 50% with respect to state-of-the-art AV1 software decoding.
Benchmarking and analysis of AV1 software decoding on Android devices
AV1 is the first generation of royalty-free video coding standards developed by the Alliance for Open Media (AOM). Since it was released in 2018, it has gained great adoption in the industry. Major services providers, such as YouTube and Netflix, have started streaming AV1 encoded content. Even though more and more vendors have started to implement HW AV1 decoders in their products, to enable AV1 playback on a broader range of devices, especially mobile devices, software decoders with very good performance are still critical. For this purpose, VideoLAN created Dav1d, a portable and highly optimized AV1 software decoder. The decoder implements all AV1 bitstream features. Dataflow is organized to allow various decoding stages (bitstream parsing, pixel reconstruction and in-loop postfilters) to be executed directly after each other for the same superblock row, allowing memory to stay in cache for most common frame resolutions. The project includes more than 200k lines of platformspecific assembly optimizations, including Neon optimizations for arm32/aarch64, as well as SSE, AVX2 (Haswell) and AVX512 (Icelake/Tigerlake) for x86[3] to create optimal performance on most popular devices. For multi-threading, Dav1d uses a generic task-pool design, which splits decoding stages in mini-tasks. This design allows multiple decoding stages to execute in parallel for adjacent tiles, superblock rows and frames, and keeps common thread-counts (2-16) efficiently occupied on multiple architectures with minimal memory or processing overhead. To test the performance of Dav1d on real devices, a set of low-end to high-end android mobile devices are selected to conduct benchmarking tests. To simulate real-time playback with display, VLC video player application with dav1d integration is used. Extensive testing is done using a wide range of video test vectors at various resolutions, bitrates, and framerates. The benchmarking and analysis are conducted to get the insights of single and multithreading performance, impact of video coding tools, CPU utilization and battery drain. Overall AV1 real-time playback of 720p 30fps @ 2Mbps is feasible for low-end devices with 4 threads and 1080p 30fps @ 4Mbps is feasible for high-end and mid-range devices with 4 threads using Dav1d decoder.
Integrated learning-based point cloud compression for geometry and color with graph Fourier transforms
Davi Lazzarotto, Touradj Ebrahimi
Point cloud representation is a popular modality to code immersive 3D contents. Several solutions and standards have been recently proposed in order to efficiently compress the large volume of data that point clouds require, in order to make them feasible for real-life applications. Recent studies adopting learning-based methods for point cloud compression have demonstrated high compression efficiency specially when compared to the conventional compression standards. However, they are mostly evaluated either on geometry or color separately, and no learning-based joint codec with performance comparable to state-of-the-art methods have been proposed. In this paper, we propose an integrated learned coding architecture by joining a previously proposed geometry coding module based on three-dimensional convolutional layers with a color compression method relying on graph Fourier transform (GFT) using a learning-based mean and scale hyperprior to compress the obtained coefficients. Evaluation on a test set with dense point clouds shows that the proposed method outperforms GPCC and achieves competitive performance with V-PCC when evaluated with state-of-the-art objective quality metrics.
Imaging Applications
icon_mobile_dropdown
H.264 or H.265 for lossy surveillance video transmission: a database study
P. Topiwala, W. Dai
Wireless video communications are a growing segment of communications in both commercial and defense domains. Due to the lossy nature of wireless channels, they require powerful error correction for successful communications. In the commercial domain, ever more people worldwide use video chat, and watch videos on portable wireless devices, while in defense domains, surveillance assets are growing rapidly in numbers, providing real-time motion imagery intelligence. In both domains, the transmission of high-quality video is of vital importance, and a variety of both source and channel codecs have been developed, separately for each domain and application. Throw in tight bandwidth constraints (e.g., 500 kbs), and the challenge intensifies. To give a fair chance to H.264, to operate at such low rates, we first restrict (convert) the videos to 480p resolution. In this paper, we explore the space of video codecs, encryption, channel codecs, and lossy channel models, and use a database of aerial surveillance video we collected, to find the best practices within this large search space. After some preliminary material, we focus attention on the two most common video codecs in use: H.264/AVC, and H.265/HEVC, ask which is better to use in lossy comms, where we mainly limit bandwidth to just 500 kbs. We perform simulations in Rayleigh fading channel models, use signal-to-noise (SNR) levels that stress transmissions, and use powerful Polar and Low-Density Parity-Check (LDPC) codes to correct errors. Since the channel varies with time (as does our simulation), we aggregate over 100 simulations for statistical convergence. Without channel errors, H.265 would be easily preferred due to its superior coding efficiency. With errors, this is still true in our simulations, but the result is far more subtle, and unexpectedly depends in part on using encrypted video. H.264 is in fact more resilient when there are high errors and without encryption, while as quality improves to usable levels, H.265 takes over. With encryption, H.265 wins at all channel Eb/No (~SNR) levels, and even pro
Sperm-cell DNA fragmentation prediction using label-free quantitative phase imaging and deep learning (Conference Presentation)
Lioz Noy, Itay Barnea, Simcha Mirsky, et al.
Intracytoplasmic sperm injection (ICSI) is the most common practice for in vitro fertilization (IVF) treatments. In ICSI, a single sperm is selected and injected into an oocyte. The quality of the sperm and specifically its DNA fragmentation index (DFI) have significant effects on the fertilization success rate. In our research, we use computer vision and deep learning methods to predict DFI scoring for a single sperm cell. Each cell in the dataset was acquired using multiple white light microscopy techniques combined with state-of-the-art interferometry. In our results, we see a strong correlation between the stained images and our score prediction which can be used in the ICSI process.
Range enhancement of a semi-flash lidar system using a sparse VCSEL array and depth upsampling
Jihoon Park, Junho Choi, Changmo Jeong, et al.
3D flash lidar has no moving part and can provide high-resolution depth information, so they are attracting attention in development of solid-state lidar system. However, the limited detecting range is key challenge because a single diffused laser source has to cover the whole target area while satisfying the laser-safety requirements. In this paper, we propose a semi-flash lidar system to increase the detecting range while providing high-resolution and wide field-of-view. To increase the peak power of the individual lasers under laser-safety constraint, we design a customized VCSEL array that forming a sparse 2D laser array. We also propose a depth upsampling method to recover the empty pixels in a SPAD detector array which are caused by sparsely distributed laser beam. For robust depth reconstruction, a joint trilateral filter that exploits the ambient light information and structural information of the VCSEL array are presented. Using a hardware prototype, we show the degree of the range enhancement and fidelity of the depth upsampling results under a various environment.
Automating sports broadcasting using ultra-high definition cameras, neural networks, and classical denoising
Sophia Rosney, Ciarán Donegan, Meegan Gower, et al.
Algorithms for automated sports broadcasting have been explored since the early 2000s. Systems consider automated control of virtual or physical cameras. However, output picture quality is compromised and capturing “off-the-ball” action remains a challenge for automated systems. In this paper, we present an exploration of the components of a semi-automated, high-quality broadcasting system. We simulate multiple dynamic views from fixed wide-angle cameras using an object detection network adapted to UHD wide-angle content. These views can be selected at the discretion of the director for broadcast. The final selected view then undergoes an enhancement process to address the optical blur and geometric distortion in wide-angle footage. Our overall system, through the combination of a human operator with automated view simulation algorithms, is capable of addressing many of the issues facing fully automated production in the complex environment of sport.
Convolutional neural networks for automatic detection of breast pathologies
Breast cancer in women is a worldwide health problem that has a high mortality rate. A strategy to reduce breast cancer mortality in women is to implement preventive programs such as mammography screening for early breast cancer diagnosis. In this presentation, a method for automatic detection of breast pathologies using a deep convolutional neural network and a class activation map is proposed. The neural network is pretrained on the regions of interest in order to modify the output layers to have two output classes. The proposed method is compared with different CNN models and applied to classify the public dataset Curated Breast Imaging Subset of DDSM (CBIS-DDSM).
AI-based telepresence for broadcast applications
In this paper, we introduce a new solution and underlying architecture that allows remote participants to interact with hosts in a broadcast scenario. To achieve this, background extraction is first applied to video received from remote participants to extract their faces and bodies. Considering that the video from remote participants are usually of lower resolutions when compared to content produced by professional cameras in production, we propose to scale the extracted video with a super-resolution module. Finally, the processed video from remote participants are merged with studio video and streamed to audiences. Given the real-time and high-quality requirements, both background extraction and super-resolution modules are learning-based solutions and run on GPUs. The proposed solution has been deployed in the Advance Mixed Reality (AdMiRe) project. The objective and subjective assessment results show that the proposed solution works well in real world applications.
Image and Video Processing
icon_mobile_dropdown
An algorithm for a quality-optimized bit rate ladder generation for video streaming services using a neural network
Andreas Kah, Maurice Klein, Christoph Burgmair, et al.
For video streaming services, a bit rate ladder is generated by encoding each video signal at various bit rates and associated spatial resolutions. For a bit rate ladder that maximizes the subjective quality at a minimum bit rate, it was found that the VMAF of the highest provided quality should not exceed 95, which is on average associated with the same subjective quality as the original signal. Second, all VMAF differences between adjacent renditions should ideally be not greater than 2 as this guarantees indistinguishable subjective quality on average. The generation of a bit rate ladder fulfilling these constraints faces the difficulties that (i) today’s encoders cannot be instructed to achieve a certain VMAF and (ii) a certain VMAF can be achieved by various combinations of bit rate and spatial resolution. These difficulties result in a content-dependent multidimensional solution space for generating the quality-based bit rate ladder at a minimum bit rate. In this paper, an algorithm is presented which can generate such a bit rate ladder. The algorithm determines the VMAF of nine initial encodings of the signal. Using a specifically designed and trained neural network, the VMAF of 5805 combinations of bit rate and spatial resolution is predicted from the nine initial ones. Based on these predictions, a bit rate ladder is extracted and further refined until all VMAF constraints are fulfilled. Experiments show that the algorithm requires 3.6 encodings per provided VMAF on average. A VMAF of 95.07 is achieved on average for the highest provided quality and a VMAF difference between adjacent renditions of 1.92.
FPGA synthesis of original chaotic system with application in imagen transmission
This article show a nonlinear dynamical system capable of generating spatial attractors. The main activity is the realization of a spatial chaotic attractor on Xilinx FPGA board PYNQ-Z1 by programming Python used Jupyter, with a focus on the implementation of a secure communication system. The first contribution is the successful synchronization of two chaotic attractors systems, in VHDL program, in a master-slave topology. The second important contribution is the FPGA realization of a secure communication system based on a spatial chaotic attractor, which involves encrypting grayscale and RGB images with chaos and broadcast key in the transmission system, sending the encrypted image through of the state variables and reconstruct the encrypted image then in the receiving system recovers sent imagen.
An empirical approach for estimating the effect of a transcoding aware preprocessor
Video compression is complicated by degradation in User Generated Content (UGC). Preprocessing the data before encoding improves compression. However the impact of the preprocessor depends not only on the codec and the filter strength of the preprocessor being used but also on the target bitrate of the encode and the level of degradation. In this paper we present a framework for modelling this relationship and estimating the optimal filter strength for a particular codec/preprocessor/bitrate/degradation combination. We examine two preprocessors based on classical and DNN ideas, and two codecs AV1, VP9. We find that up to 2dB of quality gain can result from preprocessing at constant bitrate and our estimator is accurate enough to capture most of these gains.
Redundancy in lattice algebra based associative neural networks for image retrieval from noisy inputs
Since lattice algebra based associative memories can store any number k of associated vector pairs (x, y), where x is a real n-dimensional vector and y is a real m-dimensional vector, we propose a basic redundancy mechanism to endow with retrieval capability the dual canonical min-W and max-M lattice associative memories for inputs corrupted by random noise. To achieve our goal, given a finite set of exemplar vectors, redundant patterns are added in order to enlarge the original fixed point set of the original exemplars. The redundant patterns are masked versions designed to be spatially correlated with each exemplar x in a given learning data set. An illustrative example with noisy color images are given to measure the retrieval capability performance of the proposed redundancy technique as considered for lattice associative memories.
New Imaging Modalities and Applications
icon_mobile_dropdown
Quantitative performance evaluation in an augmented reality view enhancement driver assistant system
Lukas Jütte, Alexander Poschke, Benjamin Küster, et al.
The temporally and spatially accurate display of information in augmented reality (AR) systems is essential for immersion and operational reliability when using the technology. We developed an assistant system using a head-mounted display (HMD) to hide visual restrictions on forklifts. We propose a method to evaluate the accuracy and latency of AR systems using HMD. For measuring accuracy, we compare the deviation between real and virtual markers. For latency measurement, we count the frame difference between real and virtual events. We present the influence of different system parameters and dynamics on latency and overlay accuracy.
Skin cancer post-operative scar evaluation using autofluorescence features
Marta Lange, Szabolcs Bozsányi, Emilija V. Plorina, et al.
In this work post-operative skin cancer scar evaluation with LED screening device has been described. The wavelength used for inducing autofluorescence (AF) of chromophores in the skin is 405nm. The green channel of the captured images is the best to calculate AF intensity ratio from the scar and the surrounding skin of 10 patients with healthy healing and scars with cancer recurrence. This non-invasive multispectral screening method can help dermato-oncologist to make a decision on evaluating if the scar is healing correctly and evaluate any pigmentation that could be suspected as a recurrent cancer.
Private key and password protection by steganographic image encryption
We propose a technique to protect and preserve a private key or a passcode in an encrypted two-dimensional graphical image. The plaintext private key or the passcode is converted into an encrypted QR code and embedded into a real-life color image with a steganographic scheme. The private key or the passcode is recovered from the stego color image by first extracting the encrypted QR code from the color image, followed by decryption of the QR code. The cryptographic key for encryption of the QR code is generated from the output of a Linear Feedback Shift Register (LFSR), initialized by a seed image chosen by the user. The user can store the seed image securely, without the knowledge of an attacker. Even if an active attacker modifies the seed image (without knowledge of the fact that it is the seed image), the user can easily restore it if he/she keeps multiple copies of it, so that the encryption key can be regenerated easily. Our experiments prove the feasibility of the technique using sample private key data and real-life color images.
Effective know-your-customer method for secure and trustworthy non-fungible tokens in media assets
Clément Sanh, Kambiz Homayounfar, Touradj Ebrahimi
Non-fungible tokens (NFTs) are becoming very popular in a large number of applications ranging from copyright protection to monetization of both physical and digital assets. It is however a fact that NFTs suffer from a large number of security issues that create a lack of trust in solutions based on them. In this paper, we provide an overview of some of the most critical security challenges in media assets in form of visual content and then propose a specific solution for one among them, namely, secure person identification used in the context of KnowYour-Customer (KYC) with emphasis on liveness detection. The solution includes an authentication procedure that matches a selfie photo to a photograph of an identity document (ID). The system runs through a series of steps. First, detection is applied to extract faces from the selfie and the ID. Then a face comparison is performed to assess if they belong to the same person. While these two procedures are standard in KYC, a liveness check is also included so as to increase the security. The latter ensures that the user undergoing identity verification is in front of the camera and not a fraudster attempting to impersonate another individual. The system instructs the user to perform gestures such as waving hands or tilting head in front of the camera. The algorithmic detection of these actions during the live feed will reveal whether or not the user is carrying out the instructed activities. Performance of the proposed solution is then assessed under varying conditions.
Poster Session
icon_mobile_dropdown
Clustering in coarse registration task and extraction of common parts of point clouds
Sergei Voronin, Artyom Makovetskii, Vitaly Kober, et al.
Point cloud registration is an important method in 3D point cloud processing, which is used in computer vision, autonomous driving, and other fields. Point cloud registration looks for the optimal rigid transformation that can align two input point clouds to a common coordinate system. The most common method of alignment using geometric characteristics is the Iterative Closest Point (ICP) algorithm. The disadvantage of classical ICP variants, such as pointto-point and point-to-plane, is their dependence on the initial placement of point clouds. If the rotation that can align two point clouds is sufficiently large, the ICP algorithm can converge to a local minimum. Coarse point clouds registration algorithms are used to find a suitable initial alignment of two clouds. In particular, feature-based methods for coarse registration are known. In this paper, we propose an algorithm to extract the common parts of the incongruent point clouds and coarsely aligning them. We use the SHOT algorithm to find a match between two point clouds. The corresponding neighborhoods are obtained by the correspondence between points. The neighborhoods define local vector bases that allow computing an orthogonal transformation. The proposed algorithm extracts common parts of incongruent point clouds. Computer simulation results are provided to illustrate the performance of the proposed method.
ICP error functional using point cloud geometry
Artyom Makovetskii, Sergei Voronin, Vitaly Kober, et al.
Registration of point clouds in three-dimensional space is an important task in many areas of computer vision, including robotics and autonomous driving. The purpose of registration is to find a rigid geometric transformation to align two point clouds. The registration problem can be affected by noise and incomplete data availability (partiality). Iterative Closed Point (ICP) algorithm is a common method for solving the registration problem. Usually, the ICP algorithm monotonically reduces functional values, but owing to the problem of non-convexity, the algorithm often stops at suboptimal local minima. Thus, an important characteristic of the registration algorithm is its ability to avoid local minima. The probability of obtaining an acceptable transformation as a result of the ICP algorithm is a comparative criterion for different types of ICP algorithms and other types of registration algorithms. In this paper, we propose an ICP-type registration algorithm that uses a new type of error metric functional. The functional uses fine geometrical characteristics of the point cloud. Computer simulation results are provided to illustrate the performance of the proposed method.
Neural network for 3D point clouds alignment
Sergei Voronin, Alexander Vasilyev, Vitaly Kober, et al.
Point cloud is an important type of geometric data structure. Various applications require high-level point cloud processing. Instead of defining geometric elements such as corners and edges, state-of-the-art algorithms use semantic matching. These methods require learning-based approaches that rely on a statistical analysis of labeled datasets. Adapting deep learning techniques to handle 3D point clouds remains challenging. The standard deep neural network model requires regular inputs such as vectors and matrices. Three-dimensional point clouds are fundamentally irregular; that is, the positions of points are continuously distributed in space, and any permutation of their order does not change the spatial distribution. Modern deep neural networks are designed specifically to process point clouds directly, without going to an intermediate regular representation. The Deep Closest Point (DCP) network is a neural network that implements the ICP algorithm. DCP utilizes the point-to-point functional for error metric minimization. In this paper, we propose the modified variant of DCP based on other types of ICP error minimization functionals. Computer simulation results are provided to illustrate the performance of the proposed algorithm.
A comparative study of convolutional network models for the classification of abnormalities in mammograms
Breast cancer is the most common cancer and one of the main causes of death in women. Early diagnosis of breast cancer is essential to ensure a high chance of survival for the affected women. Computer-aided detection (CAD) systems based on convolutional neural networks (CNN) could assist in the classification of abnormalities such as masses and calcifications. In this paper, several convolutional network models for the automatic classification of pathology in mammograms are analyzed. As well as different preprocessing and tuning techniques, such as data augmentation, hyperparameter tuning, and fine-tuning are used to train the models. Finally, these models are validated on various publicly available benchmark datasets.
Improving the speed of ImageJ filtering through threads
ImageJ is a very popular Java-based open-source software developed by the National Institute of Health (NIH) primarily for biomedical image processing and microscopy, as it is easy to use and very simple to implement new tools and macros to make it more useful. However, processing and filtering high-resolution images or stacks composed of a very large number of slides can often be very time-consuming, so it is important to reduce time to improve processing efficiency. One way to improve it is to use threads. A multi-threaded program contains two or more parts that can run simultaneously. Each part of such a program is called a thread, and each defines an independent execution path. Thus, multithreading is a specialized form of multitasking. In a thread-based multitasking environment, the thread is the smallest unit of dispatchable code. This means that a single program can perform two or more tasks simultaneously. Multithreading enables efficient programs to be written that take full advantage of the processing power available on the system. Although ImageJ employs multithreading, it is sometimes not sufficient to speed up processes. Therefore, in this work, threads were implemented inside several filters written as ImageJ plugins to speed up their performance in such situations. For the experiment, ten images were scaled to different dimensions and the processing times of the threaded versus non-threaded filters were noted. On average, a 24% speedup was observed. This performance improvement would be even more useful for even larger images and image stacks.