Proceedings Volume 11842

Applications of Digital Image Processing XLIV

cover
Proceedings Volume 11842

Applications of Digital Image Processing XLIV

Purchase the printed version of this volume at proceedings.com or access the digital version at SPIE Digital Library.

Volume Details

Date Published: 8 October 2021
Contents: 14 Sessions, 71 Papers, 54 Presentations
Conference: SPIE Optical Engineering + Applications 2021
Volume Number: 11842

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 11842
  • Panel Discussion on Advanced Video Compression and Applications
  • Artificial Intelligence in Image and Video Compression
  • Video Processing for User-Generated Content
  • Advanced Video Compression
  • Imaging and Information Chaos
  • Image and Video Compression
  • Applications of Visual Perception in Imaging
  • Applications of Biomedical Imaging
  • Imaging Security and Analysis
  • 3D Imaging and Augmented-Reality Applications
  • Image Analysis Tools and Techniques
  • Imaging Systems
  • Poster Session
Front Matter: Volume 11842
icon_mobile_dropdown
Front Matter: Volume 11842
This PDF file contains the front matter associated with SPIE Proceedings Volume 11842 including the Title Page, Copyright information, and Table of Contents.
Panel Discussion on Advanced Video Compression and Applications
icon_mobile_dropdown
Panel Discussion on Advanced Video Compression and Applications
Pankaj Topiwala, Alan C. Bovik, Benjamin Bross, et al.
There is currently the widest breadth of video codecs available for the massive $200B video services industry, comprising broadcast, streaming, and other services, in history: MPEG-2, MPEG-4, AVC, HEVC, VVC, VP8, VP9, AV1, EVC, and LC-EVC. While these codecs compete in the marketplace for share of streams, the consumer surely benefits from having advanced services at lower rates. Is 4K HDR HEVC going to become the new norm for broadcast/streaming? But this is a challenging environment for developers and service provides. In this panel, we explore the breadth of consumer services that are enabled by these technologies, including high resolution: 4K, 8K, and beyond, as well as HDR, and AR/VR – will these finally take off and fulfill their promise? And is 8K the end of the line for consumer devices such as TVs, and even computers, tablets, and smartphones?
Artificial Intelligence in Image and Video Compression
icon_mobile_dropdown
Quality-aware CNN-based in-loop filter for video coding
Wei Chen, Xiaoyu Xiu, Xianglin Wang, et al.
The state‐of‐the‐art video coding standard, Versatile Video Coding (VVC) or H.266, has demonstrated its superior coding efficiency over its predecessor HEVC/H.265. In this paper, a novel in‐loop filter based on convolutional neural network (CNN) is illustrated to further improve the coding efficiency over VVC. In this filter, one single NN model is used to process multiple video components simultaneously. In addition, with a quality map generated for each video component as network input, the same single NN model is capable of processing videos in different qualities and resolutions while maintaining coding efficiency, which reduces the overall network complexity significantly. Simulation results show that the proposed approach provides average BD‐rate savings of 6.27%, 18.78% and 20.42% under AI configuration, and average BD-rate savings of 5.18%, 21.95% and 22.13% under RA configuration, respectively for Y, Cb and Cr components.
A study of deep image compression for YUV420 color space
Currently, most deep image compression methods are designed to compress images in RGB color space. However, there are also many images in YUV420 color space and the video coding standards such as H.265/HEVC and H.266/VVC support compression of images in YUV420 color space with their respective Main Still Picture profiles. In this paper, we first study how to adjust the deep compression frameworks designed for images in RGB color space to compress images in YUV420 color space. Then, we study the coding performance impact when we adjust the training distortion weight for YUV channels and compare the experimental results with HEVC and VVC all intra configuration. The proposed approaches are applicable in both image compression and intra coding in video compression.
Bi-directional prediction for end-to-end optimized video compression
Fabien Racapé, Jean Bégaint, Simon Feltman, et al.
This paper presents and studies an end-to-end Artificial Neural Network (ANN)-based compression framework leveraging bi-directional prediction. Like traditional hybrid codecs in Random Access configuration, this codec processes video sequences divided in Groups Of Pictures (GOPs) in which each frame can be encoded in Intra or Inter mode. Inter frames, can be bi-predicted, i.e. using past and future previously decoded frames, the selection of the reference frame for prediction are signaled within the bitstream, allowing for efficient hierarchical GOP temporal networks. In particular, we study the benefits of optimizing the compression of the motion information prediction residuals using dedicated auto-encoder models in which the layers are conditioned based on the GOP structure. The network is trained fully end-to-end from scratch. The increase of compression efficiency shows the promises of implementing conditional convolution for bi-directional inter coding.
Multi-hypothesis inspired super-resolution for compression distorted screen content image
Meng Wang, Jizheng Xu, Li Zhang, et al.
Multi-hypothesis-based prediction has been repetitively proven to be effective in improving prediction accuracy and enhancing coding performance. In this paper, we introduce the principle of multi-hypothesis to the super-resolution (SR) of compressed screen content images, with the goal of improving the restoration quality of the compression contaminated screen content images. More specifically, the super-resolution is achieved by a deep neural network. The deep neural network learns the mapping relationship between the compressed low-resolution (LR) image and the original high-resolution (HR) image. During learning process, we feed multiple LR patches for training, including the current patch and five neighboring patches, providing more informative clues for the learning of the high-quality restoration. In the inference process, input LR image will be translated with random offsets, yielding five assistant LR items for the SR of the input LR image. The LR and assistant LR items employ separate modules for feature extraction and then the features are fused with concatenation. Subsequently, the deep residual feature extraction is applied, which is composed of multiple consecutive residual blocks. Finally, the deep features are reconstructed with pixel shuffle, producing the SR image. Experimental results verify the effectiveness of the proposed multi-hypothesis-based SR scheme.
Learning-based encoder algorithms for VVC in the context of the optimized VVenC implementation
Gerhard Tech, Valeri George, Jonathan Pfaff, et al.
Versatile Video Coding (VVC) is the most recent and efficient video-compression standard of ITU-T and ISO/IEC. It follows the principle of a hybrid, block-based video codec and offers a high flexibility to select a coded representation of a video. While encoders can exploit this flexibility for compression efficiency, designing algorithms for fast encoding becomes a challenging problem. This problem has recently been attacked with data-driven methods that train suitable neural networks to steer the encoder decisions. On the other hand, an optimized and fast VVC software implementation is provided by Fraunhofer’s Versatile Video Encoder VVenC. The goal of this paper is to investigate whether these two approaches can be combined. To this end, we exemplarily incorporate a recent CNN-based approach that showed its efficiency for intra-picture coding in the VVC reference software VTM to VVenC. The CNN estimates parameters that restrict the multi-type tree (MTT) partitioning modes that are tested in rate-distortion optimization. To train the CNN, the approach considers the Lagrangian rate-distortion-time cost caused by the parameters. For performance evaluation, we compare the five operational points reachable with the VVenC presets to operational points that we reach by using the CNN jointly with the presets. Results show that the combination of both approaches is efficient and that there is room for further improvements.
Video Processing for User-Generated Content
icon_mobile_dropdown
Machine-learning-based tuning of encoding parameters for UGC video coding optimizations
In the era of COVID-19 pandemic, videos are very important to the billions of people staying and working at home. Two-pass video encoding allows for a refinement of parameters based on statistics obtained from the first pass. Given the variety of characteristics in user-generated content, there is opportunity to make this refinement optimal for this type of content. We show how we can replace the traditional models used for rate control in video coding with better prediction models with linear and nonlinear model functions. Moreover, we can utilize these first-pass statistics to further refine the traditional encoding recipes that are typically used for all input video sequences. Our work can provide much-needed bitrate savings for many different encoders, and we highlight it by testing on typical Facebook video content.
On quality control of user-generated-content (UGC) compression
Yang Li, Xinfeng Zhang, Shiqi Wang, et al.
Exponential increase in the demand for high-quality user-generated content (UGC) videos and limited bandwidth pose great challenges for hosting platforms in practice. How to optimize the compression of UGC videos efficiently becomes critical. As the ultimate receiver is human visual system, there is a growing consensus that the optimization of the video coding and processing shall be fully driven by the perceptual quality, so traditional rate control-based methods may not be optimal. In this paper, a novel perceptual model on compressed UGC video quality is proposed by exploiting characteristics extracted from only source video. In the proposed method, content-aware features and quality-aware features are explored to estimate quality curves against quantization parameter (QP) variations. Specifically, content revelant deep semantic features from pre-trained image classification neural networks and quality revelant handcrafted features from various objective video quality assessment (VQA) models are utilized. Finally, a machine-learning approach is proposed to predict the quality of compressed videos of different QP values. Hence, the quality curves can be driven, by estimating the QP for given target quality, a quality-centered compression paradigm can be built. Based on experimental results, the proposed method can accurately model quality curves for various UGC videos and control compression quality well.
Subjective and objective study of sharpness enhanced UGC video quality
With the popularity of video sharing applications and video conferencing systems, there has been a growth of interest to measure and enhance the quality of videos captured and transmitted by those applications. While assessing the quality of UGC videos itself is still an open question, it is even challenging to enhance the perceptual quality of UGC videos with unknown characteristics. In this work, we study the potential to enhance the quality of UGC videos by increasing the sharpen effects. To this end, we construct a subjective dataset by conducting a massive online crowdsourcing. The dataset consists of 1200 sharpness enhanced UGC videos processed from 200 UGC source videos. During subjective test, each processed video is compared with its source to capture finegrained quality difference. We propose a statistical model to precisely measure whether the quality is enhanced or degraded. Moreover, we benchmark state-of-the-art No-Reference image or video quality metrics with the collected subjective data. It is observed that most metrics do not correlate well with subjective score. This indicates the need to develop more reliable objective metrics for UGC videos.
Advanced Video Compression
icon_mobile_dropdown
An end-to-end distributed video analytics system using HEVC annotated regions SEI message
The HEVC Annotated Regions (AR) SEI message supports object tracking by carrying parameters defining rectangular bounding boxes with unique object identifiers, time-aligned within a video bitstream. An end-to-end distributed video analytics pipeline utilizing the AR SEI message within the GStreamer framework has been implemented, with an edge node and a cloud server node. At the edge, light-weight face detection is performed, and face region parameters are used to create the AR SEI message syntax within an HEVC bitstream. At the cloud server, face regions are extracted from the decoded video and age and gender classification is performed. The HEVC bitstream is updated to include additional metadata in the AR SEI message.
Simplified carriage of MPEG immersive video in HEVC bitstream
Immersive video enables end users to experience video in a more natural way interactively with viewer motion from any position and orientation within a supported viewing space. MPEG Immersive Video (MIV) is an upcoming standard being developed to handle the compression and delivery of immersive media content. It extracts only needed information in the form of patches from a collection of cameras capturing the scene and compresses with video codecs such that the scene can be reconstructed at the decoder side from any pose. A MIV bitstream is composed of non-video components carrying view parameters and patch information in addition to multiple video data sub-bitstreams carrying texture and geometry information. In this paper, we describe a simplified MIV carriage method, using an SEI message within a single layer HEVC bitstream, to take advantage of existing video streaming infrastructure, including legacy video servers. The Freeport player is built on the open-source VLC video player, a GPU DirectX implementation of a MIV renderer, and a face tracking tool for viewer motion. A prerecorded demonstration of Freeport player is provided.
Implementation of film-grain technology within VVC
Miloš Radosavljevic, Edouard François, Erik Reinhard, et al.
Film grain is often a desirable feature in video production, creating a natural appearance and contributing to the expression of creative intent. Film grain, however, does not compress well with modern video compression standards, such as Versatile Video Coding (VVC) also known as ITU-T H.266 and ISO/IEC 23090-3. Indeed, within various filtering and lossy compression steps, film grain is suppressed without the possibility of recovering it. One option to alleviate this problem is to use lower quantization parameters to better preserve fine details such as film grain. However, this may strongly increase the bitrate. In some scenarios, information on film grain can be communicated as metadata through for instance an SEI message specified by Versatile Supplemental Enhancement Information (VSEI, also known as ITU-T Recommendation H.274 and ISO/IEC 23002-7). Thus, film grain is often modeled and removed prior to compression, and it is then synthesized at the decoder side with the aid of appropriate metadata. In addition, film grain can also be used as a tool to mask coding artifacts introduced by the compression. Different approaches have been studied for film grain modeling. In the context of the novel VVC standard, a frequency filtering solution to parameterize and synthesize film grain can be used. This paper provides an overview of such film grain VVC-compatible technology, including parameterization, signaling and decoder side synthesis. Thus, in this paper, an approach based on the frequency filtering is firstly summarized. Then, a quantitative and qualitative simulations are preformed to show the benefits of film grain parameterization in terms of the bitrate savings for the same perceived quality.
VMAF and variants: towards a unified VQA
P. Topiwala, W. Dai, J. Pian, et al.
Video quality analysis (VQA) is a critical processing task for all major video services. We take streaming services as a key example. Among many possible video encodings of the source material, it is vital to distinguish which encoding would produce the most favorable ratings in terms of quality by the viewers. VQA techniques aim to predict that human quality rating by objective methods. Among such methods, the VMAF algorithm, designed by Netflix, has come into prominence in recent years. We examine VMAF as well as several variants that we have developed, and assess their ability to predict human quality scoring.
Bandlimited wireless video communications over lossy channels
P. Topiwala, W. Dai, K. Bush
Wireless video communications are a growing segment of communications in both commercial and defense domains. Due to the lossy nature of wireless channels, they require powerful error correction for successful communications. In the commercial domain, ever more people worldwide use video chat, and watch videos on portable wireless devices, while in defense domains, surveillance assets are growing rapidly in numbers, providing real-time motion imagery intelligence. In both domains, the transmission of high-quality video is of vital importance, and a variety of both source and channel codecs have been developed, separately for each domain, and application. Throw in tight bandwidth constraints (e.g., < 1 Mbs), and the challenge intensifies. In this paper, we outline an effort to explore the space of video codecs, channel codecs, and channel models, to find the best practices within this large search space. After some preliminary material, we focus attention on the two most common video codecs in use: H.264/AVC, and H.265/HEVC, ask which is better to use in lossy comms, and in some cases limit bandwidth to just 750 kbs. We perform simulations in both additive white gaussian noise as well as Rayleigh fading channel models, use signal-to-noise ratios that stress transmissions, and use powerful Low-Density Parity-Check (LDPC) and Polar codes to correct errors. Since the channel varies with time (as does our simulation), we aggregate over multiple simulations. Without channel errors, H.265 would be preferred due to its superior coding efficiency. With errors, this is still true in our simulations, but is more subtle. H.264 is in fact more resilient when there are high errors, but as quality improves to usable levels, H.265 takes over.
VVenC: an open optimized VVC encoder in versatile application scenarios
Adam Wieckowski, Christian Stoffers, Benjamin Bross, et al.
Versatile Video Coding (H.266/VVC) was standardized in July 2020, around seven years after its predecessor, High Efficiency Video Coding (H.265/HEVC). Typical for a successor standard, VVC aims to offer 50% bitrate savings at similar visual quality, which was confirmed in official verification tests. While HEVC provided large compression efficiency improvements over Advanced Video Coding (H.264/AVC), fast development of video technology ecosystem required more in terms of functionality. This resulted in various amendments being specified for HEVC including screen content, scalability and 3D-video extensions, which fragmented the HEVC market, rendering only the base specification being widely supported across a wide range of devices. To mitigate this, the VVC standard was from the start designed with versatile use cases in mind, and provides wide-spread support already in the first version. Shortly after the finalization of VVC, an open optimized encoder implementation VVenC was published, aiming to provide the potential of VVC at shorter runtime than the VVC reference software VTM. VVenC also supports additional features like multi-threading, rate control and subjective quality optimizations. While the software is optimized for random-access high-resolution video encoding, it can be configured to be used in alternative use cases. This paper discusses the performance of VVenC beyond its main use case, using different configurations and content types. Application specific performance is also discussed. It is shown that VVenC can mostly match VTM performance with less computation, and provides attractive additional faster working points with bitrate reduction tradeoffs.
Overview of baseline profile in MPEG-5 essential video coding standard
Kwang Pyo Choi, Kiho Choi, Min Woo Park, et al.
The MPEG-5 Essential Video Coding (EVC) Standard was finalized in July 2020 in ISO/IEC Moving Picture Experts Group (MPEG). The main goal of the EVC standard development was to provide a significantly improved compression efficiency over existing video coding standards with timely publication of licensing terms. To achieve the goal of project, the EVC standard was developed with the royalty-free based Baseline profile (BP) as its base and a royalty bearing Main profile having a small number of coding tools on top of the Baseline profile. This paper presents EVC BP which can be a strong candidate for the media application that is a dominant in the internet platform. To evaluate the coding performance of the EVC BP, the testing result compared to H.264/AVC is provided. In the testing result, the EVC BP has shown 31.2% and 30.4% bitrate reductions with using only 40% and 23% encoding times of H.264/AVC under RA and LD test scenarios, respectively.
HDR video coding for aerial videos with VVC and AV1
Today, a huge variety of video codecs are available for compressing video, for a wide range of applications. Besides the traditional MPEG-2 and H.264 codecs, we now have H.265, VP9, AV1, H.266, EVC, and LC-EVC. H.266 succeeds H.265; EVC is essentially a light version of H.266/VVC; AV1 is a more advanced version of VP9; and LC-EVC is not a codec at all (nor related to EVC), but a codec enhancer. So of the remaining contenders for the next-generation video codec, we focus on perhaps the two most relevant: VVC and AV1. We take a practical approach, and focus on a specific application: high-quality, 4K aerial HDR video. Consumers love 4K UHD HDR. That moniker sells TVs. And among the most common way consumers can get a hold of capturing such content is with a high-end consumer drone. Modern drones can already capture 4K (and even 8K) in HDR, using the HLG transfer function, in 10-bit H.265, at 100 Mbs, creating clean content. We have collected some representative aerial HDR video clips to test with. In this context, we consider three issues: (a) convenience of conversion; (b) coding efficiency; and (c) playback support. We mention IP licensing issues in passing but focus on the technical merits as well as business prospects. In short, we find that both formats are usable, and moderately convenient to convert to. VVC has a slight edge in terms of coding efficiency (~15%). However, only one of these has even a possibility of wide playback support at this time: AV1. But that will change in time (~3-5 years). AV1 dates to March, 2018, VVC only to July, 2020.
Imaging and Information Chaos
icon_mobile_dropdown
Fake-buster: a lightweight solution for deepfake detection
Nathan Hubens, Matei Mancas, Bernard Gosselin, et al.
Recent advances in video manipulation techniques have made synthetic media creation more accessible than ever before. Nowadays, video edition is so realistic that we cannot rely exclusively on our senses to assess the veracity of media content. With the amount of manipulated videos doubling every six months, we need sophisticated tools to process the huge amount of media shared all over the internet, to remove the related videos as fast as possible, thus reducing potential harm such as fueling disinformation or reducing trust in mainstream media. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. Our method involves two networks: (1) a face identification network, extracting the faces contained in a video, and (2) a manipulation recognition network, considering the face as well as its neighbouring context to find potential artifacts, indicating that the face was manipulated. More particularly, we propose to make use of neural network compression techniques such as pruning and knowledge distillation to create a lightweight solution, able to rapidly process streams of videos. Our approach is validated on the DeepFake Detection Dataset, consisting of videos coming from 5 different manipulation techniques, reflecting the organic content found on the internet, and compared to state-of-the-art deepfake detection approaches.
JPEG Fake Media: a provenance-based sustainable approach to secure and trustworthy media annotation
Frederik Temmermans, Deepayan Bhowmik, Fernando Pereira, et al.
Media assets can easily be manipulated with photo editing software or artificially created using deep learning techniques. This can be done with the intention to mislead, but also for creative or educational purposes. Clear annotation of media modifications is a crucial element to assess trustworthiness. However, these annotations should be attached securly to prevent them from being compromised. In addition, to achieve a wide adoption, interoperability is essential. This paper gives an overview of the media manipulation history, discusses the state-of-the-art and challenges related to AI-based detection methods. The paper then introduces JPEG Fake Media as a provenance-based sustainable approach to secure and trustworthy media annotation. JPEG Fake Media has the objective to produce a standard that can facilitate secure and reliable annotation of media asset creation and modifications. The standard shall support good faith usage scenarios as well as those with malicious intent.
Adopting the JPEG universal metadata box format for media authenticity annotations
Frederik Temmermans, Leonard Rosenthol
The JPEG Universal Metadata Box Format (JUMBF) provides a uniform framework to support metadata-based extensions within the JPEG ecosystem. JUMBF provides a generic container in which domain specific metadata can be defined. On top of this container, JUMBF provides additional functionalities to facilitate linkage between metadata and image data. While JUMBF forms the foundation for JPEG specifications such as JPEG 360, JPEG Privacy & Security and JLINK, it is also intended for adoption by third parties. The Coalition for Content Provenance and Authenticity (C2PA) is designing an architecture for providing provenance to digital media, giving creators tools to express objective reality and empowering consumers to evaluate whether what they are seeing is trustworthy. The C2PA adopted JUMBF as the container format for their provenance information, initially in JPEG files, but will also be using it with video, audio and documents as well. As such, it serves as an interesting case study related to the adoption of JUMBF by a third party and use outside of the JPEG ecosystem. This paper presents how the C2PA uses the JUMBF architecture and proposes potential future enhancements to the JUMBF specification.
Framing photos in the digital dark age: towards a socio-technological 'ecology of images'
S. Caldwell, T. Gedeon
The ecology of images of which Susan Sontag spoke in 1978 is urgently needed now if we are to establish a framework within which the daily onslaught of uncertain imagery can be understood, and a reliable photographic record of our current and future real world can be preserved. Many technologies (ubiquitous digital photography, sophisticated camera sensors, image editing, ‘deep fakes,’ GAN imaging) have brought to fruition the potential for photographic images to expand into a wide range of ‘ecological’ niches. The contexts that apply to image distribution (digital platforms, social engineering, governmental regulation) are daily news.Drawing together a range of investigations into aspects of the human-image nexus using eye gaze signals, neural networks, online surveys, educational workshops and research into digital platforms, we outline the main technological and social features of a potential ‘ecology of images’ that may help shine light on the panoply of images of our times.
Image and Video Compression
icon_mobile_dropdown
Per-clip and per-bitrate adaptation of the Lagrangian multiplier in video coding
In the past ten years there have been significant developments in optimization of transcoding parameters on a per-clip rather than per-genre basis. In our recent work we have presented per-clip optimization for the Lagrangian multiplier in Rate controlled compression, which yielded BD-Rate improvements of approximately 2% across a corpus of videos using HEVC. However, in a video streaming application, the focus is on optimizing the rate/distortion tradeoff at a particular bitrate and not on average across a range of performance. We observed in previous work that a particular multiplier might give BD rate improvements over a certain range of bitrates, but not the entire range. Using different parameters across the range would improve gains overall. Therefore here we present a framework for choosing the best Lagrangian multiplier on a per-operating point basis across a range of bitrates. In effect, we are trying to find the para-optimal gain across bitrate and distortion for a single clip. In the experiments presented we employ direct optimization techniques to estimate this Lagrangian parameter path approximately 2,000 video clips. The clips are primarily from the YouTube-UGC dataset. We optimize both for bitrate savings as well as distortion metrics (PSNR, SSIM).
DCST: a data-driven color/spatial transform-based image coding method
Block-based discrete cosine transform (DCT) and quantization matrices on YCbCr color channels play key roles in JPEG and have been widely used other standards in last three decades. In this work, we propose a new image coding method, called DCST. It adopts data-driven color transform and spatial transform based on statistical properties of pixels and machine learning. To match with the data-driven forward transform, we propose a method to design the quantization table based on human visual system (HVS). Furthermore, to efficiently compensate the quantization error, a machine learning based optimal inverse transform is proposed. Performance of our new design is verified using Kodak image dataset based on libjpeg. Our pipeline outperforms JPEG with a gain of 0.5738 in BD-PSNR (or a reduction of 9.5713 in BD-rate) range from 0.2 to 3bpp.
Content adaptive video compression for autonomous vehicle remote driving
Itai Dror, Raz Birman, Oren Solomon, et al.
It is anticipated that in some extreme situations, autonomous cars will benefit from the intervention of a ״Remote Driver״. The vehicle computer may discover a failure and decide to request remote assistance for safe roadside parking. In a more extreme scenario, the vehicle may require a complete remote-driver takeover due to malfunctions or an inability to resolve unknown decision logic. In such cases, the remote driver will need a sufficiently good quality real-time video stream of the vehicle cameras to respond quickly and accurately enough to the situation at hand. Relaying such a video stream to the remote Command and Control (C&C) center is especially challenging when considering the varying wireless channel bandwidths expected in these scenarios. This paper proposes an innovative end-to-end content-sensitive video compression scheme to allow efficient and satisfactory video transmission from autonomous vehicles to the remote C&C center.
Frame synthesis for video compression
Nicola Giuliani, Biao Wang, Elena Alshina, et al.
Traditional video compression techniques heavily rely on the concept of motion compensation to predict frames in a video given previous or future frames. With the recent advances in artificial intelligence powered techniques, new approaches to video coding become apparent. We describe a video compression scheme that employs neural network based image compression and frame generation. The proposed method encodes key frames (I frames) as still images and entirely skips compression and signalling of intermediate frames (S frames), which are conversely synthesized on the decoder side exclusively using I frames. Varying complexities of motion can occur within a sequence, which can have a strong impact on the quality of the generated frames. In order to address this challenge, we propose to let the encoder dynamically adjust the group of pictures (GOP) structure. This adjustment is performed based on the quality of predicted S frames. The achieved performance of the proposed method suggests that entirely skipping coding and instead synthesizing frames is promising and should be considered for future developments of learning based video codecs.
Learning residual coding for point clouds
Davi Lazzarotto, Touradj Ebrahimi
Recent advancements in acquisition of three-dimensional models have been increasingly drawing attention to imaging modalities based on the plenoptic representations, such as light fields and point clouds. Since point cloud models can often contain millions of points, each including both geometric positions and associated attributes, efficient compression schemes are needed to enable transmission and storage of this type of media. In this paper, we present a detachable learning-based residual module for point cloud compression that allows for efficient scalable coding. Our proposed method is able to learn the encoding of residuals in any layered architecture, and is here implemented in a hybrid approach using both TriSoup and Octree modules from the G-PCC standard as its base layer. Results indicate that the proposed method can achieve performance gains in terms of ratedistortion when compared to both base layer alone, which is demonstrated both through objective metrics and subjective perception of quality in a rate-distortion framework. The source code of the proposed codec can be found at https://github.com/mmspg/learned-residual-pcc.
Towards much better SVT-AV1 quality-cycles tradeoffs for VOD applications
Ping-Hao Wu, Ioannis Katsavounidis, Zhijun Lei, et al.
Software video encoders that have been developed based on the AVC, HEVC, VP9, and AV1 video coding standards have provided improved compression efficiency but at the cost of large increases in encoding complexity. As a result, there is currently no software video encoder that provides competitive quality-cycles tradeoffs extending from the AV1 high-quality range to the AVC low-complexity range. This paper describes methods based on the dynamic optimizer (DO) approach to further improve the SVT-AV1 overall quality-cycles tradeoffs for high-latency Video on Demand (VOD) applications. First the performance of the SVT-AV1 encoder is evaluated using the conventional DO approach, and then using the combined DO approach that accounts for all the encodings being considered in the selection of the encoding parameters. A fast parameter selection approach is then discussed. The latter allow for up to a 10x reduction in the complexity of the combined DO approach with minimal BD-rate loss.
Applications of Visual Perception in Imaging
icon_mobile_dropdown
Training compression artifacts reduction network with domain adaptation
Yu-Jin Ham, Chaehwa Yoo, Je-Won Kang
Compression artifact removal is imperative for more visually pleasing contents after image and video compression. Recent works on compression artifact reduction network (CARN) assume that the same or similar quality of images would be employed for both training and testing, and, accordingly, a model needs a quality factor as a prior to accomplish the task successfully. However, the possible discrepancy will degrade performance substantially in a target if the model confronts a different level of distortion from the training phase. To solve the problem, we propose a novel training scheme of CARN to take an advantage of domain adaptation (DA). Specifically, we assign an image encoded with a different quality factor as a different domain and train a CARN using DA to perform robustly in another domain of a different level of distortion. Experimental results demonstrate that the proposed method achieves superior performance on DIV2K, BSD68, and Set12.
Study on deep CNN as preprocessing for video compression
In the recent years, video compression and picture quality have become more intense topic in research areas. In addition, user prerequisite for better resolution and higher quality video compression is increasing. Versatile video coding (VVC) is the latest emerging video coding standard specially designed for video compression. However, its frequency-based transform techniques are vulnerable on high-frequency noise, which results in increased bitrate or low picture quality. To resolve such unintended attack, we apply denoising convolutional neural network (DnCNN) to input video of codecs as a preprocessing since the DnCNN model was studied for image denoising with the capability of handling Gaussian denoising with residual learning strategy. In this paper we demonstrate experimental results that how DnCNN model helps for noised video data in terms of quality and bitrate.
The effect of degradation on compressibility of video
Varoun Hanooman, Anil Kokaram, Yeping Su, et al.
The technology climate for video streaming has vastly changed during 2020. Since the pandemic, video traffic over the internet has increased dramatically. This has clearly put increased interest in the bitrate/quality tradeoff for video compression for applications in video streaming and real time video communications. As far as we know, the impact of different artefacts on that tradeoff has not previously been systematically evaluated. In this paper we propose a methodology for measuring the impact of various degradations (noise, grain, flicker, shake) in a video compression pipeline. We show that noise/grain has the largest impact on codec performance, but that the modern codecs are more robust to the artefact. In addition, we report on the impact of a denoising module deployed as a pre-processor and show that performance metrics change in the context of the pipeline. Denoising would benefit from being treated as part of the processing pipeline both in development and testing.
A differentiable VMAF proxy as a loss function for video noise reduction
Darren Ramsook, Anil Kokaram, Noel O'Connor, et al.
Traditional metrics for evaluating video quality do not completely capture the nuances of the Human Visual System (HVS), however they are simple to use for quantitatively optimizing parameters in enhancement or restoration. Modern Full-Reference Perceptual Visual Quality Metrics (PVQMs) such as the video multi-method assessment fusion (VMAF) function are more robust than traditional metrics in terms of the HVS, but they are generally complex and non-differentiable. This lack of differentiability means that they cannot be readily used in optimization scenarios for enhancement or restoration. In this paper we look at the formulation of a perceptually motivated restoration framework for video. We deploy this process in the context of denoising by training a spatio-temporal denoiser deep convultional neural network (DCNN). We design DCNNs as a differentiable proxy for both a spatial and temporal version of VMAF. These proxies are used as part of the proposed loss function in updating the weights of the spatio-temporal DCNNs. We use these proxies and traditional losses to propose a perceptually motivated loss function for video. Our results show that using the perceptual loss function as a fine tuning step yields a higher VMAF score and lower PSNR, when compared to the spatio-temporal network that is trained using the traditional mean squared error loss. Using the perceptual loss function for the entirety of training yields a lower VMAF and PSNR, but has visibly less noise in its output.
Review of subjective quality assessment methodologies and standards for compressed images evaluation
Lossy image compression algorithms are usually employed to reduce the storage space required by the large number of digital pictures that are acquired and stored daily on digital devices. Despite the gain in storage space, these algorithms might introduce visible distortions on the images. However, users typically value the visual quality of digital media and do not tolerate any distortion. Objective image quality assessment metrics propose to predict the amount of such distortions as perceived by human subjects, but a limited number of studies have been devoted to the objective assessment of the visibility of artifacts on images as seen by human subjects. In other words, most objective quality metrics do not indicate when the artifacts become imperceptible to human observers. An objective image quality metric that assesses the visibility of artifacts could, in fact, drive the compression methods toward a visually lossless approach. In this paper, we present a subjective image quality assessment dataset, designed for the problem of visually lossless quality evaluation for image compression. The distorted images have been labeled, after a subjective experiment held with crowdsourcing, with the probability of the artifact to be visible to human observers. In contrast to other datasets in the state of the art, the proposed dataset contains a big number of images along with multiple distortions, making it suitable as a training set for a learning-based approach to objective quality assessment.
Fundamental relationships between subjective quality, user acceptance, and the VMAF metric for a quality-based bit-rate ladder design for over-the-top video streaming services
Andreas Kah, Christopher Friedrich, Thomas Rusert, et al.
A quality-based bit rate ladder design for over-the-top video streaming services is presented. Following the design criterion of maximizing subjective quality under the constraint of minimizing storage costs, the bit rate ladder is defined by three parameters. The first parameter is the lowest VMAF score at which a video signal is on average subjectively indistinguishable from the original video signal. Following the international recommendation ITU-R BT.500, extensive subjective tests were carried out to evaluate the fundamental relationships between the subjective quality and the VMAF score using a 4K OLED TV environment. Based on the test results, this VMAF score is set to 95. The second parameter is the lowest VMAF score being accepted on average by more than 50 % of the users for watching video signals of free streaming services. Additional tests yield in setting this VMAF score to 55. The third parameter is the maximum difference of two VMAF scores, for which the associated subjective qualities are approximately the same on average. In a third test, this difference is determined to be 2. This results in an ideal bit rate ladder providing each video signal in 21 qualities associated to the VMAF scores 95, 93, …, 57, 55. This bit rate ladder design can be applied to complete video signals occurring in per-title encoding strategies or to individual scenes of video signals occurring in per scene or shot-based encoding strategies. Applications using less than 21 renditions for this range, may suffer from impaired subjective quality
Applications of Biomedical Imaging
icon_mobile_dropdown
Image-based autofocusing algorithm applied in image fusion process for optical microscope
This study proposes a novel passive auto-focusing algorithm which is applied in the optical microscope. The main core of the proposed auto-focusing algorithm is the weight distribution of Gaussian pyramid and the zero weight values in the designed median mask. It is robust for the noise problem. The optical microscope measurement is one of non-contact optical measurement technology. The captured images will be blurred when the depth of the sample exceeds the depth of field in the microscope. The blurred area in the image easily produces the noise. In order to reduce the problems of the image blurring and the noise, the proposed auto-focusing algorithm is applied in the process of image fusion algorithm. The stack of the images is captured by the microscope system which contains the CCD (i.e. Charge-coupled devices) and the PZT (i.e. piezoelectric transducer). The stack of images contains the clear image on the focal plane and the blurred image out of the depth of field. The process of the image fusion algorithm calculates this stack of the images to obtain the 2D fusion image and the 3D profile, simultaneously. The proposed algorithm is compared to the known four autofocusing algorithms. In the experiment results of the 2D fusion image, the proposed algorithm is as good as the four auto-focusing algorithms. Moreover, in the results of 3D profile, the RMSE error of the proposed algorithm is 287.628, which is lower than the other RMSE errors of the four auto-focusing algorithms.
Evaluation of deep learning techniques for the detection of pulmonary nodules in computer tomography scans
Lung cancer is the third most common cancer and the leading cause of cancer-related death in America. This cancer has a high lethality with an overall survival of 16% at five years. Symptoms are nonspecific, so diagnosis is usually delayed. To achieve earlier diagnosis and initiate treatment at a non-advanced stage of the cancer to reduce mortality, low-dose computed tomography (CT) scans are performed. Therefore, advanced image processing and machine learning techniques are required since the high volume of images generated by medical equipment causes the review of a lot of information to make a medical diagnosis. For diagnosis, the images are analyzed by specialists in order to find nodules, measure them and evaluate them. However, the nodules found in the lungs have different shapes, dimensions and textures, which makes identification difficult. For this reason, this paper presents the implementation, analysis and evaluation of two deep learning techniques for the detection of pulmonary nodules in CT scans, resulting in prediction models with a high percentage of accuracy.
Evaluation of segmentation techniques for cell tracking in confocal microscopy images
Manuel G. Forero, Luis H. Rodriguez, Sergio L. Miranda
In different biological studies, such as cell regeneration studies, cell tracking over time is required. Thus, in these studies, the evolution of an amputated limb of the crustacean Parhyale hawaiensis is tracked using 4D confocal microscopy images. However, given the high number of images, noise level and number of cells make the manual cell tracking process a complex, cumbersome and difficult task. The tracking process using image processing techniques generally includes three stages: image enhancement, segmentation and cell identification. A tool made for this purpose, as a plugin of the ImageJ program is TrackMate, commonly used by biologists, which includes for segmentation the Laplacian of Gaussian (LoG) and Difference of Gaussians (DoG) edge detectors. To provide even more powerful detectors, the filtering methods based on the second derivative of Deriche and Shen and Castan were implemented and included in TrackMate. These four methods were evaluated for cell detection in images of Parhyale hawaiensis, finding that the Deriche and, Shen and Castan filters detected an appreciable number of false positives, due to sensitivity to noise and because the same cell was counted multiple times. As for the LoG and DoG methods, they presented the best results, being very similar because the DoG is basically an approximation of the LoG, finding that the DoG method slightly outperformed the LoG.
Preprocessing fast filters and mass segmentation for mammography images
Yuliana Jiménez Gaona, M. J. Rodríguez-Álvarez, Jimmy Freire, et al.
Digital mammography is a valuable technique for breast cancer detection, because it is safe, noninvasive and can reduce unnecessary biopsies. However, it is difficult to distinguish masses from normal or dense regions because of their morphological characteristics and ambiguous margins. Thus, improvement of image quality, highlighting the tissues details and performing mass segmentation are important tasks for early breast cancer diagnosis. This work presents a mini-Mammographic Image Analysis Society (MIAS) database preprocessing, system which combines classic and efficient techniques of Median, Wiener and Gaussian filters to remove salt and pepper, speckle and gaussian noise in mammography images. The experimental results indicates that the Gaussian filter outperforms other filtering techniques, as shown by evaluated by Peak Signal to Noise Ratio and Mean Square Error metrics.
Evaluation of filtering techniques for cell tracking in confocal microscopy images
The process of cell regeneration is a field of study and analysis that has grown in recent years in the field of biology. For its study, 4D confocal microscopy images are acquired that allow the visualization of cell regeneration over time. However, the process of recognition and tracking of cells is done in many cases by manual techniques, making this task complex, biased and time consuming. In addition, the very low S/N ratio of this type of images makes it necessary to implement smoothing filters that do not affect the quality of the edges, making them more diffuse, and allowing a better detection of the number of cells over time. Although a freely available semi-automatic tracking technique has been implemented, such as the Track-Mate tool, which facilitates the user's work, it only has a median filter for the smoothing process. Therefore, this paper presents the study, development and implementation of the image smoothing methods A trous, anisotropic diffusion, bilateral, guided, enhanced propagated, K-SVD, non local means, bilateral enhanced propagated, ROF and TVL, as integrated filters within the Track-Mate tool, to analyze their behavior in practical cases of progenitor cell detection and tracking, taking as criteria the noise attenuation in an optimal way with the lowest loss of information and the highest cell count in 4D images of Parhyale hawaiensis, to find the most efficient and accurate techniques for cell tracking and, thus, improve this analysis tool, allowing the user to improve the results of the studies performed in confocal microscopy images.
MTS image analyzer: a software tool to identify mesial temporal sclerosis in MRI images
D. Castillo, J. Macas, R. Samaniego, et al.
Epilepsy is a chronic neurological disorder that causes unprovoked and recurrent seizures which according to WHO affects approximately 50 million people worldwide. Functional magnetic resonance images (MRI) help to identify certain affected areas of the brain, namely, the gliosis and hippocampal volume loss. These losses cause complex epilepsy, and is known as hippocampal sclerosis or Mesial Temporal Sclerosis (MTS). This work presents the development of a Computer Aided Diagnosis CAD system software package) that can be used to identify the characteristics and patterns of MTS from brain magnetic resonance images. The image processing techniques involve texture analysis, statistical features, evaluation of the 3D Region of interest (ROI), and threshold analysis. The software allows the automatic evaluation of the degeneration of hippocampal structures, hippocampal volume and signal intensity. We will describe and demonstrate the software (which can currently be accessed on GitHub). It is expected that this tool will be useful in new neurology/radiology specialists and can serve as a secondary diagnosis. However, it is necessary to validate the software system qualitatively and quantitatively in order to get more effectiveness and efficiency in a real-world clinical application.
Cartesian function of glycerin diffusion over ex-vivo porcine skin samples using multiple sequential THz images
E. Saucedo-Casas, M. Alfaro-Gomez, L. E. Piña-Villalpando
We study images from THz electromagnetic waves reflected in porcine skin samples ex-vivo. The THz images were taken sequentially for a period of time after glycerin was applied to the tissue samples. THz imaging is especially sensitive to the water content of any system. Given that glycerin acts like a de-hydration agent, its concentration and diffusion can be determined observing the THz response. In this work, we process the THz sequential images in order to evaluate the diffusion of the material within the sample. We apply image processing techniques to calculate changes in the area of interaction of glycerin and tissue with respect to time. We also analyze the changes on glycerin concentration as function of time and space using a numerical approach based on a finite-difference algorithm. Obtained results of the diffusion coefficient are in agreement with the reported values in the literature.
Imaging Security and Analysis
icon_mobile_dropdown
Facial recognition system for security access control
Some companies require that only authorized personnel can enter certain restricted areas. Biometric systems normally use RFID (Radio Frequency Identification) cards. However, these systems are not immune to impersonation because cards can be stolen. Other alternatives for the development of security systems consist in the use of facial recognition techniques, which are safer since it is more difficult to impersonate someone, but photographs of the subject can still be used to try to vulnerate the system's security. Therefore, this work proposes the development of a facial recognition application that allows access only to those authorized persons who, during recognition, make a gesture to determine that it is indeed a real subject. The proposed technique comprises three serial stages: face recognition, mouth movement or blink detection, and liveness detection. Several algorithms were analyzed for each stage, choosing the models with which the best performance were obtained. In the first two stages, the Geitgey and Xie methods, respectively, were used for mouth detection, and the Geitgey and Soukupova-Cech algorithms for blinking. Since security systems demand the highest possible accuracy, a new technique for liveness detection from background analysis is proposed, which outperformed the results obtained with Rosebrock's technique, achieving 100% accuracy in a processing time of 6.76 seconds. To evaluate the methods, a database was constructed consisting of 46 videos of fake people and 40 videos of real people performing the opening and closing gesture of the mouth and blinking.
Towards a secure and trustworthy imaging with non-fungible tokens
Nicolas Martinod, Kambiz Homayounfar, Davi Lazzarotto, et al.
Non fungible tokens (NFTs) are used to define the ownership of digital assets. More recently, there has been a surge of platforms to auction digital art as well as other digital assets in form of image, video, and audio content of all sorts. Although NFTs have the potential of revolutionizing the foundations of ownership, they also face various challenges notably in terms of trust and security. This paper starts by identifying the challenges in current NFTs and proposes a solution in order to remedy to their current shortcomings.
Towards image denoising in the latent space of learning-based compression
Learning-based approaches to image compression have demonstrated comparable, or even superior performance when compared to conventional approaches in terms of compression efficiency and visual quality. A typical approach in learning-based image compression is through autoencoders, which are architectures consisting of two main parts: a multi-layer neural network encoder and a dual decoder. The encoder maps the input image in the pixel domain to a compact representation in a latent space. Consequently, the decoder reconstructs the original image in the pixel domain from its latent representation, as accurately as possible. Traditionally, image processing algorithms, and in particular image denoising, are applied to the images in the pixel domain before compression, and eventually even after decompression. The combination of the denoising operation with the encoder might reduce the computational cost while achieving the same performance in accuracy. In this paper, the idea of fusing the image denoising operation with the encoder is examined. The results are evaluated both by simulating the human perspective through objective quality metrics, and by machine vision algorithms for the use case of face detection.
3D Imaging and Augmented-Reality Applications
icon_mobile_dropdown
Design and implementation of augmented reality system for paper media based on ARtoolKit
In the process of knowing and understanding the world, the traditional media of books and paper have always played a very important role. However, with the emergence of new media and the skilled use of digital media technology, the traditional medium of book and paper media has gradually faded away in the eyes of mankind. Stereoscopic paper media is a three-dimensional paper media that can display three-dimensional virtual scenes on the display device based on the traditional paper books by using the camera of the display device to identify the logo card or the real scene, and then collect the virtual information and the real scene. In this paper, under Windows 10 operating system, the program is written by VC++ and ARtoolKit software development kit development tool, and based on augmented reality technology, virtual 3D scenes are built through 3D model modeling to achieve matching and integration with real scenes of paper books and realize 3D interaction. Based on this, a 3D interactive augmented reality system with a high degree of realism is realized. The main research contents include: customizing the paper media markers, reducing the system's misjudgment rate of the markers by combining the single-response matrix and calculating the image matching value with an optimization algorithm, realizing and tracking the recognition of the markers in real time; and optimizing the image segmentation method of the ARtoolKit software development kit, using the marker card to rotate 45° relative to the camera's position to simulate the situation when the scene has changed significantly. We use the minimum error method to calculate the gaps in real time to realize augmented reality; build a 3D model with animation effects based on the ARtoolKit software development tool and Unity development platform; accurately align the recognized markers with the virtual 3D model; add audio to the 3D model to match the content and add interactive functions to realize the control of 3D animation. Stereoscopic paper media breaks the traditional paper book book reading method, no longer just read the content of paper books, you can see and content-related virtual information on the display device of the mobile terminal, such as three-dimensional scenes, videos and so on. Stereoscopic paper media improves the reader's interest in reading and gives the reader a vivid and realistic reading experience.
3D computer-generated holograms for augmented reality applications in medical education
Current display technologies are limited in projecting floating ultra-high definition images on multiple layers, preventing applications in augmented reality for medical education. Computer-generated holography (CGH) with custom algorithms allows for displaying floating 3D images for augmented reality applications. The limitation of existing 3D display technologies is the lack of clearly-defined anatomical structures and models for medical education The custom algorithm was based on microoptical adaptation of the natural superposition compound human eye with the help of a virtual Gabor superlens and a layering technique. High-resolution Spatial Light Modulators (SLMs) can enable increasing the field of view and display size in CGH. The 3D holographic projection applications require an enlarged field of view for multi-user purposes and can be implemented to deliver presentations from remote settings and as non-invasive practical education in medicine. This technology targets academic medical centres, hospitals and clinics as well as research laboratories. In this work, a layered holographic projection method was developed to display high-resolution (3840×2160 px) 3D floating images in direct-view mode targeting medical education. A computational algorithm was created based on a phase retrieval algorithm and a virtual Gabor superlens to project a hologram on the Liquid Crystal on Silicon (LCoS) display panel of an UHD SLM. A code was programmed for generating multilayer 3D in-eye projections by adding multiple retrieved holograms with an independent Gabor zone plate into each single hologram. The reconstructions were obtained with a HeNe laser (633 nm, 5 mW) and the UHD SLM with reflective phase modulation. 3D holograms were directly observed floating as a ghost image at variable focal distances.
High-speed simultaneous measurement of depth and normal for real-time 3D reconstruction
Leo Miyashita, Yohta Kimura, Satoshi Tabata, et al.
Depth measurement and normal measurement have an advantage in different spatial frequency and are complementary in 3D shape measurement. However, conventional measurement of depth and normal is performed exclusively or in time-division manner and high-speed simultaneous measurement has not been achieved. In this paper, we propose a new optical system setup for high-speed simultaneous measurement of depth and normal with an active stereo method and a photometric stereo method. Furthermore, we propose a high-speed 3D shape reconstruction method using GPU, which combines complementary information obtained from the two measurements. We evaluated the throughput of the prototype system and the result shows high-speed depth and normal simultaneous measurement and 3D shape reconstruction are performed at 500fps.
Image Analysis Tools and Techniques
icon_mobile_dropdown
Phase congruency implementation in ImageJ using Radix-2 FFT
Automatic edge detection in images is an area of great interest to industry and the scientific community. A problem usually experienced is that edge detectors are sensitive to the magnitude of changes in brightness. However, this disadvantage disappears when employing the technique known as phase congruency, which allows edge detection in an image regardless of its illumination level. This technique is based on phase alignment of frequency components. This principle states that the edges of an image occur when the phases of the Fourier components coincide. By using phase, the direct dependence on brightness intensity in edge detection is avoided. A difficulty of phase congruency implemented with monogenic filters is that it requires the computation of the complex Fourier transform. However, its computational cost is high. There are some approaches that seek to reduce its cost, such as the FHT, but they only allow to obtain the Fourier transform of real images. Due to this limitation, methods based on phase congruency were not available in several image processing tools, such as ImageJ, a program widely used by biologists and microscopists for the analysis of biological images, since these programs make use mainly of the Fast Hartley Transform transformation. Therefore, in this work the implementation of phase congruency for ImageJ with monogenic filters using the Fourier Radix-2 FFT transform is described. The results obtained with the proposed implementation were compared with those found with the Kovesi code in GNU Octave, showing that both implementations obtain equivalent results and even better with the proposed method when at least one side of the images is not a power of two, in which case tile-mirror is used to complete the image.
Inpainting method based on variational calculus and sparse matrices
Manuel G. Forero, Andrés F. Navarro, Sergio L. Miranda
Photo restoration is one of the most popular tasks in digital image processing, required when an image has stains, scratches or any unwanted object. Inpainting is the name given to this type of method, which is based on modifying the areas, where the unwanted information is, in an imperceptible way. The concept of Inpainting was born in the early twentieth century, due to the need to replace or remove an object from a photograph, this was possible through manual brushstrokes of an editor or painter. From the above and the theory of Poisson's image editing, a new technique based on variational calculus and the use of sparse matrices is developed. In this technique, a functional is proposed, which is subsequently minimized, thus achieving that the union between the filled region and the image to be repaired is visually imperceptible. The results obtained were compared with those of the bilinear interpolation, isophote, Orthogonal Matching Pursuit (OMP) and KSVD techniques, the latter two being techniques based on sparse models. Then, the difference between the original and the resulting image was calculated considering only the areas of interest to find the number of distinct pixels and the root mean square error (RMSE). The proposed method presents better results than bilinear interpolation, Orthogonal Matching Pursuit and K-SVD, and very similar to those obtained with the Isotope technique.
Study of phase congruency quantization function properties for image edge detection
Phase congruency is a recently developed, but still rather unexplored and unknown technique for edge detection, allowing to determine the location of edges, ridges and valleys in images by analyzing the phase of the signal's frequency components. Phase congruency identifies edges based on the phase of the signal's frequency components. One of its main uses is image segmentation where the region of interest is separated from the background. The segmentation result varies according to a mathematical function, used to quantify the phase congruency, whose main properties are that it is centered at the origin, of even symmetry and whose global maximum is one. In addition, according to its form, a function allows better edge detection. Thus, several mathematical functions fulfill the necessary conditions for measuring phase congruency. However, these conditions have not yet been studied and, therefore, the type of changes they produce in phase congruency when varying this function is unknown. Therefore, in this work, an evaluation of the characteristics of the functions used for the quantification of phase congruency is presented, observing their properties and the behavior of phase congruency, allowing to find the most appropriate functions depending on the type of edges to be detected.
Efficient Java implementation of image cloning method based on gradient processing
In the processing of photographic images, it is common to manipulate them to include objects that are not present in the original image, to make them more appealing or to change the environment. Therefore, they are manually edited and the object to be included is cut out of a source image, to join it to the destination and then, through the use of filters, the contour is softened to make the union between both images less noticeable. Perez et al. introduced a technique, called perfect cloning, based on processing the image gradient to integrate them, using sparse matrices, allowing to automate this process. However, its implementation in languages such as Java is complex due to memory limitations for large matrices. Hence, this paper introduces a method of implementing sparse arrays that allow their use in Java. It has been implemented as a plugin for the free software ImageJ. In addition, the technique is compared with the Multiresolution method, developed by Burt and Adelson, which is considered a reference in the field.
Evaluation of panchromatic and multispectral image fusion methods using natural images
Heber I. Mejia-Cabrera, Samuel Sanchez, Fernando Monja, et al.
Images provided by various remote sensing satellites are multispectral, low resolution, and panchromatic, high resolution, which are fused, enlarging the low resolution images to make them the same size as the panchromatic ones. Panchromatic images have good spatial resolution but only one spectral band and multispectral images typically have four or eight bands but are four times lower in spatial resolution than a panchromatic image. Image fusion of this type seeks to combine the best feature of the high spatial panchromatic image with the low spatial multispectral image to obtain an image with high spatial and spectral resolution. Several techniques have been developed to perform this fusion however the techniques with low computational resource consumption are EIHS Algorithm, Brovey Algorithm, Averaging Algorithm. To compare them, in this work, natural color photographs are taken, from which high resolution monochromatic and lower resolution chromatic images are obtained to emulate the real situation. The low resolution color images obtained were interpolated using three satellite image interpolation techniques. The fusion techniques were evaluated, obtaining the quantitative spectral and spatial ERGAS indices and the RMSE. The EIHS and Brovey techniques were found to produce artifacts because the color component values can fall above or below the representation interval [0,255]. After correcting this issue, it was found that the EIHS and Brovey methods, in that order, produced the lowest RMSE, followed by the averaging method. Since this result proved to be inconsistent with that obtained with the mean ERGAS, a new normalized mean ERGAS that gives a better indication of fusion quality, matching the result given by the RMSE, was proposed to be used instead.
Imaging Systems
icon_mobile_dropdown
VehiPose: a multi-scale framework for vehicle pose estimation
Vehicle pose estimation is useful for applications such as self-driving cars, traffic monitoring, and scene analysis. Recent developments in computer vision and deep learning have achieved significant progress in human pose estimation, but little of this work has been applied to vehicle pose. We propose VehiPose, an efficient architecture for vehicle pose estimation, based on a multi-scale deep learning approach that achieves high accuracy vehicle pose estimation while maintaining manageable network complexity and modularity. The VehiPose architecture combines an encoder-decoder architecture with a waterfall atrous convolution module for multi-scale feature representation. Our approach aims to reduce the loss due to successive pooling layers and preserve the multiscale contextual and spatial information in the encoder feature representations. The waterfall module generates multiscale features, as it leverages the efficiency of progressive filtering while maintaining wider fields-of-view through the concatenation of multiple features. This multi-scale approach results in a robust vehicle pose estimation architecture that incorporates contextual information across scales and performs the localization of vehicle keypoints in an end-to-end trainable network.
An extensible framework for video ASIC development and validation at Facebook scale
Video consumption across social platforms has increased at a rapid pace. Video processing is a compute-heavy workload, and domain-specific accelerators (ASICs) allow more efficient scaling than general purpose CPUs. One of the challenges for video ASIC adoption is that videos ingested in datacenters are user-generated content and have a long-tail distribution of uncommon features. Software stack can handle the outliers gracefully, but these uncommon features may pose a challenge for the ASIC with undesirable effects for the unsupported/unhandled end cases. To avoid undesirable effects in the production, it is critical to proof our system against the long-tail conditions early in the product cycle of the ASIC development. Similarly, critical signals like BD-rate quality and outlier detection are needed from production traffic early in the product cycle. To address these needs, we propose an extensible framework that allows a continuous development strategy using production traffic, through progressive evaluation in various product phases of the video ASIC development cycle. A similar framework would benefit other ASIC accelerator programs in reducing time to deploy on large-scale platforms.
Towards super resolution in the compressed domain of learning-based image codecs
Learning-based image coding has shown promising results during recent several years. Unlike the traditional approaches for image compression, learning-based codecs exploit deep neural networks for reducing dimensionality of the input at the stage where a linear transform would be typically applied previously. The signal representation after this stage, called latent space, caries the information in such a format that it can be interpreted by other deep neural networks without the need of decoding it. One of the tasks that can benefit from the above-mentioned possibility is super resolution. In this paper, we explore the possibilities and propose an approach for super resolution that is applied in the latent space. We focus on two types of architectures: fixed compression model and enhanced compression model. Additionally, we assess the performance of the proposed solutions.
Poster Session
icon_mobile_dropdown
Design and implementation of interactive game based on augmented reality
MingHui Xu, Sikai Wang, Hongfei Zhao, et al.
In the trend of social development, mobile phones are playing a more and more important role in our life. No matter in PC or mobile devices, the development of games is no longer limited to simple screen operation. Developers continue to tap the potential of various devices according to their different characteristics, and constantly add the latest technology to the game technology, and give life entertainment and convenience. In the process of graduation project and graduation thesis, this paper discusses the origin and development of AR technology around a game project of augmented reality. In the development project, the key technology of AR is analyzed, the most important one is the algorithm of feature point detection and matching. Through the understanding of the algorithm, it is used in the relevant game projects, and the effect of augmented reality in the game is analyzed. From the basic realization of augmented reality in the game, to the key points used in the algorithm of plane recognition and image recognition, and the coordinate processing of the two recognition methods, thus affecting the role control behavior of the whole game. Finally, the two methods bring completely different game experience to the players. We will integrate augmented reality and video games in depth, rather than simply put them together, give full play to the advantages of software and platform, and show you the design and implementation of interactive games based on augmented reality.
ICP algorithm based on stochastic approach
Sergei Voronin, Artyom Makovetskii, Vitaly Kober, et al.
3D reconstruction has been widely applied in medical images, industrial inspection, self-driving cars, and indoor modeling. The 3D model is built by the steps of data collection, point cloud registration, surface reconstruction, and texture mapping. In the process of data collection, due to the limited visibility of the scanning system, the scanner needs to scan multiple angles and then splice the data to obtain a complete point cloud model. The point clouds from different angles must be merged into a unified coordinate system, which is known as point cloud registration. The result of point cloud registration can directly affect the accuracy of the point cloud model; thus, point cloud registration is a key step in the construction of the point cloud model. The ICP (Iterative Closest Points) algorithm is the most known technique of the point cloud registration. The variational ICP problem can be solved not only by deterministic but also by stochastic methods. One of them is Grey Wolf Optimizer (GWO) algorithm. Recently, GWO has been applied to rough point clouds alignment. In the proposed paper, we apply the GWO approach to the realization of the point-to-point ICP algorithms. Computer simulation results are presented to illustrate the performance of the proposed algorithm.
Fast algorithm of 3D object volume calculation from point cloud
The volume parameter is the basic content of an object morphology analysis. In this article, we propose a new fast object volume calculation algorithm from the point cloud. The proposed algorithm is based on the voxel representation of the point cloud and slice method. The accuracy and speed of the proposed object volume calculation algorithm on real data are compared to that of the state-of-the-art algorithms.
Fast 3D object pose normalization for point cloud
In this article, we proposed novel fast pose normalization algorithms for the point cloud. The first step deals with the detection of the ground plane of the scene in point cloud. Starting from downsampling point cloud by point cloud filtering and the normal vector of ground plane detected. The next step deals with the 3-D segmentation in point cloud, wherein we delete the ground plane. Then we used the algorithm of axis-aligned bounding box so that it sets the pose and dimensions of a box surrounding the given point cloud. Because we are computing an axis-aligned bounding box, the orientation of the box is just the identity orientation of the calculated unit normal vector to the plane. The algorithm of the axis-aligned bounding box is basically equivalent to taking the min/max of each coordinate. Moreover, we calculate the geometric center of the point cloud after pose normalization.
Fast 3D object symmetry detection for point cloud
In this article, we proposed a fast algorithm of symmetry detection 3D model. First, we used the PCA algorithm for initial symmetry detection. Then, using exhaustive search of symmetry planes passing through the center of gravity about the initial symmetry plane, we determined the optimal symmetry plane with the help of the modified Hausdorff metric. The accuracy and speed of the proposed symmetry detection algorithm on real and synthetic data are compared to the PCA algorithm.
Binarization method for chromosomal analysis of primitive plants: the case of Zamia tolimensis and Zamia huilensis (Cycadales, Zamiaceae)
Cytogenetic studies of plants allow integration with taxonomic and systematic data for the formulation of conservation strategies. However, traditional methods of microscopy image analysis for chromosome observation and identification are inefficient in species considered primitive and with critically endangered wild populations. For this reason, the objective of this work is to establish an image processing method that will facilitate the distinction of chromosomes in Zamia tolimensis and Zamia huilensis plants endemic to Colombia. For this purpose, metaphase plates and photographs of both species were acquired under the microscope, which was superimposed and their contrast adjusted, to obtain a composite image with species-specific dispositions and transparencies. The proposed technique, based on thresholding techniques, allows a better analysis of the images obtained. Thus, the delimited areas of chromosomes in the binarized images were shown with greater precision for the counting and measurement of genetic structures, with a total of 27 and 26 chromosomes in Z. tolimensis and Z. huilensis, respectively. The technique complements and simplifies the counting and interpretation of plant genetic information obtained by classical computational methods.
Regularized variational functional use a rough alignment for point clouds registration
Artyom Makovetskii, Sergei Voronin, Vitaly Kober, et al.
Alignment two point clouds means finding the orthogonal or affine transformation in three-dimensional space that maximizes the consistent overlap between two clouds. The ICP (Iterative Closest Points) algorithm is the most known technique of the point clouds registration based on the exclusively geometric characteristics. The ICP algorithm consists of the following iteratively applied main steps: determine the correspondence between the points of the two clouds; minimize error metrics (variational subproblem of the ICP algorithm). The key element of the ICP algorithm is the search for an orthogonal or affine transformation, which is the best in sense of a metric combining two clouds of points with a given correspondence between the points. The correspondence between point clouds in real applications is far from ideal. The bad correspondence significantly reduces the probability to obtain a good answer for orthogonal variants of the variational problem. Thus, the probability of obtaining an acceptable transformation as a result of the ICP algorithm with poor correspondence is the comparative criterion for different types of variational problems. In this paper, we propose a regularized variant of the ICP variational problem use a rough point clouds alignment that improved convergence frequency in the case of poor correspondence between point clouds. The proposed modified approach essentially increases the performance of the algorithm. Computational experiments show that the proposed point clouds alignment approach calculates true transformation for almost all synthetic 3D models for any relative placement of the point clouds.
Convolutional neural network for 3D point clouds matching
Sergei Voronin, Artyom Makovetskii, Aleksei Voronin, et al.
Geometric registration is a key task in many computational fields, including medical imaging, robotics, autonomous driving. The registration involves the prediction of a rigid motion to align one point cloud to another, potentially distorted by noise and partiality. The most popular point cloud registration algorithm, Iterative Closest Point (ICP), alternates between estimating the rigid motion based on a fixed correspondence estimate and updating the correspondences to their closest matches. Recently, the success of deep neural networks for image processing has motivated an approach to learning features on point clouds. Adaptation of deep learning to analyze point cloud data is far from straightforward. Most critically, standard deep neural network models require input data with regular structure, while point clouds are fundamentally irregular: Point positions are continuously distributed in the space, and any permutation of their ordering does not change the spatial distribution. Several neural networks have recently been proposed for analyzing point clouds data such as PointNet and DGCNN. In this paper, we propose a permutation invariant neural network to identify matching pairs of points in the clouds. Computer simulation results are provided to illustrate the performance of the proposed algorithm.
A comparative study of feature detection and description methods for a RGB-D SLAM system
Visual SLAM is widely known in robotics for computing, concurrently, the odometry of a robot and construct a 3D navigation map with only a camera. In visual SLAM systems, detection and description of local features are extremely important because they identify unique and invariant points in an observed frame. Although there are various detectors and descriptors, the proper detector/descriptor combination for extraction has not yet been generalized for the problem. In this work, a comprehensive performance evaluation of combinations for different feature detectors and descriptors is presented. This evaluation will help determine the best detector/descriptor combination for designing a visual SLAM system based on RGB-D data. The considered methods are evaluated in terms of accuracy and robustness in both, a single and overall visual SLAM system.
Contactless robust 3D palm-print identification using photometric stereo
Lyndon N. Smith, Max P. Langhof, Mark F. Hansen, et al.
Palmprints are of considerable interest as a reliable biometric, since they offer significant advantages, such as greater user acceptance than fingerprint or iris recognition. 2D systems can be spoofed by a photograph of a hand; however, 3D avoids this by recovering and analysing 3D textures and profiles. 3D palmprints can also be captured in a contactless manner, which is critical for ensuring hygiene (something that is particularly important in relation to pandemics such as COVID-19), and ease of use. The gap in prior work, between low resolution wrinkle studies and high-resolution palmprint recognition, is bridged here using high-resolution non-contact photometric stereo. A camera and illuminants are synchronised with image capture to recover high-definition 3D texture data from the palm, which are then analysed to extract ridges and wrinkles. This novel low-cost approach, which can tolerate distortions inherent to unconstrained contactless palmprint acquisition, achieved a 0.1% equ
Forest damage monitoring in South-Western Europe based on data from Unmanned Aerial Vehicles (UAV)
A. Fernandez-Manso, J. M. Cifuentes, E. Sanz-Ablanero, et al.
Knowledge of occurrence and severity of the damage caused by the fungus Cryphonectria parasitica (Murrill) M.E. Barr and forest fires is key to define a management plan of Castanea sativa forest masses in south-western Europe. The main goal of our study is to verify whether there is a concordance between field measurements and measures from orthophotographs acquired by an unmanned flight and to determine the influence of the number of severity levels on it. Accuracy of blight severity level estimate was computed used as ground truth a red-green-blue (RGB) orthophotograph of very high spatial resolution (8 cm) acquired by an Unmanned Aerial Vehicle (UAV). Starting from codes specifically defined for this study, the severity level of 823 chestnut trees was measured on the UAV orthophotograph. At same time, we ground measured the severity level on a sample of 182 chestnut trees using a standard methodology. From these measurements, overall accuracy and Kappa statistic were computed. Our results show that the concordance for 6 severity levels varies between moderate and good, whereas for 4 and 5 levels, varies between moderate and very good (kappa statistic>0.75). The study demonstrates the UAVs usefulness for studying forest damage in South-Western Europe.
Fast approximate geodesic distance on point cloud
The computation of geodesic paths and distances is a common task in many computer graphics applications, for instance obtaining 3D model measurements. Geodesic distance computation is usually performed by exact surface algorithms. However, surface reconstruction is a rather time-consuming process and does not always guarantee to get a good result in the case of missing data or cloud distortions. In this article, we propose a new fast approximate geodesic distance algorithm on the point cloud. Computer simulation results for the proposed algorithm in terms of accuracy and speed of computation are presented and discussed.
New method for digitization and manipulation of textile molds based on image processing
Heber I. Mejia-Cabrera, Jair A. Vallejos, Victor Tuesta-Monteza, et al.
The micro and small garment industries use traditional molds based on drawings on paper for the cutting of the fabric. This process is performed manually at the discretion of the operator, generating material loss during the cutting process. To make this task more efficient and reduce losses, this paper presents a technique for editing and vectorization of physical molds using digital image processing techniques, allowing the edition, modification or multiplication of the selected mold. For this purpose, a simple, low-cost device was developed to take photographs of the molds and an automatic method for contour detection and vectorization of textile molds was realized. Three edge detection methods, Sobel, Canny - Deriche and morphological gradient, were compared. Then, the Harris corner detection method was used, achieving a better detection, reducing the number of false corners, by using the image in gray levels as the input of the detector. The shapes of the contours between the corners were approximated by cubic splines, obtaining an analytical representation of each mold, being used to manipulate the size and position to place it in a better way on the fabric, achieving a significant reduction in fabric losses. The developed low-cost application thus allows the approximation of the models by vectorial representation, allowing their manipulation in an easy way and with a low consumption of computational resources without losing important information of the molds. The molds can thus be moved, rotated and scaled to accommodate them within the available fabric space.
Analysis of lung cancer clinical diagnosis based on nodule detection from computed tomography images
According to The Global Cancer Observatory (GCO), lung cancer is the type of cancer with the highest mortality rate in the world, being the most common in men and the second most frequent in women. The main factor for its high mortality is usually due to late diagnosis. Therefore, early diagnosis could help to decrease its mortality rate by applying advanced imaging techniques. It has been found that computed tomography images can be used for its diagnosis. However, the nodules that allow recognizing this kind of cancer are not easy to identify, being a difficult task for the specialist. For this reason, academic challenges have recently been proposed for researchers, where a base of images annotated by radiologists is provided in order to develop more efficient methods based on deep learning that allow the detection of these nodules. In this work, two databases acquired in the LUNA and LNDb challenges are used to perform a statistical analysis of the exams, their characteristics and the clinical diagnosis of the specialists, finding that the clinical diagnosis presents important differences between them, which makes the task of labeling the samples difficult. This analysis is useful for the development of new proposals and conclusions for the use and exploitation of deep learning in the diagnosis through medical images.
Development and validation of a novel automated method for quantification of choroidal thickness in age-related macular degeneration
A. Smitha, P. Jidesh, J. Jothi Balaji, et al.
Age-related Macular Degeneration (AMD) is a progressive, irreversible retinal disorder, and one of the leading causes of severe visual impairment or even blindness in the elderly population. The choroid plays a vital role in the pathophysiology of AMD. It is known that abnormal choroidal blood flow leads to retinal photoreceptor dysfunction and eventual death. We propose a new automated algorithm that can be used to quantify choroidal thickness (CT) from Optical Coherence Tomography (OCT) images of the retina. This thickness evaluation procedure includes image contrast enhancement, localization around the fovea centralis, segmentation of Retinal Pigment Epithelium (RPE) and choroidal layer, followed by CT measurement at multiple locations in the sub-foveal region at intervals of 0.5 mm on both nasal and temporal sides up to a distance of 1 mm from the center of the foveal pit. The horizontal radial scan OCT images (Cirrus 5000, Carl Zeiss Meditec Inc., Dublin, CA) of both healthy and AMD patients were used to measure the CT using the new algorithm. The statistical tests convey that the CT of AMD patients is relatively smaller than the normal condition. Furthermore, t-Test conducted between the proposed approach and clinical approach of extracting CT measurements confirm that the proposed method is in good agreement with the clinical measurements. On an average, the thickness of the choroid is found to be 0.32 ± 0.10 mm for the normal category and 0.21 ± 0.06 mm for the AMD category, in the central sub-foveal region, as obtained from the proposed automatic CT measurement method. The clinical significance and the results of automated choroid extraction are discussed in this paper.
Gradient direction analysis for contour tracking and local non maximum suppression
One of the most used techniques for edge refinement, once a convolution mask has been used, is the local non-maximum elimination method proposed by Canny. This technique uses the magnitude and direction of the gradient to determine whether a pixel is a local maximum. In addition, the direction of the gradient is used in a subsequent step, to perform contour enclosure in case any discontinuity is present. The local maximum detection technique gives good results when using masks of size 3 x 3 or larger. Deriche introduced an extension of Canny's edge detection method, which obtains state-of-the-art results, using a recursive method that only considers pixels in the vertical or horizontal direction for the calculation, which is equivalent to using linear operators of size 3 x 1 or larger. This technique allows good edge detection, in a shorter time, regardless of the filter size. Deriche mentions the need to employ a smoothing technique, obtained by integrating his derivative operator, prior to the use of the derivative operator. In this work the result obtained with this operator is compared with the Prewitt and Sobel and Feldman operators in the determination of the direction of the gradient to find the local maxima.
Transform-based quality assessment for enhanced image
V. Voronin, E. Semenishchev, A. Zelensky, et al.
Images captured in bad weather suffer from low contrast and faint color. Recently, plenty of enhanced algorithms have been proposed to improve visibility and restore color. The goal of image quality assessment is to predict the perceptual quality for improving imaging systems' performance. We proposed a no-reference image quality enhancement measure using hypercomplex Fourier transform for color images. The main idea is that enhancing the contrast of an image would create more high-frequency content in the enhanced image than the original image. An increase in the magnitude of higher frequency coefficients indicates an enhancement in contrast to the image's luminance content. To test the performance of the proposed algorithm, the public database TID2013 is used. The Pearson rank-ordered correlation coefficient is utilized to measure and compare the proposed quality measure's performance with state-of-the-art approaches.