Proceedings Volume 12177

International Workshop on Advanced Imaging Technology (IWAIT) 2022

Masayuki Nakajima, Shogo Muramatsu, Jae-Gon Kim, et al.
cover
Proceedings Volume 12177

International Workshop on Advanced Imaging Technology (IWAIT) 2022

Masayuki Nakajima, Shogo Muramatsu, Jae-Gon Kim, et al.
Purchase the printed version of this volume at proceedings.com or access the digital version at SPIE Digital Library.

Volume Details

Date Published: 30 April 2022
Contents: 24 Sessions, 129 Papers, 0 Presentations
Conference: International Workshop on Advanced Imaging Technology 2022 (IWAIT 2022) 2022
Volume Number: 12177

Table of Contents

icon_mobile_dropdown

Table of Contents

All links to SPIE Proceedings will open in the SPIE Digital Library. external link icon
View Session icon_mobile_dropdown
  • Front Matter: Volume 12177
  • Advanced Image Technology I
  • Advanced Image Technology II
  • Advanced Image Technology III
  • Advanced Image Technology IV
  • Advanced Image Technology V
  • Advanced Image Technology VI
  • Image Processing and Classification I
  • Image Processing and Classification II
  • Multimedia Systems and Applications I
  • Multimedia Systems and Applications II
  • VR and AR
  • 3D Processing and Applications
  • Deep Learning and Applications I
  • Deep Learning and Applications II
  • Point Cloud Processing and Applications
  • Image Understanding and Recognition
  • Machine Learning and Applications
  • Video Processing and Applications
  • Immersive Applications
  • VR and 3D Applications
  • Image Processing and Applications
  • Video Coding for Machine
  • Intelligent System Design
Front Matter: Volume 12177
icon_mobile_dropdown
Front Matter: Volume 12177
This PDF file contains the front matter associated with SPIE Proceedings Volume 12177, including the Title Page, Copyright information, Table of Contents, and Conference Committee list.
Advanced Image Technology I
icon_mobile_dropdown
Imaging method using multi-threshold pattern for photon detection of quanta image sensor
In recent years, the development of quanta image sensor (QIS), which can observe the amount of incident light intensity in units of photons, has been progressing. In QIS imaging, a large number of photon incident observations are performed in spatiotemporal direction, and multivalued images are obtained by reconstruction processing. In this observation, the binary value is output according to whether the number of incident photons exceeds a certain natural number of threshold preset in the minute photon detector (jot). In many existing methods for QIS imaging, a uniform threshold is set for all jots, the reconstructed multivalued image may be overexposed or underexposed. On the other hand, the method of setting an optimal threshold for each local region according to the scene requires time for adjustment, which leads to a decrease in temporal resolution. In this paper, we propose an imaging method that always accurately captures a wide range of light intensity from low to high by introducing a periodic pattern consisting of multiple thresholds. Since we do not adjust the threshold according to the scene, we can fundamentally avoid the degradation of temporal resolution. In addition, since the threshold applied to jots has a variety of values, it is possible to acquire a high-quality multivalued image even with a small number of photon incident observations. Our proposed method consists of three components: multivalued image reconstruction and noise reduction taking into account the characteristics of photon incident observation, and optimization of the periodic pattern.
Low-light image enhancement via channel-wise intensity transformation
We propose a low-light image enhancement algorithm that learns channel-wise transformation functions. First, we develop a lightweight network, called the transformation function estimation network (TFE-Net), to predict the channel-wise transformation functions. TFE-Net learns to generate the transformation functions by considering both the global and local characteristics of the input image. Then, we obtain enhanced images by performing channel-wise intensity transformation. Experimental results show that the proposed algorithm provides higher image quality than conventional algorithms.
Multi-level feature aggregation network for high dynamic range imaging
Modern digital cameras typically cannot capture the whole range of illumination, due to the limited sensing capability of sensor devices. High dynamic range (HDR) imaging aims to generate images with a larger range of illumination by merging multiple low-dynamic range (LDR) images with different exposure times. However, when the images are captured in dynamic scenes, existing methods unavoidably produce undesirable artifacts and distorted content. In this paper, we propose a multi-level feature aggregation network, based on the Laplacian pyramid, to address this issue for HDR imaging. The proposed method progressively aggregates non-overlapping frequency sub-bands at different pyramid levels, and generates the corresponding HDR image from coarser to finer scales. Experiment results show that our proposed method can significantly outperform other competitive HDR methods, thereby producing HDR images with high visual quality.
Improved probability modeling for lossless image coding using example search and adaptive prediction
We previously proposed a novel lossless image coding method that utilizes example search and adaptive prediction within a framework of probability model optimization. In this paper, the definition of the probability model as well as its optimization procedure are modified to reduce the encoding complexity. In addition, affine predictors used in the adaptive prediction are refined for accurate probability modeling. Simulation results indicate that our modification contributes not only to encoding time reduction, but also to coding efficiency improvement for all of the tested images.
Vectorized computing for edge-avoiding wavelet
The discrete wavelet transform (DWT) is an essential tool for image and signal processing. The edge-avoiding wavelet (EAW) is an extension for DWT to have edge-preserving property. EAW constructs a basis based on the edge content of input images; thus, the wavelet contains nonlinear filtering. DWT is computationally efficient processing in the scale-space analysis; however, EAW has a complex loop structure. Therefore, parallel computing for EAW is not an easy task. In this paper, we vectorize EAW computing by using single instruction, multiple data (SIMD) parallelization. Especially, the lifting-based wavelet allows the in-place operation, i.e., the source and destination array for DWT can be shared, and the in-place operation improves cache efficiency. However, the EAW prevents the operation in the update processing. Moreover, data interleaving for wavelet computing is the bottleneck for SIMD computing. Therefore, we show the suitable data structure for effective SIMD vectorization for EAW. Experimental results show that our effective implementation accelerates EAW. For the WCDF method, we accelerate more than two times faster, and for the WRB method, we accelerate about three times faster than the simple implementation.
Enhancing angular resolution using layers obtained from light field superpixel segmentation
Jonghoon Yim, Vinh Van Duong, Thuc Nguyen Huu, et al.
In order to alleviate the inherent trade-off relation between spatial and angular resolutions of light field (LF) images, many experiments have been carried out to enhance the angular resolution of LFs by creating novel views. In this paper, we investigate a method to enhance the angular resolution of LF image by first grouping the pixels within and across the multiple views into LF superpixels using existing LF segmentation method, then generating novel views by shifting and overlaying the layers containing the LF superpixels having similar per-view disparity values. Experimental results with synthetic and real-scene datasets show that our method achieves good quality of reconstruction.
Advanced Image Technology II
icon_mobile_dropdown
Inverse kinematics in VR motion mirroring
Sascha Roman Robitzki, Clifford Eng Ping Hao, Hock Soon Seah, et al.
The objective of this research is to develop a computer tool to perform body motion sensing, dynamics analysis, and realtime visualization based on a virtual reality system. The aimed system can enable effective motion detection, monitoring and instant visual highlights involving muscle contractions during one’s exercises. Experimental results are presented in projecting one’s movement onto a virtual skeleton by applying inverse kinematics.
Expressive B-spline curves: a pilot study on a flexible shape representation
We present a pilot study on expressive B-spline curves (XBSC), an extension of disk B-spline curves (DBSC). XBSC facilitates expressive drawings in terms of shape and color. For shape, colors on both sides of XBSC strokes are defined independently instead of using a single parameter for both sides as in DBSC. We perform coloring by considering the envelopes of XBSC as diffusion curves. Our results show that XBSC can be used to easily draw a wide range of images with fewer number of primitives compared to previous methods.
Automatic extraction of ridge and valley lines based on overground and underground openness
Tomohiro Andoh, Toshikazu Samura, Katsumi Tadamura
To determine the behavior of rainwater after rainfall, it is important to obtain local ground-slope information and topographic-feature information, e.g., ridges and valleys. To obtain such information, a method that uses overground openness (OO) and underground openness (UO) has been proposed previously [1]. In this method, OO and UO are calculated using a digital elevation model (DEM). In addition, a method has developed to generate a red relief image map [2] by combining the ridge–valley degree (RVD) calculated as (OO-UO)/2 and the gradient of the ground to visualize general topographical characteristics. However, these methods do not fully utilize the features of currently available highresolution DEM data, and, to the best of our knowledge, automatic extraction of ridge valley lines has not been considered to date. To address the former problem, a new method [3] that uses CG techniques was proposed to enable an openness calculation that reflects the DEM resolution. Thus, in this paper, we propose a method to extract ridge and valley lines automatically using OO and UO values obtained from a high-resolution DEM [3].
Advanced Image Technology III
icon_mobile_dropdown
Anomaly object detection in x-ray images with Gabor convolution and bigger discriminative RoI pooling
In airports, railway stations and other public places, security inspectors generally use the way of viewing x-ray images for security inspection, so false detection and missed detection often occur. In this paper, an automatic anomaly object detection method in x-ray images is proposed under a two-stage framework. At the first stage, a learnable Gabor convolution layer is introduced into ResNeXt to facilitate the network to capture the edge information of objects. Then, region proposal network (RPN) is used to determine the candidate regions of objects as well as perform coarse classification. At the second stage, bigger discriminative RoI pooling (BDRP) is proposed to classify the candidate boxes to improve the classification accuracy of objects. Furthermore, dense local regression (DLR) is applied to predict the offset of multiple dense boxes in region proposals to locate the objects accurately. Experimental results on the SIXray and OPIXray datasets show that, compared with the state-of-the-art methods, the proposed method can achieve a competitive detection performance.
Single-scan multiple object detection based on template matching using only effective pixels
Sakura Eba, Naoya Nakabayashi, Manabu Hashimoto
In this paper, we propose a template matching method that enables multi-class classification by scanning an input image only once, focusing on the classification ability of each pixel in the template image. This ability of each pixel in the template image is represented by using the occurrence frequency of the pixel value. On the basis of this ability, a small number of pixels that are effective for class identification are selected. Experimental results in the task of detecting five kinds of objects from 100 real images showed that the recognition rate of the proposed method was 95.4% and the processing time was 1.9 seconds when 0.6% of all template pixels were used.
Template matching via search history driven genetic algorithm
Takumi Nakane, Takuya Akashi, Chao Zhang
Pixel-based template matching suffers from computational cost by increasing potential solutions. Genetic algorithms has been adopted to search hopeful solutions, whereas there is a demand for more accurate matching. In this paper, we propose to employ a modified real-coded genetic algorithm to solve the template matching problem. Specifically, individuals sampled during the exploration process are stored in an archive and spatially clustered in the search space. An enhanced crossover (abbreviated as SHX) exploits the extra cluster information to generate new individuals in more promising regions. To solve the matching problem, this algorithm searches for suitable geometric parameters using a pixel-level dense similarity measure. Experimental results show the effectiveness of SHX for solving the template matching problem.
Depth scene flow estimation based on variational method using thin-plate spline regularization
Scene flow is a three-dimensional (3D) vector field with depth directional motion and optical flow and it can be applied to inter prediction in 3D video coding. The conventional method regularizes the scene flow so that it locally approaches a constant function, so there is the problem that it is difficult to handle spatial changes and motion boundaries. Regularizations called thin-plate spline or deformable model have been introduced in variational optical flow estimation because they find the solution locally as a linear function and may alleviate the problem. However, because the partial differential equation derived by thin-plate spline regularization includes the fourth-order partial differential, it is not easy to derive an analytical solution or solve it by a numerically stable iterative method in terms of numerical analysis. Previous researches have proposed a numerically stable iterative method that does not include thin plate spline regularization for scene flow estimation. Therefore, while making use of the framework of the previous researches, I derive a partial differential equation using thin plate spline regularization for scene flow estimation.
A frame-by-frame integrated environment map building method in cooperative SLAM
We propose a unique method TWIN HEAD SLAM to build an integrated and consistent environment map by two camera heads that move freely with each other. Visual SLAM with multiple cameras is called cooperative SLAM. The updated environment map is shared by two tracking modules that are attached to each camera separately. Our contribution is that the integrated environment map is updated frame-by-frame from both camera inputs. This indicates that the key features obtained by one camera right now are instantly available on the tracking module of the other camera. We have implemented the proposed method based on OpenVSLAM and confirmed that two video inputs from two cameras are used to build the one consistent environment map that is shared by the localization modules of the two cameras.
Advanced Image Technology IV
icon_mobile_dropdown
Posture estimation for the visually impaired people using human skeleton with a white cane
Visually impaired people avoid dangers by using a white cane. Because they use it as an extension of their arm, we consider it a part of their body. Our goal is to recognize visually impaired people using a white cane in images and how they use it. We propose a method to estimate the posture of a person with a white cane by extending an existing posture estimation model: OpenPose. In our method, we incorporate a white cane as a part of the human skeleton model. We constructed a database of images of visually impaired people with a white cane to train the network for the extended human skeleton model. We develop a method to determine the left or right hand that holds the white cane in the training images because it is necessary to train right-handed and left-handed users separately. We can analyze the motion of the white cane by the result of posture estimation. We focus on the angle of the white cane and analyze its swing frequency. Throughout our experiments, we confirmed that our preliminary system successfully estimated the human posture with the white cane and the swing frequency of the white cane.
Proxy system with JPEG bitstream-based file-size preserving encryption for cloud photo streams
In this paper, we propose a proxy system with JPEG bitstream-based file-size preserving encryption to securely store compressed images in cloud environments. The proposed system, which is settled between client’s device and the Internet, allows us not only to have exact the same file size as that of original JPEG streams but also to maintain a predetermined image format. In an experiment, the proposed system is verified to be effective in two cloud photo steams: Google Photo and iCloud Photo.
Estimating the number of table tennis rallies in a match video
Shoma Kato, Akira Kito, Toru Tamaki, et al.
In this paper, we propose a method for estimating the number of rallies in table tennis match videos. A play scene is extracted from a video by using frame similarity, and a ball area is detected in an upper side of the table with frame difference and color thresholding. The detected ball is either a receive ball by the upper side player or the high toss service by the bottom player. The high toss service is removed with the collision sound of the ball and table or racket. Experiments of estimating the number of rallies show the accuracy of 63.7 % with 157 play scenes of T-League match videos in Japan.
Explainable artist recommendation based on reinforcement knowledge graph exploration
Keigo Sakurai, Ren Togo, Takahiro Ogawa, et al.
This paper presents a novel artist recommendation method based on knowledge graph and reinforcement learning. In the field of music services, online platforms based on subscriptions are becoming the mainstream, and the recommendation technology needs to be updated accordingly. In this field, it is desirable to achieve user-centered recommendation that satisfies various user preferences, rather than the recommendation that is biased toward popular songs and artists. Our method realizes highly accurate and explainable artist recommendation by exploring the knowledge graph constructed from users’ listening histories and artist metadata. We have confirmed the effectiveness of our method by comparing it with an existing state-of-the-art method.
Action detection system for alerting drivers using computer vision
Nowadays, the increasing number of careless drivers on the road has resulted in more accident cases. Driver’s decisions and behaviors are the key to maintain road safety. However, many drivers tend to do secondary tasks like playing with their phone, adjusting radio player, eating or drinking, answering phone calls, and worse case is reading phone texts. In previous efforts, many kinds of approaches had been introduced to try to solve the task to recognize and capture potential problem related to careless driving inside the car. In this project, the work will mainly focus on the driver's secondary task recognition using the action detection method. A camera will be set up inside the car for the real-time extract of driver’s action. The video will undergo a process to extract out the human pose frames without background, called human pose estimator framework. Inside this framework, raw images will be input into a convolutional neural network (CNN) that computes human key-point's activation maps. After, that key-point's coordinate will be computed using the output activation maps and drawn on a new blank frame. Then the frames will be input into classification CNN network for the action classification. If an action performed by driver is considered a dangerous secondary task, alert will be given. The proposed framework was able to achieve a higher speed compared to other people's framework if it is being run on Raspberry Pi CPU. It is able to detect 10 different driver actions where only talking to passengers and normal driving will not trigger the buzzer to give alert to driver.
Advanced Image Technology V
icon_mobile_dropdown
Driver drowsiness detection by multitask and transfer learning
Yuan Chang, Wataru Kameyama
In this busy modern society, there are many external and psychological factors that can cause people to feel tired. The severity of fatigue driving is comparable to drunk driving when we consider the accident rate. Therefore, how to avoid this situation has become an important issue. With the trend of machine learning becoming more mature, facial expression recognition has been widely used in real life. A large number of studies and reports about fatigue driving detection and how to improve fatigue driving can be found. Most of them either detect drowsiness states without detailed facial expressions or just look at a single part of face such as eye or mouth. However, we consider that each facial feature is highly correlated. For example, when a driver gets tired, his/her mouth and eyes are thought to change the states together. Thus, it is important to evaluate more than one facial feature at a time. Therefore, in this paper, we propose a new driver-drowsiness detection method by using multi-task and transfer learning. The proposed method first captures the drivers’ facial areas frame-by-frame in videos, and learns different facial features synchronously. The experimental results show that the proposal outperforms the ever-proposed methods on four scenarios out of the five and on the average in the NTHU driver drowsiness detection video dataset.
Abnormal object detection in x-ray images with self-normalizing channel attention and efficient data augmentation
In this paper, an abnormal object detection method in X-ray images is proposed under the framework of YOLO. ResNeXt-50 is adopted as the backbone network to extract the deep features. And a self-normalizing channel attention mechanism (SCAM) is proposed and introduced into the high layer of ResNeXt-50 to enhance the semantic representative ability of the features. According to the characteristics of X-ray images, an efficient data augmentation method is also proposed to enlarge the amount of the training data samples, which facilitates to improve the training performance of the network. The experimental results on the public SIX-ray and OPIX-ray datasets show that, compared with the methods of YOLO series, the proposed method can obtain a higher detection accuracy.
A feasibility study of watermark embedding in RNN models
Kota Matsumoto, Shigeyuki Sakazawa
Deep learning models are created using a large amount of time and data, and are therefore very costly. Therefore, attention has been focused on technologies that protect rights by embedding digital watermarks in learning models. In this work, we target recurrent neural networks (RNNs) and embed watermarks during model training. There are few studies that show the possibility of watermark embedding for RNN training models. Therefore, in our previous research, we have shown that watermark embedding is possible for learning models generated by LSTM networks, a type of RNN, and have conducted detection. In this paper, we investigate the effect of watermarking on the model when it is embedded into the training model of RNN. In particular, we will conduct experiments and discuss the results regarding the impact of embedding watermarks on the task and the impact of increasing the number of bits embedded in the watermark.
A hybrid automatic defect detection method for Thai woven fabrics using CNNs combined with an ANN
Wannida Sae-Tang, Atthaphon Ariyarit
Thai silk is a main export product of Thailand. Since it is a luxury and high-cost product, its quality must be controlled and guaranteed. Automatic defect detection of Thai fabrics especially for Thai silk then becomes an interesting research issue. This paper proposes a hybrid automatic defect detection method for Thai woven fabrics using convolutional neural networks (CNNs) combined with an artificial neural network (ANN). Original images and local homogeneity images are used for CNN training, and gray level co-occurrence matrix (GLCM) texture statistics are used for ANN training. The results from two CNNs and an ANN are then combined by voting. The experimental results show that the proposed hybrid method is superior to the conventional methods in terms of accuracy.
Comparison of real-time CNN-based methods for finger-level hand segmentation
Hand segmentation is usually considered a pixel-wise binary classification problem, where the foreground hand is meant to be recognized in an input image. However, we envision that finger-level hand segmentation is more useful for applications like hand gesture and sign language recognition. Therefore, in this paper, we compare five state-of-the-art (SOTA) real-time semantic segmentation methods for the task of finger-level hand segmentation. To do that, we introduce two subsets consisted of 1,000 images manually annotated pixel-wise selected from new proposed datasets of hand gesture and world-level sign language recognition. With these subsets, we evaluate the accuracy of the recent SOTA methods of DABNet, FastSCNN, FC-HardNet, FASSDNet, and DDRNet. Since each subset has relatively few images (500), we introduce a simple yet effective loss function to train with synthetic data that includes the same annotations. Finally, we present a real-time performance evaluation of the five algorithms on the NVIDIA Jetson family of GPU-powered embedded systems, including Jetson Xavier NX, Jetson TX2, and Jetson Nano.
Advanced Image Technology VI
icon_mobile_dropdown
Optimizing coded patterns with various lengths
The performance of coded exposure photography-based image deblurring highly depends on its coded pattern to use. Conventionally, the length of the coded pattern has been optimized under the assumption that its length is equal to the length of motion blur. However, coded patterns of different lengths from the same motion blur may have better invertibility than the conventional patterns. In this paper, we investigate a method to optimize the coded pattern within an extended range of length candidates. We demonstrate the effectiveness of the proposed method using a real dataset.
Study on 5-D light field compression using multi-focus images
Shuho Umebayashi, Kazuya Kodama, Takayuki Hamamoto
We propose a novel method of 5-D dynamic light field compression using multi-focus images. A light field enables us to observe its scene from various viewpoints. However, it generally consists of 4-D enormous data, that are not suitable for storing or transmitting without effective compression. 4-D light fields are very redundant because they essentially include just 3-D scene information. Actually, a method of reconstructing a light field directly from 3-D information composed of multi-focus images without any scene estimation is successfully derived, though robust 3-D scene estimation such as depth recovery from light fields is not so easy. Previously, based on the method, we proposed novel light field compression via multi-focus images as effective representation of 3-D scenes. In this paper, we extend this method to compression of 5-D light fields composed of multiview videos including the time domain. It achieves significant improvement in compression efficiency by utilizing multiple redundancy of 5-D light fields. We show experimental results using synthetic videos. Quality of reconstructed light fields is evaluated by PSNR and SSIM for analyzing characteristics of its performance. They reveal that our method is much superior to light field compression using HEVC at practical lower bit-rates.
Polynomial fitting for period prediction in sliding-DCT-based filtering
Gaussian filtering (GF) is a fundamental smoothing filter that determines the weights in the kernel according to the Gaussian distribution. GF is an essential tool in image processing and is used in various applications. Therefore, accelerating GF is essential in various situations. The sliding DCT-based GF is one of the fastest methods for approximating GF. The Gaussian kernel is decomposed into multiple cosine kernels using the DCT transform and is approximated by the limited number of kernels. When calculating the period of the DCT for fitting the best length, a linear search method is used; however, the brute-force search has a significant impact on the filtering processing time. In this paper, we accelerate the period estimation by polynomial fitting. Experimental results show that the proposed method has almost the same accuracy as the brute-force approach.
Conventional versus learning-based video coding benchmarking: Where are we?
Video coding technology and standards have largely shaped the current world in many domains, notably personal communications and entertainment, as demonstrated during recent COVID-19 times. AI-based multimedia tools already have a major impact in many computer vision tasks, even reaching above human performance, and are now arriving to multimedia coding. In this context, this paper offers one of the first extensive benchmarking of deep learning-based video coding solutions regarding the most powerful and optimized conventional video coding standard, the Versatile Video Coding standard, under the solid, meaningful, and extensive JVET common test conditions. This study will allow the video coding research community to know the current status quo in this emerging ‘battle’ between learning-based and conventional video coding solutions and better design future developments.
Image Processing and Classification I
icon_mobile_dropdown
Scale adaptive structure tensor based rolling trilateral filter
One of tasks of image filter is to preserve strong edge structure and smooth textures in the given image. Recently, many approaches have been proposed to accomplish this challenging task. In this paper, we propose a scale adaptive structure tensor based rolling trilateral filter to smooth detailed small textures while preserving prominent structures. The proposed method first estimates scale at each pixel using a structure measure; then computes the eigenvalues and eigenvectors of structure tensor constructed from gradient at each pixel. The trilateral filter includes anisotropic weight in spatial space, the Gaussian weight in range space and the Gaussian weight of inner product of eigenvectors in gradient space. Results of experiments conducted on many natural images demonstrate that the proposed filter performs well.
Coded diffraction pattern phase retrieval with green noise masks
Qiuliang Ye, Chris Y.H. Chan, Michael G. Somekh, et al.
Coded-diffraction-pattern phase retrieval algorithms enhance performance with the help of random masks. However, traditional methods only focus on the randomness of the masks and disregard their non-bandlimited characteristics. The intensity measurements thus include plenty of high-frequency components outside the consideration of phase retrieval algorithm and lead to degraded performance. This article presents a green noise binary masking technique to substantially reduce the high-frequency components of the masks while meeting the randomization criterion. In addition, a novel phase retrieval algorithm is proposed to incorporate arbitrary denoising algorithms as prior based on the plug-and-plug framework. Simulation and experimental results show that the proposed green noise-masking technique and the plug-and-play reconstruction algorithm outperform the traditional methods in phase retrieval.
Application of GoogLeNet for ocean-front tracking
In recent years, ocean front tracking is of vital importance in ocean-related research, and many algorithms have been proposed to identify ocean fronts. However, all these methods focus on single frame ocean-front classification instead of ocean-front tracking. In this paper, we propose an ocean-front tracking dataset (OFTraD) and apply GoogLeNet inception network to track ocean fronts in video sequences. Firstly, the video sequence is split into image blocks, then the image blocks are classified into ocean-front and background by GoogLeNet inception network. Finally, the labeled image blocks are used to reconstruct the video sequence. Experiments show that our algorithm can achieve accurate tracking results.
Vehicle re-identification system for road side unit application
Vehicle re-identification is one of the complex smart traffic issues, due to the large intra-class variation and high interclass similarity, it is hard to be solved through traditional hand-crafted features. In this paper, we proposed a vehicle re-identification system based on deep learning techniques, which is able to re-identify the vehicles through deep features under acceptable operation time on Nvidia Jetson TX2. We have collected multiple sequences captured from real-world road side units (RSU) for system evaluation experiments, and the results indicate that it is highly possible to be adopted for real world traffic applications.
Image Processing and Classification II
icon_mobile_dropdown
A context-aware anchor-free tiny object detector for aerial images
Li-Syuan Chen, Der-Lor Way, Zen-Chung Shih
Object detection in aerial images is a task of predicting the target categories while locating the objects. Since the different categories of objects may have similar shapes and textures in aerial images, we propose context-aware layer to provide global and robust features for classification and regression branch. In addition, we propose the CentraBox to reduce unnecessary training samples during the training phase. We also propose the instance-level normalization to balance the contributions among the instances. Finally, we compare our method with other methods in terms of accuracy, speed and parameters usage. Moreover, we also compare our own method with different hyper-parameter settings.
Method of hiding phase-shift pattern for depth estimation in interactive art
Akane Toizume, Isao Nishihara, Takayuki Nakata
In recent years, projection mapping has become popular in the commercial and artistic fields. Among them, there is projection mapping that the viewer’s motion and the movie interact. They are called interactive art, and are attracting attention. On augmented reality, pattern hiding methods have studied to estimate depth using only a camera and projector. However, there are color breaking at the border of the grid and flicker occurs when pattern change. Therefore, in this study, we used the phase shift method for interactive art, and investigated the possibility of depth estimation. As a result, we calculated phase image and reduced color breaking and flicker. In this paper, we experimented with still images as the content. In the future, it will be possible to apply it to the movie.
Car number recognition for Formula One driver identification
Kazuya Goto, Katsuto Nakajima
We propose a two-stage car number recognition method to identify drivers of Formula One cars displayed in broadcast videos to aid the production of digest videos. The evaluation result demonstrated an accuracy of approximately 90% for the recognition of small and blurry car numbers.
Extraction of Perilla frutescens area in farmland using camera on a self-propelled robot and examine of evaluation method
Hiroki Fuwa, Kei Sawai, Takumi Tamamoto, et al.
Population ageing and shortage of laborers are serious problems in Japanese agriculture. Therefore, the government has been promoting smart agriculture. As part of the smart agriculture project, Toyama City is developing a weeding robot to establish an automated weeding system for perilla fields. In this project, authors are in charge of extracting perilla areas from input images using Deep Learning, based on images sent from cameras in real time. The learner is based on YOLOv2 and train using images taken from above in an actual field. On the premise that weeding robot avoid cutting off the perilla, we execute subjective evaluation and evaluation using the distance of the output values from the center point. As a result, we were able to roughly extract the perilla area, but there were some outputs that were not suitable for weeding system. In the future, we will adjust the parameters of the learner and study the threshold value to avoid wrong harvesting.
PRNU-based source camera identification on smart video surveillance: a practical approach
Sai-Chung Law, Ngai Fong Law
Video surveillance is very common nowadays, with systems deployed in conventional networks, as well as in the cloud and IoT domains. While the internet-based video surveillance systems provide ease of operation, at the same time they are prone to cyber attacks. Therefore, video authentication cannot be guaranteed if someone hacks into the system and gets to the video source. In order to identify the video source, a source identification method employing the PRNU (Photo-Response Non-Uniformity) noise as the detecting signal has been devised. PRNU is a kind of sensor pattern noise, which can be found from every digital image captured by a digital camera. It has been proved to be useful in image and video camera source identification. However, the challenges of real-life applications have not been fully addressed, especially on the IoT-based video surveillance. In this paper, we present a practical approach of the PRNU-based source verification scheme incorporated into the smart video surveillance system with limited resources, such as low computation power at the edge of the networks. The performance of the proposed scheme is evaluated through simulation tests on different cameras taking video scenes at different periods in a day. It comes up with the results of an efficient and effective prototype for our method, which can be comparable to the state-of-the-art techniques in related works.
Swimmer position estimation and stroke analysis based on head recognition
In the field of competitive swimming, performance investigation on official games is important and useful for performance development. We have been working on swimmer position estimation from a wide video view of official games. The challenges are water splash and complicated reflection of light that may hide swimmers from the camera. To overcome these problems, we utilize YOLOv3 and prepare a dedicated dataset of swimmers' heads in real games. We have trained the YOLOv3, and the trained YOLOv3 can detect heads by 48.1% mAP. In addition to the position estimation, we also propose a new method to investigate the status of the strokes along the time by detecting two head-classes: over the water and under the water. We also prepare another dedicated dataset for this two-class training. With the trained YOLOv3, we successfully visualize the status change of a swimmer over a whole game.
Multimedia Systems and Applications I
icon_mobile_dropdown
PetCare: a real-time pet monitoring system with food dispensing using Raspberry Pi
Nowadays, many people adopt a pet; not for taking care of the house but as a companion. However, many pet owners may not be able to spend time taking care of their pets, especially when they have a business trip. A stay-at-home pet has no, or less, survival skill to look for food outside and could not survive on its own, unlike stray animals. Therefore, usually, pet owners would ask their friends to take care for them; or look for a real-time food dispensing system that can feed the pet at scheduled times with monitoring functions. This paper proposes a monitoring system which comes with an automatic food dispenser, and which comes with several other useful functions to assist the pet owners. The proposed system uses the Raspberry Pi which controls and interconnects with several subsystems such as a (1) real-time monitoring subsystem, (2) door managing subsystem with software support help to ease the burden, and (3) food dispensing system. The wireless food dispensing system allows the owner to feed their pet automatically, according to a schedule or manually, according to the owner’s preference. The real-time monitoring subsystem allows the owner to monitor their pet and check if other stray pets tried to enter the cage through a camera. Finally, the door managing subsystem allows the owner to control the door lock/unlock, thus giving their pet some freedom time. This system proves to be useful to be implemented to reduce the chance of the pets being abandoned.
A method of projection mapping from a moving cart with reflecting position information
In this study, we have developed a projection mapping system that projects images onto the ground using projectors mounted on an auto-driving cart. By reflecting the position information for auto-driving in the projected image, we realize projection mapping in which the image changes appropriately as the cart moves. In an experiment using a prototype system, we confirmed that projection mapping by our method can realize the expression that the characters appear in the headlights as the cart moved.
Study on bottom sediment classification by complementary use of seafloor images and environmental sounds
In CNN-based classification for seafloor images, the accuracy may decrease drastically in different sea areas. Therefore, we aim to improve the accuracy by utilizing the dragged environmental sound. Classification by sound includes classification by CNN using logmel images, and we can expect a complementary relationship by using classification by image and sound together. As a concrete method, we propose a robust sediment classification method using transfer learning.
Multimedia Systems and Applications II
icon_mobile_dropdown
Study of automatic generation of motif tag in Nishiki-e
Hoshito Minagi, Takuzi Suzuki, Yoshitugu Manabe, et al.
The motif drawn on Nishiki-e is needed to register in the database as a search tag. The accuracies of the motif tag that are currently manually registered is unstable because it depends on the knowledge and interests of the registrant. Therefore, this study proposes an automatic generation method of motif tags using deep learning to support cultural activities. Nishiki-e is more difficult to collect training images that include specific motifs than photographs. In this study, we propose three methods for preparing training images. First, we applied a similar image generation model from a single image to a small number of Nishiki-e containing motifs to create training images. Second, we applied a Nishiki-e style processing model to photographs containing motifs to create training images. Third, we combined a small number of photographs with motifs with some background images to create training images. In particular, the third method can detect from a small number of inputs like the first method with an accuracy close to that of the second method.
Design of a SSIM-optimal 2D post filter with symmetric coefficients for coding artifacts reduction
We previously proposed a method of designing a 2D FIR filter that can maximize the well-known objective quality index called SSIM. The designed filter can be used as a post processing tool for lossy image coding methods to reduce coding artifacts. In this scenario, there is a trade-off between the amount of side information on filter coefficients and the obtained gain in image quality. In this paper, effectiveness of the designed filters on the rate-SSIM based coding performance is evaluated under different settings of the size and quantization precision of the filter coefficients. Moreover, we introduce symmetric constraints on the filter coefficients to reduce the side information.
Analysis of human's skilled process of assembly task using time-sequence-based machine learning
Ryo Miyoshi, Kosuke Kimura, Shuichi Akizuki, et al.
To efficiently teach novices skilled tasks, it is necessary to analyze the difference between a novice and an expert worker. Accordingly, a method for extracting differences (on the basis of skill level) in motions of workers performing tasks is proposed. As for this method, a network (multi-stream LSTM) that estimates skill level from 3D positional information of the worker’s visual point and joints is trained, and the internal structure of the network is then analyzed. The results of an experiment indicate that a particular motion, namely, “grasping an object,” becomes different when the worker becomes skilled; in particular, the worker grasps the object without moving their visual point to the position of the part, namely, without looking at the object, and uses both hands efficiently.
CCIP: combined CCLM and intra prediction
Jeehwan Lee, Bumyoon Kim, Byeungwoo Jeon
In this paper, we propose a new coding method, called combined CCLM and intra prediction (CCIP) to improve the prediction efficiency of the cross component linear model (CCLM) in versatile video coding (VVC) standard for chroma. While the CCLM technique in VVC can use the linear correlation between the current chroma block and its co-located reconstructed luma region, it cannot take into account the correlation existing between the current chroma block and its adjacent chroma blocks in prediction. The proposed CCIP overcomes this shortcoming by combining two predictors, respectively, generated by conventional CCLM models and by intra prediction using its spatial chroma reference sample. Our experimental results show BDBR of 0.06% in Y component, -0.60% in Cb, and -0.54% in Cr in 4:2:0 color format, and -0.09% in Y component, -0.33% in Cb, -0.42% in Cr in 4:4:4 color format.
Study of sub-pel block vector for intra block copy
Yujin Lee, Bumyoon Kim, Byeungwoo Jeon
Versatile video coding (VVC) has the intra block copy (IBC) coding tool for intra prediction. It uses block vector (BV) of resolution in either 1-pel or 4-pel accuracy to indicate a reference block in the current picture. However, a block vector not expressed in sub-pel accuracy might have limitation in accurately locating a reference block. In this context, we investigate representing BV in sub-pel resolution of half-pel and quarter-pel accuracy. According to our study, in case of camera-captured video contents, 18.96% and 28.63% of BVs prefer to be estimated in half-pel and quarter-pel resolutions respectively if sub-pel accuracy is allowed for the IBC block vector. Regarding the block vector difference (BVD), 8.72% and 9.98% of BVs choose to be signaled using nonzero BVD in half-pel and quarter-pel, respectively. This is comparable to the usage ratio of existing 4-pel resolution of IBC. Also, allowing the half-pel and quarter-pel resolutions of BV brings coding gain in some natural content sequences. Therefore, to further improve coding efficiency of IBC especially in natural video content, sub-pel resolution of BVs can be effective.
An examination of effect on psychological time by evoking fear in VR
The time that people perceive is called "psychological time" and is classified as "time perception" when the time range is less than 5 s and "time estimation" when the time range is 5 s or more. It is known that "psychological time" is perceived longer by fear, but there is little research on each effect of "time perception" and "time estimation" evoked by fear. Therefore, using VR technology that can give a sense of immersion as users really do, we present VR contents that evoke strong fear safely to users both in the "time perception" and "time estimation" ranges and investigate the differences between the effects of "time perception" and "time estimation" evoked by fear.
Classification of pancreatic tumors using colored two-dimensional histograms from ultrasound endoscopic images
Yuji Takeuchi, Takeshi Hara, Xiangrong Zhou, et al.
The diagnosis using a time-intensity curve (TIC) is considered to be useful in the differentiation of pancreatic tumors. TIC is a graph that shows a contrast intensity of contrast-enhanced endoscopic ultrasonography over time. We propose a method to classify pancreatic tumors, which generates and uses two types of images representing a contrast effect from ultrasound endoscopic images. The first type is a two-dimensional histogram that adds information about a distribution of luminance values per frame to TIC which features a contrast effect over time. The second type is a frame with the highest average luminance value among all frames of each case. The frame featured a contrast enhancement pattern of the tumor. The features of the two images were extracted using deep learning. The two extracted features were combined by a concatenate layer. The combined feature outputs by a fully connected layer as the probability of pancreatic cancer. In this study, 131 cases with pancreatic tumors (pancreatic cancer: 86 cases, non-pancreatic cancer: 45 cases) were used. As a result of receiver operating characteristic analysis of the output probability, the area under the curve was 0.82, the sensitivity was 80.2%, and the specificity was 71.1%.
VR and AR
icon_mobile_dropdown
Continuous dynamic collision detection in VR pool
Shi Sheng Long, Feng Lin
A new algorithm for continuous dynamic collision detection is developed for VR pool games. The sweep-based 3D physics is applied to the cue ball and object ball movements, which ensures that the fast-moving balls collide with each other. Also, The Time-of Impact algorithms are exploited to compute potential collisions for an object by sweeping its forward trajectory using its current velocity. Experimental results are presented in an immersive and interactive virtual pool environment, showing comparative advantages of accuracy in ball trajectory generations over the existing methods.
A VR-based repetitive learning system of accurate tuna dismantling for fisheries high school students
Among characteristic club activities in Japanese fisheries high schools, demonstration of dismantling tunas and other fishes dismantling is attractive to consumers as well as to students. To improve the situation where it is difficult to provide sufficient practice opportunities due to high cost, the previous research proposed a virtual reality support system of self-study for tuna dismantling. The system provides visual sense through Head-Mounted-Display (HMD) with interactive manipulation of some kinds of knives by both hands. Implementing additional functions into the previously proposed system, this article proposes a system that enables effective repetitive practice and allows students to learn exactly dismantling procedure and operation, through the experience of the virtual world.
A VR-based indoor visualization system from floorplan images with deep learning
In this paper, we propose a system to represent a 3D virtual house based on the given floorplan images, through which a user can understand intuitively and efficiently interior furnishing condition. The proposed system adopts a learning-based approach in the use of a convolutional neural network with customized dataset, which allows us to acquire high-accuracy semantic information of diverse components of the house, e.g., walls, toilet and area of balcony, from input floorplan image. Then, constructing an interactive virtual environment with Head-Mounted-Display (HMD), the system enables the user to adjust arbitrarily the position of furniture with various sizes, and prompts whether the selected furniture can be placed in a specific position through an intuitive way of expression.
An interactive shadow generation system for spatial augmented reality
A shadow implicitly represents the existence of a human or object for various applications in interaction and media art designs. However, it is challenging to generate a natural shadow in artificial for a spatial augmented reality where the conventional approaches ignore the object dynamics and automatic control. In this work, we propose an interactive shadow generation system that creates the interactive shadow of users using a projector-camera system. With the offline processes of human mesh modeling and virtual environment registration, the proposed system rigs the 3D model created by scanning the user to generate the shadow. Finally, the generated shadow is projected into the real environment. We verify the usability of the proposed system and the impression of the generated shadow from our user study.
A study of multimodal head/eye orientation prediction techniques in virtual space
Various models have been proposed to predict the future head/gaze orientation of a user watching a 360-degree video. However, most of these models do not take sound information into account, and there are few studies on the influence of sound on users in VR space. This study proposes a multimodal model for predicting head/gaze orientation for 360-degree videos based on a new analysis of users' head/gaze behavior in VR space. First, we focus on whether people are attracted to the sound source of the 360-degree video or not. We conducted a head/gaze tracking experiment with 22 subjects in AV (Audio-Visual) and V (Visual) conditions using 32 videos. As a result, it was confirmed that whether they were attracted to the sound source differed depending on the video. Next, we trained a deep learning model based on the results and constructed and evaluated a multimodal model that combined visual and auditory information. As a result, we were able to construct a multimodal head/gaze prediction model that used the sound source explicitly. However, from the viewpoint of accuracy improvement, we could not confirm any advantage of multimodalization. Finally, a discussion of this problem and prospects is given.
3D Processing and Applications
icon_mobile_dropdown
Calibration of photometric stereo point light source based on standard block
Yang Yang, Xianglong Wang, Hao Fan, et al.
Light sources’ position calibration is critical to photometric stereo. This paper presents a method for the light source position calibration based on a standard block. Unlike prior works that use mirror spheres, we use a calibration target consisting of a flat plane and a standard block placed on the plane with known length. By analyzing and reasoning the shadow produced by the standard block under a fixed light, we find that the relations of shadow and light conform to the principle of trigonometric geometry. The near light source’s position can be obtained according to the position relationship between the line segment and the inflection point in the shadow. Compared with other methods, our method is convenient and fast, and has better results.
Viewpoint extension method for light field display using stacked imaginary image display
Y. Koike, I. Nishihara, T. Nakata
In recent years, there has been a lot of research on autostereoscopic displays. In this work, we propose the viewpoints extension method for the stacked Pepper's Ghost method [1] for simultaneous viewing of stereoscopic images by multiple people. In the proposed method, we use three cameras to obtain the light transport matrix, which is the pixel correspondence between the display and the camera. In this way, we extended the number of viewpoints from two to three. In addition, the number of viewpoints can be extended to the number of cameras by increasing the number of cameras. In our experiments, we captured the target light field image, obtained the light transport matrix, and optimized the layer image. The experimental results showed that the optimization accuracy was lower than that for two viewpoints, but it was still possible to present disparity images from three viewpoints, confirming that the number of viewpoints can be expanded. In the future, we aim to present images more stereoscopically by presenting disparity in the vertical direction.
A study of calibration method for lenticular glasses-free 3D display using light transport matrix
Asuka Fukatsu, Isao Nishihara, Takayuki Nakata
In recent years, glasses-free 3D displays are expected to play an active role in various fields and have been studied. Lenticular glasses-free 3D displays are calibrated based on a one-to-one correspondence between the ray vector emitted from the display and the ray vector observed by the camera. However, in the real world, light emitted from a single pixel on the display affects multiple pixels in the camera, so it is necessary to consider a one-to-many correspondence. In this study, we investigate a calibration method that takes into account the spread of light generated in the real environment by using the light transport matrix that shows the correspondence between the display and the camera. We obtained the light transport matrix and conducted an experiment to generate a display image calibrated to the camera position. The generated image was drawn on the display and observed using the same camera, and it was confirmed that the image was calibrated to the camera position. In the future, we aim to expand the parallax by increasing the number of cameras.
Interactive 3D character modeling from 2D orthogonal drawings with annotations
Zhengyu Huang, Haoran Xie, Tsukasa Fukusato
We propose an interactive 3D character modeling approach from orthographic drawings (e.g., front and side views) based on 2D-space annotations. First, the system builds partial correspondences between the drawings and generates a base mesh with sweeping splines according to edge information in 2D images. Next, users annotates the desired parts on the input drawings (e.g., eye and mouth) by drawing two-type strokes, named addition and erosion, and the system re-optimizes the shape of the base mesh by using the annotations. By repeating the 2D-space operations (i.e., revising and modifying the annotations), users can design a desired character model. To validate the efficiency and quality of our system, we verify the generated results with state-of-the-art methods.
Traction force presentation in redirected walking
Toshiya Kurosaki, Takayuki Nakata
Walking real space for move in virtual environment is one of methods for improve immersion of Virtual Reality (VR) contents. However, a movable space in virtual environment is limited by real space obstacles such as walls. As a result, an experience of the VR content may be impaired. One solution of this problem is Redirected Walking (RDW). RDW is a method for enabling walk on virtual space that larger than limited real space. The method led to an illusion of walking direction by rotation of virtual scene. Previous research has indicated that haptics can be enhance a level of illusion. In this study, we propose a RDW method combining traction force presentation with virtual scene rotation. We develop a RDW system that presents traction force as the subject walks, and conducted experiments with few subjects. Result of experiment suggests possibility of enhancing an illusion level by traction force presentation.
Pose estimation using object contour and projection distortion for dynamic projection mapping
Projection mapping, a spatial augmented reality technology that changes the appearance of an object by projecting images using a projector, has become widespread. We have achieved projection mapping on moving objects by estimating the position and orientation of the object by mapping the contour of the object acquired with an infrared camera to the contour of a 3D model acquired in advance. However, when the contour of the object is occluded, the estimation accuracy is degraded, which is a problem for precise projection mapping. In this study, we propose robust estimation of position and orientation by using distortion information of the projected image caused by the motion of the target object in addition to the contour. The projection distortion is reproduced in the simulation environment, and the estimated accuracy of the position and orientation is evaluated.
Dynamic photometric stereo for flat bas-relief surfaces
Photometric stereo recovers surface shape for well details but fails when the images are captured with relative motion between camera and object. Therefore, we propose a dynamic photometric stereo method for 3D reconstruction of flat bas-relief objects. The key contributions of our work are to build a unified world coordinate system between multiview images by structure from motion with eliminating the mismatching points caused by the shadow, and to establish the pixel-level dense match, utilizing the homography between the flat object in two views. Finally, we can use the classic photometric stereo to obtain a high-quality 3D reconstruction result. The effectiveness of our method is verified on the real datasets.
Deep Learning and Applications I
icon_mobile_dropdown
Quantification of skin using smartphone and Skip-GANomaly deep learning model in beauty industry
Rui Matsuo, Makoto Hasegawa
Human skin visualization and quantification in the beauty industry using a smartphone based on deep learning was discussed in this study. Skin was photographed using a medical camera that could simultaneously capture RGB and UV images of the same area, and a training dataset was generated using the two types of images; the dataset was then trained via U-NET deep learning. The RGB images of the skin captured using a smartphone camera were converted into pseudo-UV images via well-trained U-NET. Moles and age spots could be effectively visualized using the pseudo-UV image. The pseudo-UV images of young subjects were deep-learned via the skip-GANomaly model to quantify the skin of middle-aged subjects.
Tolerance of CNN watermarking against model optimizations
Yudai Yamaji, Shigeyuki Sakazawa
Models created by companies and individuals by training a large amount of data are important assets and need to be protected by copyright. As a method of copyright protection, experiments have been conducted to embed a watermark into the learning model. This watermark is embedded directly into the parameters of the learning model, but the values of the parameters will change when the learning model is subjected to model compression process such as quantization. In our previous study, we showed that the effect of quantization on the watermark was small and that the embedded watermark could be retrieved. In this paper, as a further investigation, we conduct experiments on the effect of both pruning and quantization, and quantization aware training on the watermarking when creating the trained model. In the experiments, we used models of two different scales, one large and one small, and performed the above-mentioned processing on each model to check the state of the watermark. The results show that the models with both pruning and quantization show significant degradation of the watermark for small-scale models, but this is eliminated when the models are quantized. In the case of quantization aware training, there was no effect on watermarking.
Does modern facial feature extraction network need face normalization? A study on K-Face dataset
We analyzed the effect of face normalization function on face recognition. In detail, the effect of face normalization on identification accuracy was analyzed for two different feature extraction networks. Identification protocol using K-Face dataset was designed to train and test. The positive effect of face normalization on large pose was examined. However, normalization had a negative impact on modern feature extraction network. In conclusion, it was expected that the performance improvement could be achieved only when more precise face normalization modules were added to the advanced feature extraction network.
Detection and classification of 32 tooth types in dental panoramic radiographs using single CNN model and post-processing
Takumi Morishita, Chisako Muramatsu, Yuta Seino, et al.
The purpose of this study is to analyze dental panoramic radiographs for completing dental files to contribute to the diagnosis by dentists. In this study, we recognized 32 tooth types and classified four tooth attributes (tooth, remaining root, pontic, and implant) using 925 dental panoramic radiographs. YOLOv4 and post-processing were used for the recognition of 32 tooth types. As a result, the tooth detection recall was 99.65%, the number of false positives was 0.10 per image, and the 32-type recognition recall was 98.55%. For the classification of the four tooth attributes, two methods were compared. In Method 1, image classification was performed using a clipped image based on the tooth detection result. In Method 2, the labels of tooth attributes were added to the labels of tooth types in object detection. By providing two labels for the same bounding box, we performed multi-label object detection. The accuracy of Method 1 was 0.995 and that of Method 2 was 0.990. Method 2 uses a simple and robust model yet has comparable accuracy as Method 1. In addition, Method 2 did not require additional CNN models. This suggested the usefulness of multi-label detection.
Improvement of cell image analysis system based on CNN
Yuma Hotta, Toshiyuki Yoshida, Takuya Kajitani, et al.
The authors have proposed a cell image analysis system that offers the mechanisms of cell segmentation, tracking, and fluorescence analysis, where the segmentation step is the crucial one that dominates the overall performance. This paper proposes a CNN-based segmentation technique for cell images to improve the segmentation accuracy in the analysis system. In the segmentation step, the division timing of cells is important together with a geometrical accuracy of segmented cells. The proposed technique makes use of two sets of fluorescent (FL) cell images as well as common bright-field (BF) images to boost the accuracy of the division timing. These image sets are fed into an extended version of a multi-input U-Net to generate accurate cell markers utilized as seeds in the watershed postprocessing. The experimental results demonstrate that the proposed technique gives a satisfactory accuracy in terms both of the geometrical shape and division timing in the cell segmentation task.
Automated segmentation of oblique abdominal muscle based on body cavity segmentation in torso CT images using U-Net
N. Kamiya, X. Zhou, H. Kato, et al.
The body cavity region contains organs and is an essential region for skeletal muscle segmentation. This study proposes a method to segment body cavity regions using U-Net with focus on the oblique abdominal muscles. The proposed method comprises two steps. First, the body cavity is segmented using U-Net. Subsequently, the abdominal muscles are identified using recognition techniques. This is achieved by removing the segmented body cavity region from the original computerized tomography (CT) images to obtain a simplified CT image for training. In this image, the visceral organ regions are masked by the body cavity; ensuring that the organs therein are excluded from the segmentation target in advance which has been a primary concern in the conventional method of skeletal muscle segmentation. The segmentation accuracies of the body cavity and oblique abdominal muscle in 16 cases were 98.50% and 84.89%, respectively, in terms of the average dice value. Furthermore, it was observed that body cavity information reduced the number of over-extracted pixels by 36.21% in the segmentation of the oblique abdominal muscles adjacent to the body cavity, improving the segmentation accuracy. In future studies, it could be beneficial to examine whether the proposed simplification of CT images by segmentation of body cavities is also effective for abdominal musculoskeletal muscles adjacent to body cavities divided by tendon ends, such as the rectus abdominis.
Improved multi-reference makeup transfer with localized attention mechanism
Pin-Hua Lee, Chih-Hsien Hsia
Makeup transfer refers to the methodology of transferring the makeup style of a reference image to a source image. Previous works have achieved satisfactory results of transferring the entire style, but multi-reference localized makeup transfer is still challenging due to the diversity of makeup styles as well as a large variety of image content. Our method builds upon image segmentation in order to detect the facial silhouette of the portraits. In this study, an end-to-end multireference makeup transfer framework that generates the output image given multiple reference images. The deep learning (DL) network successfully applies the style from the desired regions of the target reference image to the source image without damaging the original facial features. As demonstrated in the experiment results, the makeup transfer utilizing partial style transfer, and achieve state-of-the-art performance on a wide range of makeup styles.
On the instability of unsupervised domain adaptation with ADDA
Kazuki Omi, Toru Tamaki
In this paper we report the instability of Adversarial Discriminative Domain Adaptation (ADDA), an unsupervised domain adaptation. The accuracy of ADDA is not stable, and we show that the instability comes from the initialization of CNN for the target domain, not from the pre-training with the source domain.
Siren tracking system with emergency support using self-organizing map
Hoai Thang Tan, Phooi Yee Lau
With the advance of science and technology, there are more and more vehicles on the road, emergency vehicles such as ambulances are having a hard time bypassing busy lanes. This paper proposed an ambulance siren tracking system based on SOM algorithm. By using self-organizing map (SOM) techniques, the location and the direction of the ambulance siren can be tracked. Later, the support vector machine technique was used to classify the sound of ambulance sirens. In order to improve classification of ambulance sirens, pre-processing steps such as bandpass filter are adopted, namely, the 600 Hz as the lower and 1600 Hz as the upper cutoff frequencies. For this system, we developed a mobile application, named Siren Tracking system with Emergency Support (STES), to allow users to track ambulance sirens, for drivers on the road. In order to examine the system performance, we subject the system on some real-time scenarios, using St John ambulance sirens. Based on the experimental results, the system is shown to be able to reliably localize the location of the ambulance car.
Deep Learning and Applications II
icon_mobile_dropdown
Blending CNNs with different signal lengths for real-time EEG classification sensitive to the changes
Although a lot of BMI research using CNN has been performed, CNN’s response to changes in the input EEG is too late to proceed in real-time. We propose a method to improve the real-time performance by blending multiple CNNs with different input signal length. The proposed method generates a classifier which has the advantage of a classifier with short input signal length, i.e., fast response to changes in the input signal, and also the advantage of a classifier with long input signal length, i.e., high classification performance.
A single-target license plate detection with attention
Wenyun Li, Chi-Man Pun
In wild scenes, the detection of license plates (LP) is a fundamental task in realistic intelligent transportation applications. However, most existing license plates detection models are not designed for license plate detection tasks specifically. As a standard single target object detection task, just adapting the multiple detection method to license plate detection may not be a good choice. Hence, in this paper, we choose a single detection method of state-of-the-art RetinaNet, which works well in unconstrained environment. By adding the attention module to the neural networks, we choose a more proper bbox loss function of CIoU. In the experiments, we test and evaluate on the largest Chinese license plate dataset (CCPD). Experimental results show that our proposed method can achieve better performance with very high accuracy of 99.75%, compared with other methods.
Improved lung cancer detection in ultra low dose CT with combined AI-based nodule detection and denoising techniques
Jemyoung Lee, Jae-Hyun Park, Minsu Kim, et al.
In this study, we evaluated the synergy between the two artificial intelligence solutions by applying the deep learning based denoising technique to determine if the performance of the AI-based lung nodule detection solution is enhanced.
Dataset generation with GAN for reflection image removal on eyeglasses
Sota Watanabe, Makoto Hasegawa
This study investigated a method for removing display images reflected on eyeglasses with blue light-cutting lenses using deep learning. A dataset of images with and without reflections on eyeglasses was generated with an identical angle of view, and a U-net was trained on the generated dataset. The trained model removed reflections from images. In this study, not only the dataset consisting of actual images but also the above dataset was trained using a generative adversarial network (GAN) to generate a large number of images. By increasing the number of images in the dataset, the reflection removal accuracy was improved.
Improving the accuracy of the color constancy network by the object detection
Color constancy is a human characteristic that can recognize the color of an object correctly, even if the color of the illumination light changes. We constructed a network that reproduces the color constancy by pix2pix, an adversarial generative network. However, the current network has problems. For example, the network cannot output the color and shape of object parts correctly when the illumination is extreme colors, and the object and the background in the image assimilate. This research tries to improve the accuracy of the color constancy network by using the segmentation technique. We generate a mask image by the segmentation network from the input image, where the object part is white and the background is black. Then, we input the mask image to the network in the same way as the input image and add the information of the mask image to the network processing of the input image. By inputting the mask image, the information of the target object region is added to the color constancy network. It is possible to clarify the region of the object in the input image and to reproduce the shape and color of the object, which the existing color constancy network cannot reproduce.
CNN-based realization of Depth from Defocus technique
Mizuki Kaneda, Toshiyuki Yoshida
Depth from Defocus (DFD) techniques estimate the distance/depth to each point of a target object by using a set of multifocus images. Many of the DFD techniques proposed thus far have the common disadvantage that the estimation accuracy of depth decreases for an image set captured with a real/nonideal lens compared with artificial one generated based on an ideal lens model. The accuracy degradation can be attributed to a deviation from the theoretical model of lens blur, which is quite difficult to formulate using a mathematical model. To overcome the problem, we proposes a DFD technique based a convolutional neural network (CNN) whose accuracy is enough to be applied to 3-D modeling applications. In this paper, the proposed CNN is trained with computer-generated artificial data sets to investigate the potentiality of the CNN-based DFD approach. The experimental results indicate that the proposed CNN achieves a comparable estimation accuracy for simulation data sets compared with one of state-of-the-art DFD techniques.
Robust lane detection through automatic trajectory analysis with deep learning and big data environment
Li-Wen Wang, Du Li, Wan-Chi Siu, et al.
This paper gives focus on multi-lane detection from traffic cameras, which is based on automatic trajectory analysis and is promoted with advanced deep-learning technologies. Our proposed approach is based on big trajectory data that is robust to complex road scenes, which makes our approach particularly reliable and practical for Intelligent Transportation Systems. By using the deep learning object detection technology, it firstly generates big trajectory data on the road. Then, it detects the stop lines on the road and counts the number of lanes from the trajectories. Next, the trajectories are divided into different groups, where each group contains the trajectories of one lane. Finally, the lanes are fitted by the grouped trajectories. A large number of experiments have been done. The results show that the proposed approach can effectively detect the lanes on the road.
Multi-level unsupervised domain adaption for privacy-protected in-bed pose estimation
Ziheng Chi, Shaozhi Wang, Xinyue Li, et al.
In-bed pose estimation is of great value in current health-monitoring systems. In this paper, we solve a crossdomain pose estimation problem, in which a fully annotated uncovered training set is used for pose estimation learning, and a large-scale unlabelled data set of covered images is employed for unsupervised domain adaptation. To tackle this challenging problem, we propose a multi-level domain adaptation framework, which learns a generalizable pose estimation network based three levels of adaptation. We evaluate the proposed framework on a public in-bed pose estimation benchmark. The results demonstrate that our proposed framework can effectively generalize the learned knowledge from the uncovered source domain to the covered target domain for privacy-protected in-bed pose estimation.
Deep attentive pixels for face super-resolution
Face super-resolution, successfully using fusion network approach, has successfully solved the problem of face image restoration. Recently, face attributes have been effectively used to guide the low-level feature point of the face to perform viable face recovery. First, the low-resolution image is enlarged into a super-resolution face image. Landmarks are estimated to guide the network to enhance the super-resolution image repeatedly. However, the face super-resolution network architecture parameter is redundant, and the learning efficiency is low on mapping input and target output. This paper proposes a deep attention pixel for face super-resolution, which applies an attention mechanism to optimize feature extraction and fuses the channel attention with facial landmarks heatmaps. Experimental results demonstrate that the proposed method achieves higher performance than other state-of-the-art face super-resolution methods.
Point Cloud Processing and Applications
icon_mobile_dropdown
Efficient integration of partial point clouds with few geometric features using improved colored ICP algorithm
Arisa Poonsri, Shogo Tokai
Colored ICP is often used to integrate point clouds of objects with few geometric features. However, there are problems of convergence speed and local minima, and its performance depends on the point clouds' initial location. This paper proposes a method to deal with these problems by introducing texture smoothing and ICP’s parameter controls, which improves the conventional colored ICP. Our method improves the colored ICP algorithm by first focusing only on geometrical information with a large search range and gradually shifting to color information with a reduced search range to determine the particular location. We experiment on the simulated partial globe and the RGB-D object models as the models with few geometric features. We compare the performance of the different parameter controls and verify the effectiveness by measuring errors between transformed results and the ground truth. The experiment shows that our method improves the convergence of the models with insufficient geometric features by these adjusting parameter technique.
A study on algorithm to extract stone tool surfaces from measured point clouds of joining materials based on images
Stone tools were created and used for daily life during the Paleolithic and Jomon periods. Excavated stone tools and flakes are joined together to recreate the mother rock, and are referred to as a joining material. By analyzing the joining materials and reproducing the stone tool manufacturing process, various information such as the stone tool maker’s manufacturing intentions, behaviors, technical abilities, and living ranges can be obtained. Conventionally, the created joining materials were recorded with photographs and scale drawings. In recent years, as a more accurate and stable method, recording based on three-dimensional point clouds using 3D scanners has also been performed. Unfortunately, these types of measurements can obtain only outer flake surfaces of the joining material, which means the stone tools inside the joining material are hardly recognized. To obtain the assembly order of flakes and the spatial posture of the stone tools from the measured outer point clouds, it is necessary to identify each stone tool by recognizing outer flake surfaces by segmentation of surface point clouds. In this paper, we propose a method to segment each stone tool from surface point clouds obtained by 3D-measured joining materials.
A study of colored point cloud completion for a human head
A depth image of a single RGBD camera has many occlusions and noises, so it is not easy to obtain 3D data of the whole human head. Point cloud deep learning has recently attracted much attention, which allows direct input and output of point clouds. One of them, the point cloud completion, which creates a complete point cloud from a partial point cloud, has been studied. However, existing studies of point cloud completion evaluated only the shape and have not focused on colored point clouds. Therefore, this study proposes a colored point cloud completion method for the human head based on machine learning. For deep learning training, the CG dataset was created from the face and hair dataset. The proposed network inputs and outputs point cloud with XYZ coordinates, and 𝐿 ∗𝑎 ∗𝑏 ∗ color information optionally has a Discriminator that processes 𝐿 ∗𝑎 ∗𝑏 ∗ -D images by a differentiable point renderer. This study experimented using the network and the dataset and evaluated using point domain and image domain metrics.
Study on facial part extraction for face similarity evaluation of Japanese terracotta figurines (Haniwa) from 3D point clouds
Haniwa were made for rituals during the Kofun period and were buried with the dead as funerary objects. By analyzing and classifying haniwa, archaeologists are trying to reveal information about their origins and evaluate their artistic values. Specifically, they observe haniwa carefully and classify them based on their characteristics and archaeological knowledge. Since observation is a subjective evaluation, an objective evaluation method is necessary to ensure authenticity. For objective evaluation, analysis based on digital data is effective. For example, 3D point clouds, which are digital data, can be easily obtained by photographic measurement. In [1 ], a 3D mesh is generated from a measured point cloud, and the haniwa face is analyzed based on the mesh. However, generating a mesh from a point cloud is time-consuming. In this paper, to evaluate the similarity of Haniwa faces, we investigate a method to extract the parts of Haniwa faces, such as eyes, mouth, and nose, directly from 3D point clouds.
A study on protruding pattern recognition of Jomon potteries from 3D point clouds
Ao Kikuchi, Shurentsetseg Erdenebayar, Tsutomu Kinoshita, et al.
A method will be examined for automatically detecting protruding patterns on the surface of Jomon potteries using the Watershed method based on the curvature of three-dimensional measurement point clouds.
Geometry reconstruction for spatial scalability in point cloud compression based on neighbour occupancies
Spatial scalability is an important functionality for point cloud compression. The current design of geometry reconstruction for spatial scalability applies the points at the center of nodes, ignoring correlations among neighbour nodes. In this work, a geometry reconstruction method based on neighbour occupancies is proposed, where the distribution of real points in the current node is predicted using the information of neighbour occupancies. In comparison to the state-of-the-art geometry-based point cloud compression, i.e., G-PCC, performance improvement of 1.15dB in D1- PSNR and 3.80dB in D2-PSNR in average, can be observed using proposed method.
Image Understanding and Recognition
icon_mobile_dropdown
Food image generation and appetite evaluation based on appetite factor analysis by image features
In recent years, due to the COVID-19 pandemic and the widespread use of technology, the Internet and food and beverage websites are often used for take-out and food and beverage reservations, and information such as reviews and photos on these platforms has a significant impact on revenue. In this study, to develop an appetite-enhancing application, we focus on food images as a factor that strongly influences appetite and analyze what image features stimulate appetite. Then, based on the results of the analysis, we generate appetizing images using GAN (Generative Adversarial Network).
Three-stage navigation to hand size object for visually impaired
We propose a three-stage navigation method to a hand size target object using sound guidance for the visually impaired in a walking distance situation. The advantage of our proposed method is to let visually impaired people reach a target object that he/she should touch with only a camera-equipped wearable device. It could apply to any indoor situation because our proposed system needs only a vision-based pre-registration process where only a single video trajectory should be set in advance. The navigation is decomposed into three stages—path navigation, body navigation, and hand navigation. As for the walking stage, we utilize the Clew app that is sufficient for this purpose. For the successive two stages, we introduce AR anchor. The AR anchor should be registered on the target object in advance. Our sophisticated sound guidance is made to let the subject reach the target with the resolution of hand size. The stage change is informed by vibration. We have conducted a preliminary evaluation with our smartphone-based system and confirmed that the proposed method can navigate users to a hand size target starting from a 5-meter away position.
Active learning for human pose estimation based on temporal pose continuity
Taro Mori, Daisuke Deguchi, Yasutomo Kawanishi, et al.
In recent years, human pose estimation based on deep learning has been actively studied for various applications. A large amount of training data is required to achieve good performance, but, annotating human poses is quite an expensive task. Therefore, there is a growing need to improve the efficiency of training data preparation. In this paper, we take an active learning approach to reduce the cost of preparing training data for human pose estimation. We propose an active learning method that automatically selects images effective for improving the performance of a human pose estimation model from unlabeled image sequences, focusing on the fact that the human pose continuously changes between adjacent frames in an image sequence. Specifically, by comparing the estimated human poses between frames, we select images incorrectly estimated as candidates for manual annotation. Then, the human pose estimation model is re-trained by adding a small portion of manually annotated data as training data. Through experiments, we confirm that the proposed method can effectively select training data candidates from unlabeled image sequences, and that the proposed method can improve the performance of the model with reducing the cost of manual annotations.
Depth image restoration algorithm using graph signal processing based image colorization
Tsukasa Kubota, Kairi Ito, Kazunori Uruma
This paper deals with a depth image restoration algorithm. Depth images for the purpose of visualizing stereoscopic information of an object are generally obtained from a depth sensor with low resolution compared with an RGB sensor. Therefore it is required to realize super-resolution of the depth image. This paper proposes the depth image restoration algorithm using the color guided image based on the graph signal processing. A graph is generated from a color image, and the graph signal value is considered as a depth value. Then the depth value is restored using the technique of graph signal value restoration. Numerical experiments show that it has higher restoration performance than several previous methods.
Machine Learning and Applications
icon_mobile_dropdown
Empty mug detection in pubs and restaurants using a ceiling camera
Toyokazu Shimekake, Katsuto Nakajima
We propose a system that can detect empty mugs to monitor customers in pubs and restaurants. The evaluation results revealed that our proposed system has significantly high precision and good recall, even for very small and tilted mugs in images captured using wide-angle ceiling cameras, and also has a practical detecting speed.
Raden code: a study of matching algorithm for Raden based on phase information
Y. Yamazaki, T. Nakata
In this paper, we proposed an authentication algorithm to identify the Raden. The principle underlying the authentication algorithm is matching on the Raden encoded by the 2D Fourier transform. In this paper, we obtained reproducible features in various environments and confirmed the robustness of the Raden encoded with phase information. It is possible to identify the Raden by experiment results.
Protection of SVM model with secret key from unauthorized access
In this paper, we propose a block-wise image transformation method with a secret key for support vector machine (SVM) models. Models trained by using transformed images offer a poor performance to unauthorized users without a key, while they can offer a high performance to authorized users with a key. The proposed method is demonstrated to be robust enough against unauthorized access even under the use of kernel functions in a facial recognition experiment.
Validation of random forest algorithm to monitor land cover classification and change detection using remote sensing data in Google Earth Engine
Monitoring land cover classification and change detection based on remote sensing images using a machine learning algorithm has become one of the important factors. For our case study, we select Vientiane capital as the study area. Our proposed method aims to perform the land cover classification using random forest algorithm supervised classification in Google Earth Engine (GEE), and post classification comparison (PCC) of change detection using Arc GIS software, between 1990 and 2020, with five year interval periods are evaluated. In this paper, we utilize GEE combining with multiple sources of satellite optical image time-series from three main satellites, Landsat 5, Landsat 8, and Sentinel 2 integrating with multiple spectral, spatial, temporal, and textural features. Spectral indices such as NDVI and NDBI are calculated to enhance the accurate performance. Our results show that all six classes are obtained highly accurate land cover classification, with overall accuracy over 97.73% for training data and 90.35% for testing data, and kappa statistic of 0.97 for training data and 0.87 for testing data in 2020.
A study of lightweighting method using reinforcement learning
Yoshihiro Harada, Noriko Yata, Yoshitsugu Manabe
Deep neural networks (DNNs) are capable of achieving high performance in various tasks. However, the huge number of parameters and floating point operations make it difficult to deploy them on edge devices. Therefore, in recent years, a lot of researches have been done to reduce the weight of deep convolutional neural networks. Conventional research prunes based on a set of criteria, but we do not know if those criteria are optimal or not. In order to solve this problem, this paper proposes a method to select parameters for pruning automatically. Specifically, all parameter information is input, and reinforcement learning is used to select and prune parameters that do not affect the accuracy. Our method prunes one filter or node in one action and compresses it by repeating the action. The proposed method was able to highly compress the CNN with minimal degradation in accuracy and reduce about 97.0% of the parameters with 2.53% degradation in CIFAR10 image classification task on VGG16.
LCCCRN: robust deep learning-based speech enhancement
Chun-Yin Yeung, Steve W. Y. Mung, Yat Sze Choy, et al.
Deep learning-based speech enhancement methods make use of their nonlinearity properties to estimate the speech and noise signals, especially the nonstationary noise. DCCRN, in particular, achieves state-of-the-art performance on speech intelligibility. However, the nonlinear property also causes concern about the robustness of the method. Novel and unexpected noises can be generated if the noisy input speech is beyond the operation condition of the method. In this paper, we propose a hybrid framework called LDCCRN, which integrates a traditional speech enhancement method LogMMSE-EM and DCCRN. The proposed framework leverages the strength of both approaches to improve the robustness in speech enhancement. While the DCCRN continues to remove the nonstationary noise in the speech, the novel noises generated by DCCRN, if any, are effectively suppressed by LogMMSE-EM. As shown in our experimental results, the proposed method achieves better performance over the traditional approaches measured with standard evaluation methods.
Video Processing and Applications
icon_mobile_dropdown
Performance analysis of generated predictive frames using PredNet bi-directionally
Kanato Sakama, Shunichi Sekiguchi, Wataru Kameyama
For generating motion-compensated predictive frames, which is one of the video coding processes, there has been a lot of studies on using DNN without using motion vectors. Conventional methods of generating motion-compensated predictive frames using DNN use only the source frames for prediction in the forward direction. However, in the ever-standardized video coding schemes, it has been confirmed that the bi-directional prediction, e.g., B-picture, improves coding efficiency. Thus, for generating motion-compensated predictive frames to be used in video coding, we propose to apply PredNet bidirectionally, that is a future frame generation model using DNN based on the prediction process of visual input stimuli in brain. In this paper, the performance of the predictive frames generated by the proposed method is evaluated by using MSE and SSIM compared with the prediction accuracy applying PredNet only to the forward direction. In addition, we also investigate whether the prediction accuracy of the predicted frames can be improved by increasing the amount of training frames in videos chosen from YouTube-8M. The results show the effectiveness of the proposed method in terms of less prediction error compared with the forward-only PredNet, as well as the performance increasing by more training data.
A computer simulation method for the motion sharpening phenomenon in human vision system
Human vision is capable of motion sharpening, where blurred edges look sharper while moving than when stationary. This phenomenon is an optical illusion and an important function in human vision. In this study, we propose a transformation and an inverse transformation method for simulating the motion sharpening phenomenon. Initially, we developed a digital filter based on the impulse response of the human vision. Then, we generated an image using the created filter and performed a comparison experiment between the unfiltered and filtered images. As this method provides images simulating the appearance of moving objects, it is possible to design the appearance of objects when they are moving and not just when they are stationary.
Robust real-time video face recognition system for unconstrained environments
Amir Rajak, Matthew N. Dailey, Mongkol Ekpanyapong
In everyday life, different biometric applications such as iris recognition and fingerprint recognition are used to identify individuals. The face is an important identifier for humans that can be used for surveillance, security purposes, access control, automated attendance, and so on. In this paper, a fully functional end-to-end facial recognition system is presented that can be applied in the real world. A simple face frontalness measure is proposed to filter out non-frontal faces that may result in false recognitions. To further reduce false positives, aggregation of prediction results across multiple frames is applied to form a single decision. Our model achieved a False Acceptance Rate (FAR) of 4.13% and False Rejection Rate (FRR) of 9.3% at the confidence threshold of 0.8 on the test dataset. Finally, the recognition result is displayed by a Web application. The system also records daily punch-in/punch-out times of employees and presents their monthly timesheet reports in the Web application. Our method achieves considerably lower false positive rates and runs at 18- 20 FPS in our testbed machine for a single camera.
Color laparoscopic high-definition video quality assessment for super-resolution
Norifumi Kawabata, Toshiya Nakaguchi
After considering medical image analysis support, it is one of important factors to assess high-definition image quality for high-definition display. In case of processing recognition and detection of body region and lesion in medical images, it is not enough to judge the only information of still image or some frame images. To find out image content and characteristics more accurately, we need to consider image quality from information obtained by video. In this paper, in the case of changing bit-rate and focusing on image quality for color laparoscopic high-definition video including super-resolution, we carried out the objective assessment experiment by using PSNR, SSIM among frame of image, bit-rate of video. Finally, we discussed with regard to display of medical image.
IVTEN: integration of visual-textual entities for temporal activity localization via language in video
This study looks into the stumbling block of temporal activity localization via natural language (TALL) in the untrimmed video. It’s a difficult task since the target temporal activity may be misled by disorder query. Existing approaches used sliding windows, regression, and ranking to handle the query without the usage of grammar-based rules. When a query is out of sequence and cannot be correlated with the relevant activity, these approaches suffer performance deterioration. We introduce the visual, action, object, and connecting words concepts to address the issue of non-sequence queries. Integration of visual-textual entities network (IVTEN) is our proposed architecture, which consists of three submodules: (1) visual graph convolutional network (visual-GCN), (2) textual graph convolutional network (textual-GCN), and (3) compatible method for learning embeddings (CME). Visual nodes detect activity, object, and actor in the same way as textual nodes maintain word sequence using grammar-based rules. (CME) integrates several modalities (activity, query) and trained grammar-based words into the same embedding space. We also include a stochastic latent variable in CME to align and retain the query sequence with the relevant activity. On three typical benchmark datasets, our IVTEN approach outperforms the state-of-the-art Charades-STA, TACoS, and ActivityNet-Captions.
Immersive Applications
icon_mobile_dropdown
Immersive 3D flow visualization based on enhanced texture convolution and volume rendering
Jin Guo, Cui Xie
In this paper, we proposed a texture-based 3D flow visualization method in an immersive environment. The algorithm uses line integral convolution to display the directional information of the flow fields. The algorithm enhances the contrast between streamlines by injecting a certain percentage of random noise into the convolution-generated texture. We introduce a high-pass filtering process to improve the quality of enhanced rendered texture further. In addition, this paper designs a volume rendering transfer function for the immersive environment, which can effectively extract the significant features of the flow field in the immersive environment and highlight the features areas of interest to users.
An immersive self-training system of receive motion for volleyball beginners
In this article, we propose a VR-based immersive self-training system for volleyball beginner’s receiving, by which a virtual training environment is given to a practitioner with a head-mounted-display (HMD). Not wearing motion capture markers and haptic devices, a practitioner has free hands to receiving a virtual ball, so that the proposed system is expected him/her to be able to play receiving motion as the same as in the real world, in order to enhance skill upgrading effectively.
An atlas generation method with patch trimming for efficient immersive video coding
Sung-Gyun Lim, Jae-Gon Kim
MPEG is developing an immersive video coding standard called MPEG immersive video (MIV) to provide users with an immersive visual experience with six degrees of freedom (6DoF) of view position and orientation within a limited viewing space. In the MIV reference software called test model for immersive video (TMIV), the atlases are generated for reducing the pixel rate of input source views to be coded by removing inter-view redundancy. There are a lot of discontinuities in the generated atlases in which the patches containing both valid and invalid regions are packed, which results in the decrease of coding efficiency. To address this problem, this paper presents an atlas generation method that trims the packed patches of the atlas through hole filling and tiny block removal. The proposed method shows a BD-rate bit saving of 1.4% on average compared to the original TMIV.
Mixed reality visualization of room impulse responses on two planes by moving microphone
Yasuaki Watanabe, Yusuke Ikeda, Yasuhiro Oikawa
Mixed reality (MR) can be used to visualize three-dimensional (3D) sound fields in real space. In our previous study, we proposed a sound intensity visualization system using MR. The system visualizes the flow of sound energy in a stationary sound field by measuring sound intensity. However, room impulse responses (RIRs) are essential data when investigating the sound field of a room. Therefore, to demystify the time variation of the sound field, it is crucial to visualize the spatial distribution of RIRs. However, the measurement of multipoint RIRs requires considerable time and effort and a large microphone array. In this paper, we propose an MR visualization system for RIR mapping on two planes based on dynamic RIR measurement using a moving microphone. The proposed system simplifies the measurement of RIRs at multiple points owing to the dynamic measurement capabilities of a hand-held microphone. In the simulation experiment, the RIRs on the grid points were estimated from the microphone signal using the moving path of the microphone. The estimated results were visualized by the animation of RIR maps in real space using MR. From the experimental results, the MR animation of RIR maps on the two orthogonal planes can help demystify 3D sound propagation.
VR and 3D Applications
icon_mobile_dropdown
Center of gravity correction method for self-support of output model in 3D printer
Jumpei Nakatsuka, Yuta Muraki, Kenichi Kobori
The widespread use of 3D printers has made it possible to create 3D objects easily at home. However, if a 3D model is created without considering the position of the center of gravity, the output 3D object may not be able to stand on its own. In such cases, it is necessary to adjust the 3D model so that it can stand on its own using 3D modeling software again. However, this process requires a lot of knowledge and experience with 3D modeling software. The purpose of this research is to calculate the density of the output 3D object and adjust the center of gravity of the 3D model that cannot stand on its own. This eliminates the need to re-edit the 3D model. In addition, there is no need to change the impression of the 3D model because there is no need for parts to assist the model to stand on its own.
A method of reflection representation in mixed reality using light source estimation
In recent years, research has been conducted on the reflection of real space onto virtual objects. However, conventional methods have not been able to achieve accurate reflection due to problems such as the direction and angle of the reflected object not matching. In addition, since the light source is not estimated, the color of the virtual object floats away from the surroundings and shadows are not generated. In this study, we propose a method to integrate virtual objects and real space without any discomfort by using light source estimation.
On the colored holographic moving pictures employing the blue-violet color light source
In order to perform a nice space projection in the reconstructing process by the hologram containing the information of the laser lights with red, green and blue three primary colors, it may be strongly required that we must study more to improve the quality of the reconstructed images using each of the holograms generated by each of the three color lasers, and make efforts to establish a stable, smooth representing process of the holographic images of the moving pictures. Further, to extend the region of the reproducible colors, as for the blue color laser lights, it seems to be effective to employ the blue-violet laser light of shorter wave length, which corresponds to the minimum wave length of the visible ray. However, this in turn implies some difficulty in the image reproduction using diffraction. Taking care that the reproducibility of the images of the moving picture generally depends on the refresh-rate of the elements of the display, we may see that it important to study the relation between the numbers of frame required for the formation of the reproduced picture and of the hologram elements required for one sheet of still picture. In this paper, we are going to make reports on the results concerning the required number of the points of the object for which we can reproduce it as the point object by the use of the blue-violet laser light without applying time-shared multiplex reproducing technique.
Region-aware point cloud sampling for hand gesture recognition
In this paper, we propose a region-aware point cloud sampling method that adaptively divides regions containing important information for hand gesture analysis. Experimental results prove that the proposed method contributes to improving hand gesture recognition performance.
Real-time volume rendering running on an AR device in medical applications
Kota Hashimoto, Toru Higaki, Raytchev Bisser Roumenov, et al.
Recently, the use of augmented reality in clinical environments has been increasing. Volume rendering is very useful in medical applications for visualizing volume data such as CT and MRI data. In this paper, we propose a real-time volume rendering that runs on AR glasses with limited computational power. In the proposed method, we introduce a new volume data structure based on the concept of level of detail (LOD).
Image Processing and Applications
icon_mobile_dropdown
Fingerprint minutiae detection using improved Tamura’s thinning
Takeru Yamanaka, Akira Kitsuda, Tomohiko Ohtsuka
This study proposes a new fingerprint minutiae detection method with a short processing time and high accuracy. This method removes spurs, thus preserving the connectivity of the original image following the Tamura thinning. In the proposed method, the fingerprint minutiae can be correctly detected by inspecting pixels in the 8-connection of the pixel of interest, even if the line width is not equal to one. Experimental results demonstrated that the detection rates of the proposed and conventional methods are approximately the same, while the processing time of the proposed method is 1/30th of that of the conventional method.
A high-accuracy circle detection method using a multi-angle rotating equilateral triangle
Opard Kokaphan, Wannida Sae-Tang
Nowadays, visual inspection is widely used in manufacturing industries such as electronic industries, and the most common shape detection is circle detection. Although, there were many methods proposed for circle detection, there is a room to improve the detection accuracy. This paper then proposes a high-accuracy circle detection method using a multiangle rotating equilateral triangle. The main idea is to fit the equilateral triangle into the circle with various rotation angles to select the most accurate circle center. Triangles with various rotation angles are fitted into the circle by iteratively adjusting the ratios of the triangle bases to another sides. The triangles which cannot be adjusted to be equilateral are then discarded, because those isosceles triangles, which occur when a detected triangle corner (s) is not on the circumference but on noises inside the circle, are the cause of inaccurately detected circle centers. The experimental results show that the proposed method is superior to the conventional methods in terms of accuracy and processing time.
Swin transformer and fusion for underwater image enhancement
Jinghao Sun, Junyu Dong, Qingxuan Lv
In an underwater scene, refraction, absorption, and scattering of light by suspended particles in water degrade the visibility of images, causing low contrast, blurred details and color distortion. Based on the characteristics of underwater image degradation, we proposed a fusion neural network, which builds on the blending of two images that are derived from a white-balanced and color-compensated version of the raw underwater image. The two images are fused through the image enhancement module. In the previous works, convolutional neural networks (CNN) have been widely used in underwater image enhancement tasks. However, the local computational characteristics of convolutional operations limit the effect of the image enhancement. Recently, transformers have shown impressive performance on low-level vision tasks. In this paper, we propose a module SwinMT for image enhancement based on the swin transformer. First, we generate two inputs by respectively applying white balance (WB) and gamma correction (GC) algorithms to an underwater image. Second, the SwinMT module extracts features respectively, which consists of two parts: low-frequency feature extraction module and high-frequency feature to restore high-quality. We conduct experiments on rendered synthetic underwater images. Experiments on underwater images show that our method produces visually pleasing results, and we compare results with state-of-the-art techniques.
A method of composition evaluation for photographs
Recently, with the development of smartphones and other small cameras, there are more and more opportunities to take photos even if you are not a professional photographer. In general, the photos with good composition are more impressive. Although there have been many studies on composition in photos, none have focused on the accuracy of determination and evaluation of composition. In this study, we propose a method to determine and evaluate the composition of photos taken. In the proposed method, composition is determined and evaluated in terms of visual and structural features. We believe that our method will assist in the selection of photos.
A method of automatic photo selection for photo albums
Keitaro Kawamori, Yuta Muraki, Kenichi Kobori
Recently, there has been an increase in the number of photos taken by people in especially unusual situations such as travel and ceremony. The photo groups taken in unusual situations have similar images and low-quality images such as under/overexposure and blurred images. For that reason, it is time consuming to summarize photos for creating photo albums. Therefore, we propose a method of automatic photo selection for photo albums. The proposed method can avoid to select low-quality images as candidate images automatically. Additionally, similar images are grouped and narrowed down the only most high-quality image that is selected as candidate image. The proposed method calculates the score of each candidate image, and selects final-images according to the score and scenes.
Simultaneous scanning of both feet shapes using multiple depth sensors
Kazuma Kassai, Yuta Muraki, Kenichi Kobori
For choosing the best shoes, it is important to know the size of your feet. Furthermore, to make shoes that fit one's feet, it is necessary to take foot measurements using a 3D measuring device. However, there are four kinds of problems with current foot measurement devices: “long measuring time,” “measuring one foot at a time,” “lack of measurement points,” “devices are expensive”. Therefore, we propose a method for automatic 3D foot shape acquisition that solves these problems. The proposed method uses four depth sensors and AR markers to obtain the shape of both feet simultaneously. In addition, it achieves low cost by reducing the number of sensors used.
Automatic generation of impossible shapes from line-drawing characters
Mitsuhiro Mori, Yuta Muraki, Kenichi Kobori
In recent years, there are signs and advertisements that use optical illusions. By using the optical illusion, it is easy to not only attract attention but also leave an impression. In addition, characters are always used in advertisements. The design of characters to be impressive is called lettering. Accordingly, we expect that adding an illusion to the lettering process can generate more impressive characters. However, it takes a lot of experience and time to take the illusion into account when lettering. In this paper, we add the illusion of impossible shapes to the lettering process to automatically generate more memorable characters. An impossible shape is a figure which can be recognized visually as a three-dimensional projection, but which cannot exist in reality. In this method, the strokes and contours are extracted from line-drawing characters. Then, the projected image is created by using them. Finally, impossible shape is generated by applying an optical illusion to the projected image.
Kernel estimation for super-resolution with flow-based kernel prior
Single-Image Super-Resolution methods typically assume that a low-resolution image is degraded from a high-resolution one through “bicubic” kernel convolution followed by downscaling. However, this induces a domain gap between training image datasets and the real scenario’s test images, which are down-sampled from the images that underwent convolution with arbitrary unknown kernels. Hence, correct kernel estimation for a given real-world image is necessary for its better super-resolution. One of the kernel estimation methods, KernelGAN locates the input image in the same domain of high-resolution image for accurate estimation. However, using only a low-resolution image cannot fully utilize the high-frequency information in the original image. To increase the estimation accuracy, we adopt a superresolved image for kernel estimation. Also, we use a flow-based kernel prior to getting a reasonable super-resolved image and stabilize the whole estimation process.
Template matching using a small number of pixels selected by distinctiveness of quantized hue values
We propose a template matching method using a small number of distinctive pixels selected on the basis of hue values as color information. By analyzing the template image, a distribution of co-occurrence probabilities regarding the hue values of two-pixel pairs is generated. A small number of pixels that have low probability are selected and used for matching. Since such pixels have high distinctiveness, reliable matching can be achieved. The recognition success rate of our method is 97% when 571 pixels are selected, and the processing time is 219 msec.
Video Coding for Machine
icon_mobile_dropdown
Deep learning-based feature compression for video coding for machine
We previously trained the compression network via optimization of bit-rate and distortion (feature domain MSE) [1]. In this paper, we propose feature map compression method for video coding for machine (VCM) based on deep learning-based compression network that joint training for optimizing both compressed bit rate and machine vision task performance. We use bmshij2018-hyperporior model in the CompressAI [2] as the compression network, and compress the feature map which is the output of stem layer in the Faster R-CNN X101-FPN network of Detectron2 [3]. We evaluated the proposed method by evaluation framework for MPEG VCM. The proposed method shows the better results than VVC of MPEG VCM anchor.
Compression of thermal images for machine vision based on objectness measure
Recent development of intelligent object detection systems requires high-definition images for reliable detection accuracy performance, which can cause a high occupation problem of network bandwidth as well as archiving storage capacity. In this paper, we propose an objectness measure-based image compression method of thermal images for machine vision. Based on the objectness of a certain area, bounding box for the area with high objectness is adjusted in order not to affect the possible object detection performance and the image is compressed in a way that the area having a high objectness is compressed with lower compression ratio than other area. The experiments indicate that superior object detection accuracy at comparable BPP is accomplished using the proposed scheme to that of the state-of-the-art video compression method.
Descriptor-based video coding for machine for multi-task
The coding objective of image and video that are targeted for machine consumption may differ from that for human consumption. For example, machine may only use a part of image or video requested or required by an application whereas human consumption requires whole captured area of image and video. In addition, machine may require grayscale or certain light spectrum, whereas human consumption requires full visible light spectrum. To identify an object of interest, a neural network based image or video analysis task may be performed and the output of a task is an identified feature (latent) and an associated descriptor (inference). Depending on the usage, multiple tasks can be performed in parallel or in series, and as a number of identified feature increases, the chance of feature area overlap increases as well. We propose a pipeline of descriptor based video coding for machine for multi-task. The proposed method is expected to increase coding efficiency when multiple tasks are performed, by minimizing redundant encoding of overlapped area of objects of interest and to increase utilization and re-utilization of features by transmitting inference separately.
A method of feature map reordering for machine vision based on channel correlation
Dong-Ha Kim, Yong-Uk Yoon, Jae-Gon Kim
As the need for a video coding technology for a machine that performs intelligent analysis such as object detection, segmentation, and tracking on massive video data has emerged, MPEG is developing a standard called video coding for machines (VCM). VCM is a standard technology for compression of image/video or its features for performing vision tasks of intelligent machines. In this paper, we propose methods that convert multichannel features extracted from an analysis network of input images into a reordered feature map sequence for enhanced compression using VVC. The proposed methods exploit the correlation between channel feature maps using their mean values and sum of absolute difference (SAD) between feature maps in the reordering. Although the proposed methods do not reach the anchor performance of VCM, it shows better coding performance than compressing the feature without channel reordering.
Intelligent System Design
icon_mobile_dropdown
Diversity-promoting human motion interpolation via conditional variational auto-encoder
Chunzhi Gu, Shuofeng Zhao, Chao Zhang
In this paper, we present a deep generative model based method to generate diverse human motion interpolation results. We resort to the conditional variational auto-encoder (CVAE) to learn human motion conditioned on a pair of given start and end motions, by leveraging the recurrent neural network (RNN) structure for both the encoder and the decoder. Additionally, we introduce a regularization loss to further promote sample diversity. Once trained, our method is able to generate multiple plausible coherent motions by repetitively sampling from the learned latent space. Experiments on the publicly available dataset demonstrate the effectiveness of our method, in terms of sample plausibility and diversity.
Hierarchical visual interface for educational video retrieval and summarization
jiahao Weng, Chao Zhang, Xi Yang, et al.
With the emergence of large-scale open online courses and online academic conferences, it has become increasingly feasible and convenient to access online educational resources. However, it is time consuming and challenging to effectively retrieve and present numerous lecture videos for common users. In this work, we propose a hierarchical visual interface for retrieving and summarizing lecture videos. Users can utilize the proposed interface to effectively explore the required video information through the results of the video summary generation in different layers. We retrieve the input keywords with the corresponding video layer with timestamps, a frame layer with slides, and the poster layer with summarization of the lecture videos. We verified the proposed interface with our user study by comparing it with other conventional interfaces. The results from our user study confirmed that the proposed interface can achieve high retrieval accuracy and good user experience.
Measuring speaking time from privacy-preserving videos
Shun Maeda, Chunzhi Gu, Chao Zhang
The ongoing pandemic caused by the COVID-19 virus is challenging many aspects of daily life such as restricting the conversation time. A vision-based face analyzing system is considerable for measuring and managing the person-wise speaking time, however, pointing a camera to people directly would be offensive and intrusive. In addition, privacy contents such as the identifiable face of the speakers should not be recorded during measuring. In this paper, we adopt a deep multimodal clustering method, called DMC, to perform unsupervised audiovisual learning for matching preprocessed audio with corresponding locations at videos. We set the camera above the speakers, and by feeding a pair of captured audio and visual data to a pre-trained DMC, a series of heatmaps that identify the location of the speaking people can be generated. Eventually, the speaking time measurement of each speaker can be achieved by accumulating the lasting speaking time of the corresponding heatmap.
Investigation of influence of loss function weights of cycle-consistent adversarial networks on generation of pareidolia stimuli
Yoshitaka Endo, Shinsuke Shimojo, Takuya Akashi
Pareidolia is the psychological tendency of perceiving non-facial objects as faces. The pareidolia test utilizes this tendency for diagnosis, to identify patients suffering from Lewy body dementia. A typical symptom of Lewy body dementia consists of visual-hallucination of non-existing individuals, which can be created artificially by pareidolia stimuli. No research has been conducted on how dementia progression relates to the pareidolia test, primarily because it is difficult to systematically generate test stimuli with different strengths of pareidolia-inducing power. To overcome this difficulty, we utilize the cycle-consistent adversarial networks (CycleGAN). Two loss functions are associated with CycleGAN. In this paper, the influence of the weight of one of the CycleGAN loss functions, the “cycle consistency loss”, is investigated. The results demonstrate that, as expected, there are systematic differences in inducing pareidolia like facial perception.
Interactive drawing interface with 3D animal model retrieval
In this work, we propose an interactive drawing guidance interface with 3D animal model retrieval, which aims to help common users draw 2D animal sketches by exploring the desired animal models from the pre-collected dataset. We first construct an animal model dataset and generate line drawing images of 3D models from different viewpoints. Then, we develop the drawing interface, which illustrates the retrieval models through matching freehand sketch inputs with line drawing images. We utilize the state-of-the-art sketch-based image retrieval algorithm for sketch matching, which describes the appearance and relative positions of multiple objects by measuring compositional similarity. The proposed system can accurately retrieve similar partial images and provide the blended shadow guidance underlying the user’s strokes to guide the drawing process. We verified that the proposed interface could improve the drawing quality of users’ animal sketches from our user study.
Sketch-based 3D shape modeling from sparse point clouds
Xusheng Du, Yi He, Xi Yang, et al.
3D modeling based on point clouds is an efficient way to reconstruct and create detailed 3D content. However, the geometric procedure may lose accuracy due to high redundancy and no explicit structure. In this work, we propose a human-in-the-loop sketch-based point clouds reconstruction framework to leverage the users’ cognitive ability in geometry extraction. We present an interactive drawing interface for 3D model creations from point cloud data with the help of user sketches. We adopt an optimization method that the user can continuously edit the contours extracted from the obtained 3D model and retrieve the model iteratively. Finally, we verify the proposed user interface for modeling from sparse point clouds.
Generation of dynamic images for fake-face detection
Recently, a video of famous politicians and superstars giving certain speeches surfaced online causing severe political and commercial problems. The video, although it seemed authentic, was fake. Therefore, in this study, using 37,936 videos sampled from the deep fake detection challenge (DFDC), we developed an efficient and highly accurate deepfake detection system using EfficientNet with dynamic images. Dynamic image transforms video sequences into one frame instance by conserving spatiotemporal information. The experimental results and comparative analysis indicate that EfficientNet with dynamic image exhibits higher performance than EfficientNet. We also found that dynamic images generated by 20 frames have a higher fake-face detection accuracy than simple images.