Share Email Print

Proceedings Paper

Video question answering by frame attention
Author(s): Jiannan Fang; Lingling Sun; Yaqi Wang
Format Member Price Non-Member Price
PDF $17.00 $21.00

Paper Abstract

In recent years, Visual Question Answering (VisualQA) has gradually become one of the research hotspots of video understanding, but most of the researches are mainly focused on Image Question Answering (ImageQA), while fewer researches pay attention to Video Question Answering (VideoQA). Inspired by the ImageQA model, we propose a model, which utilizes videos and questions to generate answers. We also redesign and simplify the Joint Sequence Fusion (JSFusion) model for our soft-attention mechanism called Frame Attention which can refines its attention on the frame object with the help of questions. Frame Attention first fused the multi-modal features by the Hadamard product, and then generated attention probability by encoding. In addition, a new training strategy for the ZJL dataset is also proposed, and can take full advantage of all the answers of the questions for training. Experiments show the advantages of our model and accuracy of 0.509 is achieved.

Paper Details

Date Published: 14 August 2019
PDF: 6 pages
Proc. SPIE 11179, Eleventh International Conference on Digital Image Processing (ICDIP 2019), 111793B (14 August 2019); doi: 10.1117/12.2539615
Show Author Affiliations
Jiannan Fang, Hangzhou Dianzi Univ. (China)
Lingling Sun, Hangzhou Dianzi Univ. (China)
Yaqi Wang, Hangzhou Dianzi Univ. (China)

Published in SPIE Proceedings Vol. 11179:
Eleventh International Conference on Digital Image Processing (ICDIP 2019)
Jenq-Neng Hwang; Xudong Jiang, Editor(s)

© SPIE. Terms of Use
Back to Top
Sign in to read the full article
Create a free SPIE account to get access to
premium articles and original research
Forgot your username?