Share Email Print

Proceedings Paper

Spatial-temporal attention in Bi-LSTM networks based on multiple features for video captioning
Author(s): Chu-yi Li; Wei-yu Yu
Format Member Price Non-Member Price
PDF $17.00 $21.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

Automatically generating rich natural language descriptions for open-domain videos is among the most challenging tasks of computer vision, natural language processing and machine learning. Based on the general approach of encoder-decoder frameworks, we propose a bidirectional long short-term memory network with spatial-temporal attention based on multiple features of objects, activities and scenes, which can learn valuable and complementary high-level visual representations, and dynamically focus on the most important context information of diverse frames within different subsets of videos. From the experimental results, our proposed methods achieve competitive or better than state-of-the-art performance on the MSVD video dataset.

Paper Details

Date Published: 29 October 2018
PDF: 8 pages
Proc. SPIE 10836, 2018 International Conference on Image and Video Processing, and Artificial Intelligence, 1083616 (29 October 2018); doi: 10.1117/12.2514651
Show Author Affiliations
Chu-yi Li, South China Univ. of Technology (China)
Wei-yu Yu, South China Univ. of Technology (China)

Published in SPIE Proceedings Vol. 10836:
2018 International Conference on Image and Video Processing, and Artificial Intelligence
Ruidan Su, Editor(s)

© SPIE. Terms of Use
Back to Top