13 - 17 April 2025
Orlando, Florida, US
Conference 13034 > Paper 13034-8
Paper 13034-8

Efficient and consistent zero-shot video generation with diffusion models

On demand | Presented live 22 April 2024

Abstract

Recent diffusion-based generative models employ methods such as one-shot fine-tuning an image diffusion model for video generation. However, this leads to long video generation times and suboptimal efficiency. To resolve this long generation time, zero-shot text-to-video models eliminate the fine-tuning method entirely and can generate novel videos from a text prompt alone. While the zero-shot generation method greatly reduces generation time, many models rely on inefficient cross-frame attention processors, hindering the diffusion model’s utilization for real-time video generation. We address this issue by introducing more efficient attention processors to a video diffusion model. Specifically, we use attention processors (i.e. xFormers, FlashAttention, and HyperAttention) that are highly optimized for efficiency and hardware parallelization. We then apply these processors to a video generator and test with both older diffusion models such as Stable Diffusion 1.5 and newer, high-quality models such as Stable Diffusion XL. Our results show that using efficient attention processors alone can reduce generation time by around 25%, while not resulting in any change in video quality. Combined with the use of higher quality models, this use of efficient attention processors in zero-shot generation presents a substantial efficiency and quality increase, greatly expanding the video diffusion model’s application to real-time video generation.

Presenter

Ethan Frakes
Univ. of Central Florida (United States)
Ethan is a junior undergraduate student at the University of Central Florida (UCF) in Orlando, Florida, studying Computer Science. Ethan has a long-standing interest and passion for computer vision, image and video generation, and software engineering. Ethan's work in the field of computer vision began in the summer of 2023 while he was an undergraduate research intern at the Center for Research in Computer Vision at UCF. His research focuses on improving the efficiency and quality of diffusion-based generative video models. His contributions include applying cutting-edge and highly recent attention processors to video diffusion models, including flash attention, and also utilizing new innovations in diffusion models, such as the recent model Stable Diffusion XL.
Application tracks: AI/ML
Presenter/Author
Ethan Frakes
Univ. of Central Florida (United States)
Author
Univ. of Central Florida (United States)
Author
Univ. of Central Florida (United States)