Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Stanford University, Google Research

Abstract

Video-STaR introduces a novel self-training approach for video instruction tuning, leveraging labeled video datasets to enhance Large Vision Language Models (LVLMs). By iteratively generating and filtering answers containing the correct video labels, Video-STaR improves general video understanding and adapts LVLMs to new tasks. Our results show significant performance gains in video QA and downstream tasks, demonstrating the effectiveness of Video-STaR in utilizing existing video labels as weak supervision.

💡 Introduction

Video-STaR is a self-training for video language models, allowing the use of any labeled video dataset for video instruction tuning. It cycles between generating and filtering answers to ensure only those containing the correct video labels are used for training, effectively leveraging existing video labels as weak supervision. This iterative process enhances both general video understanding and the adaptability of LVLMs to novel tasks, resulting in significant performance improvements in video question-answering and various downstream applications.

(3.1) We initialize by prompting a large vision-language model to generate an answer for a particular video. (3.3) We then filter the generated answers to those only containing the original video labels. video. We then filter the data utilizing the corresponding video label.

(3.2) The videos whose generated answer did not contain the ground-truth labels are then sent to label rationalization, where given the video, question, and label - the model is expected to rationalize the label. (3.3) The generated answers are filtered again to those only containing the ground-truth labels, and

(3) the large vision language is instruction-tuned from the pre-trained checkpoint on the resulting dataset. The cycle is then repeated.

🚀 Adapting to New Datasets

LVLMs often struggle with complex and diverse video tasks, and collecting data to adapt these models is resource-intensive. Video-STaR addresses this by leveraging auxiliary labels from existing datasets for model adaptation. We found that Video-STaR efficiently adapted LVLMs to new tasks, with a 20% accuracy increase on Kinetics700 and an improvement from 17.6 to 20.2 in FineDiving score prediction accuracy. These results demonstrate Video-STaR's ability to enhance LVLM performance using auxiliary labels, making it a versatile tool for various video understanding applications.

📊 General VQA Performance

Collecting high-quality and diverse data is challenging, and scaling it is even harder. Video-STaR facilitates the collection of diverse, high-quality data, enabling improvements in frontier models like Gemini and ChatGPT. In the TempCompass evaluation, Video-STaR consistently outperformed Video-LLaVA with a 10% performance boost. The fine-grained nature of TempCompass underscores Video-STaR ability to maintain high accuracy without increasing hallucinations.

BibTeX

@inproceedings{zohar2024videostar,
    title = {Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision},
    author = {Zohar, Orr and Wang, Xiaohan and Bitton, Yonatan and Szpektor, Idan and Yeung-Levy, Serena},
    year = {2024},
    booktitle = {arXiv preprint arXiv:2407.06189},
}