Efficient Temporal Action Localization via Vision-Language Modelling: An Empirical Study on the STALE Model's Efficiency and Generalizability in Resource-constrained Environments

Wang, Yunhan

Efficient Temporal Action Localization via Vision-Language Modelling

Title

Efficient Temporal Action Localization via Vision-Language Modelling: An Empirical Study on the STALE Model's Efficiency and Generalizability in Resource-constrained Environments

Author

Wang, Yunhan (TU Delft Electrical Engineering, Mathematics and Computer Science)

Contributor

van Gemert, J.C. (mentor)
Bruintjes, R. (mentor)
Lengyel, A. (mentor)
Strafforello, O. (mentor)
Kellnhofer, P. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Computer Science and Engineering

Project

CSE3000 Research Project

Date

2023-06-29

Abstract

Temporal Action Localization (TAL) aims to localize the start and end times of actions in untrimmed videos and classify the corresponding action types. TAL plays an important role in understanding video. Existing TAL approaches heavily rely on deep learning and require large-scale data and expensive training processes. Recent advances in Contrastive Language-Image Pre-Training (CLIP) have brought vision-language modeling into the field of TAL. While current CLIP-based TAL methods have been proven to be effective, their capabilities under data and compute-limited settings are not explored. In this paper, we have investigated the data and compute efficiencies of the CLIP-based STALE model. We evaluate the model performances under data-limited open/close-set scenarios. We find that STALE can demonstrate adequate generalizability using limited data. We experimented with the training time, inference time, GPU utilization, MACs, and memory consumption of STALE by inputting with varying video lengths. We discover an optimal input length for STALE to inference. Using model quantization, we find a significant forward time reduction for STALE on a single CPU. Our findings shed light on the capabilities and limitations of CLIP-based TAL methods under constrained data and compute resources. The insights gained from this research contribute to enhancing the efficiency and applicability of CLIP-based TAL techniques in real-world scenarios. The results provide valuable guidance for future advancements in CLIP-based TAL models and their potential for broader adoption in resource-constrained environments.

Subject

Vision-language Modelling
Temporal Action Localization
Data effciency
Compute Efficiency
Computer Vision
CLIP
Multi-modal Learning
Machine Learning
AI
STALE

To reference this document use:

http://resolver.tudelft.nl/uuid:61a3357a-95d5-4371-bf46-a34e1466642b

Part of collection

Student theses

Document type

bachelor thesis

Rights

Files

PDF

Yunhan_Wang_BSc_Thesis.pdf

1.66 MB

Close viewer