Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

Qinwu Xu

doi:10.31224/7329

##article.authors##

Qinwu Xu UT Austin

DOI:

https://doi.org/10.31224/7329

Keywords:

multimodal, video, latent semantic analysis, representation

Abstract

Self-supervised video representation learning has recently advanced through con- trastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representa- tions by recovering masked visual content [1, 2], while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment [3].

In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with con- trastive regularization to encourage temporal consistency while preventing repre- sentation collapse.

Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embed- ding space, while qualitative retrieval experiments reveal motion-aware organiza- tion across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self- supervised video representation learning without relying on reconstruction-based objectives.

Downloads

Download data is not yet available.

Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

##article.authors##

DOI:

Keywords:

Abstract

Downloads

Additional Files

Posted

License

Latest preprints