SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Published in CVPR, 2025
This paper proposes SMILE, a self-supervised video learning method that improves semantic and motion representation by incorporating CLIP-guided spatial semantics and synthetic motion patterns, achieving state-of-the-art results across 7 datasets.