Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [ICCV-2023]

1Mohamed bin Zayed University of AI, 2Australian National University, 3Linkoping University, 4University of Central Florida,
*Joint first authors

Overall Architecture

Accuracy vs Computational Complexity trade-off comparison: We show the performance of Video-FocalNets against recent Methods for video action recognition. Accuracy is compared on the Kinetics-400 dataset against GFLOPs/view. Our Video-FocalNets perform favorably compared to their counterparts across a range of model sizes (Tiny, Small, and Base).

Abstract

Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on three large-scale datasets (Kinetics-400, Kinetics-600, and SS-v2) at a lower computational cost. Our code/models are publicly released.

Visualizations

We visualize the spatial and temporal modulators for sample videos from Kinetics-600 and Something-Something-V2. Note how the temporal modulator fixates on the global motion across frames while the spatial modulator captures local variations.

Video-FocalNets architecture

(a) The overall architecture of Video-FocalNets: A four-stage architecture, with each stage comprising a patch embedding and a number of Video-FocalNet blocks. (b) Single Video-FocalNet block: Similar to the transformer blocks, we replace self-attention with Spatio-Temporal Focal Modulation.

The Spatio-Temporal Focal Modulation layer: A spatio-temporal focal modulation block that independently models the spatial and temporal information.

Comparisons with State-of-The-Art

Video-FocalNets shows consist improvements across multiple large scale benchmarks. We present results below for the Kinetics-400, Kinetics-600 and Something-Something-v2 datasets.

Results on Kinetics-400

Method
Pre-training
Top-1
Views
FLOPs (G/view)
MTV-B (CVPR'22)
ImageNet-21K
81.8
4 x 3
399
MTV-B (320p) (CVPR'22)
ImageNet-21K
82.4
4 x 3
967
Video-Swin-T (CVPR'22)
ImageNet-1K
78.8
4 x 3
88
Video-Swin-S (CVPR'22)
ImageNet-1K
80.6
4 x 3
166
Video-Swin-B (CVPR'22)
ImageNet-1K
80.6
4 x 3
282
Video-Swin-B (CVPR'22)
ImageNet-21K
82.7
4 x 3
282
MViTv2-B (CVPR'22)
-
82.9
5 x 1
226
Uniformer-B (ICLR'22)
ImageNet-1K
83.0
4 x 3
259
Video-FocalNet-T
ImageNet-1K
79.8
4 x 3
63
Video-FocalNet-S
ImageNet-1K
81.4
4 x 3
124
Video-FocalNet-B
ImageNet-1K
83.6
4 x 3
149

Results on Kinetics-600

Method
Pre-training
Top-1
MTV-B (CVPR'22)
ImageNet-21K
83.6
MTV-B (320p) (CVPR'22)
ImageNet-21K
84.0
Video-Swin-B (CVPR'22)
ImageNet-21K
84.0
Uniformer-B (ICLR'22)
ImageNet-1K
84.5
MoViNet-A6 (CVPR'21)
ImageNet-21K
84.8
MViTv2-B (CVPR'22)
-
85.5
Video-FocalNet-B
ImageNet-1K
86.7

Results on Something-Something V2

Method
Pre-training
Top-1
MTV-B (CVPR'22)
ImageNet-21K
67.6
MTV-B (320p) (CVPR'22)
ImageNet-21K
68.5
Video-Swin-B (CVPR'22)
Kinetics400
69.6
Uniformer-B (ICLR'22)
Kinetics400
70.4
MViTv2-B (CVPR'22)
Kinetics400
70.5
Video-FocalNet-B
Kinetics400
71.1

Conclusion

To learn spatio-temporal representations that can effectively model both local and global contexts, this paper introduces Video-FocalNets for video action recognition tasks. This architecture is derived from focal modulation for images and is able to effectively model both short- and long-term dependencies to learn strong spatio-temporal representations. We extensively evaluate several design choices to develop our proposed Video-FocalNet block. Specifically, our Video-FocalNet uses a parallel design to model hierarchical contextualization by combining spatial and temporal convolution and multiplication operations in a computationally efficient manner. Video-FocalNets are more efficient than transformer-based architectures which require expensive self-attention operations. We demonstrate the effectiveness of Video-FocalNets via evaluations on three representative large-scale video datasets, where our approach outperforms previous transformer- and CNN-based Methods.


For more details about the proposed video recognition framework and results/comparisons over additional benchmarks, please refer to our main paper. Thank you!

BibTeX

@InProceedings{Wasim_2023_ICCV,
    author    = {Wasim, Syed Talal and Khattak, Muhammad Uzair and Naseer, Muzammal and Khan, Salman and Shah, Mubarak and Khan, Fahad Shahbaz},
    title     = {Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {13778-13789}
}