Diffusion-Based Action Recognition Generalizes to Untrained Domains

California Institute of Technology
*Indicates Equal Contribution
MY ALT TEXT

ActionDiff. Our method uses the highly semantic features extracted from a frozen Stable Video Diffusion backbone to perform action recognition in tasks that require generalization across different domains. Our model generalizes to new agents (species), view-angles ($1^{st}$ to $3^{rd}$), and contexts (sports vs. movies) that were not present in the training data.

Abstract

Humans can recognize the same actions despite large context and viewpoint variations, such as ifferences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies).

Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions.

We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts.

Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness.

Domain Shift Tasks

MY ALT TEXT

Domain Shift Tasks. We present the three domain-shift tasks we use to measure the generalization performance of ActionDiff. Left: samples from the Animal Kingdom dataset, which contains examples of actions (eating and swimming) being performed across different animal species. Middle: samples from CharadesEgo, which contain examples of the same actions (typing and grabbing a pillow) captured from first and third person perspective. Right: Samples from UCF-101 (top) and HMDB51 (bottom), which contain examples of the same actions (shooting a bow and kicking a ball) in different contexts. UCF has mostly amateur sports footage, while HMDB also includes other sources ( such as movies, TV, and video games.)

Methods

MY ALT TEXT

Overview of our architecture. We split a longer video into shorter segments and extract frame features for each video segment using a frozen Stable Video Diffusion backbone. The video segment frames are encoded into the diffusion latent space by $\mathcal{E}$. They are processed together by the denoiser $\epsilon_\theta$, in a process guided by a condition $c$ (the middle frame $x^{mid}$ embedded by a CLIP encoder $\tau_\theta$) through cross-attention. We extract the outputs of a middle layer $l$ in the denoiser $\epsilon_\theta$, and average pool the outputs across the spatial dimensions to end up with a feature vector for each frame in the video segment. We then collect the sequence of frame features from all video segments and pass them through a learned transformer encoder and a learned class token concatenated to the beginning of the sequence. From the output, we apply a linear layer and a normalization function $\sigma$ to the class token to obtain probabilities $\hat{y}$ for each action class.

Results

Animal Kingdom

MY ALT TEXT

Charades-Ego

MY ALT TEXT

UCF-HMDB

MY ALT TEXT

Animal Kingdom dataset. ActionDiff beats the SOTA and other self-supervised frozen backbones in action recognition both on the full dataset and the unseen species partition, in which the species used at test time were not seen during training.

Charades-Ego dataset. ActionDiff beats the SOTA and other self-supervised frozen backbones in action recognition on $1^{st}$ to $1^{st}$ person and $3^{rd}$ to $1^{st}$ person viewpoint.

UCF101 (U) to HMDB51 (H) domain shift and vice-versa. ActionDiff beats the previous SOTA and other self-supervised frozen backbones when trained on any of the datasets and tested on the other.

Analysis

MY ALT TEXT

Diffusion Layer and Timestep Conditioning. We test the performance of our model with diffusion features extracted from different layers and conditioned on different timesteps to analyze the difference between the best features for in- and out-of-domain tasks. We use two example tasks where the model can train on the same train set for in and out-of-domain tasks. Each heatmap shows the results (mAP for CharadesEgo and acc for HMDB to UCF) for each layer (y-axis) and timestep (x-axis) on a task, and the plot below each heatmap shows the best result across layers obtained for each timestep. When we analyze which features are best for in-domain vs out-of-domain testing, we see a shift toward those obtained with earlier timesteps when we test out-of-domain. Layers are indexed in the direction from the input to the bottleneck, and timesteps are indexed in the direction of the generative process (reverse diffusion).

BibTeX


@article{guimaraes2025diffusion,
  title={Diffusion-Based Action Recognition Generalizes to Untrained Domains},
  author={Guimaraes, Rogerio and Xiao, Frank and Perona, Pietro and Marks, Markus},
  journal={arXiv preprint arXiv:2509.08908},
  year={2025}
}