Simple yet Strong Baseline towards Controllable Character Animation in the Wild

1Alibaba DAMO Academy, 2Zhejiang University, 3Hupan Lab,
4Southern University of Science and Technology, 5Shenzhen University
Given a character image and pose sequences, the proposed RealisDance-DiT can generate impressive videos, dealing with the challenges of character-object interactions, stylized characters, rare poses, and complex gestures. RealisDance-DiT even shows potential for multi-character control, despite being trained on a single-character dataset.



Abstract

Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.

Method

Illustration of architecture modifications and fine-tunable model parameters. The proposed RealisDance-DiT is fine-tuned under the final setting. We use the same three pose conditions as those used in RealisDance, i.e., HaMeR , DWPose, and SMPL-CS. All pose conditions and the reference image are encoded using the original Wan-2.1 VAE. The encoded pose latents are concatenated along the channel dimension. Then, the concatenated pose latent and the reference latent are fed into the pose and reference patchifiers, respectively. The pose patchifier is initialized randomly, while the reference patchifier is initialized using the weights from the noise patchifier. Finally, the pose latent is added to the noise latent, and the reference latent is concatenated with the noise latent along the sequence length, before being sent to the subsequent DiT blocks.

Illustration of spatially shifted RoPE for the reference latent. We also replace the Rotary Position Embedding (RoPE) used in self-attention with the shifted RoPE. The reference latent does not share RoPE with the noise latent. It employs the spatially shifted RoPE at the first frame, where the shifting is according to the height and width of the noise latent.

Illustration of low-noise warmup strategy. When iteration i is smaller than the maximum warmup threshold τ , there is a greater probability of sampling small timesteps. As the iteration i increases, the probability of sampling large timesteps rises. Once i exceeds τ , the sampling distribution degrades to uniform sampling.

Visualization of different batch configurations on the RealisDance-Val dataset. '#' denotes the batch size. We suggest using a large batch size along with a small number of iterations for fine-tuning, which facilitates rapid convergence while maximally preserving the prior knowledge of the foundation model.

More results

Citation and Related Work

If you find our work useful, please consider citing us!
@article{zhou2025realisdance-dit,
  title={RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild},
  author={Zhou, Jingkai and Wu, Yifan and Li, Shikai and Wei, Min and Fan, Chao
          and Chen, Weihua and Jiang, Wei and Wang, Fan},
  journal={arXiv preprint arXiv:2504.14977},
  year={2025}
}
If you would like to control the camera and character poses together, pleas refer to Uni3C.