TokenWarping

Zero-Shot Video Translation via Token Warping

IEEE TVCG 2025
Training-free • Inversion-free • Zero-shot
Video-to-video translation via token warping in self-attention

Abstract

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame's query, key, and value patches, aligning them with the current frame's patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively.

Overview

TokenWarping Overview

Longer Video Results

Extended Sequences: Our method successfully handles videos with 120-240 frames, demonstrating robust temporal consistency across extended sequences.
Input (240 frames)
Blue headphones, closed eyes
Input (240 frames)
Dark forest, holy sword
Input (120 frames)
White hair, cartoon style

Optical Flow Results

Input (60 frames)
CG style
Handsome grandpa, white hair
Input
White, snow
Van Gogh style
Input
Snow, scarf, cartoon style
Cheongsam, scarf, cartoon
Input (52 frames)
Pink, CG style
Blue
Input
Cartoon style
Input (80 frames)
Cartoon style
Input
Cotton
Sunflower
Input
Cartoon style
White top
Input
Cartoon style
CG style
Input
Marble sculpture
Greek sculpture

Appearance Flow for Large Gap Editing

Human Pose Editing: Using appearance flow to handle large editing gaps with pose conditioning.
Input
Condition
Pixar style
Panda with moon

Comparison with State-of-the-Art Methods

A fluffy wolf in hand-drawn animation, cartoon style
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
An orange SUV in sunny snowy winter
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
A cartoon Spiderman in black suit and black shoes, dancing
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
A black boxer wearing black boxing gloves punches towards the camera, cartoon style
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
A white deer in the snow
Input Video
Ours
FRESCO
Flatten: Error creating optical flow index
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
A white cat in pink background
Input Video
Ours
FRESCO
Flatten: Error creating optical flow index
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
A man in a castle, cartoon style
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero

Comparison Results with Inversion Code

A sculpture of a woman running
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero

Ablations

Exp1: Combining anchor token and warped token yields superior results.
A white deer in the snow
Only warping query token
Warping query + first KV
Warping query + warping KV
Full: warping query + [first KV, warping KV]
Exp2: Aligned query patches facilitate correct feature aggregation in attention mechanisms.
A white cat in pink background
Input
Warping KV, no warping query
Warping attention-output features
Full (Ours)
Exp3: ControlNet provides the source video's structure, aiding flow propagation.
ControlNet input
Results without ControlNet

Challenging Cases

Single scene: Our aligned QKV flow-based attention can tolerate flow errors in scenarios involving single objects and simple backgrounds.
Input
Condition input
A white cat in pink background
Backward flow
Backward flow occlusion mask
Complex scene: Changing the color of a shopping bag results in optical flow failure caused by scene changes. This limitation will be addressed in future work.
Input
Condition input
White hair, white top and jeans, CG style
Backward flow
Occlusion mask

Citation

If you find our work helpful, please consider citing:

@article{article,
          author = {Zhu, Haiming and Xu, Yangyang and Yu, Jun and He, Shengfeng},
          year = {2025},
          month = {11},
          pages = {1-11},
          title = {Zero-Shot Video Translation via Token Warping},
          volume = {PP},
          journal = {IEEE Transactions on Visualization and Computer Graphics},
          doi = {10.1109/TVCG.2025.3636949}
          }
          }

Acknowledgement

Most of our code is adapted from FRESCO and ControlVideo. We sincerely thank the authors for their great contributions.