TokenWarping - Zero-Shot Video Translation

Abstract

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame's query, key, and value patches, aligning them with the current frame's patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively.

Overview

Longer Video Results

Extended Sequences: Our method successfully handles videos with 120-240 frames, demonstrating robust temporal consistency across extended sequences.

Input (240 frames)

Blue headphones, closed eyes

Input (240 frames)

Dark forest, holy sword

Input (120 frames)

White hair, cartoon style

Optical Flow Results

Input (60 frames)

CG style

Handsome grandpa, white hair

Input

White, snow

Van Gogh style

Input

Snow, scarf, cartoon style

Cheongsam, scarf, cartoon

Input (52 frames)

Pink, CG style

Blue

Input

Cartoon style

Input (80 frames)

Cartoon style

Input

Cotton

Sunflower

Input

Cartoon style

White top

Input

Cartoon style

CG style

Input

Marble sculpture

Greek sculpture

Appearance Flow for Large Gap Editing

Human Pose Editing: Using appearance flow to handle large editing gaps with pose conditioning.

Input

Condition

Pixar style

Panda with moon

Comparison with State-of-the-Art Methods

A fluffy wolf in hand-drawn animation, cartoon style

Input Video

Ours

FRESCO

Flatten

TokenFlow

ControlVideo

Rerender

Text-to-Video-Zero

An orange SUV in sunny snowy winter

Input Video

Ours

FRESCO

Flatten

TokenFlow

ControlVideo

Rerender

Text-to-Video-Zero

A cartoon Spiderman in black suit and black shoes, dancing

Input Video

Ours

FRESCO

Flatten

TokenFlow

ControlVideo

Rerender

Text-to-Video-Zero

A black boxer wearing black boxing gloves punches towards the camera, cartoon style

Input Video

Ours

FRESCO

Flatten

TokenFlow

ControlVideo

Rerender

Text-to-Video-Zero

A white deer in the snow

Input Video

Ours

FRESCO

Flatten: Error creating optical flow index

TokenFlow

ControlVideo

Rerender

Text-to-Video-Zero

A white cat in pink background

Input Video

Ours

FRESCO

Flatten: Error creating optical flow index

TokenFlow

ControlVideo

Rerender

Text-to-Video-Zero

A man in a castle, cartoon style

Input Video

Ours

FRESCO

Flatten

TokenFlow

ControlVideo

Rerender

Text-to-Video-Zero

Comparison Results with Inversion Code

A sculpture of a woman running

Input Video

Ours

FRESCO

Flatten

TokenFlow

ControlVideo

Rerender

Text-to-Video-Zero

Ablations

Exp1: Combining anchor token and warped token yields superior results.

A white deer in the snow

Only warping query token

Warping query + first KV

Warping query + warping KV

Full: warping query + [first KV, warping KV]

Exp2: Aligned query patches facilitate correct feature aggregation in attention mechanisms.

A white cat in pink background

Input

Warping KV, no warping query

Warping attention-output features

Full (Ours)

Exp3: ControlNet provides the source video's structure, aiding flow propagation.

ControlNet input

Results without ControlNet

Challenging Cases

Single scene: Our aligned QKV flow-based attention can tolerate flow errors in scenarios involving single objects and simple backgrounds.

Input

Condition input

A white cat in pink background

Backward flow

Backward flow occlusion mask

Complex scene: Changing the color of a shopping bag results in optical flow failure caused by scene changes. This limitation will be addressed in future work.

Input

Condition input

White hair, white top and jeans, CG style

Backward flow

Occlusion mask

Citation

If you find our work helpful, please consider citing:

@article{article,
          author = {Zhu, Haiming and Xu, Yangyang and Yu, Jun and He, Shengfeng},
          year = {2025},
          month = {11},
          pages = {1-11},
          title = {Zero-Shot Video Translation via Token Warping},
          volume = {PP},
          journal = {IEEE Transactions on Visualization and Computer Graphics},
          doi = {10.1109/TVCG.2025.3636949}
          }
          }

Acknowledgement

Most of our code is adapted from FRESCO and ControlVideo. We sincerely thank the authors for their great contributions.