Zero-Shot Video Translation via Token Warping

Supplementary Materials


longer video results

Input (240 frames)
blue headphones, closed eyes
Input (240 frames)
dark forest, holy sword
Input (120 frames)
white hair, cartoon style


Optical flow results

Input (60 frames)
CG style
A handsome grandpa, white hair
Input
white, snow
Van Gogh style
Input
snow, scarf, cartoon style
cheongsam, scarf, cartoon style
Input Input (52 frames)
Pink, CG style
Blue
Input
cartoon style
Input (80 frames)
cartoon style
Input Input
cotton
sunflower
Input
cartoon style
White top
Input
Cartoon style
CG style
Input
marble sculpture
white ancient Greek sculpture


Appearance flow for large gap editing

Input
Condition
pixar style
panda with moon


Comparison with SOTA Method

A hand-drawn animation of a fluffy wolf, cartoon style
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
Orange SUV in sunny snow winter
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
A cartoon spiderman in black suit, black shoes is dancing
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
A black boxer wearing black boxing gloves punches towards the camera, cartoon style
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
A white deer in the snow
Input Video
Ours
FRESCO
Flatten: Error creating optical flow index
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
A white cat in pink background
Input video
Ours
FRESCO
Flatten: Error creating optical flow index
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero
Cartoon style, a man, in a castle
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero


Comparison results with inversion code

A sculpture of a woman running
Input Video
Ours
FRESCO
Flatten
TokenFlow
ControlVideo
Rerender
Text-to-Video-Zero


Ablations

Exp1: combining anchor token and warped token gets better results.
A white deer in the snow
Only warping query token warping query + first kv warping query + warping kv full, warping query + [first kv, warping kv]
Exp2: aligned query patches help correctly aggregate features in attention.
A white cat in pink background
input warping kv, no warping query warping attention-output feas full
Exp3: ControlNet provide source video's structure, help flow propagation.
controlnet input results of no controlnet input


Flow in challenging cases

Single scene: our aligned QKV flow-based attention can tolerate flow errors in single object and simple background.
Input
condition input
A white cat in pink background
backward flow
backward flow occlusion mask
Complex scene: changing color of shopping bag, optical flow failure caused by scene change, will solve in the future.
Input
condition input
white hair, white top and jeans, CG style
backward flow
occlusion mask