DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing


Anonymous Authors

Teaser

Input video
→ Crow
→ Pigeon
→ Chicken on the ice
→ Stork on the snow
→ Duck on the mud
→ Eagle on the mud
→ Owl on the grass
→ Flamingo on the grass

Input video
→ Astronaut
→ Firefighter
→ Oil painting
→ Pixel art
→ Watercolor painting



DreamMotion with Zeroscope T2V

Input video
Masked Regions
   
→ Taxi,
under sunset
→ School bus,
under aurora
→ Truck,
under fireworks
→ Vintage car,
under dark clouds

Input video
Masked Regions
   
→ Convertible
→ Police car
→ Porsche
→ Lamborghini, on sunset

Input video
Masked Regions
   
→ Fox
→ Horse
→ Tiger
→ Goat

Input video
Masked Regions
   
Man → Child
Dog → Corgi
Man → Child
Dog → Pig
Man → Woman
Dog → Goat
Man → Woman
Dog → Tiger

Input video
Masked Regions
   
→ Chicken
→ Duck
→ Eagle
→ Flamingo
→ Pigeon



DreamMotion with Show-1 Cascaded T2V

Input video
Masked Regions
     
→ Shark, under water
→ Spaceship, in space

Input video
Masked Regions
     
→ Boat, on the sea
→ Military aircraft

Input video
Masked Regions
     
→ Buses
→ Locomotives

Input video
Masked Regions
→ Pink swan
   
Input video
Masked Regions
→ Lamborghinis



Comparison to Baselines

A dog is jumping into a river. → A horse is jumping into a river.


Input video
DreamMotion w/ Zeroscope
 
Tune-A-Video
 
ControlVideo
 
Masked Regions
Control-A-Video
Gen-1
TokenFlow


A seagull is walking. → A duck is walking on the mud.


Input video
DreamMotion w/ Zeroscope
 
Tune-A-Video
 
ControlVideo
 
Masked Regions
Control-A-Video
Gen-1
TokenFlow


A car is driving on the road. → A lamborghini is walking is driving on the road, on sunset.


Input video
DreamMotion w/ Zeroscope
 
Tune-A-Video
 
ControlVideo
 
Masked Regions
Control-A-Video
Gen-1
TokenFlow


A man is skateboarding. → A firefighter is skateboarding.


Input video
Masked Regions
 
DreamMotion w/ Show-1
 
DDIM inversion + Word swap
 
VMC
 

Cars are running on the bridge. → Buses are running on the bridge.


Input video
Masked Regions
 
DreamMotion w/ Show-1
 
DDIM inversion + Word swap
 
VMC
 



Ablation: Roughly annotated masks effectively filter noisy gradients

 
Input video
with masks annotated
  
With mask
→ German shepherd
Without mask
→ German shepherd
 
Input video
with masks annotated
  
With mask
→ Tiger
Without mask
→ Tiger


Ablation: Appearnce injection neccessitate space-time self-similarity

 
 
Input video
     
Appearance Injection
with Structure Correction
→ Spider Man
Appearance Injection
without Structure Correction
→ Spider Man

 
Input video
   
$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}} + \mathcal{L}_{\text{T-SSM}}$
→ Flamingo
$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}}$
→ Flamingo
$\mathcal{L}_{\text{V-DDS}}$
→ Flamingo
 
   
$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}} + \mathcal{L}_{\text{T-SSM}}$
→ Eagle
$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}}$
→ Eagle
$\mathcal{L}_{\text{V-DDS}}$
→ Eagle
 
   
$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}} + \mathcal{L}_{\text{T-SSM}}$
→ Chicken
$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}}$
→ Chicken
$\mathcal{L}_{\text{V-DDS}}$
→ Chicken

 
Input video
   
$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}} + \mathcal{L}_{\text{T-SSM}}$
→ Corgi
$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}}$
→ Corgi
$\mathcal{L}_{\text{V-DDS}}$
→ Corgi
 


Ablation: Visualization of Optimization Progress

Optimize:   Car → School bus, under aurora
 

Optimize:   Dog → Fox
 

Optimize:   Man → Astronaut
 


Additional Comparisons to Baselines (Video-P2P[1], DMT[2])

A seagull is walking. → A flamingo is walking on the grass.


Input video
  
DreamMotion w/ Zeroscope
 
Video-P2P
DMT
 

A car is driving on the road. → A lamborghini is walking is driving on the road, on sunset.


Input video
  
DreamMotion w/ Zeroscope
 
Video-P2P
DMT
 

A man is walking a dog on the road. → A child is walking a pig on the road.


Input video
  
DreamMotion w/ Zeroscope
 
Video-P2P
DMT
 

A man is walking a dog on the road. → A woman is walking a tiger on the road.


Input video
  
DreamMotion w/ Zeroscope
 
Video-P2P
DMT
 

a car is driving on the road under the sky. → A school bus is driving on the road under aurora.


Input video
  
DreamMotion w/ Zeroscope
 
Video-P2P
DMT
 

a car is driving on the road under the sky. → A truck is driving on the road under fireworks.


Input video
  
DreamMotion w/ Zeroscope
 
Video-P2P
DMT
 

[1] Liu, Shaoteng, et al. "Video-p2p: Video editing with cross-attention control." CVPR 2024.

[2] Yatim, Danah, et al. "Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer." CVPR 2024.




Additional Video Style Transfer Results

Input video
   
→ Pixel art
→ Watercolor painting


References

• Sterling, Spencer. Zeroscope. https://huggingface.co/cerspense/zeroscope_v2_576w (2023).

• Zhang, David Junhao, et al. "Show-1: Marrying pixel and latent diffusion models for text-to-video generation." arXiv preprint arXiv:2309.15818 (2023).

• Wu, Jay Zhangjie, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." ICCV 2023.

• Zhang, Yabo, et al. "Controlvideo: Training-free controllable text-to-video generation." ICLR 2024.

• Chen, Weifeng, et al. "Control-a-video: Controllable text-to-video generation with diffusion models." arXiv preprint arXiv:2305.13840 (2023).

• Esser, Patrick, et al. "Structure and content-guided video synthesis with diffusion models." ICCV 2023.

• Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." ICLR 2024.

• Jeong, Hyeonho, et al. "VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models." CVPR 2024.