DreamMotion

Teaser

Input video	→ Crow	→ Pigeon	→ Chicken on the ice	→ Stork on the snow	→ Duck on the mud	→ Eagle on the mud	→ Owl on the grass	→ Flamingo on the grass

Input video	→ Astronaut	→ Firefighter	→ Oil painting	→ Pixel art	→ Watercolor painting

DreamMotion with Zeroscope T2V

Input video	Masked Regions		→ Taxi, under sunset	→ School bus, under aurora	→ Truck, under fireworks	→ Vintage car, under dark clouds

Input video	Masked Regions		→ Convertible	→ Police car	→ Porsche	→ Lamborghini, on sunset

Input video	Masked Regions		→ Fox	→ Horse	→ Tiger	→ Goat

Input video	Masked Regions		Man → Child Dog → Corgi	Man → Child Dog → Pig	Man → Woman Dog → Goat	Man → Woman Dog → Tiger

Input video	Masked Regions		→ Chicken	→ Duck	→ Eagle	→ Flamingo	→ Pigeon

DreamMotion with Show-1 Cascaded T2V

Input video	Masked Regions		→ Shark, under water	→ Spaceship, in space

Input video	Masked Regions		→ Boat, on the sea	→ Military aircraft

Input video	Masked Regions		→ Buses	→ Locomotives

Input video	Masked Regions	→ Pink swan		Input video	Masked Regions	→ Lamborghinis

Comparison to Baselines

A dog is jumping into a river. → A horse is jumping into a river.

Input video	DreamMotion w/ Zeroscope	Tune-A-Video	ControlVideo
Masked Regions	Control-A-Video	Gen-1	TokenFlow

A seagull is walking. → A duck is walking on the mud.

Input video	DreamMotion w/ Zeroscope	Tune-A-Video	ControlVideo
Masked Regions	Control-A-Video	Gen-1	TokenFlow

A car is driving on the road. → A lamborghini is walking is driving on the road, on sunset.

Input video	DreamMotion w/ Zeroscope	Tune-A-Video	ControlVideo
Masked Regions	Control-A-Video	Gen-1	TokenFlow

A man is skateboarding. → A firefighter is skateboarding.

Input video	Masked Regions		DreamMotion w/ Show-1	DDIM inversion + Word swap	VMC

Cars are running on the bridge. → Buses are running on the bridge.

Input video	Masked Regions		DreamMotion w/ Show-1	DDIM inversion + Word swap	VMC

Ablation: Roughly annotated masks effectively filter noisy gradients

Input video with masks annotated		With mask → German shepherd	Without mask → German shepherd

Input video with masks annotated		With mask → Tiger	Without mask → Tiger

Ablation: Appearnce injection neccessitate space-time self-similarity

Input video		Appearance Injection with Structure Correction → Spider Man	Appearance Injection without Structure Correction → Spider Man

Input video		$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}} + \mathcal{L}_{\text{T-SSM}}$ → Flamingo	$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}}$ → Flamingo	$\mathcal{L}_{\text{V-DDS}}$ → Flamingo
		$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}} + \mathcal{L}_{\text{T-SSM}}$ → Eagle	$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}}$ → Eagle	$\mathcal{L}_{\text{V-DDS}}$ → Eagle
		$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}} + \mathcal{L}_{\text{T-SSM}}$ → Chicken	$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}}$ → Chicken	$\mathcal{L}_{\text{V-DDS}}$ → Chicken

Input video		$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}} + \mathcal{L}_{\text{T-SSM}}$ → Corgi	$\mathcal{L}_{\text{V-DDS}} + \mathcal{L}_{\text{S-SSM}}$ → Corgi	$\mathcal{L}_{\text{V-DDS}}$ → Corgi

Ablation: Visualization of Optimization Progress


Optimize: Car → School bus, under aurora


Optimize: Dog → Fox


Optimize: Man → Astronaut

Additional Comparisons to Baselines (Video-P2P[1], DMT[2])

A seagull is walking. → A flamingo is walking on the grass.

Input video		DreamMotion w/ Zeroscope	Video-P2P	DMT

A car is driving on the road. → A lamborghini is walking is driving on the road, on sunset.

Input video		DreamMotion w/ Zeroscope	Video-P2P	DMT

A man is walking a dog on the road. → A child is walking a pig on the road.

Input video		DreamMotion w/ Zeroscope	Video-P2P	DMT

A man is walking a dog on the road. → A woman is walking a tiger on the road.

Input video		DreamMotion w/ Zeroscope	Video-P2P	DMT

a car is driving on the road under the sky. → A school bus is driving on the road under aurora.

Input video		DreamMotion w/ Zeroscope	Video-P2P	DMT

a car is driving on the road under the sky. → A truck is driving on the road under fireworks.

Input video		DreamMotion w/ Zeroscope	Video-P2P	DMT

[1] Liu, Shaoteng, et al. "Video-p2p: Video editing with cross-attention control." CVPR 2024.

[2] Yatim, Danah, et al. "Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer." CVPR 2024.

Additional Video Style Transfer Results

Input video		→ Pixel art	→ Watercolor painting

References

• Sterling, Spencer. Zeroscope. https://huggingface.co/cerspense/zeroscope_v2_576w (2023).

• Zhang, David Junhao, et al. "Show-1: Marrying pixel and latent diffusion models for text-to-video generation." arXiv preprint arXiv:2309.15818 (2023).

• Wu, Jay Zhangjie, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." ICCV 2023.

• Zhang, Yabo, et al. "Controlvideo: Training-free controllable text-to-video generation." ICLR 2024.

• Chen, Weifeng, et al. "Control-a-video: Controllable text-to-video generation with diffusion models." arXiv preprint arXiv:2305.13840 (2023).

• Esser, Patrick, et al. "Structure and content-guided video synthesis with diffusion models." ICCV 2023.

• Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." ICLR 2024.

• Jeong, Hyeonho, et al. "VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models." CVPR 2024.

DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing

Teaser

DreamMotion with Zeroscope T2V

DreamMotion with Show-1 Cascaded T2V

Comparison to Baselines

A dog is jumping into a river. → A horse is jumping into a river.

A seagull is walking. → A duck is walking on the mud.

A car is driving on the road. → A lamborghini is walking is driving on the road, on sunset.

A man is skateboarding. → A firefighter is skateboarding.

Cars are running on the bridge. → Buses are running on the bridge.

Ablation: Roughly annotated masks effectively filter noisy gradients

Ablation: Appearnce injection neccessitate space-time self-similarity

Ablation: Visualization of Optimization Progress

Additional Comparisons to Baselines (Video-P2P[1], DMT[2])

A seagull is walking. → A flamingo is walking on the grass.

A car is driving on the road. → A lamborghini is walking is driving on the road, on sunset.

A man is walking a dog on the road. → A child is walking a pig on the road.

A man is walking a dog on the road. → A woman is walking a tiger on the road.

a car is driving on the road under the sky. → A school bus is driving on the road under aurora.

a car is driving on the road under the sky. → A truck is driving on the road under fireworks.

Additional Video Style Transfer Results

References