CVPR 2026

PhysVid: Physics Aware Local Conditioning for Generative Video Models

Saurabh Pathak · Elahe Arani · Mykola Pechenizkiy · Bahram Zonooz

Eindhoven University of Technology

Videos generated by PhysVid (1.7B parameters) vs Wan-14B on VideoPhy captions. PhysVid achieves better physical realism despite the smaller model size.

PhysVid
Wan-14B

A wine bottle pours a red blend into a glass.

PhysVid
Wan-14B

A blender spins, mixing squeezed juice within it.

PhysVid
Wan-14B

A car gliding over a road slick with rainwater.

PhysVid
Wan-14B

A waterfall cascades over jagged rocks.

PhysVid
Wan-14B

An electric beater whips cream in a bowl.

PhysVid
Wan-14B

Water flows freely from a fully turned faucet.

PhysVid
Wan-14B

Hand holds the phone.

PhysVid
Wan-14B

Skateboard rolls swiftly over the bumpy sidewalk.

PhysVid
Wan-14B

Raindrops disturb quiet puddles.

PhysVid
Wan-14B

Honey pours into a cup of tea.

PhysVid
Wan-14B

Cheese is grating through the stainless steel grater.

PhysVid
Wan-14B

Water gushes from a green garden hose.

Abstract

Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real‑world settings. Prior attempts to inject physics rely on conditioning: frame‑level signals are domain‑specific and short‑horizon, while global text prompts are coarse and noisy, missing fine‑grained dynamics. We present PhysVid, a physics‑aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics‑grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk‑aware cross‑attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by 33% over baseline video generators, and by up to 8% on VideoPhy2. These results show that local, physics‑aware guidance substantially increases physical plausibility in generative video and marks a step toward physics‑grounded video models.

Contributions

Approach

The central objective of the proposed approach is to improve the overall quality of observable physical phenomena in the generated videos. To that end, we incorporate additional text conditioning based on local physical phenomena observed within smaller temporal segments of the video. This local conditioning is used in conjunction with global T2V conditioning to enhance the physical realism of the generated videos.

Method diagram showing physics-aware local conditioning approach
Architecture of PhysVid showing local information pathways with chunk aware cross-attention

Ground Truth (WISA) vs PhysVid Comparisons

Comparison between ground truth videos and PhysVid-generated videos for the same captions.

Real
Generated

A car is parked on a flooded street at night with its headlights on as it rains heavily. Buildings are visible in the background.

Real
Generated

A woman jogs along a riverbank with earphones on, wearing a beige puffer vest over a pink long-sleeve shirt.

Real
Generated

A person uses a hair dryer to blow a piece of paper suspended in the air, causing it to move back and forth. A pink cone-shaped object is visible in the background.

Real
Generated

A group of dolphins swim in the ocean. They move in different directions, sometimes close together and sometimes apart. The water is clear and blue, and the light shines through from above.

Real
Generated

A man uses a fire extinguisher to put out a fire in a container, producing a lot of smoke. The background features a fence with pink flags hanging on it.

Real
Generated

A woman in a colorful outfit performs a dance routine in front of a mirror in a well-lit room with white walls and a light-colored floor.

Real
Generated

A person is making a rubber band ball by wrapping rubber bands around their hand and then placing the ball on a table.

Real
Generated

Ducks swimming in a lake at sunset. The sky is filled with clouds and the sun is setting, creating a beautiful reflection on the water.

Real
Generated

A busy street at night with people walking, talking, and driving. Motorcycles are parked on the side of the road, and cars are passing by. The buildings in the background have signs and lights. The streetlights illuminate the scene.

Real
Generated

A pot with steam rising from it against a black background. The steam is thick and white, and it's rising steadily from the pot. The pot appears to be made of metal and has two handles on either side.

Real
Generated

A close-up of flames consuming wood. The flames are bright orange and yellow, and they are moving in a fluid motion. The wood is dark brown and blackened by the fire.

Real
Generated

A stack of wood is being crushed by a large metal cylinder. The wood is light brown and appears to be untreated. The cylinder is metallic and has a dull finish. The wood is compressed and broken into smaller pieces as the cylinder moves down. The background is a plain, light-colored wall. The video is a close-up shot, focusing on the action of the cylinder crushing the wood.

Ablations

Qualitative comparison of model variants on the same prompts.

Wan-1.3B
WISA Finetuned
PhysVid (w/o Couterfactual Guidance)
PhysVid

A mixing spoon stirring hot chocolate in a cup.

Wan-1.3B
WISA Finetuned
PhysVid (w/o Couterfactual Guidance)
PhysVid

A reed diffuser diffusing perfume oil into the room.

Wan-1.3B
WISA Finetuned
PhysVid (w/o Couterfactual Guidance)
PhysVid

Wooden swing dangles over the sand in the sandpit.

Wan-1.3B
WISA Finetuned
PhysVid (w/o Couterfactual Guidance)
PhysVid

A wine bottle pours a red blend into a glass.

Wan-1.3B
WISA Finetuned
PhysVid (w/o Couterfactual Guidance)
PhysVid

Paint swirling in jar of water.

Failure Cases

Representative failure cases of PhysVid.

Limbs bend in implausible directions.

Splashes are over-pronounced and do not settle realistically.

Objects vanish and morph implausibly.

Implausible interaction between fire, water and human skin.

Fluid volume is not conserved and appears to vanish.

Object does not obey gravity.

Object materializing out of another.

Objects drift and do not conserve momentum as expected.

Objects interpenetrate and morph instead of maintaining their identities.

Object moves, disappears and reappears without a visible cause.

Unnatural object interaction.

Liquids do not mix and exhibit unnatural dynamics.

Citation

If you find this useful, please cite:

@inproceedings{pathak2026physvid,
  title     = {PhysVid: Physics Aware Local Conditioning for Generative Video Models},
  author    = {Pathak, Saurabh and Arani, Elahe and Pechenizkiy, Mykola and Zonooz, Bahram},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgements

This work is supported by the EU funded SYNERGIES project (Grant Agreement No. 101146542). We also gratefully acknowledge the TUE supercomputing team for providing the SPIKE-1 compute infrastructure to carry out the experiments reported in this paper.