Abstract
Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real‑world settings. Prior attempts to inject physics rely on conditioning: frame‑level signals are domain‑specific and short‑horizon, while global text prompts are coarse and noisy, missing fine‑grained dynamics. We present PhysVid, a physics‑aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics‑grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk‑aware cross‑attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by 33% over baseline video generators, and by up to 8% on VideoPhy2. These results show that local, physics‑aware guidance substantially increases physical plausibility in generative video and marks a step toward physics‑grounded video models.
Contributions
- We incorporate additional text conditioning pathways into a T2V generator. In contrast to frame-level action conditioning and global text conditioning, our method acts on groups of frames. Working at the chunk level preserves sufficient temporal information necessary to observe physical laws locally, such as motion, while avoiding locally irrelevant pieces of information from the global text.
- We create a separate text prompt for each chunk using a VLM. During generation, each chunk is supported by its own physics based text conditioning in addition to the global T2V prompt.
- During inference, we also generate counterfactual prompts for each chunk based on the violation of locally observable physics laws. We use these prompts to guide the video generation away from the physically implausible scenarios.
Approach
The central objective of the proposed approach is to improve the overall quality of observable physical phenomena in the generated videos. To that end, we incorporate additional text conditioning based on local physical phenomena observed within smaller temporal segments of the video. This local conditioning is used in conjunction with global T2V conditioning to enhance the physical realism of the generated videos.
Ground Truth (WISA) vs PhysVid Comparisons
Comparison between ground truth videos and PhysVid-generated videos for the same captions.
A car is parked on a flooded street at night with its headlights on as it rains heavily. Buildings are visible in the background.
A woman jogs along a riverbank with earphones on, wearing a beige puffer vest over a pink long-sleeve shirt.
A person uses a hair dryer to blow a piece of paper suspended in the air, causing it to move back and forth. A pink cone-shaped object is visible in the background.
A group of dolphins swim in the ocean. They move in different directions, sometimes close together and sometimes apart. The water is clear and blue, and the light shines through from above.
A man uses a fire extinguisher to put out a fire in a container, producing a lot of smoke. The background features a fence with pink flags hanging on it.
A woman in a colorful outfit performs a dance routine in front of a mirror in a well-lit room with white walls and a light-colored floor.
A person is making a rubber band ball by wrapping rubber bands around their hand and then placing the ball on a table.
Ducks swimming in a lake at sunset. The sky is filled with clouds and the sun is setting, creating a beautiful reflection on the water.
A busy street at night with people walking, talking, and driving. Motorcycles are parked on the side of the road, and cars are passing by. The buildings in the background have signs and lights. The streetlights illuminate the scene.
A pot with steam rising from it against a black background. The steam is thick and white, and it's rising steadily from the pot. The pot appears to be made of metal and has two handles on either side.
A close-up of flames consuming wood. The flames are bright orange and yellow, and they are moving in a fluid motion. The wood is dark brown and blackened by the fire.
A stack of wood is being crushed by a large metal cylinder. The wood is light brown and appears to be untreated. The cylinder is metallic and has a dull finish. The wood is compressed and broken into smaller pieces as the cylinder moves down. The background is a plain, light-colored wall. The video is a close-up shot, focusing on the action of the cylinder crushing the wood.
Ablations
Qualitative comparison of model variants on the same prompts.
A mixing spoon stirring hot chocolate in a cup.
A reed diffuser diffusing perfume oil into the room.
Wooden swing dangles over the sand in the sandpit.
A wine bottle pours a red blend into a glass.
Paint swirling in jar of water.
Failure Cases
Representative failure cases of PhysVid.
Limbs bend in implausible directions.
Splashes are over-pronounced and do not settle realistically.
Objects vanish and morph implausibly.
Implausible interaction between fire, water and human skin.
Fluid volume is not conserved and appears to vanish.
Object does not obey gravity.
Object materializing out of another.
Objects drift and do not conserve momentum as expected.
Objects interpenetrate and morph instead of maintaining their identities.
Object moves, disappears and reappears without a visible cause.
Unnatural object interaction.
Liquids do not mix and exhibit unnatural dynamics.
Citation
If you find this useful, please cite:
@inproceedings{pathak2026physvid,
title = {PhysVid: Physics Aware Local Conditioning for Generative Video Models},
author = {Pathak, Saurabh and Arani, Elahe and Pechenizkiy, Mykola and Zonooz, Bahram},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
Acknowledgements
This work is supported by the EU funded SYNERGIES project (Grant Agreement No. 101146542). We also gratefully acknowledge the TUE supercomputing team for providing the SPIKE-1 compute infrastructure to carry out the experiments reported in this paper.