Skip to main content ->
Ai2

MolmoMotion: Language-guided 3D motion forecasting

June 17, 2026

Ai2


Machines have become remarkably good at perceiving motion. Given a video, modern models can track how objects and points move through a scene with exceptionally high confidence. But perception is inherently retrospective: it explains motion that has already happened. Many of the systems and applications we want to build need to look forward instead. A robot reaching for a cup has to anticipate how the cup will move before it touches it. A video generator has to know what realistic motion comes next if it's going to produce physically plausible frames. 

Predicting motion is harder than observing it, but it's also far more useful in many scenarios.

This idea was the motivation behind MolmoMotion, a new motion forecasting model we're releasing today. Given a video frame, 3D points marked on an object, and written instructions describing the intended action (e.g., “Move and rotate the wooden bowl with fruit on the table”), MolmoMotion predicts where those points will move over the next few seconds in 3D space—achieving substantially stronger performance than existing forecasting methods.

Given an RGB observation, a set of query points on an object, and an action description, MolmoMotion predicts the object's future 3D point trajectory. These predicted trajectories can then drive downstream applications such as robotics planning and trajectory-conditioned video generation.

Alongside the model, we're publishing MolmoMotion-1M, the largest collection of 3D point trajectories paired with action descriptions, drawn from 1.16M videos. We're also releasing PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy, containing 2.7K video clips. 

We find that motion forecasters like MolmoMotion can be useful across a range of downstream tasks, from robot planning to controllable video generation. We're releasing the model weights, the MolmoMotion-1M dataset, and our PointMotionBench benchmark openly for the community to study, improve, and customize.

MolmoMotion: Under the hood

MolmoMotion represents motion in a deliberate, highly efficient way: as object-attached 3D points in world space, which capture motion without the cost of rendering full video. We chose it because we needed a general motion representation with three properties:

  1. Class-agnostic: not tied to templates for human bodies, hands, rigid objects, or any other fixed category.
  2. View-stable: the same physical motion should be represented consistently across cameras and viewpoints.
  3. Directly usable by downstream systems that need to reason about physical motion.

Among the representations we considered, it was the only one that satisfied all three. A sparse set of surface points can describe rigid, articulated, and (within limits) deformable motion without assuming the type of object being moved. Because the points live in a shared world frame, their trajectories remain stable across camera motion and viewpoint change. And because they're compact explicit trajectories in 3D space, they can be passed directly to systems such as robot policies or video generation models.

To forecast those trajectories, MolmoMotion uses Molmo 2 as its backbone, allowing it to connect language instructions to objects and points in an image. Given a short video history, an action description, and a set of query points with their initial 3D positions, the model first identifies the object being referred to, the query points, and the motion the instruction describes. It then predicts the future 3D trajectory of each point.

We train two variants of MolmoMotion:

  • The autoregressive variant (MolmoMotion-AR) predicts future coordinates step by step. It represents 3D coordinates as structured text, following the coordinate-style prediction used by VLMs, and writes out the future trajectory in temporal order. Because each new coordinate is conditioned on the trajectory already generated, this encourages smooth rollouts and gives the strongest accuracy when the future path is well-defined.
  • The flow-matching variant (MolmoMotion-FM) predicts trajectories in continuous 3D space by transforming noise into motion, which makes it better suited for representing uncertainty when an instruction admits multiple plausible futures.

Introducing MolmoMotion-1M and PointMotionBench

To train MolmoMotion, we needed data that didn’t yet exist: large-scale videos with 3D point trajectories grounded to specific objects and paired with action descriptions. Existing 3D-track datasets were small and domain-limited, and while internet videos have all the scale and diversity we wanted for a forecaster like MolmoMotion, they didn’t include 3D annotations. So we built an automatic pipeline that extracts object-grounded 3D trajectories from unconstrained video.

Given an input video and its action description, our annotation pipeline produces object-grounded 3D point trajectories in metric world coordinates. (The figure below shows each stage.) The challenging part is that raw tracks from unconstrained video are noisy – with depth and tracking errors that leave points jittering and drifting – and that objects often stay still for much of a video. To make the data more trustworthy, we filter out points that don't move coherently with the rest of the object, smooth the remaining trajectories, and segment each clip to the window where the object actually moves.

Running our pipeline at scale yielded MolmoMotion-1M—to our knowledge the largest corpus of action-described, object-grounded 3D point trajectories assembled to date, spanning 736 motion types and 5.6K distinct objects.

An overview of our data annotation pipeline. Given a video of an action event and its description, we first ground the moving object and sample query points on it. We then track dense 2D points on the object, lift these tracks into a shared metric 3D frame, and use object-level spatial and temporal consistency priors to filter unreliable trajectories. Finally, we clip the video around intervals where the grounded object undergoes meaningful motion.

Left instruction: "Move and rotate wooden bowl with fruits on the table." Right instruction: "Roll a lint roller on a blue cloth."

Left instruction: "A silver car follows the road and slowly turns to the right." Right instruction: "A flamingo dips its beak into the water while walking to the right."

To evaluate MolmoMotion’s forecasting performance, we also built PointMotionBench, a human-validated benchmark of held-out 3D trajectories. It covers 2.7K clips spanning 111 object categories and 61 motion types, including indoor manipulation, egocentric hand-object interaction, and outdoor dynamic scenes. For each clip, models are given the current observation, object query points, and an action description, and are evaluated on how accurately their predicted 3D point trajectories match the object’s actual future motion. This gives us a direct quantitative test of 3D motion forecasting rather than relying on whether a generated point track merely looks plausible.

Experiments and performance

We evaluate MolmoMotion in three ways. First, we test whether it forecasts future 3D motion more accurately than existing methods. Second, we test whether what it has learned about motion helps a robot carry out manipulation tasks. Third, we test whether that same knowledge can help guide the motion in generated video.

3D motion forecasting

On PointMotionBench, MolmoMotion outperforms all existing 3D motion forecasting methods we tested – including pixel-space video generators, parametric 3D methods, and a simple constant-velocity baseline – across a range of objects, scenes, and actions.

MolmoMotion can forecast many kinds of object and scene motions, like how a lint roller will move back and forth on cloth, how a bowl will slide and rotate on a table, how a flamingo will walk to the right while dipping its beak in a body of water, or how a car will follow a road as it turns. In each case, the predicted path follows the instruction MolmoMotion was given and stays extremely close to the ground truth motion in our benchmark.

PointMotionBench benchmark results
Bars show per-split 3D average displacement error in meters on PointMotionBench. Lower is better.
HOT3D
MolmoMotion-AR (3f)
0.109
ObjectForesight
0.129
MolmoMotion-FM (3f)
0.135
MolmoMotion-AR (1f)
0.157
Extrapolate baseline
0.159
EgoScaler
0.170
Static baseline
0.180
MolmoMotion-FM (1f)
0.183
Wan2.2-5B
0.200
Robot4DGen
0.212
Cosmos Predict
0.225
Track2Act
0.294
WorldTrack
MolmoMotion-AR (3f)
0.143
MolmoMotion-AR (1f)
0.148
MolmoMotion-FM (3f)
0.158
MolmoMotion-FM (1f)
0.165
Static baseline
0.167
Extrapolate baseline
0.184
Robot4DGen
0.548
Cosmos Predict
0.831
Wan2.2-5B
0.852
Track2Act
1.230
DAVIS
MolmoMotion-AR (1f)
1.146
MolmoMotion-AR (3f)
1.227
MolmoMotion-FM (1f)
1.380
MolmoMotion-FM (3f)
1.480
Robot4DGen
2.120
Static baseline
2.281
Extrapolate baseline
2.683
Wan2.2-5B
3.074
Cosmos Predict
4.191
Track2Act
4.853
Source: MolmoMotion paper, Table 1—3D point trajectory prediction on PointMotionBench. MolmoMotion (3f) and (1f) denote the 3-frame and single-frame input variants. ObjectForesight and EgoScaler are reported on HOT3D only (elsewhere in Table 1). Static keeps each 3D point fixed; Extrapolate projects motion from prior frames forward linearly.
Downstream evaluation: robotics planning

What MolmoMotion learns about motion should carry over from one setting to another—lifting a cup with a human hand and lifting it with a robot gripper are very different actions, but the cup itself follows a similar path through 3D space. That makes MolmoMotion a natural fit for robotics, where a robot has to plan how objects should move before moving them.

After fine-tuning on DROID, a large open dataset of real-world robot manipulation videos, we find that MolmoMotion can predict sensible object paths across different objects, camera viewpoints, scenes, and tasks for a wide range of robot planning scenarios.

Left instruction: “Take cloth out of container." Right instruction: “Move lid on pot.”

In simulation, a control policy built on MolmoMotion succeeds on 76.3% of pick-and-place tasks versus 56.0% for the same policy built on Molmo 2—and it learns faster, reaching 51% after 10K training steps where the Molmo 2 version tops out at 19%. On real robots (after fine-tuning), MolmoMotion reaches the same test L2 error that the Molmo 2 baseline achieves after 12K training steps in only about 2K steps.

MolmoMotion improves robotics planning
Same MolmoBot policy, different backbone initialization. Higher closed-loop success is better.
MolmoMotion initialized
Molmo 2 initialized
Sample efficiency
Success at 10K steps
Final closed-loop success
Seen sceneseen object
85.0
70.0
Seen sceneunseen object
74.5
51.2
Unseen sceneseen object
72.0
50.0
Unseen sceneunseen object
74.2
48.7
Average
76.3
56.0
Source: MolmoMotion paper, Figure 5a and Section 5.2. The two MolmoBot policies use the same flow-matching action head and 20K released episodes, differing only in backbone initialization.
Downstream evaluation: video generation

Instruction: “A flamingo dips its beak into the water while walking to the right.” From left to right: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.

Instruction: "Take the round light brown plate from the table.” From left to right: DaS + MolmoMotion, CogVideoX-5B, and WAN-14B.

MolmoMotion's predicted paths can also steer video generation. Instead of letting an image-to-video model guess motion from a text instruction alone, you can feed in MolmoMotion's predictions. The result is generated video that follows requested actions more closely, especially for small and precise movements a prompt can only describe vaguely.

The metrics back this up. Used to guide a video generator, MolmoMotion improves motion quality over the base model on all five motion-related metrics we measure, and beats a much larger image-to-video model on four of the five.

MolmoMotion-guided video generation results
DaS + MolmoMotion improves on CogVideoX-5B across all five metrics and beats the larger Wan2.2-I2V-A14B on four of five. Higher is better.
Temporal consistency
DaS + MolmoMotion
0.968
Wan2.2-I2V-A14B
0.965
CogVideoX-5B
0.964
Subject consistency
DaS + MolmoMotion
0.950
Wan2.2-I2V-A14B
0.940
CogVideoX-5B
0.939
Motion smoothness
DaS + MolmoMotion
0.990
CogVideoX-5B
0.988
Wan2.2-I2V-A14B
0.983
Dynamic degree
Wan2.2-I2V-A14B
0.908
DaS + MolmoMotion
0.876
CogVideoX-5B
0.861
Background consistency
DaS + MolmoMotion
0.948
Wan2.2-I2V-A14B
0.947
CogVideoX-5B
0.941
Source: MolmoMotion paper, Table 2. Bar lengths in this section are rescaled within each metric row to make small differences visible and should not be interpreted as absolute score proportions; printed scores are the reported values.

Limitations and what's next

MolmoMotion is a capable model, but there are still some limitations to note. It uses eight query points per object during training—enough to forecast a useful trajectory but not enough to densely represent surface geometry. This limits the model's handling of complex deformable motion.

We think forecasting – anticipating how objects in the world will move before they move – is as fundamental to machine intelligence as perceiving what's already there. MolmoMotion is a step toward this—3D motion prediction that generalizes across object categories without per-category templates, learned from ordinary video, and the most accurate 3D motion forecaster we've measured on PointMotionBench. We expect many applications will follow in robotics, video, and beyond.

We encourage you to try MolmoMotion by downloading the weights, inspecting the training data, and evaluating our methods against PointMotionBench

Join us

At Ai2 we’re building the future of transparent, open-source AI — built in the open to empower scientific progress and fundamental understanding of this world changing technology. We’re not here to make profits, we’re here to make sure benefits of AI are shared widely and for the benefit of humanity. If this appeals to you, please take a look at our open roles.

Subscribe to receive monthly updates about the latest Ai2 news.