Product upvotes vs the next 3

Waiting for data. Loading

Product comments vs the next 3

Waiting for data. Loading

Product upvote speed vs the next 3

Waiting for data. Loading

Product upvotes and comments

Waiting for data. Loading

Product vs the next 3

Loading

Molmo 2

SOTA video understanding, pointing, and tracking VLM

Molmo 2, a new suite of state-of-the-art vision-language models with open weights, training data, and training code, can analyze videos and multiple images at once.

Top comment

Hi everyone!

Ai2 is back with a massive upgrade. If you liked the original Molmo for images, you are going to love this. Molmo 2 brings that same "pointing" capability to video.

The coolest part is how it handles Space + Time. You don't just get a text summary, you get exact timestamps and coordinates. Ask it "how many times did the ball hit the ground?" and it points to every single instance.

It reportedly outperforms Gemini 3 Pro in video tracking🤯, all while being trained on less than 1/8th of the data Meta used for PerceptionLM. That is some serious efficiency.