SOTA video understanding, pointing, and tracking VLM
Molmo 2, a new suite of state-of-the-art vision-language models with open weights, training data, and training code, can analyze videos and multiple images at once.
Ai2 is back with a massive upgrade. If you liked the original Molmo for images, you are going to love this. Molmo 2 brings that same "pointing" capability to video.
The coolest part is how it handles Space + Time. You don't just get a text summary, you get exact timestamps and coordinates. Ask it "how many times did the ball hit the ground?" and it points to every single instance.
It reportedly outperforms Gemini 3 Pro in video tracking🤯, all while being trained on less than 1/8th of the data Meta used for PerceptionLM. That is some serious efficiency.
Hi everyone!
Ai2 is back with a massive upgrade. If you liked the original Molmo for images, you are going to love this. Molmo 2 brings that same "pointing" capability to video.
The coolest part is how it handles Space + Time. You don't just get a text summary, you get exact timestamps and coordinates. Ask it "how many times did the ball hit the ground?" and it points to every single instance.
It reportedly outperforms Gemini 3 Pro in video tracking🤯, all while being trained on less than 1/8th of the data Meta used for PerceptionLM. That is some serious efficiency.