Open MLLM family (1B-78B) from OpenGVLab. Excels at vision, reasoning, long context & agents via native multimodal pre-training. Outperforms base LLMs on text tasks.
Check out InternVL3 from OpenGVLab – a new family of open vision-language models.
They used a training approach mixing vision and text data from the start, which reportedly leads to strong performance in both understanding images/video and handling text tasks well.
These models show good reasoning abilities and can handle long inputs. The weights and code are openly available.
You can experience these model capabilities directly on their Chat Web and HF Space.
The Open MLLM family is truly impressive! What stands out is how well these models handle vision and reasoning tasks while outperforming base LLMs even on text benchmarks. The native multimodal pre-training approach seems to be a game-changer.
Can't wait to see what the community will build with these models. Wishing the OpenGVLab team continued success with this project!
Hey Zac Zuo & the OpenGVLab team (congrats on the hunt/launch!), this looks like a significant step forward for open vision-language models. Exciting to see strong performance reported from native multimodal pre-training, especially in reasoning and handling long context alongside vision tasks.
As we're building AI experiences (@UNI AI), having powerful, open models like InternVL3 available is fantastic for the ecosystem. The ability to handle both image/video and text tasks well from the start is key.
Question: Regarding the long context handling – what architectural innovations or training techniques allow InternVL3 to maintain strong performance on extended inputs compared to other MLLMs?
Great contribution to the open-source community. Wishing you success with the launch! 👁️🗨️🧠