GLM-4.6V is GLM's newest open-source multimodal model with a 128k context window. It features native function calling, bridging visual perception with executable actions for complex agentic workflows like web search and coding.
GLM-4.6V is a significant iteration for the GLM multimodal series. It scales the training context window to 128k and hits SOTA visual understanding for its size.
The biggest update here is the native Function Calling. For the first time in the GLM architecture, tool use is integrated directly into the visual model. This effectively bridges the gap from "visual perception" to "executable action."
It can automatically generate high-quality image-text interleaved content and handle complete workflows independently, like viewing products, comparing prices, and generating shopping lists. The frontend replication and visual interaction capabilities are also impressive, which significantly shortens the path from design to code for developers.
Hi everyone!
GLM-4.6V is a significant iteration for the GLM multimodal series. It scales the training context window to 128k and hits SOTA visual understanding for its size.
The biggest update here is the native Function Calling. For the first time in the GLM architecture, tool use is integrated directly into the visual model. This effectively bridges the gap from "visual perception" to "executable action."
It can automatically generate high-quality image-text interleaved content and handle complete workflows independently, like viewing products, comparing prices, and generating shopping lists. The frontend replication and visual interaction capabilities are also impressive, which significantly shortens the path from design to code for developers.
Try it on Z.ai or find the open weights on HF.