SmolVLM2, from HuggingFace, is a series of tiny, open-source multimodal model for video understanding. Processes video, images, and text. Ideal for on-device applications.
Sharing SmolVLM2, a new open-source multimodal model series from Hugging Face that's surprisingly small, with the smallest version at only 256M parameters! It's designed specifically for video understanding, opening up interesting possibilities for on-device AI.
What's cool about it:
📹 Video Understanding: Designed specifically for analyzing video content, not just images. 🤏 Tiny Size: The smallest version is only 256M parameters, meaning it can potentially run on devices with limited resources. 🖼️ Multimodal: Handles video, images, and text, and you can even interleave them in your prompts. 👐 Open Source: Apache 2.0 license. 🤗 Hugging Face Transformers: Easy to use with the transformers library.
It's based on Idefics3 and supports tasks like video captioning, visual question answering, and even story telling from visual content.
You can try a video highlight generation demo here.
VLMs this small could run on our personal phones, and many other devices like glasses. That's the future.