SmolVLA

Powerful robotics VLA that runs on consumer hardware

2025-06-06

SmolVLA
SmolVLA is a compact (450M) open-source Vision-Language-Action model for robotics. Trained on community data, it runs on consumer hardware & outperforms larger models. Released with code & recipes.
SmolVLA is a compact, open-source Vision-Language-Action (VLA) model designed for robotics. With just 450 million parameters, it runs efficiently on consumer hardware like a single GPU or even a MacBook while outperforming larger models. Trained on publicly available community datasets, SmolVLA supports asynchronous inference, enabling 30% faster response times and double the task throughput compared to synchronous methods. The model combines a Vision-Language Model (VLM) with a flow-matching action expert, optimized for real-time control. Key design choices—such as visual token reduction, layer skipping, and interleaved attention—enhance speed and robustness. SmolVLA excels in both simulation and real-world tasks, demonstrating strong generalization across diverse environments. By providing accessible training recipes and affordable hardware compatibility, SmolVLA aims to democratize robotics research and accelerate progress in generalist robotic agents.
Open Source Robots Artificial Intelligence