Magma
Foundation Model for Multimodal AI Agents
2025-02-27

Magma, the flagship project from Microsoft Research, is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments.
Magma, developed by Microsoft Research, is a groundbreaking foundation model for multimodal AI agents, capable of handling complex interactions across virtual and real environments. It integrates text, images, and videos through a shared vision encoder and large language model (LLM), enabling unified action grounding and planning. Magma excels in tasks like robot manipulation, UI navigation, and video understanding, outperforming state-of-the-art models in zero-shot and few-shot evaluations. Its innovative Set-of-Mark (SoM) and Trace-of-Mark (ToM) techniques enhance spatial reasoning, action prediction, and temporal dynamics comprehension. With robust cross-domain capabilities, Magma demonstrates superior performance in real-world applications, from gaming strategies to real robot tasks, making it a versatile and powerful tool for multimodal AI development.
Open Source
Artificial Intelligence
Bots