OmniParser V2
Turn any LLM into a Computer Use Agent
2025-02-15

OmniParser ‘tokenizes’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs. This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements.
OmniParser V2 transforms any Large Language Model (LLM) into a capable computer-use agent by converting UI screenshots into structured, interpretable elements. It addresses key challenges in GUI automation, such as identifying interactable icons and understanding semantic elements, enabling LLMs to predict and execute next actions based on parsed data. The upgraded version boasts higher accuracy in detecting small elements, faster inference, and a 60% reduction in latency compared to its predecessor. Trained on extensive interactive element and icon caption data, OmniParser V2 achieves state-of-the-art performance, especially when paired with GPT-4o. Additionally, OmniTool, a dockerized Windows system, facilitates seamless experimentation with various LLMs, integrating screen understanding, grounding, action planning, and execution. Aligned with responsible AI practices, OmniParser emphasizes ethical use and risk mitigation, ensuring safe and effective automation.
User Experience
Artificial Intelligence
GitHub
Computers