OmniParser V2

Turn any LLM into a Computer Use Agent

2025-02-15

OmniParser ‘tokenizes’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs. This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements.

OmniParser V2 transforms any Large Language Model (LLM) into a capable computer-use agent by converting UI screenshots into structured, interpretable elements. It addresses key challenges in GUI automation, such as identifying interactable icons and understanding semantic elements, enabling LLMs to predict and execute next actions based on parsed data. The upgraded version boasts higher accuracy in detecting small elements, faster inference, and a 60% reduction in latency compared to its predecessor. Trained on extensive interactive element and icon caption data, OmniParser V2 achieves state-of-the-art performance, especially when paired with GPT-4o. Additionally, OmniTool, a dockerized Windows system, facilitates seamless experimentation with various LLMs, integrating screen understanding, grounding, action planning, and execution. Aligned with responsible AI practices, OmniParser emphasizes ethical use and risk mitigation, ensuring safe and effective automation.

Product Website

Product Hunt

User Experience Artificial Intelligence GitHub Computers

OmniParser V2

Turn any LLM into a Computer Use Agent

Influencer AI

CapybaraDB Beta