Ferret

Refer and ground anything anywhere at any granularity

2024-01-02

Ferret
A new type of multimodal large language model (MLLM) from Apple that excels in both image understanding and language processing, particularly demonstrating significant advantages in understanding spatial references.
Ferret is Apple's advanced Multimodal Large Language Model (MLLM) designed to excel in both image understanding and language processing. It specializes in spatial referencing, enabling precise identification and grounding of objects within images at any granularity or shape. Using a hybrid region representation that combines discrete coordinates and continuous features, Ferret can process diverse inputs like points, bounding boxes, and free-form shapes. Enhanced by the GRIT dataset, which includes 1.1M samples with hierarchical spatial knowledge, Ferret achieves superior performance in referring and grounding tasks. It also reduces object hallucination and improves detail description, making it a powerful tool for multimodal applications like region-based chatting and localization.
Open Source Artificial Intelligence GitHub Apple