NVLM 1.0

Open frontier-class multimodal LLMs

2024-10-03

NVLM 1.0
A family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2).
NVLM 1.0 is a cutting-edge family of multimodal large language models (LLMs) designed to excel in vision-language tasks, matching or surpassing leading proprietary and open-access models like GPT-4o and Llama 3-V. Notably, NVLM 1.0 not only delivers state-of-the-art performance in multimodal tasks but also improves accuracy on text-only tasks after multimodal training. The model's 72B variant achieves top scores in benchmarks like OCRBench and VQAv2, outperforming competitors in tasks such as math, coding, and reasoning. NVLM 1.0 is open-sourced, offering model weights and training code via Megatron-Core, empowering the community with advanced multimodal capabilities. Its innovative architecture, dynamic high-resolution image handling, and curated datasets ensure superior performance across diverse applications, from image understanding to complex problem-solving.
Open Source Artificial Intelligence