NVIDIA LocateAnything: Spotting Objects 10x Faster By Predicting Whole Bounding Boxes At Once

NVIDIA’s 3-billion-parameter LocateAnything emits each bounding box atomically instead of token-by-token, hitting 12.7 boxes per second on a single H100 (10x faster than Qwen3-VL) with +3.8% better accuracy on LVIS.
artificial-intelligence
Author

Kabui, Charles

Published

2026-06-01

Keywords

locateanything, nvidia-research, parallel-box-decoding, vision-language-model, visual-grounding