VLG-Loc: Vision-Language Global Localization
from Labeled Footprint Maps

Mizuho Aoki1, Kohei Honda1, 2, Yasuhiro Yoshimura2, Takeshi Ishita2, Ryo Yonetani2

1 Nagoya University, 2 CyberAgent AI Lab

Paper Code Dataset Presentation

Overview

Vision-Language Global Localization (VLG-Loc) is a global localization method that uses camera images and a human-readable labeled footprint map containing only names and areas of distinctive visual landmarks.

inputs of VLG-Loc

Abstract

This study presents Vision-Language Global Localization (VLG-Loc), a novel global localization method that uses human-readable labeled footprint maps containing only names and areas of distinctive visual landmarks in an environment. While humans naturally localize themselves using such maps, translating this capability to robotic systems remains highly challenging due to the difficulty of establishing correspondences between observed landmarks and those in the map without geometric and appearance details. To address this challenge, VLG-Loc leverages a vision-language model (VLM) to search the robot’s multi-directional image observations for the landmarks noted in the map. The method then identifies robot poses within a Monte Carlo localization framework, where the found landmarks are used to evaluate the likelihood of each pose hypothesis. Experimental validation in simulated and realworld retail environments demonstrates superior robustness compared to existing scan-based methods, particularly under environmental changes. Further improvements are achieved through the probabilistic fusion of visual and scan-based localization.

Key Idea

Our core idea is to perform global localization using only cameras and a labeled footprint map. The method works by listing visible landmarks identified from camera images and matching this list against the landmarks on the footprint map.

Architecture

The VLG-Loc architecture operates as follows: Given camera images and a labeled footprint map, (i) the vision-language model (VLM) converts images into a list of observed landmark labels. (ii) Simultaneously, simulated landmark visibility is computed for pose hypotheses uniformly distributed on the map. Finally, the hypothesis whose simulated visibility best matches the VLM's observed landmarks is selected as the global pose estimate.

architecture of VLG-Loc

Experiments

We compare the results of the following three major approaches: (i) Monte Carlo localization using only 2D LiDAR scans, (ii) Vision-language global localization (VLG-Loc) using only camera images, and (iii) Probabilistic fusion of (i) Scan and (ii) Vision.

In Case (a), the performance of Scan-Based MCL is poor because local geometric features are similar across multiple locations. Conversely, VLG-Loc performs well in this environment as it utilizes location-specific landmarks.

result of VLG-Loc, case a

Case (b) is challenging for the vision-only VLG-Loc because identical furniture is scattered across multiple locations. This scattering creates a multi-modal likelihood distribution, which complicates an accurate pose estimate.

result of VLG-Loc, case b

Cases (c) and (d) show examples using real-world observations. Here, visual cues can be ambiguous, either due to VLM misidentification or the use of abstract landmark labels (e.g., "snack"). The fusion of vision and scan effectively compensates for the weaknesses of each modality, achieving the highest accuracy.

result of VLG-Loc, case c
result of VLG-Loc, case d

Citation



  @article{vlgloc2025,
    author  = {Mizuho Aoki and Kohei Honda and Yasuhiro Yoshimura and Takeshi Ishita and Ryo Yonetani},
    title   = {{VLG-Loc: Vision-Language Global Localization from Labeled Footprint Maps}},
    journal = {arXiv preprint arXiv:2512.12793},
    year    = {2025},
    doi     = {10.48550/arXiv.2512.12793},
    url     = {https://arxiv.org/abs/2512.12793},
  }